Compare commits

..

5490 Commits

Author SHA1 Message Date
08d8d957e7 [prototype] Invoke subgraph higher order op 2024-09-16 12:43:28 -07:00
a30d5ba16c Fix bug in split-build workflows codegen (#136043)
By just deleting a few rogue lines left out in https://github.com/pytorch/pytorch/pull/135510
If file in workflows folder does not have a `.yml` extensions it will not be launched at all, will it?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136043
Approved by: https://github.com/kit1980, https://github.com/atalman
2024-09-13 21:29:06 +00:00
46935c8241 Reduce default iterations to 5 . (#135773)
running all benchmarks takes around 15 mins rn, this is the data
https://www.internalfb.com/phabricator/paste/view/P1583590240
the data looks mostly stable, and 5 iterations should be good, specially with our 1.5% threshold.
that said, the diff also add a way to increase the number of iterations for a specific benchmark.

after the change results
https://www.internalfb.com/phabricator/paste/view/P1583618969
time is down to half (7 mins)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135773
Approved by: https://github.com/ezyang, https://github.com/anijain2305
2024-09-13 21:16:38 +00:00
4f407c1884 Only measure compile time instruction count for sum_floordiv benchmark (#135785)
there was a recent strange noise +5%, -5%.
using only compile time :
1) avoid gc time .
2) avoid other operations that are not what we try to measure by this. ==> less probable noise.
```
collecting compile time instruction count for sum_floordiv_regression
compile time instruction count for iteration 0 is 8899290248
compile time instruction count for iteration 1 is 1188830489
compile time instruction count for iteration 2 is 1180579615
compile time instruction count for iteration 3 is 1176263131
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135785
Approved by: https://github.com/avikchaudhuri, https://github.com/anijain2305
2024-09-13 21:14:10 +00:00
2e461e54e8 Add gpu and gpu_dynamic versions of add_loop (#135809)
I am thinking maybe 3 iterations are enough for this one?
- so I am keeping eager and inductor since inductor is 2X eager time
- Eager dynamic is 2X eager so keeping this as well.
- inductor have three tests. (dynamic gpu, gpu and cpu)
I am unsure if am over profiling here happy to trim if anyone have suggestions.
```
collecting compile time instruction count for add_loop_eager
compile time instruction count for iteration 0 is 8213664211
compile time instruction count for iteration 1 is 2798628246
compile time instruction count for iteration 2 is 2796811362
compile time instruction count for iteration 3 is 2794438188
compile time instruction count for iteration 4 is 2794634117
collecting compile time instruction count for add_loop_eager_dynamic
compile time instruction count for iteration 0 is 5724108021
compile time instruction count for iteration 1 is 5499908609
compile time instruction count for iteration 2 is 5569101366
compile time instruction count for iteration 3 is 5493806364
compile time instruction count for iteration 4 is 5493169851
collecting compile time instruction count for add_loop_inductor
compile time instruction count for iteration 0 is 49789381222
compile time instruction count for iteration 1 is 25769347393
compile time instruction count for iteration 2 is 25772594322
compile time instruction count for iteration 3 is 25768695952
compile time instruction count for iteration 4 is 25768032314
collecting compile time instruction count for add_loop_inductor_gpu
compile time instruction count for iteration 0 is 23966942581
compile time instruction count for iteration 1 is 23771950919
compile time instruction count for iteration 2 is 23770784286
compile time instruction count for iteration 3 is 23780160875
compile time instruction count for iteration 4 is 23774634465
collecting compile time instruction count for add_loop_inductor_dynamic_gpu
compile time instruction count for iteration 0 is 41505055086
compile time instruction count for iteration 1 is 41293654089
compile time instruction count for iteration 2 is 41301016100
compile time instruction count for iteration 3 is 41306056207
compile time instruction count for iteration 4 is 41308171566
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135809
Approved by: https://github.com/ezyang, https://github.com/anijain2305
2024-09-13 20:42:31 +00:00
a3d827a28c Use python 3.11 for Large Wheel build (#136042)
Use Python 3.11 in nightly Large wheel builds. Required for Colab testing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136042
Approved by: https://github.com/kit1980, https://github.com/malfet

Co-authored-by: Sergii Dymchenko <kit1980@gmail.com>
2024-09-13 20:27:11 +00:00
4312794b92 [reland][export] fix re-export custom metadata (#135720)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/134778

The previous D62304294 broke some executorch tests. It has already been reverted.

In this diff, `_collect_param_buffer_metadata()` is modified in a way that when a `call_function` node is encountered and its input nodes include `get_attr`. We skip the fields that have been collected previously and only collect rest of the fields. This prevents over-writing.

Test Plan:
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//executorch/backends/xnnpack/test:test_xnnpack_ops

buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_re_export_preserve_handle

buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_run_decompositions_preserve_handle
```

Differential Revision: D62514208

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135720
Approved by: https://github.com/zhxchen17, https://github.com/jerryzh168
2024-09-13 20:15:15 +00:00
b856f3539b Fix script name in the comments (#135507)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135507
Approved by: https://github.com/atalman
2024-09-13 19:59:47 +00:00
835e7bb077 fix requirements.txt installation failure issue on Windows (#134567)
Fixes #134564

Root cause:

The `lintrunner` wheel released on [pypi.org](https://pypi.org/project/lintrunner/#files) only supports Windows 32bit and Linux 64bit. Since compilation of pytorch requires a 64bit env, on windows, the `lintrunner` has to be compiled from source distribution. `Rust` is its dependency for compilation, as indicated in the error message. Meanwhile, Visual Studio environment is needed for linking libraries..

![image](https://github.com/user-attachments/assets/180cd899-8886-43b5-b42f-031f41e81683)

Issue when performing `pip install lintrunner` without a Visual Studio environment activated is shown below.

```bash
>python -m pip install lintrunner
Collecting lintrunner
  Downloading lintrunner-0.12.5.tar.gz (62 kB)
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Building wheels for collected packages: lintrunner
  Building wheel for lintrunner (pyproject.toml) ... error
  error: subprocess-exited-with-error

  × Building wheel for lintrunner (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [137 lines of output]
      Running `maturin pep517 build-wheel -i C:\Users\\miniforge3\envs\py310\python.exe --compatibility off`
      📡 Using build options bindings from pyproject.toml
         Compiling proc-macro2 v1.0.79
         Compiling unicode-ident v1.0.12
         Compiling version_check v0.9.4
         Compiling windows_x86_64_msvc v0.52.4
         Compiling winapi v0.3.9
         Compiling serde v1.0.197
         Compiling autocfg v1.2.0
         Compiling syn v1.0.109
         Compiling lazy_static v1.4.0
         Compiling libc v0.2.153
         Compiling equivalent v1.0.1
         Compiling hashbrown v0.14.3
         Compiling memchr v2.7.2
         Compiling yansi v1.0.1
         Compiling unicode-width v0.1.11
         Compiling regex-syntax v0.8.3
         Compiling encode_unicode v0.3.6
         Compiling cfg-if v1.0.0
         Compiling winnow v0.6.5
         Compiling cc v1.0.92
      error: could not compile `windows_x86_64_msvc` (build script) due to 2 previous errors
      warning: build failed, waiting for other jobs to finish...
      error: could not compile `serde` (build script) due to 2 previous errors
      error: could not compile `proc-macro2` (build script) due to 2 previous errors
      error: could not compile `syn` (build script) due to 2 previous errors
      error: could not compile `libc` (build script) due to 2 previous errors
      error: could not compile `winapi` (build script) due to 2 previous errors
      💥 maturin failed
        Caused by: Failed to build a native library through cargo
        Caused by: Cargo build finished with "exit code: 101": `cargo rustc --manifest-path Cargo.toml --message-format json --release --bins --`
      📦 Including license file "LICENSE"
      🔗 Found bin bindings
      error: linker `link.exe` not found
        |
        = note: program not found

      note: the msvc targets depend on the msvc linker but `link.exe` was not found

      note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option.

      note: VS Code is a different product, and is not sufficient.

      error: aborting due to 1 previous error

      error: linker `link.exe` not found
        |
        = note: program not found

      note: the msvc targets depend on the msvc linker but `link.exe` was not found

      note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option.

      note: VS Code is a different product, and is not sufficient.

      error: aborting due to 1 previous error

      error: linker `link.exe` not found
        |
        = note: program not found

      note: the msvc targets depend on the msvc linker but `link.exe` was not found

      note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option.

      note: VS Code is a different product, and is not sufficient.

      error: aborting due to 1 previous error

      error: linker `link.exe` not found
        |
        = note: program not found

      note: the msvc targets depend on the msvc linker but `link.exe` was not found

      note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option.

      note: VS Code is a different product, and is not sufficient.

      error: aborting due to 1 previous error

      error: linker `link.exe` not found
        |
        = note: program not found

      note: the msvc targets depend on the msvc linker but `link.exe` was not found

      note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option.

      note: VS Code is a different product, and is not sufficient.

      error: aborting due to 1 previous error

      error: linker `link.exe` not found
        |
        = note: program not found

      note: the msvc targets depend on the msvc linker but `link.exe` was not found

      note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option.

      note: VS Code is a different product, and is not sufficient.

      error: aborting due to 1 previous error

      Error: command ['maturin', 'pep517', 'build-wheel', '-i', 'C:\\Users\\\\miniforge3\\envs\\py310\\python.exe', '--compatibility', 'off'] returned non-zero exit status 1
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
  ERROR: Failed building wheel for lintrunner
Failed to build lintrunner
ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (lintrunner)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134567
Approved by: https://github.com/malfet
2024-09-13 18:43:55 +00:00
b6d6aa49b8 Revert "Validate input types for torch.nn.Linear and torch.nn.Bilinear (#135596)"
This reverts commit e157ce3ebbb3f30d008c15914e82eb74217562f0.

Reverted https://github.com/pytorch/pytorch/pull/135596 on behalf of https://github.com/malfet due to It's too restrictive, should allow other int-like types, such as `numpy.int64` ([comment](https://github.com/pytorch/pytorch/pull/135596#issuecomment-2349714104))
2024-09-13 18:06:56 +00:00
deee21cb78 Revert "[Inductor] Rename cpp_wrapper_cuda.py as cpp_wrapper_gpu.py (#135313)"
This reverts commit 16b37b309f64ddd4e498c57a99191e1d9b3dfdac.

Reverted https://github.com/pytorch/pytorch/pull/135313 on behalf of https://github.com/izaitsevfb due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/135313#issuecomment-2349662091))
2024-09-13 17:53:21 +00:00
3f69410976 [gpu-profiler] Expose active and repeat in os env var (#135757)
Summary: https://fb.workplace.com/groups/ai.efficiency.tools.users/permalink/1855136444971825/

Test Plan:
`buck2 test mode/opt caffe2/test:profiler -- -r test_kineto_profiler_api `

eyes

Differential Revision: D62529249

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135757
Approved by: https://github.com/Yuzhen11
2024-09-13 17:48:27 +00:00
18f9331e5d Revert "[aoti] Fix workspace generation for triton (#135552)"
This reverts commit d3833253928f29ed760b2dccac2b730028a868ca.

Reverted https://github.com/pytorch/pytorch/pull/135552 on behalf of https://github.com/izaitsevfb due to blocks revert of #135313, internal failures, see D62511427 ([comment](https://github.com/pytorch/pytorch/pull/135552#issuecomment-2349641372))
2024-09-13 17:47:36 +00:00
bc0f330169 [trymerge] Manually close merged PR when Github fails (#135890)
Manually close merged PR when Github fails to do it.

Consequences of current design:
Sleeping for 1 min uses up the machine, might result in race conditions, results in merging label to removed a bit later, pr still left open if this api fails too (ie no async clean up job)

Tested in https://github.com/malfet/deleteme/pull/92 by removing the part of the commit message that has "resolved #pr num"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135890
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-09-13 17:29:24 +00:00
7834c0bb2c [AOTI][Tooling] Add stats summary (mean/min/max, etc) for jit inductor tensor value printing (#135887)
Summary:
As title. Follow up to add stats summary (mean/min/max, etc) for jit inductor tensor value printing as well.

The inductor python wrapper code level printing would look something like this:

 {F1859224287}

Test Plan: CI

Reviewed By: chenyang78

Differential Revision: D62415575

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135887
Approved by: https://github.com/chenyang78
2024-09-13 17:19:25 +00:00
6ef49fe8f1 Revert "Pass ideep:lowp_kind to matmul_forward::compute on cache misses (#135058)"
This reverts commit 3d2431380999252d5401f83d5010b398a32e7597.

Reverted https://github.com/pytorch/pytorch/pull/135058 on behalf of https://github.com/malfet due to It regresses x86 performance ([comment](https://github.com/pytorch/pytorch/pull/135058#issuecomment-2349480861))
2024-09-13 17:09:45 +00:00
a15774563b [ROCm] Enable ROCm support for inductor's dynamic_rblock_scaling (#129663)
As of ROCm 6.1 [hipDeviceProp_t::regsPerMultiprocessor](https://rocm.docs.amd.com/projects/HIP/en/latest/doxygen/html/structhip_device_prop__t.html#a7390d5b180d63978c81aa971060270b4) is now available allowing us to enable this attribute on ROCm.
```
>>> torch.cuda.get_device_properties(0)
_CudaDeviceProperties(name='AMD Instinct MI250X/MI250', major=9, minor=0, gcnArchName='gfx90a:sramecc+:xnack-', total_memory=65520MB, multi_processor_count=104)
>>> torch.cuda.get_device_properties(0).regs_per_multiprocessor
65536
```

With https://github.com/triton-lang/triton/pull/3962we can extract n_regs and n_spells from a triton binary with AMD backend allowing us to enable inductor's dynamic_rblock_scaling on ROCm initially implemented in https://github.com/pytorch/pytorch/pull/115094

Leaving this in draft until following PRs have landed:
- https://github.com/pytorch/pytorch/pull/129361 to bump the triton commit pin
- https://github.com/pytorch/pytorch/pull/128449 to allow us to grab warp_size from device properties instead of hard coding 64 on ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129663
Approved by: https://github.com/jansel, https://github.com/shunting314
2024-09-13 16:45:39 +00:00
564d00f364 Revert "Fix clang-tidy warnings in Caffe2 code (#134935)"
This reverts commit 7cfd23636c8fa6fcbb8bf3ea34e15b847ec9ad9d.

Reverted https://github.com/pytorch/pytorch/pull/134935 on behalf of https://github.com/izaitsevfb due to breaks internal builds, caffe2 is still used internally ([comment](https://github.com/pytorch/pytorch/pull/134935#issuecomment-2349368152))
2024-09-13 16:42:37 +00:00
ae02d663cd [FlexAttention] Fix output layout (#135882)
We previously only supported the same v_head dim and + qk_head dim. When allowed for different head-dims I accidently kept the same query strides for the output. This PR fixes this bug as well it ensures that we always produce output in the same stride order as the input query.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135882
Approved by: https://github.com/yanboliang, https://github.com/Chillee
2024-09-13 16:36:05 +00:00
ad2f0e9f81 Add remote cache time saved to compilation metrics (#135490)
Summary:
Record remote cache time saved via frame_phase_timing

We add to the "phase" when remote cache hits and saves us time, so that we have a 1:1 correspondence between a frame and time saved.

Test Plan:
Internally run benchmark, see that it's populated in sandbox table after previous diff lands and logger config is actualized.

Show that column exists in table:

https://fburl.com/scuba/logger_staging_jjwu_30582a48f1ff9cf5f4ac50a4c40af/fp2te0ff

Note that an earlier version of D62105258 had the column as a string so the staging table is a bit messed up. But you can see the most recent samples have the column populates as a float.

Reviewed By: aorenste

Differential Revision: D62106921

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135490
Approved by: https://github.com/aorenste
2024-09-13 16:35:51 +00:00
21ffa18ad1 Fix "expand: SymIntArrayRef expected to contain only concrete integers" in AOTInductor (#135933)
Internal xref:
https://fb.workplace.com/groups/1075192433118967/permalink/1501860707118802/

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135933
Approved by: https://github.com/angelayi
2024-09-13 15:23:42 +00:00
eqy
2519e5a8de [CUDA][FP8] Skip rowwise scaling test on sm89 (#135718)
Same reason as #https://github.com/pytorch/pytorch/pull/133612, rowwise scaling implementation is sm90+ specific (e.g., uses TMA)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135718
Approved by: https://github.com/Skylion007
2024-09-13 15:07:20 +00:00
ba6e0f31ab Remove cycle dependency by localizing the import. (#135926)
Summary:
Since https://www.internalfb.com/diff/D62215095 landed there has been many silence errors due to the dependency between functional_tensor and config.

```
 File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/export/__init__.py", line 64, in <module>
  File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/export/dynamic_shapes.py", line 23, in <module>
  File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/export/exported_program.py", line 26, in <module>
  File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/_higher_order_ops/__init__.py", line 1, in <module>
  File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/_higher_order_ops/cond.py", line 6, in <module>
  File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/_subclasses/functional_tensor.py", line 9, in <module>
  File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/_inductor/config.py", line 44, in <module>
```

https://fburl.com/logarithm/ol5kx0ee
complaining about a cycle dependency

this fix it.

Test Plan: buck test multipy/runtime:test_deploy_embedded_cuda_interp_without_cuda_available -- --run-disabled TorchpyTest.AcquireMultipleSessionsInDifferentPackages

Reviewed By: aorenste

Differential Revision: D62616765

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135926
Approved by: https://github.com/aorenste, https://github.com/oulgen, https://github.com/Skylion007
2024-09-13 15:05:41 +00:00
7ed0563cad Revert "[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732)"
This reverts commit e504fb70693d4a3741c3380b6a989d441e84f737.

Reverted https://github.com/pytorch/pytorch/pull/134732 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))
2024-09-13 12:52:58 +00:00
eb7dd91dd1 Revert "[Dynamo] Trace torch function modes entered outside of torch.compile (#133137)"
This reverts commit fafdd588f27e1d56090c6d260d0382c255eaf9eb.

Reverted https://github.com/pytorch/pytorch/pull/133137 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))
2024-09-13 12:52:58 +00:00
3f30360d05 Revert "[Dynamo] Support thread local setattr (#135443)"
This reverts commit 30b007bea329f512af3dc4fd4e6c7d145e807b71.

Reverted https://github.com/pytorch/pytorch/pull/135443 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))
2024-09-13 12:52:58 +00:00
4734e356d6 Revert "[Dynamo] Simplify torch function mode stack guard (#135444)"
This reverts commit 0c080cb2c78a85a5320fbeadbbb9a2cc640fd89d.

Reverted https://github.com/pytorch/pytorch/pull/135444 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))
2024-09-13 12:52:57 +00:00
ac169795a9 Revert "[Dynamo] Trace enter/exit of TorchFunctionModes (#135422)"
This reverts commit 2af3b8ffd84e36b91279174e9106f84b2d2a11f2.

Reverted https://github.com/pytorch/pytorch/pull/135422 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))
2024-09-13 12:52:57 +00:00
fca58bfda1 Revert "[Dynamo] Remove ignored modes workaround (#135502)"
This reverts commit 7d5e0dd4b1a8d20fc8624b3085a6f5ddedd89a2e.

Reverted https://github.com/pytorch/pytorch/pull/135502 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))
2024-09-13 12:52:57 +00:00
dc71e7a7d4 Revert "[Dynamo] Remove ignored modes from torch function mode stack guard (#135503)"
This reverts commit c56728b643e2b7d796abd7ec45803319e1c5967d.

Reverted https://github.com/pytorch/pytorch/pull/135503 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))
2024-09-13 12:52:57 +00:00
1cdf658f4a Revert "[PT2][inductor][Optimus] Add pad_aten_mm_pass pattern to resolve long computation kernel in LCE (#135167)"
This reverts commit eb0fe029337b31bcb3d4b2d1e539895393975d68.

Reverted https://github.com/pytorch/pytorch/pull/135167 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI eg. https://github.com/pytorch/pytorch/actions/runs/10845542664/job/30097957154 ([comment](https://github.com/pytorch/pytorch/pull/135167#issuecomment-2348847595))
2024-09-13 12:35:05 +00:00
b5c52e96e8 Revert "[dynamo] Fix support for classmethod(property(...)) (#134968)"
This reverts commit bf68e16e94fc05f10d434cdc162a14d02c6ad23c.

Reverted https://github.com/pytorch/pytorch/pull/134968 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI: eg. https://github.com/pytorch/pytorch/actions/runs/10845542664/job/30097956613 ([comment](https://github.com/pytorch/pytorch/pull/134968#issuecomment-2348837553))
2024-09-13 12:29:03 +00:00
ea2ecab15b [AOTI][reland] Fix assert_function call in cpu autotune template (#135920)
Summary: Reland https://github.com/pytorch/pytorch/pull/135086. In the ABI-compatible mode, assert_function should be AOTI_TORCH_CHECK.

Test Plan: CI

Differential Revision: D62500592

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135920
Approved by: https://github.com/chenyang78
2024-09-13 12:21:57 +00:00
2f53d570fe Update document for autocast on CPU (#135299)
Update document for autocast on CPU due to the support of float16 and changes in the operator list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135299
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/svekars
2024-09-13 09:11:47 +00:00
31007cf200 [Distributed] add FP8 support to NaN checker (#135891)
Adding support for `torch.float8_e4m3fn` and `torch.float8_e5m2`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135891
Approved by: https://github.com/wconstab
2024-09-13 08:43:54 +00:00
c56728b643 [Dynamo] Remove ignored modes from torch function mode stack guard (#135503)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135503
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502
2024-09-13 08:41:32 +00:00
7d5e0dd4b1 [Dynamo] Remove ignored modes workaround (#135502)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135502
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137, #135443, #135444, #135422
2024-09-13 08:41:32 +00:00
2af3b8ffd8 [Dynamo] Trace enter/exit of TorchFunctionModes (#135422)
This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode)

Typically the bytecode for a context manager looks like this during a graph break:
1. graph call
2. enter context
3. unsupported code
4. exit context
5. resume call

resume fn structure:
1. enter context
2. jump
...
3. exit context

The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack).

So for torch function modes the structure of our output code is this:

1. graph call
2. mutate tf mode stack to replay mutations
4. unsupported code
5. on exception restore stack
6. resume function

Then our resume fn looks like this:

1. no-op enter torch function mode
2. jump
3.  exit tf mode

To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context).

Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135422
Approved by: https://github.com/williamwen42
ghstack dependencies: #134732, #133137, #135443, #135444
2024-09-13 08:41:24 +00:00
0c080cb2c7 [Dynamo] Simplify torch function mode stack guard (#135444)
The semantics of ignored modes previously had edge cases, this eliminates these by in essence filtering any ignored modes out of both the ref stack and the current torch function mode stack. This is purely to fix complexity in #135422.  The ignored modes handling will be removed in a future PR after https://github.com/pytorch/pytorch/pull/135422 lands, since we will then trace through DeviceContexts vs inserting them into the graph which needed these extra workarounds for correctness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135444
Approved by: https://github.com/anijain2305, https://github.com/williamwen42
ghstack dependencies: #134732, #133137, #135443
2024-09-13 08:41:17 +00:00
30b007bea3 [Dynamo] Support thread local setattr (#135443)
In preparation for tracing through DeviceContext (defb515306/torch/utils/_device.py (L66))
This PR adds support for calling the setattr of thread local objects. These objects have a slots impl, and since this doesn't appear to have any side effects, we call this setattr impl when replaying mutations, since calling `object.__setattr__` on these objects results in a type error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135443
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137
2024-09-13 08:41:07 +00:00
fafdd588f2 [Dynamo] Trace torch function modes entered outside of torch.compile (#133137)
This PR adds initial tracing for torch function modes.

Details:
In essence, this adds tracing into the torch function of modes entered outside of the torch.compile call.
This does not yet support tracing enter/exit of a torch function mode/ tracing set_default_device properly using the new mode infra (this will be a very good stress test for modes). I am adding more PRs to this stack to support these. The overall plan is to support tracing enter/exit and handling graph breaks like we do other torch.* context managers.

Previously landed:
https://github.com/pytorch/pytorch/pull/133135
https://github.com/pytorch/pytorch/pull/133136
https://github.com/pytorch/pytorch/pull/133134
https://github.com/pytorch/pytorch/pull/133133
https://github.com/pytorch/pytorch/pull/133132
https://github.com/pytorch/pytorch/pull/133131
https://github.com/pytorch/pytorch/pull/133729
https://github.com/pytorch/pytorch/pull/133130

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133137
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #134732
2024-09-13 08:41:00 +00:00
e504fb7069 [Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732)
For tracing cond/while in eager, we trace the HOP with the eager backend with metadata torchfunction mode enabled. HOPs disallow the mutation that occurs in this torch function mode, so it is not able to be traced. As a result, we use a custom backend which enters this mode for tracing these HOPs. Thanks to @ydwu4 for the help with implementing this

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134732
Approved by: https://github.com/ydwu4
2024-09-13 08:40:50 +00:00
b346e99376 remove fast_flush arguments (#135387)
I've removed them from upstream Triton in https://github.com/triton-lang/triton/pull/4485. It looks like most places in the code use the default value of `fast_flush=True` anyway, though there are two PRs from @pearu that use `False`. To my knowledge, there's no reason to use the `False` value.

Differential Revision: [D62325778](https://our.internmc.facebook.com/intern/diff/D62325778)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135387
Approved by: https://github.com/nmacchioni, https://github.com/jansel
2024-09-13 08:13:46 +00:00
7dc1788396 [inductor] Remove the batch fusion passes from being a default (#135922)
Ads team do a search internally to figure out which fusion passes to use.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135922
Approved by: https://github.com/eellison, https://github.com/yanboliang
ghstack dependencies: #135819
2024-09-13 06:07:33 +00:00
9fd54d787d [Inductor UT] Generalize device-bias code in test_triton_kernels.py introduced in #135530 (#135656)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135656
Approved by: https://github.com/EikanWang, https://github.com/zou3519
2024-09-13 05:27:56 +00:00
b38be727eb [Inductor UT] Generalize inductor UT for intel GPU (Part 2) (#134556)
[Inductor UT] Reuse Inductor test case for Intel GPU.
Reuse `test/inductor/test_torchinductor_opinfo.py`
Reuse `test/inductor/test_minifier_isolate.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134556
Approved by: https://github.com/etaf, https://github.com/eellison
2024-09-13 05:16:28 +00:00
e54b559e88 [inductor] More fixes on the keys of constants and signature dictionaries (#135406)
Previous PR forgets to change two other places that also create `constants` and `signature`. https://github.com/pytorch/pytorch/pull/135170

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135406
Approved by: https://github.com/jansel
2024-09-13 04:10:41 +00:00
eea5e6ff0f [DCP][DSD] Add a test case to demonstrate the workaround to load full state dict into a 2D model (#135763)
Fix https://github.com/pytorch/pytorch/issues/134095

This is a workaround for loading full state dict into a FSDP1+TP 2D model.
Since named_parameters() in FSDP1 does not return DTensor, we don't have the information to shard the full_state_dict and load it directly into the 2d model. In order to load a full state dict in FSDP1+TP 2D model, we need to do:
- load the full state dict into a 1D FSDP model
- dcp.save the full/shard state dict into storage
- initialize a 2D FSDP1+TP model
- get the default sharded state dict for the 2D model (full_state_dict=False)
- dcp.load the state dict from storage
- load the state dict into the 2D model
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135763
Approved by: https://github.com/fegin
ghstack dependencies: #135725
2024-09-13 03:51:14 +00:00
6df91b5917 real tensor prop for composite ops (#135717)
Fixes #135632

Adds real tensor propagation for decompositions, checking any symbols on their outputs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135717
Approved by: https://github.com/ezyang
2024-09-13 03:35:16 +00:00
0cdc6a8dcd [DSD] Fix distributed state dict full_state_dict option hang during set_state_dict (#135725)
Fix https://github.com/pytorch/pytorch/issues/134095
This fix distributed state dict full_state_dict option hang during set_state_dict. We switch `_distribute_tensors` in _state_dict_utils.py to use `DTensor.from_local` instead of `distribute_tensor` to support FSDP2+TP 2D strided sharding use case, as `distribute_tensor` cannot handle strided sharding yet. `distribute_tensor` incurs a scatter behind the scenes, while `DTensor.from_local` takes the local slice from the full tensor on each rank to create the DTensor (no collective).  This means it's the user's responsibility to make sure the full_tensor from the full_state_dict is the same across all ranks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135725
Approved by: https://github.com/fegin
2024-09-13 03:26:36 +00:00
6cdc70bccd [ROCm] skip test_fp8_cast_and_t on non-MI300 machines (#135917)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135917
Approved by: https://github.com/malfet
2024-09-13 02:46:48 +00:00
e6b68359d7 Fix xpu memory stats error (#135818)
# Motivation
fix https://github.com/pytorch/pytorch/issues/135726
After merging two free blocks, I made a stupid mistake of ignoring the correct size to decrease the active memory size, which should be the original block size instead of the merged block size.

# Additional Context
Add a UT to guard this scenario.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135818
Approved by: https://github.com/EikanWang
2024-09-13 02:41:21 +00:00
1c04cbfba6 [BE] Use C10_UNUSED (#135914)
Instead of `(void)foo; // Suppress unused variable`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135914
Approved by: https://github.com/huydhn, https://github.com/eqy
2024-09-13 02:27:07 +00:00
062681a0ed [Profiler] Torch Profiler distributed info is not JSON serializable (#135548)
Summary: To fix https://github.com/pytorch/pytorch/issues/133308 we must create an encoder for numpy values so we can serialize the distributed metadata to JSON.

Test Plan: Added unit test to check that numpy values can be serialized

Differential Revision: D62411619

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135548
Approved by: https://github.com/aaronenyeshi, https://github.com/albanD
2024-09-13 02:22:33 +00:00
8c356ce3da Fix lint errors in fbcode (#135614)
Summary: Fixed a bunch of fbcode imports that happened to work but confused autodeps.  After this autodeps still suggests "improvements" to TARGETS (which breaks our builds) but at least it can find all the imports.

Test Plan:
```
fbpython fbcode/tools/build/buck/linters/lint_autoformat.py --linter=autodeps --default-exec-timeout=1800 -- fbcode/caffe2/TARGETS fbcode/caffe2/test/TARGETS
```
Before:
```
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/testing.py:229) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fbur$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export.py:87) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fburl$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_serdes.py:9) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fb$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_serdes.py:10) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fburl$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_retraceability.py:7) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https:$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_retraceability.py:6) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See ht$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export_nonstrict.py:7) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See http$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_export_nonstrict.py:6) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See $
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_export_training_ir_to_run_decomp.py:8) when processing rule "test_export". Please make sure it's listed in the srcs parameter of an$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export_training_ir_to_run_decomp.py:10) when processing rule "test_export". Please make sure it's listed in the srcs parameter of anoth$
ERROR while processing caffe2/test/TARGETS: Found "//python/typeshed_internal:typeshed_internal_library" owner for "cv2" but it is protected by visibility rules: [] (from caffe2/test/test_bundled_images.py:7) when processing rule "test_bundled_$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "caffe2.test.profiler_test_cpp_thread_lib" (from caffe2/test/profiler/test_cpp_thread.py:29) when processing rule "profiler_test_cpp_thread". Please make sure it's listed in t$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._utils_internal.get_file_path_2" (from caffe2/test/test_custom_ops.py:23) when processing rule "custom_ops". Please make sure it's listed in the srcs parameter of anoth$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._utils_internal.get_file_path_2" (from caffe2/test/test_public_bindings.py:13) when processing rule "public_bindings". Please make sure it's listed in the srcs paramete$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._C._profiler.symbolize_tracebacks" (from caffe2/test/test_cuda.py:3348) when processing rule "test_cuda". Please make sure it's listed in the srcs parameter of another $
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._C._profiler.gather_traceback" (from caffe2/test/test_cuda.py:3348) when processing rule "test_cuda". Please make sure it's listed in the srcs parameter of another rule$
ERROR while processing caffe2/test/TARGETS: Cannot find an owner for include <torch/csrc/autograd/profiler_kineto.h> (from caffe2/test/profiler/test_cpp_thread.cpp:2) when processing profiler_test_cpp_thread_lib.  Some things to try:
```

Differential Revision: D62049222

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135614
Approved by: https://github.com/oulgen, https://github.com/laithsakka
2024-09-13 02:04:34 +00:00
bf68e16e94 [dynamo] Fix support for classmethod(property(...)) (#134968)
Fixes #134451

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134968
Approved by: https://github.com/yanboliang
2024-09-13 01:14:18 +00:00
eqy
d732df7e56 [Inductor] Disable TF32 in test_slice_scatter_reinplace (#135709)
TF32 linear/matmul numerics seem unrelated to test functionality so disabling it here to abate noisy failures

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135709
Approved by: https://github.com/eellison
2024-09-13 00:30:45 +00:00
c9de2efde6 [Docs] fix inconsistent docs in conv1d, conv2d, and conv3d (#135894)
Addresses https://github.com/pytorch/pytorch/issues/135880
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135894
Approved by: https://github.com/mikaylagawarecki, https://github.com/malfet
2024-09-13 00:19:42 +00:00
1f15c0c7a5 [fx] Replace _snake_case with a regexp (#135822)
~2x speedup on this function, though saves <0.5s overall

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135822
Approved by: https://github.com/oulgen
ghstack dependencies: #135787, #135788, #135820, #135821
2024-09-13 00:18:41 +00:00
a72124add9 [fx] Minor optimization in create_arg (#135821)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135821
Approved by: https://github.com/oulgen
ghstack dependencies: #135787, #135788, #135820
2024-09-13 00:18:41 +00:00
10ca4c0564 [inductor] Use TracerBase directly in LoopBody (#135820)
This skips some unneeded work in the subclass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135820
Approved by: https://github.com/oulgen
ghstack dependencies: #135787, #135788
2024-09-13 00:18:41 +00:00
d3aab9642b [inductor] Optimize can_fuse_vertical() (#135788)
An O(n^2) to O(n) improvement by not comparing all pairs of deps.

Before:
![image](https://github.com/user-attachments/assets/797cd1bd-5d53-4374-8e76-ffce4232d7f9)

After:
![image](https://github.com/user-attachments/assets/1e61bf29-adba-41a4-839e-f028130fa979)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135788
Approved by: https://github.com/oulgen
ghstack dependencies: #135787
2024-09-13 00:18:41 +00:00
67a929eea8 [inductor] Remove unused check (#135787)
I think this is unreachable code because mode is always None on reads.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135787
Approved by: https://github.com/oulgen
2024-09-13 00:18:41 +00:00
f576960bbc do not expand in replace/simplify if no changes (#135863)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135863
Approved by: https://github.com/ezyang
2024-09-13 00:12:01 +00:00
1aba224cfd Update nightly PyTorch version to 2.6.0 (#135916)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135916
Approved by: https://github.com/kit1980
2024-09-13 00:08:52 +00:00
d383325392 [aoti] Fix workspace generation for triton (#135552)
Fixes #131337

- add `arg_type` for workspace_arg, the type is consistent with the type in `generate_workspace_allocation()`.
- do not generate example tensors for `workspace`, and use `generate_workspace_allocation()` instead.
- add workspace allocation generation code to `kernel_autotune_calls`. e.g.
```python
    workspace = empty_strided_cuda((1280, ), (1, ), torch.uint8)
    workspace.zero_()
    .....
    triton_spl_fused_add_cumprod_0.run(buf2, arg0_1, arg1_1, workspace, 1, 10000, grid=split_scan_grid(1, 10000), stream=stream0)
    del buf2, arg0_1, arg1_1, workspace
```
-  add `empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda` to the header of triton autotune code.

The generated cpp has lines like below, so we also implement a `zero_()` for ` AtenTensorHandle `.

```cpp
    static constexpr int64_t int_array_0[] = {1280L, };
    static constexpr int64_t int_array_1[] = {1L, };
    AtenTensorHandle workspace_handle;
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_empty_strided(1, int_array_0, int_array_1, cached_torch_dtype_uint8, cached_torch_device_type_cuda,  0, &workspace_handle));

        RAIIAtenTensorHandle workspace(workspace_handle);
        workspace.zero_();
```

- Fix handle grid_fn  for grid computation. Pass in "RBLOCK" to `split_scan_grid`
-  Fix dynamic shapes:
Without the fix we generate code that looks like this `workspace = empty_strided_cuda((32*((255 + s0) // 256), ), (1, ), torch.uint8)` when doing triton autotune and `s0` is not defined.

The solution approach is to use `V.graph.sizevars.size_hint(nbytes)` to realize the workspace size for triton autotune. Note that we only realize it for triton autotune code, but not for the cpp cuda code.

- We also generate slightly different cpp code depending on if `abi_compatible` is turned on.
```cpp
RAIIAtenTensorHandle workspace(workspace_handle);
AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_zero_(workspace.get()));
```
vs

```cpp
    at::Tensor workspace = at::detail::empty_strided_cuda({8L*(c10::div_floor_integer(static_cast<int64_t>((255L + s0)), static_cast<int64_t>(256L))), }, {1L, }, at::kByte, c10::DeviceType::CUDA);
    workspace.zero_();
```

Test Plan:

```
TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1  python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda
python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper
python test/inductor/test_cuda_cpp_wrapper.py DynamicShapesCudaWrapperCudaTests.test_consecutive_split_cumprod_cuda_dynamic_shapes_cuda_wrapper
TORCHINDUCTOR_ABI_COMPATIBLE=1 python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper
TORCHINDUCTOR_CPP_WRAPPER=1  python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135552
Approved by: https://github.com/desertfire
2024-09-12 23:53:09 +00:00
00dc7d4356 fix compiled_autograd deadlock throw (#135795)
Fixes #135298

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135795
Approved by: https://github.com/xmfan
2024-09-12 23:24:57 +00:00
1760bbc259 [FlexAttention] Ensure q/k/v and block_mask on excact the same device (#135823)
Fixes #134739

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135823
Approved by: https://github.com/BoyuanFeng
2024-09-12 23:11:01 +00:00
fb9d8e3248 [ROCm] Use ieee precision for fp32 in flex attention (#135702)
3bebc09be9

Brought in a change to flex_attention to allow TF32 precision, this largely lacks support on ROCm side and we should use ieee.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135702
Approved by: https://github.com/jeffdaily, https://github.com/drisspg
2024-09-12 23:00:48 +00:00
aaabfc8930 [Easy] Check if quant registered in constant folding (#135875)
Belated fix for https://github.com/pytorch/pytorch/issues/110904

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135875
Approved by: https://github.com/shunting314
2024-09-12 22:16:39 +00:00
63d6cd351a [dynamo] support torch.nn.attention.sdpa_kernel context manager (#135404)
Fixes https://github.com/pytorch/pytorch/issues/134608

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135404
Approved by: https://github.com/jansel, https://github.com/drisspg
2024-09-12 22:04:48 +00:00
3de9e474df Revert "Check function declarations of Core ML code (#135467)"
This reverts commit bc1b8f094d24de27432f4c29f0729e85a6b5ba63.

Reverted https://github.com/pytorch/pytorch/pull/135467 on behalf of https://github.com/malfet due to This breaks ios periodic jobs, see https://github.com/pytorch/pytorch/actions/runs/10797026668/job/29947377532 ([comment](https://github.com/pytorch/pytorch/pull/135467#issuecomment-2347322784))
2024-09-12 22:04:35 +00:00
3e1a4ea132 Revert "[DSD] Fix distributed state dict full_state_dict option hang during set_state_dict (#135725)"
This reverts commit 83c594ebd6dfa517fdd67ae23929cc60d5fa325d.

Reverted https://github.com/pytorch/pytorch/pull/135725 on behalf of https://github.com/ZainRizvi due to This is breaking lint. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/10835983999/job/30068709508) [HUD commit link](83c594ebd6) ([comment](https://github.com/pytorch/pytorch/pull/135725#issuecomment-2347303272))
2024-09-12 21:47:38 +00:00
e157ce3ebb Validate input types for torch.nn.Linear and torch.nn.Bilinear (#135596)
Adding validation checks to check the input types and display better error messages for the same.
Fixes https://github.com/pytorch/pytorch/issues/135463

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135596
Approved by: https://github.com/malfet
2024-09-12 21:28:37 +00:00
b897ab0540 [export] ignore mark_dynamic() in export (#135536)
Previously we were accomodating `torch._dynamo.mark_dynamic()` for export's dynamic shapes. Here we clean things up and ignore it, requiring users to specify an export input for `dynamic_shapes`.

Note: there's 4 decorators relevant to export, `mark_dynamic, maybe_mark_dynamic, mark_static, mark_unbacked`. User calls that involve export have only been `mark_dynamic()`, and we use `maybe_mark_dynamic` under the hood for `Dim.AUTO`, but we could start using others. One reason I decided to not warn and just silently ignore is these decorators cause the tensors to carry dynamic info, and it'll be hard to tell whether the markers are from export or user calls when re-exporting with the same inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135536
Approved by: https://github.com/avikchaudhuri
2024-09-12 21:22:19 +00:00
3d24313809 Pass ideep:lowp_kind to matmul_forward::compute on cache misses (#135058)
Optimized dynamic quantization for aarch64 was enabled by #126687 and #134897

This PR fixes an issue for aarch64 where on a [cache miss](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp#L592) (e.g. if input dimensions change) [ideep::matmul_forward::compute ](https://github.com/intel/ideep/blob/pytorch-rls-v3.5.3-2/include/ideep/operators/matmul.hpp#L160) (wrongly) runs with the [default lowp_kind (u8s8)](https://github.com/intel/ideep/blob/pytorch-rls-v3.5.3-2/include/ideep/operators/matmul.hpp#L174) which is not supported by oneDNN+ACL (Arm Compute Library), causing the workload to fall back to a much slower oneDNN gemm:jit kernel

Example:
```python
import torch

DIM = 4096
INPUT_SIZE1 = 32
INPUT_SIZE2 = 16

class LinearNet(torch.nn.Module):
   def __init__(self):
        super().__init__()
        self.fc1 = torch.nn.Linear(DIM, DIM, bias=False)

   def forward(self, x):
        x = self.fc1(x)
        return x

input1 = torch.randn(size=(INPUT_SIZE1, DIM))
input2 = torch.randn(size=(INPUT_SIZE2, DIM))

with torch.no_grad():
    model = LinearNet()
    model =  torch.ao.quantization.quantize_dynamic(model,{torch.nn.Linear})

    model(input1)   # this goes to ACL lowp_gemm
    print("="*50)
    model(input2)   # this goes to gemm:jit without this PR, and to ACL with this PR
```
In the code snippet above:
- The matmul from `model(input1)` goes to oneDNN+ACL (in both cases, with and without the PR)
- The matmul from `model(input2)`: **Without this PR**: there's a cache miss (different input shapes) and matmul_forward::compute is run with the default lowp_kind (u8s8). Hence the matmul falls back to gemm:jit in oneDNN. However, **With this PR** the matmul goes to oneDNN+ACL which is around 10x faster than oneDNN+jit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135058
Approved by: https://github.com/jondea, https://github.com/malfet
2024-09-12 20:30:20 +00:00
cd472bb1e3 [torch][fx] Add new replacement_callback to materialize a replacement just in time (#135553)
Summary:
Sometimes we only want to generate a replacement for a matched pattern
once we know some information about the nodes in the pattern.

So far, we have found this the most useful to do matches based on specific
shapes of tensors flowing into functions.
Use a callback function similar to `match_filters`. By default this isn't used.

Had to make `replacement` a None-able parameter because Callable was
already used to detect a case where a graph needed to be traced.

Differential Revision: D62412628

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135553
Approved by: https://github.com/SherlockNoMad
2024-09-12 18:52:14 +00:00
f032135bbf Add batching rule for torch.scatter_reduce (#135547)
Fixes #134797

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135547
Approved by: https://github.com/zou3519
2024-09-12 18:51:21 +00:00
525bec804c NJT <-> padded dense conversions (#125947)
This PR:
* Implements the pre-existing `nt.to_padded_tensor(padding_val)` ATen op via the FBGEMM kernel + appropriate view gymnastics (since that kernel only handles 2D values)
* Introduces a new `_nested_from_padded_tensor` op for the reverse conversion, implemented via the reverse FBGEMM kernel + view gymnastics
    * Note: there is currently no public API for this; design booted to a future PR

TODO:
* ~~Propagate min / max sequence length via the new factory function `_nested_from_padded_tensor`~~
* ~~Verify that Inductor does computation fusion via test logic~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125947
Approved by: https://github.com/soulitzer
2024-09-12 17:54:25 +00:00
83c594ebd6 [DSD] Fix distributed state dict full_state_dict option hang during set_state_dict (#135725)
Fix https://github.com/pytorch/pytorch/issues/134095
This fix distributed state dict full_state_dict option hang during set_state_dict. We switch `_distribute_tensors` in _state_dict_utils.py to use `DTensor.from_local` instead of `distribute_tensor` to support FSDP2+TP 2D strided sharding use case, as `distribute_tensor` cannot handle strided sharding yet. `distribute_tensor` incurs a scatter behind the scenes, while `DTensor.from_local` takes the local slice from the full tensor on each rank to create the DTensor (no collective).  This means it's the user's responsibility to make sure the full_tensor from the full_state_dict is the same across all ranks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135725
Approved by: https://github.com/fegin
2024-09-12 17:43:57 +00:00
c1277945d3 [AOTI][Tooling] Support debug printing for inductor level extern kernel call such as externkernel.addmm, bmm, etc. (#135731)
Summary:
As title.

Effect after merging this diff would look something like this:

```
        print('inductor: before_launch - triton_poi_fused_0 - buf0', buf0)
        triton_poi_fused_0.run(buf0, 6, grid=grid(6), stream=stream0)
        print('inductor: after_launch - triton_poi_fused_0 - buf0', buf0)
        buf1 = empty_strided_cuda((16, 6), (6, 1), torch.float32)
        # Topologically Sorted Source Nodes: [linear], Original ATen: [aten.addmm]
        print('inductor: before_launch - extern_kernels.addmm - buf0', buf0)
        extern_kernels.addmm(buf0, reinterpret_tensor(arg2_1, (16, 16), (16, 1), 0), reinterpret_tensor(L__self___weight, (16, 6), (1, 16), 0), alpha=1, beta=1, out=buf1)
        print('inductor: after_launch - extern_kernels.addmm - buf0', buf0)
```

Context: D62272588 only support major triton kernel jit inductor debug printing codegen

Test Plan: CI & OSS CI

Reviewed By: chenyang78, ColinPeppler

Differential Revision: D62397017

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135731
Approved by: https://github.com/ColinPeppler
2024-09-12 17:31:10 +00:00
dab7d646d5 Use a better decomposition for split_with_sizes (#135728)
This decomposition has less checks and improves the performance
of torch.compile.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135728
Approved by: https://github.com/ezyang
2024-09-12 16:38:51 +00:00
7647c398ff Allow optional positional arguments for torch.func.functional_call (#134643)
This PR resolves #134408. Add an additional test and have passed the local test.

Do you think we should add a post-check to ensure `args` and `kwargs` are not both `None`? It seems to be possible to have modules without inputs.

This PR does not include any such post-check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134643
Approved by: https://github.com/zou3519
2024-09-12 15:22:06 +00:00
d67cc58181 [ONNX] Fix symbolic values and numpy implementation (#135786)
1. Remove `__eq__` to make `SymbolicTensor` hashable and test for that
2. Update the `__array__` method so that it works for tensor on GPU

Fixes https://github.com/pytorch/pytorch/issues/135700
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135786
Approved by: https://github.com/titaiwangms
2024-09-12 14:24:43 +00:00
dddaadac6c [dynamo] Dont graph break on inner torch.compile (#135819)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135819
Approved by: https://github.com/jansel
2024-09-12 11:39:09 +00:00
02169364e1 [inductor] Split reduction loops when there is no shared reads (#134307)
Fixes #129102

![image](https://github.com/user-attachments/assets/0d00f75b-2bb9-4ce6-a0d9-2daceaff539c)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134307
Approved by: https://github.com/shunting314
2024-09-12 09:45:08 +00:00
c30042fbeb [GPT-fast] Update compilation time target for Llama & Mixtral (#135817)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135817
Approved by: https://github.com/xmfan, https://github.com/huydhn
2024-09-12 07:13:44 +00:00
6700175531 [Inductor] simplify indexing_exprs in LoopBody._init_with_copy (#135574)
This PR uses `var_ranges` information to simplify `indexing_exprs` in `LoopBody._init_with_copy` to to reduce occurrences of `FloorDiv` and `ModularIndexing` in the `indexing_exprs`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135574
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-09-12 06:56:34 +00:00
de8a8653c0 [dtensor][BE] replace compute_local_shape with compute_local_shape_and_global_offset (#135554)
**Summary**
1. This PR removes the public API `compute_local_shape` and replace its use with the more general API `compute_local_shape_and_global_offset`.
2. To keep `compute_local_shape_and_global_offset` consistent with `compute_local_shape` on empty shards, it now returns local tensor shape `(0,)` for empty shards which is more aligned with DTensor's semantics on non-participating ranks.

**Test**
`pytest test/distributed/_tensor/test_dtensor.py`
`pytest test/distributed/_tensor/test_init.py`
`pytest test/distributed/_tensor/test_tensor_ops.py`

Differential Revision: [D62415591](https://our.internmc.facebook.com/intern/diff/D62415591)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135554
Approved by: https://github.com/tianyu-l, https://github.com/wz337
2024-09-12 06:30:09 +00:00
86335e9135 [reland 3/3][fx] Bypass custom __setattr__ in Node.__init__ (#135735)
Relands #135079 whcih was reverted by #135562

I broke this up into three parts to test internally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135735
Approved by: https://github.com/oulgen
2024-09-12 05:50:39 +00:00
14e3f3c062 [aoti] Remove nlohmann/json.hpp from header (#135765)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135765
Approved by: https://github.com/malfet
2024-09-12 05:38:51 +00:00
9852c6d236 xpu: fix 3rd party builds on systems with cmake<3.25 (#135767)
Cmake LINUX variable is available on starting from cmake 3.25. Better to use CMAKE_SYSTEM_NAME instead to relax cmake version requirement.

See: https://cmake.org/cmake/help/v3.25/variable/LINUX.html
Fixes: #135766
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135767
Approved by: https://github.com/malfet, https://github.com/guangyey
2024-09-12 05:31:01 +00:00
6354271178 [inductor] Skip unused call to get_estimated_runtime() (#135776)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135776
Approved by: https://github.com/oulgen
ghstack dependencies: #135445, #135446
2024-09-12 05:22:23 +00:00
12902f6ecf [inductor] Cache get_operation_names/get_buffer_names (#135446)
Before:
![image](https://github.com/user-attachments/assets/db5b6fce-d849-4512-a21d-7a09efc72311)

After:
![image](https://github.com/user-attachments/assets/097e340c-03b2-491e-ad36-132350b37892)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135446
Approved by: https://github.com/oulgen
ghstack dependencies: #135445
2024-09-12 05:22:23 +00:00
3decb676aa [inductor] Optimize cache_on_self (#135445)
This is a small compile time win, but also makes profiles more readable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135445
Approved by: https://github.com/oulgen
2024-09-12 05:22:23 +00:00
8d68a02905 OpenReg: Split the daemon into drvier/executor (#135646)
Split the daemon into a proper user-process driver vs device-process executor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135646
Approved by: https://github.com/albanD
2024-09-12 05:03:46 +00:00
28330a8a39 [reland 1/3][fx] Bypass custom __setattr__ in Node.__init__ (#135733)
Relands #135079 whcih was reverted by #135562

I broke this up into three parts to test internally.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135733
Approved by: https://github.com/oulgen
2024-09-12 04:29:37 +00:00
eaba287adb [dynamo] Bug fix for _torchdynamo_inline source handling (#135612)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135612
Approved by: https://github.com/drisspg
2024-09-12 04:05:08 +00:00
cyy
f5f1d0a753 Fix build warnings for torch_python (#134981)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134981
Approved by: https://github.com/ezyang
2024-09-12 03:59:34 +00:00
5bc238c73e torch.hub: add get_dir/set_dir type hints (#134906)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134906
Approved by: https://github.com/Skylion007
2024-09-12 03:53:29 +00:00
79223114db Avoid inserting extra transpose when the input to group norm is NHWC (#135575)
When the input format for group norm is NHWC and the device is privateuseone, it introduces an additional transpose operation. To avoid this issue, a check for the privateuseone device needs to be added here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135575
Approved by: https://github.com/ezyang
2024-09-12 03:36:05 +00:00
cyy
7cfd23636c Fix clang-tidy warnings in Caffe2 code (#134935)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134935
Approved by: https://github.com/ezyang
2024-09-12 03:27:09 +00:00
0d1d69fd25 Update torch-xpu-ops pin (ATen XPU implementation) (#135647)
Release cycle for PyTorch 2.5
1. Fixing runtime error on Windows: Fail to load torch_xpu_ops_unary_binary_kernels.dll as the bin size is large.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135647
Approved by: https://github.com/EikanWang
2024-09-12 03:16:08 +00:00
21a64d57b1 [BE] typing for decorators - masked/_ops (#135108)
Differential Revision: D62184735

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135108
Approved by: https://github.com/Skylion007
2024-09-12 01:34:09 +00:00
1a74952925 "Remove BLOCK_LIST" (#135729)
Summary:
Skip test_prepare_qat_conv_bn_fusion_getitem_placeholder when we use training ir, since it's only for bn-getitem pattern, but the pattern doesn't exist in training ir.

Remove BLOCK_LIST since it's empty.
Now all internal unittests will use training ir.

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan'  caffe2/test/quantization:test_quantization -- -r test_prepare_qat_conv_bn_fusion_getitem_placeholder
buck2 run 'fbcode//mode/dev-nosan'  caffe2/test:quantization_pt2e_qat -- -r test_prepare_qat_conv_bn_fusion_getitem_placeholder
```

Differential Revision: D62387987

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135729
Approved by: https://github.com/tugsbayasgalan
2024-09-12 01:22:06 +00:00
a130ed828a Fix the upload of x86 micro benchmark results (#135780)
Upload stats workflow currently skips this https://github.com/pytorch/pytorch/actions/runs/10807251335/job/29977650639, this is a miss from https://github.com/pytorch/pytorch/pull/135042.  So, the workflow is running but nothing has been uploaded yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135780
Approved by: https://github.com/atalman
2024-09-12 01:16:38 +00:00
eb0fe02933 [PT2][inductor][Optimus] Add pad_aten_mm_pass pattern to resolve long computation kernel in LCE (#135167)
Summary:
We observed another long computation issue for OBA_AFOC pyper model, thus adding a pattern to avoid the perf regression

- Only happens in A100
- Do not want to use force_shape_pad since it will pad all GEMMs, which may not be optimal. Optimus pass has more flexisibility to customized GEMM shape and do corresponding padding
- To enable, we pass the pass to config, where "k_threshold_to_pad" can be customized

inductor_config.patch(post_grad_fusion_options={"pad_aten_mm_pass": {"k_threshold_to_pad" : 8388608}})

Test Plan:
# unit test

```
buck2 test mode/opt //caffe2/test/inductor:pad_mm
```
Buck UI: https://www.internalfb.com/buck2/58b0f272-f405-45be-bc8d-aec2dc4d5841
Test UI: https://www.internalfb.com/intern/testinfra/testrun/10133099209954651
Network: Up: 9.0KiB  Down: 142B  (reSessionID-8eb71a37-a5ca-4aff-a4f1-93ade3e47e4e)
Jobs completed: 9. Time elapsed: 3:18.0s.
Cache hits: 0%. Commands: 3 (cached: 0, remote: 0, local: 3)
Tests finished: Pass 17. Fail 0. Fatal 0. Skip 0. Build failure 0

# e2e test
see [D62388582](https://www.internalfb.com/diff/D62388582)

Differential Revision: D62220158

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135167
Approved by: https://github.com/jackiexu1992
2024-09-12 00:51:34 +00:00
d270e2d240 [FSDP2] better error msg for cpu offloading (#135156)
when cpu offloading is enabled, if user load a gpu state dict, FSDP2 will throw a less obvious error at backward
```
RuntimeError: attempting to assign a gradient with device type 'cpu' to a tensor with device type 'cuda'. Please ensure that the gradient and the tensor are on the same device
```

this PR throws error more explicitly by specifying which parameters should be moved because of cpu offloading

```
FSDP parameters should be materialized on cpu when enabling cpu offloading. For example, load cpu state dict or call module.to_empty(device="cpu"). Found following parameters on non-cpu device: ['0.weight']
```

`pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_dp_state_dict_cpu_offload`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135156
Approved by: https://github.com/awgu
2024-09-12 00:05:07 +00:00
16b37b309f [Inductor] Rename cpp_wrapper_cuda.py as cpp_wrapper_gpu.py (#135313)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135313
Approved by: https://github.com/jansel, https://github.com/desertfire
ghstack dependencies: #135312
2024-09-11 23:59:54 +00:00
13ee85ca5e [Inductor] Generalize cuda cpp wrapper as common triton based GPU cpp wrapper, will be reused by xpu in next PR. (#135312)
[Inductor] Generalize cuda cpp wrapper as common triton based GPU cpp wrapper, will be reused by xpu in next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135312
Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/eellison
2024-09-11 23:59:54 +00:00
94d2471d1f [Traceable FSDP2] Use .copy_ instead of .set_ for unsharded_param inplace update; Replace unsharded_param graph input usage with graph intermediate; Support FSDP2+LoRA (#133730)
Using `fsdp.set_` for unsharded_param inplace update causes difficult-to-debug errors when enabling Traceable FSDP2 on TorchTune models. In this PR, we change it to use `fsdp.copy_` which fixes the error and also strictly follows eager semantics (i.e. if user explictly stores an alias of the unsharded_param during execution of the user's module code, that alias will get updated correctly when the unsharded_param is copy_ into; whereas if we just swap out unsharded_param storage via set_, that user-saved alias will not get updated, which is not good).

This PR also implements the graph pass to remove the resizes and copy if there is a resize_(full) -> copy_ -> resize_(0) pattern.

------

Test commands:
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_trace_fsdp_copy_`
- `pytest -rA test/dynamo/test_repros.py::ReproTests::test_partitioner_cse_respects_mutation_boundaries`
- `pytest -rA test/dynamo/test_repros.py::ReproTests::test_fsdp_set_input_mutation_applied_when_input_gets_no_gradients`
- `pytest -rA test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mutation_op_matching`
- `python test/inductor/test_distributed_patterns.py DistributedPatternTests.test_fake_distributed_aot_eager`
- `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=1 PYTORCH_TEST_WITH_CROSSREF=1 python test/functorch/test_aotdispatch.py TestEagerFusionOpInfoCPU.test_aot_autograd_exhaustive_norm_cpu_float32`
- `python test/distributed/test_inductor_collectives.py TestCollectivesInductor.test_backwards`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133730
Approved by: https://github.com/bdhirsh
2024-09-11 23:01:05 +00:00
5ca46be15e Fix/torch cat doc attr (#135698)
The `torch.cat` attr name for tensors in the docs differs from the method signature, unlike other methods.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135698
Approved by: https://github.com/albanD

Co-authored-by: Alexander Jipa <azzhipa@amazon.com>
2024-09-11 22:32:55 +00:00
9a04cfbeff fix for fp16 (#134106)
This PR is a replacement for https://github.com/pytorch/pytorch/pull/133085 for pushing a quick fix for RMSNorm.
The original author is @kkontny

Previous PR summary:
Since FP16 has quite small dynamic range it is very easy to overflow while computing `at::pow(input, 2)` , and it happens in real world computation.

I've tried to use `nn.RMSNorm` fused implementation instead of `LlamaRMSNorm` inside `transformers` implementation of Llama (`src/transformers/models/llama/modeling_llama.py`). It started to give wrong answers in Fp16 while still giving good in FP32. I figured out happens due to overflow while computing square of the input tensor.

Original `LLamaRMSNorm` implementation upcasts input to fp32 to prevent this and give better numerical stability.

```
class LlamaRMSNorm(nn.Module):
    def __init__(self, hidden_size, eps=1e-6):
        """
        LlamaRMSNorm is equivalent to T5LayerNorm
        """
        super().__init__()
        self.weight = nn.Parameter(torch.ones(hidden_size))
        self.variance_epsilon = eps

    def forward(self, hidden_states):
        input_dtype = hidden_states.dtype
        hidden_states = hidden_states.to(torch.float32)
        variance = hidden_states.pow(2).mean(-1, keepdim=True)
        hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
        return self.weight * hidden_states.to(input_dtype)
```

Proposed commit fixed the issue. FP16 in RMSNorm has to be treated in special way, to be usable in real world implementations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134106
Approved by: https://github.com/mikaylagawarecki, https://github.com/eqy
2024-09-11 22:02:07 +00:00
66db61f0d1 [ONNX] Update fake mode usage in onnx docs (#135512)
Update fake mode usage in onnx docs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135512
Approved by: https://github.com/justinchuby

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2024-09-11 21:29:04 +00:00
c025f7becc Revert "[Partitioner] Reuse partition to check whether nodes exist (#135317)"
This reverts commit e004d539da3335d97a8134c9081245628f18eb67.

Reverted https://github.com/pytorch/pytorch/pull/135317 on behalf of https://github.com/izaitsevfb due to BC-breaking, breaks executorch and internal meta builds ([comment](https://github.com/pytorch/pytorch/pull/135317#issuecomment-2344730294))
2024-09-11 21:27:53 +00:00
8c4e1148b8 Refactoring byte_order (#135558)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135558
Approved by: https://github.com/mikaylagawarecki
2024-09-11 21:06:43 +00:00
e20ee39558 Expand bitwise ops to unsigned types (#135525)
Fixes https://github.com/pytorch/pytorch/issues/135436

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135525
Approved by: https://github.com/ezyang
2024-09-11 20:48:52 +00:00
74fd1bf965 [ROCm] Update to AOTriton 0.7b (#134498)
Notable changes:
1. Enable CudaGraph related tests
2. Fix UT problems
3. EXPERIMENTAL Navi31 support. User should enable Navi31 support with Env Var `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1`

Know Problem:
1. `test/test_transformers.py` will massive failures and/or NaN outputs with `--use-pytest`
    + Update: Confirmed skip `class TestSDPAPrivateUse1Only` can fix the problem with `--use-pytest`

Note:
AOTriton 0.7b adds support to nestedtenosrs+SDPA but need more work (and consequently a separate PR) to enable it.

Fixes #133540

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134498
Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily, https://github.com/malfet
2024-09-11 20:34:01 +00:00
5d964a5eb7 [Export] Fix SDPA decomposition (#135297)
Summary: Update SDPA decomposition to match updated stride from D62009189 which aligns strides with the `aten._scaled_dot_product_attention_math.default`, which makes `t.permute().continuous().permute()` no longer necessary.

Test Plan: CI

Differential Revision: D62278378

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135297
Approved by: https://github.com/drisspg
2024-09-11 20:21:59 +00:00
118d7e1480 [Inductor] add _dynamo.reset to test_cat_slice_cat_cuda (#135694)
Summary: test_cat_slice_cat_cuda runs inductor multiple times and check counters["inductor"] in between, and thus we need to reset properly.

Differential Revision: D62500331

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135694
Approved by: https://github.com/masnesral
2024-09-11 20:07:11 +00:00
dd47f6f623 Simplify expr before getting implications in _maybe_evaluate_static (#135499)
Fixes #134268

Previously we weren't simplifying these expressions before calling get_implications, resulting in inconsistent application of FloorDiv/CleanDiv. See #134268  for more details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135499
Approved by: https://github.com/ezyang
2024-09-11 19:48:29 +00:00
e05ea2b179 Add decomposition for transpose_copy (#130943)
* Extracted from #128416
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130943
Approved by: https://github.com/amjames, https://github.com/eellison
2024-09-11 19:45:22 +00:00
ad75b09d89 Replace capture_pre_autograd_graph with export_for_training in torch tests (#135623)
Summary: as title

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_conv_dynamic
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:fx -- -r matcher
 buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r x86
```

CI

Differential Revision: D62448302

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135623
Approved by: https://github.com/tugsbayasgalan
2024-09-11 19:23:08 +00:00
a2cb9b7331 Flip triton kernel default layout constraint to "needs_fixed_stride_order" (#135581)
This is to match the default layout constraint for custom operators. By
default, Inductor should match the stride order of inputs to a triton
kernel.

Test Plan:
- existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135581
Approved by: https://github.com/eellison
ghstack dependencies: #135530
2024-09-11 18:43:18 +00:00
451eaf0ff2 Log full exception trace when error raised in Dynamo (#135697)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135697
Approved by: https://github.com/Skylion007
2024-09-11 18:14:33 +00:00
09519eb195 Support rolling over a percentage of workflows (#134816)
In order to support adding a rollover percentage, this ended up being a complete rewrite of runner_determinator.py.

Details of the new format are in the comments up top.

On the plus side, this now includes some unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134816
Approved by: https://github.com/PaliC, https://github.com/zxiiro
2024-09-11 18:01:26 +00:00
5314ae2660 Don't use exception chaining for BackendCompilerFailed (#135545)
Commandeered from https://github.com/pytorch/pytorch/pull/135496 as I'm now helping @ezyang ship dynamic float arguments in PT2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135545
Approved by: https://github.com/ezyang
2024-09-11 17:49:18 +00:00
da587de9cb [ROCm] [BUGFIX] Re-enable rocm-specific tuning parameters v2 (#133852)
Small bug fix - https://github.com/pytorch/pytorch/pull/124592 replaced the torch.version.hip with device_props but made a mistake in porting the original logic.

The original code was:
`if torch.version.hip is not None:`

Which was incorrectly replaced by:
`if self.device_props.type != "hip":`

Another occurence of https://github.com/pytorch/pytorch/pull/130617

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133852
Approved by: https://github.com/masnesral, https://github.com/malfet
2024-09-11 17:21:40 +00:00
82a4df2d5f [CI] [ROCm] Run rocm workflow on every push to main branch (#135644)
Dial the frequency back up from https://github.com/pytorch/pytorch/pull/131637

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135644
Approved by: https://github.com/huydhn
2024-09-11 17:21:05 +00:00
18a9030952 [CI] Fix update slow tests (#135390)
* Add pytorchbot to list of approvers for file
* Add labels to the auto created PR

The auto generated PR is currently not merging due to some failing tests on slow workflow that were supposed to be moved back to normal

idk if this has much value, clearly we've been managing without the update
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135390
Approved by: https://github.com/ZainRizvi
2024-09-11 17:02:17 +00:00
03f23d07b4 Optimize ShapeEnv.replace (#135652)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135652
Approved by: https://github.com/ezyang
ghstack dependencies: #135621, #135622
2024-09-11 16:50:59 +00:00
8c738c9270 Improve performance of sympy_generic_le (#135622)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135622
Approved by: https://github.com/ezyang
ghstack dependencies: #135621
2024-09-11 16:20:03 +00:00
7ddacaf40a Improve performance of canonicalize_bool_expr (#135621)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135621
Approved by: https://github.com/ezyang
2024-09-11 16:20:03 +00:00
183c32fd3b Revert "[Dynamo] Trace torch function modes entered outside of torch.compile (#133137)"
This reverts commit 0d15122092c27fec1143b800bab7c996d126b547.

Reverted https://github.com/pytorch/pytorch/pull/133137 on behalf of https://github.com/clee2000 due to something in this stack broke functorch/test_control_flow.py::TestControlFlow::test_scan_simple_graph [GH job link](https://github.com/pytorch/pytorch/actions/runs/10804912306/job/29980571390) [HUD commit link](444b52ff40), newly added test yesterday ([comment](https://github.com/pytorch/pytorch/pull/133137#issuecomment-2344054339))
2024-09-11 15:57:00 +00:00
3ab12e2596 Revert "[Dynamo] Support thread local setattr (#135443)"
This reverts commit 160c228a4bd60ceffa62b045a6b0a6f9413835c5.

Reverted https://github.com/pytorch/pytorch/pull/135443 on behalf of https://github.com/clee2000 due to something in this stack broke functorch/test_control_flow.py::TestControlFlow::test_scan_simple_graph [GH job link](https://github.com/pytorch/pytorch/actions/runs/10804912306/job/29980571390) [HUD commit link](444b52ff40), newly added test yesterday ([comment](https://github.com/pytorch/pytorch/pull/135443#issuecomment-2344042800))
2024-09-11 15:53:55 +00:00
596e93b506 Revert "[dynamo] Bug fix for _torchdynamo_inline source handling (#135612)"
This reverts commit 5c3d0a2dedbc0e85f3b256ce56ac674078a5fae1.

Reverted https://github.com/pytorch/pytorch/pull/135612 on behalf of https://github.com/clee2000 due to broke inductor/test_cpu_select_algorithm.py::TestSelectAlgorithmCPU::test_linear_input_transpose_bias_True_cpu_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10805518363/job/29982386304) [HUD commit link](5c3d0a2ded), bad TD ([comment](https://github.com/pytorch/pytorch/pull/135612#issuecomment-2344039370))
2024-09-11 15:51:12 +00:00
f96e8041b1 Revert "[Dynamo] Simplify torch function mode stack guard (#135444)"
This reverts commit 444b52ff40cf4afce7bc3fdcf021a88eab3b954c.

Reverted https://github.com/pytorch/pytorch/pull/135444 on behalf of https://github.com/clee2000 due to something in this stack broke functorch/test_control_flow.py::TestControlFlow::test_scan_simple_graph [GH job link](https://github.com/pytorch/pytorch/actions/runs/10804912306/job/29980571390) [HUD commit link](444b52ff40), newly added test yesterday ([comment](https://github.com/pytorch/pytorch/pull/135444#issuecomment-2344036843))
2024-09-11 15:48:27 +00:00
7cf9c81918 Revert "[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732)"
This reverts commit 6a3edfcc1e474e6ebd0c06624000a6d6bf1a0dee.

Reverted https://github.com/pytorch/pytorch/pull/134732 on behalf of https://github.com/clee2000 due to broke functorch/test_control_flow.py::TestControlFlow::test_scan_simple_graph [GH job link](https://github.com/pytorch/pytorch/actions/runs/10804912306/job/29980571390) [HUD commit link](444b52ff40), newly added test yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2344016694))
2024-09-11 15:39:21 +00:00
49e0b88aab Fix test_triton_kernel_float64_constant (#135583)
Summary: Landed https://github.com/pytorch/pytorch/pull/135260 too soon and the test in that PR doesn't do exactly what I tested (actually test different dtypes).

Test Plan: `python test/inductor/test_triton_kernels.py -k float64_constant`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135583
Approved by: https://github.com/isuruf, https://github.com/eellison, https://github.com/Skylion007
2024-09-11 15:16:23 +00:00
ee8c5cc1cc For S444023: Back out "deprecate search_autotune_cache (#133628)" (#135186)
Summary: For S444023

Test Plan:
Revert prevented the NaN errors - f639391901
Training job ran for 7767 iterations. NaN errors show up within the first 1k.

Reviewed By: nmacchioni

Differential Revision: D62224747

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135186
Approved by: https://github.com/kit1980
2024-09-11 14:08:40 +00:00
ce4d146f56 ATen | Fix MPSCNNNeuron creation on Mac Catalyst. (#135595)
Summary:
These are still utilized directly when using relu/sigmoid/tanh tensors directly from here: https://fburl.com/code/k6n7ofzd
However, on Mac Catalyst we always were returning `nil`, as such in most cases yielding the entire graph completely useless and most often just stray `MPSTemporaryImage` references that were never written into.

This fixes the issue completely by making sure that we always return the valid kernels back, so they can be executed.

Test Plan: Test with segmentation net that uses a combination of relu and other tensors together - run this via Mac Catalyst build - it works! {F1858576745}

Reviewed By: MichaelTay

Differential Revision: D62430010

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135595
Approved by: https://github.com/MichaelTay
2024-09-11 11:12:23 +00:00
0226fcaacf Disable cuda specific restrictions in _scaled_mm for other devices (#135579)
Fixes #135576

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135579
Approved by: https://github.com/drisspg
2024-09-11 11:05:38 +00:00
4cde5096c4 [Inductor][FlexAttention] Supports dynamic shapes with block mask (#135629)
Fixes #134560 and #135206

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135629
Approved by: https://github.com/drisspg
2024-09-11 08:10:50 +00:00
443c015393 [Distributed] Improve efficiency of NaN checker (#135414)
Some customers would like to run the NaN checks on the fly, so we are improving its efficiency.

## Benchmarking
Allreduce 2G floats. `TORCH_NCCL_NAN_CHECK=1`
Red kernel: ncclAllreduce
Blue kernel: Nan check

<img width="1093" alt="Screenshot 2024-09-06 at 10 00 05 PM" src="https://github.com/user-attachments/assets/5501bc31-024f-4115-adb2-dd66eb4025d3">

## Comparison with torch ops:
Let's say a user manually check for NaNs with the following torch ops before all-reduce:
```
torch.any(torch.isnan(x))
```
<img width="1091" alt="Screenshot 2024-09-06 at 10 14 53 PM" src="https://github.com/user-attachments/assets/1f8b5f63-c955-4612-bb96-241b6c69959b">

So our perf is on-par with torch ops.

## Changes
- Load from vidmem using "big packs" of 16 bytes
- Bump `blockDim.x` from 256 to 512
- Separate loads and checks into two loops, each of 8 iterations
- Unroll the loops
- Templated functions for checking NaN in a "big pack" based on dtype

Special thanks to @jbachan from NCCL!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135414
Approved by: https://github.com/wconstab
2024-09-11 07:53:42 +00:00
4ae6d7c18f Back out "[pytorch][PR] [export] fix re-export custom metadata" (#135634)
Summary: Broke some tests. Revert this diff

Test Plan: CI

Differential Revision: D62474337

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135634
Approved by: https://github.com/tugsbayasgalan
2024-09-11 06:16:26 +00:00
3084b7b5c0 [cuDNN][SDPA] Support attn_bias in cuDNN (#130482)
CC @drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130482
Approved by: https://github.com/drisspg, https://github.com/Skylion007, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-09-11 05:59:25 +00:00
5c3d0a2ded [dynamo] Bug fix for _torchdynamo_inline source handling (#135612)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135612
Approved by: https://github.com/drisspg
ghstack dependencies: #135588
2024-09-11 05:23:42 +00:00
c608b17f60 [PTD][BE][c10d] Add some code documents for TCPStore code and cosmetic changes to libUVStore code (#130496)
While designing something else when TCPStore is needed. I spent some time digging into the codebase of TCPStore and found that the code is a little bit challenging to understand without proper documents. Although people from OSS community must be smarter than me, I still want to document my findings in the code so that devs and users can use them as a reference down the road.

Also for libuv, we need to make private variables with a "_", so it's a pure renaming of private variables such as `tcpServer`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130496
Approved by: https://github.com/wconstab
2024-09-11 04:42:25 +00:00
444b52ff40 [Dynamo] Simplify torch function mode stack guard (#135444)
The semantics of ignored modes previously had edge cases, this eliminates these by in essence filtering any ignored modes out of both the ref stack and the current torch function mode stack. This is purely to fix complexity in #135422.  The ignored modes handling will be removed in a future PR after https://github.com/pytorch/pytorch/pull/135422 lands, since we will then trace through DeviceContexts vs inserting them into the graph which needed these extra workarounds for correctness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135444
Approved by: https://github.com/anijain2305, https://github.com/williamwen42
ghstack dependencies: #134732, #133137, #135443
2024-09-11 04:18:22 +00:00
160c228a4b [Dynamo] Support thread local setattr (#135443)
In preparation for tracing through DeviceContext (defb515306/torch/utils/_device.py (L66))
This PR adds support for calling the setattr of thread local objects. These objects have a slots impl, and since this doesn't appear to have any side effects, we call this setattr impl when replaying mutations, since calling `object.__setattr__` on these objects results in a type error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135443
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137
2024-09-11 04:18:22 +00:00
0d15122092 [Dynamo] Trace torch function modes entered outside of torch.compile (#133137)
This PR adds initial tracing for torch function modes.

Details:
In essence, this adds tracing into the torch function of modes entered outside of the torch.compile call.
This does not yet support tracing enter/exit of a torch function mode/ tracing set_default_device properly using the new mode infra (this will be a very good stress test for modes). I am adding more PRs to this stack to support these. The overall plan is to support tracing enter/exit and handling graph breaks like we do other torch.* context managers.

Previously landed:
https://github.com/pytorch/pytorch/pull/133135
https://github.com/pytorch/pytorch/pull/133136
https://github.com/pytorch/pytorch/pull/133134
https://github.com/pytorch/pytorch/pull/133133
https://github.com/pytorch/pytorch/pull/133132
https://github.com/pytorch/pytorch/pull/133131
https://github.com/pytorch/pytorch/pull/133729
https://github.com/pytorch/pytorch/pull/133130

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133137
Approved by: https://github.com/jansel, https://github.com/zou3519
ghstack dependencies: #134732
2024-09-11 04:18:22 +00:00
6a3edfcc1e [Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732)
For tracing cond/while in eager, we trace the HOP with the eager backend with metadata torchfunction mode enabled. HOPs disallow the mutation that occurs in this torch function mode, so it is not able to be traced. As a result, we use a custom backend which enters this mode for tracing these HOPs. Thanks to @ydwu4 for the help with implementing this

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134732
Approved by: https://github.com/ydwu4
2024-09-11 04:18:22 +00:00
356f14e7b7 Fix the output of FileCheck when not run and add unit tests (#135345)
When FileCheck is destructed without execution, it should output all rules.
For example:
```
>>> fc = FileCheck().check("test")
>>> del fc
You have not run this instance of FileCheck!
FileCheck checks:
        CHECK: test
```

Additionally, unit tests for the Python interface of FileCheck will be added.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135345
Approved by: https://github.com/eellison
2024-09-11 04:13:24 +00:00
34dc8f69a1 Adding entry-point based support for out-of-tree rendezvous plugins (#132633)
Fixes #127519

Currently in torchrun rendezvous, there are only two rendezvous backends supported out of the box: `C10d` and `Etcd`. The changes in this PR enables the distributed elastic users to bring their out-of-tree rendezvous backend implementations as Python packages.

#### AUTHORING NEW PLUGIN
Any new plugin will be a python package exposing entry-points. For example, the structure of redis plugin is as follows:

```
plugin_root
|_ pyproject.toml
|_ src
   |_ redis
      |_ __init__.py
      |_ redis_store.py
      |_ redis_backend.py
```

The contents of the `pyproject.toml` should indicate that this is exposes a torchrun entry-point by mentioning the group name `torchrun.plugins`. The `pyproject.toml` for redis plugin would be as follows:

```
[project]
name = "redis"
version = "0.0.1"

[project.entry-points.'torchrun.plugins']
redis = 'redis'
```

The `src/redis/__init__.py` file would contain functions that return the plugin name and plugin handler. The contents of `__init__.py` for redis would be as follows:

```
def getPluginHandler():
    def _create_redis_handler(params: RendezvousParameters):
        from redis_rendezvous_backend import create_backend
        backend, store = create_backend(params)
        return create_handler(store, backend, params)
    return _create_redis_handler
```

The files `redis_store` and `redis_backend` contain the implementation of [Store](41189b0da4/torch/_C/_distributed_c10d.pyi (L171)) and [RendezvousBackend](e782918b8e/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py (L61)) respectively.

#### USER EXPERIENCE
Before using the plugin for the first time, the user has to install the plugin packages. For example, the published packages can be installed using `pip3 install <plugin-name>` and the plugin is in local file systemcan be installed using `pip3 install -e <plugin-location>`.

Once installed, the new backend can be used in torchrun as follows:

```
torchrun --rdzv-backend=redis --rdzv-endpoint=redis-container:6379 --nnodes=3 --nproc-per-node=1 --max-restarts=3 --rdzv-id=1 test.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132633
Approved by: https://github.com/fduwjj
2024-09-11 03:35:02 +00:00
cd9ee49a69 [aoti] Add cpp loader (#135374)
* Added a cpp loader, AOTIModelPackageLoader, which can load the .pt2, build the .so, and create a runner. The python-facing API is that users can directly call the `run` function, whereas in cpp users can directly access the `runner_` if they are more familiar with that. I couldn't figure out how to bind the `get_runner()` function to python...
* Added a new config, `aot_inductor.package_cpp_only` which will **not** package the so. This means that whenever the package is loaded, we will need to build the so. This is turned off by default so that new environments do not need to rebuild their so. The `package_cpp_only` is a feature which torchchat intends to use to provide flexibility to users.
* Added a new config, `aot_inductor.metadata` which stores user-provided metadata, serialized to the pt2 as a json file. It also stores the device used when exporting, "cuda" or "cpu", so that during load time, we can use that data to determine which AOTIModelContainerRunner to use. The metadata can be accessed through `loader.get_metadata()`. TODO is to move this metadata to the toplevel `package_aoti` function so that we can remove the metadata as a config.
* Separated out `package_aoti` as a standalone function, instead of it automatically being called in inductor. This is to prepare for the case where users will compile multiple models, and want to bundle it in one package. The specific use case is in torchchat, where we want to package the separately-exported encoder and decoder layers. An example of how to use this is in `test_multiple_methods`.
* `load_package` will load a singular model, given the model name.
* The loader doesn't support windows for now, I think I need to add some more casing to make the build commands work on windows?

Differential Revision: [D62329906](https://our.internmc.facebook.com/intern/diff/D62329906)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135374
Approved by: https://github.com/desertfire, https://github.com/malfet
2024-09-11 03:00:01 +00:00
26e5572dd2 Bump triton xpu pin and release version (#135638)
Similar with https://github.com/pytorch/pytorch/pull/135627

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135638
Approved by: https://github.com/atalman
2024-09-11 00:56:15 +00:00
693897df42 [dynamo] Missing guard source keys for corner case of NNModuleVariabl… (#135041)
Potentially fixes - https://fb.workplace.com/groups/1286739428954016/permalink/1319662695661689/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135041
Approved by: https://github.com/ezyang
2024-09-11 00:43:26 +00:00
3bf6be457d [MPS] Add missing dispatch to rshift.Tensor (#135607)
Missed it while working on https://github.com/pytorch/pytorch/pull/131813
Test plan: `python -c "import torch;print(torch.randint(100, 500, (64,), device='mps') >> torch.tensor([3,], device='mps'))"`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135607
Approved by: https://github.com/manuelcandales
2024-09-11 00:20:53 +00:00
492f064f15 [ONNX] Add assertion nodes to ignoring list (#135591)
Fixes #135419

PS: there are 104 empty output nodes, I suggest we add them one by one when we run into them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135591
Approved by: https://github.com/justinchuby
2024-09-11 00:18:17 +00:00
29408ea81a Add option to tweak inductor stride settings for user-defined triton kernels (#135530)
Previously, Inductor was allowed to modify the stride/storage_offset
(layout) for inputs to user-defined triton kernels. This can cause
silent incorrectness because most triton kernels are written for a
specific striding pattern (usually contiguous).

This PR adds a config to allow the user to choose Inductor's behavior on
this. The options are:
- "flexible_layout" (default): Inductor can modify the layout for inputs
  to user-defined triton kernels as much as it wants.
- "needs_fixed_stride_order": Inductor must preserve the stride order
  (when compared to tracing) for inputs to user-defined triton kernels.

This matches our handling for custom operators. In the future, we'll
want a "needs_exact_strides" option (this is the safest option).

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135530
Approved by: https://github.com/FindHao, https://github.com/oulgen
2024-09-11 00:11:17 +00:00
02dcb07765 Add boolean support in pack segments ops for both cpu and cuda impls (#132897) (#135620)
Summary:

Same as int types, forward only.

bypass-github-export-checks diff has been synced to github

Test Plan:
buck test mode/dev-nosan //caffe2/torch/fb/sparsenn:test -- test_pack_segments
https://www.internalfb.com/intern/testinfra/testconsole/testrun/16888498646804437/

Reviewed By: garroud

Differential Revision: D60785563

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135620
Approved by: https://github.com/kit1980

Co-authored-by: Haoming Lu <haominglu@meta.com>
2024-09-11 00:03:17 +00:00
5c38aa72c0 [dynamo][dicts][nv-embed] Support update with kwargs (#135588)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135588
Approved by: https://github.com/yanboliang
2024-09-10 23:50:23 +00:00
5134ba7458 Bump triton pin and release version (#135627)
Update the pin and release version to sync with https://github.com/triton-lang/triton/tree/release/3.1.x

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135627
Approved by: https://github.com/Chillee, https://github.com/drisspg, https://github.com/malfet
2024-09-10 23:46:36 +00:00
e48ee2cf50 [ONNX] Fix scaled_dot_product_attention with float scale (#135594)
Fixes #125158

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135594
Approved by: https://github.com/justinchuby
2024-09-10 23:04:02 +00:00
eb38ee21ba [ROCm] slow torch.sum optimization by increasing max_values_per_thread in reduce config (#135397)
Fixes #132964

This change is to optimize torch.sum() performance by increasing max_values_per_thread in setReduceConfig() for ROCm platform.
By increasing this parameter, it uses fewer threadblocks and improved the performance.

Test:
Tested on MI300x and H100, and now the MI300x perf improved to 3205GByte/s from ~1690GByte/s for the test case and is slightly better than H100 (3136GByte/s).

Also tested with other different sizes of tensors and also see perf improvement.

```python
import torch
from triton.testing import do_bench

x = torch.randn(2**30, device='cuda')

ms = do_bench(lambda: x.sum(dim=-1))

bandwidth_gbyte = x.numel() * x.dtype.itemsize / (10**9)

time_s = ms / 1000

bw_per_second = bandwidth_gbyte / time_s

print(bw_per_second)
```

Co-author: @carlobertolli

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135397
Approved by: https://github.com/eqy, https://github.com/malfet
2024-09-10 21:03:01 +00:00
8057b72763 [ez][inductor] don't benchmark cloning if there are no mutated args (#135533)
When a kernel does not have mutated args (this is quite common?), benchmarking the cost of cloning actually benchmarks a no-op. This still takes >100ms since triton.testing.do_bench will allocate 100 ms budget to run the kernel.
Skipping this benchmarking can save quite some compilation time if the code path is hit multiple times. Let's say, if the code path is hit 100 times when the graph is large, we would save >10s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135533
Approved by: https://github.com/jansel
ghstack dependencies: #135531
2024-09-10 20:54:31 +00:00
7b17918dc9 [inductor] fix a device sync issue for benchmarking fusion (#135531)
Fix https://github.com/pytorch/pytorch/issues/134768 .

When we benchmark the latency for a fused node set, we do benchmarking twice:
1. benchmark the latency of the kernel including cloning mutated args
2. benchmark the latency of cloning mutated args without running the kernel

We subtract result 2 from result 1 to get the latency of the kernel itself.

But when the tensors are not on the cuda device 0, we get equal number for result 1 and result 2 no matter how much work the kernel does. The root cause is, in `triton.testing.do_bench` the `torch.cuda.synchronize` call sync the current cuda device (which is device 0 if it's not overriden). But since the tensors and kernels are located on another device, the sync actually does nothing (unless there happens to be other kernels on the device 0).

The fix is to set the correct current device in our benchmarking code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135531
Approved by: https://github.com/jansel
2024-09-10 20:54:31 +00:00
66c45f3ed9 [export] fix re-export custom metadata (#135282)
Fixes #134778

When a model is exported and debug handles are added to the "custom" field of non-placeholder and non-output nodes in the graph, re-exporting it will change the metadata of placeholder nodes (the "custom" field will be added or copied to these nodes, depending whether `ExportedProgram` or `ExportedProgram.module()` is passed to `generate_numeric_debug_handle()`).

This occurs because when we re-export the model, `placeholder` nodes are unlifted to `get_attr` nodes. These nodes remain as `get_attr` after being exported to `gm_torch_level`.  Their metadata are modified [here](https://github.com/pytorch/pytorch/blob/main/torch/export/_trace.py#L1347) based on `params_buffers_to_node_meta` which is collected [here](https://github.com/pytorch/pytorch/blob/main/torch/export/_trace.py#L1312).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135282
Approved by: https://github.com/jerryzh168, https://github.com/zhxchen17, https://github.com/tugsbayasgalan
2024-09-10 20:15:02 +00:00
0a9d55d2ee Revert "[AOTI] Fix assert_function call in cpu autotune template (#135086)"
This reverts commit 16c3b8f87cfa9cb5acee8104820baa389e7ee2bd.

Reverted https://github.com/pytorch/pytorch/pull/135086 on behalf of https://github.com/izaitsevfb due to breaks internal tests, see D62405818 ([comment](https://github.com/pytorch/pytorch/pull/135086#issuecomment-2341889428))
2024-09-10 19:51:16 +00:00
4ca65d3323 [CI] Increase sharding for jobs that are timing out (#135582)
Increase sharding for
* slow grad check
* slow cuda tests slow / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test
* avx

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135582
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-09-10 19:45:13 +00:00
c932b39739 [FSDP2] Added _set_unshard_async_op (#135523)
This PR adds a private API `_set_unshard_async_op` that allows for running pre-forward and pre-backward all-gathers using the `async_op=True` path so that all-gather allocations happen in the default stream to avoid inter-stream fragmentation.

If using this option, forward requires explicit prefetching e.g. via the `unshard(async_op=True)` API for overlap. fp32 -> bf16 casts and the all-gather copy-in will not overlap with compute.

Differential Revision: [D62401551](https://our.internmc.facebook.com/intern/diff/D62401551)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135523
Approved by: https://github.com/weifengpy
2024-09-10 19:28:02 +00:00
1f15973657 [AOTI][Tooling][7/n] Add debug printing support for JIT inductor codegen path as well (#135285)
Summary:
1.  Add the debug printer call to a level lower for triton kernel python wrapper codegen path
2. Add `torch.save()` for jit inductor as well
3. This also fixes the issue introduced in D61949020 (at python wrapper code level for triton kernel not printing)

Test Plan:
```
AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1  TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_addmm_abi_compatible_cuda
```

Differential Revision: D62272588

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135285
Approved by: https://github.com/chenyang78
2024-09-10 19:24:58 +00:00
fc88ba260f [amdsmi][torch] Update amdsmi API usages (#135504)
Summary: In ROCm 6.2.0 there were API name changes-- we check if the new APIs exist and use them in this diff; see 7b2463abe0 for the changes

Test Plan: CI

Differential Revision: D62325661

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135504
Approved by: https://github.com/eqy, https://github.com/houseroad
2024-09-10 19:15:39 +00:00
bf8d0e3107 [inductor] Enable subprocess parallel compile internally with killswitch (#132467)
Differential Revision: [D60629630](https://our.internmc.facebook.com/intern/diff/D60629630)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132467
Approved by: https://github.com/eellison
2024-09-10 19:05:46 +00:00
3a1239a248 [Profiler] Harden Record Function Kwargs (#135365)
Summary:
In S445839, we had HTA break because of the "stream" parameter that was added to gpu traces. This brought up discussions regarding hardening our post processing of said inputs as to not break JSON schema as well as downstream tools. For this reason, this diff does the following.

1. Only allow int, double, bool and string values to be processed as kwinputs for JSON output. We can handle lists if needed in the future.
2. Make sure that any boolean is lowercase  when a string so that the JSON does not break when parsing it
3. Force stream parameter to be an int

Test Plan: Added unit tests to ensure that the list of requirements above is true for kwargs only.

Differential Revision: D62304843

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135365
Approved by: https://github.com/aaronenyeshi
2024-09-10 18:44:05 +00:00
4f9f1775d8 Fix flaky TestCudaWrapper.test_randint_cuda_cuda_wrapper (#135370)
Summary: This test is flaky when run after `test_dynamic_shapes_persistent_reduction_mixed_x_dim_cuda_cuda_wrapper` because the TestCase sets config options globally in its setUp() that stick around for subsequent tests. For test isolation, we use a contextlib.ExitStack pattern in other tests to patch the config options and restore them in tearDown(). Update all TestCases in `test/inductor/test_combo_kernels.py` to use that pattern.

Test Plan:
```
python test/inductor/test_combo_kernels.py
python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_dynamic_shapes_persistent_reduction_mixed_x_dim_cuda_cuda_wrapper TestCudaWrapper.test_randint_cuda_cuda_wrapper
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135370
Approved by: https://github.com/jansel
2024-09-10 18:43:14 +00:00
5e0788befb Migrate remaining jobs to use runner determinator (#134867)
At this point all self-hosted runner jobs should be using the runner determinator to switch between LF and Meta runners. This change updates the remaining jobs that have not yet been migrated over.

Issue: https://lf-pytorch.atlassian.net/browse/PC-25

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134867
Approved by: https://github.com/ZainRizvi
2024-09-10 18:14:00 +00:00
440f8f57af Revert "[fx] Bypass custom __setattr__ in Node.__init__ (#135079)" (#135562)
This reverts commit 66da3b3b2acacb116a9b23e91b24934830eaf6b8.

#135079 breaks internal tests and needs to be reverted. Revert with mergebot doesn't work as this PR is technically part of the stack, but, according to @jansel, it should be possible to revert it individually.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135562
Approved by: https://github.com/jansel, https://github.com/seemethere
2024-09-10 18:07:11 +00:00
e004d539da [Partitioner] Reuse partition to check whether nodes exist (#135317)
The time complexity of find node whether in NodeList is O(n). Reuse partition to speed up due to partition.nodes is hash table and has same elements.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135317
Approved by: https://github.com/ezyang
2024-09-10 17:45:29 +00:00
c4b84a46a9 Add more logging to TunableOp validators (#135396)
Summary: Add more logging to TunableOp validators

Test Plan:
Verified additional logging when loading kernel selections:
```
ROCBLAS_VERSION validation: expect 4.0.0-72e57364-dirty to match 4.0.0-72e57364-dirty
GCN_ARCH_NAME validation: expect gfx942:sramecc+:xnack- to match gfx942:sramecc+:xnack-
HIPBLASLT_VERSION validation: expect 800-a15e4178 to match 800-a15e4178
ROCM_VERSION validation: expect 6.0.0.0-12969-1544e39 to match 6.0.0.0-12969-1544e39
PT_VERSION validation: expect 2.5.0 to match 2.5.0
```

```
[qizixi@devgpu039.atn3 /data/users/qizixi/fbsource/fbcode (f9305317d|remote/master)]$ PYTORCH_TUNABLEOP_VERBOSE=1 buck2 run mode/{opt,amd-gpu} -c fbcode.e
nable_gpu_sections=true //scripts/xdwang/example:fc_llama -- --enable-tuning
File changed: fbcode//hipblas_tuning_pt_llama0.csv
Buck UI: https://www.internalfb.com/buck2/1ed2fac4-743e-49ef-805f-7fb6b9300022
Network: Up: 0B  Down: 0B
Jobs completed: 4189. Time elapsed: 0.2s.
BUILD SUCCEEDED
Enabled tuning
- Run Linear (matmul) 2 x 1280 x 8192, dtype = torch.bfloat16
INFO:2024-09-06 14:38:07 2834864:2835138 CuptiActivityProfiler.cpp:260] HIP versions. Roctracer: 4.1; Runtime: 60032830; Driver: 60032830
INFO:2024-09-06 14:38:07 2834864:2836083 DynoConfigLoader.cpp:61] Setting communication fabric enabled = 0
reading tuning results from hipblas_tuning_pt_llama0.csv
Validator PT_VERSION=2.5.0
Validator ROCM_VERSION=6.0.0.0-12969-1544e39
Validator HIPBLASLT_VERSION=800-a15e4178
Validator GCN_ARCH_NAME=gfx942:sramecc+:xnack-
Validator ROCBLAS_VERSION=4.0.0-72e57364-dirty
ROCBLAS_VERSION validation: expect 4.0.0-72e57364-dirty to match 4.0.0-72e57364-dirty
GCN_ARCH_NAME validation: expect gfx942:sramecc+:xnack- to match gfx942:sramecc+:xnack-
HIPBLASLT_VERSION validation: expect 800-a15e4178 to match 800-a15e4178
ROCM_VERSION validation: expect 6.0.0.0-12969-1544e39 to match 6.0.0.0-12969-1544e39
PT_VERSION validation: expect 2.5.0 to match 2.5.0
Loading results
Avg time: 13.165860176086426 us, Achieved 3.19 TFLOPS, 1598.24 GB/s

- Run Linear (matmul) 2 x 8192 x 1024, dtype = torch.bfloat16
Avg time: 13.230760097503662 us, Achieved 2.54 TFLOPS, 1271.14 GB/s

- Run Linear (matmul) 2 x 7168 x 8192, dtype = torch.bfloat16
Avg time: 26.804399490356445 us, Achieved 8.76 TFLOPS, 4384.90 GB/s

- Run Linear (matmul) 2 x 8192 x 3584, dtype = torch.bfloat16
Avg time: 13.407809734344482 us, Achieved 8.76 TFLOPS, 4384.14 GB/s

2x1280x8192-torch.bfloat16,13.165860176086426,3.18574247630113,1598.237845349412
2x8192x1024-torch.bfloat16,13.230760097503662,2.536092541374924,1271.1420867780075
2x7168x8192-torch.bfloat16,26.804399490356445,8.762778814892096,4384.9040543618985
2x8192x3584-torch.bfloat16,13.407809734344482,8.759112362638383,4384.138585247748
```

Reviewed By: leitian

Differential Revision: D62322830

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135396
Approved by: https://github.com/eqy
2024-09-10 17:20:59 +00:00
cyy
bc1b8f094d Check function declarations of Core ML code (#135467)
Relax the restrictions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135467
Approved by: https://github.com/ezyang
2024-09-10 16:05:22 +00:00
f65a564fa2 [inductor] Flip custom_op_default_layout_constraint (#135239)
By default, Inductor should respect the stride order of input Tensors to
custom operators.

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135239
Approved by: https://github.com/albanD
ghstack dependencies: #135391
2024-09-10 14:27:43 +00:00
386b313028 Handle KeyError for compiler collective in scalars too (#135385)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135385
Approved by: https://github.com/jansel
2024-09-10 12:33:04 +00:00
6d7cbc20d2 Add dynamo itertools.pairwise support (#135416)
Fixes #133766

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135416
Approved by: https://github.com/XuehaiPan, https://github.com/jansel

Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn>
2024-09-10 11:37:59 +00:00
ca16956b20 [Inductor] Generalize device guard codegen for cpp_wrapper mode. (#134761)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134761
Approved by: https://github.com/jansel, https://github.com/EikanWang
ghstack dependencies: #134693
2024-09-10 10:11:52 +00:00
67735d1ee8 [Inductor] Generalize is_cuda to specific device_type to make cpp_wrapper mode be extensible (#134693)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134693
Approved by: https://github.com/ezyang, https://github.com/EikanWang, https://github.com/jansel
2024-09-10 10:11:13 +00:00
6e13f5eb38 [FlexAttention] Add broadcast support for kv batch dimension (#135505)
This PR adds broadcast support for KV batch dimension.

## Details
Consider Q of shape `[Bq, Hq, Q_LEN, D]`, and K, V of shape `[Bkv, Hkv, KV_LEN, D]`. Prior to this diff, we require `Bq == Bkv`. However, for some use cases, we may have Bkv < Bq. For example, in paged attention, we provide K, V of shape `[1, Hkv, MAX_LEN, D]`, while still providing Q of shape `[Bq, Hq, Q_LEN, D]`. Here, MAX_LEN is the maximal number of tokens supported by paged attention.

This PR relax this requirement to be `Bq == Bkv or (Bq > 1 and Bkv == 0)`. This support covers both flex decoding, flex attention forward and backward.

## Benchmark
GPU: H100

We see negligible (1%~2%) performance change from this PR when `Bq == Bkv`.

```
python benchmarks/transformer/score_mod.py --calculate-bwd
```
### Perf before this PR

**FWD**

| Type    |   Speedup | score_mod     | mask_mod   | dtype          | shape(B,Hq,M,Hkv,N,D)        |
|---------|-----------|---------------|------------|----------------|------------------------------|
| Average |     0.743 |               |            |                |                              |
| Max     |     0.955 | head_bias     | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 64)   |
| Min     |     0.548 | relative_bias | None       | torch.bfloat16 | (16, 16, 1024, 2, 1024, 128) |

**BWD**

| Type    |   Speedup | score_mod   | mask_mod   | dtype          | shape(B,Hq,M,Hkv,N,D)       |
|---------|-----------|-------------|------------|----------------|-----------------------------|
| Average |     0.834 |             |            |                |                             |
| Max     |     1.261 | head_bias   | None       | torch.bfloat16 | (8, 16, 512, 16, 512, 64)   |
| Min     |     0.456 | None        | causal     | torch.bfloat16 | (2, 16, 1024, 2, 1024, 128) |

<details>
<summary> Full performance sweep </summary>

| score_mod     | mask_mod   | dtype          | shape(B,Hq,M,Hkv,N,D)         |   fwd_eager_time |   fwd_compiled_time |   bwd_eager_time |   bwd_compiled_time |   fwd_speedup |   bwd_speedup |
|---------------|------------|----------------|-------------------------------|------------------|---------------------|------------------|---------------------|---------------|---------------|
| None          | None       | torch.bfloat16 | (2, 16, 512, 16, 512, 64)     |           15.264 |              17.184 |          107.040 |             140.800 |         0.888 |         0.760 |
| None          | causal     | torch.bfloat16 | (2, 16, 512, 16, 512, 64)     |           15.840 |              19.744 |          112.576 |             140.064 |         0.802 |         0.804 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 512, 16, 512, 64)     |           15.232 |              17.344 |           87.744 |             142.496 |         0.878 |         0.616 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 512, 16, 512, 64)     |           15.264 |              17.184 |          108.192 |             143.328 |         0.888 |         0.755 |
| None          | None       | torch.bfloat16 | (2, 16, 512, 16, 512, 128)    |           19.904 |              22.400 |          106.432 |             136.512 |         0.889 |         0.780 |
| None          | causal     | torch.bfloat16 | (2, 16, 512, 16, 512, 128)    |           19.424 |              26.752 |           91.712 |             106.688 |         0.726 |         0.860 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 512, 16, 512, 128)    |           19.808 |              22.432 |           89.024 |             101.920 |         0.883 |         0.873 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 512, 16, 512, 128)    |           19.840 |              22.272 |           88.896 |             102.592 |         0.891 |         0.867 |
| None          | None       | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64)   |           30.240 |              32.416 |          116.768 |             112.256 |         0.933 |         1.040 |
| None          | causal     | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64)   |           29.536 |              37.024 |          113.664 |             102.688 |         0.798 |         1.107 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64)   |           30.656 |              32.800 |          116.992 |             127.008 |         0.935 |         0.921 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64)   |           30.592 |              32.480 |          116.928 |             112.160 |         0.942 |         1.043 |
| None          | None       | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128)  |           40.448 |              61.920 |          198.656 |             204.512 |         0.653 |         0.971 |
| None          | causal     | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128)  |           37.760 |              62.528 |          189.536 |             170.624 |         0.604 |         1.111 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128)  |           40.896 |              62.368 |          198.304 |             205.824 |         0.656 |         0.963 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128)  |           40.448 |              61.952 |          198.432 |             203.648 |         0.653 |         0.974 |
| None          | None       | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64)   |          318.528 |             355.904 |          947.232 |            1162.496 |         0.895 |         0.815 |
| None          | causal     | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64)   |          199.776 |             252.128 |          677.792 |             813.184 |         0.792 |         0.834 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64)   |          316.512 |             363.328 |          947.712 |            1361.984 |         0.871 |         0.696 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64)   |          317.984 |             356.864 |          947.264 |            1165.024 |         0.891 |         0.813 |
| None          | None       | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128)  |          446.656 |             734.656 |         1664.288 |            2172.960 |         0.608 |         0.766 |
| None          | causal     | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128)  |          278.688 |             467.648 |         1182.624 |            1339.296 |         0.596 |         0.883 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128)  |          447.872 |             744.096 |         1662.944 |            2196.544 |         0.602 |         0.757 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128)  |          448.128 |             732.928 |         1663.072 |            2156.800 |         0.611 |         0.771 |
| None          | None       | torch.bfloat16 | (2, 16, 512, 2, 512, 64)      |           15.648 |              16.640 |          107.520 |             143.008 |         0.940 |         0.752 |
| None          | causal     | torch.bfloat16 | (2, 16, 512, 2, 512, 64)      |           15.776 |              18.240 |          129.056 |             141.920 |         0.865 |         0.909 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 512, 2, 512, 64)      |           15.168 |              16.640 |          103.616 |             139.648 |         0.912 |         0.742 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 512, 2, 512, 64)      |           15.616 |              16.640 |          128.608 |             164.448 |         0.938 |         0.782 |
| None          | None       | torch.bfloat16 | (2, 16, 512, 2, 512, 128)     |           19.776 |              21.952 |          125.344 |             170.304 |         0.901 |         0.736 |
| None          | causal     | torch.bfloat16 | (2, 16, 512, 2, 512, 128)     |           19.776 |              23.712 |          104.288 |             196.896 |         0.834 |         0.530 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 512, 2, 512, 128)     |           19.072 |              21.952 |          102.080 |             177.056 |         0.869 |         0.577 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 512, 2, 512, 128)     |           19.648 |              21.920 |          109.920 |             170.848 |         0.896 |         0.643 |
| None          | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 64)    |           30.464 |              31.936 |          127.808 |             228.832 |         0.954 |         0.559 |
| None          | causal     | torch.bfloat16 | (2, 16, 1024, 2, 1024, 64)    |           29.472 |              33.856 |          113.152 |             215.072 |         0.871 |         0.526 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 64)    |           30.496 |              32.160 |          116.576 |             231.744 |         0.948 |         0.503 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 64)    |           30.464 |              31.904 |          116.320 |             229.824 |         0.955 |         0.506 |
| None          | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 128)   |           40.480 |              61.440 |          176.448 |             345.312 |         0.659 |         0.511 |
| None          | causal     | torch.bfloat16 | (2, 16, 1024, 2, 1024, 128)   |           38.304 |              59.424 |          169.312 |             371.360 |         0.645 |         0.456 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 128)   |           40.960 |              61.760 |          176.512 |             358.912 |         0.663 |         0.492 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 128)   |           40.352 |              61.696 |          176.512 |             344.928 |         0.654 |         0.512 |
| None          | None       | torch.bfloat16 | (2, 16, 4096, 2, 4096, 64)    |          316.224 |             357.728 |          905.728 |            1668.448 |         0.884 |         0.543 |
| None          | causal     | torch.bfloat16 | (2, 16, 4096, 2, 4096, 64)    |          199.904 |             248.416 |          636.544 |            1109.088 |         0.805 |         0.574 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 4096, 2, 4096, 64)    |          314.880 |             363.616 |          906.304 |            1658.176 |         0.866 |         0.547 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 4096, 2, 4096, 64)    |          316.160 |             354.368 |          906.080 |            1649.024 |         0.892 |         0.549 |
| None          | None       | torch.bfloat16 | (2, 16, 4096, 2, 4096, 128)   |          446.912 |             739.840 |         1555.808 |            2521.952 |         0.604 |         0.617 |
| None          | causal     | torch.bfloat16 | (2, 16, 4096, 2, 4096, 128)   |          279.776 |             463.904 |         1068.928 |            1849.888 |         0.603 |         0.578 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 4096, 2, 4096, 128)   |          446.080 |             748.960 |         1553.504 |            2629.888 |         0.596 |         0.591 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 4096, 2, 4096, 128)   |          446.208 |             740.608 |         1558.880 |            2524.960 |         0.602 |         0.617 |
| None          | None       | torch.bfloat16 | (8, 16, 512, 16, 512, 64)     |           33.568 |              41.280 |          170.016 |             147.584 |         0.813 |         1.152 |
| None          | causal     | torch.bfloat16 | (8, 16, 512, 16, 512, 64)     |           30.688 |              43.040 |          159.552 |             146.720 |         0.713 |         1.087 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 512, 16, 512, 64)     |           34.112 |              41.504 |          170.112 |             152.672 |         0.822 |         1.114 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 512, 16, 512, 64)     |           34.240 |              41.152 |          170.272 |             134.976 |         0.832 |         1.261 |
| None          | None       | torch.bfloat16 | (8, 16, 512, 16, 512, 128)    |           48.672 |              76.416 |          295.296 |             263.648 |         0.637 |         1.120 |
| None          | causal     | torch.bfloat16 | (8, 16, 512, 16, 512, 128)    |           45.088 |              72.576 |          281.920 |             237.664 |         0.621 |         1.186 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 512, 16, 512, 128)    |           48.032 |              76.672 |          295.520 |             265.248 |         0.626 |         1.114 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 512, 16, 512, 128)    |           48.096 |              76.096 |          295.456 |             262.112 |         0.632 |         1.127 |
| None          | None       | torch.bfloat16 | (8, 16, 1024, 16, 1024, 64)   |           93.920 |             111.232 |          401.568 |             382.944 |         0.844 |         1.049 |
| None          | causal     | torch.bfloat16 | (8, 16, 1024, 16, 1024, 64)   |           68.192 |              95.232 |          338.752 |             326.816 |         0.716 |         1.037 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 1024, 16, 1024, 64)   |           93.984 |             111.840 |          401.856 |             444.224 |         0.840 |         0.905 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 1024, 16, 1024, 64)   |           94.176 |             110.496 |          401.600 |             383.136 |         0.852 |         1.048 |
| None          | None       | torch.bfloat16 | (8, 16, 1024, 16, 1024, 128)  |          131.488 |             227.040 |          727.424 |             739.712 |         0.579 |         0.983 |
| None          | causal     | torch.bfloat16 | (8, 16, 1024, 16, 1024, 128)  |           95.616 |             169.760 |          616.864 |             574.112 |         0.563 |         1.074 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 1024, 16, 1024, 128)  |          131.680 |             228.672 |          727.616 |             746.048 |         0.576 |         0.975 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 1024, 16, 1024, 128)  |          131.104 |             225.696 |          727.904 |             735.392 |         0.581 |         0.990 |
| None          | None       | torch.bfloat16 | (8, 16, 4096, 16, 4096, 64)   |         1227.296 |            1386.656 |         3720.192 |            4539.904 |         0.885 |         0.819 |
| None          | causal     | torch.bfloat16 | (8, 16, 4096, 16, 4096, 64)   |          691.360 |             831.712 |         2515.872 |            3067.808 |         0.831 |         0.820 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 4096, 16, 4096, 64)   |         1228.192 |            1403.136 |         3715.520 |            5309.280 |         0.875 |         0.700 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 4096, 16, 4096, 64)   |         1229.024 |            1384.992 |         3715.904 |            4550.368 |         0.887 |         0.817 |
| None          | None       | torch.bfloat16 | (8, 16, 4096, 16, 4096, 128)  |         1784.832 |            2865.888 |         6539.840 |            8460.224 |         0.623 |         0.773 |
| None          | causal     | torch.bfloat16 | (8, 16, 4096, 16, 4096, 128)  |         1017.408 |            1660.480 |         4369.824 |            5056.992 |         0.613 |         0.864 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 4096, 16, 4096, 128)  |         1792.448 |            2904.864 |         6546.080 |            8537.024 |         0.617 |         0.767 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 4096, 16, 4096, 128)  |         1795.552 |            2856.864 |         6544.672 |            8400.160 |         0.629 |         0.779 |
| None          | None       | torch.bfloat16 | (8, 16, 512, 2, 512, 64)      |           34.240 |              38.880 |          148.832 |             179.936 |         0.881 |         0.827 |
| None          | causal     | torch.bfloat16 | (8, 16, 512, 2, 512, 64)      |           31.168 |              38.080 |          138.528 |             167.552 |         0.818 |         0.827 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 512, 2, 512, 64)      |           34.240 |              39.168 |          148.512 |             181.248 |         0.874 |         0.819 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 512, 2, 512, 64)      |           34.240 |              38.784 |          148.864 |             180.224 |         0.883 |         0.826 |
| None          | None       | torch.bfloat16 | (8, 16, 512, 2, 512, 128)     |           48.832 |              76.352 |          253.632 |             295.968 |         0.640 |         0.857 |
| None          | causal     | torch.bfloat16 | (8, 16, 512, 2, 512, 128)     |           45.760 |              65.792 |          239.040 |             290.752 |         0.696 |         0.822 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 512, 2, 512, 128)     |           48.768 |              76.576 |          253.312 |             304.032 |         0.637 |         0.833 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 512, 2, 512, 128)     |           48.768 |              76.192 |          253.600 |             296.096 |         0.640 |         0.856 |
| None          | None       | torch.bfloat16 | (8, 16, 1024, 2, 1024, 64)    |           93.728 |             109.728 |          357.696 |             498.912 |         0.854 |         0.717 |
| None          | causal     | torch.bfloat16 | (8, 16, 1024, 2, 1024, 64)    |           68.704 |              92.288 |          295.616 |             386.240 |         0.744 |         0.765 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 1024, 2, 1024, 64)    |           93.632 |             111.392 |          357.408 |             512.448 |         0.841 |         0.697 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 1024, 2, 1024, 64)    |           93.280 |             109.952 |          357.696 |             501.440 |         0.848 |         0.713 |
| None          | None       | torch.bfloat16 | (8, 16, 1024, 2, 1024, 128)   |          131.392 |             230.496 |          612.224 |             807.552 |         0.570 |         0.758 |
| None          | causal     | torch.bfloat16 | (8, 16, 1024, 2, 1024, 128)   |           96.512 |             165.184 |          502.624 |             672.384 |         0.584 |         0.748 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 1024, 2, 1024, 128)   |          131.360 |             232.608 |          612.064 |             832.320 |         0.565 |         0.735 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 1024, 2, 1024, 128)   |          131.008 |             230.528 |          612.640 |             804.320 |         0.568 |         0.762 |
| None          | None       | torch.bfloat16 | (8, 16, 4096, 2, 4096, 64)    |         1227.968 |            1377.408 |         3477.920 |            5324.384 |         0.892 |         0.653 |
| None          | causal     | torch.bfloat16 | (8, 16, 4096, 2, 4096, 64)    |          695.264 |             824.544 |         2268.224 |            3210.208 |         0.843 |         0.707 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 4096, 2, 4096, 64)    |         1228.640 |            1404.576 |         3476.832 |            5463.456 |         0.875 |         0.636 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 4096, 2, 4096, 64)    |         1228.416 |            1378.752 |         3478.048 |            5367.712 |         0.891 |         0.648 |
| None          | None       | torch.bfloat16 | (8, 16, 4096, 2, 4096, 128)   |         1788.736 |            2867.712 |         6039.520 |            8616.256 |         0.624 |         0.701 |
| None          | causal     | torch.bfloat16 | (8, 16, 4096, 2, 4096, 128)   |         1021.952 |            1653.824 |         3866.208 |            5306.848 |         0.618 |         0.729 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 4096, 2, 4096, 128)   |         1786.752 |            2896.352 |         6044.128 |            8871.360 |         0.617 |         0.681 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 4096, 2, 4096, 128)   |         1786.080 |            2868.672 |         6040.160 |            8550.144 |         0.623 |         0.706 |
| None          | None       | torch.bfloat16 | (16, 16, 512, 16, 512, 64)    |           57.504 |              71.552 |          312.768 |             255.040 |         0.804 |         1.226 |
| None          | causal     | torch.bfloat16 | (16, 16, 512, 16, 512, 64)    |           49.472 |              71.104 |          285.696 |             243.520 |         0.696 |         1.173 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 512, 16, 512, 64)    |           58.112 |              72.896 |          312.768 |             288.256 |         0.797 |         1.085 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 512, 16, 512, 64)    |           57.952 |              71.680 |          312.768 |             255.552 |         0.808 |         1.224 |
| None          | None       | torch.bfloat16 | (16, 16, 512, 16, 512, 128)   |           82.336 |             144.256 |          580.128 |             500.160 |         0.571 |         1.160 |
| None          | causal     | torch.bfloat16 | (16, 16, 512, 16, 512, 128)   |           76.160 |             123.712 |          552.544 |             447.648 |         0.616 |         1.234 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 512, 16, 512, 128)   |           82.400 |             145.184 |          580.032 |             504.032 |         0.568 |         1.151 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 512, 16, 512, 128)   |           82.368 |             143.904 |          580.192 |             499.936 |         0.572 |         1.161 |
| None          | None       | torch.bfloat16 | (16, 16, 1024, 16, 1024, 64)  |          177.216 |             209.568 |          787.872 |             747.712 |         0.846 |         1.054 |
| None          | causal     | torch.bfloat16 | (16, 16, 1024, 16, 1024, 64)  |          121.984 |             168.256 |          651.968 |             628.256 |         0.725 |         1.038 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 1024, 16, 1024, 64)  |          177.088 |             211.488 |          788.320 |             864.352 |         0.837 |         0.912 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 1024, 16, 1024, 64)  |          177.440 |             208.576 |          787.424 |             749.120 |         0.851 |         1.051 |
| None          | None       | torch.bfloat16 | (16, 16, 1024, 16, 1024, 128) |          249.472 |             441.376 |         1405.440 |            1431.648 |         0.565 |         0.982 |
| None          | causal     | torch.bfloat16 | (16, 16, 1024, 16, 1024, 128) |          172.960 |             312.064 |         1172.064 |            1096.448 |         0.554 |         1.069 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 1024, 16, 1024, 128) |          249.632 |             446.336 |         1405.408 |            1448.480 |         0.559 |         0.970 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 1024, 16, 1024, 128) |          250.944 |             440.128 |         1406.624 |            1421.952 |         0.570 |         0.989 |
| None          | None       | torch.bfloat16 | (16, 16, 4096, 16, 4096, 64)  |         2418.720 |            2747.936 |         7330.432 |            9023.712 |         0.880 |         0.812 |
| None          | causal     | torch.bfloat16 | (16, 16, 4096, 16, 4096, 64)  |         1353.696 |            1608.480 |         4941.696 |            6078.752 |         0.842 |         0.813 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 4096, 16, 4096, 64)  |         2427.456 |            2746.816 |         7329.792 |           10539.968 |         0.884 |         0.695 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 4096, 16, 4096, 64)  |         2426.688 |            2763.168 |         7336.256 |            9057.536 |         0.878 |         0.810 |
| None          | None       | torch.bfloat16 | (16, 16, 4096, 16, 4096, 128) |         3554.240 |            5634.400 |        12919.872 |           16843.489 |         0.631 |         0.767 |
| None          | causal     | torch.bfloat16 | (16, 16, 4096, 16, 4096, 128) |         2003.648 |            3250.784 |         8610.144 |           10015.424 |         0.616 |         0.860 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 4096, 16, 4096, 128) |         3582.080 |            5710.944 |        12923.328 |           17011.871 |         0.627 |         0.760 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 4096, 16, 4096, 128) |         3581.920 |            5618.144 |        12934.528 |           16745.888 |         0.638 |         0.772 |
| None          | None       | torch.bfloat16 | (16, 16, 512, 2, 512, 64)     |           57.120 |              71.232 |          269.760 |             295.680 |         0.802 |         0.912 |
| None          | causal     | torch.bfloat16 | (16, 16, 512, 2, 512, 64)     |           49.408 |              65.312 |          242.304 |             253.952 |         0.756 |         0.954 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 512, 2, 512, 64)     |           57.504 |              72.544 |          269.632 |             298.976 |         0.793 |         0.902 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 512, 2, 512, 64)     |           57.760 |              71.040 |          269.600 |             296.640 |         0.813 |         0.909 |
| None          | None       | torch.bfloat16 | (16, 16, 512, 2, 512, 128)    |           82.336 |             147.168 |          466.080 |             487.456 |         0.559 |         0.956 |
| None          | causal     | torch.bfloat16 | (16, 16, 512, 2, 512, 128)    |           76.704 |             115.040 |          435.392 |             453.248 |         0.667 |         0.961 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 512, 2, 512, 128)    |           81.856 |             147.424 |          465.920 |             499.552 |         0.555 |         0.933 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 512, 2, 512, 128)    |           81.760 |             146.656 |          466.176 |             485.984 |         0.557 |         0.959 |
| None          | None       | torch.bfloat16 | (16, 16, 1024, 2, 1024, 64)   |          176.608 |             206.976 |          678.080 |             866.976 |         0.853 |         0.782 |
| None          | causal     | torch.bfloat16 | (16, 16, 1024, 2, 1024, 64)   |          121.664 |             164.768 |          538.240 |             636.160 |         0.738 |         0.846 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 1024, 2, 1024, 64)   |          176.608 |             209.664 |          677.696 |             883.424 |         0.842 |         0.767 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 1024, 2, 1024, 64)   |          177.440 |             207.840 |          677.248 |             868.288 |         0.854 |         0.780 |
| None          | None       | torch.bfloat16 | (16, 16, 1024, 2, 1024, 128)  |          250.272 |             449.536 |         1163.424 |            1420.832 |         0.557 |         0.819 |
| None          | causal     | torch.bfloat16 | (16, 16, 1024, 2, 1024, 128)  |          173.472 |             305.376 |          929.408 |            1104.544 |         0.568 |         0.841 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 1024, 2, 1024, 128)  |          249.376 |             454.976 |         1163.648 |            1455.296 |         0.548 |         0.800 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 1024, 2, 1024, 128)  |          250.368 |             450.144 |         1163.520 |            1409.984 |         0.556 |         0.825 |
| None          | None       | torch.bfloat16 | (16, 16, 4096, 2, 4096, 64)   |         2416.576 |            2726.208 |         6835.520 |           10442.784 |         0.886 |         0.655 |
| None          | causal     | torch.bfloat16 | (16, 16, 4096, 2, 4096, 64)   |         1357.440 |            1590.752 |         4433.664 |            5975.296 |         0.853 |         0.742 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 4096, 2, 4096, 64)   |         2427.360 |            2747.040 |         6853.056 |           10670.784 |         0.884 |         0.642 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 4096, 2, 4096, 64)   |         2441.120 |            2718.944 |         6836.640 |           10433.792 |         0.898 |         0.655 |
| None          | None       | torch.bfloat16 | (16, 16, 4096, 2, 4096, 128)  |         3555.392 |            5620.960 |        11944.000 |           16504.801 |         0.633 |         0.724 |
| None          | causal     | torch.bfloat16 | (16, 16, 4096, 2, 4096, 128)  |         2010.848 |            3241.152 |         7636.064 |            9870.464 |         0.620 |         0.774 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 4096, 2, 4096, 128)  |         3557.440 |            5688.352 |        11935.744 |           17090.496 |         0.625 |         0.698 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 4096, 2, 4096, 128)  |         3562.720 |            5630.432 |        11939.168 |           16392.033 |         0.633 |         0.728 |

</details>

### Perf after this PR

**FWD**

| Type    |   Speedup | score_mod     | mask_mod   | dtype          | shape(B,Hq,M,Hkv,N,D)      |
|---------|-----------|---------------|------------|----------------|----------------------------|
| Average |     0.776 |               |            |                |                            |
| Max     |     1.006 | None          | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 64) |
| Min     |     0.566 | relative_bias | None       | torch.bfloat16 | (16, 16, 512, 2, 512, 128) |

**BWD**

| Type    |   Speedup | score_mod   | mask_mod   | dtype          | shape(B,Hq,M,Hkv,N,D)       |
|---------|-----------|-------------|------------|----------------|-----------------------------|
| Average |     0.817 |             |            |                |                             |
| Max     |     1.150 | None        | causal     | torch.bfloat16 | (16, 16, 512, 16, 512, 128) |
| Min     |     0.454 | None        | causal     | torch.bfloat16 | (2, 16, 1024, 2, 1024, 128) |

<details>
<summary> Full performance sweep </summary>

| score_mod     | mask_mod   | dtype          | shape(B,Hq,M,Hkv,N,D)         |   fwd_eager_time |   fwd_compiled_time |   bwd_eager_time |   bwd_compiled_time |   fwd_speedup |   bwd_speedup |
|---------------|------------|----------------|-------------------------------|------------------|---------------------|------------------|---------------------|---------------|---------------|
| None          | None       | torch.bfloat16 | (2, 16, 512, 16, 512, 64)     |           15.680 |              17.056 |           64.544 |              73.376 |         0.919 |         0.880 |
| None          | causal     | torch.bfloat16 | (2, 16, 512, 16, 512, 64)     |           15.712 |              19.872 |           65.408 |              72.864 |         0.791 |         0.898 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 512, 16, 512, 64)     |           16.160 |              17.280 |           64.896 |              73.888 |         0.935 |         0.878 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 512, 16, 512, 64)     |           16.192 |              17.120 |           64.896 |              75.424 |         0.946 |         0.860 |
| None          | None       | torch.bfloat16 | (2, 16, 512, 16, 512, 128)    |           19.648 |              22.496 |           89.184 |              82.592 |         0.873 |         1.080 |
| None          | causal     | torch.bfloat16 | (2, 16, 512, 16, 512, 128)    |           20.320 |              26.816 |           91.264 |              82.880 |         0.758 |         1.101 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 512, 16, 512, 128)    |           20.096 |              22.528 |           89.184 |              83.776 |         0.892 |         1.065 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 512, 16, 512, 128)    |           19.680 |              22.432 |           89.184 |             120.096 |         0.877 |         0.743 |
| None          | None       | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64)   |           32.384 |              32.512 |          119.232 |             128.960 |         0.996 |         0.925 |
| None          | causal     | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64)   |           30.176 |              37.248 |          113.664 |             119.520 |         0.810 |         0.951 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64)   |           32.512 |              32.928 |          119.264 |             131.456 |         0.987 |         0.907 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 1024, 16, 1024, 64)   |           32.448 |              32.704 |          119.200 |             128.352 |         0.992 |         0.929 |
| None          | None       | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128)  |           41.952 |              62.176 |          199.040 |             214.304 |         0.675 |         0.929 |
| None          | causal     | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128)  |           39.744 |              62.880 |          189.504 |             179.968 |         0.632 |         1.053 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128)  |           41.472 |              62.784 |          199.136 |             217.664 |         0.661 |         0.915 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 1024, 16, 1024, 128)  |           42.048 |              61.952 |          199.168 |             214.496 |         0.679 |         0.929 |
| None          | None       | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64)   |          341.184 |             357.632 |          980.256 |            1328.896 |         0.954 |         0.738 |
| None          | causal     | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64)   |          212.576 |             252.960 |          673.888 |             824.864 |         0.840 |         0.817 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64)   |          340.000 |             363.296 |          980.768 |            1375.808 |         0.936 |         0.713 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 4096, 16, 4096, 64)   |          340.768 |             356.832 |          980.960 |            1326.272 |         0.955 |         0.740 |
| None          | None       | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128)  |          459.392 |             737.120 |         1678.240 |            2205.248 |         0.623 |         0.761 |
| None          | causal     | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128)  |          292.672 |             468.096 |         1178.016 |            1371.584 |         0.625 |         0.859 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128)  |          462.144 |             745.312 |         1680.000 |            2252.512 |         0.620 |         0.746 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 4096, 16, 4096, 128)  |          462.112 |             736.576 |         1679.008 |            2216.480 |         0.627 |         0.758 |
| None          | None       | torch.bfloat16 | (2, 16, 512, 2, 512, 64)      |           16.064 |              16.704 |          105.120 |             120.768 |         0.962 |         0.870 |
| None          | causal     | torch.bfloat16 | (2, 16, 512, 2, 512, 64)      |           15.552 |              18.144 |          107.136 |             121.696 |         0.857 |         0.880 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 512, 2, 512, 64)      |           16.096 |              16.768 |          102.688 |             120.864 |         0.960 |         0.850 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 512, 2, 512, 64)      |           16.032 |              16.576 |          104.736 |             124.672 |         0.967 |         0.840 |
| None          | None       | torch.bfloat16 | (2, 16, 512, 2, 512, 128)     |           19.392 |              21.952 |          104.736 |             174.656 |         0.883 |         0.600 |
| None          | causal     | torch.bfloat16 | (2, 16, 512, 2, 512, 128)     |           20.128 |              23.712 |          105.216 |             199.008 |         0.849 |         0.529 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 512, 2, 512, 128)     |           19.904 |              21.888 |          103.744 |             179.520 |         0.909 |         0.578 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 512, 2, 512, 128)     |           19.968 |              21.952 |          104.640 |             177.312 |         0.910 |         0.590 |
| None          | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 64)    |           32.096 |              31.904 |          118.720 |             231.968 |         1.006 |         0.512 |
| None          | causal     | torch.bfloat16 | (2, 16, 1024, 2, 1024, 64)    |           30.528 |              33.952 |          112.480 |             218.304 |         0.899 |         0.515 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 64)    |           32.160 |              32.224 |          118.752 |             237.312 |         0.998 |         0.500 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 64)    |           32.128 |              32.032 |          118.240 |             233.120 |         1.003 |         0.507 |
| None          | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 128)   |           41.312 |              61.280 |          177.408 |             350.688 |         0.674 |         0.506 |
| None          | causal     | torch.bfloat16 | (2, 16, 1024, 2, 1024, 128)   |           39.552 |              59.360 |          168.832 |             371.488 |         0.666 |         0.454 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 128)   |           41.984 |              61.696 |          177.376 |             360.416 |         0.680 |         0.492 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 1024, 2, 1024, 128)   |           41.312 |              61.760 |          177.184 |             355.744 |         0.669 |         0.498 |
| None          | None       | torch.bfloat16 | (2, 16, 4096, 2, 4096, 64)    |          339.744 |             357.888 |          939.712 |            1665.376 |         0.949 |         0.564 |
| None          | causal     | torch.bfloat16 | (2, 16, 4096, 2, 4096, 64)    |          212.608 |             248.832 |          633.280 |            1122.848 |         0.854 |         0.564 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 4096, 2, 4096, 64)    |          339.712 |             363.232 |          940.448 |            1689.440 |         0.935 |         0.557 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 4096, 2, 4096, 64)    |          341.056 |             355.264 |          940.128 |            1641.152 |         0.960 |         0.573 |
| None          | None       | torch.bfloat16 | (2, 16, 4096, 2, 4096, 128)   |          460.736 |             741.024 |         1569.824 |            2559.552 |         0.622 |         0.613 |
| None          | causal     | torch.bfloat16 | (2, 16, 4096, 2, 4096, 128)   |          293.856 |             464.192 |         1066.240 |            1840.416 |         0.633 |         0.579 |
| relative_bias | None       | torch.bfloat16 | (2, 16, 4096, 2, 4096, 128)   |          460.704 |             753.152 |         1570.112 |            2641.088 |         0.612 |         0.594 |
| head_bias     | None       | torch.bfloat16 | (2, 16, 4096, 2, 4096, 128)   |          460.832 |             745.536 |         1570.144 |            2602.560 |         0.618 |         0.603 |
| None          | None       | torch.bfloat16 | (8, 16, 512, 16, 512, 64)     |           35.680 |              41.280 |          171.840 |             158.176 |         0.864 |         1.086 |
| None          | causal     | torch.bfloat16 | (8, 16, 512, 16, 512, 64)     |           31.360 |              42.976 |          158.912 |             139.264 |         0.730 |         1.141 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 512, 16, 512, 64)     |           35.168 |              41.600 |          171.648 |             161.344 |         0.845 |         1.064 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 512, 16, 512, 64)     |           35.136 |              41.152 |          171.808 |             158.336 |         0.854 |         1.085 |
| None          | None       | torch.bfloat16 | (8, 16, 512, 16, 512, 128)    |           48.832 |              76.384 |          295.680 |             277.696 |         0.639 |         1.065 |
| None          | causal     | torch.bfloat16 | (8, 16, 512, 16, 512, 128)    |           45.632 |              72.512 |          281.760 |             250.752 |         0.629 |         1.124 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 512, 16, 512, 128)    |           49.504 |              76.608 |          295.584 |             279.712 |         0.646 |         1.057 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 512, 16, 512, 128)    |           48.864 |              75.904 |          295.456 |             277.568 |         0.644 |         1.064 |
| None          | None       | torch.bfloat16 | (8, 16, 1024, 16, 1024, 64)   |           99.392 |             111.232 |          408.640 |             442.656 |         0.894 |         0.923 |
| None          | causal     | torch.bfloat16 | (8, 16, 1024, 16, 1024, 64)   |           71.392 |              95.168 |          338.784 |             341.760 |         0.750 |         0.991 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 1024, 16, 1024, 64)   |           99.808 |             112.256 |          408.608 |             456.160 |         0.889 |         0.896 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 1024, 16, 1024, 64)   |          100.032 |             110.816 |          408.512 |             444.192 |         0.903 |         0.920 |
| None          | None       | torch.bfloat16 | (8, 16, 1024, 16, 1024, 128)  |          135.040 |             226.112 |          726.880 |             774.176 |         0.597 |         0.939 |
| None          | causal     | torch.bfloat16 | (8, 16, 1024, 16, 1024, 128)  |           99.904 |             169.696 |          616.448 |             607.104 |         0.589 |         1.015 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 1024, 16, 1024, 128)  |          135.488 |             228.384 |          727.776 |             782.368 |         0.593 |         0.930 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 1024, 16, 1024, 128)  |          135.744 |             225.664 |          728.000 |             773.600 |         0.602 |         0.941 |
| None          | None       | torch.bfloat16 | (8, 16, 4096, 16, 4096, 64)   |         1324.192 |            1387.808 |         3866.944 |            5217.184 |         0.954 |         0.741 |
| None          | causal     | torch.bfloat16 | (8, 16, 4096, 16, 4096, 64)   |          738.464 |             832.608 |         2507.392 |            3146.688 |         0.887 |         0.797 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 4096, 16, 4096, 64)   |         1326.016 |            1404.256 |         3867.872 |            5382.624 |         0.944 |         0.719 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 4096, 16, 4096, 64)   |         1326.144 |            1386.688 |         3867.552 |            5203.264 |         0.956 |         0.743 |
| None          | None       | torch.bfloat16 | (8, 16, 4096, 16, 4096, 128)  |         1847.488 |            2866.336 |         6612.704 |            8597.696 |         0.645 |         0.769 |
| None          | causal     | torch.bfloat16 | (8, 16, 4096, 16, 4096, 128)  |         1066.592 |            1660.640 |         4357.696 |            5174.016 |         0.642 |         0.842 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 4096, 16, 4096, 128)  |         1850.464 |            2905.408 |         6616.928 |            8793.280 |         0.637 |         0.752 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 4096, 16, 4096, 128)  |         1848.896 |            2834.720 |         6623.872 |            8637.920 |         0.652 |         0.767 |
| None          | None       | torch.bfloat16 | (8, 16, 512, 2, 512, 64)      |           36.384 |              38.656 |          150.336 |             182.624 |         0.941 |         0.823 |
| None          | causal     | torch.bfloat16 | (8, 16, 512, 2, 512, 64)      |           31.360 |              38.112 |          137.664 |             171.840 |         0.823 |         0.801 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 512, 2, 512, 64)      |           36.608 |              39.040 |          150.528 |             183.872 |         0.938 |         0.819 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 512, 2, 512, 64)      |           36.064 |              38.656 |          150.560 |             183.520 |         0.933 |         0.820 |
| None          | None       | torch.bfloat16 | (8, 16, 512, 2, 512, 128)     |           49.344 |              76.352 |          253.920 |             301.440 |         0.646 |         0.842 |
| None          | causal     | torch.bfloat16 | (8, 16, 512, 2, 512, 128)     |           46.720 |              65.824 |          239.424 |             296.384 |         0.710 |         0.808 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 512, 2, 512, 128)     |           49.248 |              76.416 |          253.728 |             307.808 |         0.644 |         0.824 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 512, 2, 512, 128)     |           49.376 |              76.288 |          253.728 |             304.736 |         0.647 |         0.833 |
| None          | None       | torch.bfloat16 | (8, 16, 1024, 2, 1024, 64)    |           99.264 |             110.144 |          364.960 |             503.072 |         0.901 |         0.725 |
| None          | causal     | torch.bfloat16 | (8, 16, 1024, 2, 1024, 64)    |           71.136 |              92.384 |          294.432 |             393.056 |         0.770 |         0.749 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 1024, 2, 1024, 64)    |           99.200 |             111.360 |          365.152 |             512.640 |         0.891 |         0.712 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 1024, 2, 1024, 64)    |           99.264 |             110.240 |          365.088 |             504.224 |         0.900 |         0.724 |
| None          | None       | torch.bfloat16 | (8, 16, 1024, 2, 1024, 128)   |          135.680 |             230.336 |          613.472 |             816.896 |         0.589 |         0.751 |
| None          | causal     | torch.bfloat16 | (8, 16, 1024, 2, 1024, 128)   |          100.256 |             165.088 |          502.144 |             676.480 |         0.607 |         0.742 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 1024, 2, 1024, 128)   |          135.008 |             232.480 |          613.184 |             836.672 |         0.581 |         0.733 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 1024, 2, 1024, 128)   |          135.232 |             230.624 |          613.536 |             827.136 |         0.586 |         0.742 |
| None          | None       | torch.bfloat16 | (8, 16, 4096, 2, 4096, 64)    |         1324.064 |            1378.688 |         3631.808 |            5308.384 |         0.960 |         0.684 |
| None          | causal     | torch.bfloat16 | (8, 16, 4096, 2, 4096, 64)    |          731.776 |             826.688 |         2263.168 |            3241.344 |         0.885 |         0.698 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 4096, 2, 4096, 64)    |         1316.128 |            1403.200 |         3625.088 |            5550.688 |         0.938 |         0.653 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 4096, 2, 4096, 64)    |         1311.904 |            1378.880 |         3616.320 |            5353.696 |         0.951 |         0.675 |
| None          | None       | torch.bfloat16 | (8, 16, 4096, 2, 4096, 128)   |         1837.856 |            2887.392 |         6121.632 |            8586.656 |         0.637 |         0.713 |
| None          | causal     | torch.bfloat16 | (8, 16, 4096, 2, 4096, 128)   |         1066.976 |            1654.368 |         3843.136 |            5291.040 |         0.645 |         0.726 |
| relative_bias | None       | torch.bfloat16 | (8, 16, 4096, 2, 4096, 128)   |         1854.208 |            2896.832 |         6130.112 |            8745.984 |         0.640 |         0.701 |
| head_bias     | None       | torch.bfloat16 | (8, 16, 4096, 2, 4096, 128)   |         1860.512 |            2889.344 |         6135.648 |            8750.592 |         0.644 |         0.701 |
| None          | None       | torch.bfloat16 | (16, 16, 512, 16, 512, 64)    |           60.640 |              71.552 |          315.968 |             296.512 |         0.847 |         1.066 |
| None          | causal     | torch.bfloat16 | (16, 16, 512, 16, 512, 64)    |           50.784 |              71.040 |          284.288 |             258.880 |         0.715 |         1.098 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 512, 16, 512, 64)    |           61.312 |              72.704 |          315.680 |             302.016 |         0.843 |         1.045 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 512, 16, 512, 64)    |           60.800 |              71.776 |          316.320 |             297.152 |         0.847 |         1.065 |
| None          | None       | torch.bfloat16 | (16, 16, 512, 16, 512, 128)   |           84.576 |             144.416 |          580.576 |             535.936 |         0.586 |         1.083 |
| None          | causal     | torch.bfloat16 | (16, 16, 512, 16, 512, 128)   |           76.064 |             123.648 |          553.344 |             481.376 |         0.615 |         1.150 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 512, 16, 512, 128)   |           84.160 |             145.248 |          581.024 |             540.000 |         0.579 |         1.076 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 512, 16, 512, 128)   |           84.512 |             143.552 |          581.088 |             535.776 |         0.589 |         1.085 |
| None          | None       | torch.bfloat16 | (16, 16, 1024, 16, 1024, 64)  |          189.152 |             209.408 |          798.400 |             868.704 |         0.903 |         0.919 |
| None          | causal     | torch.bfloat16 | (16, 16, 1024, 16, 1024, 64)  |          127.552 |             168.800 |          650.816 |             663.328 |         0.756 |         0.981 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 1024, 16, 1024, 64)  |          189.376 |             211.360 |          798.080 |             895.552 |         0.896 |         0.891 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 1024, 16, 1024, 64)  |          189.440 |             208.576 |          797.888 |             873.152 |         0.908 |         0.914 |
| None          | None       | torch.bfloat16 | (16, 16, 1024, 16, 1024, 128) |          257.536 |             441.760 |         1408.960 |            1514.720 |         0.583 |         0.930 |
| None          | causal     | torch.bfloat16 | (16, 16, 1024, 16, 1024, 128) |          179.328 |             312.096 |         1170.368 |            1177.472 |         0.575 |         0.994 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 1024, 16, 1024, 128) |          259.264 |             446.944 |         1408.768 |            1530.400 |         0.580 |         0.921 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 1024, 16, 1024, 128) |          258.080 |             440.480 |         1408.864 |            1514.144 |         0.586 |         0.930 |
| None          | None       | torch.bfloat16 | (16, 16, 4096, 16, 4096, 64)  |         2595.808 |            2771.456 |         7616.704 |           10405.248 |         0.937 |         0.732 |
| None          | causal     | torch.bfloat16 | (16, 16, 4096, 16, 4096, 64)  |         1435.744 |            1610.336 |         4927.520 |            6220.000 |         0.892 |         0.792 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 4096, 16, 4096, 64)  |         2595.264 |            2745.056 |         7611.232 |           10631.392 |         0.945 |         0.716 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 4096, 16, 4096, 64)  |         2576.256 |            2735.456 |         7626.400 |           10346.976 |         0.942 |         0.737 |
| None          | None       | torch.bfloat16 | (16, 16, 4096, 16, 4096, 128) |         3679.744 |            5634.816 |        13077.056 |           17182.528 |         0.653 |         0.761 |
| None          | causal     | torch.bfloat16 | (16, 16, 4096, 16, 4096, 128) |         2099.360 |            3250.176 |         8589.664 |           10236.672 |         0.646 |         0.839 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 4096, 16, 4096, 128) |         3676.800 |            5716.288 |        13073.088 |           17311.071 |         0.643 |         0.755 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 4096, 16, 4096, 128) |         3679.136 |            5570.496 |        13070.720 |           17192.863 |         0.660 |         0.760 |
| None          | None       | torch.bfloat16 | (16, 16, 512, 2, 512, 64)     |           61.600 |              71.008 |          272.320 |             300.000 |         0.868 |         0.908 |
| None          | causal     | torch.bfloat16 | (16, 16, 512, 2, 512, 64)     |           50.176 |              65.344 |          241.568 |             258.912 |         0.768 |         0.933 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 512, 2, 512, 64)     |           61.120 |              72.512 |          272.672 |             305.408 |         0.843 |         0.893 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 512, 2, 512, 64)     |           61.248 |              71.136 |          272.640 |             301.120 |         0.861 |         0.905 |
| None          | None       | torch.bfloat16 | (16, 16, 512, 2, 512, 128)    |           83.872 |             146.784 |          466.912 |             496.832 |         0.571 |         0.940 |
| None          | causal     | torch.bfloat16 | (16, 16, 512, 2, 512, 128)    |           76.704 |             115.072 |          435.584 |             462.112 |         0.667 |         0.943 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 512, 2, 512, 128)    |           83.392 |             147.392 |          466.656 |             504.448 |         0.566 |         0.925 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 512, 2, 512, 128)    |           83.360 |             146.688 |          466.656 |             499.040 |         0.568 |         0.935 |
| None          | None       | torch.bfloat16 | (16, 16, 1024, 2, 1024, 64)   |          189.024 |             207.584 |          684.768 |             873.568 |         0.911 |         0.784 |
| None          | causal     | torch.bfloat16 | (16, 16, 1024, 2, 1024, 64)   |          126.944 |             164.288 |          536.192 |             645.984 |         0.773 |         0.830 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 1024, 2, 1024, 64)   |          188.768 |             209.760 |          684.096 |             897.504 |         0.900 |         0.762 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 1024, 2, 1024, 64)   |          189.408 |             207.776 |          685.024 |             876.384 |         0.912 |         0.782 |
| None          | None       | torch.bfloat16 | (16, 16, 1024, 2, 1024, 128)  |          259.168 |             449.536 |         1167.936 |            1433.280 |         0.577 |         0.815 |
| None          | causal     | torch.bfloat16 | (16, 16, 1024, 2, 1024, 128)  |          180.000 |             305.312 |          928.000 |            1113.920 |         0.590 |         0.833 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 1024, 2, 1024, 128)  |          258.464 |             455.136 |         1167.808 |            1462.848 |         0.568 |         0.798 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 1024, 2, 1024, 128)  |          257.824 |             450.208 |         1167.744 |            1448.000 |         0.573 |         0.806 |
| None          | None       | torch.bfloat16 | (16, 16, 4096, 2, 4096, 64)   |         2598.368 |            2729.120 |         7134.400 |           10381.632 |         0.952 |         0.687 |
| None          | causal     | torch.bfloat16 | (16, 16, 4096, 2, 4096, 64)   |         1435.456 |            1591.040 |         4424.768 |            6035.808 |         0.902 |         0.733 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 4096, 2, 4096, 64)   |         2594.752 |            2725.952 |         7128.384 |           10822.496 |         0.952 |         0.659 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 4096, 2, 4096, 64)   |         2597.888 |            2716.960 |         7101.568 |           10385.440 |         0.956 |         0.684 |
| None          | None       | torch.bfloat16 | (16, 16, 4096, 2, 4096, 128)  |         3647.648 |            5581.632 |        12089.952 |           16667.233 |         0.654 |         0.725 |
| None          | causal     | torch.bfloat16 | (16, 16, 4096, 2, 4096, 128)  |         2093.952 |            3241.440 |         7579.392 |            9847.936 |         0.646 |         0.770 |
| relative_bias | None       | torch.bfloat16 | (16, 16, 4096, 2, 4096, 128)  |         3650.528 |            5650.688 |        12105.568 |           16963.680 |         0.646 |         0.714 |
| head_bias     | None       | torch.bfloat16 | (16, 16, 4096, 2, 4096, 128)  |         3680.064 |            5585.312 |        12117.504 |           16935.040 |         0.659 |         0.716 |

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135505
Approved by: https://github.com/Chillee
2024-09-10 09:30:02 +00:00
23b1486185 [MPS] Allow nan mean reduction in nll_loss (#135434)
This PR allows results from `nn_loss` to be `nan`, which is the same behavior as with CUDA and CPU https://github.com/pytorch/pytorch/pull/64572#issuecomment-926504162.

Fixes #134431

Ref #64572 #119108
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135434
Approved by: https://github.com/malfet
2024-09-10 08:37:59 +00:00
9902b349cb [Inductor] Make static_input_idxs a set for faster lookup (#135314)
`static_input_idxs` is only used for lookups. With large models, this is a large list. This takes over a millisecond in some cases.

Profile before change:
<img width="824" alt="image" src="https://github.com/user-attachments/assets/002a0775-fd2f-4d27-8cf2-812b502d7d5e">

Profile after change: gaps are smaller, 1ms speedup before launching the cuda graph
<img width="794" alt="image" src="https://github.com/user-attachments/assets/12a0a0b9-2cc1-4d53-ac87-9bd5140a46f5">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135314
Approved by: https://github.com/oulgen
2024-09-10 07:27:55 +00:00
5a9ac83e94 Fix doc (#135551)
Differential Revision: [D62412667](https://our.internmc.facebook.com/intern/diff/D62412667/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135551
Approved by: https://github.com/yushangdi
ghstack dependencies: #135549
2024-09-10 07:18:44 +00:00
1adf28a5c0 [inductor] print triton float64 constants correctly (#135260)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135260
Approved by: https://github.com/jansel
2024-09-10 07:05:02 +00:00
c18052da0e Add some minor doc improvement and ban using training IR for unflattener (#135549)
Title

Differential Revision: [D62412490](https://our.internmc.facebook.com/intern/diff/D62412490/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135549
Approved by: https://github.com/yushangdi
2024-09-10 06:48:42 +00:00
c0d2f991b1 Increase TRITON_MAX_BLOCK['X'] (#135181)
Fixes #135028

As title, increase `TRITON_MAX_BLOCK['X']` to 4096 and fix an error, thanks to @Chillee: https://github.com/pytorch/pytorch/pull/133300/files#r1744706189

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135181
Approved by: https://github.com/jansel
2024-09-10 05:54:37 +00:00
e889252493 Implementation of scan (#134102)
This operation is supposed to be the pendant to the `associative_scan`, but can operate with non-associative functions.

@ydwu4

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134102
Approved by: https://github.com/ydwu4
2024-09-10 04:51:16 +00:00
6546c6186d do not raise when flatten_fn_with_keys not found when suggesting fixes (#135518)
Test Plan: added test

Differential Revision: D62395371

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135518
Approved by: https://github.com/zhxchen17
2024-09-10 03:47:36 +00:00
1d9fefff19 [DCP] Fixes the stateless optimizer issue of distributed state_dict (#135535)
Some optimizers don't have states that can cause get_state_dict/set_state_dict behave incorrectly. This PR fixes the issues.

fixes: https://github.com/pytorch/pytorch/issues/133415

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135535
Approved by: https://github.com/wz337
2024-09-10 03:10:00 +00:00
7ec17b49cf Fix dynamo benchmark skip logic for cpu device (#135193)
Fixes #132380, adjust torchbench and huggingface skip models list, then we can remove `--no-skip` when running benchmarks on 3 suites.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135193
Approved by: https://github.com/chuanqi129, https://github.com/jansel
2024-09-10 03:02:19 +00:00
146921007a [inductor] [cpp] fix the input contiguous check in max-autotune (#134982)
## Description
Fixes the FP32 accuracy failure of `resmlp_12_224` and BF16 accuracy failure of `volo_d1_224` in timm.

In this PR, we check whether input is contiguous using the following way:
If it has `FixedLayout`, we know the accurate strides. For `FlexibleLayout`, if its data is a `ComputedBuffer`, we could get the fill order of the buffer to decide whether it's contiguous. For the other cases, we won't use GEMM template as we can't infer whether it's contiguous.

## Additional context
The current GEMM template only supports this case: `input.get_stride()[-1] == 1`. In `resmlp_12_224`, when we run into this check, the layout of `input` is a `FlexibleLayout`. The reason is that when realizing the input which is a `View` IR, the `convert_to_reinterpret_view` call fails:
d14fe3ffed/torch/_inductor/ir.py (L4712-L4715)

And it finally runs into this `copy_input` and returns a `FlexibleLayout`.
d14fe3ffed/torch/_inductor/ir.py (L4722)

When checking its stride, this `FlexibleLayout` indeed satisfies `input.get_stride()[-1] == 1` but it is later decided as a `FixedLayout` with `size = (3072, 196), stride = (1, 3072)`, which is not supported by the GEMM template, thus causing accuracy issue in this model.
The `FlexibleLayout` is converted to `FixedLayout` during [CppPackedGemmTemplate.add_choices](d14fe3ffed/torch/_inductor/mkldnn_lowerings.py (L1051)) which calls [slice_nd](d14fe3ffed/torch/_inductor/codegen/cpp_template_kernel.py (L150)) when rendering the kernel (`slice_nd(X)`). When creating the `SliceView` IR, [as_storage_and_layout](d14fe3ffed/torch/_inductor/ir.py (L2288)) invokes
[decide_layout](d14fe3ffed/torch/_inductor/ir.py (L2135)) and converts it to a `FixedLayout` with `size = (3072, 196), stride = (1, 3072)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134982
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-09-10 02:47:38 +00:00
a71e5509bc [inductor]Add profiler to operatorbench (#135515)
Add profiling to operatorbench. The new argument `--profile` is added and the profiling trace is like the following figure.
<img width="954" alt="image" src="https://github.com/user-attachments/assets/5b00d6e3-4905-4a77-a5e9-9f62620a5fd5">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135515
Approved by: https://github.com/shunting314
2024-09-10 02:33:30 +00:00
136e28f616 Enable forward AD in functional.affine_grid (#135494)
Fixes #121411
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135494
Approved by: https://github.com/zou3519, https://github.com/soulitzer
2024-09-10 00:07:07 +00:00
39a61795e3 remove amax_ptr from scaled_gemm (#135421)
amax was removed from _scaled_mm by #128683. Remove it from the internal at::cuda::blas::scaled_gemm, as well.  This allows hipBLASLt to find additional solutions rather than forcing amax to be used and then discarding the result.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135421
Approved by: https://github.com/drisspg, https://github.com/eqy
2024-09-09 23:04:36 +00:00
b4feec9782 [xplat][XNNPACK] don't prefer static linkage in xplat for main target (#135529)
Building XNNPACK as a static library has some issues because of multiple global params floating around.

Let's try to get rid of it in xplat and see how it fares.

Differential Revision: [D60776152](https://our.internmc.facebook.com/intern/diff/D60776152/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D60776152/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135529
Approved by: https://github.com/kimishpatel, https://github.com/mcr229, https://github.com/kirklandsign
2024-09-09 22:47:01 +00:00
d81731615f [Dynamo] Adding CallFunctionNoArgsSource and (#135425)
CallFunctionNoArgsGuardAccessor to support torch.cuda.current_device()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135425
Approved by: https://github.com/anijain2305
2024-09-09 22:46:00 +00:00
e2f9a83b85 [ONNX] Drop final None values as inputs for nodes in exporter graph (#135520)
When value for an optional input is not provided, it is defaulted to `None`, which gets translates to "" in the onnx graph. To avoid this, if we have a list of inputs and the final few are all `None`, strip them in the graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135520
Approved by: https://github.com/justinchuby

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2024-09-09 22:28:41 +00:00
70a65a8bd5 Revert "NJT <-> padded dense conversions (#125947)"
This reverts commit 09a5e88bef04d5485b70d8f65f46a675aaa52942.

Reverted https://github.com/pytorch/pytorch/pull/125947 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing dynamo test 09a5e88bef, maybe a landrace ([comment](https://github.com/pytorch/pytorch/pull/125947#issuecomment-2339228570))
2024-09-09 22:01:09 +00:00
689d278543 Revert "Add __init__.py to shape inference folder. (#135461)"
This reverts commit dced0d6d9f05f0962f74a3c6227f774111c15715.

Reverted https://github.com/pytorch/pytorch/pull/135461 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it exposes some public function without appropriate doc. I will reopen the issue with hi-prio so that it can be fixed properly ([comment](https://github.com/pytorch/pytorch/pull/135461#issuecomment-2339218382))
2024-09-09 21:55:13 +00:00
9b764491e3 Use upload-artifact@v4.4.0 for create_release.yml (#135528)
Fixes failure: https://github.com/pytorch/pytorch/actions/runs/10780281005/job/29895846007

Due broken sync
```
actions/upload-artifact@v2
and
actions/download-artifact@v4.1.7
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135528
Approved by: https://github.com/kit1980, https://github.com/malfet
2024-09-09 20:48:52 +00:00
cbc6b30a24 Fix broken E2E tests on Linux machines (#135394)
Summary:
I'm not entirely sure why this is failing with an `ImportError` (according to lastnameye a super class of `ModuleNotFoundError`s), but on our E2E tests on Linux machines (but not Macs?), we're seeing the import failure not getting caught --
`ImportError: cannot import name 'parutil' from 'libfb.py' (/data/sandcastle/boxes/eden-trunk-hg-full-fbsource/buck-out/v2/gen/fbsource/d0c916ec8d40ce11/arvr/libraries/ctrl/studies/replay/__ctrl-r__/ctrl-r#link-tree/libfb/py/__init__.py)` from this test run https://www.internalfb.com/sandcastle/workflow/2522015791331601269, an instance of this job:  https://www.internalfb.com/intern/test/844425085172858?ref_report_id=0 is the overall job

Test Plan:
`arc skycastle schedule tools/skycastle/workflows2/ctrl/js_tests.sky:test_js_e2e_replay_tests --sandcastle-spec-overrides '{"type": "fbcode", "unicastle_size": "I1_MEDIUM"}'`
->
https://www.internalfb.com/sandcastle/workflow/256705178764255769

Differential Revision: D62321167

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135394
Approved by: https://github.com/laithsakka
2024-09-09 20:18:08 +00:00
5b368de7f7 Revert "[ONNX] Update fake mode usage in onnx docs (#135512)"
This reverts commit a13c118994b4f118388d97a35abcb91a396cd437.

Reverted https://github.com/pytorch/pytorch/pull/135512 on behalf of https://github.com/davidberard98 due to failing test  https://github.com/pytorch/pytorch/actions/runs/10778813316/job/29891679127 ([comment](https://github.com/pytorch/pytorch/pull/135512#issuecomment-2338999090))
2024-09-09 20:15:12 +00:00
09a5e88bef NJT <-> padded dense conversions (#125947)
This PR:
* Implements the pre-existing `nt.to_padded_tensor(padding_val)` ATen op via the FBGEMM kernel + appropriate view gymnastics (since that kernel only handles 2D values)
* Introduces a new `_nested_from_padded_tensor` op for the reverse conversion, implemented via the reverse FBGEMM kernel + view gymnastics
    * Note: there is currently no public API for this; design booted to a future PR

TODO:
* ~~Propagate min / max sequence length via the new factory function `_nested_from_padded_tensor`~~
* ~~Verify that Inductor does computation fusion via test logic~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125947
Approved by: https://github.com/soulitzer
2024-09-09 19:37:32 +00:00
a4e6a0b240 [split build] move periodic split builds into own concurrency group (#135510)
To avoid nightly workflows cancelling each other
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135510
Approved by: https://github.com/clee2000, https://github.com/huydhn, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-09-09 19:35:57 +00:00
4ab232d0c4 Fix symbolic number's type and tensor's dtype mismatch bug in Tensor ctor (#135433)
Fixes #135432

In the current implementation, if we try to store a symbolic number in Tensor's constructor, it assumes that the tensor's dtype and the symbolic number's type are matched, which is not the case.

In other words, if we try to store a `SymInt`, current implementation assumes tensor's dtype is `torch.int32`, `torch.int64` or something. And if we try to store a `SymFloat`, it assumes tensor's dtype is `torch.float32` or `torch.float64`. However, the tensor's dtype could also be `torch.float32` or something else when we try to store `SymInt`, which would be wrong.

This PR stores symbolic numbers by tensor's scalar type by wrapping `SymInt` and `SymFoat`'s guarded number into a PyObject.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135433
Approved by: https://github.com/ezyang
2024-09-09 19:32:18 +00:00
2032f107d7 Don't try to tag s390x docker images (#135509)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135509
Approved by: https://github.com/atalman
2024-09-09 19:07:48 +00:00
5f7d956362 Fix bugs blocking flipping the default layout constraint for custom ops (#135391)
Fixes two things:
- For regular PyTorch ops, the default layout constraint tag is always
flexible_layout. This was a bug with #135238
- Mark the new quantized _wrapped_linear_prepack ops as flexible_layout.
  The metas for these are incorrect, I didn't want to fix them (and
  changing the default requires the metas actually be correct).

Test Plan:
- The next PR up in the stack. The PRs are split because the next one is
  riskier.

foo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135391
Approved by: https://github.com/albanD
2024-09-09 18:24:21 +00:00
a13c118994 [ONNX] Update fake mode usage in onnx docs (#135512)
Update fake mode usage in onnx docs
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135512
Approved by: https://github.com/justinchuby
2024-09-09 18:10:37 +00:00
21241bfeee [CP] Extend CP to support load-balancing shards (#132442)
This PR extends the current ring attention to support load-balancing shards -- the context/sequence is divided into `2 * world_size` shards and each rank gets `rank` and `(world_size * 2 - rank - 1)` shards. The data re-shuffling is done in the `context_parallel` API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132442
Approved by: https://github.com/wconstab
2024-09-09 18:04:38 +00:00
73a6fc6e30 Revert "[Inductor] Make static_input_idxs a set for faster lookup (#135314)"
This reverts commit 011cae9570fb3c44b7f6f0c8004c470579ed21da.

Reverted https://github.com/pytorch/pytorch/pull/135314 on behalf of https://github.com/ZainRizvi due to Lint is failing on this file in trunk. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/10777258770/job/29885960050) [HUD commit link](011cae9570) ([comment](https://github.com/pytorch/pytorch/pull/135314#issuecomment-2338678219))
2024-09-09 17:33:01 +00:00
09287e3af4 [MPS] Add regression test for fft.fftfreq (#135440)
The issue reported in #135223 was already solved in #128393. This PR adds a regression test for it.

Fixes #135223

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135440
Approved by: https://github.com/ezyang
2024-09-09 17:12:36 +00:00
16c3b8f87c [AOTI] Fix assert_function call in cpu autotune template (#135086)
Summary: In the ABI-compatible mode, assert_function should be AOTI_TORCH_CHECK.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135086
Approved by: https://github.com/chenyang78, https://github.com/angelayi
ghstack dependencies: #134857
2024-09-09 16:54:12 +00:00
9c6dff4941 [AOTI] Add C shim for aten.mkldnn_rnn_layer in cpp wrapper (#134857)
Summary: Support aten.mkldnn_rnn_layer in the ABI-compatible mode. Because aten.mkldnn_rnn_layer is an aten op, it is easier to add a C shim function for it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134857
Approved by: https://github.com/angelayi
2024-09-09 16:54:12 +00:00
0eb425a563 [Release] Apply Release changes scripts after release 2.4 (#135495)
Based on additional changes required for https://github.com/pytorch/pytorch/pull/128347
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135495
Approved by: https://github.com/kit1980
2024-09-09 16:49:04 +00:00
011cae9570 [Inductor] Make static_input_idxs a set for faster lookup (#135314)
`static_input_idxs` is only used for lookups. With large models, this is a large list. This takes over a millisecond in some cases.

Profile before change:
<img width="824" alt="image" src="https://github.com/user-attachments/assets/002a0775-fd2f-4d27-8cf2-812b502d7d5e">

Profile after change: gaps are smaller, 1ms speedup before launching the cuda graph
<img width="794" alt="image" src="https://github.com/user-attachments/assets/12a0a0b9-2cc1-4d53-ac87-9bd5140a46f5">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135314
Approved by: https://github.com/oulgen
2024-09-09 16:24:58 +00:00
dfb2b661f7 Use float data type for Half var_sum in batchnorm stats updating on CPU (#126525)
Using float data type for Half `var_sum` in batchnorm stats updating on CPU to avoid `var_sum` overflow since the representation range of Half is small.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126525
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-09-09 15:31:38 +00:00
5a69e0ebbe [MPS] Update decorator comments with issue ref (#135448)
Updating the comments with references to better places for context now that the bugs have been identified.

xref #135442 #135447 #134184

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135448
Approved by: https://github.com/ezyang
2024-09-09 15:18:52 +00:00
5e145861f2 [ONNX] Improves documentation of ONNX exporter (#135372)
The PR updates the documentation to reflect the changes introduced in pytorch 2.5 and related to onnx exporter.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135372
Approved by: https://github.com/justinchuby

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
2024-09-09 15:09:01 +00:00
c35b953531 Fix wrong error msg (#135423)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135423
Approved by: https://github.com/ezyang
2024-09-09 13:28:31 +00:00
dced0d6d9f Add __init__.py to shape inference folder. (#135461)
Fixes #135196

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135461
Approved by: https://github.com/ezyang
2024-09-09 13:27:58 +00:00
c0436c5701 [inductor][cpp][gemm] fix perf regression xcit_large_24_p8_224 (#134686) (#135438)
Fix #134686.

PR https://github.com/pytorch/pytorch/pull/132729 makes GEMM template faster for one of the GEMMs in xcit_large_24_p8_224:
SingleProcess AUTOTUNE benchmarking takes 1.7088 seconds and 1.9207 seconds precompiling
AUTOTUNE linear_unary(12544x3072, 768x3072, 768)
  cpp_packed_gemm_2 2.9371 ms 100.0%
  _linear_pointwise 3.1584 ms 93.0%

But it is slower than Aten in the e2e run due to different cache behavior. The access to the input data (12544x3072) is LLC latency bound and bottlenecks seen due to the memory synchronization (data transfers and coherence updates across processors). This PR tries to mitigate the problem by cooperatively loading different chunks of input data from different processors that share the input data.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135438
Approved by: https://github.com/leslie-fang-intel
2024-09-09 05:16:02 +00:00
cyy
60e8dc4374 Check function declarations in Caffe2 code (#134925)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134925
Approved by: https://github.com/ezyang
2024-09-09 05:03:29 +00:00
e6c3f58584 Fix example: Address broadcasting error in the addition of `attn_bias… (#135427)
…` and `attn_mask`, and correct device assignment for newly created variables in the method.

Fix example: Address broadcasting error in the addition of `attn_bias` and `attn_mask`, and correct device assignment for newly created variables in the method.

1. Adding `attn_bias += attn_mask` results in a broadcasting error. The expected shape of `attn_bias` is (L, S), so the output should also have the shape (L, S). However, when the input shape is (N, num_heads, L, S), broadcasting occurs, leading to an output shape of (N, num_heads, L, S), which is not desired.
2. `attn_bias` is a newly created variable within the method, but it is not assigned to the correct device.

**This is my retry of PR #130209 . The PR has been merged into commit `d4a79d4a7c746068d25fe5cf9333495561f4ce1f`, but the modifications were overwritten by subsequent commits.**

Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
@mikaylagawarecki  provided a more elegant implementation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135427
Approved by: https://github.com/ezyang
2024-09-09 03:47:34 +00:00
90e12cf63d Fix return type of nansum example. (#135435)
One of the examples in the documentation of `torch.nansum` contains a wrong return type. This fixes it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135435
Approved by: https://github.com/ezyang
2024-09-09 03:34:52 +00:00
44c08f4984 [Partitioner] Query whether nodes exist in graph faster (#135316)
Find node if exist in graph.nodes (linked list) take too long time. Using graph._find_nodes_lookup_table (hash table) instead to speed up.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135316
Approved by: https://github.com/ezyang
2024-09-09 03:34:02 +00:00
b6186353c6 enable lazy_init for hpu (#135203)
enables lazy_init for hpu device
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135203
Approved by: https://github.com/ezyang
2024-09-09 03:32:20 +00:00
b7eb7256fb docs: torch.nn.utils.rnn.pack_padded_sequence: docs improve (#135417)
docs: `torch.nn.utils.rnn.pack_padded_sequence`: docs improve

/cc @mikaylagawarecki
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135417
Approved by: https://github.com/ezyang
2024-09-09 03:16:11 +00:00
c1ae78be92 [inductor] calibration inductor windows uts (18/N) (#135449)
skip test_quantized_* UTs of `test/inductor/test_cpu_select_algorithm.py`.
Windows inductor don't support quantize so far.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135449
Approved by: https://github.com/ezyang
2024-09-09 03:10:54 +00:00
defb515306 [NJT]Add permute ops support (#135336)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135336
Approved by: https://github.com/davidberard98
2024-09-08 21:00:41 +00:00
31c4e0d37d [inductor] Cleanup analysis done at lowering time (#135412)
Before this we would take multiple passes over the body of each IRNode as we did lowering.  This combines most analysis into `OpCounterCSE` so it can be done in a single pass.

Before:
![image](https://github.com/user-attachments/assets/0047db09-4258-4491-a9a6-b078e183092a)

After:
![image](https://github.com/user-attachments/assets/1e03adcb-8303-4bb1-8bbb-cc42dacd44d7)

This stack:
![image](https://github.com/user-attachments/assets/d6b50b24-c30c-4d23-8b1a-344b3ba65d7a)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135412
Approved by: https://github.com/oulgen
ghstack dependencies: #135286, #135306, #135377, #135400
2024-09-08 18:02:36 +00:00
53290ca00b [inductor] Refactor BaseSchedulerNode.__init__ (#135400)
Might be a small compile time improvement since we remove a call to extract_read_writes().

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135400
Approved by: https://github.com/oulgen
ghstack dependencies: #135286, #135306, #135377
2024-09-08 18:02:36 +00:00
16f5155992 [inductor] Fast path for extract_read_writes without tracing (#135377)
Before (bottom of stack):
![image](https://github.com/user-attachments/assets/13060ff9-b31d-42a9-8e8f-c50b2bf3dc2f)

After (this PR):
![image](https://github.com/user-attachments/assets/7d190821-b614-46b7-9e9e-9087443df654)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135377
Approved by: https://github.com/oulgen
ghstack dependencies: #135286, #135306
2024-09-08 18:02:32 +00:00
37144be03d [inductor] Remove ReadWrites.op_counts (#135306)
This was (almost) unused.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135306
Approved by: https://github.com/oulgen
ghstack dependencies: #135286
2024-09-08 18:02:28 +00:00
3bdc54ed18 [inductor] Refactor LoopBody.memory_usage (#135286)
This is preparing for some other changes where I speed up extract_read_writes tracing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135286
Approved by: https://github.com/oulgen
2024-09-08 18:02:24 +00:00
cyy
2196f32475 [22/N] Fix clang-tidy warnings in jit (#135319)
Follows #134537
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135319
Approved by: https://github.com/titaiwangms
2024-09-08 17:18:29 +00:00
cfc227ad43 [reland][dtensor] move DTensor to public namespace (#134203)
reland of https://github.com/pytorch/pytorch/pull/133113

I have to create a new PR because the previous reverted PR could not either be rebased, or imported successfully :(

----

Moving DTensor to be in the public namespace, to formally add the documentation page that includes all the public APIs. This includes:

* many path renames and path import fixes
* a dedicated doc page without too much content yet (adding in the next PRs)
* To preserve the BC for users still using the torch.distributed._tensor, I added a shim script to redirect old path calls to the new module

The BC preserving is evidented by the fact that all DTensor tests are still working without changing the public imports. So it's safe to land the changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134203
Approved by: https://github.com/tianyu-l
2024-09-08 17:08:40 +00:00
20cab91a12 [dynamo] Remove skip from jit freeze tests (#135281)
Fixes https://github.com/pytorch/pytorch/issues/119781
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135281
Approved by: https://github.com/zou3519
2024-09-08 15:11:12 +00:00
a6fae2e811 Use BRGEMM for Half flash attention forward kernel (#131879)
Use oneDNN BRGEMM on packed data to get better performance on the 5th generation of Xeon where Intel® Advanced Matrix Extensions (AMX) will have fp16 support, e.g. amx-fp16.
Multiple models have achieved acceleration, for instance, FP16 stable diffusion v2.1 has achieved over 50% improvement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131879
Approved by: https://github.com/jgong5, https://github.com/peterbell10
ghstack dependencies: #131878
2024-09-08 12:32:23 +00:00
042f2f7746 [ONNX] Re-raise the exception if the dynamic shapes cannot be refined (#135418)
Improve error reporting. Otherwise users will just see not being able to refine shapes most of the time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135418
Approved by: https://github.com/titaiwangms
2024-09-08 05:30:34 +00:00
fd494dd426 Change wrapped_linear_prepack and wrapped_quantized_linear_prepacked to private by adding _ as prefix (#135401)
Summary: In https://github.com/pytorch/pytorch/pull/134232, we added two new ops wrapped_linear_prepack and wrapped_quantized_linear_prepacked. From the review comments and offline discussion, we are changing them to private by adding `_` as prefix

Differential Revision: D62325142

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135401
Approved by: https://github.com/houseroad
2024-09-08 04:16:24 +00:00
8334cb2fb9 remove commented out breakpoints (#135363)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135363
Approved by: https://github.com/oulgen
2024-09-08 02:15:45 +00:00
e72ed4717e [Dynamo] Fix Huggingface PretrainedConfig get non const attr (#135413)
Fixes #135329

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135413
Approved by: https://github.com/anijain2305
2024-09-07 19:16:29 +00:00
3bebc09be9 [FlexAttention] Align the matmul tensorcore usage (#135168)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135168
Approved by: https://github.com/Chillee
2024-09-07 16:33:41 +00:00
a2db22e6bb [inductor] Catch BrokenProcessPool and print a more helpful message. (#135120)
Summary: BrokenProcessPool means a parallel-compile subprocess exited, which we never expect. It's likely due to a crash, so print a more meaningful error message and instructions that it's probably easier to debug by turning off parallel compile. Output looks like:
```
...
  File "/data/users/slarsen/pytorch/torch/_inductor/runtime/compile_tasks.py", line 45, in _reload_python_module
    exec(code, mod.__dict__, mod.__dict__)
  File "/tmp/torchinductor_slarsen/4q/c4qw7xk5lbb7whg5txnk4hwbc7z6kepak3o666tr3d64gcad5r5b.py", line 815, in <module>
    async_compile.wait(globals())
  File "/data/users/slarsen/pytorch/torch/_inductor/async_compile.py", line 265, in wait
    raise RuntimeError(
RuntimeError: A compilation subprocess exited unexpectedly. This is likely due to a crash. To facilitate debugging, you can re-run with TORCHINDUCTOR_COMPILE_THREADS=1 to cause compilation to occur in the main process.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135120
Approved by: https://github.com/Chillee
2024-09-07 16:33:37 +00:00
eac5e12548 [inductor] Move LoopBody to its own file (#135257)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135257
Approved by: https://github.com/oulgen
2024-09-07 16:29:15 +00:00
18479c5f70 [Doc] update max-autotune for CPU (#134986)
The current doc for `max-autotune` is applicable only for GPU. This PR adds the corresponding content for CPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134986
Approved by: https://github.com/jgong5, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-09-07 13:42:40 +00:00
f7c0c06692 Add oneDNN BRGEMM support on CPU (#131878)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131878
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-09-07 13:22:30 +00:00
b53d97c7be [Intel GPU] Add XPU memory-related APIs (#129919)
# Motivation
According to https://github.com/pytorch/pytorch/issues/116322, we will help unify the device allocator. So we introduce a simple xpu device allocator only with the key functionality first. And expect to add some memory statistics-related functionality after the unification.
But now, some memory statistic-related APIs listed in https://github.com/pytorch/pytorch/issues/127929 are requested. We need more time to unify the device allocator. In order to facilitate the user experience, we expect to support these memory statistic-related APIs before the unification.

# Additional Context
Fixes: #127929

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129919
Approved by: https://github.com/dvrogozh, https://github.com/abhilash1910, https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/albanD
ghstack dependencies: #130923
2024-09-07 11:15:17 +00:00
6c1da66407 [Reland] Refactor caching device allocator utils (#130923)
# Motivation
Following [[RFC] Intel GPU Runtime Upstreaming for Allocator ](https://github.com/pytorch/pytorch/issues/116322), this PR aims to refactor caching device allocator utils to improve code reuse usage.
This is the first PR, we could prepare some follow-up PRs continuing to refactor the device caching allocator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130923
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD, https://github.com/eqy
2024-09-07 11:14:17 +00:00
d7c97e7245 [inductor][cpp][gemm] cache blocking config for dynamic shapes (#133538)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133538
Approved by: https://github.com/leslie-fang-intel
ghstack dependencies: #135277, #133447

Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com>
2024-09-07 11:09:30 +00:00
be9f4ffe88 [inductor][cpp][gemm] enable dynamic M for k-slicing (#133447)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133447
Approved by: https://github.com/leslie-fang-intel
ghstack dependencies: #135277

Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com>
2024-09-07 11:09:30 +00:00
692faa9bc6 [inductor][cpp][gemm] reduce memory alloc overhead by allocating local acc once per thread (#135277)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135277
Approved by: https://github.com/leslie-fang-intel

Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com>
2024-09-07 11:09:25 +00:00
32f3af72b7 [ONNX] Support FakeTensor in ONNXProgram (#135399)
Sync with https://github.com/justinchuby/torch-onnx/compare/v0.1.20...v0.1.21 to support FakeTensors in ONNXProgram. Specifically, this PR implements the `apply_weights` method to allow users to supply a dictionary of concrete tensors to replace FakeTensors in the exported model weights.

An error is raised when users try to serialize a FakeTensor to avoid segfaults.

Also fixed a bug in `.save()` when `keep_initializers_as_inputs` is True and `include_initializers` is False.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135399
Approved by: https://github.com/titaiwangms
2024-09-07 04:48:18 +00:00
ebab5c85c4 [FlexAttention] Skip very small block size unit tests on H100 due to Triton bug (#135393)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135393
Approved by: https://github.com/BoyuanFeng
2024-09-07 04:35:22 +00:00
3d734d837b [ONNX] Handle mixed sequence inputs properly (#135378)
Previously, when an input contains a mixture of `Value` and python constants like `[SymbolicTensor('sym_size_int_3', type=Tensor(INT64), shape=[], producer=node_Shape_0, index=0), 512]`, we get errors like

```pytb
Traceback (most recent call last):
  File "/Users/justinc/Documents/GitHub/torch-onnx/src/torch_onnx/_building.py", line 367, in _call_op
    converted_named_inputs = _process_python_constants_and_sequences(
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/justinc/Documents/GitHub/torch-onnx/src/torch_onnx/_building.py", line 275, in _process_python_constants_and_sequences
    raise TypeError(
TypeError: Constant input '[SymbolicTensor('sym_size_int_3', type=Tensor(INT64), shape=[], producer=node_Shape_0, index=0), 512]' of type '<class 'list'>' is not supported
```

This PR updates Sequence handling to support this case, as well as variadic inputs and ONNX Sequence inputs.

Synced from https://github.com/justinchuby/torch-onnx/pull/187
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135378
Approved by: https://github.com/titaiwangms
2024-09-07 03:07:39 +00:00
c92227c41a [quant][pt2e] fix placeholder typo and related quantization tests (#135379)
A previous typo on "placeholder" and related tests in quantization are fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135379
Approved by: https://github.com/jerryzh168
2024-09-07 02:31:43 +00:00
e6a0221fc6 [Inductor] Optionally allow padding on non-GPU devices (#135280)
This is the OSS component of a larger MTIA diff.

Currently, Inductor disables padding for non-GPU devices. We need to change this behavior to enable padding on MTIA.

This PR adds a config option to enable padding on the CPU, or any other non-GPU device. In the future, we might want to enable padding on all devices by default. However, that might require supporting device-dependent padding defaults, since CPUs will likely use different settings than H100 GPUs.

Differential Revision: D61038114

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135280
Approved by: https://github.com/jfix71, https://github.com/shunting314
2024-09-07 02:19:14 +00:00
a6b9d444fb [ONNX] Refactor exporter errors (#135180)
Refactor exporter errors to combine old errors and new errors for API consistency.

This PR also

1. Removes the `_C._check_onnx_proto(proto)` call in the old exporter. We don't need the ONNX checker because it is limited.
2. Removes the `OnnxExporterError` defined in the dynamo module. This class unnecessarily stores the onnx program object, making it very bulky. Instead, we revert to use the plain OnnxExporterError defined in the `errors` module and use it as the base class for all errors.
3. Continues to expose `OnnxExporterError` in `torch.onnx` and the rest of the errors in `torch.onnx.errors`.
4. Removes the `CheckerError` and `InvalidExportOptionsError` from `torch.onnx`. This is BC breaking but should have low impact.
5. I did not rename existing errors out of compatibility considerations, even though `ExporterError` would have been more succinct.

Fixes https://github.com/pytorch/pytorch/issues/135125
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135180
Approved by: https://github.com/titaiwangms
2024-09-07 00:50:15 +00:00
d42b0c8f22 Add release matrix for 2.5 (#135383)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135383
Approved by: https://github.com/huydhn
2024-09-07 00:49:53 +00:00
941d094dd1 [Dynamo][DTensor] Fixes SymNodeVariable() is not a constant error in Compiled DDP + TP unit test (#135315)
Before the fix, the unit test will fail at forward Dynamo tracing:
```
  File "/data/users/willfeng/pytorch/test/distributed/_composable/test_replicate_with_compiler.py", line 415, in test_ddp_tp
    loss = compiled_replicate_model(data).sum()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
torch._dynamo.exc.InternalTorchDynamoError: SymNodeVariable() is not a constant

from user code:
   File "/data/users/willfeng/pytorch/torch/distributed/tensor/parallel/_data_parallel_utils.py", line 34, in _unflatten_tensor
    result = DTensor.from_local(
```
After the fix, the compilation fails at a later step (Compiled Autograd tracing), due to needing "pre-dispatch tracing of backward graph" feature (see details at https://github.com/pytorch/pytorch/issues/127797#issuecomment-2291695474).

I believe this PR is a net improvement, because it should also fix the 1D Traceable FSDP2 failure case on internal models (https://github.com/pytorch/pytorch/issues/130978#issuecomment-2319476690), which is much harder to build a minimal unit test for.

Fixes https://github.com/pytorch/pytorch/issues/130978.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135315
Approved by: https://github.com/bdhirsh
2024-09-07 00:11:25 +00:00
b1a934741e Change test_constant_prop_preserve_metadata (#135268)
Summary: In new export_for_training, "stack_trace" does not exist in node meta anymore.

Test Plan:
```
buck run fbcode//mode/dev-nosan fbcode//caffe2/test:quantization_pt2e -- -r test_constant_prop_preserve_metadata
```

Reviewed By: angelayi

Differential Revision: D62219974

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135268
Approved by: https://github.com/angelayi
2024-09-07 00:02:35 +00:00
0c661f3e1a [Split Build] Refactor split build binary builds into their own workflows and move split build binary builds to periodic (#134624)
As we need to move split build binary tests from trunk to periodic this pr, refactors those jobs out into its own workflow to achieve this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134624
Approved by: https://github.com/malfet
2024-09-06 23:57:56 +00:00
2c7e314803 [Inductor][CPP] Fix the issue of view dtype (#135301)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/135160, it's a regression introduced by https://github.com/pytorch/pytorch/pull/134569, where the dtype of `to_dtype_bitcast` was incorrectly handled when using the scalarize implementation.

**TestPlan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_view_dtype
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135301
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-09-06 23:36:44 +00:00
ead4407f57 [inductor] Fix loop split optimization (#135303)
Fix https://github.com/pytorch/pytorch/issues/135274.

Improve the check whether the div expr matches: add a check whether `split_var` is in `original_body.iter_vars`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135303
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-09-06 23:06:25 +00:00
2f5b40c099 [aoti test] Disable FP8 funz dtypes in fp8 runtime check test (#135373)
Fixing https://github.com/pytorch/pytorch/issues/126734

Key is the funz FP8 types are for AMD only.

source: https://github.com/openxla/stablehlo/blob/main/rfcs/20230321-fp8_fnuz.md

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135373
Approved by: https://github.com/chenyang78
2024-09-06 23:05:47 +00:00
993b5647ab [export] fix placeholder name collision tests by removing map call (#135366)
The current test is failing because of the current unstable state of map. torch.compile and non-strict export are taking two seperate routes unlike cond and while_loop. This pr fix the test it self. We'll fix map in follow up PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135366
Approved by: https://github.com/angelayi
2024-09-06 22:02:50 +00:00
2ab26806f1 Require tlparse for failing tests in test_structured_trace.py (#135376)
Summary: These tests are currently failing internally. Per discussion, skip if tlparse is unavailable

Test Plan:
```
feature remove tlparse
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --run-disabled --regex test_structured_trace.py
feature install tlparse
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --run-disabled --regex test_structured_trace.py
```

Differential Revision: D62310342

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135376
Approved by: https://github.com/ezyang
2024-09-06 21:53:41 +00:00
b1612569f6 [BE] Clarify defaulting behavior in optimizer (#135384)
Fixes #135340

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135384
Approved by: https://github.com/drisspg, https://github.com/jainapurva
2024-09-06 21:52:55 +00:00
dc0e818738 [FR] Automatically infer a common filename prefix (#135158)
Save the annoyance of specifying this on the command line each time
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135158
Approved by: https://github.com/fduwjj, https://github.com/c-p-i-o
ghstack dependencies: #135157
2024-09-06 21:44:27 +00:00
06e414d7fe [FR] Make trace_dir a required argument (#135157)
Ensures users get a clean error if they forget to specify the dir, and
improves the help message.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135157
Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj
2024-09-06 21:44:27 +00:00
a681260caf Revert "[ONNX] Refactor exporter errors (#135180)"
This reverts commit 5eebd9315a72422d59b6f8d8ca8e4e573e231d5c.

Reverted https://github.com/pytorch/pytorch/pull/135180 on behalf of https://github.com/clee2000 due to I think this broke test_public_bindings.py::TestPublicBindings::test_correct_module_names [GH job link](https://github.com/pytorch/pytorch/actions/runs/10743909338/job/29800779403) [HUD commit link](5eebd9315a), possibly a landrace with the PR that landed before it ([comment](https://github.com/pytorch/pytorch/pull/135180#issuecomment-2334844191))
2024-09-06 21:39:18 +00:00
95e976a63f [dynamo] recursively skip frames when Dynamo cache limit is hit (#135144)
Fixes https://github.com/pytorch/pytorch/pull/135144 and [T197117723](https://www.internalfb.com/intern/tasks/?t=197117723).

In general, adds `SkipCodeRecursiveException` to Dynamo - when raised in Dynamo, convert_frame will return a `skip_code_recursive_flag` back to C Dynamo, signaling it to skip the current frame and all recursive calls.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135144
Approved by: https://github.com/jansel, https://github.com/anijain2305
2024-09-06 21:38:53 +00:00
306ac44eaa [ez][TD] Fix request for issue body returns None (#135389)
I assumed it would be empty string if the body is empty, but its just None
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135389
Approved by: https://github.com/malfet
2024-09-06 21:02:01 +00:00
a7643baceb Revert expectFailureIf condition on tests with torch.compile on Windows (#134759)
Fixes #134716

This PR reverts some changes introduced in 6eae569546 (#133987)

torch.compile is not available on Windows, tests should be expected to fail.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134759
Approved by: https://github.com/malfet
2024-09-06 20:51:55 +00:00
a4030e37be [dynamo] reland map/zip iterator related changes (#135074)
Differential Revision: [D62211019](https://our.internmc.facebook.com/intern/diff/D62211019)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135074
Approved by: https://github.com/jansel, https://github.com/anijain2305, https://github.com/mlazos
2024-09-06 20:38:02 +00:00
22e1fb6faa [test][easy] Add debug utils for cpu select algorithm test (#135038)
Summary: Add debug utils to debug a flaky test in fbcode ci.

Some context: https://github.com/pytorch/pytorch/pull/126545

Test Plan: ci

Differential Revision: D62005445

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135038
Approved by: https://github.com/jgong5, https://github.com/XuehaiPan
2024-09-06 20:30:49 +00:00
2a4890e315 [ONNX] Clean up the missed lines from previous PRs (#135368)
Some missed deleted lines

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135368
Approved by: https://github.com/justinchuby
2024-09-06 20:27:52 +00:00
3ce433aef2 [TCPStore] use wait counters (#135283)
This replaces the existing TCPStore counters with the new shared wait counters. There's no users of the tcpstore counters so should be completely safe to remove.

Test plan:

Existing tests + build

There's no OSS backend for wait counters so can't write any tests with them currently.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135283
Approved by: https://github.com/c-p-i-o
2024-09-06 19:54:25 +00:00
7f2d20e687 Run all autograd node post hooks (#134728)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134728
Approved by: https://github.com/albanD, https://github.com/soulitzer
2024-09-06 19:44:28 +00:00
32fd29c1ea [ONNX] Properly handle Attributes in traceable functions (#135367)
Previously the attributes were sent in as Attr objects even when we call the function as a plain Python function. Turning them into python objects.

From https://github.com/justinchuby/torch-onnx/pull/186
Related https://github.com/microsoft/onnxscript/issues/1846

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135367
Approved by: https://github.com/justinchuby
2024-09-06 19:35:22 +00:00
5eebd9315a [ONNX] Refactor exporter errors (#135180)
Refactor exporter errors to combine old errors and new errors for API consistency.

This PR also

1. Removes the `_C._check_onnx_proto(proto)` call in the old exporter. We don't need the ONNX checker because it is limited.
2. Removes the `OnnxExporterError` defined in the dynamo module. This class unnecessarily stores the onnx program object, making it very bulky. Instead, we revert to use the plain OnnxExporterError defined in the `errors` module and use it as the base class for all errors.
3. Continues to expose `OnnxExporterError` in `torch.onnx` and the rest of the errors in `torch.onnx.errors`.
4. Removes the `CheckerError` and `InvalidExportOptionsError` from `torch.onnx`. This is BC breaking but should have low impact.
5. I did not rename existing errors out of compatibility considerations, even though `ExporterError` would have been more succinct.

Fixes https://github.com/pytorch/pytorch/issues/135125
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135180
Approved by: https://github.com/titaiwangms
2024-09-06 19:10:56 +00:00
a15aabc975 Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262)
Hi,
I noticed the `unfold` operator was missing on MaskedTensor.

I tested that my change works when calling unfold and backward on a `MaskedTensor` but I didn't find the tests for the dispatch of such operation. Where is it?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125262
Approved by: https://github.com/cpuhrsch
2024-09-06 19:06:23 +00:00
b143426db3 [Inductor] Use argument names as the key for the constants dict and the signature dict (#135170)
Referencing how triton constructs these dictionaries

ca3fb5f6fa/python/triton/runtime/jit.py (L639)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135170
Approved by: https://github.com/htyu
2024-09-06 19:05:00 +00:00
13ba0a2e5c Run bypassed graph compile outside the except block to avoid chaining of exceptions (#135175)
Fixes #135172

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135175
Approved by: https://github.com/masnesral, https://github.com/ezyang
2024-09-06 19:03:57 +00:00
8520ce5f78 Fix incorrect trace of post-accumulate grad hook on tensor with zero dims (#135226)
Fix incorrect trace of post-accumulate grad hook on tensor with zero dimensions

Fixes #135207

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135226
Approved by: https://github.com/xmfan
2024-09-06 18:19:54 +00:00
196748d491 [elastic] support local_addr across all rendezvous impls (#135262)
Summary:
There was a regression introduced in https://github.com/pytorch/pytorch/pull/125743 that made `local_addr` no longer used. This fixes that by passing `local_addr` to `RendezvousStoreInfo.build` everywhere it's used.

This also fixes a number of tests allowing them to be run in parallel which hugely sped up the testing cycle as this change touches many different rendezvous implementations. This required a few fixes in unrelated tests.

Test Plan:
Added tests for the common rendezvous implementations that `local_addr` to prevent future regressions.

```
buck2 test @//mode/dev-nosan fbcode//caffe2/test/distributed/elastic/... fbcode//caffe2/torch/distributed/elastic/... -- --stress-runs 3
```

To vet the parallelism changes I also ran with 3 stress runs each to identify flakiness caused by parallelism.

Differential Revision: D62256407

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135262
Approved by: https://github.com/fduwjj, https://github.com/wz337
2024-09-06 17:55:43 +00:00
177e4f4218 remove _check call on item() for torch.istft (#135234)
Fixes #135014

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135234
Approved by: https://github.com/tugsbayasgalan
2024-09-06 17:31:25 +00:00
3988b3468b [aoti][easy] remove breakpoint() in wrapper.py (#134807)
Differential Revision: D61687146

Remove an unintended breakpoint in code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134807
Approved by: https://github.com/YUNQIUGUO
2024-09-06 17:25:05 +00:00
04118d8617 [export] Record the global torch version in serialization. (#135243)
Summary: In general I think it will be useful to also record the global torch version in the EP, so that we can track them in the logging in addition to the schema version.

Test Plan: CI

Reviewed By: henryoier

Differential Revision: D62252626

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135243
Approved by: https://github.com/yushangdi
2024-09-06 17:02:06 +00:00
24482e5c68 [torch][fx] Set maximum warning count during fx.Graph.lint (#135069)
Summary:
resnet152 spent about 15 minutes writing warning messages in _unlift
during `to_executorch` because they're all written to unbuffered stderr
by the `warnings` module.

These warnings are almost always about get_attr nodes referencing a
non-existent name:
```lang=py
warnings.warn(f'Node {node} target {node.target} {atom} of {seen_qualname} does '
  'not reference an nn.Module, nn.Parameter, or buffer, which is '
  'what \'get_attr\' Nodes typically target'
)
```
I'm not aware of a way to configure the warnings module to write this out
at most once, so I'm just going to disable the lint for now.

Test Plan:
Re-ran resnet152 with Executorch and the XNNPackBackend, it is much faster now

Differential Revision: D62156090

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135069
Approved by: https://github.com/yushangdi
2024-09-06 16:41:59 +00:00
c0ec599f27 Update submodule ideep to include aarch64 change (#134897)
This PR is per ARM request, which is in https://github.com/intel/ideep/issues/334.

Context for the request is: Arm team has upstreamed the dynamic quantization changes, all the PRs were merged (torch, ideep, oneDNN), but without this ideep submodule update, the feature will not work. The change is isolated to only matmul operator and quantization path alone.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134897
Approved by: https://github.com/jgong5, https://github.com/atalman, https://github.com/snadampal
2024-09-06 16:40:26 +00:00
7074de43c0 Porting to GCC 15 (#135188)
uint8_t is found on cstdint header

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135188
Approved by: https://github.com/Skylion007
2024-09-06 16:16:53 +00:00
771dcce11d [AOTI][Tooling][6/n] Fix long dtype input tensors calling mean() in aoti_torch_print_tensor_handle (#135072)
Differential Revision: D61635232

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135072
Approved by: https://github.com/hl475, https://github.com/ColinPeppler
2024-09-06 15:59:32 +00:00
de74aafff4 error on exporting ScriptModule (#135302)
Test Plan: added test

Differential Revision: D62279179

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135302
Approved by: https://github.com/yushangdi
2024-09-06 15:12:40 +00:00
ad29a2c0dc Add Inductor config for default stride behavior (#135238)
By default, Inductor is allowed to manipulate the layout
(strides+storage offset) of input tensors to custom operators.

We want to change it so that the default is that Inductor should respect
the stride order of input tensors to custom operators.

This PR adds a config to toggle the behavior, in the next PR up we'll
change the default. We also make the following changes:
- We add a new operator Tag (flexible_layout), which means that
inductor is allowed to manipulate the layout. When we flip the default,
users can specify they want the old behavior by using this tag.

This is a reland of https://github.com/pytorch/pytorch/pull/126986,
which was previously reverted due to silent incorrectness. We've since
fixed the silent incorrectness
(https://github.com/pytorch/pytorch/pull/133639)

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135238
Approved by: https://github.com/albanD
2024-09-06 14:48:24 +00:00
3a9e33dca8 [torchelastic] Don't do signal handling when off the main thread (#135088)
Summary:
In multiprocessing, signal handling is not possible if the thread is not the main thread. This resulted in the following error:
> "ValueError('signal only works in main thread of the main interpreter')"

To address this issue, the diff checks whether the thread is the main thread and, if not, skips signal handling.

Test Plan:
Before this change, MAST job failed:
https://fburl.com/mlhub/iq2m10v8

With this change, MAST job succeeded:
https://fburl.com/mlhub/q6kb8343

Differential Revision: D62166943

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135088
Approved by: https://github.com/d4l3k
2024-09-06 14:47:03 +00:00
a086882d72 [inductor][triton] mark workspace args as mutated (#134648)
SplitScan makes use of a workspace arg that needs to be zeroed before it is used - then, it is used to communicate between thread blocks during the triton kernel implementation. It is mutated during during the execution of the kernel, so it should be marked as such.

Before this PR, it is not marked as mutated; AFAIK this is fine during normal execution, but during autotuning it causes problems. The workspace starts off zeroed (as expected), but during autotuning the kernel will be executed multiple times and the workspace does not get re-set between executions, resulting in incorrect data. If the data is used for indexing, then you can fail device-side asserts (and the results after the initial run (with autotuning) could be wrong). The test added in this PR repros the issue when the fix is removed.

When we mark the arg as mutated, then the arg gets cloned before autotuning, so that the arg passed to the kernel during autotuning will always be zeroed as expected.
804852c1f9/torch/_inductor/runtime/triton_heuristics.py (L685-L689)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134648
Approved by: https://github.com/peterbell10, https://github.com/jansel
2024-09-06 14:23:37 +00:00
84ae6b7d6b AOTDispatcher: limit cases when we detach() graph inputs to non-leaves (#134193)
This PR is slightly a revival / update to the discussion from https://github.com/pytorch/pytorch/pull/98960:

Part of FSDP2's tracing strategy right now is that:

(1) it is painful/difficult to handle the case where we have multiple graph input tensors that are aliased to each other and at least one of them is duplicated

(2) we already have longstanding in logic to remove duplicate input tensors from the graph in dynamo. Morally, FSDP2 gives us duplicate input tensors in the backward graph for every `unsharded_param`, because we have (a) the `unsharded_param` being closed over by the backward hook to resize/allgather, and (b) the same `unsharded_param` being saved for backward by autograd (we now guarantee in the partitioner that we will always save the base tensor for backward and recompute views)

(3) However, we were still seeing cases where the `unsharded_param` showed up twice in the backward graph inputs, as distinct tensor objects (with different python ids) instead of being true duplicates that dynamo can de-dup.

It turns on that this was because we were `.detach()`ing the `unsharded_param` in AOTDispatcher before plumbing it through the compiled forward (and so autograd would save a detach'd version of the `unsharded_param`). This is precisely because of the logic from https://github.com/pytorch/pytorch/pull/98960.

However, re-reading the detailed comments, it seems unnecessary to do a detach() on a graph input that is a (leaf) `nn.Parameter`, even if it happens to get no gradients in the backward. Since it is a leaf, we don't have to worry about the autograd engine "continuing to backprop through the graph beyond the current tensor" (the leaf has no other grad_fn for autograd to backprop through).

So this PR makes us a bit less aggressive about calling detach() on inputs: we only do it when:

(1) our graph input statically will get a `None` gradient (and also has no metadata mutations, the existing state)

(2) **and** our graph input is a non-leaf tensor (so detach()ing is actually required to prevent autograd from incorrectly backpropping past the non-leaf.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134193
Approved by: https://github.com/yf225

Co-authored-by: Will Feng <yf225@cornell.edu>
2024-09-06 14:06:48 +00:00
60a097a071 [CD] Update binary_linux_test.sh to include calling builder smoke test (#133869)
Run smoke test

Fixes #1969

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133869
Approved by: https://github.com/atalman

Co-authored-by: Andrey Talman <atalman@fb.com>
2024-09-06 13:27:24 +00:00
13bae39e22 [inductor] [cpp] improve cache blocking for is_dynamic_M (#131306)
## Performance
Models with >= 3% performance speedup are listed below:

### AMP single-thread dynamic shape (measured on CPU with AMX support)
No regressions

| Model Family | Model Name | Speedup |
|--------------|------------|---------|
torchbench | soft_actor_critic| 3%

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131306
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel
ghstack dependencies: #135275

Co-authored-by: Jiong Gong <jiong.gong@intel.com>
2024-09-06 13:21:24 +00:00
4ef6c05f65 [inductor][cpp][gemm] fix autotune runtime error from linear_binary fusion (#135275)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135275
Approved by: https://github.com/leslie-fang-intel
2024-09-06 13:21:23 +00:00
d6b9bd3e60 Also handle compiler collective when input variable doesn't exist on all ranks (#135147)
Internal xref:
https://fb.workplace.com/groups/3095840833991792/permalink/3810738595835342/

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135147
Approved by: https://github.com/jansel
2024-09-06 13:18:36 +00:00
d0591f4658 Ignore fresh unbacked when doing recursive make_fx inside HOPs (#135053)
Internal xref: https://fb.workplace.com/groups/6829516587176185/posts/7705964779531357/

This now also incorporates a test from https://github.com/pytorch/pytorch/pull/133585 (which it fixes) and the prep PR https://github.com/pytorch/pytorch/pull/134407 Including the PR desc from that:

I am trying to fix a problem reported by user in [fb.workplace.com/groups/6829516587176185/permalink/7705964779531357](https://fb.workplace.com/groups/6829516587176185/permalink/7705964779531357/) The summary of this problem is that when we do collect metadata analysis in AOTAutograd, we accumulate pending unbacked symbols which are going to be discarded at the end of the trace. However, if we do a recursive make_fx inside tracing, as occurs with torch.cond, we end up seeing that there are pending unbacked symbols that aren't associated with a binding, even though it's spurious (they've leaked into the inner make_fx call from the outer AOTAutograd analysis).

In https://github.com/pytorch/pytorch/pull/133588 I tried to just prevent adding the symbols to the pending list at all in the first place. But this itself caused some problems which were fixed in https://github.com/pytorch/pytorch/pull/124785 . The problem fixed in that PR is that when we allocate tangents that have unbacked size, something prevented them from having correct unbacked SymInts when ignore fresh unbacked SymInts was enabled. So I had patched it at the time by just not suppressing pending symbols and clearing them out some other way.

I think... I was wrong in that PR? That is to say, it was OK to avoid putting the fresh unbacked symbols in the pending list; the real problem was suppressing unbacked renamings. But there doesn't seem to be a good reason to suppress these; this PR shows that it doesn't actually fail any tests if you do these anyway. Intuitively, this makes sense, because you can't trigger renamings unless you're actually adding unbacked symbols to the pending set.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135053
Approved by: https://github.com/ydwu4
2024-09-06 13:13:15 +00:00
b5dea061c8 check compilation status before query cudnn version in conv (#135332)
This PR is created for fixing the https://github.com/pytorch/pytorch/issues/135322.  The cudnn compilation status should be check firstly before querying version, otherwise, conv may trigger runtimeerror before any check in other non-cuda backends.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135332
Approved by: https://github.com/EikanWang, https://github.com/atalman
2024-09-06 12:50:04 +00:00
041960a1ce [Dynamo] Automatically in-graph traceable tensor subclass ctors (#135151)
Fixes https://github.com/pytorch/pytorch/issues/114389

Previously, dynamo would attempt to trace through the `__init__` of traceable tensor subclasses, since their constructors are AOT dispatcher traceable by definition, dynamo should automatically put these in the graph like we do for any other tensors. Not doing this is difficult because dynamo would need to apply mutations post tensor subclass creation in the graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135151
Approved by: https://github.com/bdhirsh
2024-09-06 12:23:38 +00:00
67c7924ea1 [inductor] Fix gen_transposed_tile_load_store (#135307)
Recent PR: https://github.com/pytorch/pytorch/pull/131745 bring new VLA logical in cpp codegen. And it will raise build fail error on MSVC and error code is `Compiler Error C2131`: https://learn.microsoft.com/en-us/cpp/error-messages/compiler-errors-1/compiler-error-c2131?view=msvc-170

reproduce UT:
```cmd
pytest test\inductor\test_torchinductor_dynamic_shapes.py -v -k test_large_block_sizes_dynamic_shapes_cpu
```

Original generated code:
```c++
alignas(16) float tmp1[static_cast<int64_t>(((-256LL)*(c10::div_floor_integer(static_cast<int64_t>(ks1), static_cast<int64_t>(16LL)))) + (16LL*ks1))];
```

Changes:
allocate a large-enough fixed-sized buffer.

New genarated code:
```c++
alignas(16) float tmp1[16*16];
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135307
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-09-06 10:44:08 +00:00
217ba7b2ab [Docs] Update FileCheck doc (#135199)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135199
Approved by: https://github.com/soulitzer
2024-09-06 08:18:38 +00:00
758d515d98 [Inductor][CPP] Select tiling factor for lower precision data types (#133830)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133830
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-09-06 08:12:37 +00:00
60d98b4cfb Update torch-xpu-ops pin (ATen XPU implementation) (#135300)
Release cycle for PyTorch 2.5
1. Bugfixing: correct reduction logic in cdist kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135300
Approved by: https://github.com/EikanWang
2024-09-06 07:30:09 +00:00
590a3e9f8a [export][training ir migration] quantized_decomposed.quantize_per_tensor decomposition (#134525)
Summary:
In graph of  TestXNNPACKQuantizer.test_dynamic_linear_with_con test, some quantized_decomposed.quantize_per_tensor.default ops are becoming quantized_decomposed.dequantize_per_tensor.tensor ops when using the new training ir.

This is because we lift params/buffers before calling make_fx. So previously, for the graph that’s passed to make_fx,`graph.L__self___linear1.weight` is a tensor
now in training ir, graph.L__self___linear1.weight is a FakeTensor. This caused the node overload to be different.

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_dynamic_linear_with_conv
```

Differential Revision: D61364547

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134525
Approved by: https://github.com/tugsbayasgalan, https://github.com/jerryzh168
2024-09-06 07:06:06 +00:00
764ee6e3f9 [FlexAttention] Specify padding_value for boundary checked loads (#134573)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134573
Approved by: https://github.com/Chillee
2024-09-06 06:47:26 +00:00
67f98a99a4 [DeviceMesh][Easy] Make RuntimeError a bit more descriptive by including the actual world_size (#135271)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135271
Approved by: https://github.com/fduwjj
2024-09-06 06:23:20 +00:00
e020a8755a [Fix][FR][ez] Remove debugging logs (#135308)
Removing the print added during debugging process.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135308
Approved by: https://github.com/wz337
2024-09-06 06:14:33 +00:00
7ffb3b201c [inductor] Remove LoopBody.reads,writes,other (#135256)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135256
Approved by: https://github.com/oulgen
ghstack dependencies: #135070, #135076, #135082, #135084, #135079, #135235
2024-09-06 06:11:55 +00:00
f946bf88c4 [inductor] Skip retracing an existing LoopBody (#135235)
This is roughly a 7% speedup in inductor compile time for hf_Bert_large.  The time spent in `LoopBody.__init__` improves from 15% to 8% of `fx_codegen_and_compile`.

Before
![image](https://github.com/user-attachments/assets/7de0f28e-35bd-472f-b4be-b52733d2a85c)

After
![image](https://github.com/user-attachments/assets/5f0cf11a-43c5-43ae-b13c-f32383a75a7f)

Overall
![image](https://github.com/user-attachments/assets/6a369d8c-fb5e-4ad2-9504-0fc745ad6568)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135235
Approved by: https://github.com/oulgen
ghstack dependencies: #135070, #135076, #135082, #135084, #135079
2024-09-06 06:11:55 +00:00
66da3b3b2a [fx] Bypass custom __setattr__ in Node.__init__ (#135079)
Before:
![image](https://github.com/user-attachments/assets/5f0a6ae6-6049-44d0-b5f2-a549a23ad97f)

After:
![image](https://github.com/user-attachments/assets/51c9f91b-f8a0-4043-8362-65813feec823)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135079
Approved by: https://github.com/oulgen
ghstack dependencies: #135070, #135076, #135082, #135084
2024-09-06 06:11:46 +00:00
41e653456e [RDP] Fix "No module named 'libfb’" (#135244)
Summary:
D62215095 Introduced an import error to arvr pipelines as the is_fbcode() function does not work as intended.

This changes is_fbcode() to be a much stricter check.

Test Plan:
```
buck2 run arvr/mode/platform010/opt-stripped //arvr/libraries/depthlink/clients/mr_replay:pipeline_runner -c bolt.use_eva3_sim=True -- --config_file arvr/libraries/depthlink/clients/mr_replay/configs/runner_config.yaml --features DEPTH
```

Differential Revision: D62237502

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135244
Approved by: https://github.com/aorenste
2024-09-06 04:52:31 +00:00
e40a0a9359 Add randomness checking for sdpa vmap (#135176)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135176
Approved by: https://github.com/zou3519
2024-09-06 04:50:49 +00:00
c05a7adb36 [inductor][debug] fix draw_buffers (#135266)
**Before:**
![image](https://github.com/user-attachments/assets/aac756f3-1349-4647-9da3-87cf105cf647)

**After:**
<img width="791" alt="image" src="https://github.com/user-attachments/assets/d72c663c-e598-42fa-ac40-9e58956f1ec1">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135266
Approved by: https://github.com/yf225
2024-09-06 04:12:41 +00:00
5f57be7571 [Distributed] Change function call in test to non-deprecated to eliminate warning (#134938)
Migrate function call in test to eliminate warning message in below and reduce the chance of test fail when methods removed

-  from deprecated `save_state_dict` change to `save`
-  from deprecated `load_state_dict` change to `load`

Warning message:
```bash
pytorch/test/distributed/checkpoint/test_fsdp_model_state.py:37: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead.

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134938
Approved by: https://github.com/wz337, https://github.com/fegin
2024-09-06 03:25:09 +00:00
29d72c1100 [inductor] check intel compiler minimal version (#135209)
On Windows: early version icx has `-print-file-name` issue, and can't preload correctly for inductor. Add minimal version check for Intel compiler.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135209
Approved by: https://github.com/ezyang
2024-09-06 03:21:07 +00:00
3b1a334c0f [Inductor][CPP] Avoid mistake wgt tensor delete (#135100)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/134998: Previously, we only checked if the `get_attr` FX node for the weight had a single user node. However, two `get_attr` nodes may share the same tensor and should not be deleted in such cases. In this PR, we add the count of users for tensor along with the num of users for nodes to decide whether this tensor can be deleted or not.

**TestPlan**
```
 python test/inductor/test_cpu_select_algorithm.py -k test_linear_wgt_multi_users
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135100
Approved by: https://github.com/jgong5
2024-09-06 03:13:36 +00:00
07689a38bf [Inductor] Fix AOT weight alignment issue on CPU (#135205)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/135027. On CPU, the `consts_size` used to generate `_binary_constants_bin_start` is not padded to `ALIGN_BYTES`, while `serialized_weights` is, causing a failure in the 16K alignment check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135205
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-09-06 03:06:51 +00:00
06a7dc21c1 Remove dead expect_rational (#135105)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135105
Approved by: https://github.com/malfet
2024-09-06 02:57:27 +00:00
d9a18173fa Report qualname of exception type rather than <class 'RuntimeError'> (#135146)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135146
Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/yanboliang
ghstack dependencies: #135148, #135145
2024-09-06 02:56:50 +00:00
d8543e3162 Include exception type qualname when rewrapping InternalTorchDynamoError (#135145)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135145
Approved by: https://github.com/drisspg, https://github.com/anijain2305
ghstack dependencies: #135148
2024-09-06 02:56:50 +00:00
ad01fc194d Consolidate raise and rewrap raise error branches (#135148)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135148
Approved by: https://github.com/anijain2305, https://github.com/albanD, https://github.com/yanboliang, https://github.com/malfet
2024-09-06 02:56:46 +00:00
e162414963 add instrumentation of CCA stats for reserved and allocated memory size (#135231)
As titled
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135231
Approved by: https://github.com/c-p-i-o
2024-09-06 02:48:56 +00:00
9e5a797771 Improve test_public_bindings import module error reporting (#135258)
Error was hard to understand without message. Render it now. See https://github.com/pytorch/pytorch/pull/135259 for it in action.

Example failure:

```
2024-09-05T20:04:45.3022000Z FAILED [5.9524s] test_public_bindings.py::TestPublicBindings::test_modules_can_be_imported - AssertionError: String comparison failed: '' != "torch._logging.scribe failed to import w[112 chars].py)"
2024-09-05T20:04:45.3025413Z + torch._logging.scribe failed to import with error ImportError: cannot import name 'TypeAlias' from 'typing' (/opt/conda/envs/py_3.9/lib/python3.9/typing.py)
2024-09-05T20:04:45.3026990Z
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135258
Approved by: https://github.com/albanD
2024-09-06 02:40:03 +00:00
b46a1b9e2d Use Python 3.9 on all libtorch jobs (#135245)
Part of the migration py3.8->3.9

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135245
Approved by: https://github.com/izaitsevfb
2024-09-06 02:27:22 +00:00
9688014820 aarch64: extend matmul heuristic checks to all neoverse platforms (#134548)
for aarch64 neoverse platforms there are two gemm backends available
for matmul operator on PyTorch: (1) Arm Compute Library and (2) OpenBLAS.
While Arm Compute Library provides better performance over OpenBLAS,
it has overhead for the kernel launch time, and hence we use OpenBLAS
for smaller tensor compute. The heuristic was originally implemented for
neoverse_v1. This commit extends the heuristic to other neoverse platforms

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134548
Approved by: https://github.com/malfet
2024-09-06 01:40:50 +00:00
8f6e73f068 [ONNX] Enable experimental exporter logic to dynamo_export and support refine dynamic_shapes (#134976)
(1) Enable experimental exporter logic to dynamo_export
(2) Refine dynamic shapes and retry export in export strategies
(3) Delete `torch_export_graph_extractor` and use the new export logic
(4) Disable ExportedProgram test in `test_fx_onnx_with_onnxruntime.py`, as ONNXProgram is different now.

Fixes https://github.com/pytorch/pytorch/issues/126479
Fixes #135183
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134976
Approved by: https://github.com/justinchuby
2024-09-06 01:29:56 +00:00
1e57ef08fa [AOTI] Support MKLDNN qconv ops in cpp wrapper (#134795)
Summary: Similar to https://github.com/pytorch/pytorch/pull/134475, support qconv in the ABI-compatible mode for cpp-wrapper Inductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134795
Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/angelayi
ghstack dependencies: #134475, #134783
2024-09-06 01:01:53 +00:00
614b86d602 [AOTI] Support MKLDNN qlinear ops in cpp wrapper (#134783)
Summary: Similar to https://github.com/pytorch/pytorch/pull/134475, support qlinear in the ABI-compatible mode for cpp-wrapper Inductor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134783
Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/angelayi
ghstack dependencies: #134475
2024-09-06 01:01:53 +00:00
0b96dfb736 [AOTI] Support MKLDNN conv ops in cpp wrapper (#134475)
Summary: Partially fix https://github.com/pytorch/pytorch/issues/123040. In the ABI-compatible mode, MKLDNN fallback ops do not have C shim implementations and thus need to go through the custom ops launch path. Other MLKDNN ops will be fixed in following PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134475
Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/angelayi
2024-09-06 01:01:53 +00:00
62b221d5cc Add Percentages to Function Events (#135155)
Summary: Users have recently asked that the profiler contains self/total CPU and device percentages to FunctionEvents so that teams can process the data procedurely. Some of it could be done mathematically via subroutines but since we already have the information in the _build_table, lets build it there.

Test Plan: Check that we have the same table as before but also check that the parameters we check also have the expected values

Differential Revision: D62210351

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135155
Approved by: https://github.com/shanw-meta, https://github.com/kit1980
2024-09-06 00:39:11 +00:00
66dd4577b1 Track base of FunctionalTensor in inference mode. (#135141)
The idea behind the tracking is the following, whenever we see a tensor if the tensors is a root tensors (does not have any view metas ) when we consider is as the base of the all the tensors that shares its storage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135141
Approved by: https://github.com/zou3519
2024-09-06 00:10:25 +00:00
cyy
cc28634172 [Submodule] Bump pybind11 to v2.13.5 (#135202)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135202
Approved by: https://github.com/Skylion007
2024-09-06 00:09:00 +00:00
c83cdf068b [DTensor] Fix view op replicating on tensor dim when the size of the tensor dim = 1 (#135054)
We found a corner case that when a tensor dimension is 1, calling `view(1)` would result in an unexpected replication (see case 1 below). When the tensor dimension to shard is not 1, no matter whether the tensor dimension is evenly-shardable across the mesh dimension, it won't cause an implicit replication behind the scenes if view doesn't change the size of the given tensor dimension (see case 2 and 3).

When the tensor dimension to shard is of size 1, it is not being added to shardable_dims here:
https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/ops/_view_ops.py#L518

```
# uneven case where the size of the tensor dimension to shard is 1
p = torch.randn(1,2)
mesh = init_device_mesh(“cuda”, (2,))
dtensor = distribute_tensor(p, mesh, [Shard(0)])
t = dtensor.view(1, 2)
# this would result in replication, meaning t is now replicated across all ranks.

# uneven case where the size of the tensor dimension to shard is not 1
p = torch.randn(3, 2)
mesh = init_device_mesh(“cuda”, (2,))
dtensor = distribute_tensor(p, mesh, [Shard(0)])
t = dtensor.view(3, 2) # this would not result in replication.
# this would not result in replication, meaning t stays as sharded.

# even case
p = torch.randn(2,2)
dtensor = distribute_tensor(p, mesh, [Shard(0)])
t = dtensor.view(2, 2)
# this would not result in replication, meaning t stays as sharded.
```

Differential Revision: [D62155606](https://our.internmc.facebook.com/intern/diff/D62155606)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135054
Approved by: https://github.com/tianyu-l, https://github.com/wanchaol
2024-09-06 00:03:54 +00:00
28ccfba248 [ONNX] Delete ONNXProgramSerializer (#135261)
Fixes #135182

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135261
Approved by: https://github.com/justinchuby
2024-09-05 23:52:51 +00:00
b2386bdca1 [debug] Add helper to run cProfile on a function (#135084)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135084
Approved by: https://github.com/oulgen
ghstack dependencies: #135070, #135076, #135082
2024-09-05 23:41:30 +00:00
bdfc8d9f96 [fx] Don't use generators in map_aggregate (#135082)
While the generators avoid a copy, they are slow.

Before:
![image](https://github.com/user-attachments/assets/70a55a9a-0595-4105-b0ab-22cf77c7409c)

After:
![image](https://github.com/user-attachments/assets/cecb9c59-ae36-47de-8b08-cab2c7cb3d57)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135082
Approved by: https://github.com/oulgen
ghstack dependencies: #135070, #135076
2024-09-05 23:41:30 +00:00
70779dded8 [fx] Compile time optimization in Node.__update_args_kwargs (#135076)
Before this we took two passes over all of the args.

Before:
![image](https://github.com/user-attachments/assets/24ce5628-03f4-4983-9f2d-5ddf0ca5816e)

After:
![image](https://github.com/user-attachments/assets/c9681aa2-32f0-4f6b-a598-fc6f90ffafb5)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135076
Approved by: https://github.com/Chillee
ghstack dependencies: #135070
2024-09-05 23:41:30 +00:00
ea231300d1 [inductor] Improve compile time regression from MemoryDep.normalize (#135070)
Possible fix for #135056

Before
![image](https://github.com/user-attachments/assets/3962cb85-e808-4fd4-991f-471ff5ef7eae)

After
![image](https://github.com/user-attachments/assets/2322d48d-6518-4518-baca-336027b5cda8)

Measured based on:
```
python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --training --only hf_Bert_large --stats -n1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135070
Approved by: https://github.com/Chillee
2024-09-05 23:41:30 +00:00
8f66995459 Revert "Support rolling over a percentage of workflows (#134816)"
This reverts commit fc890b55b51098437b6149abf1026a8b2aaee389.

Reverted https://github.com/pytorch/pytorch/pull/134816 on behalf of https://github.com/malfet due to Causes lint to intermittently fail ([comment](https://github.com/pytorch/pytorch/pull/134816#issuecomment-2332902609))
2024-09-05 23:39:41 +00:00
144fde4fd2 [MPS] Add support for autocast in MPS (#99272)
Fixes https://github.com/pytorch/pytorch/issues/88415

Need to run inductor/test_cpu_select_algorithm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99272
Approved by: https://github.com/malfet

Co-authored-by: Siddharth Kotapati <skotapati@apple.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Co-authored-by: Roy Hvaara <roy@lightyear.no>
2024-09-05 23:23:17 +00:00
43f4947d44 fix fake tensor tolist implementation (#135131)
Summary:
When exporting for training with `tolist`, we do not hit `FunctionalTensor.tolist` since we do not functionalize. Unfortunately, this means we hit `FakeTensor.tolist`, which creates unbacked symints that are not backed by proxies.

Rather than trying to patch up this low-level implementation, we replace it with essentially what `FunctionalTensor.tolist` does, which is higher-level: we essentially desugar to `item()` calls and let it take care of unbacked symints.

Test Plan:
Some expected failures are gone now.
Also found a test for `tolist` that was written when `FunctionalTensor.tolist` was implemented but not really doing much; repurposed it now to exercise more modes.

Differential Revision: D62197742

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135131
Approved by: https://github.com/ezyang
2024-09-05 23:20:31 +00:00
65e1c34061 [rfc] scuba for flight recorder (#134794)
Summary: Record flight recorder status in a scuba table.

Test Plan: Testing with timing out a job. Will post results soon.

Differential Revision: D61729221

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134794
Approved by: https://github.com/fduwjj
2024-09-05 23:18:10 +00:00
830247c355 [Intel Triton] Update Intel Triton to release/2.5.0 (#134074)
This PR relands https://github.com/pytorch/pytorch/pull/134053

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134074
Approved by: https://github.com/EikanWang
2024-09-05 22:46:31 +00:00
4262755b5a [cond] fix typo in cond codegen (#134708)
As titled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134708
Approved by: https://github.com/jansel
2024-09-05 22:38:24 +00:00
3825607144 Add torch._logging.scribe (#135224)
See https://github.com/pytorch/pytorch/pull/135138 for a usage example. Meta only, see https://docs.google.com/document/d/1JpbAQvRhTmuxjnKKjT7qq57dsnV84nxSLpWJo1abJuE/edit#heading=h.9wi46k7np6xw for context

fbscribelogger is a library that allows us to write to scribe, which is Meta's logging infrastructure, when you have appropriate access token (this token is available for jobs running on main, as well as authorized jobs with the ci-scribe label). The resulting data is accessible via Scuba (a real time in-memory database) and Hive (a more traditional SQL persisted database).

Here's the motivating use case. Suppose there is somewhere in PyTorch's codebase where you'd like to log an event, and then you'd like to find all the situations where this log is called. If PyTorch is rolled out to our internal users, we have some FB-oriented APIs (like torch._utils_internal.signpost_event) with which you can do this. But you have to actually land your PR to main, wait for it to be ingested to fbcode, and then wait for us to actually roll out this version, before you get any data. But what if you want the results within the next few hours? Instead, you can use torch._logging.scribe to directly write to our logging infrastructure *from inside CI jobs.* The most convenient approach is to log unstructured JSON blobs to `open_source_signpost` (added in this PR; you can also add your own dedicated table as described in the GDoc above). After adding logging code to your code, you can push your PR to CI, add 'ci-scribe' label, and in a few hours view the results in Scuba, e.g., (Meta-only) https://fburl.com/scuba/torch_open_source_signpost/z2mq8o4l If you want continuous logging on all commits on master, you can land your PR and it will be continuously get logging for all CI runs that happen on main.

Eventually, if your dataset is important enough, you can consider collaborating with PyTorch Dev Infra to get the data collected in our public AWS cloud so that OSS users can view it without access to Meta's internal users. But this facility is really good for prototyping / one-off experiments. It's entirely self serve: just add your logging, run your PR CI with ci-scribe, get results, do analysis in Scuba.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135224
Approved by: https://github.com/Skylion007
2024-09-05 22:37:13 +00:00
eqy
3c8f71ff93 [cuDNN][64-bit indexing] cuDNN v9.3+ supports non-batch-splittable convolutions with > 2**31 elements (#134890)
For longstanding issues such as #95024

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134890
Approved by: https://github.com/Skylion007
2024-09-05 22:22:45 +00:00
fc890b55b5 Support rolling over a percentage of workflows (#134816)
In order to support adding a rollover percentage, this ended up being a complete rewrite of runner_determinator.py.

Details of the new format are in the comments up top.

On the plus side, this now includes some unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134816
Approved by: https://github.com/PaliC, https://github.com/zxiiro
2024-09-05 22:21:45 +00:00
058a69d91a [fbcode][dynamo] Turn on guard_nn_modules using justknobs_check (#134928)
As Title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134928
Approved by: https://github.com/ezyang
2024-09-05 22:05:54 +00:00
6c5920d515 Tune int8 AMX WoQ micro-kernel for CPU (#134832)
This patch prevents performance regression against the default ATen implementation for LLaMA 3.1 int8 GPTQ WoQ workload.

Uses AMX micro-kernel only if `M` >= `block_m`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134832
Approved by: https://github.com/jgong5
2024-09-05 22:01:14 +00:00
116fd474da [export] Expand coverage to more copied sym ops for unflattener. (#135119)
Test Plan:
buck2 test 'fbcode//mode/opt' fbcode//torchrec/ir/tests:test_serializer -- --run-disabled

```
File changed: fbcode//caffe2/torch/export/unflatten.py
Buck UI: https://www.internalfb.com/buck2/2e0377e7-e2b6-4bd0-8133-a787245165a0
Test UI: https://www.internalfb.com/intern/testinfra/testrun/5066549824883887
Network: Up: 0B  Down: 0B
Jobs completed: 16. Time elapsed: 10.2s.
Tests finished: Pass 6. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Differential Revision: D62190172

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135119
Approved by: https://github.com/yushangdi
2024-09-05 21:58:20 +00:00
a5d70cf545 [PyTorch] Add isfinite to BFloat16-math.h (#135052)
Missing function from <cmath>.

Differential Revision: [D62148884](https://our.internmc.facebook.com/intern/diff/D62148884/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135052
Approved by: https://github.com/PaliC, https://github.com/albanD
ghstack dependencies: #135031
2024-09-05 21:50:36 +00:00
7fe819d917 [PyTorch] Fix -Wshadow -Werror build in BFloat16-inl.h (#135031)
`float_t` is required to exists in C99 math.h, which causes -Wshadow to fire. We don't need the alias, fortunately.

Differential Revision: [D62135908](https://our.internmc.facebook.com/intern/diff/D62135908/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135031
Approved by: https://github.com/albanD
2024-09-05 21:48:21 +00:00
f63571060c Revert "Use actions/upload-artifact@v4.4.0 for rest of workflows (#135264)"
This reverts commit 9c0b03020b7204ca5d5dbe18174bab005f79c47b.

Reverted https://github.com/pytorch/pytorch/pull/135264 on behalf of https://github.com/atalman due to broke CI ([comment](https://github.com/pytorch/pytorch/pull/135264#issuecomment-2332674607))
2024-09-05 21:43:05 +00:00
38fead8f7c [hop] preserve metadata in re-tracing hop subgraph by running with interpreter (#135159)
In this way, the interpreter.run can preserve the current metadata of subgraphs correctly when tracing the subgraphs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135159
Approved by: https://github.com/tugsbayasgalan
2024-09-05 21:36:56 +00:00
24a223c49d Run inductor micro benchmark on x86 metal runner (#135042)
This enables inductor micro benchmark on CPU (x86):

* Running on AWS metal runner for more accurate benchmark
* I add a new `arch` column, which will be either x86_64 or arm64 for CPU or GPU name for GPU.  We can use this later to differentiate between different setup, i.e. cuda (a100) vs cuda (a10g) or cpu (x86_64) vs cpu (arm64)

The next step would be to run this one cpu arm64, and cuda (a10g).

### Testing
Here is the CSV results from my test run https://github.com/pytorch/pytorch/actions/runs/10709344180

```
name,metric,target,actual,dtype,device,arch,is_model
mlp_layer_norm_gelu,flops_utilization,0.8,17.36,bfloat16,cpu,x86_64,False
gather_gemv,memory_bandwidth(GB/s),990,170.80,int8,cpu,x86_64,False
gather_gemv,memory_bandwidth(GB/s),1060,204.78,bfloat16,cpu,x86_64,False
Mixtral-8x7B-v0.1,token_per_sec,175,26.68,int8,cpu,x86_64,True
Mixtral-8x7B-v0.1,memory_bandwidth(GB/s),1130,171.91,int8,cpu,x86_64,True
Mixtral-8x7B-v0.1,compilation_time(s),162,47.36,int8,cpu,x86_64,True
gemv,memory_bandwidth(GB/s),870,236.36,int8,cpu,x86_64,False
gemv,memory_bandwidth(GB/s),990,305.71,bfloat16,cpu,x86_64,False
Llama-2-7b-chat-hf,token_per_sec,94,14.01,bfloat16,cpu,x86_64,True
Llama-2-7b-chat-hf,memory_bandwidth(GB/s),1253,185.18,bfloat16,cpu,x86_64,True
Llama-2-7b-chat-hf,compilation_time(s),162,74.99,bfloat16,cpu,x86_64,True
Llama-2-7b-chat-hf,token_per_sec,144,25.09,int8,cpu,x86_64,True
Llama-2-7b-chat-hf,memory_bandwidth(GB/s),957,165.83,int8,cpu,x86_64,True
Llama-2-7b-chat-hf,compilation_time(s),172,70.69,int8,cpu,x86_64,True
layer_norm,memory_bandwidth(GB/s),950,172.03,bfloat16,cpu,x86_64,False
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135042
Approved by: https://github.com/yanboliang
2024-09-05 21:31:36 +00:00
e4920a1364 [Traceable FSDP2][Dynamo] allow tracing through auto_functionalized HOP (#135169)
If an `auto_functionalized` HOP is included in backward graph due to activation checkpointing, we will run into a scenario where Compiled Autograd Dynamo tracing will need to trace through the `auto_functionalized` HOP. This PR adds support for it.

Test commands:
- `pytest -rA test/inductor/test_compiled_autograd.py::TestCompiledAutograd::test_trace_auto_functionalized`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135169
Approved by: https://github.com/zou3519
2024-09-05 21:22:45 +00:00
bc5ecf83d7 [training ir migration] Fix quantization tests (#135184)
Summary:
Fixed some quantization tests for new training ir:

Fix batch norm node pattern matcher. In training ir, we have `aten.batch_norm` node instead of `aten._native_batch_norm_legit` and `aten._native_batch_norm_legit_no_training`.

Test Plan:
```
buck run fbcode//mode/dev-nosan fbcode//caffe2/test:quantization_pt2e
```

Reviewed By: tugsbayasgalan

Differential Revision: D62209819

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135184
Approved by: https://github.com/tugsbayasgalan
2024-09-05 21:19:28 +00:00
e55c0f59e5 Revert "[Reland] Refactor caching device allocator utils (#130923)"
This reverts commit 9809080b9ed657a8c0ea0383be7cbdce3a26e05e.

Reverted https://github.com/pytorch/pytorch/pull/130923 on behalf of https://github.com/kit1980 due to breaking internal builds - Error: Relocation overflow has occured ([comment](https://github.com/pytorch/pytorch/pull/130923#issuecomment-2332640961))
2024-09-05 21:16:14 +00:00
a4cf9653ee Revert "Remove Caffe2 code from tool scripts (#134941)"
This reverts commit c818ecd1698a28d9fadf4a81453a89914b18374a.

Reverted https://github.com/pytorch/pytorch/pull/134941 on behalf of https://github.com/kit1980 due to breaking internal builds - The path `caffe2/operators/hip/gather_op.cuh` does not exist ([comment](https://github.com/pytorch/pytorch/pull/134941#issuecomment-2332636624))
2024-09-05 21:12:54 +00:00
9c0b03020b Use actions/upload-artifact@v4.4.0 for rest of workflows (#135264)
To be consistent with https://github.com/pytorch/pytorch/pull/135263 and rest of workflows. Use v4.4.0.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135264
Approved by: https://github.com/kit1980, https://github.com/malfet
2024-09-05 21:05:06 +00:00
034717a029 [ROCm] remove triton-rocm commit pin and merge pins with triton.txt (#133438)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133438
Approved by: https://github.com/jithunnair-amd, https://github.com/malfet

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
2024-09-05 20:36:45 +00:00
9c38b00999 [export] Add ability to run eagerly on UnflattenedModule (#133996)
Summary:
Added the contextmanager, `_disable_interpreter`, which is meant to put around a call to `unflatten`. This will generate an UnflattendModule and sub-InterpreterModules which will not use torch.fx.Interpreter to run eagerly. We want to have this as a state of the module instead of a contextmanager around running the module because it's not clear where we are calling the unflattened module.

This seems to improve the performance: https://fb.workplace.com/groups/1075192433118967/posts/1473590629945810/?comment_id=1473621763276030

Test Plan: CI

Differential Revision: D60939034

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133996
Approved by: https://github.com/pianpwk
2024-09-05 20:28:42 +00:00
8efe547046 Use actions/upload-artifact@v4.4.0 for triton builds (#135263)
Same as: https://github.com/pytorch/pytorch/pull/135139
Fixes upload failure: https://github.com/pytorch/pytorch/actions/runs/10722567217/job/29748125015
fix regression introduced by https://github.com/pytorch/pytorch/pull/135068

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135263
Approved by: https://github.com/kit1980, https://github.com/huydhn
2024-09-05 20:03:39 +00:00
82d00acfee Allow cross-device copies for cpu scalars in refs (#135140)
This copies our eager-mode behavior where someone can do torch.add(a, b, out=c)
where a and b are CPU scalar tensors and c is a CUDA tensor.

Fixes https://github.com/pytorch/pytorch/issues/121619 by side effect (we get into a situation where we're writing a CPU scalar into a FakeTensor that is actually a meta tensor)

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135140
Approved by: https://github.com/williamwen42, https://github.com/yanboliang
2024-09-05 19:08:48 +00:00
098431a29d Update Resize.cpp with new device type (#135117)
Update Resize.cpp with new device type

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135117
Approved by: https://github.com/egienvalue
2024-09-05 18:53:13 +00:00
be660ea2d3 [PT2] Directly set meta.val in group_batch_fusion_aten (#135078)
Summary: instead of using FakeTensorProp after the pass

Differential Revision: D62162640

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135078
Approved by: https://github.com/frank-wei
2024-09-05 18:17:06 +00:00
52c7c89ea4 [Inductor][CPP] Leverage full bits for BF16/FP16 vectorization (#126502)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126502
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-09-05 17:17:46 +00:00
1efd341d15 [fake_tensor] Move unrecognized_type NotImplemented before ConstProp (#135033)
We should not try to do ConstProp on the unrecognized types (e.g. Subclasses).
In case of those types throwing NotImplemented will jump to the next torch_dispatch.

Test:
```
 python test/functorch/test_aotdispatch.py -k test_aot_test_subclasses_with_tensor_factories
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135033
Approved by: https://github.com/zou3519, https://github.com/bdhirsh
2024-09-05 17:09:41 +00:00
a096f2899d Add torch.serialization.skip_data context manager (#134504)
## Semantic

The semantic is
(1) By default `torch.serialization.skip_data(materialize_fake_tensors=False)` will make `torch.save` skip writing storages (but reserve space for them in the checkpoint).

```python
import torch
import torch.nn as nn

sd = nn.Linear(3, 5).state_dict()
with torch.serialization.skip_data():
    torch.save(sd, 'foo.pt')
print(torch.load('foo.pt', weights_only=True))
```

(2)  With `torch.serialization.skip_data(materialize_fake_tensors=True)`If FakeTensor is passed to `torch.save` the pickler will treat these FakeTensors as being "materialized" space will be reserved in the checkpoint for the associated storage bytes, and when loading the type will be Tensor instead of FakeTensor)

```python
import torch
import torch.nn as nn
from torch._subclasses.fake_tensor import FakeTensorMode

with FakeTensorMode():
    m = nn.Linear(3, 5, dtype=torch.float16, device='cuda')

sd = m.state_dict()
with torch.serialization.skip_data(materialize_fake_tensors=True):
    torch.save(sd, 'bla.pt')
print(torch.load('bla.pt', weights_only=True))
# OrderedDict([('weight', tensor([[0., 0., 0.],
#        [0., 0., 0.],
#        [0., 0., 0.],
#        [0., 0., 0.],
#        [0., 0., 0.]], device='cuda:0', dtype=torch.float16)), ('bias', tensor([0., 0., 0., 0., 0.], device='cuda:0', dtype=torch.float16))])

```

## Follow Ups

- [ ] `torch.load` semantic for skip_data context manager
- [ ] Mechanism for getting offsets of storages saved via this method (for writing in a separate pass)

Differential Revision: [D62238610](https://our.internmc.facebook.com/intern/diff/D62238610)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134504
Approved by: https://github.com/albanD
2024-09-05 16:53:39 +00:00
dbeb8a1691 Render log filepaths that are not anchored in torch's directory in a reasonable way (#135165)
For example, if I do TORCH_LOGS=fbscribelogger I'll get:

```
I0904 17:59:07.567000 3672513 fbscribelogger/__init__.py:161] stop
```

instead of

```
I0904 12:46:15.332000 2930287 ../../../../../home/ezyang/local/a/pytorch-env/lib/python3.10/site-packages/fbscribelogger/__init__.py:161] stop
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135165
Approved by: https://github.com/Skylion007
2024-09-05 16:48:09 +00:00
b1f72e2984 Gradient scaler for DTensor (#132816)
Solve the request [here](https://github.com/pytorch/pytorch/issues/120003#issuecomment-2248805798).
Enable DTensor input in gradient scaler's APIs, especially on `.unscale_()`
Related dispatch strategy is added to accept DTensor input.

To enable found_inf to conduct reduce action across devices, we add allreduce at dispatch with args after dispatch strategy and kernel.
Since `aten._amp_foreach_non_finite_check_and_unscale_.default` is an inplace_op, grad_scale as the arg[0] with be inplaced, so that redesign a strategy or refactoring the kernel would not help

Test files are testing 2 parts under 1-d(dp) and 2-d(dp,tp) cases:
1. whether the non-inf values unscaled
2. whether all DTensors at each device could found inf even not at their device.
3. If inf not found, will new parameters generates
4. if inf found, will scale be updated

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132816
Approved by: https://github.com/XilunWu, https://github.com/weifengpy, https://github.com/wanchaol
2024-09-05 16:44:32 +00:00
bb3c2408f4 [inductor][test] in test_unbacked_symints, replace inductor's skipCUDAIf with common device type's skipcudaif (#133936)
Differential Revision: D61506212

Use `skipCUDAIf` from `torch.testing._internal.common_device_type` if we create the test class with `instantiate_device_type_tests`.

`instantiate_device_type_tests` would make sure the class has attr device_type, which works with`skipCUDAIf` from `torch.testing._internal.common_device_type`.

Also skipping test_vertical_pointwise_reduction_fusion for cpu test class, since the test expects cuda.

FAILED [0.0026s] test/inductor/test_unbacked_symints.py::TestUnbackedSymintsCPU::test_vertical_pointwise_reduction_fusion_cpu - AttributeError: 'TestUnbackedSymintsCPU' object has no attribute 'device'

repro:
```
CUDA_VISIBLE_DEVICES="" pytest test/inductor/test_unbacked_symints.py -k cpu -v
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133936
Approved by: https://github.com/ColinPeppler, https://github.com/desertfire
2024-09-05 16:40:14 +00:00
2c99f17a32 Implement VariableTracker.python_type() (#134215)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134215
Approved by: https://github.com/amjames, https://github.com/jansel
2024-09-05 16:35:47 +00:00
0043dcd79e Switch torch pt2e xnnpack tests to use export_for_training (#134788)
Migrate all the callsites inside the pt2e XNNPACK tests to use export_for_training.

Differential Revision: D61994553

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134788
Approved by: https://github.com/mergennachin
2024-09-05 16:11:18 +00:00
2e2fb668fa Upgrade expecttest to 0.2.1 (#135136)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135136
Approved by: https://github.com/albanD, https://github.com/atalman, https://github.com/Skylion007
2024-09-05 16:05:35 +00:00
9d24f945ba [CI] Use larger instance for building triton whl (#135201)
When running CI jobs of "Build Triton Wheels", it failed due to the lack of resources. This PR uses a larger runner to avoid these issues.

The failure message is like:

```
Process completed with exit code 137.
```

Related running actions:
Failed actions: https://github.com/pytorch/pytorch/actions/runs/10714445036
Success actions: https://github.com/pytorch/pytorch/actions/runs/10716710830

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135201
Approved by: https://github.com/chuanqi129, https://github.com/atalman
2024-09-05 14:36:23 +00:00
ecbd715363 [Intel GPU][Windows] Fix overriding default CMAKE_CXX_FLAGS (#135093)
The root cause is that `/EHsc` is part of the default `CMAKE_CXX_FLAGS` in CMake.
Fix to not override the default `CMAKE_CXX_FLAGS`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135093
Approved by: https://github.com/EikanWang, https://github.com/atalman
2024-09-05 12:52:43 +00:00
58f2477a26 [Dynamo] Support builtin function frozenset (#134563)
Support builtin function frozenset in dynamo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134563
Approved by: https://github.com/anijain2305, https://github.com/EikanWang, https://github.com/jansel
2024-09-05 12:15:10 +00:00
43dcb4bb61 Revise CPU vectorization ISA support API (#135075)
Revising (mostly renaming) CPU vectorization ISA support API (non-frontend-user-facing). Also added AVX512_BF16 ISA detection API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135075
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/ezyang
2024-09-05 12:14:56 +00:00
50d1e37079 [AOTI] Fix a unbacked symint retrieve bug (#134670)
Summary: Fix https://github.com/pytorch/pytorch/issues/134081. When a unbacked symint is computed as the shape of a tensor from a tuple, generated C++ code needs to use std::get<> to extract the tensor.

Differential Revision: [D62142113](https://our.internmc.facebook.com/intern/diff/D62142113)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134670
Approved by: https://github.com/angelayi, https://github.com/22quinn, https://github.com/chenyang78
2024-09-05 11:34:14 +00:00
b99ef1a02e Update torch-xpu-ops pin (ATen XPU implementation) (#135185)
Release cycle for PyTorch 2.5
1. Update specific AOT targets for Windows. On Windows, AOT target list prefers Intel client GPUs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135185
Approved by: https://github.com/EikanWang
2024-09-05 10:05:23 +00:00
8a5c8e5db9 Update unbacked symints in masked_select more precisely (#134899)
## Summary
At the moment, the fake impl for `masked_select` simply sets the upper range while updating its size-like SymInt to `sys.maxsize`(9223372036854775807, max value for an unsigned int64) if the there are any SymInts in the original input tensor shape. This PR constrains the range more intelligently by using the upper ranges of each SymInt in the input tensor shape.

This solves an issue where an model being lowered to Executorch errors during memory planning because the memory allocated for `masked_select` ended up exceeded the 64-bit address space (`INT_MAX * size(dtype)`).

## Test plan
- Passes existing unit tests (tests case where upper bound is inf)
- Added unit test to verify upper bound reduction calculation
- Tested end-to-end by exporting with TORCH_LOGS="export" and ensuring that the range for `masked_select`'s SymInt size has the correct upper bound
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134899
Approved by: https://github.com/ezyang
2024-09-05 09:01:06 +00:00
c7328dff7f Enhance the stability of the complex divide code (#134647)
In C++, when a floating-point literal (e.g., 3.14) is compared with a variable of type float, the literal is by default interpreted as a double.
```c++
float f = 3.14f;
if (f == 3.14) {
    // Do something
}
```
If a device does not support double, an error will occur.
This PR addresses the issue of complex64 errors on machines that do not support double operations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134647
Approved by: https://github.com/EikanWang, https://github.com/albanD
2024-09-05 08:36:37 +00:00
749dc6ceda [inductor] [cpp] use_local_acc if template_buffer_has_other_users (#135081)
Fix the compilation error of `coat_lite_mini` in timm and `YituTechConvBert` in HF:
```
/tmp/tmpuu94adg_/nf/cnf3zm677wbfjzzll522zvjp57g44udzfnj66ac2t5b2odvfqpts.cpp:239:33: error: invalid conversion from ‘const float*’ to ‘float*’ [-fpermissive]
  239 |                                 &(in_ptr2[static_cast<int64_t>(n_start + (192L*m_start) + (Nr*nci) + ((-1L)*Nr*nc))]),
      |                                 ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                                 |
      |                                 const float*
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135081
Approved by: https://github.com/jgong5
ghstack dependencies: #134984
2024-09-05 08:31:31 +00:00
eaeae0ac95 [c10d] Change collective to take in a list of tensors so it work fully for all collectives (#135049)
We found that currently, we only pass one input and output tensor to the function `collective`, and this causes NaNCheck, work numel stats and FR input/output sizes not accurate for all-to-all, scatter and reduce. So we want to let the collective take in a list of tensors to ensure it works for all collectives inside PGNCCL.

This partially revert what we did in https://github.com/pytorch/pytorch/pull/119421, and down the road we will have another round of cleanup on the collective to make it cleaner. For now, at least for the sake of correctness, we changed it back.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135049
Approved by: https://github.com/kwen2501
2024-09-05 07:56:56 +00:00
5a0e7a408f restore CSE'd node metadata in runtime asserts pass (#134516)
Adds val, and optionally stack_trace & nn_module_stack metadata back to SymInt compute nodes that we CSE, with a hook on `graph.create_node()`. Not sure if there's other metadata we want to populate here?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134516
Approved by: https://github.com/ezyang
2024-09-05 07:50:04 +00:00
81a8624296 [Intel GPU] Customized XPU behaviour in indexing, group norm (#134453)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134453
Approved by: https://github.com/EikanWang, https://github.com/albanD
ghstack dependencies: #133980
2024-09-05 07:41:57 +00:00
731fd3172a [inductor] [cpp] generate reindexer for each epilogue_node (#134984)
Fixes the FP32 accuracy failure of `levit_128` in timm.

Previously, we used `Y` which is the output of the final epilogue node to calculate the reindexer. We actually need to use each epilogue node to calculate the reindexer from the GEMM output to the epilogue node.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134984
Approved by: https://github.com/jgong5
2024-09-05 07:08:31 +00:00
9d705605dd Fix decomp behaviour in export training IR (#134801)
Subset of changes in https://github.com/pytorch/pytorch/pull/132901, can't land the previous one because it is too complicated. Rest of the change will be implemented as follow up after export design meeting. This part just makes the training IR -> inference IR decomp to have the same path as normal export.

Differential Revision: [D62000525](https://our.internmc.facebook.com/intern/diff/D62000525)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134801
Approved by: https://github.com/avikchaudhuri, https://github.com/angelayi
2024-09-05 06:37:44 +00:00
05feb6e4ed [Inductor] support masked vectorization for the tail_loop for dynamic shapes (#131745)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131745
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-09-05 06:17:48 +00:00
7b280c31ba [export] dynamic_shapes serialization, load/dump (#134718)
Adds utility functions `_dump_dynamic_shapes` and `_load_dynamic_shapes`.

- `_dump_dynamic_shapes`: dynamic shapes spec -> serialized format:
    - takes in the `dynamic_shapes` pytree object you'd feed into `export()`, and dumps into serialized format
- `_load_dynamic_shapes`: serialized format -> dynamic shapes spec
    - takes the serialized format, and produces a `dynamic_shapes` object you feed into `export()`

For example with dumping:
```
dx = Dim("dx", min=4, max=16)
dy = dx + 1

inputs = (
    [
        torch.randn(4, 4),
        torch.randn(5, 4),
    ],
    torch.randn(4),
    torch.randn(4, 4),
    "hello",
)
dynamic_shapes = {
    "a": [
        (dx, 4),
        (dy, 4),
    ],
    "b": (Dim.AUTO,),
    "c": None,
    "d": None,
}
out = _dump_dynamic_shapes(dynamic_shapes, inputs)
```

would generate the following output:
```
DynamicShapesSpec(
    dynamic_shapes=(
        [
            ['dx', 4],
            ['dx + 1', 4],
        ],
        ['_DimHint.STATIC'],
        ['_DimHint.STATIC', '_DimHint.STATIC'],
        None,
    ),
    dims={
        'dx': RootDim(
            min=4,
            max=16,
            derived=['dx + 1'],
        ),
    },
)
```

The serialized format contains 2 keys, `dynamic_shapes` and `dims.`
- `dynamic_shapes` is the pytree structure matching the input to `export()`, with strings in place of Dim names and enums, and ints/Nones otherwise. Each tensor is represented with a list of shapes, non-tensors with Nones.
- `dims` contain min/max range and derived dims info for each root dim.

The test cases show some roundtrippability guarantees for these functions. Definitely taking naming suggestions for them :)

Follow up: utility function to extract serializable format from ExportedProgram.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134718
Approved by: https://github.com/avikchaudhuri
2024-09-05 05:39:44 +00:00
f2a7228aed [executorch hash update] update the pinned executorch hash (#135162)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135162
Approved by: https://github.com/pytorchbot
2024-09-05 04:21:51 +00:00
8fb1281db9 [Traceable FSDP2] Skip _backward_prefetch under compile, and rely on compiler pass to have prefetching (#135163)
Before this PR, when traceable FSDP2 + AC is run, an error would be thrown:
```
  File "/data/users/willfeng/pytorch/torch/_dynamo/variables/builtin.py", line 1449, in call_getitem
    return args[0].call_method(tx, "__getitem__", args[1:], kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/willfeng/pytorch/torch/_dynamo/variables/lists.py", line 435, in call_method
    return super().call_method(tx, name, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/willfeng/pytorch/torch/_dynamo/variables/lists.py", line 392, in call_method
    return super().call_method(tx, name, args, kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/willfeng/pytorch/torch/_dynamo/variables/lists.py", line 131, in call_method
    return self.getitem_const(tx, value)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/data/users/willfeng/pytorch/torch/_dynamo/variables/lists.py", line 106, in getitem_const
    return self.items[index]
Error: Index out of bound

from user code:
   File "<eval_with_key>.5", line 105, in forward
    aot0_trace_wrapped = torch__dynamo__trace_wrapped_higher_order_op_self_invoke(aot0_tangents_1, bw_state = aot0_primals_34);  aot0_tangents_1 = None
  File "/data/users/willfeng/pytorch/torch/_dynamo/_trace_wrapped_higher_order_op.py", line 74, in self_invoke
    return _trace_wrapped_op(*args, **dyn_kwargs, **kwargs)
  File "/data/users/willfeng/pytorch/torch/_dynamo/external_utils.py", line 132, in call_hook_from_backward_state
    return getattr(bw_state, hook_name)(*args, **kwargs)
  File "/data/users/willfeng/pytorch/torch/distributed/_composable/fsdp/_fsdp_state.py", line 271, in _pre_backward
    self._fsdp_param_group.pre_backward(default_prefetch)
  File "/data/users/willfeng/pytorch/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 332, in pre_backward
    self._backward_prefetch()
  File "/data/users/willfeng/pytorch/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 417, in _backward_prefetch
    target_fsdp_param_group = self.comm_ctx.post_forward_order[target_index]
```

Since it's okay to rely on the compiler to recover the "prefetching" pattern, we will skip this `_backward_prefetch()` code path during tracing to avoid the error, and have a compiler pass (in future PR) to achieve the equivalent prefetching overlap.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135163
Approved by: https://github.com/awgu
2024-09-05 03:32:04 +00:00
a7a53b796b [Intel GPU]device guard codegen for XPU (#133980)
This PR is a supplement to #130082. The previous PR  #130082 fulfill the basic functionality of codegen, while we found it fails to handle the device sameness check in lots of uts.  Current PR is aimed to facilitate the XPU device guard code generation.

With current PR, the code snippet in `RegisterXPU.cpp` is as follows, where we can see the device guard is successfully generated.
```c++
namespace {
at::Tensor & wrapper_XPU_Tensor_float_out_normal_out(const at::Tensor & mean, double std, ::std::optional<at::Generator> generator, at::Tensor & out) {
  std::optional<Device> common_device = std::nullopt;
(void)common_device; // Suppress unused variable warning
  c10::impl::check_and_update_common_device(common_device, out, "wrapper_XPU_Tensor_float_out_normal_out", "out");
  c10::impl::check_and_update_common_device(common_device, mean, "wrapper_XPU_Tensor_float_out_normal_out", "mean");
  const OptionalDeviceGuard device_guard(device_of(out));
  return at::native::normal_out(mean, std, generator, out);
}
} // anonymous namespace
```
Nevertheless, without current change, the generated code is
```c++
namespace {
at::Tensor & wrapper_XPU_Tensor_float_out_normal_out(const at::Tensor & mean, double std, ::std::optional<at::Generator> generator, at::Tensor & out) {
    // No device check
  // DeviceGuard omitted
  return at::native::normal_out(mean, std, generator, out);
}
} // anonymous namespace
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133980
Approved by: https://github.com/EikanWang, https://github.com/malfet
2024-09-05 01:53:31 +00:00
30b98940b8 Fix typo in comment (#135111)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135111
Approved by: https://github.com/aorenste, https://github.com/oulgen
2024-09-05 01:39:04 +00:00
724faac260 [FSDP] casting input args with dataclass(frozen=True) (#135067)
resolve: https://github.com/pytorch/pytorch/pull/135029

when enabling mixed precision, FSDP cast input args to desired dtype by calling `_apply_to_tensors`. When input args has `dataclass(frozen=True)`, we hit following runtime error, because of using `setattr` in `_apply_to_tensors`

`dataclasses.FrozenInstanceError: cannot assign to field 'some_key'`. The fix is to use dataclasses api `dataclasses.replace`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135067
Approved by: https://github.com/awgu
2024-09-05 01:19:53 +00:00
04e11c7eed Update current scripts used for setting up s390x runners (#129866)
Update current scripts used for setting up s390x runners

Just a documentation update.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129866
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-09-05 01:17:54 +00:00
a3e0d4bf07 [FlexAttention] Fix mismatched backward strides for eager impl (#135152)
# Fixes:
The first repro from: https://github.com/pytorch/pytorch/issues/134888

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135152
Approved by: https://github.com/Chillee
2024-09-05 01:14:53 +00:00
27d86f93fe Remove redundant code (#134955)
Remove GetPrivateUse1HooksInterface
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134955
Approved by: https://github.com/Skylion007
2024-09-05 01:11:32 +00:00
32f45f01a9 [dynamo] Retire CompileProfiler (#135133)
Fixes confusion in https://github.com/pytorch/pytorch/issues/113443

We have TORCH_LOGS that supersedes CompileProfiler

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135133
Approved by: https://github.com/ezyang
ghstack dependencies: #135039, #135121, #135129, #135130
2024-09-05 01:08:40 +00:00
4a661e089a [FR] Add version based logic to FR script and make traces print can be filtered (#135154)
This PR makes version passing around the version, so that we can have different behaviors for different versions of FR dump. This PR also adds the logic of filtering to certain PG(desc) and ranks to show their traces.

Some minor refactors to make the name more accurate and util function working.

<img width="1180" alt="image" src="https://github.com/user-attachments/assets/4ef8a2d6-1296-4a45-b9a7-6d3b48fbe233">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135154
Approved by: https://github.com/wconstab
2024-09-05 00:59:32 +00:00
105ac2418c Fix binary builds artifact download (#135139)
By upgrading upload-artifacts action to v4.4.0

As artifact store layout is different between v3 and v4 actions and artifacts uploaded by v3 can not be downloaded by v4

Should fix`Unable to download artifact(s): Artifact not found for name: libtorch-cpu-shared-with-deps-release`, which could be seen for example [here](https://github.com/pytorch/pytorch/actions/runs/10707740040/job/29690137218#step:7:29)

I.e. fix regression introduced by https://github.com/pytorch/pytorch/pull/135068

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135139
Approved by: https://github.com/atalman, https://github.com/huydhn
2024-09-05 00:43:34 +00:00
560f449d8f Fix: use clone_preserve_strides in auto_functionalized_v2 (#135142)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135142
Approved by: https://github.com/zou3519
ghstack dependencies: #134409
2024-09-05 00:39:48 +00:00
956da79bda [CUDA][AMP] Fix autocast_dtype (#133938)
Fixes #132715

The failure in #132715 is due to `autocast_dtype` being a thread-local variable. It causes inconsistencies between `get_autocast_dtype()` among different threads.

To be exact, what is happening in the following: The amp dtype is set to `bfloat16` on main thread. The `backward` call runs on a side thread, so `at::autocast::prioritize` fails because `lower_precision_fp` defaults to `float16`:
6f738d6434/aten/src/ATen/autocast_mode.h (L221-L225)

This PR makes `autocast_dtype` thread-global so it consistent among all threads of forward and backward passes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133938
Approved by: https://github.com/soulitzer
2024-09-05 00:07:32 +00:00
977a909250 [CI] Build pytorch wheel with Torch XPU Operators on Windows (#133151)
# Description
This pipeline enables the CI build on Windows with PR labeled with ciflow/xpu. This will build torch binary with Torch XPU Operators on Windows using Vision Studio BuildTools 2022.

# Changes
1. Install xpu batch file (install_xpu.bat) - Check if build machine has oneAPI in environment, and if the version of it is latest. If not, install the latest public released oneAPI in the machine.
2. GHA callable pipeline (_win-build.yml) - Set vc_year and use_xpu as parameter to set build wheel environment.
3.  GHA workflow (xpu.yml) - Add a new windows build job and pass parameters to it.
4.  Build wheels script (.ci/pytorch/win-test-helpers/build_pytorch.bat) - Prepare environment for building, e.g. install oneAPI bundle.

# Note
1. For building wheels on Intel GPU, you need Vision Studio BuildTools version >= 2022
2. This pipeline requires to use Vision Studio BuildTools 2022 to build wheels. For now, we specify "windows.4xlarge.nonephemeral" as build machine label in the yaml file. We will request to add self-hosted runners with Intel GPU and Vision Studio BuildTools 2022 installed soon.

Work for #114850

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133151
Approved by: https://github.com/chuanqi129, https://github.com/atalman

Co-authored-by: chuanqiw <chuanqi.wang@intel.com>
2024-09-05 00:02:46 +00:00
b3ef0c99f5 [PP] Fix zero bubble composability with DP (#134052)
Moved all the backward functions (`stage_backward_input`, `stage_backward_weight`, `stage_backward`) under the same `backward_maybe_with_nosync` function which controls the logic of the data parallel wrappers.

FSDP was not working with zero bubble PP because there will be twice as many "backward" calls and we update the weight gradients after `autograd.grad` is called. As a result, we need to manually call the FSDP `post_backward_hook()` after the weights have the correct gradients.

Fixes the tests:
`python test/distributed/_composable/test_composability/test_pp_composability.py ComposabilityTest.test_manual_with_data_parallel_dp_type_FSDP_ScheduleClass0_use_new_runtime_False`

`python test/distributed/_composable/test_composability/test_pp_composability.py ComposabilityTest.test_manual_with_data_parallel_dp_type_DDP_ScheduleClass0_use_new_runtime_False`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134052
Approved by: https://github.com/kwen2501
2024-09-04 23:46:29 +00:00
43c9b4e0e6 Fix unintentional deduplication of returned tensors (#134726)
When CSE was used, returned tensors that had gone through identical
processing steps but were distinct from a data perspective were pruned
out of the graph.  This commit protects tensors which are directly
output from being pruned, and adds a test for this behavior.

Closes #88813 and #114344

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134726
Approved by: https://github.com/amjames, https://github.com/zou3519, https://github.com/bdhirsh
2024-09-04 23:42:56 +00:00
00a8666708 [ONNX] Support output_names in dynamic_axes when dynamo=True (#135134)
Previous to this PR, if output_names shows in dynamic_axes, it errors when we turn it to dynamic_shapes of torch.export, as we only recognized input_names.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135134
Approved by: https://github.com/justinchuby
2024-09-04 23:42:13 +00:00
eqy
4f70b3cfae [CUDA][complex][TF32] Update test_noncontiguous_samples tolerances for complex64 (#134526)
Recent cuDNN heuristics change surfaces same TF32 issue as `float32`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134526
Approved by: https://github.com/ezyang
2024-09-04 23:37:16 +00:00
359077fa43 [export] Fix indentation (#135128)
Summary: as title

Test Plan: CI

Differential Revision: D62195680

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135128
Approved by: https://github.com/tugsbayasgalan
2024-09-04 23:26:36 +00:00
9810ce9ca7 [PP] Go back to export instead of _export (#134299)
Reverts https://github.com/pytorch/pytorch/pull/130998 because FakeTensor + real device suffice to work around the autocast issue in HF.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134299
Approved by: https://github.com/lessw2020
2024-09-04 23:25:17 +00:00
804852c1f9 [dynamo] Search for _torchdynamo_inline only for functions (#135130)
Issue seen in https://github.com/pytorch/pytorch/issues/93633

Fixes https://github.com/pytorch/pytorch/issues/93633

Unable to create a testcase

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135130
Approved by: https://github.com/williamwen42, https://github.com/yanboliang
ghstack dependencies: #135039, #135121, #135129
2024-09-04 23:02:59 +00:00
13a4a0c60d [Inductor] Apply loop split optimization in codegen_node (#132389)
This PR applies loop split optimization in codegen_node to avoid non-contiguous load. When the vector is loaded in a non-contiguous manner due to a division in the index, we eliminate the division by splitting the loop to avoid non-contiguous load.

Example:
```
import torch
import torch.nn as nn

class GNReLU(torch.nn.Module):
    def __init__(self, num_groups, num_channels):
        super(GNReLU, self).__init__()
        self.gn = nn.GroupNorm(num_groups, num_channels)

    def forward(self, x):
        return torch.nn.functional.relu(self.gn(x))

input = torch.randn(2, 960, 96, 96).to(memory_format=torch.channels_last)
m = GNReLU(32, 960).eval()
compiled_m = torch.compile(m)

with torch.no_grad():
    compiled_m(input)
```

Generated code:

- Before:
```
cpp_fused_native_group_norm_relu_0 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*'], '''
#include "/tmp/torchinductor_jiayisun/vu/cvuckxaygqfovv2zu2byqhcmiejbke7mdhf2rpgpr5mlscdev2hg.h"
extern "C"  void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0,
                       float* out_ptr1,
                       float* out_ptr2)
{
    #pragma omp parallel num_threads(56)
    {
        int tid = omp_get_thread_num();
        {
            #pragma omp for collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L))
                {
                    {
                        Welford<float> tmp_acc0 = Welford<float>();
                        Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<long>(17280L));
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(9216L); x2+=static_cast<long>(1L))
                        {
                            for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30L*x1) + (960L*x2) + (8847360L*x0)), 16);
                                tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0);
                            }
                            for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(14L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30L*x1) + (960L*x2) + (8847360L*x0)), 14);
                                masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, 14, &wrecps0);
                            }
                        }
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec));
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                        out_ptr0[static_cast<long>(x1 + (32L*x0))] = static_cast<float>(tmp_acc0.mean);
                        out_ptr1[static_cast<long>(x1 + (32L*x0))] = static_cast<float>(tmp_acc0.m2);
                    }
                }
            }
        }
        {
            #pragma omp for collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(9216L); x1+=static_cast<long>(1L))
                {
                    for(long x2=static_cast<long>(0L); x2<static_cast<long>(960L); x2+=static_cast<long>(16L))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x2 + (960L*x1) + (8847360L*x0)), 16);
                        auto tmp1 =
                        [&]
                        {
                            __at_align__ std::array<float, 16> tmpbuf;
                            #pragma GCC unroll 16
                            for (long x2_inner = 0; x2_inner < 16; x2_inner++)
                            {
                                tmpbuf[x2_inner] = out_ptr0[static_cast<long>((32L*x0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))];
                            }
                            return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16);
                        }
                        ()
                        ;
                        auto tmp3 =
                        [&]
                        {
                            __at_align__ std::array<float, 16> tmpbuf;
                            #pragma GCC unroll 16
                            for (long x2_inner = 0; x2_inner < 16; x2_inner++)
                            {
                                tmpbuf[x2_inner] = out_ptr1[static_cast<long>((32L*x0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))];
                            }
                            return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16);
                        }
                        ()
                        ;
                        auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x2), 16);
                        auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x2), 16);
                        auto tmp2 = tmp0 - tmp1;
                        auto tmp4 = static_cast<float>(276480.0);
                        auto tmp5 = at::vec::Vectorized<float>(tmp4);
                        auto tmp6 = tmp3 / tmp5;
                        auto tmp7 = static_cast<float>(1e-05);
                        auto tmp8 = at::vec::Vectorized<float>(tmp7);
                        auto tmp9 = tmp6 + tmp8;
                        auto tmp10 = tmp9.rsqrt();
                        auto tmp11 = tmp2 * tmp10;
                        auto tmp13 = tmp11 * tmp12;
                        auto tmp15 = tmp13 + tmp14;
                        auto tmp16 = at::vec::clamp_min(tmp15, decltype(tmp15)(0));
                        tmp16.store(out_ptr2 + static_cast<long>(x2 + (960L*x1) + (8847360L*x0)));
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg2_1, = args
    args.clear()
    assert_size_stride(arg2_1, (2, 960, 96, 96), (8847360, 1, 92160, 960))
    buf0 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32)
    buf1 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32)
    buf3 = empty_strided_cpu((2, 960, 96, 96), (8847360, 1, 92160, 960), torch.float32)
    cpp_fused_native_group_norm_relu_0(arg2_1, _frozen_param3, _frozen_param2, buf0, buf1, buf3)
    del arg2_1
    return (buf3, )
```

- After:
```
cpp_fused_native_group_norm_relu_0 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*'], '''
#include "/tmp/torchinductor_jiayisun/vu/cvuckxaygqfovv2zu2byqhcmiejbke7mdhf2rpgpr5mlscdev2hg.h"
extern "C"  void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0,
                       float* out_ptr1,
                       float* out_ptr2)
{
    #pragma omp parallel num_threads(56)
    {
        int tid = omp_get_thread_num();
        {
            #pragma omp for collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L))
                {
                    {
                        Welford<float> tmp_acc0 = Welford<float>();
                        Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<long>(17280L));
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(9216L); x2+=static_cast<long>(1L))
                        {
                            for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30L*x1) + (960L*x2) + (8847360L*x0)), 16);
                                tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0);
                            }
                            for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(14L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30L*x1) + (960L*x2) + (8847360L*x0)), 14);
                                masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, 14, &wrecps0);
                            }
                        }
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec));
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                        out_ptr0[static_cast<long>(x1 + (32L*x0))] = static_cast<float>(tmp_acc0.mean);
                        out_ptr1[static_cast<long>(x1 + (32L*x0))] = static_cast<float>(tmp_acc0.m2);
                    }
                }
            }
        }
        {
            #pragma omp for collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(9216L); x1+=static_cast<long>(1L))
                {
                    #pragma GCC ivdep
                    for(long x2=static_cast<long>(0L); x2<static_cast<long>(32L); x2+=static_cast<long>(1L))
                    {
                        for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L))
                        {
                            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30L*x2) + (960L*x1) + (8847360L*x0)), 16);
                            auto tmp1 = out_ptr0[static_cast<long>(x2 + (32L*x0))];
                            auto tmp4 = out_ptr1[static_cast<long>(x2 + (32L*x0))];
                            auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x3 + (30L*x2)), 16);
                            auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x3 + (30L*x2)), 16);
                            auto tmp2 = at::vec::Vectorized<float>(tmp1);
                            auto tmp3 = tmp0 - tmp2;
                            auto tmp5 = static_cast<float>(276480.0);
                            auto tmp6 = tmp4 / tmp5;
                            auto tmp7 = static_cast<float>(1e-05);
                            auto tmp8 = decltype(tmp6)(tmp6 + tmp7);
                            auto tmp9 = 1 / std::sqrt(tmp8);
                            auto tmp10 = at::vec::Vectorized<float>(tmp9);
                            auto tmp11 = tmp3 * tmp10;
                            auto tmp13 = tmp11 * tmp12;
                            auto tmp15 = tmp13 + tmp14;
                            auto tmp16 = at::vec::clamp_min(tmp15, decltype(tmp15)(0));
                            tmp16.store(out_ptr2 + static_cast<long>(x3 + (30L*x2) + (960L*x1) + (8847360L*x0)));
                        }
                        for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(14L))
                        {
                            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30L*x2) + (960L*x1) + (8847360L*x0)), 14);
                            auto tmp1 = out_ptr0[static_cast<long>(x2 + (32L*x0))];
                            auto tmp4 = out_ptr1[static_cast<long>(x2 + (32L*x0))];
                            auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x3 + (30L*x2)), 14);
                            auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x3 + (30L*x2)), 14);
                            auto tmp2 = at::vec::Vectorized<float>(tmp1);
                            auto tmp3 = tmp0 - tmp2;
                            auto tmp5 = static_cast<float>(276480.0);
                            auto tmp6 = tmp4 / tmp5;
                            auto tmp7 = static_cast<float>(1e-05);
                            auto tmp8 = decltype(tmp6)(tmp6 + tmp7);
                            auto tmp9 = 1 / std::sqrt(tmp8);
                            auto tmp10 = at::vec::Vectorized<float>(tmp9);
                            auto tmp11 = tmp3 * tmp10;
                            auto tmp13 = tmp11 * tmp12;
                            auto tmp15 = tmp13 + tmp14;
                            auto tmp16 = at::vec::clamp_min(tmp15, decltype(tmp15)(0));
                            tmp16.store(out_ptr2 + static_cast<long>(x3 + (30L*x2) + (960L*x1) + (8847360L*x0)), 14);
                        }
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg2_1, = args
    args.clear()
    assert_size_stride(arg2_1, (2, 960, 96, 96), (8847360, 1, 92160, 960))
    buf0 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32)
    buf1 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32)
    buf3 = empty_strided_cpu((2, 960, 96, 96), (8847360, 1, 92160, 960), torch.float32)
    cpp_fused_native_group_norm_relu_0(arg2_1, _frozen_param3, _frozen_param2, buf0, buf1, buf3)
    del arg2_1
    return (buf3, )
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132389
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel

Co-authored-by: Jiong Gong <jiong.gong@intel.com>
2024-09-04 22:42:46 +00:00
87842cc658 [dynamo][super] Corner case where the class is not present in the __mro__ (#135129)
I could not come up with a testcase. This was seen in https://github.com/pytorch/pytorch/issues/93633

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135129
Approved by: https://github.com/yanboliang
ghstack dependencies: #135039, #135121
2024-09-04 22:30:09 +00:00
d9ae92cd6e [Dynamo] Support for proxying frozen dataclasses (#134846)
Fixes https://github.com/pytorch/pytorch/issues/133858

Details: Previously Dynamo would treat dataclasses as UserDefinedVariables. This was non-desirable if we would like to proxy the value into the graph, which is needed for TensorSubclassMetadata. To rectify this, frozen dataclasses are now able to be proxied similarly to NamedTuples. We require the object to be frozen, because if arbitrary mutation were allowed, we would need to replay those mutations in the graph after construction of the object.

For tracing construction of the variable, the generated `__init__` for the dataclass uses `object.__setattr__` because frozen dataclasses throw errors on the usual `__setattr__` invocation. With this treatment, no special handling is needed in dynamo for frozen dataclass construction.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134846
Approved by: https://github.com/bdhirsh, https://github.com/anijain2305
2024-09-04 22:17:00 +00:00
ed06772e35 [TorchElastic] add warning when users try to pass a "use_libuv" argument to create_c10d_store (#135062)
**Summary**
Extend the warning message to be more self-explained

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135062
Approved by: https://github.com/shuqiangzhang
2024-09-04 22:05:51 +00:00
fb1c580892 [BE][optim] Make pyright recognize exported symbols (#135043)
Follows pattern introduced by https://github.com/pytorch/pytorch/pull/80955 which [pyright](https://github.com/microsoft/pyright) prefers over `__all__` symbol, see https://github.com/microsoft/pylance-release/issues/2953#issuecomment-1168956296
Fixes https://github.com/pytorch/pytorch/issues/134985

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135043
Approved by: https://github.com/janeyx99
2024-09-04 21:53:46 +00:00
2276940f8c Make Dynamo inline through torch._library.custom_ops.autograd (#135066)
Fixes https://github.com/pytorch/pytorch/issues/135057

The bug was: in the situation that Dynamo graph breaks in the forward
and Compiled Autograd uses Dynamo to introspect the backward, we end up
running into a "Unsupported: inlining through SKIPFILES" error. The
solution is to mark the entirety of this module as inlineable.

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135066
Approved by: https://github.com/bdhirsh, https://github.com/williamwen42, https://github.com/yanboliang
2024-09-04 21:48:28 +00:00
4e6df83d19 [PT] Add out variant for avg_pool1d and adaptive_avg_pool1d (#135051)
Test Plan: CI

Differential Revision: D62148410

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135051
Approved by: https://github.com/SS-JIA
2024-09-04 21:20:01 +00:00
a8611da86f [dynamo][backend match] Optimize backend match for common case (#135121)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135121
Approved by: https://github.com/williamwen42
ghstack dependencies: #135039
2024-09-04 21:02:29 +00:00
09a339fc06 [Flex Attention] update __getitem__ without tree_map_only to support compile (#134627)
Adds a helper function for getting the block mask for a specific row index during decoding. We need this change to avoid the pytree + torch.compile issue #134731. Tested in gpt-fast [pr](https://github.com/pytorch-labs/gpt-fast/pull/196).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134627
Approved by: https://github.com/Chillee
2024-09-04 20:09:41 +00:00
741d52c69f Revert "Add support for 32KB multi_tensor_apply kernel arguments (#134373)"
This reverts commit 08184aa85cf183198ebdf2fd7a49fe7bc4842c13.

Reverted https://github.com/pytorch/pytorch/pull/134373 on behalf of https://github.com/drisspg due to See https://github.com/pytorch/pytorch/issues/135126 for more details ([comment](https://github.com/pytorch/pytorch/pull/134373#issuecomment-2329839011))
2024-09-04 19:44:29 +00:00
dd7cd182ab [AIInfra][DCP] All gather keys checkpoint utils bug fix (#135045)
Summary: All gather keys checkpoint utils bug fix. Dist. get_world_size should have the process group passed in to avoid inconsistent world size in case the process group has changed. This is common in the tests.

Test Plan: UTs

Reviewed By: Saiteja64

Differential Revision: D61578832

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135045
Approved by: https://github.com/MeetVadakkanchery, https://github.com/LucasLLC
2024-09-04 18:49:34 +00:00
eb0fd17bc4 [Profiler] Fix Raw Metadata Iterator (#135096)
Summary:
D62008788 added an extra parameter to the RawTensorMetadata struct. For some reason this causes some corrupted accesses in other tests as described in T200685032.

Once this is removed the tests pass. Going forward we need to document how to add parameters to this portion of the code as the AppendOnlyLists seem to be very rigid.

Test Plan: Ran all the tests locally and they all passed.

Differential Revision: D62171089

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135096
Approved by: https://github.com/aaronenyeshi
2024-09-04 18:41:50 +00:00
c88c19c6de Revert "restore CSE'd node metadata in runtime asserts pass (#134516)"
This reverts commit 1dfb1052395d908ed6e67288c9357e16022da272.

Reverted https://github.com/pytorch/pytorch/pull/134516 on behalf of https://github.com/pianpwk due to breaking NestedTensor test ([comment](https://github.com/pytorch/pytorch/pull/134516#issuecomment-2329738450))
2024-09-04 18:41:21 +00:00
873abfc18e [inductor] fix compile time regression due the (disabled) loop ordering after fusion (#135071)
It's a bit surprised that the code added in Scheduler.fusable_read_and_write would increase compilation time.

Here are some number I get from a H100 on BertForMaskedLM:
- without the fix, cold start compilation time is around 82s
- with the fix, cold start compilation time is around 76s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135071
Approved by: https://github.com/jansel
2024-09-04 18:36:59 +00:00
d7b57c4d63 Fix tensor.data access under inference_mode and compile (#134878)
Fixes https://github.com/pytorch/pytorch/issues/134798

In the regular Tensor case, when you call Tensor.data, there's a check
for if inference mode is active. If it is active, then we don't set the
version counter. We replicate this check for Tensor Subclasses (the bug
was we were trying to set the version counter on a FakeTensor in
inference_mode).

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134878
Approved by: https://github.com/bdhirsh
2024-09-04 17:55:41 +00:00
0d193a0adf Add ExecuTorch warning to mobile_optimizer (#134697)
Preview: https://docs-preview.pytorch.org/pytorch/pytorch/134697/mobile_optimizer.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134697
Approved by: https://github.com/ali-khosh, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-09-04 17:47:14 +00:00
193c547461 [inductor] Refactor simplify erase_nodes() (#134822)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134822
Approved by: https://github.com/shunting314
ghstack dependencies: #134748, #134749
2024-09-04 17:32:07 +00:00
2ddf3ed707 [inductor] Allow cudagraphs with unused CPU inputs (#134749)
This pattern was preventing cudagraphs from kicking in on torch_multimodal_clip, resulting in `1.6529 → 3.3471` speedup.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134749
Approved by: https://github.com/shunting314
ghstack dependencies: #134748
2024-09-04 17:32:07 +00:00
cff1158200 [inductor] Pass to fix device on index(..., [iota]) (#134748)
This pattern was preventing cudagraphs from kicking in on torch_multimodal_clip, resulting in `1.6529 → 3.3471` speedup.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134748
Approved by: https://github.com/shunting314
2024-09-04 17:31:58 +00:00
7858045491 Revert "Fix set_unbacked_bindings when list of Tensors is returned (#133585)"
This reverts commit 2a49296d7563150d67bb00bd4c97bc5aafaa77df.

Reverted https://github.com/pytorch/pytorch/pull/133585 on behalf of https://github.com/ezyang due to fails torchrec tests ([comment](https://github.com/pytorch/pytorch/pull/133585#issuecomment-2329602983))
2024-09-04 17:21:32 +00:00
8759ed2ac5 Revert "Compute and do renamings even when ignoring fresh unbacked symbols (#134407)"
This reverts commit 46cb2af7d822681298370bab9d49b3cba5546dd5.

Reverted https://github.com/pytorch/pytorch/pull/134407 on behalf of https://github.com/ezyang due to need to back out https://github.com/pytorch/pytorch/pull/133585 ([comment](https://github.com/pytorch/pytorch/pull/134407#issuecomment-2329597388))
2024-09-04 17:18:21 +00:00
fc07e6bf56 Revert "Ignore fresh unbacked when doing recursive make_fx inside HOPs (#135053)"
This reverts commit a178a053ad2c8e42d1b684ed38385b9646ec3b74.

Reverted https://github.com/pytorch/pytorch/pull/135053 on behalf of https://github.com/ezyang due to need to back out https://github.com/pytorch/pytorch/pull/133585 ([comment](https://github.com/pytorch/pytorch/pull/134407#issuecomment-2329597388))
2024-09-04 17:18:21 +00:00
c8ab9b06a2 Redesign custom op functionlaization for better re-inplace (#134409)
- The new implementation (auto_functionalized_v2) is enabled by default but can be disable
 using an inductor flag.
- In export mode the old implementation is used.

**Motiviation**
Previous functionalization fails to re-inplace arguments when they are view over other tensors.
see issue https://github.com/pytorch/pytorch/issues/131192
The new functionalization is easier to re-inplace for views.

**A) Functionalizations pass**
consider a program:

```

func(t)
    x = t[0]
    y = t[1]
    foo(x, y) # custom operator with x, y mutable
    return (x, y, t)
```

- To functionalize `foo` we generate a function that operates on the base tensors of the inputs;  (x.base() and y.base())
and record how to regenerates the views out of the base for argument x by recording ```ViewInfo=(x.base(), x.size(), x.stride, x,storage_offset())```

- Due to some limitations on the torch.export arguments format, we have to generate alot of arguments, but this is something we can simplify in the future, for the example above we get the following function.

   ```
   auto_functionalized = torch.ops.higher_order.auto_functionalized(torch.ops.mylib.foo.default,
     _x_base_index = 0, _x_size = (), _x_stride = (), _x_storage_offset = 0 ,
     _y_base_index = 0,_y_size = (), _y_stride = (), _y_storage_offset = 1   ,
     _all_bases = [arg0_1])
   ```
 -  In the code above:
        - _all_bases[t]: refers to a unique set of bases for all foo arguments.
        - for each argument x we have _x_base_index, _x_size, _x_stride, _x_storage_offset that can be used to (1)  regenerate x from _all_bases[_x_base_index] or a copy of a the base.

-  the output of auto_functionalized is foo output , followed by x tensors one for each base in  _all_bases, that is a copy of the base tensor after observing the mutations of the all the arguments that are views of that base.

-  for each use of a base in _all_bases or a view of it , that are after the call to foo, replace it with a view of the new output

 for the function above after functionalization we get :
 ```
    def forward(self, arg0_1: "f32[2][1]cpu"):
        auto_functionalized = torch.ops.higher_order.auto_functionalized(torch.ops.mylib.foo.default, _x_base_index = 0, _x_size = (), _x_stride = (), _x_storage_offset = 0, _y_base_index = 0, _y_size = (), _y_stride = (), _y_storage_offset = 1, _all_bases = [arg0_1])
        getitem_1: "f32[2][1]cpu" = auto_functionalized[1];  auto_functionalized = None
        copy_: "f32[2][1]cpu" = torch.ops.aten.copy_.default(arg0_1, getitem_1);  arg0_1 = copy_ = None

        # No stacktrace found for following nodes
        select_2: "f32[][]cpu" = torch.ops.aten.select.int(getitem_1, 0, 0)
        select_3: "f32[][]cpu" = torch.ops.aten.select.int(getitem_1, 0, 1);  getitem_1 = None
        return (select_2, select_3)
```

**B) Semantics of  auto_functionalize**
The new semantics of auto_functionalize is as the following:
1. For each base in all_bases, copy the base and create all_bases copies. (if a base is inplaced we do not need to copy it)
2. For each arg, regenerate the arg from the copy of its base using the view information above.
3. return the original foo output followed by the new bases.

**C) Re-inplace pass**
since auto_functionalize not copy the bases, what we actually inplace is the bases.
 (run just like before but on the beses instead of args).

1. For each base b in _all_bases check if there is any use of base (or its aliases/views) after auto_functionalize (before its overwritten with a copy) if there is not any, then inplace it (avoid copying it in step 1 above).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134409
Approved by: https://github.com/zou3519
2024-09-04 17:08:58 +00:00
195ac85fb6 [Profiler] Allow kwinputs to be non-string values (#134893)
Summary: When we process keyword arguments in profiler today we assume that all values will be strings. This breaks HTA because it assumes that "stream" and other values similar to it will be ints. To fix this we will only put quotes around strings for ivalues.

Test Plan: Add chrome trace export in unit tests and check that stream does not have quotes around it

Differential Revision: D62056059

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134893
Approved by: https://github.com/sanrise, https://github.com/izaitsevfb
2024-09-04 16:34:10 +00:00
60dfe1b35e Fix lint after Bump actions/download-artifact update (#135109)
Fixes lint after auto-generated PR: 367a78495f

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135109
Approved by: https://github.com/ezyang, https://github.com/huydhn
2024-09-04 15:26:17 +00:00
8bfd4916d6 fast path for sympy gcd in floordiv (#134880)
Summary:
Re-implementation of https://github.com/pytorch/pytorch/pull/134150, which was reverted because of some internal tests hanging (case B). The original motivation was to get some other internal test unstuck (case A).

The root cause is that sympy.gcd is both very clever as well as can blow up in some cases. This PR introduces a fast path with an appropriate fallback to sympy.gcd that ensures that both cases A and B go through.

Test Plan:
See the included test for specific examples.
Also https://fb.workplace.com/groups/1075192433118967/posts/1491493248155548/?comment_id=1491938994777640&reply_comment_id=1492622821375924

Differential Revision: D62043315

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134880
Approved by: https://github.com/ezyang
2024-09-04 14:56:49 +00:00
67208f08bd [CD] Enable XPU nightly build on Windows (#134312)
Depends on https://github.com/pytorch/builder/pull/1975 land. Works for https://github.com/pytorch/pytorch/issues/114850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134312
Approved by: https://github.com/atalman
2024-09-04 14:46:36 +00:00
6c5669903f Fix Invalid NaN comparison due to infinity-zero multiply on latest sympy (#135044)
Fixes https://github.com/pytorch/pytorch/issues/133735

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135044
Approved by: https://github.com/zou3519
2024-09-04 14:13:09 +00:00
a178a053ad Ignore fresh unbacked when doing recursive make_fx inside HOPs (#135053)
Internal xref: https://fb.workplace.com/groups/6829516587176185/posts/7705964779531357/

I'm not sure this is the right approach though...

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135053
Approved by: https://github.com/ydwu4
ghstack dependencies: #134407
2024-09-04 13:25:08 +00:00
46cb2af7d8 Compute and do renamings even when ignoring fresh unbacked symbols (#134407)
This is a bit twisty and I don't entirely understand the situation, but here's my best explanation.

In https://github.com/pytorch/pytorch/pull/133588 I am trying to fix a problem reported by user in https://fb.workplace.com/groups/6829516587176185/permalink/7705964779531357/ The summary of this problem is that when we do collect metadata analysis in AOTAutograd, we accumulate pending unbacked symbols which are going to be discarded at the end of the trace. However, if we do a recursive make_fx inside tracing, as occurs with torch.cond, we end up seeing that there are pending unbacked symbols that aren't associated with a binding, even though it's spurious (they've leaked into the inner make_fx call from the outer AOTAutograd analysis).

In #133588 I tried to just prevent adding the symbols to the pending list at all in the first place. But this itself caused some problems which were fixed in https://github.com/pytorch/pytorch/pull/124785 . The problem fixed in that PR is that when we allocate tangents that have unbacked size, something prevented them from having correct unbacked SymInts when ignore fresh unbacked SymInts was enabled. So I had patched it at the time by just not suppressing pending symbols and clearing them out some other way.

I think... I was wrong in that PR? That is to say, it was OK to avoid putting the fresh unbacked symbols in the pending list; the real problem was suppressing unbacked renamings. But there doesn't seem to be a good reason to suppress these; this PR shows that it doesn't actually fail any tests if you do these anyway. Intuitively, this makes sense, because you can't trigger renamings unless you're actually adding unbacked symbols to the pending set.

But I don't entirely understand all the interactions. I just know that this seems to not cause tests to fail, and it should fix the internal issue (which I need to add a UT for.)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134407
Approved by: https://github.com/ydwu4
2024-09-04 13:25:07 +00:00
5690f003a6 C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED and C10_DIAGNOST should be used in pairs (#135004)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135004
Approved by: https://github.com/aaronenyeshi
2024-09-04 13:14:23 +00:00
dcf05fcb14 Fix stale job using non-existant ARC runner (#134863)
The ARC CI system has been shutdown so this job is currently using a runner that doesn't exist.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134863
Approved by: https://github.com/ZainRizvi
2024-09-04 12:57:10 +00:00
a8467c17c3 Remove specific lazy initialization of PrivateUse1 (#135002)
As the title stated, lazy initialization of PrivateUse1 can been
removed because maybe_initialize_device have supported PrivateUse1
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135002
Approved by: https://github.com/albanD
2024-09-04 11:45:45 +00:00
80a6d60829 Moving _run_autocast_outofplace to basic class named TestAutocast to reduce redundance (#134460)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134460
Approved by: https://github.com/EikanWang, https://github.com/ezyang
2024-09-04 10:48:58 +00:00
c2ff9fe042 [fp8 rowwise] Retune the tile heuristics to increase perf (#134781)
I propose a new heuristic function to select tile tile size, cluster size, and transposition given M, N and K. It improves the performance across the board (on average) while remaining simple and relying only on a handful of kernels (to limit build time and binary size).

Across the shapes I benchmarked, the new heuristic gives a (geometric) mean speedup of +16.5%. Some shapes worsen, but 98.6% of the shapes retain their old performance (up to 5% to allow for noise) or improve it.
![image](https://github.com/user-attachments/assets/bca30583-ac32-4af6-a4f9-37164bdb2430)

I benchmarked on over 5.4k different shapes:
- For M and N I swept across all values which are the sums of two powers of 2 (limited to multiples of 64, capped at 16,384)
- For K I only used powers of 2 between 1,024 and 8,192 (based on the intuition that the optimal config doesn't depend on K, which turned out to be the case)

Here's the detailed speedup for each shape
![image](https://github.com/user-attachments/assets/acac4318-9ee0-455d-861b-c764b8c13d22)

<details>
<summary>
This is the code I used to benchmark
</summary>

```
import torch
import torch.utils.benchmark

s = set()

for i in range(6, 15):
    s.add(2**i)
    for j in range(6, i):
        s.add(2**i + 2**j)

ms = [i for i in sorted(s) if i <= 2**14]
ns = [i for i in sorted(s) if i <= 2**14]
ks = [2**i for i in range(10, 14)]

def make_graph(n_iters, f):
    g = torch.cuda.CUDAGraph()
    with torch.cuda.graph(g):
        for _ in range(n_iters):
            f()
    return g

def rowwise_scale(t, dtype_t):
    min_v, max_v = torch.finfo(dtype_t).min, torch.finfo(dtype_t).max
    scale_t = torch.clamp(t.abs().amax(dim=-1, keepdim=True).float(), min=1e-12) / max_v
    t_fp8 = (t / scale_t).clamp(min=min_v, max=max_v).to(dtype_t)
    return t_fp8, scale_t

for m in ms:
    for n in ns:
        for k in ks:
            a = torch.randn((m, k), device="cuda", dtype=torch.float)
            b_t = torch.randn((n, k), device="cuda", dtype=torch.float)
            a_fp8, scale_a = rowwise_scale(a, torch.float8_e4m3fn)
            b_t_fp8, scale_b_t = rowwise_scale(b_t, torch.float8_e4m3fn)
            func = lambda: torch._scaled_mm(
                a_fp8,
                b_t_fp8.t(),
                scale_a=scale_a,
                scale_b=scale_b_t.t(),
                bias=None,
                use_fast_accum=True,
                out_dtype=torch.bfloat16
            )
            print(f"{m=},{n=},{k=}")
            print(torch.utils.benchmark.Timer("g.replay()", globals={"g": make_graph(1000, func)}).blocked_autorange(min_run_time=1).mean / 1000)
```
</details>

<details>
<summary>
This is the code I used for the plots
</summary>

```
from itertools import islice

import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.cm import ScalarMappable
from matplotlib.colors import FuncNorm
from mpl_toolkits.axes_grid1 import ImageGrid

def batched(iterable, n):
    iterator = iter(iterable)
    while batch := tuple(islice(iterator, n)):
        yield batch

def try_to_convert(v):
    if v == "False":
        return False
    if v == "True":
        return True
    return int(v)

def get_from_paste(filename):
    text = open(filename, "rt").read()
    headers = []
    data = []
    for config, value in batched(text.splitlines(), 2):
        config_elems = config.split(",")
        if not headers:
            headers = [e.partition("=")[0] for e in config_elems]
        data.append((*(try_to_convert(e.partition("=")[-1]) for e in config_elems), float(value)))
    return pd.DataFrame(data, columns=headers + ["latency"])

old_latencies = get_from_paste(...)
new_latencies = get_from_paste(...)

ratios = pd.merge(new_latencies, old_latencies, how="left", left_on=["m", "n", "k"], right_on=["m", "n", "k"], suffixes=("_new", "_old"))
ratios = ratios.assign(ratio=ratios.latency_old / ratios.latency_new)

fig = plt.figure(figsize=(40.0, 10.0))
grid = ImageGrid(
    fig,
    111,
    nrows_ncols=(1, 4),
    axes_pad=0.5,
    share_all=True,
    cbar_location="right",
    cbar_mode="single",
    cbar_size="7%",
    cbar_pad=0.15,
)

log_amax = np.max(np.abs(np.log(ratios.ratio.to_numpy())))

for K, ax in zip([1024, 2048, 4096, 8192], grid):
    pivoted = ratios[(ratios.k == K)].pivot_table(index="m", columns="n", values="ratio")
    im = ax.imshow(np.log(pivoted.to_numpy()), origin="lower", vmin=-log_amax, vmax=log_amax, cmap="PiYG")
    m_vals, n_vals = pivoted.axes
    ax.set_xticks(np.arange(len(n_vals)), labels=[f"N={i}" for i in n_vals.values], fontsize=12)
    ax.set_yticks(np.arange(len(m_vals)), labels=[f"M={i}" for i in m_vals.values], fontsize=12)
    plt.setp(ax.get_xticklabels(), rotation=90, ha="right", rotation_mode="anchor")
    ax.grid(False)
    ax.set_title(f"K={K}", fontsize=20)

norm = FuncNorm((lambda x: np.log(x), lambda x: np.exp(x)), np.exp(-log_amax), np.exp(log_amax))
ax.cax.colorbar(ScalarMappable(norm=norm, cmap="PiYG"))
plt.show()

counts, bins = np.histogram(np.log(ratios.ratio.to_numpy()), bins=500)
plt.stairs(counts, np.exp(bins), fill=True)
plt.xscale("function", functions=(lambda x: np.log(x), lambda x: np.exp(x)))
```
</details>

I only benchmarked fast_accum=True and out_dtype=torch.bfloat16 supposing that these are the most commonly-used flags (e.g., with fast_accum=False row-wise scaling is much slower than tensor-wise scaling hence unpractical).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134781
Approved by: https://github.com/drisspg, https://github.com/eqy
ghstack dependencies: #134773
2024-09-04 09:17:28 +00:00
eec8fa038e [fp8 rowwise] Support transposing operands in order to change output layout (#134773)
On some occasion, a column-major output layout is more efficient (it's unclear if it's because of better store coalescing for some tile shapes, or whether it's just that it's CUTLASS's default and thus it's better optimized).

At this stage I only add a flag that allows to transpose, but the hardest will be deciding on a new heuristic to turn it on selectively. This will be in a follow-up PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134773
Approved by: https://github.com/drisspg
2024-09-04 09:17:28 +00:00
679b8fe426 Update generate-xnnpack-wrappers.py parsing to handle build identifier (#134724)
Fixes an issue after updating XNNPACK where parsing the XNNPACK CMakeLists breaks. I'm just ignored the generated build identifier for now, since it's not used and we would need to update the buck build to generate it at build time.

Remove unused ukernels_xop XNNPACK target as it has no sources (after the recent update) and causes buck1 to complain.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134724
Approved by: https://github.com/mcr229
2024-09-04 08:45:46 +00:00
1dfb105239 restore CSE'd node metadata in runtime asserts pass (#134516)
Adds val, and optionally stack_trace & nn_module_stack metadata back to SymInt compute nodes that we CSE, with a hook on `graph.create_node()`. Not sure if there's other metadata we want to populate here?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134516
Approved by: https://github.com/ezyang
2024-09-04 05:56:28 +00:00
9f00317997 rationalize STATIC vs. None (#134877)
Summary:
A bit of refactoring to prepare to remove `None` as a way to specify static dimensions in dynamic shapes, given we already have `Dim.STATIC` for the same purpose. We will now warn whenever this happens. However no tests were modified because problematic uses of `None` still need to behave as they do today, until we are ready to remove support. It should be easy to port tests by replacing the warning function to raise instead.

Note that other uses of `None`, such as for entire values (tensor or non-tensor) remain as is. Moving forward this should be the only purpose of `None` (at least externally).

Finally, there's a bit of confusion in our representation now because `AUTO` also internally transforms to `None`. Renamed dynamic_shapes to transformed_dynamic_shapes where this happens. Overall the two forms (pre and post transformation) have different properties so should probably not be represented in the same format in the future.

Test Plan: existing

Differential Revision: D62040729

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134877
Approved by: https://github.com/pianpwk
2024-09-04 05:34:26 +00:00
9809080b9e [Reland] Refactor caching device allocator utils (#130923)
# Motivation
Following [[RFC] Intel GPU Runtime Upstreaming for Allocator ](https://github.com/pytorch/pytorch/issues/116322), this PR aims to refactor caching device allocator utils to improve code reuse usage.
This is the first PR, we could prepare some follow-up PRs continuing to refactor the device caching allocator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130923
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD, https://github.com/eqy
2024-09-04 05:31:08 +00:00
6448d351db [inductor] clean up cpp_builder code. (#134909)
Clean up cpp_builder duplication code.

Hi @henrylhtsang , could you please help on land internally?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134909
Approved by: https://github.com/henrylhtsang
2024-09-04 05:29:08 +00:00
2c9b4d2052 [executorch hash update] update the pinned executorch hash (#135077)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135077
Approved by: https://github.com/pytorchbot
2024-09-04 05:17:29 +00:00
6b05aafc57 Add specializations for VecMaskLoad and VecMaskCast (#126501)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126501
Approved by: https://github.com/jgong5, https://github.com/jansel
ghstack dependencies: #126500
2024-09-04 05:12:52 +00:00
ffd1e214df Back out "[FSDP2] Set ctx.set_materialize_grads(False) for post-backward (#133498)" (#135059)
Summary:
Original commit changeset: 96513cbc425f

Original Phabricator Diff: D61291210

There is some evidence that FB-FM-v4 has better NE with Set ctx.set_materialize_grads(False), especially when pairing up with prefetching.

See https://www.internalfb.com/intern/anp/view/?id=5732259

Test Plan:
export NUM_WORKERS=128
export BATCH_SIZE=1024
export CONFIG_FILE="mast_joint_arch_exploration_cmf_updated_fbfm_v3_fsdp2.yaml"

export ENTITLEMENT=ads_global_tc_2k_training_large_short
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -c fbcode.platform010_cuda_version=12 -c hpc_comms.use_nccl=2.17.1 -- mode=${CONFIG_FILE} launcher.tags='[ads_ranking_taxonomy_monetization_genai]' launcher.data_project=pytorch_at_scale launcher.max_retries=10 launcher.fbl_entitl
ement=${ENTITLEMENT} launcher.oncall=pytorch_training_enablement launcher.hardware=GRANDTETON launcher.num_workers=${NUM_WORKERS} data_loader.dataset.batch_size=${BATCH_SIZE} training.planner.proposer=dynamic_col_dim training.planner.proposer.optim_target=h
bm 2>&1| tee ~/tmp/log.mast

Differential Revision: D62009163

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135059
Approved by: https://github.com/awgu
2024-09-04 04:50:32 +00:00
cyy
c818ecd169 Remove Caffe2 code from tool scripts (#134941)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134941
Approved by: https://github.com/ezyang
2024-09-04 03:47:58 +00:00
9e6f4f3f77 [dynamo] Use __eq__ for backend match (#135039)
Fixes https://github.com/pytorch/pytorch/issues/131150

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135039
Approved by: https://github.com/jansel
2024-09-04 03:35:18 +00:00
367a78495f Bump actions/download-artifact from 2 to 4.1.7 in /.github/workflows (#135068)
Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 2 to 4.1.7.
- [Release notes](https://github.com/actions/download-artifact/releases)
- [Commits](https://github.com/actions/download-artifact/compare/v2...v4.1.7)

---
updated-dependencies:
- dependency-name: actions/download-artifact
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-09-03 20:33:57 -07:00
362ecd9817 [inductor] Skip the sub-process pool until it's ready (#133508)
Summary: Torch-compiling a quick script can be a bit slower than it needs to be: even though we initialize the subprocess pool early, it still might not be ready by the time we try to compile the first Triton kernel. Instead, let's use the single-threaded path until the pool has successfully completed a no-op job.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133508
Approved by: https://github.com/Chillee
2024-09-04 03:26:55 +00:00
7600e9b36f [ONNX] Use the stable APIs in onnxscript and sync the latest logic (#134782)
Use the stable apis from onnxscript: https://github.com/microsoft/onnxscript/issues/1827
Sync with torch-onnx at https://github.com/justinchuby/torch-onnx/compare/v0.1.12...v0.1.15.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134782
Approved by: https://github.com/titaiwangms
2024-09-04 03:10:20 +00:00
982e27e532 [halide-backend] Update CI pin (#130258)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130258
Approved by: https://github.com/eellison
2024-09-04 03:08:49 +00:00
ae3aa8ff73 [AOTI][Tooling][5/n] Refactor the debug printer call to a level lower (#134789)
Summary:
1. Move the debug printer call a level lower -> at here
:https://www.internalfb.com/code/fbsource/[931d7bbb9e7cf2dcb926f42718f56fc940903eec]/fbcode/caffe2/torch/_inductor/codegen/cpp_wrapper_cuda.py?lines=335
2. Add UT for validating debug printer for user defined triton kernel codegen

The benefit of having the debug printer call happens at a more centralized place is 1) reduce the duplicate debug printer related logic code scattered everywhere in the codebase 2) it can handle more triton kernel codegen path as long as it invokes this `generate_kernel_call()` for example,  it can automatically handle/support user_defined_kernel 's debug printing which is a pretty common use case we encounter in debugging

Test Plan:
```AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1  TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_aoti_debug_printer_user_defined_triton_kernel_abi_compatible_cuda```

Also verified that templateKernel codegen path still works

Differential Revision: D61949020

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134789
Approved by: https://github.com/ColinPeppler
2024-09-04 02:41:30 +00:00
ea89f01281 Remove unused comment (#135034)
As part of my rampup I've been reading through some of @ezyang's diffs. I noticed in https://github.com/pytorch/pytorch/pull/133439 there was a comment that he forgot to remove. This diff removes that comment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135034
Approved by: https://github.com/albanD
2024-09-04 02:32:26 +00:00
175485097a [EASY] Typofix (#135022)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135022
Approved by: https://github.com/albanD
2024-09-04 01:59:40 +00:00
15c25c4580 Fix dim mismatch logic automatic dynamic not working with compiler collectives (#135025)
Fixes
https://fb.workplace.com/groups/3095840833991792/permalink/3810738595835342/

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135025
Approved by: https://github.com/albanD
2024-09-04 01:50:21 +00:00
4ebf6b04a8 Turn on expanded index path for Half on CPU (#133553)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133553
Approved by: https://github.com/yanbing-j, https://github.com/jgong5, https://github.com/peterbell10
2024-09-04 00:56:56 +00:00
e000cf0ad9 Fix license metadata in setup.py (#129219)
Package metadata in setup.py lists license as BSD-3 which is not a valid SPDX id. The correct id would be BSD-3-Clause.

Specifying an SPDX id is beneficial to license compliance scanning.

*Taking up #129123 from my personal account.*
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129219
Approved by: https://github.com/malfet, https://github.com/kit1980
2024-09-04 00:21:22 +00:00
45743019cf [PT2][Optimus] Skip meta update on symblic shape (#134975)
Summary: We noticed that there will be runtime error to do the dim broadcast when the meta example value has symbolic shape, thus we skip it.

Test Plan:
```
buck2 run mode/opt //caffe2/benchmarks/dynamo/fb:torchbench_run_ads_dhen_5x_training -- -m ads_dhen_5x -t training
```

P1559019921

Differential Revision: D62115015

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134975
Approved by: https://github.com/xuzhao9
2024-09-04 00:05:51 +00:00
9ffcca7060 [Profiler] Handle Tensor Sizes/Strides Parsing Error (#134862)
Summary:
Currently some jobs are encountering the following trace, P1539415198. This suggests that when we are parsing through tensors the path is prone to encountering an invalid address. This is is possibly occurring because for some reason the sizes() and strides() of a Tensor seem to not be of the same dimensions. We assume such when iterating through the shapes to get the Ivalue generator. When browsing some of the tensor implementations, I found that some of the size and stride paths are different which could be the cause of this issue. Regardless, the profiler should be flexible enough to handle such issues without bringing down the whole main thread.

If the crashes still persist, it will still give us a data point as to where they are occurring and we can rule out the strides/sizes as the culprit

Test Plan: This change doesn't break anything in the happy path, just makes sure the bad path is not exited abruptly. We should use this in order to debug what the events are having mismatching dimensions between sizes and strides.

Differential Revision: D62008788

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134862
Approved by: https://github.com/aaronenyeshi
2024-09-03 23:46:38 +00:00
f05b716d6d Add validator to ensure runner determinator script is kept in sync (#134800)
We keep two copies of the runner-determinator script:
1. In runner_determinator.py, for ease of testing.  This however is not actually executed during CI
2. Embedded in _runner-determinator.yml.  This is what CI uses.

Why the duplication? Short version: Because of how github CI works, during a given CI run the workflow yml files could actually come from the main branch, while the remaining files get read from the local commit.
This can lead to a newer version of _runner-determinator.yml trying to invoke an older version of runner_determintor.py than it was actually designed for. Chaos ensues.

We mitigate this by embedding the script into the yml file.  But we still keep the script around because it's much easier to run tests against.

This workflow's job is to ensure that if one edits the script in one of those two locations then they remember to update it in the other location as well
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134800
Approved by: https://github.com/zxiiro, https://github.com/PaliC
ghstack dependencies: #134796
2024-09-03 23:29:04 +00:00
469429b959 Refactor runner determinator (#134796)
Some minor refactorings to make the code easier to parse and easier to add unit tests for.  Keeping this as a separate PR for ease of review, since it should have zero functional behavior changes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134796
Approved by: https://github.com/zxiiro, https://github.com/PaliC
2024-09-03 23:29:04 +00:00
c044deb9ce Revert "c10d/logging: add C10D_LOCK_GUARD (#134131)"
This reverts commit f33bcbe5fd67e6b18be259ad2f0dc11c74157075.

Reverted https://github.com/pytorch/pytorch/pull/134131 on behalf of https://github.com/kit1980 due to See D61985186 ([comment](https://github.com/pytorch/pytorch/pull/134131#issuecomment-2327556381))
2024-09-03 22:35:14 +00:00
2fd36086bc Revert "Add torch.serialization.skip_data context manager (#134504)"
This reverts commit 94db935749b8de99d8c3ab23fb880c67c8f3e67a.

Reverted https://github.com/pytorch/pytorch/pull/134504 on behalf of https://github.com/kit1980 due to See D62082697 ([comment](https://github.com/pytorch/pytorch/pull/134504#issuecomment-2327542276))
2024-09-03 22:21:27 +00:00
85fa019697 [Docs] Fix call to deprecated function (#135037)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135037
Approved by: https://github.com/janeyx99, https://github.com/jbschlosser
2024-09-03 20:57:11 +00:00
14c8ef5198 autolabel aotinductor->export (#135040)
"module: aotinductor" will automatically add "oncall: export".

Test Plan:
- none
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135040
Approved by: https://github.com/ydwu4
2024-09-03 20:17:51 +00:00
c40e622966 [inductor] add openmp config for intel conpiler on Linux. (#134973)
Config `openmp` for Intel Compiler on Linux.

Base on this PR, we can confirm the Intel optimized libraries are work built well.
<img width="1039" alt="image" src="https://github.com/user-attachments/assets/838d5114-c778-4961-9cfe-39a814647089">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134973
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-09-03 20:10:21 +00:00
272f3b9fe1 [FlexAttention] Update tolerance for failing test (#135035)
Summary: Address: T198937061

Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:flex_attention -- --exact 'caffe2/test/inductor:flex_attention - test_no_q_info_compile_False (caffe2.test.inductor.test_flex_attention.TestBlockMask)' --run-disabled

Differential Revision: D62137797

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135035
Approved by: https://github.com/Chillee
2024-09-03 20:09:21 +00:00
e7731b3f8a [TorchElastic] make torch elastic not have to realize TCPStore backend type and rely on c10d to decide which backend to use (#134882)
D53335860 and D56435815 added an option to torch elastic allowing users to choose a TCPStore backend type to use via
1) explicit argument passing in user code when instantiating `MastRendezvousHandler`
2) pass `--use_libuv` command line argument to `torchrun`.

The motivation was to offer a quick way to roll back to non-libuv TCPStore backend since we were making libuv the default in `c10d` code. Now we think that it's better to have torch elastic to not realize the TCPStore backend type but rely on `c10d`'s mechanism to decide which backend to use for torch elastic as well. In this sense, the TCPStore backend type used by torch elastic will be identical to that in pytorch.

PyTorch TCPStore uses the environment variable `USE_LIBUV` to determine the backend type:
when `USE_LIBUV="0"`, the non-libuv backend will be used.
when `USE_LIBUV="1"`, the libuv backend will be used. And this is the default option.

Differential Revision: [D58259590](https://our.internmc.facebook.com/intern/diff/D58259590/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134882
Approved by: https://github.com/shuqiangzhang
2024-09-03 19:43:21 +00:00
71383dd3da [MPS] Fix bachnorm_2d for channels last (#134618)
By skipping gather of input tensor if memory_layout is channels_last, which is a first step towards fixing  https://github.com/pytorch/pytorch/issues/134580

Though underlying problem is much more interesting, i.e. MPS does not have a generic support for channels last, but `c10::is_contiguoius()` is true for channels last layout.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134618
Approved by: https://github.com/albanD
2024-09-03 19:20:11 +00:00
758d787901 Added complex support for torch.logsumexp (#133187)
Added complex support for `torch.logsumexp`. Implemented complex backward pass for `torch.logsumexp`.

Fixes #133047

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133187
Approved by: https://github.com/amjames, https://github.com/lezcano
2024-09-03 17:28:36 +00:00
6c3767452d Move auto functionalize tests in their own test file (#134834)
title + use `with torch.library._scoped_library as lib` when needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134834
Approved by: https://github.com/zou3519
ghstack dependencies: #134831
2024-09-03 17:09:03 +00:00
2e0b114c06 add a new Guage API with an empty backend to PyTorch core (#134883)
Summary:
The current use case is to continuously measure the total allocated and reserved CUDA memory size from CUDACachingAllocator, and export their distribution (min, max, p90 etc) over time as timeseries.

The current callback-based API does not work because the backend decides when the measurement is taken, so data points between two measurements may not be recorded. The distribution (e.g. max) as such will not be accurate.

This new API closely follow the design of the existing WaitCounter API otherwise.

This is not quite a synchronous version of DynamicCounter, as summing multiple data points does not make sense to my use case

Test Plan: CI

Differential Revision: D61837528

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134883
Approved by: https://github.com/c-p-i-o
2024-09-03 17:08:47 +00:00
7804c089c6 [BE] Update numpy version to 2.0.2 (#134875)
It's long time to abandon pre-release version

Partially addresses https://github.com/pytorch/pytorch/issues/134868
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134875
Approved by: https://github.com/justinchuby, https://github.com/clee2000, https://github.com/kit1980, https://github.com/atalman, https://github.com/Skylion007
2024-09-03 17:02:35 +00:00
1b9f51bd88 [ONNX] Bump onnxscript version in CI; temporarily remove op test (#133748)
Bump onnxscript version in CI to 0.1.0.dev20240831, and temporarily remove the fx consistency test. We will add a better version back later.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133748
Approved by: https://github.com/titaiwangms
2024-09-03 16:30:07 +00:00
27677ead7c Revert "[ONNX] Bump onnxscript version in CI; temporarily remove op test (#133748)"
This reverts commit 6eed63c8b9c4f54a573bb51960d252cd42bfab0c.

Reverted https://github.com/pytorch/pytorch/pull/133748 on behalf of https://github.com/ZainRizvi due to The version bump appears to be pulling in an unavailable numpy version? [GH job link](https://github.com/pytorch/pytorch/actions/runs/10686076754/job/29620426371) [HUD commit link](6eed63c8b9) ([comment](https://github.com/pytorch/pytorch/pull/133748#issuecomment-2326932868))
2024-09-03 16:19:47 +00:00
a258844a32 Properly handle empty CPUINFO variable (#134916)
Fixes https://github.com/pytorch/pytorch/issues/134915

But I did not root cause why CPUINFO is totally empty to begin with...

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134916
Approved by: https://github.com/Skylion007
2024-09-03 15:59:59 +00:00
f927bcb934 Revert "[Inductor] Apply loop split optimization in codegen_node (#132389)"
This reverts commit 3cb5d251224b3fb59b5a10c6fefbb4c84eb565a6.

Reverted https://github.com/pytorch/pytorch/pull/132389 on behalf of https://github.com/ZainRizvi due to Hi, this seems to be breaking in trunk. See test_dataloader.py::TestDataLoader::test_segfault [GH job link](https://github.com/pytorch/pytorch/actions/runs/10660461216/job/29556282081) [HUD commit link](de3a641476) ([comment](https://github.com/pytorch/pytorch/pull/132389#issuecomment-2326843129))
2024-09-03 15:40:45 +00:00
6eed63c8b9 [ONNX] Bump onnxscript version in CI; temporarily remove op test (#133748)
Bump onnxscript version in CI to 0.1.0.dev20240831, and temporarily remove the fx consistency test. We will add a better version back later.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133748
Approved by: https://github.com/titaiwangms
2024-09-03 15:33:09 +00:00
33ba952e31 [subclasses] Do not fakeTensor const prop subclass args (#134855)
The issue:

Const propagation checks only if arguments do not have FakeTensor. If argument is Subclass, it will pass this condition.

As a result Const Propogation execution happens without FakeTensorMode and having tensor factories inside Subclass.__torch_dispatch__ results that this Tensor is not Fakified.

Solution:

If we have subclasses arguments, do not count that const propagation is doable

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134855
Approved by: https://github.com/zou3519
2024-09-03 13:31:49 +00:00
2a49296d75 Fix set_unbacked_bindings when list of Tensors is returned (#133585)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133585
Approved by: https://github.com/albanD
2024-09-03 12:23:31 +00:00
2443507acc Update torch-xpu-ops pin (ATen XPU implementation) (#134983)
Release cycle for PyTorch 2.5
1. Enable Windows build in latest torch-xpu-ops. Resolved large bin issue.
2. Refine test infrastructure for compatibility on different HW platforms.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134983
Approved by: https://github.com/EikanWang
2024-09-03 12:14:37 +00:00
39935e0fde Update cpuinfo submodule (#134891)
Last time it was done in June by https://github.com/pytorch/pytorch/pull/127505
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134891
Approved by: https://github.com/Skylion007
2024-09-03 09:29:59 +00:00
23a2161ad1 Changed addmv to be a decomposition and not a fallback (#134823)
Overall seems to be faster

![image](https://github.com/user-attachments/assets/0cbea76e-fb78-4634-9265-047de0291549)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134823
Approved by: https://github.com/jansel
ghstack dependencies: #134813, #134818, #134819
2024-09-03 06:33:31 +00:00
9856bc50a2 Switch nanmedian to not cuda synchronize (#134819)
Generally, this seems to be faster.

![image](https://github.com/user-attachments/assets/43a86c6f-236d-4ba1-aae0-14e3d88ae401)

And as an added benefit, it works great with cudagraphs and such :)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134819
Approved by: https://github.com/Skylion007, https://github.com/eqy
ghstack dependencies: #134813, #134818
2024-09-03 06:33:31 +00:00
6fce1faa10 change multinomial to use async asserts instead of a synchronization (#134818)
Fixes https://github.com/pytorch/pytorch/issues/134442

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134818
Approved by: https://github.com/ezyang
ghstack dependencies: #134813
2024-09-03 06:33:24 +00:00
db193d1e29 add msg to _assert_async (#134813)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134813
Approved by: https://github.com/ezyang, https://github.com/eqy, https://github.com/albanD
2024-09-03 06:33:18 +00:00
d14fe3ffed [Inductor][CPP] Turns on inline_inbuilt_nn_modules for CPP GEMM template testing (#132487)
**Summary**
The CPP GEMM template testing has been skipped with turning on `inline_inbuilt_nn_modules ` as in https://github.com/pytorch/pytorch/issues/131929.  Since https://github.com/pytorch/pytorch/pull/132334 has landed to fix the issues. Turn on this flag back since it's default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132487
Approved by: https://github.com/anijain2305, https://github.com/jgong5
2024-09-03 05:05:50 +00:00
a00fad0177 Add specializations for vectorized conversion between float and BF16/FP16 (#126500)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126500
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-09-03 02:09:12 +00:00
45f11094b6 [ONNX] Delete op_level_debug from torch.onnx.ExportOptions (#134961)
op_level_debug helped to identify missing operators, and wrongly implemented operators at the time that dynamo exporter relied on nearest matching and torchlib was just created. However, right now, with dispatcher logic improved and torchlib becomes mature, we no longer need it.

PS: op-level-debug diagnostics rule is not deleted in this PR, as it auto generates lint error code, and need more time to fix. We can delete it when we retire sarif.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134961
Approved by: https://github.com/justinchuby
2024-09-02 23:38:39 +00:00
4c1dd13ba3 [BE] better type annotation for torch.types (#129559)
Closes #129525

- #129525

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129559
Approved by: https://github.com/ezyang
2024-09-02 15:35:32 +00:00
76710d4f95 Corrected docstring of `solve_triangular` (#129766)
**Description**
The arguments docstring of [torch.linalg.solve_triangular](https://pytorch.org/docs/stable/generated/torch.linalg.solve_triangular.html#torch.linalg.solve_triangular) incorrectly describes the shape of the ``A`` argument if the option ``left=True``.

The argument ``A`` should have shape $k \times k$ if ``left=False`` in line with the rest of the docstring and the implementation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129766
Approved by: https://github.com/lezcano
2024-09-02 13:30:30 +00:00
ee03530fd9 Add a test to avoid decorator based regression for cprofile traces (#133086)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133086
Approved by: https://github.com/aorenste
2024-09-02 12:53:34 +00:00
FEI
16de25b1dc fix tensor_repr(at::Tensor) (#134762) (#134764)
Fixes #134762
@ezyang @antocuni
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134764
Approved by: https://github.com/ezyang

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
2024-09-02 06:05:10 +00:00
3daca187aa [Inductor] Allow customizing the padding format (#133939)
Based on https://github.com/pytorch/pytorch/pull/130956.

Inductor already supports padding through the `config.comprehensive_padding` option, but the padding format involves a few heuristics that are specific to Nvidia GPUs:
  - When we pad, it is always aligned to the next multiple of 128 bytes.
  - Strides smaller than 1024 are not padded.
  - Only intermediate values are padded, not outputs.

 The last of these is not really GPU-specific, but there are certain cases where we may want to override it. For example, padding outputs is useful on hardware accelerators with specific memory alignment requirements, or for applications where performance is more important than conformity with eager mode.

 This PR surfaces padding parameters up to Inductor's config module, so the user can control them.
   - `config.pad_outputs`: choose whether to pad outputs (default: `False`)
   - `config.padding_alignment_bytes`: choose the alignment size for padding (default: `128`)
   - `config.padding_stride_threshold`:  choose the smallest stride that we will pad. For example, setting this to 0 will pad all unaligned strides. (default: `1024`)

 **Test plan**
 Added a new test in `test_padding.py` which tries various combinations of these options, checking that the output strides match our expectations.

  These changes should not affect perf, because the defaults are identical to Inductor's current behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133939
Approved by: https://github.com/shunting314

Co-authored-by: Yueming Hao <yhao@meta.com>
2024-09-02 05:56:33 +00:00
de3a641476 [executorch hash update] update the pinned executorch hash (#134914)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134914
Approved by: https://github.com/pytorchbot
2024-09-02 03:52:40 +00:00
3cb5d25122 [Inductor] Apply loop split optimization in codegen_node (#132389)
This PR applies loop split optimization in codegen_node to avoid non-contiguous load. When the vector is loaded in a non-contiguous manner due to a division in the index, we eliminate the division by splitting the loop to avoid non-contiguous load.

Example:
```
import torch
import torch.nn as nn

class GNReLU(torch.nn.Module):
    def __init__(self, num_groups, num_channels):
        super(GNReLU, self).__init__()
        self.gn = nn.GroupNorm(num_groups, num_channels)

    def forward(self, x):
        return torch.nn.functional.relu(self.gn(x))

input = torch.randn(2, 960, 96, 96).to(memory_format=torch.channels_last)
m = GNReLU(32, 960).eval()
compiled_m = torch.compile(m)

with torch.no_grad():
    compiled_m(input)
```

Generated code:

- Before:
```
cpp_fused_native_group_norm_relu_0 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*'], '''
#include "/tmp/torchinductor_jiayisun/vu/cvuckxaygqfovv2zu2byqhcmiejbke7mdhf2rpgpr5mlscdev2hg.h"
extern "C"  void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0,
                       float* out_ptr1,
                       float* out_ptr2)
{
    #pragma omp parallel num_threads(56)
    {
        int tid = omp_get_thread_num();
        {
            #pragma omp for collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L))
                {
                    {
                        Welford<float> tmp_acc0 = Welford<float>();
                        Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<long>(17280L));
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(9216L); x2+=static_cast<long>(1L))
                        {
                            for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30L*x1) + (960L*x2) + (8847360L*x0)), 16);
                                tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0);
                            }
                            for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(14L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30L*x1) + (960L*x2) + (8847360L*x0)), 14);
                                masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, 14, &wrecps0);
                            }
                        }
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec));
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                        out_ptr0[static_cast<long>(x1 + (32L*x0))] = static_cast<float>(tmp_acc0.mean);
                        out_ptr1[static_cast<long>(x1 + (32L*x0))] = static_cast<float>(tmp_acc0.m2);
                    }
                }
            }
        }
        {
            #pragma omp for collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(9216L); x1+=static_cast<long>(1L))
                {
                    for(long x2=static_cast<long>(0L); x2<static_cast<long>(960L); x2+=static_cast<long>(16L))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x2 + (960L*x1) + (8847360L*x0)), 16);
                        auto tmp1 =
                        [&]
                        {
                            __at_align__ std::array<float, 16> tmpbuf;
                            #pragma GCC unroll 16
                            for (long x2_inner = 0; x2_inner < 16; x2_inner++)
                            {
                                tmpbuf[x2_inner] = out_ptr0[static_cast<long>((32L*x0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))];
                            }
                            return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16);
                        }
                        ()
                        ;
                        auto tmp3 =
                        [&]
                        {
                            __at_align__ std::array<float, 16> tmpbuf;
                            #pragma GCC unroll 16
                            for (long x2_inner = 0; x2_inner < 16; x2_inner++)
                            {
                                tmpbuf[x2_inner] = out_ptr1[static_cast<long>((32L*x0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))];
                            }
                            return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16);
                        }
                        ()
                        ;
                        auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x2), 16);
                        auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x2), 16);
                        auto tmp2 = tmp0 - tmp1;
                        auto tmp4 = static_cast<float>(276480.0);
                        auto tmp5 = at::vec::Vectorized<float>(tmp4);
                        auto tmp6 = tmp3 / tmp5;
                        auto tmp7 = static_cast<float>(1e-05);
                        auto tmp8 = at::vec::Vectorized<float>(tmp7);
                        auto tmp9 = tmp6 + tmp8;
                        auto tmp10 = tmp9.rsqrt();
                        auto tmp11 = tmp2 * tmp10;
                        auto tmp13 = tmp11 * tmp12;
                        auto tmp15 = tmp13 + tmp14;
                        auto tmp16 = at::vec::clamp_min(tmp15, decltype(tmp15)(0));
                        tmp16.store(out_ptr2 + static_cast<long>(x2 + (960L*x1) + (8847360L*x0)));
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg2_1, = args
    args.clear()
    assert_size_stride(arg2_1, (2, 960, 96, 96), (8847360, 1, 92160, 960))
    buf0 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32)
    buf1 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32)
    buf3 = empty_strided_cpu((2, 960, 96, 96), (8847360, 1, 92160, 960), torch.float32)
    cpp_fused_native_group_norm_relu_0(arg2_1, _frozen_param3, _frozen_param2, buf0, buf1, buf3)
    del arg2_1
    return (buf3, )
```

- After:
```
cpp_fused_native_group_norm_relu_0 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*'], '''
#include "/tmp/torchinductor_jiayisun/vu/cvuckxaygqfovv2zu2byqhcmiejbke7mdhf2rpgpr5mlscdev2hg.h"
extern "C"  void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0,
                       float* out_ptr1,
                       float* out_ptr2)
{
    #pragma omp parallel num_threads(56)
    {
        int tid = omp_get_thread_num();
        {
            #pragma omp for collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L))
                {
                    {
                        Welford<float> tmp_acc0 = Welford<float>();
                        Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<long>(17280L));
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(9216L); x2+=static_cast<long>(1L))
                        {
                            for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30L*x1) + (960L*x2) + (8847360L*x0)), 16);
                                tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0);
                            }
                            for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(14L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30L*x1) + (960L*x2) + (8847360L*x0)), 14);
                                masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, 14, &wrecps0);
                            }
                        }
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec));
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                        out_ptr0[static_cast<long>(x1 + (32L*x0))] = static_cast<float>(tmp_acc0.mean);
                        out_ptr1[static_cast<long>(x1 + (32L*x0))] = static_cast<float>(tmp_acc0.m2);
                    }
                }
            }
        }
        {
            #pragma omp for collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(9216L); x1+=static_cast<long>(1L))
                {
                    #pragma GCC ivdep
                    for(long x2=static_cast<long>(0L); x2<static_cast<long>(32L); x2+=static_cast<long>(1L))
                    {
                        for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L))
                        {
                            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30L*x2) + (960L*x1) + (8847360L*x0)), 16);
                            auto tmp1 = out_ptr0[static_cast<long>(x2 + (32L*x0))];
                            auto tmp4 = out_ptr1[static_cast<long>(x2 + (32L*x0))];
                            auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x3 + (30L*x2)), 16);
                            auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x3 + (30L*x2)), 16);
                            auto tmp2 = at::vec::Vectorized<float>(tmp1);
                            auto tmp3 = tmp0 - tmp2;
                            auto tmp5 = static_cast<float>(276480.0);
                            auto tmp6 = tmp4 / tmp5;
                            auto tmp7 = static_cast<float>(1e-05);
                            auto tmp8 = decltype(tmp6)(tmp6 + tmp7);
                            auto tmp9 = 1 / std::sqrt(tmp8);
                            auto tmp10 = at::vec::Vectorized<float>(tmp9);
                            auto tmp11 = tmp3 * tmp10;
                            auto tmp13 = tmp11 * tmp12;
                            auto tmp15 = tmp13 + tmp14;
                            auto tmp16 = at::vec::clamp_min(tmp15, decltype(tmp15)(0));
                            tmp16.store(out_ptr2 + static_cast<long>(x3 + (30L*x2) + (960L*x1) + (8847360L*x0)));
                        }
                        for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(14L))
                        {
                            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30L*x2) + (960L*x1) + (8847360L*x0)), 14);
                            auto tmp1 = out_ptr0[static_cast<long>(x2 + (32L*x0))];
                            auto tmp4 = out_ptr1[static_cast<long>(x2 + (32L*x0))];
                            auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x3 + (30L*x2)), 14);
                            auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x3 + (30L*x2)), 14);
                            auto tmp2 = at::vec::Vectorized<float>(tmp1);
                            auto tmp3 = tmp0 - tmp2;
                            auto tmp5 = static_cast<float>(276480.0);
                            auto tmp6 = tmp4 / tmp5;
                            auto tmp7 = static_cast<float>(1e-05);
                            auto tmp8 = decltype(tmp6)(tmp6 + tmp7);
                            auto tmp9 = 1 / std::sqrt(tmp8);
                            auto tmp10 = at::vec::Vectorized<float>(tmp9);
                            auto tmp11 = tmp3 * tmp10;
                            auto tmp13 = tmp11 * tmp12;
                            auto tmp15 = tmp13 + tmp14;
                            auto tmp16 = at::vec::clamp_min(tmp15, decltype(tmp15)(0));
                            tmp16.store(out_ptr2 + static_cast<long>(x3 + (30L*x2) + (960L*x1) + (8847360L*x0)), 14);
                        }
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg2_1, = args
    args.clear()
    assert_size_stride(arg2_1, (2, 960, 96, 96), (8847360, 1, 92160, 960))
    buf0 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32)
    buf1 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32)
    buf3 = empty_strided_cpu((2, 960, 96, 96), (8847360, 1, 92160, 960), torch.float32)
    cpp_fused_native_group_norm_relu_0(arg2_1, _frozen_param3, _frozen_param2, buf0, buf1, buf3)
    del arg2_1
    return (buf3, )
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132389
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel

Co-authored-by: Jiong Gong <jiong.gong@intel.com>
2024-09-02 00:28:34 +00:00
c140fa1426 Reorg cache code to make it simpler (#134911)
Summary:
Pull the big nested function out of the middle of cached_autotune() into its own class.

Also refactor creating the autotune cache itself out - which gets shared in the next diff.

Test Plan: unit tests

Differential Revision: D60677501

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134911
Approved by: https://github.com/oulgen
2024-09-02 00:27:40 +00:00
0cbcef12bd Stop adding useless prefix to error message here, you're pushing the important info off the screen. (#133108)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133108
Approved by: https://github.com/Skylion007
2024-09-01 23:11:17 +00:00
208442ea18 Don't setup try-except handler when Dynamo compiling (#133239)
The reraise is not supported and so this just gunks up our actual exception handling. You can trigger this by hitting an exception inside of an NN module that has hooks on it. You end up graph breaking on the reraise here, and losing the inner stack trace from the actual exception that was raised.

This might be kind of controversial.  An alternate strategy is to support reraises in Dynamo or something but IDK this doesn't feel like the right place to apply force.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133239
Approved by: https://github.com/anijain2305
2024-09-01 22:26:46 +00:00
ea01aec8b1 Move FunctionSchema implementations to cpp file (#133856)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133856
Approved by: https://github.com/bdhirsh, https://github.com/albanD
2024-09-01 19:50:35 +00:00
2dadc2c8fc Log fx graph cache bypass reasons (#134792)
Summary: Lets track when we bypass and why

Test Plan: unit tests

Differential Revision: D61994739

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134792
Approved by: https://github.com/jamesjwu
2024-09-01 19:02:09 +00:00
cyy
1595e755af [Reland] [Torchgen] Pass mutable to cpp.valuetype_type (#134549)
Reland of #121415

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134549
Approved by: https://github.com/ezyang
2024-09-01 15:15:38 +00:00
eqy
b1a00b7b6d Abate -Wsign-compare warning spam in Indexing.cu (#134805)
Fix for warning spam like
```
 warning: comparison of integer expressions of different signedness: ‘long int’ and ‘uint64_t’ {aka ‘long unsigned int’} [-Wsign-compare]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134805
Approved by: https://github.com/janeyx99
2024-09-01 10:48:07 +00:00
cyy
d03f767cae Check function declarations of Vulkan code (#134550)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134550
Approved by: https://github.com/ezyang
2024-09-01 09:38:35 +00:00
c25b64a057 expose host_emptyCache to python, fix a bug in freeing cudaHostRegist… (#134919)
…ered memory

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134919
Approved by: https://github.com/eqy
2024-09-01 09:07:25 +00:00
caa04e0cae [ET] codegen: bool array as array ref (#134886)
Test Plan: CI

Differential Revision: D62046959

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134886
Approved by: https://github.com/larryliu0820
2024-09-01 01:33:43 +00:00
29b7852dc1 drop gil in couple places (leads to deadlocks) (#134910)
Per title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134910
Approved by: https://github.com/eqy
2024-09-01 00:05:53 +00:00
7239b8a4f1 Clean up RemoteCache classes (#134032)
Summary:
The existing RemoteCacheBackend classes were a bit haphazard - some of them accepted bytes only, some accepted objects, some returned different types of objects than were passed in.

Update them to be more consistent:

1. RemoteCacheBackend is an implementation of a backend: Redis, Memcache, Manifold, LocalFile

2. RemoteCacheSerde is an implementation of a serde protocol - to turn structured objects (dict, list, etc) into bytes: RemoteCacheJsonSerde (json encoding), RemoteCachePassthroughSerde (strictly bytes only)

3. RemoteCache is the cache implementation itself, mixing a RemoteCacheBackend along with an RemoteCacheSerde to provide structured caching.

Other than simply reorganizing the existing cache code this also fixes the Redis autotune caching for OSS.

Test Plan: unit tests

Reviewed By: oulgen

Differential Revision: D61178859

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134032
Approved by: https://github.com/oulgen, https://github.com/bhack
2024-08-31 20:18:59 +00:00
590d96be64 [inductor] move test_fuse_large_params to slow test. (#134900)
Move `test_fuse_large_params` to slow test. This case spend about 1.5 minutes.

<img width="855" alt="image" src="https://github.com/user-attachments/assets/adf16dcf-d398-4d66-8dda-0c9cafc4e351">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134900
Approved by: https://github.com/jansel
2024-08-31 18:08:11 +00:00
f4641ca481 [Inductor] Remove VecChecker and fallback non-supported Vec op to Scalar impl with a for loop (#134569)
Fall back non-vectorized op by scalar impl + for loop.

Example code:
```
cpp_fused_igammac_0 = async_compile.cpp_pybinding(['const double*', 'const double*', 'double*'], '''
#include "/tmp/torchinductor_root/z4/cz4j2mmotlx3z2b7u4fbjtdt4x6plhd67ljwzg5bk7ekv4xz6y7q.h"
extern "C"  void kernel(const double* in_ptr0,
                       const double* in_ptr1,
                       double* out_ptr0)
{
    {
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(48L); x0+=static_cast<int64_t>(8L))
        {
            auto tmp0 = at::vec::VectorizedN<double,2>::loadu(in_ptr0 + static_cast<int64_t>(x0), 8);
            auto tmp1 = in_ptr1[static_cast<int64_t>(0L)];
            auto tmp2 = at::vec::VectorizedN<double,2>(tmp1);
            auto tmp3 =
            [&]()
            {
                __at_align__ std::array<double, 8> tmpbuf0;
                tmp0.store(tmpbuf0.data(), 8);
                __at_align__ std::array<double, 8> tmpbuf1;
                tmp2.store(tmpbuf1.data(), 8);
                __at_align__ std::array<double, 8> tmpbuf_out;
                for (int i = 0; i < 8; i++)
                {
                    tmpbuf_out[i] = calc_igammac(tmpbuf0[i], tmpbuf1[i]);
                }
                return at::vec::VectorizedN<double, 2>::loadu(tmpbuf_out.data(), 8);
            }
            ()
            ;
            tmp3.store(out_ptr0 + static_cast<int64_t>(x0), 8);
        }
        #pragma omp simd simdlen(4)
        for(int64_t x0=static_cast<int64_t>(48L); x0<static_cast<int64_t>(50L); x0+=static_cast<int64_t>(1L))
        {
            auto tmp0 = in_ptr0[static_cast<int64_t>(x0)];
            auto tmp1 = in_ptr1[static_cast<int64_t>(0L)];
            auto tmp2 = calc_igammac(tmp0, tmp1);
            out_ptr0[static_cast<int64_t>(x0)] = tmp2;
        }
    }
}
''')

```

`frexp` are difficult to be handled by common `fallback` since it returns two `cse_var` 2ba60a1618/torch/_inductor/codegen/cpp.py (L752-L766)
So we added a special function to do that.
```
cpp_fused_frexp_0 = async_compile.cpp_pybinding(['const double*', 'double*', 'int32_t*'], '''
#include "/tmp/torchinductor_root/z4/cz4j2mmotlx3z2b7u4fbjtdt4x6plhd67ljwzg5bk7ekv4xz6y7q.h"
extern "C"  void kernel(const double* in_ptr0,
                       double* out_ptr0,
                       int32_t* out_ptr1)
{
    {
        for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(16L); x0+=static_cast<int64_t>(8L))
        {
            auto tmp0 = at::vec::VectorizedN<double,2>::loadu(in_ptr0 + static_cast<int64_t>(x0), 8);
            at::vec::Vectorized<int32_t> tmp1;
            at::vec::VectorizedN<double, 2> tmp2;
            [&]()
            {
                __at_align__ std::array<double, 8> tmpbuf;
                tmp0.store(tmpbuf.data(), 8);
                __at_align__ std::array<int32_t, 8> tmpbuf_exponent;
                __at_align__ std::array<double, 8> tmpbuf_mantissa;
                for (int i = 0; i < 8; i++)
                {
                    tmpbuf_mantissa[i] = std::frexp(tmpbuf[i], &tmpbuf_exponent[i]);
                }
                tmp1 = at::vec::Vectorized<int32_t>::loadu(tmpbuf_exponent.data(), 8);
                tmp2 = at::vec::VectorizedN<double, 2>::loadu(tmpbuf_mantissa.data(), 8);
            }
            ();
            tmp2.store(out_ptr0 + static_cast<int64_t>(x0), 8);
            tmp1.store(out_ptr1 + static_cast<int64_t>(x0), 8);
        }
        #pragma omp simd simdlen(4)
        for(int64_t x0=static_cast<int64_t>(16L); x0<static_cast<int64_t>(20L); x0+=static_cast<int64_t>(1L))
        {
            auto tmp0 = in_ptr0[static_cast<int64_t>(x0)];
            int32_t tmp1;
            auto tmp2 = std::frexp(tmp0, &tmp1);
            out_ptr0[static_cast<int64_t>(x0)] = tmp2;
            out_ptr1[static_cast<int64_t>(x0)] = tmp1;
        }
    }
}
''')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134569
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-08-31 11:19:57 +00:00
16f119e62a Update compiled optimizer tests for tensor betas (#134169)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134169
Approved by: https://github.com/anijain2305, https://github.com/eellison
ghstack dependencies: #134166, #134167, #134168
2024-08-31 10:24:39 +00:00
4e71418566 [dynamo] rewrite addcmul_ to remove graph break (#134168)
Context: Adding support for the beta parameters to be tensors

Details: Similarly to the previous two PRs addcmul_ is used with the tensor betas as the value argument. When this occurs, an item() call is invoked in the aten op. To avoid this graph break, addcmul_ is decomposed into its constrituent ops to avoid this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134168
Approved by: https://github.com/anijain2305
ghstack dependencies: #134166, #134167
2024-08-31 10:24:39 +00:00
3fb4c6bc38 [dynamo] Rewrite foreach pow to broadcast scalar argument (#134167)
Context: Adding support for the beta parameters to be tensors

Details:
In this PR similarly to the previous, foreach_pow calls item() on the first argument when it is a scalar tensor. In this case, we broadcast that scalar tensor into a list of aliases of that tensor to avoid the item() call, and this results in a device copy of the scalar tensor. Once again, I dont think we can change the foreach_pow API due to BC concerns, so this op rewrite allows us to avoid a graph break, generate semantically the same code, and not affect eager.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134167
Approved by: https://github.com/anijain2305
ghstack dependencies: #134166
2024-08-31 10:24:35 +00:00
471c33f007 [dynamo] Rewrite foreach_lerp to avoid aten item call (#134166)
Context: Adding support for the beta parameters to be tensors

Details:
In order to add support for the beta params to be tensors without graph breaks in the Adam family of optimizers it is necessary to support foreach_lerp(x, y, s) where s is a scalar tensor. Today, this isn't possible because when `s` is a scalar, internally the aten op calls item() on it to extract the value and distribute it to each of the ops on the individual list indices. To support this in dynamo without graph breaks, I decompose the lerp into its constituent ops which support a scalar tensor in the list argument positions which do not result in an item() call. To be clear the item() call is more performant for eager I think and for BC I don't think we can modify that API, so this allows us to have performance in eager and no graph breaks in compile.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134166
Approved by: https://github.com/anijain2305
2024-08-31 10:24:31 +00:00
eed0d76682 [dynamo][itertools] refactor itertools.islice to use polyfill (#133876)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133876
Approved by: https://github.com/jansel
ghstack dependencies: #133864, #133894
2024-08-31 10:08:07 +00:00
ec660c383e [dynamo] reduce overhead for PolyfilledFunctionVariable.call_function (#134842)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134842
Approved by: https://github.com/jansel
2024-08-31 09:12:46 +00:00
d9cc693719 [jit] Change argument names (#134828)
It seems like a bug.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134828
Approved by: https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2024-08-31 08:42:30 +00:00
136badae64 [inductor] preload icx built in math libs (#134870)
Intel Compiler implenmented more math libraries than clang, for performance proposal.
We need preload them like openmp library.

reproduce UT:
```cmd
pytest test/inductor/test_cpu_cpp_wrapper.py -v -k test_silu_cpu_dynamic_shapes_cpp_wrapper
```

Depends of module:
<img width="804" alt="Image" src="https://github.com/user-attachments/assets/9a672e03-ebf5-4ebb-b182-09180e6f7841">

Local test pass:
<img width="857" alt="image" src="https://github.com/user-attachments/assets/afbb8c1c-8fcc-4d64-a3ad-c8521b137d2d">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134870
Approved by: https://github.com/jansel
2024-08-31 04:50:31 +00:00
090d9cf410 [Dynamo][autograd.Function][vmap] support torch._C._are_functorch_transforms_active (#134889)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134889
Approved by: https://github.com/jansel
2024-08-31 04:39:09 +00:00
34b85d301f [executorch hash update] update the pinned executorch hash (#134894)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134894
Approved by: https://github.com/pytorchbot
2024-08-31 04:16:41 +00:00
64fad53b50 [Inductor] Support passing module map parameter to Triton make_ir API (#134774)
In https://github.com/triton-lang/triton/pull/4539 the `make_ir` API was modified to accept a new `module_map` parameter. Update the Inductor callsite accordingly, preserving backwards compatibility following the existing code.

Fixes #134674

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134774
Approved by: https://github.com/EikanWang, https://github.com/zou3519, https://github.com/jansel
2024-08-31 03:38:08 +00:00
aef5da50f4 Cleanup unused pytorch.version (#134381)
This file doesn't seem to be used anywhere? checking CI...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134381
Approved by: https://github.com/zou3519
2024-08-31 02:50:05 +00:00
86e03a64e1 Revert "[Inductor] Allow customizing the padding format (#133939)"
This reverts commit 8b258b3b14408986a1d4142cff5a153c798ceecc.

Reverted https://github.com/pytorch/pytorch/pull/133939 on behalf of https://github.com/ZainRizvi due to sorry but this PR is causing issues with diff train imports reverting it for now but it can be merged back in as-is ([comment](https://github.com/pytorch/pytorch/pull/133939#issuecomment-2322635388))
2024-08-31 00:38:30 +00:00
f95085fd91 [BE][MPS] Prefer xfail to skip (#134858)
This essentially undoes large skips on everything but MacOS Sequoia to nn.modules made by https://github.com/pytorch/pytorch/pull/128393

Instead it uses existing `xfail`, but guards it on `_macos15_or_newer` boolean

Before the change if run on MacOS 14:
```
 % python3 ../test/test_modules.py -v -k Hardswish 2>&1|tail -n3
Ran 57 tests in 0.053s

OK (skipped=32)
```
After
```
% python3 ../test/test_modules.py -v -k Hardswish 2>&1|tail -n3
Ran 57 tests in 0.229s

OK (skipped=10, expected failures=2)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134858
Approved by: https://github.com/janeyx99
2024-08-31 00:29:48 +00:00
050ad925f3 [benchmark] Add to torchbench relative path search (#134871)
Add to relative path search in benchmark. This enables user to run `torchbench.py` inside the `pytorch/benchmark/dynamo` folder when `torchbench` repo is cloned in the same level as `pytorch`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134871
Approved by: https://github.com/FindHao
2024-08-31 00:28:22 +00:00
a854c3a25e [dynamo] refactor builtins.enumerate to use polyfill (#133894)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133894
Approved by: https://github.com/jansel
ghstack dependencies: #133864
2024-08-31 00:17:27 +00:00
ebbdeeede1 [dynamo][itertools] refactor itertools.chain and itertools.chain.from_iterable to use polyfills (#133864)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133864
Approved by: https://github.com/jansel
2024-08-31 00:11:54 +00:00
5dad6a5a84 [ONNX][DORT] Lazy-import onnxruntime (#134662)
Currently, if installed, `onnxruntime` will be imported when importing `torch._inductor` (which will be imported by some other library, e.g. transformer-engine):

```
  /mnt/c.py(53)<module>()
-> from torch._inductor.utils import maybe_profile
  /usr/local/lib/python3.10/site-packages/torch/_inductor/utils.py(49)<module>()
-> import torch._export
  /usr/local/lib/python3.10/site-packages/torch/_export/__init__.py(25)<module>()
-> import torch._dynamo
  /usr/local/lib/python3.10/site-packages/torch/_dynamo/__init__.py(2)<module>()
-> from . import convert_frame, eval_frame, resume_execution
  /usr/local/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py(48)<module>()
-> from . import config, exc, trace_rules
  /usr/local/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py(52)<module>()
-> from .variables import (
  /usr/local/lib/python3.10/site-packages/torch/_dynamo/variables/__init__.py(38)<module>()
-> from .higher_order_ops import (
  /usr/local/lib/python3.10/site-packages/torch/_dynamo/variables/higher_order_ops.py(14)<module>()
-> import torch.onnx.operators
  /usr/local/lib/python3.10/site-packages/torch/onnx/__init__.py(62)<module>()
-> from ._internal.onnxruntime import (
  /usr/local/lib/python3.10/site-packages/torch/onnx/_internal/onnxruntime.py(37)<module>()
-> import onnxruntime  # type: ignore[import]
```

This issue breaks generated triton kernel because it imported torch, and unexpected runtime libraries as well.

I've also added a test for this specific case under `test/onnx`, perhaps we should add more somewhere else?

Related issue: https://github.com/huggingface/accelerate/pull/3056
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134662
Approved by: https://github.com/justinchuby
2024-08-31 00:06:28 +00:00
2384f77d76 [XPU] Fix Windows XPU build (#134276)
Linker flag check doesn't work correctly with MSVC and linking torch_xpu with torch_cpu_library for windows MSVC works without any errors
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134276
Approved by: https://github.com/EikanWang, https://github.com/atalman
2024-08-30 23:51:40 +00:00
e688b78791 [Dynamo][autograd.Function] Trace fwd graph under no_grad mode (#134872)
Fixes #134820

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134872
Approved by: https://github.com/zou3519
2024-08-30 22:24:18 +00:00
8b258b3b14 [Inductor] Allow customizing the padding format (#133939)
Based on https://github.com/pytorch/pytorch/pull/130956.

Inductor already supports padding through the `config.comprehensive_padding` option, but the padding format involves a few heuristics that are specific to Nvidia GPUs:
  - When we pad, it is always aligned to the next multiple of 128 bytes.
  - Strides smaller than 1024 are not padded.
  - Only intermediate values are padded, not outputs.

 The last of these is not really GPU-specific, but there are certain cases where we may want to override it. For example, padding outputs is useful on hardware accelerators with specific memory alignment requirements, or for applications where performance is more important than conformity with eager mode.

 This PR surfaces padding parameters up to Inductor's config module, so the user can control them.
   - `config.pad_outputs`: choose whether to pad outputs (default: `False`)
   - `config.padding_alignment_bytes`: choose the alignment size for padding (default: `128`)
   - `config.padding_stride_threshold`:  choose the smallest stride that we will pad. For example, setting this to 0 will pad all unaligned strides. (default: `1024`)

 **Test plan**
 Added a new test in `test_padding.py` which tries various combinations of these options, checking that the output strides match our expectations.

  These changes should not affect perf, because the defaults are identical to Inductor's current behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133939
Approved by: https://github.com/shunting314

Co-authored-by: Yueming Hao <yhao@meta.com>
2024-08-30 20:34:11 +00:00
a1ba8e61d1 Revert "[ROCm] remove triton-rocm commit pin and merge pins with triton.txt (#133438)"
This reverts commit 5e8bf29148a590318f678620f84be8f4d5ffff5c.

Reverted https://github.com/pytorch/pytorch/pull/133438 on behalf of https://github.com/ZainRizvi due to This still breaks linux binary builds. Added the appropriate labels to ensure tests can pass. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/10626427003/job/29460479554) [HUD commit link](5e8bf29148) ([comment](https://github.com/pytorch/pytorch/pull/133438#issuecomment-2322246198))
2024-08-30 20:00:41 +00:00
f6398eb0fa dynamic shapes for combo_kenel/foreach_kernel (#134477)
This PR add dynamic shapes support to foreach and combo kernels for horizontal fusion.
A flag `combo_kernel_foreach_dynamic_shapes` (default False to avoid disturb production workflows) is added to _inductor/config.py. Setting it to True enables automatic dynamic shapes for foreach kernels. It is always enabled for combo kernels cases. Added unit cases.

This PR also fixes a flaky test case for [T198833257](https://www.internalfb.com/intern/tasks/?t=198833257)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134477
Approved by: https://github.com/mlazos
2024-08-30 19:58:20 +00:00
db17a9898d regenerate ci workflows for binary builds with new g4dn runners (#133404)
Fixes #103104

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133404
Approved by: https://github.com/ZainRizvi
2024-08-30 19:53:22 +00:00
98b813d0d4 Enable cudagraphs in cpp wrapper (#133885)
Fixes https://github.com/pytorch/pytorch/issues/130878

Summary: Enables cudagraphs in cpp wrapper by clearing inputs.

Generated, non-cpp wrapper code:
```python
def call(args):
    arg0_1, = args
    args.clear()
    assert_size_stride(arg0_1, (10, ), (1, ))
    with torch.cuda._DeviceGuard(0):
        torch.cuda.set_device(0)
        buf0 = empty_strided_cuda((10, ), (1, ), torch.float32)
        # Topologically Sorted Source Nodes: [sin], Original ATen: [aten.sin]
        stream0 = get_raw_stream(0)
        triton_poi_fused_sin_0.run(arg0_1, buf0, 10, grid=grid(10), stream=stream0)
        del arg0_1
    return (buf0, )
```
vs generated cpp wrapper code:
```python
def _wrap_func(f):
    def g(args):
        input_tensors = [arg if isinstance(arg, torch.Tensor) else torch.tensor(arg) for arg in args]
        input_handles = torch._C._aoti.unsafe_alloc_void_ptrs_from_tensors(input_tensors)
        # new:
        args.clear()
        # end new

        output_handles = f(input_handles)
        output_tensors = torch._C._aoti.alloc_tensors_by_stealing_from_void_ptrs(output_handles)
        return output_tensors

    return g

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133885
Approved by: https://github.com/eellison, https://github.com/desertfire
2024-08-30 18:48:37 +00:00
bdfa94b787 [RFC] Make fr trace script a console scripts (#134729)
We want to make fr analyzer script a command after users `pip install torch`, that's why we want to mimic what torchrun is doing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134729
Approved by: https://github.com/c-p-i-o, https://github.com/malfet
ghstack dependencies: #134528, #134780
2024-08-30 18:17:06 +00:00
a0d0c6b7e6 Used torch.equal in test_foreach_copy_with_multi_dtypes (#134861)
`self.assertEqual` allows some tolerance, but here, we want to show that `_foreach_copy_` gives bitwise equivalent results. Let us use `torch.equal` then.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134861
Approved by: https://github.com/Skylion007, https://github.com/janeyx99, https://github.com/crcrpar
2024-08-30 18:04:41 +00:00
1993a2aa9e [FR] Make pg_name unique, show P2P collective status and fix bugs when running the script as command (#134780)
Fixes a bunches of bugs in the script when running with the generated command and 3D parallel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134780
Approved by: https://github.com/c-p-i-o
ghstack dependencies: #134528
2024-08-30 18:03:17 +00:00
15f5a4858b [inductor] enable Intel Compiler(icx-cl) for inductor windows (#134772)
This PR is enable Intel Compiler (`icx-cl`) for Windows inductor, likes previous PR: https://github.com/pytorch/pytorch/pull/134444 which enable clang.

Changes:
1. Fix icx-cl crash by wrong decode args, the right decode should be "utf-8".
2. Add intel compiler check, and intel compiler Windows drivers check(icx-cl).
3. Add Intel compiler openmp args config.
4. Add intel compiler openmp binary preload.

For intel compiler openmp binary path:
<img width="788" alt="image" src="https://github.com/user-attachments/assets/54c76356-018d-4bef-a9b7-0ea150fd7aba">

For performance, Intel compiler(`icx-cl`) is much better performance than MSVC(`cl`):
<img width="875" alt="image" src="https://github.com/user-attachments/assets/67865faf-b1de-4535-917a-486b72527204">

Append `clang-cl` performance data:
<img width="821" alt="image" src="https://github.com/user-attachments/assets/476f4568-bf58-457f-b73d-4e57f49be384">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134772
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-08-30 17:51:46 +00:00
9e0ddc0e14 [inductor] don't allow triton config pre_hook (#134633)
The caching autotuner caches triton configs, and it doesn't try to hash or save the pre_hook from the config if it exists. If we had a config that had a pre_hook, then we might autotune -> save the config (without the pre_config) -> later load the saved config and try to run it, but this time without the pre_hook.

So this PR adds an assert and deletes the pre_hook handling. We can be confident that we didn't have functional pre_hooks, because the pre_hook handling tries to use `self.arg_name`, which doesn't exist.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134633
Approved by: https://github.com/shunting314, https://github.com/jansel
2024-08-30 17:39:37 +00:00
e21d7b77ce Update ForeachfuncInfo.sample_inputs_func to yield scalars & scalarlists that are more friendly to test_meta (#134552)
for `test_meta.py` to see more "PASSED" instead of "XFAIL".

`pytest test_meta.py -k "_foreach_"` ran 6400 test cases and:
- This PR: 4702 passed, 260 skipped, 73732 deselected, 1698 xfailed
- main (92c4771853892193d73d87bd60eca4dc7efc51d8): 3906 passed, 260 skipped, 73732 deselected, 2494 xfailed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134552
Approved by: https://github.com/janeyx99
2024-08-30 17:30:50 +00:00
577a93514f [dynamo][dynamic][heuristic] Mark tuple getitem integers as static (#134734)
Fixes issue seen in https://github.com/pytorch/pytorch/issues/132872#issuecomment-2314574656

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134734
Approved by: https://github.com/jansel
ghstack dependencies: #134653, #134713
2024-08-30 17:06:57 +00:00
08184aa85c Add support for 32KB multi_tensor_apply kernel arguments (#134373)
## Benchmark

On H100 SXM (HBM2e, 500W TDP), CUDA Toolkit=12.2, Driver Version=535.154.05, with [this script](https://gist.github.com/yifuwang/178c1f4bf951c5794ea79c04d90e44fa) (`torch._foreach_copy_`):

**Baseline**
```
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmp0g_x4sys
device ms: 0.891, cpu ms: 7.200
memory bandwidth: 1457.727 GB/s
```

Single iteration trace:
<img width="1432" alt="image" src="https://github.com/user-attachments/assets/8ef54365-0265-4281-a0f0-d4c2f448300e">

**This PR**
```
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmp3jqiugli
device ms: 0.683, cpu ms: 6.745
memory bandwidth: 1902.010 GB/s
```

Single iteration trace:
<img width="1074" alt="image" src="https://github.com/user-attachments/assets/e52acad1-d09b-492c-9611-6d69e339f3ac">

## Binary Size and Kernel Specialization
The binary size for `libtorch_cuda.so` increased 6MB (243MB -> 249MB).

```
// NOTE: [32KB kernel argument size support]
// 32KB kernel argument size support has three requirements:
// - CUDART_VERSION >= 12010
// - Driver version >= 530
// - GPU arch >= VOLTA
//
// Due to minor version compatibility, it possible for binaries built with
// CUDART_VERSION >= 12010 to run with driver version < 530. Since driver
// version can only be checked at runtime, if CUDART_VERSION >= 12010, we have
// to build both 4KB and 32KB kernels and determine the appropriate kernel to
// dispatch at runtime.
//
// - If CUDART_VERSION < 12010, only 4KB kernels will be instantiated.
//
// - If CUDART_VERSION >= 12010:
//   - Host code:
//     - We always instantiate the launching stub for both 4KB and 32KB kernels.
//   - Device code:
//     - If __CUDA_ARCH__ >= 700, we always instantiate both 4KB and 32KB
//     kernels.
//     - If __CUDA_ARCH__ < 700, it's not possible to even compile an empty
//     32KB kernel (formal parameter space overflowed). Thus, we only
//     instantiate a declaration for 32KB kernels. This is valid as long as the
//     declaration-only kernel is not launched.
//
// - At runtime, we dispatch to the 32KB kernel if driver version >= 530 and
// GPU arch >= VOLTA.
//
// - TODO(yifu): once there's a CUDART version that is not compatible with any
// driver version below 530, we can determine at compile time to not compile
// the kernels for 4KB kernel argument size.
//
// https://developer.nvidia.com/blog/cuda-12-1-supports-large-kernel-parameters/
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134373
Approved by: https://github.com/eqy, https://github.com/crcrpar, https://github.com/janeyx99
2024-08-30 16:52:28 +00:00
a19a7524f6 [export] Make sure getitem replacement are synced with module call graph. (#134830)
Summary: When we are placing nodes in the graph, we should also replace the references in module_call_graph.

Test Plan:
buck2 run 'fbcode//mode/opt' torchrec/fb/ir/tests:test_serializer -- --filter-regex test_serialize_deserialize_vlea
buck2 test 'fbcode//mode/opt' fbcode//torchrec/fb/ir/tests:test_serializer -- --exact 'torchrec/fb/ir/tests:test_serializer - torchrec.fb.ir.tests.test_serializer.TestSerializer: test_serialize_empty_value_vlea' --run-disabled

buck2 test 'fbcode//mode/opt' fbcode//torchrec/fb/ir/tests:test_serializer -- --exact 'torchrec/fb/ir/tests:test_serializer - torchrec.fb.ir.tests.test_serializer.TestSerializer: test_deserialized_device_vle' --run-disabled

Differential Revision: D62014035

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134830
Approved by: https://github.com/angelayi
2024-08-30 16:47:05 +00:00
f5b0caee71 Rewrite unsafe_remove_auto_functionalized_pass using decompose_auto_functionalized (#134831)
`unsafe_remove_auto_functionalized_pass` can be written as using `decompose_auto_functionalized`, this way we do not have to update it each time we do a change to `auto_functionalize` (Ex https://github.com/pytorch/pytorch/pull/134409) , and we avoid duplicate logics implemented in two different ways.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134831
Approved by: https://github.com/zou3519
2024-08-30 16:27:53 +00:00
351ba3e67c Revert "[c10d] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931)"
This reverts commit 65864d01341d006955579b145f78547314ceb14b.

Reverted https://github.com/pytorch/pytorch/pull/132931 on behalf of https://github.com/ZainRizvi due to This PR is breaking builds internally due to the removal of ProcessGroup::Options ([comment](https://github.com/pytorch/pytorch/pull/132931#issuecomment-2321862402))
2024-08-30 16:27:40 +00:00
994438040c Improvements for associative_scan - combine_mode (#133012)
This is part of a series of PRs to improve the functionality of the `associatve_scan` functionality. This specific PR introduces a `combine_mode`, which can be either `pointwise` (default) or `generic`. In case of `generic`, the `associative_scan` is more flexible and allows also to perform non-pointwise functions. This PR has been derived from https://github.com/pytorch/pytorch/pull/129307.

@ydwu4 @Chillee @zou3519

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133012
Approved by: https://github.com/ydwu4
2024-08-30 16:09:53 +00:00
c6ecf57dd2 Revert "[dynamo] simplify implementation for functools.reduce (#133778)"
This reverts commit b5f1ffa7ab0988184497788f2738e1769888ab7d.

Reverted https://github.com/pytorch/pytorch/pull/133778 on behalf of https://github.com/ZainRizvi due to This is still failing internally with the same error about 'Graph break due to unsupported builtin _functools.reduce' ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2321787968))
2024-08-30 16:06:10 +00:00
7a85c488a8 Revert "[dynamo] simplify implementation for builtins.sum (#133779)"
This reverts commit eaa449fbf0fe528a0827ee9b5bcfcd307a7c658d.

Reverted https://github.com/pytorch/pytorch/pull/133779 on behalf of https://github.com/ZainRizvi due to This is still failing internally with the same error about 'Graph break due to unsupported builtin _functools.reduce' ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2321787968))
2024-08-30 16:06:10 +00:00
1ad08c7a5b Revert "[dynamo][itertools] refactor itertools.chain and itertools.chain.from_iterable to use polyfills (#133864)"
This reverts commit 1b703669576223024eb84a76c53b7ec5ed8bb270.

Reverted https://github.com/pytorch/pytorch/pull/133864 on behalf of https://github.com/ZainRizvi due to This is still failing internally with the same error about 'Graph break due to unsupported builtin _functools.reduce' ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2321787968))
2024-08-30 16:06:10 +00:00
8aa44e14cf Revert "[dynamo] refactor builtins.enumerate to use polyfill (#133894)"
This reverts commit a2566adfb6064235db6d950568435fb6ef58a598.

Reverted https://github.com/pytorch/pytorch/pull/133894 on behalf of https://github.com/ZainRizvi due to This is still failing internally with the same error about 'Graph break due to unsupported builtin _functools.reduce' ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2321787968))
2024-08-30 16:06:09 +00:00
10c31e96df Revert "[dynamo][itertools] refactor itertools.islice to use polyfill (#133876)"
This reverts commit 7d12e6dceb94a221288f21c0e79ce8ca667d657a.

Reverted https://github.com/pytorch/pytorch/pull/133876 on behalf of https://github.com/ZainRizvi due to This is still failing internally with the same error about 'Graph break due to unsupported builtin _functools.reduce' ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2321787968))
2024-08-30 16:06:09 +00:00
d261a1751a [HOP] fix export x inline_inbuilt_nn_modules (#133731)
TLDR; this PR supports exporting cond x inine_inbuilt nn modules flag by inling into tracing code in proxy_tensor.py _symbolic_trace.py (internally, the pattern is make_fx(record_module_stack)(torch.compile(f))).

We have two special treatments for following cases:

1. _ModuleStackTracer will wrap all the nn modules into _AttrProxy. This _AttrProxy has several subtiles which make it hard to inline in dynamo like overriding _modules with a property method and overrides the `__getattr__`,  which mutates captured states when calling `__getattr__`.

Solution to this is that we unwrap the _AttrProxy and get its corresponding nn_module (a 1-1 correspondence). So that dynamo symbolically traces the original nn module instead of tracing _AttrProxy.

2. The tracer applies a bunch of patches the `__getattr__` and `__call__` of nn.Module for tracking reasons. This doesn't work well with dynamo. The immediate error we see is `torch._dynamo.exc.Unsupported: 'inline in skipfiles: WeakKeyDictionary.__contains__ | __contains__ /home/yidi/.conda/envs/pytorch/lib/python3.10/weakref.py` caused by a weakdict in PythonKeyTracer.

Solution to this is that we remove the patches during dynamo symbolic convert temporally. So that dynamo has a clean environment. make_fx will be trace the transformed bytecode of dynamo and patches nn modules there instead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133731
Approved by: https://github.com/anijain2305
ghstack dependencies: #134775
2024-08-30 15:58:20 +00:00
932c4ca5a0 make make_fx collective test single threaded (#134775)
make_fx is not thread-safe due to mutating and patching global states. It's difficult and low roi to make it thread-safe so just turn the tracing test into a single-thread test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134775
Approved by: https://github.com/yifuwang
2024-08-30 15:58:20 +00:00
eqy
c07e566baf [CUDA][P2P] Check device capability in requires_cuda_p2p_access (#134523)
Tests seem to fail on e.g., Volta without this given the compile time meacros used e.g., in 79b7fff188/torch/csrc/distributed/c10d/intra_node_comm.cu (L487)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134523
Approved by: https://github.com/yifuwang, https://github.com/Skylion007
2024-08-30 14:08:55 +00:00
92f282ca52 Enable batch matmul for result sizes > 2**32 the tensor can be split along batch axis (#133430)
Fixes #131865. Addresses the issue seen when running llama v3.1 8B parameter model on MPS backend where the batch matmul output size can go over the 32-bit indexing limit of MPS tensors, causing an assert.

Test case to reproduce the issue with the dimensions encountered in llama v3.1 and verify this fix works around it:

```
import torch
device='mps'
a = torch.randn([32, 20064, 128], dtype=torch.float32,device=device)
b = torch.randn([32, 128, 20064], dtype=torch.float32, device=device)
res = torch.bmm(a, b)
```

Notably the current change only works as long as the individual output matrix in the bmm does not exceed the number of elements 2**32. This lets us split up the computation along the batch axis to avoid going over the limit.

Added a TORCH_CHECK to raise an error if the individual matrix dimensions are too large to handle for this op until a more general workaround tiling the matmuls is available.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133430
Approved by: https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-08-30 14:08:43 +00:00
50efbb9f1e [DeviceMesh][Test] Add a unit test for get_local_rank for flattened mesh (#134603)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134603
Approved by: https://github.com/fduwjj
ghstack dependencies: #133838, #133839, #134048
2024-08-30 08:13:37 +00:00
0f8bec4399 [dynamo] mark_static_nn_module (#134713)
Fixes issue seen in https://github.com/pytorch/pytorch/issues/132872#issuecomment-2314574656

With this API, we can mark the offending module as static in detectron2.

Today's world - Consider user defined nn module int attributes automatic dynamic. Use the API in this PR to make them static if you want.

Alternative work - Consider all int attributes of any user defined nn module class static. And then introduce an API - `torch._dynamo.mark_nn_module_attribute_dynamic`. The default being static is worrying if users have `counter` in their model which is updated in each forward invocation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134713
Approved by: https://github.com/jansel
ghstack dependencies: #134653
2024-08-30 07:01:06 +00:00
a5630239ad [dynamo] Improve minifier error message when fp64 not supported (#134737)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134737
Approved by: https://github.com/anijain2305
2024-08-30 06:42:32 +00:00
1011e0ae98 Generalize devices specific UTs for dynamo (#130714)
## Motivation
This is follow up to PR:https://github.com/pytorch/pytorch/pull/126970, adding facility to run content for Intel Gaudi devices.
We intend to extend similar generalization for the rest of the content in test/dynamo  which is currently being written to work specifically for cuda devices. Other devices can add onto it if support is available.

## Changes
 carve out bert related content to another class
 use instantiate_device_type utility to instantiate this class for devices which support the functionality

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130714
Approved by: https://github.com/anijain2305
2024-08-30 05:02:47 +00:00
7a694f6683 [justknobs] Override __bool__ method (#134799)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134799
Approved by: https://github.com/ezyang
2024-08-30 04:54:02 +00:00
75b86b1554 [executorch hash update] update the pinned executorch hash (#134736)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134736
Approved by: https://github.com/pytorchbot
2024-08-30 04:11:51 +00:00
5e8bf29148 [ROCm] remove triton-rocm commit pin and merge pins with triton.txt (#133438)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133438
Approved by: https://github.com/jithunnair-amd, https://github.com/malfet

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
2024-08-30 03:38:35 +00:00
1f1e2eeb9d [inductor] Install tlparse for test\dynamo\test_structured_trace.py UTs. (#134806)
Install tlparse for test\dynamo\test_structured_trace.py UTs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134806
Approved by: https://github.com/ezyang
2024-08-30 03:16:03 +00:00
0d5f978795 add basic nn modules diff time benchmarks (#134658)
benchmarks several shapes of basic nn modules. in both eager and inductor

```
collecting compile time instruction count for basic_modules_ListOfLinears_inductor
compile time instruction count for iteration 0 is 48602516013
compile time instruction count for iteration 1 is 20424350269
compile time instruction count for iteration 2 is 20440350455
compile time instruction count for iteration 3 is 20419269999
compile time instruction count for iteration 4 is 20430782200
compile time instruction count for iteration 5 is 20455049622
compile time instruction count for iteration 6 is 20157290712
compile time instruction count for iteration 7 is 20455324001
compile time instruction count for iteration 8 is 20450158317
compile time instruction count for iteration 9 is 20492987748
collecting compile time instruction count for basic_modules_ListOfLinears_eager
compile time instruction count for iteration 0 is 961328334
compile time instruction count for iteration 1 is 958887896
compile time instruction count for iteration 2 is 958792214
compile time instruction count for iteration 3 is 958375977
compile time instruction count for iteration 4 is 958568525
compile time instruction count for iteration 5 is 958152305
compile time instruction count for iteration 6 is 959322800
compile time instruction count for iteration 7 is 958332703
compile time instruction count for iteration 8 is 958092100
compile time instruction count for iteration 9 is 958095277
collecting compile time instruction count for basic_modules_ModuleForwardHasGraphBreak_inductor
compile time instruction count for iteration 0 is 3572145793
compile time instruction count for iteration 1 is 3503323973
compile time instruction count for iteration 2 is 3501962432
compile time instruction count for iteration 3 is 3501746084
compile time instruction count for iteration 4 is 3500687361
compile time instruction count for iteration 5 is 3822254676
compile time instruction count for iteration 6 is 3498356846
compile time instruction count for iteration 7 is 3499019157
compile time instruction count for iteration 8 is 3500780314
compile time instruction count for iteration 9 is 3500257458
collecting compile time instruction count for basic_modules_ModuleForwardHasGraphBreak_eager
compile time instruction count for iteration 0 is 1844838754
compile time instruction count for iteration 1 is 1843476862
compile time instruction count for iteration 2 is 1844761450
compile time instruction count for iteration 3 is 1845371742
compile time instruction count for iteration 4 is 1845159665
compile time instruction count for iteration 5 is 1845035802
compile time instruction count for iteration 6 is 1844895007
compile time instruction count for iteration 7 is 1844697922
compile time instruction count for iteration 8 is 1844780885
compile time instruction count for iteration 9 is 1844493990
collecting compile time instruction count for basic_modules_SequentialWithDuplicatedModule_inductor
compile time instruction count for iteration 0 is 1597839479
compile time instruction count for iteration 1 is 1348225351
compile time instruction count for iteration 2 is 1347340818
compile time instruction count for iteration 3 is 1348170800
compile time instruction count for iteration 4 is 1348637747
compile time instruction count for iteration 5 is 1678366444
compile time instruction count for iteration 6 is 1348412420
compile time instruction count for iteration 7 is 1348461578
compile time instruction count for iteration 8 is 1347420149
compile time instruction count for iteration 9 is 1349748195
collecting compile time instruction count for basic_modules_SequentialWithDuplicatedModule_eager
compile time instruction count for iteration 0 is 137721777
compile time instruction count for iteration 1 is 139065517
compile time instruction count for iteration 2 is 137130552
compile time instruction count for iteration 3 is 137506030
compile time instruction count for iteration 4 is 137089838
compile time instruction count for iteration 5 is 137477395
compile time instruction count for iteration 6 is 138550452
compile time instruction count for iteration 7 is 137568409
compile time instruction count for iteration 8 is 136968468
compile time instruction count for iteration 9 is 137481664
collecting compile time instruction count for basic_modules_ModuleComparison_inductor
compile time instruction count for iteration 0 is 917209684
compile time instruction count for iteration 1 is 899154426
compile time instruction count for iteration 2 is 898145079
compile time instruction count for iteration 3 is 899817018
compile time instruction count for iteration 4 is 899184687
compile time instruction count for iteration 5 is 898172885
compile time instruction count for iteration 6 is 899958951
compile time instruction count for iteration 7 is 899348186
compile time instruction count for iteration 8 is 897745404
compile time instruction count for iteration 9 is 899581123
collecting compile time instruction count for basic_modules_ModuleComparison_eager
compile time instruction count for iteration 0 is 113165302
compile time instruction count for iteration 1 is 112724376
compile time instruction count for iteration 2 is 112774611
compile time instruction count for iteration 3 is 114465211
compile time instruction count for iteration 4 is 112689572
compile time instruction count for iteration 5 is 112726465
compile time instruction count for iteration 6 is 112853691
compile time instruction count for iteration 7 is 112295238
compile time instruction count for iteration 8 is 114022136
compile time instruction count for iteration 9 is 112664932
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134658
Approved by: https://github.com/anijain2305
ghstack dependencies: #133834, #134635, #134649, #134652
2024-08-30 02:13:52 +00:00
a645a18d2e [reland][dtensor][MTPG] make sharding prop lru cache not shared among threads (#134509)
**Summary**
reland of https://github.com/pytorch/pytorch/pull/134294

Fixes #131446
Fixes #126852
Fixes #126868
Fixes #126493

The PR was reverted due to CI red signal in https://github.com/pytorch/pytorch/actions/runs/10537099590/job/29201744658. It seems that the `gaussian_nll_loss` test had been flaky before my original PR #134294 . Therefore this PR also removes the `xfail` mark on this specific test to make CI signal green.

See the error message below:
```
2024-08-24T13:42:01.3228990Z ==================================== RERUNS ====================================
2024-08-24T13:42:01.3229530Z _ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _
2024-08-24T13:42:01.3229710Z Unexpected success
2024-08-24T13:42:01.3230235Z _ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _
2024-08-24T13:42:01.3230407Z Unexpected success
2024-08-24T13:42:01.3230594Z =================================== FAILURES ===================================
2024-08-24T13:42:01.3231128Z _ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _
2024-08-24T13:42:01.3231296Z Unexpected success
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134509
Approved by: https://github.com/tianyu-l, https://github.com/wz337
2024-08-30 02:13:45 +00:00
27ffa67984 Support __class__ attr for tuple and list variables (#134099)
Fixes #134086

This supports __class__ attribute for TupleVariable and ListVariable. And allows to construct a tuple or list by using __class__ attribute. This patch also fix a bug in NamedTupleVariable which misses a return on calling super var_getattr.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134099
Approved by: https://github.com/anijain2305, https://github.com/jansel
2024-08-30 01:57:49 +00:00
cf11fc0dcb dynamo: Only log if we've disabled eval_frame once. (#134529)
This spams logs pretty badly otherwise

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134529
Approved by: https://github.com/chuanhaozhuge, https://github.com/oulgen
2024-08-30 00:35:25 +00:00
8b68912dfc Correctly detect "Rate limit exceeded" error (#134785)
Currently all 403 errors are treated as "Rate limit exceeded":
https://github.com/pytorch/pytorch/actions/runs/10622019167/job/29445336924

[Github docs](https://docs.github.com/en/rest/using-the-rest-api/rate-limits-for-the-rest-api?apiVersion=2022-11-28#exceeding-the-rate-limit) claim:
> If you exceed your primary rate limit, you will receive a 403 or 429 response, and the x-ratelimit-remaining header will be 0. You should not retry your request until after the time specified by the x-ratelimit-reset header.

After this change:
https://github.com/pytorch/pytorch/actions/runs/10622365327/job/29446456395

Note, the 403 error in the jobs above is a separate issue, this PR addresses only the logging.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134785
Approved by: https://github.com/clee2000
2024-08-29 23:58:15 +00:00
3402a5d865 fix windows xpu build issue (#133845)
# Motivation
If build XPU via oneAPI 2024.2, it will fail because `sycl-preview.lib` exists in windows. And linking the unexpected lib results in `error LNK2019: unresolved external symbol`.

# Solution
Use explicitly `sycl-preview` in linux build only.

# Additional Context
For `find_library`, please note that the variable will not be updated if it has been stored.
```
If the library is found the result is stored in the variable and the search will not be repeated unless the variable is cleared.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133845
Approved by: https://github.com/min-jean-cho, https://github.com/EikanWang, https://github.com/atalman, https://github.com/malfet
2024-08-29 23:53:32 +00:00
3775fc982d [Inductor][CPP] Fix Index name error (#134645)
**Summary**

Fix the comment: https://github.com/pytorch/pytorch/pull/122961#issuecomment-2313930242. For all of the cases we see in the 3 test suits (TorchBench, Timms, Huggingface) we expect:

* `_node` is a FX Node with target in ["index_expr", "load", "store"]
* `_node.args[1 if _node.target == "index_expr" else 2]` is another FX node with target `get_index`
* `_node.args[1 if _node.target == "index_expr" else 2].args[0]` is a str for the name of this index expression

It seems not true in some FB internal testcase from the failure log posted in above link. So, add the condition check to work around it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134645
Approved by: https://github.com/jgong5, https://github.com/masnesral
2024-08-29 23:33:15 +00:00
d13ce2e2b5 [c10d] release gil lock during eager init (#134779)
Summary:
We found that if we init the pG in a background thread, it would block
the main thread till init is complete. This is because in the pybinding
we never release the GIL lock
Test Plan:
existing CI on eager init

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134779
Approved by: https://github.com/c-p-i-o
2024-08-29 23:25:33 +00:00
71ff168dbb pytorch: llvm_codegen: prefix JIT generated functions with 8B of data so jitted code can be called from ASAN+UBSAN on LLVM17 (llvm/llvm-project#65253) (#134572)
Summary:
Similar workaround was already applied elsewhere in pytorch https://github.com/pytorch/pytorch/pull/133623 {D61348865}

LLVM17 UBSAN change discussion https://github.com/llvm/llvm-project/issues/104505

Here we also have to associate the data with the function with `setPrefixData(dummyPrefixData)` to prevent this workaround being disabled by the `optimize(*module_);` call which  could change layout/remove the unused variable/etc.

Differential Revision: D61845799

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134572
Approved by: https://github.com/atalman
2024-08-29 23:15:13 +00:00
496e57283d add add_loop benchmarks (#134652)
This benchmark measure the cost of compiling the following function in eager and inductor
its basically two benchmarks.

```
        @torch.compile(backend=self.backend, fullgraph=True)
        def f(a, b):
            result = a.clone()
            for i in range(1000):
                if i % 3 == 0:
                    result = result + b
                elif i % 3 == 1:
                    result = result + 8 * b
                else:
                    result = result.sin()
            return result
```

 PYTHONPATH=$(pwd) python benchmarks/add_loop.py out
 ```
collecting compile time instruction count for add_loop_eager
compile time instruction count for iteration 0 is 8286649663
compile time instruction count for iteration 1 is 2838971338
compile time instruction count for iteration 2 is 2834263023
compile time instruction count for iteration 3 is 2829447493
compile time instruction count for iteration 4 is 2830904231
compile time instruction count for iteration 5 is 2830281077
compile time instruction count for iteration 6 is 2831466595
compile time instruction count for iteration 7 is 2830732164
compile time instruction count for iteration 8 is 2831088056
compile time instruction count for iteration 9 is 2831204407

collecting compile time instruction count for add_loop_inductor
compile time instruction count for iteration 0 is 32585687849
compile time instruction count for iteration 1 is 11747553436
compile time instruction count for iteration 2 is 11746959875
compile time instruction count for iteration 3 is 11749479461
compile time instruction count for iteration 4 is 11750053711
compile time instruction count for iteration 5 is 11750793958
compile time instruction count for iteration 6 is 11751673576
compile time instruction count for iteration 7 is 11754552912
compile time instruction count for iteration 8 is 11753723127
compile time instruction count for iteration 9 is 11759059942
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134652
Approved by: https://github.com/anijain2305
ghstack dependencies: #133834, #134635, #134649
2024-08-29 23:04:01 +00:00
65864d0134 [c10d] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931)
We introduced the dispatchable backend for a ProcessGroup and collective in https://github.com/pytorch/pytorch/issues/86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG.

Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options"

We need to make changes to the test to make it aligned with the change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132931
Approved by: https://github.com/H-Huang
2024-08-29 22:40:12 +00:00
8b4c487581 Fix AOTInductor complication on ROCM (#134522)
Summary:
Original PR (https://github.com/pytorch/pytorch/pull/124123) is broken by cpp_builder refactoring

So resubmit it to fix

Test Plan: Test with command here: https://www.internalfb.com/phabricator/paste/view/P1549765548

Differential Revision: D61827208

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134522
Approved by: https://github.com/frank-wei
2024-08-29 21:59:04 +00:00
1e92d7b688 [inductor] move loop ordering after fusion (#126254)
Restart the work from PR https://github.com/pytorch/pytorch/pull/100331 in this new PR since it's hard to rebase. It would be expected that some code is copy/pasted from the previous PR and main idea is the same.

Previously we see relatively large compilation time increase due to too many loop orders being considered. This PR tries to continue the work by doing pruning and only considering loop orders that we know for sure are relevant (i.e. do it on demand).

Some manually created cases that loop ordering matters are added as unit tests. The PR can make sure inductor does not miss fusion opportunities for them.

This PR should solve the not-able to fusion problem in https://github.com/pytorch/pytorch/issues/130015

Right now there is still significant increase of compilation time. I'll disable the feature by default. Later on after the compilation time issue is resolved, I'll enable it  by default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126254
Approved by: https://github.com/jansel
2024-08-29 21:50:07 +00:00
416a7894fe [Windows][XPU] Disable Kineto PTI on Windows only (#134620)
Disable Kineto + XPU PTI on Windows only.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134620
Approved by: https://github.com/guangyey, https://github.com/malfet

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-08-29 20:58:55 +00:00
7d12e6dceb [dynamo][itertools] refactor itertools.islice to use polyfill (#133876)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133876
Approved by: https://github.com/jansel
ghstack dependencies: #133769, #133778, #133779, #133864, #133894
2024-08-29 20:56:16 +00:00
a2566adfb6 [dynamo] refactor builtins.enumerate to use polyfill (#133894)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133894
Approved by: https://github.com/jansel
ghstack dependencies: #133769, #133778, #133779, #133864
2024-08-29 20:56:16 +00:00
1b70366957 [dynamo][itertools] refactor itertools.chain and itertools.chain.from_iterable to use polyfills (#133864)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133864
Approved by: https://github.com/jansel
ghstack dependencies: #133769, #133778, #133779
2024-08-29 20:56:16 +00:00
eaa449fbf0 [dynamo] simplify implementation for builtins.sum (#133779)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133779
Approved by: https://github.com/jansel, https://github.com/anijain2305
ghstack dependencies: #133769, #133778
2024-08-29 20:56:16 +00:00
b5f1ffa7ab [dynamo] simplify implementation for functools.reduce (#133778)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133778
Approved by: https://github.com/jansel, https://github.com/anijain2305
ghstack dependencies: #133769
2024-08-29 20:56:16 +00:00
e09324e7da [dynamo] simplify polyfill registration for builtins.all and builtins.any (#133769)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133769
Approved by: https://github.com/jansel
2024-08-29 20:56:16 +00:00
b977abd5de [Inductor] Fix error checking for scaled_mm lowering (#134765)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134765
Approved by: https://github.com/Skylion007
2024-08-29 20:18:42 +00:00
6180574771 Move py 3.8->3.9 pull, trunk, inductor, prerioric CI tests (#133624)
Part of Deprecation of python 3.8 and moving to 3.9. Related to: https://github.com/pytorch/pytorch/issues/120718
Except XPU and ROCM jobs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133624
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/ZainRizvi
2024-08-29 19:15:59 +00:00
202e5cc87d [inductor] Fix error in debug_str_extra (#134747)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134747
Approved by: https://github.com/Skylion007, https://github.com/shunting314
2024-08-29 19:09:50 +00:00
43e1df64f8 register all entry_point backends on first attempt (#132546)
fixes: https://github.com/pytorch/pytorch/issues/131360

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132546
Approved by: https://github.com/jansel
2024-08-29 18:59:29 +00:00
5470fcd5b9 [5/N] Reconcile barrier and NaN checker (#134707)
By using a zeros() tensor instead of empty() tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134707
Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
ghstack dependencies: #134345, #134357, #134701
2024-08-29 18:51:12 +00:00
d91b49dbaa expandable_segments <-> other allocator options (#134338)
Previously setting  garbage_collection_threshold or max_split_size_mb along with expandable_segments:True could cause the allocator to hit assert failures when running nearly out of memory. This PR ensures garbage_collection and max_split freeing do not accidentally try to release expandable segments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134338
Approved by: https://github.com/ezyang
2024-08-29 18:43:59 +00:00
3fc6e47d42 [AOTI] Fix cosmetic indentation issue in cuda cpp wrapper codegen for DeferredCudaKernelLine/GridLine (#134705)
Summary:
Follow up fix for D61018114, D61800622

Increase indentation for `loadKernel` `launchKernel` and `Grid` lines.

Test Plan:
```
TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_zero_grid_with_unbacked_symbols_abi_compatible_cuda
```
```
TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_zero_grid_with_backed_symbols_abi_compatible_cuda
```

Differential Revision: D61927248

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134705
Approved by: https://github.com/ColinPeppler
2024-08-29 18:38:45 +00:00
5573c17877 [BE][Ez]: Update ruff to 0.6.3 (#134769)
Mostly bugfix release, updating because it fixes an edgecase with a rule we are using

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134769
Approved by: https://github.com/albanD
2024-08-29 18:35:47 +00:00
ce96146623 [PT2] Fix node metadata setting in group_batch_fusion_aten (#134543)
Summary: Current impl results in `meta` missing fields like`val`, use `FakeTensorProp` to update the information

Differential Revision: D61832932

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134543
Approved by: https://github.com/frank-wei
2024-08-29 18:32:04 +00:00
348d02a983 Changed masked out rows logsumexp to be -inf and not zero (#134650)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134650
Approved by: https://github.com/yanboliang, https://github.com/BoyuanFeng, https://github.com/drisspg
2024-08-29 17:22:52 +00:00
36a6516290 [export] use single FQN for param_buffer_mapping (#134500)
Fixes #133252

In strict mode, we have this routine for mapping traced parameters to their FQNs using tensor ids. Currently we assume there's at least 1 unique FQN for each traced parameter, but this seems to break with parameter reuse when call_module nodes are present. Adding a test case where this breaks.

Fixes this by assigning the same FQN to all traced parameters with the same tensor id. This is fine because we return the original state_dict for the EP, and the unflattener has its own routine of handling aliasing: https://github.com/pytorch/pytorch/pull/125758
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134500
Approved by: https://github.com/angelayi
2024-08-29 17:06:31 +00:00
d9d95dc55e [4/N] Test NaN checker against broadcast (#134701)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134701
Approved by: https://github.com/wconstab
ghstack dependencies: #134345, #134357
2024-08-29 17:00:07 +00:00
ab646cd805 Revert "[reland][dtensor][MTPG] make sharding prop lru cache not shared among threads (#134509)"
This reverts commit ba5aec88c678fe4b9ad101602c29726724f56e21.

Reverted https://github.com/pytorch/pytorch/pull/134509 on behalf of https://github.com/ZainRizvi due to Sorry but this fails internally. For details see D61953754 ([comment](https://github.com/pytorch/pytorch/pull/134509#issuecomment-2318323161))
2024-08-29 16:39:19 +00:00
26aea277f7 [3/N] Set correct device to CUDA guards (#134357)
In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue https://github.com/pytorch/pytorch/issues/134062.

With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA.

Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134357
Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang
ghstack dependencies: #134345
2024-08-29 16:25:27 +00:00
d503217ea4 [inductor] calibration inductor windows uts (15/N) (#134586)
Fix `test_logs_out` UT on Windows. make `test/dynamo/test_logging.py` all UTs pass on Windows.

Changes:
1. Close `NamedTemporaryFile` to release file handle to avoid PermissionError issue.
2. `PermissionError` setup as `delete=False`, let file not be auto deleted.
3. Open log file as "utf-8" to align with Linux.
4. Process wrap difference for Windows.
5. Delete tmp file manually.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134586
Approved by: https://github.com/jansel
2024-08-29 16:18:40 +00:00
9953f55f4c [2/N] Add flag to control which rank should perform NaN check (#134345)
Fixes https://github.com/pytorch/pytorch/issues/134062.
For example, in case of broadcast / scatter, only the root rank should perform the NaN check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134345
Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
2024-08-29 16:13:15 +00:00
387d3fc296 [AOTI] Switch benchmarking to use export non-strict mode (#130977)
Summary: Switch the export part used by AOTInductor benchmarking from strict to non-strict, and switch it from producing torch IR to aten IR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130977
Approved by: https://github.com/angelayi
ghstack dependencies: #134639
2024-08-29 16:08:52 +00:00
0dbc72887b [CPU][flash attention] make the stride of output align with input (#134656)
Fixes #133671

Currently, the output of CPU flash attention has a fixed layout, no matter what the input is. This PR makes the stride of output align with input q/k/v, which is the same behavior as math backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134656
Approved by: https://github.com/jgong5, https://github.com/drisspg
2024-08-29 16:04:25 +00:00
4fcd15a667 Fix test_sgd_weight_decay_xpu accuracy error (#134744)
Fixes #134743

This PR adds `test_sgd_weight_decay_xpu` in `KERNEL_COUNT_OVERRIDES` to override.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134744
Approved by: https://github.com/EikanWang, https://github.com/desertfire
2024-08-29 15:12:40 +00:00
594162f7ab [dynamo] Support reading attributes from pybind objects (#134630)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134630
Approved by: https://github.com/jansel
2024-08-29 15:06:52 +00:00
92e38a476f preserve aten::to device in export training (#134622)
Summary:
With training IR, we cannot rely on trapping `to()` in `FunctionalTensor` because the regular decomposition kicks it first, and that can cause it to be optimized away.

So instead we preserve it until we functionalize, and then replace it explicitly with `_to_copy()`.

Test Plan: expected test failures go away

Differential Revision: D61883878

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134622
Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan
2024-08-29 14:53:30 +00:00
092349dcdd Never CSE aten.empty in the partitioner (#134703)
aten.empty is almost always fusible into its consumer, so we never CSE
it. This fixes a bug that looks like the following:

```py
@torch.library.custom_op("_reinplacing::sin_cos", mutates_args={"out_sin", "out_cos"})
def sin_cos(x: torch.Tensor, out_sin: torch.Tensor, out_cos: torch.Tensor) -> None:
    out_sin.copy_(x.sin())
    out_cos.copy_(x.cos())

@torch.compile
def f(x):
    out0 = torch.empty_like(x)
    out1 = torch.empty_like(x)
    sin_cos(x, out0, out1)
    return x.clone(), out0, out1

x = torch.randn(3, requires_grad=True)
f(x)
```

- cse would de-duplicate the empty nodes
- reinplacing would add an additional clone (because it can't write to
  both tensors at the same time)
- the clone lowers into a new buffer + a copy_ kernel
- the copy_ kernel is unnecessary because "empty" is special - all reinplacing needed was an additional
  buffer, it doesn't matter what the values are.

We could attempt to fix this on the reinplacing side but this seemed
better as a partitioner heuristic and the reinplacing fix is a bit more
tricky (we'd need to identify that the op never reads from the empty
node).

Test Plan:
- new test (the old number was 27, the new number is 21, so this PR
  helped).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134703
Approved by: https://github.com/yf225
ghstack dependencies: #134466, #134490, #134491
2024-08-29 13:51:19 +00:00
70853b792a [dynamo][itertools] support itertools.tee (#133771)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133771
Approved by: https://github.com/jansel
ghstack dependencies: #133801
2024-08-29 13:36:52 +00:00
9e806c1a60 [dynamo] simplify implementation for os.fspath (#133801)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133801
Approved by: https://github.com/anijain2305
2024-08-29 13:36:52 +00:00
d01a7a9faa [dynamo] Graph break on FSDP flat_param inconsistent tensor and grad dtype (#134614)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134614
Approved by: https://github.com/awgu, https://github.com/yf225
ghstack dependencies: #134610, #134590, #134621
2024-08-29 09:14:42 +00:00
fb35d1e01f [raland][dynamo][exceptions] Support raise from None (#134621)
The PR was reverted because this PR traced more code and surfaced a latent bug. Resubmitting w/o any changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134621
Approved by: https://github.com/jansel
ghstack dependencies: #134610, #134590
2024-08-29 09:14:42 +00:00
2bf622685d [dynamo][dicts] Support hasattr on dicts (#134590)
Fixes - https://github.com/pytorch/pytorch/issues/134577

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134590
Approved by: https://github.com/Skylion007
ghstack dependencies: #134610
2024-08-29 09:14:42 +00:00
2446dead35 [dynamo][exceptions] Use exception subclass whenever possible (#134610)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134610
Approved by: https://github.com/drisspg, https://github.com/jansel
2024-08-29 09:14:42 +00:00
cfb642bb6b [DTensor] Extend implicit replication to replicate DTensor for foreach ops so model doesn't have to be fully tp-ed when using 2D (#134551)
Fixes [134212](https://github.com/pytorch/pytorch/issues/134212)

Currently, when we use 2D FSDP with TP, `optimizer.step()` would fail if the model were not fully tensor parallelized. If we don't have the entire model tensor parallelized when doing 2D, we would have both 1D and 2D DTensor parameters. As foreach is turned on by default, `optimizer.step()` would fail as cross mesh op is not allowed. Error as follows:

```
NotImplementedError: aten._foreach_mul_.Scalar: DTensor does not support cross-mesh operation yet!Got meshes: DeviceMesh('cuda', [[0, 1], [2, 3]], mesh_dim_names=('dp', 'tp')) DeviceMesh('cuda', [1, 3], mesh_dim_names=('dp',))
```

In this PR, we extend implicit_replication to replicate DTensor in missing dimensions for foreach ops. If users don't want to fully tensor parallelize the model when using 2D, they have the option of using the `implicit_replication()` context manager for `optimizer.step()`. In this case, we would swap out the 1D DTensorSpec and replace it with 2D DTensorSpec. However, we don't want to turn this on by default yet, as we want the users to be aware that the tp dimension is replicated if a layer is not tp-ed.

With implicit implication turning on, try replicate dtensor spec in missing dimension would work for most cases for foreach case except when the first DTensor in the list is one that also need to be replicated. This is currently a limitation, which I don't have a good solution yet. Currently, with this change, we can handle most of the cases except the case that the first DTensor's ndim is not the largest.
```
[2D_DTensor, 1D_DTensor...] ---> Implicit_replication() can handle this.
[1D_DTensor, 2D_DTensor...] ---> Implicit_replication() can't handle this.
```

This change doesn't affect the existing default behavior, as `implicit_replication()` is not turned on by default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134551
Approved by: https://github.com/tianyu-l
2024-08-29 09:01:31 +00:00
3645634f3c [1/N] Move NaN check onto NCCL stream (#134300)
So that the tensor's lifetime management is the same as the management built for the NCCL, pre and post kernels.
Also so that on visualizers, they show up in the NCCL stream line. Otherwise if they show up in the compute line, user may get confused (my code does not have these kernels).

The check is thus moved after the point where we depend NCCL stream from the last compute kernel.

Also moved declaration of `checkForNan` from Utils.hpp to NCCLUtils.hpp, and renamed Utils.cu to NCCLUtils.cu.

Differential Revision: [D61957573](https://our.internmc.facebook.com/intern/diff/D61957573)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134300
Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
2024-08-29 08:28:49 +00:00
578b8d75e5 [2nd try][Traceable FSDP2] Allow tracing through FSDP2 impl in trace_rules.py (#134539)
The previous PR https://github.com/pytorch/pytorch/pull/133532 caused stuck compilation issue on internal models. In this 2nd attempt PR, we gate the trace_rules.py changes with `if not torch._dynamo.config.skip_fsdp_hooks:`, so that they don't take effect for current graph-break FSDP2 (which relies on the default config value `skip_fsdp_hooks=True`), and will only take effect when we are using Traceable FSDP2 (in which case the user needs to proactively set `skip_fsdp_hooks=False`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134539
Approved by: https://github.com/ckluk2, https://github.com/yanboliang
2024-08-29 06:28:16 +00:00
834d8b0965 [Inductor][mkldnn] Bug fix: incorrect codegen arg order for qconv (#134579)
Fixes #133448

The arg order for mkldnn qconv IR became incorrect after PR #132367 . This PR fixes the bug.

**Test plan**
`python test/inductor/test_mkldnn_pattern_matcher.py -k qconv`
`python test/inductor/test_cpu_cpp_wrapper.py -k qconv`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134579
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
2024-08-29 06:20:52 +00:00
b0a6d9ad27 [DTensor] Add pointwise ops strategy for aten.isinf, aten.isneginf, aten.isposinf (#134699)
Fixes #ISSUE_NUMBER

Need it for https://github.com/facebookresearch/optimizers/blob/main/distributed_shampoo/utils/shampoo_preconditioner_list.py#L671

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134699
Approved by: https://github.com/tianyu-l
2024-08-29 06:01:12 +00:00
da9e61ef70 Get accumulate dtype for Intel GPU (#134465)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

There are two function variants to get accumulated dtype for a given dtype:

- Func1: `c10::ScalarType toAccumulateType(c10::ScalarType type, c10::DeviceType device)`
- Func2: `c10::ScalarType toAccumulateType(c10::ScalarType type, bool is_cuda)`

The Func1 is general enough to support different devices, while the Func2 only supports CUDA and CPU. This PR intends to add the Intel GPU path in the Func1. And we expect users to invoke the Func1 to ensure compatibility for different devices.

* __->__ #134465

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134465
Approved by: https://github.com/Skylion007, https://github.com/atalman
2024-08-29 05:27:57 +00:00
94db935749 Add torch.serialization.skip_data context manager (#134504)
## Semantic

The semantic is
(1) By default `torch.serialization.skip_data(materialize_fake_tensors=False)` will make `torch.save` skip writing storages (but reserve space for them in the checkpoint).

```python
import torch
import torch.nn as nn

sd = nn.Linear(3, 5).state_dict()
with torch.serialization.skip_data():
    torch.save(sd, 'foo.pt')
print(torch.load('foo.pt', weights_only=True))
```

(2)  With `torch.serialization.skip_data(materialize_fake_tensors=True)`If FakeTensor is passed to `torch.save` the pickler will treat these FakeTensors as being "materialized" space will be reserved in the checkpoint for the associated storage bytes, and when loading the type will be Tensor instead of FakeTensor)

```python
import torch
import torch.nn as nn
from torch._subclasses.fake_tensor import FakeTensorMode

with FakeTensorMode():
    m = nn.Linear(3, 5, dtype=torch.float16, device='cuda')

sd = m.state_dict()
with torch.serialization.skip_data(materialize_fake_tensors=True):
    torch.save(sd, 'bla.pt')
print(torch.load('bla.pt', weights_only=True))
# OrderedDict([('weight', tensor([[0., 0., 0.],
#        [0., 0., 0.],
#        [0., 0., 0.],
#        [0., 0., 0.],
#        [0., 0., 0.]], device='cuda:0', dtype=torch.float16)), ('bias', tensor([0., 0., 0., 0., 0.], device='cuda:0', dtype=torch.float16))])

```

## Follow Ups

- [ ] `torch.load` semantic for skip_data context manager
- [ ] Mechanism for getting offsets of storages saved via this method (for writing in a separate pass)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134504
Approved by: https://github.com/albanD
2024-08-29 04:52:52 +00:00
297b42012d [PyTorch] Use pinned memory for zero_cuda_out (#134712)
Summary: This diff creates a pinned tensor for copying from device for the zero_out op.

Differential Revision: D61759262

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134712
Approved by: https://github.com/zyan0
2024-08-29 04:46:08 +00:00
a32255481b [caffe2][hipify] remove un-used flag from pybind_utils.h (#134404)
Summary:
Encountered issues related to AMD build when working on https://www.internalfb.com/diff/D60739324?dst_version_fbid=2203158110057105 (see stack trace P1545717562)

Looking at the file history, seems that the flag is no longer used so I propose to remove it.  Alternatively, I could change the `#ifdef` to check both `USE_C10D_NCCL` and  `USE_ROCM` and include the corresponding AMD header files.

Let me know what is more preferred way.

Test Plan: Sandcastle

Differential Revision: D61762129

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134404
Approved by: https://github.com/malfet
2024-08-29 04:09:44 +00:00
4655eb3ee2 Uses MemPoolContext to route allocations from CUDACachingAllocator (#134685)
Re-open of https://github.com/pytorch/pytorch/pull/133599 that was mistakenly closed by issuing `ghstack land`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134685
Approved by: https://github.com/ezyang
2024-08-29 03:56:31 +00:00
4b4ba7ab06 [NJT] Support NJT SDPA + meta-device flop counting (#134289)
A user wants to use the flop counter with meta devices. This previously caused problems for SDPA+NJT:

1. autocast check: `torch.is_autocast_enabled("meta")` fails because `meta` is not valid for autocasting. If we skip this, we run into the next error
2. math backend: conversion to NST requires getting concrete offsets in a list of python integers, which doesn't work on a meta tensor b2eb0e8c6a/torch/nested/_internal/sdpa.py (L809-L815)
3. (fixed in the previous PR, #134288) - if we force using flash attention backend for flop counting, `_flash_attention_forward` previously didn't support meta tensors.

In this PR, we check specifically for FlopCounterMode, and, if it's enabled and combined with meta tensors, (a) skip autocasting and (b) force it down the flash attention path. This isn't generally safe for tracing (e.g. if you actually care which kernels you are running), but in the absence of actual device information, we have to make some assumptions. By specifically checking for FlopCounterMode, this should reduce the chance of unintended side effects for other meta tensor users.

Note: fake tensor would solve a bunch of these issues, but it's not a viable solution right now for the user.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134289
Approved by: https://github.com/soulitzer
ghstack dependencies: #134288
2024-08-29 03:43:42 +00:00
17e9c2d1e7 Add oneDNN support for Half LSTM on CPU (#132607)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132607
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-08-29 03:40:10 +00:00
41e36e2b46 Reflect check_labels status as a signal (#134711)
Fixes the workflow when meta-exported diff (co-dev) doesn't have the required labels, but the signal is suppressed due to job failure (e.g. [see this run](https://github.com/pytorch/pytorch/actions/runs/10590994706/job/29347663526?pr=134484)).

With this change the workflow status correctly reflects the status of the check.

# Testing
* [illegal pr_num](https://github.com/pytorch/pytorch/actions/runs/10603163898/job/29386843591)
* [successful run](https://github.com/pytorch/pytorch/actions/runs/10603279052/job/29387230110) (topic label present)
* no labels: [check fails](https://github.com/pytorch/pytorch/actions/runs/10603310368/job/29387333864)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134711
Approved by: https://github.com/clee2000
2024-08-29 03:11:16 +00:00
4f9c68454a [inductor]Let output or input_as_strided match exact strides (#130956)
Fixes #130394

TorchInductor doesn't respect original strides of outputs. It opens up optimization opportunities like changing up memory layout. But for some cases, such as the case in https://github.com/pytorch/pytorch/issues/130394, we do need the output match the exact stride as required. The correctness is the first priority goal. So, this PR adds a new API `ir.ExternKernel.require_exact_strides(x, exact_strides, allow_padding=False)` to fix the issue.  This PR enables dense and non-dense outputs' strides follow the strides required by semantics.

The comparison between the original and after this fix for the test is the below.

```python
@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 128
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex % 8
    x1 = (xindex // 8)
-   x2 = xindex
    tmp0 = tl.load(in_ptr0 + (x0 + (16*x1)), xmask)
    tmp1 = tmp0 + tmp0
-   tl.store(out_ptr0 + (x2), tmp1, xmask)
+   tl.store(out_ptr0 + (x0 + (16*x1)), tmp1, xmask)

def call(args):
    arg0_1, = args
    args.clear()
    assert_size_stride(arg0_1, (16, 8), (16, 1))
    with torch.cuda._DeviceGuard(0):
        torch.cuda.set_device(0)
-       buf1 = empty_strided_cuda((16, 8), (8, 1), torch.float32)
+       buf1 = empty_strided_cuda((16, 8), (16, 1), torch.float32)
        stream0 = get_raw_stream(0)
        triton_poi_fused_add_copy_0.run(arg0_1, buf1, 128, grid=grid(128), stream=stream0)
        del arg0_1
    return (buf1, )
```

The buf1 is created with exact stride required by users, and its values are written in same stride with the input.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130956
Approved by: https://github.com/eellison, https://github.com/blaine-rister, https://github.com/desertfire
2024-08-29 03:06:58 +00:00
4811dc3de9 Revert "[dynamo] simplify polyfill registration for builtins.all and builtins.any (#133769)"
This reverts commit cc3a76edbac4a48381db6ccc44a83927f80c545b.

Reverted https://github.com/pytorch/pytorch/pull/133769 on behalf of https://github.com/ZainRizvi due to Sorry but this has been discovered to be causing a performance regression internally ([comment](https://github.com/pytorch/pytorch/pull/133769#issuecomment-2316620213))
2024-08-29 03:00:47 +00:00
f65df5edae Revert "[dynamo][itertools] support itertools.tee (#133771)"
This reverts commit 1dbd3476de07d7f07489e243cb7a43073e8c25c1.

Reverted https://github.com/pytorch/pytorch/pull/133771 on behalf of https://github.com/ZainRizvi due to Sorry, have to revert this in order to be able to revert https://github.com/pytorch/pytorch/pull/133769 ([comment](https://github.com/pytorch/pytorch/pull/133771#issuecomment-2316611158))
2024-08-29 02:49:30 +00:00
eaec9e80b8 Revert "[dynamo] simplify implementation for os.fspath (#133801)"
This reverts commit 74341e1150f10b8aaddd33a165e686724424071f.

Reverted https://github.com/pytorch/pytorch/pull/133801 on behalf of https://github.com/ZainRizvi due to Sorry, have to revert this in order to be able to revert https://github.com/pytorch/pytorch/pull/133769 ([comment](https://github.com/pytorch/pytorch/pull/133771#issuecomment-2316611158))
2024-08-29 02:49:30 +00:00
76f975948e [inductor] Cleanup generate_node_schedule (#134306)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134306
Approved by: https://github.com/shunting314
2024-08-29 02:45:14 +00:00
cccb121d4e [Inductor] add inductor config: masked_vec (#134566)
This PR adds inductor config: masked_vec to control enable/disable masked vectorization for the tail_loop, and enable by default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134566
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-08-29 02:29:06 +00:00
c5f114747e fix flakiness in update_hint_benchmark.py (#134649)
```
compile time instruction count for iteration 1 is 10732129038
compile time instruction count for iteration 2 is 10719776783
compile time instruction count for iteration 3 is 10729546868
compile time instruction count for iteration 4 is 10737655132
compile time instruction count for iteration 5 is 10732564252
compile time instruction count for iteration 6 is 10728721234
compile time instruction count for iteration 7 is 10733354271
compile time instruction count for iteration 8 is 10719588972
compile time instruction count for iteration 9 is 10706311856
```
1. add torch.manual_seed(0), inputs was not the same across iterations
2. disable gc.
3. remove loop (not needed since compilation happen once only)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134649
Approved by: https://github.com/aorenste
ghstack dependencies: #133834, #134635
2024-08-29 02:22:05 +00:00
f0fceed432 Revert "[dynamo][exceptions] Use exception subclass whenever possible (#134610)"
This reverts commit 880e3d18a406777dbea6aeaf14443b0e3a8b441c.

Reverted https://github.com/pytorch/pytorch/pull/134610 on behalf of https://github.com/ZainRizvi due to Sorry, I had to revert this in order to revert another PR ([comment](https://github.com/pytorch/pytorch/pull/134610#issuecomment-2316568553))
2024-08-29 02:02:12 +00:00
67d7040fce Revert "[dynamo][dicts] Support hasattr on dicts (#134590)"
This reverts commit c566f2465f41b8081caed205fcf5fe973fd970b3.

Reverted https://github.com/pytorch/pytorch/pull/134590 on behalf of https://github.com/ZainRizvi due to Sorry, I had to revert this in order to revert another PR ([comment](https://github.com/pytorch/pytorch/pull/134610#issuecomment-2316568553))
2024-08-29 02:02:12 +00:00
40cebde3bc Revert "[raland][dynamo][exceptions] Support raise from None (#134621)"
This reverts commit e96dc3665a1d48434c02e17f7faed41f779cee2c.

Reverted https://github.com/pytorch/pytorch/pull/134621 on behalf of https://github.com/ZainRizvi due to Sorry, I had to revert this in order to revert another PR ([comment](https://github.com/pytorch/pytorch/pull/134610#issuecomment-2316568553))
2024-08-29 02:02:12 +00:00
c35d1f7b3a Revert "[dynamo] Graph break on FSDP flat_param inconsistent tensor and grad dtype (#134614)"
This reverts commit e4a5958ab58e2f9b5b9c336a1d2a6449784d88d3.

Reverted https://github.com/pytorch/pytorch/pull/134614 on behalf of https://github.com/ZainRizvi due to Sorry, I had to revert this in order to revert another PR ([comment](https://github.com/pytorch/pytorch/pull/134610#issuecomment-2316568553))
2024-08-29 02:02:12 +00:00
25531eb735 Revert "[2nd try][Traceable FSDP2] Allow tracing through FSDP2 impl in trace_rules.py (#134539)"
This reverts commit 26e392132d3039345de6aaf8643e7330f7fc3cbc.

Reverted https://github.com/pytorch/pytorch/pull/134539 on behalf of https://github.com/ZainRizvi due to Sorry, I had to revert this in order to revert another PR ([comment](https://github.com/pytorch/pytorch/pull/134539#issuecomment-2316568257))
2024-08-29 01:59:02 +00:00
cbf5ba1e97 Revert "[1/N] Move NaN check onto NCCL stream (#134300)"
This reverts commit 94caba4899096f160eca9628acddba6032755b3b.

Reverted https://github.com/pytorch/pytorch/pull/134300 on behalf of https://github.com/kwen2501 due to This is breaking builds of MTIA ([comment](https://github.com/pytorch/pytorch/pull/134300#issuecomment-2316559704))
2024-08-29 01:50:22 +00:00
33d0c11b26 Revert "[2/N] Add flag to control which rank should perform NaN check (#134345)"
This reverts commit 2fe7e332c7a61f025ccbcdbbb4875c6bf0b9afdf.

Reverted https://github.com/pytorch/pytorch/pull/134345 on behalf of https://github.com/kwen2501 due to This is breaking builds of MTIA ([comment](https://github.com/pytorch/pytorch/pull/134300#issuecomment-2316559704))
2024-08-29 01:50:22 +00:00
43dc17fd00 Revert "[3/N] Set correct device to CUDA guards (#134357)"
This reverts commit afc76c6f2d46d7726012507ec5c67b4c04e21723.

Reverted https://github.com/pytorch/pytorch/pull/134357 on behalf of https://github.com/kwen2501 due to This is breaking builds of MTIA ([comment](https://github.com/pytorch/pytorch/pull/134300#issuecomment-2316559704))
2024-08-29 01:50:22 +00:00
503c0dd923 Revert "Add MaskedTensor support to *_like API (#128637)"
This reverts commit b6e51711a0ea6174806e75ab6e208d2d910b45f5.

Reverted https://github.com/pytorch/pytorch/pull/128637 on behalf of https://github.com/ZainRizvi due to Actually, seems like it was this commit that introduced the failure: test_maskedtensor.py::TestOperatorsCUDA::test_like_empty_like_layout1_cuda_bool [GH job link](https://github.com/pytorch/pytorch/actions/runs/10604690725/job/29392898277) [HUD commit link](b6e51711a0) ([comment](https://github.com/pytorch/pytorch/pull/128637#issuecomment-2316554188))
2024-08-29 01:42:52 +00:00
1285443994 Revert "Add torch.serialization.skip_data context manager (#134504)"
This reverts commit 202600bc2384cb19a29b8fca503bafc289158c32.

Reverted https://github.com/pytorch/pytorch/pull/134504 on behalf of https://github.com/mikaylagawarecki due to This is breaking Windows docs tests due to NamedTemporaryFile on Windows not working well ([comment](https://github.com/pytorch/pytorch/pull/134504#issuecomment-2316543901))
2024-08-29 01:30:49 +00:00
e7711d6c7d [MPS] Fix SDP training (#134719)
Check whether the input tensors require grad. If required, then we don't get into the fast path and fall back to composite implicit.

Fixes #134678
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134719
Approved by: https://github.com/malfet
2024-08-29 01:28:53 +00:00
ca03a14cf7 hang dim hint constants off Dim (#134702)
Summary: Retry landing https://github.com/pytorch/pytorch/pull/134484

Test Plan: (see original)

Differential Revision: D61925860

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134702
Approved by: https://github.com/pianpwk
2024-08-29 01:02:01 +00:00
7a554e96b4 [AOTI][Tooling] Follow up to print location of saved file path for torch.pickle_save() (#134651)
Summary:
- Follow up to add torch.pickle_save() log for saved file path

- Minor debug printer code refine

Test Plan: CI

Differential Revision: D61883239

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134651
Approved by: https://github.com/muchulee8
2024-08-28 23:58:37 +00:00
202600bc23 Add torch.serialization.skip_data context manager (#134504)
## Semantic

The semantic is
(1) By default `torch.serialization.skip_data(materialize_fake_tensors=False)` will make `torch.save` skip writing storages (but reserve space for them in the checkpoint).

```python
import torch
import torch.nn as nn

sd = nn.Linear(3, 5).state_dict()
with torch.serialization.skip_data():
    torch.save(sd, 'foo.pt')
print(torch.load('foo.pt', weights_only=True))
```

(2)  With `torch.serialization.skip_data(materialize_fake_tensors=True)`If FakeTensor is passed to `torch.save` the pickler will treat these FakeTensors as being "materialized" space will be reserved in the checkpoint for the associated storage bytes, and when loading the type will be Tensor instead of FakeTensor)

```python
import torch
import torch.nn as nn
from torch._subclasses.fake_tensor import FakeTensorMode

with FakeTensorMode():
    m = nn.Linear(3, 5, dtype=torch.float16, device='cuda')

sd = m.state_dict()
with torch.serialization.skip_data(materialize_fake_tensors=True):
    torch.save(sd, 'bla.pt')
print(torch.load('bla.pt', weights_only=True))
# OrderedDict([('weight', tensor([[0., 0., 0.],
#        [0., 0., 0.],
#        [0., 0., 0.],
#        [0., 0., 0.],
#        [0., 0., 0.]], device='cuda:0', dtype=torch.float16)), ('bias', tensor([0., 0., 0., 0., 0.], device='cuda:0', dtype=torch.float16))])

```

## Follow Ups

- [ ] `torch.load` semantic for skip_data context manager
- [ ] Mechanism for getting offsets of storages saved via this method (for writing in a separate pass)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134504
Approved by: https://github.com/albanD
2024-08-28 23:53:17 +00:00
f997b2b8e6 Revert "Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262)"
This reverts commit f685018ea9d08f98cbd7106028db134f967f74d3.

Reverted https://github.com/pytorch/pytorch/pull/125262 on behalf of https://github.com/ZainRizvi due to Hi, this PR appears to be calling maskedtensor tests to fail on main. Please rebase your changes onto the latest trunk build to repro the failure. test_maskedtensor.py::TestOperatorsCUDA::test_like_empty_like_layout1_cuda_bool [GH job link](https://github.com/pytorch/pytorch/actions/runs/10604716811/job/29393256312) [HUD commit link](f685018ea9) ([comment](https://github.com/pytorch/pytorch/pull/125262#issuecomment-2316387447))
2024-08-28 23:10:07 +00:00
6dd3f81aaf Add export_for_training as public API (#134677)
Differential Revision: [D61912084](https://our.internmc.facebook.com/intern/diff/D61912084)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134677
Approved by: https://github.com/avikchaudhuri, https://github.com/zhxchen17
2024-08-28 22:32:10 +00:00
a7933acd5a Improve custom ops aliasing error message (#134688)
Fixes https://github.com/pytorch/pytorch/issues/134278

Test Plan:
- tested locally
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134688
Approved by: https://github.com/yushangdi
ghstack dependencies: #134466, #134490, #134491, #134690, #134692
2024-08-28 22:22:04 +00:00
dd443f418a Improve opcheck docs. (#134692)
Fixes https://github.com/pytorch/pytorch/issues/134119
From user feedback, it's difficult to understand what the tests do. We
clarify the docs more.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134692
Approved by: https://github.com/albanD
ghstack dependencies: #134466, #134490, #134491, #134690
2024-08-28 22:22:04 +00:00
afc76c6f2d [3/N] Set correct device to CUDA guards (#134357)
In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue https://github.com/pytorch/pytorch/issues/134062.

With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA.

Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134357
Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang
ghstack dependencies: #134300, #134345
2024-08-28 22:17:11 +00:00
5ff97e79ee Skip test_mutable_custom_op_fixed_layout2 on ROCM (#134690)
ROCM doesn't trigger the layout optimization that makes the test case
valid so we're going to skip the checks.

Should fix the following (I'll close them later)
- https://github.com/pytorch/pytorch/issues/134481
- https://github.com/pytorch/pytorch/issues/134519

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134690
Approved by: https://github.com/FindHao
ghstack dependencies: #134466, #134490, #134491
2024-08-28 22:12:24 +00:00
2fe7e332c7 [2/N] Add flag to control which rank should perform NaN check (#134345)
Fixes https://github.com/pytorch/pytorch/issues/134062.
For example, in case of broadcast / scatter, only the root rank should perform the NaN check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134345
Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
ghstack dependencies: #134300
2024-08-28 21:53:39 +00:00
26ec06e45d [amd][lowering] hipify shim v2 headers (#134689)
Summary: The default c_shim version was switched to 2 for HIP in D60674018. This results in some linking errors where shim function symbols are missing from the compiled .so file (eg. P1551186492) when building lowering benchmark scripts since the required files aren't included. Hipify the shim v2 generated header files as well since they're needed during codegen when the buck binaries are executed.

Reviewed By: frank-wei, zoranzhao, henryoier

Differential Revision: D61865202

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134689
Approved by: https://github.com/zoranzhao
2024-08-28 21:53:24 +00:00
7b3da5f297 Revert "[dynamo] Cache _dynamo.disable results (#134272)"
This reverts commit dbef2b05b4d81e891f7497f92f730a22bebe445d.

Reverted https://github.com/pytorch/pytorch/pull/134272 on behalf of https://github.com/anijain2305 due to Peak mem increase detected internally ([comment](https://github.com/pytorch/pytorch/pull/134272#issuecomment-2316308170))
2024-08-28 21:51:43 +00:00
20b62fed21 Create processes in parallel in mp.start_processes for forkserver (#134629)
Summary:
This is to fix the pytorch issue filed https://github.com/pytorch/pytorch/issues/133010
one way to fix this problem is to enable parallel start processes in mp.start_processes.
What else in the diff:
refactored a test case api_test which was repeating a lot of tests due to the inheritance.
added unit test for forkserver when parallel start is on.

Test Plan: Added unit tests

Differential Revision: D61878552

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134629
Approved by: https://github.com/d4l3k
2024-08-28 21:34:32 +00:00
f685018ea9 Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262)
Hi,
I noticed the `unfold` operator was missing on MaskedTensor.

I tested that my change works when calling unfold and backward on a `MaskedTensor` but I didn't find the tests for the dispatch of such operation. Where is it?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125262
Approved by: https://github.com/cpuhrsch
2024-08-28 21:30:39 +00:00
b6e51711a0 Add MaskedTensor support to *_like API (#128637)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128637
Approved by: https://github.com/cpuhrsch
2024-08-28 21:28:23 +00:00
4c16797e71 [c10d FR analyzer] Output a meaningful debug report for users (#134528)
- This PR generates a more useful output log for users: P1552399180.
- It also fixes the logic when we check the all-gather size mismatch.
- Add dtype check for collective input/output
- We store more context information for error match_state so that we can report them in the file.
- Disable the size match for alltoall because we don't log the size for all inputs/outputs.
- Correct some types for func args specification.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134528
Approved by: https://github.com/c-p-i-o
2024-08-28 21:22:47 +00:00
de35d3062f Runtime Estimator for estimating GPU compute time (#134243)
This PR adds a basic Runtime Estimator for single-device models.
It estimates the GPU runtime in milliseconds using various estimation methods under the ``FakeTensorMode``.
It provides a ``TorchDispatchMode`` based context manager that can estimate the eager runtime of PyTorch functions. It supports two estimation modes, benchmarking (`operator-level-benchmark`) and roofline cost modeling (`operator-level-cost-model`).
For modules executed under this context manager, it agggregates the forward and backward operation runtimes and records their execution orders.

```
import torch
from torch import nn, optim
from torch._subclasses.fake_tensor import FakeTensorMode
from torch.distributed._tools.runtime_estimator import RuntimeEstimator
from torch.testing._internal.distributed._tensor.common_dtensor import (
    ModelArgs,
    Transformer,
)

if __name__ == "__main__":
    def _train_step(
        model: nn.Module,
        optimizer: optim.Optimizer,
        inp: torch.Tensor,
    ):
        out = model(inp)
        loss = out.sum()
        loss.backward()
        optimizer.step()
        optimizer.zero_grad()

    dev = torch.cuda.current_device()
    vocab_size = 8192
    bsz, seq_len = 32, 1024
    model_args = ModelArgs(
        n_layers=4,
        n_heads=12,
        vocab_size=vocab_size,
        max_seq_len=seq_len,
        dim=768,
        dropout_p=0.1,
    )
    runtime_estimator = RuntimeEstimator()

    with FakeTensorMode():
        with torch.device(dev):
            model = Transformer(model_args)
        optimizer = optim.Adam(model.parameters(), lr=1e-2, foreach=True)
        inp = torch.randint(0, model_args.vocab_size, (bsz, model_args.max_seq_len), device=dev)
        with runtime_estimator("operator-level-benchmark"):
            _train_step(model, optimizer, inp)
        with runtime_estimator("operator-level-cost-model"):
            _train_step(model, optimizer, inp)

    # Actual model runtime
    with torch.device(dev):
        model = Transformer(model_args)
    optimizer = optim.Adam(model.parameters(), lr=1e-2, foreach=True)
    inp = torch.randint(0, model_args.vocab_size, (bsz, model_args.max_seq_len), device=dev)
    warmup_iters, actual_iters = 2, 5
    start_event = torch.cuda.Event(enable_timing=True)
    end_event = torch.cuda.Event(enable_timing=True)
    for _ in range(warmup_iters):
        _train_step(model, optimizer, inp)
    start_event.record()
    for _ in range(actual_iters):
        _train_step(model, optimizer, inp)
    end_event.record()
    torch.cuda.synchronize()
    measured_time = start_event.elapsed_time(end_event) / actual_iters
    print(f"Actual total_time: {measured_time:.3f} ms")
  ```

<img width="506" alt="Screenshot 2024-08-26 at 11 27 15 PM" src="https://github.com/user-attachments/assets/04d243c9-21a6-4389-8c20-80958980788c">

@weifengpy @xuanzhang816 @gnadathur

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134243
Approved by: https://github.com/weifengpy
2024-08-28 20:06:54 +00:00
cae817c862 [ET][CodeGen] Remove TORCH_API from NativeFunctions.h declarations (#134245)
Summary:
Remove TORCH_API from the generated executorch/kernels/portable/NativeFunctions.h declarations

These generated declarations are using ET tensors. They don't need to have the TORCH_API macro prefixed to them, since in this case TORCH_API is just empty. See [codegen/macros.h](https://www.internalfb.com/code/fbsource/[d12d7d3accfb12932368e0216124f2d735c51d73]/fbcode/executorch/codegen/macros.h)

Test Plan: CI

Differential Revision: D61490943

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134245
Approved by: https://github.com/larryliu0820
2024-08-28 19:58:37 +00:00
b07d0a22f5 [hop] require hops to override __call__. (#134352)
Fixes https://github.com/pytorch/pytorch/issues/133719 by making `__call__` of hops an abstractmethod.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134352
Approved by: https://github.com/zou3519
2024-08-28 19:56:40 +00:00
66c33d5989 Revert "[2/N] Add flag to control which rank should perform NaN check (#134345)"
This reverts commit be7752ead3824e79f5ede6a2f59715b415a2f776.

Reverted https://github.com/pytorch/pytorch/pull/134345 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/134345#issuecomment-2316133024))
2024-08-28 19:51:59 +00:00
23e26b84af Revert "[3/N] Set correct device to CUDA guards (#134357)"
This reverts commit 13114da4ef9d14978ea1dfc0fefb236cb4000435.

Reverted https://github.com/pytorch/pytorch/pull/134357 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/134357#issuecomment-2316121423))
2024-08-28 19:44:55 +00:00
3b40b07efb Update PyTorch for XNNPACK 87ee0b4 (#134518)
Summary: Update XNNPACK library version.

Test Plan: Combined diff CI is clean: D61586079 (all changes, has to be split out for export).

Differential Revision: D61822610

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134518
Approved by: https://github.com/mcr229
2024-08-28 19:24:04 +00:00
042b733ddd [dynamo][freezing] Set is_static_type to false after marking an input static (#134653)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134653
Approved by: https://github.com/mlazos
2024-08-28 19:22:37 +00:00
aa31e7019a [FSDP] Made clip_grad_norm_ norm compute order deterministic (#134673)
Fixes https://github.com/pytorch/pytorch/issues/134393

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134673
Approved by: https://github.com/weifengpy
ghstack dependencies: #134152
2024-08-28 18:44:11 +00:00
47ba47a81f [compiled autograd] error instead of deadlock on reentrant autograd (#134530)
reentrant calls autograd multiple times using the same thread, so it passes all the thread checks and hangs waiting for the lock it holds in another scope

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134530
Approved by: https://github.com/jansel
ghstack dependencies: #134514
2024-08-28 17:54:31 +00:00
c352b6aaaf [compiled autograd][cpp node] point c++ custom autograd functions tracing error to google doc (#134514)
`RuntimeError: Attempting to trace a potentially unsafe C++ autograd function: torch::autograd::CppNode<CustomOpAutogradFunction>. It may be possible to trace it safely, please refer to the instructions in: https://docs.google.com/document/d/11VucFBEewzqgkABIjebZIzMvrXr3BtcY1aGKpX61pJY/.`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134514
Approved by: https://github.com/yf225
2024-08-28 17:54:31 +00:00
ba5aec88c6 [reland][dtensor][MTPG] make sharding prop lru cache not shared among threads (#134509)
**Summary**
reland of https://github.com/pytorch/pytorch/pull/134294

Fixes #131446
Fixes #126852
Fixes #126868
Fixes #126493

The PR was reverted due to CI red signal in https://github.com/pytorch/pytorch/actions/runs/10537099590/job/29201744658. It seems that the `gaussian_nll_loss` test had been flaky before my original PR #134294 . Therefore this PR also removes the `xfail` mark on this specific test to make CI signal green.

See the error message below:
```
2024-08-24T13:42:01.3228990Z ==================================== RERUNS ====================================
2024-08-24T13:42:01.3229530Z _ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _
2024-08-24T13:42:01.3229710Z Unexpected success
2024-08-24T13:42:01.3230235Z _ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _
2024-08-24T13:42:01.3230407Z Unexpected success
2024-08-24T13:42:01.3230594Z =================================== FAILURES ===================================
2024-08-24T13:42:01.3231128Z _ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _
2024-08-24T13:42:01.3231296Z Unexpected success
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134509
Approved by: https://github.com/tianyu-l, https://github.com/wz337
2024-08-28 17:51:44 +00:00
310eb6d8c6 [AOTI] Fix test_aoti_inference CPU build issue (#134675)
Summary: Fixes https://github.com/pytorch/pytorch/issues/130311. We need to guard CUDA-only code in test_aoti_inference with macros so that it won't fail for CPU-only platform.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134675
Approved by: https://github.com/atalman, https://github.com/chunyuan-w
2024-08-28 17:42:19 +00:00
633a9a3b13 add back sum_floordiv benchmark. (#134635)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134635
Approved by: https://github.com/avikchaudhuri, https://github.com/oulgen
ghstack dependencies: #133834
2024-08-28 17:38:24 +00:00
b8859dc4b8 [PyTorch Pin Memory Allocator] Optimize the free list implementation and add lock sharding (#134154)
Summary: This diff addresses the lock contention issue in free list implementation of CachingHost/Pinned allocator. We add a different data structure for free list and also add lock sharding based on allocation size.

Differential Revision: D61623367

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134154
Approved by: https://github.com/guangyey, https://github.com/jgong5, https://github.com/zyan0, https://github.com/EikanWang, https://github.com/jiayisuse
2024-08-28 17:12:01 +00:00
40de63be09 parameterized test_graph_optims and test_graph_scaling_fused_optimizers (#133749)
Fixes #123451

This is a rework of a reverted pull request, https://github.com/pytorch/pytorch/pull/125127.
The test failure is fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133749
Approved by: https://github.com/janeyx99
2024-08-28 16:34:06 +00:00
c7338f457c [DCP] Fixes the BC issue where the traversal doesn't support versions before 2.4 (#134158)
The original DCP doesn't flattening all the containers, which can cause issues, https://github.com/pytorch/pytorch/pull/125335 intends to solve the issue by flattening all the dictionaries.

Unfortunately, it breaks the checkpoints that are saved before 2.4. This
also shows some issues of the DCP:

1. DCP should record version in the metadata.
2. DCP should have a nice way to load old state_dict.
3. DCP should unflatten all containers (map, list) not just map.

This PR only addresses issue 2 to unblock users. Issue 1 and issue 3 need to be addressed in the future.

@pradeepfn Please let me know if this summary matches our discussion.

Fixes https://github.com/pytorch/pytorch/issues/133923

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134158
Approved by: https://github.com/wz337, https://github.com/pradeepfn
2024-08-28 16:31:44 +00:00
13d40f6fc5 Revert "hang dim hint constants off Dim (#134484)"
This reverts commit c142af7209a423a05504fdec50680333f5a37629.

Reverted https://github.com/pytorch/pytorch/pull/134484 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/134484#issuecomment-2315749549))
2024-08-28 16:05:42 +00:00
2c88a923a7 Revert "Refactor caching device allocator utils (#130923)"
This reverts commit c45ca8092dddf718563a1a754de798ad25eae1ee.

Reverted https://github.com/pytorch/pytorch/pull/130923 on behalf of https://github.com/ZainRizvi due to Sorry but this appears to be causing internal tests to fail with errors like `error: no type named 'DeviceStats' in namespace 'xxx::xxx:xxxAllocator'; did you mean 'DeviceStatus'?` ([comment](https://github.com/pytorch/pytorch/pull/130923#issuecomment-2315730155))
2024-08-28 15:56:08 +00:00
d52aff3e73 Revert "Adding entry-point based support for out-of-tree rendezvous plugins (#132633)"
This reverts commit 136b19b062f62c81ea3ed8fb306debe9d7720e93.

Reverted https://github.com/pytorch/pytorch/pull/132633 on behalf of https://github.com/ZainRizvi due to Sorry but this is causing internal tests to fail with the error `ImportError: cannot import name '_register_out_of_tree_handlers' from 'torch.distributed.elastic.rendezvous.registry'` ([comment](https://github.com/pytorch/pytorch/pull/132633#issuecomment-2315716201))
2024-08-28 15:49:18 +00:00
85d9946001 [CI] change conda to miniforge for XPU images (#134455)
The `.ci/docker` change with `ciflow/xpu` label will trigger docker images rebuild on xpu runner, but xpu runner can't use miniconda, change to miniforge. Works for https://github.com/pytorch/pytorch/issues/114850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134455
Approved by: https://github.com/atalman
2024-08-28 15:16:14 +00:00
208b922327 [Intel GPU] Remove special dispatch logic for xpu in adaptive_avg_pooling (#132217)
We now align the dispatch logic for XPU with CUDA in the adaptive average pooling operation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132217
Approved by: https://github.com/EikanWang, https://github.com/atalman, https://github.com/albanD, https://github.com/malfet
2024-08-28 15:06:35 +00:00
e6bf1710ff [Inductor][Refactor] Rename CPU benchmark test configs (#134639)
Summary: benchmarks/dynamo/ci_expected_accuracy/update_expected.py expects a benchmark run config is named as {config}_{benchmark}, and CPU tests should follow the same naming convention.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134639
Approved by: https://github.com/huydhn
2024-08-28 14:49:55 +00:00
c142af7209 hang dim hint constants off Dim (#134484)
Summary: Recently https://github.com/pytorch/pytorch/pull/133620 added support for automatic dynamic shapes, where a new enum, `DIM`, was introduced to provide hints like `AUTO` and `STATIC`. This PR is a nominal change where we expose the hints via the existing public `Dim` API, and remove `DIM` from the public API. The main motivation is to avoid having users need to import too many things.

Test Plan: existing

Differential Revision: D61807361

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134484
Approved by: https://github.com/angelayi
2024-08-28 14:35:40 +00:00
3e42f21eee Bucketize fix to include number and tensor inputs (#133652)
Fixes #132222

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133652
Approved by: https://github.com/ezyang
2024-08-28 13:35:41 +00:00
bb22132c8d [aotd] Make effects op registry WeakKeyDictionary (#134470)
Op is used as a Dictionary Key, while op can be deregistered as a result this Key will be holding this op from deallocation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134470
Approved by: https://github.com/zou3519
2024-08-28 12:12:00 +00:00
97c8a0739e [Dynamo] Support inspect.signature.Parameter getattr (#134636)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134636
Approved by: https://github.com/Chillee, https://github.com/anijain2305
2024-08-28 09:59:41 +00:00
26e392132d [2nd try][Traceable FSDP2] Allow tracing through FSDP2 impl in trace_rules.py (#134539)
The previous PR https://github.com/pytorch/pytorch/pull/133532 caused stuck compilation issue on internal models. In this 2nd attempt PR, we gate the trace_rules.py changes with `if not torch._dynamo.config.skip_fsdp_hooks:`, so that they don't take effect for current graph-break FSDP2 (which relies on the default config value `skip_fsdp_hooks=True`), and will only take effect when we are using Traceable FSDP2 (in which case the user needs to proactively set `skip_fsdp_hooks=False`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134539
Approved by: https://github.com/ckluk2, https://github.com/yanboliang
2024-08-28 08:57:56 +00:00
8693322ef0 [Dynamo][autograd.Function] Support mark_non_differentiable (#134087)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134087
Approved by: https://github.com/zou3519
2024-08-28 08:12:37 +00:00
d01415409b [PGNCCL] Improve logic to infer device for barrier (#134617)
Fixes #134391, #124714

The above issues reported that `dist.barrier()` could hang in some cases.
The culprit is that ProcessGroupNCCL inferred a wrong device to perform the dummy all-reduce.

After the PR, the following will be the order of device selection:
- 1st choice: `opts.device_ids`, if provided by user via `barrier(opts)`.
- 2nd choice: bound device id, if provided to `init_process_group` via `device_id` arg.
- 3rd choice: `usedDeviceIdxs_` recorded in current PG. Will have a value from previous collectives.
- 4th choice: `globalRank() % localDeviceCount_`. This can only happen when `dist.barrier()` is the first call of the PG.

What's new:
- Added the 2nd choice.
- In the 4th choice, we use `globalRank()` instead of group-local rank, because the group-local rank can be offset wrt the device id if intra-node GPUs are sharded into multiple dimensions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134617
Approved by: https://github.com/yifuwang, https://github.com/shuqiangzhang
2024-08-28 08:12:09 +00:00
e4a5958ab5 [dynamo] Graph break on FSDP flat_param inconsistent tensor and grad dtype (#134614)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134614
Approved by: https://github.com/awgu, https://github.com/yf225
ghstack dependencies: #134610, #134590, #134621
2024-08-28 07:35:24 +00:00
e96dc3665a [raland][dynamo][exceptions] Support raise from None (#134621)
The PR was reverted because this PR traced more code and surfaced a latent bug. Resubmitting w/o any changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134621
Approved by: https://github.com/jansel
ghstack dependencies: #134610, #134590
2024-08-28 07:35:23 +00:00
c566f2465f [dynamo][dicts] Support hasattr on dicts (#134590)
Fixes - https://github.com/pytorch/pytorch/issues/134577

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134590
Approved by: https://github.com/Skylion007
ghstack dependencies: #134610
2024-08-28 07:35:18 +00:00
880e3d18a4 [dynamo][exceptions] Use exception subclass whenever possible (#134610)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134610
Approved by: https://github.com/drisspg, https://github.com/jansel
2024-08-28 07:35:12 +00:00
bf7db4e4f9 [Inductor UT] Generalize inductor UT for intel GPU (#133309)
[Inductor UT] Generalize Inductor test case for Intel GPU.

- Reuse `test/inductor/test_decompose_mem_bound_mm.py`
- Reuse `test/inductor/test_inplacing_pass.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133309
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/etaf
2024-08-28 06:17:43 +00:00
2ba60a1618 fix torch.prod vectorized path for bool (#128009)
Fix https://github.com/pytorch/pytorch/issues/127866.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128009
Approved by: https://github.com/jgong5, https://github.com/albanD
2024-08-28 05:27:50 +00:00
89929d9abc [AOTI][Tooling][4/n] Add torch.save() for individual intermediate tensor (#133871)
Differential Revision: D61415304

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133871
Approved by: https://github.com/ColinPeppler
2024-08-28 04:48:00 +00:00
ca77f0a986 [executorch hash update] update the pinned executorch hash (#133386)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133386
Approved by: https://github.com/pytorchbot
2024-08-28 04:16:42 +00:00
e3308d835d [audio hash update] update the pinned audio hash (#134632)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134632
Approved by: https://github.com/pytorchbot
2024-08-28 04:16:25 +00:00
cyy
bb4dfe90b8 [Reland] [1/N] Fix clang-tidy warnings in inductor (#134544)
Reland #131979 and exclude aoti_torch_index_put_out changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134544
Approved by: https://github.com/ColinPeppler
2024-08-28 04:05:06 +00:00
71d0eff6e7 Back out "[pytorch][PR] [export] Schematize nn_module_stack serialization" (#134628)
Summary: Breaking backward compatibilities for serialization and deserialization

Differential Revision: D61888223

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134628
Approved by: https://github.com/angelayi
2024-08-28 03:45:46 +00:00
cyy
ec3f52dd27 [21/N] Fix clang-tidy warnings in jit (#134537)
Follows #133399

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134537
Approved by: https://github.com/Skylion007
2024-08-28 03:22:01 +00:00
5beb859e74 [BE] no need to print stream in comm abort (#134362)
Strictly speaking, NCCL communicator has nothing to do with CUDA streams. Thus, we don't need to print stream in comm abort's message.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134362
Approved by: https://github.com/fduwjj, https://github.com/wconstab
2024-08-28 02:14:18 +00:00
f33bcbe5fd c10d/logging: add C10D_LOCK_GUARD (#134131)
This adds logs if we can't acquire locks in NCCLUtils and ProcessGroupNCCL for 30s.

This is motivated by some deadlocks were seeing and it's unclear if it's in NCCL or on the PyTorch side of things.

This required replacing most `std::mutex` with `std::timed_mutex` and `std::condition_variable_any` as appropriate.

Test plan:

existing CI for regressions

will add unit tests on `C10D_LOCK_GUARD`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134131
Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj
2024-08-28 01:40:42 +00:00
c45ca8092d Refactor caching device allocator utils (#130923)
# Motivation
Following [[RFC] Intel GPU Runtime Upstreaming for Allocator ](https://github.com/pytorch/pytorch/issues/116322), this PR aims to refactor caching device allocator utils to improve code reuse usage.
This is the first PR, we could prepare some follow-up PRs continuing to refactor the device caching allocator.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130923
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD, https://github.com/eqy
2024-08-28 01:35:23 +00:00
d96254631e [CD] Fix docker builds by installing setuptools after python build (#134631)
Follow up after https://github.com/pytorch/pytorch/pull/134595

Same error happens silently before the error addressed in the above PR (and build continues and builds invalid Docker):
```
#47 457.5 Traceback (most recent call last):
#47 457.5   File "<string>", line 1, in <module>
#47 457.5   File "/opt/_internal/cpython-3.12.0/lib/python3.12/site-packages/wheel/pep425tags.py", line 3, in <module>
#47 457.5     import distutils.util
#47 457.5 ModuleNotFoundError: No module named 'distutils'
#47 457.5 + local abi_tag=
#47 457.5 + ln -s /opt/_internal/cpython-3.12.0 /opt/python/
#47 457.5 + rm -f Python-3.12.0.tgz
```

The fix in  https://github.com/pytorch/pytorch/pull/134595 is no longer needed since we will install setuptools right after python installation.

Link: https://github.com/pytorch/pytorch/actions/runs/10584642913/job/29329366729#step:6:6041
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134631
Approved by: https://github.com/kit1980
2024-08-28 01:17:41 +00:00
2b95da7ef4 allow conv_bn mixed dtype folding in post-grad (#133968)
This PR relaxes the condition to allow conv_bn mixed dtype folding in post-grad.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133968
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2024-08-28 01:02:09 +00:00
f7467c3b95 using new device-agnostic api instead of old api like torch.cpu or torch.cuda (#134448)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134448
Approved by: https://github.com/guangyey, https://github.com/shink, https://github.com/albanD
2024-08-28 01:01:49 +00:00
0c7856973b [export] enumerate unsupported sympy.Functions (#134271) (#134598)
Summary:
There's 2 concepts of unsupported sympy.Functions in symbolic_shapes:
1) unsupported by the export solver, meaning the solver doesn't know how to provide useful fixes for those functions
2) unsupported by the sympy interpreter - meaning we can't reify them into FX nodes because the functions aren't present in PythonReferenceAnalysis

This splits the current call into a call for each version, with the Export solver the only user of 1). For 1), we enumerate the functions in _sympy/functions.py, and subtract the functions we know we can support. For 2) there's only 3 functions we've seen pop up in test cases.

cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10

Differential Revision: D61863394

Pulled By: pianpwk

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134598
Approved by: https://github.com/angelayi
2024-08-28 00:34:38 +00:00
3b33f26513 Add device daemon (#131814)
Base implementation aiming towards https://github.com/pytorch/rfcs/pull/64

Details of the implementation and next steps in https://github.com/pytorch/pytorch/blob/gh/albanD/3/head/test/cpp_extensions/open_registration_extension/README.md

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131814
Approved by: https://github.com/ezyang
2024-08-27 23:32:07 +00:00
d6091c8726 Add compile time instruction count metric (#133834)
PYTHONPATH=$(pwd) python benchmarks/update_hint_benchmark.py out
as of this diff, compile_time_instruction_count counts the number of instruction from within
convert_frame.compile_inner
```
update_hint_regression,compile_time_instruction_count,10522459165
```
 will add result from CI once populated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133834
Approved by: https://github.com/aorenste
2024-08-27 23:29:02 +00:00
ef0f5919c7 [ROCm][Inductor][CK] Fix codegen after ck signature change (#134483)
MakeArgument signature was changed in https://github.com/ROCm/composable_kernel/pull/1453 adding splitK argument to universal gemm templates which are used to codegen addmm and matmul

(part of the series started at #125453 )

# Testing
`pytest test/inductor/test_ck_backend.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134483
Approved by: https://github.com/ColinPeppler
2024-08-27 23:25:42 +00:00
5ead965026 [export] don't duck size for DIM.AUTO (#134486)
Summary: apparently DIM.AUTO leads to duck sizing, I didn't catch this. Doing the least intrusive fix possible by using `torch._dynamo.maybe_mark_dynamic()` under the hood.

Test Plan: added test

Differential Revision: D61809344

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134486
Approved by: https://github.com/avikchaudhuri
2024-08-27 23:00:26 +00:00
30094bedbc Revert "[dynamo][dicts] Support hasattr on dicts (#134590)"
This reverts commit d23c0150f3ba5fd1162358e9e7b0e72e7308c87e.

Reverted https://github.com/pytorch/pytorch/pull/134590 on behalf of https://github.com/anijain2305 due to causing trunk CI failures ([comment](https://github.com/pytorch/pytorch/pull/134590#issuecomment-2313705582))
2024-08-27 22:52:52 +00:00
d966d91e37 [FlexAttention] Fix Sparse block multiple to ceildiv instead for floor div (#134538)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134538
Approved by: https://github.com/yanboliang
ghstack dependencies: #134507, #134511
2024-08-27 22:04:57 +00:00
f5c67917d3 [FlexAttention] Remove unused code (#134511)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134511
Approved by: https://github.com/yanboliang
ghstack dependencies: #134507
2024-08-27 22:04:57 +00:00
856a8410f2 [FlexAttention] Create new variables for the subgraphs (#134507)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134507
Approved by: https://github.com/yanboliang, https://github.com/BoyuanFeng
2024-08-27 22:04:57 +00:00
41e512a4cd [EZ] Restore test_unicode_comments (#134589)
This reverts changes introduced by test_jit.py by 43737bd78a and adds lint suppression for this it

As test name suggests it should have an unicode comment to make sure our parser can handle it

Part of the fix for https://github.com/pytorch/pytorch/issues/134422
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134589
Approved by: https://github.com/aorenste, https://github.com/Skylion007
2024-08-27 21:51:06 +00:00
1ba39ec1d0 Add test case test_arange_length_with_float32_dtype (#134415)
Adding a test as a followup from https://github.com/pytorch/pytorch/pull/134296

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134415
Approved by: https://github.com/ezyang
2024-08-27 21:36:23 +00:00
b58a0c3c4d [split build] fix distributed problems (#134502)
Should fix the issue where USE_C10D_NCCL was not getting propagated to libtorch_python.so
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134502
Approved by: https://github.com/yifuwang
2024-08-27 21:12:58 +00:00
289486d007 Move attention kernels back from fake_impls to meta_registrations (#134288)
See #121528 for additional context.

In #120682, we moved the attention kernels from meta_registrations to fake_impls with the intent of fixing the device handling for seed/offset: these are typically on CPU. We needed to put the registrations in fake_impls to do this because meta_registrations doesn't have a way to specify device, whereas fake_impls does. But when we tried to actually fix the device types (#120839), we had to revert the PR because it broke cudagraph handling (during which seed/offset _are_ on CUDA).

Now, we want to put the registrations back in meta_registrations so that we can call these kernels with meta tensors. The use case is later in this stack - we want to be able to use the flop counter with these kernels.

Also - I specifically skip the `compare_tensor_meta()` check in test_fake / test_fake_autocast tests for the `_efficient_attention_forward` and `_flash_attention_forward` kernels, which fails because of the device mismatch from the seed/offset tensors. Then we can un-skip these opinfos. I verified that the efficient_attention_forward bug (#120842) is now caught by these opinfos if I revert the fix from this PR.

Differential Revision: [D61687369](https://our.internmc.facebook.com/intern/diff/D61687369)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134288
Approved by: https://github.com/drisspg
2024-08-27 21:10:36 +00:00
39ca96398b Update label_to_label with oncall: pt2 hierarchy. (#134582)
Test Plan:
- None
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134582
Approved by: https://github.com/clee2000
2024-08-27 21:05:40 +00:00
cyy
b567ca0f51 Remove unused imported names in python files (#134438)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134438
Approved by: https://github.com/zou3519
2024-08-27 20:44:04 +00:00
d23c0150f3 [dynamo][dicts] Support hasattr on dicts (#134590)
Fixes - https://github.com/pytorch/pytorch/issues/134577

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134590
Approved by: https://github.com/Skylion007
ghstack dependencies: #134039
2024-08-27 20:43:40 +00:00
16b8146c9e Exclude test_transformers and unit tests which require recent GPU arch (#132895)
This PR is to exclude test_transformers on ROCm temporarily and skip some unit tests which require recent GPU arch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132895
Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/malfet
2024-08-27 20:40:53 +00:00
44dadf2506 [Fix] Check name when registering privateuse1 backend (#134071)
do some checks when registering privateuse1 backend to avoid using in-tree deivce names

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134071
Approved by: https://github.com/albanD
2024-08-27 20:28:30 +00:00
f754c0ae1b [easy] rm duplicate definition for inductor in TORCH_LOGS documentation (#134480)
already defined in
2eb9339b71/torch/_logging/_internal.py (L286-L287)

Test Plan: Sandcastle run

Differential Revision: D61806088

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134480
Approved by: https://github.com/eellison, https://github.com/mlazos
2024-08-27 20:15:10 +00:00
fe6d0e3a04 Do not compute unnecessary tensor!=0 for bool tensors in count_nonzero (#134254)
Updated aten/src/ATen/native/TensorAdvancedIndexing.cpp to only reduce non-bool tensors before computing a sum

Since I have no expertise for MPS, I did leave the MPS backend untouched. Also, in `count_nonzero_impl` for CPU, I assumed the comparison can be optimized by the compiler for boolean values? 90c821814e/aten/src/ATen/native/TensorAdvancedIndexing.cpp (L2262-L2264) Fixes #133983

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134254
Approved by: https://github.com/albanD
2024-08-27 20:09:29 +00:00
b744ed6816 Add a cpu_dispatch_key parameter to the cpu_fallback function (#134321)
Fixes #134322
Add a cpu_dispatch_key parameter to the cpu_fallback function to support fallback, for example, to SparseCPU.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134321
Approved by: https://github.com/albanD
2024-08-27 19:57:57 +00:00
adf401f822 Links to contributors' GitHub accounts (#133787)
Maintainers have the links to their GitHub profiles, but the major contributors do not have them.
I added the links to the contributors' GitHub accounts in case anyone wants to follow them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133787
Approved by: https://github.com/albanD
2024-08-27 19:56:08 +00:00
534f43ddce [Doc] Fix rendering of the unicode characters (#134597)
https://github.com/pytorch/pytorch/pull/124771 introduced unicode escape sequences inside raw strings, which were not rendered correctly. Also fix typo in `\uue0 ` escape sequence (should have been `\u00e0`)
Fix it by relying on [string literal concatenation](https://docs.python.org/3/reference/lexical_analysis.html#string-literal-concatenation) to join raw and regular strings together during lexical analysis stage

Fixes https://github.com/pytorch/pytorch/issues/134422

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134597
Approved by: https://github.com/aorenste, https://github.com/Skylion007
2024-08-27 19:52:46 +00:00
3ef4c27ab3 Update pt2e numeric debugger to use node.meta["custom"] field (#134040)
Summary:
With https://github.com/pytorch/pytorch/pull/131912 we now have a "custom" field in node.meta that can be preserved
in

* copy/deepcopy
* run_decompositions()
* serialization
* re-exporting

So we refactored numeric debugger to use this.

Test Plan:
python test/test_quantization.py TestNumericDebugger

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134040
Approved by: https://github.com/tarun292
2024-08-27 19:51:03 +00:00
ed494603c7 [inductor] calibration inductor windows uts (16/N) (#134587)
skip UT for `test/inductor/test_compiled_autograd.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134587
Approved by: https://github.com/jansel
2024-08-27 19:45:02 +00:00
b094972051 [inductor] calibration inductor windows uts (17/N) (#134588)
skip UTs for `test/inductor/test_minifier_isolate.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134588
Approved by: https://github.com/jansel
2024-08-27 19:41:17 +00:00
9d0e0e6f1d [inductor] calibration inductor windows uts (14/N) (#134585)
skip UT for `test/dynamo/test_exc.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134585
Approved by: https://github.com/jansel
2024-08-27 19:40:56 +00:00
05ac7cd760 [MPS] Remove superfluous label/link (#134090)
This was probably intended to be a comment. I removed it since the issue is already linked in the warning below.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134090
Approved by: https://github.com/albanD
2024-08-27 19:37:33 +00:00
d5aefadb17 [CD] Fix docker builds by installing setuptools (#134595)
Seeing failures like this:
```
#49 844.6 //build_scripts/manylinux1-check.py:6: DeprecationWarning: The distutils package is deprecated and slated for removal in Python 3.12. Use setuptools or check PEP 632 for potential alternatives
.....
[python 3/3] RUN bash build_scripts/build.sh && rm -r build_scripts:
846.9 ...it did, yay.
846.9 + for PYTHON in '/opt/python/*/bin/python'
846.9 + /opt/python/cpython-3.12.0/bin/python build_scripts/manylinux1-check.py
847.0 Traceback (most recent call last):
847.0   File "//build_scripts/manylinux1-check.py", line 55, in <module>
847.0     if is_manylinux1_compatible():
847.0        ^^^^^^^^^^^^^^^^^^^^^^^^^^
847.0   File "//build_scripts/manylinux1-check.py", line 6, in is_manylinux1_compatible
847.0     from distutils.util import get_platform
847.0 ModuleNotFoundError: No module named 'distutils'
------
```
PR: https://github.com/pytorch/pytorch/pull/134455

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134595
Approved by: https://github.com/kit1980, https://github.com/seemethere, https://github.com/malfet
2024-08-27 19:31:44 +00:00
a4b44dd2ef [AOTI] Introduce DeferredCudaGridLine for cuda cpp wrapper (#129268)
Summary: Similar to https://github.com/pytorch/pytorch/pull/129135, use DeferredCudaGridLine to create a deferred grid computation line when generating cpp wrapper.

Differential Revision: [D61800622](https://our.internmc.facebook.com/intern/diff/D61800622)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129268
Approved by: https://github.com/angelayi
2024-08-27 19:23:25 +00:00
5fd670e0ef [ROCM] Properly disable Flash Attention/Efficient Attention with environment variables (#133866)
Now `USE_FLASH_ATTENTION=0 USE_MEM_EFF_ATTENTION=0 python setup.py` can compile correctly

Fixes #125230

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133866
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily, https://github.com/malfet
2024-08-27 18:24:29 +00:00
5b392d22c6 Revert "fix stuck floordiv (#134150)"
This reverts commit 92c4771853892193d73d87bd60eca4dc7efc51d8.

Reverted https://github.com/pytorch/pytorch/pull/134150 on behalf of https://github.com/anijain2305 due to compile time regression internal ([comment](https://github.com/pytorch/pytorch/pull/134150#issuecomment-2313230404))
2024-08-27 18:23:44 +00:00
0159ebb654 [dtensor] add test for local_map decorator (#127752)
**Summary**
This PR is a follow-up of #126924 to address reviewer's comments:
1) add a test case to show the use of `local_map` as a function decorator.
2) simplify the logic of handling different data types of `out_placements`.
3) correct variable naming in test cases to match math formulas.

**Test**
see #126924

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127752
Approved by: https://github.com/wanchaol
2024-08-27 18:22:23 +00:00
8de0d7690c Use newer toAccumulateType signature in Normalization.cpp (#134540)
Which fixes BatchNorm behavior for if called with empty tensors on MPS backed. Removed `expectedFailureMPS` in test_nn.py, deleted expected failure in `test_mps.py` and adjusted `skipIfMPS` to `expectedFailureMPS`  in BatchNorm2d OpInfo decorator, but restrict it only to the memory format tests

Test Plan: CI + `python3 -c "import torch; print(torch.nn.BatchNorm2d(3, device='mps')(torch.rand(0, 3, 2, 2, device='mps')))"`

Fixes https://github.com/pytorch/pytorch/issues/134423

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134540
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-08-27 18:09:20 +00:00
68b1a09422 Integrate device agnostic APIs in FSDP library [1/n] (#134337)
Summary: For MTIA FSDP support, we need to ensure the FSDP library code handles accelerator devices not limited to CUDA.

Test Plan: CI

Reviewed By: hanzlfs

Differential Revision: D60587415

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134337
Approved by: https://github.com/LucasLLC, https://github.com/awgu
2024-08-27 17:31:11 +00:00
13049cd6e5 [aotinductor][UserDefinedTritonKernel] fix case with non-constexpr params declared after autotuned params (#134520)
## Context
In some user Triton kernels, we have this set-up for whatever reason.
```
@triton.jit
def mykernel(
  param0,
  param1,
  param2,
  param3: tl.constexpr,   # autotuned
  param4,                 # non-constexpr
):
  ...
```

This is an edge case because it's a general practice to declare all constexprs params at the end.

And this will be an issue for AOTI because it fails to codegen all 4 params. That will surface as a device-side error: CUDA IMA, invalid argument...

```
>     void* kernel_args_var_0[] = {&var_0, &var_1, &var_2};
---
<     CUdeviceptr var_3;
<     AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_get_data_ptr(buf0, reinterpret_cast<void**>(&var_3)));
<     void* kernel_args_var_0[] = {&var_0, &var_1, &var_2, &var_3};
```

## Root-cause
* `kernel.constexpr` from the Kernel side-table contains the indices for all `constexpr` params that includes autotuned params.
* `raw_args`, that gets passed to wrapper codegen, excludes autotuned args.
* In the wrapper codegen, we try to find non-constexpr args using `kernel.constexpr` & `raw_args`. This is okay unless there's a `raw_arg` after an autotuned param in the function signature.

79b7fff188/torch/_inductor/codegen/cpp_wrapper_cuda.py (L118-L126)

## Fix
We try to fix this, by calculating the right constexprs wrt `raw_args`.

An illustration
```
         raw_args: [arg0, arg1, arg2, arg4]
 kernel.arg_names: [param0, param1, param2, param3, param4]
kernel.constexprs: [3]                      # param3 is autotuned; this is correct wrt kernel.arg_names
constexpr_indices: []                       # this is correct wrt raw_args
```

Differential Revision: [D61831625](https://our.internmc.facebook.com/intern/diff/D61831625)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134520
Approved by: https://github.com/oulgen
2024-08-27 17:20:27 +00:00
13114da4ef [3/N] Set correct device to CUDA guards (#134357)
In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue https://github.com/pytorch/pytorch/issues/134062.

With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA.

Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134357
Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang
ghstack dependencies: #134300, #134345
2024-08-27 16:38:15 +00:00
be7752ead3 [2/N] Add flag to control which rank should perform NaN check (#134345)
Fixes https://github.com/pytorch/pytorch/issues/134062.
For example, in case of broadcast / scatter, only the root rank should perform the NaN check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134345
Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
ghstack dependencies: #134300
2024-08-27 16:33:59 +00:00
9dc4bd7466 Create a JustknobConfig for use in config (#134161)
This is designed to be a more ergonomic interface on top of justknob_feature (see https://github.com/pytorch/pytorch/pull/134151 for just the PR with the base commits).

The idea is that people stop having to think about this as much, and can just do JustkobsConfig("//the:thing", "FORCE_THING") and it'll do the right thing.

Primarily sending this to see how people feel about the API, and using it for new config changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134161
Approved by: https://github.com/ezyang
2024-08-27 16:07:33 +00:00
94caba4899 [1/N] Move NaN check onto NCCL stream (#134300)
So that the tensor's lifetime management is the same as the management built for the NCCL, pre and post kernels.
Also so that on visualizers, they show up in the NCCL stream line. Otherwise if they show up in the compute line, user may get confused (my code does not have these kernels).

The check is thus moved after the point where we depend NCCL stream from the last compute kernel.

Also moved declaration of `checkForNan` from Utils.hpp to NCCLUtils.hpp, and renamed Utils.cu to NCCLUtils.cu.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134300
Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
2024-08-27 16:02:27 +00:00
c582602245 Update partitioner's is_fusible heuristic to respect triton kernels (#134491)
mutated arguments to triton kernels are fusible into the triton kernel.

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134491
Approved by: https://github.com/Chillee
ghstack dependencies: #134364, #134466, #134490
2024-08-27 15:57:32 +00:00
761cf91e3c [DeviceMesh] Add get_all_submeshes in _MeshEnv (#134275)
Adding a private helper method for Shampoo HSDP use cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134275
Approved by: https://github.com/XilunWu
2024-08-27 14:51:19 +00:00
d028b810fe Fix flaky GroupNorm ModuleInfo test (#133899)
Fixes https://github.com/pytorch/pytorch/issues/98677

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133899
Approved by: https://github.com/albanD
2024-08-27 14:45:51 +00:00
2033934ff8 Clarify error messages for NEWOBJ and BUILD in weights_only unpickler (#134346)
Clarify that `add_safe_globals` will allow types for these instructions

Some types do not appear as `GLOBAL` and are only caught in `BUILD`, example from hf slack is `numpy.dtypes.UInt32DType`

```python
import torch
import numpy as np
from tempfile import TemporaryDirectory
from pathlib import Path
from codecs import encode

torch.serialization.add_safe_globals([encode, np.dtype, np.core.multiarray._reconstruct, np.ndarray])

with TemporaryDirectory() as tempdir:
    p = Path(tempdir)
    r2 = np.random.get_state()
    torch.save(r2, p / "r2.pkl")
    torch.load(p / "r2.pkl", weights_only=True)
```

Yields (error comes from BUILD)
```
UnpicklingError: Weights only load failed. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
 Please file an issue with the following so that we can make `weights_only=True` compatible with your use case: WeightsUnpickler error: Can only build Tensor, parameter or OrderedDict objects, but got <class 'numpy.dtypes.UInt32DType'>
```

The reasoning is that `numpy.dtypes.UInt32DType` is constructed via `REDUCE` with `func =<class 'numpy.dtype'>` and `args= ('u4', False, True)`, clarify the error message that doing `add_safe_globals` on these will also allow them

After this PR error message becomes

```
_pickle.UnpicklingError: Weights only load failed. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
Please file an issue with the following so that we can make `weights_only=True` compatible with your use case: WeightsUnpickler error: Can only build Tensor, Parameter, OrderedDict or types allowlisted via `add_safe_globals`, but got <class 'numpy.dtypes.UInt32DType'>
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134346
Approved by: https://github.com/albanD
2024-08-27 14:45:39 +00:00
2ac710e667 Make torch.serialization.set_default_mmap_options usable as a context manager (#134371)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134371
Approved by: https://github.com/albanD
2024-08-27 14:45:29 +00:00
0fa0ac80e4 Do not use <filesystem> on Linux (#134494)
Because right now it leads to symbol conflict from binary builds.
Use of `std::filesystem::file_exists` was introduced by https://github.com/pytorch/pytorch/pull/126601 and in this PR it is replaced with a very straightforward implementation that calls `stat` on the given path, which is a classic C-way of checking for the file existence.

This PR should be reverted once one figures out how to keep `std::filesystem` methods linked into the binary private

Fixes symptoms of https://github.com/pytorch/pytorch/issues/133437

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134494
Approved by: https://github.com/atalman, https://github.com/d4l3k
2024-08-27 14:44:10 +00:00
3418708abf Revert "[FlexAttention] Create new variables for the subgraphs (#134507)"
This reverts commit 4d0a44d34a46af6dcc764d55269b30ac537822a0.

Reverted https://github.com/pytorch/pytorch/pull/134507 on behalf of https://github.com/albanD due to Broke lint due to too long line ([comment](https://github.com/pytorch/pytorch/pull/134507#issuecomment-2312505955))
2024-08-27 13:05:27 +00:00
87a3f664e1 Revert "[FlexAttention] Remove unused code (#134511)"
This reverts commit 767c47d3c0ee3fc7804918a08de3f94874143a03.

Reverted https://github.com/pytorch/pytorch/pull/134511 on behalf of https://github.com/albanD due to Broke lint due to too long line ([comment](https://github.com/pytorch/pytorch/pull/134507#issuecomment-2312505955))
2024-08-27 13:05:27 +00:00
3e10a1eb5a Revert "[FlexAttention] Fix Sparse block multiple to ceildiv instead for floor div (#134538)"
This reverts commit a34320a6f225061a3b5fe130a5a8fe35ed7a40f9.

Reverted https://github.com/pytorch/pytorch/pull/134538 on behalf of https://github.com/albanD due to Broke lint due to too long line ([comment](https://github.com/pytorch/pytorch/pull/134507#issuecomment-2312505955))
2024-08-27 13:05:27 +00:00
c7cbcdad76 Update partitioner's is_fusible heuristic to respect auto_functionalized (#134490)
We say Node a is fusible into node b if node b is an auto_functionalized
node that may reinplace node a later on.

This PR also changes aten.empty to be recomputable w.r.t the Partitioner
(it is, like aten.zeros, cheap to recompute and fusible into other ops).

Fixes https://github.com/pytorch/pytorch/issues/134468

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134490
Approved by: https://github.com/Chillee
ghstack dependencies: #134364, #134466
2024-08-27 13:05:01 +00:00
dde5974b13 Implementation for rng ops on hpu and xpu (#133068)
implementation for high_order_op::run_and_save_rng_state and high_order_op::run_with_rng_state on hpu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133068
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel, https://github.com/anijain2305
2024-08-27 11:34:37 +00:00
FEI
ef8236f12b Provide default value None for the attn_bias parameter(#133981) (#133986)
Fixes #133981

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133986
Approved by: https://github.com/ezyang
2024-08-27 11:10:43 +00:00
a34320a6f2 [FlexAttention] Fix Sparse block multiple to ceildiv instead for floor div (#134538)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134538
Approved by: https://github.com/yanboliang
ghstack dependencies: #134495, #134507, #134511
2024-08-27 09:53:19 +00:00
767c47d3c0 [FlexAttention] Remove unused code (#134511)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134511
Approved by: https://github.com/yanboliang
ghstack dependencies: #134495, #134507
2024-08-27 09:53:19 +00:00
4d0a44d34a [FlexAttention] Create new variables for the subgraphs (#134507)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134507
Approved by: https://github.com/yanboliang, https://github.com/BoyuanFeng
ghstack dependencies: #134495
2024-08-27 09:53:13 +00:00
f480385277 Remove explicit Amz2023 reference from jobs (#134355)
Changes jobs to go back to using the default AMI.

Note: This is only a cleanup PR. It does NOT introduce any behavior changes in CI

Now that the default variant uses the Amazon 2023 AMI and has been shown to be stable for a week, it's time to remove the explicit amz2023 references and go back to using the default variant.

After a week or two, when this is rolled out to most people, we can remove the variants from scale config as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134355
Approved by: https://github.com/jeanschmidt
2024-08-27 08:51:42 +00:00
0916d72e99 Fix the warning for cat operators with same qparams (#133999)
Summary:
Currently the warning is printed when the cat inputs have same qparam, leading to a flood of warnings.
This diff emits the warning only when cat inputs don't have the same qparam.

Test Plan: CI

Reviewed By: aprotopopov

Differential Revision: D60638609

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133999
Approved by: https://github.com/tarun292
2024-08-27 08:21:39 +00:00
3515090006 Fix TypeError when itering NoneType in instantiate_device_type_tests() (#134457)
Fixes #134454

Fix TypeError introduced by https://github.com/pytorch/pytorch/pull/133082, which uses iter for NoneType of default args ``except_for`` and ``only_for``.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134457
Approved by: https://github.com/shink, https://github.com/albanD
2024-08-27 07:13:36 +00:00
136b19b062 Adding entry-point based support for out-of-tree rendezvous plugins (#132633)
Fixes #127519

Currently in torchrun rendezvous, there are only two rendezvous backends supported out of the box: `C10d` and `Etcd`. The changes in this PR enables the distributed elastic users to bring their out-of-tree rendezvous backend implementations as Python packages.

#### AUTHORING NEW PLUGIN
Any new plugin will be a python package exposing entry-points. For example, the structure of redis plugin is as follows:

```
plugin_root
|_ pyproject.toml
|_ src
   |_ redis
      |_ __init__.py
      |_ redis_store.py
      |_ redis_backend.py
```

The contents of the `pyproject.toml` should indicate that this is exposes a torchrun entry-point by mentioning the group name `torchrun.plugins`. The `pyproject.toml` for redis plugin would be as follows:

```
[project]
name = "redis"
version = "0.0.1"

[project.entry-points.'torchrun.plugins']
redis = 'redis'
```

The `src/redis/__init__.py` file would contain functions that return the plugin name and plugin handler. The contents of `__init__.py` for redis would be as follows:

```
def getPluginHandler():
    def _create_redis_handler(params: RendezvousParameters):
        from redis_rendezvous_backend import create_backend
        backend, store = create_backend(params)
        return create_handler(store, backend, params)
    return _create_redis_handler
```

The files `redis_store` and `redis_backend` contain the implementation of [Store](41189b0da4/torch/_C/_distributed_c10d.pyi (L171)) and [RendezvousBackend](e782918b8e/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py (L61)) respectively.

#### USER EXPERIENCE
Before using the plugin for the first time, the user has to install the plugin packages. For example, the published packages can be installed using `pip3 install <plugin-name>` and the plugin is in local file systemcan be installed using `pip3 install -e <plugin-location>`.

Once installed, the new backend can be used in torchrun as follows:

```
torchrun --rdzv-backend=redis --rdzv-endpoint=redis-container:6379 --nnodes=3 --nproc-per-node=1 --max-restarts=3 --rdzv-id=1 test.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132633
Approved by: https://github.com/wconstab
2024-08-27 07:09:41 +00:00
4a18fcf7af [inductor] calibration inductor windows uts (12/N) (#134428)
enable Windows inductor UTs for `test/inductor/test_torchinductor_codegen_dynamic_shapes.py`

Failed by depends on https://github.com/pytorch/pytorch/pull/134429, need to rebase after https://github.com/pytorch/pytorch/pull/134429 merged.
```cmd
2024-08-25T23:57:23.2747794Z Windows CI does not have necessary dependencies for test_torchinductor_dynamic_shapes yet
2024-08-25T23:57:23.2748541Z Traceback (most recent call last):
2024-08-25T23:57:23.2749593Z   File "C:\actions-runner\_work\pytorch\pytorch\test\inductor\test_torchinductor_codegen_dynamic_shapes.py", line 30, in <module>
2024-08-25T23:57:23.2750688Z     from inductor.test_torchinductor_dynamic_shapes import (
2024-08-25T23:57:23.2751877Z   File "C:\actions-runner\_work\pytorch\pytorch\test\inductor\test_torchinductor_dynamic_shapes.py", line 46, in <module>
2024-08-25T23:57:23.2752876Z     raise unittest.SkipTest("requires sympy/functorch/filelock")
2024-08-25T23:57:23.2753545Z unittest.case.SkipTest: requires sympy/functorch/filelock
2024-08-25T23:57:23.2754077Z Got exit code 1
2024-08-25T23:57:23.2754874Z No stepcurrent file found. Either pytest didn't get to run (e.g. import error) or file got deleted (contact dev infra)
```

Local test pass:
<img width="1892" alt="image" src="https://github.com/user-attachments/assets/241ab082-6026-4f33-b3ac-7e9ef7da744d">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134428
Approved by: https://github.com/jansel
2024-08-27 05:43:07 +00:00
0b81f700aa [PT2/Profiler] Add Context Info to Torch-Compiled Regions (#132765)
Summary:
We want to add compile IDs and frames to each Torch-Compiled Region in order to help users cross reference the section they are checking alongside data obtained from tools, such as tlparse.
This diff operates on the assumption that each graph section will enter and exit a CompileContext before it is ran to either compile the graph or look it up in the cache. Based on this assuption, we can save the value of the graph section from the exited CompileContext in eval_frame.c using a Python C API. After this, we can create a new interface in cpp shim to wrap around the record_function in order to pass in the new keyword argument for "context".

Test Plan:
Enhance test_profiler_dynamo_compiled_region to look for kwinputs as well as a name to see that the context is now labeled. Also changed test to run graph with more contexts so that we test a wider range of profiling.

Differential Revision: D60803317

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132765
Approved by: https://github.com/anijain2305
2024-08-27 04:55:04 +00:00
de57a6e806 Back out "[dynamo][exception] Support raise exception from None (#134028)" (#134513)
Summary:
The original diff is causing the error "attempting to assign a gradient with dtype 'c10::BFloat16' to a tensor with dtype ‘float".

The context is in: https://fb.workplace.com/groups/1075192433118967/permalink/1491357138169159/

Test Plan: After reverting, the above issue is gone, details are in https://fb.workplace.com/groups/1075192433118967/permalink/1491357138169159/

Differential Revision: D61820520

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134513
Approved by: https://github.com/anijain2305
2024-08-27 02:57:14 +00:00
02b0b524b5 [inductor] Turn on UT: test_randint_int64_mod (#134510)
It fixed by https://github.com/pytorch/pytorch/pull/134229, turn on it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134510
Approved by: https://github.com/ezyang
2024-08-27 02:33:07 +00:00
d0147290d8 [BE][Easy][dynamo] ensure trace_rules.MOD_INLINELIST in alphabetical order (#134246)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

* __->__ #134246
* #133987

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134246
Approved by: https://github.com/yanboliang
2024-08-27 02:29:43 +00:00
cyy
2ee201a7d0 [CMake] Remove BUILDING_WITH_TORCH_LIBS (#134434)
Since BUILDING_WITH_TORCH_LIBS is not used now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134434
Approved by: https://github.com/ezyang
2024-08-27 01:48:21 +00:00
bdfc1d3987 Remove unnecessary expect_true in split_with_sizes (#133439)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133439
Approved by: https://github.com/albanD
2024-08-27 01:34:00 +00:00
c7ca89a11a Improve print stack/locals printing in comptime (#133651)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133651
Approved by: https://github.com/anijain2305
2024-08-27 01:29:50 +00:00
58771315d3 Unify lowerings for auto_functionalized and triton_kernel_wrapper_functional (#134466)
Fixes https://github.com/pytorch/pytorch/issues/134372

The triton_kernel_wrapper_functional lowering was causing problems (it
was generating small kernels with nans in it, probably from realizing
aten.empty nodes. Instead of having its own manual lowering, we change
triton_kernel_wrapper_functional to go the same route as
auto_functionalized where we decompose the node into clone + mutation
nodes.

Test Plan:
- new test
- existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134466
Approved by: https://github.com/oulgen, https://github.com/eellison
ghstack dependencies: #134364
2024-08-27 00:53:05 +00:00
141a9c7204 Revert "[export] enumerate unsupported sympy.Functions (#134271)"
This reverts commit ddd71e34797f3bb56a048058e007a2df87c5755f.

Reverted https://github.com/pytorch/pytorch/pull/134271 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/134271#issuecomment-2311353460))
2024-08-27 00:45:00 +00:00
4df10a6340 [FlexAttention] Fix bug when checking whether to return LSE (#134495)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134495
Approved by: https://github.com/yanboliang, https://github.com/Chillee, https://github.com/BoyuanFeng
2024-08-27 00:31:46 +00:00
b98d33c155 [inductor] calibration inductor windows uts (13/N) (#134429)
enable Windows inductor UTs for `test/inductor/test_torchinductor_dynamic_shapes.py`

Local test pass:
<img width="1885" alt="image" src="https://github.com/user-attachments/assets/4b96b6d9-715f-4c94-8059-9ee0afaaa574">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134429
Approved by: https://github.com/jansel
2024-08-27 00:16:16 +00:00
74341e1150 [dynamo] simplify implementation for os.fspath (#133801)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133801
Approved by: https://github.com/anijain2305
ghstack dependencies: #133771
2024-08-27 00:08:04 +00:00
1dbd3476de [dynamo][itertools] support itertools.tee (#133771)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133771
Approved by: https://github.com/jansel
2024-08-27 00:08:04 +00:00
43bbd781f2 Back out "[Traceable FSDPS] Allow tracing through FSDP2 impl in trace_rules.py (#133532)" (#134478)
Summary:
Original commit changeset: 0215a41433e9

Original Phabricator Diff: D61432583

D61432583 causes FSDP2 stuck in PT2 compilation when applied to FB-FM-v4.

With D61432583:
https://www.internalfb.com/mast/job/aps-ckluk-745e763d6a

After backing out D61432583:
https://www.internalfb.com/mast/job/aps-ckluk-f9604ea1f9

Test Plan:
hg graft D61774888
scripts/ckluk/aps/mast_joint_arch_exploration_cmf_updated_fbfm_v3_fsdp2_qps.sh

Differential Revision: D61802689

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134478
Approved by: https://github.com/yf225
2024-08-27 00:07:28 +00:00
46ecc673ae [ROCm] Prevent accidental enablement of efficient attention. (#133331)
Currently Efficient attention and Flash attention share the same set of GPU
kernels on ROCM and have common limitations on head sizes.

Fixes https://github.com/pytorch/pytorch/issues/132004

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133331
Approved by: https://github.com/malfet, https://github.com/jithunnair-amd
2024-08-27 00:03:45 +00:00
0be6584203 [Inductor UT] Refine test case test_codegen_upcast_to_fp32_upcast to pass on XPU. (#134474)
[Inductor UT] Refine test case test_codegen_upcast_to_fp32_upcast to pass on XPU.
Fix issue: #134476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134474
Approved by: https://github.com/jansel
2024-08-26 23:59:26 +00:00
1565940114 [MPS] Add test/test_nn.py to test suite (#134184)
This PR increases test coverage by including the tests in `test/test_nn.py` in the test suite of MPS.

Some of the tests are decorated with `@expectedFailureMPS` for various reasons. Either that the op is not implemented, or that the outputs do not align. Those tests that contain differing results should be investigated further to rule out any live bugs.

```bash
$ python test/run_test.py --mps --verbose -k TestNN
Running test batch 'tests to run' cost 84.76 seconds
```

Ref #133520

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134184
Approved by: https://github.com/albanD, https://github.com/malfet
2024-08-26 23:48:23 +00:00
79b7fff188 Fix docstring for torch.signal.windows.nuttall (#134512)
This partially fixes regression introduced by https://github.com/pytorch/pytorch/pull/124771 but also just improves `z_n` rendering, by using MathML
In 2.3 it was [rendered](https://pytorch.org/docs/2.3/generated/torch.signal.windows.nuttall.html#torch.signal.windows.nuttall)
as
<img width="177" alt="image" src="https://github.com/user-attachments/assets/2c15d1f9-13ad-483f-bb66-41fa3fa4ba9c">

With this change it'll be [rendered](https://docs-preview.pytorch.org/pytorch/pytorch/134512/generated/torch.signal.windows.nuttall.html#torch.signal.windows.nuttall) as
```math
z_n = \frac{2 \pi n}{M}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134512
Approved by: https://github.com/kit1980, https://github.com/aorenste, https://github.com/atalman
2024-08-26 22:51:43 +00:00
ddd71e3479 [export] enumerate unsupported sympy.Functions (#134271)
There's 2 concepts of unsupported sympy.Functions in symbolic_shapes:
1) unsupported by the export solver, meaning the solver doesn't know how to provide useful fixes for those functions
2) unsupported by the sympy interpreter - meaning we can't reify them into FX nodes because the functions aren't present in PythonReferenceAnalysis

This splits the current call into a call for each version, with the Export solver the only user of 1). For 1), we enumerate the functions in _sympy/functions.py, and subtract the functions we know we can support. For 2) there's only 3 functions we've seen pop up in test cases.

Differential Revision: D61677956

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134271
Approved by: https://github.com/avikchaudhuri
2024-08-26 22:44:12 +00:00
55236d0cb7 TestForeach::test_parity: Remove check for error message text (#134251)
Previously, error messages were expected to be string equivalent to
error messages thrown by the ref function.  This check fails for dozens
of torch functions, and doesn't appear to add much value for the end
user.  This commit removes this check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134251
Approved by: https://github.com/amjames, https://github.com/janeyx99
ghstack dependencies: #134253, #134344
2024-08-26 22:40:54 +00:00
ef8c474fcf Add the fast path for bfloat16 lgamma (#134344)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134344
Approved by: https://github.com/amjames, https://github.com/janeyx99
ghstack dependencies: #134253
2024-08-26 22:40:54 +00:00
3c5883e550 Fix test_parity xfail for sigmoid (#134253)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134253
Approved by: https://github.com/amjames, https://github.com/janeyx99
2024-08-26 22:40:54 +00:00
a23dae22d5 Update AC pass use_reentrant message (#134472)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134472
Approved by: https://github.com/albanD
2024-08-26 21:57:38 +00:00
dbef2b05b4 [dynamo] Cache _dynamo.disable results (#134272)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134272
Approved by: https://github.com/yf225, https://github.com/jansel
2024-08-26 21:04:15 +00:00
28a4db84f2 [ARM] Fix infinite recursion in unwind (#134387)
Fixes #119905

The `TORCH_SHOW_CPP_STACKTRACES=1` setting on ARM causes infinite recursive unwind because on failure a `StackTraceFetcher` attempts to unwind the <ins>failed instruction</ins>: 5ad759ca33/torch/csrc/profiler/combined_traceback.cpp (L25)
then the unwind itself fails:
5ad759ca33/torch/csrc/profiler/unwind/unwind.cpp (L10-L12)
and it causes another attempt to unwind the failure in `unwind()`...

In summary, the executed instruction is equivalent to:
```C++
std::vector<void*> unwind() {
  // some instructions ...
  return unwind();
}
```
This PR replaces `TORCH_CHECK` by `TORCH_WARN_ONCE` as it will not cause an uncontrolled recursion. The only side effect would be an empty back-trace.

Huge thanks to @nWEIdia who found the root cause!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134387
Approved by: https://github.com/eqy, https://github.com/nWEIdia, https://github.com/malfet
2024-08-26 21:02:31 +00:00
900c5083ed [inductor] calibration inductor windows uts (9/N) (#134425)
enable Windows inductor UTs of `test/inductor/test_binary_folding.py`

Failed UT depends on https://github.com/pytorch/pytorch/pull/134427
Need to rebase after https://github.com/pytorch/pytorch/pull/134427 merged.
```cmd
2024-08-25T23:32:23.0905727Z Traceback (most recent call last):
2024-08-25T23:32:23.0906516Z   File "C:\actions-runner\_work\pytorch\pytorch\test\inductor\test_binary_folding.py", line 18, in <module>
2024-08-25T23:32:23.0908200Z     from inductor.test_inductor_freezing import TestCase
2024-08-25T23:32:23.0909883Z   File "C:\actions-runner\_work\pytorch\pytorch\test\inductor\test_inductor_freezing.py", line 39, in <module>
2024-08-25T23:32:23.0911128Z     raise unittest.SkipTest("requires sympy/functorch/filelock")
2024-08-25T23:32:23.0911801Z unittest.case.SkipTest: requires sympy/functorch/filelock
2024-08-25T23:32:23.0912370Z Got exit code 1
2024-08-25T23:32:23.0913155Z No stepcurrent file found. Either pytest didn't get to run (e.g. import error) or file got deleted (contact dev infra)
```

Local test pass:
<img width="1898" alt="image" src="https://github.com/user-attachments/assets/4a6e3f66-4bbc-4aab-8f0d-2e2318046e53">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134425
Approved by: https://github.com/ezyang, https://github.com/jansel
2024-08-26 20:57:41 +00:00
68624cf089 [dynamo][guards] De-dupe DUPLICATE_INPUT guard (#134354)
Hard to write a test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134354
Approved by: https://github.com/jansel
2024-08-26 20:48:57 +00:00
af82dc816a Fix lint failures (#134488)
Introduced by https://github.com/pytorch/pytorch/pull/131000

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134488
Approved by: https://github.com/Skylion007, https://github.com/msaroufim, https://github.com/albanD, https://github.com/atalman
2024-08-26 20:13:21 +00:00
2588b5e51a Move module_tracker to logging for confused hierarchy (#134467)
Fixes https://github.com/pytorch/pytorch/issues/134242

Make sure to never raise an error when confused. Logs for confusion can be enabled with `TORCH_LOGS="torch.utils.module_tracker"` or the usual python systems.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134467
Approved by: https://github.com/malfet
2024-08-26 19:39:08 +00:00
a0e062c6f1 Add mean.dtype_out (#133506)
Give it a try and see if CI is happy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133506
Approved by: https://github.com/bdhirsh
2024-08-26 19:26:11 +00:00
eqy
3541e450af Support larger page sizes with use_mmap_weights (#131000)
Fixes e.g., `test_large_mmaped_weights_non_abi_compatible_cuda` on machines with 64K page size

CC @malfet @tinglvv @nWEIdia

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131000
Approved by: https://github.com/malfet
2024-08-26 18:35:55 +00:00
3322ee236d [aoti] remove c_shim_version v1 logic (#134283)
Summary: Previously, https://github.com/pytorch/pytorch/pull/132750 and https://github.com/pytorch/pytorch/pull/133105 set c_shim_version to 2 for all cases. So removing c_shim_version logic.

Test Plan: ci

Differential Revision: D61574695

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134283
Approved by: https://github.com/desertfire
2024-08-26 18:29:40 +00:00
1d231ff8ba [HOO] add hints_wrapper to support passing context hints (#132860)
Fixes #126393

The implementation code is based on feedback here (https://github.com/pytorch/pytorch/pull/121639#issuecomment-2223948842).

Hints are passed as kwargs of hints_wrapper op. It also supports nested hints.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132860
Approved by: https://github.com/ydwu4, https://github.com/zou3519
2024-08-26 18:21:22 +00:00
1ccc8f0200 [dynamo][super] Improve handling of getattr on super (#134039)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134039
Approved by: https://github.com/yanboliang, https://github.com/jansel
2024-08-26 18:20:39 +00:00
1dd4b9221b [inductor] enable clang for Windows inductor (#134444)
Changes:
1. Add Windows clang-cl compiler check.
2. Add openmp config for clang-cl.
3. Preload libomp.dll when use clang.
4. Add compiler flags syntax check for `clang` and `clang++`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134444
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/malfet
2024-08-26 18:19:59 +00:00
0a3c064c12 [inductor] fix _maybe_subprocess_run not support Windows path (#134365)
Windows file path use `\` as delimiter, it is also a escape character. We need translate all path `\` to `/`. which like Linux.

Reproduce UTs:
```cmd
pytest test\dynamo\test_minifier.py -v -k test_after_dynamo_cpu_accuracy_error
```

Error message:
```cmd
____________________________________________________________________________________________________________ MinifierTests.test_after_dynamo_cpu_accuracy_error _____________________________________________________________________________________________________________
Traceback (most recent call last):
  File "D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_minifier.py", line 40, in test_after_dynamo_cpu_accuracy_error
    self._test_after_dynamo(
  File "D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_minifier.py", line 27, in _test_after_dynamo
    self._run_full_test(run_code, "dynamo", expected_error, isolate=False)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\_dynamo\test_minifier_common.py", line 235, in _run_full_test
    self.assertIn(expected_error, test_proc.stderr.decode("utf-8"))
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 1112, in assertIn
    self.fail(self._formatMessage(msg, standardMsg))
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 675, in fail
    raise self.failureException(msg)
AssertionError: 'AccuracyError' not found in 'Traceback (most recent call last):\n  File "C:\\Users\\Xuhan\\.conda\\envs\\win_mkl_static\\lib\\site-packages\\torch\\_dynamo\\test_minifier_common.py", line 114, in _maybe_subprocess_run\n    exec(code, {"__name__": "__main__", "__compile_source__": code})\n  File "<string>", line 9\n    torch._dynamo.config.debug_dir_root = "C:\\Users\\Xuhan\\AppData\\Local\\Temp\\tmpufu9t3pc"\n                                                                                         ^\nSyntaxError: (unicode error) \'unicodeescape\' codec can\'t decode bytes in position 2-3: truncated \\UXXXXXXXX escape\n'

To execute this test, run the following from the base repo dir:
    python test\dynamo\test_minifier.py MinifierTests.test_after_dynamo_cpu_accuracy_error

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
--------------------------------------------------------------------------------------------------------------------------- Captured stdout call ----------------------------------------------------------------------------------------------------------------------------
test stdout:
test stderr: Traceback (most recent call last):
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\_dynamo\test_minifier_common.py", line 114, in _maybe_subprocess_run
    exec(code, {"__name__": "__main__", "__compile_source__": code})
  File "<string>", line 9
    torch._dynamo.config.debug_dir_root = "C:\Users\Xuhan\AppData\Local\Temp\tmpufu9t3pc"
                                                                                         ^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape

--------------------------------------------------------------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------------------------------------------------------------
running test
```
Local test passed:
<img width="849" alt="image" src="https://github.com/user-attachments/assets/4a4eecc2-7c08-4de6-9395-546b69803b16">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134365
Approved by: https://github.com/jansel, https://github.com/jgong5
2024-08-26 17:48:11 +00:00
78128cbdd8 [CD] Use ephemeral arm64 runners for nightly and docker builds (#134473)
Follow up after adding linux arm64 ephemeral instances: https://github.com/pytorch/pytorch/pull/134469
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134473
Approved by: https://github.com/malfet
2024-08-26 17:47:20 +00:00
0f5b052dba [inductor] calibration inductor windows uts (11/N) (#134427)
enable Windows inductor UTs of `test/inductor/test_inductor_freezing.py`

Local test pass:
<img width="1891" alt="image" src="https://github.com/user-attachments/assets/f3a873b4-abb5-4047-92f8-8e6da7c67315">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134427
Approved by: https://github.com/jansel
2024-08-26 17:43:58 +00:00
cyy
73604eed0c [20/N] Fix clang-tidy warnings in jit (#133399)
Follows #133067

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133399
Approved by: https://github.com/Skylion007
2024-08-26 17:43:52 +00:00
019b80855f [inductor] calibration inductor windows uts (10/N) (#134426)
enable Windows inductor UT of `test/inductor/test_efficient_conv_bn_eval.py`

Local test pass:
<img width="1892" alt="image" src="https://github.com/user-attachments/assets/8a94c5e4-68bf-4a6f-8a1b-60d6ede14882">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134426
Approved by: https://github.com/jansel
2024-08-26 17:43:36 +00:00
7ff576072f [inductor] calibration inductor windows uts (8/N) (#134424)
enable Windows inductor UTs of `test/inductor/test_benchmark_fusion.py`

Local test pass:
<img width="1912" alt="image" src="https://github.com/user-attachments/assets/5be34b0c-9411-4430-927e-3313245f7c13">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134424
Approved by: https://github.com/ezyang
2024-08-26 17:38:53 +00:00
adcce538b7 Revert "Allow mp.start_processes to create processes in parallel (#133707)"
This reverts commit 3546628a2a167ace6060737eeccf8ee8fd87ddc0.

Reverted https://github.com/pytorch/pytorch/pull/133707 on behalf of https://github.com/ZainRizvi due to sorry but trunk has been consistently broken since this PR was merged. See: [GH job link](https://github.com/pytorch/pytorch/actions/runs/10529617600/job/29191757055) [HUD commit link](3546628a2a) ([comment](https://github.com/pytorch/pytorch/pull/133707#issuecomment-2310709523))
2024-08-26 17:31:10 +00:00
d0ac5d55ba Memory optimization for DSD for TorchTune LoRA (#134025)
Optimize memory cost at [PR#129635](https://github.com/pytorch/pytorch/pull/129635)

There are 2 main part of the optimization here:
1. optimize the tensor distributing part, postpone the full_tensor generation, which avoids the memory overlap, saves around 50% peak memory at 2 param test case.
2. apply `assign=True` for the `load_state_dict`, saves memory cost at the state dict loading by assigning the input param, around 50% peak memory at loading part.

Future work:
Memory optimization to the opt will be conducted in the next PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134025
Approved by: https://github.com/fegin

Co-authored-by: Rachel Guo <guorachel@meta.com>
2024-08-26 17:24:25 +00:00
fc61aae70f Remove color in CI (#133517)
Remove color by default to make CI logs easier to read

Example of color
<img width="569" alt="image" src="https://github.com/user-attachments/assets/0da13544-98b1-47be-8383-64a5b3fd8951">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133517
Approved by: https://github.com/ZainRizvi
2024-08-26 16:58:06 +00:00
42955e04f1 Revert "[dynamo] Cache _dynamo.disable results (#134272)"
This reverts commit a699bd11551e9755bb9238c6b82c369880789397.

Reverted https://github.com/pytorch/pytorch/pull/134272 on behalf of https://github.com/ZainRizvi due to Fails internal tests ([comment](https://github.com/pytorch/pytorch/pull/134272#issuecomment-2310649115))
2024-08-26 16:57:53 +00:00
e94bdc7876 Revert "[dynamo][guards] De-dupe DUPLICATE_INPUT guard (#134354)"
This reverts commit cdb9df5efe78142b7a612ae9c938ddf8a8850d10.

Reverted https://github.com/pytorch/pytorch/pull/134354 on behalf of https://github.com/ZainRizvi due to Fails internal tests ([comment](https://github.com/pytorch/pytorch/pull/134272#issuecomment-2310649115))
2024-08-26 16:57:53 +00:00
a6fac0e969 Use ephemeral runners for windows nightly builds (#134463)
This is definition of windows.4xlarge:

```
  windows.4xlarge:
    disk_size: 256
    instance_type: c5d.4xlarge
    is_ephemeral: true
    max_available: 420
    os: windows
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134463
Approved by: https://github.com/jeanschmidt
2024-08-26 16:33:19 +00:00
b417e32da2 [CD] fix xpu nightly wheel test env (#134395) (#134464)
Due to the https://github.com/pytorch/builder/pull/1972 landed, it will source xpu env duplicated in nightly wheel test.
Works for https://github.com/pytorch/pytorch/issues/114850

Realnd of #134395 to be landed with pytorchmergebot
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134464
Approved by: https://github.com/jeanschmidt

Co-authored-by: Wang, Chuanqi <chuanqi.wang@intel.com>
2024-08-26 15:35:48 +00:00
c507f402f1 Add linux arm64 ephemeral runners (#134469)
Should be landed with: https://github.com/pytorch/test-infra/pull/5593

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134469
Approved by: https://github.com/jeanschmidt, https://github.com/clee2000
2024-08-26 15:32:45 +00:00
17e8a51ff2 Revert "[inductor]Let output or input_as_strided match exact strides (#130956)"
This reverts commit a63efee5cd422db0aabe5d02d2fe35fef9be7978.

Reverted https://github.com/pytorch/pytorch/pull/130956 on behalf of https://github.com/ZainRizvi due to sorry but this seems to cause internal tests to fail. Please see D61771533 for details ([comment](https://github.com/pytorch/pytorch/pull/130956#issuecomment-2310490049))
2024-08-26 15:31:23 +00:00
1c4780e69a Revert "c10d/logging: add C10D_LOCK_GUARD (#134131)"
This reverts commit 4c28a0eb0ba437c1b7db559f63f8bec17bd48f69.

Reverted https://github.com/pytorch/pytorch/pull/134131 on behalf of https://github.com/ZainRizvi due to Sorry but this causes formatting errors internally which make it fail to build. See D61759282 ([comment](https://github.com/pytorch/pytorch/pull/134131#issuecomment-2310455878))
2024-08-26 15:19:27 +00:00
50e90d7203 Revert "[dynamo] simplify implementation for functools.reduce (#133778)"
This reverts commit 6c0b15e3828b8e2a0bd726a3e5d4e98c8ced5efe.

Reverted https://github.com/pytorch/pytorch/pull/133778 on behalf of https://github.com/ZainRizvi due to Sorry, but this breaks internal tests because of using functools ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2310445169))
2024-08-26 15:16:17 +00:00
472c7cf962 Revert "[dynamo] simplify implementation for builtins.sum (#133779)"
This reverts commit 8d90392fb02ce5e6854e6b4dbcdc4a7bbd55f8e2.

Reverted https://github.com/pytorch/pytorch/pull/133779 on behalf of https://github.com/ZainRizvi due to Sorry, but this breaks internal tests because of using functools ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2310445169))
2024-08-26 15:16:17 +00:00
3d7f3f6a55 Revert "[dynamo][itertools] support itertools.tee (#133771)"
This reverts commit 0e49b2f18e78386c8ed9ce540a8017411c7ab0cd.

Reverted https://github.com/pytorch/pytorch/pull/133771 on behalf of https://github.com/ZainRizvi due to Sorry, but this breaks internal tests because of using functools ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2310445169))
2024-08-26 15:16:17 +00:00
e1fc4362fb Revert "[dynamo] simplify implementation for os.fspath (#133801)"
This reverts commit c5f6b72041144c00e240bcfdc783a5597c3d8928.

Reverted https://github.com/pytorch/pytorch/pull/133801 on behalf of https://github.com/ZainRizvi due to Sorry, but this breaks internal tests because of using functools ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2310445169))
2024-08-26 15:16:17 +00:00
bb67ff2ba7 Migrate Windows bin jobs to runner determinator (#134231)
Update Windows binary workflows to use the runner determinator script.

Closes: pytorch/ci-infra#262

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134231
Approved by: https://github.com/ZainRizvi
2024-08-26 14:56:00 +00:00
27d97b9649 Remove unnecessary test skip (#134250)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134250
Approved by: https://github.com/amjames, https://github.com/janeyx99
2024-08-26 14:34:53 +00:00
be96ccf77c Revert "[CD] fix xpu nightly wheel test env (#134395)" (#134461)
This reverts commit 96738c9d756fbd64e6f2eba67f711d3e18f1630c.

Merged without pytorchmergebot command by mistake

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134461
Approved by: https://github.com/jeanschmidt
2024-08-26 13:40:17 +00:00
96738c9d75 [CD] fix xpu nightly wheel test env (#134395) 2024-08-26 08:53:15 -04:00
1ff226d88c [inductor] support vec for atomic add (#131314)
Depends on https://github.com/pytorch/pytorch/pull/130827 to have correct `index_expr` dtype

Support vec for atomic add by scalar implementation.
TestPlan:
```
python test/inductor/test_cpu_repro.py -k test_scatter_using_atomic_add_vec
```
Generated code for `test_scatter_using_atomic_add_vec`
```
cpp_fused_scatter_0 = async_compile.cpp_pybinding(['const float*', 'const int64_t*', 'const float*', 'float*'], '''
#include "/tmp/torchinductor_root/nn/cnnpkaxivwaa5rzng6qsyc4ao42vschogi3yk33ukwv3emlvxeqq.h"
extern "C"  void kernel(const float* in_ptr0,
                       const int64_t* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0)
{
    {
        for(long x0=static_cast<long>(0L); x0<static_cast<long>(16L); x0+=static_cast<long>(16L))
        {
            auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x0), 16);
            tmp0.store(out_ptr0 + static_cast<long>(x0));
        }
        #pragma omp simd simdlen(8)
        for(long x0=static_cast<long>(16L); x0<static_cast<long>(25L); x0+=static_cast<long>(1L))
        {
            auto tmp0 = in_ptr0[static_cast<long>(x0)];
            out_ptr0[static_cast<long>(x0)] = tmp0;
        }
    }
    {
        for(long x0=static_cast<long>(0L); x0<static_cast<long>(16L); x0+=static_cast<long>(16L))
        {
            auto tmp0 = at::vec::VectorizedN<int64_t,2>::loadu(in_ptr1 + static_cast<long>(x0), 16);
            auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x0), 16);
            auto tmp1 = 25L;
            auto tmp2 = c10::convert<int64_t>(tmp1);
            auto tmp3 = at::vec::VectorizedN<int64_t,2>(tmp2);
            auto tmp4 = tmp0 + tmp3;
            auto tmp5 = static_cast<int64_t>(0);
            auto tmp6 = at::vec::VectorizedN<int64_t,2>(tmp5);
            auto tmp7 = at::vec::VecMask<int64_t,2>(tmp0 < tmp6);
            auto tmp8 = decltype(tmp4)::blendv(tmp0, tmp4, tmp7.template cast<int64_t,2>());
            auto tmp9 =
            [&]
            {
                __at_align__ std::array<int64_t, 16> tmpbuf;
                tmp8.store(tmpbuf.data());
                return tmpbuf;
            }
            ()
            ;
            auto tmp10 =
            [&]
            {
                __at_align__ std::array<int64_t, 16> tmpbuf;
                #pragma GCC unroll 16
                for (long x0_inner = 0; x0_inner < 16; x0_inner++)
                {
                    tmpbuf[x0_inner] = static_cast<long>(tmp9[x0_inner]);
                }
                return at::vec::VectorizedN<int64_t,2>::loadu(tmpbuf.data(), 16);
            }
            ()
            ;
            TORCH_CHECK((at::vec::VecMask<int64_t,2>((at::vec::VectorizedN<int64_t,2>(0) <= tmp10) & (tmp10 < at::vec::VectorizedN<int64_t,2>(25L)))).all_masked(), "index out of bounds: 0 <= tmp10 < 25L");
            atomic_add_vec(out_ptr0, tmp8, tmp12);
        }
        #pragma omp simd simdlen(8)
        for(long x0=static_cast<long>(16L); x0<static_cast<long>(20L); x0+=static_cast<long>(1L))
        {
            auto tmp0 = in_ptr1[static_cast<long>(x0)];
            auto tmp9 = in_ptr2[static_cast<long>(x0)];
            auto tmp1 = 25L;
            auto tmp2 = c10::convert<int64_t>(tmp1);
            auto tmp3 = decltype(tmp0)(tmp0 + tmp2);
            auto tmp4 = tmp0 < 0;
            auto tmp5 = tmp4 ? tmp3 : tmp0;
            auto tmp6 = tmp5;
            auto tmp7 = c10::convert<int64_t>(tmp6);
            TORCH_CHECK((0 <= tmp7) & (tmp7 < 25L), "index out of bounds: 0 <= tmp7 < 25L");
            atomic_add(&out_ptr0[static_cast<long>(tmp5)], static_cast<float>(tmp9));
        }
    }
}
''')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131314
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel
2024-08-26 10:36:51 +00:00
bf5c7bf06d [FR] Fix the bug in FR script (e.g., checking all ranks dump check) (#134383)
We somehow convert the rank to string which makes the ranks check fail. This fix now convert them all to int.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134383
Approved by: https://github.com/c-p-i-o
2024-08-26 08:21:14 +00:00
92c4771853 fix stuck floordiv (#134150)
Summary: Fixes https://github.com/pytorch/pytorch/issues/134133

Test Plan:
Tested on the small repro in the linked issue with different lengths N (replacing 100), recording N vs. time taken in nanoseconds:
10 127268319
20 220839662
30 325463125
40 429259441
50 553136055
60 670799769
70 999170514
80 899014103
90 997168902
100 1168202035
110 1388556619
120 1457488235
130 1609816470
140 2177889877
150 1917560313
160 2121096113
170 2428502334
180 4117450755
190 4003068224

So N ~ 200 takes ~5s. Previously even smaller N would go for >1 min.

Didn't add a perf test because ezyang is planning to build a benchmark.

Also tested on https://www.internalfb.com/diff/D61560171, which now gets past the stuck point.

Differential Revision: D61619660

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134150
Approved by: https://github.com/ezyang
2024-08-26 07:27:59 +00:00
c5f6b72041 [dynamo] simplify implementation for os.fspath (#133801)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133801
Approved by: https://github.com/anijain2305
ghstack dependencies: #133769, #133778, #133779, #133771
2024-08-26 07:12:15 +00:00
38f97ec8e3 [pt2] Add meta for poisson (#134103)
Because aten.poisson doesn't have meta function registered, there is one additional eager execution of this op during compilation phase of torch.compile.

There are more ops without meta registration. Is there any reason for it?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134103
Approved by: https://github.com/ezyang
2024-08-26 06:14:38 +00:00
ed86ac2f25 [BE] typing for decorators - fx/_compatibility (#134054)
Summary: See #131429

Test Plan: unit tests pass

Differential Revision: D61493706

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134054
Approved by: https://github.com/oulgen
2024-08-26 04:00:27 +00:00
7b6b10417d Remove ansi escape chars in assertExpectedInline and add options to skip comments and to skip empty lines (#134248)
I had a night mare rewriting tests in test_misc.py specifically :
1. graphs can have comments that refers to my files "/lsakka/.." we really dont care about comments add option to ignore comments.
2. empty lines added when EXPECTTEST_ACCEPT=1  are changed with linter causing tests to fail or linter fail!
add flag to ignore empty lines.
3. EXPECTTEST_ACCEPT fails when the text have some not readable characters. those should not effect comparing strings, also those causes weird diffs comments when tests fails. I removed ansi_escape chars https://github.com/pytorch/pytorch/pull/133045

this is used in

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134248
Approved by: https://github.com/aorenste
ghstack dependencies: #133639, #134364
2024-08-26 02:03:44 +00:00
2ec149cd3e [inductor] fix test_functional_call_sequential_params_and_buffers expectation on Windows (#134394)
This UT actual code only one empty line wrap difference(`linear` and `add`) between Windows/Linux, and the context is right.
Reproduce UTs:
```cmd
pytest test\dynamo\test_higher_order_ops.py -v -k test_functional_call_sequential_params_and_buffers
```

We can add `empty_line_normalizer` to fix it.

```cmd
______________________________________________________________________________________________ FuncTorchHigherOrderOpTests.test_functional_call_sequential_params_and_buffers _______________________________________________________________________________________________
Traceback (most recent call last):
  File "D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_higher_order_ops.py", line 3676, in test_functional_call_sequential_params_and_buffers
    self.assertExpectedInline(
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_utils.py", line 2871, in assertExpectedInline
    return super().assertExpectedInline(actual if isinstance(actual, str) else str(actual), expect, skip + 1)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\expecttest\__init__.py", line 271, in assertExpectedInline
    self.assertMultiLineEqualMaybeCppStack(expect, actual, msg=help_text)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\expecttest\__init__.py", line 292, in assertMultiLineEqualMaybeCppStack
    self.assertMultiLineEqual(expect, actual, *args, **kwargs)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 1226, in assertMultiLineEqual
    self.fail(self._formatMessage(msg, standardMsg))
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 675, in fail
    raise self.failureException(msg)
AssertionError: 'clas[509 chars]one\n        add: "f32[1, 1]" = linear + l_buf[69 chars],)\n' != 'clas[509 chars]one\n\n        add: "f32[1, 1]" = linear + l_b[71 chars],)\n'
  class GraphModule(torch.nn.Module):
      def forward(self, L_params_l1_weight_: "f32[1, 1]", L_params_l1_bias_: "f32[1]", L_buffers_buffer_: "f32[1]", L_inputs_: "f32[1, 1]"):
          l_params_l1_weight_ = L_params_l1_weight_
          l_params_l1_bias_ = L_params_l1_bias_
          l_buffers_buffer_ = L_buffers_buffer_
          l_inputs_ = L_inputs_

          linear: "f32[1, 1]" = torch._C._nn.linear(l_inputs_, l_params_l1_weight_, l_params_l1_bias_);  l_inputs_ = l_params_l1_weight_ = l_params_l1_bias_ = None
+ <<<< (difference is here )
          add: "f32[1, 1]" = linear + l_buffers_buffer_;  linear = l_buffers_buffer_ = None
          return (add,)
 : To accept the new output, re-run test with envvar EXPECTTEST_ACCEPT=1 (we recommend staging/committing your changes before doing this)

To execute this test, run the following from the base repo dir:
    python test\dynamo\test_higher_order_ops.py FuncTorchHigherOrderOpTests.test_functional_call_sequential_params_and_buffers

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
========================================================================================================================== short test summary info ==========================================================================================================================
FAILED [0.4275s] test/dynamo/test_higher_order_ops.py::FuncTorchHigherOrderOpTests::test_functional_call_sequential_params_and_buffers - AssertionError: 'clas[509 chars]one\n        add: "f32[1, 1]" = linear + l_buf[69 chars],)\n' != 'clas[509 chars]one\n\n        add: "f32[1, 1]" = linear + l_b[71 chars],)\n'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134394
Approved by: https://github.com/jansel

Co-authored-by: Jason Ansel <jansel@jansel.net>
2024-08-26 01:41:20 +00:00
7af38eb98b Fix unexpected inference_mode interaction with torch.autograd.functional.jacobian (#130307)
Fixes #128264

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130307
Approved by: https://github.com/soulitzer
2024-08-25 22:14:02 +00:00
dc1959e6a7 [inductor] calibration inductor windows uts (7/N) (#134420)
Disable UTs on Windows: `test/dynamo/test_misc.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134420
Approved by: https://github.com/jansel
2024-08-25 20:39:54 +00:00
97fd087cdb [inductor] calibration inductor windows uts (6/N) (#134419)
Disable UTs for Windows: `test/dynamo/test_aot_autograd_cache.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134419
Approved by: https://github.com/jansel
2024-08-25 20:39:34 +00:00
b5dd60fa75 Fix namespace issues with qnnpack (#134336)
After this I think all `using namespace` will have been eliminated from PyTorch header files. Internally, `-Wheader-hygiene` will prevent more from being added.

Test Plan: Sandcastle

Differential Revision: D61679037

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134336
Approved by: https://github.com/Skylion007
2024-08-25 19:50:01 +00:00
7940f2428f [torch/package_importer] add compatibility name mapping (#134376)
Summary:
This enables patching extern modules to provide compatibility with serialized code depending on different versions of those extern modules.

The main motivation is to enable Numpy upgrade. In the recent release many alias to builtin types were deprecated and removed [1]. This breaks loading pickled modules that reference the removed aliases. While the proper solution is to re-generate pickled modules, it's not always feasible.

This proposes a way to define mapping with a new type, for a module member. It is only set if it's not present in the loaded module, thus removes the need to check for exact versions.

https://numpy.org/doc/stable/release/1.20.0-notes.html#using-the-aliases-of-builtin-types-like-np-int-is-deprecated

Differential Revision: D61556888

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134376
Approved by: https://github.com/SherlockNoMad
2024-08-25 19:34:46 +00:00
816061843a [Distributed/Profiler] Fix input/output dimension overflow (#134360)
Summary: When using ParamCommsDebugInfo, the input elements and output elements are stored in `int` instead of `int64_t`

Test Plan: Run HTA with new outputted values and make sure overflow does not occur

Reviewed By: fengxizhou

Differential Revision: D61728747

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134360
Approved by: https://github.com/fengxizhou, https://github.com/jeanschmidt
2024-08-25 16:25:56 +00:00
eqy
e93ca12c88 [CUDNN][SDPA] Fix unsupported trivial stride-1 transpose case (#134031)
Fixes #134001
Incorrect assumption that two same-shape tensors being contiguous meant that they would have the same stride

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134031
Approved by: https://github.com/drisspg, https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-08-25 14:31:30 +00:00
08d111250a [ez][c10d] change ERROR to WARNING (#134349)
Summary:
Change error to warning because TCPStore can be torn down during a normal shutdown. It's OK if we're unable to access TCPStore. Should not be an error.

Test Plan:
Ran locally

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134349
Approved by: https://github.com/fduwjj, https://github.com/wconstab
2024-08-25 14:22:55 +00:00
4648848696 Revert "[ROCm] remove triton-rocm commit pin and merge pins with triton.txt (#133438)"
This reverts commit f71c3d265ab52589f983dd252d61461db4e7dbbd.

Reverted https://github.com/pytorch/pytorch/pull/133438 on behalf of https://github.com/jeanschmidt due to seems to have introduced breakages in linux binary builds ([comment](https://github.com/pytorch/pytorch/pull/133438#issuecomment-2308787310))
2024-08-25 11:20:30 +00:00
e5563f7ad7 Revert "[dtensor][MTPG] make sharding prop lru cache not shared among threads (#134294)"
This reverts commit eb15b1a016c6facaf8605dde2c20b5de1586542d.

Reverted https://github.com/pytorch/pytorch/pull/134294 on behalf of https://github.com/jeanschmidt due to seems to have introduced https://github.com/pytorch/pytorch/actions/runs/10537099590/job/29201744658 ([comment](https://github.com/pytorch/pytorch/pull/134294#issuecomment-2308785949))
2024-08-25 11:16:04 +00:00
268092db83 [DeviceMesh] Allow _flatten() to take in an optional mesh_dim_name (#134048)
If a mesh_dim_name is given, we will use the given mesh_dim_name to name the new flattened dim.
Otherwise, the default is a string concatentaing the mesh_dim_names of the given submesh with each mesh_dim_name separated by "_".

For example, if we have a 3D mesh DeviceMesh([[[0, 1], [2, 3]], [[4, 5], [6, 7]]], mesh_dim_names=("dp", "cp", "tp")), calling mesh_3d["dp", "cp"]._flatten() will create a 1D submesh DeviceMesh([0, 1, 2, 3], mesh_dim_names=("dp_cp",)) on rank 0, 1, 2, 3 and a 1D submesh DeviceMesh([4, 5, 6, 7], mesh_dim_names=("dp_cp",)) on rank 4, 5, 6, 7.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134048
Approved by: https://github.com/fegin
ghstack dependencies: #133838, #133839
2024-08-25 10:36:01 +00:00
326db8af4c Replace sympy Min/Max with reimplementations (#133319)
Sympy's implementation of Min/Max displays asymptotically bad behavior on `TORCH_COMPILE_CPROFILE=1 python torchrec/distributed/tests/test_pt2_multiprocess.py TestPt2Train.test_compile_multiprocess`. Evidence profile:

![image](https://github.com/user-attachments/assets/142301e9-3a18-4370-b9db-19b32ece7ee8)

On this test case, we spend 42% of all time compiling the network on ShapeEnv.replace, which in turn spends all of its time in xreplace.

The problem appears to be find_localzeros call. By vendoring the implementations of Min/Max, we can potentially reduce the cost of this operation.

The implementation is copy-pasted sympy/functions/elementary/miscellaneous.py but with some adjustments:

* I deleted logic related to differentatiation, evalf and heaviside, as it's not relevant to PyTorch reasoning
* There's some massaging to appease PyTorch's linters, including a lot of noqa and type: ignore (which I could potentially refactor away with substantive changes, but that's better as its own change)
* I deleted the second loop iteration for is_connected, as an attempt at initial optimization (this also simplifies the port, since I can omit some code). I'll comment at that point what the exact difference is.

Before this change, the test in question takes 100s with 40 features; post this change, afterwards, it takes only 69s.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133319
Approved by: https://github.com/Skylion007
2024-08-25 05:05:59 +00:00
8db8ac700d line by line logging (#134298)
Summary:
Today there is no good mechanism to detect progress of non-strict export line-by-line in user code. This caused some pain recently in trying to find the exact line of user code that was triggering a bug where the process appeared stuck because deep down something was calling some symbolic shapes code that was suffering some exponential blowup.

This PR adds a environment variable for extended debugging that will log the line of user code corresponding to every torch function call. It only works in non-strict export for now. Prefix setting this environment variable with `TORCH_LOGS`  enabled for `export` logs at `DEBUG` level (i.e., with a `+` prefix), i.e.,.:

```
TORCHEXPORT_EXTENDED_DEBUG_CURRENT_LOC=1 TORCH_LOGS="+export" ...
```

This will show logs with something like:
```
...
prim::device called at .../example.py:4284 in foo
TensorBase.item called at .../example.py:4277 in bar
...
```

We already have an existing place to intercept torch functions where we process data-dependent errors in non-strict, so parking the logging there. An alternative place we could be doing this is where we add `stack_trace` metadata when generating code, but unfortunately at least the example that motivated this gets stuck before generating code, so that would be too late.

Test Plan: ran it on some sample commands

Differential Revision: D61692156

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134298
Approved by: https://github.com/angelayi
2024-08-25 02:57:11 +00:00
907c32faac [inductor] calibration inductor windows uts (4/N) (#134401)
skip failed UTs of `test/dynamo/test_unspec.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134401
Approved by: https://github.com/ezyang
2024-08-25 00:32:29 +00:00
74ef74be36 [inductor] calibration inductor windows uts (3/N) (#134400)
skip Windows UT of `test/dynamo/test_trace_rules.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134400
Approved by: https://github.com/ezyang
2024-08-24 23:48:50 +00:00
d33d68e326 [Profiler] Add test to make sure FunctionEvents are processed lazily (#134359)
Summary: Create simple test that checks that FunctionEvent build tree happens lazily by checking that the metrics for it changes before and after call.

Test Plan: Make sure test passes in CI

Reviewed By: briancoutinho

Differential Revision: D61685429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134359
Approved by: https://github.com/briancoutinho
2024-08-24 23:03:19 +00:00
af4c87953e [inductor] calibration inductor windows uts (5/N) (#134402)
skip UTs of `test/dynamo/test_repros.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134402
Approved by: https://github.com/ezyang
2024-08-24 23:00:11 +00:00
94f92fbd88 Use integer divison in arange length calculation when start/end/step are integral (#134296)
Fixes #133338

Test Plan:

```
TORCH_LOGS=dynamic python
import torch

torch._dynamo.config.capture_scalar_outputs = True

@torch.compile()
def f(x):
    y = x.item()
    torch._check_is_size(y)
    r = torch.arange(y, dtype=torch.float32)
    torch._check(r.size(0) == y)
    return r

f(torch.tensor([300]))
```

Before and after diff. Verify the following line

```
I0813 11:05:44.890000 652898 torch/fx/experimental/symbolic_shapes.py:5198] [0/0] runtime_assert Eq(CeilToInt(IntTrueDiv(u0, 1)), u0) [guard added] at aa.py:10 in f (_dynamo/utils.py:2092 in run_node), for more info run with TORCHDYNAMO_EXTENDED_DEBUG_GUARD_ADDED="Eq(CeilToInt(IntTrueDiv(u0, 1)), u0)"
```

no longer shows in the logs. Also verify CI passes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134296
Approved by: https://github.com/aorenste
2024-08-24 21:09:28 +00:00
1a0d00f1f4 [traced-graph][sparse] enable to_dense() for compressed (#133371)
Fixes https://github.com/pytorch/pytorch/issues/133174

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133371
Approved by: https://github.com/ezyang
2024-08-24 20:33:23 +00:00
050aa67e41 [traced-graph][sparse] fix restrictive assert for sparse add (#134037)
exporting sparse addition can be CPU/Meta this fixes the overly restrictive assert and adds an exporting test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134037
Approved by: https://github.com/ezyang
2024-08-24 20:26:47 +00:00
90fb83749e [inductor] fix test torch package working with trace on windows (#134397)
Current temporary directory path is hard code. Fixed by get temporary directory path by API.

Reproduce UTs:
```cmd
python test/dynamo/test_dynamic_shapes.py -v -k test_torch_package_working_with_trace_dynamic_shapes
```

Error message:
```cmd
________________________________________________________________________________________________ DynamicShapesMiscTests.test_torch_package_working_with_trace_dynamic_shapes ________________________________________________________________________________________________
Traceback (most recent call last):
  File "D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_misc.py", line 7199, in test_torch_package_working_with_trace
    with package.PackageExporter(path) as exp:
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\package\package_exporter.py", line 237, in __init__
    self.zip_file = torch._C.PyTorchFileWriter(f)
RuntimeError: Parent directory /tmp does not exist.

To execute this test, run the following from the base repo dir:
    python test\dynamo\test_dynamic_shapes.py DynamicShapesMiscTests.test_torch_package_working_with_trace_dynamic_shapes

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
========================================================================================================================== short test summary info ==========================================================================================================================
FAILED [0.0080s] test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_torch_package_working_with_trace_dynamic_shapes - RuntimeError: Parent directory /tmp does not exist.
==================================================================================================================== 1 failed, 1665 deselected in 4.00s =====================================================================================================================
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134397
Approved by: https://github.com/ezyang
2024-08-24 20:25:44 +00:00
9cd53b3212 Add Arm copyright line to LICENSE (#133982)
Some historical commits from arm:
- 2021 664126bab5f3f2a275e82b7bde127132cff7f34e
- 2023 2630144786e906b40abbe017294d404bcfe3c6ae
- 2024 ce6130014156fa9555ce3d16c5f9a84cbdadf8f4

See https://github.com/pytorch/pytorch/pull/126687 for initial discussion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133982
Approved by: https://github.com/malfet
2024-08-24 18:41:06 +00:00
50d5aa8c10 Enable optimized dynamic quantization on aarch64 (#126687)
oneDNN+ACL has optimized kernels for s8s8 matmul, so input is signed. This change leaves behaviour on all other platforms the same. This change requires https://github.com/intel/ideep/pull/313 to go in, and oneDNN 3.5 for the optimized kernels. This change speeds up dynamic quantized linear by ~10x.

Also, do you have a policy on copyright headers? Arm's usual policy when contributing to open source projects is to include a copyright header on any file which is modified. Would this be acceptable? If not, is there somewhere else suitable to note copyright?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126687
Approved by: https://github.com/jgong5, https://github.com/malfet, https://github.com/snadampal

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
2024-08-24 18:40:12 +00:00
f71c3d265a [ROCm] remove triton-rocm commit pin and merge pins with triton.txt (#133438)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133438
Approved by: https://github.com/jithunnair-amd, https://github.com/malfet
2024-08-24 18:26:49 +00:00
6245d5b87b [CI] Update XPU ci test python version to 3.9 (#134214)
Works for https://github.com/pytorch/pytorch/issues/114850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134214
Approved by: https://github.com/EikanWang, https://github.com/malfet
2024-08-24 18:11:36 +00:00
a63efee5cd [inductor]Let output or input_as_strided match exact strides (#130956)
Fixes #130394

TorchInductor doesn't respect original strides of outputs. It opens up optimization opportunities like changing up memory layout. But for some cases, such as the case in https://github.com/pytorch/pytorch/issues/130394, we do need the output match the exact stride as required. The correctness is the first priority goal. So, this PR adds a new API `ir.ExternKernel.require_exact_strides(x, exact_strides, allow_padding=False)` to fix the issue.  This PR enables non-dense outputs' strides follow the strides required by semantics.

The comparison between the original and after this fix for the test is the below.

```python
@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 128
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex % 8
    x1 = (xindex // 8)
-   x2 = xindex
    tmp0 = tl.load(in_ptr0 + (x0 + (16*x1)), xmask)
    tmp1 = tmp0 + tmp0
-   tl.store(out_ptr0 + (x2), tmp1, xmask)
+   tl.store(out_ptr0 + (x0 + (16*x1)), tmp1, xmask)

def call(args):
    arg0_1, = args
    args.clear()
    assert_size_stride(arg0_1, (16, 8), (16, 1))
    with torch.cuda._DeviceGuard(0):
        torch.cuda.set_device(0)
-       buf1 = empty_strided_cuda((16, 8), (8, 1), torch.float32)
+       buf1 = empty_strided_cuda((16, 8), (16, 1), torch.float32)
        stream0 = get_raw_stream(0)
        triton_poi_fused_add_copy_0.run(arg0_1, buf1, 128, grid=grid(128), stream=stream0)
        del arg0_1
    return (buf1, )
```

The buf1 is created with exact stride required by users, and its values are written in same stride with the input.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130956
Approved by: https://github.com/eellison, https://github.com/blaine-rister
2024-08-24 17:04:05 +00:00
cdb9df5efe [dynamo][guards] De-dupe DUPLICATE_INPUT guard (#134354)
Hard to write a test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134354
Approved by: https://github.com/jansel
ghstack dependencies: #134272
2024-08-24 15:17:56 +00:00
d433a603af [BE] use torch.amp.autocast instead of torch.cuda.amp.autocast (#134291)
torch.cuda.amp.autocast / torch.cpu.amp.autocast are deprecated and spew a ton of warnings when these tests run. This PR: Update to just use torch.amp.autocast(device).

Note: this uncovers a bug in the test: when `device` is CUDA, it actually shows up as "cuda:0" - so previously, this test was _always_ using `torch.cpu.amp.autocast` even for `cuda` device. This PR fixes this, and uncovers additional bugs in `pinverse` and `linalg.pinv`; `linalg.pinv` was already failing before on CPU, but now the test also catches failures on CUDA, (and this PR adds to the skipped-test list).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134291
Approved by: https://github.com/YuqingJ
2024-08-24 15:07:49 +00:00
a1061009c9 [PT2] use statically_known_true in slice_noop (#134270)
Summary:
# context
* when fixing the graph break in _maybe_compute_kjt_to_jt_dict, we encountered this issue P1539489731:
```
[rank0]:   ATTENTION: guard_size_oblivious would fix the error, evaluating expression to False.
[rank0]:   Maybe you need to add guard_size_oblivious to framework code, see doc below for more guidance.
[rank0]:
[rank0]:   Potential framework code culprit (scroll up for full backtrace):
[rank0]:     File "/data/users/hhy/fbsource/buck-out/v2/gen/fbcode/61f992c26f3f2773/aps_models/ads/icvr/__icvr_launcher_live__/icvr_launcher_live#link-tree/torch/_inductor/fx_passes/post_grad.py", line 671, in slice_noop
[rank0]:       if start == 0 and end >= 2**63 - 1 and step == 1:
```
* change the condition logic to be compatible with SymInt

Test Plan:
# commands
* run test
```
TORCH_SHOW_CPP_STACKTRACES=1 TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 TORCH_LOGS="+graph_code,output_code,dynamic,aot,guards,verbose_guards,recompiles,graph_breaks" TORCH_TRACE=/var/tmp/tt buck2 run fbcode//mode/opt fbcode//aps_models/ads/icvr:icvr_launcher_live -- mode=fmc/local_ig_fm_v4_mini training.pipeline_type=pt2 2>&1 | tee -a `date +"%Y.%m.%d.%H.%M"`.`sl whereami`.log
```
* tlparse
```
ls -thl /var/tmp/tt | head -9 && tlparse `ls -t /var/tmp/tt/* | head -1`
```

Differential Revision: D61677207

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134270
Approved by: https://github.com/ezyang
2024-08-24 13:58:51 +00:00
ff77c67d16 Use ephemeral runners for linux nightly builds (#134367)
Should be landed with https://github.com/pytorch/test-infra/pull/5590
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134367
Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/seemethere
2024-08-24 12:49:07 +00:00
ff7d94c67e [compiled autograd] fix saved tensor hook firing count (#134361)
SavedVariable constructor calls the pack hooks, we don't want to call them for the proxy tensor since it is proxying a tensor that already had called the pack hook during forward.

Using the same fix as https://github.com/pytorch/pytorch/pull/123196

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134361
Approved by: https://github.com/jansel
ghstack dependencies: #134186, #134200, #134205, #134286, #134290, #134162, #134163
2024-08-24 12:06:36 +00:00
929de1d0d4 Re-enable skipped compiled autograd eager tests (#134163)
Originally disabled in: https://github.com/pytorch/pytorch/pull/131700#discussion_r1727153445, but the failure is no longer in CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134163
Approved by: https://github.com/soulitzer
ghstack dependencies: #134186, #134200, #134205, #134286, #134290, #134162
2024-08-24 12:06:36 +00:00
ad8bdfae1e add compiled_autograd to programmatic set_logs API (#134162)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134162
Approved by: https://github.com/yf225, https://github.com/jansel
ghstack dependencies: #134186, #134200, #134205, #134286, #134290
2024-08-24 12:06:36 +00:00
1431663693 [compiled autograd] finish classifying tests (#134290)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134290
Approved by: https://github.com/yf225
ghstack dependencies: #134186, #134200, #134205, #134286
2024-08-24 12:06:36 +00:00
0b228a2af8 [compiled autograd] match eager behavior for ctx.saved_variables (#134286)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134286
Approved by: https://github.com/jansel
ghstack dependencies: #134186, #134200, #134205
2024-08-24 12:06:36 +00:00
6cc57c64b2 [compiled autograd] match eager behavior for post acc grad hooks (#134205)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134205
Approved by: https://github.com/jansel
ghstack dependencies: #134186, #134200
2024-08-24 12:06:36 +00:00
d7a25e1d8c [compiled autograd] add config patching for certain eager tests (#134200)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134200
Approved by: https://github.com/jansel
ghstack dependencies: #134186
2024-08-24 12:06:36 +00:00
0d9208a398 [compiled autograd] match eager behavior for inplace detached activations (#134186)
Fixes `TestAutograd.test_saved_variable_saved_original_inplace_detach` when ran under compiled autograd

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134186
Approved by: https://github.com/jansel
2024-08-24 12:06:36 +00:00
ccafc93be5 [AOTI][CPU] Make int8 qlinear work (#134368)
Summary:
This diff will decompose torch.ops._quantized.wrapped_quantized_linear into torch.ops._quantized.wrapped_linear_prepack and torch.ops._quantized.wrapped_quantized_linear_prepacked for AOTI, and added the corresponding impl into shim

The way it works will be similar to what we did previously for fbgemm fp16 dynamic qlinear. We will do constant folding for packed weight during runtime (warm up) to achieve the speed up

Reviewed By: desertfire

Differential Revision: D61396144

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134368
Approved by: https://github.com/houseroad
2024-08-24 08:25:25 +00:00
eb15b1a016 [dtensor][MTPG] make sharding prop lru cache not shared among threads (#134294)
**Summary**
Before this PR, `sharding propagator` is shared among threads. The result is the cache result of rank 0 would be accessible by other ranks e.g. rank 1 and this could lead to wrong DTensor resharding. This PR fixes it by making the cache a local variable at thread level, and it fixes `dstack` test (#126493), `inner` (https://github.com/pytorch/pytorch/issues/126852), and `vstack` (https://github.com/pytorch/pytorch/issues/126868). It also fixes `poisson_nll` (https://github.com/pytorch/pytorch/issues/131446) as a bi-product.

**Test**
`pytest test/distributed/_tensor/test_dtensor_ops.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134294
Approved by: https://github.com/wz337, https://github.com/awgu
2024-08-24 05:56:45 +00:00
1034f456ef [inductor] fix munge_exc not support windows path (#134348)
Windows file path use `\` as delimiter, it is also a escape character. We need translate all path `\` to `/`. which like Linux.

Reproduce UT:
```cmd
pytest test\dynamo\test_higher_order_ops.py -v -k test_vmap_grad_vmap_guard_fail
```
Error msg:
```cmd
________________________________________________________________________________________________________ HigherOrderOpVmapGuardTests.test_vmap_grad_vmap_guard_fail _________________________________________________________________________________________________________
Traceback (most recent call last):
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\logging_utils.py", line 89, in test_fn
    fn(self, records)
  File "D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_higher_order_ops.py", line 2714, in test_vmap_grad_vmap_guard_fail
    munge_exc(record.getMessage()),
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_utils.py", line 5252, in munge_exc
    s = re.sub(file, os.path.basename(file), s)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\re.py", line 209, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\re.py", line 303, in _compile
    p = sre_compile.compile(pattern, flags)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_compile.py", line 788, in compile
    p = sre_parse.parse(p, flags)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_parse.py", line 955, in parse
    p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_parse.py", line 444, in _parse_sub
    itemsappend(_parse(source, state, verbose, nested + 1,
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_parse.py", line 526, in _parse
    code = _escape(source, this, state)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_parse.py", line 370, in _escape
    raise source.error("incomplete escape %s" % escape, len(escape))
re.error: incomplete escape \x at position 2

To execute this test, run the following from the base repo dir:
    python test\dynamo\test_higher_order_ops.py HigherOrderOpVmapGuardTests.test_vmap_grad_vmap_guard_fail

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
--------------------------------------------------------------------------------------------------------------------------- Captured stdout call ----------------------------------------------------------------------------------------------------------------------------
frames [('total', 2), ('ok', 2)]
inductor []
inline_call []
stats [('calls_captured', 38), ('unique_graphs', 2)]
--------------------------------------------------------------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------------------------------------------------------------
V0824 01:29:00.148000 27840 torch\_dynamo\guards.py:2787] [0/1] [__recompiles] Recompiling function fn in D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_higher_order_ops.py:2699
V0824 01:29:00.148000 27840 torch\_dynamo\guards.py:2787] [0/1] [__recompiles]     triggered by the following guard failure(s):
V0824 01:29:00.148000 27840 torch\_dynamo\guards.py:2787] [0/1] [__recompiles]     - 0/0: torch._functorch.pyfunctorch.compare_functorch_state([('Vmap', 1, 'error')])  # _dynamo\output_graph.py:479 in init_ambient_guards
========================================================================================================================== short test summary info ==========================================================================================================================
FAILED [0.7452s] test/dynamo/test_higher_order_ops.py::HigherOrderOpVmapGuardTests::test_vmap_grad_vmap_guard_fail - re.error: incomplete escape \x at position 2
```
Local test passed:
<img width="860" alt="image" src="https://github.com/user-attachments/assets/90f0d780-0639-4c03-8d7c-6f227c93a3fc">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134348
Approved by: https://github.com/jansel
2024-08-24 05:51:35 +00:00
0694918aeb [export] Temporarily bypass torch_fn in partitioner (#134292)
Summary:
"torch_fn" is not correct for the decomposed add node from batch norm. This is a temporary workaround to bypass torch fn.

For example, for the graph below (test_qat_conv2d_unary graph):
```
graph():
    %conv_weight : [num_users=1] = get_attr[target=conv.weight]
    %bn_weight : [num_users=1] = get_attr[target=bn.weight]
    %bn_bias : [num_users=1] = get_attr[target=bn.bias]
    %bn_running_mean : [num_users=1] = get_attr[target=bn.running_mean]
    %bn_running_var : [num_users=1] = get_attr[target=bn.running_var]
    %bn_num_batches_tracked : [num_users=1] = get_attr[target=bn.num_batches_tracked]
    %x : [num_users=1] = placeholder[target=x]
    %conv2d : [num_users=1] = call_function[target=torch.ops.aten.conv2d.default](args = (%x, %conv_weight, None, [1, 1], [1, 1]), kwargs = {})
    %add_ : [num_users=0] = call_function[target=torch.ops.aten.add_.Tensor](args = (%bn_num_batches_tracked, 1), kwargs = {})
    %batch_norm : [num_users=1] = call_function[target=torch.ops.aten.batch_norm.default](args = (%conv2d, %bn_weight, %bn_bias, %bn_running_mean, %bn_running_var, True, 0.1, 1e-05, True), kwargs = {})
    %relu : [num_users=1] = call_function[target=torch.ops.aten.relu.default](args = (%batch_norm,), kwargs = {})
    %max_pool2d : [num_users=1] = call_function[target=torch.ops.aten.max_pool2d.default](args = (%relu, [3, 3], [3, 3]), kwargs = {})
    return (max_pool2d,)
```

the add_ node has `'torch_fn': ('add__1', 'method_descriptor.add_'),` in its meta.

If we run the line below in `_annotate_qat_conv2d_bn_binary_unary`, we'll have a partition without output nodes.

```
 find_sequential_partitions(
            gm, [torch.nn.Conv2d, torch.nn.BatchNorm2d, operator.add, torch.nn.ReLU]
        )
````

```
partition_list
[
SourcePartition(nodes=[conv_weight, conv2d], source=<class 'torch.nn.modules.conv.Conv2d'>, input_nodes=[x], output_nodes=[conv2d], params=[conv_weight]),

SourcePartition(nodes=[bn_weight, bn_bias, bn_running_mean, bn_running_var, bn_num_batches_tracked, add_, batch_norm], source=<class 'torch.nn.modules.batchnorm.BatchNorm2d'>, input_nodes=[conv2d], output_nodes=[batch_norm], params=[bn_num_batches_tracked, bn_running_var, bn_bias, bn_weight, bn_running_mean]),

SourcePartition(nodes=[add_], source='add_', input_nodes=[bn_num_batches_tracked], output_nodes=[], params=[])
]
```
We should not have the last partition.

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_conv2d
```

Differential Revision: D61569049

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134292
Approved by: https://github.com/angelayi
2024-08-24 05:50:18 +00:00
f260cc2edf Enable DTensor sharding propagation of native_layer_norm_backward to more fully accommodate optional args (#133502)
Fixes #133499

### The issue

Testing a variety of TP `requires_grad` patterns (validating maximally flexible finetuning) revealed `DTensor` sharding propagation of `aten.native_layer_norm_backward` (default) fails with an `IndexError` for certain `requires_grad` patterns (pattern 1) (e.g. `output_mask` `[True, False, False]`) and an `AssertionError` for others (pattern 2) (e.g. output mask `[False, True, *]`). Please see issue #133499 for a full description of the observed failure patterns along with reproduction.

### Use Cases and Remediation

Failure pattern 1 is potentially problematic for a variety of finetuning scenarios. Though failure pattern 2 is really an xfail right now since it's not fully supported, IMHO there are use cases (e.g. especially wrt to mechanistic interpretability research, but certain finetuning scenarios too potentially) that justify supporting this output mask (especially since supporting it is fairly straightforward I think).

In this PR I propose some modest changes that:
  * Address the aforementioned failure modes.
  * Add a couple tests that I'm hopeful will help ensure `DTenso`r op dispatch (which is so well implemented and such a pleasure working with btw! 🚀 🎉) accommodates a wide variety of (potentially unanticipated) `requires_grad` patterns as it evolves.

To address both failure modes, I'm proposing the following changes:
1. To [`torch.distributed._tensor.ops._math_ops.layer_norm_bwd_strategy`](7b269cc484/torch/distributed/_tensor/ops/_math_ops.py (L873)):
  - Refactor conditional `output_mask` handling such that the input and output specs in the`PlacementStrategy`s of the returned `output_strategy.strategies` list remain aligned with the `op_schema.args_spec` (whose definition does not change at runtime based upon unused optional args).
2. To [`torch.distributed._tensor._sharding_prop.propagate_op_sharding_non_cached`](7b269cc484/torch/distributed/_tensor/_sharding_prop.py (L256-L262)):
  - When iterating through the active `op_schema.args_spec` to build the relevant `expected_input_specs` list, filter any `None` `desired_specs`.
3. To [`torch/distributed/_tensor/_op_schema.OpSchema._inplace_rewrap_schema_suggestion`](7b269cc484/torch/distributed/_tensor/_op_schema.py (L418))
  - When inputs need a redistribute, for runtime-unrequired (`None` arguments in the aligned `suggestion_args_schema`), ignore the associated `suggestion_args_spec`

### Implementation considerations:

- Regarding `1`, to avoid changing the op strategy return args ([`op_strategy`](cf81180007/torch/distributed/_tensor/_sharding_prop.py (L234))), the change in `1` allows `None` elements to exist temporarily in `PlacementStrategy.input_specs` (treating it as `Sequence[DTensorSpec | None] | None` when it's `Sequence[DTensorSpec] | None`. This could be addressed in any number of ways but I thought it best to leave that for a subsequent PR since it could have broader ramifications (e.g. allowing op_strategies to return an output_strategy.input_specs` mask explicitly, explicitly allowing `None`s in `PlacementStrategy.input_specs`, creating a `Null` DTensorSpec etc.). That's why I'm using an ignore arg-type directive there for now.
- Regarding `2` and `3` above, I don't introspect `op_schema.op._schema.arguments` to verify any `None` arguments are `torch.OptionalType`, leaving adherence to the schema contract the responsibility of the given op. Regarding `2`, I assume any `desired_spec` will be either a `DTensorSpec` or `None`, so only `None` can be Falsy in this context.
- I considered altering the active `args_schema`, which could be inspected and aligned with the active `output_strategy.input_specs` in some cases and avoid the changes in `3`, but I think that would rely on one of (among other possibilities):
    - all supported op signatures having optional Tensors (`DTensorSpec`) args after required tensors (which isn't a planned required as far as I know),
    -  (somewhat brittle) heuristic-driven arg alignment
    -  only supporting kwargs etc.

### Added Tests

To facilitate detection of future `requires_grad` pattern op failure modes as `DTensor` evolves, I added the following two tests:

1. `test/distributed/_tensor/test_math_ops.py DistMathOpsTest.test_layer_norm_bwd_req_grad`
    - Tests `native_layer_norm_backward` specifically with 20 subtests that sweep valid `output_mask` patterns along in different LayerNorm dimensionality and `elementwise_affine` configurations.

2. `test/distributed/tensor/parallel/test_tp_examples.py DistTensorParallelExampleTest.test_transformer_req_grad`
    - Samples a subset of `requires_grad` patterns in a more realistic (relative to the `LayerNorm`-specific test) Transformer usage context with different `dtype` and `is_seq_parallel` configurations. Note since there was substantial overlap with the existing `test_transformer_training` test, I took the opportunity to refactor that test to allow relevant code-sharing. I also added an `ExpCommCounts` `NamedTuple` to facilitate the addition of additional `requires_grad` patterns that we may want to test in the future which may result in different comm counts. I created the separate `requires_grad` test to allow decoupling the multi-iteration `test_transformer_training` test and allow addition of new `requires_grad` scenarios as desired while being mindful of resources.

Thanks again to the PyTorch distributed team for your immensely valuable contributions to the open-source ML community!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133502
Approved by: https://github.com/XilunWu
2024-08-24 05:49:54 +00:00
8d3c6494ae [Inductor][FlexAttention] Rename IS_LAST_BLOCK to CHECK_BLOCK_BOUNDARY (#134378)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134378
Approved by: https://github.com/drisspg
2024-08-24 04:40:01 +00:00
5ad759ca33 [inductor] calibration inductor windows uts (2/N) (#134358)
skip unsupported UTs of `test\inductor\test_compile_worker.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134358
Approved by: https://github.com/jansel
2024-08-24 04:08:59 +00:00
5ae9c01794 [DTensor] Add naive replicate strategy for aten._linalg_eigh.default (#134284)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134284
Approved by: https://github.com/awgu
2024-08-24 03:50:05 +00:00
962e1f6ca7 [DTensor] Add aten.any.default,dim,out to linear_reduction_strategy (#134206)
For `aten.any`, we can use `reduce_op="sum"` as the linear reduction op.

When we do `all_reduce` with `reduce_op="sum"` on bool tensor, if one rank returns `torch.Tensor([True]) `, then the reduction result is `torch.Tensor([True]) `. Only when all ranks return `torch.Tensor([False]) ` would the reduction result be `torch.Tensor([False]) `. This matches with `any`'s behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134206
Approved by: https://github.com/tianyu-l, https://github.com/chuanhaozhuge
2024-08-24 03:49:46 +00:00
5d39b14b68 [DeviceMesh] Add DeviceMesh slicing support for flatten mesh dim (#133839)
Add DeviceMesh slicing support such that we could do the following:
```
mesh_3d = init_device_mesh(
    self.device_type, (2, 2, 2), mesh_dim_names=("replicate", "shard", "cp")
)
shard_cp_mesh = mesh_3d["shard", "cp"]._flatten()
hsdp_mesh = mesh_3d["replicate", "shard_cp"]
# we can get the corresponding group of the flatten mesh through

group = shard_cp_mesh.get_group()
# or
group = mesh_3d["shard_cp"].get_group()
# or
mesh_3d.get_group(mesh_dim="shard_cp")
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133839
Approved by: https://github.com/fegin
ghstack dependencies: #133838
2024-08-24 03:49:29 +00:00
195abdb85c ppc64le: VSX Support for Inductor (#132746)
### Description

This PR extends the `VecISA` class to include support for VSX on the `ppc64le` architecture within the Inductor backend. This enhancement enables vectorization support, resulting in performance improvements when using `torch.compile()` on `ppc64le`.

### Fixes

- Resolved the `test_acosh_with_negative_large_input` test case in `test_cpu_repro.py` by implementing `acosh` for VSX.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132746
Approved by: https://github.com/jansel
2024-08-24 03:36:09 +00:00
519342962d Pass process group info into NcclWork (#134269)
Summary: Pass process group info into NcclWork

Test Plan: buck2 run mode/dev-nosan kineto/libkineto/fb/integration_tests:pytorch_execution_trace_integration_test

Differential Revision: D61677160

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134269
Approved by: https://github.com/wconstab
2024-08-24 01:04:43 +00:00
e2a87fb1e9 [ONNX] Update exporter logic (#134304)
Sync the exporter logic with torch-onnx at https://github.com/justinchuby/torch-onnx/compare/v0.1.12...v0.1.15.

https://github.com/pytorch/pytorch/issues/129277

- Create a `testing` module to facilitate testing model accuracy. The model is internal
- Improve decomp table
- Improve model verification logic
- Add tests

The next PRs will enable OpInfo tests and clean up existing code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134304
Approved by: https://github.com/titaiwangms
2024-08-24 00:49:54 +00:00
a1d0b4d568 Add option to skip functional passes in the pattern matcher's replacement graph (#134364)
The pattern matcher runs DCE and remove_noop_ops on the replacement
graph by default. Previously we had a switch for the DCE. This PR
changes that switch to also control if we run remove_noop_ops.

The context was that there is silent incorrectness with
auto_functionalized. We use the Pattern matcher to decompose
auto_functionalized into a mutable op + clones; remove_noop_ops were
deleting the clones.

Future: can try #134363

Test Plan:
- new test. I wasn't able to produce a silently incorrect example so I
  settled for asserting that clones still exist in the post-grad graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134364
Approved by: https://github.com/eellison
ghstack dependencies: #133639
2024-08-24 00:38:55 +00:00
2c8fc3f4ce [inductor] Move imports to top of file in generated code (#134195)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134195
Approved by: https://github.com/eellison
ghstack dependencies: #134194
2024-08-24 00:35:57 +00:00
1aa0e35a04 [inductor] Remove dead code in multi_kernel.py (#134194)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134194
Approved by: https://github.com/eellison
2024-08-24 00:35:57 +00:00
4ff1a4dd0f [export] support set_grad_enabled hop in dynamo to enable re-tracing (#134281)
As titled. We added dynamo support for wrap_with_set_grad_enabled hop to support re-trace an exported program.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134281
Approved by: https://github.com/tugsbayasgalan
2024-08-24 00:35:53 +00:00
9dc47f5e62 [FlexAttention]Fix how we realize input buffers (#134351)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134351
Approved by: https://github.com/Chillee
2024-08-24 00:31:00 +00:00
4c28a0eb0b c10d/logging: add C10D_LOCK_GUARD (#134131)
This adds logs if we can't acquire locks in NCCLUtils and ProcessGroupNCCL for 30s.

This is motivated by some deadlocks were seeing and it's unclear if it's in NCCL or on the PyTorch side of things.

This required replacing most `std::mutex` with `std::timed_mutex` and `std::condition_variable_any` as appropriate.

Test plan:

existing CI for regressions

will add unit tests on `C10D_LOCK_GUARD`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134131
Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj
2024-08-24 00:27:39 +00:00
e52e93e8fd Update scale-config files with linux.24xlarge.ephemeral (#134380)
Add linux.24xlarge.ephemeral  to scale config
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134380
Approved by: https://github.com/kit1980, https://github.com/ZainRizvi
2024-08-24 00:01:39 +00:00
54ff320519 [export] refactor ExportGraphSignature construction (#134059)
Refactors construction of ExportGraphSignature object for export & training IR, explicitly creating AOTAutograd signature for training IR. This will be helpful for upcoming refactors for placeholder naming & runtime asserts prettifying.

Changes:
- dedups `make_argument_spec` call, moved to export/graph_signature.py
- `_sig_to_specs` wrapped into new function `_convert_to_export_graph_signature`, directly converts GraphSignature -> ExportGraphSignature
- `_make_fx_helper` explicitly creates AOTAutograd GraphSignature object
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134059
Approved by: https://github.com/angelayi, https://github.com/ydwu4
2024-08-23 23:29:28 +00:00
aa9f4cc733 [Inductor][CPP] Support vectorization of remainder (#129849)
**Summary**
When check the vectorization status among 3 test suit, we found some operators disabled vectorization with message `Disabled vectorization: op: remainder`. In this PR, we add vectorization support of this op.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_remainder
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_int_div_vec
```

Differential Revision: [D61147014](https://our.internmc.facebook.com/intern/diff/D61147014)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129849
Approved by: https://github.com/jgong5, https://github.com/lezcano
2024-08-23 23:26:51 +00:00
286f2dba9f [2/N refactor NCCLPG error logs][c10d] Make msg in monitoring thread in NCCLPG more accurate and simpler (#134036)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134036
Approved by: https://github.com/wconstab
2024-08-23 23:21:28 +00:00
2cfc2da527 [export] Make move_to_device_pass function public (#134263)
Summary:
This is a follow-up of https://github.com/pytorch/pytorch/pull/133660

Here we make the `move_to_device_pass()` function publich so users can call it by `from torch.export.passes import move_to_device_pass`

Test Plan: CI

Differential Revision: D61671310

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134263
Approved by: https://github.com/angelayi
2024-08-23 23:18:30 +00:00
c638a40a93 [Caffe2] Remove unused AVX512 code (#133160)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133160
Approved by: https://github.com/albanD
2024-08-23 23:16:16 +00:00
1f19ccb5b3 [Inductor/Triton] Customize triton codegen to optionally preserve input dtype on tl.load (#132406)
Differential Revision: D60536337

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132406
Approved by: https://github.com/jfix71, https://github.com/blaine-rister
2024-08-23 22:58:43 +00:00
8ff3a5be1b [export] basic auto dynamic shapes (#133620)
Starter version of automatic dynamic shapes for export.

Creates enums `DIM.AUTO`, `DIM.STATIC`, allowing user to specify `AUTO` for dims in dynamic_shapes specs, meaning that corresponding dims are treated as dynamic, and relevant guards will do what's necessary (e.g. refine ValueRanges, set replacements based on equality, or even set static) without raising ConstraintViolationErrors. Basically allows the user to say, "a bunch of these dims can be dynamic, let export do model analysis and return the program with maximum possible dynamism, without complaining".

The usage for specifying `dynamic_shapes` is now:
```
AUTO -> dynamic by default, return whatever produce_guards() says, even if it's static
None/int/STATIC -> static
Dim/DerivedDim -> same as before - will complain if the min/max range is invalid, or if dims related to this are unspecified.
```

Caveat 1: specifying `AUTO` for a dim won't guarantee it'll be dynamic:

- specifying `AUTO` for a dim will return the maximum possible dynamism given your program and other specified constraints, but this can still mean you'll get a static program. For example, with the program below, x is specified dynamic, but it's equal to y, which is specified static, and with how we currently do things we won't promote y to dynamic, but will demote(?) x to static. So this can be surprising if you don't fully know your model, and/or missed one of your other inputs when specifying auto-dynamic shapes.
```
class Foo(torch.nn.Module):
    def forward(self, x, y):
        return x + y
inputs = (torch.randn(6), torch.randn(6))
export(Foo(), inputs, dynamic_shapes={"x": (DIM.AUTO,), "y": None})
```

Caveat 2: specifying `AUTO` and Dims in the same spec is still problematic:

- The way Dims/DerivedDims are currently handled is very strict. A Dim represents a symbol, and we require a user to specify the symbol for all dims governed by the symbol - that's why we've seen errors in the past like `The values of x must always be related to y by ...`, asking the user to specify the exact relation as in the program. We also require the specified min/max range to be a subset of the valid range from model analysis. All this doesn't compose well with specifying `AUTO` just yet - for example in the program below, ideal behavior could be to return a dynamic program, where `dx = x.size(0) = y.size(0)` has range (3,6). Unfortunately this crashes, and correct behavior is to specify `dx` for both inputs. So currently we raise a UserError and crash if both Dims + `AUTO` are present in the spec.
```
class Foo(torch.nn.Module):
    def forward(self, x, y):
        return x + y
inputs = (torch.randn(6), torch.randn(6))
export(Foo(), inputs, dynamic_shapes={"x": (DIM.AUTO,), "y": {0: Dim("dx", min=3, max=6)}})  # this doesn't work, because x & y and related
```

Implementation details:

This is done by setting `assume_static_by_default=False`, and doing a transform on the `dynamic_shapes` spec to preserve semantics. `assume_static_by_default=False` will treat unspecified dims or Nones as dynamic. This is the opposite of what `export.export()` currently does - unspecified Dims/Nones are treated as static. Historically this static-by-default behavior, where the user deals with fewer guards, has been desirable, and we would like to respect that in this implementation. So this internal spec transformation is added, `_transform_shapes_for_default_dynamic()`, does the spec conversion necessary to be compatbile with dynamic by default. Specifically, AUTOs are converted into Nones, and Nones/unspecified dims are filled in with explicitly static constraints.

For example, this would look like, for a 3-d tensor: `{0: DIM.AUTO, 1: None, 2: Dim("dx")} -> {0: None, 1: 32, 2: Dim("dx")}`

This does seem overly complicated, but it's done to preserve dynamic shapes semantics for `torch._dynamo.export()`, which already uses `assume_static_by_default=False`, and follows the same process for generating shape constraints , via `_process_dynamic_shapes`. There the semantics are:
```
None/unspecified: dynamic by default
Dim/DerivedDim: also a strict assertion
```

If we don't care about BC for `_dynamo.export(dynamic_shapes)`, then we can just modify semantics for `_process_dynamic_shapes()` and change all the relevant tests in `test/dynamo/test_export.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133620
Approved by: https://github.com/avikchaudhuri
2024-08-23 22:56:39 +00:00
f5a2a22dc4 [export] Fix unflattener to respect nn.Parameter requires_grad (#134353)
Summary: Fixes P1539870235

Test Plan: CI

Differential Revision: D61726403

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134353
Approved by: https://github.com/pianpwk
2024-08-23 22:49:34 +00:00
eaa2c0e009 Improves error message when passing wrong tensor type to torch.nn.functional.one_hot (#134209)
The function expects a Tensor of type LongTensor. It currently throws the following error: "one_hot is only applicable to index tensor." which, imo, does not provide the user with enough information on what the problem is.

PR simply adds extra information to the error message on this specific scenario.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134209
Approved by: https://github.com/mikaylagawarecki
2024-08-23 22:40:05 +00:00
09a82f3d24 [EZ][BE] Delete references to non-existing AWS_SCCACHE secrets (#134370)
First of all, none of the binary builds should be using sccache for security and reliability reasons (as distributed cache can become corrupted/compromised), but even if they do all authentication to AWS service shoudl be done via OIDC

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134370
Approved by: https://github.com/seemethere, https://github.com/atalman
2024-08-23 22:23:48 +00:00
adf0f50cc7 [Compile] Add NEON implementation for bf16->fp32 cast (#134297)
This changes assembly generated for the following routine
```cpp
void bfloat16tofloat(c10::BFloat16* in, float* out) {
        auto tmp0 = at::vec::Vectorized<c10::BFloat16>::loadu(in, 8);
        auto tmp1 = at::vec::convert<float>(tmp0);
        tmp1.store(out);
}
```
from
```asm
bfloat16tofloat(c10::BFloat16*, float*):
0000000000000034        stp     x29, x30, [sp, #-0x10]!
0000000000000038        mov     x29, sp
000000000000003c        sub     x9, sp, #0x90
0000000000000040        and     sp, x9, #0xffffffffffffffe0
0000000000000044        mov     x8, #0x0
0000000000000048        adrp    x9, 0 ; 0x0
000000000000004c        ldr     x9, [x9]
0000000000000050        ldr     x9, [x9]
0000000000000054        str     x9, [sp, #0x88]
0000000000000058        stp     xzr, xzr, [sp, #0x10]
000000000000005c        ldr     q0, [x0]
0000000000000060        str     q0, [sp]
0000000000000064        ldr     q1, [sp, #0x10]
0000000000000068        stp     q0, q1, [sp, #0x20]
000000000000006c        add     x9, sp, #0x40
0000000000000070        add     x10, sp, #0x20
0000000000000074        add     x11, x10, x8
0000000000000078        ldp     d0, d1, [x11]
000000000000007c        shll.4s v0, v0, #16
0000000000000080        shll.4s v1, v1, #16
0000000000000084        stp     q0, q1, [x9], #0x20
0000000000000088        add     x8, x8, #0x10
000000000000008c        cmp     x8, #0x20
0000000000000090        b.ne    0x74
0000000000000094        add     x8, sp, #0x40
0000000000000098        ld1.4s  { v0, v1 }, [x8]
000000000000009c        st1.4s  { v0, v1 }, [x1]
00000000000000a0        ldr     x8, [sp, #0x88]
00000000000000a4        adrp    x9, 0 ; 0x0
00000000000000a8        ldr     x9, [x9]
00000000000000ac        ldr     x9, [x9]
00000000000000b0        cmp     x9, x8
00000000000000b4        b.ne    0xc4
00000000000000b8        mov     sp, x29
00000000000000bc        ldp     x29, x30, [sp], #0x10
00000000000000c0        ret
00000000000000c4        bl      0xc4
```
to
```asm
bfloat16tofloat(c10::BFloat16*, float*):
0000000000000034        ldr     q0, [x0]
0000000000000038        shll.4s v1, v0, #16
000000000000003c        shll2.4s        v2, v0, #16
0000000000000040        st1.4s  { v1, v2 }, [x1]
0000000000000044        ret
```

And as result speeds up `python3 torchchat.py generate stories110M --num-samples 3 --compile --device cpu --dtype bfloat16` from 33 to 90 tokens/sec

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134297
Approved by: https://github.com/kimishpatel
2024-08-23 22:22:59 +00:00
69813dbbfd [export] Schematize nn_module_stack serialization (#134049)
`nn_module_stack` was previously serialized to string by adding commas between the module_path and module_type. This error prone when the `nn_module_stack` itself contains commas.

This PR fixes this by creating a dictionary to store the `nn_module_stack` and serialize it to string via `json.dumps()`

Fixes #131941

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134049
Approved by: https://github.com/angelayi
2024-08-23 21:50:01 +00:00
78d69bfe11 [SymmetricMemory] introduce multicast support, multimem_all_reduce_ and multimem_one_shot_all_reduce (#133424)
### Summary
- Added multicast support to SymmetricMemory. If the cuda runtime and cuda driver have multicast support, SymmetricMemory associate all peer buffers with a multicast object and exposes the multicast virtual address.
- Implemented `multimem_all_reduce_` and `multimem_one_shot_all_reduce` based on the multicast support. The two variants shows different performance characteristic for different message size. We plan to use Inductor for collective algo selection (and required symmetric memory buffer allocation).

### Benchmark

8xH100 (non-standard version with HBM2e at 650W). NVSwitch V3 with NVLS support.

![image](https://github.com/user-attachments/assets/4998a16b-c2c0-4797-9dd0-1da2303df947)

![image](https://github.com/user-attachments/assets/278ad361-52cb-4864-82c6-bb67e8d0a3fe)

Differential Revision: [D61682507](https://our.internmc.facebook.com/intern/diff/D61682507)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133424
Approved by: https://github.com/yf225, https://github.com/weifengpy
2024-08-23 20:09:20 +00:00
2ca7f0fc5c [Minimizer] for sequential mode, respect find_all setting (#134339)
Summary: Currently, for sequential mode, minimizer search terminates after a node is excluded via the user defined exclusion_fn. However, on some occasions we would like the search to continue past that for the remaining nodes. In this diff I am changing the termination criteria to respect the find_all setting, where we continue sequential search if it is set.

Test Plan: CI

Differential Revision: D61720262

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134339
Approved by: https://github.com/jfix71
2024-08-23 19:59:43 +00:00
58e2cf364b Make DTensor sharding propagation for scaled_dot_product_efficient_attention and scaled_dot_product_flash_attention more conservatively cached (#134146)
Fixes #134050

### The issue

The current `DTensor` sharding propagation caching policy for  `aten.scaled_dot_product_efficient_attention` (default) can result in silently incorrect gradients or trigger an IMA after cuda kernel launch in mixed `require_grad` configurations. Please see issue #134050 for a full description of the observed failure patterns along with reproduction. Note `aten.scaled_dot_product_flash_attention` presents a similar concern so this PR addresses both [as discussed here.](https://github.com/pytorch/pytorch/issues/134050#issuecomment-2299887602)

### Remediation

While there are a number of ways this could be addressed, the most straightforward remediation is to modify the sharding propagation caching policy of [`aten._scaled_dot_product_efficient_attention.default`](b03381cac2/torch/distributed/_tensor/ops/_matrix_ops.py (L337-L340)), registering it with `schema_info=RuntimeSchemaInfo(4)` to prevent cache sharing between differing `compute_log_sumexp` values i.e.

```python
@register_op_strategy(aten._scaled_dot_product_efficient_attention.default, schema_info=RuntimeSchemaInfo(4))
def scaled_dot_product_efficient_attention_strategy(
...
```

[As discussed here](https://github.com/pytorch/pytorch/issues/134050#issuecomment-2299887602),  since `aten::_scaled_dot_product_flash_attention` could be affected by a similar issue wrt `return_debug_mask`, this PR adjusts the sharding propagation caching policy for that op as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134146
Approved by: https://github.com/tianyu-l
2024-08-23 19:43:30 +00:00
157de30f53 [sparse] Update cuSPARSELt to v0.6.2 (#134022)
Summary:

This PR updated cuSPARSELt to v0.6.2. I think we should land
https://github.com/pytorch/pytorch/pull/128534 first though.

Most of this PR is just enabling tests to run when cuSPARSELt v0.6.2 is
available.

Unfortunately was running into a bug with fp32 support on Hopper, so I
removed fp32 support from the cuSPARSELt backend. I think this should be
fine since almost everybody uses the bfloat/float16/int8 kernels.

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134022
Approved by: https://github.com/jerryzh168, https://github.com/malfet
ghstack dependencies: #128534
2024-08-23 19:34:53 +00:00
74a9001ada [aoti] Add additional custom op input type support (#132454)
Summary:
Added support for more custom op input types, now only missing dtype,
layout, memory format as input type, since we need to add some more testing for
mapping the types to their integer values
([previous
comment](https://github.com/pytorch/pytorch/pull/126215#discussion_r1617428066)).

This PR also replaces the `DynamicArg` struct's `serialized_arg_val` with
`list_item_types`, which stores an optional list of strings, where each string
represents the type of the value within this list. This is only used for
parsing lists of optional tensors, where we need to know if a specific value in
the list should be a tensor, or a None. Replacing with a list of strings is
also better than storing the actual json format because then we don't need to
parse the json string during the runtime, and can just loop over a preprocessed
list of strings.

Test Plan: `buck2 run @//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r "test_custom_"`

Reviewed By: desertfire

Differential Revision: D60295995

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132454
Approved by: https://github.com/desertfire
2024-08-23 19:11:36 +00:00
f8fbfe5846 Always emit end events even on failure, use thread local storage for stack (#134279)
Summary:
We should always emit an end event in a finally block so that if a unit test or job fails, the stack is still correct.

Also, we use thread local storage for the stack, so that in multithreaded scenarios the stack will still be correctly added.

Test Plan:
Run benchmark and see that everything still works
Run
```
TORCH_LOGS=dynamo buck run test/functorch:test_aotdispatch -- -r test_backward_mutation_on_grad_out
```
With some extra logging to see that start events with the correct stack are emitted, and the end events are also emitted even though the test fails at runtime.

Differential Revision: D61682556

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134279
Approved by: https://github.com/aorenste
2024-08-23 18:13:13 +00:00
a23d86c178 [hop] ban creating hop by directly instantiating HigherOrderOperator. (#133645)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133645
Approved by: https://github.com/zou3519
2024-08-23 17:28:02 +00:00
3546628a2a Allow mp.start_processes to create processes in parallel (#133707)
Summary:
Background discussion in https://fb.workplace.com/groups/319878845696681/posts/1226087421742481

and pytorch issue filed https://github.com/pytorch/pytorch/issues/133010

one way to fix this problem is to add an option to parallel start processes on pytorch side.

Test Plan: Tested aps run in problem and things are in parallel now (next diff)

Differential Revision: D61301603

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133707
Approved by: https://github.com/d4l3k, https://github.com/ezyang
2024-08-23 17:11:20 +00:00
afd081c9d4 [inductor] Fix needs_fixed_stride_order silent incorrectness (#133639)
Fixes #128084

The approach is option 2 of what Elias suggested in the comment
thread:
- We require tensors to have the correct stride at usage. This may
  involve a clone; if there was a clone and then a mutation into it
  then we copy_ back the result of the mutation.

The reason why I went this approach was because it was the easiest and
Inductor already works really hard to remove additional clones/copy_.

There are some cases that this doesn't generate efficient code for; for
example, if the tensor is a view, we don't change the base of the view
to have the right stride order, instead we do a clone.
The view case isn't very common so I'm ignoring it for now but we could
improve this in the future.

Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133639
Approved by: https://github.com/eellison
2024-08-23 17:07:58 +00:00
2553278bae .github/merge_rules.yaml: added multiprocessing to Distributed (#134262)
This allows the Distributed team to approve changes to torch.multiprocessing which is used by torchelastic/run.

Example PR: https://github.com/pytorch/pytorch/pull/133707

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134262
Approved by: https://github.com/wconstab, https://github.com/PaliC
2024-08-23 17:07:20 +00:00
6eae569546 [dynamo][fix] always use POSIX-style path in trace_rule.py (#133987)
We are hardcoding some path in string in POSIX style. This will lead to different results on Windows. This PR force all paths to be in POSIX-style.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133987
Approved by: https://github.com/jansel
2024-08-23 16:28:57 +00:00
2eef749b31 [Inductor][FlexAttention] Fix IS_DIVISIBLE bug and add unit tests (#134055)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134055
Approved by: https://github.com/Chillee
2024-08-23 16:11:09 +00:00
8ae4f82243 [aotd] Support HOP effects in backward (#132638)
Support of effectful operations in backward:

1/ AOTD collects metadata from forward fn only, so we can have usage of effectful ops in backward, that were not used in forward => Allowing tokens discovery during joint function .

FunctionalTensorMode holds _tokens, in Joint function after tracing forward we memoize _tokens as `_tokens_forward_output`.

2/ Tokens are added as primals inputs (forward) in EffectTokensWrapper.
Tokens that will be used in backward are in partitioner saved values. We do not have control on which positions they are saved in forward outputs.

2/ If new tokens discovered in backward after tracing joint_fn, the result graph will be manually added in the end of primals.
_aot_autograd/utils.py

3/ All effectful ops during backward are marked with 'must_be_in_backward' partitioner_tag, to prevent partiitoner to place them in forward.

For that functional_tensor_mode got new optional state `self._effects_partitioner_tag` for effectful ops, to set after tracing forward.

There are additional changes in partitioner to improve functionality of 'must_be_in_backward'

4/ Unlift tokens now should run for both forward and backward.
- As saved for backward tokens are placed on non static places - we identify input and output tokens to erase, by input and output of `with_effects` operation
- In forward we can have input tokens, discovered in backward, that are not used in with_effects ops in forward, but saved for backward. We identify them by position in forward inputs.

5/ Adding aot debug logging for graphs before unlifting and before adding additional primal for backward tokens.

Tests:
```
python test/higher_order_ops/test_with_effects.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132638
Approved by: https://github.com/bdhirsh
2024-08-23 15:30:58 +00:00
7fd3b69886 Revert "[dynamo][super] Improve handling of getattr on super (#134039)"
This reverts commit 1da3a049dac3c78554506d5ef9ede55b7c2b774d.

Reverted https://github.com/pytorch/pytorch/pull/134039 on behalf of https://github.com/jeanschmidt due to broke internal torchrec signals, see [D61670727](https://www.internalfb.com/diff/D61670727) ([comment](https://github.com/pytorch/pytorch/pull/134039#issuecomment-2307151643))
2024-08-23 13:57:04 +00:00
09127b096c Revert "[inductor] Fix needs_fixed_stride_order silent incorrectness (#133639)"
This reverts commit 8604c0a150b12e0ba3f9a6faaf52498370f21368.

Reverted https://github.com/pytorch/pytorch/pull/133639 on behalf of https://github.com/jeanschmidt due to Broke internal fbgemm signals, see [D61670495](https://www.internalfb.com/diff/D61670495) ([comment](https://github.com/pytorch/pytorch/pull/133639#issuecomment-2307133060))
2024-08-23 13:48:04 +00:00
75c22dd8bf Revert "[dynamo][fix] always use POSIX-style path in trace_rule.py (#133987)"
This reverts commit b23779ef0af8d4f06e667da460c43d264359f1f0.

Reverted https://github.com/pytorch/pytorch/pull/133987 on behalf of https://github.com/albanD due to This breaks windows trunk jobs ([comment](https://github.com/pytorch/pytorch/pull/133987#issuecomment-2306956764))
2024-08-23 12:08:56 +00:00
0e49b2f18e [dynamo][itertools] support itertools.tee (#133771)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133771
Approved by: https://github.com/jansel
ghstack dependencies: #133769, #133778, #133779
2024-08-23 10:13:12 +00:00
8d90392fb0 [dynamo] simplify implementation for builtins.sum (#133779)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133779
Approved by: https://github.com/jansel
ghstack dependencies: #133769, #133778
2024-08-23 10:10:19 +00:00
6c0b15e382 [dynamo] simplify implementation for functools.reduce (#133778)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133778
Approved by: https://github.com/jansel
ghstack dependencies: #133769
2024-08-23 09:10:44 +00:00
cc3a76edba [dynamo] simplify polyfill registration for builtins.all and builtins.any (#133769)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133769
Approved by: https://github.com/jansel
2024-08-23 09:05:24 +00:00
ca3f48dd5b [XPU] Set make triton install pre-built whl by default (#130313)
Now the user could install the pre-built `triton` for xpu by calling the following:

```Bash
export USE_XPU=1
make triton
```

[Dev Only]: If the user wishes to build it from the source, one could set an additional flag:

```Bash
export TRITON_XPU_BUILD_FROM_SOURCE=1
export USE_XPU=1
make triton
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130313
Approved by: https://github.com/chuanqi129, https://github.com/EikanWang, https://github.com/atalman
2024-08-23 07:36:34 +00:00
55cdcef0f7 [fp8 rowwise] Work around CUDA Invalid Memory Access bug (#134227)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134227
Approved by: https://github.com/drisspg, https://github.com/eqy
ghstack dependencies: #134223, #134224, #134225, #134226
2024-08-23 07:27:55 +00:00
9d81767d43 [fp8 rowwise] Rework dispatch logic (#134226)
It's likely a matter of opinion, but I find this new version to have less duplication, even if it might have more boilerplate.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134226
Approved by: https://github.com/drisspg
ghstack dependencies: #134223, #134224, #134225
2024-08-23 07:27:55 +00:00
0afb4872aa [fp8 rowwise] Support non-contiguous inputs and clarify checks (#134225)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134225
Approved by: https://github.com/drisspg
ghstack dependencies: #134223, #134224
2024-08-23 07:27:52 +00:00
9f8d3f511f [fp8 rowwise] Some clean-up (#134224)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134224
Approved by: https://github.com/drisspg
ghstack dependencies: #134223
2024-08-23 07:27:48 +00:00
2f198605ac [fp8 rowwise] Simplify epilogue visitor tree via common blocks (#134223)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134223
Approved by: https://github.com/drisspg
2024-08-23 07:27:41 +00:00
25b2e46573 [dynamo] add max iterator limit while inlining generators (#134233)
Related:

- #133879

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134233
Approved by: https://github.com/jansel
2024-08-23 07:03:31 +00:00
673b9bd561 [WIP] [Inductor UT] Reuse inductor UT for intel GPU test/inductor/test_multi_kernel.py (#133943)
[Inductor UT] Reuse Inductor test case for Intel GPU.
Reuse `test/inductor/test_multi_kernel.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133943
Approved by: https://github.com/EikanWang, https://github.com/jansel

Co-authored-by: Justin Chu <justinchu@microsoft.com>
Co-authored-by: Jesse Cai <jcjessecai@gmail.com>
Co-authored-by: Sahdev Zala <spzala@us.ibm.com>
Co-authored-by: rzou <zou3519@gmail.com>
Co-authored-by: FFFrog <ljw1101.vip@gmail.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Co-authored-by: yanbing-j <yanbing.jiang@intel.com>
Co-authored-by: Will Feng <yf225@cornell.edu>
Co-authored-by: Bin Bao <binbao@meta.com>
Co-authored-by: Yiming Zhou <yimingzhou@meta.com>
Co-authored-by: Yanbo Liang <ybliang8@gmail.com>
2024-08-23 05:52:29 +00:00
80846caa8c [inductor] fix dynamic size array(vla) build error on msvc v4 (#134221)
MSVC don't support dynamic array.
Ref: https://stackoverflow.com/questions/56555406/creating-dynamic-sized-array-using-msvc-c-compiler

We tried to solutions:
1. use std::vector to instead of it in previous PR: https://github.com/pytorch/pytorch/pull/134140, but it changed variable's type and failed at UTs.
2. Use `std::unique_ptr` to instead of it in PR: https://github.com/pytorch/pytorch/pull/134156, @jansel reviewed and give comments:  https://github.com/pytorch/pytorch/pull/134156#pullrequestreview-2253091693. It is make sense, allocation memory maybe make code run slower.
3. Use fixed size array to instead of it in PR: https://github.com/pytorch/pytorch/pull/134210, fixed size is hard to process the situlation, reserved size if small than CPU number.
> a. Use min() function limited is local test failed: https://github.com/pytorch/pytorch/pull/134210#issuecomment-2304447729
> b. Dynamic select fixed size or dynamic array: https://github.com/pytorch/pytorch/pull/134210#issuecomment-2304128666 . It makes code too complex to maintains.

Discussed with origin PR(https://github.com/pytorch/pytorch/pull/115620) author @zhuhaozhe, we think:
1. MSVC it the only one compiler, which not support VLA.
2. MSVC it worse performance than other compilers, use `std::unique_ptr` for MSVC and make it works.
3. For other compilers, keep using current `VLA` code.
4. For Windows users, they can use `clang-cl` or `icx` to get better performance than MSVC.
5. Discussed with @jansel , we need to move compiler check to python side, and make output code cleaner.

Reproduce UT:
```cmd
pytest test/inductor/test_cpu_repro.py -v -k test_reduction_with_dynamic_threads
```

Error msg:
```cmd
C:/Users/Xuhan/AppData/Local/Temp/tmpncykej5v/a4/ca4534cazplidnf7vopaaxaifqkjiyhxm3h2gsylgztputbaeybx.cpp(13): error C2131: expression did not evaluate to a constant
C:/Users/Xuhan/AppData/Local/Temp/tmpncykej5v/a4/ca4534cazplidnf7vopaaxaifqkjiyhxm3h2gsylgztputbaeybx.cpp(13): note: failure was caused by a read of a variable outside its lifetime
C:/Users/Xuhan/AppData/Local/Temp/tmpncykej5v/a4/ca4534cazplidnf7vopaaxaifqkjiyhxm3h2gsylgztputbaeybx.cpp(13): note: see usage of 'max_threads'
C:/Users/Xuhan/AppData/Local/Temp/tmpncykej5v/a4/ca4534cazplidnf7vopaaxaifqkjiyhxm3h2gsylgztputbaeybx.cpp(16): error C3863: array type 'float [max_threads]' is not assignable
```
Genarated code:
```c++

#include "C:/Users/Xuhan/AppData/Local/Temp/tmpt6mxcjzi/j2/cj22tgrdgh42wbunl7gdptg2lintcziox2kmr7rdbcc6n2njrhgx.h"
extern "C" __declspec(dllexport) void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       float* out_ptr0,
                       float* out_ptr1)
{
    {
        {
            float tmp_acc0 = 0;
            at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(0);
            int max_threads = omp_get_max_threads();
            float tmp_acc0_arr[max_threads];
            for (int tid = 0; tid < max_threads; tid++)
            {
                tmp_acc0_arr[tid] = 0;
            }
            at::vec::Vectorized<float> tmp_acc0_vec_arr[max_threads];
            for (int tid = 0; tid < max_threads; tid++)
            {
                tmp_acc0_vec_arr[tid] = at::vec::Vectorized<float>(0);
            }
            #pragma omp parallel
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134221
Approved by: https://github.com/zhuhaozhe, https://github.com/jansel
2024-08-23 05:40:08 +00:00
49b9f2d8b0 [inductor] fix signbit build fail on Windows. (#134229)
Reproduce UT:
```cmd
pytest test/inductor/test_torchinductor.py -v -k test_randint_int64_mod_cpu
```

Error message:
```cmd
cl : Command line warning D9025 : overriding '/openmp' with '/openmp:experimental'
c6airoloxwj4prmlejdyo5ybp43xa2yo5rbnpk4ttw3oifu6noor.cpp
C:/Users/Xuhan/AppData/Local/Temp/tmpx1fj2bd4/6a/c6airoloxwj4prmlejdyo5ybp43xa2yo5rbnpk4ttw3oifu6noor.cpp(23): error C2668: 'signbit': ambiguous call to overloaded function
C:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt\corecrt_math.h(309): note: could be 'bool signbit(float) noexcept'
C:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt\corecrt_math.h(314): note: or       'bool signbit(double) noexcept'
C:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt\corecrt_math.h(319): note: or       'bool signbit(long double) noexcept'
C:/Users/Xuhan/AppData/Local/Temp/tmpx1fj2bd4/6a/c6airoloxwj4prmlejdyo5ybp43xa2yo5rbnpk4ttw3oifu6noor.cpp(23): note: while trying to match the argument list '(__int64)'
C:/Users/Xuhan/AppData/Local/Temp/tmpx1fj2bd4/6a/c6airoloxwj4prmlejdyo5ybp43xa2yo5rbnpk4ttw3oifu6noor.cpp(24): error C2668: 'signbit': ambiguous call to overloaded function
C:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt\corecrt_math.h(309): note: could be 'bool signbit(float) noexcept'
C:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt\corecrt_math.h(314): note: or       'bool signbit(double) noexcept'
C:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt\corecrt_math.h(319): note: or       'bool signbit(long double) noexcept'
C:/Users/Xuhan/AppData/Local/Temp/tmpx1fj2bd4/6a/c6airoloxwj4prmlejdyo5ybp43xa2yo5rbnpk4ttw3oifu6noor.cpp(24): note: while trying to match the argument list '(int64_t)'
```

Genarated code:
```c++

#include "C:/Users/Xuhan/AppData/Local/Temp/tmpcjnxnvkl/4f/c4ff4q4pxgo3yprbo2nkfopkt3qgi6rmptfpgpl2iylgtunvizwn.h"
extern "C" __declspec(dllexport) void kernel(const int64_t* in_ptr0,
                       int64_t* out_ptr0)
{
    #pragma omp parallel num_threads(8)
    {
        int tid = omp_get_thread_num();
        {
            #pragma omp for
            for(int64_t x0=static_cast<int64_t>(0LL); x0<static_cast<int64_t>(20LL); x0+=static_cast<int64_t>(1LL))
            {
                auto tmp0 = in_ptr0[static_cast<int64_t>(0LL)];
                auto tmp1 = x0;
                auto tmp2 = c10::convert<int32_t>(tmp1);
                auto tmp3 = static_cast<int64_t>(-5);
                auto tmp4 = static_cast<int64_t>(5);
                auto tmp5 = randint64_cpu(tmp0, tmp2, tmp3, tmp4);
                auto tmp6 = static_cast<int64_t>(10);
                auto tmp7 = mod(tmp5, tmp6);
                auto tmp8 = static_cast<int32_t>(0);
                auto tmp9 = tmp7 != tmp8;
                auto tmp10 = std::signbit(tmp7);
                auto tmp11 = std::signbit(tmp6);
                auto tmp12 = tmp10 != tmp11;
                auto tmp13 = tmp9 & tmp12;
                auto tmp14 = decltype(tmp7)(tmp7 + tmp6);
                auto tmp15 = tmp13 ? tmp14 : tmp7;
                out_ptr0[static_cast<int64_t>(x0)] = tmp15;
            }
        }
    }
}
```

Fixed by cast `std::signbit` to `long double`: https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/signbit?view=msvc-170

Local test passed:
<img width="848" alt="image" src="https://github.com/user-attachments/assets/e4467256-a068-40ef-a6ff-19b442e9116d">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134229
Approved by: https://github.com/jansel
2024-08-23 05:40:05 +00:00
311af3b988 Add new ops wrapped_linear_prepack and wrapped_quantized_linear_prepacked (#134232)
Summary:
This diff adds two new operators torch.ops._quantized.wrapped_linear_prepack and torch.ops._quantized.wrapped_quantized_linear_prepacked. It is a decomposition of the op torch.ops._quantized.wrapped_quantized_linear added in the previous diff.

We decomposed in this way as packed weight could be computed early so we don;t need to do it in every forward in AOTI

Reviewed By: jerryzh168

Differential Revision: D61395887

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134232
Approved by: https://github.com/houseroad
2024-08-23 04:54:26 +00:00
b23779ef0a [dynamo][fix] always use POSIX-style path in trace_rule.py (#133987)
We are hardcoding some path in string in POSIX style. This will lead to different results on Windows. This PR force all paths to be in POSIX-style.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133987
Approved by: https://github.com/jansel
2024-08-23 04:33:05 +00:00
a699bd1155 [dynamo] Cache _dynamo.disable results (#134272)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134272
Approved by: https://github.com/yf225, https://github.com/jansel
2024-08-23 04:20:50 +00:00
b454c51060 remove dynamic_dim (#134211)
Summary: As promised in https://github.com/pytorch/pytorch/pull/134045.

Test Plan: existing

Differential Revision: D61646937

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134211
Approved by: https://github.com/angelayi
2024-08-23 04:13:03 +00:00
058302494c [AOTI][Tooling] Add a test case where config.debug_intermediate_value_printer=True to check codegen (#133326)
Summary:
As title.

Add a test case in test_aot_inductor to check for codegen (i.e. `aoti_torch_print_tensor_handle` is inserted as expected for debugging printer) for both cpu and cuda based on a simple `addmm` test model.

Test Plan:
```
AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1  TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_aoti_debug_printer_codegen_abi_compatible_{cuda/cpu}
```

Differential Revision: D61169068

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133326
Approved by: https://github.com/ColinPeppler
2024-08-23 02:12:21 +00:00
d2c60749ac [Inductor][FlexAttention] Respect user's input kernel_options (#134065)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134065
Approved by: https://github.com/Chillee
2024-08-23 01:21:05 +00:00
8301add833 [4/N] Further refactor FR script to make it more modulized (#134196)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134196
Approved by: https://github.com/c-p-i-o
2024-08-23 01:15:29 +00:00
bcfc560aea [Profiler/CPU] Add Test for Dynamic Activity Toggling [4/n] (#134149)
Summary: Add tests that check function events for dynamic activity toggling for both GPU and CPU events. Also added comments from previous GH comments

Test Plan: Make sure all tests pass

Differential Revision: D61617514

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134149
Approved by: https://github.com/aaronenyeshi
2024-08-23 01:13:42 +00:00
bf5addb613 [FlexAttention] Enable different qk and v head-dims (#134043)
# Summary
Adds the option for the head dims to be different between QK and V tensors.

Fixes issue: https://github.com/pytorch/pytorch/issues/133674

V_DIM > QK_DIM is blocked by landing: https://github.com/triton-lang/triton/pull/4138 / https://github.com/triton-lang/triton/pull/4540

Into PyTorch's triton branch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134043
Approved by: https://github.com/Chillee
2024-08-23 01:06:57 +00:00
7c93c4f8cf [CI][dashboard] Change aarch64 perf run (#134265)
Summary: Reduce the aarch64 dashboard run to only test the default config, until we solve the timeout issue. Also increase the frequency from nightly to 6 times a day, to see if we can reproduce the perf instability Nikita has observed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134265
Approved by: https://github.com/malfet
2024-08-23 00:40:28 +00:00
b3821f1da1 [dynamo][guards][logs] Generate code_parts for debugging (#134181)
Fixes https://github.com/pytorch/pytorch/issues/132692

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134181
Approved by: https://github.com/youkaichao, https://github.com/jansel
ghstack dependencies: #133742, #134016, #134039
2024-08-22 23:40:37 +00:00
edbadc904b Do not broadcast uniqueId during a split (#133962)
When using split, we do not need to exchange the NCCL uniqueID at all.
This would avoid connecting to the TCPStore on each split operation.
@exported-using-ghexport

Differential Revision: [D60966980](https://our.internmc.facebook.com/intern/diff/D60966980/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133962
Approved by: https://github.com/shuqiangzhang
ghstack dependencies: #133960, #133961
2024-08-22 23:23:32 +00:00
b2eb0e8c6a docker: Use miniforge, install from pip (#134274)
Switch installation of the pytorch package to be installed from our download.pytorch.org sources which are better maintained.

As well, switching over the miniconda installation to a miniforge installation in order to ensure backwards compat for users expecting to have the conda package manager installed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134274
Approved by: https://github.com/malfet, https://github.com/atalman

Co-authored-by: atalman <atalman@fb.com>
2024-08-22 23:20:22 +00:00
30d7e7a1cd [XPU] Fix patch for old llvm package error for triton xpu (#134204)
Fixes #134199

The PR #133694 does a workaround to replace the str `"https://tritonlang.blob.core.windows.net/llvm-builds/"` with  `"https://oaitriton.blob.core.windows.net/public/llvm-builds/"` in `triton/python/setup.py`. However, in [newer version of Triton](06e6799f4e), it has already been changed to `"https://oaitriton.blob.core....` and don't need to be replaced.  But formerly, this will throw a runtime error.

This PR makes the `check_and_replace` logic won't fail in such a scenario. Both the old link and the newer link could work.

Also note that the `.ci/docker/common/install_triton.sh` does not need the fix, because its `sed` command won't be in effect if there is no such pattern.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134204
Approved by: https://github.com/chuanqi129, https://github.com/EikanWang, https://github.com/atalman
2024-08-22 23:18:44 +00:00
629bd6f718 Update FlexAttention with masking semantic (#133373)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133373
Approved by: https://github.com/yanboliang
2024-08-22 22:50:33 +00:00
e7929809f3 [c10d][ez] Add comments to CudaEventCache class (#134172)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134172
Approved by: https://github.com/d4l3k, https://github.com/kwen2501
2024-08-22 22:44:12 +00:00
b319fa3fd9 [ONNX] Opt into ruff fmt (#134120)
Add ONNX directory to use ruff format.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134120
Approved by: https://github.com/XuehaiPan, https://github.com/Skylion007
2024-08-22 22:44:03 +00:00
25499de814 Remove ncclIdToCommMap_. (#133961)
There is no purpose for this map structure, and it is incorrect in
some cases.  For example, when the uniqueID is not broadcasted to the
other processes.
@exported-using-ghexport

Differential Revision: [D60966882](https://our.internmc.facebook.com/intern/diff/D60966882/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133961
Approved by: https://github.com/shuqiangzhang
ghstack dependencies: #133960
2024-08-22 22:06:25 +00:00
b0cf287b46 [export][training ir migration] Fix getitem not exist (#134259)
Summary:
Make quantization tests compatible with the new training IR.

With the new batch norm node `torch.ops.aten.batch_norm.default`, we don't need an additional getitem node after the bn node, so tests need to be fixed to not check for the getitem node.

We added a capture_pre_autograd_graph_using_training_ir() function, which returns True when we are using the training ir, and False otherwise. This way, the code supports both training ir and the old ir.

For now, we are just rolling out the training ir for fbcode internal tests.

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_preserve_source_fn_stack
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_update_shared_qspec
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_conv2d
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_conv_bn_relu_fusion

buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_conv_bn_fusion
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_conv_bn_fusion_literal_args
```

Reviewed By: andrewor14, tugsbayasgalan

Differential Revision: D61292102

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134259
Approved by: https://github.com/tugsbayasgalan
2024-08-22 22:00:14 +00:00
f0ba309d78 [CI][dashboard] Add jemalloc back for aarch64 (#134189)
Forward fix based on https://github.com/pytorch/pytorch/pull/133997#discussion_r1726004220
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134189
Approved by: https://github.com/malfet, https://github.com/huydhn
2024-08-22 21:08:39 +00:00
1b6bbaa016 Remove PMI dependencies in PyTorch (#133960)
This patch makes two changes:
1. Whenever ncclCommSplit accepts groupRanks in its config, we should
populate it.  This is independent of using PMI or not.  For example,
non-PMI NCCL can also use this information, if it chooses to.
2. Provide a user flag to decide when to do a uniqueId broadcast and
when to skip it.  This is a performance optimization, and not a
correctness requirement.  If the user forgets to set this, we will
do the uniqueId broadcast, which is wasteful (because it will be
ignored by NCCL), but not incorrect.
@exported-using-ghexport

Differential Revision: [D60966774](https://our.internmc.facebook.com/intern/diff/D60966774/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133960
Approved by: https://github.com/shuqiangzhang
2024-08-22 20:34:43 +00:00
ff61f55387 [Dynamo][autograd.Function] Supports ctx.set_materialize_grads (#133978)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133978
Approved by: https://github.com/zou3519
2024-08-22 20:06:17 +00:00
5633773188 Convert various jobs to be Linux Foundation fleet compatible (#134128)
Migrates a batch of workflows over to LF
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134128
Approved by: https://github.com/zxiiro, https://github.com/jeanschmidt
2024-08-22 19:23:07 +00:00
0eb9c870fd [reland][ROCm] TunableOp for gemm_and_bias (#128919)
Reland of #128143 but added `alpha` and `bias` initialization to `launchTunableGemmAndBias`

Thus far TunableOp was implemented for gemm, bgemm, and scaled_mm. gemm_and_bias was notably missing. This PR closes that gap.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128919
Approved by: https://github.com/malfet
2024-08-22 18:27:50 +00:00
978c5a80a0 [export][training ir migration] fix batch norm pattern match in quantization (#134157)
Summary:
In the new training ir, we produce `torch.ops.aten.batch_norm.default` instead of `torch.ops.aten._native_batch_norm_legit.default` or `torch.ops.aten._native_batch_norm_legit_no_training.default`.

So we need to change the pattern match to accomodate the new op.

- Add `torch.ops.aten.batch_norm.default` to pattern matcher list so it's identified as a batch norm node
- `torch.ops.aten.batch_norm.default` doesn't have a getitem user anymore, so when removing the bn norm,  we need to do `bn_node.replace_all_uses_with(conv_node)` instead of `getitem_node.replace_all_uses_with(conv_node)`

The behavior of capture_pre_autograd_graph is consistent for each run.

If the run is a fbcode test, then capture_pre_autograd_graph uses training IR. This means both _get_aten_graph_module_for_pattern and  replace_pattern_with_filters see the same training IR.

If the run is not a fbcode test, then both would see the old IR.

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_conv2d_binary2
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_conv2d_unary
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_linear_unary
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_dynamic_quant_linear
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_dynamic_quant_linear
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_flatten_recipe
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_linear_unary
```

Reviewed By: andrewor14, tugsbayasgalan

Differential Revision: D61291077

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134157
Approved by: https://github.com/tugsbayasgalan
2024-08-22 18:25:45 +00:00
fee677eeb6 [fbode-testing][dynamo][reland][inline-inbuilt-nn-modules] Mark attri… (#134136)
Shuai wants to test this internally before https://github.com/pytorch/pytorch/pull/133713 can go in. Creating a separate PR for ghmport.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134136
Approved by: https://github.com/yanboliang
2024-08-22 17:54:58 +00:00
8f7d66f0c3 Enable dynamic rollout for Linux binary workflows (#131472)
Enables dynamic migration of jobs to the LF AWS account for binary workflows.

The new runners are only given to people specified in this issue: pytorch/test-infra#5132

This closes pytorch/ci-infra#251.

Depends-On: pytorch/pytorch#132870
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131472
Approved by: https://github.com/ZainRizvi
2024-08-22 17:12:50 +00:00
d95aedf5fd [BE] typing for decorators - fx/_compatibility (part 1) (#134202)
Part of #134054.

This corresponds to the pytorch mypy changes from D61493706. Updating takes so
long and touches so many files that it's impossible to land as a whole without conflicting with some other intermediate change.
So landing these 'type: ignore' for pytorch in advance of them actually being needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134202
Approved by: https://github.com/Skylion007
2024-08-22 17:07:33 +00:00
44fa9f991c [NJT] add aten.to.dtype support (#134164)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134164
Approved by: https://github.com/davidberard98
2024-08-22 16:59:38 +00:00
b6abac68ec [BE][dynamo] reorganize polyfill module hierarchy (#133977)
Changes:

1. Move `polyfill.py` -> `polyfills/__init__.py`. It can be used as `polyfill.xxx` -> `polyfills.xxx`.
2. Move submodule loading from `polyfills/__init__.py` to `polyfills/loader.py`.

Merge `polyfill.py` and `polyfills/` packages. Each polyfill module have its own namespace for better code organization.

The ultimate goal is make `polyfills/__init__.py` empty and all polyfill functions move to its own namespace.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133977
Approved by: https://github.com/jansel
2024-08-22 16:42:29 +00:00
c95ddd4bf2 [dynamo] ensure polyfill function has the same signature as the original function in substitute_in_graph (#133813)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133813
Approved by: https://github.com/jansel
2024-08-22 16:38:06 +00:00
240467adfe [fx] Implement deepcopy for Proxy (#133706)
Summary: When deepcopy a proxy, we first try the default deepcopy behavior.

Test Plan: buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:fx -- -r  proxy_deepcopy

Differential Revision: D61398418

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133706
Approved by: https://github.com/angelayi
2024-08-22 16:37:30 +00:00
b0171c3920 Revert "[ONNX] Opt into ruff fmt (#134120)"
This reverts commit 0870398fa8c3e097640f31cb8a8e2e2d3e522d33.

Reverted https://github.com/pytorch/pytorch/pull/134120 on behalf of https://github.com/albanD due to Breaks main branch lint ([comment](https://github.com/pytorch/pytorch/pull/134120#issuecomment-2305089756))
2024-08-22 15:48:14 +00:00
828ab84e19 Improve error msg on _lazy_init() error (#134159)
Reviewed By: hanzlfs

Differential Revision: D61627609

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134159
Approved by: https://github.com/hanzlfs
2024-08-22 15:10:50 +00:00
3c5485fb7f [Retry] Log chromium events to scuba (#134118)
Summary:
This diff implements a bunch of views for internal scuba viewing.

TODOS that I might punt to another diff:
- Saving cache stats via counter is definitely sus here, but there's not really a good way to track "fx graph cache hit for this compile phase" right now. Will think about this more.
- We should definitely log frame id, compile id, etc
- We should definitely be logging configs. That way, we can A/B test based on whether a config is turned on.
- idk what I'm doing with compile_uuid yet, but it's useful when you want to look at samples for a single run. I think if we had mast job info this field is not needed, but it's nice to be able to drill down to a single run and get its chrome trace view or icicle view, so idk

Test Plan:
All of the above views are run with nanogpt benchmark:

```
buck run mode/opt caffe2/benchmarks/dynamo:torchbench -- --training --backend=inductor --only nanogpt --performance
```

Differential Revision: D61603243

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134118
Approved by: https://github.com/oulgen
2024-08-22 14:59:45 +00:00
1b10a5c652 Allow SymInts and SymFloats as other in div_softmax_pattern (#133989)
Fixes https://github.com/pytorch/pytorch/issues/133759

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133989
Approved by: https://github.com/ezyang
2024-08-22 14:36:01 +00:00
afc2615d33 Add proper casting to fuse_linear_bn_weights (#134105)
As per title, this PR adds proper casting to fuse_linear_bn_weights in the same style as the conv case above. This previously caused numerical issues on my end, so that is why I am fixing it.

Also cleans up the docstring.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134105
Approved by: https://github.com/mikaylagawarecki
2024-08-22 14:26:12 +00:00
b459ca78eb [NJT]Add unit tests that cover the internal use cases using new NJT API (#133513)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133513
Approved by: https://github.com/davidberard98, https://github.com/soulitzer
2024-08-22 13:54:40 +00:00
1a7e8e5780 Revert "Update FlexAttention with masking semantic (#133373)"
This reverts commit 5a7b544e5c3e37bea62c6a231f6230c004a33d38.

Reverted https://github.com/pytorch/pytorch/pull/133373 on behalf of https://github.com/jeanschmidt due to Broke internal test/inductor signals, see D61611729 ([comment](https://github.com/pytorch/pytorch/pull/133373#issuecomment-2304714503))
2024-08-22 13:47:26 +00:00
88c973005d Revert "[FlexAttention] Enable different qk and v head-dims (#134043)"
This reverts commit e847b6bb9ba281b0db83fcdd79c328252403e9e8.

Reverted https://github.com/pytorch/pytorch/pull/134043 on behalf of https://github.com/jeanschmidt due to Need to revert, in order to be able to revert https://github.com/pytorch/pytorch/pull/133373, feel free to reland this after solving conflicts ([comment](https://github.com/pytorch/pytorch/pull/134043#issuecomment-2304708996))
2024-08-22 13:44:17 +00:00
83b5d449a3 Add full float16/bfloat16 support to MaxUnPool (#133774)
It already supported half so might as well add bfloat16 support for parity

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133774
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-08-22 13:34:43 +00:00
c9c84ae3ee [BE][Ez]: Update CUDNN_frontend submodule to 1.6.1 (#134007)
Update cudnn_frontend submodule to 1.6.1 to patch some minor bugfixes and compiler fixes.
# Bug fix
* Fixed an issue where custom dropout mask was not correctly applied.
* Added -fvisibility=hidden for the pip wheels generated to avoid symbol conflicts with other modules that use cudnn frontend.
* Fixed an issue in sdpa operation which when deserialized will lead to numerical mismatches.
* Fixed an issue in sdpa fp8 fprop operation (in inference mode).
# Samples
* Added a new sample to showcase how a custom dropout mask can be applied to a sdpa operation.
* Added a sample to showcase convolutions on large (c * d * h * w > 2 **31) tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134007
Approved by: https://github.com/eqy
2024-08-22 13:34:17 +00:00
108a75b454 [PP] Add ZeroBubble schedule (#133467)
Zero bubble can be expressed through `ScheduleFlexibleInterleaved1F1B` by setting `enable_zero_bubble=True`. But instead of having to include this flag in schedule initialization we should create a separate ZeroBubbleSchedule and also transition `Interleaved1F1B` to derive from `ScheduleFlexibleInterleaved1F1B`. Then we dont need to expose `ScheduleFlexibleInterleaved1F1B` since the naming is not obvious

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133467
Approved by: https://github.com/wconstab
ghstack dependencies: #132691
2024-08-22 13:32:15 +00:00
cedfac20c7 Revert "[SymmetricMemory] introduce multicast support, multimem_all_reduce_ and multimem_one_shot_all_reduce (#133424)"
This reverts commit 66d3eb783c3b3d7087988dd29bfb619b7f4306b7.

Reverted https://github.com/pytorch/pytorch/pull/133424 on behalf of https://github.com/jeanschmidt due to Broke internal ADS builds, see D61611517 ([comment](https://github.com/pytorch/pytorch/pull/133424#issuecomment-2304676328))
2024-08-22 13:29:27 +00:00
592a172910 [FSDP2] Resolved strided sharding todo in clipping tests (#134152)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134152
Approved by: https://github.com/XilunWu, https://github.com/weifengpy, https://github.com/wz337
2024-08-22 12:45:13 +00:00
4c645c04d8 Fix type of get_raw_stream (#134187)
Just something I noticed while implementing a new DeviceInterface

I had to add `# type: ignore[assignment]` because mypy thinks
DeviceInterface.get_raw_stream is a `Callable` and therefore
incompatible with a `staticmethod`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134187
Approved by: https://github.com/jansel
2024-08-22 12:00:08 +00:00
5fb8754434 [inductor] write cpp code with encoding utf-8 (#134027)
Windows is different to Linux, each Windows version with different language pack have different code page.
Inductor on Windows will write the genarated cpp code with its code page, and it should occured un-decode character failed.

For this situlation, Microsoft suggest to use Unicode to instead of a specific code page. Ref: https://learn.microsoft.com/en-us/windows/win32/intl/code-page-identifiers

Changes:
1. Use `utf-8` as encoder for cpp code.
2. It only change encode for cpp code, but not for binary type. binary type is for AoT binary context.

It works on https://github.com/pytorch/pytorch/issues/122094#issuecomment-2299592942.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134027
Approved by: https://github.com/desertfire, https://github.com/jgong5, https://github.com/jansel
2024-08-22 11:54:32 +00:00
aea1148d56 [fp8 rowwise] Clarify dtypes (#134114)
Disambiguate some of the dtypes (e.g., for the scales), move the "constant" ones out of the function, and use safe casting functions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134114
Approved by: https://github.com/drisspg
ghstack dependencies: #134110, #134111, #134112, #134113
2024-08-22 11:07:39 +00:00
72586ccd14 [fp8 rowwise] Don't build separate kernel for no bias (#134113)
CUTLASS automatically skips a stage in the epilogue if we provide a nullptr. Thus, instead of building a special kernel for bias=None, we can reuse one of the other ones.

This also considerably simplifies the code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134113
Approved by: https://github.com/drisspg
ghstack dependencies: #134110, #134111, #134112
2024-08-22 11:07:39 +00:00
d64fa11095 [fp8 rowwise] Fix bias calculation being done in low precision (#134112)
The compute dtype for the bias addition was set to ElementBias. Thus, for a bf16 bias, we would cast the fp32 accum to bf16 and _then_ add the bias. It is however (slightly?) more accurate to first add the bias in fp32 and only cast at the end.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134112
Approved by: https://github.com/drisspg
ghstack dependencies: #134110, #134111
2024-08-22 11:07:34 +00:00
15faed60ca [fp8 rowwise] Make schedule selection more readable (#134111)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134111
Approved by: https://github.com/drisspg
ghstack dependencies: #134110
2024-08-22 11:07:30 +00:00
b8ea5b01c9 [fp8 rowwise] Allocate workspace as a PyTorch Tensor (#134110)
This makes us pass through the CUDA caching allocator which is safer e.g. in case of CUDA graphs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134110
Approved by: https://github.com/drisspg
2024-08-22 11:07:26 +00:00
cyy
4c8193b8f0 [14/N] Fix clang-tidy warnings in aten/src/ATen (#132733)
Follows #133807

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132733
Approved by: https://github.com/ezyang
2024-08-22 10:09:15 +00:00
90c821814e SparseCsrCUDA: cuDSS backend for linalg.solve (#129856)
This PR switches to cuDSS library and has the same purpose of #127692, which is to add Sparse CSR tensor support to linalg.solve.
Fixes #69538

Minimum example of usage:
```
import torch

if __name__ == '__main__':
    spd = torch.rand(4, 3)
    A = spd.T @ spd
    b = torch.rand(3).to(torch.float64).cuda()
    A = A.to_sparse_csr().to(torch.float64).cuda()

    x = torch.linalg.solve(A, b)
    print((A @ x - b).norm())

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129856
Approved by: https://github.com/amjames, https://github.com/lezcano, https://github.com/huydhn

Co-authored-by: Zihang Fang <zhfang1108@gmail.com>
Co-authored-by: Huy Do <huydhn@gmail.com>
2024-08-22 07:57:30 +00:00
64cfcbd8a3 Tune _int_bsr_dense_addmm for int8 inputs on A100 (#134035)
As in the title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134035
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #133855
2024-08-22 06:43:11 +00:00
b7baa062fc Update torch-xpu-ops pin (ATen XPU implementation) (#133850)
Bugfixings for PyTorch 2.5,
1. Using SYCL group algorithm API instead of old style for sub group shift utilities.
2. Add preprocess in reduction kernel for cases requiring data type cast.
3. Make group norm memory format compatible.
4. ZeroTensor: a. Remove unnecessary aten operators registration, or ZeroTensor process is bypassed. b. Align preprocess with intree implementation in aten::copy_.
5. Rebase checkIndexTensorTypes usage.
6. Align latest semantics of PyTorch foreach operators. Return multiple tensors with offset=0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133850
Approved by: https://github.com/EikanWang
2024-08-22 06:27:03 +00:00
cdb9c7d228 Add support for using privateuse1 backend name in instantiate_device_type_tests() (#133082)
As you can see, 'privateuse1' appears many times in out-of-tree extension codebase. I think that everything about the device type should be as same as other in-tree backends after registering the privateuse1 backend.

For example, after registering a privateuse1 backend named "foo", you should allow "foo" to be passed in as a valid device type.

```diff
- instantiate_device_type_tests(TestIndexing, globals(), only_for='privateuse1')
- instantiate_device_type_tests(NumpyTests, globals(), only_for='privateuse1')
+ instantiate_device_type_tests(TestIndexing, globals(), only_for='foo')
+ instantiate_device_type_tests(NumpyTests, globals(), only_for='foo')
```

> https://github.com/Ascend/pytorch/blob/master/test/test_indexing.py#L1654-L1655

The change is to map privateuse1 backend name to 'privateuse1' when calling `filter_desired_device_types()`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133082
Approved by: https://github.com/albanD
2024-08-22 06:17:21 +00:00
24c2dd2002 Migrate fuse_chunk_reshape_concat_pass to PT2 (#134026)
Summary:
This is part of the work of dper pass migration https://fburl.com/gdoc/wxwykxns
This pass has ~2.4% perf impact for adfinder_reels_ctr_model

Test Plan: Still in test

Differential Revision: D60789747

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134026
Approved by: https://github.com/huxintong
2024-08-22 06:13:52 +00:00
938f37b745 Added batching rule for sdpa_math, sdpa_efficient_attention forward, cudnn, and flash attention (#133964)
Fixes https://github.com/pytorch/pytorch/issues/117016, https://github.com/pytorch/pytorch/issues/102457, https://github.com/pytorch/pytorch/issues/110525, https://github.com/pytorch/pytorch/issues/108065,

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133964
Approved by: https://github.com/Skylion007
2024-08-22 05:29:49 +00:00
e2ff094008 [inductor] calibration inductor windows uts (1/N) (#134033)
Changes:
1. Re-open fixed UTs.
2. Mark skiped reasons for failed UTs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134033
Approved by: https://github.com/jansel
2024-08-22 05:21:28 +00:00
0d7ac1966a kill sharing of constraints (#134045)
Summary:
Previously, reuse of the same `Dim` was encoded by "sharing" internal constraints among constraint targets. This kind of sharing, implemented using `shared` fields between `_Constraint`s, was originally motivated by `dynamic_dim`, specifically to support `==` between `dynamic_dim`s, but we no longer need to maintain this overcomplicated structure: we can simply use names of `Dims` to directly encode sharing information.

Thus this PR vastly simplifies the structure of `_Constraint` by removing `shared` fields. As a result, both `_Constraint` and its moral subclass, `_DerivedConstraint`, are 1-1 with `Dim` and its moral subclass, `DerivedDim`.

Note that this will break `==` over `dynamic_dim`, so an immediate follow-up will be to remove `dynamic_dim` entirely from our public API. (It's been more than 6 months since the deprecation warning anyway.) I just didn't want to deal with that process in the same PR.

Test Plan: existing

Differential Revision: D61559413

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134045
Approved by: https://github.com/pianpwk
2024-08-22 04:40:47 +00:00
de06345e9b Avoid Host & Device Sync In LR Scheduler (#133663)
Fixes #133662.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133663
Approved by: https://github.com/janeyx99, https://github.com/eqy

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2024-08-22 03:52:43 +00:00
e847b6bb9b [FlexAttention] Enable different qk and v head-dims (#134043)
# Summary
Adds the option for the head dims to be different between QK and V tensors.

Fixes issue: https://github.com/pytorch/pytorch/issues/133674

V_DIM > QK_DIM is blocked by landing: https://github.com/triton-lang/triton/pull/4138 / https://github.com/triton-lang/triton/pull/4540

Into PyTorch's triton branch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134043
Approved by: https://github.com/Chillee
2024-08-22 03:42:17 +00:00
7868b65c4d [Dynamo] Support dict.setdefault (#134083)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134083
Approved by: https://github.com/williamwen42
2024-08-22 01:57:33 +00:00
7b20514f8e [export] Device remapping in export (#133660)
Implemented `move_to_device_pass()` function in `torch._export.passes`.

The user has to explicitly call this method to move the exported program from one torch.device to another one.

Fixes https://github.com/pytorch/pytorch/issues/121761
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133660
Approved by: https://github.com/angelayi
2024-08-22 01:03:35 +00:00
df467f8746 [CI] Do not set Intel OMP for aarch64 (#133997)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133997
Approved by: https://github.com/angelayi
2024-08-22 00:55:46 +00:00
6bddfb9546 [FSDP2] Add cache for FSDP wrapper class (#134135)
Currently, `fully_shard` will create a new `FSDPMyModuleClass` class for each `MyModuleClass` module **object**, which causes Dynamo to guard-fail on every module object's type checking. This PR fixes the issue by caching and reusing previously created FSDP wrapper class.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134135
Approved by: https://github.com/awgu
2024-08-22 00:41:30 +00:00
2a73ba298c Upgrade submodule oneDNN to v3.5.3 (#131620)
This PR is to upgrad submodule oneDNN to v3.5.3.

## Improvements

- [experimental] Introduced [microkernel API](https://oneapi-src.github.io/oneDNN/ukernels.html) for Intel Architecture Processors. This API exposes internal mechanisms used in matmul and convolution implementation to expert users.
- Improved performance of matmul primitive with sum post-op for batched cases on processors with Intel AMX instruction set support.
- Introduced fp64 matmul support. This functionality is currently implemented on Intel GPUs with hardware acceleration for fp64 math only.

## Validation results on CPU
No regression was found.

1. NLP models accuracy/inference/training

Model Name | Mode Name | Precision | OneDNN | Baseline | OneDNN/Baseline
-- | -- | -- | -- | -- | --
bert-large | realtime | bf16 | 192.498 | 189.664 | 1.014942214
bert-large | throughput | bf16 | 202.424 | 202.156 | 1.001325709
bert-large | train_phase2 | bf16 | 15.955 | 16.029 | 0.995383368
LCM | throughput | bf16 | 1.01983 | 1.06632 | 0.956401455
stable-diffusion | throughput | bf16 | 0.10313 | 0.10184 | 1.012666929
ViT | realtime | bf16 | 1086.48 | 928.43 | 1.17023362
ViT | throughput | bf16 | 1419.07 | 1393.81 | 1.018122987
yolov7 | realtime | bf16 | 413.468682 | 415.16503 | 0.995914039
yolov7 | throughput | bf16 | 369.697 | 366.789 | 1.007928264
bert-large | realtime | fp32 | 46.685 | 46.652 | 1.000707365
bert-large | throughput | fp32 | 47.766 | 48.007 | 0.994979899
bert-large | train_phase2 | fp32 | 7.101 | 7.104 | 0.999577703
LCM | throughput | fp32 | 0.5501 | 0.55023 | 0.999763735
stable-diffusion | throughput | fp32 | 0.04012 | 0.04002 | 1.002498751
ViT | realtime | fp32 | 337.27 | 335.19 | 1.006205436
ViT | throughput | fp32 | 346.52 | 350.08 | 0.989830896
yolov7 | realtime | fp32 | 107.138054 | 107.242747 | 0.999023775
yolov7 | throughput | fp32 | 103.383 | 104.301 | 0.99119855
bert-large | realtime | int8 | 283.541 | 289.569 | 0.979182855
LCM | throughput | int8 | 1.09864 | 1.08998 | 1.0079451
stable-diffusion | throughput | int8 | 0.10617 | 0.10604 | 1.001225952
ViT | realtime | int8 | 1562.11 | 1554.68 | 1.004779119
ViT | throughput | int8 | 1904.38 | 1903.39 | 1.000520125
yolov7 | realtime | int8 | 540.489493 | 539.902488 | 1.001087243
yolov7 | throughput | int8 | 499.999 | 500.757 | 0.998486292

Device | Dtype | Geomean Higher is better
-- | -- | --
All | all | 101.17%
All | fp32 | 99.83%
All | bf16 | 102.24%
All | int8 | 99.91%
All | fp16 | 103.61%
SPR | all | 100.54%
SPR | fp32 | 99.82%
SPR |bf16 | 101.78%
SPR |int8 | 99.90%
GNR | all | 101.58%
GNR | fp32 | 99.85%
GNR | bf16 | 102.66%
GNR | int8 | 99.93%
GNR | fp16 | 103.61%

2. Torchbench cpu userbenchmark inference & training

Perf_Geomean | Ratio (oneDNN/baseline)
-- | --
eager_throughtput_bf16_infer | 1.00x
eager_throughtput_fp32_infer | 1.00x
jit_llga_throughtput_amp_bf16 | 1.00x
jit_llga_throughtput_fp32 | 1.00x
eager_throughtput_fx_int8 | 0.99x
eager_throughtput_bf16_train | 1.01x
eager_throughtput_fp32_train | 1.00x

3. Inductor quantization

Static quant:
Perf_Geomean | Ratio (oneDNN/baseline)
-- | --
PTQ | 1.00x
PTQ_CPP_WRAPPER | 1.00x
QAT | 1.00x

ACC_Geomean | Ratio (oneDNN/baseline)
-- | --
PTQ | 1.00x
PTQ_CPP_WRAPPER | 1.00x
QAT | 1.00x

Dynamic quant:

  | Ratio (oneDNN/baseline)
-- | --
Performance | 1.04x
Accuracy | 1.00x

4. Dynamo benchmarks
GEOMEAN summary
![image](https://github.com/user-attachments/assets/82fc4b76-50f6-4f06-9ba9-034b932f1158)

FP32 Static shape, default wrapper
![image](https://github.com/user-attachments/assets/9335268e-3e99-426b-91f8-f9df90a2007c)

FP32 Dynamic shape, default wrapper
![image](https://github.com/user-attachments/assets/e7cf3f4f-2a62-4b58-9461-5e5ba254d822)

AMP Static shape, default wrapper
![image](https://github.com/user-attachments/assets/12392c88-e44f-4c95-904a-4fa5fc9f34a2)

AMP Dynamic shape, default wrapper
![image](https://github.com/user-attachments/assets/13930b0d-9bb2-46de-9ecb-5d2585d5c2f6)

## Validation results on XPU
Category | Eager | Inductor
-- | -- | --
huggingface_amp_fp16_training | 1.002456 | 0.999998
huggingface_bfloat16_inference | 1.005386 | 1.003511
huggingface_float32_training | 1.002533 | 1.003098
torchbench_amp_fp16_training | 1.009065 | 1.01323
torchbench_bfloat16_inference | 1.003371 | 1.001534
torchbench_float32_training | 1.012102 | 1.011596
timm_models_amp_fp16_training | 1.005511 | 1.010329
timm_models_bfloat16_inference | 1.000935 | 1.000538
timm_models_float32_training | 0.991873 | 0.99721

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131620
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-08-21 23:40:02 +00:00
5f0bd98767 Increase max total number of dynamo partitions to 15 (#134153)
Needed to be able to split some of the aarch64 workflows to 15 shards

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134153
Approved by: https://github.com/seemethere, https://github.com/kit1980, https://github.com/ZainRizvi
2024-08-21 23:10:12 +00:00
a5ef04a3b8 add relevant function (#133946)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133946
Approved by: https://github.com/ezyang
2024-08-21 23:04:59 +00:00
8604c0a150 [inductor] Fix needs_fixed_stride_order silent incorrectness (#133639)
Fixes #128084

The approach is option 2 of what Elias suggested in the comment
thread:
- We require tensors to have the correct stride at usage. This may
  involve a clone; if there was a clone and then a mutation into it
  then we copy_ back the result of the mutation.

The reason why I went this approach was because it was the easiest and
Inductor already works really hard to remove additional clones/copy_.

There are some cases that this doesn't generate efficient code for; for
example, if the tensor is a view, we don't change the base of the view
to have the right stride order, instead we do a clone.
The view case isn't very common so I'm ignoring it for now but we could
improve this in the future.

Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133639
Approved by: https://github.com/eellison
2024-08-21 22:54:16 +00:00
d2204d4f0f Remove skip ci recommendation (#134134)
Using `skip ci` is no longer a recommendation practices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134134
Approved by: https://github.com/soulitzer
2024-08-21 22:42:25 +00:00
255cd75a97 [sparse] Add cuSPARSELt as a backend (#128534)
Summary:

This PR adds in cuSPARSELt as a backend to PyTorch.

It is now possible to see if cuSPARSELt is available and the version if
it is with
```
torch.backends.cusparselt.is_available()
torch.backends.cusparselt.version()
```

Test Plan:
```
python test/test_sparse_semi_structured.py -k test_cusparselt_backend
```

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128534
Approved by: https://github.com/cpuhrsch, https://github.com/eqy, https://github.com/syed-ahmed
2024-08-21 22:06:07 +00:00
0870398fa8 [ONNX] Opt into ruff fmt (#134120)
Add ONNX directory to use ruff format.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134120
Approved by: https://github.com/XuehaiPan, https://github.com/Skylion007
2024-08-21 21:43:55 +00:00
96dfe95ed0 Fix DDPLoadBalancingPlanner docstring (#134044)
Summary:
1. Indentation in chunk function was wrong.
1. The previous logic missed a level of zip.

This diff uses the idiom in python zip doc to do chunking https://docs.python.org/3/library/functions.html#zip

Test Plan: Run the docstring locally

Differential Revision: D61548758

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134044
Approved by: https://github.com/fegin
2024-08-21 21:28:22 +00:00
5d5a45dc85 [CI][dashboard] Collect Export pass rate separately (#134076)
Summary: Collect Export pass rate separately when running AOTInduction, so that we can have a better isolated signal.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134076
Approved by: https://github.com/angelayi
2024-08-21 21:18:55 +00:00
b3eef3deaf Triple number of shards for aarch64 cpu inductor tests (#134123)
Let's see if this will work.

Alas, other than linting I can only test it after it lands
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134123
Approved by: https://github.com/clee2000
2024-08-21 20:52:23 +00:00
345578afb4 Add int8 support to bsr_dense_addmm and bsr_dense_mm Triton kernels (#133855)
As in the title. In addition, the PR introduces `_int_bsr_dense_addmm` that is equivalent to `bsr_dense_addmm` except for int8 inputs the operation result is int32 tensor (similar to existing `_int_mm`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133855
Approved by: https://github.com/cpuhrsch
2024-08-21 20:44:40 +00:00
a3e1416c05 Fix out_tensor device in diag_test.py (#134020)
This benchmark fails if device='cuda' but out_tensor is on cpu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134020
Approved by: https://github.com/soulitzer
2024-08-21 20:43:39 +00:00
6c1e2d2462 [easy] Force inline_inbuilt_nn_modules to remove divergence (#134122)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134122
Approved by: https://github.com/williamwen42, https://github.com/mlazos
2024-08-21 20:42:15 +00:00
865facda44 [pytorch] Remove thread naming when torch is imported (#134066)
Fixes #133690

The naming was added in #121170 to allow performance debugging of latency critical threads. However the `pt_main_thread` name gets inherited every time a new process or thread is created from the parent one, which defeats the purpose. We need a better way to name the thread that launches kernels on accelerators but for the time being we can let users name the threads in the application code, using: `torch.multiprocessing._set_thread_name("insert_name")`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134066
Approved by: https://github.com/soulitzer, https://github.com/d4l3k
2024-08-21 20:34:35 +00:00
1491a61769 Revert "[hop] ban creating hop by directly instantiating HigherOrderOperator. (#133645)"
This reverts commit 696107efcb83f9359aa669ab343c2cfa2a111372.

Reverted https://github.com/pytorch/pytorch/pull/133645 on behalf of https://github.com/ydwu4 due to breaking ci. probably due to land race ([comment](https://github.com/pytorch/pytorch/pull/133645#issuecomment-2302866106))
2024-08-21 19:33:14 +00:00
5fcfccefc6 [export] Migrate capture_pre_autograd_graph to _export_for_training (#132815)
Summary: as title

Test Plan: CI

Differential Revision: D60860909

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132815
Approved by: https://github.com/tugsbayasgalan
2024-08-21 19:00:41 +00:00
18aaceb7be Update conda-env-iOS.txt (#134068)
Followup after https://github.com/pytorch/pytorch/pull/133814 To fix periodic build failures update `typing-extensions` to 4.11.0, as 4.10 is missing in conda
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134068
Approved by: https://github.com/wdvr, https://github.com/atalman, https://github.com/Skylion007
2024-08-21 18:47:14 +00:00
84b3f1900a C++ network flow implementation in c10 (#132188)
The functorch partitioners use network flow to split the joint graph into a forward and backward graph. Internally, we've found that upgrading to networkx 2.8.8 (from 2.5) results in some hard-to-debug failures (internal reference: https://fburl.com/workplace/jrqwagdm). And I'm told that there's interest to remove the python dependency.

So this PR introduces a C++ implementation that mirrors the API provided by networkx. We'll need to add python bindings and do some additional testing to verify correctness.

Differential Revision: [D61550977](https://our.internmc.facebook.com/intern/diff/D61550977)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132188
Approved by: https://github.com/Chillee
2024-08-21 18:40:54 +00:00
05304f59f0 [Doc] Fix typo in torch/fx/passes/README.md (#134078)
Fix typo, `utis` to `utils`, in the utility name.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134078
Approved by: https://github.com/soulitzer, https://github.com/malfet
2024-08-21 18:35:50 +00:00
32e057636c Enable scribe environment for compile-time benchmarks if requested. (#133891)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133891
Approved by: https://github.com/malfet
2024-08-21 18:02:54 +00:00
750d68ff70 Use amazon linux2 for Docker builds, fix build-docker-conda condition (#134116)
1. Switches failing jobs to amzon linux 2:
- CUDA, CPU, ROCM jobs are failing
3. Fix trigger condition for build-docker-conda to be same as manywheel and libtorch
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134116
Approved by: https://github.com/ZainRizvi, https://github.com/nWEIdia
2024-08-21 18:01:16 +00:00
696107efcb [hop] ban creating hop by directly instantiating HigherOrderOperator. (#133645)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133645
Approved by: https://github.com/zou3519
ghstack dependencies: #133521
2024-08-21 17:34:21 +00:00
6835f20d20 [HOP] support generating schema for hop (#133521)
Add a way of generating a FunctionSchema from example values because hop's schema varies even for the same hop.

We didn't use torch._C.FunctionSchema because we cannot construct the classes directly (e.g. "__init__" cannot be used for torch._C.FunctionSchema). Also extending the Basic types in c++ seems not that easy.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133521
Approved by: https://github.com/zou3519
2024-08-21 17:34:21 +00:00
dd5a7c8397 [PT2] Add a pass to convert stack to unsqueeze cat (#133966)
Summary: so that we can optimize with `fuse_chunk_reshape_unsqueeze_concat_pass`

Test Plan: new UT

Reviewed By: frank-wei

Differential Revision: D61220221

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133966
Approved by: https://github.com/frank-wei
2024-08-21 17:31:26 +00:00
1da3a049da [dynamo][super] Improve handling of getattr on super (#134039)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134039
Approved by: https://github.com/yanboliang
ghstack dependencies: #133742, #134016
2024-08-21 16:50:35 +00:00
3ef1cc8583 [export] Implement common_getitem_elimination pass. (#133618)
Summary:
In export, we will generate many redundant getitem nodes branching from the same source, inserted by runtime assertions or any passes. This is causing issues with any downstream system relying on any value being uniquely defined by a single node.

I don't think it hurt to remove a bunch of getitem nodes only, so I just added to the ctor.

Test Plan:
rebase on D61256937
```
buck2 run scripts/bearzx:pt2_export_playground
```

Differential Revision: D61351578

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133618
Approved by: https://github.com/tugsbayasgalan
2024-08-21 16:48:24 +00:00
2db28a9611 Revert "[BE]: Update Typeguard to TypeIs for better type inference (#133814)"
This reverts commit bce0caba7804b0787684dbf1f4e1c4d9e3acded5.

Reverted https://github.com/pytorch/pytorch/pull/133814 on behalf of https://github.com/ezyang due to root cause of internal failures not addressed ([comment](https://github.com/pytorch/pytorch/pull/133814#issuecomment-2302466444))
2024-08-21 16:13:34 +00:00
57625bacea [partitioner] Fix must_be_in_backward corner cases (#134002)
Preparation PR for https://github.com/pytorch/pytorch/pull/132638

"must_be_in_backward" fails the partitioner, if partitioner picks this node as saved_values.

The fix is to prevent partitioner to pick those nodes during nodes classification.

It's hard to make a test without making effectful ops in backward "must_be_in_backward", which will be testing this ( https://github.com/pytorch/pytorch/pull/132638 )

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134002
Approved by: https://github.com/bdhirsh
ghstack dependencies: #134003
2024-08-21 15:58:49 +00:00
68425e68fe Revert "[dynamo][reland][inline-inbuilt-nn-modules] Mark attributes of nn mod… (#133714)"
This reverts commit e8d3c4be3629582294b5944754009fae60f42f6d.

Reverted https://github.com/pytorch/pytorch/pull/133714 on behalf of https://github.com/anijain2305 due to fails internally ([comment](https://github.com/pytorch/pytorch/pull/133714#issuecomment-2302171472))
2024-08-21 14:21:06 +00:00
32e052e468 [docs] improve torch.stack example code to be reproducible (#133857)
Improve the sample code can produce the expected results after copying and executing it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133857
Approved by: https://github.com/soulitzer
2024-08-21 14:07:02 +00:00
585c049fa3 Fix Extension attribute name in CppExtension example (#134046)
Hi! It seems there's a typo in `CppExtension` example. I think it should say `extra_link_args` instead of `extra_link_flags`. Not that I spent a few hours debugging missing kernels inside a library's fatbin or anything :D.

Please see `Extension` definition inside setuptools:
ebddeb36f7/setuptools/_distutils/extension.py (L62)

Thanks!
Błażej

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134046
Approved by: https://github.com/soulitzer
2024-08-21 13:58:16 +00:00
afaa5fcecb [BE][Ez]: FURB142,FURB92 misc preview fixes (#133880)
Fixes some miscellaneous code quality issues with some refurb rules that have not been enabled yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133880
Approved by: https://github.com/soulitzer, https://github.com/malfet
2024-08-21 13:54:51 +00:00
683609c631 Skip cpp_extension test internally (#134011)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134011
Approved by: https://github.com/masnesral
2024-08-21 13:51:05 +00:00
4b1fb3b0ed [PP] pt-native input/weight grad split (#132691)
Add `stage_backward_input` and `stage_backward_weight` functions to perform the weight updates for inputs and weights independently.

We still support `self.dw_builder` argument for a custom backward, but it has become optional. It takes a separate code path and cannot be used in conjuction with the native zero backward.

Added tests:
`python test/distributed/pipelining/test_schedule_multiproc.py -k test_schedule_with_native_zero_bubble`
`python test/distributed/pipelining/test_backward.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132691
Approved by: https://github.com/wconstab
2024-08-21 13:37:54 +00:00
2bffbe06bd [Inductor][CPP] Support vectorization of load_seed and randn (#130317)
**Summary**
Enable the vectorization of `load_seed` and `randn`. For now, `randn` is using the reference implementation.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_randn
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130317
Approved by: https://github.com/jgong5
ghstack dependencies: #122961
2024-08-21 13:20:43 +00:00
313bc11963 [inductor][cpp] complete vectorization for int32/int64 (#122961)
**Summary**
Implement the complete vectorization of `index_expr` functionally. We also add heuristic from performance perspective to resolve the regressions posted below: https://github.com/pytorch/pytorch/pull/122961#issuecomment-2041336265 by disabling vectorization of specific (Fused) scheduler Node:

- Heuristic 1: when the num of non-contiguous `index_expr/load/store` exceeds the threshold, we disable the vectorization.
- Heuristic 2: when the total number of elements along the vec dim is less than `tiling_factor/2`, we disable the vectorization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122961
Approved by: https://github.com/jansel

Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>
2024-08-21 13:12:38 +00:00
539be0a769 [dynamo] support ClassMethodDescriptorType (#133862)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133862
Approved by: https://github.com/jansel
2024-08-21 12:56:19 +00:00
0d79f67a25 [dynamo][exception] Support raise exception from None (#134028)
Fixes https://github.com/pytorch/pytorch/issues/132362

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134028
Approved by: https://github.com/yanboliang
2024-08-21 12:48:35 +00:00
bd0db490bf [dynamo][set] Fix EQUALS_MATCH guard for constant sets and lists (#134016)
Fixes https://github.com/pytorch/pytorch/issues/133509

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134016
Approved by: https://github.com/laithsakka, https://github.com/jansel
ghstack dependencies: #133742
2024-08-21 12:41:52 +00:00
c929e1e11f [dynamo] fix polyfill for user defined constructor __new__ (#133822)
In `cls->tp_call`, if `cls->tp_new` does not return an instance of class `cls`, then `cls->tp_init` is not called on the new instance.

Related PR:

- #132977

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133822
Approved by: https://github.com/jansel
2024-08-21 12:41:19 +00:00
695291be2f Fix test flakiness due to not resetting state (#134058)
Fixes https://github.com/pytorch/pytorch/issues/133994

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134058
Approved by: https://github.com/yanboliang
2024-08-21 11:54:08 +00:00
30dc6338c1 [effects] Prevent inductor dtype promotions for HOP effects tokens (#134003)
Preparation for https://github.com/pytorch/pytorch/pull/132638 and https://github.com/pytorch/pytorch/pull/132755

Inductor promotes arguments dtypes to the highest dtype, as a result additional token tensor argument wtih float32 dtype incurred dtype promotions for lower types, e.g. int32

The solution for that - to use the lowest dtype for tokens - torch.bool.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134003
Approved by: https://github.com/zou3519, https://github.com/bdhirsh
2024-08-21 11:42:10 +00:00
19eb14493a [Inductor] Moves intermediary tensors which are constructed on the cpu to XPU when safe, align with CUDA. (#132843)
[Inductor] Moves intermediary tensors which are constructed on the cpu to XPU when safe, align with CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132843
Approved by: https://github.com/EikanWang, https://github.com/eellison
ghstack dependencies: #132740, #132748
2024-08-21 11:28:09 +00:00
6535f11259 [Inductor] Support _check_triton_bf16_support on XPU. (#132748)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132748
Approved by: https://github.com/EikanWang, https://github.com/eellison
ghstack dependencies: #132740
2024-08-21 11:28:09 +00:00
c2e2602ecd [Inductor] Move GPU_TYPE(The runtime avaliable gpu type, cuda or xpu) from (#132740)
Move GPU_TYPE(The runtime avaliable gpu type, cuda or xpu) from `testing/_internal/inductor_utils.py` to `_inductor/utils.py`. So that we can use it in Inductor, not limited in test case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132740
Approved by: https://github.com/EikanWang, https://github.com/jansel
2024-08-21 11:18:00 +00:00
3d8db41337 Add new op wrapped_quantized_linear (#134024)
Summary:
This diff adds a new operator wrapped_quantized_linear (torch.ops._quantized.wrapped_quantized_linear) and takes the following input argument: input (in fp32) , input_scale, input_zero_point, weight (in fp32), weight_scale, weight_zero_point, bias (in fp32), output_scale, output_zero_point, and out_channel. It does the following

1. Use quantize_per_tensor(input, input_scale, input_zero_point) to quantize the input tensor to int8
2. Use quantized::linear_prepack(weight, weight_scale, weight_zero_point, bias) to pack the weight and bias
3. Use quantized::linear to perform int8 quantized linear
4. dequantize

This new op is essentially a wrapper of mutiple ops. We do this as torch.export cannot handle models where it has old quantize apis.

Reviewed By: jerryzh168

Differential Revision: D61377266

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134024
Approved by: https://github.com/houseroad
2024-08-21 09:26:58 +00:00
022cd7c9aa [RFC][dynamo] add decorator to register polyfill for unsupported C++ function to avoid graph break (#133712)
Add decorator `torch.compiler.substitute_in_graph` to register polyfill for unsupported C++ function to avoid graph break. This API provides an official way to add support for dynamo for third-party C extensions. Also, it can be used to simplify our implementation for `torch._dynamo.polyfill`.

5ee070266f/torch/_dynamo/variables/builtin.py (L97-L107)

Example:

```python
>>> import operator
>>> operator.indexOf([1, 2, 3, 4, 5], 3)
2

>>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3)
Unsupported: ...

>>> @torch.compiler.substitute_in_graph(operator.indexOf)
... def indexOf(sequence, x):
...     for i, item in enumerate(sequence):
...         if item is x or item == x:
...             return i
...     raise ValueError("sequence.index(x): x not in sequence")

>>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3)
2
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133712
Approved by: https://github.com/jansel
2024-08-21 06:36:41 +00:00
843fdf81c2 Fix a getenv segfault due to a race (#133744)
Summary:
* TLDR:

`getenv` is not thread safe w.r.t `setenv`. Environment variables are kept as a per-process "dictionary" by libc. `setenv` can essentially realloc the whole thing move this list to a completely different location. If there is a concurrent `getenv` happening as the same time, it is possible that it might end up reading stale memory and segfault.
`getenv` is thread safe w.r.t other `getenv`.

* Details:

Inside PTD init:
```
ProcessGroupNCCL ctor
	...
	ncclCommWatchdogThread_ =
      std::thread(&ProcessGroupNCCL::ncclCommWatchdog, this); (https://fburl.com/code/terf9ai7)
```

Inside ncclCommWatchdog thread:
```
	...
	ncclHeartbeatMonitorThread_ =
        std::thread(&ProcessGroupNCCL::heartbeatMonitor, this);  (https://fburl.com/code/fv9camg2)
    ...
```

Inside heartbeatMonitor thread:
```
	...
	std::optional<DumpPipe> dumpPipe = std::nullopt; (https://fburl.com/code/qdvahzbu)
	dumpPipe.emplace(rank_);
	...
```

Inside DumpPipe ctor (https://fburl.com/code/wvixlqcz)
```
	getCvarString
		getenv <=== SIGSEGV
```

On the main thread:

We go on to initialize NCCL:

Inside getNCCLComm, we call: `getNcclVersion` -> `initEnv` (https://fburl.com/code/j312pccu)

`initEnv` inside NCCL does this: `initEnv` -> `setEnvFile`

This guy, reads the /etc/nccl.conf file, and sets values of env variables with "setenv" (https://fburl.com/code/cq4r0y0h)
This "setenv" can race with "getenv" in heartbeatMonitor thread.

Ideally, all `setenv` should be done by a single thread before launching other threads. This diff moves getNCCLVersion before launching watchdog thread to make sure all setenvs are done beforehand.

I think we are just getting lucky that we are not hitting it in production. IIRC in fact we saw getenv segfault once in one of the large scale runs, but now I dont remember the details.

Test Plan: A lot of testing done as part of D61411062 & CI

Differential Revision: D61421292

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133744
Approved by: https://github.com/wconstab, https://github.com/fduwjj
2024-08-21 06:27:31 +00:00
af664882dd Safely infer device type + docstrings + tests (#133668)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133668
Approved by: https://github.com/eellison
2024-08-21 05:27:31 +00:00
b39ec7fbe9 [1/N] Make NCCL PG error messages more accurate and simpler (#134017)
We did a thorough review on all the error messages we are logging inside PGNCCL, and we want to make log message simpler and more accurate, this is the first PR for this effort.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134017
Approved by: https://github.com/wconstab
2024-08-21 05:21:24 +00:00
66d3eb783c [SymmetricMemory] introduce multicast support, multimem_all_reduce_ and multimem_one_shot_all_reduce (#133424)
### Summary
- Added multicast support to SymmetricMemory. If the cuda runtime and cuda driver have multicast support, SymmetricMemory associate all peer buffers with a multicast object and exposes the multicast virtual address.
- Implemented `multimem_all_reduce_` and `multimem_one_shot_all_reduce` based on the multicast support. The two variants shows different performance characteristic for different message size. We plan to use Inductor for collective algo selection (and required symmetric memory buffer allocation).

### Benchmark

8xH100 (non-standard version with HBM2e at 650W). NVSwitch V3 with NVLS support.

![image](https://github.com/user-attachments/assets/4998a16b-c2c0-4797-9dd0-1da2303df947)

![image](https://github.com/user-attachments/assets/278ad361-52cb-4864-82c6-bb67e8d0a3fe)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133424
Approved by: https://github.com/yf225, https://github.com/weifengpy
2024-08-21 05:11:21 +00:00
8337b4d96e [training ir migration] Fix ReorderConvertTest (#134010)
Summary:
Change ReorderConvertTest to work with the new `capture_pre_autograd_graph` implementation using D61175223.

Note that now `ReorderConvertTest` doesn't work with the old `capture_pre_autograd_graph` anymore.

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/passes/tests:optimize_test -- -r ReorderConvertTest
```

Differential Revision: D61507772

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134010
Approved by: https://github.com/tugsbayasgalan
2024-08-21 04:48:43 +00:00
e8fc1e0118 [ONNX] New export logic leveraging ExportedProgram and ONNX IR (#132530)
1/n PR to

- Move code from torch-onnx from commit 395495e566 into torch.onnx and fixes imports.
- Integrate the new export logic with the torch.onnx.export API and include basic set of tests.
- Refactor the API for the change.
- Improve documentation.

Next PRs will be more tests and docs.

Fix https://github.com/pytorch/pytorch/issues/129277
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132530
Approved by: https://github.com/titaiwangms, https://github.com/malfet
2024-08-21 01:08:42 +00:00
06cc2e83f0 Make optim.swa.util content accessible from the torch.optim doc (#133393)
Link various classes and functions of the `optim.swa.util` to make doc content accessible from the `torch.optim` doc.

Currently, if you click the link,
https://pytorch.org/docs/stable/optim.html#module-torch.optim.swa_utils it goes to a blank, bottom of the page section of `torch.optim`.
Also,
`torch.optim.swa_utils.AveragedModel` and `torch.optim.swa_utils.SWALR` classes as well as `torch.optim.swa_utils.update_bn()` and `optim.swa_utils.get_ema_multi_avg_fn` are not linked to doc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133393
Approved by: https://github.com/janeyx99
2024-08-21 00:43:46 +00:00
d1abd6241a [CI][BE] Update retry action to v3.0.0 (#119403)
To reduce number of
```
 Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20
```

Finally can land this one as all nodes has been migrated to AmazonLinux2023
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119403
Approved by: https://github.com/clee2000, https://github.com/Skylion007
2024-08-20 23:56:37 +00:00
c42ac54d9e [inductor] prune unused constants in graph scheduling (#132208)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132208
Approved by: https://github.com/leslie-fang-intel

Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>
2024-08-20 23:40:11 +00:00
5f3d22a609 Avoid GPU syncs by reusing Pre-allocated Zero Tensor (#128069)
This commit improves the FullyShardedDataParallel (FSDP) class in PyTorch by reducing unnecessary GPU synchronizations by reusing a pre-allocated zero tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128069
Approved by: https://github.com/awgu
2024-08-20 22:51:33 +00:00
5a7b544e5c Update FlexAttention with masking semantic (#133373)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133373
Approved by: https://github.com/yanboliang
2024-08-20 22:38:10 +00:00
bc785c2d9a [Inductor][FlexAttention] Don't trigger dynamic shape on building empty block mask (#133836)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133836
Approved by: https://github.com/Chillee
2024-08-20 22:36:53 +00:00
f7c1f32803 Fix partially initialized module error (#134019)
https://github.com/pytorch/pytorch/pull/132990 introduced dependency on `torch.version`, which might not be imported yet, and can result in  `AttributeError: partially initialized module 'torch' has no attribute 'version' (most likely due to a circular import)` if user starts its code with `import torch.cuda`

Fix it by importing `torch.version` explicitly

Test Plan: CI

Differential Revision: D61549284

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134019
Approved by: https://github.com/seemethere
2024-08-20 22:20:02 +00:00
41fab40be7 [report_exportability] Avoid re-exporting duplicated modules (#133930)
Summary:
Skip re-exporting modules with the duplicated types to speed up the exportability tests.

In real models, there are many duplicated modules, and mostly have the same export issues.

Test Plan: Existing CI

Differential Revision: D61504630

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133930
Approved by: https://github.com/angelayi
2024-08-20 22:11:57 +00:00
1ae5d5bb62 [dynamo][user-defined] Improve getattr_static for user_defined objects (#133742)
Fixes https://github.com/pytorch/pytorch/issues/133607

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133742
Approved by: https://github.com/Skylion007, https://github.com/jansel
2024-08-20 21:51:03 +00:00
a36739f36a Cherry-Picking don't resolve conflicts (#134047)
During cherry-picking we want to use default setting and fail if there is merge conflict
Here an example of invalid conflict resolution:
https://github.com/pytorch/pytorch/pull/131194
and cherry-pick
https://github.com/pytorch/pytorch/pull/133590

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134047
Approved by: https://github.com/kit1980
2024-08-20 21:48:05 +00:00
2e1830c7c8 Implement 2D version of masked_select for nestedtensors (#133889)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133889
Approved by: https://github.com/soulitzer
2024-08-20 21:46:32 +00:00
15b5a0b67f Revert "[RFC][dynamo] add decorator to register polyfill for unsupported C++ function to avoid graph break (#133712)"
This reverts commit 71dd52f51a05d110c06e83f74cef165f64627842.

Reverted https://github.com/pytorch/pytorch/pull/133712 on behalf of https://github.com/ZainRizvi due to breaking main windows cpu tests - this stack still causes that windows test to fail ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2299776241))
2024-08-20 21:14:45 +00:00
88ead0afc6 Revert "[dynamo] simplify polyfill registration for builtins.all and builtins.any (#133769)"
This reverts commit 178e8563b8a44243a6f69f3d257d9a3aab71b2c5.

Reverted https://github.com/pytorch/pytorch/pull/133769 on behalf of https://github.com/ZainRizvi due to breaking main windows cpu tests - this stack still causes that windows test to fail ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2299776241))
2024-08-20 21:14:45 +00:00
3fa874abbe Revert "[dynamo] simplify implementation for functools.reduce (#133778)"
This reverts commit 37b4bc60a4ec65858044983a36577912fb9b4651.

Reverted https://github.com/pytorch/pytorch/pull/133778 on behalf of https://github.com/ZainRizvi due to breaking main windows cpu tests - this stack still causes that windows test to fail ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2299776241))
2024-08-20 21:14:45 +00:00
98e6a1d8ff Revert "[dynamo] simplify implementation for builtins.sum (#133779)"
This reverts commit 3f58a8051a92470dbd254859322a7eb085a8f243.

Reverted https://github.com/pytorch/pytorch/pull/133779 on behalf of https://github.com/ZainRizvi due to breaking main windows cpu tests - this stack still causes that windows test to fail ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2299776241))
2024-08-20 21:14:44 +00:00
2540ee372a Revert "[dynamo][itertools] support itertools.tee (#133771)"
This reverts commit 28ce3c0227830c78c0b5d4ec592f5c3879bc61a3.

Reverted https://github.com/pytorch/pytorch/pull/133771 on behalf of https://github.com/ZainRizvi due to breaking main windows cpu tests - this stack still causes that windows test to fail ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2299776241))
2024-08-20 21:14:44 +00:00
ccc0aa69ce [ONNX] Remove torch.onnx._export (#133824)
- Remove the deprecated torch.onnx._export function
- Remove test/onnx/test_export_modes.py because export modes are no longer supported
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133824
Approved by: https://github.com/titaiwangms
2024-08-20 20:54:48 +00:00
b03381cac2 [dynamo] support cls.__flags__ (#133970)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133970
Approved by: https://github.com/jansel
ghstack dependencies: #133969
2024-08-20 20:03:31 +00:00
5229b52bf2 [dynamo] support cls.__base__ (#133969)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133969
Approved by: https://github.com/jansel
2024-08-20 20:03:31 +00:00
bb0bf09aff [easy] skip test_sdpa_autocast on windows (#134009)
test is failing because torch.compile doesn't work on windows
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134009
Approved by: https://github.com/YuqingJ, https://github.com/Skylion007, https://github.com/ZainRizvi
2024-08-20 19:51:55 +00:00
28ce3c0227 [dynamo][itertools] support itertools.tee (#133771)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133771
Approved by: https://github.com/jansel
ghstack dependencies: #133712, #133769, #133778, #133779
2024-08-20 19:48:57 +00:00
3f58a8051a [dynamo] simplify implementation for builtins.sum (#133779)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133779
Approved by: https://github.com/jansel
ghstack dependencies: #133712, #133769, #133778
2024-08-20 19:48:57 +00:00
37b4bc60a4 [dynamo] simplify implementation for functools.reduce (#133778)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133778
Approved by: https://github.com/jansel
ghstack dependencies: #133712, #133769
2024-08-20 19:48:57 +00:00
178e8563b8 [dynamo] simplify polyfill registration for builtins.all and builtins.any (#133769)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133769
Approved by: https://github.com/jansel
ghstack dependencies: #133712
2024-08-20 19:48:57 +00:00
71dd52f51a [RFC][dynamo] add decorator to register polyfill for unsupported C++ function to avoid graph break (#133712)
Add decorator `torch.compiler.substitute_in_graph` to register polyfill for unsupported C++ function to avoid graph break. This API provides an official way to add support for dynamo for third-party C extensions. Also, it can be used to simplify our implementation for `torch._dynamo.polyfill`.

5ee070266f/torch/_dynamo/variables/builtin.py (L97-L107)

Example:

```python
>>> import operator
>>> operator.indexOf([1, 2, 3, 4, 5], 3)
2

>>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3)
Unsupported: ...

>>> @torch.compiler.substitute_in_graph(operator.indexOf)
... def indexOf(sequence, x):
...     for i, item in enumerate(sequence):
...         if item is x or item == x:
...             return i
...     raise ValueError("sequence.index(x): x not in sequence")

>>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3)
2
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133712
Approved by: https://github.com/jansel
2024-08-20 19:48:57 +00:00
49430bfd5c [DeviceMesh] Add a _MeshEnv attr to record the mapping of flatten mesh_dim_name to its mesh dim index in root mesh (#133838)
```
# supposed we have a 3d mesh
mesh_3d = init_device_mesh("cuda", (2,2,2), mesh_dim_names=("dp", "cp", "tp")
dp_cp_mesh = mesh_3d["dp", "cp"]._flatten()

"""
then we would have
flatten_name_to_root_dims[mesh_3d]: {
    "dp_cp": (0, 1)
}
"""
```

We need this information to validate the order mesh slice including flatten mesh dim.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133838
Approved by: https://github.com/fegin
2024-08-20 19:43:45 +00:00
c188d419db [BE] [EZ] Allow linux-build workflows to run on the default runner type (#133640)
Replace usage of `runner` with the new `runner_prefix` input, which allows the workflows to use the default runner type (linux.2xlarge) specified by the reusable workflow.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133640
Approved by: https://github.com/clee2000, https://github.com/jeanschmidt, https://github.com/malfet
2024-08-20 19:37:14 +00:00
81a822ddc9 Back out "[1/N] Fix clang-tidy warnings in inductor (#131979)" (#133922)
Summary:
Original commit changeset: cc9392e5fce2

Original Phabricator Diff: D60464909

Differential Revision: D61501052

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133922
Approved by: https://github.com/22quinn
2024-08-20 19:16:29 +00:00
49f6ea6dd9 Revert "[report_exportability] Avoid re-exporting duplicated modules (#133930)"
This reverts commit 278bc985d71f1ee09a499fba2ea5032b7baf2567.

Reverted https://github.com/pytorch/pytorch/pull/133930 on behalf of https://github.com/izaitsevfb due to breaks lint ([comment](https://github.com/pytorch/pytorch/pull/133930#issuecomment-2299513046))
2024-08-20 18:44:09 +00:00
43f78bf37a [MPS] Gather sliced inputs to batch norm (#133610)
This PR removes the `executeGatherOp` flag from batch norm in favor of relying on the logic in 4aa66f68a8/aten/src/ATen/native/mps/OperationUtils.mm (L372) to decide if gathering is necessary.

It's not the most efficient way to solve this issue, but it assures correctness for sliced inputs.

### Performance impact

#### With fix

```
python -m timeit -n 100 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x)"
100 loops, best of 5: 282 usec per loop

python -m timeit -n 100 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x[5:])"
100 loops, best of 5: 448 usec per loop

python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x)"
1000 loops, best of 5: 705 usec per loop

python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x[5:])"
1000 loops, best of 5: 1.11 msec per loop

python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(1000, 100, 35, 45).to('mps')" "bn(x)"
1000 loops, best of 5: 7.16 msec per loop

python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(1000, 100, 35, 45).to('mps')" "bn(x[5:])"
1000 loops, best of 5: 11.7 msec per loop
```

#### Without fix

```
python -m timeit -n 100 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x)"
100 loops, best of 5: 284 usec per loop

python -m timeit -n 100 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x[5:])"
100 loops, best of 5: 265 usec per loop

python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x)"
1000 loops, best of 5: 715 usec per loop

python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x[5:])"
1000 loops, best of 5: 675 usec per loop

python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(1000, 100, 35, 45).to('mps')" "bn(x)"
1000 loops, best of 5: 7.19 msec per loop

python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(1000, 100, 35, 45).to('mps')" "bn(x[5:])"
1000 loops, best of 5: 7.13 msec per loop
```

Please feel free to push back or request changes.

Fixes #133520
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133610
Approved by: https://github.com/malfet
2024-08-20 18:24:48 +00:00
278bc985d7 [report_exportability] Avoid re-exporting duplicated modules (#133930)
Summary:
Skip re-exporting modules with the duplicated types to speed up the exportability tests.

In real models, there are many duplicated modules, and mostly have the same export issues.

Test Plan: Existing CI

Differential Revision: D61504630

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133930
Approved by: https://github.com/angelayi

Co-authored-by: bearzx <bearzx@fb.com>
2024-08-20 18:20:49 +00:00
333890b701 Enable CUDA 12.4.1 (#132202)
Trying to keep a record of the steps before I lose track of it.

- 1st Commit: Similar to https://github.com/pytorch/builder/pull/1720
- 2nd Commit:  Update CUDA 12.4 CI CUDA versions from 12.4.0 to 12.4.1 mapping to changes in https://github.com/pytorch/pytorch/pull/125944/files
- 3rd Commit: update for aarch64 install_cuda_aarch64.sh docker step
- 4th Commit: aaa456e3e6 Related https://github.com/pytorch/pytorch/pull/121684
- Synchronization point: Meta helps uploading pypi cuda dependencies specified in .github/scripts/generate_binary_build_matrix.py
- The above pypi upload is done (thanks Andrey!), restarted jobs like https://github.com/pytorch/pytorch/actions/runs/10188203670/job/28369471321
- 77532344e3, use temporary docker containers (generated from a previous successful container build). If merged, these containers would be rebuilt, therefore testing them now.  (5th commit)
- 6th commit 5f93c625b5: revert the 5th commit. Update, done but have to debug seemingly irrelevant failures (rocm/xpu/mps)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132202
Approved by: https://github.com/Skylion007, https://github.com/eqy, https://github.com/atalman
2024-08-20 17:52:50 +00:00
e41b520ee3 [3/N] Refactor FR script - Add a processor module (#133933)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133933
Approved by: https://github.com/c-p-i-o
ghstack dependencies: #133927, #133929
2024-08-20 17:36:49 +00:00
bce0caba78 [BE]: Update Typeguard to TypeIs for better type inference (#133814)
Uses TypeIs instead of TypeGuard for better inference. See https://peps.python.org/pep-0742/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133814
Approved by: https://github.com/ezyang
2024-08-20 17:19:57 +00:00
fbf3fc2a30 [inductor] Use int64_t as index type for all platfroms 4 (#133892)
It is parallel PR to https://github.com/pytorch/pytorch/pull/133819 , and it is append change for @jansel 's comments.
1. For `torch/_inductor/codegen/cpp_wrapper_cpu.py`, revert to origin code to append LL on MacOS and Windows: bdc14ad89a
2. For `torch/_inductor/codegen/cpp_utils.py`, append LL on MacOS and Windows forlarge constants. And fix its UTs: 3a56b76ce0

------------------------------
Another solution for https://github.com/pytorch/pytorch/pull/133615, use `int64_t` as index type for all plartform.

### Development notes:
The metioned PR( https://github.com/pytorch/pytorch/pull/133615) is fix the index type not match to parse_arg args types. As reviewed with @jansel , Jason think we need to unificate `INDEX_TYPE` for all platforms.
Current code is make code cumbersome:
```python
INDEX_TYPE = "int64_t" if _IS_WINDOWS else "long"
```

So, I have some attempts to unificate `INDEX_TYPE` as `long` or `int64_t`.
For use `long` as index type: https://github.com/pytorch/pytorch/pull/133768
For use `int64_t` as index type: https://github.com/pytorch/pytorch/pull/133782

Since that, we still discussed which type we will select as final solution.
![image](https://github.com/user-attachments/assets/b23fa577-2d40-4bd6-b934-fb7994fe0bb0)

`long` type is different define and size in different OSs and different compilers. So, @jansel make decision that, we need to select `int64_t` for all platforms. So, I would comtine my work based on https://github.com/pytorch/pytorch/pull/133782.

As https://github.com/pytorch/pytorch/pull/133782 still has two issues:
1. std::min/std::max could not match function instances by arg types. It as fixed and validated in PR: https://github.com/pytorch/pytorch/pull/133812
4. Cuda TestMemoryPlanning::test_cpp_wrapper issue by wrong index type. It is fixing in this PR.

So, we made final solution in this PR.

### Changes:
**1. Use `int64_t` type as index type for all OSs: `Windows`, `Linux` and `MacOS`.**
**2. Use static_cast<int64_t>(`constant`) to convert constant to `div_floor_integer` with args type(`int64_t`).**
**3. Update `parse_arg` function signature to `int64_t`, which follow the index type.**
**4. Append double L(`LL`) to constant on Windows and MacOS, because of their int64_t are are long long.**
**5. Fix `std::min/std::max` type miss match by static_cast to `INDEX_TYPE`.**
**6. Fix UTs, containts: cuda `TestMemoryPlanning::test_cpp_wrapper`, and `test_indexing.py`.**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133892
Approved by: https://github.com/jansel
2024-08-20 16:54:12 +00:00
3caf3baabb [inductor] enable inductor backend for dynamo on Windows. (#133921)
Changes:
Enable inductor backend for dynamo on Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133921
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-08-20 16:46:19 +00:00
cyy
c3d02fa390 [Reland2] Update NVTX to NVTX3 (#109843)
Another attempt to update NVTX to NVTX3. We now avoid changing NVTX header inclusion of existing code.  The advantage of NVTX3 over NVTX is that it is a header-only library so that linking with NVTX3 can greatly simplify our CMake and other building scripts for finding libraries in user environments. In addition, NVTX are indeed still present in the latest CUDA versions, but they're no longer a compiled library: It's now a header-only library. That's why there isn't a .lib file anymore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109843
Approved by: https://github.com/peterbell10, https://github.com/eqy

Co-authored-by: Ivan Zaitsev <108101595+izaitsevfb@users.noreply.github.com>
2024-08-20 16:33:26 +00:00
33f1ee036e [dynamo][user-defined] Simplify call_hasattr (#133935)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133935
Approved by: https://github.com/williamwen42, https://github.com/jansel
ghstack dependencies: #133745, #133747, #133746, #133799, #133800
2024-08-20 16:27:44 +00:00
cyy
8d93fe510e Remove NestedTensorFactories.h (#133809)
Since it has no code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133809
Approved by: https://github.com/ezyang
2024-08-20 16:16:30 +00:00
187d55018a [BE] Fix MYPY issues (#133872)
Fix some mypy issues that have crept in to the trunk.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133872
Approved by: https://github.com/oulgen, https://github.com/Skylion007
2024-08-20 16:12:04 +00:00
52dfe99dbf Skip test_custom_op_add_abi_compatible_cpu_with_stack_allocation internally (#133704)
Summary: This test is segfaulting internally. Skip for now so we can get the internal tests green.

Differential Revision: D61399618

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133704
Approved by: https://github.com/desertfire
2024-08-20 16:01:39 +00:00
3a2f7192c3 Revert "return state dict without optimized module (#132626)"
This reverts commit e37eef8a7bd5915fa2961d688fd8b02df5cc5fd7.

Reverted https://github.com/pytorch/pytorch/pull/132626 on behalf of https://github.com/ZainRizvi due to Sorry but it seems like this PR broke trunk. distributed/checkpoint/test_state_dict.py::TestStateDict::test_fsdp2 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10458281674/job/28969008325) [HUD commit link](da69a28c6f) ([comment](https://github.com/pytorch/pytorch/pull/132626#issuecomment-2299190664))
2024-08-20 15:54:54 +00:00
f2b57d8831 Fix torch._C submodules population (#133919)
This fixes regression introduced by https://github.com/pytorch/pytorch/pull/132216 that on some Python runtimes failed with
```
>   from torch._C._dynamo.guards import GlobalStateGuard
E   ModuleNotFoundError: No module named 'torch._C._dynamo.guards'; 'torch._C._dynamo' is not a package

c:\users\malfet\git\pytorch\torch\_dynamo\convert_frame.py:28: ModuleNotFoundError
```

Simplify it by always registering submodules by its primary name and do not try to add submodules which are not part of the same namespace as parent. Otherwise module can be registered by alias, rather than by primary name.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133919
Approved by: https://github.com/atalman, https://github.com/izaitsevfb, https://github.com/XuehaiPan, https://github.com/albanD, https://github.com/Skylion007
2024-08-20 15:38:32 +00:00
b02695d65f [export] training ir migration, fix export_rle_model (#133937)
Summary:
- exir.capture + to_edge is deprecated. We need to use the export + to_edge.
- Fix quantization pass to be compatible with the new export IR. In the quantization pass, some nodes might have side-effects, so they don't have users, but still are not removed by the DCE pass. We need to consider it.
- now export_rle_model works with the default `capture_pre_autograd_graph`, it should also work with the new training it

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/export:export_rle_model  -- -r export_rle_model
```

Differential Revision: D61485834

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133937
Approved by: https://github.com/tugsbayasgalan
2024-08-20 15:35:25 +00:00
6590f4fb0e [CD] Enable python 3.13 for xpu nightly build (#133670)
Enable python 3.13 for XPU nightly build, it depends on https://github.com/pytorch/pytorch/pull/133454 land. Also update the xpu nightly wheel test env.

Works for https://github.com/pytorch/pytorch/issues/114850
Fixes #130543
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133670
Approved by: https://github.com/atalman, https://github.com/malfet
2024-08-20 15:05:20 +00:00
36376efd06 [2/N] Refactor FR script - add a loader module (#133929)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133929
Approved by: https://github.com/c-p-i-o
ghstack dependencies: #133927
2024-08-20 14:27:40 +00:00
2bd02e0c82 Revert "[RFC][dynamo] add decorator to register polyfill for unsupported C++ function to avoid graph break (#133712)"
This reverts commit 641724ed1daad1e6fc2525cc6858d199e576d5cd.

Reverted https://github.com/pytorch/pytorch/pull/133712 on behalf of https://github.com/jeanschmidt due to breaking main windows cpu tests - reverting them all, so we can identify the culprit with more calmness ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2298528797))
2024-08-20 10:34:41 +00:00
91fd270535 Revert "[dynamo] simplify polyfill registration for builtins.all and builtins.any (#133769)"
This reverts commit 59ca56e56ca3e2f6dd80db57079725cf61f06810.

Reverted https://github.com/pytorch/pytorch/pull/133769 on behalf of https://github.com/jeanschmidt due to breaking main windows cpu tests - reverting them all, so we can identify the culprit with more calmness ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2298528797))
2024-08-20 10:34:41 +00:00
5109c5ef23 Revert "[dynamo] simplify implementation for functools.reduce (#133778)"
This reverts commit ff9be0eda99c59cdbcc269853168657de93043c7.

Reverted https://github.com/pytorch/pytorch/pull/133778 on behalf of https://github.com/jeanschmidt due to breaking main windows cpu tests - reverting them all, so we can identify the culprit with more calmness ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2298528797))
2024-08-20 10:34:41 +00:00
241df7e7f8 Add multi-cache autotune test (#133868)
Summary:
The existing tests didn't cover a case where we had multiple autotunes in a single graph.  Add a test to demonstrate that case.

Also added a test dependency on redis and removed the "fake redis" from the previous PR (#133579)

Test Plan: unit tests

Reviewed By: oulgen

Differential Revision: D61178861

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133868
Approved by: https://github.com/oulgen
2024-08-20 10:26:45 +00:00
11af423eca [SymmetricMemory] make buffer_ptrs_dev, signal_pad_ptrs_dev, buffer_size, and signal_pad_size accessible in python (#133680)
These allows us to experiment with creative applications with triton.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133680
Approved by: https://github.com/Chillee
2024-08-20 10:15:35 +00:00
08b5e07e6c Revert "[dynamo] simplify implementation for builtins.sum (#133779)"
This reverts commit 1fdeb4e32918017ee3a712e0bba86e8482fa293b.

Reverted https://github.com/pytorch/pytorch/pull/133779 on behalf of https://github.com/jeanschmidt due to breaking main windows cpu tests ([comment](https://github.com/pytorch/pytorch/pull/133779#issuecomment-2298285206))
2024-08-20 08:33:29 +00:00
68570fca69 Revert "Add MaskedTensor support to *_like API (#128637)"
This reverts commit 8de56e29581fa2706d44f8c4b0827830c9351470.

Reverted https://github.com/pytorch/pytorch/pull/128637 on behalf of https://github.com/jeanschmidt due to Introduced API linting errors ([comment](https://github.com/pytorch/pytorch/pull/128637#issuecomment-2298270307))
2024-08-20 08:26:28 +00:00
42097f0ec1 Revert "[BE]: Update Typeguard to TypeIs for better type inference (#133814)"
This reverts commit cf60fe53a83bafec0857d5b49c2054de6ba4cddc.

Reverted https://github.com/pytorch/pytorch/pull/133814 on behalf of https://github.com/jeanschmidt due to Broke 12k internal signals/jobs, @ezyang please help get those changes merged. More details check D61488368 ([comment](https://github.com/pytorch/pytorch/pull/133814#issuecomment-2298210309))
2024-08-20 08:02:49 +00:00
25d5a815f7 [Dynamo] Guard on torch function mode global state (#133135)
Adds guards checking whether torch function mode is in the all disabled state.

There are three torch function enablement states:
* All torch function disabled (modes + subclasses)
* Torch function subclass disabled
* All enabled

We now have guards checking if the state is All enabled and if state is All disabled.
All of the above ternary states are assigned to a unique pair of these two flags.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133135
Approved by: https://github.com/anijain2305
ghstack dependencies: #133130, #133729, #133131, #133132, #133133, #133134, #133136
2024-08-20 07:15:04 +00:00
48ee0984ac Add C API to return all torch function disablement status (#133136)
This PR adds a C function to check if all torch function is disabled.
Recall that there are three torch function enablement states:
* All disabled
* Torch Function Subclass disabled
* All enabled

The API before this change provides two functions:
* `_is_torch_function_enabled` - returns True iff the current TF state is All enabled
* `_is_torch_function_mode_enabled` - returns True iff the state is not All disabled and the torch function mode stack is non-empty.

The crux of why a new API is needed is the following: If dynamo enters a frame with the torch function mode stack empty, `_is_torch_function_enabled` == False, it is impossible to determine if after a new mode is pushed whether we should enter the mode or not. This is because we don't know if the enablement state is All disabled or only subclass disabled. Adding this API to check if All disabled is True allows us to disambiguate this case.

In the next PR, Dynamo InstructionTranslator will have clearer flags than the underlying C API:
* A flag to indicate if subclasses are disabled (ie All disabled or Subclass Disabled is the current state)
* A flag to indicate if modes are disabled (ie if All disabled is the current state)
* A symbolic stack which can be checked if any modes are present

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133136
Approved by: https://github.com/bdhirsh
ghstack dependencies: #133130, #133729, #133131, #133132, #133133, #133134
2024-08-20 07:15:04 +00:00
d97ca968cd [Dynamo] Test intermediate tf mode construction (#133134)
Ensures that constructing a torch function mode in the middle of a function is supported.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133134
Approved by: https://github.com/williamwen42
ghstack dependencies: #133130, #133729, #133131, #133132, #133133
2024-08-20 07:14:56 +00:00
626acaeb16 [Dynamo] Support torch function stack len (#133133)
Adds support for `torch._C._len_torch_function_stack()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133133
Approved by: https://github.com/williamwen42
ghstack dependencies: #133130, #133729, #133131, #133132
2024-08-20 07:14:52 +00:00
d1fdf984c3 [Dynamo] Support push torch function mode stack (#133132)
This PR adds support `torch._C._push_on_torch_function_stack()` by updating `torch.py` to push onto the symbolic torch function mode stack when a push is encountered. The same side effects infra used in the previous PR is used to track the mutation of the torch function mode stack and add bytecode to update it if it is mutated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133132
Approved by: https://github.com/williamwen42
ghstack dependencies: #133130, #133729, #133131
2024-08-20 07:14:47 +00:00
c0b4aaa8c5 [Dynamo] Support pop torch function mode stack (#133131)
This PR adds support for tracing `torch._C._pop_torch_function_stack()` without graph breaking and in order to verify the state change also adds replay of mutations to the torch function mode stack via side_effects appending supplemental bytecode as we do for other python mutable objects.

Details:
To represent the torch function mode stack symbolically a deque field is added to the instruction translator. When the InstructionTranslator is initialized, all modes are read from the current torch function mode stack, and stashed in a global weak ref for later access (using existing sources) without needing to push/pop the python/cpp torch function mode stack.

During tracing, when `_pop_torch_function_stack` is encountered a value is popped from this deque and the variable tracker representing the mode is returned. To ensure the true torch function mode stack matches this state, `TorchFunctionModeStackVariable`, a singleton, is marked as mutated, this adds it to side effects, where during final codegen, side effects will codegen a call to a python helper which will update the python torch function mode stack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133131
Approved by: https://github.com/jansel
ghstack dependencies: #133130, #133729
2024-08-20 07:14:42 +00:00
f147349568 Fix DeviceContext bug (#133729)
Fixes https://github.com/pytorch/pytorch/issues/133666

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133729
Approved by: https://github.com/bdhirsh
ghstack dependencies: #133130
2024-08-20 07:14:37 +00:00
09e366cb57 [Dynamo] Add torch function mode stack guard to dynamo (#133130)
This PR adds a guard on the torch function mode stack state at the beginning of tracing. The way this is implemented is via a new leaf guard which is passed the initial stack state at construction and compares it to the stack state at the time the guard is run.

Details:
The stack state is extracted via popping all modes, appending them to a list, and pushing all modes back. This list is stored on the output graph and read during guard construction to pass to the stack mode guard. There the length and types of the modes are recorded. Next time the guard is run it compares this recorded state to the current mode stack state.

To implement this in python a helper function was added to utils.py and this is used if cpp guards are not enabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133130
Approved by: https://github.com/anijain2305
2024-08-20 07:14:33 +00:00
7492da804f Mark disabled tests as fixed (#133940)
Fixes #132552, #133900, #133901, #133902, #133903, #133904, #133905, #133906, #133908, #133910, #133911, #133912, #133913, #133914, #133915, #133916, #133917

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133940
Approved by: https://github.com/oulgen
2024-08-20 06:58:11 +00:00
e8d3c4be36 [dynamo][reland][inline-inbuilt-nn-modules] Mark attributes of nn mod… (#133714)
Relands https://github.com/pytorch/pytorch/pull/132539
Relands https://github.com/pytorch/pytorch/pull/132736

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133714
Approved by: https://github.com/jansel
2024-08-20 05:57:52 +00:00
f08d484702 Add itertools.islice support in dynamo (#133893)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133893
Approved by: https://github.com/oulgen
2024-08-20 05:55:53 +00:00
b6891f4002 [1/N] Refactor fr trace script to make it modulized - config (#133927)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133927
Approved by: https://github.com/c-p-i-o
2024-08-20 05:47:17 +00:00
15addb00e6 Update test_control_flow.py to device-agnostic. (#133843)
Fixes #133841

This PR makes the `test_pointwise_associative_scan_CUDA_flip` also work on XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133843
Approved by: https://github.com/EikanWang, https://github.com/desertfire, https://github.com/malfet, https://github.com/jansel, https://github.com/atalman
2024-08-20 05:05:43 +00:00
994fcb9acd Killswitch based rollout for flight recorder (#133237)
Summary: Defaulting TORCH_NCCL_DUMP_ON_TIMEOUT to "true" and adding a kilswitch in case we need to kill this feature in production.

Test Plan: Tests pass manually but need futher testing before this is rolled out fully everywhere.

Differential Revision: D61136320

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133237
Approved by: https://github.com/c00w
2024-08-20 04:27:55 +00:00
32f57ac627 [BE] Fix lint issues in qlinear_prepack.cpp (#133797)
Summary: This diff fixed many lint issues in qlinear_prepack.cpp. I'am fixing them as I want to add more ops/funcs into this file later.

Test Plan: Sandcastle

Differential Revision: D61425436

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133797
Approved by: https://github.com/Skylion007
2024-08-20 04:23:25 +00:00
b0bafd2be5 remove tensor weak ref from constraint target (#133890)
Summary: `_ConstraintTarget` is an internal data structure that has some redundancy: tensors are identified by their id but also carry a weak reference. The weak reference was probably useful a year back but everything is done with ids right now, and the lifetime of these tensors ensures that using their ids is OK.

Test Plan: existing tests

Differential Revision: D61488816

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133890
Approved by: https://github.com/tugsbayasgalan
2024-08-20 03:03:05 +00:00
188cb5e67b Bump scikit-image to 0.22.0 (#133932)
Fixes: https://github.com/pytorch/pytorch/issues/133926

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133932
Approved by: https://github.com/malfet
2024-08-20 02:37:16 +00:00
6c82a1c68c [AOTI] Introduce DeferredCudaKernelLine for cuda cpp wrapper (#129135)
Summary: When generating CUDA kernel load and launch, certain Triton kernel meta data are needed, but those meta data only exist after kernel auto-tune is done. DeferredCudaKernelLine is a deferred line which can backfill a string template after kernel auto-tune. This is to prepare for one-pass AOTI codegen implementation.

Differential Revision: [D61018114](https://our.internmc.facebook.com/intern/diff/D61018114)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129135
Approved by: https://github.com/angelayi
2024-08-20 02:15:44 +00:00
cyy
c51fc7e98e Enable clang-tidy in aten/src/ATen/native/nested/ (#133829)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133829
Approved by: https://github.com/Skylion007
2024-08-20 01:52:15 +00:00
c6ea7b3f21 Update xpu CD used driver to rolling version (#133454)
The main purpose of this PR is change the XPU CD use rolling driver to support more clients GPU AOT build and enable Kineto. And also plan to enable python 3.13 for xpu CD.

Works for https://github.com/pytorch/pytorch/issues/114850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133454
Approved by: https://github.com/atalman
2024-08-20 01:45:45 +00:00
c7af2728d3 Remove aten dispatch to empty in foreach_norm cuda kernel (#133897)
Saves significant time on aten dispatch. For 2k tensors, goes from 38ms to 58us.
Should shave some overhead mentioned in https://github.com/pytorch/pytorch/issues/133586

Before PR:
![image](https://github.com/user-attachments/assets/7813f059-0f7f-4d44-a9f0-1aaf94ae849f)

After:
![image](https://github.com/user-attachments/assets/ad0855b1-2743-432a-ad31-b574c620e2fd)

script:
```
import torch

# warm up caching allocator
a = torch.rand(200, 10, device="cuda")
b = torch.rand(200, 10, device="cuda")
c = a + b
del a, b, c

ts = [torch.rand(2, 3, device="cuda") for _ in range(2000)]

with torch.profiler.profile(
    activities=[
        torch.profiler.ProfilerActivity.CPU,
        torch.profiler.ProfilerActivity.CUDA,
    ]
) as p:
    torch._foreach_norm(ts)

print(p.key_averages().table(sort_by="cpu_time_total"))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133897
Approved by: https://github.com/albanD, https://github.com/drisspg
2024-08-20 01:27:09 +00:00
874ae854eb [c10d] Land CudaEventCache with roll out flags (#133727)
@zdevito added a cache for CudaEvent in https://github.com/pytorch/pytorch/pull/122732. And we want to productionize it with a flag in this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133727
Approved by: https://github.com/shuqiangzhang, https://github.com/eqy
2024-08-20 01:08:00 +00:00
cfcb9e388d [PT2][Optimus] Add move reshape out of split stack pass (#133710)
Summary: We observed a  new pattern in CMF where reshape nodes are in the middle of split stack patter, introducing massive triton_fused_stack_xxx kernels, leading to increased compilation time, we thus move it outside of the pattern, and elimate such split stack nodes.

Test Plan:
# unit test
```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test //caffe2/test/inductor:split_cat_fx_passes
```

Buck UI: https://www.internalfb.com/buck2/2fb51ae7-832e-436b-b6b7-a81599390182
Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173811074971
Network: Up: 10MiB  Down: 5.4GiB  (reSessionID-96a20105-fdc6-4b4f-b465-813a84a71eba)
Jobs completed: 304618. Time elapsed: 25:24.7s.
Cache hits: 99%. Commands: 120772 (cached: 120410, remote: 357, local: 5)
Tests finished: Pass 9. Fail 0. Fatal 0. Skip 1. Build failure 0

# benchmark

```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "cmf_shrink" --flow_id 587303213
```
P1529578588
graph diffing: https://www.internalfb.com/intern/diffing/?paste_number=1529577762

Counter({'pattern_matcher_nodes': 2123, 'pattern_matcher_count': 1715, 'normalization_pass': 404, 'remove_split_with_size_one_pass': 269, 'extern_calls': 193, 'merge_splits_pass': 74, 'normalization_aten_pass': 47, 'fxgraph_cache_miss': 9, 'batch_aten_mul': 6, 'scmerge_split_sections_removed': 5, 'scmerge_split_removed': 4, 'scmerge_cat_removed': 4, 'unbind_stack_pass': 4, 'batch_sigmoid': 2, 'batch_linear': 2, 'move_reshape_out_of_split_stack_pass': 2, 'batch_aten_sub': 2, 'batch_layernorm': 1, 'scmerge_split_added': 1, 'scmerge_cat_added': 1, 'split_stack_to_cats_pass': 1, 'split_cat_to_slices_pass': 1, 'batch_aten_add': 1, 'batch_relu': 1})

Trace link: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Ftest%2Fcmf_shrink.Aug_15_10_55_41_trace.json.gz&bucket=pyper_traces

The triton_fused_stack_xxx has been reduced significantly, we can see from the trace that the green part becomes smaller
{F1806406290}

# e2e
ads_dper3:68464f2dc5e849ba2670482079cecaaa
training_platform:8643db0c3453f2658aa7be7d73974ea0

baseline:
f588719502

proposal:
f592116164

Differential Revision: D61249205

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133710
Approved by: https://github.com/jackiexu1992
2024-08-20 00:50:07 +00:00
6f738d6434 Remove early exit in constant_pad_nd for export (#132679)
Summary:
Remove the early exit for padding when padding = [0, 0, 0, 0].

This prevents export from specializing when all padding=0, allowing export when all padding >= 0. Specialization will still happen for negative padding.

This change will be used to export image preprocess for multimodal models, where images of dynamic shape are padded. As images are of dynamic shape, we can't be sure if padding will be required or not. Padding is guaranteed to be non-negative.

Preprocess code: https://github.com/pytorch/torchtune/pull/1242

Note: the alternative is to wrap padding in a custom op, which isn't ideal given the custom op will contain the same impl as constant_pad_nd.

Test Plan: ci

Differential Revision: D60687727

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132679
Approved by: https://github.com/ezyang
2024-08-20 00:07:41 +00:00
9a998d98f1 Fix edge case in inductor triton clean script (#130837)
The regex in the script is too restrictive, as it excludes examples with parentheses in args, like the following:
```
triton_poi_fused_add_0.run(arg0_1.item(), arg1_1.item(), buf0, 1, grid=grid(1), stream=streamNone)
                                       ^
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130837
Approved by: https://github.com/Chillee
2024-08-19 23:46:11 +00:00
65b3e42074 Warn on fx graph cache bypass and log it to tlparse (#133826)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133826
Approved by: https://github.com/aorenste
2024-08-19 23:39:55 +00:00
2ec95ffe57 [cond] support unbacked symbool inputs (#133589)
Fixes https://github.com/pytorch/pytorch/issues/133577.

In dynamo, when received an unbacked symbool input, we create an unbacked symint to replace it.

The alternative approach of `not realizing the pred LazyVariable in cond` doesn't work because we need to get the proxy of the symbool input.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133589
Approved by: https://github.com/ezyang
2024-08-19 23:36:48 +00:00
3f525c9d5d Upgrade nightly wheels to rocm6.2 - 2 of 2 (binaries) (#133238)
Depends on https://github.com/pytorch/pytorch/pull/132875

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133238
Approved by: https://github.com/atalman
2024-08-19 22:35:33 +00:00
2b95007d12 [dynamo] support random.Random (#133725)
Fixes the observed graph breaks in https://github.com/pytorch/pytorch/issues/121349 and https://github.com/pytorch/pytorch/issues/121350.

But there are still graph breaks since a random output is being used as a seed, e.g.
```python
import random
import torch

def fn(x):
    seed = random.randint(0, 100)
    rand = random.Random(seed)
    return x + rand.randrange(10)

opt_fn = torch.compile(fn, backend="eager", fullgraph=True)
opt_fn(torch.ones(1))
```

fails with
```
torch._dynamo.exc.InternalTorchDynamoError: UnspecializedPythonVariable() is not a constant
```

when tracing the line
```
rand = random.Random(seed)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133725
Approved by: https://github.com/jansel
2024-08-19 22:34:44 +00:00
06faa15194 [pytorch][counters] add pytorch.wait_counter.fx_codgen_and_compile (#133107)
as titled

Differential Revision: [D60876629](https://our.internmc.facebook.com/intern/diff/D60876629/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133107
Approved by: https://github.com/asiab4
2024-08-19 22:29:16 +00:00
afb3e5ed6a Add onnx and onnxscript to CI requirements (#133647)
Add onnx and onnxscript to requirements-ci.txt to allow for `test_public_bindings` and mypy to function when checking `torch.onnx._internal` code as @malfet suggested.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133647
Approved by: https://github.com/titaiwangms, https://github.com/kit1980
2024-08-19 22:15:07 +00:00
1fdeb4e329 [dynamo] simplify implementation for builtins.sum (#133779)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133779
Approved by: https://github.com/jansel
ghstack dependencies: #133712, #133769, #133778
2024-08-19 22:14:34 +00:00
ff9be0eda9 [dynamo] simplify implementation for functools.reduce (#133778)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133778
Approved by: https://github.com/jansel
ghstack dependencies: #133712, #133769
2024-08-19 22:14:33 +00:00
59ca56e56c [dynamo] simplify polyfill registration for builtins.all and builtins.any (#133769)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133769
Approved by: https://github.com/jansel
ghstack dependencies: #133712
2024-08-19 22:14:33 +00:00
641724ed1d [RFC][dynamo] add decorator to register polyfill for unsupported C++ function to avoid graph break (#133712)
Add decorator `torch.compiler.substitute_in_graph` to register polyfill for unsupported C++ function to avoid graph break. This API provides an official way to add support for dynamo for third-party C extensions. Also, it can be used to simplify our implementation for `torch._dynamo.polyfill`.

5ee070266f/torch/_dynamo/variables/builtin.py (L97-L107)

Example:

```python
>>> import operator
>>> operator.indexOf([1, 2, 3, 4, 5], 3)
2

>>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3)
Unsupported: ...

>>> @torch.compiler.substitute_in_graph(operator.indexOf)
... def indexOf(sequence, x):
...     for i, item in enumerate(sequence):
...         if item is x or item == x:
...             return i
...     raise ValueError("sequence.index(x): x not in sequence")

>>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3)
2
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133712
Approved by: https://github.com/jansel
2024-08-19 22:14:33 +00:00
8de56e2958 Add MaskedTensor support to *_like API (#128637)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128637
Approved by: https://github.com/cpuhrsch
2024-08-19 22:13:59 +00:00
14ddd932fd Add MaskedTensor support to _is_any_true (#128574)
Fixes #128557

If there is a better way to detect autograd anomalies consistently, feel free to share your ideas. This is a dirty check.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128574
Approved by: https://github.com/cpuhrsch
2024-08-19 21:34:31 +00:00
432638f521 Remove useless environment in reusable workflow (#133659)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133659
Approved by: https://github.com/Skylion007
2024-08-19 20:44:17 +00:00
d131048056 Change install_triton to do git checkout, apply patch, pip install (#133878)
Fixes Docker builds: https://github.com/pytorch/pytorch/actions/runs/10458684809/job/28961048777

Follow up after https://github.com/pytorch/pytorch/pull/133694 to apply same patch to Docker build.

Change Rather then doing:
```
pip_install "git+${TRITON_REPO}@${TRITON_PINNED_COMMIT}#subdirectory=python"
```

We do using 4 step: git clone, git checkout, apply patch, pip install
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133878
Approved by: https://github.com/malfet, https://github.com/ZainRizvi
2024-08-19 20:42:50 +00:00
66d6d8b1b9 Support TORCH_COMPILER_COLLECTIVES envvar (#133696)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133696
Approved by: https://github.com/Skylion007, https://github.com/c-p-i-o
2024-08-19 20:13:04 +00:00
0d4eacb9d2 [fake tensor] unbacked symint support for binary op fast path (#133584)
Addreses https://github.com/pytorch/pytorch/issues/133525

We have an unbacked symint in `final_shape` and it's a tuple... So, add `guard_size_oblivious` to do size oblivious checks + `sym_eq` for list equality.

```
op.shape
> torch.Size([1])
final_shape
> (u0 + 1,)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133584
Approved by: https://github.com/ezyang
2024-08-19 20:03:05 +00:00
565e2ea019 Scale XBLOCK in triton for pointwise (#133300)
Adjust https://github.com/pytorch/pytorch/pull/128826 for also `triton_heuristics.pointwise`.

An example we encountered during training qwen-7b with rocm 6.1:

Note: this kernel also hit the limit of `TRITON_MAX_BLOCK['X']`, shall we increase it from 2048 to 4096?

```

import torch

aten = torch.ops.aten
inductor_ops = torch.ops.inductor
assert_size_stride = torch._C._dynamo.guards.assert_size_stride
empty_strided_cpu = torch._C._dynamo.guards._empty_strided_cpu
empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda
reinterpret_tensor = torch._C._dynamo.guards._reinterpret_tensor
alloc_from_pool = torch.ops.inductor._alloc_from_pool

import triton
import triton.language as tl
from triton.compiler.compiler import AttrsDescriptor

from torch._inductor.runtime import triton_heuristics
from torch._inductor.runtime.hints import DeviceProperties

@triton_heuristics.pointwise(
    size_hints=[8589934592],
    filename=__file__,
    triton_meta={'signature': {0: '*bf16'}, 'device': DeviceProperties(type='hip', index=2, cc='gfx942', major=None, regs_per_multiprocessor=None, max_threads_per_multi_processor=None, multi_processor_count=None), 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1), equal_to_1=())]},
    inductor_meta={'autotune_hints': set(), 'kernel_name': 'triton_poi_fused_nll_loss_backward_0', 'mutated_arg_names': [], 'no_x_dim': False, 'num_load': 0, 'num_reduction': 0, 'backend_hash': None, 'are_deterministic_algorithms_enabled': False, 'assert_indirect_indexing': True, 'autotune_local_cache': True, 'autotune_pointwise': True, 'autotune_remote_cache': False, 'force_disable_caches': False, 'dynamic_scale_rblock': True, 'max_autotune': False, 'max_autotune_pointwise': False, 'min_split_scan_rblock': 256, 'spill_threshold': 16, 'store_cubin': False, 'is_hip': True},
    min_elem_per_thread=0
)
@triton.jit
def triton_(out_ptr0, XBLOCK : tl.constexpr):
    xoffset = tl.program_id(0).to(tl.int64) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:].to(tl.int64)
    x0 = xindex
    tmp0 = 0.0
    tl.store(out_ptr0 + (x0), tmp0, None)

import triton
import triton.language as tl
from torch._inductor.runtime.triton_heuristics import grid
from torch._C import _cuda_getCurrentRawStream as get_raw_stream

if __name__ == "__main__":
    with torch.cuda._DeviceGuard(2):
        torch.cuda.set_device(2)
        buf0 = empty_strided_cuda((32752, 151936), (151936, 1), torch.bfloat16)
        stream2 = get_raw_stream(2)
        triton_.run(buf0, grid=grid(4976207872), stream=stream2)

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133300
Approved by: https://github.com/jansel
2024-08-19 19:41:55 +00:00
fb26b84390 Update fused kernels and call _safe_softmax from SDPA (#133882)
# UPDATE:
This is  take 3 of https://github.com/pytorch/pytorch/pull/131863 which was landed via co dev but not applying correclty

# Summary
Changes the stance of SDPA on what to do for fully masked out rows

## Current Behavior
Several PyTorch users have expressed frustration over this issue:
- https://github.com/pytorch/pytorch/issues/41508
- https://github.com/pytorch/pytorch/issues/103749
- https://github.com/pytorch/pytorch/issues/103963

These are significant issues with extensive discussion but no satisfactory resolution. The PyTorch team's consensus, as stated here:
https://github.com/pytorch/pytorch/issues/24816#issuecomment-524415617

Can be paraphrased as follows:

When passing in fully masked out rows, attention becomes ambiguous. We have two main options:

1. Uniformly attend to all values:
   ```python
   scores[masked_out_rows] = 1 / len(row)
   out[masked_out_rows] = 1 / len(row) * value
   ```

2. Decide that attention between no queries (masked) and no keys (masked) is meaningless:
   ```python
   output[fully_masked_rows] = NaN
   ```

We went with option 2. Partially because it was easier to implement, but also people argued that users can slice the output to remove the NaNs:
``` Python
>fill_value = -float("inf")
>row0 = torch.randn(4)
>row1 = torch.tensor([(fill_value for _ in range(4)])
>matrix = torch.stack([row0, row1]).requires_grad_(True)
>out = torch.softmax(matrix, 1)
>out = out[0]
>print(out)
tensor([0.5377, 0.2729, 0.0692, 0.1201])
```
Cool, problem solved. But what happends when you call backwards..
```Python
>out.backward(torch.ones_like(out))
>print(matrix.grad)
tensor([[3.0957e-08, 1.4157e-08, 7.7802e-10, 1.3713e-08],
        [       nan,        nan,        nan,        nan]])
```
Those pesky NaNs are back!

## Why do we see NaNs today?

The core of the problem revolves around using softmax function in sdpa:

```python
> row = torch.tensor([(-float("inf")) for _ in range(4)])
> torch.softmax(row, 0)
tensor([nan, nan, nan, nan])
```

## Quick Aside: Masking in Attention

Attention itself doesn't have a concept of masking. The `sdpa` function has an argument called `attn_mask`, which would be more accurately named `attn_bias`. This is because we don't actually "mask" entries when computing attention. Instead, due to implementation details([performance](https://github.com/pytorch/pytorch/issues/25110#issuecomment-524519087)), we add a value to the masked-out query/key pairs.

We use a large negative number (typically -inf) to decrease the attention weight, as softmax assigns more weight to larger values.

## Alternative Approaches

If we use a very large negative number instead of -inf:

```python
> row = torch.tensor([(-1e6) for _ in range(4)])
> torch.softmax(row, 0)
tensor([0.2500, 0.2500, 0.2500, 0.2500])
```
However if users always remembered to "slice" out their outputs i.e.:
```Python
>fill_value = -1e6
>...
>out.backward(torch.ones_like(out))
>print(matrix.grad)
tensor([[-0.0563, -0.0564,  0.1613, -0.0486],
        [ 0.0000,  0.0000,  0.0000,  0.0000]])
```
This would bring us back into a better state.

## A Third Option

We don't necessarily need to alter the behavior of softmax for -inf or very large negative numbers. The fundamental goal is to exclude certain query/key pairs from attention, regardless of the underlying implementation.

This PR implements the new semantic for masking w/ attention in fully masked-out rows:
```python
out[masked_out_rows] = 0
```

**Important Note**: This idea isn't entirely new. The [MaskedTensor](https://pytorch.org/tutorials/prototype/maskedtensor_overview#safe-softmax) prototype, a tensor subclass, was designed to handle such cases. However, it remains a prototype feature and hasn't gained widespread adoption.

## Details
This PR stack does 3 things:
1. Adds a PRIVATE _safe_softmax op
2. Updates semantic for flash_cpu fused kernel
3. Updates semantic for efficient_cuda fused kernel

_safe_softmax is not supposed to be used generically and is only meant to be used within the context of SDPA. Due to this fact instead of decomposing softmax and checking for -inf rows we instead "cheat" and use nan_to_num.

Why I think this is okay? (please find a counter point if avail)
There are multiple ways NaNs can emerge. For the fully masked out rows case nan_to_num works. But what if there were other NaNs, wouldn't this silently remove them?

The only case that this can happen is if the input itself had a NaN or an Inf
For example:
```Python
a = torch.ones([4], requires_grad=False, dtype=torch.float16)
a[1] = torch.finfo(torch.float16).max
print(a.softmax(-1))
```
Will return
`tensor([0., 1., 0., 0.], dtype=torch.float16)`

Where
```Python
a = torch.ones([4], requires_grad=False, dtype=torch.float16)
a[1] = float("inf")
a.softmax(-1)
```
returns:
`tensor([nan, nan, nan, nan], dtype=torch.float16)`

If we dont want to even allow for the possibility of "inf" or "NaN" attention scores to be converted to 0 then we can implemented it something like this

```Python
max = torch.max(a, dim=-1, keepdim=True)
exp = torch.exp(a - max.values)
denom = torch.sum(exp, dim=-1, keepdim=True)
softmax = exp / denom
softmax = torch.where(max.values == float('-inf'), 0.0, softmax)
```
however we would be paying for this in math performance.

## Why Now
I think one point that has substantially changed where PyTorch should lie on this argument is the fact that we have fused implementations for SDPA now. And these fused implementations allow us to easily and performantly support this new semantic.

Differential Revision: [D61418679](https://our.internmc.facebook.com/intern/diff/D61418679)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133882
Approved by: https://github.com/soulitzer
2024-08-19 18:53:11 +00:00
f1dc3b108a Back out "[export] fix test for training ir migration" (#133697)
Summary:
Original commit changeset: 0a1cb57e0338

Original Phabricator Diff: D61223356

Test Plan: buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/export:export_rle_model -- -r  test_export_rle_model

Reviewed By: tugsbayasgalan

Differential Revision: D61395818

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133697
Approved by: https://github.com/tugsbayasgalan
2024-08-19 18:30:42 +00:00
a8619c9a1d Add nitpicker, which allows adding comments to PRs when they match a file pattern (#133861)
This message would have helped avoid https://www.internalfb.com/sevmanager/view/440895

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133861
Approved by: https://github.com/albanD, https://github.com/izaitsevfb
2024-08-19 18:29:59 +00:00
64d9afd8a7 Register nll_loss2d decompositions for core aten (#133534)
When exporting a training model for Executorch (which requires all ops to be core aten) with cross entropy loss (`torch.nn.CrossEntropyLoss`), we ran into the following error from the fx verifier in `to_edge`:

```
torch._export.verifier.SpecViolationError: Operator torch._ops.aten.nll_loss2d_forward.default is not Aten Canonical.
```
The aten [implementation](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/LossNLL.cpp#L624) of `torch.nn.CrossEntropyLoss` uses `nll_loss2d_forward` for inference and `nll_loss2d_backward` for training, so we need to add the decompositions for both (which already exist) to the list of core aten decompositions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133534
Approved by: https://github.com/JacobSzwejbka
2024-08-19 18:26:48 +00:00
ad7dda7b32 [CI] Bump up TIMM pin (#133528)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133528
Approved by: https://github.com/angelayi
2024-08-19 18:13:57 +00:00
773a782249 Decompose _unsafe_index_put into index_put (#133365)
## Description
Create decomposition of _unsafe_index_put (non-core aten) that turns it into index_put (core aten)

## Testing
Phi3 mini + LoRA model successfully passed `to_edge` after failing due to a non-core aten `unsafe_index_put` getting introduced in a decomposition during joint graph calculations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133365
Approved by: https://github.com/pianpwk
2024-08-19 18:07:23 +00:00
517aee5369 [torchscript] Add a sampled logging integration point. (#133484)
Test Plan:
test script:
```
    def test_zhxchen17(self):
        from libfb.py.pyinit import initFacebook

        initFacebook()

        class M(torch.nn.Module):
            def forward(self, x):
                return torch.add(x, x)

        def tmptmp(x, y):
            return torch.mul(x, y)

        m = M()
        n = torch.jit.script(m)
        print(n(torch.tensor(1)))
        print(torch.jit.script(tmptmp)(torch.tensor(1), torch.tensor(2)))
```

```
I0802 12:01:23.932929 4079081 init.cc:407] Logging to scuba: run __torch__.caffe2.test.export.test_export.M.forward sample rate: 1000000
```

Differential Revision: D60920867

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133484
Approved by: https://github.com/davidberard98
2024-08-19 18:04:45 +00:00
6564e746ed [PT2] Port remove_noop to PT2 pre_grad passes (#132183)
Summary: migrate to aten IR, `reshape` -> `view.default`, not covering `flatten` as there are already optimazation done in PT2, see the example here P1506057533

Differential Revision: D60476525

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132183
Approved by: https://github.com/frank-wei
2024-08-19 17:46:51 +00:00
da69a28c6f [pipelining] Add schedule runtime for lowered schedule (#130488)
Creates a new runtime that shifts complexity from runtime to
ahead-of-time.

The existing runtime (PipelineScheduleMulti) accepts a
compute-only schedule (forward, backward, weight) actions only are
specified, and it infers the communication operations at runtime.
Compared to that runtime, PipelineScheduleRuntime has less logic that
happens at runtime and relies on lowering passes to transform the
compute-only schedule to add communications.

Advantages include
- easier to verify the correctness by dumping a compute+comm schedule
- posible to manually edit the compute+comm schedule if the lowering
  heuristics are insufficient

Functionality included inside the PipelineScheduleRuntime is limited to
- accepting a compute-only schedule and lowering it to add comms
- executing the compute or comm operations specified by the given
  schedule
- handling work.wait() automatically by calling it just before the
  matching compute operation (for RECV ops) or at the end of step (for
  SEND ops)

Follow ups for later PRs
- Some refactoring should be done to replace PipelineScheduleMulti with
  this runtime
- Optimizer execution is not considered (e.g. for zero-bubble cases)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130488
Approved by: https://github.com/H-Huang
2024-08-19 17:44:24 +00:00
f31404ba6f Revert "Update xpu CD used driver to rolling version (#133454)"
This reverts commit 32ed4a3beb746c94c702c80c79c812e45ab3b2f4.

Reverted https://github.com/pytorch/pytorch/pull/133454 on behalf of https://github.com/ZainRizvi due to Sorry, there's [an outage](https://github.com/triton-lang/triton/issues/4527) that's preventing triton from being installed correctly, which has the side effect of breaking our docker builds. Reverting this PR since it requires a docker rebuild (which now fails) to give us more time to properly fix the docker builds. ([comment](https://github.com/pytorch/pytorch/pull/133454#issuecomment-2297073937))
2024-08-19 17:28:50 +00:00
6ca68357b3 [dynamo] Save class vt in UserDefinedObjectVariable (#133800)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133800
Approved by: https://github.com/jansel
ghstack dependencies: #133745, #133747, #133746, #133799
2024-08-19 17:21:48 +00:00
08f14d5492 [refactor][dynamo][side-effects] Helper function for __new__ for user defined class (#133799)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133799
Approved by: https://github.com/jansel
ghstack dependencies: #133745, #133747, #133746
2024-08-19 17:21:48 +00:00
d6f30b91e5 Add a smaller default config option for decode (#133646)
## Before
A100
| Type    |   Speedup | score_mod   | mask_mod   | dtype          | shape(B,Hq,M,Hkv,N,D)     |
|---------|-----------|-------------|------------|----------------|---------------------------|
| Average |     0.461 |             |            |                |                           |
| Max     |     0.996 | None        | causal     | torch.bfloat16 | (16, 16, 1, 16, 1024, 64) |
| Min     |     0.188 | None        | causal     | torch.bfloat16 | (2, 16, 1, 16, 512, 128)  |

H100
| Type    |   Speedup | score_mod   | mask_mod   | dtype          | shape(B,Hq,M,Hkv,N,D)     |
|---------|-----------|-------------|------------|----------------|---------------------------|
| Average |     4.528 |             |            |                |                           |
| Max     |    16.710 | None        | offset     | torch.bfloat16 | (2, 16, 1, 2, 4096, 64)   |
| Min     |     1.612 | None        | offset     | torch.bfloat16 | (16, 16, 1, 16, 512, 128) |

## After

A100:
| Type    |   Speedup | score_mod   | mask_mod   | dtype          | shape(B,Hq,M,Hkv,N,D)     |
|---------|-----------|-------------|------------|----------------|---------------------------|
| Average |     0.472 |             |            |                |                           |
| Max     |     1.110 | None        | causal     | torch.bfloat16 | (16, 16, 1, 16, 1024, 64) |
| Min     |     0.182 | None        | causal     | torch.bfloat16 | (2, 16, 1, 16, 4096, 128) |

H100:
| Type    |   Speedup | score_mod   | mask_mod   | dtype          | shape(B,Hq,M,Hkv,N,D)     |
|---------|-----------|-------------|------------|----------------|---------------------------|
| Average |     4.535 |             |            |                |                           |
| Max     |    16.691 | None        | offset     | torch.bfloat16 | (2, 16, 1, 2, 4096, 64)   |
| Min     |     1.607 | None        | offset     | torch.bfloat16 | (16, 16, 1, 16, 512, 128) |

### Failing example code

``` Python
import torch
import torch.nn as nn
import functools
from torch.nn.attention.flex_attention import flex_attention, create_block_mask

class AttentionModel(nn.Module):
    def __init__(self, initial_kv_len):
        super().__init__()
        self.kv_len = initial_kv_len
        self.q_len = 1

    def causal_mask_decode(self, b, h, q_idx, kv_idx):
        offset = self.kv_len - self.q_len
        return offset + q_idx >= kv_idx

    def forward(self, queries, keys, values, mask):
        self.kv_len = keys.shape[-2]
        bs, nh, seq_len, _ = queries.shape

        attention = functools.partial(flex_attention, block_mask=mask, enable_gqa=True)
        attention = torch.compile(attention)
        attn_output = attention(queries, keys, values)

        return attn_output

# Driver code
def main():
    # Set up parameters
    d_model = 256
    q_heads = 32
    kv_heads = 8
    kv_len = 128
    q_len = 1
    batch_size = 1

    # Initialize the model
    model = AttentionModel(kv_len)
    mask = create_block_mask(
        lambda a, b, c, d: model.causal_mask_decode(a, b, c, d), 1, 1, q_len, kv_len
    )

    # Create sample input tensors
    queries = torch.randn(batch_size, q_heads, q_len, d_model, device="cuda")
    keys = torch.randn(batch_size, kv_heads, kv_len, d_model, device="cuda")
    values = torch.randn(batch_size, kv_heads, kv_len, d_model, device="cuda")

    # Forward pass
    output = model(queries, keys, values, mask)

    print(f"Input shapes:")
    print(f"  Queries: {queries.shape}")
    print(f"  Keys: {keys.shape}")
    print(f"  Values: {values.shape}")
    print(f"Output shape: {output.shape}")

if __name__ == "__main__":
    main()

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133646
Approved by: https://github.com/Chillee, https://github.com/joydddd
2024-08-19 17:13:26 +00:00
e37eef8a7b return state dict without optimized module (#132626)
Fixes #123625

We should consider changing the current behaviour and make it similar to 1fb498d6e3/torch/distributed/algorithms/_checkpoint/checkpoint_wrapper.py (L69-L101)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132626
Approved by: https://github.com/williamwen42
2024-08-19 16:58:41 +00:00
8d404581fc Revert "[ONNX] New export logic leveraging ExportedProgram and ONNX IR (#132530)"
This reverts commit 5fab35d77c7d1db7dbb9d5c516254a510b4f4f64.

Reverted https://github.com/pytorch/pytorch/pull/132530 on behalf of https://github.com/ZainRizvi due to Sorry but it seems like Dr. CI incorrectly flagged the [pull / linux-docs / build-docs-python-false](https://hud.pytorch.org/pr/pytorch/pytorch/132530#28918577682) failure as being flaky. The job started failing consistently on CI once your PR was merged. [GH job link](https://github.com/pytorch/pytorch/actions/runs/10454830880/job/28949386844) [HUD commit link](5fab35d77c) ([comment](https://github.com/pytorch/pytorch/pull/132530#issuecomment-2297001423))
2024-08-19 16:47:15 +00:00
68fcd54226 Lower cache mocking to test more pytorch code (#133579)
Summary: Previously we were mocking out FbRemoteFxGraphCacheBackend which meant that we were missing testing a whole bunch of the cache code. Cache at a lower level (CacheClient, LocalAutotuneCacheBackend, ManifoldClient, Redis) so we cover a larger amount of the caching code.

Test Plan: unit tests

Reviewed By: oulgen

Differential Revision: D60937966

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133579
Approved by: https://github.com/oulgen
2024-08-19 16:32:36 +00:00
32ed4a3beb Update xpu CD used driver to rolling version (#133454)
The main purpose of this PR is change the XPU CD use rolling driver to support more clients GPU AOT build and enable Kineto. And also plan to enable python 3.13 for xpu CD.

Works for https://github.com/pytorch/pytorch/issues/114850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133454
Approved by: https://github.com/atalman
2024-08-19 16:01:47 +00:00
df6831562c [Flight Recorder] Add more basic analysis to the script (#133412)
This is the first step to make sure we have a basic function of analyzer for FR in production.

- We want to use this script to find out abnormalities in collectives and report it to users.
- We also fixed some type errors.

- [Ongoing] Also we will add more unit tests to this script and make it modularized so that we can better maintain it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133412
Approved by: https://github.com/c-p-i-o, https://github.com/atalman
2024-08-19 15:55:00 +00:00
76b0284744 Revert "[inductor][cpp] complete vectorization for int32/int64 (#122961)"
This reverts commit 99b3b58f39507bb8ad5b4bb1b9bedf7f47b64fa3.

Reverted https://github.com/pytorch/pytorch/pull/122961 on behalf of https://github.com/atalman due to Breaks slow jobs: inductor/test_cpu_repro.py::CPUReproTests::test__adaptive_avg_pool2d [GH job link](https://github.com/pytorch/pytorch/actions/runs/10432403692/job/28893704833) [HUD commit link](a0ef8888e6) ([comment](https://github.com/pytorch/pytorch/pull/122961#issuecomment-2296852418))
2024-08-19 15:29:15 +00:00
318d3b39c4 Revert "[Inductor][CPP] Support vectorization of load_seed and randn (#130317)"
This reverts commit a0ef8888e60d934ae7e4ddaec1c1274b12d0d39d.

Reverted https://github.com/pytorch/pytorch/pull/130317 on behalf of https://github.com/atalman due to Breaks slow jobs: inductor/test_cpu_repro.py::CPUReproTests::test__adaptive_avg_pool2d [GH job link](https://github.com/pytorch/pytorch/actions/runs/10432403692/job/28893704833) [HUD commit link](a0ef8888e6) ([comment](https://github.com/pytorch/pytorch/pull/130317#issuecomment-2296819045))
2024-08-19 15:13:39 +00:00
5153550e4b [CI] Add FP32 dynamic, AMP static, AMP dynamic for AOT inductor accuracy CPU CI test (#132836)
This PR added 3 more accuracy test for AOT inductor CPU side.
1. FP32 dynamic shape accuracy test, torchbench suite
2. AMP static shape accuracy test, torchbench suite
3. AMP dynamic shape accuracy test, torchbench suite

**Test Time cost:**
| Precision 	| Shape Type 	| Suite      	| Time cost 	|
|-----------	|------------	|------------	|-----------	|
| FP32      	|    dynamic 	| Torchbench 	|  1h40m         	|
| AMP       	|     Static 	| Torchbench 	|  1h38m        	|
| AMP       	|    dynamic 	| Torchbench 	|  1h48m        	|

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132836
Approved by: https://github.com/desertfire
2024-08-19 14:26:48 +00:00
5fab35d77c [ONNX] New export logic leveraging ExportedProgram and ONNX IR (#132530)
1/n PR to

- Move code from torch-onnx from commit 395495e566 into torch.onnx and fixes imports.
- Integrate the new export logic with the torch.onnx.export API and include basic set of tests.
- Refactor the API for the change.
- Improve documentation.

Next PRs will be more tests and docs.

Fix https://github.com/pytorch/pytorch/issues/129277
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132530
Approved by: https://github.com/titaiwangms, https://github.com/malfet
2024-08-19 14:01:07 +00:00
92151c814b [ROCm] Set _HAS_PYNVML to false if amdsmi not installed (#132990)
This is a bugfix that was recently encountered in ROCm/Deepspeed. Currently if a library installs pynvml and runs on ROCm pytorch will break as _HAS_PYNVML is set to true and it will attempt to use amdsmi library for the device_count call which will not be installed.

This fix will set _HAS_PYNVML to false on ROCm if amdsmi is not installed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132990
Approved by: https://github.com/pruthvistony, https://github.com/eqy, https://github.com/malfet
2024-08-19 09:45:58 +00:00
0a976b8899 Enable bf16 float32 mkldnn matmul when float32 precision is 'medium' (#130919)
This fixes an issue on AArch64 cpus supporting BF16, caused when torch.set_float32_matmul_precision("highest") does not disable the bf16 downconversion in mkldnn_matmul.

This was discovered from a unit test failure where the decorator `torch.testing._internal.common_mkldnn.bf32_on_and_off`, which internally switches the float32_matmul_precision between "medium" and "highest" was not having the desired effect.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130919
Approved by: https://github.com/jgong5
2024-08-19 09:18:12 +00:00
8b6b1721c8 remove StrobelightCompileTimeProfiler.profile_compile_time from stacktrace when strobelight profiling not enabled (#133831)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133831
Approved by: https://github.com/oulgen
2024-08-19 09:14:52 +00:00
4bae7ae3d9 [DeviceMesh][Easy] Fix typo (#133790)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133790
Approved by: https://github.com/Skylion007
2024-08-19 05:20:22 +00:00
35f36363ec Revert "[dtensor] move DTensor to public namespace (#133113)"
This reverts commit 2ee6b97464d17fcf4c1fc67c29868fa30d0c16e1.

Reverted https://github.com/pytorch/pytorch/pull/133113 on behalf of https://github.com/wanchaol due to looks like it break some internal type imports ([comment](https://github.com/pytorch/pytorch/pull/133113#issuecomment-2295670911))
2024-08-19 05:00:19 +00:00
42e61c783c [Inductor][CPP] Align Half load with BFloat16 load (#132011)
Remove `static_cast<float>` for Half load to align with BFloat16.
Before:
```
extern "C"  void kernel(const half* in_ptr0,
                       half* out_ptr0)
{
    {
        #pragma GCC ivdep
        for(long x0=static_cast<long>(0L); x0<static_cast<long>(20L); x0+=static_cast<long>(1L))
        {
            auto tmp0 = static_cast<float>(in_ptr0[static_cast<long>(x0)]);
            out_ptr0[static_cast<long>(x0)] = tmp0;
        }
    }
}
```

After:
```
extern "C"  void kernel(const half* in_ptr0,
                       half* out_ptr0)
{
    {
        #pragma GCC ivdep
        for(long x0=static_cast<long>(0L); x0<static_cast<long>(20L); x0+=static_cast<long>(1L))
        {
            auto tmp0 = in_ptr0[static_cast<long>(x0)];
            out_ptr0[static_cast<long>(x0)] = tmp0;
        }
    }
}

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132011
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-08-19 04:52:39 +00:00
ae00063570 Change default runner's AMI to Amazon 2023 AMI - Part 1 (#133641)
Upgrades the LF scale configs to change the default AMI in accordance with the Amazon 2023 rollout plan.

This PR will be merged on Monday Aug 19 in the morning, and over the next 2-3 days as new linux runners are spun up (and old ones spun down) they'll start using this new AMI

This PR will be paired with https://github.com/pytorch/test-infra/pull/5558, which will be merged after this one
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133641
Approved by: https://github.com/jeanschmidt
2024-08-19 01:32:25 +00:00
e72e924eb5 Add correct typing annotations to rsample() for all distributions (#133516)
Fixes #133514
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133516
Approved by: https://github.com/Skylion007
2024-08-18 20:31:54 +00:00
eqy
c0c82a5f6a [CUDA][SDPA] Bump tolerances for test_mem_efficient_attention_attn_mask_vs (#133738)
Same thing as #133051 but for efficient attention

CC @drisspg @nWEIdia

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133738
Approved by: https://github.com/drisspg, https://github.com/nWEIdia, https://github.com/Skylion007
2024-08-18 19:14:29 +00:00
cf60fe53a8 [BE]: Update Typeguard to TypeIs for better type inference (#133814)
Uses TypeIs instead of TypeGuard for better inference. See https://peps.python.org/pep-0742/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133814
Approved by: https://github.com/ezyang
2024-08-18 19:10:16 +00:00
cyy
0d4cedaa47 [13/N] Fix clang-tidy warnings in aten/src/ATen (#133807)
Follows #133425

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133807
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2024-08-18 17:54:12 +00:00
cyy
47ed5f57b0 [12/N] Fix clang-tidy warnings in aten/src/ATen (#133425)
Follows  #133758

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133425
Approved by: https://github.com/ezyang
2024-08-18 11:03:55 +00:00
fbd020fce6 Add new prop to _XpuDevicePropertie for triton gemm optimization (#131738)
# Motivation
This PR aims to add new properties to `_XpuDevicePropertie` for triton gemm optimization.

# Additional Context
`ext_oneapi_supports_cl_extension` is not a ABI-neutral API. It depends on compiler 2025.0. For more details, see https://github.com/intel/llvm/pull/13212

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131738
Approved by: https://github.com/gujinghui
2024-08-18 08:32:30 +00:00
fed6096e73 [dynamo] Support object.__new__ call (#133746)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133746
Approved by: https://github.com/Skylion007, https://github.com/jansel
ghstack dependencies: #133745, #133747
2024-08-18 07:18:52 +00:00
d56a395971 [dynamo] Support os.fspath (#133747)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133747
Approved by: https://github.com/yanboliang, https://github.com/Skylion007, https://github.com/jansel
ghstack dependencies: #133745
2024-08-18 07:18:52 +00:00
27dfd63ee8 remove unnecessary slicing in EffectTokensWrapper (#133737)
In the cases that `outs ` is a tensor, `[0:]` will cause a nadditional slicing ops that's unnecessary and failed some of XLA's unit test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133737
Approved by: https://github.com/IvanKobzarev
2024-08-18 05:52:48 +00:00
d717df2071 [compiled autograd] fix flaky tests due to torch.cuda.memory_allocated() != 0 (#133733)
FIXES https://github.com/pytorch/pytorch/issues/123949 https://github.com/pytorch/pytorch/issues/124376
torch.cuda.memory_allocated returns the amount of memory allocated in the current process, so if it isn't 0 it means another test didn't properly clean up after itself. I'm keeping the memory check and isolating these tests in subprocess as we don't have a good way to test for activation refcount

e.g. https://github.com/pytorch/pytorch/runs/28838386083
```
_______________ TestCompiledAutograd.test_free_activation_memory _______________
Traceback (most recent call last):
  File "/var/lib/jenkins/workspace/test/inductor/test_compiled_autograd.py", line 1892, in test_free_activation_memory
    self.assertTrue(torch.cuda.memory_allocated() == 0)
  File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue
    raise self.failureException(msg)
AssertionError: False is not true
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133733
Approved by: https://github.com/jansel
2024-08-18 05:43:35 +00:00
cyy
fb9d2dc641 Remove Wno-invalid-partial-specialization from CMake (#133398)
The code base is clean enough that Winvalid-partial-specialization can be enabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133398
Approved by: https://github.com/ezyang
2024-08-18 04:06:21 +00:00
cyy
f8cf1829b5 [Reland] [11/N] Fix clang-tidy warnings in aten/src/ATen (#133758)
Reland of #133298. Remove possible changes that may increase the build time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133758
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2024-08-17 23:09:44 +00:00
0bde3c4f2f Run cudagraphs on AOTAutograd cache hit (#132294)
This threads through all of the necessary parts into aot autograd from the FXGraphCache changes so that we can run cudagraphs properly on a AOTAutograd cache hit.

Specifics:
- AOTAutograd needs access to the `cudagraphs` boxedbool in order to properly set the backward to not use cudagraphs on a cache hit from the forward.
- We have lots of tests that test this already from the previous PR, so I just added an extra test and made the previous test work with both AOTAutogradCache and FXGraphCache at the same time.

```
TORCH_LOGS=torch._functorch._aot_autograd.autograd_cache,cudagraphs ENABLE_AOT_AUTOGRAD_CACHE=1 TORCHINDUCTOR_FX_GRAPH_CACHE=1 tlp python benchmarks/gpt_fast/benchmark.py --output ~/gpt_fast_benchmark.csv
```
Twice, once on cache miss and once and cache hit.

Here is the perfetto trace for each(FB only link):

**Cache Miss:**
Logs:
```
Loading model Llama-2-7b-chat-hf
Time to load model: 0.66 seconds
I0813 10:53:34.416000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:479] [0/0] AOTAutograd cache miss for key alqchc7zw6ynsxj2bzktcsngu4cajwcb3tmhvwlyqkuinx3zhmey
I0813 10:53:51.395000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:558] [0/0] Writing AOTAutograd cache entry to /tmp/torchinductor_jjwu/aotautograd/alqchc7zw6ynsxj2bzktcsngu4cajwcb3tmhvwlyqkuinx3zhmey/entry
I0813 10:54:17.579000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:479] [1/0] AOTAutograd cache miss for key a3nq2ywjxku342c6ag7rsqkalnxfshlcgve3tb2bigg7a45uz6pt
I0813 10:54:38.636000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:558] [1/0] Writing AOTAutograd cache entry to /tmp/torchinductor_jjwu/aotautograd/a3nq2ywjxku342c6ag7rsqkalnxfshlcgve3tb2bigg7a45uz6pt/entry
I0813 10:54:39.228000 911030 torch/_inductor/cudagraph_trees.py:385] [__cudagraphs] recording cudagraph tree for graph without symints
V0813 10:54:39.939000 911030 torch/_inductor/cudagraph_trees.py:2160] [__cudagraphs] Running warmup of function 0
V0813 10:55:10.615000 911030 torch/_inductor/cudagraph_trees.py:2119] [__cudagraphs] Recording function 0 of graph recording id 0
Compilation time: 101.24 seconds
Average tokens/sec: 147.96 tokens/sec
Average bandwidth achieved: 1955.22 GB/s
Memory used: 14.51 GB
```

Chromium Event(fb only):
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json&local_cache_key

![image](https://github.com/user-attachments/assets/47fdd77e-3cc1-437e-8e68-7901646269bb)

**Cache Hit:**
Logs:
```
Loading model Llama-2-7b-chat-hf
Time to load model: 0.67 seconds
I0813 10:55:51.821000 944420 torch/_functorch/_aot_autograd/autograd_cache.py:474] [0/0] AOTAutograd cache hit for key alqchc7zw6ynsxj2bzktcsngu4cajwcb3tmhvwlyqkuinx3zhmey
I0813 10:55:55.465000 944420 torch/_functorch/_aot_autograd/autograd_cache.py:474] [1/0] AOTAutograd cache hit for key a3nq2ywjxku342c6ag7rsqkalnxfshlcgve3tb2bigg7a45uz6pt
I0813 10:55:56.030000 944420 torch/_inductor/cudagraph_trees.py:385] [__cudagraphs] recording cudagraph tree for graph without symints
V0813 10:55:56.192000 944420 torch/_inductor/cudagraph_trees.py:2160] [__cudagraphs] Running warmup of function 0
V0813 10:55:56.426000 944420 torch/_inductor/cudagraph_trees.py:2119] [__cudagraphs] Recording function 0 of graph recording id 0
Compilation time: 9.40 seconds
Average tokens/sec: 147.94 tokens/sec
Average bandwidth achieved: 1954.98 GB/s
Memory used: 14.51 GB
```
Chromium Event(fb only):
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom2%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom2%2Fchromium_events.json&local_cache_key

![image](https://github.com/user-attachments/assets/9bdd14ec-d12a-4c89-8705-135c999ac746)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132294
Approved by: https://github.com/eellison
2024-08-17 21:24:54 +00:00
d6368985af [BE]: Fix setuptools not installed with Python 3.12 (#133561)
setuptools is not installed correctly for Python 3.12.
See https://github.com/python-poetry/poetry/issues/9630#issuecomment-2291114885

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133561
Approved by: https://github.com/Skylion007

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
2024-08-17 17:42:04 +00:00
b4a1673a67 profiler/unwind: include <dlfcn.h> for dladdr (#133582)
This fixes a compilation error on linux systems using the musl c library.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133582
Approved by: https://github.com/Skylion007, https://github.com/aaronenyeshi
2024-08-17 16:15:18 +00:00
215b14530a Add Half for sparse.mm reduce (#133672)
This PR is to add Half support for sparse.mm reduce in CPU backend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133672
Approved by: https://github.com/Skylion007
2024-08-17 15:20:39 +00:00
1c6fbae579 [Easy][dynamo] fix builtin function names for itertools (#133711)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133711
Approved by: https://github.com/Skylion007
2024-08-17 15:12:01 +00:00
a0ef8888e6 [Inductor][CPP] Support vectorization of load_seed and randn (#130317)
**Summary**
Enable the vectorization of `load_seed` and `randn`. For now, `randn` is using the reference implementation.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_randn
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130317
Approved by: https://github.com/jgong5
ghstack dependencies: #122961
2024-08-17 07:15:57 +00:00
99b3b58f39 [inductor][cpp] complete vectorization for int32/int64 (#122961)
**Summary**
Implement the complete vectorization of `index_expr` functionally. We also add heuristic from performance perspective to resolve the regressions posted below: https://github.com/pytorch/pytorch/pull/122961#issuecomment-2041336265 by disabling vectorization of specific (Fused) scheduler Node:

- Heuristic 1: when the num of non-contiguous `index_expr/load/store` exceeds the threshold, we disable the vectorization.
- Heuristic 2: when the total number of elements along the vec dim is less than `tiling_factor/2`, we disable the vectorization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122961
Approved by: https://github.com/jansel

Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>
2024-08-17 07:07:49 +00:00
d5f6d68d68 [PT2] Resolve PT2 compatility issue in slice and diff (#133740)
Summary:
# context
* when running an IG FM training with PT2 we found there are a few graph break due to torch.diff call in [jagged_tensor.py](https://fburl.com/code/cwssxabc)
```
_length: List[int] = (
    _length_per_key_from_stride_per_key(torch.diff(offsets), stride_per_key)
    if variable_stride_per_key
    else torch.sum(torch.diff(offsets).view(-1, stride), dim=1).tolist()
)
```
* look into the failure, we found the TORCH_CHECK in diff should be TORCH_SYM_CHECK
* slice_forward error: df3d7729e, [tlparse](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpxXZ2em/index.html)
```
RestartAnalysis
Tried to use data-dependent value in the subsequent computation. This can happen when we encounter unbounded dynamic value that is unknown during tracing time.  You will need to explicitly give hint to the compiler. Please take a look at torch._check OR torch._check_is_size APIs.  Could not guard on data-dependent expression ((5*u37 + u38)//(u37 + u38)) < 0 (unhinted: ((5*u37 + u38)//(u37 + u38)) < 0).  (Size-like symbols: u38, u37)

ATTENTION: guard_size_oblivious would fix the error, evaluating expression to False.
Maybe you need to add guard_size_oblivious to framework code, see doc below for more guidance.

Potential framework code culprit (scroll up for full backtrace):
  File "/data/users/hhy/fbsource/buck-out/v2/gen/fbcode/e99934938a0abe90/aps_models/ads/icvr/__icvr_launcher_live__/icvr_launcher_live#link-tree/torch/_decomp/decompositions.py", line 771, in slice_forward
    if end_val < 0:
```
* after this diff: [tlparse](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpAhv2Sh/failures_and_restarts.html)

Test Plan:
# command
* run model
```
TORCH_SHOW_CPP_STACKTRACES=1 TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 TORCH_LOGS="+graph_code,output_code,dynamic,aot,guards,verbose_guards,recompiles,graph_breaks" TORCH_TRACE=/var/tmp/tt buck2 run fbcode//mode/opt fbcode//aps_models/ads/icvr:icvr_launcher_live -- mode=fmc/local_ig_fm_v4_mini training.pipeline_type=pt2
```
* generate tlparse
```
tlparse `ls -t /var/tmp/tt/* | head -1`
```

Reviewed By: ezyang

Differential Revision: D56339251

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133740
Approved by: https://github.com/ezyang
2024-08-17 06:07:21 +00:00
cd89bf77c8 [inductor][cpp][gemm] easy: adjust indentation of template, var renaming etc. (#133312)
Indent the template instructions separately from the generated code, for readability. Also, renaming M0,N0,K0 to Mr,Nr,Kr ("r" meaning "register") to consistent naming.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133312
Approved by: https://github.com/Skylion007, https://github.com/leslie-fang-intel
ghstack dependencies: #132729, #132730
2024-08-17 05:49:14 +00:00
4dc9795ebf [refactor][easy] Directly call var_getattr method for PythonModuleVariable (#133745)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133745
Approved by: https://github.com/yanboliang
2024-08-17 05:30:01 +00:00
2ee6b97464 [dtensor] move DTensor to public namespace (#133113)
Moving DTensor to be in the public namespace, to formally add the
documentation page that includes all the public APIs. This includes:

* many path renames and path import fixes
* a dedicated doc page without too much content yet (adding in the next
  PRs)
* To preserve the BC for users still using the `torch.distributed._tensor`,
  I added a shim script to redirect old path calls to the new module

The BC preserving is evidented by the fact that all DTensor tests are still
working without changing the public imports. So it's safe to land the
changes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133113
Approved by: https://github.com/XilunWu
ghstack dependencies: #133305, #133306
2024-08-17 05:09:52 +00:00
1a4709cef5 [dtensor] add more documentations (#133306)
This PR adds more documentations to the DTensor APIs, to prepare for the
module be public

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133306
Approved by: https://github.com/XilunWu, https://github.com/tianyu-l, https://github.com/wz337
ghstack dependencies: #133305
2024-08-17 05:09:52 +00:00
addee9f4d1 [dtensor] add missing __all__ to public modules (#133305)
as titled, some submodules are missing __all__ for API exposures, this
PR adds necessary __all__ to those modules, and private some non public
APIs explicitly together in this PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133305
Approved by: https://github.com/XilunWu, https://github.com/tianyu-l, https://github.com/wz337
2024-08-17 05:09:48 +00:00
702c810780 move param's device check to _init_group for fused (#131153)
There could be some cases where the params have the meta device when calling optimizer's dunder init and those params are materialized in the first computation. This change would allow such situation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131153
Approved by: https://github.com/mlazos, https://github.com/janeyx99

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
2024-08-17 04:49:47 +00:00
12b8e29203 Add a fudge factor to ephemeral NCCL timeout increase (#133722)
Differential Revision: [D61422431](https://our.internmc.facebook.com/intern/diff/D61422431)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133722
Approved by: https://github.com/c00w, https://github.com/aorenste
ghstack dependencies: #133504
2024-08-17 03:08:40 +00:00
695d7db2d6 remove dead code for suggesting legacy dynamic shapes fixes (#133700)
Summary: `dynamic_dim` based dynamic shapes are long gone, so pretty-printing suggested fixes for them is dead code.

Test Plan: existing tests

Differential Revision: D61398303

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133700
Approved by: https://github.com/zhxchen17
2024-08-17 01:59:34 +00:00
455f6bda56 Add cache timings info to tlparse (#133504)
https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpLR1T85/rank_1/0_0_0/fx_graph_cache_hash_11.json

Differential Revision: [D61422432](https://our.internmc.facebook.com/intern/diff/D61422432)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133504
Approved by: https://github.com/jamesjwu
2024-08-17 01:37:53 +00:00
dcfa415e6e [Inductor UT] Reuse inductor UT for intel GPU test/inductor/test_compiled_optimizers.py (#133083)
[Inductor UT] Reuse Inductor test case for Intel GPU.
Reuse `test/inductor/test_compiled_optimizers.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133083
Approved by: https://github.com/etaf, https://github.com/jansel, https://github.com/mlazos
2024-08-17 01:15:26 +00:00
983bea399d [compiled autograd] move non-hot path logs into default logger (#133541)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133541
Approved by: https://github.com/yf225, https://github.com/bdhirsh
ghstack dependencies: #133115, #133148
2024-08-17 00:46:52 +00:00
0a6cc15079 [compiled autograd] use same graph node names as AOTDispatcher (#133148)
FIXES https://github.com/pytorch/pytorch/issues/132939

Compiled autograd's trace of the AOT backward may result in some additional ops e.g. clone to make contiguous, trace_wrapped HOPs, so the graphs may be slightly offset from each other

hf_Whisper example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpNv89Pu/index.html
fsdp2 example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpPdKssS/rank_0/index.html
Unit test example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpvoQsnl/index.html
```python
 ===== Compiled autograd graph =====
 <eval_with_key>.14 class CompiledAutograd(torch.nn.Module):
    def forward(self, inputs, sizes, scalars, hooks):
        # No stacktrace found for following nodes
        getitem: "f32[]cpu" = inputs[0]
        aot1_primals_1: "f32[4]cpu" = inputs[1]
        aot1_primals_2: "f32[4]cpu" = inputs[2]
        aot0_sin: "f32[4]cpu" = inputs[3]
        aot0_cos: "f32[4]cpu" = inputs[4]
        getitem_5: "f32[4]cpu" = inputs[5];  inputs = None

         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: SumBackward0 (NodeCall 1)
        expand: "f32[4]cpu" = torch.ops.aten.expand.default(getitem, [4]);  getitem = None

         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: CompiledFunctionBackward1 (NodeCall 2)
        aot1_tangents_1: "f32[4]cpu" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format);  expand = None
        aot1_sin_1: "f32[4]cpu" = torch.ops.aten.sin.default(aot1_primals_2);  aot1_primals_2 = None
        aot1_neg: "f32[4]cpu" = torch.ops.aten.neg.default(aot1_sin_1);  aot1_sin_1 = None
        aot0_tangents_2: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot1_tangents_1, aot1_neg);  aot1_neg = None
        aot1_cos_1: "f32[4]cpu" = torch.ops.aten.cos.default(aot1_primals_1);  aot1_primals_1 = None
        aot0_tangents_1: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot1_tangents_1, aot1_cos_1);  aot1_tangents_1 = aot1_cos_1 = None

         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: CompiledFunctionBackward0 (NodeCall 3)
        aot0_neg: "f32[4]cpu" = torch.ops.aten.neg.default(aot0_sin);  aot0_sin = None
        aot0_mul: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot0_tangents_2, aot0_neg);  aot0_tangents_2 = aot0_neg = None
        aot0_mul_1: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot0_tangents_1, aot0_cos);  aot0_tangents_1 = aot0_cos = None
        aot0_add: "f32[4]cpu" = torch.ops.aten.add.Tensor(aot0_mul, aot0_mul_1);  aot0_mul = aot0_mul_1 = None

         # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: torch::autograd::AccumulateGrad (NodeCall 4)
        accumulate_grad_ = torch.ops.inductor.accumulate_grad_.default(getitem_5, aot0_add);  getitem_5 = aot0_add = accumulate_grad_ = None
        _exec_final_callbacks_stub = torch__dynamo_external_utils__exec_final_callbacks_stub();  _exec_final_callbacks_stub = None
        return []
```

where aot1 is
```python
class GraphModule(torch.nn.Module):
    def forward(self, primals_1: "f32[4][1]cpu", primals_2: "f32[4][1]cpu", tangents_1: "f32[4][1]cpu"):
         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2233 in torch_dynamo_resume_in_f_at_2232, code: return tmp1.sin() + tmp2.cos()
        sin_1: "f32[4][1]cpu" = torch.ops.aten.sin.default(primals_2);  primals_2 = None
        neg: "f32[4][1]cpu" = torch.ops.aten.neg.default(sin_1);  sin_1 = None
        mul: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, neg);  neg = None
        cos_1: "f32[4][1]cpu" = torch.ops.aten.cos.default(primals_1);  primals_1 = None
        mul_1: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, cos_1);  tangents_1 = cos_1 = None
        return (mul_1, mul)
```

and aot0 is
```python
class GraphModule(torch.nn.Module):
    def forward(self, sin: "f32[4][1]cpu", cos: "f32[4][1]cpu", tangents_1: "f32[4][1]cpu", tangents_2: "f32[4][1]cpu"):
         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2231 in f, code: tmp2 = x.cos()
        neg: "f32[4][1]cpu" = torch.ops.aten.neg.default(sin);  sin = None
        mul: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_2, neg);  tangents_2 = neg = None

         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2230 in f, code: tmp1 = x.sin()
        mul_1: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, cos);  tangents_1 = cos = None

         # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2230 in f, code: tmp1 = x.sin()
        add: "f32[4][1]cpu" = torch.ops.aten.add.Tensor(mul, mul_1);  mul = mul_1 = None
        return (add,)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133148
Approved by: https://github.com/jansel
ghstack dependencies: #133115
2024-08-17 00:46:52 +00:00
4b3ed8bc52 [compiled autograd] log aot id for CompiledFunctionBackward (#133115)
Partially addresses https://github.com/pytorch/pytorch/issues/132939. Adds the AOT ID after the CompiledFunctionBackward annotation in verbose compiled autograd logging

default (no change):
https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp8WCSLf/dedicated_log_torch_trace_xw3ktsi_.log/index.html

TORCH_LOGS="compiled_autograd_verbose":
https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp8WCSLf/dedicated_log_torch_trace_gsc9q_43.log/index.html

```python
# File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:361 in set_node_origin, code: CompiledFunctionBackward1 (NodeCall 2)
clone: "f32[4]" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format);  expand = None
cos: "f32[4]" = torch.ops.aten.cos.default(getitem_1);  getitem_1 = None
mul: "f32[4]" = torch.ops.aten.mul.Tensor(clone, cos);  clone = cos = None

# File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:361 in set_node_origin, code: CompiledFunctionBackward0 (NodeCall 3)
cos_1: "f32[4]" = torch.ops.aten.cos.default(getitem_2)
mul_1: "f32[4]" = torch.ops.aten.mul.Tensor(mul, cos_1);  mul = cos_1 = None
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133115
Approved by: https://github.com/jansel
2024-08-17 00:46:52 +00:00
b0803129e8 Added meta registration for _fused_adamw_ (#133728)
See https://github.com/pytorch/pytorch/issues/123461#issuecomment-2294335273

<img width="1463" alt="Screenshot 2024-08-16 at 5 38 25 PM" src="https://github.com/user-attachments/assets/fe940c0e-775f-4047-bf69-34a3677d539b">
same signature so should be ok to just add the op to the decorator
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133728
Approved by: https://github.com/janeyx99, https://github.com/fegin
2024-08-17 00:28:31 +00:00
ec28121017 [inductor] Fix test_cudagraph_trees_expandable_segments.py for internal (#133698)
Summary:
These tests aren't running internally because the outer test harness is crashing without listing the tests. To fix we need:
* Add a target for the tools/stats/ folder since this test imports it
* Add a dependence to that target so it's included in the par
* Fix up the relative import syntax, which is somehow different internally vs. fbcode (not sure why this works, but many other tests are doing it)

Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:cudagraph_trees_expandable_segments -- --run-disabled`

Differential Revision: D61396711

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133698
Approved by: https://github.com/xuzhao9
2024-08-17 00:09:32 +00:00
648fc6c9c1 [Inductor][CPP] Refactor the tiling select into a standalone module to enhance its extensibility (#130892)
**Summary**
After enabling more vectorization, we found that vectorization does not always bring performance benefits. For example, a kernel with several non-contiguous index computations or non-contiguous buffer load/store operations can experience performance regression. A typical case is what we observed in the next PR: after fully enabling vectorization of `index_expr`, we saw a performance regression of `hf_BigBird`.

In this PR, we refactor the tiling select into a standalone module to enhance its extensibility for further advanced tiling select heuristic. A standalone class `TilingSelect` with its method `select_tiling` has been added. `select_tiling` accepts the inputs of `fn_list`, `var_sizes_list` and return `tiling_factors`, `tiling_indices`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130892
Approved by: https://github.com/jgong5
2024-08-16 23:55:38 +00:00
d04cd7f3ba Improvements for associative_scan - Reverse feature (#133011)
This is part of a series of PRs to improve the functionality of the `associatve_scan` functionality. This specific PR introduces a `reverse` flag to the `associative_scan` to establish a similar interface as for `jax.associative_scan`. This PR has been derived from https://github.com/pytorch/pytorch/pull/129307.

@ydwu4 @Chillee @zou3519

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133011
Approved by: https://github.com/ydwu4
2024-08-16 23:06:31 +00:00
19ff9059eb Revert "[Inductor][CPP] Support vectorization of remainder (#129849)"
This reverts commit 8624a571b4eecd11547867591d70992843265e97.

Reverted https://github.com/pytorch/pytorch/pull/129849 on behalf of https://github.com/izaitsevfb due to ptedge_executorch_benchmark build failed again with LLVM crash ([comment](https://github.com/pytorch/pytorch/pull/129849#issuecomment-2294408526))
2024-08-16 22:41:05 +00:00
98d6a6eb7d [inductor] clean up TODO comments. (#133718)
clean up TODO comments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133718
Approved by: https://github.com/henrylhtsang
2024-08-16 22:12:01 +00:00
271ee90851 [easy] Fix type annotation for ExportedProgram.run_decompositions (#133720)
Fix the tuple type annotation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133720
Approved by: https://github.com/Skylion007
2024-08-16 22:11:42 +00:00
99e789b52b [Fix 1/n] GPU Test skips - fbcode/ caffe2/test/quantization (#133158)
Summary:
This diff aims to fix the GPU Test skips in the quantization tests under the `caffe2/test/quantization` directory. The changes made in the `TARGETS` files include adding the `should_use_remote_gpu` flag to enable remote GPU testing. This should help to resolve the skipped tests and improve the overall test coverage.

[This diff] Fixed skip count: 4
[Running total] Fixed skip count: 4

Note: Creating separate diffs for each test-group.

Test Plan:
**281475054644766**: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_compare_per_channel_device_numerics (caffe2.test.quantization.core.test_quantized_tensor.TestQuantizedTensor)'
https://www.internalfb.com/intern/testinfra/testrun/5629499773981783

**281475054644780**: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_compare_per_tensor_device_numerics (caffe2.test.quantization.core.test_quantized_tensor.TestQuantizedTensor)'
https://www.internalfb.com/intern/testinfra/testrun/11540474087422107

**281475054644853**: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_quant_pin_memory (caffe2.test.quantization.core.test_quantized_tensor.TestQuantizedTensor)'
https://www.internalfb.com/intern/testinfra/testrun/11540474087422477

**844425008078016**: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_cuda_quantization_does_not_pin_memory (caffe2.test.quantization.core.test_quantized_tensor.TestQuantizedTensor)'
https://www.internalfb.com/intern/testinfra/testrun/1407375259845199

Differential Revision: D60055277

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133158
Approved by: https://github.com/jovianjaison
2024-08-16 22:00:57 +00:00
fd33499b0c [PT2][Optimus] Fix mixed precison training problem in decompose mem bound (#133626)
Summary: Recently we observed in AI CMF, enabling decompose_mm pass will lead to mixed dtype for aten.mm and aten.addmm errors. By investigation, we figure out that the error comes from torch.sum, which has an implicit type casting to avoid the possible overflow (a similar discussion in github: https://github.com/pytorch/pytorch/issues/115832). Thus we do the output cast to avoid the error.

Test Plan:
# unit test
```
buck2 test mode/dev-nosan //caffe2/test/inductor:decompose_mem_bound_mm -- test_decompose_mm_mixed_precision
```
Buck UI: https://www.internalfb.com/buck2/00dc168e-4d65-40f8-b169-f4a58206f641
Test UI: https://www.internalfb.com/intern/testinfra/testrun/17169973624867151
Network: Up: 25KiB  Down: 44KiB  (reSessionID-b7e2ecc7-16ca-476d-95b2-09ea74645eb0)
Jobs completed: 19. Time elapsed: 1:07.6s.
Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2)
Tests finished: Pass 6. Fail 0. Fatal 0. Skip 0. Build failure 0

# e2e
ads_dper3:68464f2dc5e849ba2670482079cecaaa
training_platform:2c41d916ad5dd82f196372a8c7bd37a0
### build training_platform
```
buck2 run fbcode//fblearner/flow/projects/training_platform:training_platform
```

### register training_platform
```
buck2 run mode/opt fblearner/flow/projects/training_platform:workflow -- register-workflows --project-name training_platform --flow_version training_platform:2c41d916ad5dd82f196372a8c7bd37a0
```

### build ads_dper 3

```
fbpkg build -E ads_dper3 --yes --expire 14d
```

### register ads_dper 3
```
 buck2 run //pyper/core/eval_app_utils:flow_utils_script -- register --pkg-version ads_dper3:68464f2dc5e849ba2670482079cecaaa
```

### extend package (optional)
```
fbpkg expire --extend-only training_platform:2c41d916ad5dd82f196372a8c7bd37a0 30d
```

### before fix
f591360990

### after fix

baseline
f591395056
proposal

Differential Revision: D61351815

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133626
Approved by: https://github.com/jackiexu1992
2024-08-16 21:53:12 +00:00
be207af6e1 Disable unwrapping scalar tensors when used as outputs (#132859)
If the scalar tensor is an output tensor, it shouldn't be unwrapped (i.e. `.item()` called) since `tl.store` requires a pointer type for outputs. This issue only occurs for mutated buffers: the input tensor is also used as an output tensor.

Fixes #ISSUE_NUMBER

@yanboliang @jansel @ngimel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132859
Approved by: https://github.com/jansel
2024-08-16 21:40:45 +00:00
861bdf96f4 [MPS] Add native strided API for MPSNDArray starting with macOS 15 (#128393)
Add support for native strides in MPS starting with macOS Sequoia. This will get rid of the additional gather and scatter operations needed to solve the strides or storage offsets of the tensors.

Summary of changes (starting with macOS 15):
- Add support for **MPS strided API** (strides/storage offsets etc):
   - [initWithBuffer:offset:descriptor:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarray/4391636-initwithbuffer?language=objc)
   - [arrayViewWithCommandBuffer:descriptor:aliasing:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarray/3114040-arrayviewwithcommandbuffer?language=objc)
   - [arrayViewWithShape:strides:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarray/4408694-arrayviewwithshape?language=objc)
   - [reshapeWithCommandBuffer:sourceArray:shape:destinationArray:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarrayidentity/4438557-reshapewithcommandbuffer?language=objc)
- Add native support for NHWC convolutions (without incurring any extra copy from NCHW -> NHWC -> NCHW).
- Add support for strided output buffers (previously we would create a contiguous buffer

OSes older than macOS 15 will run the old gather/scatter code path to solve strides/storage offsets.

---

Couple performance stats collected from torchbench comparing macOS 15 vs macOS 14:
```
- test_train[functorch_maml_omniglot-mps]: 27% faster
- test_train[timm_vision_transformer-mps]: 12% faster
- test_train[hf_T5-mps]: 9.46% faster
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128393
Approved by: https://github.com/albanD

Co-authored-by: Siddharth Kotapati <skotapati@apple.com>
2024-08-16 21:07:50 +00:00
447f428d6d [ROCm] Fix text_export cudnn_attention UT (#133234)
On ROCm we should decompose to flash_attention for sdpa instead of cudnn_attention. Need additional conditionalisation in this code.

Issue observed: https://hud.pytorch.org/failure?name=rocm%20%2F%20linux-focal-rocm6.1-py3.8%20%2F%20test%20(default%2C%203%2C%206%2C%20linux.rocm.gpu.2)&jobName=undefined&failureCaptures=%5B%22export%2Ftest_export.py%3A%3ATestOneOffModelExportResult%3A%3Atest_scaled_dot_product_attention_cuda%22%5D

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133234
Approved by: https://github.com/malfet
2024-08-16 20:49:13 +00:00
f57b00704e [Traceable FSDP2][Dynamo] Support reconstructing CUDA event object within Dynamo graph (#133635)
`torch.cuda.Event` objects are different from `torch.cuda.Stream` in that events are not pooled, meaning we can't look up a previously created CUDA event object by ID. This prevents CUDA event object created outside of the Dynamo graph from being used within the graph (since Dynamo needs a way to emit a `call_function` line in the graph that does the retrieval of the event object for downstream op use). This PR adds a simple object pool within Dynamo utility, to support looking up CUDA event object by ID from within the Dynamo graph.

After this PR, if a user creates a CUDA event object outside of the graph and use that event within the graph, the behavior will exactly match eager.

Test commands:
- `pytest -rA test/dynamo/test_ctx_manager.py::CtxManagerTests::test_cuda_event_created_outside_of_graph`
- `pytest -rA test/dynamo/test_ctx_manager.py::CtxManagerTests::test_cuda_event_across_graph_break`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133635
Approved by: https://github.com/yifuwang
ghstack dependencies: #133532, #133531, #133636
2024-08-16 20:40:46 +00:00
bc9e20b927 Move the layout constraint registration of aten._scaled_mm.default to module scope (#133669)
During Inductor lowering, layout constraints for an op is applied before the op's lowering is called. Currently `add_layout_constraint(aten._scaled_mm.default, constrain_to_fx_strides)` is called inside `aten._scaled_mm.default`'s lowering. This means that if the first `_scaled_mm` to be lowered relies on the layout constraint, it won't be applied and the generated code would fail. The issue won't manifest if the first `_scaled_mm` doesn't rely on the layout constraint.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133669
Approved by: https://github.com/drisspg, https://github.com/yangsiyu007
2024-08-16 20:30:13 +00:00
88ba50279c Consolidate the format for --max-acc-splits flag (#133724)
fixes the partial export of [lowering] Add max_acc_splits (#133041) ([D60133589](https://www.internalfb.com/diff/D60133589))

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133724
Approved by: https://github.com/kit1980
2024-08-16 20:28:55 +00:00
3ac527ac5f [BE][Ez]: Update cudnn_frontend submodule to 1.6.0 (#133687)
Updates CUDNN_frontend header only library to make the most of the newest CUDNN features and decrease the overhead of the library.

Copied from commit:
New API
- Graph Slice Operation: Introduced the graph.slice operation for slicing input tensors. Refer to docs/operations/Slice.md for detailed documentation and samples/cpp/misc/slice.cpp for a C++ sample. Pybinds for this operation have also been added.
- SM Carveout Feature: Added the set_sm_count(int32_t type) graph property to support the SM Carveout feature introduced in Ampere and Hopper GPUs. Engines that do not support SM_COUNT will return NOT_SUPPORTED.
Bug Fixes
- Convolution Mode Attribute: Added the missing set_convolution_mode attribute to convolution attributes in forward propagation (fprop), data gradient (dgrad), and weight gradient (wgrad). Previously, this was hardcoded to CUDNN_CROSS_CORRELATION in the 1.x API.
- SDPA FP8 Backward Node: Fixed an issue with the deserialization of the sdpa_fp8_backward node.
Enhancements
- Graph Execution Overhead: Reduced the overhead of graph.execute() by optimizing sub-node tree traversal, collected UIDs, workspace modifications, and workspace size.
- Graph Validation Performance: Significantly improved (~10x) the performance of graph.validate() by deferring graph expansion to a later stage (build_operation_graph).
- Optional Running Stats for BatchNorm: Made the running statistics for the batch normalization operation optional, supported by cuDNN backend version 9.3.0 and later.
- Shape and Stride Inferencing: Enhanced shape and stride inferencing to preserve the stride order of the input.
- Diagnostic Error Message: Added a diagnostic error message to create_execution_plans if called without the preceding build_operation_graph.
- JSON Schema and Deserialization: Improved the JSON schema and deserialization logic with additional checks.
- Logging Overhead: Reduced logging overhead, resulting in faster graph.build() calls.
- CMake Integration: Replaced CMAKE_SOURCE_DIR with PROJECT_SOURCE_DIR in CMake files for better integration. See the relevant pull request for more details.
Samples
- Jupyter Notebooks: Added Jupyter notebooks for RMSNorm, InstanceNorm, and LayerNorm. Refer to the samples/python folder for more information.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133687
Approved by: https://github.com/eqy, https://github.com/malfet
2024-08-16 20:27:23 +00:00
41e6619509 [codemod] Del un at::native::metal @ MPSCNNFullyConnectedOp.h:6 (export D59157302) (#133515)
Manual export of D59157302

Original description:
Removes a using namespace from the global namespace in pursuit of enabling -Wheader-hygiene. Qualifies instances that relied on the using namespace.

@diff-train-skip-merge
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133515
Approved by: https://github.com/kit1980, https://github.com/malfet
2024-08-16 19:59:07 +00:00
a0cb54ab46 Revert "C++ network flow implementation in c10 (#132188)"
This reverts commit e6272acaec63c960486b3ac558d0199cd65d7b97.

Reverted https://github.com/pytorch/pytorch/pull/132188 on behalf of https://github.com/izaitsevfb due to breaks aps models and builds internally ([comment](https://github.com/pytorch/pytorch/pull/132188#issuecomment-2294120234))
2024-08-16 19:48:54 +00:00
fb59440791 Use dedicated docker-build environment for manywheel, libtorch and conda Docker builds - 2 (#133709)
Follow up after https://github.com/pytorch/pytorch/pull/133699. 2 more placed where we need to pass these env vars.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133709
Approved by: https://github.com/Skylion007, https://github.com/seemethere
2024-08-16 19:41:11 +00:00
678a8f9e66 [Inductor][FlexAttention] Small cleanup for FlexAttention kernel template (#133664)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133664
Approved by: https://github.com/drisspg
2024-08-16 19:33:36 +00:00
611c104370 [MPS] Add workaround for nonzero with large/complex inputs (#126188)
Fixes Issue #122916

Resolves correctness issue seen with large inputs to the mps nonzero op by using a different scatter mode. Native nonzero op is still used with smaller inputs for better performance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126188
Approved by: https://github.com/kulinseth, https://github.com/malfet
2024-08-16 19:04:04 +00:00
0063e56949 Make FX Graph Cache work with distributed training (#133374)
During distributed training if all ranks except one hit the cache, the rank that did not hit the cache will cause a NCCL timeout since rest of the ranks will enter the collective and start the timer. This PR uses the new PTD API to increase timeout for the ranks that hit the cache by the amount of time the cache would save.

Differential Revision: [D61363722](https://our.internmc.facebook.com/intern/diff/D61363722)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133374
Approved by: https://github.com/ezyang
2024-08-16 18:51:14 +00:00
5ee070266f Workaround ASAN failure (#133623)
Summary:
ASAN in llvm 17.x and newer reads 8 bytes in front of every function called. This means the JIT must not place a function immediately at the beginning of a freshly `mmap`ed page. This adds an 8 byte sized dummy variable as the first thing to work around the problem.

See also:
- https://reviews.llvm.org/D148665
- https://github.com/llvm/llvm-project/issues/65253

Test Plan:
- `servicelab create cogwheel_adfinder_ubsan_multi_trial_test --local-commit`: https://www.internalfb.com/servicelab/experiment/3701354882
- sandcastle

Differential Revision: D61348865

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133623
Approved by: https://github.com/Skylion007
2024-08-16 18:48:10 +00:00
cyy
90c3669cd9 Make sure T::is_traceable is bool (#133673)
Add static_assert to C++ templates in custom_function
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133673
Approved by: https://github.com/Skylion007
2024-08-16 18:28:02 +00:00
eb3d517605 [Test] Add SkipIfRocm to test_grad_acc_cpu_offload (#132975)
Fixes #123726

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132975
Approved by: https://github.com/malfet
2024-08-16 18:26:20 +00:00
e5baf43b61 [Inductor] short-term fix for needs_fixed_stride_order silent incorrectness (#133452)
This is a low-risk short-term fix for
https://github.com/pytorch/pytorch/issues/128084, for the purposes of
2.4.1. The actual fix for that issue is more risky and we'll target 2.5.

needs_fixed_stride_order is silently incorrect with args that are
mutable because it creates clones of those args, writes into them, and
doesn't update the original args.

This PR makes it so that needs_fixed_stride_order doesn't apply to
inputs that are being mutated.

This PR doesn't completely fix the problem, but it makes it less
incorrect: most of the time the input already has the correct strides
but inductor fails to recognize it, and in those cases writing directly
to the input is fine.

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133452
Approved by: https://github.com/eellison
2024-08-16 18:14:57 +00:00
caaa339e0f Use dedicated docker-build environment for manywheel, libtorch and conda Docker builds (#133699)
BE change. Apply logic simiar to: https://github.com/pytorch/pytorch/blob/main/.github/workflows/docker-builds.yml

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133699
Approved by: https://github.com/seemethere
2024-08-16 18:10:43 +00:00
b833990a8f Revert "[CUDA][CUTLASS][submodule] Fixes for CUTLASS upgrade (#131493)"
This reverts commit 4aa66f68a803927ddd127ceaaa1521b8d6e90e5f.

Reverted https://github.com/pytorch/pytorch/pull/131493 on behalf of https://github.com/izaitsevfb due to breaks internal builds with identifier "std::numeric_limits< ::cutlass::half_t> ::infinity" is undefined in device code ([comment](https://github.com/pytorch/pytorch/pull/131493#issuecomment-2293939390))
2024-08-16 18:09:33 +00:00
4ee65c7e4e Add message text to BypassFxGraphCache exceptions. (#133505)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133505
Approved by: https://github.com/oulgen
2024-08-16 18:02:59 +00:00
1df1d00ffc [Traceable FSDP2] Remove usage of tuple() generator and simplify code (#133636)
Dynamo doesn't support `tuple()` generator, and this change also simplifies code a bit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133636
Approved by: https://github.com/awgu
ghstack dependencies: #133532, #133531
2024-08-16 17:47:28 +00:00
374c61cc82 [inductor] make conv template work with symbolic stride/padding (#132938)
Fix https://github.com/pytorch/pytorch/issues/132716

The triton template for convolution does not work when the stride or padding contains dynamic shape. Use the hint and add guards to handle that. An alternative is to fallback to eager, but since I've seen the lowering rule for convolution use the hint in other cases, I'll just follow the convention.

I don't really know how to add a unit test here since I need create symbolic strides (not strides of a tensor but the stride parameter for convolution) and paddings. I can try harder if reviewer swants me to add unit tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132938
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: #132952
2024-08-16 17:45:12 +00:00
2cffe82dea Fix triton build failure due to tritonlang.blob.core.windows.net not available (#133694)
This should mitigate https://github.com/triton-lang/triton/issues/4527
We should also remove this once our triton pin moves past: https://github.com/triton-lang/triton/pull/4216

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133694
Approved by: https://github.com/Skylion007, https://github.com/kit1980, https://github.com/malfet
2024-08-16 17:34:30 +00:00
f735038c8f [PT2][Optimus] Add unbind_stack_to_slices pass (#133420)
Summary: We find another pattern to be optimized in AI CMF, thus we add the new pattern

Test Plan:
# unit test

```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test //caffe2/test/inductor:split_cat_fx_passes
```

Buck UI: https://www.internalfb.com/buck2/b0b9bdf6-1bd1-45db-ba2c-a6892d9d557e
Test UI: https://www.internalfb.com/intern/testinfra/testrun/1125900285323964
Network: Up: 595KiB           Down: 1.7MiB           (reSessionID-e527c3b3-03ac-45f8-bd08-3eb9a28b7dc0)
Tests finished: Pass 9. Fail 0. Fatal 0. Skip 1. Build failure 0

# benchmark

```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "ai_cmf" --flow_id 558295195 -n
```
P1520513078

Counter({'pattern_matcher_nodes': 1756, 'pattern_matcher_count': 936, 'normalization_pass': 280, 'merge_splits_pass': 250, 'scmerge_cat_removed': 14, 'scmerge_cat_added': 12, 'scmerge_split_removed': 7, 'unbind_stack_pass': 7, 'split_stack_to_cats_pass': 4, 'scmerge_split_sections_removed': 3, 'split_cat_pass': 2, 'scmerge_split_added': 2, 'split_cat_to_slices_pass': 2, 'unbind_stack_to_slices_pass': 1}

# e2e (OBA AFOC)

baseline
f590253290
proposal
f591051921

### QPS and NE
{F1804187079}

### trace analysis
baseline trace link: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2Ff590283096-TrainingApplication%2F4%2Frank-1.Aug_12_08_52_03.3628.pt.trace.json.gz&bucket=pyper_traces

proposal trace link:
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2Ff591081210-TrainingApplication%2F0%2Frank-1.Aug_12_22_23_35.3401.pt.trace.json.gz&bucket=pyper_traces

{F1804227687}{F1804227675}
Based on the traces, the green part has been shrinked due to optimus transformation.

Differential Revision: D61039466

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133420
Approved by: https://github.com/jackiexu1992
2024-08-16 17:30:35 +00:00
6790eb52f9 [Traceable FSDP2] Set torch._dynamo.config.skip_fsdp_hooks to True by default (#133531)
Setting `torch._dynamo.config.skip_fsdp_hooks = True` is required for graph-break compiled FSDP2, thus setting it to default will make this adoption easier. If users want to use Traceable FSDP2, they can set this to False manually (which will allow FSDP2 hooks to be traced through).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133531
Approved by: https://github.com/awgu
ghstack dependencies: #133532
2024-08-16 17:18:42 +00:00
6d85077168 [Traceable FSDPS] Allow tracing through FSDP2 impl in trace_rules.py (#133532)
Test commands:
- `python test/distributed/_composable/fsdp/test_fully_shard_training.py TestFullyShard1DTrainingCompose.test_train_parity_with_activation_checkpointing`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133532
Approved by: https://github.com/yanboliang
2024-08-16 17:13:47 +00:00
18705e371d S390x nightly binaries for python 3.13 (#132984)
Enable building python 3.13 nightly binaries for s390x
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132984
Approved by: https://github.com/malfet
2024-08-16 17:07:27 +00:00
770086fe39 [Dynamo] Support torch.cuda.device ctx manager (#133385)
Fixes #128059

I'm not sure if this is the right way, since Inductor doesn't always respect the device id set by users, so probably we should just wrap it as null context manager and print a warning. cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames @jansel @anijain2305 @mlazos @williamwen42

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133385
Approved by: https://github.com/jansel
2024-08-16 17:05:55 +00:00
38e5ee1a34 mixed_mm: add more extensive dtype testing (#133292)
This PR adds a test that tests more combinations of dtypes. The bfloat16 and uint8 combination causes a crash somewhere in triton during the generation of LLVM code. Tests like these would have also prevented segfaults like this one https://github.com/pytorch/pytorch/pull/133173.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133292
Approved by: https://github.com/shunting314
2024-08-16 16:49:27 +00:00
9c2d119194 [Profiler/CPU] Add API for Dynamic Activity Toggling [3/n] (#133353)
Summary:
In this diff, we add the CPU activity implementation of being able to dynamically toggle profiling in between steps. To do this we remove the callbacks for Torch Ops and add them back in when an enable call is made.

This diff also adds some support code for doing the same in python; however, the python stack comes with its own set of compilcations when enabling this feature. For one, we get into a scenario where the python stack during the toggle never gets an exit as it the tracing gets turned off which makes for some tricky post processing. For this reason, we can leave the python dynamic toggling off for now and revisit if there is enough demand.

Test Plan: Got the following tracing by disabling torch and cuda ops: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Aug_13_13_03_02.606577.pt.trace.json.gz&bucket=gpu_traces

Differential Revision: D61221497

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133353
Approved by: https://github.com/sanrise, https://github.com/aaronenyeshi
2024-08-16 16:36:57 +00:00
46af996ce7 [c10d] Do not call ncclCommAbort if comm is not initialized (#133630)
Summary:
We saw ncclCommAbort was called and hang during the NCCLComm:create.
If NCCL comm is not properly initialized, ncclCommAbort behavior is
'undefined', avoid calling it would allow the process to properly throw
exception
Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133630
Approved by: https://github.com/wconstab
2024-08-16 16:25:07 +00:00
8b8b4e5ae9 AutoHeuristic: documentation for mm (#133611)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133611
Approved by: https://github.com/eellison
ghstack dependencies: #131705, #131710, #131714, #133608
2024-08-16 16:20:38 +00:00
0e0077f3b6 AutoHeuristic: mm ranking heuristic h100 (#133608)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133608
Approved by: https://github.com/eellison
ghstack dependencies: #131705, #131710, #131714
2024-08-16 16:20:38 +00:00
e51c8ad369 AutoHeuristic: Heuristic that ranks choices for mm (#131714)
This PR adds a heuristic for tuned_mm that predicts the top 10 best choices. To be safe, aten.mm is always included.

Perf run: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2008%20Aug%202024%2020%3A20%3A28%20GMT&stopTime=Thu%2C%2015%20Aug%202024%2020%3A20%3A28%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/AlnisM/22/head&lCommit=905826f4ab5344efb0bcaa87e3b27a25299927ab&rBranch=main&rCommit=79ca596dc6ea16b6cdd0f2517451e19840717d37

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131714
Approved by: https://github.com/eellison
ghstack dependencies: #131705, #131710
2024-08-16 16:20:38 +00:00
51e13745be [BE]: Update ruff to 0.6.0 (#133609)
Updates ruff and fixes a couple false negatives it discovered.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133609
Approved by: https://github.com/malfet
2024-08-16 14:11:01 +00:00
eca8b4220f [inductor][cpp][gemm] fix k-slicing bug and add thread blocking config (#132730)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132730
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
ghstack dependencies: #132729
2024-08-16 13:50:19 +00:00
a6aa451bde Move python 3.8 to 3.9 for linux-binary-manywheel workflow (#133621)
Part of Deprecation of python 3.8 and moving to 3.9. Related to: https://github.com/pytorch/pytorch/issues/120718
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133621
Approved by: https://github.com/Skylion007, https://github.com/kit1980, https://github.com/malfet
2024-08-16 13:49:26 +00:00
e1b9b89d94 Revert "[Flight Recorder] Add more basic analysis to the script (#133412)"
This reverts commit fcc2fc1a70c35628939611b496b209fa0a1d19bf.

Reverted https://github.com/pytorch/pytorch/pull/133412 on behalf of https://github.com/atalman due to New test: distributed/flight_recorder/test_fr_analysis is constantly failing ([comment](https://github.com/pytorch/pytorch/pull/133412#issuecomment-2293506539))
2024-08-16 13:26:25 +00:00
b444343087 Fix printing symfloat pow in triton (#133614)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133614
Approved by: https://github.com/Skylion007
2024-08-16 13:08:29 +00:00
762b1b4c17 [inductor] [cpp] fix accuracy when template_buffer has users other than the epilogue nodes (#133073)
This PR fixes the accuracy issues when template_buffer has users other than the epilogue nodes. This will fix the accuracy failure of the below models using max-autotune:

- MobileBertForMaskedLM
- MobileBertForQuestionAnswering
- convnext_base
- swin_base_patch4_window7_224

## Issue 1:
Previously we always add `template_buffer` as an alias of `Y`. In case the `template_buffer` has users other than the epilogue nodes, we shouldn't set it as an alias of `Y`. This PR adds the check in such case.

Wrong code before the fix where `tmp4` and `tmp9` are both stored to `Y` while we need 2 different buffers for them since `tmp4` will be used by nodes other than the epilogue node:
```cpp
Y[static_cast<long>(n_start + x1 + (32L*m_start) + (32L*x0))] = tmp4; // tmp4 is the output of the template
Y[static_cast<long>(n_start + x1 + (32L*m_start) + (32L*x0))] = tmp9; // tmp9 is the output of the epilogue node
```

Correct code after the fix:
```cpp
out_ptr2[static_cast<long>(n_start + x1 + (32L*m_start) + (32L*x0))] = tmp4;
Y[static_cast<long>(n_start + x1 + (32L*m_start) + (32L*x0))] = tmp9;
```

## Issue 2:
When fixing the above issue, we found that there's correctness issue when `bias` is `False`. The root cause is that in the case where `bias` is `False`, the `template_buffer` has users other than the epilogue nodes and the GEMM output buffer is localized, we need to add an extra copy epilogue to ensure that the GEMM output (a local buffer) is stored to the `template_buffer` that will be used later by other nodes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133073
Approved by: https://github.com/jgong5
ghstack dependencies: #133070
2024-08-16 12:13:10 +00:00
dd69013c7a deprecate search_autotune_cache (#133628)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133628
Approved by: https://github.com/oulgen
2024-08-16 09:29:39 +00:00
15183f5ebf overestimate time_taken_ns for autotuning (#133633)
tldr; in `autotune_to_one_config` we now include the precompile time, and in coordesc tuning we include the time from `autotune_to_one_config`, since this is a precursor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133633
Approved by: https://github.com/oulgen, https://github.com/eellison
2024-08-16 09:28:49 +00:00
30fbf5b19c Remove AMD restrictions on triton hashing (#133616)
Summary: When we added these functions, AMD's triton checkout was very old, it appears to have caught up. Remove restrictions.

Test Plan: unit tests

Differential Revision: D61351473

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133616
Approved by: https://github.com/mxz297, https://github.com/nmacchioni, https://github.com/eellison
2024-08-16 08:02:48 +00:00
5ed3b70d09 remove redundant upper bound check at runtime (#133627)
Summary: Some symbols (unbacked symints?) can have upper bound that is `sys.maxsize - 1` but our code for runtime assertions assumes that such upper bounds would come in as `sympy.oo` (like backed symints?) in order to drop them. So we weren't dropping them, which this PR fixes.

Test Plan: added test

Differential Revision: D61352056

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133627
Approved by: https://github.com/SherlockNoMad
2024-08-16 06:57:12 +00:00
f64146aff0 Update source matcher to use torch_fn (#133642)
Updating the source matcher to also accept pattern matching on the torch_fn metadata, which exists in both strict and non-strict export. We want to replace the use of source_fn_stack with torch_fn, as it's not possible for us to get source_fn_stack in non-strict export.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133642
Approved by: https://github.com/ydwu4
2024-08-16 06:42:52 +00:00
d12bbcd785 Add auto-tuning for sparse semi-structured MM operator (#123742)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123742
Approved by: https://github.com/kadeng
2024-08-16 06:40:24 +00:00
3d45717219 [ROCm][CK][Inductor] enable dynamic shapes for CK backend to gemm max autotune (#133285)
This PR enables dynamic shapes for the CK backend for gemm max autotune (see #125453).

This is achieved via unhardcoding the problem sizes from the template body and passing them as parameters instead.

We handle passing the problem sizes for the kernel call as well as for the benchmark call.

# Testing

`pytest test/inductor/test_ck_backend.py [-k dynamic]`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133285
Approved by: https://github.com/ColinPeppler
2024-08-16 06:05:23 +00:00
8ea5b572a6 [PT2][Optimus] Add missing example value for the nodes introduced in group batch fusion (#133414)
Summary: Recently we observed more missing example values in nodes introduced in Optimus, which causes problem to have further optimization when this node info needs to be used. Thus we add the meta for these nodes in the diff.

Test Plan:
# unit test

```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test //caffe2/test/inductor:split_cat_fx_passes
```

Buck UI: https://www.internalfb.com/buck2/c0ad506f-ce9d-4b80-947a-cb79074b72f0
Test UI: https://www.internalfb.com/intern/testinfra/testrun/2251800058834808
Network: Up: 1.4GiB  Down: 2.0GiB  (reSessionID-fb781425-f29b-44b5-8a5b-daffe7274f86)
Jobs completed: 300289. Time elapsed: 13:19.5s.
Cache hits: 99%. Commands: 119360 (cached: 118494, remote: 824, local: 42)
Tests finished: Pass 9. Fail 0. Fatal 0. Skip 1. Build failure 0

# benchmark

```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "cmf_shrink" --flow_id 587303213
```

P1520691492

Differential Revision: D61039772

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133414
Approved by: https://github.com/jackiexu1992
2024-08-16 04:52:16 +00:00
8a2b064236 [dynamo][user_defined][stable-diffusion] Raise ObservedAttributeError on UserDefinedObject var_getattr (#132806)
Fixes https://github.com/pytorch/pytorch/issues/132551

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132806
Approved by: https://github.com/williamwen42
2024-08-16 04:30:06 +00:00
fcc2fc1a70 [Flight Recorder] Add more basic analysis to the script (#133412)
This is the first step to make sure we have a basic function of analyzer for FR in production.

- We want to use this script to find out abnormalities in collectives and report it to users.
- We also fixed some type errors.

- [Ongoing] Also we will add more unit tests to this script and make it modularized so that we can better maintain it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133412
Approved by: https://github.com/c-p-i-o
2024-08-16 03:53:12 +00:00
d9f17cf4e4 [fx] Do not add Proxy on Tensor (#133470)
Summary: Switch to set_proxy_slot instead of set the proxy directly on the Tensor. We do not want to add Proxy to tensor objects, because Proxy cannot be deepcopied or pickeled and can cause problems when users want to deepcopy or pickle models.

Test Plan: CI

Differential Revision: D61277650

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133470
Approved by: https://github.com/zou3519
2024-08-16 03:39:50 +00:00
8a5708ba3d [dynamo] Support object creation of classes with custom __new__ (#132977)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132977
Approved by: https://github.com/jansel
2024-08-16 03:09:23 +00:00
a1a869f2f5 [ts_converter][reland] Add support for LinearOpContext and Conv2dOpContext in quantization pass (#133622)
Summary: Reland of D60871242

Test Plan: CI

Differential Revision: D61352600

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133622
Approved by: https://github.com/SherlockNoMad
2024-08-16 01:55:45 +00:00
1653f7786d Fix type promotion for ldexp (#133519)
According to the documentation, ldexp of half and int should return half tensor and ldexp of double should not overflow for 64-bit exponent

Introduce `_pow2` helper function that does not follow scalar to float32 promotion pattern if `self` is reduced precision float or double

Add regression tests to `test_ldexp` and enable it to run on both CPU and GPU

Fixes https://github.com/pytorch/pytorch/issues/133267

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133519
Approved by: https://github.com/janeyx99, https://github.com/Skylion007
2024-08-16 01:26:26 +00:00
3a904d1163 AutoHeuristic: Enable explicit support for ranking (#131710)
This PR adds support for heuristics that rank choices in AutoHeuristic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131710
Approved by: https://github.com/eellison
ghstack dependencies: #131705
2024-08-16 01:20:52 +00:00
add0f0085c AutoHeuristic: Support ranking/pruning choices (#131705)
This PR adds support in train_decision if one wants to learn a heuristic for ranking. The main idea is that the user has to provide a number of choices the heuristic should return. I added a way to prune the learned decision tree such that it always returns the number of choices provided by the user.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131705
Approved by: https://github.com/eellison
2024-08-16 01:20:52 +00:00
cyy
929d2f8253 [3/N] Fix clang-tidy warnings in torch/csrc/autograd (#133389)
Follows #133295
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133389
Approved by: https://github.com/Skylion007
2024-08-16 00:57:54 +00:00
c22f51ce7c [inductor][cpp][gemm] improve large bs perf with better cache blocking (#132729)
Improve the cache blocking by reducing Mc_blocks to make A reside in L2 and reused by B as much as possible. This improves large bs perf for both scenarios: 1) N is large and K is of medium sizes; 2) K is large. Different strategies are used to handle these scenarios. Check the notes in `get_cache_blocking` in the changes.

Measured with 56-core Intel (R) Xeon (R) CPU Max 9480, jemalloc 5.1 and intel omp, bf16. Run with code cache of B matrix (weights).

Model Shapes | Before Optimization | After Optimization | Speedup | onednn linear | Speedup over onednn
-- | -- | -- | -- | -- | --
M=1024, N=12288, K=4096 (Llama2-8b) | 5.69 ms | 3.71 ms | 1.53 | 4.53 ms | 1.22
M=1024, N=4096, K=4096 (Llama2-8b) | 1.69 ms | 1.63 ms | 1.04 | 2.05 ms | 1.26
M=1024, N=22016, K=4096 (Llama2-8b) | 10.32 ms | 6.57 ms | 1.57 | 8.46 ms | 1.29
M=1024, N=4096, K=11008 (Llama2-8b) | 5.21 ms | 3.26 ms | 1.60 | 4.65 ms | 1.43
M=1024, N=5120, K=4096 (Llama3-8b) | 1.99 ms | 1.78 ms | 1.12 | 2.31 ms | 1.30
M=1024, N=28672, K=4096 (Llama3-8b) | 13.41 ms | 8.56 ms | 1.57 | 10.96 ms | 1.28
M=1024, N=4096, K=14336 (Llama3-8b) | 6.93 ms | 4.31 ms | 1.61 | 6.24 ms | 1.45

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132729
Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/jansel
2024-08-16 00:57:51 +00:00
cyy
8f7cf796ea [14/N] Use std::optional (#133417)
Follows #132527
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133417
Approved by: https://github.com/ezyang
2024-08-16 00:48:34 +00:00
d9576c9440 Fix failures when default is flipped for weights_only (#127627)
Tests on XLA shard not fixed yet but there is an issue here https://github.com/pytorch/xla/issues/7799

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127627
Approved by: https://github.com/albanD
ghstack dependencies: #132349
2024-08-16 00:22:43 +00:00
c8ad5e37e8 Fix all RuntimeErrors during weights_only load from being erroneously reported with the weights_only message (#132349)
Caught in above PR #127627

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132349
Approved by: https://github.com/albanD
2024-08-16 00:22:43 +00:00
0d2be06d94 [export] fix test for training ir migration (#133587)
Summary:
Fix quantization pass to be compatible with the new export IR.

Some nodes might have side-effects, so they don't have users, but still are not removed by the DCE pass.

Test Plan:
CI

buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/export:export_rle_model  -- -r export_rle_model

Differential Revision: D61223356

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133587
Approved by: https://github.com/tugsbayasgalan
2024-08-15 23:55:09 +00:00
eqy
7ad3108ef2 [CUTLASS][FP8] Skip scaled_mm rowwise test on sm89 (#133612)
Rowwise implementation currently uses sm90-specific features incl. TMA
CC @drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133612
Approved by: https://github.com/Skylion007
2024-08-15 23:43:30 +00:00
413416cf33 [PT2] Consolidate args and kwargs usage in pre_grad passes (#133518)
Summary: with acc_tracer disabled, the nodes generated use `args` instead of `kwargs` like before, in the current passes there are a mixed usage of `args` and `kwargs` and normalize nodes to switch between them can cause following passes to work/not work, in this diff we create a pass to normalize all the nodes to use `kwargs` at the beginning and changed all the passes to follow the same

Reviewed By: frank-wei

Differential Revision: D61049898

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133518
Approved by: https://github.com/frank-wei
2024-08-15 23:41:39 +00:00
f347174d61 Hipify Pytorch3D (#133343)
Summary:
X-link: https://github.com/fairinternal/pytorch3d/pull/45

X-link: https://github.com/facebookresearch/pytorch3d/pull/1851

Very minor change to extend hipification to a missing hipcub constant. This is needed to hipify some of the kernels in pytorch3d.

Differential Revision: D61171993

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133343
Approved by: https://github.com/houseroad
2024-08-15 23:39:07 +00:00
29c4b4ea5a [executorch] Refactor delegation code (#132773)
Summary: Refactoring partitioner-based delegation to prepare for allowing buffer mutations in the delegate (following diff).

Test Plan: CI

Differential Revision: D60813405

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132773
Approved by: https://github.com/ydwu4, https://github.com/cccclai
2024-08-15 22:52:12 +00:00
86aa327e4a [FSDP2] Added eager fast-path for fp32->bf16 param cast (#133369)
Some recommendation models have a high number of `nn.Parameter`s. This exacerbates per-tensor CPU overheads in FSDP2 compared to FSDP1.

This PR adds a fast-path for the common bf16/fp32 mixed precision case for the casting the parameters from fp32 to bf16 to reduce CPU overhead and possibly have more efficient copy.
- Old: `for` loop + `.to(torch.bfloat16)`, incurring dispatcher overhead per parameter
- New: `torch.empty` + `torch.split` + `torch._foreach_copy_`, incurring three dispatches

---

Example on Llama3-8B which does not have many `nn.Parameter`s (compared to recommendation models):

(Old) on Llama3-8B (0.46 ms CPU overhead for all-gather):
![Screenshot 2024-08-13 at 6 19 39 PM](https://github.com/user-attachments/assets/e6390e9f-ee54-4208-9d60-9451a4142efa)

(New) on Llama3-8B (0.37 ms CPU overhead for all-gather):
![Screenshot 2024-08-13 at 6 20 32 PM](https://github.com/user-attachments/assets/a5dc1d38-53d2-4984-b3cc-85ce5a538ede)

---

Same example as above but now with float8 all-gather:

(Old) on Llama3-8B with float8 (0.996 ms CPU overhead for all-gather):
![Screenshot 2024-08-15 at 11 27 46 AM](https://github.com/user-attachments/assets/2b7e9c9c-56ea-4375-851e-a2a704689d8d)

(New) on Llama3-8B with float8 (1.014 ms CPU overhead for all-gather):
![Screenshot 2024-08-15 at 11 26 33 AM](https://github.com/user-attachments/assets/160cf8f6-bb97-4633-b802-baeae74e3262)

The times are relatively comparable for float8 with the new one possibly slightly slower, but this is mainly because for Llama's transformer blocks, there are only two norm weights that need to cast to bf16. These screenshots are mainly to show that the optimization still works in the mixed case.

Differential Revision: [D61236983](https://our.internmc.facebook.com/intern/diff/D61236983)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133369
Approved by: https://github.com/weifengpy
ghstack dependencies: #133498
2024-08-15 22:27:20 +00:00
90d2593b3e Revert #132806, #132736, #132539, #132487 (#133570)
This reverts commit 25df063f044202899ab92d6f3d77950af5de482f.
This reverts commit de00c7958301ce81b9716bdef5731ed40d4d14ca.
This reverts commit 419b76c4ac80c8b1c95120cd52db622333a3a688.
This reverts commit bc57d5b6ff8725bbe93f0e67db72459720c750cf.

Differential Revision: [D61335013](https://our.internmc.facebook.com/intern/diff/D61335013)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133570
Approved by: https://github.com/albanD, https://github.com/jansel, https://github.com/anijain2305
2024-08-15 20:54:21 +00:00
5f1470d45d [export] Add InterpreterModule to trace_rules (#132949)
Summary: Added InterpreterModule to trace_rules so that it can be torch.compiled. Fixes https://github.com/pytorch/pytorch/issues/132921

Test Plan: CI

Differential Revision: D60426372

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132949
Approved by: https://github.com/zhxchen17
2024-08-15 20:46:13 +00:00
09a489b177 Fix serialization for tensor list output (#133539)
Summary: Some element of tensor list output doesn't not have a user. In such case, create a name as `{node_name}_unused_{index}` for it.

Test Plan: OSS CI

Differential Revision: D61309011

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133539
Approved by: https://github.com/zhxchen17
2024-08-15 20:31:44 +00:00
cdf217cda1 Disable distributed nccl tests to unblock Amazon2023 ami upgrade (#133355)
These tests keep failing on the Linux Amazon 2023 AMI.  The distributed team is looking into them, but until then, disabling the tests in order to unblock the AMI upgrade

Examples of the failures:
Failure 1: https://github.com/pytorch/pytorch/actions/runs/10047579686/job/27770963175
```
FAILED [90.0880s] distributed/test_c10d_nccl.py::NCCLTraceTestDumpOnTimeout::test_timeout_dumps_timing_enabled_False - AssertionError: None mismatch: None is not -6
```

Failure 2: https://github.com/pytorch/pytorch/actions/runs/10047579686/job/27770963494
```
____ NCCLTraceTestTimeoutDumpOnStuckRanks.test_timeout_dumps_on_stuck_ranks ____
Traceback (most recent call last):
  File "/var/lib/jenkins/workspace/test/distributed/test_c10d_nccl.py", line 4214, in test_timeout_dumps_on_stuck_ranks
    self.assertEqual(self._wait_process(0, timeout=90), -6)
  File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3721, in assertEqual
    raise error_metas.pop()[0].to_error(
AssertionError: None mismatch: None is not -6
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133355
Approved by: https://github.com/kit1980, https://github.com/wconstab
2024-08-15 20:15:00 +00:00
161cc137d2 [DTensor] Add naive replicate strategy for aten.triu.default and aten.tril.default (#133545)
Shampoo uses triu and tril [here](https://github.com/facebookresearch/optimizers/blob/main/matrix_functions.py#L63). As the matrix input is replicated, we register the naive replicate strategy to unblock.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133545
Approved by: https://github.com/awgu
2024-08-15 20:05:03 +00:00
99cf567714 Make SCRIBE_GRAPHQL_ACCESS_TOKEN available to test jobs running on main (#133536)
It is possible to write to Meta's internal in-memory database Scuba via the Scribe Graph API: https://www.internalfb.com/intern/wiki/Scribe/users/Knowledge_Base/Interacting_with_Scribe_categories/Graph_API/ This is currently being used by pytorch/benchmark repo to upload torchbench performance results.

I want to make this API generally available to all jobs running on CI in a semi-trusted context. To talk to Scribe, you need a secret access token. I have initially configured an environment prod-branch-main which contains `SCRIBE_GRAPHQL_ACCESS_TOKEN`, and switched a single class of jobs (linux-test) to use this environment when they are running on the main branch. Because we require approvals for running CI on untrusted contributions, we could potentially allow all jobs to run in this environment, including jobs on PRs, but I don't need this for my use case (per-PR benchmark result reporting, and miscellaneous statistics on main.)

If this works, I'll push out this environment to the rest of our test jobs.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133536
Approved by: https://github.com/xuzhao9, https://github.com/malfet, https://github.com/albanD
2024-08-15 19:53:17 +00:00
5dfb22d4c8 AutoHeuristic: tests (#133496)
This PR adds tests to AutoHeuristic that ensure that when existing heuristics are re-generated, the generated code stays the same.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133496
Approved by: https://github.com/eellison
2024-08-15 19:22:44 +00:00
7673ee5456 remove benchmarks/__init__.py (#133390)
trying to address https://github.com/pytorch/pytorch/issues/133377

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133390
Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/ezyang
2024-08-15 19:08:10 +00:00
dff388491b Fix docs for L1Loss and MSELoss (#133501)
The total number of elements is `N` not `n`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133501
Approved by: https://github.com/mikaylagawarecki
2024-08-15 18:56:55 +00:00
cyy
27538671ae Enable clang-tidy coverage on torch/*.h (#133422)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133422
Approved by: https://github.com/albanD, https://github.com/Skylion007
2024-08-15 18:52:08 +00:00
4aa66f68a8 [CUDA][CUTLASS][submodule] Fixes for CUTLASS upgrade (#131493)
Unblocks/unbreaks against newer CUTLASS (3.5+)

CC @nWEIdia @xwang233 @ptrblck @thakkarV

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131493
Approved by: https://github.com/Skylion007
2024-08-15 18:33:22 +00:00
41d6cabca1 [c10d]Control logging c++ traces with a flag (#133490)
Summary:
Logging C++ stack traces occasionally races with shutdown processes on exception. It isn't safe and we've seen SIGSEGVs in the field.
These crashes prevent flight recorder dumps from completing.

For now, default this dumping to `true` and provide a knob if we need to control things in production.

Test Plan:
Tested locally on a job named `torchx-chirag_test_run` to make sure that the JK was honored by the code.
It was correctly disabled on my test job.
see (TORCH_NCCL_LOG_CPP_STACK_ON_EXCEPTION: 0) below.

```
] [trainer2]:I0814 11:21:20.152419  3708 ProcessGroupNCCL.cpp:874] [PG ID 0PG GUID 0 Rank 10] ProcessGroupNCCL environments: NCCL version: 2.20.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 1, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK: 0, TORCH_NCCL_ENABLE_MONITORING: 0, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 480, TORCH_NCCL_TRACE_BUFFER_SIZE: 2000, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0, TORCH_NCCL_LOG_CPP_STACK_ON_EXCEPTION: 0
```

Differential Revision: D61283335

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133490
Approved by: https://github.com/fduwjj
2024-08-15 18:25:02 +00:00
546c53b784 Bump max runners for linux.24xlarge to 500 (#133569)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133569
Approved by: https://github.com/ZainRizvi
2024-08-15 18:02:46 +00:00
59b3f5911d [sigmoid] Support custom obj deserialization. (#133463)
Summary:
It seems we have multiple places deserializing torchbind objects. Moving the code around so that every load essentially share the same implementation.

Also added a test case "package_reader_testing" which load back the archive file in Python and eagerly validate the numerical result.

Test Plan: buck test mode/opt sigmoid/inference/test:e2e_test_cpu

Reviewed By: SherlockNoMad

Differential Revision: D61235770

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133463
Approved by: https://github.com/ydwu4
2024-08-15 17:58:44 +00:00
5ec9c0bc4a Fix linearize(grad(...)) call (#133364)
Fixes #124550

Also moves `graph.eliminate_dead_code()` call to a few lines after
`_inline_module(...)` in `const_fold.py`

* Test plan:

Add a new test on `test_eager_transforms.py` to ensure the reported
issue was indeed fixed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133364
Approved by: https://github.com/zou3519
2024-08-15 17:55:36 +00:00
cfec69e2a1 Revert "Update fused kernels and call _safe_softmax from SDPA (#131863)"
This reverts commit caba37e99b03d2199848197de4e452b78c8c2a23.

Reverted https://github.com/pytorch/pytorch/pull/131863 on behalf of https://github.com/izaitsevfb due to breaks executorch test executorch/backends/apple/coreml:test - test_vit_skip_conv (executorch.backends.apple.coreml.test.test_coreml_partitioner.TestCoreMLPartitioner) ([comment](https://github.com/pytorch/pytorch/pull/131863#issuecomment-2291855634))
2024-08-15 17:55:07 +00:00
d3b458e603 [export] Do not use export.export for capture_pre_autograd_graph (#133370)
Summary:
Do not use export.export for `capture_pre_autograd_graph` in unittests anymore.

#buildall

Test Plan: CI

Reviewed By: tugsbayasgalan

Differential Revision: D60996041

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133370
Approved by: https://github.com/tugsbayasgalan
2024-08-15 17:37:45 +00:00
2236194c6b [traced-graph][sparse] cleanup test guards (#133375)
Rather than repeating the same guard for every test, simply express it once on the test fixture instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133375
Approved by: https://github.com/ezyang
2024-08-15 17:32:06 +00:00
a7c6e30a3f [c10d][ez] Add space between PG ID and PG UID (#133497)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133497
Approved by: https://github.com/shengbao-zheng, https://github.com/wz337
2024-08-15 17:20:12 +00:00
018e48c337 [Reland] Add wrappers for synchronous GPUDirect Storage APIs (#133489)
Reland #130633

USE_CUFILE turned off by default in this version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133489
Approved by: https://github.com/albanD
2024-08-15 17:11:52 +00:00
c23dceb8f1 Add Adafactor foreach impl (#132336)
This PR adds the foreach impl for Adafactor knowing that there are many ways to improve its runtime perf today (by adding more foreach support). After this PR:
- we have a foreach flag for Adafactor
- It is NOT the default. Why not? It is only slightly faster + uses O(n) more memory where n is the number of params in your max param group. People tend to use Adafactor for memory efficiency.

Next steps:
- make torch.compile possible on it
- make it faster (by adding more foreach apis)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132336
Approved by: https://github.com/albanD
ghstack dependencies: #133360
2024-08-15 17:00:33 +00:00
3434a54fba [CP] Rewrite ring attention backward algorithm and enablement APIs (#131351)
**What does this PR achieve**
1. This PR rewrite ring attention backward algorithm to fuse the alltoall and overlap the gradient communication with computation.

2. Enables memory efficient attention with CP by templating the ring attention backward to verify the accuracy as fp32 gives us higher confident about the implementation correctness.

3. Provides some experimental APIs to enable context parallelism.

4. Ensures CP work with torch.compiler. The combination of causal masking and torch.compiler has not
yet worked.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131351
Approved by: https://github.com/wanchaol
2024-08-15 16:41:51 +00:00
7470ae85e4 Fix triton codegen with math.trunc (#133354)
Fixes https://github.com/pytorch/pytorch/issues/133172

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133354
Approved by: https://github.com/ezyang, https://github.com/jansel
2024-08-15 16:38:26 +00:00
fc5aa24a6e Rewording doc string for clip_grad_norm_ (#133406)
The doc string for `torch.nn.utils.clip_grad_norm_` needed some clarity, it was earlier unclear that the norm was being computed over the norms of individual gradient parameters.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133406
Approved by: https://github.com/mikaylagawarecki
2024-08-15 16:21:27 +00:00
a75248528f [export] refactor _process_dynamic_shapes (#133391)
Sorryyyyy for another refactor. This splits `_process_dynamic_shapes` into 3 parts:
1. `_combine_args` - mostly the same thing
2. `_check_dynamic_shapes`, which is responsible for raising 99% of UserErrors if the dynamic shapes spec is invalid (minus 1 UserError with DerivedDims)
3.  `_process_dynamic_shapes`, which for now, is the same thing, minus the stuff in 2.

This refactor is helpful for incoming automatic dynamic shapes work, because, we're switching to `assume_static_by_default=False`, which is what `_dynamo.export` currently does. This means any unspecified dims are allocated a symbol, in contrast to export today which keeps unspecified dims static. Historically this has been desirable - export users don't want too much dynamism. So we want to change how the spec is translated into constraints.

This means when we switch over to automatic dynamic shapes, we want to plug in something in between steps 2. and 3. which patches up the spec for `assume_static_by_default=False`, filling in static shapes for any unspecified dims, and potentially clearing out the auto-dynamic dims (since they're no-ops). We would do this in-between 2. and 3. to keep `_process_dynamic_shapes` semantically the same, since it's used with `_dynamo.export`.

We could do this without a refactor, plugging in this transform before `_process_dynamic_shapes`, but since that function's responsible for both spec checking + constraint production, moving spec checking to before we transform the specs helps guarantee we're raising errors on what the user's specified, and not an internal export bug.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133391
Approved by: https://github.com/avikchaudhuri
2024-08-15 16:21:21 +00:00
dd6ce2fe7c Restore mixed dtypes GEMM auto-tuning for Ampere (#129058)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129058
Approved by: https://github.com/kadeng
2024-08-15 15:56:09 +00:00
758a0a88a2 [BE][Easy] enable ruff rule PIE790: unnecessary pass statement (#133200)
This PR removes unnecessary `pass` statement. This is semanticly safe because the bytecode for the Python code does not change.

Note that if there is a docstring in the function, a empty function does not need a `pass` statement as placeholder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133200
Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/kit1980
2024-08-15 15:50:19 +00:00
57d1ffc512 Ignore torch.onnx._internal in test_circular_dependencies (#133110)
Ignore the whole `_internal` module as code will depend on onnxscript and onnx.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133110
Approved by: https://github.com/titaiwangms, https://github.com/malfet
2024-08-15 15:37:24 +00:00
a6ad834fa8 Fix counting execution time in run_test.py (#133199)
Counting `elapsed_time` immediately after `start_time`, not reflect real execution time of `test_batch`.

Move `elapsed_time` and print method after `run_tests` method call to fix it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133199
Approved by: https://github.com/clee2000
2024-08-15 15:29:44 +00:00
ec49ce5f8e [CUDA]: Add frexp CUDA bfloat16 support (#133313)
Fixes #133263 Add CUDA bfloat16 support to cuda_frexp

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133313
Approved by: https://github.com/ezyang, https://github.com/eqy
2024-08-15 15:20:00 +00:00
59e33cd1f4 [FSDP2] Set ctx.set_materialize_grads(False) for post-backward (#133498)
https://pytorch.org/docs/stable/generated/torch.autograd.function.FunctionCtx.set_materialize_grads.html
This avoids unnecessarily `aten::zeros` for the inputs in the post-backward custom autograd backward. We do not need the gradient values for the post-backward logic.

Differential Revision: [D61291210](https://our.internmc.facebook.com/intern/diff/D61291210)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133498
Approved by: https://github.com/weifengpy
2024-08-15 14:58:26 +00:00
07adae3dac Revert "Make FX Graph Cache work with distributed training (#133374)"
This reverts commit dcdb25453e0ddc6a83e0052fffc544d4d03cdffd.

Reverted https://github.com/pytorch/pytorch/pull/133374 on behalf of https://github.com/albanD due to Broke trunk ([comment](https://github.com/pytorch/pytorch/pull/133374#issuecomment-2291289260))
2024-08-15 13:43:16 +00:00
32d890745d Revert "Add cache timings info to tlparse (#133504)"
This reverts commit 7eb31e5023fa16c51a984257ee7ee4e17fb3c682.

Reverted https://github.com/pytorch/pytorch/pull/133504 on behalf of https://github.com/albanD due to Broke trunk ([comment](https://github.com/pytorch/pytorch/pull/133374#issuecomment-2291289260))
2024-08-15 13:43:16 +00:00
bbddde311a Migrate inductor jobs to runner determinator (#133457)
Updates inductor jobs to use the runner determinator script.

Depends-On: pytorch/pytorch#133352
Closes: pytorch/ci-infra#257
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133457
Approved by: https://github.com/ZainRizvi
2024-08-15 12:16:39 +00:00
9876aa39c0 AutoHeuristic: pad_mm documentation (#133411)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133411
Approved by: https://github.com/Chillee
ghstack dependencies: #133409, #133410
2024-08-15 10:49:56 +00:00
f32a9e953f AutoHeuristic: mixed_mm documentation (#133410)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133410
Approved by: https://github.com/Chillee
ghstack dependencies: #133409
2024-08-15 10:49:56 +00:00
142353eca3 AutoHeuristic: util scripts (#133409)
This PR introduces scripts that make it easier to use autoheuristic:
- `collect_data.sh`: The user can specify things like the number of GPUs to be used and the number of training samples to collect. This script will open one tmux pane per GPU and collect num_training_samples/num_gpus samples per GPU.
- `merge_data.py`: This script can be used to merge multiple training data files into a single file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133409
Approved by: https://github.com/Chillee
2024-08-15 10:49:56 +00:00
b0fc6aa412 fix a typo in the householder_product docs (#124279)
The function argument is A, not V.

Remaining inconsistency is the matrix $A$ with columns $v_i$.
It seems, a better solution would be to rename the argument $A \rightarrow V$, but this might lead to backward compatibility issues.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124279
Approved by: https://github.com/lezcano
2024-08-15 09:34:17 +00:00
b6335cfeab Add an option to use do_bench_using_profiling in TORCHINDUCTOR_PROFILE (#133523)
When I did profiling using the "TORCHINDUCTOR_PROFILE" option, some kernel shows less bandwidth than expected. So, added the option to exclude the CPU overheads from the profiling time:

```
# With the option:
(pytorch-3.10) [shuqiyangdevgpu001.lla3 ~/local/pytorch (gh/shunting314/144/head)]$ TORCHINDUCTOR_PROFILE=1 TORCHINDUCTOR_PROFILE_WITH_DO_BENCH_USING_PROFILING=1 TORCHINDUCTOR_PROFILE_OUTPUT=/tmp/profile.txt python ../test_pt/a.py
0.038ms         0.067 GB         1777.11GB/s     triton_poi_fused__to_copy_clamp_clone_mul_0
SUMMARY (/tmp/torchinductor_shuqiyang/tmp03wdg8e4/m6/cm6vdqp62ofwsone3u3fmb42vs3fti5omseo3qn4ddh2bhalsvbn.py)
0.04ms           0.07 GB         1777.11GB/s

# Without the option:
(pytorch-3.10) [shuqiyangdevgpu001.lla3 ~/local/pytorch (gh/shunting314/144/head)]$ TORCHINDUCTOR_PROFILE=1 TORCHINDUCTOR_PROFILE_OUTPUT=/tmp/profile.txt python ../test_pt/a.py
0.040ms         0.067 GB         1663.09GB/s     triton_poi_fused__to_copy_clamp_clone_mul_0
SUMMARY (/tmp/torchinductor_shuqiyang/tmpwr6rraao/s4/cs4npkh77myatwpcmsizyduyfm6ne6o4pg4n3eodejdvvg2j3xzd.py)
0.04ms           0.07 GB         1663.09GB/s
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133523
Approved by: https://github.com/nmacchioni
2024-08-15 09:27:11 +00:00
cf1fc07bd4 [DTensor][Easy] Minor fix to Partial Placement Docstring (#133149)
Minor doc fix: The reduce op string for product should be "product" instead of "prod".
https://github.com/pytorch/pytorch/blob/main/torch/distributed/_functional_collectives.py#L1045

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133149
Approved by: https://github.com/XilunWu, https://github.com/tianyu-l
2024-08-15 08:09:30 +00:00
e6272acaec C++ network flow implementation in c10 (#132188)
The functorch partitioners use network flow to split the joint graph into a forward and backward graph. Internally, we've found that upgrading to networkx 2.8.8 (from 2.5) results in some hard-to-debug failures (internal reference: https://fburl.com/workplace/jrqwagdm). And I'm told that there's interest to remove the python dependency.

So this PR introduces a C++ implementation that mirrors the API provided by networkx. We'll need to add python bindings and do some additional testing to verify correctness.

Differential Revision: [D61284135](https://our.internmc.facebook.com/intern/diff/D61284135)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132188
Approved by: https://github.com/Chillee
2024-08-15 07:32:51 +00:00
c88174df95 typing for remote_cache (#133446)
Summary:
typing annotations for remote_cache
Redo of #133299 with fixed annotations.

Test Plan: unit tests

Differential Revision: D61271883

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133446
Approved by: https://github.com/oulgen
2024-08-15 06:36:13 +00:00
7eb31e5023 Add cache timings info to tlparse (#133504)
https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpLR1T85/rank_1/0_0_0/fx_graph_cache_hash_11.json

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133504
Approved by: https://github.com/jamesjwu
ghstack dependencies: #133362, #133363, #133374
2024-08-15 05:53:00 +00:00
448d54ee92 AutoHeuristic: instructions (#132894)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132894
Approved by: https://github.com/Chillee
2024-08-15 04:54:54 +00:00
8624a571b4 [Inductor][CPP] Support vectorization of remainder (#129849)
**Summary**
When check the vectorization status among 3 test suit, we found some operators disabled vectorization with message `Disabled vectorization: op: remainder`. In this PR, we add vectorization support of this op.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_remainder
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_int_div_vec
```

Differential Revision: [D61147014](https://our.internmc.facebook.com/intern/diff/D61147014)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129849
Approved by: https://github.com/jgong5, https://github.com/lezcano
2024-08-15 02:06:30 +00:00
1120b5ab55 Revert "[CI] Change inductor-perf-test-nightly naming (#131476)"
This reverts commit 86cb24e6ebf1b85840568fbc62d22629abaf5739.

Reverted https://github.com/pytorch/pytorch/pull/131476 on behalf of https://github.com/desertfire due to manually trigged dashboard run failed ([comment](https://github.com/pytorch/pytorch/pull/131476#issuecomment-2290224084))
2024-08-15 01:18:06 +00:00
c2b2969b5d made some args optional in create_mask (#133413)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133413
Approved by: https://github.com/yanboliang, https://github.com/drisspg
2024-08-15 00:34:55 +00:00
8676401707 [MPS] Enable MPS mm from macOS >= 14.4 (#133494)
Summary of changes:
- [MPS] Enable MPS `mm` op from macOS >= 14.4. Previously this was disabled in https://github.com/pytorch/pytorch/pull/117549 as it was causing crashes with large matrices.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133494
Approved by: https://github.com/malfet
2024-08-15 00:25:22 +00:00
dcdb25453e Make FX Graph Cache work with distributed training (#133374)
During distributed training if all ranks except one hit the cache, the rank that did not hit the cache will cause a NCCL timeout since rest of the ranks will enter the collective and start the timer. This PR uses the new PTD API to increase timeout for the ranks that hit the cache by the amount of time the cache would save.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133374
Approved by: https://github.com/ezyang
ghstack dependencies: #133362, #133363
2024-08-14 22:58:48 +00:00
6d4287419c [ONNX] Disable op_level_debug tests (#133485)
op_level_debug is being deprecated. So we disable the tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133485
Approved by: https://github.com/titaiwangms
2024-08-14 22:02:12 +00:00
7a74294786 [sparse] enable meta tests (#133379)
The skip for dynamo is no longer needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133379
Approved by: https://github.com/ezyang
2024-08-14 21:58:23 +00:00
3965f11837 Minor type annotation updates following up D60954888 (#133382)
Summary: As title.

Test Plan:
CI

Ran lintrunner locally but might have to continue to keep an eye on more oss linting issue if comes up.

Differential Revision: D61240900

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133382
Approved by: https://github.com/ColinPeppler
2024-08-14 21:36:42 +00:00
d8c494910b [EZ] Enable explicitly opting into the old Linux Amazon 2 ami - Pt 1 (#133469)
For the next phase of the Amazon 2023 migration we'll be bulk migrating the remaining jobs over to the new AMI by changing the default AMI that we use.

In preparation for that, we're adding the old Linux Amazon 2 ami as a fixed variant for runners, so that if any of the less frequently jobs breaks on Amazon 2023 AMI then they can shift to explicitly using the Amazon 2 AMI temporarily while the underlying problem is debugged and fixed.

This PR is part 1, and there's a corresponding scale config PR in test-infra: https://github.com/pytorch/test-infra/pull/5551
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133469
Approved by: https://github.com/clee2000
2024-08-14 21:33:02 +00:00
3fc9ee5a31 [DeviceMesh] Directly retrieve flattened mesh if already created (#133195)
Add mapping to keep track of root_to_flatten relationship and directly retrieve the flattened mesh if already created (no pg creation).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133195
Approved by: https://github.com/fegin, https://github.com/wanchaol
ghstack dependencies: #133193
2024-08-14 21:11:04 +00:00
44eaef46d0 [DCP] Fix meta tensor loading (#133256)
We realized the fix for (https://github.com/pytorch/pytorch/pull/129683) loading the learning rate in place actually broke the meta tensor initialization. After the PR #129683, the learning rate is loading correctly, the param with meta tensors are still un-initialized.

We cannot use `tree_map_only_` to iterate over state_dict for initialization in-place,  as `empty_like` and `to("cuda")` are both not in-place option. More context in https://github.com/pytorch/pytorch/issues/130709 Therefore, with changes in (https://github.com/pytorch/pytorch/pull/129683), the tensor after loading are still meta tensors. We previously did not catch that since `self.assertEqual()` does not distinguish a DTensor with meta DTensor.

In this PR, we added a _iterate_state_dict() function to implement in-place update for state_dict and updated the test to make sure that the params are no longer meta tensors after loading.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133256
Approved by: https://github.com/fegin
2024-08-14 21:07:11 +00:00
c0be0105c7 [aarch64] Replace OpenBLAS with NVPL in cuda arm docker (#132811)
Add NVPL to CUDA ARM docker build

original https://github.com/pytorch/builder/pull/1928 moving to pytorch/pytorch repo now

Need to go with builder repo change https://github.com/pytorch/builder/pull/1950

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132811
Approved by: https://github.com/atalman
2024-08-14 21:01:50 +00:00
2e8c1be947 Update date for 2.5 in RELEASE.md (#133503)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133503
Approved by: https://github.com/atalman
2024-08-14 20:45:58 +00:00
86cb24e6eb [CI] Change inductor-perf-test-nightly naming (#131476)
Summary: To make it consistent with inductor-perf-test-nightly-x86
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131476
Approved by: https://github.com/huydhn, https://github.com/zou3519
2024-08-14 20:42:15 +00:00
bedf96d7ff [AOTI] Switch fbcode HIP to C shim version v2 (#133105)
Summary: Completely switch over the default value of c_shim_version to 2

Test Plan: CI

Differential Revision: D60674018

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133105
Approved by: https://github.com/ColinPeppler, https://github.com/zoranzhao
2024-08-14 19:39:10 +00:00
6980e9e569 [AOTI] Disable split_cat_aten passes (#133014)
Summary: disable passes with negative performance impact

Test Plan: run UT

Differential Revision: D60970288

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133014
Approved by: https://github.com/frank-wei
2024-08-14 19:36:17 +00:00
63e5b09218 Add unit test for asymmetric compilation (#133363)
Unit test for asymmetric compilation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133363
Approved by: https://github.com/jamesjwu
ghstack dependencies: #133362
2024-08-14 19:32:18 +00:00
6f51782a59 Add comptime.sleep (#133362)
Add comp time sleep for NCCL timeout testing. The unit test is not great..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133362
Approved by: https://github.com/jamesjwu
2024-08-14 19:32:18 +00:00
cf81180007 allow SubConfigProxy of arbitrary depth (#133418)
Before, having arbitrary depth nested configs like

```
class Foo:
    foo: List[int] = [1, 2, 3]
    class Bar:
        bar: str = "1"
        class Baz:
            baz: int = 1
```

would cause problems beyond the first layer. For example, if we tried

```
from torch._inductor import config as inductor_config

print(inductor_config.Foo)
print(repr(inductor_config.Foo.foo))
print(inductor_config.Foo.Bar)
print(repr(inductor_config.Foo.Bar.bar))
print(inductor_config.Foo.Bar.Baz)
print(repr(inductor_config.Foo.Bar.Baz.baz))
```

we would get some output like

```
<torch.utils._config_module.SubConfigProxy object at 0x7fac65de00a0>
[1, 2, 3]
...
AttributeError: torch._inductor.config.Foo.Bar does not exist
```

Obviously, this is not what we want. With these changes, we get the right values

```
<torch.utils._config_module.SubConfigProxy object at 0x7f840d05bf40>
[1, 2, 3]
<torch.utils._config_module.SubConfigProxy object at 0x7f840cedc940>
'1'
<torch.utils._config_module.SubConfigProxy object at 0x7f840cedc100>
1
```

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133418
Approved by: https://github.com/oulgen
2024-08-14 18:43:00 +00:00
d46e0761ca Revert "[11/N] Fix clang-tidy warnings in aten/src/ATen (#133298)"
This reverts commit 35785984013a74469de8c1d29eaecb25aa0c141e.

Reverted https://github.com/pytorch/pytorch/pull/133298 on behalf of https://github.com/izaitsevfb due to causes build time regression in aten/src/ATen/native/cpu/ReduceOpsKernel.cpp ([comment](https://github.com/pytorch/pytorch/pull/133298#issuecomment-2289453440))
2024-08-14 17:47:12 +00:00
07c73a964b [MPS][BE] Delete MacOS-12.3 specific checks (#133141)
And make MPS device unavailable on Sonoma releases As lots of those checks 2 years old, are no longer validated in CI and probably much more such checks are missing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133141
Approved by: https://github.com/kulinseth, https://github.com/clee2000, https://github.com/atalman
2024-08-14 17:42:40 +00:00
7b269cc484 [TD] llm retrieval to not use bash -l {0} (#133464)
https://github.com/pytorch/pytorch/pull/129720 swapped the action used to setup miniconda from [conda incubator](https://github.com/conda-incubator/setup-miniconda) to the [custom action](2aba8f107a/.github/actions/setup-miniconda/action.yml (L1)) we have in test-infra that comes with caching.

The original miniconda [relies on bash profiles](e5293c8fd2/README.md (L746)) to set the environment variables needed to run conda, but the test infra version relies on the user using the env vars that are set during the step.

This PR changes the job to not use `bash -l {0}` to see if not activating bash profile has an effect on the run.  Unfortunately this failure happens rarely on main so I'm not sure I will be able see if this has an effect.  On the plus side, changing this doesn't seem to have a negative effect on the job, so it should be a noop at worst.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133464
Approved by: https://github.com/kit1980
2024-08-14 16:53:41 +00:00
4bb1650ca3 Bump maxinum num warps (#132458)
Fix for https://github.com/pytorch/pytorch/issues/129104

Our heuristic for num_warps was giving the optimal number, but we were capping maximum num_warps at 8. Gives 1% speedup on HF and TIMM in inference, 2% speedup in TIMM training, neutral otherwise.

ultimately, I think we want live var analysis for register usage.. still worth landing this now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132458
Approved by: https://github.com/Chillee, https://github.com/shunting314
2024-08-14 16:51:05 +00:00
d114fd78bd [FSDP2] Enable HSDP + TP (#133335)
This PR enables HSDP + TP

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133335
Approved by: https://github.com/awgu
2024-08-14 16:34:04 +00:00
7f40ac9be2 Migrate periodic jobs to use runner determinator (#133124)
This updates the Linux & Windows jobs in periodic.yml to use the runner determinator script.

Closes: pytorch/ci-infra#261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133124
Approved by: https://github.com/ZainRizvi
2024-08-14 16:04:15 +00:00
118b2a4139 Convert inductor jobs to Linux Amazon 2023 (#133352)
A continuation of the migration started in
- https://github.com/pytorch/pytorch/pull/131250
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133352
Approved by: https://github.com/zxiiro, https://github.com/seemethere
2024-08-14 15:59:33 +00:00
62cd065de2 Validate that node TK_ASSIGN have field initialized (#127878)
Fixes segmentation fault during model load via C++ API.

An `Assign` statement (`TK_ASSIGN` type) have 3 fields: `lhs`, `rhs` and `type`. Field `type` is of type `Maybe`, which means it could be not presented. During model load in `import_source.cpp` field `type` is dereferenced without validation.

It is similar error that have been fixed in #106041.

Fixes #127877

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127878
Approved by: https://github.com/malfet
2024-08-14 15:27:58 +00:00
e554f71d7e Implement filter in dynamo (#131674)
Fixes https://github.com/pytorch/pytorch/issues/128944

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131674
Approved by: https://github.com/amjames, https://github.com/jansel
2024-08-14 14:54:13 +00:00
854a5ba958 [lint] fix lint broken by #131912 (#133428)
lint

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133428
Approved by: https://github.com/aaronenyeshi
2024-08-14 14:50:18 +00:00
378b12f3ad Improve namespace for c10::MemoryFormat::Contiguous in torchgen/api/cpp.py (#131622)
Top-level namespaces are more convenient for out-of-tree device extensions.

For example, now we have a patch for it in `torch_npu`:

98c50ced16/codegen/gen_backend_stubs.py (L772-L778)

```python
JIT_TO_CPP_DEFAULT["contiguous_format"] = "c10::MemoryFormat::Contiguous"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131622
Approved by: https://github.com/zou3519
2024-08-14 14:41:01 +00:00
efc6e8457a [inductor] [cpp] fix the reindexer from template_buffer to Y (#133070)
This PR fixes the accuracy of jx_nest_base and part of the accuracy issue of convnext_base of the max-autotune path. Another fix (https://github.com/pytorch/pytorch/pull/133073 in this ghstack) is needed to make convnext_base fully pass the accuracy check.

The index calculated via the reindexer was wrong before this PR. Both the shape of the reshape reindexer and the stride order of the stride reindexer needs to be fixed.

Index calculated before this PR:
```
# in_ptr4 points to arg4_1: size = (1, 32, 18, 18), stride = (10368, 1, 576, 32))
auto tmp7 = in_ptr4[static_cast<long>((32L*(static_cast<long>((n_start + x1 + (32L*m_start) + (32L*x0))) % static_cast<long>(18L))) + (576L*(static_cast<long>(c10::div_floor_integer((n_start + x1 + (32L*m_start) + (32L*x0)), 324L)) % static_cast<long>(32L))))];
```

The correct one after the fix is:
```
auto tmp7 = in_ptr4[static_cast<long>(n_start + x1 + (32L*(static_cast<long>((m_start + x0)) % static_cast<long>(324L))))];
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133070
Approved by: https://github.com/jgong5
2024-08-14 11:42:03 +00:00
52741043e7 [Inductor][FlexAttention] Support non-divisible sequence lengths (#133019)
Perf benchmark script: https://gist.github.com/yanboliang/7c34a82df611d4ea8869cb9e041bfbfc
* Update ```Q_LEN``` and ```KV_LEN``` to ```8192-9``` for testing non divisible cases.

Run ```python perf_bench.py --partial-mask```.

* Before this PR

| Seqence length        | Forward | Backward |
|---------------------|-----------------|------------------|
| **Divisible(8192)**       | 0.87            | 0.85             |
| **Non-divisible(8192-9)**   | N/A            | N/A             |

* After this PR

| Seqence length        | Forward | Backward |
|---------------------|-----------------|------------------|
| **Divisible(8192)**       | 0.87            | 0.85             |
| **Non-divisible(8192-9)**   | 0.83            | 0.78             |

Memory out of bounds check passed:
* ```PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer --tool memcheck python perf_bench.py --partial-mask```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133019
Approved by: https://github.com/Chillee
2024-08-14 10:27:39 +00:00
b5711297a0 Add support for SetVariable.discard (#133317)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133317
Approved by: https://github.com/Skylion007
2024-08-14 09:10:36 +00:00
ef580a0e5c [DeviceMesh] Restrict slicing to be a contiguous or non-contiguous subsequence of the root mesh_dim_names (#133193)
This PR adds restriction for DeviceMesh slicing. No out-of-order subsequence slicing is allowed. To create a flatten mesh_dim_names, only the in-order slicing is allowed.

```
mesh_3d = init_device_mesh(
    self.device_type, (2,2,2), mesh_dim_names=("dp", "cp", "tp"),
)

# valid 2d slicing
mesh_2d = mesh_3d["dp", "cp"]
mesh_2d = mesh_3d["dp", "tp"]
mesh_2d = mesh_3d["cp", "tp"]

# invalid 2d slicing
mesh_2d = mesh_3d["cp", "dp"]
mesh_2d = mesh_3d["tp", "cp"]
mesh_2d = mesh_3d["tp", "dp"]

# valid way to create dp_cp flatten slice
dp_cp_mesh = mesh_3d["dp", "cp"]._flatten()
# invalid way to create dp_cp flatten slice
dp_cp_mesh = mesh_3d["cp", "dp"]._flatten()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133193
Approved by: https://github.com/fegin, https://github.com/wanchaol
2024-08-14 07:18:41 +00:00
d143f879e2 [DTensor] Add more aten._foreach ops to _pointwise_ops.py (#133271)
Fixes #ISSUE_NUMBER

Follow up for https://github.com/pytorch/pytorch/pull/132056. Added the missing foreach ops pointed out by @ad8e.

```
_foreach_sub.Scalar
_foreach_exp
_foreach_exp_
_foreach_cos_
_foreach_log_
```

As @ad8e mentioned, since the list of _foreach ops at https://pytorch.org/cppdocs/api/library_root.html is long and overload-heavy, it could be annoying to manually keep this file updated. We might need to come up with a way to update the list and add associated tests systematically.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133271
Approved by: https://github.com/awgu
2024-08-14 07:14:29 +00:00
a6413d2924 Regression test for S429861 (#133376)
Adds repro test to verify that https://www.internalfb.com/sevmanager/view/429861 does not occur again.

I haven't been able to reduce the size of the repro further, if I remove any buffers the error disappears!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133376
Approved by: https://github.com/eellison
2024-08-14 06:55:05 +00:00
a30504b2a2 fix silly error when printing diff (#133345)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/133336

When we fail to suggest fixes for a data dependent error because some symbols couldn't be mapped to sources, we print out those symbols but there was a silly bug in the printing code.

New error:
```
...
    raise self._make_data_dependent_error(
torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(u0 + 1, CeilToInt(IntTrueDiv(u0 + 1, 1))) (unhinted: Eq(u0 + 1, CeilToInt(IntTrueDiv(u0 + 1, 1)))).  (Size-like symbols: u0)

Potential framework code culprit (scroll up for full backtrace):
  File "/data/users/avik/fbsource/buck-out/v2/gen/fbcode/6ef5f323b6193f0f/pyspeech/fb/tools/__export_speech_llama__/export_speech_llama#link-tree/torch/_refs/__init__.py", line 2972, in expand
    guard_size_oblivious(requested_length == x)

For more information, run with TORCH_LOGS="dynamic"
For extended logs when we create symbols, also add TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL="u0"
If you suspect the guard was triggered from C++, add TORCHDYNAMO_EXTENDED_DEBUG_CPP=1
For more debugging help, see https://docs.google.com/document/d/1HSuTTVvYH1pTew89Rtpeu84Ht3nQEFTYhAX3Ypa_xJs/edit?usp=sharing

For C++ stack trace, run with TORCHDYNAMO_EXTENDED_DEBUG_CPP=1

The following call raised this error:
  File "/data/users/avik/fbsource/buck-out/v2/gen/fbcode/6ef5f323b6193f0f/pyspeech/fb/tools/__export_speech_llama__/export_speech_llama#link-tree/pyspeech/nn/utils.py", line 271, in lengths_to_padding_mask
    ).expand(batch_size, max_length)
```

Test Plan: Repro gets past reported error, hits new error

Differential Revision: D61221994

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133345
Approved by: https://github.com/ezyang
2024-08-14 06:52:55 +00:00
4d11a9b783 [CI] Fix rowwise scaling tests on h100 (#133384)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133384
Approved by: https://github.com/malfet, https://github.com/nWEIdia
2024-08-14 05:58:33 +00:00
7aee3376e2 [aotd] HOP effect tokens wrapper above SubclassWrapper (#131672)
Original issue:
https://github.com/pytorch/pytorch/issues/129486

Before subclass_wrapper() got inputs containing additional effect tokens and failed as this did not match SubclassMeta indexes.

This happened as functionalization was responsible to add / remove those tokens.

Functionalization can not be run above Subclasses, as args/outs are duplicated in case of mutations.

The main design thought is to  keep logic of EffectTokens, Subclasses, Functionalization to know as less as possible about each others transformations.

For that extracting EffectTokens manipulation to a separate wrapper, which will be processed above SubclassWrapper, while functionalization will happen below SubclassWrapper as before.

In that case subclass wrap/unwrap works without information of additional arguments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131672
Approved by: https://github.com/bdhirsh, https://github.com/zou3519
2024-08-14 05:57:17 +00:00
2a4304329b [wip][lowering] Add max_acc_splits (#133041)
Summary: Model owners can set the lower_settings with max_acc_splits=2, and lowering will fail during model iteration, to alert them of possible performance degradation from increased fragmentation.

Test Plan: Added unit tests

Differential Revision: D60133589

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133041
Approved by: https://github.com/hl475
2024-08-14 03:50:31 +00:00
f951fcd1d7 Inductor-CPU WoQ int8 GEMM micro-kernel with scale epilogue (#131887)
## Summary

As part of #125683, this PR modifies existing CPU GEMM cpp template & micro-kernel template to enable int8 WoQ GEMM auto-tuning with AVX2, AVX512 & AMX ISAs (the latter is only available on Xeon 4th generation & beyond).

WoQ GEMM takes FP16/BF16 activations, int8 weights, and scale of the same dtype as activations.
The operation is equivalent to `torch.nn.functional.linear(x, w.to(x.dtype)) * scale`, which is essentially what the ATen op `torch.ops.aten._weight_int8pack_mm` currently does (except that weights are not cached by it). Weights will be considered constant & cached, so this implementation is suitable for inference, and not QAT. `scale` is supported as a `mul` epilogue.

Only BF16 activations have been supported in this PR because for FP16 & FP32, weight is dequantized during constant-folding pass of freezing, and then after auto-tuning, performance with a large `M` dimension may be better than either torch.ops.aten._weight_int8pack_mm, or the WoQ micro-kernel support introduced in this PR, which dequantizes `w` within the micro-kernel.
While even BF16 activations with a large `M` dimension may benefit from dequantizing `w` beforehand, for now, they would  use WoQ support in GEMM templates for auto-tuning, and then a subsequent PR would add logic for deciding whether or not to dequantize weights beforehand.

### Performance
#### AMX
Op-level speedup due to AMX micro-kernel (selected during auto-tuning) on 32 physical cores of Intel(R) Xeon(R) Platinum 8468H (of Xeon 4th generation series, codenamed Sapphire Rapids) vs. ATen kernel `torch.ops.aten._weight_int8pack_mm`. Intel OpenMP & tcmalloc were preloaded.

In a few cases with an odd `K`, the implementation being added in this PR may not perform as well as the ATen kernel, which is unrelated to this PR, though, since `test_linear_amx` also exhibits similar datapoints. In those cases, the AMX micro-kernel might be slower than AVX512 micro-kernel, so if such sets of shapes are used for auto-tuning, either the AVX512 micro-kernel implementation, or the ATen kernel would be chosen instead.

Benchmarked with unit-tests.

Tabular data at https://gist.github.com/sanchitintel/294811a86c8ff6b867c668ae2107c405?permalink_comment_id=5142442#gistcomment-5142442

The AVX512 micro-kernel was disabled to collect data for AMX micro-kernel.

#### AVX2/AVX512 micro-kernels

Tabular data at at https://gist.github.com/sanchitintel/52b5fa9c66f791be19e48e2aa6423dc4?permalink_comment_id=5142437#gistcomment-5142437

### Follow-up
1. int4 WoQ GEMM micro-kernel will also be added in a separate PR.
2. A subsequent PR would add logic for deciding whether or not to dequantize weights beforehand.

E2E perf measurement should be done with #131310.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131887
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-08-14 03:14:45 +00:00
918367ebb0 Add new runner: G4DN Extra Large with T4 for windows binary builds (#133229)
Prep for #103104

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133229
Approved by: https://github.com/ZainRizvi
2024-08-14 03:08:49 +00:00
1206958d89 [Dynamo] add EventVariable reconstruct (#133236)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133236
Approved by: https://github.com/yifuwang
2024-08-14 02:56:11 +00:00
d1d6b370ce Upgrade nightly wheels to rocm6.2 - 1 of 2 (docker images) (#132875)
Fixes https://github.com/pytorch/pytorch/issues/132570

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132875
Approved by: https://github.com/atalman
2024-08-14 02:46:48 +00:00
14750dd737 Correct return type of grouping helper function in Optimizer (#133360)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133360
Approved by: https://github.com/albanD
2024-08-14 01:56:02 +00:00
5fff960004 [PT2][Optimus] Extend split_stack_to_cats when split and stack have different dims (#133060)
Summary: We observed a special case in AI CMF where the split and stack nodes have different dims, thus we extend our current implementation to include the special case.

Test Plan:
# unit test

```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test //caffe2/test/inductor:split_cat_fx_passes
```

Buck UI: https://www.internalfb.com/buck2/6d0502bc-c840-425e-b686-b00b0b7da5f5
Test UI: https://www.internalfb.com/intern/testinfra/testrun/17732923577411786
Network: Up: 353KiB  Down: 611KiB  (reSessionID-1f80d74b-543f-4856-b3bf-181283c0f7e3)
Jobs completed: 29. Time elapsed: 5:36.7s.
Cache hits: 0%. Commands: 4 (cached: 0, remote: 1, local: 3)
Tests finished: Pass 9. Fail 0. Fatal 0. Skip 1. Build failure 0

# benchmark

```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "ai_cmf" --flow_id 558295195 -n
```

 Counter({'pattern_matcher_nodes': 2321, 'pattern_matcher_count': 1320, 'normalization_pass': 280, 'merge_splits_pass': 250, 'extern_calls': 95, 'normalization_aten_pass': 28, 'scmerge_cat_removed': 14, 'scmerge_cat_added': 12, 'scmerge_split_removed': 7, 'unbind_stack_pass': 7, 'split_stack_to_cats_pass': 4, 'scmerge_split_sections_removed': 3, 'batch_aten_add': 3, 'batch_aten_mul': 3, 'split_cat_pass': 2, 'scmerge_split_added': 2, 'split_cat_to_slices_pass': 2, 'fxgraph_cache_miss': 2, 'batch_linear_post_grad': 1})

torch graph
https://www.internalfb.com/intern/everpaste/?color=0&handle=GK5kwRZRtEMCZTAJAJlRpekhPhp0br0LAAAz

# e2e

Differential Revision: D60998945

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133060
Approved by: https://github.com/jackiexu1992
2024-08-14 01:45:12 +00:00
4af4910b1a Reland "Construct NJT without graph breaks" (#133196)
This reverts commit 154d40ca488e6979ce9c2de89d8a35b53129ebea.

and adds changes from https://github.com/pytorch/pytorch/pull/133061

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133196
Approved by: https://github.com/ezyang
ghstack dependencies: #133145
2024-08-14 01:11:13 +00:00
f23dbefe52 [export] Support "custom" metadata field. (#131912)
Summary:
Add a special field in Graph and Node level metadata called "custom" which should be mapped to a json-serializable object, and we guarantee this field should be always preversed across the following transformations:
1. copy/deepcopy
2. run_decompositions()
3. serialization
4. re-exporting

Test Plan: :test_export -- -r custom_tag

Reviewed By: angelayi

Differential Revision: D60291839

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131912
Approved by: https://github.com/angelayi
2024-08-14 01:09:01 +00:00
cyy
c2eeda5da0 [structural binding][12/N] Replace std::tie with structural binding (#131031)
Follows #130830
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131031
Approved by: https://github.com/ezyang
2024-08-14 00:51:34 +00:00
7666ef9d9b [GHF] Fix co-authors attribution (#133372)
Acording to https://docs.github.com/en/pull-requests/committing-changes-to-your-project/creating-and-editing-commits/creating-a-commit-with-multiple-authors Co-authors must be mentioned at the very end of commit message and separated by 2 newlines

Test plan:
```python
from trymerge import GitHubPR
pr = GitHubPR("pytorch", "pytorch", 133189)
print(pr.gen_commit_message())
```

Fixes https://github.com/pytorch/pytorch/issues/133310

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133372
Approved by: https://github.com/kit1980
2024-08-14 00:48:24 +00:00
cyy
3578598401 [11/N] Fix clang-tidy warnings in aten/src/ATen (#133298)
Follows #133155

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133298
Approved by: https://github.com/ezyang
2024-08-14 00:29:38 +00:00
fbb0adbc32 [TunableOp] lazy init TuningContext singleton (#133347)
Forward fix after #132464 because TuningContext had been created during static library init, which creates the TuningResultsValidator, which tries to query HIP device properties before the HIP runtime has initialized.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133347
Approved by: https://github.com/zixi-qi
2024-08-14 00:20:11 +00:00
5947169c9d Add missing endline in exception message (#133240)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133240
Approved by: https://github.com/Skylion007
2024-08-14 00:11:39 +00:00
c91bc499f7 [CI] Do not emit color escape sequence during testing (#133350)
By forcing term to vt100

Fixes problem reported in  https://github.com/pytorch/pytorch/issues/133330 but more broadly it should be fixed on Nova/Infra side

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133350
Approved by: https://github.com/zou3519
2024-08-13 23:39:16 +00:00
caba37e99b Update fused kernels and call _safe_softmax from SDPA (#131863)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131863
Approved by: https://github.com/jbschlosser, https://github.com/Chillee
2024-08-13 23:37:50 +00:00
9de023d44d [Dynamo] Make torch.Size can be reconstructed by LOAD_CONST (#133342)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133342
Approved by: https://github.com/mlazos, https://github.com/jansel
2024-08-13 23:18:38 +00:00
c17d26c3c1 [AOTI][Tooling] A couple fixes / minor updates for initial debug printer (#133016)
Summary:
Follow up small diff to fix a couple issues:
-  add condition for cuda/gpu case to only print kernel name list in the second pass i.e. when we do the cpp wrapper codegen

- other minor fixes around `AOT_INDUCTOR_FILTERED_KERNELS_TO_PRINT` option

Test Plan:
```
AOT_INDUCTOR_FILTERED_KERNELS_TO_PRINT="triton_poi_fused_0" AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1  TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_addmm_abi_compatible_cuda
```

Differential Revision: D60954888

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133016
Approved by: https://github.com/ColinPeppler
2024-08-13 23:00:29 +00:00
41da528378 [BE] Skip inductor+profiler test for templates if we didn't run select_autotune (#133344)
Sometimes we don't have enough SMs to do autotuning and then we fall back to aten, in which case we won't run the template kernel and it won't show up in the profile trace.

Differential Revision: [D61222101](https://our.internmc.facebook.com/intern/diff/D61222101/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133344
Approved by: https://github.com/masnesral
2024-08-13 22:58:24 +00:00
8e074c4583 [ROCm] skip SymmetricMemory related UTs for ROCm (#133241)
This features is not yet supported on ROCm.
Skipping:
distributed/test_symmetric_memory.py::SymmetricMemoryTest::test_low_contention_all_gather_symm_mem_input_False
With the errors:
RuntimeError: CUDASymmetricMemory requires PYTORCH_C10_DRIVER_API_SUPPORTED

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133241
Approved by: https://github.com/pruthvistony, https://github.com/malfet
2024-08-13 22:33:51 +00:00
5a1d4f7ddc Migrate lint.yml to runner determinator (#133320)
Update the jobs in lint.yml to use the runner determinator.

Closes: pytorch/ci-infra#258

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133320
Approved by: https://github.com/Skylion007
2024-08-13 22:16:32 +00:00
a9d34138df [traced-graph][sparse] add to_dense() operation to sparse export test (#133175)
This works for sparse COO but surprisingly still fails for the other compressed sparse cases. I filed the following bug report:

https://github.com/pytorch/pytorch/issues/133174
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133175
Approved by: https://github.com/ezyang
2024-08-13 20:36:40 +00:00
69de9e78e9 Revert "typing for remote_cache (#133299)"
This reverts commit 2fde1934f9efc418cc5a398bd0b09b29551cc091.

Reverted https://github.com/pytorch/pytorch/pull/133299 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/133299#issuecomment-2287067434))
2024-08-13 20:26:24 +00:00
fa7ae6cdbc can't infer device on benchmarked function with no args or kwargs (#133290)
when we call benchmarker.benchmark(fn, (), {}) it attempts to infer the device from the args and kwargs, which are both empty. in this case the default behavior is to assume CPU, since `is_cpu_device` is implemented as `all([x.device == "cpu" for x in ... if x is Tensor])`, and `all([]) == True`. I've added a PR that makes this raise an error, but we should just fix this one callsite first

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133290
Approved by: https://github.com/eellison
2024-08-13 20:13:44 +00:00
dfc7c860e4 Allow SymInt input for torch.fx reinplace pass (#133178)
Fixes #133176

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133178
Approved by: https://github.com/ezyang
2024-08-13 20:07:17 +00:00
61625a18ef [profiler] Only parse kineto requests and build tree when required (#132713)
To avoid high overheads of constructing datastructure in python when the user is simply saving trace to a file, we only process things lazily.

## Details
1. Delay function event parsing, add a flag to denote when needed.
1. Make profiler.function_events a computed property so code using `prof.function_events` does not need to change.
1. Fix coverage for `str(prof)` in profiler tests.

## Test run
Test program
```
import torch
from torch.profiler import profile, record_function, ProfilerActivity

def payload(use_cuda=False):
    x = torch.randn(10, 10)
    if use_cuda:
        x = x.cuda()
    y = torch.randn(10, 10)
    if use_cuda:
        y = y.cuda()
    z = torch.mm(x, y)
    z = z + y
    if use_cuda:
        z = z.cpu()

with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof:
    with record_function("model_inference"):
        payload()

prof.export_chrome_trace("/tmp/test_trace.json")
#print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10))
```

The print "this is computing events" will happen lazily.

```
>]$ python3 profiler_test.py
Brian: this is computing function events
----------------------  ------------  ------------  ------------  ------------  ------------  ------------
                  Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
----------------------  ------------  ------------  ------------  ------------  ------------  ------------
       model_inference         6.77%     441.628us       100.00%       6.523ms       6.523ms             1
           aten::randn         1.86%     121.108us        46.93%       3.061ms       1.530ms             2
              aten::mm        45.36%       2.959ms        45.44%       2.964ms       2.964ms             1
         aten::normal_        44.72%       2.917ms        44.72%       2.917ms       1.458ms             2
             aten::add         0.87%      56.646us         0.87%      56.646us      56.646us             1
           aten::empty         0.35%      22.808us         0.35%      22.808us      11.404us             2
    aten::resolve_conj         0.08%       5.173us         0.08%       5.173us       1.724us             3
----------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 6.523ms

$> python3 profiler_test.py
(pytorch) [bcoutinho@devgpu038.ftw6 /data/users/bcoutinho/pytorch (profiler_optimize_parsing)]$
$>ls -a profiler_test.py
$> ls -l /tmp/test_trace.json
-rw-r--r-- 1 bcoutinho users 16471 Aug  5 16:10 /tmp/test_trace.json
```
## Unit test
Updates some tests and they all pass now.
`pytest test/profiler/test_profiler.py`

Also
`python test/test_autograd.py TestAutogradWithCompiledAutograd.test_record_function`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132713
Approved by: https://github.com/sraikund16
2024-08-13 18:58:20 +00:00
657d58bbd8 [inductor] Fix test_codecache::test_inductor_counters (#133244)
Summary: This test is flakey internally, but it's not a great test in the first place since it's relying on the max-autotune step to bump a related counter. Instead of doing that, directly install a mock that bumps a counter specifically for this test. Additionally, test that the caching logic correctly accommodates an arbitrary counter delta (previously the relevant counter is only bumped by +1).

Differential Revision: [D61141164](https://our.internmc.facebook.com/intern/diff/D61141164)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133244
Approved by: https://github.com/eellison
2024-08-13 18:52:27 +00:00
2fde1934f9 typing for remote_cache (#133299)
Summary: typing annotations for remote_cache

Test Plan: unit tests

Reviewed By: oulgen

Differential Revision: D60937968

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133299
Approved by: https://github.com/Skylion007
2024-08-13 18:28:41 +00:00
a1ca4dfe0b [ONNX] Fix onnx conversion scaled_dot_product_attention (#133314)
Fixes error message raised by the torch>=2.5: A mismatch between the number of arguments (8) and their descriptors (7) was found at symbolic function 'scaled_dot_product_attention' by adding the newly introduced use_gqa parameter.

From https://github.com/pytorch/pytorch/pull/132689
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133314
Approved by: https://github.com/Skylion007, https://github.com/justinchuby
2024-08-13 18:22:24 +00:00
19416bf38b Reland "[2/2] PT2 Inductor ComboKernels - automatic horizontal fusing (#131675)" (#133291)
Reland by reverting commit 844103197d3e8cf6b4b59176e473365113f4f962. #131675 failed a few internal tests because it imported a diff version which wasn't rebased on the proper dependent diffs. Reland from OSS only to avoid the out-of-sync issue.

Original description from #131675
Summary:
A ComboKernel combines independent Inductor Triton kernels into a single one.
This is part 2 pull request which 1) adds automatic horizontal fusion in the end of the inductor operator fusion process, 2) adds type annotation for trition_combo_kernel.py

ComboKernel is used in two cases: 1) for existing foreach kernels, combo kernels are used as the backend kernel. the front-end kernel generation logic remains the same. 2) Added an extra optimization phase to the end of the scheduler to generate extra combo kernels if combo_kernels is True in config.py

This is part 2 pull request which deals with the 2nd case above:

The combo kernel generation in the added optimization phase is done in two steps: 1) in the front end inside the scheduler, it topologically sort the schedule nodes to find all the nodes with no data dependency and create a frond end schedule node for them. We currently limit the maximal number of sub-nodes for each combo kernel to 8 (but we still need to find what is the optimal number). 2) then, these sub-nodes are combined in the codegen phase to generate the combo kernel code for them based on a few rules. For example, 1d and 2d kernels are separated into different combo kernels, as mixing them is not supported yet. Note these algorithms we provide are very basic, and the users can register their customized combo kernel generation algorithms for both steps.

Performance wise, combining small kernels is about always to see performance gain. however, combining very large kernels may not see any perf gain, sometimes even regression possibly due to improper block sizes. Thus, a benchmark function is implemented to avoid such perf regression, and it is recommended to turn it on by setting benchmark_combo_kernels to True whenever combo_kernels is True.

Please refer to part 1 pull request https://github.com/pytorch/pytorch/pull/124969 for more details.

Test Plan: buck2 test mode/dev-nosan caffe2/test/inductor:combo_kernels

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133291
Approved by: https://github.com/wdvr
2024-08-13 18:18:12 +00:00
dadb20a9d6 [Memory Snapshot][Viz] Add Allocator Settings Tab (#132518)
Summary: Since we are storing the allocator settings in the snapshot files for awhile now (since https://github.com/pytorch/pytorch/pull/119404), we can expose this to users with a new tab in the visualizer.

Test Plan:
Ran it locally:
![image](https://github.com/user-attachments/assets/5f79ccd0-fe1c-4e42-bb58-106d8f3cccd6)

Differential Revision: D60673548

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132518
Approved by: https://github.com/tianfengfrank, https://github.com/zdevito
2024-08-13 17:35:12 +00:00
7172c732d9 [Memory Snapshot] Skip C++ warmup unwind() call if context is not set (#133038)
Summary: Should skip C++ warmup `unwind::unwind();` if there is no context set. This call is sometimes causing hanging issues since C++ stack collection is not robust.

Test Plan: CI

Differential Revision: D60965985

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133038
Approved by: https://github.com/eqy
2024-08-13 17:25:24 +00:00
be400ee2b4 [inductor][test] Fix test_vertical_pointwise_reduction_fusion (#133276)
Summary: Fix after https://github.com/pytorch/pytorch/pull/131649 changes behavior for fusion.

Test Plan: ci

Differential Revision: D61165949

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133276
Approved by: https://github.com/ColinPeppler
2024-08-13 17:18:43 +00:00
89795da5e3 [inductor] process compile_only case in all build options class. (#129975)
Optimize `compile_only` logical. Origin code only apply for `CppTorchCudaOptions`, this PR make it apply for all build option classes.
Changes:
1. Remove `libraries_dirs` and `libraries` settings, when `compile_only`.
2. Remove compile_only from CppTorchCudaOptions.
3. Make the `compile_only` apply for all classes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129975
Approved by: https://github.com/henrylhtsang
2024-08-13 16:45:27 +00:00
19270cff61 Add a reference for the LRScheduler class (#133243)
The `LRScheduler` class provides methods to adjusts the learning rate during optimization (as updated in this PR). Also, as a note, all the classes of lr_scheduluer are already provided in the `How to adjust learning rate` section.

Fixes #127884

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133243
Approved by: https://github.com/janeyx99
2024-08-13 16:20:22 +00:00
aa4fbba42d Make q info optional in prep for inference (#133261)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133261
Approved by: https://github.com/Chillee
ghstack dependencies: #132969
2024-08-13 16:09:39 +00:00
660436d843 Convert Periodic to use Amazon2023 runners (#133036)
Fixes #ISSUE_NUMBER

Co-authored-by: clee2000 <44682903+clee2000@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133036
Approved by: https://github.com/clee2000, https://github.com/zxiiro
2024-08-13 15:59:50 +00:00
cyy
2f30473fba [19/N] Fix clang-tidy warnings in jit (#133067)
Follows  #132963
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133067
Approved by: https://github.com/Skylion007
2024-08-13 15:59:43 +00:00
2e7d67e6af Migrate slow.yml jobs to use runner determinator (#133232)
Update the jobs in slow.yml to use the runner determinator script.

Closes: pytorch/ci-infra#259

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133232
Approved by: https://github.com/ZainRizvi
2024-08-13 15:44:55 +00:00
c518b50c4c Remove functorch dispatch keys in legacyExtractDispatchKey (#133018)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133018
Approved by: https://github.com/zou3519
2024-08-13 15:32:01 +00:00
cd565bc455 Refactor process_inputs outside of create_aot_dispatcher_function (#130962)
This PR refactors process_inputs so that it occurs earlier outside of create_aot_dispatcher_function for the purpose of calculating a cache key with the inputs after they have been processed.

This way, if tensors have symint sizes/strides, we successfully factor that into the cache key instead of specializing on every possible size and stride. Test that utilizes this incoming.

# Guard behavior
Note that it's technically possible for tensors with symint arguments to introduce guards in aot_dispatch, if they trace through decompositions that branch on tensor size/stride. This can result in multiple graph modules with differing guards having the same key in the cache.

FXGraphCache has this same issue, and the remote FXGraphCache intentionally does not handle this: instead it only saves the first result in the cache, and cache misses if guards miss. The local FXGraphCache does handle this by storing multiple files and iterating through them, but we opt not to introduce that complexity just yet for AOTAutogradCache until we deem it necessary (i.e., models appear where saving multiple cache results with different guards but the same cache key becomes important). Instead, AOTAutogradCache will save a single entry per result, overriding it if it cache misses due to guards.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130962
Approved by: https://github.com/bdhirsh
2024-08-13 14:56:00 +00:00
4cca18d5b6 Revert "Update fused kernels and call _safe_softmax from SDPA (#131863)"
This reverts commit e61def65d5c6268e79f52776f75277ee60f01462.

Reverted https://github.com/pytorch/pytorch/pull/131863 on behalf of https://github.com/albanD due to Broke forward AD tests in main ([comment](https://github.com/pytorch/pytorch/pull/131863#issuecomment-2286432628))
2024-08-13 14:44:08 +00:00
095c5ccf9f [CD] Change XPU nightly build back to ABI=0 (#132854)
Works for https://github.com/pytorch/pytorch/issues/114850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132854
Approved by: https://github.com/atalman
2024-08-13 13:46:29 +00:00
cyy
e0a5536cc9 [2/N] Fix clang-tidy warnings in torch/csrc/autograd (#133295)
Follows #133180
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133295
Approved by: https://github.com/Skylion007
2024-08-13 13:23:46 +00:00
7756175273 Add Sleef Implementation for maximum Kernel for ARM (#131642)
The NEON Vectorized<float> implementation does not use SLEEF functions for maximum Implementation. So updated maximum function with sleef calls for better performance on graviton3.It showed good performance improvement in LLM models.
The results are taken in graviton3 machine as follows:
<img width="268" alt="perf_result" src="https://github.com/user-attachments/assets/8c575873-b985-44e1-ba8e-880fe6494c5f">

This maximum kernel is used in softmax. The performance timing of softmax with default and sleef change is as below:(graviton3 machine)
<img width="265" alt="softmax" src="https://github.com/user-attachments/assets/3be22c0e-7c99-407e-a8d1-891cb1e035ad">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131642
Approved by: https://github.com/snadampal, https://github.com/jgong5
2024-08-13 11:08:14 +00:00
40061bd61e [export] overwrite placeholder names when deepcopying (#133269)
In joint-graph export we have a `copy.deepcopy(ep.graph_module)` call. This turns out to be an imperfect deepcopy, because deepcopy allows objects to overwrite their `__deepcopy__` methods. For fx.Graph, this ends up deferring to `Graph.create_node()`, which checks the graph namespace, and can avoiding copying the exact name in niche examples, like where the name is a Python keyword (e.g. `input` gets renamed to `input_1`).

Names like `input` happen because export's placeholder naming pass overwrites what the namespace creates, based on the model's `forward()` signature. So we can either 1) avoid overwriting such cases, which requires rewriting the naming pass logic, or 2) force another overwrite after deepcopying. This goes with 2).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133269
Approved by: https://github.com/zhxchen17, https://github.com/dvorjackz, https://github.com/ydwu4
2024-08-13 10:20:43 +00:00
947a446be4 [executorch hash update] update the pinned executorch hash (#131420)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131420
Approved by: https://github.com/pytorchbot
2024-08-13 08:30:51 +00:00
9f17037e8b [dtensor] move tensor constructors to the api module (#133129)
This is to ensure __init__.py only contain public APIs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133129
Approved by: https://github.com/awgu, https://github.com/tianyu-l
2024-08-13 06:09:56 +00:00
cyy
50e837d9c2 [10/N] Fix clang-tidy warnings in aten/src/ATen (#133155)
Follows  #132842

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133155
Approved by: https://github.com/janeyx99, https://github.com/ezyang
2024-08-13 03:48:58 +00:00
cyy
af7830e353 [1/N] Fix clang-tidy warnings in torch/csrc/autograd (#133180)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133180
Approved by: https://github.com/albanD
2024-08-13 03:36:10 +00:00
4671e98656 [export] fix node.users when inlining HOOs (#133144)
The process of inlining HOO subgraphs (e.g. set_grad_enabled) seems to break node.users when a node is present in multiple subgraphs, for example:
```
class SetGradCase(torch.nn.Module):
    def forward(self, x):
        _x = x.shape[0] + 2
        _xx = _x + 2
        with torch.no_grad():
            y = _x * 4
        return _xx, y
```

The `_x` node contains 2 users (_xx and y) after being inlined, but with inspection it seems to only contain y as a user.

Previously we were completely clearing node.users for output nodes in HOO subgraphs before inlining them - we should just be deleting the subgraph output nodes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133144
Approved by: https://github.com/larryliu0820, https://github.com/ydwu4
2024-08-13 03:21:42 +00:00
fa36eba77d Turn off remote caching in unit tests unless explicitly on (#133258)
Summary: This PR turns off remote caching in unit tests unless the unit test explicitly turns it on.

Test Plan: existing tests

Differential Revision: D61152154

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133258
Approved by: https://github.com/masnesral
2024-08-13 02:49:43 +00:00
1e9bedf688 Add _codecs.encode and builtins.bytearray to _get_allowed_globals to support bytes and bytearray serialization (#133189)
Fixes #133163

Debugged in collaboration with @hariveliki

The `byte` type is demanding the global `_codecs.encode`. That means, the following currently works:
```python
import torch

torch.save(b'hello', '/tmp/dummy.pth')

torch.serialization.add_safe_globals([_codecs.encode])
torch.load('/tmp/dummy.pth', weights_only=True)
```

Similarly, `bytearray` needs `builtins.bytearray`.

Following the `torch.loads` docs promise, both types should be supported without `add_safe_globals` as they are both primitive types:
>         weights_only: Indicates whether unpickler should be restricted to
>            loading only tensors, primitive types, dictionaries
>           and any types added via :func:`torch.serialization.add_safe_globals`.

This PR adds both `_codecs.encode` and `builtins.bytearray` to `_get_allowed_globals` and test for saving and loading of both types with and without `weights_only`.

Co-authored-by: hariveliki <98284163+hariveliki@users.noreply.github.com>
Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133189
Approved by: https://github.com/mikaylagawarecki
2024-08-13 02:20:28 +00:00
f1c439cbed AutoHeuristic: refactoring (#133170)
This PR refactors train_decision.py and adds some basic logging, which I'll extend in another PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133170
Approved by: https://github.com/Chillee
2024-08-13 01:46:53 +00:00
cyy
e76f0e0646 Remove QNNPACK reference from setup.py (#133177)
QNNPACK has been removed from third party
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133177
Approved by: https://github.com/albanD
2024-08-13 01:19:12 +00:00
7be77658e9 [Inductor] support masked vectorization for the tail_loop for INT8 datatype (#131155)
This PR supports masked vectorization for the tail_loop for torch.uint8 and torch.int8 datatype to improve performance.
BTW, I fixed the UT of `byte` by setting the range of the sample inputs  to [0, 255] since the range of `torch.uint8` is [0, 255].

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131155
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
ghstack dependencies: #130724
2024-08-13 01:12:05 +00:00
370b072d8d [Inductor] support masked vectorization for the tail_loop of the 2d tiles kernel (#130724)
This PR supports masked vectorization for the tail_loop of the 2d tiles kernel to improve the performance.

Example:
```
import torch

def fn(a):
    return torch.permute(a, (2, 0, 1)).contiguous()

input = torch.randn(2, 20, 40)
compiled_fn = torch.compile(fn)

with torch.no_grad():
    for _ in range(3):
        compiled_fn(input)
```

Generated code:
- Before:
```
cpp_fused_clone_0 = async_compile.cpp_pybinding(['const float*', 'float*'], '''
#include "/tmp/torchinductor_jiayisun/z2/cz2ry4ghylembzwx7hkbanur76fi3mkiu7s6jm3zdi2amy5egq4b.h"
extern "C"  void kernel(const float* in_ptr0,
                       float* out_ptr0)
{
    {
        #pragma GCC ivdep
        for(long x0=static_cast<long>(0L); x0<static_cast<long>(32L); x0+=static_cast<long>(16L))
        {
            #pragma GCC ivdep
            for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(16L))
            {
                float tmp0[16*16] __attribute__ ((aligned (16)));
                at::vec::transpose_mxn<float,16,16>(in_ptr0 + static_cast<long>(x0 + (40L*x1)), static_cast<long>(40L), tmp0, 16);
                for (long x0_inner = 0; x0_inner < 16; x0_inner++)
                {
                    auto tmp1 = at::vec::Vectorized<float>::loadu(tmp0 + static_cast<long>(16L*x0_inner), 16);
                    tmp1.store(out_ptr0 + static_cast<long>(x1 + (40L*x0) + (40L*x0_inner)));
                }
            }
            #pragma GCC ivdep
            for(long x1=static_cast<long>(32L); x1<static_cast<long>(40L); x1+=static_cast<long>(1L))
            {
                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x0 + (40L*x1)), 16);
                [&]
                {
                    __at_align__ std::array<float, 16> tmpbuf;
                    tmp0.store(tmpbuf.data(), 16);
                    #pragma GCC unroll 16
                    for (long x0_inner = 0; x0_inner < 16; x0_inner++)
                    {
                        out_ptr0[static_cast<long>(x1 + (40L*x0) + (40L*x0_inner))] = tmpbuf[x0_inner];
                    }
                }
                ()
                ;
            }
        }
        #pragma GCC ivdep
        for(long x0=static_cast<long>(32L); x0<static_cast<long>(40L); x0+=static_cast<long>(1L))
        {
            #pragma GCC ivdep
            for(long x1=static_cast<long>(0L); x1<static_cast<long>(40L); x1+=static_cast<long>(1L))
            {
                auto tmp0 = in_ptr0[static_cast<long>(x0 + (40L*x1))];
                out_ptr0[static_cast<long>(x1 + (40L*x0))] = tmp0;
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, = args
    args.clear()
    assert_size_stride(arg0_1, (2, 20, 40), (800, 40, 1))
    buf0 = empty_strided_cpu((40, 2, 20), (40, 20, 1), torch.float32)
    cpp_fused_clone_0(arg0_1, buf0)
    del arg0_1
    return (buf0, )
```

- After:
```
cpp_fused_clone_0 = async_compile.cpp_pybinding(['const float*', 'float*'], '''
#include "/tmp/torchinductor_jiayisun/z2/cz2ry4ghylembzwx7hkbanur76fi3mkiu7s6jm3zdi2amy5egq4b.h"
extern "C"  void kernel(const float* in_ptr0,
                       float* out_ptr0)
{
    {
        #pragma GCC ivdep
        for(long x0=static_cast<long>(0L); x0<static_cast<long>(32L); x0+=static_cast<long>(16L))
        {
            #pragma GCC ivdep
            for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(16L))
            {
                float tmp0[16*16] __attribute__ ((aligned (16)));
                at::vec::transpose_mxn<float,16,16>(in_ptr0 + static_cast<long>(x0 + (40L*x1)), static_cast<long>(40L), tmp0, 16);
                for (long x0_inner = 0; x0_inner < 16; x0_inner++)
                {
                    auto tmp1 = at::vec::Vectorized<float>::loadu(tmp0 + static_cast<long>(16L*x0_inner), 16);
                    tmp1.store(out_ptr0 + static_cast<long>(x1 + (40L*x0) + (40L*x0_inner)));
                }
            }
            #pragma GCC ivdep
            for(long x1=static_cast<long>(32L); x1<static_cast<long>(40L); x1+=static_cast<long>(8L))
            {
                float tmp0[16*8] __attribute__ ((aligned (16)));
                at::vec::transpose_mxn<float,8,16>(in_ptr0 + static_cast<long>(x0 + (40L*x1)), static_cast<long>(40L), tmp0, 8);
                for (long x0_inner = 0; x0_inner < 16; x0_inner++)
                {
                    auto tmp1 = at::vec::Vectorized<float>::loadu(tmp0 + static_cast<long>(8L*x0_inner), 8);
                    tmp1.store(out_ptr0 + static_cast<long>(x1 + (40L*x0) + (40L*x0_inner)), 8);
                }
            }
        }
        #pragma GCC ivdep
        for(long x0=static_cast<long>(32L); x0<static_cast<long>(40L); x0+=static_cast<long>(8L))
        {
            #pragma GCC ivdep
            for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(16L))
            {
                float tmp0[8*16] __attribute__ ((aligned (16)));
                at::vec::transpose_mxn<float,16,8>(in_ptr0 + static_cast<long>(x0 + (40L*x1)), static_cast<long>(40L), tmp0, 16);
                for (long x0_inner = 0; x0_inner < 8; x0_inner++)
                {
                    auto tmp1 = at::vec::Vectorized<float>::loadu(tmp0 + static_cast<long>(16L*x0_inner), 16);
                    tmp1.store(out_ptr0 + static_cast<long>(x1 + (40L*x0) + (40L*x0_inner)));
                }
            }
            #pragma GCC ivdep
            for(long x1=static_cast<long>(32L); x1<static_cast<long>(40L); x1+=static_cast<long>(8L))
            {
                float tmp0[8*8] __attribute__ ((aligned (16)));
                at::vec::transpose_mxn<float,8,8>(in_ptr0 + static_cast<long>(x0 + (40L*x1)), static_cast<long>(40L), tmp0, 8);
                for (long x0_inner = 0; x0_inner < 8; x0_inner++)
                {
                    auto tmp1 = at::vec::Vectorized<float>::loadu(tmp0 + static_cast<long>(8L*x0_inner), 8);
                    tmp1.store(out_ptr0 + static_cast<long>(x1 + (40L*x0) + (40L*x0_inner)), 8);
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, = args
    args.clear()
    assert_size_stride(arg0_1, (2, 20, 40), (800, 40, 1))
    buf0 = empty_strided_cpu((40, 2, 20), (40, 20, 1), torch.float32)
    cpp_fused_clone_0(arg0_1, buf0)
    del arg0_1
    return (buf0, )
```

Co-authored-by: CaoE <e.cao@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130724
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-08-13 01:02:24 +00:00
e61def65d5 Update fused kernels and call _safe_softmax from SDPA (#131863)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131863
Approved by: https://github.com/jbschlosser
2024-08-13 00:51:55 +00:00
00aa086298 Revert "[dtensor] move tensor constructors to a separate module (#133129)"
This reverts commit e890d888d916b4f38b383a59e0e9445513c67313.

Reverted https://github.com/pytorch/pytorch/pull/133129 on behalf of https://github.com/fbgheith due to breaking internal tests ([comment](https://github.com/pytorch/pytorch/pull/133129#issuecomment-2285090400))
2024-08-12 23:55:08 +00:00
89670d5bdd Revert "Inductor-CPU WoQ int8 GEMM micro-kernel with scale epilogue (#131887)"
This reverts commit 8fbd7d92a81b61d41363edb1b3902ba7701d5a27.

Reverted https://github.com/pytorch/pytorch/pull/131887 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/131887#issuecomment-2285082401))
2024-08-12 23:45:46 +00:00
844103197d Revert "[2/2] PT2 Inductor ComboKernels - automatic horizontal fusing (#131675)"
This reverts commit bb6eef8ed1de0eb48bde10a07da57b6acc82fb05.

Reverted https://github.com/pytorch/pytorch/pull/131675 on behalf of https://github.com/fbgheith due to breaking internal tests ([comment](https://github.com/pytorch/pytorch/pull/131675#issuecomment-2285069508))
2024-08-12 23:31:16 +00:00
656465fc77 Revert "Conversions between strided and jagged layouts for Nested Tensors (#115749)"
This reverts commit ed97fb77f9a9d9d815f4975caccbc961ebbcb714.

Reverted https://github.com/pytorch/pytorch/pull/115749 on behalf of https://github.com/izaitsevfb due to fails internal jobs, see [S440348](https://www.internalfb.com/sevmanager/view/440348) ([comment](https://github.com/pytorch/pytorch/pull/115749#issuecomment-2285051164))
2024-08-12 23:14:19 +00:00
d4b31f7bcf Refactor BlockMask constructorr and add Factory func (#132969)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132969
Approved by: https://github.com/Chillee
2024-08-12 22:38:42 +00:00
e553ef69d0 [BE] Fix typo (#133247)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133247
Approved by: https://github.com/c-p-i-o, https://github.com/zxiiro
2024-08-12 21:58:55 +00:00
8585dee85d [inductor] Add some more reinplacing tests (#132839)
Also add some tests around the counters we added in a previous PR.

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132839
Approved by: https://github.com/eellison
2024-08-12 21:34:45 +00:00
592682fe22 Migrate nightly.yml to use runner determinator (#133225)
Updates the nightly.yml jobs to use the runner determinator script.

Closes: pytorch/ci-infra#260

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133225
Approved by: https://github.com/ZainRizvi
2024-08-12 21:25:55 +00:00
80ed3e9ccd s/dipatch/dispatch/g (#133192)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133192
Approved by: https://github.com/albanD
2024-08-12 20:26:58 +00:00
4f0d5f6551 Pin sympy to 1.13.1 (#133235)
Sympy 1.13.2 release yesterday, and it results in test failures on windows and mac

454713fe9d/1

Hopefully these are the places it needs to get pinned
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133235
Approved by: https://github.com/atalman, https://github.com/ZainRizvi
2024-08-12 20:10:09 +00:00
36c4ed8e49 [inductor] add FreeLibrary to DLLWrapper for Windows. (#133184)
For previous PR https://github.com/pytorch/pytorch/pull/132630 . We found `DLLWrapper` class doesn't have `_dlclose` implemention for Windows.

I write a small test project to figure out how to make it works on Windows: https://github.com/xuhancn/ctypes_all_lifecycle/blob/main/pysrc/module_manage.py#L30-L61
Test result: https://github.com/xuhancn/ctypes_all_lifecycle/tree/main?tab=readme-ov-file#ctypes_cyclepy

So, I have port the Windows FreeLibrary implemention to pytorch DLLWrapper in this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133184
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-08-12 19:55:48 +00:00
cdcc7dc891 update comit pin for xla (#133120)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133120
Approved by: https://github.com/janeyx99
2024-08-12 19:38:37 +00:00
cc1cc71c46 [MPS] Fix relu for 0-element input case (#133191)
Fixes #133182

Should already be tested by `test/test_mps.py::MPSReluTest::testNumbersGPU`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133191
Approved by: https://github.com/albanD
2024-08-12 19:24:17 +00:00
666362865c [test/profiler] Make test_profiler_pattern_matcher_json_report write … (#133009)
Makes it possible to run `test/profiler/test_profiler.py#test_profiler_pattern_matcher_json_report` on CI environments where the test runner doesn't have write permissions to the current-working-directory.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133009
Approved by: https://github.com/zou3519
2024-08-12 18:56:50 +00:00
fa1d7b0262 Revert "Remove unused Caffe2 macros (#132979)"
This reverts commit da65cfbdea4f1f2176f6242004bda940a24f9ddb.

Reverted https://github.com/pytorch/pytorch/pull/132979 on behalf of https://github.com/ezyang due to these are apparently load bearing internally ([comment](https://github.com/pytorch/pytorch/pull/132979#issuecomment-2284666332))
2024-08-12 18:34:56 +00:00
afb73d253c [custom_ops] torch.library.{custom_op, register_kernel} disable Dynamo (#133125)
We promise the user that these custom ops (and their kernels) are black
boxes w.r.t. torch.compile. Unfortunately Dynamo can turn itself back
on in the implementation of the custom operator, so we force it off by
disabling Dynamo

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133125
Approved by: https://github.com/ezyang
2024-08-12 18:29:18 +00:00
d53dfa4680 [BE] Raise when the target model has scalar parameters (#132934)
Address the issue, https://github.com/pytorch/pytorch/issues/130810.

Both FSDP1 and FSDP2 do not support scalar parameters. For FSDP1, the issue happens during state_dict operations while FSDP2 fails during the initialization. This PR adds exceptions to help users debug the issue and change the scalar parameters to 1D parameters.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132934
Approved by: https://github.com/awgu, https://github.com/wz337
2024-08-12 18:28:02 +00:00
0e4c0ef29f fix type of eta_min parameter in CosineAnnealing (int -> float) (#132482)
This fixes errors with type checkers such as `pyright`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132482
Approved by: https://github.com/janeyx99
2024-08-12 18:22:26 +00:00
e7d8d73582 [minor] Correct in-code documentation for complex numbers and LBFGS (#133020)
To @lezcano's credit, this should be associative, as floating point add is actually commutative per IEEE754.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133020
Approved by: https://github.com/soulitzer, https://github.com/lezcano
2024-08-12 18:04:11 +00:00
d51e5467fd TunableOp unconditionally add all validators (#132464)
For workloads that only exercised scaled_mm, the csv result file would not contain the same set of validators as a gemm workload.  Trying to reuse the same csv file between workloads would discard the file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132464
Approved by: https://github.com/zixi-qi
2024-08-12 17:35:00 +00:00
d61815cb7d [torch][ao] Use returned model from Quantizer.transform_for_annotation in prepare_pt2e (#132893)
Summary:
The Quantizer subclass can return a new model from `transform_for_annotation`,
and this is common if it uses any ExportPass subclass which does not mutate in-place.

Use the returned model instead of assuming its the same.

Differential Revision: D60869676

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132893
Approved by: https://github.com/jerryzh168
2024-08-12 17:23:19 +00:00
1371c420c3 Migrate binary builds to use Amazon2023 runners (#131826)
A continuation of the migration started in
- https://github.com/pytorch/pytorch/pull/131250

Migrates all linux binary builds.

The failures are windows jobs which aren't touched by this PR

prev runs (for tracking):
- https://hud.pytorch.org/pytorch/pytorch/pull/131826?sha=e1ee074b1e7b17008e3f3774e4842b5e1d4c1355
- https://hud.pytorch.org/pytorch/pytorch/pull/131826?sha=50a3488ae776f86bd6bead8b048b051c49a25ec7
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131826
Approved by: https://github.com/malfet
2024-08-12 17:18:55 +00:00
b06959e614 [export] change deepcopy to copy in _replace_with_hop passes (#133142)
Summary:
Add back the change in 19897a1647.

The change was lost in refactoring due to a bad rebase.

Test Plan:
CI

```
buck2 run 'fbcode//mode/dev-nosan'  fbcode//torchrec/distributed/tests:test_pt2 -- --filter-text test_sharded_quant_fpebc_non_strict_export
```

Differential Revision: D61052687

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133142
Approved by: https://github.com/ydwu4
2024-08-12 17:15:04 +00:00
3128640c31 [Memory Snapshot][Viz] Show event timestamps if collected (#132523)
Summary: Since we've been capturing timestamps for awhile (since https://github.com/pytorch/pytorch/pull/112266), we can surface this into the UI. This can be useful to correlate with timing of other events.

Test Plan:
Ran it locally.

![image](https://github.com/user-attachments/assets/8b3922e8-1ae2-4b09-aa13-20b2b8237064)

Differential Revision: D60673800

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132523
Approved by: https://github.com/tianfengfrank, https://github.com/zdevito
2024-08-12 16:12:04 +00:00
454713fe9d Add inductor-cu124, inductor-rocm to upload test stats (#133143)
Forgot to add them in https://github.com/pytorch/pytorch/issues/128250 and https://github.com/pytorch/pytorch/issues/131637

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133143
Approved by: https://github.com/huydhn
2024-08-12 15:51:51 +00:00
9641abe97a Revert "[export] change deepcopy to copy in _replace_with_hop passes (#133142)"
This reverts commit 2d71f03db124bd1517627d34896dd2d9248227af.

Reverted https://github.com/pytorch/pytorch/pull/133142 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/133142#issuecomment-2284327241))
2024-08-12 15:48:11 +00:00
e9eb8795bb Revert "[Memory Snapshot][Viz] Show event timestamps if collected (#132523)"
This reverts commit 27c44c884e28c9378677fb295a528c36c429c3f7.

Reverted https://github.com/pytorch/pytorch/pull/132523 on behalf of https://github.com/clee2000 due to broke some tests on mac ex export/test_retraceability.py::RetraceExportTestExport::test_disable_forced_specializations_ok_retraceability [GH job link](https://github.com/pytorch/pytorch/actions/runs/10344621336/job/28630686528) [HUD commit link](27c44c884e) Possibly a landrace since I see that some of the failing tests ran on the PR ([comment](https://github.com/pytorch/pytorch/pull/132523#issuecomment-2284312426))
2024-08-12 15:42:07 +00:00
26b0a0c2f3 Fix fsdp_state_dict_type_without_warnings (#132621)
Do actually ignore the warnings. Otherwise this is a no-op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132621
Approved by: https://github.com/fegin
2024-08-12 10:33:09 +00:00
f5e704a6f2 Add instruction count benchmark to run on pull requests (#131475)
This PR only adds the execution of the benchmarks on this PR and print results, following diffs will add checking out head~1 and running it and comparing.

to access results goto test pr_time_benchmarks and inspect logs:
you should see
```
+ echo 'benchmark results on current PR: '
benchmark results on current PR:
+ cat /var/lib/jenkins/workspace/test/test-reports/pr_time_benchmarks_before.txt
update_hint_regression,instruction_count,27971461254
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131475
Approved by: https://github.com/ezyang
2024-08-12 05:20:26 +00:00
27c44c884e [Memory Snapshot][Viz] Show event timestamps if collected (#132523)
Summary: Since we've been capturing timestamps for awhile (since https://github.com/pytorch/pytorch/pull/112266), we can surface this into the UI. This can be useful to correlate with timing of other events.

Test Plan:
Ran it locally.

![image](https://github.com/user-attachments/assets/8b3922e8-1ae2-4b09-aa13-20b2b8237064)

Differential Revision: D60673800

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132523
Approved by: https://github.com/tianfengfrank, https://github.com/zdevito
2024-08-12 01:48:23 +00:00
7f08b73980 Revert "[Memory Snapshot][Viz] Show event timestamps if collected (#132523)"
This reverts commit 456909e5d350810e941290ee61c1dfc3315a9a69.

Reverted https://github.com/pytorch/pytorch/pull/132523 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/132523#issuecomment-2282925079))
2024-08-11 23:33:37 +00:00
456909e5d3 [Memory Snapshot][Viz] Show event timestamps if collected (#132523)
Summary: Since we've been capturing timestamps for awhile (since https://github.com/pytorch/pytorch/pull/112266), we can surface this into the UI. This can be useful to correlate with timing of other events.

Test Plan:
Ran it locally.

![image](https://github.com/user-attachments/assets/8b3922e8-1ae2-4b09-aa13-20b2b8237064)

Differential Revision: D60673800

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132523
Approved by: https://github.com/tianfengfrank, https://github.com/zdevito
2024-08-11 23:27:48 +00:00
2d71f03db1 [export] change deepcopy to copy in _replace_with_hop passes (#133142)
Summary:
Add back the change in 19897a1647.

The change was lost in refactoring due to a bad rebase.

Test Plan:
CI

```
buck2 run 'fbcode//mode/dev-nosan'  fbcode//torchrec/distributed/tests:test_pt2 -- --filter-text test_sharded_quant_fpebc_non_strict_export
```

Differential Revision: D61052687

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133142
Approved by: https://github.com/ydwu4
2024-08-11 21:47:52 +00:00
e7b870c88b mixed_mm: fix segfault when allow_tf32=True (#133173)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133173
Approved by: https://github.com/Chillee
2024-08-11 15:02:24 +00:00
04f37ed57d Add support for returning LSE from FlexAttention (and also differentiating through it) (#133159)
This PR changes the "contract" of `flex_attention_hop` to return LSE in base 2. However, we undo that and return LSE in base e from the `flex_attention` frontend.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133159
Approved by: https://github.com/yanboliang
2024-08-11 10:29:16 +00:00
78ccbad678 [inductor] remove dtype check/assert for reduction vec and support bool for min/max (#132473)
This PR is to remove the dtype check/assert for vectorized reduction. And support bool for min/max reduction.

After removing dtype check and assertion, failed on UT.
```
PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=0 python test/inductor/test_torchinductor_opinfo.py -k TestInductorOpInfoCPU.test_comprehensive_max_reduction_no_dim_cpu_bool
```
Now it is supported, generated code:
```
cpp_fused_max_0 = async_compile.cpp_pybinding(['const bool*', 'bool*'], '''
#include "/tmp/torchinductor_root/xf/cxf75ftbahznonqovnsugw7v6sldrabizgtx3j4rhgdmu3r36wlu.h"
extern "C"  void kernel(const bool* in_ptr0,
                       bool* out_ptr0)
{
    {
        {
            bool tmp_acc0 = std::numeric_limits<bool>::min();
            at::vec::VecMask<float,1> tmp_acc0_vec = at::vec::VecMask<float,1>::from(std::numeric_limits<bool>::min());
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(112L); x0+=static_cast<long>(16L))
            {
                auto tmp0 = at::vec::VecMask<float,1>::from(in_ptr0 + static_cast<long>(x0));
                tmp_acc0_vec = tmp_acc0_vec | tmp0;
            }
            #pragma omp simd simdlen(8)
            for(long x0=static_cast<long>(112L); x0<static_cast<long>(125L); x0+=static_cast<long>(1L))
            {
                auto tmp0 = in_ptr0[static_cast<long>(x0)];
                tmp_acc0 = max_propagate_nan(tmp_acc0, tmp0);
            }
            tmp_acc0 = max_propagate_nan(tmp_acc0, tmp_acc0_vec.all_zero());
            out_ptr0[static_cast<long>(0L)] = static_cast<bool>(tmp_acc0);
        }
    }
}
''')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132473
Approved by: https://github.com/jgong5
2024-08-11 08:37:54 +00:00
79ca596dc6 Optimize test_transformers.py (#133049)
- Reduced number of skipped test cases
- Merged redundant test cases

**Benchmark:**

| | Original | New |
| ----- | ----- | ----- |
| Run time | 60 mins | 35 mins |
| Total tests | 75k | 18k |
| Skipped tests | 20k | 4k |

_These are approximate numbers from running test_transformers.py on a single H100, and can change based on the device._

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133049
Approved by: https://github.com/drisspg
2024-08-11 05:20:58 +00:00
a7912bf9dc Make step != 0 test in slice irrefutable (#133091)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133091
Approved by: https://github.com/bdhirsh
2024-08-10 23:56:45 +00:00
cyy
5b7b3e4af0 Fix some issues detected by static analyzer (#132970)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132970
Approved by: https://github.com/ezyang
2024-08-10 16:02:46 +00:00
92f650c5b3 [Inductor][Intel GPU] Support codegen empty_strided_xpu, align with #118255. (#126678)
[Inductor][Intel GPU] Support codegen empty_strided_xpu, align with #118255.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126678
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/eellison
2024-08-10 14:33:39 +00:00
4a3a30c36e [inductor] remove deprecated cpp_builder implementation. (#133161)
I have worked with @henrylhtsang to switch the cpp_builder to new one. We have removed the dependency to the old implementation.
So, it is time to remove the old implementation now. This PR is done the change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133161
Approved by: https://github.com/ezyang
2024-08-10 14:21:22 +00:00
cyy
32be3e942c Remove -Wno-error=pedantic from CMake (#133074)
The codebase is largely clean so that we can turn it on.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133074
Approved by: https://github.com/ezyang
2024-08-10 13:11:21 +00:00
b9922f7a5a [compiled autograd][cpp node] No recaptures from saved float scalars (#133048)
Partially addresses https://github.com/pytorch/pytorch/issues/130170 for float scalars saved from forward pass of a custom c++ autograd function. With this PR, compiled autograd no longer recaptures when the float value changes, but downstream support isn't there yet: 4bdb4bbd86/torch/_dynamo/config.py (L58-L61)

Currently, any non-tensors passed in ctx->saved_data is specialized on by compiled autograd. To stop specializing on float values, we lift the float. We also require user code to use IValue::toSymFloat instead of IValue::toDouble in order to swap the SymFloat to proxy during compiled autograd tracing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133048
Approved by: https://github.com/jansel
ghstack dependencies: #132771
2024-08-10 11:05:44 +00:00
c860889a65 [compiled autograd][cpp node] No recompiles from saved int scalars (#132771)
Addresses https://github.com/pytorch/pytorch/issues/130170 for int scalars saved from forward pass of a custom c++ autograd function

Currently, any non-tensors passed in ctx->saved_data is specialized on by compiled autograd. To stop specializing on int values, we lift the ints. We also require user code to use IValue::toSymInt instead of IValue::toInt in order to swap the SymInt to proxy during compiled autograd tracing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132771
Approved by: https://github.com/jansel
2024-08-10 11:05:44 +00:00
2ad011ca73 [inductor] remove debug code of AotCodeCompiler (#132823)
Since we switch AotCodeCompiler to new cpp_builder: https://github.com/pytorch/pytorch/pull/132766
We can remove debug code of AotCodeCompiler.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132823
Approved by: https://github.com/henrylhtsang
2024-08-10 08:04:48 +00:00
343071cd96 Fix privateuse1 backend name case (#132980)
### Problem

`get_privateuse1_backend(bool lower_case)` always returns a lower case name and `lower_case` is not used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132980
Approved by: https://github.com/albanD
2024-08-10 07:39:54 +00:00
c8275e25a7 fix requirement for error classification (#133122)
Test Plan: none

Differential Revision: D61039300

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133122
Approved by: https://github.com/yushangdi
2024-08-10 04:59:09 +00:00
9f0d90655d [inductor] cpp_builder add dynamo time trace for compile_file (#133103)
trace `compile_file` time for cpp_builder.
Ref: https://github.com/pytorch/pytorch/pull/132328/files#diff-c9b517f8db609ffa866804dfa2689188a4fee20abacaa0b0dca91625c1b5cb8dR2224

<img width="994" alt="image" src="https://github.com/user-attachments/assets/862c7943-79dc-4d06-b398-a09595ad1295">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133103
Approved by: https://github.com/ezyang
2024-08-10 04:55:02 +00:00
cc5a57d185 Return from monitoring thread on TCPStore failure (#133150)
Summary: Pessimisticly assume that things are being torn down if TCPStore is not available and do not attempt to dump stack traces.

Test Plan:
Seeing crashes in production when Flight Recorder is enabled.
Here's the relevant mast link: https://fburl.com/mlhub/qia257xh

Reviewed By: fduwjj

Differential Revision: D61055124

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133150
Approved by: https://github.com/fduwjj
2024-08-10 03:45:00 +00:00
e888f401c5 Fix autotuning for flex_decoding (#132157)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132157
Approved by: https://github.com/drisspg, https://github.com/yanboliang
ghstack dependencies: #131559
2024-08-10 03:39:48 +00:00
05de2b2d0f Revert "Construct NJT without graph breaks" (#133145)
This reverts commit 911154271309667b55dfb963ec6384bd0048019b.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133145
Approved by: https://github.com/YuqingJ
2024-08-10 03:11:16 +00:00
e890d888d9 [dtensor] move tensor constructors to a separate module (#133129)
This is to ensure __init__.py only contain public APIs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133129
Approved by: https://github.com/awgu, https://github.com/tianyu-l
2024-08-10 02:51:42 +00:00
8fbd7d92a8 Inductor-CPU WoQ int8 GEMM micro-kernel with scale epilogue (#131887)
## Summary

As part of #125683, this PR modifies existing CPU GEMM cpp template & micro-kernel template to enable int8 WoQ GEMM auto-tuning with AVX2, AVX512 & AMX ISAs (the latter is only available on Xeon 4th generation & beyond).

WoQ GEMM takes FP16/BF16 activations, int8 weights, and scale of the same dtype as activations.
The operation is equivalent to `torch.nn.functional.linear(x, w.to(x.dtype)) * scale`, which is essentially what the ATen op `torch.ops.aten._weight_int8pack_mm` currently does (except that weights are not cached by it). Weights will be considered constant & cached, so this implementation is suitable for inference, and not QAT. `scale` is supported as a `mul` epilogue.

Only BF16 activations have been supported in this PR because for FP16 & FP32, weight is dequantized during constant-folding pass of freezing, and then after auto-tuning, performance with a large `M` dimension may be better than either torch.ops.aten._weight_int8pack_mm, or the WoQ micro-kernel support introduced in this PR, which dequantizes `w` within the micro-kernel.
While even BF16 activations with a large `M` dimension may benefit from dequantizing `w` beforehand, for now, they would  use WoQ support in GEMM templates for auto-tuning, and then a subsequent PR would add logic for deciding whether or not to dequantize weights beforehand.

### Performance
#### AMX
Op-level speedup due to AMX micro-kernel (selected during auto-tuning) on 32 physical cores of Intel(R) Xeon(R) Platinum 8468H (of Xeon 4th generation series, codenamed Sapphire Rapids) vs. ATen kernel `torch.ops.aten._weight_int8pack_mm`. Intel OpenMP & tcmalloc were preloaded.

In a few cases with an odd `K`, the implementation being added in this PR may not perform as well as the ATen kernel, which is unrelated to this PR, though, since `test_linear_amx` also exhibits similar datapoints. In those cases, the AMX micro-kernel might be slower than AVX512 micro-kernel, so if such sets of shapes are used for auto-tuning, either the AVX512 micro-kernel implementation, or the ATen kernel would be chosen instead.

Benchmarked with unit-tests.

Tabular data at https://gist.github.com/sanchitintel/294811a86c8ff6b867c668ae2107c405?permalink_comment_id=5142442#gistcomment-5142442

The AVX512 micro-kernel was disabled to collect data for AMX micro-kernel.

#### AVX2/AVX512 micro-kernels

Tabular data at at https://gist.github.com/sanchitintel/52b5fa9c66f791be19e48e2aa6423dc4?permalink_comment_id=5142437#gistcomment-5142437

### Follow-up
1. int4 WoQ GEMM micro-kernel will also be added in a separate PR.
2. A subsequent PR would add logic for deciding whether or not to dequantize weights beforehand.

E2E perf measurement should be done with #131310.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131887
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-08-10 02:01:04 +00:00
eqy
c89936eaa0 [CUDA][SDPA] Bump grad_key fudge factor in test_flash_attention_vs_math_ref_grads (#133051)
Abates failures like `ValueError: grad_key Test error 1.592235639691353e-05 is greater than threshold 1.5236437320709229e-05!` that we've seen when bringing up newer versions of CUDA

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133051
Approved by: https://github.com/drisspg, https://github.com/Skylion007
2024-08-10 01:49:30 +00:00
f037803290 Add ChromiumEventLogger, log FXGraphCache and AOTAutogradCache (#132864)
This PR implements ChromiumEventLogger in all @dynamo_timed events. For each dynamo timed call, we log:
- A start event before starting the function execution
- An end event after finishing the function execution
- An extra pair of start/end events for any phase names included in dynamo.

Separately, this also gives us the ability to log instant events. I use them to log cache hits/misses as a first step. The little arrows on the bottom of the UI are cache hits/misses, and you can look at cache details by clicking each triangle.

The outputted chromium trace events can be viewed in perfetto for a timeline of an execution. Here's what it looks like for a run of nanogpt:
![image](https://github.com/user-attachments/assets/cb9e6c7a-1acf-45e6-8a27-6651d9ae6132)

And another with warm start:
![image](https://github.com/user-attachments/assets/cd9709bc-59ef-4da1-a7dd-10b1a0ab9b8f)

Trace events are based around the JSON Event format: https://docs.google.com/document/d/1CvAClvFfyA5R-PhYUmn5OOQtYMH4h6I0nSsKchNAySU/preview

We may want to switch to the less deprecated Protobuf format later, but so far I don't see any features we care about supported there.

Internal FB employees can see a link to this in the tlparse output:
https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpVi1FIl/dedicated_log_torch_trace_bb4zl_bc.log/index.html

I'll also work on logging these

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132864
Approved by: https://github.com/aorenste
2024-08-10 01:15:53 +00:00
de48d54042 [TorchRec] Add Support for FakeProcessGroup (#133039)
Summary:
# context
* use FakeProcessGroup to mimic the multi-process tests
* can use `_test_compile_fake_pg_fn` as the single-process VB compile test
```
from torchrec.distributed.tests.test_pt2_multiprocess import _test_compile_fake_pg_fn
_test_compile_fake_pg_fn(
    rank=0,
    world_size=2,
)
```

reference: D59637444

Test Plan:
# run test
* run command and results: P1519228952, [tlparse](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpwMCK1E/index.html)
```
TORCH_TRACE=/var/tmp/tt TORCH_SHOW_CPP_STACKTRACES=1 TORCH_LOGS="+all" buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:test_pt2_multiprocess
```

Differential Revision: D56124045

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133039
Approved by: https://github.com/ezyang
2024-08-10 01:10:47 +00:00
3899465268 relax unification checks when size-like symbols can be 0 (#133112)
Test Plan: Fixes test failure in https://www.internalfb.com/diff/D51127481

Differential Revision: D61031307

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133112
Approved by: https://github.com/angelayi
2024-08-10 00:57:49 +00:00
72f2b29bb0 [CI] disable xpu kineto build (#133069)
Due to the xpu kineto support PR https://github.com/pytorch/pytorch/pull/130811 landed, but the xpu ci infra not ready for now. Disable kineto build as a temp WA
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133069
Approved by: https://github.com/seemethere
2024-08-09 23:58:50 +00:00
21302d5891 AutoHeuristic: script to generate data for mm (#131617)
This PR introduces a script that can be used to generate training data for tuned_mm in order to learn a heuristic with AutoHeuristic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131617
Approved by: https://github.com/eellison
ghstack dependencies: #131615, #131616
2024-08-09 23:49:29 +00:00
e7512ab752 inductor mm autotuning: add back previously pruned configs (#131616)
This PR adds back 10 configs for tuned_mm that were previously removed in https://github.com/pytorch/pytorch/pull/126570. The main idea is that we use 30 configs to autotune only when data is collected with AutoHeuristic. The learned heuristic will prune these 30 configs down to 10 configs, which reduces compilation time and at the same time might improve performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131616
Approved by: https://github.com/eellison
ghstack dependencies: #131615
2024-08-09 23:49:29 +00:00
e5fa190e01 AutoHeuristic: tuned_mm (#131615)
This PR enables AutoHeuristic to be used for `tuned_mm`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131615
Approved by: https://github.com/eellison
2024-08-09 23:49:29 +00:00
3b440f358c [elastic collectives API] add missing rank tracing support (#132818)
Optional option to detect missing ranks (that can be mapped to host info via `rank_tracing_decoder` lambda argument) in store barrier operation.

This approach has been used in some form already, moving it to collectives API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132818
Approved by: https://github.com/d4l3k
2024-08-09 22:55:04 +00:00
6beb2be2ed Fix _dynamo.variables.torch_function.global_mangled_class_name (#132744)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132744
Approved by: https://github.com/zou3519
2024-08-09 22:19:01 +00:00
d2ecdcb2f7 [Profiler] Add API for Dynamic Activity Toggling [2/n] (#133035)
Summary: During PT2 there are many GPU/CPU events that are unneccessary to profile in between a given step. To remedy this, we can add an API that takes in a list of activities and an arg to toggle said activies or not. For this diff we are adding the profiler API to propogate down to kineto (and in the future the collection.cpp logic). Subsequent diffs will be added for CPU toggling and e2e testing.

Test Plan: Tested by toggling backward gpu traces off and got following trace: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Jul_31_13_40_55.3251726.pt.trace.json.gz&bucket=gpu_traces

Reviewed By: aaronenyeshi

Differential Revision: D60541767

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133035
Approved by: https://github.com/aaronenyeshi
2024-08-09 21:54:54 +00:00
b0b4723062 [c10d] Rename PG name and PG ID attribute (#132915)
As discussed in https://github.com/pytorch/pytorch/pull/132058. we think pg_uid and local_uid might be a better name for pg_name and pg_id. So this PR is trying to rename it. More PRs are needed to change on the logging and other places.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132915
Approved by: https://github.com/fegin
ghstack dependencies: #132058
2024-08-09 21:26:56 +00:00
4110cb6ba7 Add explicit GQA support. (#131559)
### tl;dr
This PR adds GQA support to higher order op `flex_attention`.

## Details
When `enable_gqa` is set to True, HOP `flex_attention(score_mod, query, key, value, block_mask, enable_gqa)` runs Group Query Attention(GQA), where the number of query heads (Hq) is a multiple of number of key/value heads (Hkv). Each group of query heads (`Hq//Hkv` heads) attends to a shared kv head.
Otherwise, `flex_attention` assumes Multi Head Attention (MHA) where the number of query heads is equal the number of kv heads.

The `score_mod` and `mask_mod` API are adapted accordingly to take `q_head` as head index.
```
def score_mod(score: torch.Tensor, batch: torch.Tensor, q_head: torch.Tensor, token_q: torch.Tensor, token_kv: torch.Tensor) -> torch.Tensor

def mask_mod(batch: torch.Tensor, q_head: torch.Tensor, token_q: torch.Tensor, token_kv: torch.Tensor) -> torch.Tensor
```

## Example
```python
import torch
from torch.nn.attention.flex_attention import flex_attention
from torch.nn.attention.flex_attention import create_block_mask

torch.manual_seed(0)

def query_key_value_clones(
    query: torch.Tensor,
    key: torch.Tensor,
    value: torch.Tensor,
    dtype: torch.dtype = None,
):
    """Clones the query, key, and value tensors and moves them to the specified dtype."""
    if dtype is None:
        dtype = query.dtype
    query_ref = query.clone().detach().to(dtype).requires_grad_(query.requires_grad)
    key_ref = key.clone().detach().to(dtype).requires_grad_(key.requires_grad)
    value_ref = value.clone().detach().to(dtype).requires_grad_(value.requires_grad)
    return query_ref, key_ref, value_ref

# Lets create some input tensors
# The input tensor has shape (batch_size, num_heads, seq_len, head_dim).
# query and key/value can have different num_heads and seq_len
# Here 8 query heads share one KV head.
query = torch.randn(2, 8, 2048, 64, device="cuda", dtype=torch.float32, requires_grad=True)
key = torch.randn(2, 2, 2048, 64, device="cuda", dtype=torch.float32, requires_grad=True)
value = torch.randn(2, 2, 2048, 64, device="cuda", dtype=torch.float32, requires_grad=True)

query1, key1, value1 = query_key_value_clones(query, key, value)

# Lets create a score_modification. We take alibi_bias as an example.
# score_mod takes batch index, query head index, query index, and key/value index.
def _generate_alibi_bias(num_kv_heads: int, num_q_heads: int):
    def _alibi_bias(
        score: torch.Tensor,
        b: torch.Tensor,
        hq: torch.Tensor,
        token_q: torch.Tensor,
        token_kv: torch.Tensor,
    ) -> torch.Tensor:
        # Let's calculate kv head from query head index
        group = num_q_heads // num_kv_heads
        hkv = hq // group

        scale = torch.exp2(-((hkv + 1) * 8.0 / num_kv_heads))
        return score + (token_kv - token_q) * scale

    return _alibi_bias

# Let's apply a casual mask on top of it
def causal_mask(b, h, q, kv):
    return q >= kv

# Generate a block mask for our new mask_mod function.
# The mask is broadcasted long head & batch dimensions.
block_mask = create_block_mask(causal_mask, B=1, H=1, Q_LEN=2048, KV_LEN=2048)

# Lets call flex_attention with our new score modification and block mask under eager mode.
output = flex_attention(query, key, value, score_mod=_generate_alibi_bias(2, 8), block_mask=block_mask, enable_gqa=True)

# Now lets compile flex_attention and run the flex_attention kernel.
compiled_flex_attention = torch.compile(flex_attention)
out_compiled = compiled_flex_attention(query1, key1, value1, score_mod=_generate_alibi_bias(2, 8), block_mask=block_mask, enable_gqa=True)

torch.testing.assert_close(output, out_compiled, atol=5e-2, rtol=2e-2)

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131559
Approved by: https://github.com/drisspg
2024-08-09 21:25:35 +00:00
dc8bb2636c [c10d][doc] Add docs for ENV variables TORCH_NCCL_ASYNC_ERROR_HANDLING TORCH_NCCL_TRACE_CPP_STACK and TORCH_NCCL_COORD_CHECK_MILSEC (#132920)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132920
Approved by: https://github.com/fegin, https://github.com/wconstab
2024-08-09 21:08:20 +00:00
78fa32a77b Turn off Function Event Accumulation by Default (#133095)
Summary: D56956245 added the ability to accumulate FunctionEvents across multiple cycles in order to perform statistical analysis on them all together. Although this can be useful, it uses too many CPU resources especially for long running jobs. For this reason, lets add a flag to the profiler to turn off this behavior by default, but still allow users to turn it on if they wish.

Test Plan: Changed function count test to have acc_events passed in and check the amount of function events based on if flag is true or not

Differential Revision: D61021490

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133095
Approved by: https://github.com/briancoutinho, https://github.com/LucasLLC, https://github.com/aaronenyeshi
2024-08-09 20:47:20 +00:00
c44cb89e06 [export] detach constant tensors when they're not registered as buffer or parameter in unlift (#133031)
Summary:
Fixes T198245910.

In  previous diff D60532628 that causes the test failure, we fix the  in-consistency caused by constant tensors is accidentally reigistered as buffer by deleting the buffer and re assign them as constant.

However, this broke several existing tests in pyspeech when the exported program is re-traced with torch.jit.trace (which is an anti-pattern we probably should have some alignment), the jit tracer finds this constant tensor requiring grad and errors out.

This PR force constant attr not requiring grad, which is the correct behavior. A better fix is finding out where the constants are created in user code and why it requires grad. But this has low roi so we warn user about it.

Test Plan: See failures in T198245910.

Differential Revision: D60974869

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133031
Approved by: https://github.com/angelayi
2024-08-09 20:33:52 +00:00
cd307fb0b1 [FSDP2] reset FSDPParam.sharded_param in lazy_init (#132954)
motivated by FSDP2 + DoRA https://github.com/pytorch/pytorch/issues/132721

after meta init, we need a user-defined function to move DoRALinear.magnitude from device=meta to device=cuda
The problem is how to trigger reset_sharded_param or _apply to update FSDPParam. Otherwise lazy_init complains that DoRALinear.magnitude are still on device=meta

credit to @awgu for chasing after DDP lazy_init to unblock the PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132954
Approved by: https://github.com/awgu
ghstack dependencies: #133059
2024-08-09 20:26:10 +00:00
78cf8df4a0 [aoti] forward fix of [inductor] switch AotCodeCompiler to new cpp_builder. (take 3) (#133042)
Summary:
Forward fix of a test failure caused by D60773405.

The idea of D60773405 is that we need to use absolute path. So we will want to use the older version of path for output_so and output_o.

However, when I was copying the older definitions of output_so and output_o, I thought it was okay to simplify it a bit. See https://github.com/pytorch/pytorch/pull/131304#issuecomment-2270016609

Turns out I was wrong.

Test Plan: ci

Differential Revision: D60990594

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133042
Approved by: https://github.com/hl475, https://github.com/desertfire
2024-08-09 18:53:27 +00:00
472b0daeaa [DDP][FSDP2] keep DTensor params for replicate(fully_shard) (#133059)
current status: for `replicate(fully_shard)`, DDP lazy_init will convert DTensor into local tensor, and that breaks FSDP unshard

this PR keeps FSDP params untouched during DDP lazy_init
I came across it because of a CI error in FSDP2's unit test #132978
thanks @awgu for fix proposal

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133059
Approved by: https://github.com/Skylion007, https://github.com/fegin
2024-08-09 18:38:05 +00:00
e66084f9bf [BUG FIX] Refactor _scale_attn_mask_fusion_kernel to Use Runtime Argument Instead of Template Parameter (#132434)
**Description**

**_[BUG FIX]_**
This PR fixes a bug which happens during compilation with GCC 11.4 compiler in the FlashAttentionKernel.cpp file. This issue doesn't seem to be with PyTorch main branch but gets introduced with our SVE PR changes (https://github.com/pytorch/pytorch/pull/119571 ) + PyTorch main.

See the CI Pipeline failing in our PR:
https://github.com/pytorch/pytorch/actions/runs/9895714768/job/27336251795?pr=119571

```
/var/lib/jenkins/workspace/build/aten/src/ATen/native/cpu/FlashAttentionKernel.cpp.SVE256.cpp
during RTL pass: expand
In file included from /var/lib/jenkins/workspace/build/aten/src/ATen/native/cpu/FlashAttentionKernel.cpp.SVE256.cpp:1:
/var/lib/jenkins/workspace/aten/src/ATen/native/cpu/FlashAttentionKernel.cpp: In lambda function:
/var/lib/jenkins/workspace/aten/src/ATen/native/cpu/FlashAttentionKernel.cpp:290:57: internal compiler error: in emit_move_insn, at expr.c:3821
  290 |   at::parallel_for(0, batchSize * num_head * qSlice, 1, [&](int64_t begin, int64_t end) {
      |                                                         ^
0xffffb03f73fb __libc_start_call_main
	../sysdeps/nptl/libc_start_call_main.h:58
0xffffb03f74cb __libc_start_main_impl
	../csu/libc-start.c:392
Please submit a full bug report,
with preprocessed source if appropriate.
Please include the complete backtrace with any bug report.
See <file:///usr/share/doc/gcc-11/README.Bugs> for instructions.

[5731/6839] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/CatKernel.cpp.SVE256.cpp.o
[5732/6839] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/GridSamplerKernel.cpp.SVE256.cpp.o
```

This issue with compilation only happens with GCC 11.4 and works well with the latest GCC 12.3 compiler and also the Clang compiler. The issue is related to the check for `is_b_stride_zero` introduced as a template parameter (compile time check complexity) in the following commit: 5da428d9eb  which was added recently into FlashAttentionKernel.cpp file.

This PR fixes the above compilation failure with GCC 11.4 compiler.

cc : @Valentine233 @yanbing-j @mingfeima @malfet @jgong5 @r-barnes

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132434
Approved by: https://github.com/jgong5
2024-08-09 18:34:42 +00:00
b41d62a3a2 Fix typo in docs of all_gather (#133066)
Fix a typo of docs:
```
def all_gather(tensor_list, tensor, group=None, async_op=False):
...
        [tensor([0, 0], device='cuda:0'), tensor([0, 0], device='cuda:1')] # Rank 1
```
`cuda:0` should be `cuda:1`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133066
Approved by: https://github.com/awgu
2024-08-09 18:21:26 +00:00
f3eab23c42 Fix typo in mypy.ini (#133097)
A missing comma in the file list currently leads to errors when running mypy, introduced in #113745

Fixes #133096

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133097
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-08-09 18:19:51 +00:00
31ef900a65 Revert "added persistent option to buffers and namedbuffers (#132994)"
This reverts commit 8707c6dfacaed293ddc40cbb5ecf5841568df0e6.

Reverted https://github.com/pytorch/pytorch/pull/132994 on behalf of https://github.com/PaliC due to breaking internal pyre tests ([comment](https://github.com/pytorch/pytorch/pull/132994#issuecomment-2278487672))
2024-08-09 18:14:53 +00:00
6c012f7217 [c10d][Log] Use pg_id instead of pg_name for logging prefix (#132058)
When checking the logs of c10d, I found it showed that "[PG 7 rank 7]" which it actually means "[PG 1 rank 7]". So we need to use pg_id(aka, uid_) rather than pg_name_ because when creating subpgs, currently we need to call it multiple times, so this makes PG names are based on bumped up numbers (e.g, 7 rather than 1). Using pg_id is more accurate and consistent with other logging tools.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132058
Approved by: https://github.com/shengbao-zheng, https://github.com/shuqiangzhang
2024-08-09 18:14:10 +00:00
655ec07525 [ROCm] TunableOp logging improvements (#132173)
Summary:
TunableOp logging improvements:
1. PYTORCH_TUNABLEOP_VERBOSE=1: print out the expected value vs actual value for TunableOp validators, so that if validation fails, we know exactly how to fix it
2. PYTORCH_TUNABLEOP_VERBOSE=3: print out the exact kernel signature for both successful and failure cases in kernel lookup

Test Plan:
> PYTORCH_TUNABLEOP_VERBOSE=3 buck
2 run mode/{opt,amd-gpu} -c fbcode.enable_gpu_sections=true //scripts/xdwang/example:fc_llama -- --enab
le-tuning

```
reading tuning results from hipblas_tuning_pt_llama0.csv
Validator PT_VERSION=2.5.0
Validator ROCBLAS_VERSION=4.0.0-72e57364-dirty
Validator HIPBLASLT_VERSION=800-a15e4178
Validator ROCM_VERSION=6.0.0.0-12969-1544e39
Validator GCN_ARCH_NAME=gfx942:sramecc+:xnack-
GCN_ARCH_NAME validation: expect gfx942:sramecc+:xnack- to match gfx942:sramecc+:xnack-
ROCM_VERSION validation: expect 6.0.0.0-12969-1544e39 to match 6.0.0.0-12969-1544e39
HIPBLASLT_VERSION validation: expect 800-a15e4178 to match 800-a15e4178
ROCBLAS_VERSION validation: expect 4.0.0-72e57364-dirty to match 4.0.0-72e57364-dirty
PT_VERSION validation: expect 2.5.0 to match 2.5.0
Loading results
GemmTunableOp_BFloat16_TN(tn_8192_2_1024) -> Gemm_Hipblaslt_TN_61169,0.0171694
GemmTunableOp_BFloat16_TN(tn_7168_2_8192) -> Gemm_Hipblaslt_TN_61089,0.036138
GemmTunableOp_BFloat16_TN(tn_8192_2_3584) -> Gemm_Hipblaslt_TN_61169,0.0240673
missing params_signature, returning null ResultEntry for GemmTunableOp_BFloat16_TN,tn_1280_2_8192
finding fastest for GemmTunableOp_BFloat16_TN(tn_1280_2_8192) out of 2818 candidates
Rotating buffer 4 MiB. Needed Size: 20 MiB. Needed number of param copies: 1
├──tuning using warmup iters 0 [0 ms] and tuning iters 1 [0.208254 ms] instance id=0, GemmTunableOp_BFloat16_TN(tn_1280_2_8192) Default
├──offset at 3
......
ResultEntry found for GemmTunableOp_BFloat16_TN,tn_8192_2_3584
ResultEntry found for GemmTunableOp_BFloat16_TN,tn_8192_2_3584
ResultEntry found for GemmTunableOp_BFloat16_TN,tn_8192_2_3584
Avg time: 16.42832040786743 us, Achieved 7.15 TFLOPS, 3578.07 GB/s

2x1280x8192-torch.bfloat16,16.260499954223633,2.5794434438103107,1294.0669757533708
2x8192x1024-torch.bfloat16,16.15394949913025,2.0771658350056508,1041.11852032876
2x7168x8192-torch.bfloat16,25.691540241241455,9.14234887416194,4574.841325057144
2x8192x3584-torch.bfloat16,16.42832040786743,7.1486621324818085,3578.0709494714856
```

Differential Revision: D60468273

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132173
Approved by: https://github.com/mxz297, https://github.com/jeffdaily, https://github.com/eqy
2024-08-09 17:55:21 +00:00
d13e72fd6a [c10d] set a shorter heartbeat detect timeout to avoid race with NCCL timeout (#133028)
What we found recently is that:
1. Monitoring detect watchdog hang(no heartbeat) at same time as nccl timeout. This race leads to less useful debug info gets dumped to logs (such as CudaEventDestroy and GIL checker)
2. We don't kill the program if monitoring thread has not enabled but somehow still silently run the monitoring thread. Plus for users who feel this is too short, they should config TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC themselves.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133028
Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
2024-08-09 17:48:34 +00:00
574cdf1232 [export] Merge functions in replace set_grad/autocast with HOO (#132724)
Summary: as title

Test Plan: CI

Differential Revision: D60701648

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132724
Approved by: https://github.com/ydwu4
2024-08-09 17:25:07 +00:00
2dbe5cb979 [C10D] Clarify warning for concurrent PG usage (#131895)
Addresses a common misconception about safety of using multiple NCCL
process groups from PyTorch.

Notably, it IS safe to use multiple process groups, so long as
communication operations from different groups are not allowed to
overlap.  (Overlap of communication operations from one group with
compute operations IS ok).

TODO: after getting feedback on the text, update other copies of the warning on other APIs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131895
Approved by: https://github.com/fduwjj
2024-08-09 17:06:46 +00:00
bc57d5b6ff [Inductor][CPP] Turns on inline_inbuilt_nn_modules for CPP GEMM template testing (#132487)
**Summary**
The CPP GEMM template testing has been skipped with turning on `inline_inbuilt_nn_modules ` as in https://github.com/pytorch/pytorch/issues/131929.  Since https://github.com/pytorch/pytorch/pull/132334 has landed to fix the issues. Turn on this flag back since it's default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132487
Approved by: https://github.com/anijain2305, https://github.com/jgong5
2024-08-09 16:56:57 +00:00
23b877cb54 [inductor]a less ambitious way to slove the scalar tensor (#132702)
Fixes #121374

The previous https://github.com/pytorch/pytorch/pull/131775 was trying to convert the 0dim cpu tensor to a DynamicScalar in lowering stage. But there are so many lowering rules uncompatible with that way. So, this PR is trying to do the conversion in codegen stage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132702
Approved by: https://github.com/eellison
2024-08-09 16:29:36 +00:00
50595ecef4 Revert "[BE] Raise when the target model has scalar parameters (#132934)"
This reverts commit ea00036841b225330396f8d8f6ecf796f4826786.

Reverted https://github.com/pytorch/pytorch/pull/132934 on behalf of https://github.com/clee2000 due to I think this broke distributed/_composable/fsdp/test_fully_shard_init.py::TestFullyShardShardedParameterTensor::test_raise_scalar_parameter [GH job link](https://github.com/pytorch/pytorch/actions/runs/10314920655/job/28563430905) [HUD commit link](ea00036841).  Dr CI is wrong, it is not flaky ([comment](https://github.com/pytorch/pytorch/pull/132934#issuecomment-2278208789))
2024-08-09 15:30:34 +00:00
065f7aa44b [inductor] tensor_is_align fallbacking False if unbacked expr not comptime evaled (#132423)
Currently if storage_offset is unbacked symbol and is_align can not be computed compiletime - it hard fails.

Doing the best we can: adding guard_size_oblivious and fallback on False if can not be evaluated compiletime

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132423
Approved by: https://github.com/ezyang
2024-08-09 15:07:42 +00:00
4bdb4bbd86 Fix fbcode AOTI GPU lowering for ARM64 hosts (#133017)
Summary: Fix fbcode AOTI GPU lowering for ARM64 hosts

Reviewed By: hl475

Differential Revision: D60969898

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133017
Approved by: https://github.com/hl475
2024-08-09 14:05:13 +00:00
f2bacd908a [BE] Move function definitions to .cpp (#132927)
Summary:
Non-functional change.

Move function definitions for NCCLTraceBuffer to .cpp files.

Test Plan:
Unit tests.

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132927
Approved by: https://github.com/Skylion007, https://github.com/d4l3k
ghstack dependencies: #132916
2024-08-09 13:52:29 +00:00
465e071898 Revert "[CUDA][CUTLASS][submodule] Fixes for CUTLASS upgrade (#131493)"
This reverts commit 927b4c11143e047eb6e3430e4c7c912064572f1b.

Reverted https://github.com/pytorch/pytorch/pull/131493 on behalf of https://github.com/nmacchioni due to breaking many tests ([comment](https://github.com/pytorch/pytorch/pull/131493#issuecomment-2277738114))
2024-08-09 11:30:23 +00:00
f565d16acb Fix work-around item non-sync issue on AMD (#133054)
Summary: Otherwise it will break FSDP code paths

Test Plan:
unit test

see next diff for print message
```
sh ./scripts/lufang/amd/small_repro.sh
ROCM_GET_SCALAR_ITEM_SYNC=1 sh ./scripts/lufang/amd/small_repro.sh
```

It will log "====== Async mode ======" or "====== Sync mode ======" correspondingly

Differential Revision: D60995134

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133054
Approved by: https://github.com/houseroad
2024-08-09 09:22:29 +00:00
927b4c1114 [CUDA][CUTLASS][submodule] Fixes for CUTLASS upgrade (#131493)
Unblocks/unbreaks against newer CUTLASS (3.5+)

CC @nWEIdia @xwang233 @ptrblck @thakkarV

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131493
Approved by: https://github.com/Skylion007
2024-08-09 07:35:38 +00:00
7b8ab7eb3e [dynamo] Partially support random.Random class (#133037)
This partially fixes the graph break issue when instantiating a `random.Random` class in Python.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133037
Approved by: https://github.com/anijain2305
2024-08-09 07:15:42 +00:00
ea00036841 [BE] Raise when the target model has scalar parameters (#132934)
Address the issue, https://github.com/pytorch/pytorch/issues/130810.

Both FSDP1 and FSDP2 do not support scalar parameters. For FSDP1, the issue happens during state_dict operations while FSDP2 fails during the initialization. This PR adds exceptions to help users debug the issue and change the scalar parameters to 1D parameters.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132934
Approved by: https://github.com/awgu
ghstack dependencies: #132908, #132933
2024-08-09 06:45:48 +00:00
5707c6e952 [Fake tensor] Align the appearance of device_put op in fx_graph generated for CUDA and XPU, which is exposed in the issue #130823 (#132479)
[Fake tensor] Align the appearance of device_put op in fx_graph generated for CUDA and XPU, which is exposed in the issue #130823
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132479
Approved by: https://github.com/EikanWang, https://github.com/zou3519, https://github.com/eellison
2024-08-09 05:31:00 +00:00
cyy
da65cfbdea Remove unused Caffe2 macros (#132979)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132979
Approved by: https://github.com/ezyang
2024-08-09 04:48:20 +00:00
cyy
05e8e87a69 [Submodule] Remove foxi (#132976)
It is not used after removal of Caffe2 code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132976
Approved by: https://github.com/ezyang
2024-08-09 03:46:52 +00:00
bb6eef8ed1 [2/2] PT2 Inductor ComboKernels - automatic horizontal fusing (#131675)
Summary:
A ComboKernel combines independent Inductor Triton kernels into a single one.
This is part 2 pull request which 1) adds automatic horizontal fusion in the end of the inductor operator fusion process, 2) adds type annotation for trition_combo_kernel.py

ComboKernel is used in two cases: 1) for existing foreach kernels, combo kernels are used as the backend kernel. the front-end kernel generation logic remains the same. 2) Added an extra optimization phase to the end of the scheduler to generate extra combo kernels if combo_kernels is True in config.py

This is part 2 pull request which deals with the 2nd case above:

- The combo kernel generation in the added optimization phase is done in two steps: 1) in the front end inside the scheduler, it topologically sort the schedule nodes to find all the nodes with no data dependency and create a frond end schedule node for them. We currently limit the maximal number of sub-nodes for each combo kernel to 8 (but we still need to find what is the optimal number). 2) then, these sub-nodes are combined in the codegen phase to generate the combo kernel code for them based on a few rules. For example, 1d and 2d kernels are separated into different combo kernels, as mixing them is not supported yet. Note these algorithms we provide are very basic, and the users can register their customized combo kernel generation algorithms for both steps.

- Performance wise, combining small kernels is about always to see performance gain. however, combining very large kernels may not see any perf gain, sometimes even regression possibly due to improper block sizes. Thus, a benchmark function is implemented to avoid such perf regression, and it is recommended to turn it on by setting benchmark_combo_kernels to True whenever combo_kernels is True.

Please refer to part 1 pull request https://github.com/pytorch/pytorch/pull/124969 for more details.

Test Plan: buck2 test mode/dev-nosan caffe2/test/inductor:combo_kernels

Differential Revision: D60067757

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131675
Approved by: https://github.com/mlazos
2024-08-09 03:14:16 +00:00
8875226d62 [dtensor] multi-dim mesh redistribute follow up (#133023)
follow up from https://github.com/pytorch/pytorch/pull/131210

and added one test case from user in

https://github.com/pytorch/pytorch/issues/132751

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133023
Approved by: https://github.com/tianyu-l
ghstack dependencies: #133022
2024-08-09 02:26:23 +00:00
3b7edc12c6 [dtensor] more refactor to imports/paths (#133022)
as titled

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133022
Approved by: https://github.com/XilunWu, https://github.com/wz337
2024-08-09 02:26:23 +00:00
22ea248aa8 dynamic shapes mismatch errors (#132982)
Summary: When PyTree detects a structural mismatch between inputs and dynamic shapes, the error messages are quite horrible. This PR fixes these error messages by adding, for each kind of error, the path to the point where the error happens and an actionable reason for the error.

Test Plan: added test with several cases

Differential Revision: D60956976

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132982
Approved by: https://github.com/yushangdi
2024-08-09 02:22:32 +00:00
cyy
8967d55b01 [18/N] Fix clang-tidy warnings in jit (#132963)
Follows #132753

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132963
Approved by: https://github.com/Skylion007
2024-08-09 01:27:32 +00:00
313aa151da Revert "[ROCm] TunableOp logging improvements (#132173)"
This reverts commit 9cca0494b9d5c89c0a1100aee9477ed8ca7d527b.

Reverted https://github.com/pytorch/pytorch/pull/132173 on behalf of https://github.com/PaliC due to reverted internally ([comment](https://github.com/pytorch/pytorch/pull/132173#issuecomment-2276966242))
2024-08-09 01:04:57 +00:00
4101dd14c2 Make debugging backends accept and ignore options kwargs from torch.compile (#132892)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132892
Approved by: https://github.com/anijain2305, https://github.com/jansel
2024-08-09 00:49:45 +00:00
0ff0bf3d31 [Replicate] Fix replicate with DeviceMesh initialization (#133024)
A follow up on https://github.com/pytorch/pytorch/pull/132339.

`get_parent_mesh` is replaced by `get_root_mesh`. In addition, modify a few places that parent mesh is mentioned in test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133024
Approved by: https://github.com/Skylion007, https://github.com/fegin
2024-08-09 00:45:47 +00:00
10c2168b31 [pt2-bench] use larger multiplier for smaller tensors for a few models (#132952)
Fix https://github.com/pytorch/pytorch/issues/132922  and https://github.com/pytorch/pytorch/issues/132924

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132952
Approved by: https://github.com/eellison, https://github.com/jansel
2024-08-09 00:09:21 +00:00
3c5b246d3c [export] Remove Proxy from exported programs and modules (#132956)
Summary: Remove Proxy from exported programs and modules because they cannot be deepcopied or pickeled.

Test Plan:
CI

```
buck2 run 'fbcode//mode/dev-nosan'  fbcode//caffe2/test/quantization:test_quantization -- -r  qat_conv2d
buck2 run 'fbcode//mode/dev-nosan' fbcode//modai/test:test_modai -- -r test_qat_stinson_htp_export
buck2 run 'fbcode//mode/dev-nosan' fbcode//vizard_projects/ml_depth/tests:test_model -- -r test_qat_model_et
buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/backends/tests:qnn_test -- -r test_qat_bias=False,use_3d_input=False
buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/backends/tests:qnn_test -- -r test_qat_bias=True,use_3d_input=False
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r  test_fold_bn_erases_bn_node
```

Differential Revision: D60940832

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132956
Approved by: https://github.com/angelayi
2024-08-09 00:00:20 +00:00
e2b94923ba [PyTorch] Speed up decomposed quantize_per_channel (#133029)
Similar to D60871396 (#132828).

Differential Revision: [D60978385](https://our.internmc.facebook.com/intern/diff/D60978385/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133029
Approved by: https://github.com/cccclai
2024-08-08 23:48:34 +00:00
fa8c34301a [ts-migration]: Quantized ops to standard ops pass. (#133026)
#### Description
Transform quantized operation properly. Add de/quantization before and after the quantized operation.

#### Test Plan
`pytest test/export/test_converter.py -s -k test_ts2ep_convert_quantized_model`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133026
Approved by: https://github.com/angelayi
2024-08-08 23:10:17 +00:00
45cf8ef557 add impls for required for nt ops (#132710)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132710
Approved by: https://github.com/jbschlosser
ghstack dependencies: #131060
2024-08-08 23:09:38 +00:00
1434e0b121 Add a private _safe_softmax (#131060)
# Summary
Changes the stance of SDPA on what to do for fully masked out rows

## Current Behavior
Several PyTorch users have expressed frustration over this issue:
- https://github.com/pytorch/pytorch/issues/41508
- https://github.com/pytorch/pytorch/issues/103749
- https://github.com/pytorch/pytorch/issues/103963

These are significant issues with extensive discussion but no satisfactory resolution. The PyTorch team's consensus, as stated here:
https://github.com/pytorch/pytorch/issues/24816#issuecomment-524415617

Can be paraphrased as follows:

When passing in fully masked out rows, attention becomes ambiguous. We have two main options:

1. Uniformly attend to all values:
   ```python
   scores[masked_out_rows] = 1 / len(row)
   out[masked_out_rows] = 1 / len(row) * value
   ```

2. Decide that attention between no queries (masked) and no keys (masked) is meaningless:
   ```python
   output[fully_masked_rows] = NaN
   ```

We went with option 2. Partially because it was easier to implement, but also people argued that users can slice the output to remove the NaNs:
``` Python
>fill_value = -float("inf")
>row0 = torch.randn(4)
>row1 = torch.tensor([(fill_value for _ in range(4)])
>matrix = torch.stack([row0, row1]).requires_grad_(True)
>out = torch.softmax(matrix, 1)
>out = out[0]
>print(out)
tensor([0.5377, 0.2729, 0.0692, 0.1201])
```
Cool, problem solved. But what happends when you call backwards..
```Python
>out.backward(torch.ones_like(out))
>print(matrix.grad)
tensor([[3.0957e-08, 1.4157e-08, 7.7802e-10, 1.3713e-08],
        [       nan,        nan,        nan,        nan]])
```
Those pesky NaNs are back!

## Why do we see NaNs today?

The core of the problem revolves around using softmax function in sdpa:

```python
> row = torch.tensor([(-float("inf")) for _ in range(4)])
> torch.softmax(row, 0)
tensor([nan, nan, nan, nan])
```

## Quick Aside: Masking in Attention

Attention itself doesn't have a concept of masking. The `sdpa` function has an argument called `attn_mask`, which would be more accurately named `attn_bias`. This is because we don't actually "mask" entries when computing attention. Instead, due to implementation details([performance](https://github.com/pytorch/pytorch/issues/25110#issuecomment-524519087)), we add a value to the masked-out query/key pairs.

We use a large negative number (typically -inf) to decrease the attention weight, as softmax assigns more weight to larger values.

## Alternative Approaches

If we use a very large negative number instead of -inf:

```python
> row = torch.tensor([(-1e6) for _ in range(4)])
> torch.softmax(row, 0)
tensor([0.2500, 0.2500, 0.2500, 0.2500])
```
However if users always remembered to "slice" out their outputs i.e.:
```Python
>fill_value = -1e6
>...
>out.backward(torch.ones_like(out))
>print(matrix.grad)
tensor([[-0.0563, -0.0564,  0.1613, -0.0486],
        [ 0.0000,  0.0000,  0.0000,  0.0000]])
```
This would bring us back into a better state.

## A Third Option

We don't necessarily need to alter the behavior of softmax for -inf or very large negative numbers. The fundamental goal is to exclude certain query/key pairs from attention, regardless of the underlying implementation.

This PR implements the new semantic for masking w/ attention in fully masked-out rows:
```python
out[masked_out_rows] = 0
```

**Important Note**: This idea isn't entirely new. The [MaskedTensor](https://pytorch.org/tutorials/prototype/maskedtensor_overview#safe-softmax) prototype, a tensor subclass, was designed to handle such cases. However, it remains a prototype feature and hasn't gained widespread adoption.

## Details
This PR stack does 3 things:
1. Adds a PRIVATE _safe_softmax op
2. Updates semantic for flash_cpu fused kernel
3. Updates semantic for efficient_cuda fused kernel

_safe_softmax is not supposed to be used generically and is only meant to be used within the context of SDPA. Due to this fact instead of decomposing softmax and checking for -inf rows we instead "cheat" and use nan_to_num.

Why I think this is okay? (please find a counter point if avail)
There are multiple ways NaNs can emerge. For the fully masked out rows case nan_to_num works. But what if there were other NaNs, wouldn't this silently remove them?

The only case that this can happen is if the input itself had a NaN or an Inf
For example:
```Python
a = torch.ones([4], requires_grad=False, dtype=torch.float16)
a[1] = torch.finfo(torch.float16).max
print(a.softmax(-1))
```
Will return
`tensor([0., 1., 0., 0.], dtype=torch.float16)`

Where
```Python
a = torch.ones([4], requires_grad=False, dtype=torch.float16)
a[1] = float("inf")
a.softmax(-1)
```
returns:
`tensor([nan, nan, nan, nan], dtype=torch.float16)`

If we dont want to even allow for the possibility of "inf" or "NaN" attention scores to be converted to 0 then we can implemented it something like this

```Python
max = torch.max(a, dim=-1, keepdim=True)
exp = torch.exp(a - max.values)
denom = torch.sum(exp, dim=-1, keepdim=True)
softmax = exp / denom
softmax = torch.where(max.values == float('-inf'), 0.0, softmax)
```
however we would be paying for this in math performance.

## Why Now
I think one point that has substantially changed where PyTorch should lie on this argument is the fact that we have fused implementations for SDPA now. And these fused implementations allow us to easily and performantly support this new semantic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131060
Approved by: https://github.com/jbschlosser
2024-08-08 23:09:38 +00:00
1f66487c69 [BE] Reroute all uses of proxy_tensor.maybe_disable_fake_tensor_mode to fake_tensor.unset_fake_temporarily (#132770)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132770
Approved by: https://github.com/bdhirsh
2024-08-08 23:07:23 +00:00
f25df31008 TunableOp more unit test follow-up (#130065)
More unit tests for preventing TunableOp regressions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130065
Approved by: https://github.com/jeffdaily, https://github.com/malfet
2024-08-08 22:42:16 +00:00
3d0de6e1cd [Inductor] Add config option to force higher-dimensional tiling (#132937)
Fixes #125077

**Feature**

This PR creates a new Inductor config, `config.triton.prefer_nd_tiling`, which is disabled by default. When enabled, this encourages the Triton code to use as many tiling dimensions as possible. This simplifies indexing expressions for discontiguous tensors, resulting in expressions like `5 * x + 8 * y` as opposed to `5 * (x // 7) + 8 * (y % 9)`. This allows us to find more block pointers than we normally would. We should now see simplified indexing expressions as long as:
 1. All discontiguous reads/writes have the same shape.
 2. The number of discontiguous dimensions is less than `config.triton.max_tiles`.

 Here's an example kernel (elementwise add of views) with ND tiling disabled:
 ```
 @triton.jit
def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 21
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex % 7
    x1 = (xindex // 7)
    x2 = xindex
    tmp0 = tl.load(in_ptr0 + (x0 + (9*x1)), xmask)
    tmp1 = tl.load(in_ptr1 + (x0 + (9*x1)), xmask)
    tmp2 = tmp0 + tmp1
    tl.store(tl.make_block_ptr(out_ptr0, shape=[21], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp2, [XBLOCK]).to(tl.float32), boundary_check=[0])
''', device_str='cuda')
 ```

 And here's the version with it enabled:
 ```
 @triton.jit
def triton_(in_ptr0, in_ptr1, out_ptr0, ynumel, xnumel, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr):
    ynumel = 3
    xnumel = 7
    yoffset = tl.program_id(1) * YBLOCK
    yindex = yoffset + tl.arange(0, YBLOCK)[None, :]
    ymask = yindex < ynumel
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
    xmask = xindex < xnumel
    x1 = xindex
    y0 = yindex
    tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[7, 3], strides=[1, 9], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), boundary_check=[0, 1], eviction_policy='evict_last')
    tmp1 = tl.load(tl.make_block_ptr(in_ptr1, shape=[7, 3], strides=[1, 9], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), boundary_check=[0, 1], eviction_policy='evict_last')
    tmp2 = tmp0 + tmp1
    tl.store(tl.make_block_ptr(out_ptr0, shape=[7, 3], strides=[1, 7], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), tl.broadcast_to(tmp2, [XBLOCK, YBLOCK]).to(tl.float32), boundary_check=[0, 1])
''', device_str='cuda')
 ```

 With this feature enabled, we get a discontiguous strided block pointer. Previously, this would only have worked for specific shapes, like powers of 2 or multiples of the maximum block size. With this PR, we can support arbitrary shapes so long as we have enough tiles to cover all discontiguous dimensions.

**Test plan**

This PR adds some tests for pointwise ops with discontiguous tensors.
 - Test that we can generate block pointers for views with odd shapes like `(5,7)`, `(9,3,5)`, etc.
 - Test that we can generate block pointers for a single discontiguous dim in 3D and 4D tensors.
 - Test that we generate a 2D tiling for a 5D tensor with two discontiguous dims. This case doesn't generate a block pointer, but it checks that the output code is at least correct.

This PR also parametrizes some existing tests to run with and without `triton.prefer_nd_tiling`. That way, we ensure this feature doesn't break existing usage.

Since this setting isn't enabled on most tests, I also created https://github.com/pytorch/pytorch/pull/132935 to test what happens when `triton.prefer_nd_tiling=True` by default. None of the failures seem related to invalid tiling, so I think this feature is safe to merge.

**Limitations and follow-ups**

I can see two main improvements which would expand the usefulness of this feature:

1. This feature currently only works for pointwise kernels, since reductions are never tiled. As a follow-up, we could enable tiled reductions to extend these benefits to reduction kernels.

2. The usefulness of this feature depends on `triton.config.max_tiles`. This is currently restricted to 2 by default, although it can be increased to 3 in certain cases. To support more discontiguous dims, we might consider expanding support for 3D tiling, or even supporting ND tiling, by mapping an ND "virtual" launch grid onto Triton's 3D launch grid.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132937
Approved by: https://github.com/jansel, https://github.com/eellison
2024-08-08 22:11:56 +00:00
8707c6dfac added persistent option to buffers and namedbuffers (#132994)
Fixes #85235

Alternative to PR https://github.com/pytorch/pytorch/pull/129655, implements 3-valued option (None or bool).

- adds keyword only argument `persistent: Optional[bool] = None` to `nn.Module.buffers`
- updated docstrings slightly.
- added test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132994
Approved by: https://github.com/mikaylagawarecki
2024-08-08 21:39:01 +00:00
9cca0494b9 [ROCm] TunableOp logging improvements (#132173)
Summary:
TunableOp logging improvements:
1. PYTORCH_TUNABLEOP_VERBOSE=1: print out the expected value vs actual value for TunableOp validators, so that if validation fails, we know exactly how to fix it
2. PYTORCH_TUNABLEOP_VERBOSE=3: print out the exact kernel signature for both successful and failure cases in kernel lookup

Test Plan:
> PYTORCH_TUNABLEOP_VERBOSE=3 buck
2 run mode/{opt,amd-gpu} -c fbcode.enable_gpu_sections=true //scripts/xdwang/example:fc_llama -- --enab
le-tuning

```
reading tuning results from hipblas_tuning_pt_llama0.csv
Validator PT_VERSION=2.5.0
Validator ROCBLAS_VERSION=4.0.0-72e57364-dirty
Validator HIPBLASLT_VERSION=800-a15e4178
Validator ROCM_VERSION=6.0.0.0-12969-1544e39
Validator GCN_ARCH_NAME=gfx942:sramecc+:xnack-
GCN_ARCH_NAME validation: expect gfx942:sramecc+:xnack- to match gfx942:sramecc+:xnack-
ROCM_VERSION validation: expect 6.0.0.0-12969-1544e39 to match 6.0.0.0-12969-1544e39
HIPBLASLT_VERSION validation: expect 800-a15e4178 to match 800-a15e4178
ROCBLAS_VERSION validation: expect 4.0.0-72e57364-dirty to match 4.0.0-72e57364-dirty
PT_VERSION validation: expect 2.5.0 to match 2.5.0
Loading results
GemmTunableOp_BFloat16_TN(tn_8192_2_1024) -> Gemm_Hipblaslt_TN_61169,0.0171694
GemmTunableOp_BFloat16_TN(tn_7168_2_8192) -> Gemm_Hipblaslt_TN_61089,0.036138
GemmTunableOp_BFloat16_TN(tn_8192_2_3584) -> Gemm_Hipblaslt_TN_61169,0.0240673
missing params_signature, returning null ResultEntry for GemmTunableOp_BFloat16_TN,tn_1280_2_8192
finding fastest for GemmTunableOp_BFloat16_TN(tn_1280_2_8192) out of 2818 candidates
Rotating buffer 4 MiB. Needed Size: 20 MiB. Needed number of param copies: 1
├──tuning using warmup iters 0 [0 ms] and tuning iters 1 [0.208254 ms] instance id=0, GemmTunableOp_BFloat16_TN(tn_1280_2_8192) Default
├──offset at 3
......
ResultEntry found for GemmTunableOp_BFloat16_TN,tn_8192_2_3584
ResultEntry found for GemmTunableOp_BFloat16_TN,tn_8192_2_3584
ResultEntry found for GemmTunableOp_BFloat16_TN,tn_8192_2_3584
Avg time: 16.42832040786743 us, Achieved 7.15 TFLOPS, 3578.07 GB/s

2x1280x8192-torch.bfloat16,16.260499954223633,2.5794434438103107,1294.0669757533708
2x8192x1024-torch.bfloat16,16.15394949913025,2.0771658350056508,1041.11852032876
2x7168x8192-torch.bfloat16,25.691540241241455,9.14234887416194,4574.841325057144
2x8192x3584-torch.bfloat16,16.42832040786743,7.1486621324818085,3578.0709494714856
```

Differential Revision: D60468273

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132173
Approved by: https://github.com/mxz297, https://github.com/jeffdaily
2024-08-08 21:24:16 +00:00
cd30861857 [PT2][Optimus] Update unbind_cat_to_view pass to include more complicated cases (#132831)
Summary: We found recent CMF and IGCTR has more complicated patterns to optimize in order to remove as many stack/cat nodes as possible, we thus design such patterns

Test Plan:
# unit test

```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test //caffe2/test/inductor:split_cat_fx_passes
```
Test UI: https://www.internalfb.com/intern/testinfra/testrun/3659174939423652
Network: Up: 113KiB  Down: 112KiB  (reSessionID-11c9b598-af3a-4727-8f02-ccb1471d092b)
Jobs completed: 27. Time elapsed: 5:45.8s.
Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2)
Tests finished: Pass 9. Fail 0. Fatal 0. Skip 1. Build failure 0

# benchmark

### cmf
```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "cmf_shrink" --flow_id 587303213 -n
```
P1515072258

Counter({'pattern_matcher_nodes': 2170, 'pattern_matcher_count': 1766, 'normalization_pass': 402, 'remove_split_with_size_one_pass': 269, 'extern_calls': 193, 'merge_splits_pass': 74, 'normalization_aten_pass': 51, 'fxgraph_cache_miss': 9, 'batch_aten_mul': 6, 'scmerge_split_sections_removed': 5, 'scmerge_split_removed': 3, 'scmerge_cat_removed': 3, 'unbind_stack_pass': 3, 'batch_sigmoid': 2, 'batch_linear': 2, 'batch_aten_sub': 2, 'batch_layernorm': 1, 'scmerge_split_added': 1, 'scmerge_cat_added': 1, 'split_stack_to_cats_pass': 1, 'split_cat_to_slices_pass': 1, 'batch_aten_add': 1, 'batch_relu': 1})

### ig_ctr

```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "ig_ctr" --flow_id 584880697 -n
```
P1515087739

Counter({'pattern_matcher_nodes': 1832, 'pattern_matcher_count': 1564, 'extern_calls': 378, 'normalization_pass': 345, 'normalization_aten_pass': 49, 'fxgraph_cache_miss': 18, 'batch_aten_mul': 6, 'scmerge_cat_removed': 5, 'scmerge_cat_added': 4, 'batch_linear_post_grad': 4, 'scmerge_split_removed': 3, 'unbind_stack_pass': 3, 'unbind_cat_to_view_pass': 3, 'batch_tanh': 2, 'scmerge_split_sections_removed': 2, 'scmerge_split_added': 2, 'split_stack_to_cats_pass': 2, 'split_cat_to_slices_pass': 1})

# e2e

testing the following new patterns
```
                "split_stack_to_cats_pass": {},
                "split_cat_to_slices_pass": {},
                "unbind_cat_to_view_pass": {},
```
Note that you can tune the hyper-parameter "threshold_to_cat " for these patterns, and the minimum value you give should be at least 2. The larger the value, the less aggressive to do the node slicing but to keep the cat, and the default value is 10. You can tune the parameters by setting threshold_to_cat. For example

```
"split_stack_to_cats_pass": {"threshold_to_cat": 10},
"split_cat_to_slices_pass": {"threshold_to_cat": 10},
"unbind_cat_to_view_pass": {"threshold_to_cat": 10},
```

Note that the default value may not be optimal, it's based on my experiments on CMF and IGCTR, you are more than welcome to tune the value to find the best threashold for you. For example, in the cmf local run,
- when "threshold_to_cat" is 2
P1515072258
=============Print full analysis for cmf_shrink================
| Metric             | Value           |
|:-------------------|:----------------|
| Batch size         | 10              |
| Latency            | 156.07 ms       |
| Model size         | 844357184 bytes |
| Flops/example      | 583.53 G        |
| TFLOPS             | 37.39           |
| MFU                | 4.67%           |
| Activation/example | 1707.49 MB      |

- when "threshold_to_cat" is 10
P1515912635
=============Print full analysis for cmf_shrink================
| Metric             | Value           |
|:-------------------|:----------------|
| Batch size         | 10              |
| Latency            | 155.09 ms       |
| Model size         | 844357184 bytes |
| Flops/example      | 583.53 G        |
| TFLOPS             | 37.63           |
| MFU                | 4.70%           |
| Activation/example | 1707.49 MB      |

ads_dper3:164562cbe29f6c5aea4546cf3d463b87
training_platform:5e455c643c52940bb4567017f4c7ba83

## cmf
baseline
f588717948
proposal
f588719502

### QPS and NE results
{F1793304642}
{F1793304664}
{F1793304689}
{F1793304683}

### Compilation time reduction

zoomer link: https://www.internalfb.com/intern/zoomer/?profiling_run_fbid=1045728747213538&tab=pt2_metrics

Compile time for that frame is reduced to 1 min from 9 min.

### trace analysis
baseline trace link
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2Ff588722004-TrainingApplication%2F0%2Frank-1.Aug_06_00_03_46.3617.pt.trace.json.gz&bucket=pyper_traces

proposal trace link
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2Ff588723545-TrainingApplication%2F0%2Frank-1.Aug_05_23_54_56.3647.pt.trace.json.gz&bucket=pyper_traces

{F1793312804} {F1793312867}

From the trace, we can see that the green part (introduced by split cat) has been reduced significantly with our new patterns.

Differential Revision: D60750275

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132831
Approved by: https://github.com/jackiexu1992
2024-08-08 21:18:01 +00:00
40767e8468 [BE] rename testHelperPrefix test (#132916)
Summary:
Re-enable testHelperPrefix test that was erroneously disabled in CI.
Fixes #50701

Test Plan:
Test passes locally:
```
❯ ./TCPStoreTest --gtest_filter=TCPStoreTest.testHelperPrefix
Running main() from
/data/users/cpio/pytorch/third_party/googletest/googletest/src/gtest_main.cc
Note: Google Test filter = TCPStoreTest.testHelperPrefix
[==========] Running 1 test from 1 test suite.
[----------] Global test environment set-up.
[----------] 1 test from TCPStoreTest
[ RUN      ] TCPStoreTest.testHelperPrefix
[W807 12:01:31.531576727 socket.cpp:462] [c10d] waitForInput: poll for
socket SocketImpl(fd=6, addr=[localhost]:37984,
remote=[localhost]:37171) returned 0, likely a timeout
[W807 12:01:31.531663710 socket.cpp:487] [c10d] waitForInput: socket
SocketImpl(fd=6, addr=[localhost]:37984, remote=[localhost]:37171) timed
out after 100ms
[       OK ] TCPStoreTest.testHelperPrefix (314 ms)
[----------] 1 test from TCPStoreTest (314 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test suite ran. (314 ms total)
[  PASSED  ] 1 test.
╭─ ~/local/pytorch/build/bin  main *1 +1 ···················· ✔
/home/cpio/local/a/pytorch-env   cpio@devgpu011 ─╮
╰─
```
Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132916
Approved by: https://github.com/Skylion007
2024-08-08 20:54:52 +00:00
7bd0732cbd Fix flaky internal mixed_mm tests (#133015)
This PR fixes flaky internal tests:
- The AutoHeuristic test was sometimes failing because it required autotuning to happen for mixed_mm which didn't end up happening when there was a fx graph cache hit.
- The tests inside pattern_matcher failed because in some cases pad_mm decided to pad which made the mixed_mm pattern not match anymore (instead of cast -> mm, it was cast -> pad -> mm), and the tests also fail when is_big_gpu is false (which I haven't found an explanation for).

Differential Revision: [D60972176](https://our.internmc.facebook.com/intern/diff/D60972176)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133015
Approved by: https://github.com/Chillee, https://github.com/eellison
2024-08-08 20:32:12 +00:00
a9954d22f8 Raise exception if torch.func.* calls torch.compile functions (#128736)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128736
Approved by: https://github.com/zou3519
2024-08-08 20:21:44 +00:00
b845068db2 [dtensor] refactor examples folder (#132914)
as titled:

1. remove checkpoint example as it's not maintained
2. refactor convnext example to use torchrun
3. refactor comm mode feature example to sit in one file

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132914
Approved by: https://github.com/wz337
2024-08-08 20:03:14 +00:00
c326533999 [ROCm][Inductor] Enable AOT Inductor CPP UTs for ROCm (#131521)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131521
Approved by: https://github.com/jataylo, https://github.com/pruthvistony, https://github.com/malfet
2024-08-08 19:49:56 +00:00
de288e2203 Fix inf value reduction in non persistent reduction for scans (#132293)
Fixes https://github.com/pytorch/pytorch/issues/132107

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132293
Approved by: https://github.com/peterbell10
2024-08-08 19:02:32 +00:00
322c9d03a0 [FSDP][dtensor] use _StridedShard to represent nested sharding for correct full_tensor() result (#130760)
Fixes issue #129229 #129206
**Summary**

1. Have `FSDP` choose `_StridedShard` placement for FSDP+TP sharding
2. Added a parity test to FSDP to ensure that FSDP+TP sharding (i.e. strided) and simply TP sharding (i.e. non-strided) has the same `full_tensor()` result
3. Re-enabled the tests that were disabled in #129519

**test**
`pytest test/distributed/_composable/fsdp/`
`pytest test/distributed/_composable/test_composability/test_2d_composability.py`
`pytest test/distributed/checkpoint/fsdp/test_fsdp_dsd.py`

Differential Revision: [D60606114](https://our.internmc.facebook.com/intern/diff/D60606114)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130760
Approved by: https://github.com/wanchaol, https://github.com/fegin, https://github.com/wz337
ghstack dependencies: #126697, #130239, #132391, #131408
2024-08-08 18:15:29 +00:00
21906ddaba [AOTI] Fix complex64 not defined (#132810)
Partially fixes #122980

- change cpp type mapping for complex64 to std::complex<float>
- add `aoti_torch_item_complex64` and `aoti_torch_scalar_to_tensor_complex64`.
- add `expensiveCopyToTensor()` to convert `ArrayRefTensor<T>` type to `AtenTensorHandle` type.

- if we want to fully fix #122980, we still need to let ArrayRef and MiniArrayRef to consider underlying storage number of elements. See more details in https://github.com/pytorch/pytorch/pull/132347 (#132347  broke some internal tests, so we need more work before landing it).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132810
Approved by: https://github.com/desertfire
2024-08-08 18:08:23 +00:00
ac95b2a2f2 Migrate slow self-hosted jobs to Amazon2023 AMI (#131771)
A continuation of the migration started in
- https://github.com/pytorch/pytorch/pull/131250

(for tracking: signal on Aug 6: https://hud.pytorch.org/pytorch/pytorch/pull/131771?sha=38bc4755567527fad5279203ddef534ac132ea94)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131771
Approved by: https://github.com/seemethere
2024-08-08 17:33:57 +00:00
75eb66afc0 Support 'non-contiguous with holes' NJTs for contiguous clone() (#132776)
It's possible to construct an NJT with "holes" by specifying both `offsets` and `lengths` metadata. When `nt.clone(memory_format=torch.contiguous_format)` is called on such an NJT, the result should be an NJT without holes.

This PR fixes this in simplistic way using `unbind()`, which isn't really supported in `torch.compile`. The longer term solution involves writing a proper kernel to support this.

NB: Another limitation is that the returned NJT does not have the same ragged structure as the input. While we could manually hack the nested int registry (or update the union find when that lands), this is the first instance where a NJT with holes and an NJT without holes could have the same ragged structure, and getting those to play nicely together requires some fairly involved updates. For now, this PR punts on these updates until we can clean this up.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132776
Approved by: https://github.com/ani300, https://github.com/soulitzer
ghstack dependencies: #131898, #131704, #131937
2024-08-08 17:08:11 +00:00
3ec9ec03a8 Revert "[pipelining] Add schedule runtime for lowered schedule (#130488)"
This reverts commit b73d4b6555dd6b5a39d70d741099b83190eb31f0.

Reverted https://github.com/pytorch/pytorch/pull/130488 on behalf of https://github.com/PaliC due to breaking distributed tests internally (that should be running in OSS) ([comment](https://github.com/pytorch/pytorch/pull/130488#issuecomment-2276266909))
2024-08-08 16:57:50 +00:00
942ffd1b2d Make the __module__ name of HOO to be always "torch.ops.higher_order" (#132775)
Summary: It seems that we can just make this the default so that in the future all the ops printed in the graph should be like torch.ops.higher_order

Test Plan: CI

Differential Revision: D60530900

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132775
Approved by: https://github.com/ydwu4, https://github.com/zou3519
2024-08-08 16:55:09 +00:00
eeb6ad0744 [quant] Speed up dequantize_per_channel (#132828)
Tensor-wise operations are much faster than looping over tensor elements. Rewrite loop in dequantize_per_channel to use whole-Tensor operations accordingly.

Differential Revision: [D60871396](https://our.internmc.facebook.com/intern/diff/D60871396/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132828
Approved by: https://github.com/cccclai
2024-08-08 16:44:41 +00:00
dfc5bb0099 Login to Meta's ECR when using non-meta runner (#132870)
The project depends on fetching container images from Meta's ECR repo so when run on non-meta runners we need to ensure that we also login to Meta's ECR too.

Closes pytorch/ci-infra#252.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132870
Approved by: https://github.com/ZainRizvi
2024-08-08 16:34:46 +00:00
4a4dc9d6d9 [inductor] Disable remote caching in failing test_cpu_repro tests (#132955)
Summary: These tests are failing stress tests internally because of remote caching. Most already have local cache disabled; disable remote cache as well

Test Plan: Ran stress tests locally for each of the affected tests

Differential Revision: D60940081

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132955
Approved by: https://github.com/leslie-fang-intel
2024-08-08 16:20:56 +00:00
9d5c85c499 Move exir.delegate to PyTorch core to enforce no out-of-tree HOPs (#132525)
Summary: When HOPs live out of tree, it makes it impossible to make breaking changes to the HOP API. But HOP implementations are deeply entwined with PyTorch internals. Move the HOP into PyTorch tree so that changes are possible.

Test Plan: sandcastle, ossci

Differential Revision: D60674615

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132525
Approved by: https://github.com/zou3519, https://github.com/Skylion007
2024-08-08 16:06:56 +00:00
4ee5547b37 [triton_op] Skip HOP dispatch when possible (#132822)
The capture_triton decorator returns a function that goes through the
triton kernel wrapper HOP. This is useful for make_fx tracing and
non-strict export. However, the HOP dispatch is slow (~1ms) and not
necessary in certain situations.

This PR skips going through the HOP dispatch for any
capture_triton-wrapped triton kernels that are registered as
implementations to a `@triton_op` custom operator. We do this by
creating a new thread-local flag that controls if the
capture_trition-wrapped triton kernel goes through HOP dispatch or not.

Test Plan:
- new test and existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132822
Approved by: https://github.com/SherlockNoMad
2024-08-08 15:56:40 +00:00
b885ad8fce Revert "[Inductor][CPP] Turns on inline_inbuilt_nn_modules for CPP GEMM template testing (#132487)"
This reverts commit 73c083e02cb6093bb3adf06b7ccdf5c4a2e7591c.

Reverted https://github.com/pytorch/pytorch/pull/132487 on behalf of https://github.com/PaliC due to this pr is breaking inductor tests internally ([comment](https://github.com/pytorch/pytorch/pull/132487#issuecomment-2276142742))
2024-08-08 15:47:04 +00:00
0ca8f66e3a [NestedTensor] Modify softmax on ragged dimension to allow for 2D nested tensors (#132812)
Summary:
Modify `softmax` on the ragged dimension, where `ragged_idx == 1`, to allow for 2D nested tensors. This diff now enables a `softmax` operation on tensors of shape `(B, *)`, where `*` is the ragged dimension.

Extend existing `softmax` unit tests to include 2D nested tensors using the `include_2d_tensor=True` keyword argument.

Test Plan:
Verify that existing and modified unit tests pass using the following commands:

```
buck2 run mode/{opt,inplace} //caffe2/test:nested -- --regex test_softmax
```

```
buck2 run mode/{opt,inplace} //caffe2/test:nested -- --regex test_jagged_op
```

Reviewed By: davidberard98

Differential Revision: D60780975

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132812
Approved by: https://github.com/davidberard98
2024-08-08 15:41:28 +00:00
c4071c4707 Remove noqa: G004 warnings (#132917)
Remove logging messages with f-strings (G004), https://docs.astral.sh/ruff/rules/logging-f-string/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132917
Approved by: https://github.com/Skylion007, https://github.com/c-p-i-o, https://github.com/fduwjj, https://github.com/fegin
ghstack dependencies: #132888
2024-08-08 15:18:53 +00:00
9db5bfccdc [inductor] disable test_torchinductor failed UTs on Windows (#132973)
Disable failed UTs of `test/inductor/test_torchinductor.py` on Windows.

**TODO:**
Debug and enable these UTs, after CI ready.

Local test:
<img width="857" alt="image" src="https://github.com/user-attachments/assets/3d9da274-f147-474e-92f1-a6d3ed8aa003">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132973
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-08-08 14:56:10 +00:00
51ddcde110 [BE] Introduces runner variants for amzn2023 to simplify lf-scale-config.yml and lf-canary-scale-config.yml (#132918)
Depends on https://github.com/pytorch/test-infra/pull/5541 to be deployed on LF and Meta infra

Test for this changes are in this PR: https://github.com/pytorch/test-infra/pull/5542
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132918
Approved by: https://github.com/zxiiro, https://github.com/ZainRizvi
2024-08-08 14:38:34 +00:00
6f99e97f0a Revert "[ts-migration]: Support quantized operation transformation (#131915)"
This reverts commit 0e8541766fe5ed58c54aa530eee8e34832539199.

Reverted https://github.com/pytorch/pytorch/pull/131915 on behalf of https://github.com/ezyang due to test broken on windows 0e8541766f ([comment](https://github.com/pytorch/pytorch/pull/131915#issuecomment-2275974907))
2024-08-08 14:30:35 +00:00
42cd397a0e Loads .pyd instead of .so in MemPool test for windows (#132749)
Fixes #132650

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132749
Approved by: https://github.com/albanD
2024-08-08 14:29:56 +00:00
d1f73fd844 Revert "[BE] Reroute all uses of proxy_tensor.maybe_disable_fake_tensor_mode to fake_tensor.unset_fake_temporarily (#132770)"
This reverts commit 902c6f3a191fb2ecb1976895b3e9eaae4b257b89.

Reverted https://github.com/pytorch/pytorch/pull/132770 on behalf of https://github.com/ezyang due to Removed API was recommitted ([comment](https://github.com/pytorch/pytorch/pull/132770#issuecomment-2275749689))
2024-08-08 12:54:34 +00:00
902c6f3a19 [BE] Reroute all uses of proxy_tensor.maybe_disable_fake_tensor_mode to fake_tensor.unset_fake_temporarily (#132770)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132770
Approved by: https://github.com/bdhirsh
ghstack dependencies: #132674, #132675, #132421, #132062, #132767, #132769
2024-08-08 12:03:25 +00:00
0e43175e22 [BE] Get rid of unnecessary inner_torch_dispatch method (#132769)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132769
Approved by: https://github.com/Skylion007, https://github.com/bdhirsh
ghstack dependencies: #132674, #132675, #132421, #132062, #132767
2024-08-08 12:03:25 +00:00
35fd4391bc Format torch.fx.experimental.proxy_tensor.py (#132767)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132767
Approved by: https://github.com/bdhirsh
ghstack dependencies: #132674, #132675, #132421, #132062
2024-08-08 12:03:18 +00:00
b4e2411f6f Big enough count to trigger stack overflow (#132062)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132062
Approved by: https://github.com/bdhirsh
ghstack dependencies: #132674, #132675, #132421
2024-08-08 12:03:12 +00:00
aec6332356 Only thunkify proxies in some situations (#132421)
The goal of this PR is to avoid stack overflow when we create extremely long chains of thunks, and then evaluate them (e.g., as occurs if you sum(long list of symint)). The basic idea behind this PR is to only thunkify proxies if they're being created in places where they may or may not be used--crucially, symint operations that occur in user code we are tracing are eagerly placed into the graph, even if they may eventually be dead.

I annotated the PR with explanation of changes.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132421
Approved by: https://github.com/Skylion007, https://github.com/zou3519
ghstack dependencies: #132674, #132675
2024-08-08 12:03:06 +00:00
54efd43022 [BE] Simplify code interacting with get_proxy_mode/enable_tracing (#132675)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132675
Approved by: https://github.com/Skylion007, https://github.com/ydwu4, https://github.com/zou3519
ghstack dependencies: #132674
2024-08-08 12:03:00 +00:00
361db32d47 Consolidate SymDispatchMode into ProxyTensorMode (#132674)
Instead of having a separate context variable for SymDispatchMode, we
now simply delegate to the current active proxy tensor mode when we
need to trace a SymInt.  We maintain a separate `__sym_dispatch__` magic
method as the calling convention is different than `__torch_dispatch__`.

Consolidating the modes in this ways means that we can consistently
disable both of these modes in tandem simply by removing the mode
from the proxy mode infra slot.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132674
Approved by: https://github.com/zou3519, https://github.com/bdhirsh
2024-08-08 12:02:54 +00:00
0f19d4150b Revert "[inductor]a less ambitious way to slove the scalar tensor (#132702)"
This reverts commit b483ca05a91f2876b0f1f5a435fa264f5467762d.

Reverted https://github.com/pytorch/pytorch/pull/132702 on behalf of https://github.com/ezyang due to breaks trunk jobs ([comment](https://github.com/pytorch/pytorch/pull/132702#issuecomment-2275642109))
2024-08-08 11:59:38 +00:00
ec49796b8f [Inductor] Support use_libdevice_for_f64 for pointwise ops on XPU, align with CUDA. (#132739)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132739
Approved by: https://github.com/malfet, https://github.com/EikanWang
2024-08-08 11:50:10 +00:00
24dee99cb7 Populate submodules of torch._C to sys.modules recursively (#132216)
See comment:

e9d1c26275/torch/__init__.py (L938-L950)

This PR recursively sets the submodules in the C extension to `sys.modules` (e.g., `_C._dynamo.eval_frame`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132216
Approved by: https://github.com/ezyang
2024-08-08 10:20:25 +00:00
7f71f2a997 [dtensor] improve docs and comments (#132683)
as titled, fixed typos in various comments and improve the
public documentations

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132683
Approved by: https://github.com/XilunWu
ghstack dependencies: #131210, #132682
2024-08-08 09:24:58 +00:00
9e37e73e01 [dtensor] refactor and improve readability of _dispatch.py (#132682)
as titled. It also changes some comments of _op_schema.py to make them
update to date

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132682
Approved by: https://github.com/XilunWu
ghstack dependencies: #131210
2024-08-08 09:24:58 +00:00
ac960dced1 Skip Reformer for Dynamic size testing (#132468)
**Summary**

As discussed in https://github.com/pytorch/pytorch/issues/132286, `Reformer` has specialized the batch size dim which will fails the API  `mark_dynamic` 3a355c1891/torch/_dynamo/decorators.py (L228-L230)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132468
Approved by: https://github.com/ezyang
2024-08-08 08:25:53 +00:00
9c5e0d47fe Add xpu_cmake_macros.h to xpu build (#132847)
# Motivation

fix https://github.com/pytorch/pytorch/issues/132971

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132847
Approved by: https://github.com/EikanWang
2024-08-08 08:06:49 +00:00
751c744ad0 Optimize sort kernel for contiguous tensors (#132236)
Introduces enhancement for SortingKernel.cpp for cases where both the values and indices tensors have a stride 1, indicating contiguous memory layouts.

The changes include:
1. A new function `sort_kernel_impl`, encapsulating the core sorting logic for distinct types of tensor accessors.
2. Modifications to the `sort_kernel` function to utilize `sort_kernel_impl`. It now checks for tensor strides and optimally handles contiguous and non-contiguous tensor scenarios.
3. The optimization aims to improve cache locality and efficiency in memory access for contiguous tensor sorts.
4. Enhanced Code Readability and Structure: The restructuring of the sorting process improves clarity and maintenance by clearly defining how different stride scenarios are handled, making the code more transparent and easier to understand.

Tests have been conducted across various tensor sizes and shapes to ensure stability and reliability of the change.

The result of running the `test/test_sort_and_select.py` test suite is consistent between the main branch, and this modified branch.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132236
Approved by: https://github.com/jgong5
2024-08-08 07:01:25 +00:00
83e4af203d [dtensor] rewrite redistribute algorithm for multi-dim mesh (#131210)
As titled, this PR rewrite the current redistribute algorithm to make
the multi-mesh dim redistribute logic more sound. The previous algorithm
works numerically but it could incur additional non-necessary steps
when transforming shardings in the multi-dimesnion device mesh, i.e.

Let's say we want to transform from (S(1), S(1)) -> (S(1), S(2)). The
previous algorithm yield the following steps:

* mesh_dim 1: S(1) -> R, mesh_dim 0: S(1) -> R
* mesh_dim 0: R -> S(1), mesh_dim 1: R -> S(2)

Although it works semantically but it incurs two allgather
transformations, where it should really only incur a S(1) -> S(2) on the
mesh dim 1.

The rewrite algorithm basically take it in a more principled way:

1. we check if src_spec have sharding, if not, we don't need to worry about nested sharding case, as sharding would always be in order, so we just go from left to right in the placements and add the transform steps
2. if src_spec have sharding, this potentially means that there would be either nested or mis-aligned shardings. So we first tranverse from right to left to check if there's mis-aligned sharding as the above example showed, if there is, we replicate that mesh dimension so that it unshard the nested sharding
3. we tranverse again from left to right to generate the transformation
   after we unshard the nested sharding

should also fix https://github.com/pytorch/pytorch/issues/132751

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131210
Approved by: https://github.com/tianyu-l
2024-08-08 06:50:30 +00:00
479d460471 [DeviceMesh] Add a private _flatten() API for device_mesh (#132632)
Adds a new private API to flatten a DeviceMesh to a 1D DeviceMesh such that:
```
mesh_3d = init_device_mesh(
    self.device_type, (2, 2, 2), mesh_dim_names=("dp", "cp", "tp"),
)

dp_cp_mesh = mesh_3d["dp", "cp"]
# flattened_mesh on rank 0, 2, 4, 6 is DeviceMesh([0, 2, 4, 6], mesh_dim_names=('dp_cp',))
# flattened_mesh on rank 1, 3, 5, 7 is DeviceMesh([1, 3, 5, 7], mesh_dim_names=('dp_cp',))
flattened_dp_cp_mesh = dp_cp_mesh._flatten()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132632
Approved by: https://github.com/fegin, https://github.com/wanchaol
ghstack dependencies: #132310, #132311, #132339
2024-08-08 06:46:42 +00:00
0e8541766f [ts-migration]: Support quantized operation transformation (#131915)
#### Description
Transform quantized operation properly. Add de/quantization before and after the quantized operation.

#### Test Plan
`pytest test/export/test_converter.py -s -k test_ts2ep_convert_quantized_model`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131915
Approved by: https://github.com/angelayi
2024-08-08 06:34:53 +00:00
9e584d0c05 [BE] Test foreach optimizer for FSDP1 optimizer state_dict (#132933)
Summary:
When fixing https://github.com/pytorch/pytorch/issues/130810, we suspected FSDP1 optimizer state_dict cannot handle foreach optimizer, which is not correct. For FSDP1, whether optimizer uses foreach or not does not matter. Since we already have tests for non-foreach version optimizer, this PR changes the distributed state_dict tests for FSDP1 to use foreach optimizer.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132933
Approved by: https://github.com/c-p-i-o
ghstack dependencies: #132908
2024-08-08 06:13:10 +00:00
a270800f0b [export][reland] Add print_readable to unflattened module (#132817)
Reland https://github.com/pytorch/pytorch/pull/128617

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132817
Approved by: https://github.com/pianpwk
2024-08-08 06:05:30 +00:00
745665d8b5 [BE] Using with_temp_dir for test_distributed_checkpoint (#132908)
Fixes https://github.com/pytorch/pytorch/issues/113936
Fixes https://github.com/pytorch/pytorch/issues/113937

The original way to broadcast the path seems to cause desync issues.  `with_temp_dir` has been used for other checkpoint related tests without problems. Change the tests to use `with_temp_dir`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132908
Approved by: https://github.com/awgu, https://github.com/Skylion007
2024-08-08 05:42:19 +00:00
aff48f7378 Autoselect default device in FSDP construction. (#127609)
There are still some differences between CUDA and non-CUDA custom devices when
construct FSDP because CUDA is selected as the default device. For example,
when construct FSDP from CPU model and device_id is not passed, device_handle
will choose CUDA as default device. This PR will autoselect the real device
as the default device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127609
Approved by: https://github.com/awgu
2024-08-08 05:25:17 +00:00
4a1edbe475 Disable SymDispatchMode when torch.compile'ing (#132433)
Partially addresses https://github.com/pytorch/pytorch/issues/132417

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132433
Approved by: https://github.com/ydwu4
2024-08-08 05:02:43 +00:00
5ae979ab10 [Dynamo] Support torch.autograd._is_checkpoint_valid (#132611)
Hi, we got `torch._dynamo.exc.Unsupported: torch.* op returned non-Tensor bool call_function <function _is_checkpoint_valid at 0x7f0b0d22e290>` while tracing activation [checkpointing function in deepspeed](324ee65cb0/deepspeed/runtime/activation_checkpointing/checkpointing.py (L630)). Consider to add it to constant_folding list which is similar with https://github.com/pytorch/pytorch/pull/126196

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132611
Approved by: https://github.com/anijain2305, https://github.com/williamwen42
2024-08-08 04:05:08 +00:00
4fd0d594a1 [sym_shapes] Not eval sym expression for printing storage_offset (#132911)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132911
Approved by: https://github.com/ezyang
2024-08-08 03:49:29 +00:00
b483ca05a9 [inductor]a less ambitious way to slove the scalar tensor (#132702)
Fixes #121374

The previous https://github.com/pytorch/pytorch/pull/131775 was trying to convert the 0dim cpu tensor to a DynamicScalar in lowering stage. But there are so many lowering rules uncompatible with that way. So, this PR is trying to do the conversion in codegen stage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132702
Approved by: https://github.com/eellison
2024-08-08 03:42:21 +00:00
ac6398b630 [FSDP2] Follow-up fix to correct relaxed overlap test (#132953)
The previous PR forgot to include dummy all-gathers before backward, so the reference time was too short, causing the test to still fail.

I verified the test passes locally.

This should close https://github.com/pytorch/pytorch/issues/120961 (again).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132953
Approved by: https://github.com/weifengpy
ghstack dependencies: #132869
2024-08-08 03:24:46 +00:00
636a7c4859 [13/N] Use std::optional (#132527)
Follows #132361

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132527
Approved by: https://github.com/ezyang
2024-08-08 03:16:28 +00:00
fd874b799f [AOTI][refactor] Update MKLDNN ops cpp wrapper support (#132367)
Summary: Set op_overload for MKLDNN ops so that cpp_kernel_name and python_kernel_name are constructed from there. This is an important step towards support those MKLDNN ops in the ABI-compatible mode, because we will need to read schema from op_overload for generating correct fallback op call in C++.

Differential Revision: [D60909798](https://our.internmc.facebook.com/intern/diff/D60909798)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132367
Approved by: https://github.com/leslie-fang-intel, https://github.com/angelayi
2024-08-08 03:02:29 +00:00
c69b2d24e3 [dynamo] Support remove method of set (#132943)
Fixes https://github.com/pytorch/pytorch/issues/132800

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132943
Approved by: https://github.com/anijain2305
2024-08-08 02:43:19 +00:00
194ec49d27 [dynamo][lists][stable diffusion] Do not add source on list slice (#132912)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132912
Approved by: https://github.com/williamwen42
ghstack dependencies: #132806, #132899
2024-08-08 02:23:07 +00:00
45d0e90bd3 [export] Allow str outputs (#132808)
Summary: Fixes https://fb.workplace.com/groups/1075192433118967/permalink/1478413606130179/

Test Plan: CI

Differential Revision: D60850712

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132808
Approved by: https://github.com/ydwu4
2024-08-08 02:20:59 +00:00
4ca616e6d4 Disable sparse tests in export (#132824)
Summary: Dynamo doesn't trace through sparse tensors in fbcode. So we should disable tests that run sparse tensors in export. We should do this to make the CI green internally.

Test Plan:
Before:
Tests finished: Pass 1409. Fail 71. Fatal 0. Skip 90. Build failure 0
After:
Tests finished: Pass 1408. Fail 0. Fatal 0. Skip 162. Build failure 0

Differential Revision: D60870543

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132824
Approved by: https://github.com/BoyuanFeng
2024-08-08 01:45:12 +00:00
fb6b001cde Disable expandable segments IPC in fbcode, because some jobs
seem to be failing. (#132890)

seem to be failing.

https://fb.workplace.com/groups/1405155842844877/permalink/8867182216642165/

Differential Revision: [D60912371](https://our.internmc.facebook.com/intern/diff/D60912371/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132890
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-08-08 01:42:32 +00:00
5709375d56 [AOTI][tooling][1/n] Add intermediate value debug printer (#132323)
Summary:
**Context:**

Currently we have a helper to print out AtenTensor in [shim_common.cpp](https://github.com/pytorch/pytorch/blob/v2.4.0-rc4/torch/csrc/inductor/aoti_torch/shim_common.cpp#L866)

The way we were using this function was a “manual” process. We inject this function into the generated output.cpp file, and recompile and reload the file. This diff automates the printing value process.

**Changes:**

1. Added a simple initial debug printer helper to print out tensor values

2. Added a filter option to selectively dump tensor values.

**Usage:**

Sample cmd :

```
AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1  TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor, +schedule, output_code"  python test/inductor/test_aot_inductor.py -k test_addmm_abi_compatible_cuda
```

Sample outputs :
```
[  before_launch - triton_poi_fused_0 - buf0  ]:
 0.6331
 1.6358
-0.3459
 1.0196
-0.4122
 1.4279
[ CUDAFloatType{6} ]
Min value: -0.412198
Max value: 1.63582
Device: cuda:0
Size: [6]
Stride: [1]
Dtype: float
Layout: Strided
Number of elements: 6
Is contiguous: 1
Requires grad: 0

[  after_launch - triton_poi_fused_0 - buf0  ]:
 0.6331
 1.6358
-0.3459
 1.0196
-0.4122
 1.4279
[ CUDAFloatType{6} ]
Min value: -0.412198
Max value: 1.63582
Device: cuda:0
Size: [6]
Stride: [1]
Dtype: float
Layout: Strided
Number of elements: 6
Is contiguous: 1
Requires grad: 0

[ before_launch - aoti_torch_cuda_addmm_out - buf1  ]:
Min value: -2.25655
Max value: 2.32996
Device: cuda:0
Size: [16, 6]
Stride: [6, 1]
Dtype: float
Layout: Strided
Number of elements: 96
Is contiguous: 1
Requires grad: 0

[  before_launch - aoti_torch_cuda_addmm_out - buf0  ]:
 0.6331
 1.6358
-0.3459
 1.0196
-0.4122
 1.4279
[ CUDAFloatType{6} ]
Min value: -0.412198
Max value: 1.63582
Device: cuda:0
Size: [6]
Stride: [1]
Dtype: float
Layout: Strided
Number of elements: 6
Is contiguous: 1
Requires grad: 0

[  after_launch - aoti_torch_cuda_addmm_out - buf1  ]:
Min value: -12.0839
Max value: 11.6878
Device: cuda:0
Size: [16, 6]
Stride: [6, 1]
Dtype: float
Layout: Strided
Number of elements: 96
Is contiguous: 1
Requires grad: 0

[  after_launch - aoti_torch_cuda_addmm_out - buf0  ]:
 0.6331
 1.6358
-0.3459
 1.0196
-0.4122
 1.4279
[ CUDAFloatType{6} ]
Min value: -0.412198
Max value: 1.63582
Device: cuda:0
Size: [6]
Stride: [1]
Dtype: float
Layout: Strided
Number of elements: 6
Is contiguous: 1
Requires grad: 0

stats [('calls_captured', 1), ('unique_graphs', 1)]
inductor [('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('extern_calls', 2)]
.
----------------------------------------------------------------------
Ran 1 test in 10.867s

OK

```

The user is able to filter kernel names to print out values by specifying env var `AOT_INDUCTOR_FILTERED_KERNELS_TO_PRINT` and see choices of kernel names in a log message like below:
```
torch/_inductor/graph.py:1642] Finished codegen for all nodes. The list of kernel names available: ['triton_poi_fused_0', 'aoti_torch_cuda_addmm_out']

```

In the follow-up diff, will add `torch.save()` to dump/save the intermediate tensors into individual `.pt` files that can be further  `torch.load()`.

Test Plan:
Run Unit Tests in OSS: (similar cmd as mentioned above in the usage part)

 `AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1  TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor, output_code"  python test/inductor/test_aot_inductor.py -k test_addmm_abi_compatible_cuda`

Differential Revision: D60538496

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132323
Approved by: https://github.com/ColinPeppler
2024-08-08 01:39:59 +00:00
59f4725b49 [NJT] manually autocast in SDPA handling (#132835)
When autocasting is turned on, right now SDPA w/ NJT won't be autocasted. This PR adds manual "autocasting" logic in sdpa.py - at the beginning, it just checks if autocasting is enabled, and if so, it casts the inputs in the way you would expect if autocasting was actually running.

Why normal autocasting won't work:
* NJT intercepts the `__torch_function__` call for scaled_dot_product_attention, which, AFAIK, happens before we get to any dispatcher logic, and then calls efficient attention or flash attention. So autocasting the scaled_dot_product_attention op won't work; we never call the aten op for scaled_dot_product_attention, so we won't ever run autocasting for it.
* If we try to add autocasting handling for `_flash_attention_forward` or `_efficient_attention_forward`, then autocasting will _run_, but it will have the wrong semantics: sdpa.py's handling will run first, and it will do backend selection based on the uncasted inputs to SDPA. This also means that if the inputs to the SDPA call don't have uniform types, the sdpa.py implementation will fail checks (this is the specific issue we're targeting).

Alternative: "just change the backend selection logic for NJT to be autocast aware, but don't actually do the autocast; then, add `_(flash|efficient)_attention_forward` to autocasting rules". I think this would work too. But it's arguably better to make the backend-selection logic and actual-autocast-behavior use the same implementation, in case the implementations are different.

Differential Revision: [D60879916](https://our.internmc.facebook.com/intern/diff/D60879916)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132835
Approved by: https://github.com/soulitzer
2024-08-08 01:36:57 +00:00
bbf568aac8 Split of "[reland] [export] fix zero arg export in training_ir and constant tensor handling" (#132307)
Summary:
A re-land of D60006710.
Fixed TrainingIRToRunDecomp failures for test_tensor_attribute_zero_args and also a few re-tracability failures because run_decomposition does a retracing.

edit: also remove the eliminate_dead_code() in _unlift because of one onnx test failure:
a constant tensor attr was lifted as constant_tensor input but it's not used in the graph after aot_autograd due to a short cut in its decomposition. This causes the setattr to be removed by eliminate_dead_code but the graph signature still contains the name of that buffer, which causes an inconsitency between the transformed graph and ep's original signature after _unlift. And it seems that this has happened a few times where some nodes are accidentally removed and we're in an inconsistent state.
The alternative of removing it would be: every time we call elimiate_dead_code, we verify the consistency of the graph with 1. the graph before transformation and 2. all the meta datas but i think this deserves a complete design

edit 2: Also fix the inconsistency of graph signatures when param_constant is marked as lifted_tensor_constants but it's registered as parameters in the output of ep.module().

Differential Revision: D60532628

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132307
Approved by: https://github.com/zhxchen17
2024-08-08 01:36:16 +00:00
0f90ffe94a Remove ProcessGroupRoundRobin (#132888)
`_round_robin_process_groups` is deprecated and should be removed.

258f47fc0b/torch/csrc/distributed/c10d/ProcessGroupRoundRobin.cpp (L10-L12)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132888
Approved by: https://github.com/Skylion007, https://github.com/wanchaol, https://github.com/c-p-i-o, https://github.com/fduwjj
2024-08-08 01:07:40 +00:00
5cb05a82b4 [BC breaking] move benchmarking + prefer inductor path (#132827)
move benchmarking out of `torch._inductor.runtime.runtime_utils` and into `torch._inductor.runtime.benchmarking`, and prefer this path over directly accessing Triton's benchmarking

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132827
Approved by: https://github.com/eellison
2024-08-08 00:47:45 +00:00
a9036e1cf8 [inductor] raise unsupport msg in capture_pre_autograd_graph on Windows (#132841)
Debuged with @leslie-fang-intel , and we found that: https://github.com/pytorch/pytorch/issues/132561 and https://github.com/pytorch/pytorch/issues/132569 are all failed by `capture_pre_autograd_graph` not work well on Windows.

So, we added some code to raise message and let end user known that.

Detailed:
For https://github.com/pytorch/pytorch/issues/132561
```cmd
Traceback (most recent call last):
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 59, in testPartExecutor
    yield
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 591, in run
    self._callTestMethod(testMethod)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 549, in _callTestMethod
    method()
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_utils.py", line 2918, in wrapper
    method(*args, **kwargs)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_utils.py", line 1515, in wrapper
    fn(*args, **kwargs)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_quantization.py", line 399, in wrapper
    fn(*args, **kwargs)
  File "D:\xu_git\dnnl_cb\pytorch\test\quantization\pt2e\test_x86inductor_quantizer.py", line 1737, in test_qat_conv2d
    self._test_quantizer(
  File "D:\xu_git\dnnl_cb\pytorch\test\quantization\pt2e\test_x86inductor_quantizer.py", line 553, in _test_quantizer
    m = capture_pre_autograd_graph(
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\_export\__init__.py", line 121, in capture_pre_autograd_graph
    raise RuntimeError("capture_pre_autograd_graph not yet supported on Windows")
RuntimeError: capture_pre_autograd_graph not yet supported on Windows

To execute this test, run the following from the base repo dir:
    python test\quantization\pt2e\test_x86inductor_quantizer.py -k TestQuantizePT2EX86Inductor.test_qat_conv2d

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
```

For https://github.com/pytorch/pytorch/issues/132569
```cmd
Traceback (most recent call last):
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 59, in testPartExecutor
    yield
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 591, in run
    self._callTestMethod(testMethod)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 549, in _callTestMethod
    method()
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_utils.py", line 2918, in wrapper
    method(*args, **kwargs)
  File "D:\xu_git\dnnl_cb\pytorch\test\inductor\test_torchinductor.py", line 11218, in new_test
    return value(self)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\_dynamo\testing.py", line 312, in _fn
    return fn(*args, **kwargs)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "D:\xu_git\dnnl_cb\pytorch\test\inductor\test_cpu_cpp_wrapper.py", line 155, in fn
    _, code = test_torchinductor.run_and_get_cpp_code(
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\_inductor\utils.py", line 1863, in run_and_get_cpp_code
    result = fn(*args, **kwargs)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_quantization.py", line 415, in wrapper
    fn(*args, **kwargs)
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_quantization.py", line 367, in wrapper
    fn(*args, **kwargs)
  File "D:\xu_git\dnnl_cb\pytorch\test\inductor\test_mkldnn_pattern_matcher.py", line 1668, in test_qlinear_gelu_cpu
    self._qlinear_unary_cpu_test_helper((torch.randn((2, 4)),), gelu)
  File "D:\xu_git\dnnl_cb\pytorch\test\inductor\test_mkldnn_pattern_matcher.py", line 1615, in _qlinear_unary_cpu_test_helper
    self._test_common(
  File "D:\xu_git\dnnl_cb\pytorch\test\inductor\test_mkldnn_pattern_matcher.py", line 165, in _test_common
    convert_model = _generate_qdq_quantized_model(
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_quantization.py", line 2949, in _generate_qdq_quantized_model
    export_model = capture_pre_autograd_graph(
  File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\_export\__init__.py", line 121, in capture_pre_autograd_graph
    raise RuntimeError("capture_pre_autograd_graph not yet supported on Windows")
RuntimeError: capture_pre_autograd_graph not yet supported on Windows

To execute this test, run the following from the base repo dir:
    python test\inductor\test_cpu_cpp_wrapper.py -k DynamicShapesCppWrapperCpuTests.test_qlinear_gelu_cpu_dynamic_shapes_cpp_wrapper

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
--------------------------------------------------------------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------------------------------------------------------------
W0807 13:24:34.291000 11228 torch\_export\__init__.py:64] +============================+
W0807 13:24:34.291000 11228 torch\_export\__init__.py:65] |     !!!   WARNING   !!!    |
W0807 13:24:34.291000 11228 torch\_export\__init__.py:66] +============================+
W0807 13:24:34.291000 11228 torch\_export\__init__.py:67] capture_pre_autograd_graph() is deprecated and doesn't provide any function guarantee moving forward.
W0807 13:24:34.291000 11228 torch\_export\__init__.py:68] Please switch to use torch.export instead.
```

Co-authored-by: Jiong Gong <jiong.gong@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132841
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-08-08 00:28:07 +00:00
441c1c03d5 Prevent an unnecessary device -> host copy for CuPy arrays when not explicitly setting a device in torch.as_tensor. (#132595)
See title. Until now, calling `torch.as_tensor` on a CuPy array would return a CPU tensor, when not providing a device. This is most likely not desired.

Fixes #132553

```python3
import torch
import cupy as cp

cupy_arr = cp.asarray([1, 2, 3])

# Default case
t = torch.as_tensor(cupy_arr)
# New behavior, same device as cupy_arr now, was cpu before
print(t.device)  # cuda:0

# Explicitly set device
t = torch.as_tensor(cupy_arr, device='cpu')
print(t.device)  # cpu

# Implicit default device
torch.set_default_device('cpu')
t = torch.as_tensor(cupy_arr)
print(t.device)  # cpu

# Default device via context manager
torch.set_default_device('cuda')
with torch.device('cpu'):
    t = torch.as_tensor(cupy_arr)
    print(t.device)  # cpu

# Unset default device
torch.set_default_device(None)
t = torch.as_tensor(cupy_arr)
# New behavior, same device as cupy_arr now, was cpu before
print(t.device)  # cuda:0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132595
Approved by: https://github.com/ezyang
2024-08-08 00:26:58 +00:00
374747818d Run performance test non-alternately (#131935)
Summary:
By default, performance tests (speedup experiments) will run the baseline and test backend alternately.

However, this does not work for the torchao backend, which will change the model in-place, therefore the baseline run will also run with torchao backend since the model has already been quantized.

Add a new experiment "latency_experiment" to run performance tests non-alternately (first run baseline for a few iterations, then run the test backend).

other changes:

need to add torch.compiler.cudagraph_mark_step_begin() to avoid the
slowdown from             # Unable to hit fast path of CUDAGraphs because of pending, uninvoked backwards

also updated the torchao APIs to the current versions

X-link: https://github.com/pytorch/benchmark/pull/2394

Test Plan:
python run_benchmark.py torchao --only AlbertForMaskedLM --quantization noquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune python run_benchmark.py torchao --only BartForCausalLM --quantization noquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune python run_benchmark.py torchao --only timm_efficientnet --quantization noquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune

(should all be ~1.0
0.997x
1.006x
0.994x

Reviewed By: xuzhao9

Differential Revision: D60252821

Pulled By: HDCharles

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131935
Approved by: https://github.com/xuzhao9
2024-08-08 00:23:20 +00:00
f16d87eeff Print where raw cprofile lives (#132866)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132866
Approved by: https://github.com/albanD
2024-08-08 00:13:29 +00:00
b73d4b6555 [pipelining] Add schedule runtime for lowered schedule (#130488)
Creates a new runtime that shifts complexity from runtime to
ahead-of-time.

The existing runtime (PipelineScheduleMulti) accepts a
compute-only schedule (forward, backward, weight) actions only are
specified, and it infers the communication operations at runtime.
Compared to that runtime, PipelineScheduleRuntime has less logic that
happens at runtime and relies on lowering passes to transform the
compute-only schedule to add communications.

Advantages include
- easier to verify the correctness by dumping a compute+comm schedule
- posible to manually edit the compute+comm schedule if the lowering
  heuristics are insufficient

Functionality included inside the PipelineScheduleRuntime is limited to
- accepting a compute-only schedule and lowering it to add comms
- executing the compute or comm operations specified by the given
  schedule
- handling work.wait() automatically by calling it just before the
  matching compute operation (for RECV ops) or at the end of step (for
  SEND ops)

Follow ups for later PRs
- Some refactoring should be done to replace PipelineScheduleMulti with
  this runtime
- Optimizer execution is not considered (e.g. for zero-bubble cases)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130488
Approved by: https://github.com/H-Huang
2024-08-08 00:08:03 +00:00
9282e6ca78 Don't use _disable_current_modes as decorator (#132809)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132809
Approved by: https://github.com/albanD
ghstack dependencies: #132801, #132802, #132804
2024-08-07 23:59:46 +00:00
42226ca3a3 Don't use use_lazy_graph_module as decorator (#132804)
See https://github.com/pytorch/pytorch/pull/132073 for motivation

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132804
Approved by: https://github.com/albanD
ghstack dependencies: #132801, #132802
2024-08-07 23:59:46 +00:00
5e4d8eb831 Don't generate stack entry for DebugContext.wrap (#132802)
See https://github.com/pytorch/pytorch/pull/132073 for motivation

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132802
Approved by: https://github.com/albanD
ghstack dependencies: #132801
2024-08-07 23:59:38 +00:00
708a99e52a Stop using with_fresh_cache_if_config as decorator (#132801)
See https://github.com/pytorch/pytorch/pull/132073 for motivation

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132801
Approved by: https://github.com/albanD
2024-08-07 23:59:32 +00:00
c3e51c09ed [PP] Add get_schedule_class util (#132768)
Add a function to map a string to a class instance for schedules. This allows users to select a schedule based on a string command line argument and removes the need for glue code (e.g. in torchtitan)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132768
Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj
2024-08-07 23:51:03 +00:00
383f2ac914 AutoHeuristic: mixed_mm H100 heuristic (#132685)
H100 heuristic for mixed_mm. Performance looks similar to A100 heuristic.
```
  set     crit  max_depth  min_samples_leaf  correct  wrong  unsure  total  wrong_max_spdup  wrong_gman_spdup  max_spdup_default  gman_spdup_default  max_slowdown_default  non_default_preds  default_better
train  entropy          5              0.01     1562    604     145   2311         1.522201          1.077722          10.399141            3.134170              1.034802               2061               2
 test  entropy          5              0.01      361    164      24    549         1.443590          1.079169           8.159173            3.105360              1.197973                500               2
```

gpt-fast speedups
|batch size|prompt length| fallback    |  heuristic  | speedup |
|----------|-------------|------------:|------------:|--------:|
|     1    |      7      |      109.95  |       220.63|  2      |
|     1    |     11      |      109.65  | 	    210.92|  1.92   |
|     4    |      7      |       149.04 |       625.80|  4.19   |
|     4    |     11      |       149.56 |       494.64|  3.30   |
|     8    |      7      |       293.68 |       956.72|  3.25   |
|     8    |     11      |       294.48 |       925.60|  3.14   |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132685
Approved by: https://github.com/eellison
2024-08-07 23:48:01 +00:00
c327710a87 [export] Publicize validate function (#132777)
as titled

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132777
Approved by: https://github.com/zhxchen17
2024-08-07 23:10:05 +00:00
21d4c48059 Allow distributed breakpoint to skip the first few calls (#129511)
Summary:
PDB allows to do conditional breakpoint but the ability won't work in the distributed environment. We can still do conditional breakpoint by doing the following:

```
counter = 0

global counter
count += 1
if counter > 100:
  dist.breakpoint()
```

This PR makes dist.breakpoint() support this feature as a syntax sugar.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129511
Approved by: https://github.com/wconstab, https://github.com/c-p-i-o
2024-08-07 21:57:37 +00:00
acad2050c1 [easy][dynamo] Add tx as an arg in getitem_const (#132899)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132899
Approved by: https://github.com/yanboliang
ghstack dependencies: #132806
2024-08-07 21:35:41 +00:00
700a11fdd4 Make inductor kernel metadata comments more descriptive (#126698)
Summary:

A couple of improvements to the generated comments in inductor kernels:

1. Makes the nodes in the comment topologically sorted, I think having them
   alphabetically sorted is a gotcha. I was always confused on why the
   sorting in the comments did not match the code.
2. Adds a printout of the aten graph fragment corresponding to the
   current inductor kernel, to make it easier to map from aten
   code to inductor code

Example float8-overhead-related inductor kernel comment after this PR:

```
# kernel path: /tmp/torchinductor_vasiliy/27/c27ts3rdw56ns7od5j6ovdnhxphished2lcu3adclzzixoo7khg5.py
# Source Nodes: [weight_fp8], Original ATen: [aten.mul, aten.clamp, aten._to_copy]
# Source node to ATen node mapping:
#   weight_fp8 => clamp_max_1, clamp_min_3, convert_element_type_10, convert_element_type_11, convert_element_type_9, mul_3
# Graph fragment:
#   %mul_3 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%primals_2, %convert_element_type_8), kwargs = {})
#   %convert_element_type_9 : [num_users=1] = call_function[target=torch.ops.prims.convert_element_type.default](args = (%mul_3, torch.float32), kwargs = {})
#   %clamp_min_3 : [num_users=1] = call_function[target=torch.ops.aten.clamp_min.default](args = (%convert_element_type_9, -448.0), kwargs = {})
#   %clamp_max_1 : [num_users=1] = call_function[target=torch.ops.aten.clamp_max.default](args = (%clamp_min_3, 448.0), kwargs = {})
#   %convert_element_type_10 : [num_users=1] = call_function[target=torch.ops.prims.convert_element_type.default](args = (%clamp_max_1, torch.bfloat16), kwargs = {})
#   %convert_element_type_11 : [num_users=1] = call_function[target=torch.ops.prims.convert_element_type.default](args = (%convert_element_type_10, torch.float8_e4m3fn), kwargs = {})
triton_poi_fused__to_copy_clamp_mul_5 = async_compile.triton('triton_', '''
```

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126698
Approved by: https://github.com/ezyang
ghstack dependencies: #126573
2024-08-07 21:25:09 +00:00
48f7bdbbe1 aot_autograd: copy metadata from fw to bw nodes (#126573)
Summary:

Uses the `seq_nr` field (introduced to aot_autograd nodes in
https://github.com/pytorch/pytorch/pull/103129) to map the aot_autograd
fx bw nodes to the corresponding fw nodes, and copy the metadata over.

I am trusting the `seq_nr` mapping in the linked PR here. I did
some validation with a toy LLaMa 3 8b training run and the mapping seemed
correct.

I am also trusting that the forward is single threaded, since `seq_nr` is thread local.  If this isn't always true, we'll need to also plumb `thread_id` through the same machinery which is populating `seq_nr`.

I'd like to use this data in a future PR to make inductor kernels easily
attributable to the nn.Module path in modeling land, to make it easier
to do performance debugging.

Test Plan:

```
// 1. unit test
python test/dynamo/test_aot_autograd.py -k test_aot_sequence_nr

// 2. manual test
// run LLaMa 3 8B fw + bw with torch.compile, print out the inductor graphs
// seen in `torch/_inductor/utils.py::get_kernel_metadata`, they seemed
// right to me.
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126573
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
2024-08-07 21:25:09 +00:00
260e7cb143 Make CUDA device properties's __repr__ output actually printable (#132863)
Previously we would write the UUID bytes directly, leading to 'invalid
UTF-8 sequence' errors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132863
Approved by: https://github.com/Skylion007, https://github.com/eqy
2024-08-07 21:08:43 +00:00
525fdc0f95 [docs] fix incorrect example in convert_conv3d_weight_memory_format (#129318)
The current example fails when using `torch.channels_last`, and the docs are slightly incorrect for the 3d case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129318
Approved by: https://github.com/albanD
2024-08-07 20:06:59 +00:00
6a348e5e57 [CUDAGraph] Warn once if too many distinct sizes (#132832)
Warn once if there are too many distinct sizes for cudagraph, so we can avoid spamming logs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132832
Approved by: https://github.com/eellison
2024-08-07 19:48:06 +00:00
e76bd0b603 [BE] put "show_dispatch_trace()" print logic in .cpp file (#132717)
I find myself occasionally trying to modify this to get additional debug info. Recompiling takes forever after modifying these lines, because the .h file is depended on by a huge number of files.

If we move this logic into a helper function and put it in the .cpp file, recompilation will be a lot faster when adding debug here.

Tested with a local DEBUG=1 build (which is needed to use `TORCH_SHOW_DISPATCH_TRACE=1`) and verified basic sanity - i.e. it still prints `[call]`, etc.

Differential Revision: [D60804331](https://our.internmc.facebook.com/intern/diff/D60804331)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132717
Approved by: https://github.com/soulitzer, https://github.com/bdhirsh
2024-08-07 19:43:29 +00:00
7830373662 Update owner for BC test (#132891)
Add @larryliu0820 to `/test/forward_backward_compatibility/check_forward_backward_compatibility.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132891
Approved by: https://github.com/albanD
2024-08-07 19:42:04 +00:00
59bbaea3a7 [inductor] disable capture_pre_autograd_graph related UTs on Windows (#132848)
Contined to https://github.com/pytorch/pytorch/pull/132841

We disable `capture_pre_autograd_graph` related UT on Windows.
Disable `test_lstm_packed_change_input_sizes` and `test_multihead_attention` UTs on Windows.

**TODO:**
Turn on them after fix `capture_pre_autograd_graph` issue on Windows.

## Local Test:
Linux is not skiped:
<img width="1387" alt="image" src="https://github.com/user-attachments/assets/28dfbb4b-d9c0-4d5b-be84-d7b3697bcd3f">

And we can skiped them on Windows:
<img width="853" alt="image" src="https://github.com/user-attachments/assets/e96ebcf8-9bf3-43aa-93fd-fb33d3743573">

Co-authored-by: Jiong Gong <jiong.gong@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132848
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-08-07 19:38:03 +00:00
7ea8374c0e nn.ModuleList.__getitem__ overloads (#132834)
Overloads so that you can get more specific type info based on how you are indexing.

```python
from torch import nn

module_list = nn.ModuleList(32 * [nn.Linear(2, 2)])

# before:
reveal_type(module_list[0])  # Type of "module_list[0]" is "Module | ModuleList"
reveal_type(module_list[:1])  # Type of "module_list[: 1]" is "Module | ModuleList"

# now:
reveal_type(module_list[0])  # Type of "module_list[0]" is "Module"
reveal_type(module_list[:1])  # Type of "module_list[: 1]" is "ModuleList"
```
Co-authored-by: Skylion007 <Skylion007@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132834
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-08-07 19:25:23 +00:00
83fa7f871f Work around item non-sync issue on AMD (#132772)
Differential Revision: D59669714

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132772
Approved by: https://github.com/ZhengkaiZ, https://github.com/izaitsevfb
2024-08-07 18:58:11 +00:00
ff81ca8e0c Revert "Populate submodules of torch._C to sys.modules recursively (#132216)"
This reverts commit 672ce4610e41386da9763e07375b0879dc351905.

Reverted https://github.com/pytorch/pytorch/pull/132216 on behalf of https://github.com/PaliC due to was breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/132216#issuecomment-2274112397))
2024-08-07 18:45:00 +00:00
4fe6a5dc34 Move slow tests to be in repo (#132379)
Move the slow test json to be in the pytorch/pytorch repo and make a job that will update it weekly.  The job uses the same environment as the commit hash.  It uses similar code to the hash updates, but the hash update contains a lot of code that is specific to the hash update, so I chose to pick out the parts that are relevant

Remove references to the old file and set up testing to read from the new file instead

The old update cadence was every day, the new one is every week

The auto slow test infra + the lack of pinning between pytorch and test-infra makes it really hard to tell if a test started failing because of a change or because of the slow test json changing.  While this can have benefits, like disable test issues being effective everywhere immediately, it can also be very confusing, especially since we don't have the same insight into slow tests like we do for disable issues.

Example PR made: https://github.com/pytorch/pytorch/pull/132383 (with all the changes from this PR because it was working on top of this)

We should just get rid of this at some point in favor of the slowTest decorator, but there are some tests that take 5+ minutes to run and I don't want to track them down right now
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132379
Approved by: https://github.com/huydhn
2024-08-07 18:42:56 +00:00
26b0011fb8 [XPU][Kineto Submodule] Introduce kineto-based XPU profiler (#130811)
As XPU became a PyTorch built-in device, the profiler support is indispensable part of functionality completeness. This PR is associated with the PR to introduce XPU profiler plugin into the kineto. When USE_XPU is enabled, the LIBKINETO_NOXPUPTI option will be suppressed accordingly, which allows kineto to build with XPU profiler plugin.

Associated PR to introduce kineto-based XPU profiler into kineto:
https://github.com/pytorch/kineto/pull/961

Also updates the Kineto Submodule to include XPU changes.

Co-authored-by: Aaron Enye Shi <enye.shi@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130811
Approved by: https://github.com/aaronenyeshi
2024-08-07 18:41:37 +00:00
07551887b8 Revert "Disable SymDispatchMode when torch.compile'ing (#132433)"
This reverts commit 63eb06c0512b636a34caf041eab6fbc0726fc7ee.

Reverted https://github.com/pytorch/pytorch/pull/132433 on behalf of https://github.com/PaliC due to We need to now revert https://github.com/pytorch/pytorch/pull/132216 in OSS and there is a dependency on this pr ([comment](https://github.com/pytorch/pytorch/pull/132433#issuecomment-2274105080))
2024-08-07 18:41:28 +00:00
ca713b8393 llvm update for backward-breaking APIs in 18 and 19 (#132825)
Related to #130661, #129797.  Based on the LLVM tagged releases, these LLVM_VERSION_MAJOR guards are accurate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132825
Approved by: https://github.com/dcci, https://github.com/Skylion007
2024-08-07 18:31:40 +00:00
a9ff190867 Revert "Consolidate SymDispatchMode into ProxyTensorMode (#132674)"
This reverts commit ffdf48e63b94930c81f05b06444721109d0b243d.

Reverted https://github.com/pytorch/pytorch/pull/132674 on behalf of https://github.com/PaliC due to We need to now revert https://github.com/pytorch/pytorch/pull/132216 in OSS and there is a dependency on this pr ([comment](https://github.com/pytorch/pytorch/pull/132674#issuecomment-2274062785))
2024-08-07 18:25:33 +00:00
9d476fee53 Revert "[BE] Simplify code interacting with get_proxy_mode/enable_tracing (#132675)"
This reverts commit c2bccfd4311fe905ff78c0977281b8e642bb10d6.

Reverted https://github.com/pytorch/pytorch/pull/132675 on behalf of https://github.com/PaliC due to We need to now revert https://github.com/pytorch/pytorch/pull/132216 in OSS and there is a dependency on this pr ([comment](https://github.com/pytorch/pytorch/pull/132674#issuecomment-2274062785))
2024-08-07 18:25:33 +00:00
f2ad3c89b0 fix dtype mismatch in lobpcg eigen solver (#132762)
Fixes #132761

If rerr value is_complex, test against the real part. Since the rerr variable holds a norm calculation, the imaginary part will be 0.0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132762
Approved by: https://github.com/albanD
2024-08-07 18:20:46 +00:00
1749025081 Revert "Fix infinite recursion while walking to submodules (#132763)"
This reverts commit 063a45ed27c3001bba44ea2161d188ec2314d428.

Reverted https://github.com/pytorch/pytorch/pull/132763 on behalf of https://github.com/PaliC due to We need to now revert https://github.com/pytorch/pytorch/pull/132216 in OSS and there is a dependency on this pr ([comment](https://github.com/pytorch/pytorch/pull/132763#issuecomment-2274059792))
2024-08-07 18:20:27 +00:00
25df063f04 [dynamo][user_defined][stable-diffusion] Raise ObservedAttributeError on UserDefinedObject var_getattr (#132806)
Fixes https://github.com/pytorch/pytorch/issues/132551

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132806
Approved by: https://github.com/williamwen42
2024-08-07 18:19:49 +00:00
40ce0a53bb [FSDP][dtensor] add FSDP2+TP distributed state dict test (#131408)
**Test**
`pytest test/distributed/_composable/fsdp/test_fully_shard_training.py`
`pytest test/distributed/_composable/fsdp/test_fully_shard_state_dict.py`
`pytest test/distributed/checkpoint/fsdp/test_fsdp_dsd.py`
`pytest test/distributed/_composable/fsdp/test_fully_shard_init.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131408
Approved by: https://github.com/fegin
ghstack dependencies: #126697, #130239, #132391
2024-08-07 18:17:12 +00:00
ad0ce89050 [3/N][dtensor] Strided Sharding offset calculation util (#132391)
**Summary**
1. change `compute_local_shape_and_global_offset` to correctly compute shape and offset for strided sharding placement (currently it only handles 2D and some 3D+ sharding).
2. Add a new property `num_shards_map` to `DTensorSpec` denoting how many shards each tensor dimension has. This is necessary for constructing `_StridedShard` placement when we call `distribute_tensor(dtensor_tp, dp_device_mesh, [Shard(0)])` and the `split_factor` argument will just be the number of shards on that sharding tensor dim.

**Test**
`test/distributed/_tensor/test_utils.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132391
Approved by: https://github.com/wanchaol
ghstack dependencies: #126697, #130239
2024-08-07 18:17:12 +00:00
0b0c660c02 [2/N][dtensor] Strided Sharding shard_to_replicate (#130239)
** Summary **
This PR adds the necessary util function to `_StridedShard` for correct shard-to-replicate resharding.

**Test**
`pytest test/distributed/_tensor/test_utils.py -s -k strided_sharding`
`pytest test/distributed/_tensor/test_utils.py -s -k test_fsdp2_tp_2d_dtensor_local_shards_and_offsets`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130239
Approved by: https://github.com/wanchaol
ghstack dependencies: #126697
2024-08-07 18:17:06 +00:00
92a17f454a [1/N][dtensor] introduce StridedShard placement type and _split_tensor() logic (#126697)
**Summary**
This PR adds a new private placement type `_StridedShard` for FSDP2 + TP style tensor sharding. The previously used `Shard` placement type cannot produce correct `full_tensor()` result because it assumes the tensor to be first sharded over `dp` mesh dimension then `tp` mesh dimension which does not hold true in FSDP2 + TP case.

**Test**
`pytest test/distributed/_tensor/test_utils.py -s -k strided_sharding`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126697
Approved by: https://github.com/wanchaol
2024-08-07 18:17:02 +00:00
123d9ec5bf Revert "Loads .pyd instead of .so in MemPool test for windows (#132749)"
This reverts commit 37ab0f33854fafdf9bf4f575260329ffcd960d13.

Reverted https://github.com/pytorch/pytorch/pull/132749 on behalf of https://github.com/syed-ahmed due to Seems like periodic is still failing: 7c79e89bc5 ([comment](https://github.com/pytorch/pytorch/pull/132749#issuecomment-2274041302))
2024-08-07 18:08:44 +00:00
a62710c820 [FSDP2] Relaxed overlap test to address CI flakiness (#132869)
This tries to fix https://github.com/pytorch/pytorch/issues/120961.

This is a similar situation as https://github.com/pytorch/pytorch/pull/132116. The overlap tests were written strictly based on a precise calculation of what compute/communication should be non-overlapped vs. overlapped. This is done via `torch.cuda._sleep()`, which takes inputs in cycles, so we must convert from milliseconds to cycles via `get_cycles_per_ms()`, which is computed once and cached. Variation in CI can cause this `get_cycles_per_ms()` value to be inaccurate when the FSDP overlap tests run. Thus, we decide to relax the overlap tests to just make sure the overlapped runs are faster than a baseline without overlap.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132869
Approved by: https://github.com/weifengpy
2024-08-07 17:37:03 +00:00
cyy
32a284c275 [9/N] Fix clang-tidy warnings in aten/src/ATen (#132842)
Follows #132728

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132842
Approved by: https://github.com/Skylion007
2024-08-07 16:54:21 +00:00
ffd0d92c18 fix autotuning init issues (#132837)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132837
Approved by: https://github.com/yanboliang
2024-08-07 16:36:47 +00:00
8b50d5398f [DeviceMesh] Create new group for 1D mesh when default backend is 'gloo' and 'cuda' is available (#132709)
More context in [#132471](https://github.com/pytorch/pytorch/issues/132471) and https://github.com/pytorch/pytorch/issues/132366.

TLDR:
When cuda is available and users move tensors to cuda, we cannot really reuse the default pg if default pg is gloo, as lots of collectives are not supported on gloo for cuda tensors. For example, `dtensor.full_tensor()` would result in a mysterious SIGTERM when all_gather a cuda tensor using gloo. Without the change in this PR, users would have to know the context and explicitly move the cuda tensor to cpu before invoking most collectives, which I think is not so ideal UX.

Therefore, given most collectives are not supported on gloo for cuda tensors, we should init a new pg if the default pg is gloo when torch.cuda.is_available() and device_type is cuda.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132709
Approved by: https://github.com/awgu, https://github.com/wanchaol
2024-08-07 16:13:11 +00:00
258f47fc0b Add padding_side to pad_sequence with "left" and "right" options ("right" as default) (#131884)
Fixes #10536

Reattempt of #61467. Thank you so much to @mskoh52 for your excellent work!

As I was trying to create a more efficient LLM data collator, I realized that `pad_sequence` only supports right padding, even though left padding is a very common format for LLMs, like Llama and Mistral.

The proposed alternative implementation was to use multiple flips, which tends to be 1.5x-2x slower. Instead we can add a [`padding_side` parameter as there is for for Hugging Face tokenizers](9d6c0641c4/src/transformers/tokenization_utils_base.py (L1565)), which requires only a very small change in the C++ code.

Here are the benchmarks of the new implementation!

`float32`:

![eaaa95ef-9384-45d2-be56-6898bc1d3514](https://github.com/user-attachments/assets/3b0eb309-e5a0-4a4d-97bb-4e3298783dbb)

`bool`:

![892f32da-8d9a-492b-9507-18d3f0a41e8e](https://github.com/user-attachments/assets/6824ea15-7d4e-4b89-95f0-8546635f0c2e)

Code:

```python
from __future__ import annotations

import random
import time
from typing import Literal

import numpy as np
import torch

def pad_sequence_with_flips(
    sequences: list[torch.Tensor],
    batch_first: bool = False,
    padding_value: int | float | bool = 0.0,
    padding_side: Literal["left", "right"] | str = "left",
) -> torch.Tensor:
    if padding_side == 'right':
        padded_sequence = torch._C._nn.pad_sequence([t.flatten() for t in sequences], batch_first=batch_first, padding_value=padding_value)
    elif padding_side=='left':
        padded_sequence = torch._C._nn.pad_sequence([t.flatten().flip(0) for t in sequences], batch_first=batch_first, padding_value=padding_value)  # pyright: ignore[reportArgumentType]
        padded_sequence = padded_sequence.flip(int(batch_first))
    else:
        raise ValueError(f"padding_side should be either 'right' or 'left', but got {padding_side}")

    return padded_sequence

sequence_lengths: list[int] = []

flip_left_pad_times: list[float] = []
flip_left_pad_times_std: list[float] = []

left_pad_times: list[float] = []
left_pad_times_std: list[float] = []

RUNS_PER_LOOP: int = 100

for i in range(1, 7):
    sequence_length = i * int(1e6) // 6
    sequence_lengths.append(sequence_length)

    sequences = [torch.randint(0, 2, (random.randint(1, sequence_length),), dtype=torch.bool) for _ in range(64)]

    inner_left_pad_times: list[float] = []
    inner_right_pad_times: list[float] = []

    inner_flip_left_pad_times: list[float] = []
    inner_flip_right_pad_times: list[float] = []

    for _ in range(RUNS_PER_LOOP):

        start = time.perf_counter()
        torch._C._nn.pad_sequence(sequences, batch_first=True, padding_value=False, padding_side="left")
        end = time.perf_counter()
        inner_left_pad_times.append(end - start)

        start = time.perf_counter()
        pad_sequence_with_flips(sequences, batch_first=True, padding_value=False, padding_side="left")
        end = time.perf_counter()
        inner_flip_left_pad_times.append(end - start)

    left_pad_times.append(sum(inner_left_pad_times) / len(inner_left_pad_times))
    left_pad_times_std.append(np.std(inner_left_pad_times))

    flip_left_pad_times.append(sum(inner_flip_left_pad_times) / len(inner_flip_left_pad_times))
    flip_left_pad_times_std.append(np.std(inner_flip_left_pad_times))

    print(f"Sequence Length: {sequence_length}, Left Pad Time: {left_pad_times[-1]}, Left with Flips Pad Time: {flip_left_pad_times[-1]}")

import matplotlib.pyplot as plt

plt.plot(sequence_lengths, left_pad_times, label="new pad_sequence left")
plt.scatter(sequence_lengths, left_pad_times)
plt.errorbar(sequence_lengths, left_pad_times, yerr=left_pad_times_std, linestyle='None', marker='^')

plt.plot(sequence_lengths, flip_left_pad_times, label="old pad_sequence left (2 flips)")
plt.scatter(sequence_lengths, flip_left_pad_times)
plt.errorbar(sequence_lengths, flip_left_pad_times, yerr=flip_left_pad_times_std, linestyle='None', marker='^')

plt.xlabel("Sequence Length")
plt.ylabel("Time (s)")
plt.legend(loc="upper right")

# Sequence Length: 166666, Left Pad Time: 0.06147645162009212, Left with Flips Pad Time: 0.09842291727001794
# Sequence Length: 333333, Left Pad Time: 0.08933195920990329, Left with Flips Pad Time: 0.15597836187991562
# Sequence Length: 500000, Left Pad Time: 0.08863158334006585, Left with Flips Pad Time: 0.15224887342999863
# Sequence Length: 666666, Left Pad Time: 0.10524682551997103, Left with Flips Pad Time: 0.18177212480995877
# Sequence Length: 833333, Left Pad Time: 0.11801802741003485, Left with Flips Pad Time: 0.20821274195001024
# Sequence Length: 1000000, Left Pad Time: 0.131894061660023, Left with Flips Pad Time: 0.23223503091008751
```

Co-authored-by: mskoh52 <mskoh52@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131884
Approved by: https://github.com/ezyang
2024-08-07 15:53:07 +00:00
780310fed7 Revert "Only thunkify proxies in some situations (#132421)"
This reverts commit bb99008c9e7c357b88047bcd6971dc2078341484.

Reverted https://github.com/pytorch/pytorch/pull/132421 on behalf of https://github.com/clee2000 due to I think this broke dynamo/test_subclasses.py::TestNestedTensor::test_in_graph_construction_from_input [GH job link](https://github.com/pytorch/pytorch/actions/runs/10283744685/job/28459340678) [HUD commit link](bb99008c9e).  Test got added in f50621989b which is before your merge base ([comment](https://github.com/pytorch/pytorch/pull/132421#issuecomment-2273742960))
2024-08-07 15:29:54 +00:00
de9b8a42c1 Revert "Add support for other backends in get_preferred_device (#132118)"
This reverts commit c184ac0f6b6d2482cf300d852fde6370a1c1e086.

Reverted https://github.com/pytorch/pytorch/pull/132118 on behalf of https://github.com/clee2000 due to I think this broke distributed/checkpoint/test_file_system_checkpoint_cpu.py::TestDistributedReshardOnLoad::test_load_rowwise_to_colwise_thread_count_1 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10279901233/job/28456599072) [HUD commit link](c184ac0f6b).  Dr CI classification is wrong, the failure is not flaky ([comment](https://github.com/pytorch/pytorch/pull/132118#issuecomment-2273729288))
2024-08-07 15:22:42 +00:00
cyy
13fa59580e Enable clang-tidy on aten/src/ATen/cpu (#132830)
Expands code coverage of clang-tidy to aten/src/ATen/cpu

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132830
Approved by: https://github.com/Skylion007
2024-08-07 14:44:17 +00:00
ed97fb77f9 Conversions between strided and jagged layouts for Nested Tensors (#115749)
This PR does 3 things:
1. Adds a copy-free strided->jagged layout conversion for NT
2. Adds a copy-free jagged->strided layout conversion for NT
3. Modifies and expands the .to() API to support the layout argument for the specific case of NT layout conversion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115749
Approved by: https://github.com/jbschlosser
2024-08-07 14:18:53 +00:00
fb146fc3c6 Only store necessary tensor_dict fields in node meta (#132805)
Fixes #132290

This PR attempts a more invasive / complete solution than the one from #132338, which removes immediate tensor fields from the `tensor_dict` copy stored in node meta. The approach taken here is to store only those fields of the `tensor_dict` which are absolutely utilized somewhere else.

So far, this appears to be limited to:
* `_dynamo_static_input_type`
* `tag` (at least in the tests). Discussion at #94080 appears to indicate this is depended on for export

(CI may point out more)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132805
Approved by: https://github.com/mlazos
2024-08-07 13:35:16 +00:00
7c79e89bc5 Stop using clear_frame as decorator (#132778)
See https://github.com/pytorch/pytorch/pull/132073 for motivation

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132778
Approved by: https://github.com/albanD
ghstack dependencies: #132774
2024-08-07 11:53:18 +00:00
bb99008c9e Only thunkify proxies in some situations (#132421)
The goal of this PR is to avoid stack overflow when we create extremely long chains of thunks, and then evaluate them (e.g., as occurs if you sum(long list of symint)). The basic idea behind this PR is to only thunkify proxies if they're being created in places where they may or may not be used--crucially, symint operations that occur in user code we are tracing are eagerly placed into the graph, even if they may eventually be dead.

I annotated the PR with explanation of changes.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132421
Approved by: https://github.com/Skylion007, https://github.com/zou3519
ghstack dependencies: #132674, #132675
2024-08-07 11:51:17 +00:00
32f9a809c7 Replace [[unlikely]] with unlikely(x) (#130816)
Do not use `[[unlikely]]` as its c++20 language features, see https://en.cppreference.com/w/cpp/language/attributes/likely

Fixes https://github.com/pytorch/pytorch/issues/130815

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130816
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/malfet
2024-08-07 10:38:13 +00:00
8c8eb9670a [CI] Enable inductor UT test on avx512 (#132645)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132645
Approved by: https://github.com/desertfire
2024-08-07 10:22:40 +00:00
37ab0f3385 Loads .pyd instead of .so in MemPool test for windows (#132749)
Fixes #132650

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132749
Approved by: https://github.com/albanD
2024-08-07 09:58:52 +00:00
8333ecf085 Support hasattr tracing for more PythonModuleVariable (#132731)
Fixes #132237

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132731
Approved by: https://github.com/EikanWang, https://github.com/yanboliang
2024-08-07 09:15:17 +00:00
c8c964f950 [inductor] check best templates first for fusions (#132829)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132829
Approved by: https://github.com/eellison
2024-08-07 07:48:00 +00:00
c184ac0f6b Add support for other backends in get_preferred_device (#132118)
Currenlty get_preferred_device supports only cuda and cpu. Add support for other backends using backend config.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132118
Approved by: https://github.com/awgu
2024-08-07 07:19:20 +00:00
87053132ea [DeviceMesh] Remove parent mesh concept from _MeshEnv and replace by root mesh (#132339)
Previously, when we slice out a submesh from a mesh, we assign the mesh as the parent mesh of the submesh. In this case, when we have a 3D mesh topology, the parent mesh of a 1D mesh sliced out from the 3D mesh is different from the parent mesh of the same 1D mesh sliced out from the 2D submesh of the 3D mesh. For example:
```
mesh_3d = init_device_mesh("cuda", (2,2,2), ("dim0", "dim1", "dim2"))
mesh_dim0 = mesh_3d["dim0"]

mesh_2d = mesh_2d["dim0", "dim1"]
mesh_dim0_2 =  mesh_2d["dim0_2"]

# This would evaluate to be True
print(_mesh_resources.get_parent_mesh(mesh_dim0) != _mesh_resources.get_parent_mesh(mesh_dim0))
```

We can always reconstruct the mesh needed from the mesh dim names, as long as two dims come from the same root. For simplicity, we do not see the necessity of building a tree structure to represent child-parent relationship. Therefore, we are replacing the parent mesh concept with a root mesh concept in `_MeshEnv` so we would have:

```
mesh_3d = init_device_mesh("cuda", (2,2,2), ("dim0", "dim1", "dim2"))
mesh_dim0 = mesh_3d["dim0"]

mesh_2d = mesh_2d["dim0", "dim1"]
mesh_dim0_2 =  mesh_2d["dim0_2"]

# This would evaluate to be True
print(_mesh_resources.get_root_mesh(mesh_dim0) == _mesh_resources.get_root_mesh(mesh_dim0))
```
With this change, we will have two types of meshes in an environment.
1. `device_mesh != _mesh_resources.get_root_mesh(device_mesh)` means that the device_mesh is created by slicing.
2. `device_mesh == _mesh_resources.get_root_mesh(device_mesh)` means that the device_mesh is a root mesh not created through slicing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132339
Approved by: https://github.com/wanchaol
ghstack dependencies: #132310, #132311
2024-08-07 07:01:12 +00:00
dc00eeb0f4 [Dynamo] fix incorrect kwargs in create_proxy (#132723)
## Summary
Fix https://github.com/pytorch/pytorch/issues/132642, the implementation of `create_proxy` requires to pass-in `kwargs` explicitly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132723
Approved by: https://github.com/aorenste
2024-08-07 06:26:24 +00:00
2206a3de00 [Compile] Speedup int8-to-float conversion on aarch64 (#132676)
With this change following snippet:
```cpp
#include <ATen/cpu/vec/vec.h>

void int8tofloat(int8_t* in, float* out) {
        auto tmp0 = at::vec::Vectorized<int8_t>::loadu(in, 8);
        auto tmp1 = at::vec::convert<float>(tmp0);
        tmp1.store(out);
}
```, which is core of the algorithm generated by cpu_inductor for the following compiled function:
```python
@torch.compile
def to_float(x):
  return x.to(torch.float)
```

changes from
```assembly
int8tofloat(signed char*, float*):
0000000000000000	stp	x29, x30, [sp, #-0x10]!
0000000000000004	mov	x29, sp
0000000000000008	sub	x9, sp, #0x30
000000000000000c	and	sp, x9, #0xffffffffffffffe0
0000000000000010	adrp	x8, 0 ; 0x0
0000000000000014	ldr	x8, [x8]
0000000000000018	ldr	x8, [x8]
000000000000001c	str	x8, [sp, #0x28]
0000000000000020	ldr	s0, [x0]
0000000000000024	sshll.8h	v0, v0, #0x0
0000000000000028	sshll.4s	v0, v0, #0x0
000000000000002c	scvtf.4s	v0, v0
0000000000000030	str	q0, [sp]
0000000000000034	ldr	s0, [x0, #0x4]
0000000000000038	sshll.8h	v0, v0, #0x0
000000000000003c	sshll.4s	v0, v0, #0x0
0000000000000040	scvtf.4s	v0, v0
0000000000000044	str	q0, [sp, #0x10]
0000000000000048	mov	x8, sp
000000000000004c	ld1.4s	{ v0, v1 }, [x8]
0000000000000050	st1.4s	{ v0, v1 }, [x1]
0000000000000054	ldr	x8, [sp, #0x28]
0000000000000058	adrp	x9, 0 ; 0x0
000000000000005c	ldr	x9, [x9]
0000000000000060	ldr	x9, [x9]
0000000000000064	cmp	x9, x8
0000000000000068	b.ne	0x78
000000000000006c	mov	sp, x29
0000000000000070	ldp	x29, x30, [sp], #0x10
0000000000000074	ret
0000000000000078	bl	0x78
```
to
```assembly
0000000000000000	ldr	d0, [x0]
0000000000000004	sshll.8h	v0, v0, #0x0
0000000000000008	sshll.4s	v1, v0, #0x0
000000000000000c	scvtf.4s	v1, v1
0000000000000010	sshll2.4s	v0, v0, #0x0
0000000000000014	scvtf.4s	v2, v0
0000000000000018	st1.4s	{ v1, v2 }, [x1]
000000000000001c	ret
```

and improves perf of `python3 torchchat.py generate stories110M --num-samples 3 --quantize '{"linear:int8" : {"groupsize" : 0}}' --compile --device cpu` from 56 to 98 tokens per sec on MacBook M1 Pro

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132676
Approved by: https://github.com/desertfire
2024-08-07 06:26:05 +00:00
4faa0e3efb [Inductor] support masked vectorization for the tail_loop (#126526)
Currently the tail_loop always uses the scalar kernel. This PR supports masked vectorization for the tail_loop to improve the performance.

Example:
```
import torch
import torch.nn as nn

class GN(nn.Module):
    def __init__(self, num_groups, num_channels):
        super(GN, self).__init__()
        self.gn = nn.GroupNorm(num_groups, num_channels)

    def forward(self, x):
        return self.gn(x)

input = torch.randn(2, 960, 96, 96).to(memory_format=torch.channels_last)
m = GN(32, 960).eval()
compiled_m = torch.compile(m)

with torch.no_grad():
    for _ in range(3):
        compiled_m(input)

```

Generated code:
- Before:
```
cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*'], '''
#include "/tmp/torchinductor_jiayisun/ky/cky2bufythacofebk7ujv36e4pxyqcqbpsy5r4vojoprjiwcwfxf.h"
extern "C"  void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0,
                       float* out_ptr1,
                       float* out_ptr2)
{
    #pragma omp parallel num_threads(112)
    {
        int tid = omp_get_thread_num();
        {
            #pragma omp for collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L))
                {
                    {
                        Welford<float> tmp_acc0 = Welford<float>();
                        Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        static WeightRecp<at::vec::Vectorized<float>> weight_recps(static_cast<long>(17280L));
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(9216L); x2+=static_cast<long>(1L))
                        {
                            for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30L*x1) + (960L*x2) + (8847360L*x0)), 16);
                                tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &weight_recps);
                            }
                            #pragma omp simd simdlen(8)
                            for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(1L))
                            {
                                auto tmp0 = in_ptr0[static_cast<long>(x3 + (30L*x1) + (960L*x2) + (8847360L*x0))];
                                tmp_acc0 = welford_combine(tmp_acc0, tmp0);
                            }
                        }
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                        out_ptr0[static_cast<long>(x1 + (32L*x0))] = static_cast<float>(tmp_acc0.mean);
                        out_ptr1[static_cast<long>(x1 + (32L*x0))] = static_cast<float>(tmp_acc0.m2);
                    }
                }
            }
        }
        {
            #pragma omp for collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(9216L); x1+=static_cast<long>(1L))
                {
                    for(long x2=static_cast<long>(0L); x2<static_cast<long>(960L); x2+=static_cast<long>(16L))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x2 + (960L*x1) + (8847360L*x0)), 16);
                        auto tmp1 =
                        [&]
                        {
                            __at_align__ std::array<float, 16> tmpbuf;
                            #pragma GCC unroll 16
                            for (long x2_inner = 0; x2_inner < 16; x2_inner++)
                            {
                                tmpbuf[x2_inner] = out_ptr0[static_cast<long>((32L*x0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))];
                            }
                            return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16);
                        }
                        ()
                        ;
                        auto tmp3 =
                        [&]
                        {
                            __at_align__ std::array<float, 16> tmpbuf;
                            #pragma GCC unroll 16
                            for (long x2_inner = 0; x2_inner < 16; x2_inner++)
                            {
                                tmpbuf[x2_inner] = out_ptr1[static_cast<long>((32L*x0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))];
                            }
                            return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16);
                        }
                        ()
                        ;
                        auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x2), 16);
                        auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x2), 16);
                        auto tmp2 = tmp0 - tmp1;
                        auto tmp4 = static_cast<float>(276480.0);
                        auto tmp5 = at::vec::Vectorized<float>(tmp4);
                        auto tmp6 = tmp3 / tmp5;
                        auto tmp7 = static_cast<float>(1e-05);
                        auto tmp8 = at::vec::Vectorized<float>(tmp7);
                        auto tmp9 = tmp6 + tmp8;
                        auto tmp10 = tmp9.rsqrt();
                        auto tmp11 = tmp2 * tmp10;
                        auto tmp13 = tmp11 * tmp12;
                        auto tmp15 = tmp13 + tmp14;
                        tmp15.store(out_ptr2 + static_cast<long>(x2 + (960L*x1) + (8847360L*x0)));
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, arg1_1, arg2_1 = args
    args.clear()
    assert_size_stride(arg0_1, (960, ), (1, ))
    assert_size_stride(arg1_1, (960, ), (1, ))
    assert_size_stride(arg2_1, (2, 960, 96, 96), (8847360, 1, 92160, 960))
    buf0 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32)
    buf1 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32)
    buf3 = empty_strided_cpu((2, 960, 96, 96), (8847360, 1, 92160, 960), torch.float32)
    cpp_fused_native_group_norm_0(arg2_1, arg0_1, arg1_1, buf0, buf1, buf3)
    del arg0_1
    del arg1_1
    del arg2_1
    return (buf3, )
```

- After:
```
cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*'], '''
#include "/tmp/torchinductor_jiayisun/em/cemtujj65j5txpqlxc7w4pcunpmvz3qtiudkc5ocxxhcmdlknw2m.h"
extern "C"  void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       const float* in_ptr2,
                       float* out_ptr0,
                       float* out_ptr1,
                       float* out_ptr2)
{
    #pragma omp parallel num_threads(112)
    {
        int tid = omp_get_thread_num();
        {
            #pragma omp for collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L))
                {
                    {
                        Welford<float> tmp_acc0 = Welford<float>();
                        Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
                        static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<long>(17280L));
                        for(long x2=static_cast<long>(0L); x2<static_cast<long>(9216L); x2+=static_cast<long>(1L))
                        {
                            for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30L*x1) + (960L*x2) + (8847360L*x0)), 16);
                                tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0);
                            }
                            for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(14L))
                            {
                                auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30L*x1) + (960L*x2) + (8847360L*x0)), 14);
                                masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, 14, &wrecps0);
                            }
                        }
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec));
                        tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
                        out_ptr0[static_cast<long>(x1 + (32L*x0))] = static_cast<float>(tmp_acc0.mean);
                        out_ptr1[static_cast<long>(x1 + (32L*x0))] = static_cast<float>(tmp_acc0.m2);
                    }
                }
            }
        }
        {
            #pragma omp for collapse(2)
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L))
            {
                for(long x1=static_cast<long>(0L); x1<static_cast<long>(9216L); x1+=static_cast<long>(1L))
                {
                    for(long x2=static_cast<long>(0L); x2<static_cast<long>(960L); x2+=static_cast<long>(16L))
                    {
                        auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x2 + (960L*x1) + (8847360L*x0)), 16);
                        auto tmp1 =
                        [&]
                        {
                            __at_align__ std::array<float, 16> tmpbuf;
                            #pragma GCC unroll 16
                            for (long x2_inner = 0; x2_inner < 16; x2_inner++)
                            {
                                tmpbuf[x2_inner] = out_ptr0[static_cast<long>((32L*x0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))];
                            }
                            return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16);
                        }
                        ()
                        ;
                        auto tmp3 =
                        [&]
                        {
                            __at_align__ std::array<float, 16> tmpbuf;
                            #pragma GCC unroll 16
                            for (long x2_inner = 0; x2_inner < 16; x2_inner++)
                            {
                                tmpbuf[x2_inner] = out_ptr1[static_cast<long>((32L*x0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))];
                            }
                            return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16);
                        }
                        ()
                        ;
                        auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x2), 16);
                        auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x2), 16);
                        auto tmp2 = tmp0 - tmp1;
                        auto tmp4 = static_cast<float>(276480.0);
                        auto tmp5 = at::vec::Vectorized<float>(tmp4);
                        auto tmp6 = tmp3 / tmp5;
                        auto tmp7 = static_cast<float>(1e-05);
                        auto tmp8 = at::vec::Vectorized<float>(tmp7);
                        auto tmp9 = tmp6 + tmp8;
                        auto tmp10 = tmp9.rsqrt();
                        auto tmp11 = tmp2 * tmp10;
                        auto tmp13 = tmp11 * tmp12;
                        auto tmp15 = tmp13 + tmp14;
                        tmp15.store(out_ptr2 + static_cast<long>(x2 + (960L*x1) + (8847360L*x0)));
                    }
                }
            }
        }
    }
}
''')

async_compile.wait(globals())
del async_compile

def call(args):
    arg0_1, arg1_1, arg2_1 = args
    args.clear()
    assert_size_stride(arg0_1, (960, ), (1, ))
    assert_size_stride(arg1_1, (960, ), (1, ))
    assert_size_stride(arg2_1, (2, 960, 96, 96), (8847360, 1, 92160, 960))
    buf0 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32)
    buf1 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32)
    buf3 = empty_strided_cpu((2, 960, 96, 96), (8847360, 1, 92160, 960), torch.float32)
    cpp_fused_native_group_norm_0(arg2_1, arg0_1, arg1_1, buf0, buf1, buf3)
    del arg0_1
    del arg1_1
    del arg2_1
    return (buf3, )
```

Co-authored-by: CaoE <e.cao@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126526
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
2024-08-07 06:00:12 +00:00
8bc5ef563e Grouped Query Attention (#132689)
### Approach: Using the current function declaration

**Constraint:** Q_Heads % KV_Heads == 0

**Major change:**
- Added a new argument enable_gqa: bool to sdpa function call
- It adds a meaning to the last third dimension.

Sample use cases this would enable:
LLama3

```
# LLama3 8b call to SDPA
query = torch.rand(batch, 32, seq_len_q, D)
key = torch.rand(batch, 8, seq_len_kv, D)
value = torch.rand(batch, 8, seq_len_kv, D)

output = scaled_dot_product_attention(query, key, value, is_causal=True, enable_gqa=True)

# Output Shape
(batch, 32, seq_len_q, D)
```

### Design Choice:

- Check if Query.size(-3) == Key.size(-3) == Value.size(-3) or, Query.size(-3) % Key.size(-3) == 0
- The function adjusts the key and value tensors to match the query tensor's head dimension by using repeat_interleave if their number of heads are not equal, facilitating correct and efficient computation in attention mechanisms.
- By default the enable_gqa flag is set to False, which ensures that regular sdpa functionality remains unchanged.

### Benchmarks:

- **sdpa.py: #130634**
For different batch sizes enable_gqa=True shows a substansial improvement in the run_time of sdpa

 | batch_size | q_num_heads | kv_num_heads | q_seq_len | kv_seq_len | embed_dim | forward_time when enable_gqa=True   |   forward_time when enable_gqa=False    |
| ------------ | ------------- | -------------- | ----------- | ------------ | ----------- | ----------- | ---------------- |
|     1      |     32      |      8       |   2048    |    2048    |   2048    |   100.71  |  119.70  |
|     8      |     32      |      8       |   2048    |    2048    |   2048    |   539.78  |  628.83  |
|     16     |     32      |      8       |   2048    |    2048    |   2048    |   1056.81  |  1225.48  |
|     32      |     32      |      8       |   2048    |    2048    |   2048    |   2099.54  |  2440.45  |

![Screenshot 2024-07-25 at 9 07 40 PM](https://github.com/user-attachments/assets/a3e5f716-c39f-4096-9e6c-82a735e57b7b)

- **TorchTitan: https://github.com/pytorch/torchtitan/pull/458**

Differential Revision: D60772086

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132689
Approved by: https://github.com/drisspg
2024-08-07 05:35:36 +00:00
527f104a69 add L2 cache size to device properties (#132819)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132819
Approved by: https://github.com/eellison
2024-08-07 04:55:06 +00:00
cyy
bfeb45e46b [17/N] Fix clang-tidy warnings in jit (#132753)
Follows #132604
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132753
Approved by: https://github.com/Skylion007
2024-08-07 03:47:54 +00:00
cyy
03480213de [8/N] Fix clang-tidy warnings in aten/src/ATen (#132728)
Follows  #132727
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132728
Approved by: https://github.com/ezyang
2024-08-07 02:44:17 +00:00
919e384247 [PT2][Optimus] Add unbind_stack_to_cat_pass (#132542)
Summary: We observe the stack mpde can be transformed to cat node to elimiate split nodes, which could further enable the unbind cat optimization, thus we add a more advanced pattern to do the graph transformation

Test Plan:
# unit test

```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test //caffe2/test/inductor:split_cat_fx_passes
```
Buck UI: https://www.internalfb.com/buck2/de6c1cda-3d74-4a30-8980-7b209b6fe5dc
Test UI: https://www.internalfb.com/intern/testinfra/testrun/12103424042268125
Network: Up: 485KiB  Down: 728KiB  (reSessionID-2f2c01c3-79bb-4e37-b5be-fb77ec09b264)
Jobs completed: 29. Time elapsed: 5:19.8s.
Cache hits: 0%. Commands: 4 (cached: 0, remote: 0, local: 4)
Tests finished: Pass 9. Fail 0. Fatal 0. Skip 1. Build failure 0

# benchmark

```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "ig_ctr" --flow_id 584880697
```
P1503698962

before and after graph transformation
https://www.internalfb.com/intern/diffing/?paste_number=1504050718

Differential Revision: D60411560

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132542
Approved by: https://github.com/jackiexu1992
2024-08-07 02:26:40 +00:00
063a45ed27 Fix infinite recursion while walking to submodules (#132763)
Fixes https://github.com/pytorch/pytorch/pull/132216#issuecomment-2271555873

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132763
Approved by: https://github.com/ezyang
2024-08-07 02:20:17 +00:00
73c083e02c [Inductor][CPP] Turns on inline_inbuilt_nn_modules for CPP GEMM template testing (#132487)
**Summary**
The CPP GEMM template testing has been skipped with turning on `inline_inbuilt_nn_modules ` as in https://github.com/pytorch/pytorch/issues/131929.  Since https://github.com/pytorch/pytorch/pull/132334 has landed to fix the issues. Turn on this flag back since it's default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132487
Approved by: https://github.com/anijain2305, https://github.com/jgong5
2024-08-07 02:18:51 +00:00
ed224554eb [BE] Don't unnecessarily suggest -k for rerunning tests locally (#132807)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132807
Approved by: https://github.com/malfet
2024-08-07 02:15:18 +00:00
837898d9c8 Stop using preserve_rng_state as decorator (#132774)
See https://github.com/pytorch/pytorch/pull/132073 for motivation

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132774
Approved by: https://github.com/albanD
2024-08-07 01:07:12 +00:00
cyy
b01402b0a4 [7/N] Fix clang-tidy warnings in aten/src/ATen (#132727)
Follows  #132620
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132727
Approved by: https://github.com/Skylion007
2024-08-07 00:29:03 +00:00
178dc0c9c7 various doc fixes (#132803)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132803
Approved by: https://github.com/Chillee, https://github.com/joydddd, https://github.com/BoyuanFeng
ghstack dependencies: #132799
2024-08-07 00:19:42 +00:00
cb4d1bfb71 Clean up some tflop calc and add option for saving (#132799)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132799
Approved by: https://github.com/BoyuanFeng
2024-08-07 00:19:42 +00:00
cbee9c1fd2 Revert "Deprecate torch._utils.is_compiling() and torch._dynamo.external_utils.is_compiling() (#127690)"
This reverts commit 0e7e61f7cec82a43f2de52b83eff152d703be7a3.

Reverted https://github.com/pytorch/pytorch/pull/127690 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/127690#issuecomment-2272370386))
2024-08-07 00:05:20 +00:00
e98eac76b3 [inductor] switch AotCodeCompiler to new cpp_builder. (take 3) (#132766)
Summary: This is basically https://github.com/pytorch/pytorch/pull/131304 together with https://github.com/pytorch/pytorch/pull/132594 and absolute path fix for fbcode.

Test Plan: ci

Differential Revision: D60773405

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132766
Approved by: https://github.com/xuhancn, https://github.com/chenyang78, https://github.com/desertfire
2024-08-06 23:56:34 +00:00
c7113a6186 Revert "[DeviceMesh] Create new group for 1D mesh when default backend is 'gloo' and 'cuda' is available (#132709)"
This reverts commit 1a23ef2ece1c667ee46cd34deb70df2b91bffa32.

Reverted https://github.com/pytorch/pytorch/pull/132709 on behalf of https://github.com/clee2000 due to I think this broke distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_ddp_device_mesh_initialization [GH job link](https://github.com/pytorch/pytorch/actions/runs/10274519791/job/28432469987) [HUD commit link](1a23ef2ece).  Test not run due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/132709#issuecomment-2272350923))
2024-08-06 23:47:53 +00:00
0d6caeb259 Add logging + counter for missed reinplacing opportunities (#132758)
Summary:
- We add Inductor logs for what tensors we tried to reinplace, what
  tensors we were unable to reinplace, and of those tensors, which of
  those might be bugs (the "missed reinplacing opportunities"). You can
  tell this by reading the Inductor output graph but the logs make it
  easier to figure out.
- Add a dynamo_compile counter for missed reinplacing opportunities. The
  goal is to see how widespread existing problems (if any) are. We've had
  trouble getting all of the edge cases for the reinplacing pass; the
  counter will help us hunt down issues.

Test Plan:
- tested locally

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132758
Approved by: https://github.com/eellison
2024-08-06 23:44:24 +00:00
cd7f527c59 [3/3] 3D Composability - move tp dp tests (#129802)
pytorch (fsdp, tp, pp) -> pytorch (composable)
Move (fsdp, tp, pp) tests under pytorch into a composable folder

FSDP:
test/distributed/_composable/fsdp/test_fully_shard_trainin.py
-TestFullyShard2DTraining
**DP:
test/distributed/tensor/parallel/test_ddp_2d_parallel.py
TP:
test/distributed/tensor/parallel/test_fsdp_2d_parallel.py**
PP:
test/distributed/pipelining/test_composability.py

=>
**distributed/_composable/test_composability/test_2d_composability.py**
distributed/_composable/test_composability/test_pp_composability.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129802
Approved by: https://github.com/fduwjj
ghstack dependencies: #129801
2024-08-06 23:07:07 +00:00
179b572fd9 [2/3] 3D Composability - move pp tests (#129801)
pytorch (fsdp, tp, pp) -> pytorch (composable)
Move (fsdp, tp, pp) tests under pytorch into a composable folder

FSDP:
test/distributed/_composable/fsdp/test_fully_shard_trainin.py
-TestFullyShard2DTraining
DP:
test/distributed/tensor/parallel/test_ddp_2d_parallel.py
TP:
test/distributed/tensor/parallel/test_fsdp_2d_parallel.py
**PP:
test/distributed/pipelining/test_composability.py**

=>
distributed/_composable/test_composability/test_2d_composability.py
**distributed/_composable/test_composability/test_pp_composability.py**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129801
Approved by: https://github.com/wconstab, https://github.com/atalman
2024-08-06 23:07:07 +00:00
825002c9c6 [export][fx] More robust DCE pass (#132764)
Summary:
- make default DCE pass check schema,
- need to rebase onto https://github.com/pytorch/pytorch/pull/131651 after it's in phabricator (for now the change is manually added).

- mark Proxy dump as NotImplemented for better error msg

- Remove Proxy from tensors when dumping models, as Proxy cannot be dumped.

More details in https://docs.google.com/document/d/1G5vmTXjzxoyVGRI2kpA1gQukK_Glyg2NrE0Oh6Nlg9A/edit?usp=sharing.

Test Plan:
CI
```
- buck2 run 'fbcode//mode/dev-nosan'  fbcode//caffe2/test/quantization:test_quantization -- -r  qat_conv2d
- test_export.py
- buck2 run 'fbcode//mode/dev-nosan' fbcode//modai/test:test_modai -- -r test_qat_stinson_htp_export
- buck2 run 'fbcode//mode/dev-nosan' fbcode//vizard_projects/ml_depth/tests:test_model -- -r test_qat_model_et
- buck2 run 'fbcode//mode/dev-nosan'  fbcode//caffe2/test:fx -- -r dce
- buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/backends/tests:qnn_test -- -r test_qat_bias=False,use_3d_input=False
- buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/backends/tests:qnn_test -- -r test_qat_bias=True,use_3d_input=False
- buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r  test_fold_bn_erases_bn_node
```

Reviewed By: angelayi

Differential Revision: D60319175

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132764
Approved by: https://github.com/angelayi
2024-08-06 22:27:22 +00:00
073cee531c [Test][Easy] Remove print in test_device_mesh.py (#132780)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132780
Approved by: https://github.com/XilunWu
2024-08-06 22:04:39 +00:00
1a23ef2ece [DeviceMesh] Create new group for 1D mesh when default backend is 'gloo' and 'cuda' is available (#132709)
More context in [#132471](https://github.com/pytorch/pytorch/issues/132471) and https://github.com/pytorch/pytorch/issues/132366.

TLDR:
When cuda is available and users move tensors to cuda, we cannot really reuse the default pg if default pg is gloo, as lots of collectives are not supported on gloo for cuda tensors. For example, `dtensor.full_tensor()` would result in a mysterious SIGTERM when all_gather a cuda tensor using gloo. Without the change in this PR, users would have to know the context and explicitly move the cuda tensor to cpu before invoking most collectives, which I think is not so ideal UX.

Therefore, given most collectives are not supported on gloo for cuda tensors, we should init a new pg if the default pg is gloo when torch.cuda.is_available() and device_type is cuda.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132709
Approved by: https://github.com/awgu, https://github.com/wanchaol
2024-08-06 22:00:09 +00:00
18b678082e [Easy] log output code path on cache hit (#132718)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132718
Approved by: https://github.com/oulgen, https://github.com/masnesral
2024-08-06 21:59:30 +00:00
3c1033eeb0 Don't auto request review for reopened PRs (#132681)
This will clobber previous approves.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132681
Approved by: https://github.com/albanD, https://github.com/malfet
2024-08-06 21:36:18 +00:00
2073ddfd1c Actually report the HOP and subclass/mode when there isn't a registration (#132550)
Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132550
Approved by: https://github.com/ydwu4
2024-08-06 21:33:10 +00:00
623d0204f0 [NJT] Support Chunk backward for simple cases (#132193)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132193
Approved by: https://github.com/soulitzer
2024-08-06 21:20:09 +00:00
2f908ffa4a [traced-graph][sparse] sparsity propagation for all current tests (#132690)
This PR makes sure all current tests in the sparsity export test suite pass. Note that there will probably be anecdotal cases that need fixing after this, but the general idea of preserving sparsity metadata has been completed.

Fixes: https://github.com/pytorch/pytorch/issues/117188

```
$ PYTORCH_TEST_WITH_DYNAMO=0 python test/export/test_sparse.py ........................................................................................................................................................
 ----------------------------------------------------------------------
Ran 152 tests
OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132690
Approved by: https://github.com/ezyang
2024-08-06 21:18:13 +00:00
029f8fc701 Bump rexml from 3.2.8 to 3.3.3 in /ios/TestApp (#132469)
Bumps [rexml](https://github.com/ruby/rexml) from 3.2.8 to 3.3.3.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/ruby/rexml/releases">rexml's releases</a>.</em></p>
<blockquote>
<h2>REXML 3.3.3 - 2024-08-01</h2>
<h3>Improvements</h3>
<ul>
<li>
<p>Added support for detecting invalid XML that has unsupported
content before root element</p>
<ul>
<li><a href="https://redirect.github.com/ruby/rexml/issues/184">GH-184</a></li>
<li>Patch by NAITOH Jun.</li>
</ul>
</li>
<li>
<p>Added support for <code>REXML::Security.entity_expansion_limit=</code> and
<code>REXML::Security.entity_expansion_text_limit=</code> in SAX2 and pull
parsers</p>
<ul>
<li><a href="https://redirect.github.com/ruby/rexml/issues/187">GH-187</a></li>
<li>Patch by NAITOH Jun.</li>
</ul>
</li>
<li>
<p>Added more tests for invalid XMLs.</p>
<ul>
<li><a href="https://redirect.github.com/ruby/rexml/issues/183">GH-183</a></li>
<li>Patch by Watson.</li>
</ul>
</li>
<li>
<p>Added more performance tests.</p>
<ul>
<li>Patch by Watson.</li>
</ul>
</li>
<li>
<p>Improved parse performance.</p>
<ul>
<li><a href="https://redirect.github.com/ruby/rexml/issues/186">GH-186</a></li>
<li>Patch by tomoya ishida.</li>
</ul>
</li>
</ul>
<h3>Thanks</h3>
<ul>
<li>
<p>NAITOH Jun</p>
</li>
<li>
<p>Watson</p>
</li>
<li>
<p>tomoya ishida</p>
</li>
</ul>
<h2>REXML 3.3.2 - 2024-07-16</h2>
<h3>Improvements</h3>
<ul>
<li>
<p>Improved parse performance.</p>
<ul>
<li><a href="https://redirect.github.com/ruby/rexml/issues/160">GH-160</a></li>
<li>Patch by NAITOH Jun.</li>
</ul>
</li>
<li>
<p>Improved parse performance.</p>
<ul>
<li><a href="https://redirect.github.com/ruby/rexml/issues/169">GH-169</a></li>
<li><a href="https://redirect.github.com/ruby/rexml/issues/170">GH-170</a></li>
<li><a href="https://redirect.github.com/ruby/rexml/issues/171">GH-171</a></li>
<li><a href="https://redirect.github.com/ruby/rexml/issues/172">GH-172</a></li>
<li><a href="https://redirect.github.com/ruby/rexml/issues/173">GH-173</a></li>
<li><a href="https://redirect.github.com/ruby/rexml/issues/174">GH-174</a></li>
<li><a href="https://redirect.github.com/ruby/rexml/issues/175">GH-175</a></li>
<li><a href="https://redirect.github.com/ruby/rexml/issues/176">GH-176</a></li>
</ul>
</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/ruby/rexml/blob/master/NEWS.md">rexml's changelog</a>.</em></p>
<blockquote>
<h2>3.3.3 - 2024-08-01 {#version-3-3-3}</h2>
<h3>Improvements</h3>
<ul>
<li>
<p>Added support for detecting invalid XML that has unsupported
content before root element</p>
<ul>
<li><a href="https://redirect.github.com/ruby/rexml/issues/184">GH-184</a></li>
<li>Patch by NAITOH Jun.</li>
</ul>
</li>
<li>
<p>Added support for <code>REXML::Security.entity_expansion_limit=</code> and
<code>REXML::Security.entity_expansion_text_limit=</code> in SAX2 and pull
parsers</p>
<ul>
<li><a href="https://redirect.github.com/ruby/rexml/issues/187">GH-187</a></li>
<li>Patch by NAITOH Jun.</li>
</ul>
</li>
<li>
<p>Added more tests for invalid XMLs.</p>
<ul>
<li><a href="https://redirect.github.com/ruby/rexml/issues/183">GH-183</a></li>
<li>Patch by Watson.</li>
</ul>
</li>
<li>
<p>Added more performance tests.</p>
<ul>
<li>Patch by Watson.</li>
</ul>
</li>
<li>
<p>Improved parse performance.</p>
<ul>
<li><a href="https://redirect.github.com/ruby/rexml/issues/186">GH-186</a></li>
<li>Patch by tomoya ishida.</li>
</ul>
</li>
</ul>
<h3>Thanks</h3>
<ul>
<li>
<p>NAITOH Jun</p>
</li>
<li>
<p>Watson</p>
</li>
<li>
<p>tomoya ishida</p>
</li>
</ul>
<h2>3.3.2 - 2024-07-16 {#version-3-3-2}</h2>
<h3>Improvements</h3>
<ul>
<li>
<p>Improved parse performance.</p>
<ul>
<li><a href="https://redirect.github.com/ruby/rexml/issues/160">GH-160</a></li>
<li>Patch by NAITOH Jun.</li>
</ul>
</li>
<li>
<p>Improved parse performance.</p>
<ul>
<li><a href="https://redirect.github.com/ruby/rexml/issues/169">GH-169</a></li>
<li><a href="https://redirect.github.com/ruby/rexml/issues/170">GH-170</a></li>
<li><a href="https://redirect.github.com/ruby/rexml/issues/171">GH-171</a></li>
<li><a href="https://redirect.github.com/ruby/rexml/issues/172">GH-172</a></li>
<li><a href="https://redirect.github.com/ruby/rexml/issues/173">GH-173</a></li>
<li><a href="https://redirect.github.com/ruby/rexml/issues/174">GH-174</a></li>
<li><a href="https://redirect.github.com/ruby/rexml/issues/175">GH-175</a></li>
</ul>
</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="e4a067e112"><code>e4a067e</code></a> Add 3.3.3 entry</li>
<li><a href="17ff3e7874"><code>17ff3e7</code></a> test: add a performance test for attribute list declaration</li>
<li><a href="be86b3de0a"><code>be86b3d</code></a> test: fix wrong test name</li>
<li><a href="b93d790b36"><code>b93d790</code></a> test: use double quote for string literal</li>
<li><a href="0fbe7d5a0e"><code>0fbe7d5</code></a> test: don't use abbreviated name</li>
<li><a href="1599e8785f"><code>1599e87</code></a> test: add a performance test for PI with many tabs</li>
<li><a href="e2546e6eca"><code>e2546e6</code></a> parse pi: improve invalid case detection</li>
<li><a href="73661ef281"><code>73661ef</code></a> test: fix a typo</li>
<li><a href="850488abf2"><code>850488a</code></a> test: use double quote for string literal</li>
<li><a href="46c6397d5c"><code>46c6397</code></a> test: add performance tests for entity declaration</li>
<li>Additional commits viewable in <a href="https://github.com/ruby/rexml/compare/v3.2.8...v3.3.3">compare view</a></li>
</ul>
</details>
<br />

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=rexml&package-manager=bundler&previous-version=3.2.8&new-version=3.3.3)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts).

</details>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132469
Approved by: https://github.com/ezyang
2024-08-06 21:17:24 +00:00
e47b684c33 Revert "Temp disable MKL in DistributionKernels.cpp (#132532)"
This reverts commit 7b2664ece6a961ce9e4557be913c2cead09c7390.

Reverted https://github.com/pytorch/pytorch/pull/132532 on behalf of https://github.com/PaliC due to causing numerical instability issues internally ([comment](https://github.com/pytorch/pytorch/pull/132532#issuecomment-2272136210))
2024-08-06 20:57:09 +00:00
94155ce31b [Torch] Support meta device in checkpoint (#132684)
Summary:
## Why
utils.checkpoint doesn't support meta device:

```
  File "/Users/lyu1/torchdev/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 490, in checkpoint
    next(gen)
  File "/Users/lyu1/torchdev/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 1359, in _checkpoint_without_reentrant_generator
    device_module = _get_device_module(device)
  File "/Users/lyu1/torchdev/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 98, in _get_device_module
    device_module = getattr(torch, device)
  File "/Users/lyu1/torchdev/lib/python3.9/site-packages/torch/__init__.py", line 1938, in __getattr__
    raise AttributeError(f"module '{__name__}' has no attribute '{name}'")
AttributeError: module 'torch' has no attribute 'meta'
```

This blocks us from running model with checkpoint enabled in meta mode.

## What
This diff handles the case of meta device in checkpoint.py.

(in checkpoint.py, device module is manily used when preserve_rng_state=true, which doesn't apply to meta case. So a more elgant fix might be set preserve_rng_state=false when detecting args are on meta device. But I didn't find where to do this check in the minimum way. Let me know if you have ideas.)

Test Plan: Tested with toy model which has checkpoint on its module: P1513716944

Differential Revision: D60749427

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132684
Approved by: https://github.com/kit1980
2024-08-06 20:45:50 +00:00
de00c79583 [dynamo][inline_inbuilt_nn_modules] Mark nn module tensor static for cudagraphs (#132736)
Fixes https://github.com/pytorch/pytorch/issues/132714

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132736
Approved by: https://github.com/mlazos
ghstack dependencies: #132538
2024-08-06 20:13:28 +00:00
1954bfacda [Inductor] Small performance, precision, and dependency updates to B2B-GEMM (#132354)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132354
Approved by: https://github.com/masnesral
2024-08-06 20:01:27 +00:00
775c310c0c Preserve source_fn_stack in the training IR decomp (#132033)
Title

Differential Revision: [D60377712](https://our.internmc.facebook.com/intern/diff/D60377712/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132033
Approved by: https://github.com/angelayi
ghstack dependencies: #131988, #131995, #131999
2024-08-06 19:45:40 +00:00
4faa5804f6 [c10d] Used float tensor for PG NCCL barrier all-reduce (#132701)
This helps avoid a CUDA illegal memory access in the NCCL all-reduce part of `barrier()` when the CUDA caching allocator is disabled. NCCL all-reduce seems to assume reading at least 4 bytes. See https://github.com/pytorch/pytorch/issues/132640 for more context.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132701
Approved by: https://github.com/wanchaol, https://github.com/fegin
2024-08-06 19:35:37 +00:00
1e65ccc3de [inductor] export kernel for gemm template. (#132580)
Changes:
1. Move `get_export_declaration` to global scope.
2. Export kernel for gemm template.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132580
Approved by: https://github.com/ezyang
2024-08-06 18:52:22 +00:00
81a5a7a30a [Quantizer] Fix getattr for quantizing constants (#132705)
Mobilebert quantization was failing because there were embedding constants that could not be accessed through getattr().

It seems that we have to search the submodule for the embeddings. Which we do here. This is just to help get around looking at unlifted attrs to check if they are large scalars

Differential Revision: [D60492338](https://our.internmc.facebook.com/intern/diff/D60492338/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132705
Approved by: https://github.com/jerryzh168
ghstack dependencies: #132704
2024-08-06 18:16:27 +00:00
c2bccfd431 [BE] Simplify code interacting with get_proxy_mode/enable_tracing (#132675)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132675
Approved by: https://github.com/Skylion007, https://github.com/ydwu4, https://github.com/zou3519
ghstack dependencies: #132674
2024-08-06 18:13:22 +00:00
1de4ebc85d [Quantizer] Fix Maxpool2d share q params (#132704)
There seems to be a bug in the code for sharing q params for maxpool2d. This case occurs when output_node = maxpool_node. When this happens we overwrite the node's "quantization_annotation" metadata. This fix ensures that qparams are indeed shared across input and output

Differential Revision: [D60492341](https://our.internmc.facebook.com/intern/diff/D60492341/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132704
Approved by: https://github.com/jerryzh168
2024-08-06 18:13:16 +00:00
db0bd04151 [AOTI] Switch to use shim v2 for fbcode (#132750)
Summary: As title

Test Plan: CI

Reviewed By: hl475, ColinPeppler

Differential Revision: D57899065

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132750
Approved by: https://github.com/angelayi
2024-08-06 17:57:32 +00:00
8d2c272e5a properly register conjugate/neg fallthroughs to prim ops (#132699)
A few aten ops (like `clone` and `copy_` get fallthrough registrations to the Conjugate/Negative keys. We haven't been giving the same treatment to their corresponding `prims` variants, which can cause infinite loops in some cases.

Fixes an infinite loop that showed up in tests from https://github.com/pytorch/pytorch/pull/132563

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132699
Approved by: https://github.com/albanD
2024-08-06 17:57:04 +00:00
c6582f11cd Add get_optin_feature() to allow opt-in to amz2023 (#131792)
This extends the runner determinator to be able to opt-in to keywords
to provide additional options when determining which systems to run
jobs on. This enables us to support opt-in users to Amazon Linux 2023.

This change creates a generic get_optin_feature() which hopefully will
be useful to handle additional future features that we might want to
experiment with.

This change has kept backwards compatability with the existing issue
userlist format and adds support for the comma-separated list of users
in a backwards compatible way.

The user list has the following rules:

- Users are GitHub usernames with the @ prefix
- If the first line is a "*" then all users will use the new runners
- If the first line is a "!" then all users will use the old runners
- Each user is also a comma-separated list of features/experiments to enable
- A "#" prefix indicates the user is opted out of the new runners but is opting
  into features/experiments.

Example user list:

```
@User1
@User2,amz2023
#@UserOptOutOfNewRunner,amz2023
```

This closes pytorch/ci-infra#249.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131792
Approved by: https://github.com/jeanschmidt, https://github.com/ZainRizvi
2024-08-06 17:54:20 +00:00
e3394e5548 torch.autograd.graph.increment_version: accept List[Tensor], use in AOTDispatcher (#132652)
The regression from https://github.com/pytorch/pytorch/issues/132281 pinpoints e4ace1a396 as the cause. The main delta that commit introduces is that we now manually check `is_inference()` and call `increment_version()` (a pybind call) on every mutated input tensor to the graph.

This PR attempts to reduce overhead a bit by bundling up all of those checks into a single pybind call, by:

(1) updating `torch.autograd.graph.increment_version()` to accept a `Union[Tensor, List[Tensor]]`

(2) updating its semantics to no-op if you pass in a tensor with no version counter, instead of erroring

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132652
Approved by: https://github.com/albanD
2024-08-06 17:46:48 +00:00
af67b8df6d [export] Fix exportdb test (#132678)
Summary:
FIx exportdb test  for tensor_setattr.

copy.deepcopy(deepcopy) can fail if tensor inputs have attribute (i.e. __dict__).

We remove it before deepcopy.

Before the fix, we have

```
inputs[0].__dict__
{'attr': FakeTensor(..., size=(3, 2))}
```

the test errors out with

```
======================================================================
ERROR: test_exportdb_supported_case_tensor_setattr (caffe2.test.export.test_serialize.TestDeserialize)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/a915c8ae5cba5b70/caffe2/test/__test_export__/test_export#link-tree/torch/testing/_internal/common_utils.py", line 529, in instantiated_test
    test(self, **param_kwargs)
  File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/a915c8ae5cba5b70/caffe2/test/__test_export__/test_export#link-tree/caffe2/test/export/test_serialize.py", line 878, in test_exportdb_supported
    self.check_graph(model, case.example_args, _check_meta=_check_meta)
  File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/a915c8ae5cba5b70/caffe2/test/__test_export__/test_export#link-tree/caffe2/test/export/test_serialize.py", line 548, in check_graph
    _check_graph(pre_dispatch=True)
  File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/a915c8ae5cba5b70/caffe2/test/__test_export__/test_export#link-tree/caffe2/test/export/test_serialize.py", line 506, in _check_graph
    copy.deepcopy(inputs),
  File "/usr/local/fbcode/platform010/lib/python3.10/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/usr/local/fbcode/platform010/lib/python3.10/copy.py", line 211, in _deepcopy_tuple
    y = [deepcopy(a, memo) for a in x]
  File "/usr/local/fbcode/platform010/lib/python3.10/copy.py", line 211, in <listcomp>
    y = [deepcopy(a, memo) for a in x]
  File "/usr/local/fbcode/platform010/lib/python3.10/copy.py", line 153, in deepcopy
    y = copier(memo)
  File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/a915c8ae5cba5b70/caffe2/test/__test_export__/test_export#link-tree/torch/_tensor.py", line 206, in __deepcopy__
    new_tensor.__dict__ = deepcopy(self.__dict__, memo)
  File "/usr/local/fbcode/platform010/lib/python3.10/copy.py", line 146, in deepcopy
    y = copier(x, memo)
  File "/usr/local/fbcode/platform010/lib/python3.10/copy.py", line 231, in _deepcopy_dict
    y[deepcopy(key, memo)] = deepcopy(value, memo)
  File "/usr/local/fbcode/platform010/lib/python3.10/copy.py", line 153, in deepcopy
    y = copier(memo)
  File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/a915c8ae5cba5b70/caffe2/test/__test_export__/test_export#link-tree/torch/_tensor.py", line 108, in __deepcopy__
    or (type(self) is not Tensor and self.data_ptr() == 0)
RuntimeError: Cannot access data pointer of Tensor (e.g. FakeTensor, FunctionalTensor). If you're using torch.compile/export/fx, it is likely that we are erroneously tracing into a custom kernel. To fix this, please wrap the custom kernel into an opaque custom op. Please see the following for details: https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html
```

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r  test_exportdb_supported_case_tensor_setattr
```

Differential Revision: D60610860

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132678
Approved by: https://github.com/zhxchen17
2024-08-06 17:45:10 +00:00
e6eee04875 dynamo: use equality guards instead of id guards for Placement/DeviceMesh (#124401)
After talking to @anijain2305, we probably can't land this since it won't work for C++ guards. But we should still be able to do better than ID_MATCH

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124401
Approved by: https://github.com/anijain2305
2024-08-06 17:14:44 +00:00
f50621989b Construct NJT without graph breaks (#130292)
Combines contributions from https://github.com/pytorch/pytorch/pull/130505

Some context can be found in this large comment block:

a5b64d39fd/test/dynamo/test_subclasses.py (L1667-L1681)

Changes in this PR
- For each tensor fakified, check the nested int registry in eager, and eagerly symbolicize if that tensor has already been associated with nested int in eager.
- Adds a separate counter stored on FakeTensorMode as a fake analog to _tensor_id_counter (which keeps track of unique tensors). This counter is initialized to the global eager tensor id counter upon creation of the FakeTensorMode, and needs to be reset when the same FakeTensorMode is reused to trace again (in this PR, we piggyback on the epoch incrementing logic).
- (refactor) Today, we store FakeTensor -> symbolic nested int in the global registry. With this PR, symbolic nested int is stored directly on the FakeTensor. (Eager still caches nested int in the registry, though we should avoid this at some point.)

Basically unchanged, but worth noting:
- `__tensor_unflatten__` is still responsible for determining whether we should cache for now. The logic is somewhat simplified.
- to_copy is still using the trick of updating two different tensors in the registry to point to the same nested int. This is kind of broken, but we try to leave it as is, and plan a better fix with the UnionFind stack.

Differential Revision: [D60406772](https://our.internmc.facebook.com/intern/diff/D60406772)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130292
Approved by: https://github.com/bdhirsh
ghstack dependencies: #131916, #131803
2024-08-06 17:03:39 +00:00
406b50835b Use FakeTensor cache for subclass inner tensors (#131803)
Rewrite of original PR in https://github.com/pytorch/pytorch/pull/130291

To answer review comments from https://github.com/pytorch/pytorch/pull/130291#pullrequestreview-2166671953:

> At a higher level, do we need this?

Today, this should not change the behavior of anything. But an invariant of "same tensor always corresponds to the same FakeTensor" is nice (from discussion with @bdhirsh).

> Why does this happen?

Today, both dynamo and meta_utils do some recursion when it comes to FakeTensors. So whenever we fakify a subclass, the process would roughly like:

```
wrap_to_fake (subclass)
   meta_utils (subclass)
      meta_utils (values) -> not cached because we use callback
      meta_utils(offsets) -> not cached because we use callback
  wrap_to_fake (values)
  wrap_to_fake (offsets) -> cached because we rely on top-level meta_utils
```

However, we know that:
- Caching only occurs at the top-level of meta_utils.
- The return value of the top-level wrap_to_fake is returned.

This means that after all of this:
- The fakified subclass holds inner FakeTensors that are NOT part of the cache
- values/offsets are Fakified a second time, and those instances are cached.

Differential Revision: [D60406773](https://our.internmc.facebook.com/intern/diff/D60406773)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131803
Approved by: https://github.com/ezyang
ghstack dependencies: #131916
2024-08-06 17:03:39 +00:00
a94c441e48 Fix symbolic nested int printing (#131916)
Differential Revision: [D60406775](https://our.internmc.facebook.com/intern/diff/D60406775)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131916
Approved by: https://github.com/Skylion007, https://github.com/jbschlosser
2024-08-06 17:03:39 +00:00
ffdf48e63b Consolidate SymDispatchMode into ProxyTensorMode (#132674)
Instead of having a separate context variable for SymDispatchMode, we
now simply delegate to the current active proxy tensor mode when we
need to trace a SymInt.  We maintain a separate `__sym_dispatch__` magic
method as the calling convention is different than `__torch_dispatch__`.

Consolidating the modes in this ways means that we can consistently
disable both of these modes in tandem simply by removing the mode
from the proxy mode infra slot.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132674
Approved by: https://github.com/zou3519, https://github.com/bdhirsh
2024-08-06 17:03:17 +00:00
7045bc5a77 [export] change error message for specializations (#132698)
https://github.com/pytorch/pytorch/pull/130775 recently killed forced specializations for export on complex guards, so the only way we now get a specialized value is if we're able to solve for it. For example, if we have guards `s0 * 2 = s1`, `s0 + 6 = s1`, we specialize `s0 = 6; s1 = 12`.

That might look like this:
```
class Foo(torch.nn.Module):
    def forward(self, x, y):
        return x.reshape([-1]) + y

dy = Dim("dy", min=6)
x, y = torch.randn(6, 2), torch.randn(12)
dynamic_shapes = {
    "x": (dy - 6, 2),
    "y": (dy,),
}
```

Our current error message is:
`{symbol} must be specialized to {value} because the guards generated for it are too complex`
This is now misleading, so we change it to:
`solving the guards generated for {symbol} resulted in a specialized value of {value}`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132698
Approved by: https://github.com/avikchaudhuri
2024-08-06 16:59:53 +00:00
ca7ce2fca1 [ts-migration][1/N]: Add prim::Loop for constant number of iterations and condition (#131418)
#### Description
This PR adds prim::Loop support for the simplest case where the number of iteration is constant and the loop termination condition is also a constant.

[PR by stages](https://docs.google.com/document/d/1q6OprW3HBHbYPwEyE_DikBn-uzmhnN284Cmen_CnlhI/edit?usp=sharing)

#### Test Plan
Add reprod example.
* `pytest test/export/test_converter.py -s -k test_ts2ep_with_loop`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131418
Approved by: https://github.com/angelayi
2024-08-06 16:51:08 +00:00
C
c803e35c4b Reduce number of guards introduced by check_cudnn_tensor_shapes when cudnn version is higher enough (#132384)
I found that when using TorchDynamo (torch.compile) with dynamic shape on H100, there are some extra guards added to check the sequence length of inputs of `scaled_dot_product_attention` to be divisible by 64. These guards cause unwanted recompilations when the input shape changes.

In fact these guards are not necessary if our CUDNN version is higher enough, So I change the order of those checks to use short-circuit rules to skip those checks and avoid unnecessary guards.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132384
Approved by: https://github.com/eqy, https://github.com/Skylion007
2024-08-06 16:48:13 +00:00
fc7849b93f [pt2e][quant] Ensure BN node is erased after convert (#131651)
Summary: Previously, when folding BN into conv, we rely on DCE
to clean up the unused BN node from the graph. This works if
the model is already in eval mode, but fails if the model is
still in train mode because DCE doesn't remove nodes with
potential side effects (in this case `_native_batch_norm_legit`).
This required users to move the model to eval mode before calling
convert in order to get a properly DCE'd graph.

To solve this, we manually erase the BN node after folding
instead of relying on DCE. This relaxes the ordering constraints
between `move_exported_model_to_eval` and `convert_pt2e`.

Test Plan:
python test/test_quantization.py TestQuantizePT2EQAT_ConvBn1d.test_fold_bn_erases_bn_node
python test/test_quantization.py TestQuantizePT2EQAT_ConvBn2d.test_fold_bn_erases_bn_node

Reviewers: jerryzh168, yushangdi

Subscribers: jerryzh168, yushangdi, supriyar

Differential Revision: [D60520149](https://our.internmc.facebook.com/intern/diff/D60520149)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131651
Approved by: https://github.com/yushangdi, https://github.com/leslie-fang-intel
2024-08-06 16:37:39 +00:00
679cdf606a Converted __all__ literal tuple to literal list. (#132404)
Partial Fix for #131765.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132404
Approved by: https://github.com/soulitzer
2024-08-06 15:12:32 +00:00
6753ee127c Allow torch.cuda.memory.mem_get_info to take a device str argument with an unspecified device index. (#132616)
`torch.cuda.memory.mem_get_info` allows device strings given the current type hints. However, `device = torch.device('cuda')` leads to `device.index = None`, which results in downstream problems. Setting `optional=True` will insert the default device index in such cases.

Fixes #132583

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132616
Approved by: https://github.com/soulitzer
2024-08-06 13:19:46 +00:00
7100c36c8a Revert "[inductor] export kernel for gemm template. (#132580)"
This reverts commit 87d46d70d7754e32eb0e6689688f4336e4e7c955.

Reverted https://github.com/pytorch/pytorch/pull/132580 on behalf of https://github.com/PaliC due to sys is not defined in torch/_inductor/codegen/cpp_utils.py ([comment](https://github.com/pytorch/pytorch/pull/132580#issuecomment-2271264974))
2024-08-06 13:15:15 +00:00
cyy
656a4d1408 [6/N] Fix clang-tidy warnings in aten/src/ATen (#132620)
Follows #132565

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132620
Approved by: https://github.com/Skylion007
2024-08-06 13:07:16 +00:00
a8f0979962 Add cudagraph static inputs logging (#132726)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132726
Approved by: https://github.com/anijain2305
2024-08-06 12:01:20 +00:00
da320214e6 Format tensor (#127992)
Align tensor display
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127992
Approved by: https://github.com/janeyx99
2024-08-06 07:10:16 +00:00
728374d7f7 Changed create_block_mask to just accept BLOCK_SIZE (#132697)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132697
Approved by: https://github.com/drisspg
2024-08-06 04:37:15 +00:00
91df66ee74 [caffe2] Wrap constexpr with preprocessor statements (#132582)
Summary: When the preprocessor check we leave an unused constexpr around, so when `-Wunused-const-variable` is enabled we get an error. Let's inline these values since they're not used anywhere else in order to avoid this.

Test Plan: CI

Differential Revision: D60723823

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132582
Approved by: https://github.com/houseroad
2024-08-06 04:35:06 +00:00
4260f365ba [inductor] Replace torch.allclose with torch.testing.assert_close in test_fx_fusion (#130618)
Preventative fix of a test failure with oneDNN v3.5 upgrade where order of float32 arithmetic may change in torch.admm ( bias term can be at the start or end of the arithmetic ) resulting in slightly different output due to float32 precision loss.

Replaced occurrences of torch.allclose with ~~torch._dynamo.testing.same~~  torch.testing.assert_close which is the recommended approach as per this issue https://github.com/pytorch/pytorch/issues/56544 ,the default tolerance is more relaxed than torch.allclose which satisfies the test with upcoming oneDNN change.

This should fix aarch64 ci failures in #129932

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130618
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-08-06 03:58:43 +00:00
4e610924d4 [c10d] Add a new API for adding ephemeral timeout for one local rank and the timeout will reset when the first collective finishes (#130905)
We provide an API for user to add ephemeral timeout across all PGs within one rank and the timeout will reset when the first collective issued after the timeout added finishes.

Each extension only covers collectives after the issue and before the first collective finished. The diagram below shows how the timeout changes:

<img width="1174" alt="image" src="https://github.com/user-attachments/assets/354923b7-581c-40de-ae0f-1cd3da273ccc">

While this feature provides flexibility in specific scenarios, it introduces statefulness to timeout setting. Therefore, it is advisable to use this API sparingly and consider alternative approaches, such as directly setting the timeout or utilizing a barrier collective (one can set any timeout to the barrier), whenever feasible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130905
Approved by: https://github.com/ezyang
2024-08-06 03:47:58 +00:00
39c9b75a68 Add registration mechanism for aoti model runner (#131638)
Current AOTI model runner has supported CUDA and CPU. However, in terms of a particular out-of-tree backend, it is not easier to support the feature.

This PR intends to provide a registration mechanism to support this case by providing two: `RegisterAOTIModelRunner` and `getAOTIModelRunnerRegistry`.

- `RegisterAOTIModelRunner` is used to register a function(`AOTIModelRunnerABC`) to create a `AOTIModelContainerRunner`. The function signature is as follows.

    ```C++
    using AOTIModelRunnerABC = std::shared_ptr<AOTIModelContainerRunner> (*)(
        const std::string& model_so_path,
        size_t num_models,
        const std::string& device_str,
        const std::string& bin_dir);
    ```
- `getAOTIModelRunnerRegistry` is used to get all the registered backends.

In terms of a new backend, it needs to define its `AOTIModelContainerRunner` class and then register a `AOTIModelRunnerABC` function to `aoti` to create its `AOTIModelContainerRunner`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131638
Approved by: https://github.com/desertfire, https://github.com/jansel
2024-08-06 02:47:35 +00:00
345bea01dc Refactor thunkify to return proper thunk abstraction (#132407)
This is superior to lru_cache because (1) it's more explicit and (2) it
doesn't leak the original function after it's been forced.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132407
Approved by: https://github.com/albanD
2024-08-06 02:35:45 +00:00
93fad2f0f2 [export] Fix import in D60427208 (#132707)
Summary:
D60427208 broke APS release by failing our NE  deterministric test. https://www.internalfb.com/intern/test/562950111197340/

This Diff fixes it.

Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//aps_models/ads/gmp/tests/ne/e2e_deterministic_tests:gmp_e2e_ne_tests -- --filter-text test_mtml_instagram_model_474023725_single_gpu_with_ir
```

Differential Revision: D60790203

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132707
Approved by: https://github.com/ydwu4
2024-08-06 02:35:17 +00:00
2f16e68cab [Intel GPU] Allow XPU device in copy, cdist, index_put_impl (#130088)
# Motivation
`copy`, `cdist`, `index_put_impl` operators use `op_stub` for runtime dispatching inside operators.  Extra device list is inside them to assure the accuracy, while XPU is not in them. This PRs make them allow XPU as a supported device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130088
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD
ghstack dependencies: #130019, #130082
2024-08-06 01:55:50 +00:00
38674bcb45 Revert "Conversions between strided and jagged layouts for Nested Tensors (#115749)"
This reverts commit eca0cb0fbe84bb0a34fa94afe261bceecd52c436.

Reverted https://github.com/pytorch/pytorch/pull/115749 on behalf of https://github.com/izaitsevfb due to breaks test_overrides.py::TestTorchFunctionWarning::test_warn_on_invalid_torch_function_tensor_subclass ([comment](https://github.com/pytorch/pytorch/pull/115749#issuecomment-2270213988))
2024-08-06 01:55:41 +00:00
d6a24b3b92 Removed duplicate __all__ declarations. (#132405)
Partial Fix for #131765.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132405
Approved by: https://github.com/soulitzer
2024-08-06 01:17:44 +00:00
96471ea47c [inductor] support vectorization for torch.any(bool) -> bool (#132472)
Support reduction `any` by from `bool` to `bool`.
TestPlan:
```
python test/inductor/test_cpu_repro.py -k test_any_bool_vec
```

Generated code for `test_any_bool_vec`
```
cpp_fused_any_0 = async_compile.cpp_pybinding(['const float*', 'const float*', 'bool*', 'bool*'], '''
#include "/tmp/torchinductor_root/ky/cky2bufythacofebk7ujv36e4pxyqcqbpsy5r4vojoprjiwcwfxf.h"
extern "C"  void kernel(const float* in_ptr0,
                       const float* in_ptr1,
                       bool* out_ptr0,
                       bool* out_ptr1)
{
    {
        {
            bool tmp_acc0 = 0;
            at::vec::VecMask<float,1> tmp_acc0_vec = at::vec::VecMask<float,1>::from(0);
            bool tmp_acc0_arr[64];
            for (int tid = 0; tid < 64; tid++)
            {
                tmp_acc0_arr[tid] = 0;
            }
            at::vec::VecMask<float,1> tmp_acc0_vec_arr[64];
            for (int tid = 0; tid < 64; tid++)
            {
                tmp_acc0_vec_arr[tid] = at::vec::VecMask<float,1>::from(0);
            }
            #pragma omp parallel num_threads(64)
            {
                int tid = omp_get_thread_num();
                bool tmp_acc0_local = 0;
                at::vec::VecMask<float,1> tmp_acc0_vec_local = at::vec::VecMask<float,1>::from(0);
                #pragma omp for
                for(long x0=static_cast<long>(0L); x0<static_cast<long>(64L); x0+=static_cast<long>(16L))
                {
                    auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x0), 16);
                    auto tmp1 = at::vec::VecMask<float,1>::from<float,1>(tmp0);
                    tmp_acc0_vec_local = tmp_acc0_vec_local | tmp1;
                }
                tmp_acc0_arr[tid] = tmp_acc0_local;
                tmp_acc0_vec_arr[tid] = tmp_acc0_vec_local;
            }
            for (int tid = 0; tid < 64; tid++)
            {
                tmp_acc0 = tmp_acc0 || tmp_acc0_arr[tid];
            }
            for (int tid = 0; tid < 64; tid++)
            {
                tmp_acc0_vec = tmp_acc0_vec | tmp_acc0_vec_arr[tid];
            }
            tmp_acc0 = tmp_acc0 || at::vec::vec_reduce_all<bool>([](at::vec::Vectorized<bool>& x, at::vec::Vectorized<bool>& y) { return x | y; }, tmp_acc0_vec.to<bool, 1>());
            out_ptr0[static_cast<long>(0L)] = static_cast<bool>(tmp_acc0);
        }
    }
    {
        {
            bool tmp_acc0 = 0;
            at::vec::VecMask<float,1> tmp_acc0_vec = at::vec::VecMask<float,1>::from(0);
            bool tmp_acc0_arr[64];
            for (int tid = 0; tid < 64; tid++)
            {
                tmp_acc0_arr[tid] = 0;
            }
            at::vec::VecMask<float,1> tmp_acc0_vec_arr[64];
            for (int tid = 0; tid < 64; tid++)
            {
                tmp_acc0_vec_arr[tid] = at::vec::VecMask<float,1>::from(0);
            }
            #pragma omp parallel num_threads(64)
            {
                int tid = omp_get_thread_num();
                bool tmp_acc0_local = 0;
                at::vec::VecMask<float,1> tmp_acc0_vec_local = at::vec::VecMask<float,1>::from(0);
                #pragma omp for
                for(long x0=static_cast<long>(0L); x0<static_cast<long>(64L); x0+=static_cast<long>(16L))
                {
                    auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x0), 16);
                    auto tmp1 = at::vec::VecMask<float,1>::from<float,1>(tmp0);
                    tmp_acc0_vec_local = tmp_acc0_vec_local | tmp1;
                }
                tmp_acc0_arr[tid] = tmp_acc0_local;
                tmp_acc0_vec_arr[tid] = tmp_acc0_vec_local;
            }
            for (int tid = 0; tid < 64; tid++)
            {
                tmp_acc0 = tmp_acc0 || tmp_acc0_arr[tid];
            }
            for (int tid = 0; tid < 64; tid++)
            {
                tmp_acc0_vec = tmp_acc0_vec | tmp_acc0_vec_arr[tid];
            }
            tmp_acc0 = tmp_acc0 || at::vec::vec_reduce_all<bool>([](at::vec::Vectorized<bool>& x, at::vec::Vectorized<bool>& y) { return x | y; }, tmp_acc0_vec.to<bool, 1>());
            out_ptr1[static_cast<long>(0L)] = static_cast<bool>(tmp_acc0);
        }
    }
}
''')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132472
Approved by: https://github.com/jgong5
2024-08-06 01:03:51 +00:00
26c6786109 return_and_correct_aliasing: skip dispatcher when swapping storage (#132524)
`return_and_correct_aliasing` is used by FunctionalTensor today to ensure that when we call view/inplace ops, the input and output `FunctionalTensors` share the same storage.

This was previously done with a dispatcher call to `aten.set_`. In this PR I swap it out with a util that just manually does the storage swap. Benefits:

(1) we know this is safe in the specific way it is used by FunctionalTensor: avoiding the extra assertions in `aten.set_` is necessary to avoid some unbacked symint errors

(2) this should improve compile times a bit

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132524
Approved by: https://github.com/ezyang
ghstack dependencies: #132243, #132337, #132322
2024-08-06 00:44:35 +00:00
eca0cb0fbe Conversions between strided and jagged layouts for Nested Tensors (#115749)
This PR does 3 things:
1. Adds a copy-free strided->jagged layout conversion for NT
2. Adds a copy-free jagged->strided layout conversion for NT
3. Modifies and expands the .to() API to support the layout argument for the specific case of NT layout conversion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115749
Approved by: https://github.com/jbschlosser
2024-08-05 23:45:48 +00:00
4306eebab1 [DeviceMesh] Update slicing documentation to include nD and non-continuous slicing (#132311)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132311
Approved by: https://github.com/wanchaol
ghstack dependencies: #132310
2024-08-05 23:44:23 +00:00
1add8c5f1c [Easy][DTensor] Rename args_sharding to args_schema for OpSchema __str__ (#132187)
Looks like we don't use the name `args_sharding` anywhere else so just changing it to `args_schema` for naming consistency

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132187
Approved by: https://github.com/wanchaol
2024-08-05 23:40:19 +00:00
cyy
3ef45e5669 Fix ODR (#131032)
Fixes ODR violation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131032
Approved by: https://github.com/ezyang
2024-08-05 23:19:49 +00:00
a74e5abda4 Fix issues in activation_memory_budget for float8 (#132687)
Summary:
When using activation_memory_budget for float8 training, two issues were noticed:

- When `aggressive_options` (https://fburl.com/code/m1yoskxw) is called , all fp8 gemms (the scaled_mm op) are saved for recomputation.
- After adding "scaled_mm" in the `compute_intensive_ops`, we got the next error from `estimate_runtime`: `mat2 must be col_major` from `meta_scaled_mm`.
To fix it, modified `materialize_arg` to also include the stride of the original tensor.

Test Plan: Run float8 training with `activation_memory_budget`.

Differential Revision: D60777297

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132687
Approved by: https://github.com/Chillee
2024-08-05 23:01:35 +00:00
a4ed8eeb33 [hop] makes compiled hops not share code objects (#132427)
Fixes code object sharing issue in https://github.com/pytorch/pytorch/issues/132417.

Before this Pr, compiled hops such as cond and flex_attenion are wrapped by _dynamo/external_utils.py:wrap_inline. This causes them to share the same code object. There is a condition surrounding the warp_inline call and currently is passing.

We make hops fail the check so that they don't share code objects by adding them to LEGACY_MOD_INLINELIST. Adding them to MOD_INLINELIST doesn't work because trace_rules.check(fn) doesn't check for MOD_INLINLIST by default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132427
Approved by: https://github.com/jansel, https://github.com/anijain2305
2024-08-05 22:59:05 +00:00
4a2cf50edf [export][reland] Convert autocast to HOO (#132677)
Summary:
Reland of D60206382.

Suggested in https://github.com/pytorch/pytorch/issues/128394.

If there's an autocast context manager, the predispatch (strict) graph can look something like:

```
class <lambda>(torch.nn.Module):
    def forward(self, x: "f32[1]"):
        ...
        _enter_autocast = torch.amp.autocast_mode._enter_autocast('cuda', torch.bfloat16, True, None)
        mm: "f32[8, 8]" = torch.ops.aten.mm.default(rand, rand_1);  rand = rand_1 = None
        _exit_autocast = torch.amp.autocast_mode._exit_autocast(_enter_autocast);  _enter_autocast = None
        return (mm_1,)
```

But the operator `torch.amp.autocast_mode._enter_autocast` is not a valid ATen op. We remove these nodes by turning autocast into a higher order operator and make a submodule for the blocks between `_enter_autocast` and `_exit_autocast`.

Some potential followup improvement:
1) Merge some of the duplicated logic with `replace_set_grad_with_hop_pass.py`
2) Check the current autocast status (any enabled? dtype?) and not create a submodule if the autocast args matches current autocast status.

Test Plan:
CI

```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r "test_predispatch_autocast"
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r "test_predispatch_set_grad"
```

Verified that now we can export the llama model in  gh issue 128394 and the gemma model in  gh issue 131829 without error.

Differential Revision: D60770038

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132677
Approved by: https://github.com/angelayi
2024-08-05 22:34:52 +00:00
ea42027e0e [micro_pipeline_tp] support all _scaled_mm args (#131984)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131984
Approved by: https://github.com/weifengpy
2024-08-05 21:44:37 +00:00
2b5e31d099 Move sigmoid run_const_graph HOP to PyTorch core (#132526)
Summary: When HOPs live out of tree, it makes it impossible to make breaking changes to the HOP API. But HOP implementations are deeply entwined with PyTorch internals. Move the HOP into PyTorch tree so that changes are possible.

Test Plan: sandcastle and oss ci

Differential Revision: D60674861

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132526
Approved by: https://github.com/SherlockNoMad
2024-08-05 21:40:56 +00:00
af8b8a47cb fsdp.set_: convey to functionalization that it mutates storage (#132322)
Fixes https://github.com/pytorch/pytorch/issues/132197

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132322
Approved by: https://github.com/albanD, https://github.com/yf225
ghstack dependencies: #132243, #132337
2024-08-05 21:28:59 +00:00
1a0db29932 move torch._functionalize APIs to pybind. add one for marking storage mutations (#132337)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132337
Approved by: https://github.com/albanD, https://github.com/justinchuby
ghstack dependencies: #132243
2024-08-05 21:28:59 +00:00
4db368a475 make functorch CSE respect mutations as barriers (like fsdp.set_) (#132243)
Fixes https://github.com/pytorch/pytorch/issues/132200

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132243
Approved by: https://github.com/albanD, https://github.com/zou3519, https://github.com/yf225
2024-08-05 21:28:55 +00:00
ee0ae11b34 Fix a typo in the example code. (#132601)
Since the backward multiples the gradient by `n`, we must change the forward function to multiply the input tensor by `n`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132601
Approved by: https://github.com/soulitzer
2024-08-05 21:04:20 +00:00
9a1ad3345f Fix periodic windows test (#132648)
This test fails to clean up folders on windows for the past week, see 27f61eba58 for example

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132648
Approved by: https://github.com/janeyx99, https://github.com/zou3519, https://github.com/malfet
2024-08-05 20:54:20 +00:00
cyy
6b12dc0224 [Reland] [11/N] Use std::nullopt and std::optional (#132622)
Reland of #132396, which was reverted due to dependency reversion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132622
Approved by: https://github.com/ezyang
2024-08-05 20:36:33 +00:00
6f4dc56735 [inductor] Default to 1 compile thread for internal (#132540)
Summary: The historical default here is "1", i.e., no parallel compilation. In order to prepare for rolling out the subprocess-based parallel compile, I had previously modified this code to allow parallelism when worker_start_method="subprocess". I realize this probably isn't the best rollout strategy. Rather than opting all internal usages into both a) parallel-compile, _and_ b) a new implementation of parallel compile, let's put the default back to "1" and then start rolling out the new parallel compile implementation only to those usages that have already opted in by explicitly setting compile_thread > 1

Differential Revision: [D60686105](https://our.internmc.facebook.com/intern/diff/D60686105)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132540
Approved by: https://github.com/c00w
2024-08-05 20:23:16 +00:00
1471473b84 Add tests to bsr_dense_addmm_meta. Tune bsr_dense_addmm kernel for ViT shapes. (#132646)
As in the title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132646
Approved by: https://github.com/cpuhrsch
2024-08-05 20:22:33 +00:00
b7bcfdaff2 Change deprecate warning on dispatch_on_subclass to warn once (#132374)
Summary:
# Problem

`TORCH_WARN` can cause massive log spam.

I output the logs for before and after adding this change.

*Before:*

* The log file size was ~61.15 MB(61148028 bytes).

*After:*

* The log filesize was ~56.44 MB(56444057) bytes.

# Context

Looks like we tried to land this change earlier but it was reverted:

* D59413413
* Reverted https://github.com/pytorch/pytorch/pull/130047 on behalf of https://github.com/clee2000 due to broke test_overrides.py::TestTorchFunctionWarning::test_warn_on_invalid_torch_function

# Testing Update

`test_warn_on_invalid_torch_function` would fail because the warning would not be called on the handling of the second torch function class since `TORCH_WARN_ONCE` stops repeats globally.

Updated so that it runs separate programs. (Was not able to actually run the test, could someone help me with that

Test Plan: Need help with this...

Differential Revision: D60561181

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132374
Approved by: https://github.com/ezyang
2024-08-05 20:02:33 +00:00
2764bee942 Revert "[MPS] Add support for autocast in MPS (#99272)"
This reverts commit 6919e8baaba391ced7b4acaa553d6ea1f3b30e79.

Reverted https://github.com/pytorch/pytorch/pull/99272 on behalf of https://github.com/clee2000 due to Broke test/inductor/test_cpu_select_algorithm.py::TestSelectAlgorithmCPU::test_quantized_linear_amx_batch_size_3_in_features_128_out_features_64_bias_False_cpu on sm86 jobs [GH job link](https://github.com/pytorch/pytorch/actions/runs/10252979157/job/28367091621) [HUD commit link](6919e8baab) Not caught on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/99272#issuecomment-2269808857))
2024-08-05 19:59:04 +00:00
a3ea96b762 Revert "[export] Convert autocast to HOO (#131914)"
This reverts commit aec948adfc224e49213c4bc49586d4e4ba65fbbb.

Reverted https://github.com/pytorch/pytorch/pull/131914 on behalf of https://github.com/davidberard98 due to PR shouldn't have been relanded by the bot, phabricator diff did not have any recent changes and is still internally reverted ([comment](https://github.com/pytorch/pytorch/pull/131914#issuecomment-2269797388))
2024-08-05 19:52:09 +00:00
1d34f33d00 Scale XBLOCK in triton reduction configs to avoid hitting max grid (#128826)
Scale XBLOCK size in triton_config_reduction to avoid hitting maxGridSize limits.

This issue was observed in gpt-fast examples with large sequence length:
Reproducer: https://gist.github.com/jataylo/8a0ba922fbf68e345d360a418b48b9f1

`RuntimeError: Triton Error [HIP]:  Code: 9, Messsage: invalid configuration argument`

Co-authored-by: Jason Ansel <jansel@jansel.net>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128826
Approved by: https://github.com/jansel, https://github.com/nmacchioni
2024-08-05 19:34:38 +00:00
e1c2bdac2f [easy] fix f-string messages in torch/_ops.py (#132531)
I encountered these when making this change:

```
diff --git a/test/functorch/test_ac.py b/test/functorch/test_ac.py
index 3a2e07fa147..a4d003399e7 100644
--- a/test/functorch/test_ac.py
+++ b/test/functorch/test_ac.py
@@ -259,15 +259,8 @@ class MemoryBudgetTest(TestCase):

         expected = call()
         for budget in range(0, 11):
-            memory_budget = budget / 10
-            torch._dynamo.reset()
-            with config.patch(activation_memory_budget=memory_budget):
-                if memory_budget is not None:
-                    f_compile = torch.compile(
-                        call, backend="aot_eager_decomp_partition"
-                    )
-
-                self.assertEqual(expected, f_compile())
+            get_mem_and_flops(call, memory_budget=budget / 10)
+

     def test_prioritize_cheaper_matmul(self):
         def f(xs, ws):
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132531
Approved by: https://github.com/Skylion007
2024-08-05 18:58:33 +00:00
aec948adfc [export] Convert autocast to HOO (#131914)
Summary:
Suggested in https://github.com/pytorch/pytorch/issues/128394.

If there's an autocast context manager, the predispatch (strict) graph can look something like:

```
class <lambda>(torch.nn.Module):
    def forward(self, x: "f32[1]"):
        ...
        _enter_autocast = torch.amp.autocast_mode._enter_autocast('cuda', torch.bfloat16, True, None)
        mm: "f32[8, 8]" = torch.ops.aten.mm.default(rand, rand_1);  rand = rand_1 = None
        _exit_autocast = torch.amp.autocast_mode._exit_autocast(_enter_autocast);  _enter_autocast = None
        return (mm_1,)
```

But the operator `torch.amp.autocast_mode._enter_autocast` is not a valid ATen op. We remove these nodes by turning autocast into a higher order operator and make a submodule for the blocks between `_enter_autocast` and `_exit_autocast`.

Some potential followup improvement:
1) Merge some of the duplicated logic with `replace_set_grad_with_hop_pass.py`
2) Check the current autocast status (any enabled? dtype?) and not create a submodule if the autocast args matches current autocast status.

Test Plan:
CI

```
parsh --build-flags fbcode//mode/dev-nosan  fbcode//caffe2/test:test_export
run_tests("test_predispatch_autocast")
```

Reviewed By: angelayi

Differential Revision: D60206382

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131914
Approved by: https://github.com/angelayi
2024-08-05 18:52:12 +00:00
8d9c3a71f6 Support IPC for Expandable Segments (#130890)
This reapplication commit is the same as before except it resolves a build error in an internal build where `handle` was shadowed.

Differential Revision: [D60547506](https://our.internmc.facebook.com/intern/diff/D60547506)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130890
Approved by: https://github.com/dsjohns2
2024-08-05 18:48:13 +00:00
618e2c9de4 fix torch rec test failure (#132437)
Summary: Fixes T192448049. The module call form an unusal call stack for the nodes: https://www.internalfb.com/phabricator/paste/view/P1507230978. This is currently not supported by unflattener and need some extra design to make it work.

Test Plan: buck2 run 'fbcode//mode/opt' torchrec/distributed/tests:test_pt2 -- --filter-text "test_sharded_quant_fpebc_non_strict_export"

Reviewed By: zhxchen17

Differential Revision: D60528900

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132437
Approved by: https://github.com/Skylion007
2024-08-05 18:06:07 +00:00
1c7dc335f7 [ROCm][CK][Inductor] Enable addmm for CK backend to gemm max autotune (#130576)
Add functional support for torch.addmm with CK backend. See also #125453

# Implementation details
1. It turns out we can use the same template between addmm and matmul; essentially, matmul is addmm with empty bias
2. The Python generator in CK was updated to generate the shared cpp template. The pip package can be installed from `pip install git+https://github.com/rocm/composable_kernel@add-addmm` and will be merged into `develop` branch after this PR lands to avoid breaking the current matmul

# Testing
`pytest test/inductor/test_ck_backend.py -k addmm`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130576
Approved by: https://github.com/chenyang78
2024-08-05 17:49:09 +00:00
7b2664ece6 Temp disable MKL in DistributionKernels.cpp (#132532)
Until https://github.com/pytorch/pytorch/issues/132395 is addressed

Test plan: Add test based on the script below (taken from https://discuss.pytorch.org/t/bug-in-torch-multinomial-generated-distribution-is-modestly-incorrect-edit-this-is-a-regression-and-appears-to-be-due-to-an-analogous-bug-in-tensor-exponential )
```python
import torch

high_bits_for_seed = 16000000000000000000           # to use "good quality" seed
_ = torch.manual_seed (high_bits_for_seed + 2024)

prob = torch.ones (26)
dups_mult = 0
perm_counts_mult = {}
for _ in range (1_000_000):
    p = tuple (torch.multinomial (prob, prob.numel(), replacement=False).tolist())
    if  p in perm_counts_mult:
        dups_mult += 1
        perm_counts_mult[p] += 1
    else:
        perm_counts_mult[p] = 1

print ('duplicate multinomial perms: ', dups_mult)
print ('multiple multinomial perms:  ', (torch.tensor (list (perm_counts_mult.values())) > 1).sum().item())
print ('max of perm_counts_mult:     ', torch.tensor (list (perm_counts_mult.values())).max().item())
print ('len (perm_counts_mult):      ', len (perm_counts_mult))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132532
Approved by: https://github.com/albanD
2024-08-05 17:40:57 +00:00
baa2483cea Revert "Refactor thunkify to return proper thunk abstraction (#132407)"
This reverts commit c65cb37657ef4f7fcd070a7e8e5121eb299919fd.

Reverted https://github.com/pytorch/pytorch/pull/132407 on behalf of https://github.com/ezyang due to td strikes again ([comment](https://github.com/pytorch/pytorch/pull/132407#issuecomment-2269577711))
2024-08-05 17:39:54 +00:00
cyy
d5045cceff [16/N] Fix clang-tidy warnings in jit (#132604)
Follows #132564

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132604
Approved by: https://github.com/Skylion007
2024-08-05 17:36:22 +00:00
e8645fa2b9 [Doc] fix some typos (found by codespell and typos) (#132544)
Applying doc fixes from PR https://github.com/pytorch/pytorch/pull/127267 - with CLA
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132544
Approved by: https://github.com/kit1980
2024-08-05 17:21:56 +00:00
3d87dfc088 Add basic OpenReg module scaffolding with autograd (#131708)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131708
Approved by: https://github.com/ezyang
2024-08-05 17:07:11 +00:00
df59084012 Drop GIL around cudart APIs (#132520)
Noticed a hang where the stuck thread blocked on cudaHostUnregister
call, probably due to an internal cuda deadlock caused by something
else, but was holding the GIL at the time and blocked other python
threads.

As far as I can tell cudart APIs all do not require the GIL held nor are
they marked as thread unsafe.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132520
Approved by: https://github.com/LucasLLC, https://github.com/kirtiteja
2024-08-05 17:04:01 +00:00
6919e8baab [MPS] Add support for autocast in MPS (#99272)
Fixes https://github.com/pytorch/pytorch/issues/88415

Co-authored-by: Siddharth Kotapati <skotapati@apple.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99272
Approved by: https://github.com/malfet
2024-08-05 17:02:30 +00:00
d532c00c81 [test/torch_np] Fix usages of deprecated NumPy 2.0 APIs in numpy_tests (#131909)
Migrates usages of deprecated APIs in NumPy-2.0 per [numpy-2.0 migration guide](https://numpy.org/devdocs/numpy_2_0_migration_guide.html#numpy-2-0-migration-guide).

I did a grep on the old API usages (see list below) and these were used only referenced in test files under `test/torch_np/numpy_tests/**/*.py`.

Specifically, migrates the usages of the following APIs:

1. `np.sctypes` &rarr; Access dtypes explicitly instead
2. `np.float_` &rarr; `np.float64`
3. `np.complex_` &rarr; `np.complex128`
4. `np.longcomplex` &rarr; `np.clongdouble`
5. `np.unicode_` &rarr; `np.str_`
6. `np.product` &rarr; `np.prod`
7. `np.cumproduct` &rarr; `np.cumprod`
8. `np.alltrue` &rarr; `np.all`
9. `np.sometrue` &rarr; `np.any`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131909
Approved by: https://github.com/rgommers, https://github.com/Skylion007, https://github.com/atalman
2024-08-05 16:21:08 +00:00
a672f6c84e [inductor] unificate SUBPROCESS_DECODE_ARGS variable in cpp_builder.py (#132615)
[inductor] unificate SUBPROCESS_DECODE_ARGS variable in cpp_builder.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132615
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-08-05 16:00:35 +00:00
9945caec65 [inductor] Fix autotune non-close attr crash on Windows (#132630)
When I enable `autotune` related UT on Windows.
<img width="1364" alt="Image" src="https://github.com/user-attachments/assets/b0c9c516-419d-47d0-a4c1-e90c98109d02">

I found the non `close` attr issue on Windows. Acturaly, I checked the DLL type is `CDLL`. It doesn't have `close` attr.
I made this PR to check the `close` attr and do the close operation.

<img width="1624" alt="Image" src="https://github.com/user-attachments/assets/14093900-4ad8-4673-839e-7ba1410c5656">

After this fix, the UTs passed.

Here are some existing issues:
1. `CDLL` didn't have `close` attr, so the DLL are not be closed. Though it did't crash on Linux.
2. This PR just avoid crash on Windows, and didn't real close also.

**TODO:**
We need to replace `CDLL` by `DLLWrapper` in `CppBenchmarkRequest`, like `CUDABenchmarkRequest`. I have added a task to tracking: https://github.com/pytorch/pytorch/issues/124245 , and will follow up this change in further PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132630
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-08-05 16:00:27 +00:00
a8490a0762 [traced-graph][sparse] propagate sparsity in fx graph (#131920)
This PR proceeds with implementing the feature request #117188 by generalizing more cases that already work with COO to work with the compressed sparse formats as well.

Feature request:
https://github.com/pytorch/pytorch/issues/117188

Rebranch of older PRs (for history):
https://github.com/pytorch/pytorch/pull/131474
https://github.com/pytorch/pytorch/pull/128549

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131920
Approved by: https://github.com/ezyang
2024-08-05 15:49:53 +00:00
14edd986b3 Fix missing include file (#132647)
This error only appears with newer gcc releases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132647
Approved by: https://github.com/Skylion007
2024-08-05 15:49:49 +00:00
70cb16b316 [DTensor] Added naive replicate strategy for more diagonal ops (#132201)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132201
Approved by: https://github.com/wz337
ghstack dependencies: #132104
2024-08-05 15:18:56 +00:00
c65cb37657 Refactor thunkify to return proper thunk abstraction (#132407)
This is superior to lru_cache because (1) it's more explicit and (2) it
doesn't leak the original function after it's been forced.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132407
Approved by: https://github.com/albanD
ghstack dependencies: #131649
2024-08-05 14:42:40 +00:00
b465a5843b DTensor: add more foreach ops to supported sharding prop list (#132066)
fixes https://github.com/pytorch/pytorch/issues/132016.

Right now if you run an op that DTensor has no sharding prop rule, **and** that op accepts non-trivial pytrees of inputs tensors as arguments, DTensor can end up infinite looping before it has the chance to error due to not having a sharding prop rule.

This PR doesn't fix the problem, but adds rules for the culprit ops (missing foreach ops)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132066
Approved by: https://github.com/wanchaol
2024-08-05 13:51:59 +00:00
c3ee07c71c add missing profiler include in cpp code generation (#132419)
Summary:
When a user sets config.profiler_mark_wrapper_call, RECORD_FUNCTION annotations are added to the code. This requires importing the header <ATen/record_function.h>, but the conditional for doing so didn't check
 config.profiler_mark_wrapper_call.

Test Plan:
This case is already covered in test_profiler_mark_wrapper_call.

```
(pytorch-3.10) [gabeferns@devvm2252.cco0 ~/pytorch (missing-profile-include)]$ TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k CpuTests.test_profiler_mark_wrapper_call_cpu
stats [('calls_captured', 1), ('unique_graphs', 1)]
inductor [('fxgraph_cache_miss', 1)]
aot_autograd [('total', 1), ('ok', 1)]
.
----------------------------------------------------------------------
Ran 1 test in 8.080s

OK
```

Fixes https://github.com/pytorch/pytorch/issues/131339

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132419
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-08-05 13:40:47 +00:00
b30d0916d9 [FSDP2] Added missing event wait (for future) (#132568)
Nothing is actually wrong currently, but we should add this in case we land https://github.com/pytorch/pytorch/pull/127032 in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132568
Approved by: https://github.com/weifengpy, https://github.com/Skylion007
2024-08-05 12:44:46 +00:00
fb87796d4f [DeviceMesh] Add supports for non-continuous slicing (#132310)
Removes constraint of continuous slicing to allow non-continuous slicing and adds a unit test for 3D non-continuous slicing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132310
Approved by: https://github.com/wanchaol
2024-08-05 09:30:07 +00:00
27f61eba58 serde sympy functions (#132493)
Summary: Sympy functions appearing in symbolic expressions inside tensor metadata were not being deserialized properly.

Test Plan: updated test

Differential Revision: D60573150

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132493
Approved by: https://github.com/pianpwk
2024-08-05 08:08:50 +00:00
55b0c39d82 Reland "[1/2] PT2 Inductor ComboKernels - Foreach cases (#124969)" (#132182)
Summary:
Reland #124969 by backing out D60397377 "Back out "[1/2] PT2 Inductor ComboKernels - Foreach cases  (#124969)""

The original diff D54134695 was reverted because of failure of ads nightly cogwheel tests.

The root cause: the logic for generating mask in Triton kernel needed update after a recent refactoring on triton.py. This diff includes the fix of the root cause.

See D54134695 or #124969 for more details.

Test Plan:
Originally failed tests
f585704630
f585733786

Diff patched:
f586664028
f586663820

Differential Revision: D60458597

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132182
Approved by: https://github.com/Yuzhen11
2024-08-05 06:57:30 +00:00
ae44b8f410 [inductor] support vectorization for torch.argmax/min(float/int64_t)-> int64_t (#131016)
Support reduction argmin/max by scalar implementation.
TestPlan:
```
python test/inductor/test_cpu_repro.py -k test_argmax_argmin_with_nan_value
python test/inductor/test_cpu_repro.py -k test_argmin
python test/inductor/test_cpu_repro.py -k test_reduction_cpu_only
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131016
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-08-05 04:31:53 +00:00
1fb498d6e3 Add try except for _maybe_evaluate_static call in IndexPropagation (#132128)
Fixes the Inductor max-autotune mode failures of the below models:
- GPT2ForSequenceClassification
- PegasusForConditionalGeneration
- XGLMForCausalLM
- hf_GPT2
- tnt_s_patch16_224
```log
  File "/pytorch/torch/_inductor/index_propagation.py", line 329, in statically_true
    evaluated = self.shape_env._maybe_evaluate_static(
  File "/pytorch/torch/fx/experimental/symbolic_shapes.py", line 1499, in wrapper
    return fn_cache(self, *args, **kwargs)
  File "/pytorch/torch/fx/experimental/symbolic_shapes.py", line 4539, in _maybe_evaluate_static
    vr = var_ranges[k]
torch._dynamo.exc.BackendCompilerFailed: backend='compile_fx_wrapper' raised:
KeyError: m_start
```

The `_maybe_evaluate_static` call in `IndexPropagation` may fail. This PR adds try except following the way in `torch/_inductor/sizevars.py` by adding a common utility function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132128
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-08-05 01:02:51 +00:00
c7cfa51721 Always use high precision for SDPA math backend (#128922)
Summary:
feikou observed the big numerical gaps when using math backend on AMD and NV GPUs. It's mainly because we are not using higher precision FP32 for the intermediate accumulated/materialized parts.

Since math backend is expected to be slower anyways, and we expect math backend to generate the correct reference result, I think it should be worth to upcast FP16/BF16 input to FP32, and do FP32/TF32 computations, and then downcast FP32 output back to FP16/BF16.

Differential Revision: D58710805

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128922
Approved by: https://github.com/xw285cornell, https://github.com/drisspg
2024-08-04 23:58:14 +00:00
01cdcbf7c8 [dynamo] revert map/zip iterator related changes (#132528)
Need to revert due to internal hangs: S437700

This reverts commit b6c1490cc02316ffe85e5ae74651d80f0158ba64.

Revert "[dynamo] implement IteratorVariable and polyfill fallbacks for enumerate (#131725)"

This reverts commit 2576dbbc35d66e8e9ed6cb12216ccc424cb87ec3.

Revert "[dynamo] add itertools repeat/count bytecode reconstruction (#131716)"

This reverts commit 35b4de32fafc5ad024c20ef1275711bffc557ae9.

Revert "[dynamo] add lazy IteratorVariable implementations for map and zip (#131413)"

This reverts commit 7d282d87550787d8269593093519c2ad7c5032cd.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132528
Approved by: https://github.com/ZainRizvi
2024-08-04 18:46:55 +00:00
09f9c256ad Add basic mypy annotations to inductor (#132416)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132416
Approved by: https://github.com/XuehaiPan, https://github.com/jamesjwu
ghstack dependencies: #132415
2024-08-04 18:43:37 +00:00
6e79932543 Add basic mypy annotations to dynamo (#132415)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132415
Approved by: https://github.com/XuehaiPan, https://github.com/jamesjwu
2024-08-04 18:43:36 +00:00
3558a8cf4a Revert "Add basic mypy annotations to dynamo (#132415)"
This reverts commit 71e22e0959eb8d5a66833bf5c6b5903536a5bef1.

Reverted https://github.com/pytorch/pytorch/pull/132415 on behalf of https://github.com/ZainRizvi due to Sorry, this PR has entered a weird state in the diff train. Trying to revert it to skip it, and then we can try relanding it ([comment](https://github.com/pytorch/pytorch/pull/132415#issuecomment-2267631785))
2024-08-04 18:39:29 +00:00
f2ddd5e9e0 Revert "Add basic mypy annotations to inductor (#132416)"
This reverts commit 78927d37f6085a0b30269cceb731d8097302c091.

Reverted https://github.com/pytorch/pytorch/pull/132416 on behalf of https://github.com/ZainRizvi due to Sorry, this PR has entered a weird state in the diff train. Trying to revert it to skip it, and then we can try relanding it ([comment](https://github.com/pytorch/pytorch/pull/132415#issuecomment-2267631785))
2024-08-04 18:39:29 +00:00
9be33bc584 Revert "[inductor] Add type hints to functions in mkldnn_fusion.py (#131820)"
This reverts commit 6c65fd03942415b68040e102c44cf5109d2d851e.

Reverted https://github.com/pytorch/pytorch/pull/131820 on behalf of https://github.com/ZainRizvi due to Sorry, had to revert this to revert another PR that depends on this change ([comment](https://github.com/pytorch/pytorch/pull/131820#issuecomment-2267629534))
2024-08-04 18:30:59 +00:00
0a25666f92 Revert "[dynamo] revert map/zip iterator related changes (#132528)"
This reverts commit e81e74ca6cb45e1ab831ddfe9a2ba5c7e17fa03f.

Reverted https://github.com/pytorch/pytorch/pull/132528 on behalf of https://github.com/ZainRizvi due to This stack entered a weird state in the diff train. Reverting and relanding to clean the state ([comment](https://github.com/pytorch/pytorch/pull/132528#issuecomment-2267628475))
2024-08-04 18:26:09 +00:00
fd4b649e6c [BE]: Simplify some list comps to generators C419 (#132578)
Simplifies some list comprehensions to generator which is more efficient. Automatically applied diffs for the most part with ruff

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132578
Approved by: https://github.com/ezyang
2024-08-04 17:46:26 +00:00
4226ed1585 [BE] Format uncategorized Python files with ruff format (#132576)
Remove patterns `**`, `test/**`, and `torch/**` in `tools/linter/adapters/pyfmt_linter.py` and run `lintrunner`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132576
Approved by: https://github.com/ezyang, https://github.com/Skylion007
ghstack dependencies: #132574
2024-08-04 17:13:31 +00:00
c35061c542 Migrate Python code formatter from black to ruff format (#132574)
See also:

- #124845
- #123062

Closes #124845
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132574
Approved by: https://github.com/ezyang
2024-08-04 17:13:31 +00:00
09fcd792eb [Fix]: ScriptObject lifting issue (#130952)
#### Issue
ScriptObject was treated as normal attribute by the converter previously. This PR lifts it to be a constant and convert it directly to a GetAttr fx node. ScriptObject would also trigger `CallMethod` and this PR adds that support as well.

#### Test Plan
Add test case for ScriptObject.
`pytest test/export/test_converter.py -s -k test_convert_script_object`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130952
Approved by: https://github.com/angelayi
2024-08-04 16:52:45 +00:00
5dac4d2c78 Revert "[easy] fix f-string messages in torch/_ops.py (#132531)"
This reverts commit 908d2a153b14cbb7a39c1f4ef9a77534cf2c71bf.

Reverted https://github.com/pytorch/pytorch/pull/132531 on behalf of https://github.com/davidberard98 due to still breaks tests ([comment](https://github.com/pytorch/pytorch/pull/132531#issuecomment-2267584289))
2024-08-04 15:41:56 +00:00
cyy
105ba7b58c [5/N] Fix clang-tidy warnings in aten/src/ATen (#132565)
Follows #132001

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132565
Approved by: https://github.com/Skylion007
2024-08-04 14:39:16 +00:00
908d2a153b [easy] fix f-string messages in torch/_ops.py (#132531)
I encountered these when making this change:

```
diff --git a/test/functorch/test_ac.py b/test/functorch/test_ac.py
index 3a2e07fa147..a4d003399e7 100644
--- a/test/functorch/test_ac.py
+++ b/test/functorch/test_ac.py
@@ -259,15 +259,8 @@ class MemoryBudgetTest(TestCase):

         expected = call()
         for budget in range(0, 11):
-            memory_budget = budget / 10
-            torch._dynamo.reset()
-            with config.patch(activation_memory_budget=memory_budget):
-                if memory_budget is not None:
-                    f_compile = torch.compile(
-                        call, backend="aot_eager_decomp_partition"
-                    )
-
-                self.assertEqual(expected, f_compile())
+            get_mem_and_flops(call, memory_budget=budget / 10)
+

     def test_prioritize_cheaper_matmul(self):
         def f(xs, ws):
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132531
Approved by: https://github.com/Skylion007
ghstack dependencies: #132356, #132466
2024-08-04 14:30:42 +00:00
87d46d70d7 [inductor] export kernel for gemm template. (#132580)
Changes:
1. Move `get_export_declaration` to `cpp_utils.py` as basic function.
2. Export kernel for gemm template.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132580
Approved by: https://github.com/ezyang
2024-08-04 11:17:19 +00:00
d2dc173664 Remove lint dependency ufmt (#132573)
`ufmt` is a combination of `black + usort`.

This PR removes `ufmt` and run `black` and `usort` separately.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132573
Approved by: https://github.com/ezyang
ghstack dependencies: #129769, #132572
2024-08-04 10:24:09 +00:00
f7aeb394b6 [BE][Easy] Remove empty ISORT_SKIPLIST (#132572)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132572
Approved by: https://github.com/ezyang, https://github.com/justinchuby
ghstack dependencies: #129769
2024-08-04 10:24:09 +00:00
f3fce597e9 [BE][Easy][17/19] enforce style for empty lines in import segments in torch/[a-c]*/ and torch/[e-n]*/ (#129769)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129769
Approved by: https://github.com/ezyang
2024-08-04 10:24:09 +00:00
2714adce20 [caffe2] Fix compiling ATen-hip in non-opt mode (#132581)
Summary:
It looks like https://github.com/pytorch/pytorch/pull/131894 accidentally broke non-opt hip builds. I.e. `is_flash_attention_available` doesn't get inlined in non-opt mode, so all of `can_use_flash_attention` is compiled into the
 final object file. This includes a reference to `aotriton::v2::flash::check_gpu` which we haven't setup yet for HIP builds.

Test Plan:
CI

Differential Revision: D60720707

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132581
Approved by: https://github.com/jianyuh, https://github.com/xw285cornell
2024-08-04 07:51:18 +00:00
cyy
522fa03e91 [Submodule] Bump ONNX to v1.16.2 (#132566)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132566
Approved by: https://github.com/justinchuby
2024-08-04 07:01:54 +00:00
2a8e94347f [TP] verify numeric parity on Transfromers for multiple iterations (#132543)
Before setting up float8 numeric parity test, I have to set up regular TP numeric parity test, preferrably testing 10 iterations

this PR sets a baseline of TP numerics. I can verify fp8 on top of it

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132543
Approved by: https://github.com/tianyu-l
ghstack dependencies: #132350
2024-08-04 06:43:27 +00:00
8ff310392e add __torch_function__ handler to get_device cpp (#132567)
From the issue:
```
import torch

class CustomParameter(torch.nn.Parameter):
    @classmethod
    def __torch_function__(cls, func, types, args=(), kwargs=None):
         return func.__name__

x = CustomParameter(torch.rand(2))

print(x.square()) # 'square'
print(torch.square(x)) # 'square'
print(x.get_device()) # 'get_device'
print(torch.get_device(x)) # -1
```
after fix:
```
$ python repro.py
square
square
get_device
get_device
```

Fixes: https://github.com/pytorch/pytorch/issues/131944

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132567
Approved by: https://github.com/ezyang
2024-08-04 04:26:30 +00:00
7f8a384a8f [inductor] add msvc_cl compiler check (#132571)
add `msvc_cl` compiler check.
Local test:
<img width="880" alt="image" src="https://github.com/user-attachments/assets/fe4da5e0-dd52-4dbc-831e-c32479e27a29">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132571
Approved by: https://github.com/ezyang
2024-08-04 03:48:25 +00:00
81b8d3586f Update torch-xpu-ops pin (ATen XPU implementation) (#132390)
Regular update.
1. New 69 ATen operators and variants are added. See https://github.com/intel/torch-xpu-ops/blob/main/yaml/xpu_functions.yaml.
2. Align with PyTorch in-tree to use safe data pointer access APIs.
3. Enable FP64 conversion emulation for some platforms.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132390
Approved by: https://github.com/EikanWang
2024-08-04 02:22:46 +00:00
6ec4af6865 [Inductor][CPP] Add vectorization support for double (#131886)
Before:
```
extern "C"  void kernel(const double* in_ptr0, double* out_ptr0)
{
     #pragma omp parallel num_threads(112)
     {
         int tid = omp_get_thread_num();
         {
             #pragma omp for
             for(long x0=static_cast<long>(0L); x0<static_cast<long>(1024L); x0+=static_cast<long>(1L))
             {
                 auto tmp0 = in_ptr0[static_cast<long>(x0)];
                 auto tmp1 = decltype(tmp0)(tmp0 * tmp0);
                 out_ptr0[static_cast<long>(x0)] = tmp1;
             }
         }
     }
 }
```

After:
```
extern "C"  void kernel(const double* in_ptr0, double* out_ptr0)
{
    #pragma omp parallel num_threads(112)
    {
        int tid = omp_get_thread_num();
        {
            #pragma omp for
            for(long x0=static_cast<long>(0L); x0<static_cast<long>(1024L); x0+=static_cast<long>(16L))
            {
                auto tmp0 = at::vec::VectorizedN<double,2>::loadu(in_ptr0 + static_cast<long>(x0), 16);
                auto tmp1 = tmp0 * tmp0;
                tmp1.store(out_ptr0 + static_cast<long>(x0), 16);
            }
        }
    }
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131886
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-08-04 02:13:21 +00:00
d984105748 Revert "[export] Convert autocast to HOO (#131914)"
This reverts commit b28c01d90d6575522d2240ce485d7dd87a7242aa.

Reverted https://github.com/pytorch/pytorch/pull/131914 on behalf of https://github.com/ezyang due to Failing lint, but was covered up by master failure on lint ([comment](https://github.com/pytorch/pytorch/pull/131914#issuecomment-2267248773))
2024-08-04 02:10:35 +00:00
6c65fd0394 [inductor] Add type hints to functions in mkldnn_fusion.py (#131820)
Summary: ATT

Test Plan: lintrunner

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131820
Approved by: https://github.com/eellison
2024-08-03 22:11:47 +00:00
cyy
bc46f205c4 [15/N] Fix clang-tidy warnings in jit (#132564)
Follows  #132477

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132564
Approved by: https://github.com/Skylion007
2024-08-03 19:33:24 +00:00
00097f3458 Revert "C++ network flow implementation in c10 (#132188)"
This reverts commit dccce77935bb023f225b9972929fd9213e754e84.

Reverted https://github.com/pytorch/pytorch/pull/132188 on behalf of https://github.com/ZainRizvi due to Sorry but this appears to be failing internal tests. Please see D60702564 to investigate ([comment](https://github.com/pytorch/pytorch/pull/132188#issuecomment-2267098420))
2024-08-03 18:44:28 +00:00
e3387c6712 [inductor] use uint64_t replace long to add Windows support. (#132491)
`long` type is different between `Windows` and `Linux`.
This PR use `int64_t` instead of `long` on Windows. `LL` suffix is used to initial `int64_t` value.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132491
Approved by: https://github.com/malfet
2024-08-03 18:38:30 +00:00
bbce517221 [Inductor][FlexAttention] TestFlexAttention -> TestFlexDecoding (#132547)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132547
Approved by: https://github.com/Chillee
ghstack dependencies: #132015
2024-08-03 17:26:44 +00:00
21d02f8b4b Revert "[easy] fix f-string messages in torch/_ops.py (#132531)"
This reverts commit 25903f3932b3a24d4edf323484d2159f3ac92999.

Reverted https://github.com/pytorch/pytorch/pull/132531 on behalf of https://github.com/davidberard98 due to broke lint and tests due to conflict with 132377 ([comment](https://github.com/pytorch/pytorch/pull/132531#issuecomment-2266743391))
2024-08-03 14:49:07 +00:00
a896fb1b36 check unsupported sympy functions for runtime asserts (#132457)
Some sympy Functions aren't supported by sympy_interp(); we can't turn them into FX nodes, so currently the runtime asserts CSE pass avoids CSE'ing on any expression containing a sympy Function. https://github.com/pytorch/pytorch/pull/132325 started tracking unsupported functions, so we switch the check to that to be more precise. We also check for and skip unsupported functions when adding asserts - previously we only did the check for CSE, and not adding new expressions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132457
Approved by: https://github.com/avikchaudhuri
2024-08-03 10:17:25 +00:00
0e7e61f7ce Deprecate torch._utils.is_compiling() and torch._dynamo.external_utils.is_compiling() (#127690)
This PR is split from PR #126898.

- #126898

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127690
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-08-03 09:43:38 +00:00
159d508f03 [Fix]: prim::If with multiple outputs and input return directly (#131779)
#### Issue
Test is not working for prim::Loop with multiple outputs. Additionally fix issue where input is directly returned, which is not supported by HigherOrderOp.

#### Test Plan
`pytest test/export/test_converter.py -s -k test_convert_if_multiple_out`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131779
Approved by: https://github.com/angelayi, https://github.com/SherlockNoMad
2024-08-03 08:07:21 +00:00
36ec0fdf10 [inductor] check compiler exist on Windows. (#132533)
Current Windows env, if we are not activate the MSVC env. It will not raise a clear error to compiler:
<img width="904" alt="image" src="https://github.com/user-attachments/assets/725ea608-d181-40b1-8930-42fe2b32643a">

With this PR, we can help users point to the issue is from compiler.
<img width="1034" alt="image" src="https://github.com/user-attachments/assets/8515a796-e3e9-4909-a68f-8a14d4864951">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132533
Approved by: https://github.com/jansel
2024-08-03 07:47:11 +00:00
8ad9f89ccc [inductor] Reland: Add flag to ignore unsupported @triton.autotune args in user-written kernel compilation (#132562)
Summary:
This is a reland attempt of [#131431](https://github.com/pytorch/pytorch/pull/131431), as, in its original form, the PR has caused issues internally.

We currently don't support some of the `triton.autotune` arguments when compiling user-written Triton kernels with PT2. In this PR, we're adding a flag to circumvent it. This is to unblock internal compilation in some cases. The flag is supplied with the docs mentioning why it is not a good idea to set it.

Test Plan:
```
python test/inductor/test_triton_kernels.py -k test_triton_kernel_
autotune_with_unsupported_args
...
----------------------------------------------------------------------
Ran 3 tests in 3.636s

OK
```

Differential Revision: D60701839

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132562
Approved by: https://github.com/chenyang78
2024-08-03 06:31:28 +00:00
06581c277a [dynamo][stable-diffusion] Support dict(obj) on constrained subclasses of dict and OrderedDict (#132558)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132558
Approved by: https://github.com/jansel
2024-08-03 06:31:00 +00:00
b28c01d90d [export] Convert autocast to HOO (#131914)
Summary:
Suggested in https://github.com/pytorch/pytorch/issues/128394.

If there's an autocast context manager, the predispatch (strict) graph can look something like:

```
class <lambda>(torch.nn.Module):
    def forward(self, x: "f32[1]"):
        ...
        _enter_autocast = torch.amp.autocast_mode._enter_autocast('cuda', torch.bfloat16, True, None)
        mm: "f32[8, 8]" = torch.ops.aten.mm.default(rand, rand_1);  rand = rand_1 = None
        _exit_autocast = torch.amp.autocast_mode._exit_autocast(_enter_autocast);  _enter_autocast = None
        return (mm_1,)
```

But the operator `torch.amp.autocast_mode._enter_autocast` is not a valid ATen op. We remove these nodes by turning autocast into a higher order operator and make a submodule for the blocks between `_enter_autocast` and `_exit_autocast`.

Some potential followup improvement:
1) Merge some of the duplicated logic with `replace_set_grad_with_hop_pass.py`
2) Check the current autocast status (any enabled? dtype?) and not create a submodule if the autocast args matches current autocast status.

Test Plan:
CI

```
parsh --build-flags fbcode//mode/dev-nosan  fbcode//caffe2/test:test_export
run_tests("test_predispatch_autocast")
```

Reviewed By: angelayi

Differential Revision: D60206382

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131914
Approved by: https://github.com/angelayi
2024-08-03 05:48:57 +00:00
ed4493de0e dim name is identifier (#132557)
Summary: Dim names appear in suggested fixes so should be valid Python identifiers.

Test Plan: none

Differential Revision: D60696854

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132557
Approved by: https://github.com/pianpwk
2024-08-03 05:28:50 +00:00
1f5dfe00da Subtracer should always be real to inherit fake/real tensors from parent config (#132488)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132488
Approved by: https://github.com/zou3519
2024-08-03 04:55:42 +00:00
6966d44eda [ONNX] Rename _internal/exporter to _exporter_legacy (#132429)
The next PR will be creating an `exporter` directory to house logic from `torch-onnx`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132429
Approved by: https://github.com/titaiwangms
2024-08-03 04:23:05 +00:00
5973aec671 [fx] python_code(verbose=True): show size/strides for all tensors (#132192)
python_code(verbose=True) (or print_readable()) generates a string with the code representing the fx graph, with extra annotations indicating the size or stride of the tensor. Currently, it'll only shows sizes/strides for FakeTensors provided in metadata. For subclass tensors like NestedTensor, the outer class (provided in the node metadata) will be a non-FakeTensor and the inner tensors will be fake. This PR expands the conditional to show sizes/strides for all tensors, not just FakeTensors.

Testing: I ran this test script (below), ran it with `TORCH_LOGS=+dynamo` and found in the logs the graph shown below - we see that the input nested tensor has sizes and strides associated with it. Also, I stacked a diff on top of this one that forces the readable graph to be generated whenever PT2 is in use in tests, which should hopefully find any issues; https://github.com/pytorch/pytorch/pull/132195 shows no significant failures except for preexisting failures.

test script:
```python
import torch

def fn(x):
    return x.cos()

nt = torch.nested.nested_tensor_from_jagged(
    torch.randn(10, 10),
    torch.tensor([0, 1, 3, 6, 10]),
)

torch.compile(fn)(nt)
```

logs excerpt:
```
[0/0] [__graph_code] TRACED GRAPH
[0/0] [__graph_code]  ===== __compiled_fn_1 =====
[0/0] [__graph_code]  /data/users/dberard/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.M

[0/0] [__graph_code]     def forward(self, L_x_: "f32[4, zf1, 10][10*zf1, 10, 1]cpu", zf1: "Sym(zf1)"):
[0/0] [__graph_code]         l_x_ = L_x_
[0/0] [__graph_code]
[0/0] [__graph_code]          # File: /data/users/dberard/scripts/nt_print_graph.py:4 in fn, code: return x.c

[0/0] [__graph_code]         cos: "f32[4, zf1, 10][10*zf1, 10, 1]cpu" = l_x_.cos();  l_x_ = None
[0/0] [__graph_code]         return (cos,)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132192
Approved by: https://github.com/Chillee
2024-08-03 02:54:32 +00:00
0b571b1058 [codemod][pyre] Add missing Pyre mode headers (#132548)
Reviewed By: connernilsen

Differential Revision: D59849027

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132548
Approved by: https://github.com/kit1980, https://github.com/ZainRizvi
2024-08-03 02:32:53 +00:00
373e9be457 [Inductor][FlexAttention] Add kwarg to top level for users to specify kernel params (#132015)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132015
Approved by: https://github.com/Chillee
2024-08-03 02:27:02 +00:00
25903f3932 [easy] fix f-string messages in torch/_ops.py (#132531)
I encountered these when making this change:

```
diff --git a/test/functorch/test_ac.py b/test/functorch/test_ac.py
index 3a2e07fa147..a4d003399e7 100644
--- a/test/functorch/test_ac.py
+++ b/test/functorch/test_ac.py
@@ -259,15 +259,8 @@ class MemoryBudgetTest(TestCase):

         expected = call()
         for budget in range(0, 11):
-            memory_budget = budget / 10
-            torch._dynamo.reset()
-            with config.patch(activation_memory_budget=memory_budget):
-                if memory_budget is not None:
-                    f_compile = torch.compile(
-                        call, backend="aot_eager_decomp_partition"
-                    )
-
-                self.assertEqual(expected, f_compile())
+            get_mem_and_flops(call, memory_budget=budget / 10)
+

     def test_prioritize_cheaper_matmul(self):
         def f(xs, ws):
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132531
Approved by: https://github.com/Skylion007
ghstack dependencies: #132356, #132466
2024-08-03 02:23:44 +00:00
419b76c4ac [dynamo] Reland 132308, 132314, 132318, 132334 - Make builtin nn modules attributes static (#132539)
Relanding 4 PRs ending at https://github.com/pytorch/pytorch/pull/132334

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132539
Approved by: https://github.com/Skylion007, https://github.com/yanboliang, https://github.com/mlazos
2024-08-03 02:08:22 +00:00
841cadd555 Fix discrepancies from 129973 (#132545)
#129973 ([D59132793](https://www.internalfb.com/diff/D59132793)) was exported missing changes in `test/cpp/jit/CMakeLists.txt` this PR remediates that.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132545
Approved by: https://github.com/kit1980
2024-08-03 01:57:49 +00:00
243a763e1b ci: Remove split-build CUDA testing from pull.yml (#132537)
This is already represented in trunk.yml so it seems a bit redundant to include this level of testing in pull.yml.

I've been observing a large spike in our usage of `g3.4xlarge` which seems to correspond to these builds in particular so removing these from `pull.yml` since they are already covered in `trunk.yml`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132537
Approved by: https://github.com/ZainRizvi, https://github.com/malfet
2024-08-03 01:24:17 +00:00
a503136583 [export] Detect whether case_name is registered in exportdb (#132420)
Summary:
- moves logging functionalities into `torch/_export/db/logging.py` file.
- add a check in `_dynamo/eval_frame.py` to check for optional input and error out with `UnsupportedError`
- change the case name of `torch_sym_int` to `unsupported_operator`
- Check if the case name is registered in exportdb, if so, we give a link to the case in exportdb.
- TODO: add test

Test Plan:
CI

Running the example in https://pytorch.org/docs/main/generated/exportdb/index.html#optional-input gives the following error logging:

```
E0730 10:53:33.687000 4155538 torch/_dynamo/eval_frame.py:1086] Parameter y is optional with a default value of tensor([[-0.1633,  1.2414, -0.1071],
E0730 10:53:33.687000 4155538 torch/_dynamo/eval_frame.py:1086]         [-0.1936, -0.9425, -0.0824]])
E0730 10:53:33.688000 4155538 torch/export/_trace.py:1043] See optional_input in exportdb for unsupported case.                 https://pytorch.org/docs/main/generated/exportdb/index.html#optional-input
......
  File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/389acaeb40d57230/tutorials/pytorch/nntest/__torchtest__/torchtest#link-tree/torch/_dynamo/eval_frame.py", line 1091, in produce_matching
    raise Unsupported(
torch._dynamo.exc.Unsupported: Tracing through optional input is not supported yet
```

It also logs a `export.error.classified` event in Scuba.

Reviewed By: zhxchen17

Differential Revision: D60427208

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132420
Approved by: https://github.com/zhxchen17
2024-08-03 01:08:48 +00:00
64720f3b89 Introduce checks to validate public API tests (#131390)
This PR introduces a new sanity check for the public API tests in `.ci/pytorch/test.sh`.
* Validates two public API tests:
    1. Ensures `test_correct_module_names` fails when a new file OR an existing file adds an invalid public API function (e.g. one whose `__module__` is unset).
    2. Ensures `test_modules_can_be_imported` fails when a module underneath `torch/` cannot be imported.
* Runs this in CI as part just before the pre-existing FC / BC checks.

I've verified that re-introducing the bug that #131386 fixed causes the new check to fail:
![public_api_failure](https://github.com/user-attachments/assets/376ddef3-d14a-41f6-93e2-f935deb6555a)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131390
Approved by: https://github.com/albanD
2024-08-03 00:29:00 +00:00
cyy
fcef6cc6d1 [13/N] Fix clang-tidy warnings in jit (#132477)
Follows  #132209

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132477
Approved by: https://github.com/Skylion007
2024-08-03 00:13:18 +00:00
705ac311aa Fix Distributed EventList usage (#132448)
Summary: Summarized here: https://github.com/pytorch/pytorch/issues/132227

Test Plan: Use suggestion in issue, should see test passing again

Differential Revision: D60614690

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132448
Approved by: https://github.com/aaronenyeshi
2024-08-02 23:55:31 +00:00
e3513fb2af [ts_converter]handle python list append, list add, aten.to.dtype+mutation_op pattern (#132529)
Summary:
#### Description
Add support for aten::append with a python function that returns a new list with the appended element. We then update the `fx_node` in the `name_to_node` mapping.

aten::append contributed by Jiashen Cao <jiashenc@meta.com>

Fix conversion for csr_ranker_test

```
    model_name: csr_ranker_test_4.ptl
    has_ts_model: True
    has_sample_inputs: True
    ops_maybe_missing_meta: set()
    script_objects: set()
    ts_can_run: True
    ts_run_exception: None
    can_convert: True
    convert_exception: None
    ep_result_correct: True
    ep_run_exception: None
    can_package: True
    package_exception: None
    sigmoid_can_run: False
    sigmoid_run_exception: RuntimeError('not for symbolics')
    sigmoid_result_correct: None
```

Test Plan:
test_aten_add_t
test_aten_append_t
test_aten_to_dtype_with_mutating_storage

buck2 run mode/opt sigmoid/inference/ts_migration:main -- --mode test_one --model_name csr_ranker_test

Differential Revision: D60635893

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132529
Approved by: https://github.com/jiashenC
2024-08-02 23:32:37 +00:00
85f19ce14a Support meta["val"] that is a dict, for triton kernels and for the partitioner (#132466)
Internally there's a model that's using memory_budget with the partitioner, and using custom triton kernels. The partitioner fails when encountering the triton ops because they don't have `meta["val"]`. This PR adds `meta["val"]`  to these fx graph nodes and then adds handling for `meta["val"]` being a dict in the partitioner.

Differential Revision: [D60627813](https://our.internmc.facebook.com/intern/diff/D60627813)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132466
Approved by: https://github.com/zou3519
ghstack dependencies: #132356
2024-08-02 23:24:29 +00:00
bcac71517c [Profiler] Test Logging for Empty Traces (#132444)
Summary: Tests D60311331. Please see that diff for explanation

Test Plan: This diff is adding a test itself

Reviewed By: aaronenyeshi

Differential Revision: D60311555

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132444
Approved by: https://github.com/aaronenyeshi
2024-08-02 22:04:15 +00:00
1962f9475f [NJT][flop counter] attention: if offsets are fake, use max seqlen (#132356)
The flop counter is used by the partitioner, in which case the tensors passed in can be fake.

The flop computations for nested attention use the offsets to determine the actual amount of compute that will be done. But when the offsets are fake, we end up with unbacked symints (from `(offsets[1:] - offsets[:-1]).to_list()`). If we find that the offsets are fake or functional tensors, then use the max sequence length instead.

Repro: https://gist.github.com/davidberard98/903fb3e586edb6d1d466786e1a610eba

Differential Revision: [D60597463](https://our.internmc.facebook.com/intern/diff/D60597463)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132356
Approved by: https://github.com/soulitzer
2024-08-02 20:42:29 +00:00
37c3d503b7 [pipelining] Make test_schedule quiet (#132369)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132369
Approved by: https://github.com/H-Huang
ghstack dependencies: #129810, #130378
2024-08-02 20:38:17 +00:00
7c1cca9fda [pipelining] Add schedule send/recv pass (#130378)
Inserts send/recv ops where needed in a compute-only pipeline schedule.

Any F or B action will require a recv op for its input and a send op
for its output, except for at the ends of the pipeline.

To avoid hangs caused by mixed-up orderings of sends/recvs across ranks,
we pick one compute action at a time and insert both its send op (on
that rank's schedule), and the matching recv op for the recipient stage
(on the schedule for the rank for that stage).

TODO
Currently ignores a couple of edge cases
- ignores batching (which is an optimization)
- ignores cases where a stage sends to anotehr stage on the same rank,
  and should skip the send/recv and directly access memory

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130378
Approved by: https://github.com/H-Huang
ghstack dependencies: #129810
2024-08-02 20:38:17 +00:00
625f494619 [Pipelining] Add schedule unshard/reshard pass (#129810)
Adds fsdp unshard/reshard ops to a compute-only schedule.

Operates on one pp-rank's schedule at a time, since there is no
cross-pp-rank coordination needed for FSDP.  (Unshard/Reshard is across
DP ranks within a PP group).

Uses a heuristic based on examining the next N stages to run compute
operations on this rank, evicting (resharding) and fetching (unsharding)
ahead of time to give unshard operations a chance to overlap with
compute and PP comms.
- this heuristic has not been validated and may not be optimal

Makes the assumption that it's fine to add the UNSHARD/RESHARD actions
to the schedule regardless of if FSDP will actually be used.
- this way, users do not have to tell us at PP schedule creation time if
  they plan to use FSDP or DDP
- it is trivial to implement UNSHARD/RESHARD as no-ops inside the
  runtime, if FSDP is not detected on the stage module

TODO
- also add FSDP's reduce-scatter? or is it sufficient to leave this
  handled by PipelineStage at 'last backward' time
- validate 'next N stages' heuristic and expose an API if needed
- add an e2e test

Co-authored-by: Howard Huang <howardhuang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129810
Approved by: https://github.com/kwen2501, https://github.com/H-Huang
2024-08-02 20:38:17 +00:00
f379bbd46d [dynamo] support inspect.signature.bind (#132330)
Fixes https://github.com/pytorch/pytorch/issues/93760.

This was not that small of a task...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132330
Approved by: https://github.com/jansel
ghstack dependencies: #132329
2024-08-02 20:37:05 +00:00
642257db1a Update the FQN for auto_functionalized HOO. (#132171)
Summary:
as title.

torch._higher_order_ops.auto_functionlize.auto_functionalized is a Python FQN which should NOT be used to talk to the backends and we should use the standard FQN name torch.ops.higher_order.auto_functionalized instead.

Test Plan: buck test mode/opt caffe2/test:test_export -- -r test_custom_op_auto_functionalize_pre_dispatch

Differential Revision: D60468759

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132171
Approved by: https://github.com/SherlockNoMad
2024-08-02 20:34:50 +00:00
dccce77935 C++ network flow implementation in c10 (#132188)
The functorch partitioners use network flow to split the joint graph into a forward and backward graph. Internally, we've found that upgrading to networkx 2.8.8 (from 2.5) results in some hard-to-debug failures (internal reference: https://fburl.com/workplace/jrqwagdm). And I'm told that there's interest to remove the python dependency.

So this PR introduces a C++ implementation that mirrors the API provided by networkx. We'll need to add python bindings and do some additional testing to verify correctness.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132188
Approved by: https://github.com/Chillee
2024-08-02 20:30:59 +00:00
f49d5e30eb Change owners of test/test_transformers.py to module: multi-headed-attention (#132519)
So flaky tests get tagged with `module: multi-headed-attention` instead of `module: nn`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132519
Approved by: https://github.com/Skylion007
2024-08-02 20:12:33 +00:00
e81e74ca6c [dynamo] revert map/zip iterator related changes (#132528)
Need to revert due to internal hangs: S437700

This reverts commit b6c1490cc02316ffe85e5ae74651d80f0158ba64.

Revert "[dynamo] implement IteratorVariable and polyfill fallbacks for enumerate (#131725)"

This reverts commit 2576dbbc35d66e8e9ed6cb12216ccc424cb87ec3.

Revert "[dynamo] add itertools repeat/count bytecode reconstruction (#131716)"

This reverts commit 35b4de32fafc5ad024c20ef1275711bffc557ae9.

Revert "[dynamo] add lazy IteratorVariable implementations for map and zip (#131413)"

This reverts commit 7d282d87550787d8269593093519c2ad7c5032cd.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132528
Approved by: https://github.com/ZainRizvi
2024-08-02 19:40:57 +00:00
b71cd149ce Fix file lock issue in AotCodeCompiler (#132343)
Summary:
It looks like there are several places in AotCodeCompiler that write files in a way that aren't safe for concurrency. There's a filelock to cope with that, but it seems like the lock path isn't quite robust enough to prevent races. We have an internal stress test failing when executing multiple concurrent versions of the test. It seems as though there's some variability in the content we write to the cpp file, which means we can get a different 'key' across different runs. The lock path includes that key in the lock path name, but the path for the "consts_path" is computed separately. Therefore, I see things like this:

- The computed 'key' is `cp5tgbuxuegvg5g2j7oi6u74nkf3v7mx5w3qzl6qbedtmw5tq77z`
- The lock_path (based on the key) is: `/tmp/torchinductor_slarsen/locks/cp5tgbuxuegvg5g2j7oi6u74nkf3v7mx5w3qzl6qbedtmw5tq77z.lock`
- The cpp path is (also includes the key) is: `/tmp/torchinductor_slarsen/cenzkqfnhu53mrhrdhzjtnblzyma2hgmeo7hai5yqsxzirdavurh/cp5tgbuxuegvg5g2j7oi6u74nkf3v7mx5w3qzl6qbedtmw5tq77z.cpp`
- The consts_path (not based on the key) is: `/tmp/torchinductor_slarsen/cenzkqfnhu53mrhrdhzjtnblzyma2hgmeo7hai5yqsxzirdavurh/cifbshkqkbsurzldsyi2vl5bsnhvejmavys4kktpwrzmpo4ysuoy.bin`

So we have different test instances using different lock paths, but touching the same consts_path and therefore stomping on each others' consts_path. To fix, include the key in the consts_paths.

Test Plan: Ran internal stress test. Repro'd failure and verified this change fixes it.

Differential Revision: D60552021

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132343
Approved by: https://github.com/desertfire
2024-08-02 19:01:37 +00:00
bcb4f7c172 Revert "Grouped Query Attention (#128898)"
This reverts commit 6b28af1b79eaa63e2f423d925bbd42330582983f.

Reverted https://github.com/pytorch/pytorch/pull/128898 on behalf of https://github.com/ZainRizvi due to Sorry, this broke a bunch of tests internally. See D60638265 ([comment](https://github.com/pytorch/pytorch/pull/128898#issuecomment-2265961038))
2024-08-02 18:58:46 +00:00
afca6f5b47 [PT2][Optimus] Add missing example value for introduced nodes (#132297)
Summary:
We observed that many introduced nodes during split cat and batch fusion pattern optimization did not have example value meta data, which will cause problems in our follow up pattern optimizations, thus we add all missing values.

We also fix bugs in some meta update and corner case bug for the old pattern, which caused problems in the follow up pattern optimization.

We delete merge_stack_tahn_unbind_pass pattern, which was designed for cmf model, and it could be replaced by the more advanced pattern we added, thus we remove it for easy maintenance.

Test Plan:
# unit test
```
buck2 test //caffe2/test/inductor:split_cat_fx_passes
```

Test UI: https://www.internalfb.com/intern/testinfra/testrun/15481123762720165
Network: Up: 230KiB  Down: 702KiB  (reSessionID-756346bf-6da3-4fa0-8d03-1b4fd61e0a7a)
Jobs completed: 30. Time elapsed: 7:23.9s.
Cache hits: 20%. Commands: 5 (cached: 1, remote: 0, local: 4)
Tests finished: Pass 9. Fail 0. Fatal 0. Skip 1. Build failure 0

```
buck2 test @mode/opt pytorch/diff_train_tests/ads/optimus:local_pt2_runner
```

Network: Up: 1.3GiB  Down: 84MiB  (reSessionID-ff135cdd-e42c-4ab5-8217-907ada465f01)
Jobs completed: 61. Time elapsed: 21:56.5s.
Cache hits: 0%. Commands: 39 (cached: 0, remote: 0, local: 39)
Tests finished: Pass 8. Fail 0. Fatal 0. Skip 0. Build failure 0

# benchmark

```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run @mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "ig_ctr" --flow_id 584880697
```

Counter({'pattern_matcher_nodes': 752, 'pattern_matcher_count': 732, 'normalization_pass': 328, 'normalization_aten_pass': 12, 'scmerge_cat_removed': 5, 'scmerge_cat_added': 4, 'scmerge_split_removed': 3, 'unbind_stack_pass': 3, 'batch_tanh': 2, 'scmerge_split_sections_removed': 2, 'scmerge_split_added': 2, 'optimize_cat_inputs_pass': 1, 'unbind_cat_to_view_pass': 1, 'fxgraph_cache_miss': 1})

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132297
Approved by: https://github.com/jackiexu1992
2024-08-02 18:57:12 +00:00
24d0a32f98 Revert "[dynamo] Wrap unspecialized nn module getattr with UnspecializedNNModuleSource (#132308)"
This reverts commit aa0ed2496f5bf38768c9eda13112fd43359548bb.

Reverted https://github.com/pytorch/pytorch/pull/132308 on behalf of https://github.com/anijain2305 due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/132308#issuecomment-2265959993))
2024-08-02 18:55:51 +00:00
e696f17467 Revert "[dynamo] Track builtin nn modules with UnspecializedBuiltinNNModuleVariable (#132314)"
This reverts commit d6a82ce39bd8e705a4cc2cebb886f4476a7250cf.

Reverted https://github.com/pytorch/pytorch/pull/132314 on behalf of https://github.com/anijain2305 due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/132314#issuecomment-2265953367))
2024-08-02 18:52:38 +00:00
e4e3575fb0 Revert "[11/N] Use std::nullopt and std::optional (#132396)"
This reverts commit d7d61904936617a6a43782868d0b1004cb70dfc0.

Reverted https://github.com/pytorch/pytorch/pull/132396 on behalf of https://github.com/ZainRizvi due to Sorry, but this PR has a dependency on another PR (https://github.com/pytorch/pytorch/pull/128898) that has to be reverted ([comment](https://github.com/pytorch/pytorch/pull/132396#issuecomment-2265952528))
2024-08-02 18:49:42 +00:00
59b73079a0 Revert "Always use high precision for SDPA math backend (#128922)"
This reverts commit fbf3bc0a602b4ec1eab169202d5b1158fe2c1def.

Reverted https://github.com/pytorch/pytorch/pull/128922 on behalf of https://github.com/ZainRizvi due to Sorry, but this PR has a dependency on another PR (https://github.com/pytorch/pytorch/pull/128898) that has to be reverted ([comment](https://github.com/pytorch/pytorch/pull/128922#issuecomment-2265949958))
2024-08-02 18:46:50 +00:00
193a19ee91 Revert "[dynamo] Treat attr of unspecialized buiitin nn modules as static (#132318)"
This reverts commit 7b816d7d6d5d521f913c78f897790f66112c7d84.

Reverted https://github.com/pytorch/pytorch/pull/132318 on behalf of https://github.com/anijain2305 due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/132318#issuecomment-2265945433))
2024-08-02 18:43:32 +00:00
b8f7019df0 Revert "[dynamo] Track params/buffers and mark them as static (#132334)"
This reverts commit babb249a89b51931afe16db8b498ff72cd433afc.

Reverted https://github.com/pytorch/pytorch/pull/132334 on behalf of https://github.com/anijain2305 due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/132334#issuecomment-2265942261))
2024-08-02 18:41:19 +00:00
e0514a5b99 [AOTI][refactor] Consolidate how python_kernel_name is set (#132320)
Summary: Similar to the refactoring of set_cpp_kernel, consolidate the ways of setting python_kernel_name

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132320
Approved by: https://github.com/angelayi, https://github.com/chenyang78
ghstack dependencies: #132319
2024-08-02 18:34:25 +00:00
a9e1133faa [AOTI][refactor] Move set_cpp_kernel to base class (#132319)
Summary: Consolidate how cpp_kernel_name is set and make it a method in the base ExternKernel class.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132319
Approved by: https://github.com/angelayi, https://github.com/chenyang78
2024-08-02 18:34:24 +00:00
df781343e2 Link libc10 to pthreads (#132484)
It gets linked as transitive dependency of `libmkl` on x86_64,  but it's must be specified explicitly on s390x

Linking issue only appears when using gcc-13 with gold linker.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132484
Approved by: https://github.com/malfet
2024-08-02 18:03:44 +00:00
19897a1647 [export] change deepcopy to copy in _replace_set_grad_with_hop pass.. (#132181)
Summary:
Fixes T197371132.

Previously, we call copy.deepcopy to avoid mutating the original signature. However, this causes errors when the signature reference a FakeScriptObject, which then references a real torch.ScriptObject due to "The tensor has a non-zero number of elements, but its data is not allocated yet."

We therefore just change it to a shallow copy. This should be good enough for guarding the signature.

Test Plan: buck2 run 'fbcode//mode/opt' torchrec/distributed/tests:test_pt2 -- --filter-text "test_sharded_quant_ebc_non_strict_export"

Differential Revision: D60476839

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132181
Approved by: https://github.com/BoyuanFeng
2024-08-02 17:57:09 +00:00
cyy
87d58cc81f [4/N] Fix clang-tidy warnings in aten/src/ATen/native/ (#132001)
Follows #132000
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132001
Approved by: https://github.com/Skylion007
2024-08-02 17:42:02 +00:00
cyy
207e24ff83 Enable clang-tidy on aten/src/ATen/cudnn/* (#130133)
Continued work of applying clang-tidy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130133
Approved by: https://github.com/eqy, https://github.com/Skylion007
2024-08-02 17:39:37 +00:00
0c491702c4 [ONNX] Define the TORCH_ONNX_USE_EXPERIMENTAL_LOGIC flag (#132299)
Define the `TORCH_ONNX_USE_EXPERIMENTAL_LOGIC` flag to allow for enabling the new torch.onnx logic and hiding them during migration and testing. The actual logic migration will happen after.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132299
Approved by: https://github.com/titaiwangms
2024-08-02 17:06:11 +00:00
9167113c16 [easy][MPS] add torch.mps.is_available() (#132426)
Just return "torch.mps.device_count() > 0", which, based on the implementation of device_count(), seems to be equivalent.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132426
Approved by: https://github.com/malfet
2024-08-02 17:05:49 +00:00
fc32732596 Don't attempt to compute hints for unbacked expressions (#132060)
This breaks the inference we made that if you cat an N-D tensor with a 1-D tensor of size (u0,), the u0 must be zero, but no one really wanted that anyway...

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132060
Approved by: https://github.com/Skylion007
2024-08-02 16:39:14 +00:00
8fff976355 Revert "Refactor thunkify to return proper thunk abstraction (#132407)"
This reverts commit d903e664c6b70ad17e0b316ef39d71be5edddc87.

Reverted https://github.com/pytorch/pytorch/pull/132407 on behalf of https://github.com/ezyang due to test_correct_module_names ([comment](https://github.com/pytorch/pytorch/pull/132407#issuecomment-2265754857))
2024-08-02 16:32:43 +00:00
1197550876 Revert "Don't attempt to compute hints for unbacked expressions (#132060)"
This reverts commit d342dc0179944dd317b509b3432da81701836444.

Reverted https://github.com/pytorch/pytorch/pull/132060 on behalf of https://github.com/ezyang due to test_correct_module_names ([comment](https://github.com/pytorch/pytorch/pull/132407#issuecomment-2265754857))
2024-08-02 16:32:43 +00:00
296c339f98 Ensure compiler collective is called even when no graph is compiled (#132163)
It's very important to make sure we always run the compiler collective, because if we don't, we will fail to apply automatic dynamic at all.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132163
Approved by: https://github.com/jansel
2024-08-02 16:31:54 +00:00
82b6480b0a Update SavedTensorHooks TLS stack to use SafePyObject (#131700)
Previously, we must manually manage refcounting when updating the TLS saved variable stack. With this PR, things should be handled automatically by the SafePyObject.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131700
Approved by: https://github.com/albanD
2024-08-02 16:27:16 +00:00
9eeb5eebab Revert "Ensure compiler collective is called even when no graph is compiled (#132163)"
This reverts commit 0d9c9716b2db52281f6f10a113e07936deeb6e0a.

Reverted https://github.com/pytorch/pytorch/pull/132163 on behalf of https://github.com/ezyang due to test_correct_module_names ([comment](https://github.com/pytorch/pytorch/pull/132163#issuecomment-2265729449))
2024-08-02 16:16:31 +00:00
fca2dba7ca [pytorch][counters] Pybind for WaitCounter (#132357)
Summary:
Basic pybind integration for WaitCounter providing a guard API.
Also fixes broken copy/move constructor in WaitGuard (it wasn't really used with the macro-based C++ API).

Test Plan: unit test

Differential Revision: D60557660

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132357
Approved by: https://github.com/jamesperng, https://github.com/asiab4
2024-08-02 16:08:10 +00:00
d224857b3a Revert "Change signature of CompilerFn for register_backend decorator (#131880)"
This reverts commit ccf9ce8e8c3c86269003547d976da5ed1fc9511b.

Reverted https://github.com/pytorch/pytorch/pull/131880 on behalf of https://github.com/albanD due to Breaking lint ([comment](https://github.com/pytorch/pytorch/pull/131880#issuecomment-2265682757))
2024-08-02 15:49:09 +00:00
63eb06c051 Disable SymDispatchMode when torch.compile'ing (#132433)
Partially addresses https://github.com/pytorch/pytorch/issues/132417

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132433
Approved by: https://github.com/ydwu4
2024-08-02 15:23:49 +00:00
cyy
5aafdc2f87 [3/N] Fix clang-tidy warnings in aten/src/ATen/native/ (#132000)
Follows #131834

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132000
Approved by: https://github.com/ezyang
2024-08-02 15:00:38 +00:00
78f4a3919f Remove duplicate XPU switch case in DispatchStub (#132480)
This PR fixes the issue mentioned in https://github.com/pytorch/pytorch/issues/132481. Duplicated XPU switch cases exist in `DispatchStub.cpp` and this PR removes it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132480
Approved by: https://github.com/nautsimon, https://github.com/malfet
2024-08-02 14:39:00 +00:00
ccf9ce8e8c Change signature of CompilerFn for register_backend decorator (#131880)
## Description
Add `...` to show that CompilerFn for custom backend could take additional options

Re: Recreated closed PR https://github.com/pytorch/pytorch/pull/110006
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131880
Approved by: https://github.com/jansel
2024-08-02 14:30:58 +00:00
053e5080f6 Enable exception chaining in call_user_compiler (#131186)
Enable exception chaining of BackendCompilerFailed exception in call_user_compiler. This prevents the original exception and traceback, which is often the most useful for debugging, from being discarded.

Example output without the patch
> Traceback (most recent call last):
> [Traceback from test_slice_scatter_issue122291 to raise BackendCompilerFailed(self.compiler_fn, e).with_traceback(]
> [Trace back from call_user_compiler to  _inplace_generalized_scatter raise RuntimeError]
>  torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
>  RuntimeError: shape error in scatter op, can not broadcast torch.Size([16, 2]) to torch.Size([16, 6])
> Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

Example output with the patch
> Traceback (most recent call last):
> [Traceback from_inplace_generalized_scatter to raise error_type(message_evaluated)]
> RuntimeError: expand: attempting to expand a dimension of length 2!
> The above exception was the direct cause of the following exception:
> Traceback (most recent call last):
> [Traceback from  call_user_compiler to  _inplace_generalized_scatter raise RuntimeError]
> RuntimeError: shape error in scatter op, can not broadcast torch.Size([16, 2]) to torch.Size([16, 6])
> The above exception was the direct cause of the following exception:
> Traceback (most recent call last):
> [Traceback from test_slice_scatter_issue122291 to raise BackendCompilerFailed(self.compiler_fn, e) with e]
> RuntimeError: shape error in scatter op, can not broadcast torch.Size([16, 2]) to torch.Size([16, 6])
> Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131186
Approved by: https://github.com/jansel
2024-08-02 14:07:06 +00:00
48929184e9 AutoHeuristic: mixed_mm heuristic for A100 (#131613)
This PR introduces changes to AutoHeuristic that allow one to learn a heuristic as a decision tree. I used this to learn a heuristic for mixed_mm on A100 that consistenly performs better than the default choice (https://github.com/pytorch/pytorch/blob/main/torch/_inductor/kernel/mm.py#L402).

This is how the results look like:
Explanation of columns:
**wrong_max_spdup**: In the worst case, how much better would the best choice have been
**wrong_gman_spdup**: For inputs where the heuristic is wrong, how much better is the best choice on average (geomean)
**max_spdup_default**: Highest speedup achieved by the learned heuristic over the default choice
**gman_spdup_default**: Geomean speedup achived by the learned heuristic over the default choice
**max_slowdown_default**: If the default choice is better than the choice predicted by the learned heuristic, how much is it better in the worst case
**non_default_preds**: Number of times the learned heuristic predicted a choice that is not the default choice
**default_better**: Number of times the default choice is better than the choice made by the heuristic
```
  set     crit  max_depth  min_samples_leaf  correct  wrong  unsure  total  wrong_max_spdup  wrong_gman_spdup    max_spdup_default  gman_spdup_default  max_slowdown_default  non_default_preds  default_better
train  entropy          5              0.01     2376    740     323   3439         1.855386          1.063236            11.352318            3.438279              1.022164               3116               2
 test  entropy          5              0.01      563    183      71    817         1.622222          1.060897            10.084181            3.507741              1.017039                746               2
```

While the number of wrong predictions is high, on average the best choice is only around 6% better. What is important is that the choice predicted by the learned heuristic performs better than the default choice.

I evaluated my heuristic on gpt-fast `meta-llama/Llama-2-7b-chat-hf` with int8 weight quantization. To get the `tuned_mixed_mm` to trigger, I had to replace `F.linear()` in https://github.com/pytorch-labs/gpt-fast/blob/main/quantize.py#L355 with `torch.matmul(input, self.weight.t().to(dtype=input.dtype))` because the mixed_mm pattern does not match if there is a transpose between a cast and the matmul.
|batch size|prompt length| fallback    |  heuristic  | speedup |
|----------|-------------|------------:|------------:|--------:|
|     1    |      7      | 75.31 tok/s | 148.83 tok/s|  1.97   |
|     1    |     11      | 75.99 tok/s | 148.15 tok/s|  1.94   |
|     4    |      7      | 103.48 tok/s | 472.00 tok/s|  4.56   |
|     4    |     11      | 103.56 tok/s |  371.36 tok/s|  3.58   |
|     8    |      7      | 201.92 tok/s | 813.44 tok/s|  4.02   |
|     8    |     11      | 201.76 tok/s |  699.36 tok/s|  3.46   |

Currently, the heuristic only applies to the following inputs:
- m <= 128, k >= 1024, n >= 1024 (For these sizes, one of the triton kernels wins in most cases, but the heuristic still has to be careful to not choose a config that performs worse than the fallback)
- k % 256 == 0 (If k is not a multiple of the block size, some choices perform extremely bad. In one case one config, that usually performs very well, was 130x slower.)
- mat1 not transposed
- mat2 transposed (In some cases, it was hard for the learned heuristic to detect some cases where it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131613
Approved by: https://github.com/eellison
2024-08-02 13:54:37 +00:00
cyy
b9cb1abf65 [12/N] Use std::optional (#132361)
Follows #132396

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132361
Approved by: https://github.com/eqy
2024-08-02 13:46:46 +00:00
56f2917bef [dynamo] Bugfix for recently added str handler (#132461)
There is probably more work to improve support. But this is hot fix to not fail on `.__func__`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132461
Approved by: https://github.com/williamwen42
ghstack dependencies: #132425
2024-08-02 13:16:39 +00:00
0d9c9716b2 Ensure compiler collective is called even when no graph is compiled (#132163)
It's very important to make sure we always run the compiler collective, because if we don't, we will fail to apply automatic dynamic at all.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132163
Approved by: https://github.com/jansel
2024-08-02 12:18:34 +00:00
d342dc0179 Don't attempt to compute hints for unbacked expressions (#132060)
This breaks the inference we made that if you cat an N-D tensor with a 1-D tensor of size (u0,), the u0 must be zero, but no one really wanted that anyway...

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132060
Approved by: https://github.com/Skylion007
ghstack dependencies: #131649, #132407
2024-08-02 12:09:37 +00:00
d903e664c6 Refactor thunkify to return proper thunk abstraction (#132407)
This is superior to lru_cache because (1) it's more explicit and (2) it
doesn't leak the original function after it's been forced.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132407
Approved by: https://github.com/albanD
ghstack dependencies: #131649
2024-08-02 12:09:37 +00:00
290f09f829 Ban decorator usage of dynamo_timed (#132328)
This is a more manual version of https://github.com/pytorch/pytorch/pull/132073 that just manually creates the new function at each call site instead of magicking it with clone. Review with whitespace diffs off.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132328
Approved by: https://github.com/albanD
2024-08-02 12:00:46 +00:00
8668bc279d [inductor] contine to fix restrict keyword. (#132463)
It is a continued work to the PR: https://github.com/pytorch/pytorch/pull/132394 , and all `restrict` key word of `cpp_micro_gemm.py` are fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132463
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-08-02 11:09:17 +00:00
d2e9a8bf6d [Reland] Fix inlining module-scoped store global (#132439)
Reland https://github.com/pytorch/pytorch/pull/132224

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132439
Approved by: https://github.com/anijain2305
2024-08-02 09:13:52 +00:00
a4ea776881 Add pinned memory support to sparse COO/CSR/CSC/BSR/BSC tensors (#129645)
As in the title:

To register indices/values of a sparse XYZ tensor with CUDA, the following methods are supported
- `sparse_xyz_tensor(indices, values, pin_memory=True)`
- `sparse_xyz_tensor(indices, values).pin_memory()`
- `sparse_xyz_tensor(indices.pin_memory(), values.pin_memory())`

Fixes https://github.com/pytorch/pytorch/issues/115330

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129645
Approved by: https://github.com/amjames, https://github.com/cpuhrsch, https://github.com/eqy
2024-08-02 08:55:55 +00:00
babb249a89 [dynamo] Track params/buffers and mark them as static (#132334)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132334
Approved by: https://github.com/ezyang, https://github.com/mlazos
2024-08-02 08:55:43 +00:00
2ee9895304 Support optimizer capturable on hpu and xpu (#132119)
as title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132119
Approved by: https://github.com/jgong5, https://github.com/janeyx99
2024-08-02 08:19:52 +00:00
f936e68506 [CI] Update CPU inductor smoke test model list and target (#132221)
Fixes #132097

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132221
Approved by: https://github.com/desertfire
2024-08-02 07:09:54 +00:00
eqy
e5560d10f4 [CUDA][SDPA] Fix expect export on sm90+ (#132194)
CC @drisspg not sure what is causing the scale=0.125 to be omitted here...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132194
Approved by: https://github.com/drisspg
2024-08-02 05:43:58 +00:00
7d8b95e8fb [easy] more debug in partitioner assert (#132456)
Print the name of the node that didn't have good meta['val']. An internal model is failing with this assert, we need this info to debug further.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132456
Approved by: https://github.com/Chillee
2024-08-02 05:07:01 +00:00
cyy
35d14d22a0 Fix some issues detected by static analysis tools (#131989)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131989
Approved by: https://github.com/ezyang
2024-08-02 04:18:57 +00:00
5ea0f51187 [Dynamo] Support abc.MutableMapping.get (#132363)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132363
Approved by: https://github.com/anijain2305, https://github.com/mlazos
2024-08-02 04:17:35 +00:00
2b86a7fcc7 fix printing of scores and mods names (#132424)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132424
Approved by: https://github.com/Skylion007
2024-08-02 03:30:23 +00:00
cyy
07fe1dd58f [13/N] Fix clang-tidy warnings in jit (#132411)
Follows  #132209

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132411
Approved by: https://github.com/Skylion007
2024-08-02 03:14:09 +00:00
1250171866 Use fresh inductor cache on unit tests (#132432)
Summary: This makes it so that stress tests on separate processes on the same machine don't clobber the directories of each other. InductorTestCase will automatically make a fresh tmpdir for each unit test.

Test Plan:
```
buck2 test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_aot_autograd_cache.py::AOTAutogradCacheTests::test_nn_module_with_params_global_constant' --run-disabled --stress-runs 10 --record-results
```

Now passes

Differential Revision: D60604811

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132432
Approved by: https://github.com/masnesral
2024-08-02 03:02:36 +00:00
6c4ce4331c [dynamo][exception] Raise Observed KeyError exception for dict __getitem__ (#132425)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132425
Approved by: https://github.com/yanboliang, https://github.com/Skylion007
2024-08-02 02:58:31 +00:00
cd5452aace [CUDA] is_bf16_supported() should not crash if there are no GPUs (#132313)
`False` is the good answer on a system that does not have any CUDA GPUs.
- Added regression test to TestTorch.

Fixes https://github.com/pytorch/pytorch/issues/132303

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132313
Approved by: https://github.com/eqy, https://github.com/syed-ahmed
2024-08-02 02:50:43 +00:00
3a355c1891 Correct sample creation of torch.histogram in UT op_db to align PyTorch defined operator semantics (#131630)
Fixes #130916
As the semantics defined in [torch.histogram](https://pytorch.org/docs/stable/generated/torch.histogram.html#torch-histogram), we need an increasing sequence as bins tensor. Random input doesn't make sense for torch.histogram.
The case is a comparison between CPU backend and another backend. When the input is random, kernel implementation in other backends have to totally align with the CPU kernel, or the case fails.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131630
Approved by: https://github.com/EikanWang, https://github.com/albanD
2024-08-02 01:51:09 +00:00
bc510916fa Only make wait_tensor as a side_effect op (#132341)
Summary:
https://github.com/pytorch/pytorch/pull/131023 add all the collective ops to the side effect list. But we should only make wait_tensor as a side_effect op because all collective ops should have a corresponding wait_tensor.

We should switch to use high_order effect token.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132341
Approved by: https://github.com/yf225
2024-08-02 01:24:40 +00:00
ef426d5183 [nccl] Wrap nccl code update with version check (#130419)
Fixes the issue that cannot build pytorch with nccl < 2.13 after https://github.com/pytorch/pytorch/issues/128756

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130419
Approved by: https://github.com/eqy, https://github.com/malfet
2024-08-02 01:22:07 +00:00
50ed6ce277 Support built-in id function for TensorVariable on parameters (#130100)
Fixes #130087

This patch tries to provide a built-in id function implementation for TensorVariable when the id function is called on tensors like module parameters. The id function call on intermediate tensors is not supported.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130100
Approved by: https://github.com/anijain2305
2024-08-02 01:19:25 +00:00
64235c6a71 Skip test_fp8 in test_aot_inductor to temporarily (#132453)
https://github.com/pytorch/pytorch/pull/130422 caused the test `test.inductor.test_aot_inductor.AOTInductorTestABICompatibleCuda. test_fp8_abi_compatible_cuda` to fail (unclear why it was not run in GitHub) with `torch/csrc/inductor/aoti_torch/c/shim.h:390:34: note: candidate function not viable: requires 9 arguments, but 6 were provided`. We suspect that the kernel produced by the lowering function, which is no longer a fallback choice, has a schema issue at codegen. Fp8 is not used through AOTI currently and it is difficult to revert the PR (BE week), so we'll skip the test temporarily while making the new lowering compatible with AOTI.

Testing: the failed test on internal diff is now skipped.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132453
Approved by: https://github.com/henrylhtsang
2024-08-02 01:18:03 +00:00
cyy
56334c854c [2/N] Fix clang-tidy warnings in aten/src/ATen/native/*.{cpp,h} (#131834)
Follows #130798

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131834
Approved by: https://github.com/ezyang
2024-08-02 00:49:30 +00:00
ee1ef066fd add src map to data-dependent errors (#132393)
Summary: Currently suggested fixes pick a map from symbols to user variables. However it is possible that many user variables  point to the same symbol, and some may be preferred over others. Thus we dump this info as well.

Test Plan: updated test

Sample error with new format:
```
Could not guard on data-dependent expression u2 >= 0 (unhinted: u2 >= 0).  (Size-like symbols: none)

<snip>

The following call raised this error:
  File "test/export/test_export.py", line 1950, in forward
    return r.view(items[0], items[2])

To fix the error, insert one of the following checks before this call:
  1. torch._check(items[2] >= 0)
  2. torch._check(items[2] < 0)

(These suggested fixes were derived by replacing `u2` with items[2] in u2 >= 0 and its negation.)
```

Differential Revision: D60574478

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132393
Approved by: https://github.com/BoyuanFeng
2024-08-02 00:31:12 +00:00
625af2d27c [dynamo] fix add_push_null callsites with CALL_FUNCTION_EX (#132329)
Also fix a bug in `PyCodegen.add_push_null` where in Python <= 3.12, we may accidentally duplicate a NULL instead of the object on the stack before it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132329
Approved by: https://github.com/anijain2305
2024-08-02 00:29:21 +00:00
0016be8051 [Docker] Replace epel release rpm by yum install (#132449)
URL: https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm is not available anymore, hence replacing this with yum epel-release install.

As a backup plan this is available still : https://archives.fedoraproject.org/pub/archive/epel/7/x86_64/Packages/e/epel-release-7-14.noarch.rpm

Saved on our s3 path, just in case: https://ossci-linux.s3.amazonaws.com/epel-release-7-14.noarch.rpm

Please note, We are still using for installs like this:
```
RUN yum install -y \
    https://repo.ius.io/ius-release-el7.rpm \
	https://ossci-linux.s3.amazonaws.com/epel-release-7-14.noarch.rpm
```

Test in CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132449
Approved by: https://github.com/kit1980, https://github.com/seemethere, https://github.com/malfet
2024-08-02 00:16:03 +00:00
3855ac5a5d Revert "[export] Add print_readable to unflattener (#128617)"
This reverts commit ab9791c0e342753013181eeeab300a05774fc456.

Reverted https://github.com/pytorch/pytorch/pull/128617 on behalf of https://github.com/angelayi due to never got landed internally due to weird flow... sorry ([comment](https://github.com/pytorch/pytorch/pull/128617#issuecomment-2264224466))
2024-08-01 23:47:29 +00:00
0c3ac428a2 [BE][typing] fix types in common pruning (#132309)
BE task. Add typings and remove mypy errors in torch/testing/_internal/common_pruning.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132309
Approved by: https://github.com/ColinPeppler
2024-08-01 23:34:33 +00:00
87ddf70fc6 Set weights_only=False in export deserialize_torch_artifact (#132348)
Context:

We are planning to make a BC breaking change to `torch.load` by flipping the default for `weights_only` from `False` --> `True` in a future release. With `weights_only=True`, a custom unpickler is used that limits what can be loaded to state_dicts containing tensors (there is also a way for the user to allowlist specific things to be loaded). The goal of this is to attempt to prevent remote execution of arbitrary code when using `torch.load`.

To my understanding, in export, `torch.load` is used internally to load arbitrary objects, so we should set `weights_only=False` here to prevent the flip from breaking export.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132348
Approved by: https://github.com/angelayi
2024-08-01 23:25:07 +00:00
1362d51e7d [AOTI] Fix number type for AOTI (#132180)
Fixes #131338

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132180
Approved by: https://github.com/desertfire
2024-08-01 22:43:28 +00:00
35400f750f [torchbind] don't warning for certain skippable methods. (#132306)
Summary:
Skip the warning if the fake script object doesn't implement a fake method for:
1. __obj_flatten__: for real script object only.
2. __set_state__ and __get_state__ for serialization. Don't expect it to be used during tracing.

Test Plan: Existing tests.

Reviewed By: angelayi

Differential Revision: D60478460

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132306
Approved by: https://github.com/angelayi
2024-08-01 22:40:42 +00:00
2f54c38594 [AOTI] Fix bfloat16 in CPU (#132150)
Fixes #122986

- add "typedef at::BFloat16 bfloat16;" to the header of generated cpp file

- Supress warning: comparison of integer expressions of different signedness: ‘long unsigned int’ and ‘int64_t’ {aka ‘long int’} [-Wsign-compare]
  436 |   if (tensor.numel() != numel) {

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132150
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2024-08-01 22:26:30 +00:00
a356a03f4a Fix DEBUG=1 asserts for mvlgamma backward with NJT (#132422)
mvlgamma backward trips DEBUG=1 asserts when trying to construct an empty tensor with `layout=torch.jagged`. This happens due to passing `self.options()` to `arange()` in `mvlgamma_backward()`. Fix in this PR unconditionally constructs `arange()` with the strided layout.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132422
Approved by: https://github.com/albanD
2024-08-01 21:53:16 +00:00
92bebb46fa Support XPU ABI=0 build (#130110)
# Motivation
This PR intends to support ABI=0 build for XPU backend.

# Additional Context
The major change is adding a compilation option `-D__INTEL_PREVIEW_BREAKING_CHANGES` for the host compiler(gcc) and `-fpreview-breaking-changes` for XPU device kernel code compiler(icpx), why?
Because we use
- gcc to compile host code and link SYCL runtime. So we need to pass `-D__INTEL_PREVIEW_BREAKING_CHANGES` to tell the host compiler invoking the ABI-neutral API included in SYCL. And
- use icpx to compile device kernel code and link SYCL runtime. So we need to pass `-fpreview-breaking-changes` to tell the device kernel compiler building ABI-neutral code. Besides,
- `libsycl-preview.so` is an ABI-neutral library but `libsycl.so` is not.

This PR depends on https://github.com/pytorch/pytorch/pull/131643.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130110
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD
2024-08-01 21:42:14 +00:00
997f64af38 fastpath FunctionalTensor sizes() (#132084)
Another attempt at fast-pathing sizes() in FunctionalTensor, since it appears to improve compile time perf by up to ~10%. See the investigation from https://github.com/pytorch/pytorch/issues/125977#issuecomment-2122915602.

After looking at some failing tests locally I realized that we need to manually handle metadata mutations now, since the previous "smarter" size dispatch was handling the updates

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132084
Approved by: https://github.com/ezyang
2024-08-01 21:09:22 +00:00
c8958f8f84 Revert "Ban decorator usage of dynamo_timed (#132328)"
This reverts commit 9853c048eb53946eb505424b17ac42ce46b66ac1.

Reverted https://github.com/pytorch/pytorch/pull/132328 on behalf of https://github.com/clee2000 due to seems to have broken functorch/test_aotdispatch.py::TestAOTAutograd::test_input_data_and_metadata_mutation_aliases_other_input [GH job link](https://github.com/pytorch/pytorch/actions/runs/10204547165/job/28233976446) [HUD commit link](9853c048eb).  Test passed on PR, probably a landrace, base is only 10 hours old ([comment](https://github.com/pytorch/pytorch/pull/132328#issuecomment-2263909337))
2024-08-01 20:20:28 +00:00
78927d37f6 Add basic mypy annotations to inductor (#132416)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132416
Approved by: https://github.com/XuehaiPan, https://github.com/jamesjwu
ghstack dependencies: #132415
2024-08-01 20:14:25 +00:00
71e22e0959 Add basic mypy annotations to dynamo (#132415)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132415
Approved by: https://github.com/XuehaiPan, https://github.com/jamesjwu
2024-08-01 20:14:25 +00:00
12f61e65eb [mtia][sdpa] MTIA SDPA dispatch via _fused_sdp_choice_stub (#132008)
Summary: as title

Differential Revision: D59823335

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132008
Approved by: https://github.com/mortzur
2024-08-01 20:01:40 +00:00
596f568592 [dtensor][debug] adding js script to pytorch github so that i can host the browser visualizer on pytorch (#132185)
**Summary**
This is the javascript portion that is used in CommDebugMode's visual browser. I have placed it here so that I can host the browser on PyTorch. I am following the same procedures to host as memory_viz https://github.com/pytorch/pytorch.github.io/blob/site/memory_viz.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132185
Approved by: https://github.com/XilunWu
ghstack dependencies: #132070
2024-08-01 19:50:23 +00:00
9853c048eb Ban decorator usage of dynamo_timed (#132328)
This is a more manual version of https://github.com/pytorch/pytorch/pull/132073 that just manually creates the new function at each call site instead of magicking it with clone. Review with whitespace diffs off.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132328
Approved by: https://github.com/albanD
2024-08-01 19:27:58 +00:00
40c8f73099 Revert "Fix inlining module-scoped store global (#132224)"
This reverts commit c3a31d90e7d10a9b89b11396b6f8b20ed52bf394.

Reverted https://github.com/pytorch/pytorch/pull/132224 on behalf of https://github.com/ZainRizvi due to Looks like the new import mock_store_global_crossfile_inline fails internally. Please see D60567756 for details ([comment](https://github.com/pytorch/pytorch/pull/132224#issuecomment-2263768729))
2024-08-01 19:06:36 +00:00
93979e7063 Skip frame if torch dispatch mode enabled (#131828)
Fixes https://github.com/pytorch/pytorch/issues/105929

We now skip frames if a dispatch mode is enabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131828
Approved by: https://github.com/bdhirsh, https://github.com/anijain2305
2024-08-01 19:06:20 +00:00
fbf3bc0a60 Always use high precision for SDPA math backend (#128922)
Summary:
feikou observed the big numerical gaps when using math backend on AMD and NV GPUs. It's mainly because we are not using higher precision FP32 for the intermediate accumulated/materialized parts.

Since math backend is expected to be slower anyways, and we expect math backend to generate the correct reference result, I think it should be worth to upcast FP16/BF16 input to FP32, and do FP32/TF32 computations, and then downcast FP32 output back to FP16/BF16.

Differential Revision: D58710805

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128922
Approved by: https://github.com/xw285cornell, https://github.com/drisspg
2024-08-01 18:55:48 +00:00
0eea2b3947 Cast inputs to low precision kernels in emulate low precision mode (#132345)
With https://github.com/pytorch/pytorch/pull/132238 is sufficient to make give no divergence https://github.com/pytorch/pytorch/issues/132301:

Although we should discuss that issue more at length.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132345
Approved by: https://github.com/zou3519
2024-08-01 18:02:10 +00:00
Ryo
ce61300141 Enable oneDNN for tanh based GELU on aarch64 (#130925)
Provides speedup for GELU on aarch64 compared to native PyTorch implementation. e.g.

  8.5x speedup compared to native implementation for 1x1x16384 on 32 threads on Graviton 3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130925
Approved by: https://github.com/malfet
2024-08-01 17:54:48 +00:00
97eba8e174 [AOTI] Fix a typo in ExternKernel.codegen_const_args (#132191)
Differential Revision: [D60513923](https://our.internmc.facebook.com/intern/diff/D60513923)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132191
Approved by: https://github.com/chenyang78
2024-08-01 17:46:25 +00:00
f467d55329 Disable remote cache on test_aot_autograd_cache (#132409)
Summary:
AOTAutogradCache currently only checks the local directory instead of both local and remote when saving/loading from the cache, so if remote cache is turned on, it will cache miss.

Disable remote caching for now on these tests: when I work on remote caching compatibility, I'll re-enable them here.

Test Plan:
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_aot_autograd_cache.py::AOTAutogradCacheTests::test_nn_module_with_params_global_constant' --run-disabled
passes

Differential Revision: D60588615

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132409
Approved by: https://github.com/masnesral
2024-08-01 17:26:11 +00:00
010fc7858a [export] Fix serialization of OpOverload w/ SymInt outputs (#132126)
Fixes https://fb.workplace.com/groups/1075192433118967/permalink/1473575486613991/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132126
Approved by: https://github.com/ydwu4
2024-08-01 17:22:04 +00:00
ff4ca0d02a [Easy] Fix argument name collision in HigherOrderOperator dispatched functions (#132377)
Share the same spirit of #129562

- #129562

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132377
Approved by: https://github.com/zou3519
2024-08-01 17:13:37 +00:00
7b816d7d6d [dynamo] Treat attr of unspecialized buiitin nn modules as static (#132318)
This fixes the huge increase in compile time with +dynamic with inline_inbuilt_nn_modules.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132318
Approved by: https://github.com/yanboliang, https://github.com/mlazos, https://github.com/ezyang
ghstack dependencies: #132302, #132304, #132312, #132308, #132314
2024-08-01 17:11:18 +00:00
69cbf05529 Fix recent build error on ppc64le (#129736)
This PR will fix the recent build issue observed on ppc64le.
Fixes #128130

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129736
Approved by: https://github.com/albanD, https://github.com/malfet
2024-08-01 17:09:42 +00:00
30293319a8 [BE][Easy][19/19] enforce style for empty lines in import segments in torch/[o-z]*/ (#129771)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129771
Approved by: https://github.com/justinchuby, https://github.com/janeyx99
2024-08-01 17:07:14 +00:00
c59f3fff52 [PP] Forward only schedule (#132177)
`python test/distributed/pipelining/test_schedule_multiproc.py -k test_forward_only`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132177
Approved by: https://github.com/lessw2020
2024-08-01 16:35:56 +00:00
ee09d066d3 [dynamo] Add line number to _warn_capture_scalar_outputs() (#132333)
Fixes #127667.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132333
Approved by: https://github.com/anijain2305
2024-08-01 16:11:21 +00:00
35fcd59fd8 [inductor] make restrict_keyword cross OSs. (#132394)
Error Msg:
<img width="862" alt="image" src="https://github.com/user-attachments/assets/51fef188-bce8-42a5-8ed4-d11802c6ca89">

<img width="347" alt="image" src="https://github.com/user-attachments/assets/0eafe38e-1c7c-427d-82f5-16a31bccc476">

Handle `restrict` keyword the by OS, ref: https://learn.microsoft.com/en-us/cpp/cpp/extension-restrict?view=msvc-170

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132394
Approved by: https://github.com/desertfire
2024-08-01 16:03:10 +00:00
920f0426ae Add None return type to init -- tests rest (#132376)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132376
Approved by: https://github.com/jamesjwu
ghstack dependencies: #132335, #132351, #132352
2024-08-01 15:44:51 +00:00
221350e3a4 Add None return type to init -- tests (#132352)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132352
Approved by: https://github.com/ezyang
ghstack dependencies: #132335, #132351
2024-08-01 15:44:51 +00:00
a6985c09cb Add None return type to init -- functorch and torchgen (#132351)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132351
Approved by: https://github.com/jamesjwu
ghstack dependencies: #132335
2024-08-01 15:26:45 +00:00
72d2dba992 Add None return type to init (#132335)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132335
Approved by: https://github.com/albanD
2024-08-01 15:26:45 +00:00
30d7f0b15a Remove wget call to builder install_cuda.sh (#132410)
This file ``install_cuda.sh`` now lives in ``.ci/docker/common`` and will be removed from builder repo.
Here is PR that removes it from builder: https://github.com/pytorch/builder/pull/1949
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132410
Approved by: https://github.com/Skylion007
2024-08-01 15:22:08 +00:00
cyy
c99adce9a1 [12/N] Fix clang-tidy warnings in jit (#132209)
Follows #132131

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132209
Approved by: https://github.com/Skylion007
2024-08-01 15:12:12 +00:00
0d88dd0f77 [TS2E] Remove reference to torch.onnx internals (#132186)
Instead, this PR moves the code to the converter to avoid dependence. Feel free to refactor it afterward.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132186
Approved by: https://github.com/angelayi
2024-08-01 15:08:02 +00:00
cyy
d7d6190493 [11/N] Use std::nullopt and std::optional (#132396)
Follows #132364
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132396
Approved by: https://github.com/ezyang
2024-08-01 14:46:33 +00:00
a4013e8b72 [inductor] cpp codegen alignas for all OSs. (#132387)
Changes:
1. Make cpp codegen alignas works for all OSs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132387
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-08-01 14:30:09 +00:00
6c1f1563e1 [inductor] fix UndefinedTensorImpl singleton can't export on Windows. (#132326)
This PR fix the `UndefinedTensorImpl::_singleton` can't export on Windows issue.
Snapshot:
<img width="1346" alt="image" src="https://github.com/user-attachments/assets/b34256ac-a0ae-473b-89e6-10d755eaad24">

The reason is MSVC can't export class static data to external linkage, ref: https://learn.microsoft.com/en-us/cpp/cpp/using-dllimport-and-dllexport-in-cpp-classes?view=msvc-170#_pluslang_using_dllimport_and_dllexport_in_c2b2bselectivememberimportexport

I use another singleton implenmentation to avoid the issue, for Windows.

Since this PR, cpp_wrapper on Windows would start to work.
<img width="1916" alt="image" src="https://github.com/user-attachments/assets/c1d7d7e7-64ca-4c6d-9fb7-e3b91e675b58">

Next step, I will enable the cpp_wrapper UTs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132326
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-08-01 13:37:12 +00:00
6ff1e43a41 [BE][Easy][13/19] enforce style for empty lines in import segments in test/j*/ (#129764)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129764
Approved by: https://github.com/ezyang
2024-08-01 12:13:42 +00:00
672ce4610e Populate submodules of torch._C to sys.modules recursively (#132216)
See comment:

e9d1c26275/torch/__init__.py (L938-L950)

This PR recursively sets the submodules in the C extension to `sys.modules` (e.g., `_C._dynamo.eval_frame`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132216
Approved by: https://github.com/ezyang
2024-08-01 12:04:59 +00:00
d95756f6a5 [Quantizer][Add] Fix add annotation with constant (#132092)
Summary:
Occaisonally we run into a partition that looks like this for Add:

```
SourcePartition(nodes=[_constant2, add_2], source=<built-in function add>, input_nodes=[x], output_nodes=[_constant2, add_2], params=[_constant2])
```

In this case we are adding a constant to an input, and reusing the constant later down the line. This causes our constant to be an output in our SourcePartition. The assumption then that:

```
        add_node = add_partition.output_nodes[0]
```
Will not necessarily hold. As a result we must check that the output node is indeed a call function and not a constant.

Test Plan: buck test mode/dev-nosan //executorch/backends/xnnpack/test:test_xnnpack_ops -- test_qs8_add_constant

Differential Revision: D60413221

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132092
Approved by: https://github.com/jerryzh168
2024-08-01 09:57:43 +00:00
bdd83c4c7f Add Full block support to flex_decoding (#131404)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131404
Approved by: https://github.com/yanboliang
2024-08-01 07:28:52 +00:00
cyy
043e41f4f4 [10/N] Use std::nullopt and std::make_optional (#132364)
Follows #130674
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132364
Approved by: https://github.com/ezyang
2024-08-01 07:02:35 +00:00
d6a82ce39b [dynamo] Track builtin nn modules with UnspecializedBuiltinNNModuleVariable (#132314)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132314
Approved by: https://github.com/yanboliang
ghstack dependencies: #132302, #132304, #132312, #132308
2024-08-01 06:21:05 +00:00
aa0ed2496f [dynamo] Wrap unspecialized nn module getattr with UnspecializedNNModuleSource (#132308)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132308
Approved by: https://github.com/yanboliang
ghstack dependencies: #132302, #132304, #132312
2024-08-01 06:21:05 +00:00
612ea35395 [dynamo] Introduce UnspecializedBuiltinNNModuleSource (#132312)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132312
Approved by: https://github.com/yanboliang
ghstack dependencies: #132302, #132304
2024-08-01 06:21:05 +00:00
4c29c1a96a [EZ] adjust test to accept training IR input (#131999)
When we do predispatch functional export, sometimes we get harmless additional detach calls. In the new training IR, it actually outputs slightly different (arguable more correct) result.

Differential Revision: [D60348764](https://our.internmc.facebook.com/intern/diff/D60348764/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131999
Approved by: https://github.com/bdhirsh
ghstack dependencies: #131988, #131995
2024-08-01 06:20:38 +00:00
7a779b5257 Add functions from torch.masked._ops to __all__ for torch.masked (#131288)
Add the non-private operations imported in this file to `__all__` so that pyright considers them to be publicly exported. Solves this error:

```
"mean" is not exported from module "torch.masked" Pylance[reportPrivateImportUsage]
```

Related: https://github.com/pytorch/pytorch/pulls?q=pyright+export

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131288
Approved by: https://github.com/ezyang
2024-08-01 05:45:08 +00:00
928adb7cc2 Fix empty fake mode problem (#131995)
Title

Differential Revision: [D60348541](https://our.internmc.facebook.com/intern/diff/D60348541/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131995
Approved by: https://github.com/angelayi
ghstack dependencies: #131988
2024-08-01 04:55:37 +00:00
f32ab3b9e3 Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004)
Python's set is non deterministic. There is an internal failure which we recently ran into which did not consistently fail.

See, repro here: P1453035092.

Now, with these changes, it does consistently fail. In follow ups we could also consider adding a lintrule for uses of either set() or set literals.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130004
Approved by: https://github.com/oulgen
2024-08-01 04:37:15 +00:00
bcd1d2e832 [dynamo] Introduce UnspecializedNNModule guard source (#132304)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132304
Approved by: https://github.com/yanboliang
ghstack dependencies: #132302
2024-08-01 04:35:43 +00:00
e772547d70 [dynamo][rename/refactor] Rename guard_source NN_MODULE to SPECIALIZED_NN_MODULE (#132302)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132302
Approved by: https://github.com/yanboliang
2024-08-01 04:35:43 +00:00
90fa64bd7e [torch][take2] Implement BFloat16 __hip_bfloat16 overloads (#132234)
Summary:
In D60024830 I attempted to define these overloads, but gated the implementation on the wrong macros. Namely I used `__CUDACC__` instead of `__HIPCC__` (facepalm).

It might be worth merging this with the nvidia case via typedefs (e.g. `typedef __hip_bfloat16 __gpu_bfloat16` and `typedef __nv_bfloat16 __gpu_bfloat16`), but that seems like an entirely new paradigm for torch, so I'll punt that change to the future so we can focus on supporting `BFloat16(__hip_bfloat16)` here

Test Plan: CI

Differential Revision: D60362079

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132234
Approved by: https://github.com/houseroad
2024-08-01 04:25:46 +00:00
7911b7bfb7 [inductor][cpp] stabilize do_bench_cpu (#131873)
This PR stabilizes the `do_bench_cpu` by using milliseconds for warmup and benchmark runs, aligning with that of Trtion's do_bench.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131873
Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/eellison
2024-08-01 04:25:31 +00:00
b25ef91bf1 [BE][Easy][18/19] enforce style for empty lines in import segments in torch/d*/ (#129770)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129770
Approved by: https://github.com/wconstab
2024-08-01 04:22:50 +00:00
bc7ed1fbdc [FSDP2] add __repr__ to FSDPParamGroup and FSDPParam (#132350)
in pdb, it's pretty common to print `FSDPParamGroup` and `FSDPParam`. making sure they are human readable

print `FSDPParam` in pdb
```
FSDPParam(fqn=layers.6._checkpoint_wrapped_module.attention.wq.weight, orig_size=torch.Size([128, 256]))
```
print `FSDPParamGroup` in pdb
```
FSDPParamGroup(fqn=layers.6)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132350
Approved by: https://github.com/awgu
2024-08-01 04:21:57 +00:00
46ed33b207 add decomposition_table as an arg to get_isolated_graphmodule (#130886)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130886
Approved by: https://github.com/wanchaol
2024-08-01 04:21:43 +00:00
073430ebea Don't check for autograd state when lowering to inference IR (#131988)
When lowering to inference IR, we shouldn't error on autograd state changes because we will have preserved the autograd state change at the training level. I think the more correct way of implementing it would be to wrap autograd ops in HOP before decomposing, but that seems low ROI.

Differential Revision: [D60346235](https://our.internmc.facebook.com/intern/diff/D60346235/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131988
Approved by: https://github.com/angelayi
2024-08-01 04:15:37 +00:00
81db69278d unsupported sympy functions in export solver (#132325)
Summary:
A bunch of issues around support for sympy functions like `TruncToInt` and `ToFloat` are uncovered by https://github.com/pytorch/pytorch/issues/131897. This PR addresses only one of them (as the title suggests). Another issue is deserialization, filed as a task: T197567691.

However the most important issue is that adding runtime assertions is broken right now: specifically, sympy_interp with `PythonReferenceAnalysis` currently doesn't work because the implementations of some of these sympy functions in `PythonReferenceAnalysis` (or falling through to its base class) does not expect proxies. This means things like `math.trunc`, `math.floor`, `round`, etc. don't work, and can be easily repro'd by using them inside `torch._check`, e.g. According to ezyang these implementations need to point to new torch functions that can expect proxies (see how minimum and maximum are implemented, e.g.).

Test Plan: added test (original repro provided)

Differential Revision: D60540951

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132325
Approved by: https://github.com/ezyang
2024-08-01 04:11:52 +00:00
10344d76bd Revert "[AOTI] Fix bfloat16 in CPU (#132150)"
This reverts commit a488113062b7231197ace8522ab3cab535c77d0b.

Reverted https://github.com/pytorch/pytorch/pull/132150 on behalf of https://github.com/clee2000 due to I think this broke inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_unspec_inputs_cuda_dynamic_shapes_cuda_wrapper [GH job link](https://github.com/pytorch/pytorch/actions/runs/10189155341/job/28189531216) [HUD commit link](a488113062). Test was not run on PR due to being skipped for being slow ([comment](https://github.com/pytorch/pytorch/pull/132150#issuecomment-2261895048))
2024-08-01 03:35:39 +00:00
a28cda11ef Revert "AutoHeuristic: mixed_mm heuristic for A100 (#131613)"
This reverts commit 344c15a0bb66409ec5e576992090d127cbfa2cff.

Reverted https://github.com/pytorch/pytorch/pull/131613 on behalf of https://github.com/AlnisM due to lintrunner issues ([comment](https://github.com/pytorch/pytorch/pull/131613#issuecomment-2261884149))
2024-08-01 03:22:11 +00:00
589aef4bb0 Fix py codegen to delete values that don't have any users (#131028)
Fixes #131025

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131028
Approved by: https://github.com/ezyang
2024-08-01 03:18:37 +00:00
718c13cd39 [inductor] Reinplacing should not allow an op to mutate the same input multiple times (#132238)
Fixes #132196

Let's say we have:
- op(x, y) that mutates both x and y
- new_x, new_y = functional_op(x, y) is the functional variant

If we are presented with functional_op(x, x), we must not reinplace
this into op(x, x), because then it would be writing to the same Tensor.
Instead, it's OK to reinplace one of them and to clone the other:
```
>>> y = x.clone()
>>> op(x, y)
```
This also applies if we have views: functional_op(x, x[0])
should not reinplace into op(x, x[0]).

The fix is to avoid reinplacing an arg if a view of it already has been
reinplaced.

Test Plan:
- new and existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132238
Approved by: https://github.com/oulgen, https://github.com/eellison
2024-08-01 02:37:03 +00:00
344c15a0bb AutoHeuristic: mixed_mm heuristic for A100 (#131613)
This PR introduces changes to AutoHeuristic that allow one to learn a heuristic as a decision tree. I used this to learn a heuristic for mixed_mm on A100 that consistenly performs better than the default choice (https://github.com/pytorch/pytorch/blob/main/torch/_inductor/kernel/mm.py#L402).

This is how the results look like:
Explanation of columns:
**wrong_max_spdup**: In the worst case, how much better would the best choice have been
**wrong_gman_spdup**: For inputs where the heuristic is wrong, how much better is the best choice on average (geomean)
**max_spdup_default**: Highest speedup achieved by the learned heuristic over the default choice
**gman_spdup_default**: Geomean speedup achived by the learned heuristic over the default choice
**max_slowdown_default**: If the default choice is better than the choice predicted by the learned heuristic, how much is it better in the worst case
**non_default_preds**: Number of times the learned heuristic predicted a choice that is not the default choice
**default_better**: Number of times the default choice is better than the choice made by the heuristic
```
  set     crit  max_depth  min_samples_leaf  correct  wrong  unsure  total  wrong_max_spdup  wrong_gman_spdup    max_spdup_default  gman_spdup_default  max_slowdown_default  non_default_preds  default_better
train  entropy          5              0.01     2376    740     323   3439         1.855386          1.063236            11.352318            3.438279              1.022164               3116               2
 test  entropy          5              0.01      563    183      71    817         1.622222          1.060897            10.084181            3.507741              1.017039                746               2
```

While the number of wrong predictions is high, on average the best choice is only around 6% better. What is important is that the choice predicted by the learned heuristic performs better than the default choice.

I evaluated my heuristic on gpt-fast `meta-llama/Llama-2-7b-chat-hf` with int8 weight quantization. To get the `tuned_mixed_mm` to trigger, I had to replace `F.linear()` in https://github.com/pytorch-labs/gpt-fast/blob/main/quantize.py#L355 with `torch.matmul(input, self.weight.t().to(dtype=input.dtype))` because the mixed_mm pattern does not match if there is a transpose between a cast and the matmul.
|batch size|prompt length| fallback    |  heuristic  | speedup |
|----------|-------------|------------:|------------:|--------:|
|     1    |      7      | 75.31 tok/s | 148.83 tok/s|  1.97   |
|     1    |     11      | 75.99 tok/s | 148.15 tok/s|  1.94   |
|     4    |      7      | 103.48 tok/s | 472.00 tok/s|  4.56   |
|     4    |     11      | 103.56 tok/s |  371.36 tok/s|  3.58   |
|     8    |      7      | 201.92 tok/s | 813.44 tok/s|  4.02   |
|     8    |     11      | 201.76 tok/s |  699.36 tok/s|  3.46   |

Currently, the heuristic only applies to the following inputs:
- m <= 128, k >= 1024, n >= 1024 (For these sizes, one of the triton kernels wins in most cases, but the heuristic still has to be careful to not choose a config that performs worse than the fallback)
- k % 256 == 0 (If k is not a multiple of the block size, some choices perform extremely bad. In one case one config, that usually performs very well, was 130x slower.)
- mat1 not transposed
- mat2 transposed (In some cases, it was hard for the learned heuristic to detect some cases where it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131613
Approved by: https://github.com/eellison
ghstack dependencies: #131610, #131611
2024-08-01 02:25:54 +00:00
2276d9045a [cpu] add more VecConvert for 8bits (#131876)
Adds more intrinsic specializations for 8bits conversions, in order to speed up bit8 SDPA in the future.
- u8 -> i16
- i32 -> f32
- f32 -> i32
- i32 -> i8 (only add vec512 cause lack of avx512vl for vec256)
- i16 -> i8 (only add vec512 cause lack of avx512vl for vec256)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131876
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel
2024-08-01 01:38:39 +00:00
7c89ec0f7c Implements torch.cuda.MemPool() API (#131152)
In this PR:
- Pool id creation logic is refactored and moved to a MemPool class. `graph_pool_handle()` API now uses `torch.cuda.MemPool()` to get a unique id for a pool. Existing tests should cover this change.
- MemPool holds a pointer to a CUDAAllocator as proposed in https://github.com/pytorch/pytorch/issues/124807#issuecomment-2077506997. Tests are added to show usage with CUDAPluggableAllocator.
- MemPoolContext API makes a mempool active. Tests are added to show usage of this API. This API will be used in CUDACachingAllocator to route allocations to a user provided allocator. See draft here: https://github.com/pytorch/pytorch/pull/125722/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131152
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-08-01 01:29:30 +00:00
4e966e8a1c Update inference_mode doc (#132321)
Fix https://github.com/pytorch/pytorch/issues/132288
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132321
Approved by: https://github.com/awgu, https://github.com/soulitzer
2024-07-31 23:50:03 +00:00
a488113062 [AOTI] Fix bfloat16 in CPU (#132150)
Fixes #122986

- add "typedef at::BFloat16 bfloat16;" to the header of generated cpp file

- Supress warning: comparison of integer expressions of different signedness: ‘long unsigned int’ and ‘int64_t’ {aka ‘long int’} [-Wsign-compare]
  436 |   if (tensor.numel() != numel) {

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132150
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2024-07-31 23:28:24 +00:00
6b28af1b79 Grouped Query Attention (#128898)
### Approach: Using the current function declaration

**Constraint:** Q_Heads % KV_Heads == 0

**Major change:**
- Added a new argument enable_gqa: bool to sdpa function call
- It adds a meaning to the last third dimension.

Sample use cases this would enable:
LLama3

```
# LLama3 8b call to SDPA
query = torch.rand(batch, 32, seq_len_q, D)
key = torch.rand(batch, 8, seq_len_kv, D)
value = torch.rand(batch, 8, seq_len_kv, D)

output = scaled_dot_product_attention(query, key, value, is_causal=True, enable_gqa=True)

# Output Shape
(batch, 32, seq_len_q, D)
```

### Design Choice:

- Check if Query.size(-3) == Key.size(-3) == Value.size(-3) or, Query.size(-3) % Key.size(-3) == 0
- The function adjusts the key and value tensors to match the query tensor's head dimension by using repeat_interleave if their number of heads are not equal, facilitating correct and efficient computation in attention mechanisms.
- By default the enable_gqa flag is set to False, which ensures that regular sdpa functionality remains unchanged.

### Benchmarks:

- **sdpa.py: #130634**
For different batch sizes enable_gqa=True shows a substansial improvement in the run_time of sdpa

 | batch_size | q_num_heads | kv_num_heads | q_seq_len | kv_seq_len | embed_dim | forward_time when enable_gqa=True   |   forward_time when enable_gqa=False    |
| ------------ | ------------- | -------------- | ----------- | ------------ | ----------- | ----------- | ---------------- |
|     1      |     32      |      8       |   2048    |    2048    |   2048    |   100.71  |  119.70  |
|     8      |     32      |      8       |   2048    |    2048    |   2048    |   539.78  |  628.83  |
|     16     |     32      |      8       |   2048    |    2048    |   2048    |   1056.81  |  1225.48  |
|     32      |     32      |      8       |   2048    |    2048    |   2048    |   2099.54  |  2440.45  |

![Screenshot 2024-07-25 at 9 07 40 PM](https://github.com/user-attachments/assets/a3e5f716-c39f-4096-9e6c-82a735e57b7b)

- **TorchTitan: https://github.com/pytorch/torchtitan/pull/458**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128898
Approved by: https://github.com/drisspg
2024-07-31 22:58:51 +00:00
f0da167ce5 Add fx graph runnable to tl parse (#130976)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130976
Approved by: https://github.com/ezyang
2024-07-31 22:19:35 +00:00
645c1052a6 Refactor local autotune remote cache to make the code less error prone (#132289)
Fixes #132241

This PR refactors local autotune cache so that disabling it is easier and cleaner.

Differential Revision: [D60537196](https://our.internmc.facebook.com/intern/diff/D60537196)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132289
Approved by: https://github.com/aorenste
ghstack dependencies: #132285
2024-07-31 22:12:22 +00:00
b0e06d9d6a Make config.autotune_remote_cache be a three-way option (#132285)
Similar to fx_graph_cache config, make autotune config be three-way so we can hard enable/disable via config options.

Differential Revision: [D60537105](https://our.internmc.facebook.com/intern/diff/D60537105)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132285
Approved by: https://github.com/aorenste
2024-07-31 22:12:22 +00:00
260c991e20 [inductor] Fix unsoundness with negative-valued indexing expressions (#131761)
This fixes a few instances where we assumed indexing expressions were
non-negative. This is not valid when we have more complicated
expressions involving masking e.g. pointwise cat.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131761
Approved by: https://github.com/ezyang
2024-07-31 21:32:20 +00:00
e74ba1b34a [BE][Easy][15/19] enforce style for empty lines in import segments in torch/_d*/ (#129767)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129767
Approved by: https://github.com/anijain2305
2024-07-31 21:18:11 +00:00
ad9826208c Remove string length limit in ET (#132169)
Summary: ET sets the length limit of string input varaibele to 8192 characters. However, the node process_group::init has more than 8192 characters for a Ads 128 rank job. This DIFF is to temporaily remove this limit, so ET can capture the complete information of the process group.

Test Plan: buck2 test mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTrace

Reviewed By: sanrise

Differential Revision: D60341306

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132169
Approved by: https://github.com/sraikund16, https://github.com/sanrise
2024-07-31 20:54:39 +00:00
d3cefc9e3a AutoHeuristic: Collect data for mixed_mm (#131611)
This PR introduces a script that can be used to collect data for mixed_mm to learn a heuristic with AutoHeuristic. This PR also includes the following things:

Move pad_mm related AutoHeuristic files into subdirectory
Introduce an interface benchmark_runner.py that can be subclassed to introduce new scripts to run benchmarks in order to collect data with AutoHeuristic (see gen_data_pad_mm.py and gen_data_mixed_mm.py).
The idea behind the interface is that, in the end, it hopefully makes it easier to collect data for new optimizations, and thus makes it easier to learn a heuristic.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131611
Approved by: https://github.com/eellison
ghstack dependencies: #131610
2024-07-31 20:45:45 +00:00
f8b6e91840 Add sequoia runner to mac-mps (#132190)
Adds MacOS 15 runners to GitHub actions for Mac-mps test suite

Co-authored-by: Joona Havukainen <jhavukainen@apple.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132190
Approved by: https://github.com/malfet
2024-07-31 20:26:04 +00:00
d72e863b3e Fix lint after PR #130572 (#132316)
Fix lint after https://github.com/pytorch/pytorch/pull/130572

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132316
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/ZainRizvi
2024-07-31 20:00:31 +00:00
aeb78c9849 [TD] More files for test_public_bindings (#132284)
It relies on that file

Also we care about .cpp files too apparently
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132284
Approved by: https://github.com/ZainRizvi
2024-07-31 19:53:40 +00:00
cb4c107d70 [pytorch][counters] DynamicCounter (#132166)
Summary:
Implement a callback-based dynamic counter with pluggable backends.
The backend API and integration is similar to WaitCounter. Note that this counter should only be used with C++ callbacks, since making it safe to be used for GIL-requiring callbacks would be pretty challenging and may defeat the whole purpose of this counter (since the duration of the callback can no longer be guaranteed).

Test Plan: unit test

Differential Revision: D60464055

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132166
Approved by: https://github.com/asiab4
2024-07-31 19:52:51 +00:00
dc38646c58 Revert "[pytorch][counters] Pybind for WaitCounter (#132167)"
This reverts commit 2c7bd61afa4b762e00b26bbde43685de080af32a.

Reverted https://github.com/pytorch/pytorch/pull/132167 on behalf of https://github.com/clee2000 due to broke test_public_bindings.py::TestPublicBindings::test_correct_module_names [GH job link](https://github.com/pytorch/pytorch/actions/runs/10183687967/job/28172929836) [HUD commit link](2c7bd61afa) not tested on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/132167#issuecomment-2261328275))
2024-07-31 19:51:56 +00:00
6955bc170d Some updates to merge rules (#132296)
The added people from metamates don't actually make a material
difference right now but I added some for fun.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132296
Approved by: https://github.com/albanD, https://github.com/malfet
2024-07-31 19:49:08 +00:00
2138a710eb enable test_max_pool2d6 after resolving empty array (#132219)
Related to Issue: https://github.com/pytorch/pytorch/issues/131335
Resolving PR: https://github.com/pytorch/pytorch/pull/132023

Test output:
```
(pytorch-3.10) [gabeferns@devvm2252.cco0 ~/pytorch (enable-test-max-pool2d6)]$ TORCHINDUCTOR_ABI_COMPATIBLE=1 python test/inductor/test_cpu_cpp_wrapper.py -k test_max_pool2d6
inline_call []
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('extern_calls', 3), ('fxgraph_cache_miss', 1)]
aot_autograd [('total', 1), ('ok', 1)]
.inline_call []
stats [('calls_captured', 3), ('unique_graphs', 1)]
aot_autograd [('total', 1), ('ok', 1)]
inductor [('extern_calls', 3), ('fxgraph_cache_miss', 1)]
.
----------------------------------------------------------------------
Ran 2 tests in 8.668s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132219
Approved by: https://github.com/desertfire
2024-07-31 19:13:54 +00:00
cfe61e84ac Add a 'to' method for moving to and from device for BlockMask (#132087)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132087
Approved by: https://github.com/yanboliang
2024-07-31 19:05:30 +00:00
898a431a46 Dump files that look like FX graphs to structured log (#132100)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132100
Approved by: https://github.com/oulgen
2024-07-31 18:45:28 +00:00
f9e4d05c15 Save and run post compilation steps within FXGraphCache (#130572)
This PR mostly refactors by putting code into utils files so that they can be shared between codecache.py and compile_fx.py. Afterwards, it then changes compile_fx so that:
- When saving to FXGraphCache, we save onto the CompiledFXGraph all the necessary metadata for running post compile steps (realigning inputs, cudagraphification).
- When loading from FXGraphCache, we use the saved information directly, instead of calculating them from scratch.

What this does is make it so that `FXGraphCache.load()` is a perfect cache on compile_fx_inner, in that it **returns exactly what compile_fx_inner returns**. This also makes it possible for AOTAutogradCache, given a key to the fx graph cache and example inputs, to get back the full return value of compile_fx_inner.

## What's a post compile step?
We define a **post-compile** to be the set of actions that need to run after FXGraphCache either loads from the cache or misses and runs compilation. These steps include:
- Setting the tracing context's output strides
- Running cudagraphs if enabled
- Maybe realign inputs if cudagraphs didn't run

To run these steps, we save all the necessary metadata in CompiledFxGraph, and use them on a cache hit to reconstruct the object.

## Splitting cudagraphs work into pre/post compile
Cudagraphs does a lot of work on the input graph module to determine if cudagraphs can be enabled. This is the code that involves cudagraph_tests and stack traces. This will work in a world where we have access to the input graph module, but with AOTAutograd warm start, we won't have access to that information anymore. Therefore we can split cudagraphs work into two parts: on a cache miss (and therefore a full compile), we do the cudagraphs testing work, and save cudagraph_fail_reasons into the cache. Then on a cache hit, we know whether or not we can run cudagraphs, and if we can't, we can emit the correct error messages.

Implementation notes:
- We save `fx_kwargs` directly onto the CompiledFXGraph. `fx_kwargs` is already, by definition, part of the cache key, so this is safe to do when it comes to cache correctness.
- ^ Why do we do above even though FXGraphCache.load takes fx_kwargs as an argument? Because AOTAutogradCache **doesn't** have access to fx_kwargs: they're annoyingly encoded in the functools.partial() of the fw_compiler, so *only* inductor knows about these options. They're fully captured by the AOTAutogradCache key (since every key to fx_kwargs is either a global config, or a field that's deterministic based on an input graph module), but their values are still needed to run cudagraphs/postprocessing. Therefore, it's easier/safer to store it on the cached result.
- Willing to hear other approaches here if we think saving these extra fields is not reasonable, though I can't think of another way to do this that's less complicated to explain.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130572
Approved by: https://github.com/eellison
2024-07-31 18:32:40 +00:00
b40249b462 propagate XLA's metadata after functional sync (#131076)
Fixes https://github.com/pytorch/xla/issues/7174

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131076
Approved by: https://github.com/bdhirsh
2024-07-31 18:20:00 +00:00
7eb2a99585 Fix to support unary pointwise ops when an NJT is not the first arg (#131937)
**Background:** NJT utilizes a `jagged_unary_pointwise()` fallback that historically has assumed blindly that the first arg is an NJT. This assumption breaks certain ops; for example `pow(scalar, Tensor)` has an NJT as the second arg.

This PR expands `jagged_unary_pointwise()` and the associated schema validation logic to handle an NJT in args other than the first position.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131937
Approved by: https://github.com/soulitzer
ghstack dependencies: #131898, #131704
2024-07-31 17:51:03 +00:00
c3a31d90e7 Fix inlining module-scoped store global (#132224)
Fixes https://github.com/pytorch/pytorch/issues/132165

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132224
Approved by: https://github.com/anijain2305
2024-07-31 17:37:43 +00:00
6214b5388b typing ir.py - part 1 (#131845)
See #131852

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131845
Approved by: https://github.com/Skylion007, https://github.com/eellison
2024-07-31 17:37:14 +00:00
144639797a Improve side effects error message (#132223)
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132223
Approved by: https://github.com/anijain2305
2024-07-31 17:29:26 +00:00
784a6ec5a3 Revert "Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004)"
This reverts commit 13d744464f10e35c0de50feb4e2340d4dae8e05f.

Reverted https://github.com/pytorch/pytorch/pull/130004 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/10183945999/job/28170099930) [HUD commit link](13d744464f) probably a landrace, the base is 21 hours old ([comment](https://github.com/pytorch/pytorch/pull/130004#issuecomment-2260946562))
2024-07-31 16:49:21 +00:00
9826c542f0 [inductor] skip remote fx caching in failing pattern matcher tests (#132206)
Summary: These tests are failing internally with remote caching enabled because the installed pattern increments a nonlocal counter, which we skip with a cache hit.

Test Plan:
```
buck2 test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/inductor:pattern_matcher -- --exact 'caffe2/test/inductor:pattern_matcher - test_match_with_mutation (caffe2.test.inductor.test_pattern_matcher.TestPatternMatcher)' --run-disabled --stress-runs 10
buck2 test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/inductor:pattern_matcher -- --exact 'caffe2/test/inductor:pattern_matcher - test_match_equivalent_function_invocations1 (caffe2.test.inductor.test_pattern_matcher.TestPatternMatcher)' --run-disabled --stress-runs 10
buck2 test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/inductor:pattern_matcher -- --exact 'caffe2/test/inductor:pattern_matcher - test_match_equivalent_function_invocations2 (caffe2.test.inductor.test_pattern_matcher.TestPatternMatcher)' --run-disabled --stress-runs 10
buck2 test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/inductor:pattern_matcher -- --exact 'caffe2/test/inductor:pattern_matcher - test_match_equivalent_function_invocations3 (caffe2.test.inductor.test_pattern_matcher.TestPatternMatcher)' --run-disabled --stress-runs 10
```

Differential Revision: D60491503

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132206
Approved by: https://github.com/oulgen
2024-07-31 16:41:04 +00:00
bdd7a0322d [Dynamo] Fix - str handler for UserDefinedObjectVariable (#130506)
Fixes #130301

Adjusted the call_str method to handle str conversion for UserDefinedObjectVariable.
Attempt in a clean branch for unrelated test errors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130506
Approved by: https://github.com/oulgen, https://github.com/anijain2305
2024-07-31 16:39:59 +00:00
fe4f8e97cd [Intel GPU] xpu-ops codegen via backend whitelist (#130082)
# Motivation

This PR intends to enhance the codegen to allow generate codes for XPU backend.

XPU operators need be registered in an hand-written way currently. Developers have no chance to take the advantage of shared code to handle tensor meta setting (like strides, proxy output, structured kernels).  Manually porting code is erro-prone and may lead to high maintaining efforts.

We utilize the backend_whitelist argument in `gen.py` to generate XPU needed headers and source codes.

# Usage
XPU ops lie in `third_pary/torch-xpu-ops`, the codegen process is triggered before the complation of `torch-xpu-ops`

We use the following commands to generate XPU operators

` python -m torchgen.gen --source-path path/to/yaml/of/xpu   --install-dir  build/xpu    --per-operator-headers    --static-dispatch-backend     --backend-whitelist=XPU`

The diff lies at `backend-whitelist=XPU`.  The backend-whitelist key is an existent argument in torchgen.

The input of `gen.py` are code templates and operators yaml. We share the same templates in `aten`. A simplified yaml lies in `third_party/torch-xpu-ops`, which only includes the supported xpu operators. This yaml is a copy-and-modify of `native_functions.yaml`. No extra entry is added, the format is same as the one in `aten`

# Result

All operators headers are generated in `build/xpu/ATen/ops` independently, which would not affect operators declared/defined by CPU/CUDA or any other backend.  XPU operators only include headers in this folder.

# Verification

* In `third-party/torch-xpu-ops`, we migrate all supported kernels to structured kernels style, where they are registered through `REGISTER_XPU_DISPATCH` or `TORCH_IMPL_FUNC`, and we have UT verification based on `test_ops.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130082
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/atalman
ghstack dependencies: #130019
2024-07-31 16:31:38 +00:00
aec8bc5e4c [easy] fix type annotation on constraint_violations variable (#127064)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127064
Approved by: https://github.com/jananisriram
2024-07-31 16:27:10 +00:00
c85088b1f9 [ROCm] performance optimization for index select (#131713)
As observed during working on this fix (https://github.com/pytorch/pytorch/pull/130994), 128 threads per block seems quite low. This PR is to increase the default to improve the performance, and also slightly refactoring the code to replace the hard-coded 128 for better maintenance.

By increasing the default max threads per block from 128 to 256, I saw for `aten::index_select`,  its "CUDA total" time drop from 44.820ms to 33.608ms by profiling below embedding script:
```
input = torch.randint(low=0, high=16032, size=[131072], device="cuda")
w = torch.randn([16032, 16384], device="cuda")

with profiler.profile(record_shapes=True) as prof:
    x = torch.nn.functional.embedding(input, w)

```
I tested with the default from 128 to 256, 512, 1024 on several different types of devices, and observed "CUDA total" time dropping even more and more latency improvement as the number increases. Below is one example of latency improvement ratio:
128 | 1x
256 | 1.33x
512 | 1.44x
1024 | 1.49x

Using 512 as the new default max for non-mi300x to be conservative, which is 1.44x faster than using 128 with the above profiling script.

Using 1024 for mi300x is 1.61x faster than using 128 with the same profiling script, and using 512 is 1.57x faster.

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131713
Approved by: https://github.com/jeffdaily, https://github.com/syed-ahmed, https://github.com/malfet
2024-07-31 16:24:01 +00:00
13d744464f Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004)
Python's set is non deterministic. There is an internal failure which we recently ran into which did not consistently fail.

See, repro here: P1453035092.

Now, with these changes, it does consistently fail. In follow ups we could also consider adding a lintrule for uses of either set() or set literals.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130004
Approved by: https://github.com/oulgen
2024-07-31 16:22:11 +00:00
2c7bd61afa [pytorch][counters] Pybind for WaitCounter (#132167)
Summary:
Basic pybind integration for WaitCounter providing a guard API.
Also fixes broken copy/move constructor in WaitGuard (it wasn't really used with the macro-based C++ API).

Test Plan: unit test

Reviewed By: asiab4

Differential Revision: D60463979

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132167
Approved by: https://github.com/asiab4
2024-07-31 16:04:40 +00:00
39a3c98aa6 [inductor] fix scalar miss constuctor for long type. (#132117)
Fix `long` to `c10::scalar` convert issue.

![image](https://github.com/user-attachments/assets/fc44a170-e293-4688-a185-d189484f6638)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132117
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-07-31 15:40:48 +00:00
b2118573d6 [BE] Unify PG assignments (#132230)
python's `or` operator returns `bar` in cases of
`foo = None or bar`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132230
Approved by: https://github.com/Skylion007, https://github.com/wconstab
2024-07-31 15:28:25 +00:00
9c52013559 [subclasses] Fix nested subclasses flattened tensors ordering (#132096)
get_plain_tensors() should result in DFS of leaves.
The error was that plain tensors (leaves) on the same level were returned before subclasses plained tensors even if subclasses are before in "flatten" list.

Original issue from AO: https://github.com/pytorch/ao/issues/515

Test:TBD, need to make asymetric subclass with dense tensors and subclasses
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132096
Approved by: https://github.com/bdhirsh
2024-07-31 14:12:51 +00:00
5406e46b00 Revert "Add fx graph runnable to tl parse (#130976)"
This reverts commit 52c3af62d6fa4a0a4e22764a89f1877f3b1b28f9.

Reverted https://github.com/pytorch/pytorch/pull/130976 on behalf of https://github.com/albanD due to Broke trunk ([comment](https://github.com/pytorch/pytorch/pull/130976#issuecomment-2260579485))
2024-07-31 13:53:57 +00:00
3d7f541597 [BE][TP] Check module has bias before access (#132137)
Some linear modules, such as the ones reconstructed by `torch.export.unflatten()`, may not have the `bias` attribute, if the original linear module has `bias=None`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132137
Approved by: https://github.com/wanchaol
2024-07-31 13:45:28 +00:00
dad125a64b Address clang-tidy nits in BFloat16 (#132203)
Summary: In https://github.com/pytorch/pytorch/pull/131359 I forgot to amend with clang-tidy fixes before merging. This addresses that.

Test Plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132203
Approved by: https://github.com/houseroad
2024-07-31 13:41:56 +00:00
45e6a364ee Avoid autocast deprecation warning (#132207)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132207
Approved by: https://github.com/awgu
2024-07-31 13:13:39 +00:00
f4f7aba75d Expose function to probe whether PyTorch was built with FlashAttention (#131894)
This is needed by downstream projects (e.g., xFormers) to determine whether they can count on FlashAttention in PyTorch or whether they need to build it themselves.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131894
Approved by: https://github.com/drisspg, https://github.com/eqy
2024-07-31 11:33:09 +00:00
548c460bf1 [BE][Easy][7/19] enforce style for empty lines in import segments in test/[a-c]*/ and test/[q-z]*/ (#129758)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129758
Approved by: https://github.com/ezyang
2024-07-31 10:54:03 +00:00
46994e753b [NestedTensor] Integrate the layer normalization operator along the jagged dimension into NestedTensor (#132172)
Modify the existing `layer normalization` operator in PyTorch, invoked by `torch.layer_norm`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff, which uses the `aten` padding operator, enables PyTorch users to invoke `torch.nn.functional.layer_norm` on a nested tensor when reducing along the ragged dimension, e.g. `*` in a `(B, *, M)` or `(B, *, M, N)` nested tensor.

Write unit tests based on the `softmax` jagged operator to verify the accuracy of the ragged reduction implementation for `torch.nn.functional.layer_norm`. Add unit tests to verify error handling for unsupported features.

Note that this implementation is limited to nested tensors with `ragged_idx == 1`, i.e. the ragged dimension is not transposed. The layer normalization operator also requires an operation on a 2-dimensional layer; for nested tensors with 4 or more dimensions, I flatten the extra dimensions, then unflatten them after performing layer normalization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132172
Approved by: https://github.com/davidberard98
ghstack dependencies: #132170
2024-07-31 10:51:46 +00:00
89053e382a [NestedTensor] Integrate the softmax operator along the jagged dimension into NestedTensor (#132170)
Modify the existing `softmax` operator in PyTorch, invoked by `torch.softmax`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff, which uses the aten padding operator, enables PyTorch users to invoke `torch.softmax` on a nested tensor when reducing along the ragged dimension, e.g. `*` in a `(B, *, M)` nested tensor.

Write unit tests based on the `sum` and `mean` jagged operators to verify the accuracy of the ragged reduction implementation for `torch.softmax`. Add unit tests to verify error handling for unsupported features in `NestedTensor` `torch.softmax`.

Note that this implementation is limited to nested tensors with `ragged_idx == 1`, i.e. the ragged dimension is not transposed. In addition, the `softmax` operator is required to take in as input an integer for the reduction dimension `dim`, requiring new unit tests heavily inspired by the `sum` and `mean` jagged operator unit tests. `Softmax` also allows for reducing along the batch dimension.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132170
Approved by: https://github.com/davidberard98
2024-07-31 10:51:46 +00:00
e7eeee473c [BE][Easy][14/19] enforce style for empty lines in import segments in torch/_[a-c]*/ and torch/_[e-h]*/ and torch/_[j-z]*/ (#129765)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129765
Approved by: https://github.com/ezyang
2024-07-31 10:42:50 +00:00
9e473fd868 Make adding Buffers more like adding Parameters (#125971)
Add similar semantics for creating a buffer object similar to creating a parameter. This is done by introducing a new Buffer class that can be used for type disambiguation. The underlying functionality of registering a buffer remains the same as the register_buffer method has not been changed. The persistent parameter in the Buffer type is to indicate whether a buffer object should be persistent or not. Other non-test changes have to do with getting the new Buffer type recognized by inductor and dynamo. Remaining changes are test changes to make sure that the Buffer type can be used as a drop in replacement for register_buffer as it just leads to register_buffer being called. The addition of this new functionality still allows for normal tensors to be used as buffers so these changes are intended to be backwards compatible.

Fixes #35735

Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125971
Approved by: https://github.com/albanD, https://github.com/anijain2305, https://github.com/mlazos
2024-07-31 10:32:40 +00:00
a94e507c39 [aota] Needs autograd if an input requires_grad, agnostic to enable_grad (#128890)
Original issue: https://github.com/pytorch/pytorch/issues/114338

Reland of:  https://github.com/pytorch/pytorch/pull/128016

Summary from previous PR:
We assume only two possible mutually exclusive scenarios:

Running compiled region for training (Any of inputs has requires_grad)

Produced differentiable outputs should have requires_grad.
Running compiled region for inference (None of inputs has requires_grad)

All outputs do not have requires_grad.
Even if user runs the region under no_grad(), but has an input Tensor with requires_grad - we go Training scenario (1).

With current state that means:
1/ needs_autograd should not check torch.is_grad_enabled(), only that any of inputs requires_grad
2/ if needs_autograd => trace_joint (We are in training scenario 1.) => always run compiled region under with.enable_grad()

Changes in partitioner?

Inference and Training graphs had difference in return container, list/tuple.
The changes in partitioner are done to unify and return always tuple.
As a result - some changes in test_aotdispatch.py for graph contents list -> tuple.

Why was revert?

There was a regression of hf_Reformer model on inference.
```
TORCHINDUCTOR_FX_GRAPH_CACHE=0 python benchmarks/dynamo/torchbench.py --performance --inference --bfloat16 --backend inductor --device cuda --only hf_Reformer --cold-start-latency --use-eval-mode
```

Because one of the compiled graphs contained outputs, which are aliases to the inputs that are nn.Parameter(requires_grad=True).

Even if inference bencharmsk torchbench runs inside with` torch.no_grad()` - alias (specifically for hf_Reformer - expand) ops preserve requires_grad.

As a result we started compiling training graph instead of inference.

Fix for view ops:

If we have outputs, that are aliases to inputs that requires_grad, those outputs requires grad is not a reason to generate training graph.

This is handled in aot_autograd.py, where output_and_mutation_safe are calculated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128890
Approved by: https://github.com/bdhirsh
2024-07-31 07:25:19 +00:00
e9d1c26275 fix uniform op in dynamo (#132160)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132160
Approved by: https://github.com/anijain2305
2024-07-31 06:48:43 +00:00
ae708e9791 [ONNX] Remove the deprecated SymbolicContext (#132184)
Remove the deprecated SymbolicContext class from torch.onnx
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132184
Approved by: https://github.com/titaiwangms
2024-07-31 04:24:32 +00:00
cyy
89da94594e [11/N] Fix clang-tidy warnings in jit (#132131)
Follows #132122

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132131
Approved by: https://github.com/Skylion007
2024-07-31 03:45:52 +00:00
91299c95ec Revert "Add functions from torch.masked._ops to __all__ for torch.masked (#131288)"
This reverts commit 78020ea55d1bc06898577887b80c15d6d2b967dc.

Reverted https://github.com/pytorch/pytorch/pull/131288 on behalf of https://github.com/kit1980 due to Broke test_public_bindings.py::TestPublicBindings::test_correct_module_names [GH job link](https://github.com/pytorch/pytorch/actions/runs/10172945925/job/28136657243) [HUD commit link](78020ea55d) ([comment](https://github.com/pytorch/pytorch/pull/131288#issuecomment-2259581854))
2024-07-31 03:45:09 +00:00
27c9262d29 Fix stdout / stderr typing in SubprocessHandler (#132071)
Summary: Fix stdout / stderr typing in SubprocessHandler. Stdout and Stderr should be `Optional[str]` instead of `str`.

Test Plan: CI

Differential Revision: D60319648

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132071
Approved by: https://github.com/Skylion007
2024-07-31 02:51:11 +00:00
52c3af62d6 Add fx graph runnable to tl parse (#130976)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130976
Approved by: https://github.com/ezyang
2024-07-31 02:27:22 +00:00
deb788f6cc Merge torch.nn.utils.rnn type stubs (#131872)
I want to re-attempt:

* #61467

See:

* https://github.com/pytorch/pytorch/issues/10536#issuecomment-2251948730

and this is one of the files I would touch.

quoting @ezyang:

* https://github.com/pytorch/pytorch/issues/91648#issuecomment-1372010129

> The back story here is that in https://github.com/pytorch/pytorch/pull/19089 we added pyi stubs for nn modules, but when we got off Python 2 we started merging the pyi stubs directly into the py files, e.g., as in https://github.com/pytorch/pytorch/pull/43044. But not all the modules got the treatment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131872
Approved by: https://github.com/Skylion007, https://github.com/ezyang
2024-07-31 02:24:59 +00:00
78020ea55d Add functions from torch.masked._ops to __all__ for torch.masked (#131288)
Add the non-private operations imported in this file to `__all__` so that pyright considers them to be publicly exported. Solves this error:

```
"mean" is not exported from module "torch.masked" Pylance[reportPrivateImportUsage]
```

Related: https://github.com/pytorch/pytorch/pulls?q=pyright+export

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131288
Approved by: https://github.com/ezyang
2024-07-31 02:16:38 +00:00
df0494bbba Clean redundant link libraries for XPU (#131322)
`torch_xpu` should link to `libtorch_cpu.so` instead of `torch_cpu_library`, otherwise redundant link libraries will contaminate `torch_xpu`, especially when there are MKL in both CPU and XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131322
Approved by: https://github.com/cyyever, https://github.com/ezyang
2024-07-31 02:15:15 +00:00
c07aa1c9c9 [Easy] reorder functions in torch._jit_internal (#130531)
Split from #128633.

- #128633

Move commonly used functions (e.g. `is_scripting`) to the top of the module to avoid circular dependency.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130531
Approved by: https://github.com/EikanWang, https://github.com/ezyang
2024-07-31 02:12:29 +00:00
fbe6f42dcf [BE][Easy][8/19] enforce style for empty lines in import segments in test/[k-p]*/ (#129759)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129759
Approved by: https://github.com/justinchuby, https://github.com/ezyang
2024-07-31 02:09:20 +00:00
914577569d Remove python 3.8 nightly builds (#132138)
Removing python 3.8 support in nightly builds. As per PR: https://github.com/pytorch/pytorch/issues/120718
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132138
Approved by: https://github.com/albanD, https://github.com/malfet, https://github.com/huydhn
2024-07-31 01:50:03 +00:00
05317cd8f7 [dtensor][be] improving readability and reducing repeating code (#132070)
**Summary**
I created functions that reduced repeating code in the console and json APIs which also improved their readability for future developers.

**Test Plan**
1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_json_dump

2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132070
Approved by: https://github.com/XilunWu
2024-07-31 00:53:36 +00:00
f85feef127 [DTensor] add support for custom op registration (#131108)
`register_sharding` is an experimental API that allows users to register sharding strategies for an operator when the tensor inputs and outputs are :class:`DTensor`s. It can be useful when: (1) there doesn't exist a default sharding strategy for ``op``, e.g. when `op` is a custom operator that is not supported by `DTensor`; (2) when users would like to overwrite default sharding strategies of existing operators.

Here's an example:

        @register_sharding(aten._softmax.default)
        def custom_softmax_sharding(x, dim, half_to_float):
            softmax_dim = dim if dim >= 0 else dim + x.ndim
            acceptable_shardings = []

            all_replicate = ([Replicate()], [Replicate(), None, None])
            acceptable_shardings.append(all_replicate)

            for sharding_dim in range(x.ndim):
                if sharding_dim != softmax_dim:
                    all_sharded = (
                        [Shard(sharding_dim)],
                        [Shard(sharding_dim), None, None],
                    )
                    acceptable_shardings.append(all_sharded)

            return acceptable_shardings

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131108
Approved by: https://github.com/wanchaol
2024-07-31 00:51:16 +00:00
31205d5198 [Inductor][CPP] Fix Local Buffer issue with inplace result line (#132018)
**Summary**
If a `global buffer` has been replaced by `local buffer`, we will add this `global buffer` into `removed_buffers` to avoid unnecessary allocation. However, a special case is when this `global buffer` can reuse previous buffer. We didn't handle this case previously which cause functional failure in f151f25c0b/torch/_inductor/codegen/wrapper.py (L440)

In this PR, we resolve this issue by avoid adding this global buffer into `V.kernel.inplace_update_buffers` when this buffer has been marked as `removed`.

**Test Plan**
```
python test/inductor/test_cpu_repro.py -k test_local_buffer_with_line_reuse
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132018
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-07-31 00:38:17 +00:00
882d80fd92 Add lowering for updated _scaled_mm (fixing submodules) (#130422)
Add the Inductor lowering for `torch._scaled_mm`, whose API was last updated in https://github.com/pytorch/pytorch/pull/128683.

The lowering does:
- for tensor-wise scaling, auto-tune between the default ATen kernel (cuBLAS) and Triton kernel configurations.
- for row-wise scaling, auto-tune between the default ATen kernel (CUTLASS kernel added in https://github.com/pytorch/pytorch/pull/125204) and Triton kernel configurations.

The Triton kernel template is based on 3ad9031d02 (D56337896) by @choutim, without using SPLIT_K, and that of mm `torch/_inductor/kernel/mm.py`

## Testing:
- Logging shows max-autotune tuning (`AUTOTUNE scaled_mm`) for both tensor-wise and row-wise scaling when called with the two scaling types.
- Row-wise scaling allows operator fusion between preceding pointwise/reduction op and amax/cast:
    - output code Evaluating m=256, n=256, k=256, fusion_case='pointwise', scaling_mode='row'
        - P1477224245 - 2 kernels
    - output code Evaluating m=2048, n=256, k=2048, fusion_case='reduction', scaling_mode='row'
        - P1477227340 - 2 kernels

- UT `python test/inductor/test_fp8.py -- TestFP8Lowering`

## Benchmarking

Eager/compiled tensor-wise/row-wise scaling for various shapes:
https://docs.google.com/spreadsheets/d/1VfWEVuyrwoWysfbS0_u2VHJ-PsdWkF1qIsiD60AzTes/edit?gid=2113587669#gid=2113587669
- Some of the “compiled” cases are slightly slower than “eager”. It’s because max-autotune selected the ATen kernel in the compiled case, and I think the discrepancy is variance.

Eager/compiled tensor-wise/row-wise scaling with pointwise/reduction preceding op for various shapes:
https://docs.google.com/spreadsheets/d/1Nv07NrdffQIoDeMjo9E0V-E-EYrEN0WysO_bn1bc6ns/edit?gid=1715488446#gid=1715488446

## Questions for reviewers:
- Should the type of the accumulator `ACC_TYPE` always be in float32? If not, where is this type set (output layout?)?

## Todo:
- Make the Triton template use the improved persistent kernel version (https://github.com/pytorch/FBGEMM/pull/2735 by @htyu)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130422
Approved by: https://github.com/ipiszy
2024-07-30 23:48:48 +00:00
fdcd2f0dd1 [PT2][Optimus] Add unbind cat to view pass (#132152)
Summary: We observed new graph transformation opportunity in IG_CTR, which can further remove the cat node.

Test Plan:
# unit test

```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test //caffe2/test/inductor:split_cat_fx_passes
```

Buck UI: https://www.internalfb.com/buck2/5061a3fe-b788-4031-b3af-66d48564a2df
Test UI: https://www.internalfb.com/intern/testinfra/testrun/9007199298289131
Network: Up: 2.5GiB  Down: 5.7GiB  (reSessionID-a49b1234-c02c-4a2d-a9ad-9f5b23557522)
Jobs completed: 294061. Time elapsed: 13:47.8s.
Cache hits: 68%. Commands: 106996 (cached: 72904, remote: 33875, local: 217)
Tests finished: Pass 10. Fail 0. Fatal 0. Skip 1. Build failure 0

# benchmark

```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "ig_ctr" --flow_id 584880697
```

Counter({'pattern_matcher_nodes': 1649, 'pattern_matcher_count': 1538, 'normalization_pass': 343, 'extern_calls': 160, 'normalization_aten_pass': 39, 'merge_splits_pass': 19, 'fxgraph_cache_miss': 9, 'scmerge_cat_added': 4, 'scmerge_cat_removed': 4, 'scmerge_split_removed': 3, 'unbind_stack_pass': 3, 'batch_tanh': 2, 'scmerge_split_sections_removed': 2, 'scmerge_split_added': 2, 'merge_stack_tahn_unbind_pass': 1, 'optimize_cat_inputs_pass': 1, 'unbind_cat_to_view_pass': 1})

before vs after graph diffing: https://www.internalfb.com/intern/diffing/?paste_number=1497865201

Differential Revision: D60325668

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132152
Approved by: https://github.com/jackiexu1992
2024-07-30 23:27:18 +00:00
afb04d78c8 Don't try hard to compute alignment of unbacked expressions (#131649)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131649
Approved by: https://github.com/bdhirsh
2024-07-30 23:19:42 +00:00
5a33657b31 [micro_pipeline_tp] implement the pass for fused_scaled_matmul_reduce_scatter (#131951)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131951
Approved by: https://github.com/weifengpy
2024-07-30 23:02:49 +00:00
524aac413c Initial OpInfo-based testing for NJTs (#131704)
This PR utilizes the info from the existing OpInfo database `op_db` to contribute to general NJT testing.
* New tests in `TestNestedTensorOpInfo`
    * `test_forward()` - compares forward output to an unbind-based reference
    * `test_backward()` - compares forward output and grads to an unbind-based reference
    * `test_forward_compile()` - compares forward compile output (`backend="aot_eager_decomp_partition"`) to eager
    * `test_backward_compile()` - compares forward compile output (`backend="aot_eager_decomp_partition"`) and grads to eager
* To avoid adding a bunch of NJT-specific stuff to the `OpInfo` structure, this PR translates `op_db` -> a NJT-specific `njt_op_db`.
    * `UnaryUfuncInfo`s utilize a new `sample_inputs_unary_njt_pointwise()` which iterates through a comprehensive list of NJTs: contiguous / non-contiguous, dims 2, 3, and 4, transposed / not, etc.
    * `BinaryUfuncInfo`s utilize a new `sample_inputs_binary_njt_pointwise()` which iterates through a comprehensive list of NJTs: contiguous / non-contiguous, dims 2, 3, and 4, transposed / not, etc.
    * `ReductionOpInfo`s utilize a new `sample_inputs_njt_reduction()` which covers full reductions, reductions over the jagged dim, and reductions over the non-jagged dim
* Several xfails were added to get things passing

TODO (future PRs):
* Pass non-contiguous / non-contiguous with holes NJTs (maybe we should have separate tests for these? most ops don't support NJTs with holes today)
* Mixed (NT, T), (T, NT) inputs for binary ops
* Handle other types of OpInfos (beyond unary pointwise, binary pointwise, and reduction) by manually by writing sample_inputs_funcs
* Address all xfails via fixes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131704
Approved by: https://github.com/soulitzer
ghstack dependencies: #131898
2024-07-30 23:02:24 +00:00
93facac02c [NeuralNetInference] Bring up iOS builds (#131917)
Summary: Mirror Android setup to static link & use lite interpreter on iOS

Test Plan: CI

Reviewed By: EscapeZero

Differential Revision: D60156611

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131917
Approved by: https://github.com/cccclai
2024-07-30 23:01:09 +00:00
53a5e0f1a8 [BE] delete spmd module (#132072)
Summary:
as titled, fully delete spmd module as we stopped working on this and the code is already broken with no unit tests enabled.

We should not keep it in the codebase as it provide no value anymore, and it burdens DTensor to maintain the compatiblity with it (i.e. code paths/imports) constantly.

Test Plan: sandcastle

Differential Revision: D60402105

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132072
Approved by: https://github.com/awgu, https://github.com/XilunWu, https://github.com/fegin, https://github.com/seemethere, https://github.com/albanD, https://github.com/yifuwang
2024-07-30 22:20:21 +00:00
a141334c88 migitate wrong tensor.dim_order() (#131366)
Summary:
there're some issues for dim order creation. T194410923 has detail illustration.

One of the reason is sometimes `is_contiguous` function may generate ambiguous memory format result (some tensors might be both channels_last and contiguous at the same time), and dim order generation rely on memory format result underneath for shortcut.

To mitigate the issue, we make dim order utilizing the short cut if and only if the tensor is only belongs to single memory format. Otherwise, we will still recalculate it.

Test Plan: CI

Differential Revision: D60056793

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131366
Approved by: https://github.com/ezyang
2024-07-30 21:58:15 +00:00
2b43fab555 [DTensor] Added naive support for nn.init.orthogonal_ (#132104)
Try to unblock https://github.com/pytorch/pytorch/issues/131991

- `nn.init.orthogonal_` uses `tensor.new`, which is the legacy factory function. We change this to `tensor.new_empty` (empty is okay since it will be immediately followed by `.normal_()` to fill the tensor) so that it preserves `DTensor`-ness.
- `nn.init.orthogonal_` uses QR decomposition (`aten.linalg_qr.default`) and `torch.diag` (calling into `aten.diagonal_copy.default`). For simplicity, we use naive replicate strategies for now. `aten.diagonal_copy.default` could do something more sophisticated for sharded inputs, but I would rather defer that to later due to the complexity. For `orthogonal_` support specifically, since the result of the QR decomp will be replicated, the input to `aten.diagonal_copy.default` will be replicated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132104
Approved by: https://github.com/albanD, https://github.com/wanchaol
2024-07-30 21:55:09 +00:00
3e142d766a [EZ] Make consistent with scale-config.yml (#132164)
Fix inconsistencies from test-infra's scale-config.yml file

To be followed up by https://github.com/pytorch/test-infra/pull/5513 which will catch such inconsistencies going forward
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132164
Approved by: https://github.com/clee2000, https://github.com/malfet, https://github.com/zxiiro
2024-07-30 21:42:23 +00:00
69c34f6e4c Corrects Error Codes from cudaHostRegister (#132089)
Causing some terrible error messages e.g. :

```
# printing directly: cudaError.???
# casting to int first: 712

Traceback (most recent call last):
  File "/data/users/lpasqualin/fbsource/fbcode/scripts/lpasqualin/playground.py", line 15, in <module>
    main()
  File "/data/users/lpasqualin/fbsource/fbcode/scripts/lpasqualin/playground.py", line 11, in main
    _create_cpu_state_dict(sd, share_memory=True, pin_memory=True)
  File "/home/lpasqualin/pytorch/torch/distributed/_state_dict_utils.py", line 436, in _create_cpu_state_dict
    ret = _iterate_state_dict(
          ^^^^^^^^^^^^^^^^^^^^
  File "/home/lpasqualin/pytorch/torch/distributed/_state_dict_utils.py", line 143, in _iterate_state_dict
    ret = {
          ^
  File "/home/lpasqualin/pytorch/torch/distributed/_state_dict_utils.py", line 144, in <dictcomp>
    key: _iterate_state_dict(
         ^^^^^^^^^^^^^^^^^^^^
  File "/home/lpasqualin/pytorch/torch/distributed/_state_dict_utils.py", line 125, in _iterate_state_dict
    ret = tensor_func(iter_object, pg, device, companion_obj)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/lpasqualin/pytorch/torch/distributed/_state_dict_utils.py", line 428, in tensor_func
    succ == 0
AssertionError: Pinning shared memory failed with error-code: cudaError.???
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132089
Approved by: https://github.com/Skylion007
2024-07-30 21:42:00 +00:00
ff377e16ab Improve logging in the TSConverter (#132082)
Summary: Currently, running explain with TORCH_LOGS enabled will cause duplicate loggings because explain uses the exact same code path for covnersion. This PR just disables logging when it is running explain. And move all logging to convert() to prevent from logging from __init__ when we are just using explain.

Test Plan: Manual testing with attached outputs.

Reviewed By: SherlockNoMad, angelayi

Differential Revision: D60199007

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132082
Approved by: https://github.com/ydwu4
2024-07-30 21:37:44 +00:00
495d413519 Include code object of frame being compiled in stack (#132161)
This is pretty useful to have!

Test plan: https://internalfb.com/intern/fblearner/details/586653862/

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132161
Approved by: https://github.com/oulgen
2024-07-30 21:33:27 +00:00
19db4f6014 [capture_triton] fix special kwargs path (#132143)
I didn't test this path when creating the orchestrator. This PR fixes
that path to work in the capture_triton path. The problem is that we are
handling a value that is an int (in the capture_triton path) and a
ConstantVariable (in the Dynamo triton path) so we abstract that out in
the orchestrator.

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132143
Approved by: https://github.com/oulgen
2024-07-30 20:30:40 +00:00
1118c74b5f [PT2] Port fuse_chunk_reshape_unsqueeze_concat_pass to PT2 pre_grad passes (#131902) (#132078)
Summary:

Port fuse_chunk_reshape_unsqueeze_concat_pass to PT2 pre_grad passes

Test Plan: run new UTs

Reviewed By: frank-wei

Differential Revision: D60258724

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132078
Approved by: https://github.com/frank-wei
2024-07-30 20:17:06 +00:00
d53b11bb6e Strict shape checking for NJTs with TestCase.assertEqual() (#131898)
**Background**: `TestCase.assertEqual()` is commonly used during test case validation. Historically, to support NSTs, the logic was written to compare two nested tensors by unbinding them and comparing their components. This logic applied to NJTs as well, which in practice meant that two NJTs with different nested ints in their shapes could compare equal if their components were equal.

This PR changes the above logic so that NJTs are no longer unbound during comparison, allowing them to receive full shape validation. This makes `TestCase.assertEqual()` stricter for NJTs, requiring them to have the same nested ints in their shapes to compare equal.

Note that some tests rely on the old, looser behavior. To address this, the PR introduces a base `NestedTensorTestCase` that defines a helper function `assertEqualIgnoringNestedInts()` so that these tests can explicitly opt in to the looser comparison behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131898
Approved by: https://github.com/soulitzer
2024-07-30 20:05:48 +00:00
58f76bc301 Revise skip torchrec logic (#130783)
Summary:
The previous logic adds skipped files when the file was imported which happens at very early stage. However, we could set skip_torchrec at later stage (e.g, in APS, we set it during the trainer execution). In that case, the skip logic will still take effect since skipped files have been added.

So in this diff, we revise the logic so that it can adapt to changes of skip_torchrec at later stages.

Test Plan:
Tested on APS models:

  buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher_live -- mode=local_ig_fm_uhm_mini model_name=ig_fm_one_sparse_benchmark features=ig_fm_one_sparse_benchmark model=ig_fm_one_sparse_benchmark training.pipeline_type=pt2

commit: 2fb485d9e

torchrec related paths were not skipped.

Differential Revision: D59779153

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130783
Approved by: https://github.com/yanboliang
2024-07-30 19:55:20 +00:00
964f97539f [MPS] Correct nonzero warning and fix the test (#132127)
#125355 lifted the natively supported macOS version to 14.

Fixes #132110
Probably fixes this flaky test disabling issue: #126492

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132127
Approved by: https://github.com/malfet
2024-07-30 19:46:25 +00:00
f2dedc910e Improve SpeculationLog error message (#131982)
There are some substantive changes. Instead of recording the *next* instruction in the speculation log, I record the *current* instruction. I think this is more intuitive, we always call speculation at the beginning of executing an instruction, so logically, the entry is associated with the current instruction. (Note that self.instruction_pointer is next instruction, as conventionally we increment IP before calling speculate).

The cosmetic change is to also pass in the Instruction corresponding to the IP and print it, and beef up the error message, including notes about the previous instruction that was run before it failed (this is typically the critical instruction).

At time of submission, this test case triggered the error:

```
diff --git a/test/distributed/test_dynamo_distributed.py b/test/distributed/test_dynamo_distributed.py
index 5ade17856e1..60ef89be346 100644
--- a/test/distributed/test_dynamo_distributed.py
+++ b/test/distributed/test_dynamo_distributed.py
@@ -844,6 +844,39 @@ class TestMultiProc(DynamoDistributedMultiProcTestCase):
             for r in res[1:]:
                 self.assertEqual(res[0], r)

+    @unittest.skipIf(not has_triton(), "Inductor+gpu needs triton and recent GPU arch")
+    @config.patch(enable_compiler_collectives=True)
+    def test_compiler_collectives_automatic_dynamic_speculation_divergence(self):
+        with _dynamo_dist_per_rank_init(self.rank, self.world_size):
+            torch._dynamo.utils.clear_compilation_metrics()
+
+            # TODO: This should be possible to do inside the function, but
+            device = f"cuda:{self.rank}"
+
+            @torch.compile()
+            def f(x, y):
+                zx = x.shape
+                zy = y.shape
+                return x.sum() + y.sum()
+
+            if self.rank == 0:
+                dataloader = [4, 4]
+            else:
+                dataloader = [3, 4]
+
+            for data in dataloader:
+                f(
+                    torch.randn(data, device=self.rank),
+                    torch.randn(data, device=self.rank),
+                )
+
+            metrics = torch._dynamo.utils.get_compilation_metrics()
+            # Number of compiles same on all nodes
+            res = [None] * self.world_size
+            torch.distributed.all_gather_object(res, len(metrics))
+            for r in res[1:]:
+                self.assertEqual(res[0], r)
+

 @requires_nccl()
```

although I plan to fix this soon.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131982
Approved by: https://github.com/anijain2305, https://github.com/mlazos, https://github.com/jansel
2024-07-30 19:21:31 +00:00
e6cddc9271 Fix public API tests (#131386)
This PR fixes a bug in `test_correct_module_names` introduced in #130497. It also addresses post-fix test failures in:
* `torch/ao/quantization/__init__.py` - set the correct `__module__` for several public API helpers
* `torch/library.py` - add `register_vmap` to `__all__`
* `torch/nn/attention/flex_attention.py` - make `round_up_to_multiple` private by prepending an underscore
* `torch/storage.py` - introduce `__all__` to avoid `Self` being re-exported as a public API
* `torch/distributed/pipelining/schedules.py` - add `ZeroBubbleAlgorithm` to `__all__`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131386
Approved by: https://github.com/albanD
2024-07-30 18:42:54 +00:00
f217b470cc [CMAKE] Avoid double setting of LDFLAGS (#130370)
It was observed that in some environments `LDFLAGS` gets directly appended to `CMAKE_SHARED_LINKER_FLAGS`. As the result, the same linker flag can appear twice in `CMAKE_SHARED_LINKER_FLAGS` due to manual set:
1bf4a44b33/CMakeLists.txt (L541-L542)
This flag collision causes the build failures at the `cmake` stage.
This PR adds an instruction to `CMakeLists.txt` to avoid double setting of `LDFLAGS` into `CMAKE_SHARED_LINKER_FLAGS`.

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130370
Approved by: https://github.com/atalman, https://github.com/tinglvv, https://github.com/malfet
2024-07-30 18:16:04 +00:00
3816f6420a [BE] remove unnecessary _dispatch_sqrt by using ** 0.5 (#131358)
Based on the discussion here where ** 0.5 is not slower than math.sqrt. https://github.com/pytorch/pytorch/pull/129905#discussion_r1675605075

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131358
Approved by: https://github.com/albanD
2024-07-30 18:08:17 +00:00
9f6d7df3d9 docs(multinomial): Add reference to Multinomial class (#131904)
This PR just adds the reference to the class
`torch.distributions.multinomial.Multinomial` in `torch.multinomial`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131904
Approved by: https://github.com/jbschlosser
2024-07-30 18:05:07 +00:00
239d4d2489 Revert "[reland][inductor] switch AotCodeCompiler to new cpp_builder (#130127)"
This reverts commit 9606d61e0c921b886d20cb61454043c6c270ae89.

Reverted https://github.com/pytorch/pytorch/pull/130127 on behalf of https://github.com/ZainRizvi due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/130127#issuecomment-2258871791))
2024-07-30 17:39:41 +00:00
9027db1ab8 TCPStore: fix remote address (#131773) (#131913)
Summary:
This fixes corrupt remote address logs caused by dangling pointers to addrinfo_storage inside of addrinfo.

This relands it since it got reverted due to a fmt::format issue internally.

Original Pull Request: https://github.com/pytorch/pytorch/pull/131773
Approved by: https://github.com/kurman

Test Plan:
Enable debug logs and verify addresses are correct

```
TORCH_CPP_LOG_LEVEL=INFO TORCH_DISABLE_SHARE_RDZV_TCP_STORE=1 TORCH_DISTRIBUTED_DEBUG=DETAIL LOGLEVEL=INFO python test/distributed/test_store.py -v
buck2 test @//mode/dev-nosan //caffe2/test/distributed:store
```

Differential Revision: D60296583

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131913
Approved by: https://github.com/kurman, https://github.com/rsdcastro, https://github.com/Skylion007
2024-07-30 17:27:33 +00:00
3864a2d834 [profiler ut] Update event name in test_profiler.py (#131757)
Fixes #ISSUE_NUMBER
To support kernel name with some uppercase letters.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131757
Approved by: https://github.com/aaronenyeshi
2024-07-30 17:15:31 +00:00
32c57e78ed Specialize sym node when used as device kwarg (#131811)
Fixes https://github.com/pytorch/pytorch/issues/131189.

We specialize the symint in python_arg_parser when used as kwarg device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131811
Approved by: https://github.com/yanboliang, https://github.com/jansel, https://github.com/albanD
2024-07-30 17:11:57 +00:00
33ce9cf7f9 [FSDP2] Relaxed overlap timing check to avoid flakiness (#132116)
Trying to fix https://github.com/pytorch/pytorch/issues/131081

See https://github.com/pytorch/pytorch/issues/131081#issuecomment-2239443504 for detailed context. This PR is relaxing one assertion against the _baseline_ to try to fix the flakiness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132116
Approved by: https://github.com/Skylion007
2024-07-30 14:28:12 +00:00
16e0868a3d [FSDP] Add hpu device to _get_remote_device_str (#132120)
In _creating chunk_sharded_tensor, _get_remote_device_str is used. by default it uses the node cound to determine the device:instance. for hpu, need to use current device to get the deivce_instance.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132120
Approved by: https://github.com/awgu
2024-07-30 14:24:24 +00:00
a843178529 Let dynamo inline functional_call (#128646)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128646
Approved by: https://github.com/zou3519
2024-07-30 14:22:23 +00:00
12b67bd998 Fix pyi annotation for ProcessGroupGloo.Options (#132080)
This PR fixes the pyi annotation for `ProcessGroupGloo.Options` based on the definition in the `torch/csrc/distributed/c10d/init.cpp` file.

Fixes #132054

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132080
Approved by: https://github.com/Skylion007
2024-07-30 13:52:31 +00:00
499ead96ff Revert "Grouped Query Attention (#128898)"
This reverts commit d039b14207fe659d664c590efc06cc0a2abc96c0.

Reverted https://github.com/pytorch/pytorch/pull/128898 on behalf of https://github.com/albanD due to Broken test on main ([comment](https://github.com/pytorch/pytorch/pull/128898#issuecomment-2258314481))
2024-07-30 13:11:24 +00:00
cyy
bdf57da6a6 [3/N] Enable clang-tidy on torch/csrc/inductor (#132101)
Follows #132040
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132101
Approved by: https://github.com/Skylion007
2024-07-30 13:04:57 +00:00
cyy
eccbd408e5 [10/N] Fix clang-tidy warnings in jit (#132122)
Follows #132010

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132122
Approved by: https://github.com/Skylion007
2024-07-30 12:56:31 +00:00
83db609ee5 [inductor] fix the cudagraph tree test (#132043)
Summary:
There are two kinds of exceptions:
Case #1:
```
static input data pointer changed.
input name: primals_2. data pointer changed from 140315748992000 to 140315748993536. input stack trace:   File "/dev/shm/uid-30083/c0899c70-seed-nspid4026535598_cgpid16622182-ns-4026535192/caffe2/test/inductor/test_cudagraph_trees.py", line 1826, in forward
    return self.static_tensor + x + self.goo(x)
  File "/dev/shm/uid-30083/c0899c70-seed-nspid4026535598_cgpid16622182-ns-4026535192/caffe2/test/inductor/test_cudagraph_trees.py", line 1816, in forward
    return self.linear(x)

input name: primals_3. data pointer changed from 140315748990976 to 140315748993024. input stack trace:   File "/dev/shm/uid-30083/c0899c70-seed-nspid4026535598_cgpid16622182-ns-4026535192/caffe2/test/inductor/test_cudagraph_trees.py", line 1825, in forward
    self.static_tensor.add_(torch.ones((2, 2), device="cuda"))

```
Case #2:
```
static input data pointer changed.
input name: primals_2. data pointer changed from 139852509086720 to 139852509088256. input stack trace: None
input name: primals_3. data pointer changed from 139852509085696 to 139852509087744. input stack trace:   File "/dev/shm/uid-30083/f61ee184-seed-nspid4026560782_cgpid769179-ns-4026560865/caffe2/test/inductor/test_cudagraph_trees.py", line 1825, in forward
    self.static_tensor.add_(torch.ones((2, 2), device="cuda"))

```
The current impl only covered the case #2

Test Plan: https://www.internalfb.com/intern/testinfra/testrun/15481123762274476

Differential Revision: D60340212

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132043
Approved by: https://github.com/BoyuanFeng
2024-07-30 08:35:56 +00:00
36e8289129 [PT2][Optimus] Optimize cat node inputs pattern (#131866)
Test Plan:
# unit test
```
buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_passes
```

# benchmark

```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "ig_ctr" --flow_id 584880697
```

Counter({'pattern_matcher_nodes': 1589, 'pattern_matcher_count': 1497, 'extern_calls': 393, 'normalization_pass': 342, 'merge_splits_pass': 19, 'fxgraph_cache_miss': 12, 'scmerge_cat_added': 4, 'scmerge_cat_removed': 4, 'scmerge_split_removed': 3, 'unbind_stack_pass': 3, 'batch_tanh': 2, 'scmerge_split_sections_removed': 2, 'scmerge_split_added': 2, 'merge_stack_tahn_unbind_pass': 1, 'optimize_cat_inputs_pass': 1})

P1496150856

Differential Revision: D60274533

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131866
Approved by: https://github.com/jackiexu1992
2024-07-30 07:49:26 +00:00
54d4f6bbca [Inductor][FlexAttention] Correct partial/full blocks naming (#131993)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131993
Approved by: https://github.com/drisspg
2024-07-30 06:40:40 +00:00
03e058189e [dynamo] Support dict unpack of MutableMapping objects (#131961)
Fixes https://github.com/pytorch/pytorch/issues/128067

The basic functionality was alredy introduced earlier. This just ensures
that we support UserDefinedObjectVariable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131961
Approved by: https://github.com/williamwen42, https://github.com/mlazos, https://github.com/yanboliang
ghstack dependencies: #131827, #131956
2024-07-30 05:49:58 +00:00
f806128619 [dynamo] Skip <frozen abc> to skip __isisintance__ check on abc objects (#131956)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131956
Approved by: https://github.com/williamwen42, https://github.com/mlazos
ghstack dependencies: #131827
2024-07-30 05:49:58 +00:00
13457d1da0 [dynamo][log] Suggest to use pytree when graph-break on optree (#131827)
Discovered while working on https://github.com/pytorch/pytorch/issues/121369
On the model above, the log looks like this

~~~
/home/anijain/local/pytorch2/torch/_dynamo/variables/functions.py:698: UserWarning: Graph break for an optree C/C++ function optree._C.PyCapsule.flatten. Consider using torch._utils.pytree - https://github.com/pytorch/pytorch/blob/main/torch/utils/_pytree.py.
  torch._dynamo.utils.warn_once(msg)
/home/anijain/local/pytorch2/torch/_dynamo/variables/functions.py:698: UserWarning: Graph break for an optree C/C++ function optree.PyCapsule.unflatten. Consider using torch._utils.pytree - https://github.com/pytorch/pytorch/blob/main/torch/utils/_pytree.py.
  torch._dynamo.utils.warn_once(msg)
  ~~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131827
Approved by: https://github.com/zou3519, https://github.com/mlazos
2024-07-30 05:49:58 +00:00
fc6066b80f improve mkldnn_linear_pointwise_binary performance for contiguous tensor with non default contiguous strides (#132019)
Fixes https://github.com/pytorch/pytorch/issues/131734

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132019
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
2024-07-30 05:02:38 +00:00
40f8db5741 [audio hash update] update the pinned audio hash (#132105)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132105
Approved by: https://github.com/pytorchbot
2024-07-30 03:39:27 +00:00
aa1488fe02 [inductor] turn on enable_kernel_profile on Windows. (#132025)
Enable `TORCHINDUCTOR_CPP_ENABLE_KERNEL_PROFILE` on Windows inductor.

Local tested pass:
![image](https://github.com/user-attachments/assets/a82351af-cc56-4ba1-a8f4-08f1c38713d1)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132025
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-30 03:02:09 +00:00
475da800c7 [inductor] optimize cflags for Windows. (#131980)
changes:
1. optimize cflags for Windows. Ref: https://github.com/pytorch/pytorch/blob/v2.4.0/torch/utils/cpp_extension.py#L215

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131980
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-30 02:59:51 +00:00
bdc42e3fb8 [inductor] validate_can_generate_cpp_wrapper add win32 support. (#131978)
Changes:
1. `validate_can_generate_cpp_wrapper` add win32 support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131978
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-30 02:59:48 +00:00
baa4c9ca46 Optimize aten.cat calls of a repeated element (#132081)
This was a particular problem for a model I saw which would have a large number of repeats, making compilation slow.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132081
Approved by: https://github.com/shunting314
2024-07-30 02:56:00 +00:00
f8e4060484 [Inductor][CPP] Enhance cppcsevar data type deduce (#130827)
**Summary**
Previously, we used `data_type_propagation` at the start of `codegen` to deduce the data type of each node and save this information in `node.meta[OptimizationContext.key]`. Then, we used this node metadata to update the cppcsevar data type in `update_on_args`. However, this method is not always correct. For example, in the codegen of `indirect_indexing` (see [here](096dc444ce/torch/_inductor/codegen/common.py (L1844))), we insert nodes on the fly and reuse the node of `indirect_indexing` to set the `cppcsevar` data type. In this PR, we plan to enhance the `cppcsevar` data type deduction:

- We will deduce the `cppcsevar` data type in `update_on_args` by reusing the code in `data_type_propagation`.

- To align the data type of scalar and vector variables, we previously always cast the scalar to the vector's data type. This caused a data type misalignment between `codegen` and `data_type_propagation`. We should use the same data type promotion logic to align the data types of scalar and vector variables.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130827
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-30 02:51:31 +00:00
b6c1490cc0 [dynamo] make more unpack_var_sequence calls forced (#132069)
Fixes [T197204962](https://www.internalfb.com/intern/tasks/?t=197204962) (example failure: https://www.internalfb.com/intern/testinfra/diagnostics/11540474088277914.281475138576374.1722221031/)

Added tests contain a simple repro for the observed failure (`test_map_unpack_vars`).

Also fixes https://github.com/pytorch/pytorch/issues/132044

Differential Revision: [D60420335](https://our.internmc.facebook.com/intern/diff/D60420335)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132069
Approved by: https://github.com/anijain2305
2024-07-30 02:30:08 +00:00
8721b21b38 Fix fake_tensor w/ non-view tensor (#132050)
Summary: This code was overly complex and is confusing some guards - basically if a result cached tensor isn't a view there's no reason to be messing with its storage.

Test Plan: unit tests pass

Differential Revision: D60387821

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132050
Approved by: https://github.com/oulgen
2024-07-30 02:17:18 +00:00
9598c58618 Add config option to skip autotuning conv (#131839)
requested internally bc for some models the conv templates are not very helpful

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131839
Approved by: https://github.com/oulgen
ghstack dependencies: #131400
2024-07-30 01:57:53 +00:00
5a2620302b [inductor] Replace self_cuda_time_total function calls with self_dev… (#131029)
…ice_time_total for wrapper_bench

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131029
Approved by: https://github.com/shunting314
2024-07-30 01:57:39 +00:00
a147fa577b [MPS] Fix masked_fill_ in non_contiguous cases (#131957)
fixes #131285

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131957
Approved by: https://github.com/DenisVieriu97
2024-07-30 01:34:48 +00:00
3716934b1a [Inductor] Refactor autotuning utils to compute max block sizes (#131730)
These OSS changes are part of a larger MTIA diff. The OSS part is a simple refactor that makes it easier to query max block sizes by the prefix of the grid dimension, e.g. `"X"`, as opposed to having to use separate functions for `get_xmax()`, `get_ymax()`, etc.

Differential Revision: D60195669

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131730
Approved by: https://github.com/eellison
2024-07-30 01:04:53 +00:00
7a7dd8c29e Revert "[NestedTensor] Integrate the softmax operator along the jagged dimension into NestedTensor (#131518)"
This reverts commit bcf5c68c18c6a109e1fa00829eea0428d44cfb6b.

Reverted https://github.com/pytorch/pytorch/pull/131518 on behalf of https://github.com/ZainRizvi due to Sorry, reverting this since this is based on an internal diff that has diverged from actual internal commit (the final PR and diff must always be identical). Conflicts arise when that happens which block the diff train. Let's revert both this PR and the internal diff, and then reland them as a proper new codev diff ([comment](https://github.com/pytorch/pytorch/pull/131518#issuecomment-2257259839))
2024-07-30 00:55:10 +00:00
ab9791c0e3 [export] Add print_readable to unflattener (#128617)
Taking inspiration from `GraphModule.print_readable` (aka I copied its [code](17b45e905a/torch/fx/graph_module.py (L824))), I added a `print_readable` to the unflattened module, because it's kind of nontrivial to print the contents of this module.

Example print from `python test/export/test_unflatten.py -k test_unflatten_nested`
```
class UnflattenedModule(torch.nn.Module):
    def forward(self, x: "f32[2, 3]"):
        # No stacktrace found for following nodes
        rootparam: "f32[2, 3]" = self.rootparam

        # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:99 in forward, code: x = x * self.rootparam
        mul: "f32[2, 3]" = torch.ops.aten.mul.Tensor(x, rootparam);  x = rootparam = None

        # No stacktrace found for following nodes
        foo: "f32[2, 3]" = self.foo(mul);  mul = None
        bar: "f32[2, 3]" = self.bar(foo);  foo = None
        return (bar,)

    class foo(torch.nn.Module):
        def forward(self, mul: "f32[2, 3]"):
            # No stacktrace found for following nodes
            child1param: "f32[2, 3]" = self.child1param
            nested: "f32[2, 3]" = self.nested(mul);  mul = None

            # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:79 in forward, code: return x + self.child1param
            add: "f32[2, 3]" = torch.ops.aten.add.Tensor(nested, child1param);  nested = child1param = None
            return add

        class nested(torch.nn.Module):
            def forward(self, mul: "f32[2, 3]"):
                # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:67 in forward, code: return x / x
                div: "f32[2, 3]" = torch.ops.aten.div.Tensor(mul, mul);  mul = None
                return div

    class bar(torch.nn.Module):
        def forward(self, add: "f32[2, 3]"):
            # No stacktrace found for following nodes
            child2buffer: "f32[2, 3]" = self.child2buffer

            # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:87 in forward, code: return x - self.child2buffer
            sub: "f32[2, 3]" = torch.ops.aten.sub.Tensor(add, child2buffer);  add = child2buffer = None
            return sub
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128617
Approved by: https://github.com/zhxchen17, https://github.com/pianpwk
2024-07-30 00:41:44 +00:00
2a4d9aa548 Disable expandable segments checkpointing internally (#132048)
Differential Revision: [D60388286](https://our.internmc.facebook.com/intern/diff/D60388286)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132048
Approved by: https://github.com/ezyang, https://github.com/eqy
2024-07-30 00:26:39 +00:00
be5e44192d Revert "[NestedTensor] Integrate the layer normalization operator along the jagged dimension into NestedTensor (#131519)"
This reverts commit 8fe2bf212dc5e01b15cbe728958f940873230d64.

Reverted https://github.com/pytorch/pytorch/pull/131519 on behalf of https://github.com/ZainRizvi due to Sorry, reverting this since this is based on an internal diff that has diverged from actual internal commit.  Weird conflicts arise when that happens.  Let's revert both this PR and the internal diff, and then reland them as a proper new codev diff ([comment](https://github.com/pytorch/pytorch/pull/131519#issuecomment-2257230717))
2024-07-30 00:18:22 +00:00
b1ccd0c407 [CI] Update environment varible setting for aarch64 (#132046)
Summary: JEMALLOC_LIB and core_number need to be set differently on aarch64.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132046
Approved by: https://github.com/huydhn
2024-07-30 00:09:59 +00:00
e3dc20c94b [NJT] support cat backward (#132076)
cat_tensors_backward use narrow_symint, so we need to support aten::narrow for NJT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132076
Approved by: https://github.com/davidberard98
2024-07-29 23:49:26 +00:00
5298acb5c7 Back out "[1/2] PT2 Inductor ComboKernels - Foreach cases (#124969)" (#132065)
Summary:
Original commit changeset: 1d8cfdcef69d

Original Phabricator Diff: D54134695

back out: D54134695

Test Plan: more details see: https://docs.google.com/document/d/1noPTmTdNYHVDFyk7AJSSO7jQoNw6fTo4o6k9eTNeZh8/edit#heading=h.xeo30usu77nc

Reviewed By: zw2326

Differential Revision: D60397377

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132065
Approved by: https://github.com/zw2326, https://github.com/qchip
2024-07-29 22:48:29 +00:00
8b507a922a Mode to emulate amp numerics (#131595)
```
# Mode to emulate pytorch eager numerics for lower precision (fp16, bf16)
# Pytorch eager computes bf16/fp16 by upcasting inputs to fp32 and downcasting after
# For multiple, fused pointwise nodes, inductor will elide the intermediary upcasts and downcasts
# Typically this should be closer to fp64 ref numerics. However, it can be useful for debugging
# to emulate the eager numerics.
```

We add extra upcasts and downcasts for pointwise nodes that correspond to casts that existed in the original user program (excluding pointwise nodes that are emitted during decomposition). Since this is mostly for debugging, I added this information in the `meta` so that this mode does not have unintended side effects like changing pattern matching.

in theory there could also be some other casts with fused reduction -> reduction, although i havent seen this in practice as much. could be done as follow up. note: only works with cuda backend right now.

This mode was sufficient to eliminate compile differences from https://fb.workplace.com/groups/385893200869952/posts/464263173032954/?comment_id=465199259606012&reply_comment_id=465676792891592.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131595
Approved by: https://github.com/shunting314, https://github.com/bdhirsh, https://github.com/jansel
2024-07-29 22:42:23 +00:00
884eadcd19 Fix multi grad hooks thread safety (#132055)
Thanks @awgu  for spotting this

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132055
Approved by: https://github.com/Skylion007, https://github.com/awgu, https://github.com/albanD
2024-07-29 22:32:59 +00:00
e55e9d8126 Clear speculation log when restarting due to compiler collective (#131983)
The compiler collective can trigger an input to become dynamic, which
can trigger operations to be recorded to the graph, which would change
the speculation log entries (since they only start being recorded once
we have a non-empty output graph).  Test case triggers this situation.

Production instance:
https://www.internalfb.com/mlhub/pipelines/runs/mast/f584750649-TrainingApplication?job_attempt=2&version=0&env=PRODUCTION

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131983
Approved by: https://github.com/anijain2305, https://github.com/mlazos
2024-07-29 22:32:10 +00:00
62b2e7a553 Revert "Add config option to skip autotuning conv (#131839)"
This reverts commit 3d4de8e96d0bb1fe19b25734a97a19dd85313692.

Reverted https://github.com/pytorch/pytorch/pull/131839 on behalf of https://github.com/eellison due to wrong config name ([comment](https://github.com/pytorch/pytorch/pull/131839#issuecomment-2257117221))
2024-07-29 22:31:51 +00:00
8fe2bf212d [NestedTensor] Integrate the layer normalization operator along the jagged dimension into NestedTensor (#131519)
Modify the existing `layer normalization` operator in PyTorch, invoked by `torch.layer_norm`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff, which uses the `aten` padding operator, enables PyTorch users to invoke `torch.nn.functional.layer_norm` on a nested tensor when reducing along the ragged dimension, e.g. `*` in a `(B, *, M)` or `(B, *, M, N)` nested tensor.

Write unit tests based on the `softmax` jagged operator to verify the accuracy of the ragged reduction implementation for `torch.nn.functional.layer_norm`. Add unit tests to verify error handling for unsupported features.

Note that this implementation is limited to nested tensors with `ragged_idx == 1`, i.e. the ragged dimension is not transposed. The layer normalization operator also requires an operation on a 2-dimensional layer; for nested tensors with 4 or more dimensions, I flatten the extra dimensions, then unflatten them after performing layer normalization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131519
Approved by: https://github.com/davidberard98
ghstack dependencies: #131518
2024-07-29 22:16:32 +00:00
d039b14207 Grouped Query Attention (#128898)
### Approach: Using the current function declaration

**Constraint:** Q_Heads % KV_Heads == 0

**Major change:**
- Added a new argument enable_gqa: bool to sdpa function call
- It adds a meaning to the last third dimension.

Sample use cases this would enable:
LLama3

```
# LLama3 8b call to SDPA
query = torch.rand(batch, 32, seq_len_q, D)
key = torch.rand(batch, 8, seq_len_kv, D)
value = torch.rand(batch, 8, seq_len_kv, D)

output = scaled_dot_product_attention(query, key, value, is_causal=True, enable_gqa=True)

# Output Shape
(batch, 32, seq_len_q, D)
```

### Design Choice:

- Check if Query.size(-3) == Key.size(-3) == Value.size(-3) or, Query.size(-3) % Key.size(-3) == 0
- The function adjusts the key and value tensors to match the query tensor's head dimension by using repeat_interleave if their number of heads are not equal, facilitating correct and efficient computation in attention mechanisms.
- By default the enable_gqa flag is set to False, which ensures that regular sdpa functionality remains unchanged.

### Benchmarks:

- **sdpa.py: #130634**
For different batch sizes enable_gqa=True shows a substansial improvement in the run_time of sdpa

 | batch_size | q_num_heads | kv_num_heads | q_seq_len | kv_seq_len | embed_dim | forward_time when enable_gqa=True   |   forward_time when enable_gqa=False    |
| ------------ | ------------- | -------------- | ----------- | ------------ | ----------- | ----------- | ---------------- |
|     1      |     32      |      8       |   2048    |    2048    |   2048    |   100.71  |  119.70  |
|     8      |     32      |      8       |   2048    |    2048    |   2048    |   539.78  |  628.83  |
|     16     |     32      |      8       |   2048    |    2048    |   2048    |   1056.81  |  1225.48  |
|     32      |     32      |      8       |   2048    |    2048    |   2048    |   2099.54  |  2440.45  |

![Screenshot 2024-07-25 at 9 07 40 PM](https://github.com/user-attachments/assets/a3e5f716-c39f-4096-9e6c-82a735e57b7b)

- **TorchTitan: https://github.com/pytorch/torchtitan/pull/458**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128898
Approved by: https://github.com/drisspg
2024-07-29 21:49:06 +00:00
05a8540041 [cpp-wrapper] create null pointer for zero-size array (#132023)
zero-size array is not supported in the C or C++ standard,
so we create a null pointer for it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132023
Approved by: https://github.com/desertfire
2024-07-29 21:40:33 +00:00
d8358a2d86 Made register_multi_grad_hook return type RemovableHandle (#132074)
`_MultiHandle` is private. Let us return `RemovableHandle`, which is public.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132074
Approved by: https://github.com/soulitzer
2024-07-29 21:29:34 +00:00
d5e9fbb012 Revert "BE: reset dynamo before each test in test_module.py (#131372)"
This reverts commit 527901f054a947976dc587bb9cf72c86992b7c87.

Reverted https://github.com/pytorch/pytorch/pull/131372 on behalf of https://github.com/kit1980 due to Broke test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_CTCLoss_cuda_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10149118852/job/28065175173) [HUD commit link](ca8153ae67) ([comment](https://github.com/pytorch/pytorch/pull/131372#issuecomment-2257019116))
2024-07-29 21:15:25 +00:00
a4723b566f Revert "BE: reset dynamo before each test in test_ops_gradients.py (#131397)"
This reverts commit ca8153ae6758fbf33cc767cfd0cb384b87b8d3ca.

Reverted https://github.com/pytorch/pytorch/pull/131397 on behalf of https://github.com/kit1980 due to Broke test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_CTCLoss_cuda_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10149118852/job/28065175173) [HUD commit link](ca8153ae67) ([comment](https://github.com/pytorch/pytorch/pull/131372#issuecomment-2257019116))
2024-07-29 21:15:25 +00:00
bdf5a6dca9 Add decomposition for unsqueeze_copy (#130942)
* Extracted from #128416
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130942
Approved by: https://github.com/peterbell10
2024-07-29 21:13:37 +00:00
3c1562158e [BE] Fix torch.compile docstring formatting issues (#131837)
Fixes #131815

<img width="1098" alt="Screenshot 2024-07-25 at 6 58 39 PM" src="https://github.com/user-attachments/assets/d0f6edc3-419e-4096-803b-cecd45d8644b">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131837
Approved by: https://github.com/williamwen42
2024-07-29 20:52:28 +00:00
dcb03106b7 [Land Internally] MTIA equivalent of torch.cuda.memory_stats (#132007)
Summary: as title

Test Plan: pytorch ci failing: https://github.com/pytorch/pytorch/issues/131962

Differential Revision: D60335413

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132007
Approved by: https://github.com/hanzlfs, https://github.com/egienvalue
2024-07-29 20:47:18 +00:00
082d0b80ca Min and max NaN propagation fix in MPS backend (#130445)
Partial fix to issue #130295

Moves min and max ops to use the NaN propagating API in MPS to align with the pytorch convention. Adds a regression test to validate the fix achieves parity with cpu backend.
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130445
Approved by: https://github.com/malfet
2024-07-29 20:09:15 +00:00
f44446e851 [dynamo] Turn on inline_inbuilt_nn_modules (#131275)
Known issues that are deliberately kept open and will be fixed later are tracked here - https://github.com/pytorch/pytorch/issues/131696

Training dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/anijain2305/435/head&lCommit=408b9358b8fca3a5d08b39741419fe8a596941aa&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51))

![image](https://github.com/user-attachments/assets/08ef081c-37d7-436d-905b-4b9e2b470644)

Inference dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=gh/anijain2305/435/head&lCommit=914244fa2fe0055917e039e35183b21fa90afdc6&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51))
![image](https://github.com/user-attachments/assets/32136eff-a39e-4cde-a438-e51a665bc3c9)

Inference sees a little bit more perf degradation but we are ok with that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131275
Approved by: https://github.com/ezyang, https://github.com/jansel
ghstack dependencies: #132053
2024-07-29 20:01:51 +00:00
4c2bcf92cb [inductor] Enable FX graph caching in OSS by default (#125863)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125863
Approved by: https://github.com/eellison, https://github.com/oulgen
2024-07-29 19:19:54 +00:00
484852c02b [Doc] update guide install mkl-static from conda to pip (#130026)
<img width="619" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/4ac3ca68-57dc-42c7-ac7a-876dc377ebcf">

Conda intel channel is not avaliable now.
Use `pip` install instead of `conda`.

`Windows` and `Linux` are avaliable:
Binary list: https://pypi.org/project/mkl-static/#files

`MacOS` is avaliable for old version:
https://pypi.org/project/mkl-static/2021.3.0/#files

TODO:
1. cherry-pick to `release/2.4` branch, @atalman .
2. fix it also in `release/2.3` branch: https://github.com/pytorch/pytorch/pull/131853

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130026
Approved by: https://github.com/jgong5, https://github.com/atalman
2024-07-29 19:19:15 +00:00
301ec32ae8 [EASY][TEST][CUDA] Fix typo in test_graph_make_graphed_callables_same_pool (#132059)
Per title.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132059
Approved by: https://github.com/Skylion007
2024-07-29 19:15:37 +00:00
5cc34f61d1 [CI] add new test config label ci-test-showlocals to control test log verbosity (#131981)
Add a new label `ci-test-showlocals` and add it to test config filter.
If the PR is labeled with `ci-test-showlocals` or "ci-test-showlocals"
present in the PR comment, the test config filter will set a environment
variable `TEST_SHOWLOCALS`. Then `pytest` will show local variables on
failures for better debugging.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131981
Approved by: https://github.com/malfet
ghstack dependencies: #131151
2024-07-29 18:53:14 +00:00
4694ee1ad2 [BE][tests] show local variables on failure in tests (#131151)
------

As per the title, add argument `--locals` for `unittest` and `--showlocals --tb=long` for `pytest` in CI.

Some failures cannot be reproduced on the local machine but exist on cloud CI. This change allows us to investigate the test failure more easily.

Example output: https://github.com/pytorch/pytorch/actions/runs/9961546996/job/27523888353?pr=130710#step:20:3361

```text
/opt/conda/envs/py_3.8/lib/python3.8/site-packages/sympy/core/function.py:307:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

cls = FloorDiv, base = -1.00000000000000, divisor = -1.00000000000000

    @classmethod
    def eval(cls, base, divisor):
        # python test/test_dynamic_shapes.py -k TestDimConstraints.test_dim_constraints_solve_full
        # Assert triggered by inequality solver
        # assert base.is_integer, base
        # assert divisor.is_integer, divisor

        # We don't provide the same error message as in Python because SymPy
        # makes it difficult to check the types.
        if divisor.is_zero:
            raise ZeroDivisionError("division by zero")
        if base in (int_oo, -int_oo, sympy.oo, -sympy.oo) and divisor in (
            int_oo,
            -int_oo,
            sympy.oo,
            -sympy.oo,
        ):
            return sympy.nan
        if base is sympy.nan or divisor is sympy.nan:
            return sympy.nan

        if base.is_zero:
            return sympy.S.Zero
        if base.is_integer and divisor == 1:
            return base
        if base.is_integer and divisor == -1:
            return sympy.Mul(base, -1)
        if (
            isinstance(base, sympy.Number)
            and isinstance(divisor, sympy.Number)
            and (
                base in (int_oo, -int_oo, sympy.oo, -sympy.oo)
                or divisor in (int_oo, -int_oo, sympy.oo, -sympy.oo)
            )
        ):
            r = float(base) / float(divisor)
            if r == math.inf:
                return int_oo
            elif r == -math.inf:
                return -int_oo
            elif math.isnan(r):
                return sympy.nan
            else:
                return sympy.Integer(math.floor(r))
        if isinstance(base, sympy.Integer) and isinstance(divisor, sympy.Integer):
            return sympy.Integer(int(base) // int(divisor))
        if isinstance(base, FloorDiv):
            return FloorDiv(base.args[0], base.args[1] * divisor)

        # Expands (x + y) // b into x // b + y // b.
        # This only works if floor is an identity, i.e. x / b is an integer.
        for term in sympy.Add.make_args(base):
            quotient = term / divisor
            if quotient.is_integer and isinstance(divisor, sympy.Integer):
                # NB: this is correct even if the divisor is not an integer, but it
                # creates rational expressions that cause problems with dynamic
                # shapes.
                return FloorDiv(base - term, divisor) + quotient

        try:
            gcd = sympy.gcd(base, divisor)
            if gcd != 1:
>               return FloorDiv(
                    sympy.simplify(base / gcd), sympy.simplify(divisor / gcd)
                )

base       = -1.00000000000000
cls        = FloorDiv
divisor    = -1.00000000000000
gcd        = 1.00000000000000
quotient   = 1.00000000000000
term       = -1.00000000000000

/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/_sympy/functions.py:159:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

args = (FloorDiv, -1.00000000000000, -1.00000000000000), kwargs = {}

    @wraps(func)
    def wrapper(*args, **kwargs):
        try:
>           retval = cfunc(*args, **kwargs)
E           RecursionError: maximum recursion depth exceeded in comparison
E
E           To execute this test, run the following from the base repo dir:
E               python test/test_sympy_utils.py -k TestValueRanges.test_binary_ref_fn_floordiv_dtype_float
E
E           This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

args       = (FloorDiv, -1.00000000000000, -1.00000000000000)
cfunc      = <functools._lru_cache_wrapper object at 0x7fc5303173a0>
func       = <function Function.__new__ at 0x7fc530317280>
kwargs     = {}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131151
Approved by: https://github.com/ezyang
2024-07-29 18:53:14 +00:00
cyy
ab912b7fef [2/N] Fix clang-tidy warnings in inductor (#132040)
Follows #131979
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132040
Approved by: https://github.com/Skylion007
2024-07-29 18:41:24 +00:00
cyy
c764ef6d53 [9/N] Fix clang-tidy warnings in jit (#132010)
Follows  #131997

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132010
Approved by: https://github.com/Skylion007
2024-07-29 18:38:35 +00:00
f389bca2e9 [dynamo][inline_inbuilt_nn_modules] Skip test_dpp_graphs for now (#132053)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132053
Approved by: https://github.com/laithsakka
2024-07-29 17:59:47 +00:00
6c6fbb4691 Fix pyi annotation for ProcessGroupNCCL.Options (#130957)
Probably all the other options need updating too, but this is the one I
needed.  The accurate annotation was determined by reading
torch/csrc/distributed/c10d/init.cpp

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130957
Approved by: https://github.com/wconstab, https://github.com/fduwjj
2024-07-29 17:46:01 +00:00
025242d065 [cpu-test] enable test_cpu_repro in fbcode (#132022)
Summary: This diff enables test_cpu_repro in fbcode

Test Plan: ci

Differential Revision: D60364517

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132022
Approved by: https://github.com/desertfire
2024-07-29 17:45:26 +00:00
ca8153ae67 BE: reset dynamo before each test in test_ops_gradients.py (#131397)
https://github.com/pytorch/pytorch/pull/126586 tried to reset dynamo before each unit test. That PR get reverted a couple of times because we see post-land test failures that we don't see before merge. This PR only reset dynamo before each tests in `test_ops_gradients.py` to make it easier to land.

Eventually after we reset dynamo in each individual test files, we can move the change to the base class (TestCase) and remove the change in individual test files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131397
Approved by: https://github.com/zou3519
ghstack dependencies: #131551, #131388, #131372
2024-07-29 17:39:23 +00:00
527901f054 BE: reset dynamo before each test in test_module.py (#131372)
https://github.com/pytorch/pytorch/pull/126586 tried to reset dynamo before each unit test. That PR get reverted a couple of times because we see post-land test failures that we don't see before merge. This PR only reset dynamo before each tests in `test_module.py` to make it easier to land.

Eventually after we reset dynamo in each individual test files, we can move the change to the base class (TestCase) and remove the change in individual test files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131372
Approved by: https://github.com/zou3519
ghstack dependencies: #131551, #131388
2024-07-29 17:39:23 +00:00
bd1a29b158 [BE][Ez]: Update ruff to 0.5.5. Bugfixes and better LSP support (#132037)
Updates ruff to the latest and greatest, mainly better LSP support and bugfixes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132037
Approved by: https://github.com/malfet
2024-07-29 16:57:13 +00:00
6cf493158e Revert "Enable FlashAttention on Windows (#131906)"
This reverts commit b90bc66766c3503c1f229660710a803488d53c16.

Reverted https://github.com/pytorch/pytorch/pull/131906 on behalf of https://github.com/atalman due to Windows nightly failures ([comment](https://github.com/pytorch/pytorch/pull/131906#issuecomment-2256421183))
2024-07-29 16:49:23 +00:00
3d4de8e96d Add config option to skip autotuning conv (#131839)
requested internally bc for some models the conv templates are not very helpful

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131839
Approved by: https://github.com/oulgen
ghstack dependencies: #131400
2024-07-29 16:43:58 +00:00
e73a4cb21f Revert "[pt2e][quant] Ensure BN node is erased after convert (#131651)"
This reverts commit eba2ffd278a004df8fd335328ab8ba00c978e471.

Reverted https://github.com/pytorch/pytorch/pull/131651 on behalf of https://github.com/ZainRizvi due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/131651#issuecomment-2256407968))
2024-07-29 16:42:24 +00:00
f72266ecea Revert "Let dynamo inline functional_call (#128646)"
This reverts commit 5aab1acc84ff4a4374c9ddd179be48b07c6c8a74.

Reverted https://github.com/pytorch/pytorch/pull/128646 on behalf of https://github.com/clee2000 due to the newly added test dynamo/test_higher_order_ops.py::FuncTorchHigherOrderOpTests::test_functional_call_sequential_params_and_buffers [GH job link](https://github.com/pytorch/pytorch/actions/runs/10147452270/job/28058682000) [HUD commit link](5aab1acc84) is broken, probably a landrace since it passed on PR ([comment](https://github.com/pytorch/pytorch/pull/128646#issuecomment-2256375501))
2024-07-29 16:26:50 +00:00
962f248437 Add decomposition for expand_copy (#130940)
* Extracted from #129476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130940
Approved by: https://github.com/peterbell10
2024-07-29 16:23:56 +00:00
e393c7fa05 Tighten torch.library.infer_schema input types (#130705)
Made the following changes:
- mutates_args is now keyword-only and mandatory. This is to align with
  torch.library.custom_op (which makes it mandatory because it's easy to
  miss)
- op_name is now keyword-only. This helps the readability of the API
- updated all usages of infer_schema

This change is not BC-breaking because we introduced
torch.library.infer_schema a couple of days ago.

Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130705
Approved by: https://github.com/yushangdi
ghstack dependencies: #131777
2024-07-29 16:01:19 +00:00
957a89f56c Revert "[inductor] Fix unsoundness with negative-valued indexing expressions (#131761)"
This reverts commit 03760be2714c6ed3b4f44c4dc3ea016f557d8597.

Reverted https://github.com/pytorch/pytorch/pull/131761 on behalf of https://github.com/atalman due to Broke CI: inductor/test_cpu_cpp_wrapper.py::DynamicShapesCppWrapperCpuTests::test_linear_binary_dynamic_shapes_cpp_wrapper [GH job link](https://github.com/pytorch/pytorch/actions/runs/10145214748/job/28051168920) [HUD commit link](03760be271) ([comment](https://github.com/pytorch/pytorch/pull/131761#issuecomment-2256287736))
2024-07-29 15:52:08 +00:00
ca254d145f [BE][Ez]: Update fmtlib submodule to 11.0.2 (#132036)
Updates fmtlib to 11.0.2 which mainly includes minor bugfixes for edge cases such as move-only iterators and formatting on non-posix systems.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132036
Approved by: https://github.com/malfet
2024-07-29 15:50:00 +00:00
5aab1acc84 Let dynamo inline functional_call (#128646)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128646
Approved by: https://github.com/zou3519
ghstack dependencies: #129091, #130490
2024-07-29 15:41:03 +00:00
e0e4e84ef9 wrap self.call_function(...) in try finally block to undo changes to self.kw_names (#130490)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130490
Approved by: https://github.com/williamwen42, https://github.com/zou3519
ghstack dependencies: #129091
2024-07-29 15:41:03 +00:00
1e9cdf7d91 Relax constraints for creating a GenericContextWrappingVariable (#129091)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129091
Approved by: https://github.com/yanboliang, https://github.com/zou3519
2024-07-29 15:40:59 +00:00
6cbad37bee make _inductor.config.rocm.supported_arch set order deterministic for caching (#131921)
This fixes some AOTAutograd caching tests that were failing flakily internally because they would occasionally cache miss.

[T195598220](https://www.internalfb.com/intern/tasks/?t=195598220)

I found it by running some stress tests and diffing the AOT cache information on each run, and ended up with this diff (`rocm.supported_arch` order was changing from run to run, although apparently not in OSS):
```
--- tmpa.txt    2024-07-26 11:03:46.220924798 -0700
+++ tmpb.txt    2024-07-26 11:03:44.053586437 -0700
@@ -1,4 +1,4 @@
-Autograd graph cache hash details for key ati644hstroc45hvmc6dcgzmxz7n4ezi46vbb2iriu634aojza74:
+Autograd graph cache hash details for key ayfqecv56xcczljwuvigh73sjd7dfvgr6akzf3ikr46nq7dfm6eh:
 [z76jr26kn3enjhz7b3ks3a2dgpwolnnqsqmo3wn6ddml3vxjtam] aot_config: (0, True, False, False, False, [LocalSource(local_name='x', cell_or_freevar=False)], True, False)
 [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] grad_enabled: False
 [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] disable_amp: False
@@ -184,7 +184,7 @@
 [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[rocm.print_kernel_resource_usage]: False
 [tquy2we2efmowuj4wuqzcfcfdcrkzkzmwdae6hprj7fa64jpusq] inductor_config[rocm.rocm_home]: None
 [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[rocm.save_temps]: False
-[xr3ayxgy2xduff3r5ey7o3ypfndexy7edha62kibw2dexijjvdr] inductor_config[rocm.supported_arch]: {'gfx941', 'gfx942', 'gfx940'}
+[qauhp44riavgubamhd3ehrifxdgm7pkwx2nehsqg5toy54dqqmn] inductor_config[rocm.supported_arch]: {'gfx942', 'gfx940', 'gfx941'}
 [cev5uo2jlwdhw2uyzcm7vr6cl23azjfw437f5r5lskm7spucos6] inductor_config[rocm.use_fast_math]: True
 [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[rocm.use_preselected_instances]: False
 [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[save_args]: False
@@ -231,7 +231,7 @@
 [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[verbose_progress]: False
 [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[warn_mix_layout]: False
 [a44txxznx23htuc7zxw7larc7yxpxzxmiqzloxznw7z2k2azqj3] inductor_config[worker_start_method]: fork
-Autograd graph cache hash details for key ati644hstroc45hvmc6dcgzmxz7n4ezi46vbb2iriu634aojza74:
+Autograd graph cache hash details for key ayfqecv56xcczljwuvigh73sjd7dfvgr6akzf3ikr46nq7dfm6eh:
 [z76jr26kn3enjhz7b3ks3a2dgpwolnnqsqmo3wn6ddml3vxjtam] aot_config: (0, True, False, False, False, [LocalSource(local_name='x', cell_or_freevar=False)], True, False)
 [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] grad_enabled: False
 [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] disable_amp: False
@@ -417,7 +417,7 @@
 [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[rocm.print_kernel_resource_usage]: False
 [tquy2we2efmowuj4wuqzcfcfdcrkzkzmwdae6hprj7fa64jpusq] inductor_config[rocm.rocm_home]: None
 [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[rocm.save_temps]: False
-[xr3ayxgy2xduff3r5ey7o3ypfndexy7edha62kibw2dexijjvdr] inductor_config[rocm.supported_arch]: {'gfx941', 'gfx942', 'gfx940'}
+[qauhp44riavgubamhd3ehrifxdgm7pkwx2nehsqg5toy54dqqmn] inductor_config[rocm.supported_arch]: {'gfx942', 'gfx940', 'gfx941'}
 [cev5uo2jlwdhw2uyzcm7vr6cl23azjfw437f5r5lskm7spucos6] inductor_config[rocm.use_fast_math]: True
 [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[rocm.use_preselected_instances]: False
 [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[save_args]: False
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131921
Approved by: https://github.com/jamesjwu, https://github.com/oulgen
2024-07-29 15:29:04 +00:00
14108c1677 Fix error handling in _triton.py (#132006)
On Windows, _triton.py creates a confusing error ("RuntimeError: Should never be _installed")_ as triton is not supported in Windows. This is not caught in the current Pytorch exception handling. This pull request adds a new exception handling for the runtime error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132006
Approved by: https://github.com/oulgen
2024-07-29 15:02:25 +00:00
be3eba382f [CI] Run perf test for perf_cpu_aarch64 (#132038)
Summary: Run perf test for perf_cpu_aarch64 instead of regular CI test (test_linux_aarch64).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132038
Approved by: https://github.com/malfet
2024-07-29 13:48:40 +00:00
c35f21e5fc Revert "[BE][tests] show local variables on failure in tests (#131151)"
This reverts commit 14158d892a2bd9b34edb5637f9a05217ea0330bd.

Reverted https://github.com/pytorch/pytorch/pull/131151 on behalf of https://github.com/atalman due to Broke CI: test_testing.py::TestTestingCUDA::test_cuda_assert_should_stop_common_device_type_test_suite_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/10131415299/job/28014665693) [HUD commit link](14158d892a) ([comment](https://github.com/pytorch/pytorch/pull/131151#issuecomment-2255921015))
2024-07-29 13:19:38 +00:00
06fe99a097 Revert "[CI] add new test config label ci-test-showlocals to control test log verbosity (#131981)"
This reverts commit dfa18bf3f39c5a90b48baf956e50fa7da4462d3d.

Reverted https://github.com/pytorch/pytorch/pull/131981 on behalf of https://github.com/atalman due to Sorry, need to revert bottom PR, which broke CI: https://github.com/pytorch/pytorch/pull/131151 ([comment](https://github.com/pytorch/pytorch/pull/131981#issuecomment-2255892628))
2024-07-29 13:09:41 +00:00
7ef927da15 Revert "[dynamo] Turn on inline_inbuilt_nn_modules (#131275)"
This reverts commit 6de65d5dd4226b6bae15352b575c81a6750c819b.

Reverted https://github.com/pytorch/pytorch/pull/131275 on behalf of https://github.com/atalman due to Broke CI: dynamo/test_structured_trace.py::StructuredTraceTest::test_ddp_graphs [GH job link](https://github.com/pytorch/pytorch/actions/runs/10132084288/job/28016215101) [HUD commit link](6de65d5dd4) ([comment](https://github.com/pytorch/pytorch/pull/131275#issuecomment-2255839646))
2024-07-29 12:48:27 +00:00
cyy
efca51e171 [8/N] Fix clang-tidy warnings in jit (#131997)
Follows #131996
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131997
Approved by: https://github.com/Skylion007
2024-07-29 12:40:42 +00:00
eb9409511e Revert "support zb1p and zb2p algorithms (#130752)"
This reverts commit 8fe5b93667b60e37c12d288659a25cbd5ae53c79.

Reverted https://github.com/pytorch/pytorch/pull/130752 on behalf of https://github.com/atalman due to Broke Periodic CI: distributed/pipelining/test_composability.py::ComposabilityTest::test_manual_with_data_parallel_dp_type_DDP_ScheduleClass4 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10131472868/job/28014900187) [HUD commit link](8fe5b93667) ([comment](https://github.com/pytorch/pytorch/pull/130752#issuecomment-2255819078))
2024-07-29 12:40:00 +00:00
9d497887b8 Changes to support clang-19 (#131905)
Co-authored-by: pruthvistony <pruthvigithub@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131905
Approved by: https://github.com/jeffdaily, https://github.com/Skylion007
2024-07-29 12:38:23 +00:00
cyy
b67811abda [1/N] Fix clang-tidy warnings in inductor (#131979)
Fixes clang-tidy warnings in inductor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131979
Approved by: https://github.com/Skylion007
2024-07-29 12:37:56 +00:00
d47c470f47 [dynamo] implement var_getattr in UserFunctionVariable (#130413)
This PR addresses the `getattr` of  UserFunctionVariable. Although this usage is uncommon, it does appear in [Megatron's code](https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/tensor_parallel/layers.py#L635).

```
def linear_with_grad_accumulation_and_async_allreduce(...):
    ....
    if not linear_with_grad_accumulation_and_async_allreduce.warned:
        ....
    ....

linear_with_grad_accumulation_and_async_allreduce.warned = False
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130413
Approved by: https://github.com/yanboliang
2024-07-29 08:29:59 +00:00
dfa18bf3f3 [CI] add new test config label ci-test-showlocals to control test log verbosity (#131981)
Add a new label `ci-test-showlocals` and add it to test config filter.
If the PR is labeled with `ci-test-showlocals` or "ci-test-showlocals"
present in the PR comment, the test config filter will set a environment
variable `TEST_SHOWLOCALS`. Then `pytest` will show local variables on
failures for better debugging.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131981
Approved by: https://github.com/malfet
2024-07-29 07:40:42 +00:00
f151f25c0b BE: reset dynamo before each test in test_torch.py (#131388)
https://github.com/pytorch/pytorch/pull/126586 tried to reset dynamo before each unit test. That PR get reverted a couple of times because we see post-land test failures that we don't see before merge. This PR only reset dynamo before each tests in `test_torch.py` to make it easier to land.

Eventually after we reset dynamo in each individual test files, we can move the change to the base class (TestCase) and remove the change in individual test files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131388
Approved by: https://github.com/zou3519
ghstack dependencies: #131551
2024-07-29 04:57:34 +00:00
30e7fc0fe1 Cpp wrapper: set args to CppWrapperKernelArgs in cpp template kernel (#129557)
Fix the compilation error:
```cpp
/tmp/tmpywg34bca/tg/ctg7wbli6pvydsjr2xsxamdbamkquhlincuky3dzopa3ilrxqdwt.cpp:401:24: error: cannot convert ‘at::Tensor’ to ‘const bfloat16*’ {aka ‘const c10::BFloat16*’}
  401 |     cpp_fused_div_mm_0(arg2_1, constant2, _frozen_param1, buf1);
      |                        ^~~~~~
      |                        |
      |                        at::Tensor
```

The generated code after the fix will be:
```cpp
cpp_fused_div_mm_0((bfloat16*)(arg2_1.data_ptr()), (bfloat16*)(constant2.data_ptr()), (bfloat16*)(_frozen_param1.data_ptr()), (bfloat16*)(buf1.data_ptr()));
```

Multiple changes are required for ABI compatible mode. Separate it into a follow-up PR in this ghstack: https://github.com/pytorch/pytorch/pull/131841

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129557
Approved by: https://github.com/leslie-fang-intel
2024-07-29 04:01:17 +00:00
03760be271 [inductor] Fix unsoundness with negative-valued indexing expressions (#131761)
This fixes a few instances where we assumed indexing expressions were
non-negative. This is not valid when we have more complicated
expressions involving masking e.g. pointwise cat.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131761
Approved by: https://github.com/ezyang
2024-07-29 03:14:13 +00:00
2a02b5cd22 [Intel GPU] Dispatch Stub support (#130019)
# Motivation
Structured codegen is beneficial for easier decoupling tensor meta setting and kernel implementation. At present, XPU operators need to handle tensor metas in hand-written way.

We plan to leverage the codegen system for auto generate structured operators. This PR facilitate the `DispatchStub` support for  Intel GPUs. Based on that, XPU operators would have possibility to register kernel functor to operator stubs.

This is a prerequisite of PR #130082, where we will modify the codegen system to generate XPU needed source files and headers.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130019
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD
2024-07-29 02:18:52 +00:00
cyy
5b3b2b9cc7 [7/N] Fix clang-tidy warnings in jit (#131996)
Follows #131986

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131996
Approved by: https://github.com/ezyang
2024-07-29 01:21:18 +00:00
cyy
ddd539ba6c [6/N] Fix clang-tidy warnings in jit (#131986)
Follows  #131969
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131986
Approved by: https://github.com/ezyang
2024-07-29 00:49:08 +00:00
7b0e10f0e5 fix _MaskPartial when multiple embeddings coexist (#131264)
Previously, using _MaskPartial when multiple embeddings have the following issues:
1. Suppose an `nn.Embedding` has shape `[vocab_size, emb_size]`. When there are more than one embeddings, sharing the same `vocab_size` but with different `emb_size`s. Then they would not share `OpStrategy` since each, when involved in computation, would have different `OpSchema`; however, there would be cache hit for redistribute (specifically `_gen_transform_infos` in `torch/distributed/_tensor/_redistribute.py` when doing `Replicate` -> `_MaskPartial`) as the `_MaskPartial` only has `vocab_size` as `logical_dim_size` but not `emb_size` as attribute. This cache hit is undesirable and would cause trouble when doing all-reduce/reduce-scatter on the new `_MaskPartial` in a separate `OpStrategy`. The error was reported in #130725. In this PR, we introduce `offset_shape` to represent the embedding's full shape to avoid cache hit from embeddings of different shapes.
2. The second issue is when we have two `nn.Embedding`s `emb1` and `emb2` with the same shape. There will be cache hit not only in `_gen_transform_infos`, but also in `OpStrategy` generation. Previously, if we sequentially do `Replicate` -> `_MaskPartial` for both `emb1` `emb2` and then sequentially do reduction on the `_MaskPartial` of `emb1`, it would destroy the `MaskBuffer` and `emb2` would hit error. This PR adds a `refcount` for the `MaskBuffer` so that it can be properly shared by multiple `nn.Embedding`s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131264
Approved by: https://github.com/wanchaol
2024-07-29 00:40:58 +00:00
0ab6551bcb [inductor] Handle NoneLayout in count_numel (#131645)
We're currently under-counting mutations from ExternKernel since they use `NoneLayout` which doesn't have an associated shape and dtype. Instead, we can get that information from the buffer being mutated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131645
Approved by: https://github.com/jansel
2024-07-28 23:02:22 +00:00
cyy
7c1fbc7fe9 [5/N] Remove unused parameter (#131998)
Follows #131291

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131998
Approved by: https://github.com/ezyang
2024-07-28 21:29:06 +00:00
f901b02066 [Distributed] Do not expose nlohmann/json.hpp in public headers (#131925)
Move `<hlohmann/json.hpp>` dependency as well as `NCCLTraceBuffer::getCollectiveTraceJson` and `NCCLTraceBuffer::dump_json` implementation introduced by https://github.com/pytorch/pytorch/pull/129505 from the header into .cpp file. This relaxes the requirement on all downstream client to depend on the library

Fixes https://github.com/pytorch/pytorch/issues/130678

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131925
Approved by: https://github.com/albanD, https://github.com/d4l3k, https://github.com/fduwjj, https://github.com/c-p-i-o
ghstack dependencies: #131922
2024-07-28 18:45:24 +00:00
75c8d59ea1 Remove mypy ignore from torch/_dynamo/variables/lazy.py (#131785)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131785
Approved by: https://github.com/aorenste, https://github.com/zou3519
ghstack dependencies: #131786, #131870
2024-07-28 17:13:53 +00:00
7c29665f77 Remove mypy ignore from torch/testing/_internal/distributed/ (#131870)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131870
Approved by: https://github.com/aakhundov
ghstack dependencies: #131786
2024-07-28 17:13:53 +00:00
2e4807575c Remove mypy ignore from torch/_dynamo/polyfill.py (#131786)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131786
Approved by: https://github.com/aorenste, https://github.com/zou3519
2024-07-28 17:13:49 +00:00
cc512ea0f6 [inductor] Fix flaky tests in test_aot_inductor.py (#131994)
Summary:
The `test_model_modified_weights` in `test_aot_inductor.py` has been failing internally for a while. The behavior leading to the test failure was that, after updating the eager model's weights and recompiling the (CPU) model with AOTI, the output of the model was identical to the one before the weights were updated.

The root cause is here in Python:

8927fc209f/test/inductor/test_aot_inductor_utils.py (L69-L71)

which, in turn, instantiates the `Runner` object in C++ relying on `dlopen` for loading the *.so. The problem is that repeated `dlopen` call does not reload the library from the same path, unless `dlclose` is called in-between the two `dlopen` calls. There is `dlclose` in the `Runner`'s destructor, but it's not called, likely due to the way the loaded `runner` gets closed over in Python:

8927fc209f/test/inductor/test_aot_inductor_utils.py (L83-L94)

Here we add copying the *.so file to a unique temporary path right before loading it into a `runner` to avoid the `dlopen` staleness described above. This fixes the `test_model_modified_weights` and, hopefully, will help avoiding similar errors in the future tests.

Test Plan: Tested internally.

Differential Revision: D60348165

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131994
Approved by: https://github.com/chenyang78
2024-07-28 16:55:22 +00:00
6de65d5dd4 [dynamo] Turn on inline_inbuilt_nn_modules (#131275)
Known issues that are deliberately kept open and will be fixed later are tracked here - https://github.com/pytorch/pytorch/issues/131696

Training dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/anijain2305/435/head&lCommit=408b9358b8fca3a5d08b39741419fe8a596941aa&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51))

![image](https://github.com/user-attachments/assets/08ef081c-37d7-436d-905b-4b9e2b470644)

Inference dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=gh/anijain2305/435/head&lCommit=914244fa2fe0055917e039e35183b21fa90afdc6&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51))
![image](https://github.com/user-attachments/assets/32136eff-a39e-4cde-a438-e51a665bc3c9)

Inference sees a little bit more perf degradation but we are ok with that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131275
Approved by: https://github.com/ezyang, https://github.com/jansel
ghstack dependencies: #131744, #131928, #131948
2024-07-28 13:23:00 +00:00
8927fc209f [inductor] Add type hints to functions in debug.py (#131836)
Summary: ATT

Test Plan: lintrunner

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131836
Approved by: https://github.com/eellison
2024-07-28 04:54:22 +00:00
500aea8d50 Build PT aarch64 on arm runner (#131964)
Another fix is needed to address https://github.com/pytorch/pytorch/actions/runs/10118374576/job/27985575620.  The build needs to be done on arm runner to stay compatible with the Docker image.

### Testing

https://github.com/pytorch/pytorch/actions/runs/10118589329/job/27985670691

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131964
Approved by: https://github.com/malfet
2024-07-28 04:50:38 +00:00
945bf78894 Revert "[BE] typing for decorators - fx/_compatibility (#131568)"
This reverts commit 193f62fde91ee20deb5ddcd9ff4593cd78d74c64.

Reverted https://github.com/pytorch/pytorch/pull/131568 on behalf of https://github.com/clee2000 due to same as https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359 but I clicked the wrong link by accident.  This is where it actually starts ([comment](https://github.com/pytorch/pytorch/pull/131568#issuecomment-2254330781))
2024-07-28 03:43:39 +00:00
b002ec61b6 Revert "[BE] typing for decorators - masked/_ops (#131569)"
This reverts commit aa58af8b43ad0e615415b4d754255f5be481d41a.

Reverted https://github.com/pytorch/pytorch/pull/131569 on behalf of https://github.com/clee2000 due to same as https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359 but I clicked the wrong link by accident.  This is where it actually starts ([comment](https://github.com/pytorch/pytorch/pull/131568#issuecomment-2254330781))
2024-07-28 03:43:39 +00:00
a3ba405871 Revert "[BE] typing for decorators - library (#131570)"
This reverts commit 5731b486c87bedff69aa0264d6c934bf723eb513.

Reverted https://github.com/pytorch/pytorch/pull/131570 on behalf of https://github.com/clee2000 due to same as https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359 but I clicked the wrong link by accident.  This is where it actually starts ([comment](https://github.com/pytorch/pytorch/pull/131568#issuecomment-2254330781))
2024-07-28 03:43:39 +00:00
a0abb77007 Revert "[BE] typing for decorators - distributed/_tensor/ops/utils (#131571)"
This reverts commit 4b985e6f803023ec301238d2b4bab4fbea4dd03c.

Reverted https://github.com/pytorch/pytorch/pull/131571 on behalf of https://github.com/clee2000 due to same as https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359 but I clicked the wrong link by accident.  This is where it actually starts ([comment](https://github.com/pytorch/pytorch/pull/131568#issuecomment-2254330781))
2024-07-28 03:43:39 +00:00
a8a9882899 Implement fused_scaled_matmul_reduce_scatter for async-TP (#131950)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131950
Approved by: https://github.com/weifengpy
ghstack dependencies: #131410, #131831, #131832, #131833
2024-07-28 03:39:12 +00:00
0538a69a8d [micro_pipeline_tp] support all-gather -> _scaled_mm (#131833)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131833
Approved by: https://github.com/weifengpy
ghstack dependencies: #131410, #131831, #131832
2024-07-28 03:39:11 +00:00
492e9a4886 [micro_pipeline_tp] add support for type-erased all-gather pattern observed in DTensor + float8_experimental (#131832)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131832
Approved by: https://github.com/weifengpy
ghstack dependencies: #131410, #131831
2024-07-28 03:39:11 +00:00
fd5b7d4bf9 Revert "[BE] typing for decorators - _meta_registrations (#131572)"
This reverts commit bfe0079b72aa3ed315ae8f140c97a5826c401a65.

Reverted https://github.com/pytorch/pytorch/pull/131572 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))
2024-07-28 03:29:32 +00:00
609447a626 Revert "[BE] typing for decorators - _jit_internal (#131573)"
This reverts commit f0f20f7e97716b4b077dca2a1a42930ccf990c1c.

Reverted https://github.com/pytorch/pytorch/pull/131573 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))
2024-07-28 03:29:32 +00:00
4684b8e9d7 Revert "[BE] typing for decorators - _inductor/lowering (#131574)"
This reverts commit b2cbcf710b26c4cb92d810fff46b6ddcb8d10cbf.

Reverted https://github.com/pytorch/pytorch/pull/131574 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))
2024-07-28 03:29:32 +00:00
07b7f51877 Revert "[BE] typing for decorators - _inductor/fx_passes/post_grad (#131575)"
This reverts commit 42dc5a47a157f9a441ceba53cf569cc42a640732.

Reverted https://github.com/pytorch/pytorch/pull/131575 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))
2024-07-28 03:29:32 +00:00
6a0c3bae21 Revert "[BE] typing for decorators - fx/experimental/migrate_gradual_types/constraint_generator (#131576)"
This reverts commit 37d76c7d48353cff5ed0d868b7ca486ad092ceaf.

Reverted https://github.com/pytorch/pytorch/pull/131576 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))
2024-07-28 03:29:32 +00:00
b1d640a2b7 Revert "[BE] typing for decorators - ao/quantization/quantizer/xnnpack_quantizer_utils (#131577)"
This reverts commit 5ee6a6dacc926da37ebe06e4206dcc307bf891f5.

Reverted https://github.com/pytorch/pytorch/pull/131577 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))
2024-07-28 03:29:32 +00:00
d3c17fea90 Revert "[BE] typing for decorators - _library/custom_ops (#131578)"
This reverts commit c65b197b85aeee61ed4c09527a8f6eecf8c20e27.

Reverted https://github.com/pytorch/pytorch/pull/131578 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))
2024-07-28 03:29:32 +00:00
065d0fe570 Revert "[BE] typing for decorators - fx/experimental/graph_gradual_typechecker (#131579)"
This reverts commit 79f0c4dc04c7976b734767d64c4833932219dcfb.

Reverted https://github.com/pytorch/pytorch/pull/131579 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))
2024-07-28 03:29:31 +00:00
5ced63a005 Revert "[BE] typing for decorators - utils/flop_counter (#131580)"
This reverts commit 81c26ba5ae1edf95da8f6956ae4b5ad23c9833c6.

Reverted https://github.com/pytorch/pytorch/pull/131580 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))
2024-07-28 03:29:31 +00:00
2c4023d65f Revert "[BE] typing for decorators - _refs/nn/functional (#131581)"
This reverts commit dbf7c318b2dd4652467f11f4aaebaa3ed372e728.

Reverted https://github.com/pytorch/pytorch/pull/131581 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))
2024-07-28 03:29:31 +00:00
e448f32944 Revert "[BE] typing for decorators - signal/windows/windows (#131582)"
This reverts commit 8689d377f9b60b70efa6608e654a3889f947f4d8.

Reverted https://github.com/pytorch/pytorch/pull/131582 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))
2024-07-28 03:29:31 +00:00
d90f6b45c0 Revert "[inductor] Add type hints to functions in mkldnn_fusion.py (#131820)"
This reverts commit fb3ddafbcfe6de1c4b208c020bc5ff4c4c4faf79.

Reverted https://github.com/pytorch/pytorch/pull/131820 on behalf of https://github.com/clee2000 due to reverting this to revert something else, only action you should need to do is to rebase and merge again, sorry for the churn ([comment](https://github.com/pytorch/pytorch/pull/131820#issuecomment-2254327833))
2024-07-28 03:26:14 +00:00
8f5cf46405 Revert "Fix public API tests (#131386)"
This reverts commit 91fcfd87600545c19b975bd6ea134f2f931bf84a.

Reverted https://github.com/pytorch/pytorch/pull/131386 on behalf of https://github.com/clee2000 due to reverting this to revert something else, only action you should need to do is to rebase and merge again, sorry for the churn ([comment](https://github.com/pytorch/pytorch/pull/131386#issuecomment-2254327487))
2024-07-28 03:23:04 +00:00
cyy
7be0ce51b6 Fix handle serialization error (#131871)
This is a bug to try serialise std::string in C API
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131871
Approved by: https://github.com/Skylion007
2024-07-28 00:33:20 +00:00
3e0ccb3a9f Fixing fake tensor SymInt caching (#131966)
Summary: Some tests are failing because of a weird interaction between the symbolic sizes and the `set()` - back it out for now.

Differential Revision: D60320595

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131966
Approved by: https://github.com/oulgen
2024-07-27 22:43:57 +00:00
d07a125af2 [Inductor] supporting pointwise intermediate nodes in B2B-GEMM (#131685)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131685
Approved by: https://github.com/eellison
2024-07-27 20:11:20 +00:00
14158d892a [BE][tests] show local variables on failure in tests (#131151)
------

As per the title, add argument `--locals` for `unittest` and `--showlocals --tb=long` for `pytest` in CI.

Some failures cannot be reproduced on the local machine but exist on cloud CI. This change allows us to investigate the test failure more easily.

Example output: https://github.com/pytorch/pytorch/actions/runs/9961546996/job/27523888353?pr=130710#step:20:3361

```text
/opt/conda/envs/py_3.8/lib/python3.8/site-packages/sympy/core/function.py:307:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

cls = FloorDiv, base = -1.00000000000000, divisor = -1.00000000000000

    @classmethod
    def eval(cls, base, divisor):
        # python test/test_dynamic_shapes.py -k TestDimConstraints.test_dim_constraints_solve_full
        # Assert triggered by inequality solver
        # assert base.is_integer, base
        # assert divisor.is_integer, divisor

        # We don't provide the same error message as in Python because SymPy
        # makes it difficult to check the types.
        if divisor.is_zero:
            raise ZeroDivisionError("division by zero")
        if base in (int_oo, -int_oo, sympy.oo, -sympy.oo) and divisor in (
            int_oo,
            -int_oo,
            sympy.oo,
            -sympy.oo,
        ):
            return sympy.nan
        if base is sympy.nan or divisor is sympy.nan:
            return sympy.nan

        if base.is_zero:
            return sympy.S.Zero
        if base.is_integer and divisor == 1:
            return base
        if base.is_integer and divisor == -1:
            return sympy.Mul(base, -1)
        if (
            isinstance(base, sympy.Number)
            and isinstance(divisor, sympy.Number)
            and (
                base in (int_oo, -int_oo, sympy.oo, -sympy.oo)
                or divisor in (int_oo, -int_oo, sympy.oo, -sympy.oo)
            )
        ):
            r = float(base) / float(divisor)
            if r == math.inf:
                return int_oo
            elif r == -math.inf:
                return -int_oo
            elif math.isnan(r):
                return sympy.nan
            else:
                return sympy.Integer(math.floor(r))
        if isinstance(base, sympy.Integer) and isinstance(divisor, sympy.Integer):
            return sympy.Integer(int(base) // int(divisor))
        if isinstance(base, FloorDiv):
            return FloorDiv(base.args[0], base.args[1] * divisor)

        # Expands (x + y) // b into x // b + y // b.
        # This only works if floor is an identity, i.e. x / b is an integer.
        for term in sympy.Add.make_args(base):
            quotient = term / divisor
            if quotient.is_integer and isinstance(divisor, sympy.Integer):
                # NB: this is correct even if the divisor is not an integer, but it
                # creates rational expressions that cause problems with dynamic
                # shapes.
                return FloorDiv(base - term, divisor) + quotient

        try:
            gcd = sympy.gcd(base, divisor)
            if gcd != 1:
>               return FloorDiv(
                    sympy.simplify(base / gcd), sympy.simplify(divisor / gcd)
                )

base       = -1.00000000000000
cls        = FloorDiv
divisor    = -1.00000000000000
gcd        = 1.00000000000000
quotient   = 1.00000000000000
term       = -1.00000000000000

/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/_sympy/functions.py:159:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

args = (FloorDiv, -1.00000000000000, -1.00000000000000), kwargs = {}

    @wraps(func)
    def wrapper(*args, **kwargs):
        try:
>           retval = cfunc(*args, **kwargs)
E           RecursionError: maximum recursion depth exceeded in comparison
E
E           To execute this test, run the following from the base repo dir:
E               python test/test_sympy_utils.py -k TestValueRanges.test_binary_ref_fn_floordiv_dtype_float
E
E           This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

args       = (FloorDiv, -1.00000000000000, -1.00000000000000)
cfunc      = <functools._lru_cache_wrapper object at 0x7fc5303173a0>
func       = <function Function.__new__ at 0x7fc530317280>
kwargs     = {}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131151
Approved by: https://github.com/ezyang
2024-07-27 19:39:40 +00:00
466ea8ce54 Add fallback() to torch.library (#131707)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131707
Approved by: https://github.com/zou3519
2024-07-27 18:02:35 +00:00
cyy
8e5a367311 [5/N] Fix clang-tidy warnings in jit (#131969)
Follows #131903
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131969
Approved by: https://github.com/ezyang
2024-07-27 17:54:20 +00:00
918ece4f4d [BE][Easy][11/19] enforce style for empty lines in import segments in test/dy*/ (#129762)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129762
Approved by: https://github.com/anijain2305
2024-07-27 17:43:53 +00:00
ae9f17a821 [aoti] Rename OSS DynamicArg and OpKernel (#131862)
Summary: Fixing P1495466240 which I think is due to the fact that internal also has an "OpKernel" in the same namespace, using thrift instead of json.

Test Plan: https://www.internalfb.com/intern/testinfra/testrun/4785074844896831

Differential Revision: D60273354

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131862
Approved by: https://github.com/desertfire
2024-07-27 17:34:50 +00:00
8cdfdb41bc Revert "[NestedTensor] Integrate the layer normalization operator along the jagged dimension into NestedTensor (#131519)"
This reverts commit f862f457304f1952e75336f9f74e4ea3d2a5eb72.

Reverted https://github.com/pytorch/pytorch/pull/131519 on behalf of https://github.com/atalman due to broke CI: test_nestedtensor.py::TestNestedTensorSubclassCPU::test_layer_norm_with_lengths_requires_grad_False_components_require_grad_False_cpu_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10121747545/job/27996722731) [HUD commit link](f862f45730) ([comment](https://github.com/pytorch/pytorch/pull/131519#issuecomment-2254167994))
2024-07-27 14:45:47 +00:00
07389163f0 [C10][BE] Use range loop (#131922)
Non-function change that iterates over entries in `getCollectiveTraceJson` and uses `C10_UNUSED` rather than `(void)i;` trick

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131922
Approved by: https://github.com/XilunWu
2024-07-27 11:26:27 +00:00
cyy
f83ef69b84 Fix typo in assignment operators (#131890)
Most typos were introduced in #131077
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131890
Approved by: https://github.com/Skylion007
2024-07-27 11:13:42 +00:00
cyy
c82441e07a Fix std::optional checking bug (#131874)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131874
Approved by: https://github.com/Skylion007
2024-07-27 11:08:10 +00:00
93a4671746 Add out_dtypes to fused_all_gather_scaled_matmul's args (#131831)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131831
Approved by: https://github.com/weifengpy
ghstack dependencies: #131410
2024-07-27 11:07:43 +00:00
12cd040edd [micro_pipeline_tp] exclude simple overlappable collectives as micro-pipeline TP candidates when reorder_for_compute_comm_overlap is enabled (#131410)
When a collective can be hidden through either simple overlapping or micro-pipeline TP, we prefer simple overlapping to avoid the overhead associated with decomposition. If `reorder_for_compute_comm_overlap` is enabled, we identify collectives that can be hidden through simple overlapping and exclude them from micro-pipeline TP candidates.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131410
Approved by: https://github.com/weifengpy
2024-07-27 11:07:43 +00:00
36d24925c6 [inline_inbuilt_nn_modules][inductor-cpu] More skips for dynamic shapes when inlining enabled (#131948)
The issue is tracked here - https://github.com/pytorch/pytorch/issues/131929

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131948
Approved by: https://github.com/eellison, https://github.com/leslie-fang-intel
ghstack dependencies: #131744, #131928
2024-07-27 10:03:49 +00:00
aee6bcdba4 [Traceable FSDP2][Inductor] Apply compute/comm reordering passes to achieve overlap (#131614)
This PR enables the Inductor compute/comm reordering passes to Traceable FSDP2 to achieve overlap. Note that the overlap is not maximally optimized yet and the follow-up work will be done in subsequent PRs.

Test commands:
- `pytest -rA  test/distributed/test_compute_comm_reordering.py::TestComputeCommReorderingMultiProc`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131614
Approved by: https://github.com/yifuwang
ghstack dependencies: #131510
2024-07-27 08:39:58 +00:00
9e06572704 [Traceable FSDP2][Inductor] Create grouped nodes for FSDP2 all-gather code block and reduce-scatter code block (after Buffer/Operation split) (#131510)
This PR creates these `GroupedSchedulerNode`s:
- One for each all-gather code block (cast + copy-in + all-gather)
- One for each all-gather-wait code block (all-gather-wait + copy-out)
- One for each reduce-scatter code block (copy-in + reduce-scatter)
- One for each reduce-scatter-wait code block (reduce-scatter-wait)

This serves two goals:
- Prevent outside ops from being fused into these op groups, in order to have more predicable memory usage.
- Make it easier to specify the dependency e.g. from `i+1` all-gather group node to the `i` all-gather-wait group node, to enforce FSDP2 comm ordering (i.e. "serialization of comms").

The actual "reorder-for-FSDP-compute-comm-overlap" PR will come next.

Test commands:
- `pytest -rA  test/distributed/test_compute_comm_reordering.py::TestComputeCommReorderingMultiProc`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131510
Approved by: https://github.com/yifuwang
2024-07-27 08:39:58 +00:00
cyy
99e13e68e9 [4/N] Fix clang-tidy warnings in jit (#131903)
Follows #131830

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131903
Approved by: https://github.com/Skylion007
2024-07-27 08:08:14 +00:00
f862f45730 [NestedTensor] Integrate the layer normalization operator along the jagged dimension into NestedTensor (#131519)
Modify the existing `layer normalization` operator in PyTorch, invoked by `torch.layer_norm`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff, which uses the `aten` padding operator, enables PyTorch users to invoke `torch.nn.functional.layer_norm` on a nested tensor when reducing along the ragged dimension, e.g. `*` in a `(B, *, M)` or `(B, *, M, N)` nested tensor.

Write unit tests based on the `softmax` jagged operator to verify the accuracy of the ragged reduction implementation for `torch.nn.functional.layer_norm`. Add unit tests to verify error handling for unsupported features.

Note that this implementation is limited to nested tensors with `ragged_idx == 1`, i.e. the ragged dimension is not transposed. The layer normalization operator also requires an operation on a 2-dimensional layer; for nested tensors with 4 or more dimensions, I flatten the extra dimensions, then unflatten them after performing layer normalization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131519
Approved by: https://github.com/davidberard98
ghstack dependencies: #131518
2024-07-27 07:09:10 +00:00
bcf5c68c18 [NestedTensor] Integrate the softmax operator along the jagged dimension into NestedTensor (#131518)
Modify the existing `softmax` operator in PyTorch, invoked by `torch.softmax`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff, which uses the aten padding operator, enables PyTorch users to invoke `torch.softmax` on a nested tensor when reducing along the ragged dimension, e.g. `*` in a `(B, *, M)` nested tensor.

Write unit tests based on the `sum` and `mean` jagged operators to verify the accuracy of the ragged reduction implementation for `torch.softmax`. Add unit tests to verify error handling for unsupported features in `NestedTensor` `torch.softmax`.

Note that this implementation is limited to nested tensors with `ragged_idx == 1`, i.e. the ragged dimension is not transposed. In addition, the `softmax` operator is required to take in as input an integer for the reduction dimension `dim`, requiring new unit tests heavily inspired by the `sum` and `mean` jagged operator unit tests. `Softmax` also allows for reducing along the batch dimension.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131518
Approved by: https://github.com/davidberard98
2024-07-27 07:09:10 +00:00
c49e857d32 [pt] immutable accessors in graph signature (#131940)
Summary: splitting PT part of D60253955

Test Plan: existing tests

Differential Revision: D60296909

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131940
Approved by: https://github.com/angelayi, https://github.com/zhxchen17
2024-07-27 05:32:53 +00:00
96c1862e0b Remove mypy ignore from torch/_dynamo/variables/__init__.py (#131784)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131784
Approved by: https://github.com/aorenste, https://github.com/zou3519, https://github.com/Skylion007
2024-07-27 05:07:33 +00:00
1bfe7eb7e6 Update how we do sdpa testing (#131743)
## Motivation

This refactor aligns our testing methodology with the Flash Attention upstream repository while addressing several key issues:

1. **Standardized comparison**: We now compare fused kernels against float64 references, using the maximum of a calculated tolerance (based on same-precision math implementation) or standard float32 `atol`.

2. **Reduced redundancy**: Utilizing the same tensors for both same-precision math and fused kernel runs eliminates duplication.

3. **Improved maintainability**: The new approach simplifies tolerance adjustments across all affected tests.

4. **Consistency**: Standardizing tensor comparisons ensures a more uniform and reliable testing suite.

These changes collectively simplify our testing code, improve its maintainability, and provide a more robust framework for validating our attention mechanisms.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131743
Approved by: https://github.com/jainapurva, https://github.com/jbschlosser
2024-07-27 03:58:49 +00:00
bcdba9f91d Added hpu backend support in fsdp utils (#127757)
In fsdp init_utils, adding support for hpu backend device on _get_device API.

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127757
Approved by: https://github.com/wconstab, https://github.com/jgong5, https://github.com/awgu
2024-07-27 03:30:59 +00:00
28fd2e905d [inductor] enhance cpp_builder lint check. (#131752)
enhance cpp_builder `mypy` check.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131752
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-27 02:46:27 +00:00
a90b8b967a [inductor] enable windows inductor UTs (#131767)
Changes:
1. Add `skipIfWindows` function.
2. Fix `fresh_inductor_cache` raise error on Windows, due to can't delete loaded modules.
3. Disable some UTs, which are not passed on Windows.
4. Enable test_torchinductor in Windows CI.

I have tested passed on my dev machine:
<img width="864" alt="image" src="https://github.com/user-attachments/assets/91d5a62f-7383-44b3-b614-99940f196fdb">

TODO: review and fix the skipped cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131767
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-27 02:46:03 +00:00
3768faec2f carry cond in data-dependent error (#131932)
Test Plan: existing

Differential Revision: D60302877

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131932
Approved by: https://github.com/zhxchen17
2024-07-27 02:13:04 +00:00
9606d61e0c [reland][inductor] switch AotCodeCompiler to new cpp_builder (#130127)
Changes:
1. Switch `AotCodeCompiler` to new cpp_builder.
2. Only use `deprecated_cpp_compile_command` for `fb_code`, due to I can't debug anymore on no Meta internal environment access.
3. Add `TODO` comments for further some Meta employee help on contine to do this work.
4. Due to item 3, we only remaining `deprecated_cpp_compile_command` for `fb_code` to be fix, let's remove `validate_new_cpp_commands`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130127
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-27 01:46:13 +00:00
fdf1451bfa Add __all__ to torch.optim to define public interface (#131959)
There was a regression in the public interface for `torch.optim` introduced in #125452 when `torch/optim/__init__.pyi` was merged into `torch/optim/__init__.py`. [The import aliases were not preserved and so now `pyright` thinks that these classes are not publicly exported from `torch/optim/__init__.py`.](https://github.com/pytorch/pytorch/pull/125452/files#diff-941595c1e1aa06bec94578499dd3654532a5183d0bc1bcd94d1f33b47e0d0adfL1-L15)

```
error: "SGD" is not exported from module "torch.optim"
```

Adding these classes/modules to `__all__` fixes this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131959
Approved by: https://github.com/ezyang
2024-07-27 01:03:25 +00:00
8458980bbf Move benchmarks/dynamo/huggingface configuration to YAML (#131724)
Similar to https://github.com/pytorch/pytorch/pull/120299

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131724
Approved by: https://github.com/shunting314
2024-07-27 00:55:04 +00:00
ef8d118c67 Sync with changes to test-infra's scale-config.yml (#131955)
This synchronized lf-canary-scale-config and lf-scale-config with one in test-infra.

This really needs some automatic validation to prevent it from drifting out of sync over and over again (coming soon...)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131955
Approved by: https://github.com/malfet
2024-07-27 00:25:40 +00:00
8b04edcac1 Delete unused yml files (#131298)
To be landed at least 3 days later after previous commit
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131298
Approved by: https://github.com/ZainRizvi
ghstack dependencies: #130762
2024-07-27 00:21:22 +00:00
1e00f055a4 Move distributed experimental jobs back to the amazon2 for now (#131963)
Something about the new Amazon2023 AMI is making some distributed tests fail. Moving them back to the old AMI until the issue is fixed

This particular jobs are causing this test to fail:
https://github.com/pytorch/pytorch/issues/129539

More details in https://github.com/pytorch/pytorch/issues/131962
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131963
Approved by: https://github.com/clee2000
2024-07-26 23:44:56 +00:00
91fcfd8760 Fix public API tests (#131386)
This PR fixes a bug in `test_correct_module_names` introduced in #130497. It also addresses post-fix test failures in:
* `torch/ao/quantization/__init__.py` - set the correct `__module__` for several public API helpers
* `torch/library.py` - add `register_vmap` to `__all__`
* `torch/nn/attention/flex_attention.py` - make `round_up_to_multiple` private by prepending an underscore
* `torch/storage.py` - introduce `__all__` to avoid `Self` being re-exported as a public API
* `torch/distributed/pipelining/schedules.py` - add `ZeroBubbleAlgorithm` to `__all__`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131386
Approved by: https://github.com/albanD
2024-07-26 23:38:43 +00:00
02b922900b [aoti] Fix float16 and bfloat16 for generated GPU code (#131437)
Fixes #131333

Summary:
- Add header to define `float16` and `bfloat16` as `at::Half` and `at::BFloat16`.
- change `float16` and `bfloat16` to `float` before passing to kernel.

code generated before:
```cpp
.....
    half var_1;
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_item_float16(convert_arrayref_tensor_to_tensor(arg1_1), &var_1));
....
```

code generated now:
```cpp
typedef at::Half half;
typedef at::BFloat16 bfloat16;
.....
    half var_1_tmp;
    AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_item_float16(convert_arrayref_tensor_to_tensor(arg1_1), &var_1_tmp));
    float var_1 = float(var_1_tmp);
....
```

Test plan: `TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_unspec_inputs_cuda`
Work in progress.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131437
Approved by: https://github.com/desertfire
2024-07-26 23:36:11 +00:00
0272934238 [Inductor][CPU] Fix an InvalidVecISA issue on CI (#131812)
Summary: CPU CI nodes failed to find valid VecISA because importing torch under the default pytorch directory will fail with the following msg, so switch cwd to a tmp directory.

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/var/lib/jenkins/workspace/torch/__init__.py", line 66, in <module>
    from torch.torch_version import __version__ as __version__
  File "/var/lib/jenkins/workspace/torch/torch_version.py", line 4, in <module>
    from torch.version import __version__ as internal_version
ModuleNotFoundError: No module named 'torch.version'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131812
Approved by: https://github.com/eellison, https://github.com/malfet
2024-07-26 22:31:44 +00:00
5489ff8e94 Use Mermaid for the diagram in torch/ao/quantization/fx/README.md (#131412)
preview 3a0efcdfa3/torch/ao/quantization/fx/README.md
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131412
Approved by: https://github.com/jerryzh168
2024-07-26 22:01:21 +00:00
16cd1aaa1d [inductor] Improve sort kernel perf (#131719)
Closes #129507

This makes two changes to the sort kernel:
1. Use int16 for the indices since we only operate on small dims anyway
2. Instead of passing an explicit mask, we pass the rnumel and imply the
   mask from that which saves an additional reduction in the sort
   kernel's inner loop.

In my benchmarks, this gives enough of a perf improvement to bump up the
max rblock to 512.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131719
Approved by: https://github.com/eellison
2024-07-26 21:56:47 +00:00
b90bc66766 Enable FlashAttention on Windows (#131906)
Let's just give this a try.

Reland of https://github.com/pytorch/pytorch/pull/131875.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131906
Approved by: https://github.com/drisspg
2024-07-26 21:41:56 +00:00
d73b55d64b Support meta tensors as inputs to the triton_kernel_wrapper HOPs (#131896)
We automatically generate FakeTensor support for them (the FakeTensor
kernel for a triton kernel is "return None"). The same thing should
apply to the meta kernel.

Tests:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131896
Approved by: https://github.com/oulgen
2024-07-26 21:41:03 +00:00
fb98cd33f1 [inline_inbuilt_nn_modules][inductor-cpu] Skip test_quantized_linear_amx (#131928)
The issue is tracked here - https://github.com/pytorch/pytorch/issues/131929

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131928
Approved by: https://github.com/eellison
ghstack dependencies: #131744
2024-07-26 21:28:17 +00:00
c8626a4e1f [BE] add a list of inductor test files to skip resetting dynamo (#131551)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131551
Approved by: https://github.com/zou3519
2024-07-26 21:08:15 +00:00
fde577702d [TD] More synonyms for filepath (#131838)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131838
Approved by: https://github.com/PaliC, https://github.com/ZainRizvi
2024-07-26 21:02:42 +00:00
1bda3a3135 Migrate nightly.yml workflow & docs to Amazon 2023 (#131821)
A continuation of the migration started in
- https://github.com/pytorch/pytorch/pull/131250

Migrates nightly jobs and the linux-docs job in pull.yml

To preserve reusability, I'm switching to a new format here that allows one to only specify the runner prefix instead of the full runner name, allowing multiple jobs to continue using the same base runner type like how they did before

**Validation:**
- Nightly builds passed in the prev commit: https://github.com/pytorch/pytorch/actions/runs/10102118461/job/27937632823?pr=131821
- Latest commit only updated the docs job in pull.yml, and that has already passed: https://github.com/pytorch/pytorch/actions/runs/10114635537/job/27974392472?pr=131821

The other in-progress jobs are irrelevant
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131821
Approved by: https://github.com/atalman, https://github.com/seemethere
2024-07-26 20:54:43 +00:00
0e6df1e0fb Disable remote cache on test (#131908)
Summary: Fixes test internally

Test Plan:
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:cudagraph_trees -- --exact 'caffe2/test/inductor:cudagraph_trees - test_cache_hit_forward_miss_backward (caffe2.test.inductor.test_cudagraph_trees.CudaGraphTreeTests)'

Passes

Differential Revision: D60293177

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131908
Approved by: https://github.com/clee2000
2024-07-26 20:19:02 +00:00
071ac38141 fast-path FakeTensor detach (#131899)
Fixes https://github.com/pytorch/pytorch/issues/128281, see investigation at https://github.com/pytorch/pytorch/issues/128281#issuecomment-2252976926.

benchmark:
```
python benchmarks/dynamo/huggingface.py --performance --timing --explain --backend aot_eager --device cuda --training --float32 --only BertForMaskedLM
```

time before:
```
TIMING: entire_frame_compile:30.85435 backend_compile:23.98599 total_wall_time:30.85435
```

time after:
```
TIMING: entire_frame_compile:24.35898 backend_compile:18.15235 total_wall_time:24.35898
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131899
Approved by: https://github.com/ezyang, https://github.com/zou3519, https://github.com/albanD
2024-07-26 20:16:08 +00:00
2ec8312a28 Add rerun_disabled_tests for inductor (#131681)
Test in prod?

THis also turns on mem leak check

Briefly checked that
```
 python3 ".github/scripts/filter_test_configs.py" \
    --workflow "inductor" \
    --job-name "cuda12.1-py3.10-gcc9-sm86 / build" \
    --test-matrix "{ include: [
    { config: "inductor", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_distributed", shard: 1, num_shards: 1, runner: "linux.g5.12xlarge.nvidia.gpu" },
    { config: "inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "dynamic_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "dynamic_inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "dynamic_inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "dynamic_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "dynamic_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "aot_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "aot_inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "aot_inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "aot_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "aot_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_cpp_wrapper_abi_compatible", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
  ]}
  " \
    --selected-test-configs "" \
    --pr-number "${PR_NUMBER}" \
    --tag "${TAG}" \
    --event-name "schedule" \
    --schedule "29 8 * * *" \
    --branch "${HEAD_BRANCH}"
```
has rerun disabled tests option in the test matrix

I don't think all these things need to run but I'm not sure which ones (probably just inductor?)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131681
Approved by: https://github.com/zou3519
2024-07-26 20:05:24 +00:00
da1a1fa55f Move load_yaml_file to common (#131924)
This is for https://github.com/pytorch/pytorch/pull/131724 and future timm_models.py refactoring.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131924
Approved by: https://github.com/shunting314, https://github.com/huydhn
2024-07-26 19:47:52 +00:00
6c95f79645 [CI] Increase the timeout for aarch64 docker build (#131926)
Summary: Increase the timeout limit for pytorch-linux-jammy-aarch64-py3.10-gcc11-inductor-benchmarks. If slow build is a problem later, we can upgrade the arm64 CI instance capability.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131926
Approved by: https://github.com/avikchaudhuri
2024-07-26 19:27:45 +00:00
782efd8e5b Revert "Add rerun_disabled_tests for inductor (#131681)"
This reverts commit 85fa66be04b6f78139da4f0ec8f8b1956291e1c5.

Reverted https://github.com/pytorch/pytorch/pull/131681 on behalf of https://github.com/clee2000 due to this is the wrong file ([comment](https://github.com/pytorch/pytorch/pull/131681#issuecomment-2253318038))
2024-07-26 19:08:59 +00:00
0f9bf208ec Revert "[BE][tests] show local variables on failure in tests (#131151)"
This reverts commit 054d214c504b415b155ef2da1a70764a115e1276.

Reverted https://github.com/pytorch/pytorch/pull/131151 on behalf of https://github.com/jbschlosser due to pollutes test failure output for OpInfo tests ([comment](https://github.com/pytorch/pytorch/pull/131151#issuecomment-2253310448))
2024-07-26 19:03:10 +00:00
a3cdbd8189 [FlopCounterMode] Fix register_flop_formula (#131777)
Previously, FlopCounterMode would ignore any custom ops registered
through `register_flop_formula`. The problem was:
- register_flop_formula(target) requires target to be an OpOverloadPacket.
- register_flop_formula used register_decomposition to populate its registry
- register_decomposition decomposes the OpOverloadPacket into OpOverload before
  putting it into the registry
- FlopCounterMode ignores OpOverloads in its registry (it assumes the
  registry is a dictionary mapping OpOverloadPacket to flop formula).

register_decomposition is too heavy of a hammer, plus this isn't a
decomposition, so I changed the registration mechanism.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131777
Approved by: https://github.com/Chillee
2024-07-26 18:44:50 +00:00
cd53698df0 Add hpu backend support for dynamo torchVariable _in_graph_classes() function (#129948)
Fixes #ISSUE_NUMBER

Recent change from PR#
f657b2b1f8 (diff-4a52059570bb96333d8383ce6a9d01bbb114c5e34aff6028f820899ca39b5a26R80)  , has hard coded flow to cuda stream in ingraph function. For non cuda backend (hpu in our case), it breaks the graph.

As part of this PR change adding hpu backend support to dynamo variables function _in_graph_classes().

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129948
Approved by: https://github.com/yanboliang
2024-07-26 18:38:03 +00:00
5f2c80d16d Add inductor OrderedSet (#130003)
Implemented by extending `collections.abc.MutableSet` and backing it with a dictionary, which is ordered. From collections.abc.MutableSet:

```
    A mutable set is a finite, iterable container.

    This class provides concrete generic implementations of all
    methods except for __contains__, __iter__, __len__,
    add(), and discard().
```

In addition to implementing those methods I also had to define some methods of python's set which were not implemented in MutableSet.

I reused the test from my python's lib. There were a few instances of tests that didnt pass because edge case behavior that is not necessary to reimplement
- support self-referencing repr
- erroring when an member's `__eq__` function would modify the set itself
- MutableSet supports Iterables as inputs, but not sequences (pretty rare..)
- Some specifics of exact equivalent type errors being thrown
- [The protocol for automatic conversion to immutable](https://docs.python.org/2/library/sets.html#protocol-for-automatic-conversion-to-immutable)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130003
Approved by: https://github.com/aorenste
2024-07-26 18:16:57 +00:00
1dd10ac802 [BE] [Reland] Make nn.Module state_dict load_state_dict pre-hook and state_dict post-hook public (#131690)
Reland https://github.com/pytorch/pytorch/pull/126704

#### Fixes the issue with type of `nn.Module._state_dict_hooks` being changed in that PR which was problematic:
Instead of using `Tuple(Callable, bool)` to keep track of whether the private `_register_state_dict_hook` or the public `register_state_dict_post_hook` API was used to register the hook and toggle the behavior accordingly, I set an attribute on the Callable in the private API, which is never cleaned up.

If a callable previously registered using the private API is registered via the public API, a RuntimeError will be raised

#### Copied from previous PR description
Fixes https://github.com/pytorch/pytorch/issues/75287 and https://github.com/pytorch/pytorch/issues/117437

- `nn.Module._register_state_dict_hook` --> add public `nn.Module.register_state_dict_post_hook`
   - Add a test as this API was previously untested
- `nn.Module._register_load_state_dict_pre_hook` --> add public `nn.Module.register_load_state_dict_pre_hook` (remove the `with_module` flag, default it to `True`
    ~- For consistency with optimizer `load_state_dict_pre_hook` raised by @janeyx99, allow the pre-hook to return a new `state_dict`~
 - For issuet by https://github.com/pytorch/pytorch/issues/117437 regarding `_register_state_dict_hook` semantic of returning a new state_dict only being respected for the root for private hook
       - Document this for private `_register_state_dict_hook`
       - Remove this for the public `register_state_dict_post_hook`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131690
Approved by: https://github.com/albanD
2024-07-26 18:14:07 +00:00
8158cf2f59 [c10d] Fix split_group usage when there is a single rank (#131824)
Summary:
This is a request from xlformer team to allow single rank PG/comms
Test Plan:
UT

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131824
Approved by: https://github.com/pavanbalaji, https://github.com/fduwjj
2024-07-26 18:11:17 +00:00
e191b83462 Revert "Add wrappers for synchronous GPUDirect Storage APIs (#130633)"
This reverts commit 709ddf7a9dcfa1268848b72f6f56b55afa6728d6.

Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to still failing internally D60265673 ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2253239607))
2024-07-26 18:08:20 +00:00
e4db5dc1c4 Revert "[BE] remove unnecessary _dispatch_sqrt by using ** 0.5 (#131358)"
This reverts commit 4c7f22dee25649cd895bc382192d29f39e482215.

Reverted https://github.com/pytorch/pytorch/pull/131358 on behalf of https://github.com/janeyx99 due to Internal uses this private API and landing that has been a pain so we're reverting this first ([comment](https://github.com/pytorch/pytorch/pull/131358#issuecomment-2253190654))
2024-07-26 17:35:27 +00:00
2576dbbc35 [dynamo] implement IteratorVariable and polyfill fallbacks for enumerate (#131725)
Fixes https://github.com/pytorch/pytorch/issues/112794.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131725
Approved by: https://github.com/anijain2305
ghstack dependencies: #131413, #131716
2024-07-26 17:17:09 +00:00
35b4de32fa [dynamo] add itertools repeat/count bytecode reconstruction (#131716)
Also fix bugs in the count iterator variable implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131716
Approved by: https://github.com/anijain2305
ghstack dependencies: #131413
2024-07-26 17:17:09 +00:00
40cc5c0697 [AOT Autograd] Donated Buffer (#130580)
Implements donated buffer feature and adds unit tests. Donated buffer is a saved tensor that is not aliased with forward inputs, fw_outputs (except saved tensors), and bw_outputs. We detect donated buffers during `aot_dispatch_autograd` and store donated buffers in `ViewAndMutationMetadata`, such that it can be accssed in inductor.

Fixes #129496

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130580
Approved by: https://github.com/bdhirsh
2024-07-26 17:14:34 +00:00
9589d986fa [UT] Relax atol for test_non_contiguous_input_* (3 tests) (#131822)
BE task T195600898 (internal).

The 3 tests
```
test_non_contiguous_input_mm
test_non_contiguous_input_bmm
test_non_contiguous_input_addmm
```
had the following error in TestX:
```
self.assertTrue(torch.allclose(ref, act, atol=1e-2, rtol=1e-2))
AssertionError: False is not true
```

The tolerance comparing eager and compiled results is too small, perhaps because of a Triton update that changed numerics:
```
Mismatched elements: 25 / 38597376 (0.0%)
Greatest absolute difference: 0.015625 at index (3771, 509) (up to 0.01 allowed)
Greatest relative difference: 9.375 at index (13687, 48) (up to 0.01 allowed)
```

Change the absolute tolerance from 0.01 to 0.02. Also switch to use `torch.testing.assert_close` which prints out the greatest absolute/relative difference like above when the assert fails.

`test_non_contiguous_input_mm_plus_mm` has a different problem, just switching to `torch.testing.assert_close` to be uniform with the other tests.

Test commands:
```
python test/inductor/test_max_autotune.py -k TestMaxAutotune.test_non_contiguous_input_mm

python test/inductor/test_max_autotune.py -k TestMaxAutotune.test_non_contiguous_input_addmm

python test/inductor/test_max_autotune.py -k TestMaxAutotune.test_non_contiguous_input_bmm
```
Internal stress tests pass now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131822
Approved by: https://github.com/shunting314
2024-07-26 17:11:35 +00:00
161bb67116 Revert "Fix static py::object dangling pointer with py::gil_safe_call_once_and_store (#130341)"
This reverts commit ace6decc9948e434dfe2e253bc28341bb22aa983.

Reverted https://github.com/pytorch/pytorch/pull/130341 on behalf of https://github.com/clee2000 due to unfortunately the internal pybind update got reverted cc @malfet ([comment](https://github.com/pytorch/pytorch/pull/130341#issuecomment-2253147079))
2024-07-26 17:02:56 +00:00
c382fc3fea [Reland] Fix vulkan builds with missing overrides errors (#131760)
Followup after https://github.com/pytorch/pytorch/pull/131524

Add note explaining why C10 macros should not be used in that header
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131760
Approved by: https://github.com/atalman
2024-07-26 17:01:51 +00:00
1a2edf6dca [AOTI] Fix _mm_plus_mm codegen (#131689)
Summary: Fixes https://github.com/pytorch/pytorch/issues/128474

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131689
Approved by: https://github.com/chenyang78
2024-07-26 16:50:12 +00:00
696e83a1da Revert "TCPStore: fix remote address (#131773)"
This reverts commit 9039131a89a5fdb8746bd86b0a4dd91559821e36.

Reverted https://github.com/pytorch/pytorch/pull/131773 on behalf of https://github.com/clee2000 due to broke internal builds D60265883, something about formatter ([comment](https://github.com/pytorch/pytorch/pull/131773#issuecomment-2253123800))
2024-07-26 16:47:57 +00:00
404a8ae8f6 [export] fix set_grad x tensor constant. (#131787)
Summary:
Fixes https://github.com/pytorch/pytorch/issues/130379.

The original error is verifier finds that the placeholder nodes' meta[''val"] are missing in subgraph of WrapSetGradEnabled hop.

In this PR, we fixed it by re-ordering the replace_set_grad_with_hop_pass with lift_constant_tensor pass because only after lift_constant_pass, all the constant attrs start to have meta["val"].

Test Plan: buck2 test test:test_export -- -r "test_setgrad_lifted_tensor"

Differential Revision: D60244935

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131787
Approved by: https://github.com/yushangdi
2024-07-26 16:41:59 +00:00
bb64702eb3 Revert "[reland][inductor] switch AotCodeCompiler to new cpp_builder (#130127)"
This reverts commit 520182dbffe09943be74a8a9cd58618fc171738f.

Reverted https://github.com/pytorch/pytorch/pull/130127 on behalf of https://github.com/clee2000 due to broke internal tests D60265910 ([comment](https://github.com/pytorch/pytorch/pull/130127#issuecomment-2253113689))
2024-07-26 16:40:03 +00:00
d57de73fe0 AutoHeuristic: Add support for kernel choice selection (#131610)
This PR enables AutoHeuristic for kernel choice selection, where the feedback can not immediately be provided when AutoHeuristic is called, but only after autotuning has happened. The steps are the following:

When the AutoHeuristic constructor is called, AutoHeuristic registers a function in select_algorithm.py.
After autotuning in select_algorithm.py has happened, and there is an entry in autoheuristic_registry, select_algorithm provides the autotuning results to AutoHeuristic, which stores the results.
I enabled AutoHeuristic for mixed_mm to have an example to test it on. We probably want to add more context, and also add an augment_context function. I will add support for this in another PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131610
Approved by: https://github.com/eellison
2024-07-26 16:35:55 +00:00
a38890a53f Revert "[2/3] 3D Composability - move pp tests (#129801)"
This reverts commit 29571c5c06f6e5fd143d85c18d8a6b87d2e4e1d3.

Reverted https://github.com/pytorch/pytorch/pull/129801 on behalf of https://github.com/atalman due to Broke periodic CI: distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_manual_with_data_parallel_dp_type_DDP_ScheduleClass4 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10083807511/job/27882848654) [HUD commit link](544f950d14) ([comment](https://github.com/pytorch/pytorch/pull/129801#issuecomment-2253099894))
2024-07-26 16:30:29 +00:00
13ab92b72d [dynamo][recompile-logs] Suggest force_parameter_static_shapes on the recompile log for parameter-related recomps (#131825)
Discovered in https://github.com/pytorch/pytorch/issues/121369

On the user-empathy-day model, the logs look like these
~~~
W0725 15:33:58.022000 1967777 torch/_dynamo/convert_frame.py:807] [0/8] torch._dynamo hit config.cache_size_limit (8)
W0725 15:33:58.022000 1967777 torch/_dynamo/convert_frame.py:807] [0/8]    function: 'auto_repeat_tensors_for_time' (/home/anijain/local/lumiere-pytorch/lumiere_pytorch/lumiere.py:545)
W0725 15:33:58.022000 1967777 torch/_dynamo/convert_frame.py:807] [0/8]    last reason: 0/0: len(L['args']) == 1
W0725 15:33:58.022000 1967777 torch/_dynamo/convert_frame.py:807] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
W0725 15:33:58.022000 1967777 torch/_dynamo/convert_frame.py:807] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
W0725 15:34:00.282000 1967777 torch/_dynamo/convert_frame.py:807] [11/8] torch._dynamo hit config.cache_size_limit (8)
W0725 15:34:00.282000 1967777 torch/_dynamo/convert_frame.py:807] [11/8]    function: 'forward' (/home/anijain/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/denoising_diffusion_pytorch/karras_unet.py:150)
W0725 15:34:00.282000 1967777 torch/_dynamo/convert_frame.py:807] [11/8]    last reason: 11/0: tensor 'L['x']' size mismatch at index 0. expected 16, actual 8
W0725 15:34:00.282000 1967777 torch/_dynamo/convert_frame.py:807] [11/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
W0725 15:34:00.282000 1967777 torch/_dynamo/convert_frame.py:807] [11/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
W0725 15:34:10.216000 1967777 torch/_dynamo/convert_frame.py:807] [40/8] torch._dynamo hit config.cache_size_limit (8)
W0725 15:34:10.216000 1967777 torch/_dynamo/convert_frame.py:807] [40/8]    function: 'normalize_weight' (/home/anijain/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/denoising_diffusion_pytorch/karras_unet.py:127)
W0725 15:34:10.216000 1967777 torch/_dynamo/convert_frame.py:807] [40/8]    last reason: 40/1: tensor 'L['weight']' size mismatch at index 0. expected 64, actual 16. Guard failed on a parameter, consider using torch._dynamo.config.force_parameter_static_shapes = False to allow dynamism on parameters.
W0725 15:34:10.216000 1967777 torch/_dynamo/convert_frame.py:807] [40/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
W0725 15:34:10.216000 1967777 torch/_dynamo/convert_frame.py:807] [40/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
W0725 15:34:11.643000 1967777 torch/_dynamo/convert_frame.py:807] [58/8] torch._dynamo hit config.cache_size_limit (8)
W0725 15:34:11.643000 1967777 torch/_dynamo/convert_frame.py:807] [58/8]    function: 'pack_one' (/home/anijain/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/denoising_diffusion_pytorch/karras_unet.py:38)
W0725 15:34:11.643000 1967777 torch/_dynamo/convert_frame.py:807] [58/8]    last reason: 58/1: tensor 'L['t']' stride mismatch at index 0. expected 32, actual 8. Guard failed on a parameter, consider using torch._dynamo.config.force_parameter_static_shapes = False to allow dynamism on parameters.
W0725 15:34:11.643000 1967777 torch/_dynamo/convert_frame.py:807] [58/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
W0725 15:34:11.643000 1967777 torch/_dynamo/convert_frame.py:807] [58/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
W0725 15:34:12.029000 1967777 torch/_dynamo/convert_frame.py:807] [62/8] torch._dynamo hit config.cache_size_limit (8)
W0725 15:34:12.029000 1967777 torch/_dynamo/convert_frame.py:807] [62/8]    function: 'torch_dynamo_resume_in_pack_at_70' (/home/anijain/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/einops-0.8.0-py3.10.egg/einops/packing.py:70)
W0725 15:34:12.029000 1967777 torch/_dynamo/convert_frame.py:807] [62/8]    last reason: 62/0: tensor 'L['tensors'][0]' size mismatch at index 0. expected 16, actual 32. Guard failed on a parameter, consider using torch._dynamo.config.force_parameter_static_shapes = False to allow dynamism on parameters.
W0725 15:34:12.029000 1967777 torch/_dynamo/convert_frame.py:807] [62/8] To log all recompilation reasons, use TORCH_LOGS="recompiles".
W0725 15:34:12.029000 1967777 torch/_dynamo/convert_frame.py:807] [62/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html.
W0725 15:34:12.357000 1967777 torch/_dynamo/convert_frame.py:807] [65/8] torch._dynamo hit config.cache_size_limit (8)
W0725 15:34:12.357000 1967777 torch/_dynamo/convert_frame.py:807] [65/8]    function: 'reshape' (/home/anijain/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/einops-0.8.0-py3.10.egg/einops/_backends.py:91)
W0725 15:34:12.357000 1967777 torch/_dynamo/convert_frame.py:807] [65/8]    last reason: 65/0: tensor 'L['x']' size mismatch at index 0. expected 32, actual 8. Guard failed on a parameter, consider using torch._dynamo.config.force_parameter_static_shapes = False to allow dynamism on parameters.
~~~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131825
Approved by: https://github.com/ezyang
ghstack dependencies: #131795, #131801, #131804
2024-07-26 16:25:21 +00:00
7feaa73057 [export] Remove deprecated fields from ExportedProgram ctor. (#131697)
Summary: as title.

Test Plan: CI

Reviewed By: SherlockNoMad

Differential Revision: D60078426

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131697
Approved by: https://github.com/ydwu4
2024-07-26 16:19:46 +00:00
546df5daf8 Revert "[3/3] 3D Composability - move tp dp tests (#129802)"
This reverts commit ec3829795dfb58a58ebc9ca241f7949efd60bfda.

Reverted https://github.com/pytorch/pytorch/pull/129802 on behalf of https://github.com/atalman due to Need to revert https://github.com/pytorch/pytorch/pull/129801 that got remerged ([comment](https://github.com/pytorch/pytorch/pull/129802#issuecomment-2253082995))
2024-07-26 16:19:25 +00:00
cyy
2988d33c80 [3/N] Fix clang-tidy warnings in jit (#131830)
Follows #131735

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131830
Approved by: https://github.com/ezyang
2024-07-26 15:46:28 +00:00
5612408735 _get_operation_overload: dont raise exception when overload does not exist (#131554)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131554
Approved by: https://github.com/ezyang, https://github.com/zou3519
ghstack dependencies: #131403, #131482, #131665
2024-07-26 15:38:11 +00:00
eba2ffd278 [pt2e][quant] Ensure BN node is erased after convert (#131651)
Summary: Previously, when folding BN into conv, we rely on DCE
to clean up the unused BN node from the graph. This works if
the model is already in eval mode, but fails if the model is
still in train mode because DCE doesn't remove nodes with
potential side effects (in this case `_native_batch_norm_legit`).
This required users to move the model to eval mode before calling
convert in order to get a properly DCE'd graph.

To solve this, we manually erase the BN node after folding
instead of relying on DCE. This relaxes the ordering constraints
between `move_exported_model_to_eval` and `convert_pt2e`.

Test Plan:
python test/test_quantization.py TestQuantizePT2EQAT_ConvBn1d.test_fold_bn_erases_bn_node
python test/test_quantization.py TestQuantizePT2EQAT_ConvBn2d.test_fold_bn_erases_bn_node

Reviewers: jerryzh168, yushangdi

Subscribers: jerryzh168, yushangdi, supriyar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131651
Approved by: https://github.com/yushangdi
2024-07-26 15:30:45 +00:00
9440a4824d [CI][dashboard] Add a workflow to collect A10g perf (#131816)
Summary: This is an experimental work. Depending on the performance stableness and benchmark coverage on A10g, we may consider to use A10g for manually-triggered per-PR performance comparison instead of exausting expensive A100 instances.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131816
Approved by: https://github.com/huydhn
2024-07-26 14:36:14 +00:00
535c17efb3 [torch] Implement c10::BFloat16 ctor from __hip_bfloat16 (#131359)
Summary: Pretty straightfoward. ROCm 6.2.0 changed the `__hip_bfloat16` API (see [this PR](481912a1fd)), so we gate impl on `__BF16_HOST_DEVICE__` macro to support older and newer versions of ROCm.

Test Plan: CI

Differential Revision: D60024830

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131359
Approved by: https://github.com/houseroad
2024-07-26 14:30:49 +00:00
e4ace1a396 AOTDispatcher: properly bump version counter on input mutations in inference graphs (#131665)
This ensures that in an inference setting, we properly bump the VC of mutated graph inputs. Previously, we would only properly bump the VC for training graphs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131665
Approved by: https://github.com/ezyang, https://github.com/zou3519
ghstack dependencies: #131403, #131482
2024-07-26 14:22:20 +00:00
5570a0da0a dont dispatch aten.conj(scalar_tensor) back to python (#131482)
https://github.com/pytorch/pytorch/issues/105290

The problem in the original flow is that:

(1) the user calls `torch.mul(complex_tensor, complex_scalar)
(2) python arg parser wraps the complex scalar in a `scalar_tensor`, and dispatches to `aten.mul.Tensor(self, scalar_other)`
(3) autograd sees `aten.mul.Tensor`, calls `scalar_other.conj()` [here](https://github.com/pytorch/pytorch/blob/main/torch/csrc/autograd/FunctionsManual.cpp#L597)
(4) during proxy tensor tracing, this gets dispatched to `aten._conj(scalar_tensor)`
(5) when we hit __torch_dispatch__, the scalar_tensor is converted back into a plain python scalar
(6) we error during tracing, because in `FunctionalTensorMode.__torch_dispatch__` we try to redispatch on `aten._conj.default(plain_python_scalar)`, and this overload does not accept python scalars.

My attempted fix in this PR is to update `TensorBase::conj()` to check if the current tensor is a scalar tensor (wrapped number), and if so, manually:
(1) convert the scalar tensor back into a scalar
(2) call scalar.conj() directly
(3) convert the result back into a wrapped tensor

This avoids having to go through python entirely in the tracing case (which is fine, because these scalar tensors are constants that we can const-prop during tracing anyway).

Notable, I did **not** add e.g. a new `aten._conj.Scalar` overload. This would not actually fix the problem, since the bug is that we call `aten._conj.default(python_scalar)` directly. we would also need to muck with all `__torch_dispatch__` call sites to know to convert python scalars back into tensors directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131482
Approved by: https://github.com/zou3519, https://github.com/ezyang
ghstack dependencies: #131403
2024-07-26 14:22:20 +00:00
8bb9aa93a7 dynamo: mutations on .data should be invisible to autograd (#131403)
Fixes https://github.com/pytorch/pytorch/issues/121353

our handle for `.data` in dynamo today basically just converts `y = x.data` into `y = x.detach()`. The semantics of these two ops are not quite the same, because:

(1) any future mutations on `x.data` will be fully ignored by autograd
(2) any mutations on `x.detach()` will bump x's version counter

the linked model does a .data mutation that is hidden from autograd in eager, but ends up erroring during AOTDispatcher tracing.

I updated dynamo's handling so that:

(1) when dynamo sees a call to `getattr(tensor, "data")` and calls `.detach()` we set a flag on the returned `TensorVariable` indicating it came from `.data`

(2) on any tensor method that we call with an input `TensorVariable` with this flag turned on, we proxy autograd's `preserve_version_counter` logic into the graph, to properly reset the VC after the op is run.

One thing to note is that I don't actually do this on every op that we pass the tensor to: I only do it for tensor methods that appear to be mutations (by checking for a trailing underscore). My thought was that:

(1) I didn't want to do this for **every** op that you pass `y` into, since that will e.g. triple the number of nodes in the graph, and could cause compile time regressions if you use .data

(2) this situation is pretty rare in general, and I'm hoping that "tensor method mutations" cover most reasonable mutation cases. If we manage to miss a case, you will get a loud error during tracing anyway, so there is not a safety issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131403
Approved by: https://github.com/anijain2305, https://github.com/zou3519
2024-07-26 14:22:20 +00:00
7339c8ab28 Revert "immutable accessors in graph signature (#131807)"
This reverts commit 6fd28fc228f900863d63b1c83912dcc000b084e3.

Reverted https://github.com/pytorch/pytorch/pull/131807 on behalf of https://github.com/atalman due to Broke CI: [GH job link](https://github.com/pytorch/pytorch/actions/runs/10111847569/job/27965364355) [HUD commit link](608057afe2) ([comment](https://github.com/pytorch/pytorch/pull/131807#issuecomment-2252875417))
2024-07-26 14:21:12 +00:00
e76e566cfb [Dynamo] Support zip_longest (#131497)
Fixes #121348

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131497
Approved by: https://github.com/mlazos, https://github.com/jansel, https://github.com/zou3519
2024-07-26 14:06:10 +00:00
c9888c2739 Revert "[BE] typing for decorators - optim/optimizer (#131583)"
This reverts commit a1dad77dfa4e244a867ca7c73e9f6b6fe36a1340.

Reverted https://github.com/pytorch/pytorch/pull/131583 on behalf of https://github.com/atalman due to Breaks CI: [GH job link](https://github.com/pytorch/pytorch/actions/runs/10105959146/job/27947741162) [HUD commit link](a1dad77dfa) ([comment](https://github.com/pytorch/pytorch/pull/131583#issuecomment-2252784280))
2024-07-26 13:41:22 +00:00
7ee6831ae8 Revert "Fix vulkan builds with missing overrides errors (#131760)"
This reverts commit 7260eaeca056ffa013de769c10a2bfce9505d937.

Reverted https://github.com/pytorch/pytorch/pull/131760 on behalf of https://github.com/malfet due to Does not work with internal builds ([comment](https://github.com/pytorch/pytorch/pull/131760#issuecomment-2252783645))
2024-07-26 13:38:28 +00:00
d3e932dc10 [CI] Add inductor cpu accuracy test running on AVX2 runners (#128682)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128682
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-07-26 13:24:41 +00:00
e73fa28ec8 [CI] Fix arm64 docker build arch (#131869)
Attempt to fix arm64 docker build arch on https://github.com/pytorch/pytorch/pull/131855
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131869
Approved by: https://github.com/desertfire
2024-07-26 13:19:36 +00:00
608057afe2 [inductor] Fix duplicated range tree codegen in split scan (#131669)
Looks like in the halide codegen refactor, the range tree codegen was
split out from initialize_range_tree into its own function, but
triton_split_scan.py wasn't updated to reflect this change.

The result was the codegen gets invoked twice which is benign but makes
the kernel harder to read.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131669
Approved by: https://github.com/Chillee
2024-07-26 13:11:26 +00:00
945946e817 [AOTI] Fix another ABI-compatible CPU issue (#131798)
Summary: This problem is seen on AOTI CPU dashboard runs, a cpp compilation error because ConstantHandle::get doesn't exist. This PR adds ConstantHandle::get so that the interface is consistent with RAIIAtenTensorHandle.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131798
Approved by: https://github.com/zou3519, https://github.com/chenyang78
ghstack dependencies: #131791
2024-07-26 11:27:58 +00:00
7d282d8755 [dynamo] add lazy IteratorVariable implementations for map and zip (#131413)
Fixes https://github.com/pytorch/pytorch/issues/130750.

Repro of lazy/eager `map` discrepancy without `islice`:
```python
    def fn(a, b):
        y = 1

        def f(x):
            nonlocal y
            y += 1
            return x

        l = list(zip([a, b], map(f, [1, 2, 3, 4])))
        return a + y
```

The major change is that we implement `MapVariable` and `ZipVariable` based on `IteratorVariable`. Before, `map` and `zip` were being traced by immediately unpacking the result as a `TupleVariable`, which is wrong in cases such as the example above.

`MapVariable`s are not allowed to be unpacked while `ZipVariable`s can only be unpacked if all of its iterables can also be unpacked.

We also add new `[has_]force_unpack_var_sequence` methods to `VariableTracker` for the case where it is safe to unpack the entire sequence lazily, e.g., when building a list from a map (i.e. `list(map(f, ...))`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131413
Approved by: https://github.com/anijain2305
2024-07-26 10:47:38 +00:00
115994fea2 [aotd] Align partitioner graph output type to tuple (#131759)
Brian debugged the difference of the output type for inference and train graph.
Partitioner sometimes return list output type.

After this PR it will always return tuple.

Potentially there can be some new graphs inside tests that will be landed between this PR ci jobs finish and landing.
This could be easily fixed with fast-forward fix on:
```
EXPECTTEST_ACCEPT=1 python test/test.py
```

Adding ciflows/periodic to minimize this probability

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131759
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
2024-07-26 09:46:29 +00:00
1e24f7875e [AOTI] Fix ABI-compatible mode link issue for CPU (#131791)
Summary: Found this "cannot find -ltorch: No such file or directory" issue when collecting AOTI CPU perf for the dashboard. Debugging on the CI machine revealed two problems: 1) no valid VEC_ISA was picked; 2) when 1 happens, libtorch path is not specified in the linker path.

This PR fixes the second problem. A later PR will fix the first problem, but somehow finding the right VEC_ISA causes a performance regression, which needs more investigation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131791
Approved by: https://github.com/zou3519, https://github.com/chenyang78
2024-07-26 09:02:13 +00:00
6fd28fc228 immutable accessors in graph signature (#131807)
Test Plan: existing tests

Differential Revision: D60253955

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131807
Approved by: https://github.com/ydwu4
2024-07-26 08:56:19 +00:00
bceb91222c Fix meta error in _convert_weight_to_int4pack (#130915)
This PR is to fix meta error in _convert_weight_to_int4pack.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130915
Approved by: https://github.com/jerryzh168
2024-07-26 08:36:30 +00:00
2bf649f5ae suggested fix for data-dependent error (#125378)
Suggests fixes for data-dependent errors in non-strict export.

Any data-dependent error has an unresolved condition on unbacked symints. A mechanizable strategy for fixing such errors, which this PR enables, is to "bash" them using `torch._check()`s. For each error we suggest using `torch._check()` on the condition or its negation. The user selects and copy-pastes the suggested fix and continues.

For example, here's an existing data-dependent error message with the suffix following `<snip>...</snip>` added by this PR:
```
Could not guard on data-dependent expression Eq(u2, u1) (unhinted: Eq(u2, u1)).  (Size-like symbols: u1)

<snip>...</snip>

User code:
  File "test/export/test_export.py", line 1944, in forward
    return r.view(items[0], items[2])

Suggested fixes (please choose one of the following):
  1. torch._check(items[2] == r.shape[1])
  2. torch._check(items[2] != r.shape[1])"
```

Tests in this PR illustrate this workflow, by taking common examples of data-dependent errors and bashing them until success, purely based on suggested fixes. In particular, we test this workflow on the "puzzlers" in https://www.internalfb.com/intern/anp/view/?id=5330476 (thanks @ezyang).

In terms of implementation, we focus on non-strict mode, where we can intercept torch function calls to install a handler that walks up the stack from the error, finding the closest non-torch frame and inspecting its locals for symints appearing in the error. The suggested fixes then access these symints through the local variables so that they can be (a) easily understood by the user (b) directly added to the code.

Implementing this idea in strict mode is follow-up work—we have already investigated what it would take, and decided to separate it out of this PR for reasons described next.

It's not too hard to map symints to locals in Dynamo (although it needs to happen elsewhere, i.e., intercepting torch function calls won't work). However, unfortunately this doesn't seem to be enough; the graph modules created by Dynamo when going through AOTAutograd can raise further data-dependent errors in some cases, and thus we need yet another mechanism to map symints to locals for graph modules, via captured source-level metadata and FX node walking. This latter component will require some care to build properly, or we might conclude it is altogether unnecessary and fix Dynamo instead.

Differential Revision: D56867432

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125378
Approved by: https://github.com/ezyang
2024-07-26 08:34:50 +00:00
fb3ddafbcf [inductor] Add type hints to functions in mkldnn_fusion.py (#131820)
Summary: ATT

Test Plan: lintrunner

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131820
Approved by: https://github.com/eellison
2024-07-26 08:11:34 +00:00
13e806a591 [NestedTensor] Add support for transposed NestedTensors where ragged_idx > 1 for sum and mean operators (#131517)
Add support for transposed, non-contiguous `NestedTensor`s, where `ragged_idx > 1`, for the aten operators `sum` and `mean`. This diff enables reducing along the jagged dimension for non-contiguous `NestedTensor`s, transposed between non-batch dimensions as well as between a ragged and a non-batch dimension. For example, users can now reduce a `NestedTensor` of shape `(B, M, *, N)` along `*` or `(B, N, M, *)` along `*`.

Parametrize existing unit tests and add new unit tests verifying the accuracy of implementations on `NestedTensor`s that transpose between 2 non-batch dimensions as well as between a ragged and a non-batch dimension.

Differential Revision: [D59847927](https://our.internmc.facebook.com/intern/diff/D59847927/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131517
Approved by: https://github.com/davidberard98
2024-07-26 07:21:32 +00:00
63374dda69 [BE][Easy] explicitly define global constants in torch.testing._internal.common_utils (#129826)
This appeases IDE warnings like "torch.testing._internal.common_utils has no member TEST_WITH_ROCM".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129826
Approved by: https://github.com/Skylion007
2024-07-26 06:32:08 +00:00
aebfd3d4de [CUDAGraph] skip cudagraph if too many distinct sizes (#131387)
Current implementation records a new cudagraph for every distinct input size. This leads to significant overhead if there are too many distinct input sizes.

While we currently hint re-recording cudagraph from dynamic shapes, it is at [info level](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/cudagraph_trees.py#L363-L366) which is easy to overlook and leads to several issues, such as Issue #119640 and Issue #128424.

This PR checks the number of cudagraph due to dynamic shapes and warns loudly if #cudagraph exceeds a threshold `cudagraph_dynamic_shape_limit`(=50).

Fixes #119640

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131387
Approved by: https://github.com/eellison
2024-07-26 06:17:35 +00:00
16d7cb5049 [CUDAGraph] Type annotation for cudagraph_trees.py (#131621)
As a Better Engineer effort, this PR adds type annotation to `cudagraph_trees.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131621
Approved by: https://github.com/eellison
2024-07-26 06:14:06 +00:00
dfba85c26b Update torch-xpu-ops pin (ATen XPU implementation) (#131643)
# Motivation
Regular update.
1. Some new ATen ops support
2. ABI=0 build support
3. Remove dispatched implementation of pin_memory&is_pinned
4. Enhance deterministic usage

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131643
Approved by: https://github.com/EikanWang
2024-07-26 05:51:58 +00:00
baa93e160f [MPS] Add native implementation for shift ops (#131813)
Similar to how AND/OR/XOR ops are implemented

TODO: Consider using MPS method calls rather than metal kernels

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131813
Approved by: https://github.com/manuelcandales
2024-07-26 05:01:20 +00:00
a1dad77dfa [BE] typing for decorators - optim/optimizer (#131583)
See #131429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131583
Approved by: https://github.com/janeyx99
ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576, #131577, #131578, #131579, #131580, #131581, #131582
2024-07-26 05:00:07 +00:00
8689d377f9 [BE] typing for decorators - signal/windows/windows (#131582)
See #131429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131582
Approved by: https://github.com/oulgen, https://github.com/zou3519
ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576, #131577, #131578, #131579, #131580, #131581
2024-07-26 05:00:07 +00:00
dbf7c318b2 [BE] typing for decorators - _refs/nn/functional (#131581)
See #131429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131581
Approved by: https://github.com/oulgen, https://github.com/zou3519
ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576, #131577, #131578, #131579, #131580
2024-07-26 05:00:03 +00:00
81c26ba5ae [BE] typing for decorators - utils/flop_counter (#131580)
See #131429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131580
Approved by: https://github.com/oulgen, https://github.com/zou3519
ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576, #131577, #131578, #131579
2024-07-26 04:59:58 +00:00
33069630ce [inductor] Add type hints to functions in decompositions.py (#131780)
Summary: ATT

Test Plan: lintrunner

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131780
Approved by: https://github.com/eellison
2024-07-26 04:50:23 +00:00
5b05ad9697 fix non-persistent buffers (#131756)
Summary:
Dynamo doesn't track whether buffers are `persistent`. This led to some ugly code where we would mark buffers as always persistent when creating signatures, then later check whether the buffers were not in the state dict to infer whether they were non-persistent, and use this to fix up the signature.

This PR instead defines a utility to look up all the non-persistent buffers registered inside a module (this information is recorded in a private `_non_persistent_buffers_set` module attribute), and uses it to (a) correctly set the persistent flag on buffers when creating signatures (b) transfer this information to a Dynamo-traced graph module, which then causes non-persistent buffers to (correctly) not show up in the state dict.

Test Plan: existing tests + new case with non-persistent buffer in nested module

Differential Revision: D60224656

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131756
Approved by: https://github.com/zhxchen17, https://github.com/ydwu4
2024-07-26 04:45:30 +00:00
a617919541 [dynamo] Do not guard on keys for _forward_hooks and _forward_pre_hooks (#131682)
Fixes https://github.com/pytorch/pytorch/issues/125836

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131682
Approved by: https://github.com/bdhirsh
2024-07-26 04:39:54 +00:00
3d7c424a75 [inductor] update users to buffers instead of scheduler nodes (#131796)
After a recent refactoring of inductor, `.users` are now associated with buffers instead of scheduler nodes.

In `debug.py`, one such usage of `.users` is not updated accordingly, and the change here fixes that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131796
Approved by: https://github.com/yf225
2024-07-26 03:34:26 +00:00
6dbf343936 Fix aten implementation for low memory max_pool2d (#131717)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131717
Approved by: https://github.com/peterbell10
2024-07-26 03:23:16 +00:00
c2f3266c8e Not remove collective ops in dce since they have side-effect (#131023)
Fixes #130918

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131023
Approved by: https://github.com/yf225
2024-07-26 03:03:32 +00:00
e0d3e4a498 remove unused code for XPU (#131856)
# Motivation
This PR aims to remove unused code in PyTorch for XPU, following https://github.com/pytorch/pytorch/pull/128179
Otherwise, CI will block without this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131856
Approved by: https://github.com/EikanWang
2024-07-26 02:57:12 +00:00
236d055330 [Traceable FSDP2] Add partial-graph (graph-break) unit tests (#131747)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131747
Approved by: https://github.com/bdhirsh
2024-07-26 02:51:57 +00:00
03f49c9523 Revert "[CUDAGraph] Type annotation for cudagraph_trees.py (#131621)"
This reverts commit 16699c7d848fca669865d83ffff205bcbb8665be.

Reverted https://github.com/pytorch/pytorch/pull/131621 on behalf of https://github.com/atalman due to lint is failing, please rebase fix lint and reland ([comment](https://github.com/pytorch/pytorch/pull/131621#issuecomment-2251831163))
2024-07-26 02:08:45 +00:00
16699c7d84 [CUDAGraph] Type annotation for cudagraph_trees.py (#131621)
As a Better Engineer effort, this PR adds type annotation to `cudagraph_trees.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131621
Approved by: https://github.com/eellison
2024-07-26 01:40:23 +00:00
2ff98bc57f [inductor][autotune_at_compile_time] fix some codegen-ing for standalone autotuning file (#131726)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131726
Approved by: https://github.com/desertfire
ghstack dependencies: #131253
2024-07-26 00:58:04 +00:00
b343644f3a Revert "MTIA equivalent of torch.cuda.memory_stats (#131673)"
This reverts commit 513ce5f69a7f53742b7aa5798082dd158beec2ed.

Reverted https://github.com/pytorch/pytorch/pull/131673 on behalf of https://github.com/clee2000 due to linked internal diff has internal changes, not sure what happened here, but this shouldn't have been merged externally without also merging the internal diff ([comment](https://github.com/pytorch/pytorch/pull/131673#issuecomment-2251749644))
2024-07-26 00:54:37 +00:00
b893a57f96 [Dynamo] Fix guard_on_nn_modules unit tests discrepancy between OSS and fbcode (#131810)
Fixes Meta internal task: [T195592220](https://www.internalfb.com/intern/tasks/?t=195592220)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131810
Approved by: https://github.com/zou3519
2024-07-26 00:24:46 +00:00
246e32055a [benchmark] Add hf_T5_generate to inline_inbuilt_nn_modules (#131804)
Fixes https://github.com/pytorch/pytorch/issues/121989

We are turning on the flag by default in another PR. But that PR can go
through reverts. So, forcibly adding the benchmark to prevent dashboard
fluctuation in case of reverts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131804
Approved by: https://github.com/yanboliang, https://github.com/shunting314
ghstack dependencies: #131795, #131801
2024-07-26 00:20:42 +00:00
c92f2a19a4 [BE] Use assertEqual in MultiKernel tests (#127725)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127725
Approved by: https://github.com/lezcano
ghstack dependencies: #131044, #127724
2024-07-26 00:12:43 +00:00
9ae288f4be [inductor] Simplify multi-kernel codegen by unifying kernel args (#127724)
Persistent kernels are sometimes able to remove intermediate buffers that would
otherwise be needed for the non-persistent reduction kernel. This makes
multi kernel's codegen more complicated as it needs to drop these extra
arguments at runtime after selecting the correct kernel to run.

Instead, this PR updates the persistent kernel's `must_keep_buffers` so these
aren't dropped during codegen so both kernels have the same signature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127724
Approved by: https://github.com/shunting314
ghstack dependencies: #131044
2024-07-26 00:12:43 +00:00
14920c149b Revert "[dynamo] Turn on inline_inbuilt_nn_modules (#131275)"
This reverts commit 0455344777f354dcbbd8e661a46ca2ca20e8a913.

Reverted https://github.com/pytorch/pytorch/pull/131275 on behalf of https://github.com/clee2000 due to I think this broke inductor/test_cpu_select_algorithm.py::TestSelectAlgorithmDynamicShapesCPU::test_quantized_linear_amx_dynamic_shapes_batch_size_16_in_features_4_out_features_64_bias_True_cpu [GH job link](https://github.com/pytorch/pytorch/actions/runs/10102272826/job/27938970118) [HUD commit link](0455344777) not run on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/131275#issuecomment-2251609554))
2024-07-26 00:12:40 +00:00
adbe4f5ecf TCPStore: add better logging on wait timeout (#131808)
This makes TCPStore `wait` timeout print actually useful info instead of a generic `Socket Timeout` message on timeout.

Bonus:

* fix weirdness where `connect_timeout` only supported seconds unlike the reset of our timeouts (thus minimum timeout was 1s)
* Fixed tests that used a 10s timeout (test_store now only takes 20s instead of 40s)

Ex:

```
DistStoreError: wait timeout after 100ms, keys: /the_key
```

Test plan:

```
python test/distributed/test_store.py
python test/distributed/test_c10d_gloo.py -v -k timeout
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131808
Approved by: https://github.com/kurman
2024-07-25 23:54:41 +00:00
e9443860e7 add python binding for _get_current_graph_task_keep_graph (#131038)
Inductor would like a way to have activations that do not escape the backward graph marked as "donated", so we can re-use their memory during memory planning here: https://github.com/pytorch/pytorch/pull/130580

For this to be safe though, we need to know at runtime that autograd does not plan to retain the current autograd graph (either for another call to .backward() later, or if double backward is being used). In the linked PR, the current plan is to error when we detect this situation, and ask the user to turn off the donated buffer config (although if/once we get to the point of always delaying backward compilation to runtime, we can just wait until we know the runtime value to compile).

There isn't a way to know if the currently running backward is run with `retain_graph=True` from python - @soulitzer helped me figure out where to grab it so I added a python binding for it under `ctx.is_retain_graph()`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131038
Approved by: https://github.com/soulitzer
2024-07-25 23:50:40 +00:00
cyy
eac83479cc Enable Wunused-function and Wunused-result globally (#131596)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131596
Approved by: https://github.com/zou3519
2024-07-25 23:50:12 +00:00
2a4ca5ccc4 [dynamo] Pop the exception stack on handling the StopIteration natively (#131801)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131801
Approved by: https://github.com/yanboliang
ghstack dependencies: #131795
2024-07-25 23:33:19 +00:00
11673851d9 [dynamo][exception][bugfix] Add a pop for < 3.11 version (#131795)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131795
Approved by: https://github.com/yanboliang
2024-07-25 23:33:19 +00:00
f885a70fab [inductor][autotune_at_compile_time] support Triton kernel with sympy fn str arg (#131253)
## What is sympy fn str arg?
It's  a string such as `sqrt` which also happens to be a real sympy function (e.g. `sympy.sqrt`)

## Crash

```
torch/_inductor/sizevars.py", line 468, in symbolic_hint
    expr = self.simplify(expr)        # where expr is 'sqrt'
torch/_inductor/sizevars.py", line 66, in simplify
    return sympy.expand(expr).xreplace(self.replacements)
sympy/core/function.py", line 2816, in expand
    return sympify(e).expand(deep=deep, modulus=modulus, **hints)
AttributeError: 'function' object has no attribute 'expand'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131253
Approved by: https://github.com/desertfire
2024-07-25 23:31:20 +00:00
b4b62d3945 update to 2.5.8 (#131684)
# Summary
This stack brings the current fork of FAv2 near the top of main which is 2.6.2

Notably we need to update cutlass to 3.5.0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131684
Approved by: https://github.com/jainapurva
2024-07-25 23:15:03 +00:00
51f4f87718 [Reland] Ensure staticmethods can be allowed in graph (#131789)
Fixes https://github.com/pytorch/pytorch/issues/124735

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131789
Approved by: https://github.com/anijain2305
2024-07-25 22:54:18 +00:00
4de85e3c30 [DeviceMesh] Remove _parent_mesh as an attribute from DeviceMesh and remove it from DeviceMesh's hash (#131636)
We recently revisited the hash implementation and think `_parent_mesh` information should not be burned into DeviceMesh but rather be inferred from the MeshEnv which manages device meshes.

As `mesh_dim_names` is considered in device mesh's hash. This should not affect the issue brought up in https://github.com/pytorch/pytorch/issues/121799

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131636
Approved by: https://github.com/wanchaol
2024-07-25 22:47:22 +00:00
79f0c4dc04 [BE] typing for decorators - fx/experimental/graph_gradual_typechecker (#131579)
See #131429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131579
Approved by: https://github.com/oulgen, https://github.com/zou3519
ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576, #131577, #131578
2024-07-25 22:24:19 +00:00
c65b197b85 [BE] typing for decorators - _library/custom_ops (#131578)
See #131429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131578
Approved by: https://github.com/oulgen, https://github.com/zou3519
ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576, #131577
2024-07-25 22:24:19 +00:00
5ee6a6dacc [BE] typing for decorators - ao/quantization/quantizer/xnnpack_quantizer_utils (#131577)
See #131429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131577
Approved by: https://github.com/oulgen, https://github.com/zou3519
ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576
2024-07-25 22:24:19 +00:00
37d76c7d48 [BE] typing for decorators - fx/experimental/migrate_gradual_types/constraint_generator (#131576)
See #131429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131576
Approved by: https://github.com/oulgen, https://github.com/zou3519
ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575
2024-07-25 22:24:19 +00:00
42dc5a47a1 [BE] typing for decorators - _inductor/fx_passes/post_grad (#131575)
See #131429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131575
Approved by: https://github.com/oulgen, https://github.com/zou3519
ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574
2024-07-25 22:24:19 +00:00
b2cbcf710b [BE] typing for decorators - _inductor/lowering (#131574)
See #131429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131574
Approved by: https://github.com/oulgen, https://github.com/zou3519
ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573
2024-07-25 22:24:19 +00:00
f0f20f7e97 [BE] typing for decorators - _jit_internal (#131573)
See #131429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131573
Approved by: https://github.com/oulgen, https://github.com/zou3519
ghstack dependencies: #131568, #131569, #131570, #131571, #131572
2024-07-25 22:24:19 +00:00
bfe0079b72 [BE] typing for decorators - _meta_registrations (#131572)
See #131429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131572
Approved by: https://github.com/oulgen, https://github.com/zou3519
ghstack dependencies: #131568, #131569, #131570, #131571
2024-07-25 22:24:19 +00:00
4b985e6f80 [BE] typing for decorators - distributed/_tensor/ops/utils (#131571)
See #131429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131571
Approved by: https://github.com/oulgen, https://github.com/zou3519
ghstack dependencies: #131568, #131569, #131570
2024-07-25 22:24:19 +00:00
5731b486c8 [BE] typing for decorators - library (#131570)
See #131429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131570
Approved by: https://github.com/oulgen, https://github.com/zou3519
ghstack dependencies: #131568, #131569
2024-07-25 22:24:19 +00:00
aa58af8b43 [BE] typing for decorators - masked/_ops (#131569)
See #131429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131569
Approved by: https://github.com/oulgen, https://github.com/zou3519
ghstack dependencies: #131568
2024-07-25 22:24:19 +00:00
193f62fde9 [BE] typing for decorators - fx/_compatibility (#131568)
See #131429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131568
Approved by: https://github.com/justinchuby, https://github.com/oulgen, https://github.com/zou3519
2024-07-25 22:24:19 +00:00
709ddf7a9d Add wrappers for synchronous GPUDirect Storage APIs (#130633)
Based in part on https://github.com/NVIDIA/apex/pull/1774

Differential Revision: [D60155434](https://our.internmc.facebook.com/intern/diff/D60155434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633
Approved by: https://github.com/albanD
2024-07-25 22:23:38 +00:00
0455344777 [dynamo] Turn on inline_inbuilt_nn_modules (#131275)
Known issues that are deliberately kept open and will be fixed later are tracked here - https://github.com/pytorch/pytorch/issues/131696

Training dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/anijain2305/435/head&lCommit=408b9358b8fca3a5d08b39741419fe8a596941aa&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51))

![image](https://github.com/user-attachments/assets/08ef081c-37d7-436d-905b-4b9e2b470644)

Inference dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=gh/anijain2305/435/head&lCommit=914244fa2fe0055917e039e35183b21fa90afdc6&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51))
![image](https://github.com/user-attachments/assets/32136eff-a39e-4cde-a438-e51a665bc3c9)

Inference sees a little bit more perf degradation but we are ok with that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131275
Approved by: https://github.com/ezyang
ghstack dependencies: #131744
2024-07-25 22:14:17 +00:00
513ce5f69a MTIA equivalent of torch.cuda.memory_stats (#131673)
Summary: Adding MTIA equivalent of `torch.cuda.memory_stats`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131673
Approved by: https://github.com/egienvalue
2024-07-25 21:59:59 +00:00
9039131a89 TCPStore: fix remote address (#131773)
This fixes corrupt remote address logs caused by dangling pointers to addrinfo_storage inside of addrinfo.

Test plan:

Enable debug logs and verify addresses are correct

```
TORCH_CPP_LOG_LEVEL=INFO TORCH_DISABLE_SHARE_RDZV_TCP_STORE=1 TORCH_DISTRIBUTED_DEBUG=DETAIL LOGLEVEL=INFO python test/distributed/test_store.py -v
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131773
Approved by: https://github.com/kurman
2024-07-25 21:55:25 +00:00
520182dbff [reland][inductor] switch AotCodeCompiler to new cpp_builder (#130127)
Changes:
1. Switch `AotCodeCompiler` to new cpp_builder.
2. Only use `deprecated_cpp_compile_command` for `fb_code`, due to I can't debug anymore on no Meta internal environment access.
3. Add `TODO` comments for further some Meta employee help on contine to do this work.
4. Due to item 3, we only remaining `deprecated_cpp_compile_command` for `fb_code` to be fix, let's remove `validate_new_cpp_commands`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130127
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-25 21:45:40 +00:00
a34692c0a3 [Inductor] Added and_masks and or_masks utilities & make fully masked out rows 0 instead of nan (#131552)
Combine #131073 and #131012 and fix doc building failures.

Co-authored-by: chilli <chilli@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131552
Approved by: https://github.com/Chillee
2024-07-25 21:29:46 +00:00
89bdd9c18f [kineto] populate src/dst rank for p2p (#130812)
Summary:
as title
populate src/dst rank (global rank) for p2p kernel

Differential Revision: D59794535

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130812
Approved by: https://github.com/aaronenyeshi
2024-07-25 21:10:57 +00:00
1c58aacbc8 [dtensor] move ops to private (#131211)
as titled

Differential Revision: [D60132519](https://our.internmc.facebook.com/intern/diff/D60132519)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131211
Approved by: https://github.com/XilunWu, https://github.com/wz337
ghstack dependencies: #131212
2024-07-25 20:59:55 +00:00
605dfd8fb4 Switch sync_distributed_folder to use non-reverse order (#131683)
`git` on GHA seems to use the reverse commit ordering that I see locally O_o
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131683
Approved by: https://github.com/seemethere
2024-07-25 20:44:23 +00:00
fe2e6f0c51 Revert "[reland][inductor] switch AotCodeCompiler to new cpp_builder (#130127)"
This reverts commit dfc9bfc8839ea3a0ffe933a64cd129fab5e4da75.

Reverted https://github.com/pytorch/pytorch/pull/130127 on behalf of https://github.com/atalman due to Breask CI test_dataloader.py::TestDataLoader::test_segfault [GH job link](https://github.com/pytorch/pytorch/actions/runs/10099725941/job/27930133346) [HUD commit link](2c1851f04e) ([comment](https://github.com/pytorch/pytorch/pull/130127#issuecomment-2251360224))
2024-07-25 20:44:04 +00:00
1ad4e6f228 Refactor cudagraphs to use serializable placeholder info (#130252)
This PR refactors placeholders in cudagraphs to be serializable. We define a new PlaceholderInfo object which only has the necessary parts of placeholders for logging/debugging, and use that instead of `torch.fx.Node` directly. This allows us to then save PlaceholderInfo into the FXGraphCache/AOTAutogradCache later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130252
Approved by: https://github.com/eellison, https://github.com/masnesral
ghstack dependencies: #129384
2024-07-25 20:39:37 +00:00
eqy
69d63b2318 [CUDA][Pooling] Clean up unused accscalar_t in maxpool2d forward (#131728)
maxpool forward doesn't actually do any accumulation and the second template param was just a dupe of the first

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131728
Approved by: https://github.com/mikaylagawarecki
2024-07-25 20:32:42 +00:00
fdc4d6fe96 [inductor] Refactor fusion of inplace operations (#130835)
Resubmit of #128979

`WeakDep`s force readers to have completed before a mutation overwrites the
buffer, but we want to allow fusions to occur for inplace mutations where the
same index is read and written.

Currently this is achieved by:
1. Identifying the buffers used by the mutating op in its `dep_closure`
2. Not creating `WeakDep`s for buffers in the `dep_closure`
3. Fixing up any bad fusions that might occur by an extra check in `can_fuse_vertical`

So we are first over-agressive in removing `WeakDep`, then add an ad-hoc fixup.

This PR instead emits all `WeakDep`s and adds a `fusable_weak_dep` check to
`can_fuse_vertical` which selectively allows inplace operation to fuse.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130835
Approved by: https://github.com/lezcano
2024-07-25 20:29:01 +00:00
61d7bb3e79 Migrate trunk workflows to Amazon2023 ami (#131677)
A continuation of the migration started in
- https://github.com/pytorch/pytorch/pull/131250

All migrated trunk jobs passed successfully
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131677
Approved by: https://github.com/malfet
2024-07-25 20:19:16 +00:00
a6ebd56f7b Factor out cudagraph post compile into its own function (#129384)
Moves cudagraphs stuff into a post_compile function that I can later call when loading from AOTAutogradCache. On a cache hit, we only need to save any reasons for disabling cudagraphs along with some metadata needed to run cudagraphify. The arguments to cudagraphs_post_compile should be the set of parameters I'll need to reconstruct on a warm start.

No actual behavioral change should result from this: I'm moving the behavior into separate functions, but every operation should be the same pre and post PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129384
Approved by: https://github.com/eellison
2024-07-25 20:15:44 +00:00
58b8704f28 [aot] Keep backward mutations in backward (#129130)
https://github.com/pytorch/pytorch/issues/127561

Mutations of inputs in backward are emitted manually, after joint_fn tracing.
With default partitioner logic they will be moved to "forward" graph, as this is operation on forward inputs.

To keep those mutations in backward:
- Introduce "subgraph" node key, that can be specified with contextmanager. When we do manual `copy_` in backward on forward input - we know that his is for backward - set subgraph="backward"

In partitioner:
Introducing optional argument subgraph, to filter out nodes with specified subgraph (node_subgraph) and not to add them to subgraph if node_subgraph is different.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129130
Approved by: https://github.com/Chillee
2024-07-25 20:02:25 +00:00
6c31e02971 Fixes the example for convert_conv3d_weight_memory_format (#131742)
Fixes #129158

Please let me know if changes are needed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131742
Approved by: https://github.com/albanD
2024-07-25 20:01:44 +00:00
fba24252bd [dynamo][frame summary] Skip frame summary for frames from inside torch/nn/modules (#131744)
This ensures that the stack trace points to the user code.

At main (no inlining)
![image](https://github.com/user-attachments/assets/bf6f1f46-2dfe-45a2-95e1-fb733cda7e50)

With inlining but without this PR

![image](https://github.com/user-attachments/assets/fcb16c4d-dd81-4e5d-a63a-391a73683deb)

With inlining and this PR

![image](https://github.com/user-attachments/assets/69f10f65-c2ed-4179-acd5-a2824615129c)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131744
Approved by: https://github.com/ezyang
2024-07-25 19:30:03 +00:00
a1fad03fa8 [ROCm] Enable cudagraph expandable segments UTs in inductory/dynamo (#131111)
Test runtimes extracted from CI logs are as follows.

"linux-focal-rocm6.1-py3.8":
"dynamo/test_cudagraphs_expandable_segments": 3.3185000000000002,
"inductor/test_cudagraph_trees_expandable_segments": 153.233,

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131111
Approved by: https://github.com/eqy, https://github.com/jataylo, https://github.com/peterbell10
2024-07-25 19:26:04 +00:00
8c4683c978 Add device argument to the large_grid unit test (#131702)
Missing device argument lets this unit test only run on CPUs. Two unit tests added in the previous PR https://github.com/pytorch/pytorch/pull/127448. But only one use `device=self.device` to make sure the tests run on correct devices.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131702
Approved by: https://github.com/desertfire
2024-07-25 19:19:56 +00:00
bf6aae1468 Improve torch.masked.mean and torch.masked._std_var scaling (#131293)
Fixes #131292

Using `new_ones` is expensive and unnecessary.

Before:

![21232fda-366a-47ea-a017-15a35cd51d0c](https://github.com/user-attachments/assets/779830f0-0027-4fab-a9e6-b99954c80bc5)

After:

![aad2dfcc-52c9-4046-86ab-122b044fa19c](https://github.com/user-attachments/assets/810711c5-c4f0-4b6b-91dc-9a9e714f6ee0)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131293
Approved by: https://github.com/ezyang
2024-07-25 18:52:59 +00:00
2c1851f04e [export] fix output node's meta (#131706)
Summary:
This pr fixes all the places in strict export stack where the output node's meta is not preserved correctly. However, we're getting a new error for the test we intend to fix: `buck2 run caffe2/test/quantization:test_quantization -- -r "test_re_export_preserve_handle"`:

The `get_attr` nodes has wrong metadata. I guess there are more things need to be fixed to get it working but it's beyond the scope of this PR.

Test Plan: buck2 run caffe2/test/quantization:test_quantization -- -r "test_re_export_preserve_handle"

Differential Revision: D60198221

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131706
Approved by: https://github.com/yushangdi
2024-07-25 18:44:21 +00:00
dfc9bfc883 [reland][inductor] switch AotCodeCompiler to new cpp_builder (#130127)
Changes:
1. Switch `AotCodeCompiler` to new cpp_builder.
2. Only use `deprecated_cpp_compile_command` for `fb_code`, due to I can't debug anymore on no Meta internal environment access.
3. Add `TODO` comments for further some Meta employee help on contine to do this work.
4. Due to item 3, we only remaining `deprecated_cpp_compile_command` for `fb_code` to be fix, let's remove `validate_new_cpp_commands`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130127
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-25 18:34:08 +00:00
f3df7deab8 Revert "Add flag to ignore unsupported @triton.autotune args in user-written kernel compilation (#131431)"
This reverts commit e9db1b059733a02e1fb726d22a0489471044ad98.

Reverted https://github.com/pytorch/pytorch/pull/131431 on behalf of https://github.com/clee2000 due to broke internal tests D60211713 ([comment](https://github.com/pytorch/pytorch/pull/131431#issuecomment-2251091957))
2024-07-25 18:00:46 +00:00
2423d89d0c [dynamo] mirror training flag in OptimizedModule (#131546)
Fixes https://github.com/pytorch/pytorch/issues/122414.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131546
Approved by: https://github.com/yanboliang, https://github.com/anijain2305
2024-07-25 17:43:09 +00:00
c3679bed35 Revert "Fix py codegen to delete values that don't have any users (#131028)"
This reverts commit 91aba7baac3d2a079c0b13db25588842260c98cc.

Reverted https://github.com/pytorch/pytorch/pull/131028 on behalf of https://github.com/clee2000 due to broke inductor/test_triton_kernels inductor/test_triton_kernels.py::KernelTests::test_triton_kernel_functionalize [GH job link](https://github.com/pytorch/pytorch/actions/runs/10094659640/job/27915271250) [HUD commit link](91aba7baac) ([comment](https://github.com/pytorch/pytorch/pull/131028#issuecomment-2251058374))
2024-07-25 17:42:18 +00:00
ec3829795d [3/3] 3D Composability - move tp dp tests (#129802)
pytorch (fsdp, tp, pp) -> pytorch (composable)
Move (fsdp, tp, pp) tests under pytorch into a composable folder

FSDP:
test/distributed/_composable/fsdp/test_fully_shard_trainin.py
-TestFullyShard2DTraining
**DP:
test/distributed/tensor/parallel/test_ddp_2d_parallel.py
TP:
test/distributed/tensor/parallel/test_fsdp_2d_parallel.py**
PP:
test/distributed/pipelining/test_composability.py

=>
**distributed/_composable/test_composability/test_2d_composability.py**
distributed/_composable/test_composability/test_pp_composability.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129802
Approved by: https://github.com/fduwjj
ghstack dependencies: #129800, #129801
2024-07-25 16:36:55 +00:00
29571c5c06 [2/3] 3D Composability - move pp tests (#129801)
pytorch (fsdp, tp, pp) -> pytorch (composable)
Move (fsdp, tp, pp) tests under pytorch into a composable folder

FSDP:
test/distributed/_composable/fsdp/test_fully_shard_trainin.py
-TestFullyShard2DTraining
DP:
test/distributed/tensor/parallel/test_ddp_2d_parallel.py
TP:
test/distributed/tensor/parallel/test_fsdp_2d_parallel.py
**PP:
test/distributed/pipelining/test_composability.py**

=>
distributed/_composable/test_composability/test_2d_composability.py
**distributed/_composable/test_composability/test_pp_composability.py**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129801
Approved by: https://github.com/wconstab
ghstack dependencies: #129800
2024-07-25 16:36:55 +00:00
75c4176b05 [export][BE] consolidate export and export_for_training (#131496)
Summary: This PR consolidates the implementation of export and export_for_training to maximize code re-use. Also add some type annotations and comments in the code for better readability.

Test Plan: Existing tests.

Differential Revision: D60130515

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131496
Approved by: https://github.com/avikchaudhuri, https://github.com/pianpwk
2024-07-25 16:35:16 +00:00
6bc8db1d32 Rename is_training flag to have more information (#131618)
Summary: rename is_training flag into dispatch_tracing_mode = “make_fx” or “aot_export”

Test Plan: OSS CI

Differential Revision: D60154327

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131618
Approved by: https://github.com/ydwu4
2024-07-25 16:29:55 +00:00
f063027d54 [aoti] Fix constant inputs passed to aoti (#131594)
In cases where the program takes in a constant, export will specialize on the constant and embed the constant into the graph, with the graph containing a placeholder node with no users. However, inductor errors further down as typically in torch.compile, these constants don't show up as inputs. Since these constants are already embedded in the graph, we will just ignore these inputs while compiling with AOTI, and filter out the non-tensor inputs during the runtime.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131594
Approved by: https://github.com/desertfire
2024-07-25 16:22:15 +00:00
ffc6bf8149 [dynamo] lazily guard and specialize on the symint when used in f-string. (#131529)
Fixes https://github.com/pytorch/pytorch/issues/103602.

This PR implements the idea of "if someone creates a string and then ends up not using it, we would prefer to NOT have specialized." mentioned in above issue. Specifically, we create a lazy variable tracker instead of ConstantVariable when we're in FORMAT_VALUE, and when the lazy variable tracker is realized (i.e. it's going to be used), we create a ConstantVariable and the specialization/guarding happens at the time of realization.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131529
Approved by: https://github.com/ezyang
2024-07-25 16:16:34 +00:00
96e8df6a3a [ts_converter] Support prim::max and prim::if with multiple outputs (#131593)
Summary: As title.

Test Plan: test_converter.py

Reviewed By: angelayi

Differential Revision: D60147455

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131593
Approved by: https://github.com/ydwu4
2024-07-25 16:13:31 +00:00
cyy
b07ea91c4c [2/N] Fix clang-tidy warnings in jit (#131735)
Follows #131034

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131735
Approved by: https://github.com/ezyang
2024-07-25 15:56:53 +00:00
49a8e061b6 Revert "Support IPC for Expandable Segments (#130890)"
This reverts commit 0e71a88f9b2ca6b950c76a061791559cdd8a8870.

Reverted https://github.com/pytorch/pytorch/pull/130890 on behalf of https://github.com/zdevito due to some internal tests show shutdown issues with the change to the table that holds ipc handles ([comment](https://github.com/pytorch/pytorch/pull/130890#issuecomment-2250767280))
2024-07-25 15:54:57 +00:00
cyy
a4be5cb50e Simplify some c++ code (#131612)
The simplifications were discovered by static analysis tools.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131612
Approved by: https://github.com/ezyang
2024-07-25 15:07:37 +00:00
c3d099ddd1 [BE][Easy] Add hooks to doc for Optimizer base class (#131628)
Happened to notice this was missing from the base class (but is rendering for the other optimizers like Adam etc.) when I wanted to link the state_dict hooks for https://discuss.pytorch.org/t/global-not-per-param-optimizer-state/206769

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131628
Approved by: https://github.com/janeyx99
2024-07-25 15:07:08 +00:00
745b55d14a [CI][dashboard] Add a workflow to collect aarch64 perf (#131729)
Summary: as title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131729
Approved by: https://github.com/huydhn
2024-07-25 14:58:47 +00:00
1eedb0a962 fix torchrun log message (#131652)
fixes https://github.com/pytorch/pytorch/issues/131461

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131652
Approved by: https://github.com/awgu
2024-07-25 14:50:10 +00:00
d0e2ab617d Migrate conda, manywheel and libtorch docker builds to pytorch/pytorch (#129022)
Migration of Docker conda builds  to pytorch/pytorch from pytorch/builder: https://github.com/pytorch/builder/blob/main/.github/workflows/build-conda-images.yml

Related to: https://github.com/pytorch/builder/issues/1849

Migrate scripts and worklfows, adds logic to execute on PR and upload to ecr with github hash tag in order to test Docker build and nightly on PR.

Test when executing on PR, upload to ecr:
https://github.com/pytorch/pytorch/actions/runs/9799439218/job/27059691327
```
308535385114.dkr.ecr.us-east-1.amazonaws.com/pytorch/conda-builder-cpu:789cf8fcd738088860056160f6e9ea7cd005972b
```

Test With-Push, upload to dockerhub:
https://github.com/pytorch/pytorch/actions/runs/9799783407/job/27060633427
```
docker.io/pytorch/conda-builder:cpu done
```
Will upload here: https://hub.docker.com/r/pytorch/conda-builder/

Test using ecr image in the nightly workflow:
https://github.com/pytorch/pytorch/actions/runs/9798428933/job/27057835235#step:16:87

Note: This is first part that will build docker and upload it to either dockerhub or ecr. After merging followup PR will need to change conda nightly workflows to either use ecr image or dockerhub image, depending if we are running it on PR or from main/release branch.

Cleanup of workflows and scripts from builder repo: https://github.com/pytorch/builder/pull/1923
Co-authored-by: atalman <atalman@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129022
Approved by: https://github.com/atalman, https://github.com/seemethere, https://github.com/malfet, https://github.com/chuanqi129
2024-07-25 14:36:15 +00:00
4a5a87168e [BE] typing for decorators - _prims_common/wrappers (#131567)
See #131429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131567
Approved by: https://github.com/oulgen, https://github.com/zou3519
2024-07-25 14:35:13 +00:00
7260eaeca0 Fix vulkan builds with missing overrides errors (#131760)
Followup after https://github.com/pytorch/pytorch/pull/131524

Also, use `C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED` macro to suppress existing warnings

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131760
Approved by: https://github.com/atalman
2024-07-25 14:29:44 +00:00
fddb1bcdea [CCA][Memory Snapshot] Move user_defined annotations to Native Caching Allocator (#130964)
Summary: Instead of embedding the user_defined TraceEntry inside of device_traces, which causes issues when some threads may not have the proper device id set, save them into an external_annotations field by using a RingBuffer<AnnotationEntry> called annotation_buffer owned by the NativeCachingAllocator.

Test Plan: CI, resnet run, and FBR model.

Differential Revision: D59703213

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130964
Approved by: https://github.com/zdevito
2024-07-25 14:06:52 +00:00
c88c90a897 [TS2E] Improve logging (#131711)
Serializing the text without having to do so can be costly for large outputs like ExportedProgram

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131711
Approved by: https://github.com/ydwu4
2024-07-25 13:40:10 +00:00
316c0d3e6b [inductor][cpp][gemm] support k slicing for static shapes (#130821)
This PR provides the initial support for k-slicing (i.e. parallel reduction along k-dim) of CPP GEMM template. Only static shapes are supported now. When k-slicing is enabled, there would be extra temporary buffers allocated to hold the intermediate results and an extra barrier after initial GEMM compute by each thread, i.e. each thread first stores the GEMM result to temporary accumulation buffers (pointed by `local_buf_ptrs` which is an array of pointers pointing to accumulation buffers), followed by a reduction along k-slices, epilogue computes and store to the final output `Y`. In each k-slicing thread group, the reduction along k-slices and epilogue computes are conducted in parallel along M-dim. The algorithm is designed to reduce the synchronization overhead as much as possible.

The k-slicing is enabled when blocking on M and N is unable to occupy all threads. Since k-slicing doesn't always bring benefit, an extra configuration is added to enable it (disable by default). We need to identify a good heuristics in the future to enable k-slicing by default.

Performance numbers with 64x4096x64, 64x10000x64, 64x20000x64 as examples on 60-core SPR as examples. As you can see, the perf of k-slicing is only better than non-k-slicing when K is large enough.

Without k-slicing
AUTOTUNE linear_unary(64x4096, 64x4096, 64)
  cpp_packed_gemm_0 0.0108 ms 100.0%
  _linear_pointwise 0.0431 ms 25.1%

AUTOTUNE linear_unary(64x10000, 64x10000, 64)
  cpp_packed_gemm_0 0.0272 ms 100.0%
  _linear_pointwise 0.0892 ms 30.5%

AUTOTUNE linear_unary(64x20000, 64x20000, 64)
  cpp_packed_gemm_0 0.0781 ms 100.0%
  _linear_pointwise 0.1693 ms 46.1%

With k-slicing:
AUTOTUNE linear_unary(64x4096, 64x4096, 64)
  cpp_packed_gemm_0 0.0260 ms 100.0%
  _linear_pointwise 0.0444 ms 58.5%

AUTOTUNE linear_unary(64x10000, 64x10000, 64)
  cpp_packed_gemm_0 0.0275 ms 100.0%
  _linear_pointwise 0.0893 ms 30.8%

AUTOTUNE linear_unary(64x20000, 64x20000, 64)
  cpp_packed_gemm_0 0.0284 ms 100.0%
  _linear_pointwise 0.1686 ms 16.8%

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130821
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
ghstack dependencies: #131024
2024-07-25 13:36:38 +00:00
d962dba0c4 Revert "[2/3] 3D Composability - move pp tests (#129801)"
This reverts commit 84cd062fb25c6da7d33b559c28afa38420e64415.

Reverted https://github.com/pytorch/pytorch/pull/129801 on behalf of https://github.com/atalman due to Broke periodic CI: distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_manual_with_data_parallel_dp_type_DDP_ScheduleClass4 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10083807511/job/27882848654) [HUD commit link](544f950d14) ([comment](https://github.com/pytorch/pytorch/pull/129801#issuecomment-2250326191))
2024-07-25 13:30:56 +00:00
9c4cf866c2 Adafactor forloop basic impl (#129905)
#109581

At this point, the vanilla implementation (the default) is good.
Docs: https://docs-preview.pytorch.org/pytorch/pytorch/129905/generated/torch.optim.Adafactor.html#torch.optim.Adafactor

Specifically, the impl in this PR, which attempts to replicate the paper,
```
optim = torch.optim.Adafactor([weight])
```
is close enough to https://pytorch-optimizers.readthedocs.io/en/latest/optimizer/#pytorch_optimizer.AdaFactor
```
optim_c = AdaFactor([weight], betas=(0, 0.999), scale_parameter=False)
```
is close enough to https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adafactor
```
optim = keras.optimizers.Adafactor(learning_rate=0.01)
```

The three results respectively for the same randomly generated weights:
```
# ours
tensor([[ 0.3807594, -0.3912092],
        [ 0.0762539,  0.5377805],
        [ 0.2459473,  0.4662207]])

# pytorch-optimizer
tensor([[ 0.3807592, -0.3912172],
        [ 0.0762507,  0.5377818],
        [ 0.2459457,  0.4662213]])

# keras
array([[ 0.38076326, -0.39121315],
        [ 0.0762547 ,  0.5377859 ],
        [ 0.24594972,  0.46622536]], dtype=float32)
```

This gives me confidence to move forward in speeding up the implementation now that a baseline has been established. If you're curious about differences:
* keras assigns step_size (rho_t in their code) to `min(lr, 1 / sqrt(step)` whereas the OG impl uses a hardcoded 0.01 instead of lr. We do the same thing as keras, but our lr default is 0.01.
* We differ from the pytorch-optimizers default in that our default will not track momentum (thus `beta1=0`) and we do not apply parameter scaling.

<details>

Keras collab: https://colab.research.google.com/drive/1i3xF8ChL7TWKJGV_5v_5nMhXKnYmQQ06?usp=sharing

My script repro:

```
import torch
from pytorch_optimizer import AdaFactor
torch.set_printoptions(precision=7)

weight = torch.tensor([[ 0.37697506, -0.39500135],
        [ 0.07246649,  0.53399765],
        [ 0.24216151,  0.46243715]], dtype=torch.float32)
# bias = torch.tensor([0, 0], dtype=torch.float32)

weight.grad = torch.tensor([[-0.5940447, -0.7743838],
        [-0.5940447, -0.7743838],
        [-0.5940447, -0.7743838]], dtype=torch.float32)
# bias.grad = torch.tensor([-2.5027974,  1.5422692], dtype=torch.float32)

weight_c = weight.clone()
weight_c.grad = weight.grad.clone()

optim = torch.optim.Adafactor([weight])
optim.step()
print(weight)

optim_c = AdaFactor([weight_c], betas=(0, 0.999), scale_parameter=False)
optim_c.step()
print(weight_c)
```

<details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129905
Approved by: https://github.com/albanD
2024-07-25 13:17:19 +00:00
e8956c9fe6 Allow cpu scalar to be moved to HPU in masked_fill_decomposition (#127871)
Extension of the condition allowing the cpu scalar to be moved to specific devices.

This fixes an HPU specific error:
`torch._dynamo.exc.BackendCompilerFailed: backend='aot_hpu_training_backend' raised:
RuntimeError: Expected `value` to be on same device as `a`While executing %masked_fill : [num_users=1] = call_method[target=masked_fill](args = (%matmul, %expand_as, %tensor), kwargs = {})`

On the HPU in eager mode the problem doesn't occur because the pytorch's implementation is not used then.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127871
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-07-25 13:04:55 +00:00
91aba7baac Fix py codegen to delete values that don't have any users (#131028)
Fixes #131025

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131028
Approved by: https://github.com/ezyang
2024-07-25 13:04:23 +00:00
2784b3f1b7 [inductor] Fix split-scan interaction with multi-kernel (#131044)
This fixes a couple errors that come up when multi-kernel is used with
split-scan.
1. The split-scan was being marked as a persistent kernel, which allowed
   a multi-kernel to be created but this isn't supported. Fix is to
   never mark split-scan as persistent.
2. Benchmark codegen was not handling WorkspaceArg, and would raise a
   KeyError during codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131044
Approved by: https://github.com/shunting314
2024-07-25 11:36:36 +00:00
c04f70bb30 [BE] enable UFMT for torch/ao/ (#128864)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128864
Approved by: https://github.com/ezyang
2024-07-25 11:30:14 +00:00
434f60ce33 Refactor nightly checkout tool (#131134)
Changes:

- Add `-C REPO` in `git` commands to allow the tool can be run everywhere not only the repo dir
- Use `pathlib.Path` as many as possible
- Replace `subprocess.run(..., check=True)` with `subprocess.check_{call,output}(...)`
- Add `encoding='utf-8'` for files
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131134
Approved by: https://github.com/ezyang
2024-07-25 11:20:43 +00:00
054d214c50 [BE][tests] show local variables on failure in tests (#131151)
------

As per the title, add argument `--locals` for `unittest` and `--showlocals --tb=long` for `pytest` in CI.

Some failures cannot be reproduced on the local machine but exist on cloud CI. This change allows us to investigate the test failure more easily.

Example output: https://github.com/pytorch/pytorch/actions/runs/9961546996/job/27523888353?pr=130710#step:20:3361

```text
/opt/conda/envs/py_3.8/lib/python3.8/site-packages/sympy/core/function.py:307:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

cls = FloorDiv, base = -1.00000000000000, divisor = -1.00000000000000

    @classmethod
    def eval(cls, base, divisor):
        # python test/test_dynamic_shapes.py -k TestDimConstraints.test_dim_constraints_solve_full
        # Assert triggered by inequality solver
        # assert base.is_integer, base
        # assert divisor.is_integer, divisor

        # We don't provide the same error message as in Python because SymPy
        # makes it difficult to check the types.
        if divisor.is_zero:
            raise ZeroDivisionError("division by zero")
        if base in (int_oo, -int_oo, sympy.oo, -sympy.oo) and divisor in (
            int_oo,
            -int_oo,
            sympy.oo,
            -sympy.oo,
        ):
            return sympy.nan
        if base is sympy.nan or divisor is sympy.nan:
            return sympy.nan

        if base.is_zero:
            return sympy.S.Zero
        if base.is_integer and divisor == 1:
            return base
        if base.is_integer and divisor == -1:
            return sympy.Mul(base, -1)
        if (
            isinstance(base, sympy.Number)
            and isinstance(divisor, sympy.Number)
            and (
                base in (int_oo, -int_oo, sympy.oo, -sympy.oo)
                or divisor in (int_oo, -int_oo, sympy.oo, -sympy.oo)
            )
        ):
            r = float(base) / float(divisor)
            if r == math.inf:
                return int_oo
            elif r == -math.inf:
                return -int_oo
            elif math.isnan(r):
                return sympy.nan
            else:
                return sympy.Integer(math.floor(r))
        if isinstance(base, sympy.Integer) and isinstance(divisor, sympy.Integer):
            return sympy.Integer(int(base) // int(divisor))
        if isinstance(base, FloorDiv):
            return FloorDiv(base.args[0], base.args[1] * divisor)

        # Expands (x + y) // b into x // b + y // b.
        # This only works if floor is an identity, i.e. x / b is an integer.
        for term in sympy.Add.make_args(base):
            quotient = term / divisor
            if quotient.is_integer and isinstance(divisor, sympy.Integer):
                # NB: this is correct even if the divisor is not an integer, but it
                # creates rational expressions that cause problems with dynamic
                # shapes.
                return FloorDiv(base - term, divisor) + quotient

        try:
            gcd = sympy.gcd(base, divisor)
            if gcd != 1:
>               return FloorDiv(
                    sympy.simplify(base / gcd), sympy.simplify(divisor / gcd)
                )

base       = -1.00000000000000
cls        = FloorDiv
divisor    = -1.00000000000000
gcd        = 1.00000000000000
quotient   = 1.00000000000000
term       = -1.00000000000000

/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/_sympy/functions.py:159:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

args = (FloorDiv, -1.00000000000000, -1.00000000000000), kwargs = {}

    @wraps(func)
    def wrapper(*args, **kwargs):
        try:
>           retval = cfunc(*args, **kwargs)
E           RecursionError: maximum recursion depth exceeded in comparison
E
E           To execute this test, run the following from the base repo dir:
E               python test/test_sympy_utils.py -k TestValueRanges.test_binary_ref_fn_floordiv_dtype_float
E
E           This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

args       = (FloorDiv, -1.00000000000000, -1.00000000000000)
cfunc      = <functools._lru_cache_wrapper object at 0x7fc5303173a0>
func       = <function Function.__new__ at 0x7fc530317280>
kwargs     = {}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131151
Approved by: https://github.com/ezyang
2024-07-25 10:10:58 +00:00
c4bf4005d1 [dtensor][debug] adding new noise level which allows users to only print operations with dtensors (#131592)
**Summary**
I have added a new noise level between the existing levels of 1 and 2, such that the noise level controls are now:
          0. prints module-level collective counts
          1. prints dTensor operations not included in trivial operations (new noise level)
          2. prints operations not included in trivial operations
          3. prints all operations

This gives the user more flexibility in controlling what information they want to use. The noise levels are used both for creating the console/file log and the json dump. In the example file, I have changed the module_tracing examples to noise level 0 and have changed my transformer examples to show off the new noise level.

**Test Plan**
1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_json_dump

2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing

3. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing

4. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131592
Approved by: https://github.com/XilunWu
ghstack dependencies: #131419, #130996
2024-07-25 06:54:57 +00:00
41e9f9cb7c [inductor] Fix flaky tests in test_select_algorithm.py (#131709)
Summary: Same as [#131699](https://github.com/pytorch/pytorch/pull/131699), but in `test_select_algorithm.py`.

Test Plan: Tested internally.

Differential Revision: D60202778

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131709
Approved by: https://github.com/eellison
2024-07-25 06:42:57 +00:00
3afdbecb23 [inductor] Fix flaky tests in test_debug_trace.py (#131722)
Summary:
When run internally in multiple parallel processes, the `test_debug_trace` hits the cache and skips writing all the expected outputs. Here we force-disable inductor cache to circumvent the problem. Ideally, we should switch to using a cleaner `fresh_inductor_cache` decorator approach, but it doesn't work at the moment.

Additionally, the debug trace dir is now generated by `tempfile.mkdtemp` to avoid a (rather unlikely) race condition.

Test Plan: Tested internally.

Differential Revision: D60207586

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131722
Approved by: https://github.com/eellison
2024-07-25 05:56:01 +00:00
059f9fb30b [BE][inductor] Type annotate codecache.py and config.py (#131427)
As title.

Checked/ Referred to the raw json file for runtime types . (and tried to cover all the missing annotations listed in the .json) this time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131427
Approved by: https://github.com/eellison, https://github.com/oulgen
2024-07-25 05:54:38 +00:00
ace6decc99 Fix static py::object dangling pointer with py::gil_safe_call_once_and_store (#130341)
Fix static `py::object`s with `py::gil_safe_call_once_and_store`.

The following code will leak a `py::object` which will call its destructor when shutdown the program. The destructor will call `Py_DECREF(obj.m_ptr)` which may raise a segmentation fault.

```c++
void func() {
    static py::object obj = py::module_::import("foo").attr("bar");

    ...
}
```

The correct code is to use raw pointers rather than the instance.

```c++
void func() {
    static py::object* obj_ptr = new py::object{py::module_::import("foo").attr("bar")};
    py::object obj = *obj_ptr;

    ...
}
```

This PR uses the `py::gil_safe_call_once_and_store` function from `pybind11`, which can run arbitrary initialization code only once under the Python GIL thread safely.

```c++
void func() {
    PYBIND11_CONSTINIT static py::gil_safe_call_once_and_store<py::object> storage;
    py::object obj = storage
                         .call_once_and_store_result(
                             []() -> py::object {
                                 return py::module_::import("foo").attr("bar");
                             }
                         )
                         .get_stored();

    ...
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130341
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-07-25 05:53:09 +00:00
59ef88ea5b [inductor] Fix flaky tests in test_pad_mm (#131699)
Summary: When run internally, some tests in `test_pad_mm.py` requiring big enough GPU to run `max_autotune=True` fail, as they're getting a smaller GPU than they need. Here we add `skipTest`s to skip the tests in these (rare) circumstances.

Differential Revision: D60192586

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131699
Approved by: https://github.com/chenyang78, https://github.com/shunting314, https://github.com/eellison
2024-07-25 05:46:45 +00:00
ee996cd63c [inductor] Fix flaky tests in test_benchmark_fusion.py (#131733)
Summary: Same as [#131699](https://github.com/pytorch/pytorch/pull/131699), but in `test_benchmark_fusion.py`.

Test Plan: Tested internally.

Differential Revision: D60211793

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131733
Approved by: https://github.com/oulgen
2024-07-25 05:39:14 +00:00
42a4df9447 Support CUDA nightly package in tools/nightly.py (#131133)
Add a new option `--cuda` to `tools/nightly.py` to pull the nightly packages with CUDA support.

```bash
# installs pytorch-nightly with cpuonly
tools/nightly.py pull

# The following only available on Linux and Windows
# installs pytorch-nightly with latest CUDA we support
tools/nightly.py pull --cuda

# installs pytorch-nightly with CUDA 12.1
tools/nightly.py pull --cuda 12.1
```

Also add targets in `Makefile` and instructions in constribution guidelines.

```bash
# setup conda environment with pytorch-nightly
make setup-env

# setup conda environment with pytorch-nightly with CUDA support
make setup-env-cuda
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131133
Approved by: https://github.com/ezyang
2024-07-25 05:33:52 +00:00
ceab3121de [inductor] Fix flaky tests in test_memory_planning.py (#131703)
Summary: Internally, the ABI-compatible mode is [enabled by default](eb54ca7abe/torch/_inductor/config.py (L53)). As a result, when the `abi_compatible: False` flag is not specified explitictly in the tests assuming non-ABI-compatible C++ codegen, those are failing internally. Here we fix one such test in `test_memory_planning.py`.

Test Plan: Tested internally.

Differential Revision: D60197327

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131703
Approved by: https://github.com/eellison
2024-07-25 05:09:08 +00:00
cyy
35bb0d3638 Fix unsigned type bug in CUDACachingAllocator.cpp (#131464)
curr_block->size and block_state.size are both size_t, so once they are not equal, split will happen. According to the comment, it's better to use '>'
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131464
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-07-25 04:48:05 +00:00
5f3f14e5e4 [BE] Annotate subgraph_lowering (#131545)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131545
Approved by: https://github.com/anijain2305, https://github.com/zou3519
2024-07-25 04:35:26 +00:00
00e19ae97a [MTIA] Support module.mtia() (#131499)
Summary: Following other device backends' implementation to support module.mtia() API.

Test Plan: OSS and Internal CIs.

Differential Revision: D60076584

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131499
Approved by: https://github.com/mikaylagawarecki
2024-07-25 04:23:48 +00:00
2ce734cee9 [BE] enable UFMT for torch/ao/quantization/ (#128863)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128863
Approved by: https://github.com/ezyang
ghstack dependencies: #128861, #128862
2024-07-25 04:17:54 +00:00
a2f6eb33d0 Register buffer in static input test (#131686)
Previously, without nn module inlining, dynamo would lift all tensor attributes on an nn module to be constant on the graph. With nn module inlining these need to be buffers explicitly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131686
Approved by: https://github.com/anijain2305
2024-07-25 03:47:56 +00:00
cyy
62704db5c3 [Distributed] [10/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d/control_plane (#131671)
Follows #130109

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131671
Approved by: https://github.com/zou3519
2024-07-25 03:46:55 +00:00
2d7c135757 Bump setuptools from 69.5.1 to 70.0.0 in /tools/build/bazel (#130893)
Bumps [setuptools](https://github.com/pypa/setuptools) from 69.5.1 to 70.0.0.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/pypa/setuptools/blob/main/NEWS.rst">setuptools's changelog</a>.</em></p>
<blockquote>
<h1>v70.0.0</h1>
<h2>Features</h2>
<ul>
<li>Emit a warning when <code>[tools.setuptools]</code> is present in <code>pyproject.toml</code> and will be ignored. -- by :user:<code>SnoopJ</code> (<a href="https://redirect.github.com/pypa/setuptools/issues/4150">#4150</a>)</li>
<li>Improved <code>AttributeError</code> error message if <code>pkg_resources.EntryPoint.require</code> is called without extras or distribution
Gracefully &quot;do nothing&quot; when trying to activate a <code>pkg_resources.Distribution</code> with a <code>None</code> location, rather than raising a <code>TypeError</code>
-- by :user:<code>Avasam</code> (<a href="https://redirect.github.com/pypa/setuptools/issues/4262">#4262</a>)</li>
<li>Typed the dynamically defined variables from <code>pkg_resources</code> -- by :user:<code>Avasam</code> (<a href="https://redirect.github.com/pypa/setuptools/issues/4267">#4267</a>)</li>
<li>Modernized and refactored VCS handling in package_index. (<a href="https://redirect.github.com/pypa/setuptools/issues/4332">#4332</a>)</li>
</ul>
<h2>Bugfixes</h2>
<ul>
<li>In install command, use super to call the superclass methods. Avoids race conditions when monkeypatching from _distutils_system_mod occurs late. (<a href="https://redirect.github.com/pypa/setuptools/issues/4136">#4136</a>)</li>
<li>Fix finder template for lenient editable installs of implicit nested namespaces
constructed by using <code>package_dir</code> to reorganise directory structure. (<a href="https://redirect.github.com/pypa/setuptools/issues/4278">#4278</a>)</li>
<li>Fix an error with <code>UnicodeDecodeError</code> handling in <code>pkg_resources</code> when trying to read files in UTF-8 with a fallback -- by :user:<code>Avasam</code> (<a href="https://redirect.github.com/pypa/setuptools/issues/4348">#4348</a>)</li>
</ul>
<h2>Improved Documentation</h2>
<ul>
<li>Uses RST substitution to put badges in 1 line. (<a href="https://redirect.github.com/pypa/setuptools/issues/4312">#4312</a>)</li>
</ul>
<h2>Deprecations and Removals</h2>
<ul>
<li>
<p>Further adoption of UTF-8 in <code>setuptools</code>.
This change regards mostly files produced and consumed during the build process
(e.g. metadata files, script wrappers, automatically updated config files, etc..)
Although precautions were taken to minimize disruptions, some edge cases might
be subject to backwards incompatibility.</p>
<p>Support for <code>&quot;locale&quot;</code> encoding is now <strong>deprecated</strong>. (<a href="https://redirect.github.com/pypa/setuptools/issues/4309">#4309</a>)</p>
</li>
<li>
<p>Remove <code>setuptools.convert_path</code> after long deprecation period.
This function was never defined by <code>setuptools</code> itself, but rather a
side-effect of an import for internal usage. (<a href="https://redirect.github.com/pypa/setuptools/issues/4322">#4322</a>)</p>
</li>
<li>
<p>Remove fallback for customisations of <code>distutils</code>' <code>build.sub_command</code> after long
deprecated period.
Users are advised to import <code>build</code> directly from <code>setuptools.command.build</code>. (<a href="https://redirect.github.com/pypa/setuptools/issues/4322">#4322</a>)</p>
</li>
<li>
<p>Removed <code>typing_extensions</code> from vendored dependencies -- by :user:<code>Avasam</code> (<a href="https://redirect.github.com/pypa/setuptools/issues/4324">#4324</a>)</p>
</li>
<li>
<p>Remove deprecated <code>setuptools.dep_util</code>.
The provided alternative is <code>setuptools.modified</code>. (<a href="https://redirect.github.com/pypa/setuptools/issues/4360">#4360</a>)</p>
</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="5cbf12a9b6"><code>5cbf12a</code></a> Workaround for release error in v70</li>
<li><a href="9c1bcc3417"><code>9c1bcc3</code></a> Bump version: 69.5.1 → 70.0.0</li>
<li><a href="4dc0c31644"><code>4dc0c31</code></a> Remove deprecated <code>setuptools.dep_util</code> (<a href="https://redirect.github.com/pypa/setuptools/issues/4360">#4360</a>)</li>
<li><a href="6c1ef5748d"><code>6c1ef57</code></a> Remove xfail now that test passes. Ref <a href="https://redirect.github.com/pypa/setuptools/issues/4371">#4371</a>.</li>
<li><a href="d14fa0162c"><code>d14fa01</code></a> Add all site-packages dirs when creating simulated environment for test_edita...</li>
<li><a href="6b7f7a18af"><code>6b7f7a1</code></a> Prevent <code>bin</code> folders to be taken as extern packages when vendoring (<a href="https://redirect.github.com/pypa/setuptools/issues/4370">#4370</a>)</li>
<li><a href="69141f69f8"><code>69141f6</code></a> Add doctest for vendorised bin folder</li>
<li><a href="2a53cc1200"><code>2a53cc1</code></a> Prevent 'bin' folders to be taken as extern packages</li>
<li><a href="720862807d"><code>7208628</code></a> Replace call to deprecated <code>validate_pyproject</code> command (<a href="https://redirect.github.com/pypa/setuptools/issues/4363">#4363</a>)</li>
<li><a href="96d681aa40"><code>96d681a</code></a> Remove call to deprecated validate_pyproject command</li>
<li>Additional commits viewable in <a href="https://github.com/pypa/setuptools/compare/v69.5.1...v70.0.0">compare view</a></li>
</ul>
</details>
<br />

[![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=setuptools&package-manager=pip&previous-version=69.5.1&new-version=70.0.0)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts).

</details>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130893
Approved by: https://github.com/kit1980
2024-07-25 03:32:08 +00:00
d6115439be [MPS] Add SDPA implentation (#131362)
This work is based off @malfet's #119200

Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131362
Approved by: https://github.com/kimishpatel
2024-07-25 03:24:37 +00:00
cyy
d98d00487d [2/N] Remove unused variables (#131468)
Follows #122496

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131468
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-07-25 03:08:07 +00:00
cyy
538258bc13 [1/N] Fix clang-tidy warnings in jit (#131034)
Some some tidy warnings
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131034
Approved by: https://github.com/ezyang
2024-07-25 03:03:46 +00:00
cyy
46e42ae85d [4/N] Fix Wunused-parameter warnings (#131291)
Follows #131271

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131291
Approved by: https://github.com/ezyang
2024-07-25 02:59:22 +00:00
03979a599e [BE] enable UFMT for torch/ao/pruning/ (#128862)
Part of #123062

- #123062
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128862
Approved by: https://github.com/ezyang
ghstack dependencies: #128861
2024-07-25 02:49:35 +00:00
973a1362b9 [BE] enable UFMT for torch/ao/nn/ (#128861)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128861
Approved by: https://github.com/ezyang
2024-07-25 02:49:19 +00:00
c047bddbca [easy][dynamo] Update test for inline_inbuilt_n_modules (#131718)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131718
Approved by: https://github.com/williamwen42, https://github.com/mlazos
ghstack dependencies: #131694
2024-07-25 02:49:16 +00:00
01bc2a8165 [inline-inbuilt-nn-modules] Skip mobilenet_v2 test for cpu inductor (#131694)
Related issue https://github.com/pytorch/pytorch/issues/131693

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131694
Approved by: https://github.com/eellison
2024-07-25 02:49:16 +00:00
b5c006acac [BE][Easy] enable UFMT for torch/nn/ (#128865)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128865
Approved by: https://github.com/ezyang
2024-07-25 02:48:42 +00:00
cyy
8ea4c72eb2 [1/N] Fix clang-tidy warnings in aten/src/ATen/native/*.{cpp,h} (#130798)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130798
Approved by: https://github.com/ezyang
2024-07-25 02:36:43 +00:00
ab609d6aa6 [ts_convert] Update conversion for aten.tensor (#131549)
Fixes aten::tensor issues in edgeml models
P1492137675
| suite   |   #models |   #has_ts_model |   #has_sample_inputs |   #ts_can_run |   #can_convert |   #ep_result_correct |   #can_package |   #sigmoid_can_run |   #sigmoid_result_correct |
|---------|-----------|-----------------|----------------------|---------------|----------------|----------------------|----------------|--------------------|---------------------------|
| EDGEML  |        34 |              25 |                   23 |            21 |              2 |                    2 |              2 |                  2 |                         2 |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131549
Approved by: https://github.com/jiashenC, https://github.com/SherlockNoMad
2024-07-25 01:11:03 +00:00
e20fb5e975 [PTD][c10d] Include PG status into flight recorder (#131268)
We are considering consolidating data source for logging and flight recorder so that we don't build multiple paths for debugging information. Before we do any merging, we want to first ensure that the PG status is also included in flight recorder. Also, we can leverage this information to validate our FR dump as well. Because the dump is not synced so we might potentially see some variants in the dump.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131268
Approved by: https://github.com/shuqiangzhang
2024-07-25 01:01:00 +00:00
c3fe9075a9 [ROCM] Use hipblaslt version from hipblaslt runtime instead of header for tunableops validator (#131078)
Summary:
When tunable ops load selected kernels from csv file, it will validate hipblaslt version defined in hipblaslt-version.h

This PR changes the validator to fetch hipblaslt version and revision from hipblaslt runtime instead of the header file, as in our environment we might rollout a new version of the run time prior to updating the header file fleet wide.

Test Plan:
Verified generated tunableops kernel selection has the correct hipblaslt version from runtime:

```
Validator,PT_VERSION,2.5.0
Validator,ROCBLAS_VERSION,4.0.0-72e57364-dirty
Validator,HIPBLASLT_VERSION,800-bf2c3184
Validator,ROCM_VERSION,6.0.0.0-12969-1544e39
Validator,GCN_ARCH_NAME,gfx942:sramecc+:xnack-
GemmTunableOp_BFloat16_TN,tn_8192_2_3584,Gemm_Hipblaslt_TN_572,0.0240676
GemmTunableOp_BFloat16_TN,tn_7168_2_8192,Gemm_Hipblaslt_TN_482,0.0359019
GemmTunableOp_BFloat16_TN,tn_8192_2_1024,Default,0.0173723
GemmTunableOp_BFloat16_TN,tn_1280_2_8192,Gemm_Hipblaslt_TN_491,0.0191047
```

Differential Revision: D59889043

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131078
Approved by: https://github.com/jeffdaily, https://github.com/xw285cornell
2024-07-25 00:54:07 +00:00
cyy
803c5b8640 [CMake] Fix private compile options for CUDA code (#130546)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130546
Approved by: https://github.com/ezyang
2024-07-25 00:22:18 +00:00
7a42470bcb Annotate all InstructionTranslator (#131509)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131509
Approved by: https://github.com/zou3519
2024-07-24 23:45:53 +00:00
7535b23a25 [export] Fix set_grad hoo if output is empty (#131511)
Fixes https://fb.workplace.com/groups/1075192433118967/permalink/1467707973867409/

Test Plan: CI

Differential Revision: D60135531

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131511
Approved by: https://github.com/ydwu4
2024-07-24 23:17:20 +00:00
29c9f8c782 [export] Fix graph_break log registration error when importing export/_trace.py (#131523)
Summary:
When importing `_trace.py`, put `torch._dynamo.exc.Unsupported` in the global variable ``_ALLOW_LIST`` can cause import to ``export/_trace.py`` to fail with error:

ValueError: Artifact name: 'graph_breaks' not registered, please call register_artifact('graph_breaks') in torch._logging.registrations.

The error is directly raise on line `graph_breaks_log = torch._logging.getArtifactLogger(__name__, "graph_breaks")` in `_dynamo/exc.py`. I've checked that ``register_artifact('graph_breaks')`` does already exist in torch._logging.registrations.

Explicitly call `import torch._logging` doesn't fix the issue.

(see T196719676)

We move ``_ALLOW_LIST`` to be a local variable.

Test Plan:
buck2 test 'fbcode//mode/opt' fbcode//aiplatform/modelstore/publish/utils/tests:fc_transform_utils_test -- --exact 'aiplatform/modelstore/publish/utils/tests:fc_transform_utils_test - test_serialized_model_for_disagg_acc (aiplatform.modelstore.publish.utils.tests.fc_transform_utils_test.PrepareSerializedModelTest)'

buck2 test 'fbcode//mode/opt' fbcode//aiplatform/modelstore/publish/utils/tests:fc_transform_utils_test -- --exact 'aiplatform/modelstore/publish/utils/tests:fc_transform_utils_test - test_serialized_test_dsnn_module (aiplatform.modelstore.publish.utils.tests.fc_transform_utils_test.PrepareSerializedModelTest)'

Differential Revision: D60136706

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131523
Approved by: https://github.com/zhxchen17
2024-07-24 22:40:24 +00:00
236e06f9f9 Revert "Ensure staticmethods can be allowed in graph (#130882)"
This reverts commit 93fdd0237dcfe8cb4c65f3596aef123417b760a1.

Reverted https://github.com/pytorch/pytorch/pull/130882 on behalf of https://github.com/clee2000 due to torchrec test still broken internally D59945836 ([comment](https://github.com/pytorch/pytorch/pull/130882#issuecomment-2249003059))
2024-07-24 22:32:41 +00:00
5db5865614 Revert "Annotate all InstructionTranslator (#131509)"
This reverts commit eafbd20f23746aa6b9090d989a4ccb059f45297e.

Reverted https://github.com/pytorch/pytorch/pull/131509 on behalf of https://github.com/clee2000 due to sorry need to revert this to revert something else, I think you only need to rebase and remerge ([comment](https://github.com/pytorch/pytorch/pull/131509#issuecomment-2249000843))
2024-07-24 22:29:49 +00:00
a7e20ef7e4 [BE] Get rid of missing destructor override warning (#131204)
Regression introduced by https://github.com/pytorch/pytorch/pull/126376

Before this change, compiling torch_cpu on my MacBook prints tons of warnings every time HooksInterface is included
```
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/src/optim/adamw.cpp:1:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/optim/adamw.h:3:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/nn/module.h:3:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/nn/modules/container/any_module_holder.h:3:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/nn/modules/container/any_value.h:3:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/detail/static.h:4:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/types.h:3:
In file included from /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/ATen.h:7:
In file included from /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/Context.h:13:
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/detail/HIPHooksInterface.h:27:11: warning: '~HIPHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override]
  virtual ~HIPHooksInterface() = default;
          ^
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/detail/AcceleratorHooksInterface.h:16:11: note: overridden virtual function is here
  virtual ~AcceleratorHooksInterface() = default;
          ^
1 warning generated.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131204
Approved by: https://github.com/albanD, https://github.com/seemethere
2024-07-24 22:29:31 +00:00
b56939dae1 Annotate more InstructionTranslator (#131680)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131680
Approved by: https://github.com/zou3519
ghstack dependencies: #131676
2024-07-24 22:14:29 +00:00
f9322c26b2 Remove _export/exported_program.py (#131597)
Summary:
We removed references to _export/exported_program.py in executorch
in D60052318. Now we can remove this file.

Update the pin to executorch.

Test Plan: contbuild & OSS CI:

Differential Revision: D60072980

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131597
Approved by: https://github.com/avikchaudhuri
2024-07-24 22:04:17 +00:00
eb54ca7abe Revert "[BE] Get rid of missing destructor override warning (#131204)"
This reverts commit 8a890b72dc3e4dcd501060c2a2fee139c235a8b8.

Reverted https://github.com/pytorch/pytorch/pull/131204 on behalf of https://github.com/atalman due to sorry @malfet need to revert to make CI green, lets reland with ciflow/periodic label on ([comment](https://github.com/pytorch/pytorch/pull/131204#issuecomment-2248898033))
2024-07-24 21:08:49 +00:00
544f950d14 [BE] Improve error message when there are internal changes (#131547)
Fixes https://github.com/pytorch/test-infra/issues/4988
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131547
Approved by: https://github.com/xuzhao9, https://github.com/malfet, https://github.com/atalman
2024-07-24 20:38:08 +00:00
7f61324268 Add sparse block to flex_decoding kernel (#130884)
fix typo

Finish flex_decoding block sparse

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130884
Approved by: https://github.com/drisspg
2024-07-24 20:30:25 +00:00
b90aa18569 [aoti] Add initial custom op support (#127034)
Re-land of https://github.com/pytorch/pytorch/pull/125242

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127034
Approved by: https://github.com/malfet
2024-07-24 20:29:55 +00:00
44fdf24967 [BE] typing for decorators - jit/_decompositions (#131566)
See #131429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131566
Approved by: https://github.com/oulgen, https://github.com/zou3519
2024-07-24 20:28:28 +00:00
2b83e4f8d7 [ROCm] Enable flex decoding unit tests (#131048)
Flex decoding tests are passing with upstream pytorch on MI300X/MI2XX.
Only flex attention unit tests have issues.

[result_mi250.log](https://github.com/user-attachments/files/16286954/result_mi250.log)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131048
Approved by: https://github.com/jeffdaily, https://github.com/pruthvistony, https://github.com/malfet
2024-07-24 20:25:34 +00:00
84cd062fb2 [2/3] 3D Composability - move pp tests (#129801)
pytorch (fsdp, tp, pp) -> pytorch (composable)
Move (fsdp, tp, pp) tests under pytorch into a composable folder

FSDP:
test/distributed/_composable/fsdp/test_fully_shard_trainin.py
-TestFullyShard2DTraining
DP:
test/distributed/tensor/parallel/test_ddp_2d_parallel.py
TP:
test/distributed/tensor/parallel/test_fsdp_2d_parallel.py
**PP:
test/distributed/pipelining/test_composability.py**

=>
distributed/_composable/test_composability/test_2d_composability.py
**distributed/_composable/test_composability/test_pp_composability.py**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129801
Approved by: https://github.com/wconstab
ghstack dependencies: #129800
2024-07-24 20:17:54 +00:00
a9e6356271 [ONNX] Update torch.onnx.export API (#131501)
- Add a `kwargs` option; add the `dynamic_shapes` option so users can supply it directly to `torch.export`.
- Make the options keyword-only arguments (bc-breaking)
- Deprecate the `training` and `operator_export_type` options and include a warning message. The exact time for removal is TBD but the message should discourage users from using the options.
- Deprecate two functions `exportable_ops` and pretty print export

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131501
Approved by: https://github.com/titaiwangms
2024-07-24 20:03:17 +00:00
9db567f17d [ONNX] Set dump_exported_program to True in bench (#131670)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131670
Approved by: https://github.com/titaiwangms
2024-07-24 20:02:03 +00:00
85fa66be04 Add rerun_disabled_tests for inductor (#131681)
Test in prod?

THis also turns on mem leak check

Briefly checked that
```
 python3 ".github/scripts/filter_test_configs.py" \
    --workflow "inductor" \
    --job-name "cuda12.1-py3.10-gcc9-sm86 / build" \
    --test-matrix "{ include: [
    { config: "inductor", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_distributed", shard: 1, num_shards: 1, runner: "linux.g5.12xlarge.nvidia.gpu" },
    { config: "inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "dynamic_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "dynamic_inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "dynamic_inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "dynamic_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "dynamic_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "aot_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "aot_inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "aot_inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "aot_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "aot_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" },
    { config: "inductor_cpp_wrapper_abi_compatible", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" },
  ]}
  " \
    --selected-test-configs "" \
    --pr-number "${PR_NUMBER}" \
    --tag "${TAG}" \
    --event-name "schedule" \
    --schedule "29 8 * * *" \
    --branch "${HEAD_BRANCH}"
```
has rerun disabled tests option in the test matrix

I don't think all these things need to run but I'm not sure which ones (probably just inductor?)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131681
Approved by: https://github.com/zou3519
2024-07-24 19:56:00 +00:00
65ce2bf465 Allow setting PYTHON_LIB_REL_PATH via environment variable (#128419)
This allows builds to customize the location where caffe2's Python modules are installed to.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128419
Approved by: https://github.com/PaliC, https://github.com/d4l3k, https://github.com/malfet
2024-07-24 19:49:06 +00:00
074b46b7d9 [1/3] 3D Composability - move fsdp tests (#129800)
pytorch (fsdp, tp, pp) -> pytorch (composable)
Move (fsdp, tp, pp) tests under pytorch into a composable folder

**FSDP:
test/distributed/_composable/fsdp/test_fully_shard_trainin.py
-TestFullyShard2DTraining**
DP:
test/distributed/tensor/parallel/test_ddp_2d_parallel.py
TP:
test/distributed/tensor/parallel/test_fsdp_2d_parallel.py
PP:
test/distributed/pipelining/test_composability.py

=>
**distributed/_composable/test_composability/test_2d_composability.py**
distributed/_composable/test_composability/test_pp_composability.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129800
Approved by: https://github.com/awgu
2024-07-24 19:47:34 +00:00
e0f1bf14a4 Fully type torch/utils/_config_module.py (#131676)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131676
Approved by: https://github.com/zou3519
2024-07-24 19:36:09 +00:00
05681b6838 Migrate missed experimental jobs to Amazon2023 AMI (#131485)
Adding in a few jobs that got missed in https://github.com/pytorch/pytorch/pull/131250

Those jobs have passed with the new AMI:
https://github.com/pytorch/pytorch/actions/runs/10063808680/job/27820050195?pr=131485
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131485
Approved by: https://github.com/atalman, https://github.com/malfet
2024-07-24 19:33:02 +00:00
05064f2827 [CI] Move all ROCm jobs to periodic frequency (#131637)
`inductor` and `rocm` workflows are the major contributors to the CI load on ROCm CI at the moment, resulting in huge backlogs: https://github.com/pytorch/pytorch/pull/131489#issue-2425804464

* Move rocm.yml to cron frequency
* Move ROCm CI jobs from inductor.yml to inductor-rocm.yml
* Introduce `ciflow/inductor-rocm` as PR label to manually invoke inductor jobs for ROCm (no automatic invoking to limit CI load)
* After this PR, only `trunk` workflow jobs for ROCm will run on every commit and PR merge, but since they take 45min*3 time on average, I decided to leave them as-is since it will provide us some basic insulation against ROCm breakage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131637
Approved by: https://github.com/clee2000, https://github.com/atalman, https://github.com/huydhn
2024-07-24 19:26:58 +00:00
8aff6caf67 [CI][dashboard] Rename cpu-x86 to cpu_x86 (#131658)
Summary: '-' is used as a special separator by upload_dynamo_perf_stats.py, so switch to '_' instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131658
Approved by: https://github.com/huydhn
2024-07-24 19:16:52 +00:00
3ce6f61416 [AOTI] Support fallback ops not in inductor_fallback_ops (#131247)
Summary: For aten ops that are not listed in inductor_fallback_ops, AOTI will use proxy executor to execute them instead of erroring out as missing C shim implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131247
Approved by: https://github.com/angelayi
2024-07-24 19:16:43 +00:00
aeca9845a6 Migrate Lint jobs to Amazon 2023 AMI (#131514)
Continuing in the same vein as https://github.com/pytorch/pytorch/pull/131250, migrate all self-hosted lint.yml jobs to use the new Amazon 2023 AMI
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131514
Approved by: https://github.com/clee2000, https://github.com/malfet, https://github.com/huydhn
2024-07-24 19:11:02 +00:00
b98b3127f7 [easy][pytorch][counters] Move WaitCounter in c10/util (#131021)
Summary: Since WaitCounter frontend itself has minimal depdendencies it's fine to be moved into c10. Specific backends can be registered/linked separately.

Test Plan: unit test

Reviewed By: jamesperng, asiab4, c-p-i-o

Differential Revision: D59842868

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131021
Approved by: https://github.com/asiab4
2024-07-24 18:38:33 +00:00
7718024d2b [3.13] support 3.13 multiline traces in munge_exc (#131207)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131207
Approved by: https://github.com/jansel, https://github.com/anijain2305
ghstack dependencies: #131206
2024-07-24 18:22:30 +00:00
f0378912a0 [3.13, dynamo] fix test/dynamo/test_bytecode_utils.py (#131206)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131206
Approved by: https://github.com/jansel, https://github.com/anijain2305
2024-07-24 18:22:30 +00:00
a86909d251 [inductor] Type annotate constant_folding.py (#131364)
Summary: Type annotate constant_folding.py

Test Plan: mypy

Reviewed By: angelayi

Differential Revision: D60063872

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131364
Approved by: https://github.com/angelayi
2024-07-24 18:20:06 +00:00
8fe5b93667 support zb1p and zb2p algorithms (#130752)
Previously, we have proved that ZB2P is not truly zero bubble when num_local_stages exceed 4 and so only ZB1P was supported.

We did a few tweaks to the ZB2P to really make it zero bubble. Algorithm and proof is attached.
[zero_bubble.pdf](https://github.com/user-attachments/files/16238738/zero_bubble.pdf)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130752
Approved by: https://github.com/H-Huang
2024-07-24 17:58:46 +00:00
5e6cfb7db5 Add an extra shard for distributed periodic jobs (#131498)
Fixes issue of timeouts being observed in ROCm periodic workflow for distributed runs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131498
Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/clee2000
2024-07-24 16:44:53 +00:00
106c6a49f5 [dynamo] limit number of compiles per frame (#130891)
Fixes https://github.com/pytorch/pytorch/issues/130776

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130891
Approved by: https://github.com/anijain2305
2024-07-24 16:43:40 +00:00
abcd329359 [BE] typing for decorators - onnx/symbolic_helper (#131565)
See #131429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131565
Approved by: https://github.com/justinchuby, https://github.com/oulgen, https://github.com/zou3519, https://github.com/titaiwangms
2024-07-24 16:39:47 +00:00
0e71a88f9b Support IPC for Expandable Segments (#130890)
This reapplication commit is the same as before except it resolves a build error in an internal build where `handle` was shadowed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130890
Approved by: https://github.com/dsjohns2
2024-07-24 15:45:40 +00:00
eb5883f8aa Add new runner labels to actionlint (#131525)
Adding the labels corresponding to the Amazon2023 ami
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131525
Approved by: https://github.com/atalman
2024-07-24 15:28:59 +00:00
72d17d95d7 [inductor] Enable dynamo for Windows. RC1 (#131286)
Changes:
1. Enable Windows in `check_if_inductor_supported`.
2. Disable Windows in `AotCodeCompiler`.
3. Force Windows inductor to `c++20` to support `std::enable_if_t`.
4. Disable `test_x86inductor_quantizer` UT on `Windows` temporary, It still some issue need to be fix: https://github.com/pytorch/pytorch/pull/131308 .

Based on this PR, I have run first model `resnet18` on Windows inductor successful.
<img width="1036" alt="image" src="https://github.com/user-attachments/assets/2642bda1-1845-417a-aaba-39bdf22e65d6">

TODO:
1. Upgrade pytorch Windows build to `c++20`.
2. Fix and re-enable `test_x86inductor_quantizer` UT on `Windows`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131286
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-24 15:26:55 +00:00
4c7f22dee2 [BE] remove unnecessary _dispatch_sqrt by using ** 0.5 (#131358)
Based on the discussion here where ** 0.5 is not slower than math.sqrt. https://github.com/pytorch/pytorch/pull/129905#discussion_r1675605075

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131358
Approved by: https://github.com/albanD
2024-07-24 14:58:57 +00:00
98984422eb [triton_op] fix autotuning (#131363)
The problem was we were shoving SymInts into the constant_args side
table. The root problem is that torch.fx.node.base_types, which we use
to determine what can be put in the graph, doesn't actually have SymInt
in it. This PR fixes base_types to include SymInt.

Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131363
Approved by: https://github.com/oulgen, https://github.com/justinchuby
2024-07-24 14:03:37 +00:00
bc938184de [FSDP2] Added set_reduce_scatter_divide_factor (#129286)
This PR adds an API `FSDPModule.set_reduce_scatter_divide_factor` to allow setting a custom gradient divide factor for reduce-scatter. This can be useful when using parallelisms in combination with FSDP (e.g. expert parallelism), where gradients need to be divided by a custom factor (e.g. an extra `EP` factor).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129286
Approved by: https://github.com/weifengpy
2024-07-24 12:42:35 +00:00
8ffd109a00 Revert "Fix py codegen to delete values that don't have any users (#131028)"
This reverts commit 466c167b71e6021f8eadcfbae1d9156a375663ce.

Reverted https://github.com/pytorch/pytorch/pull/131028 on behalf of https://github.com/atalman due to breaks CI ([comment](https://github.com/pytorch/pytorch/pull/131028#issuecomment-2247771530))
2024-07-24 12:21:43 +00:00
451462dbff [1/N] Add missing constructors or assignment operators (#131077)
Just mark them as deleted in most cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131077
Approved by: https://github.com/ezyang
2024-07-24 12:09:39 +00:00
0c6f1ca064 Introduce torch._dynamo.config.enable_compiler_collectives for syncing compilation across ranks (#130935)
This PR implements an opt-in configuration option for synchronizing compilation across all ranks at the end of Dynamo tracing (and potentially, other places in the future). There are two pieces to this PR:

1. Implementing infrastructure for compiler collectives (DistributedState/LocalState, the actual collective)
2. Using this infrastructure to synchronize automatic dynamic choices across all ranks

The infrastructure in part one can be used for other purposes, just add more (serializable) fields to LocalState.

Here is how automatic dynamic synchronization works:

1. Preflight in "torch/_dynamo/variables/builder.py": On the first Dynamo trace run, we trace without automatic dynamic at all; we assume all Tensor inputs that are not otherwise marked are static. This run is purely to collect all Tensor input sizes in the program.
2. torch/_dynamo/output_graph.py: At the end of the first Dynamo trace run, we perform a compiler collective to distribute all Tensor input sizes to all ranks. Then, we restart Dynamo
3. Apply the updates in "torch/_dynamo/variables/builder.py": Now that we have all sizes for every rank, we now update frame state with the observed sizes for all ranks, in rank order. Under the assumption that frame state is consistent on all ranks, this series of updates will preserve consistency.

For future work, it would be safer if we force a consistent hint on all ranks; this is more involved as we have to interpose in fakification.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130935
Approved by: https://github.com/jansel
2024-07-24 11:24:11 +00:00
85d3ee1d67 [micro_pipeline_tp] refactor all-gather and reduce-scatter pattern matchers to be more flexible and testable (#131409)
High level goals:
- Cover the all-gather and reduce-scatter pattern matchers with unit tests
- Make it easier to exclude certain collectives as async-tp candidates
- Make it easier to match other all-gather and reduce-scatter variants (e.g. fp8 collectives)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131409
Approved by: https://github.com/weifengpy
2024-07-24 11:16:27 +00:00
89d5391bbf [inductor] Kill mark_node_as_mutating (#130834)
Resubmit of #129346

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130834
Approved by: https://github.com/lezcano
ghstack dependencies: #130832, #130833
2024-07-24 11:11:19 +00:00
6415c45da5 [inductor] Use multiple outputs for flex-attention (#130833)
Resubmit of #129344

This fixes the DCE issue for attention output

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130833
Approved by: https://github.com/lezcano
ghstack dependencies: #130832
2024-07-24 11:11:19 +00:00
95c248751b [inductor] Make UserDefinedTritonKernel a multi-output operation (#130832)
Resubmit of #129325

Previously each mutation was represented by a `MutationOutput` operation which
was a new scheduler node that must be scheduled immediately afterwards.

Now we have a single scheduler node, which produces mutiple `MutationOutput`
buffers as its output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130832
Approved by: https://github.com/lezcano
2024-07-24 11:11:14 +00:00
a4c3f29047 [ONNX][BE] Remove ruff skips in torch/onnx (#131368)
Remove all ruff skips for torch/onnx since we do not do runtime type checking anymore.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131368
Approved by: https://github.com/titaiwangms, https://github.com/Skylion007
2024-07-24 10:56:43 +00:00
62e566b345 [BE] Remove suppression of inconsistent missing overrides (#131524)
This should prevent regressions like the ones fixed by https://github.com/pytorch/pytorch/pull/131204

- Remove global `-Wno-error=inconsistent-missing-override`
- Wrap offending includes (protobuf and asmjit) with `C10_DIAGNOSTIC_PUSH_AND_IGNORE` and `C10_DIAGNOSTIC_POP_AND_IGNORED`
- Add `override` keyword to `at::namespace::tunable::StreamTimer` and `LLVMCodeGenImpl`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131524
Approved by: https://github.com/atalman
2024-07-24 10:07:36 +00:00
83d19620f6 kill tmp _is_executorch flag (#131488)
Test Plan: existing tests

Differential Revision: D60126186

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131488
Approved by: https://github.com/ydwu4
2024-07-24 08:51:37 +00:00
1e34870796 [CI][dashboard][reland] Collect PT2 cpu perf nightly (#131560)
Summary: Add a workflow similar to inductor-perf-test-nightly.yml but use x86 metal instances for perf measurement. The data processing and dashboard update will come next.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131560
Approved by: https://github.com/huydhn
2024-07-24 08:50:33 +00:00
276b5238ef [bug] Add is_compiling check for optimizers to avoid untracked tensor during graph tracing (#130909)
Hey folks, I was using the `stateless_func` [here](7c45476d38/torch/distributed/_spmd/api.py (L435)), which worked well before [this commit](https://github.com/pytorch/pytorch/pull/111084) but then introduced a `_tensor_constant0` and made this func non-stateless. Since there is no way to retrieve this constant tensor before compilation and performance is not an issue when tracing a graph, I think it might be good to fall back to the other branch.
![image](https://github.com/user-attachments/assets/6ee4487d-456b-47e0-8c1d-66cb5a641d47)

![image](https://github.com/user-attachments/assets/1ed46502-e50e-45c4-9751-49aa5a4590ae)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130909
Approved by: https://github.com/mlazos
2024-07-24 08:29:27 +00:00
cyy
41189b0da4 Simplify THPEvent_get_device (#131466)
Because self->event.device() always returns Device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131466
Approved by: https://github.com/albanD
2024-07-24 08:24:01 +00:00
e782918b8e [NestedTensor] Add example NestedTensor objects with inner dimension of size 1 to tests reducing along jagged dimension for NestedTensor (#131516)
Add example `NestedTensor`s with inner dimension of size `1` to `_get_example_tensor_lists` with `include_inner_dim_size_1=True`. This diff creates `NestedTensor`s of sizes `(B, *, 1)` and `(B, *, 5, 1)`, ensuring that the current implementations of jagged reductions for `sum` and `mean` hold for tensors of effective shape `(B, *)` and `(B, *, 5)`.

Differential Revision: [D59846023](https://our.internmc.facebook.com/intern/diff/D59846023/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131516
Approved by: https://github.com/davidberard98
2024-07-24 07:01:39 +00:00
e9db1b0597 Add flag to ignore unsupported @triton.autotune args in user-written kernel compilation (#131431)
Summary: We currently don't support some of the `@triton.autotune` arguments when compiling user-written Triton kernels with PT2. In this PR, we're adding a flag to circumvent it. This is to unblock internal compilation in some cases. The flag is supplied with the docs mentioning why it is not a good idea to set it.

Test Plan:

```
python test/inductor/test_triton_kernels.py -k test_triton_kernel_
autotune_with_unsupported_args
...
----------------------------------------------------------------------
Ran 3 tests in 3.636s

OK
```

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131431
Approved by: https://github.com/oulgen, https://github.com/zou3519
2024-07-24 05:37:09 +00:00
eafbd20f23 Annotate all InstructionTranslator (#131509)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131509
Approved by: https://github.com/zou3519
2024-07-24 05:31:01 +00:00
5772c13f56 Dont wrap negative indexing in scatter reduce (#131503)
Fix for https://github.com/pytorch/pytorch/issues/131321

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131503
Approved by: https://github.com/shunting314
2024-07-24 04:01:32 +00:00
9f96d4b61b Disable inlining on cudagraph fallback tests (#131557)
The cudagraph fallback tests should only run without nn module inlining. The [rerecord limit](fc3d2b26cd/torch/_inductor/cudagraph_trees.py (L1922)) is ignored if nn module inlining is disabled. Arguably it should just be higher, but this PR addresses the failures and allows inlining to be on by default on main.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131557
Approved by: https://github.com/anijain2305
ghstack dependencies: #131556
2024-07-24 04:00:02 +00:00
9575b1afad Ensure tensor dict is populated with compiled autograd (#131556)
The issue addressed is that compiled autograd changes the calling convention of the FX graph to only have a single placeholder which contains a list of inputs. In this case, the meta of the tensor input nodes don't contain the `tensor_dict` meta. This adds them.

The context is that `tensor_dict` is used to convey if a tensor is an input with a static address.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131556
Approved by: https://github.com/anijain2305
2024-07-24 04:00:02 +00:00
dffbd3a1e2 Add mypy typing to pattern_matcher (#131506)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131506
Approved by: https://github.com/zou3519
2024-07-24 02:55:43 +00:00
7124efa81b Include _native.h for structured_native_functions (#131208)
In gen.py, the code for generating CompositeViewCopyKernels.cpp includes *_native.h headers for "view_groups" but not "structured_native_functions". However, this results in the TORCH_API in the headers being ineffective and presents such functions being used outside libtorch_cpu.so

This patch ensures that gen.py includes the native headers for "structured_native_functions" in the same way as for "view_groups".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131208
Approved by: https://github.com/bdhirsh
2024-07-24 02:55:36 +00:00
31da9ee711 Use explain function to provide more meaningful information when conversion failed. (#131214)
Summary: In the script of testing different families of models, when the conversion failed, we switch to use output from the explain function to provide more meaningful information.

Test Plan:
Manual testing with attatched log information.

```
buck2 run mode/dev-nosan sigmoid/inference/ts_migration:main -- --mode test_all --test_suites ads_merge --model_id 440779101
```

```
Processing 440779101_5455.predictor.disagg.gpu.merge

    model_name: 440779101_5455.predictor.disagg.gpu.merge
    has_ts_model: True
    has_sample_inputs: True
    ops_maybe_missing_meta: set()
    ts_can_run: True
    ts_run_exception: None
    can_convert: False
    convert_exception: Unsupported nodes are found in the following list:

        0. prim::Loop [%14259 : int = prim::Loop(%14258, %1129, %1126), scope: torch.fx.graph_module.GraphModule:: # <torch_package_1>.caffe2/torch/fb/predictor/modules/tensors_to_device_module.py💯19]

        1. prim::Loop [%14326 : int = prim::Loop(%1115, %1129, %14259), scope: torch.fx.graph_module.GraphModule:: # <torch_package_1>.caffe2/torch/fb/predictor/modules/tensors_to_device_module.py💯19]
    ep_result_correct: None
    ep_run_exception: None
    can_package: None
    package_exception: None
    sigmoid_can_run: None
    sigmoid_run_exception: None
    sigmoid_result_correct: None
```

Reviewed By: SherlockNoMad

Differential Revision: D59971446

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131214
Approved by: https://github.com/angelayi
2024-07-24 02:42:18 +00:00
0ceaabaf71 [easy][inline-inbuilt-nn-modules] Update test (#131563)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131563
Approved by: https://github.com/mlazos
ghstack dependencies: #131347, #131367, #131378, #131389, #131405, #131480, #131512
2024-07-24 02:32:19 +00:00
0e780a7d69 [BE] Remove some mypy allow-untyped-decorators that are no longer needed (#131564)
See #131429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131564
Approved by: https://github.com/oulgen
2024-07-24 02:00:08 +00:00
abb313b466 [torch.mtia] Noop set_rng_state and get_rng_state APIs (#130873)
Summary: As title

Test Plan: CI tests

Reviewed By: joebos

Differential Revision: D59036602

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130873
Approved by: https://github.com/hanzlfs
2024-07-24 01:52:21 +00:00
aa1c78c7e9 [PTD][c10d][EZ] LOG error for nccl error rather than info (#131483)
Summary: As title, when we get nccl exception we should log it as error not info.

Test Plan: CI

Reviewed By: csmodlin, rmiao

Differential Revision: D60123773

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131483
Approved by: https://github.com/fegin
2024-07-24 01:08:00 +00:00
466c167b71 Fix py codegen to delete values that don't have any users (#131028)
Fixes #131025

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131028
Approved by: https://github.com/ezyang
2024-07-24 01:03:56 +00:00
14495ce288 [BE][MPS] Use isOperatingSystemAtLeastVersion: (#131513)
Instead of trying to come up with different checks for classes resonding to selectors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131513
Approved by: https://github.com/atalman
2024-07-24 00:54:25 +00:00
76f7b3e560 [inductor][cpp][gemm] improve thread blocking heuristics (#131024)
This PR improves the thread blocking heuristics to favor full occupancy as much as possible. Also, the "m x n" block size is made as squared as possible for better data reuse.

Take the shape M=20000, N=64, K=128 as an example, the original heuristics couldn't use up all the threads when the number of threads is large, say 60:
AUTOTUNE linear_unary(200000x128, 64x128, 64)
  _linear_pointwise 0.1010 ms 100.0%
  cpp_packed_gemm_0 0.8303 ms 12.2%
0722 02:26:39.220660 302553 torch/_inductor/codegen/cpp_gemm_template.py:503] [0/0] Register blocking: GemmBlocking(block_m=32, block_n=32, block_k=32)
V0722 02:26:39.221042 302553 torch/_inductor/codegen/cpp_gemm_template.py:507] [0/0] Cache blocking: GemmBlocking(block_m=625, block_n=1, block_k=4)
V0722 02:26:39.221118 302553 torch/_inductor/codegen/cpp_gemm_template.py:509] [0/0] Thread blocking: GemmBlocking(block_m=625, block_n=1, block_k=4)
V0722 02:26:39.221252 302553 torch/_inductor/codegen/cpp_gemm_template.py:526] [0/0] Number of threads: 60, occupancy: (10, 2, 1)

After this PR:
AUTOTUNE linear_unary(200000x128, 64x128, 64)
  _linear_pointwise 0.1143 ms 100.0%
  cpp_packed_gemm_0 0.1228 ms 93.1%
V0722 02:29:49.261794 304201 torch/_inductor/codegen/cpp_gemm_template.py:309] [0/0] Register blocking: GemmBlocking(block_m=32, block_n=32, block_k=32)
V0722 02:29:49.262860 304201 torch/_inductor/codegen/cpp_gemm_template.py:313] [0/0] Cache blocking: GemmBlocking(block_m=64, block_n=1, block_k=8)
V0722 02:29:49.262951 304201 torch/_inductor/codegen/cpp_gemm_template.py:315] [0/0] Thread blocking: GemmBlocking(block_m=69, block_n=79, block_k=8)
V0722 02:29:49.263075 304201 torch/_inductor/codegen/cpp_gemm_template.py:332] [0/0] Number of threads: 60, occupancy: (15, 4, 1)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131024
Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w
2024-07-24 00:36:29 +00:00
fdc9a1404e Remove _BLACK_LISTED_OPS (#131361)
Summary: remove _BLACK_LISTED_OPS after https://github.com/pytorch/pytorch/pull/100749

Test Plan: contbuild & OSS CI

Differential Revision: D60056130

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131361
Approved by: https://github.com/angelayi
2024-07-24 00:15:27 +00:00
2cf220956a [inductor] fix CacheBase.get_system on AMD (#131365)
Summary: CacheBase.get_system on AMD is missing device name and hip version, fix that

Test Plan:
on AMD:
```
buck run fbcode//mode/opt-amd-gpu scripts/nmacchioni/repros/amd_cache_key:repro
{'device': {'name': 'gfx942:sramecc+:xnack-'}, 'version': {'triton': '3.0.006965bceb379c60d8184a4166f502457952938167bfb69592ebf48abebfb0ce9-4856d26164925fd955c779d8f67ecf47cc5754052b008714b3a580d708b13dd8-06965bceb379c60d8184a4166f502457952938167bfb69592ebf48abebfb0ce9-23d635e690d670bf61798e1259674b78c0ed5ba222ab6a455f329f27a758fc2d-e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855-166fbf4e6f8845f354611638861a2a9e1dc2654224c278e10b566f09549dae7e-ccd93feaad4c82c8c1604557340de15fda0a3c84fe83f5a4d1e12a07a77bf3f4-cf28658fa328f7f283ec4e6ccc6c48d7c2a8ddbdf5134d3eb35c9b38ce4ace44-b9d80690b3109c2aaf5ece450d62e93b37eb6ab38552089794b3bb36e36a22b3-36130a37af1b19a0dec569aa08d30b00c74c8f02b6b632999d86dea169146792-4a620da64e0c263067f0dbf6c721f5214a5ac315625a07dd98520502ddf7e22f-6ace95666f6a4ecd2b1a7fc7ae865d1a9239608bd020cb6e4b8d15233c2dd9b3', 'hip': '6.0.32830'}, 'hash': 'c4db04316e15953dda8648f5a43a3f208f2c0ba454666cc7d78e40527aab85ec'}
```

on Nvidia:
```
buck run fbcode//mode/opt scripts/nmacchioni/repros/amd_cache_key:repro
{'device': {'name': 'NVIDIA PG509-210'}, 'version': {'triton': '6de41ec76ecad84e618d692e6793a4ebe707ae68a0c033a222105daa72214d7c', 'cuda': '12.0.0'}, 'hash': 'b58d0aa37d80fc2932c1b7576ca876b77aa1258db1c14e27d1f201bd15376faf'}
```

Differential Revision: D60062972

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131365
Approved by: https://github.com/eellison
2024-07-24 00:11:59 +00:00
480ae51f85 [pytree] Only import optree if it's used (#131478)
torch.utils._pytree imports optree if it's available. Instead, we change
it to if it gets used. The motivation for this is better isolation.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131478
Approved by: https://github.com/albanD
2024-07-24 00:10:49 +00:00
6850e42266 [dynamo][exception] Remove older specialization for StopIteration (#131512)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131512
Approved by: https://github.com/yanboliang
ghstack dependencies: #131347, #131367, #131378, #131389, #131405, #131480
2024-07-24 00:06:53 +00:00
e2b941a1b4 [dynamo] Rename TENSOR_ALIASING to OBJECT_ALIASING. Permit OBJECT_ALIASING for dict guards (#131480)
Fixes https://github.com/pytorch/pytorch/issues/129667

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131480
Approved by: https://github.com/williamwen42
ghstack dependencies: #131347, #131367, #131378, #131389, #131405
2024-07-24 00:06:53 +00:00
e39f136c35 [debug][dtensor] implemented activation checkpointing differentiation (#130996)
**Summary**
While trying to integrate CommDebugMode with TorchTitan, I realized that the forward_hooks were being registered even though it was in the backward pass. After investigating, I realized that it was activation checkpointing that was causing this. In order to prevent users from being confused, I edited CommDebugMode so that it could differentiate between backward pass operations and activation checkpointing operations. I have also added an example case showing that CommDebugMode is able to successfully differentiate between the backward pass and activation checkpointing. The output for the example can be seen below.

**Test Case**
torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e activation_checkpointing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130996
Approved by: https://github.com/XilunWu
ghstack dependencies: #131419
2024-07-23 23:44:56 +00:00
7b375c3682 [dtensor][debug] changed which module tracker I inherited from to fix bug with activation checkpointing (#131419)
**Summary**
I switched the module tracker I had been inheriting from PyTorch’s all purpose one to the one written by Sanket in the distributed tools folder. I did this because the original one messed up activation checkpointing by adding itself to the parent set in the backward_pre_hook and then in the forward_pre_hook for the activation_checkpointing.

**Test Case**
pytest test/distributed/_tensor/debug/test_comm_mode_features.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131419
Approved by: https://github.com/XilunWu
2024-07-23 23:44:56 +00:00
161c18ed0b SymmetricMemory-based, low contention intra-node all-gather and reduce-scatter (#130583)
```python
# NOTE [low-contention collectives]
# When a collective is overlapped with abundant compute, it makes sense to
# prioritize reducing the contention between the collective and the overlapped
# compute, even at the cost of a slightly slower collective.
#
# Common collective implementations (e.g., NCCL without user buffer
# registration) optimize for throughput with no ambient compute. However, such
# implementations may not be optimal when they are overlapped with compute:
# - These impls typically fuse the entire collective into a single kernel and
# reserve SM resources based on the most demanding portion of the collective,
# even when a large portion of the collective does not require this much
# resource.
# - These implementations typically fuse the entire collective into a single
# kernel and reserve SM resources based on the most demanding portion of the
# collective, even when a large portion of the collective does not require this
# much resource.
# - These implementations often use SM-based P2P copy as opposed to copy
# engine-based P2P copy. Copy engine-based P2P copy may not have a significant
# advantage when there's no ambient compute. However, it may significantly
# improve overall resource utilization in the presence of ambient compute.
#
# When overlapped with intensive compute (e.g., persistent matmul kernels), the
# SM-usage of a collective can lead to inefficient overlapping.
#
# Low-contention collectives achieve their goals with the following strategies:
# - Use copy engine-based copy whenever possible.
# - Break down portions of a collective with different resource requirements
# into multiple kernels. This improves the overlapping efficiency at the cost
# of additional launching overhead.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130583
Approved by: https://github.com/weifengpy
2024-07-23 23:37:48 +00:00
1930698140 Fix fake tensor SymInt caching when there's a SymInt storage_offset (#131500)
Test Plan: Internal unit tests failed before and succeeded after.

Differential Revision: D60131273

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131500
Approved by: https://github.com/clee2000
2024-07-23 23:37:04 +00:00
fc3d2b26cd Use fake PG for test_compute_comm_reordering.py unit tests (#131415)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131415
Approved by: https://github.com/yifuwang
2024-07-23 22:53:23 +00:00
980bb54361 [BE][Inductor] fix failures in test_padding.py (#131417)
The failure only happens [internally](https://www.internalfb.com/tasks/?t=195598864) because the main block was not executed when the tests are run internally.

Differential Revision: [D60083954](https://our.internmc.facebook.com/intern/diff/D60083954)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131417
Approved by: https://github.com/eellison
2024-07-23 21:53:59 +00:00
53f1f75061 [BE][Inductor] fix do_bench test (#131402)
The test fail internally [T195592444](https://www.internalfb.com/intern/tasks/?t=195592444) (This is meta internal link). But we don't see the failure in OSS.

It turns out that there are 2 issues:
1. `run_test('cuda')` is improperly handled since it tries to import a module named 'cuda' if cuda is available. Since the import fails, all tests in the file are skipped. This hides the failure in OSS. The failure is exposed in internal tests since the main block which runs `run_test('cuda')` is skipped sometimes.
2. fix the real issue that incompatible inputs are provided to `do_bench`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131402
Approved by: https://github.com/eellison
2024-07-23 21:52:35 +00:00
5a0068cc69 [BE] mypy: disallow untyped decorators (#131428)
Untyped decorators strip the types from their decorated function so even if the underlying function is fully typed then callers to it don't get any benefit from type annotations.

Step 1 - Enable the error and override in all the offending files.

#131429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131428
Approved by: https://github.com/justinchuby, https://github.com/oulgen
2024-07-23 21:50:55 +00:00
e3ca4e79e1 Fix mypy errors introduced by #131400 (#131522)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131522
Approved by: https://github.com/zou3519, https://github.com/eellison
2024-07-23 21:25:21 +00:00
c9e74449f3 bump executorch commit pin. (#131486)
Summary: as title. Target commit: 6153b1bf7b

Test Plan: CI

Differential Revision: D60125590

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131486
Approved by: https://github.com/huydhn
2024-07-23 21:25:07 +00:00
8a890b72dc [BE] Get rid of missing destructor override warning (#131204)
Regression introduced by https://github.com/pytorch/pytorch/pull/126376

Before this change, compiling torch_cpu on my MacBook prints tons of warnings every time HooksInterface is included
```
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/src/optim/adamw.cpp:1:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/optim/adamw.h:3:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/nn/module.h:3:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/nn/modules/container/any_module_holder.h:3:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/nn/modules/container/any_value.h:3:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/detail/static.h:4:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/types.h:3:
In file included from /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/ATen.h:7:
In file included from /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/Context.h:13:
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/detail/HIPHooksInterface.h:27:11: warning: '~HIPHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override]
  virtual ~HIPHooksInterface() = default;
          ^
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/detail/AcceleratorHooksInterface.h:16:11: note: overridden virtual function is here
  virtual ~AcceleratorHooksInterface() = default;
          ^
1 warning generated.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131204
Approved by: https://github.com/albanD, https://github.com/seemethere
2024-07-23 21:02:14 +00:00
4eee2e7a6d [operator_benchmark] Remove TARGETS from broken benchmarks (#131460)
Summary:
Remove operator_benchmark caffe2 build due to the removal of caffe2: 2fd75667b4

Plus, we are deleting the TARGETS file from broken benchmarks that we do not intend to maintain.

Test Plan: Sandcastle CI

Differential Revision: D60086216

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131460
Approved by: https://github.com/vmpuri
2024-07-23 20:06:08 +00:00
8497930766 Revert "[CI][dashboard] Collect PT2 cpu perf nightly (#131369)"
This reverts commit 9851c7313d118517d21a112960044e0fdbf560b1.

Reverted https://github.com/pytorch/pytorch/pull/131369 on behalf of https://github.com/atalman due to Sorry need to revert looks like , please run ciflow/inductor looks like this caused failure in [pytorch/pytorch/actions/runs/10058412015/job/27802257096](https://github.com/pytorch/pytorch/actions/runs/10058412015/job/27802257096) ([comment](https://github.com/pytorch/pytorch/pull/131369#issuecomment-2246142022))
2024-07-23 19:41:49 +00:00
d4e3fd613c Revert "[CI] Relax config name matching for cpu inductor tests (#131467)"
This reverts commit aa54bcb6d25fc7c9ac23b82b74ea45f03033c8b2.

Reverted https://github.com/pytorch/pytorch/pull/131467 on behalf of https://github.com/atalman due to Sorry need to revert looks like https://github.com/pytorch/pytorch/pull/131369 broke inductor tests ([comment](https://github.com/pytorch/pytorch/pull/131467#issuecomment-2246136839))
2024-07-23 19:38:35 +00:00
7b82ed2d59 Delete very old misleading info from .ci README (#131502)
I think there is no way to salvage that by updating, so deleting.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131502
Approved by: https://github.com/seemethere, https://github.com/atalman
2024-07-23 19:27:36 +00:00
93fdd0237d Ensure staticmethods can be allowed in graph (#130882)
Fixes https://github.com/pytorch/pytorch/issues/124735

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130882
Approved by: https://github.com/anijain2305, https://github.com/williamwen42
2024-07-23 18:59:19 +00:00
faddb0f30c [NestedTensor] Integrate the mean operator along the jagged dimension into NestedTensor (#131132)
Summary:
Modify the existing `mean` operator in PyTorch, invoked by `torch.mean`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff enables PyTorch users to invoke `torch.mean` on a nested tensor when reducing along the ragged dimension, e.g. `*` in a `(B, *, M)` nested tensor.

Parametrize unit tests from `sum` to verify the accuracy of the ragged reduction implementation for `torch.mean`. Add unit tests and parametrize `sum` unit tests to verify error handling for unsupported features in `NestedTensor` `torch.mean`.

Test Plan:
Verify that the new unit test passes via the following command:
```
buck2 run mode/{opt,inplace} //caffe2/test:nested -- --regex test_mean
```

```
buck2 run mode/{opt,inplace} //caffe2/test:nested -- --regex test_jagged_op
```

Differential Revision: D59654668

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131132
Approved by: https://github.com/davidberard98, https://github.com/jbschlosser
2024-07-23 18:48:34 +00:00
120ca23a1f Fix IMAs in Flash-Attention splitkv kernel (#131277)
# Summary

While debugging CI failures for flash_attention tests I stumbled across 2 IMAs for the split-kv variant of flash attention.
1. Illegal global memory writes during the writing of softmax_lse_accum. This was pinpointed to the temporary liftime of these out_accum and softmax_lse_accum. These were likely getting their refcount dropped **before** the kernel launch that used, them allowing them to potentially get used for other allocations.
2. After debugging this there was illegal writes of the combine kernel. I was able to pinpoint this to the writing to the reduce LSE. From my understanding it was making assumption that kBlocKM evenly divided the global number of rows and wasn't masking out these writes.

### History
My line of thinking for this:

We create the temporary split accum + LSE stats tensors to store the data for each split. We then launch a follow up kernel to do the reduction.

Under ordinary non roofline memory usage the cuda memory caching allocator will keep these allocations alive even though the tensors were created within a temporary scope and no longer have any live references.

On CI we often run near max memory usage. We change/add tests and suddenly we get close to oom threshold. The memory allocator will reap these segments and we get write after free errors.

After that  fix I did get further past the splitkv_flash kernel and then got the following error:

``` Shell
❯ TORCH_DISABLE_ADDR2LINE=1 PYTORCH_NO_CUDA_MEMORY_CACHING=1  compute-sanitizer --show-backtrace=device --tool memcheck --log-file ima.txt python ima.py

softmax_lseaccum_ptr =0x7f5ebb208a00
oaccum_ptr =0x7f5ebb208c00
softmax_lse_ptr = 0x7f5ebb208800
❯
❯ head ima.txt -n 10
========= COMPUTE-SANITIZER
========= Invalid __global__ write of size 4 bytes
=========     at void pytorch_flash::flash_fwd_splitkv_combine_kernel<pytorch_flash::Flash_fwd_kernel_traits<(int)32, (int)64, (int)256, (int)4, (bool)0, (bool)0, cutlass::bfloat16_t, pytorch_flash::Flash_kernel_traits<(int)32, (int)64, (int)256, (int)4, cutlass::bfloat16_t>>, (int)16, (int)1, (bool)1>(pytorch_flash::Flash_fwd_params)+0x630
=========     by thread (2,0,0) in block (0,0,0)
=========     Address 0x7f5ebb208804 is out of bounds
=========     and is 1 bytes after the nearest allocation at 0x7f5ebb208800 of size 4 bytes
```

Okay I looked at the address and it looks like we are writing consective bytes past the softmax_lse_ptr in from the combine func: I tried padding out the softmax_lse to q_padded and no more illegal memory errors on my repro:
```
========= COMPUTE-SANITIZER
========= ERROR SUMMARY: 0 errors
```

Fixes https://github.com/pytorch/pytorch/issues/131240
Fixes https://github.com/pytorch/pytorch/issues/131227
Fixes https://github.com/pytorch/pytorch/issues/131221

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131277
Approved by: https://github.com/malfet
2024-07-23 18:26:49 +00:00
f75d724482 Updating Types in torch/_dynamo/utils.py (#131001)
Adds some type annotations to the torch/_dynamo/utils.py file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131001
Approved by: https://github.com/aorenste
2024-07-23 18:25:52 +00:00
aa54bcb6d2 [CI] Relax config name matching for cpu inductor tests (#131467)
Summary: Matching *cpu* instead of *cpu_inductor* should be sufficient. This fixes torchbench test failures in https://github.com/pytorch/pytorch/pull/131369.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131467
Approved by: https://github.com/zou3519
2024-07-23 18:24:29 +00:00
94f22eb6b2 refactor post-trace fakification in strict (#131421)
Summary:
Previously it was unclear what `_convert_input_to_fake` actually does (used in strict), and in particular how it is different from `make_fake_inputs` (used in non-strict).

This PR splits that function to work purely on user inputs, then renames it to `extract_fake_inputs` and adds a comment clarifying what it does—namely, it extracts fake inputs from a given graph module instead of "converting inputs to fake inputs" (as suggested by the current name) or "making fake inputs" (as happens in non-strict, where no tracing has taken place yet).

The remainder of that function used to also fakify params and buffers. It turns out that this part is identical to what happens in non-strict, hence we also pull `make_fake_inputs` out from `non_strict_utils` into `_trace`, merge it with another util, and make both modes call it.

Test Plan: existing tests

Differential Revision: D60084442

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131421
Approved by: https://github.com/zhxchen17
2024-07-23 18:23:03 +00:00
f85c35872b Remove GraphModuleOpUpgrader in _export.serde.upgrade.py (#131373)
Summary: Remove GraphModuleOpUpgrader in _export.serde.upgrade.py and the file

Test Plan: contbuild & OSS CI

Differential Revision: D60067937

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131373
Approved by: https://github.com/angelayi
2024-07-23 18:09:44 +00:00
22906be8f0 Do not abort on SPARSE_STATUS_INVALID_VALUE (#130382)
Summary:
Newer versions of the MKL library return `SPARSE_STATUS_INVALID_VALUE` when badly formed non-triangular matrices are passed to the `mkl_sparse_?_trsv`/`mkl_sparse_?_mrsv` functions. This would start aborting (badly written) tests that worked with the old version which just filled the result tensor with `-NaN` instead of returning an error status.

This changes the code to fill the result tensor with `-NaN` on `SPARSE_STATUS_INVALID_VALUE` so we get the same behavior regardless of the MKL version in use.

Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//caffe2/test:sparse -- --run-disabled`

Differential Revision: D59542023

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130382
Approved by: https://github.com/malfet
2024-07-23 18:09:36 +00:00
cfb9ccab6c [export] Filter errors by exception type, add case name (#131327)
Summary:
-  Log export errors to Scuba and mark them with "classified" and "unclassified"
- Classify errors by exception type (ALLOW_LIST) and a `case_name` attribute
- Add `case_name` for some exceptions.

Test Plan:
Running the code below logs a classified error to `torch_export_usage` table in Scuba.

```
import torch

from torch._export.db.case import SupportLevel

class TorchSymMin(torch.nn.Module):
    """
    torch.sym_min operator is not supported in export.
    """

    def forward(self, x):
        return x.sum() + torch.sym_min(x.size(0), 100)

example_args = (torch.randn(3, 2),)
tags = {"torch.operator"}
support_level = SupportLevel.NOT_SUPPORTED_YET
model = TorchSymMin()

torch.export.export(model, example_args)
``

Differential Revision: D59981459

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131327
Approved by: https://github.com/zhxchen17
2024-07-23 18:01:13 +00:00
6b8ec2b371 Revert "[triton_op] fix autotuning (#131363)"
This reverts commit 154f27455a62314dfb689f1fe13c0cfd52490339.

Reverted https://github.com/pytorch/pytorch/pull/131363 on behalf of https://github.com/ZainRizvi due to This was a tricky one, but looking at the code it's the change to torch/fx/node.py that triggered the type violation errors. Reverting since this is now breaking trunk ([comment](https://github.com/pytorch/pytorch/pull/131363#issuecomment-2245899858))
2024-07-23 18:01:09 +00:00
3fe72e0c2e [4/N] Non-Tensor: Support layout, device and dtype for aten operations (#125897)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125897
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-23 17:50:17 +00:00
68c725a094 [custom ops] Add register_vmap for custom ops (#130589)
Fixes #130284
Fixes #130653

- Add `torch.library.register_vmap` to custom ops
- Add `register_vmap` for operators in ops in custom_op_db.
- Make `torch.autograd.Function` support kwarg-only kwargs for vmap
- test operators in op_db with `tests/test_vmap`.
- change `test_vmap` to allow custom `out_dim` and allow "None" in `out_dim` when testing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130589
Approved by: https://github.com/zou3519
2024-07-23 17:48:38 +00:00
404d640c39 [1/2] PT2 Inductor ComboKernels - Foreach cases (#124969)
Summary:
A ComboKernel combines independent Inductor Triton kernels into a single one.
Consolidation with Foreach kernel:
1) For the scheduler node, the logic is consolidated into ForeachKernelSchedulerNode
2) The backend kernel is consolidated into ComboKernel.

(Note: this is part 1 which only deals with the 1st case above.)

Details:

1. ComboKernel can be viewed as the extension of Foreach kernel (see the examples below). The main differences are: 1) the block size is tunable (but currently shared by the sub-kernels).  2) it supports multiple kernel typs, like pointwise, reduce, and may extend to matmm as well (it doesn't support mixed 1d and 2d kernels yet, but it can be extended for such case) 3) the blocks are interleaved among the sub kernels (can be extended to other arrangement), 4) it is designed to be general enough to combine kernels without dependency and doesn't rely on certain patterns. 5) it doesn't support dynamic sizes yet but can be easily extended for it.

2. ComboKernel is used in two cases: 1) for existing foreach kernels, combo kernels are used as the backend kernel. the front-end kernel generation logic remains the same. 2) Added an extra optimization phase to the end of the scheduler to generate extra combo kernels if combo_kernels is True in config.py

3. The combo kernel generation in the added optimization phase is done in two steps: 1) in the front end inside the scheduler, it topologically sort the schedule nodes to find all the nodes with no data dependency and create a frond end schedule node for them. We currently limit the maximal number of sub-nodes for each combo kernel to 8 (but we still need to find what is the optimal number). 2) then, these sub-nodes are combined in the codegen phase to generate the combo kernel code for them based on a few rules. For example, 1d and 2d kernels are separated into different combo kernels, as mixing them is not supported yet. Note these algorithms we provide are very basic, and the users can register their customized combo kernel generation algorithms for both steps.

4. Performance wise, combining small kernels is about always to see performance gain. however, combining very large kernels may not see any perf gain, sometimes even regression possibly due to improper block sizes. Thus, a benchmark function is implemented to avoid such perf regression, and it is recommended to turn it on by setting benchmark_combo_kernels to True whenever combo_kernels is True.

Example:
- element wise kernels
original Pytorch function:
```
 def test_activations(a, b, c):
     a1 = torch.nn.functional.relu(a)
     b1 = torch.nn.functional.sigmoid(b)
     c1 = torch.nn.functional.tanh(c)
     return a1, b1, c1
```
combokernel
```
triton_heuristics.pointwise(
    size_hints=[512], tile_hint=TileHint.DEFAULT,
    filename=__file__,
    triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32', 3: '*fp32', 4: '*fp32', 5: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2, 3, 4, 5), equal_to_1=())]},
    inductor_meta={'kernel_name': 'triton_poi_fused_0', 'mutated_arg_names': []}
)
triton.jit
def triton_(in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, out_ptr2, XBLOCK : tl.constexpr):
    pid = tl.program_id(0)
    if pid % 3 == 0:
        pid_offset = pid // 3
        xnumel = 100
        rnumel = 1
        xoffset = pid_offset * XBLOCK
        xindex = xoffset + tl.arange(0, XBLOCK)[:]
        xmask = xindex < xnumel
        x0 = xindex
        tmp0 = tl.load(in_ptr0 + (x0), xmask)
        tmp1 = triton_helpers.maximum(0, tmp0)
        tl.store(out_ptr0 + (x0), tmp1, xmask)
    elif pid % 3 == 1:
        pid_offset = pid // 3
        xnumel = 400
        rnumel = 1
        xoffset = pid_offset * XBLOCK
        xindex = xoffset + tl.arange(0, XBLOCK)[:]
        xmask = xindex < xnumel
        x1 = xindex
        tmp2 = tl.load(in_ptr1 + (x1), xmask)
        tmp3 = tl.sigmoid(tmp2)
        tl.store(out_ptr1 + (x1), tmp3, xmask)
    elif pid % 3 == 2:
        pid_offset = pid // 3
        xnumel = 100
        rnumel = 1
        xoffset = pid_offset * XBLOCK
        xindex = xoffset + tl.arange(0, XBLOCK)[:]
        xmask = xindex < xnumel
        x2 = xindex
        tmp4 = tl.load(in_ptr2 + (x2), xmask)
        tmp5 = libdevice.tanh(tmp4)
        tl.store(out_ptr2 + (x2), tmp5, xmask)
    else:
        pass
```
- reduction kernels
Original Pytorch function:
```
def test_reduce(a, b, c):
     a1 = torch.sum(a, dim=0)
     b1 = torch.max(b, dim=0)
     c1 = torch.min(c, dim=0)
     return a1, b1, c1
```
Generated combokernal:
```
 triton_heuristics.persistent_reduction(
     size_hints=[32, 32],
     reduction_hint=ReductionHint.DEFAULT,
     filename=__file__,
     triton_meta={'signature': {0: '*fp32', 1: '*fp32', 2: '*fp32', 3: '*fp32', 4: '*i64', 5: '*fp32', 6: '*i64', 7: '*fp32'}, 'device': 0, 'device_type': 'cuda', 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1, 2, 3, 4, 5, 6, 7), equal_to_1=())]},
     inductor_meta={'kernel_name': 'triton_per_fused_0', 'mutated_arg_names': []}
 )
 triton.jit
 def triton_(in_ptr0, in_ptr1, in_ptr2, out_ptr0, out_ptr1, out_ptr2, out_ptr3, out_ptr4, XBLOCK : tl.constexpr):
     pid = tl.program_id(0)
     if pid % 3 == 0:
         pid_offset = pid // 3
         xnumel = 20
         rnumel = 20
         RBLOCK_0: tl.constexpr = 32
         xoffset = pid_offset * XBLOCK
         xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
         xmask = xindex < xnumel
         rindex = tl.arange(0, RBLOCK_0)[None, :]
         roffset = 0
         rmask = rindex < rnumel
         r1 = rindex
         x0 = xindex
         tmp0 = tl.load(in_ptr0 + (x0 + (20*r1)), rmask & xmask, other=0.0)
         tmp1 = tl.broadcast_to(tmp0, [XBLOCK, RBLOCK_0])
         tmp3 = tl.where(rmask & xmask, tmp1, float("-inf"))
         tmp4 = triton_helpers.max2(tmp3, 1)[:, None]
         tmp6 = tl.broadcast_to(rindex, tmp3.shape)
         _, tmp5_tmp = triton_helpers.max_with_index(tmp3, tmp6, 1)
         tmp5 = tmp5_tmp[:, None]
         tl.store(out_ptr0 + (x0), tmp4, xmask)
         tl.store(out_ptr1 + (x0), tmp5, xmask)
     elif pid % 3 == 1:
         pid_offset = pid // 3
         xnumel = 10
         rnumel = 10
         RBLOCK_1: tl.constexpr = 16
         xoffset = pid_offset * XBLOCK
         xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
         xmask = xindex < xnumel
         rindex = tl.arange(0, RBLOCK_1)[None, :]
         roffset = 0
         rmask = rindex < rnumel
         r3 = rindex
         x2 = xindex
         tmp7 = tl.load(in_ptr1 + (x2 + (10*r3)), rmask & xmask, other=0.0)
         tmp8 = tl.broadcast_to(tmp7, [XBLOCK, RBLOCK_1])
         tmp10 = tl.where(rmask & xmask, tmp8, float("inf"))
         tmp11 = triton_helpers.min2(tmp10, 1)[:, None]
         tmp13 = tl.broadcast_to(rindex, tmp10.shape)
         _, tmp12_tmp = triton_helpers.min_with_index(tmp10, tmp13, 1)
         tmp12 = tmp12_tmp[:, None]
         tl.store(out_ptr2 + (x2), tmp11, xmask)
         tl.store(out_ptr3 + (x2), tmp12, xmask)
     elif pid % 3 == 2:
         pid_offset = pid // 3
         xnumel = 10
         rnumel = 10
         RBLOCK_2: tl.constexpr = 16
         xoffset = pid_offset * XBLOCK
         xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
         xmask = xindex < xnumel
         rindex = tl.arange(0, RBLOCK_2)[None, :]
         roffset = 0
         rmask = rindex < rnumel
         r5 = rindex
         x4 = xindex
         tmp14 = tl.load(in_ptr2 + (x4 + (10*r5)), rmask & xmask, other=0.0)
         tmp15 = tl.broadcast_to(tmp14, [XBLOCK, RBLOCK_2])
         tmp17 = tl.where(rmask & xmask, tmp15, 0)
         tmp18 = tl.sum(tmp17, 1)[:, None]
         tl.store(out_ptr4 + (x4), tmp18, xmask)
     else:
         pass
```

Note: ComboKernels uses masks to allow combination of kernels working with tensors of different sizes.

Test Plan:
```
buck2 test mode/dev-nosan caffe2/test/inductor:foreach
```
```
buck2 test mode/dev-nosan caffe2/test/inductor:combo_kernels
```

Differential Revision: D54134695

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124969
Approved by: https://github.com/mlazos
2024-07-23 17:34:28 +00:00
979429ca89 [inductor]Add DtypeView to avoid memory leak and unnecessary kernel generations (#128883)
Fixes #126338
## Issue Summary

When torchinductor compiles the combination `functional_collective -> view.dtype -> wait`, a memory leak occurs. This happens because `view.dtype` is compiled into an out-of-place Triton kernel that copies the input data to a new tensor, even if the data hasn't completed collection via the wait operation. The tensor used by `collective` is only freed when the `wait` operation triggers the garbage collector, see [~WorkRegistry](https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/Functional.cpp#L41). However, since `wait` now waits for a new tensor, the previous one is never freed. The `view.dtype` should only check the metadata instead of creating a new tensor. The current lowering is against its semantics and causes memory leaks.

See more great discussions in the #126338

This kind of lowering also generates unnecessary triton kernels for `view.dtype` when it can't be fused with other operations.

## Fix
The function `aten.view.dtype` is a CPU operation that changes the metadata of its input. After discussions with @eellison and @bdhirsh, we decided to change the lowering of `aten.view.dtype` to ensure it fallback properly to the correct `aten.view.dtype` instead of generating a Triton kernel in some cases. This approach also preserves the same semantics of the view operation.
When the model calls `aten.view.dtype` with a data type whose bit width matches the input's original data type, we lower it to the newly added `DtypeView` in IR, acting like a `ReinterpretView`. When the operation can be fused, its `make_loader` is called to maintain the correct type conversion for each load instruction. When the operation can't be fused, it falls back to `aten.view.dtype` to avoid Triton kernel generation.

## Example

```python
@torch.compile
def fn(x, y):
    x = x.view(torch.float16)
    y = y.view(torch.float16) + 1
    return x @ y

x = torch.randn((2, 2), device=self.device, dtype=torch.bfloat16)
y = torch.randn((2, 2), device=self.device, dtype=torch.bfloat16)
fn(x, y)
```
The output code generated before this fix is like the following.
```python
triton_poi_fused_add_view_0...
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 4
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (x0), xmask).to(tl.float32)
    tmp1 = tmp0.to(tl.bfloat16).to(tl.float32, bitcast=True).to(tl.float32)
    tl.store(out_ptr0 + (x0), tmp1, xmask)

triton_poi_fused_add_view_1...
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 4
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (x0), xmask).to(tl.float32)
    tmp1 = tmp0.to(tl.bfloat16).to(tl.float32, bitcast=True).to(tl.float32)
    tmp2 = 1.0
    tmp3 = tmp1 + tmp2
    tl.store(out_ptr0 + (x0), tmp3, xmask)

def call(args):
...
        triton_poi_fused_view_0.run(arg0_1, buf0, 4, grid=grid(4), stream=stream0)
        del arg0_1
        buf1 = empty_strided_cuda((2, 2), (2, 1), torch.float16)
        # Source Nodes: [view_1, y], Original ATen: [aten.add, aten.view]
        triton_poi_fused_add_view_1.run(arg1_1, buf1, 4, grid=grid(4), stream=stream0)
        del arg1_1
        buf2 = empty_strided_cuda((2, 2), (2, 1), torch.float16)
        # Source Nodes: [matmul, view_1, x, y], Original ATen: [aten.add, aten.mm, aten.view]
        extern_kernels.mm(buf0, buf1, out=buf2)
```
As you can see, the two `view` operations are compiled to two kernels `triton_poi_fused_view_0` nad `triton_poi_fused_add_view_1`. Both of them has a line `tmp1 = tmp0.to(tl.bfloat16).to(tl.float32, bitcast=True).to(tl.float32)` which does the type conversion.

The main issue is that the first `view` operation didn't do anything to the actual data. But it generates a triton kernel with a new output tensor. Another small issue is that this triton kernel can't be compiled because `bitcast=True` only support type converstion with same bidwidth.

The following are output code generated after this PR.

```python
triton_poi_fused_add_0...
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 4
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (x0), xmask).to(tl.float32)
    tmp1 = tmp0.to(tl.bfloat16).to(tl.float32)
    tmp2 = 1.0
    tmp3 = tmp1 + tmp2
    tl.store(out_ptr0 + (x0), tmp3, xmask)
def call(args):
...
        triton_poi_fused_add_0.run(arg1_1, buf0, 4, grid=grid(4), stream=stream0)
        del arg1_1
        buf1 = empty_strided_cuda((2, 2), (2, 1), torch.float16)
        # Source Nodes: [matmul, y], Original ATen: [aten.add, aten.mm]
        extern_kernels.mm(aten.view.dtype(arg0_1, torch.float16), buf0, out=buf1)
```
The first `view` operation has been replaced with the `aten.view.dtype` and it is directly passed as an argument. The second one is still there because it is fused with the following add operation. The invalid bitcast operation is removed too.

The following two code snippets is for the upcasts and downcasts. For dtype in `torch.float16, torch.bfloat16`, each load will be upcasted to float32, then downcast to its original dtype to ensure use values with the right precision.

7bda23ef84/torch/_inductor/codegen/triton.py (L1725-L1726)
7bda23ef84/torch/_inductor/codegen/triton.py (L629-L642)

Huge thanks to @eellison, @bdhirsh, @shunting314, and @desertfire .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128883
Approved by: https://github.com/eellison
2024-07-23 17:31:39 +00:00
f93a6a4d31 Add mypy typing to torch_version.py (#131447)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131447
Approved by: https://github.com/angelayi
ghstack dependencies: #131434
2024-07-23 17:31:07 +00:00
eab1595ce2 [dynamo] Delete wrong assertion in bind_args (#131405)
Fix - https://github.com/pytorch/pytorch/issues/130537

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131405
Approved by: https://github.com/williamwen42, https://github.com/yanboliang
ghstack dependencies: #131347, #131367, #131378, #131389
2024-07-23 17:28:05 +00:00
e4b5645f83 Revert "Add wrappers for synchronous GPUDirect Storage APIs (#130633)"
This reverts commit 5b5e0698a5f560decb9bbdd150ed7b0622eb7777.

Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to breaking a lot of jobs and build rules internally D60085885, possibly needs to update some bazel build? ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2245806738))
2024-07-23 17:19:34 +00:00
f7754c6dc5 Run pull jobs with new AMI (#131250)
Migrate all pull jobs to the new Amazon 2023 AMI runner type.

Exceptions:
- Distributed tests are still on the old AMI since they had some weird [test failures](https://github.com/pytorch/pytorch/actions/runs/10047579686/job/27770963175). Will debug those separately.
- Ported over a couple trunk and slow jobs that had `sync-tag`s set with the pull jobs and so needed to be on the same AMI

Revert plan, in case something starts breaking when we run these new AMIs at a larger scale:
- If specific jobs start failing consistently, we bring those jobs back to the old AMI
- If the failure is more widespread, revert this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131250
Approved by: https://github.com/malfet, https://github.com/atalman
2024-07-23 17:17:12 +00:00
5f0b65bee7 Revert "Replace manual parsing of "TMPDIR", "TMP", "TEMP" and "TEMPDIR" with std::filesystem::temp_directory_path() (#130842)"
This reverts commit d33804f8b6e2ea38f8446826a16be13ce4f9b71e.

Reverted https://github.com/pytorch/pytorch/pull/130842 on behalf of https://github.com/clee2000 due to breaking some builds internally D60085710, Im not sure what the logs mean but I think its something about build size ([comment](https://github.com/pytorch/pytorch/pull/130842#issuecomment-2245799309))
2024-07-23 17:15:06 +00:00
4ca8705035 Add mypy typing to fx node (#131434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131434
Approved by: https://github.com/zou3519
2024-07-23 17:00:31 +00:00
ded5bdb0de Use inductor TestCase for test_replicate_with_compiler.py (#131053)
Summary: `test/distributed/_composable/test_replicate_with_compiler.py` torch.compiles. This change introduces a version of MultiProcessTestCase that derives from the inductor TestCase class to make sure we always get a clean cache dir.

Test Plan: `python test/distributed/_composable/test_replicate_with_compiler.py`

Differential Revision: [D59925519](https://our.internmc.facebook.com/intern/diff/D59925519)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131053
Approved by: https://github.com/eellison
2024-07-23 16:59:55 +00:00
a5ad02d05d Remove MacOS M2 14 runner from MacMPS job (#131465)
As it's been dead for 2+ weeks and causing queuing issues

<img width="760" alt="image" src="https://github.com/user-attachments/assets/4e806cae-3a67-4acb-b84f-1a9131d2a859">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131465
Approved by: https://github.com/seemethere, https://github.com/atalman
2024-07-23 16:51:42 +00:00
c1ef214046 Print ExportedProgram without color by default (#131399)
Summary:
Without plugin, colored ExportedProgram is not really readable.

![image](https://github.com/user-attachments/assets/319920a9-bb4b-4ad2-bcac-0c4f76973b11)

Test Plan: CI

Differential Revision: D60074481

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131399
Approved by: https://github.com/angelayi
2024-07-23 16:41:55 +00:00
db376fb643 Ensure non-contiguous indices are handled (#131430)
The unaligned inputs checker built in the assumption that static indices are a contiguous range (ie 0, 1, 2)
when with the new changes with nn module inlining break this assumption.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131430
Approved by: https://github.com/anijain2305
2024-07-23 16:37:55 +00:00
4f0497c747 Divorce triton and pt2 remote caching (#131345)
Now that remote caching has evolved into various parts of PT2, we want to separate triton and pt2 caching as changes to one have caused SEVs to the other.

Differential Revision: D60047752

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131345
Approved by: https://github.com/aorenste
2024-07-23 16:28:12 +00:00
154f27455a [triton_op] fix autotuning (#131363)
The problem was we were shoving SymInts into the constant_args side
table. The root problem is that torch.fx.node.base_types, which we use
to determine what can be put in the graph, doesn't actually have SymInt
in it. This PR fixes base_types to include SymInt.

Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131363
Approved by: https://github.com/oulgen
2024-07-23 16:15:00 +00:00
3aa45cae77 [export] Removed deprecated dialect field from EP schema. [2/2] (#131344)
Summary: Not landable until we've updated the pin of executorch.

Test Plan: CI

Reviewed By: SherlockNoMad

Differential Revision: D59759620

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131344
Approved by: https://github.com/SherlockNoMad, https://github.com/ydwu4
2024-07-23 16:05:10 +00:00
b61600f6cc [pytorch] fix the leak for pinned memory when using _create_cpu_state… (#131270)
When pin_memory and share_memory both are set to True in _create_cpu_state_dict, the memory is pinned using cudaHostRegister but is never unpinned. So, once tensor is created and freed, when a new tensor is created the caching allocator is allocating the same memory. This fails with below error.

```
obj = <[RuntimeError('CUDA error: part or all of the requested memory range is already mapped\nCUDA kernel errors might be a...pile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n') raised in repr()] Tensor object at 0x7f0028a4d6c0> pg = None, device = None, _ = None
```

This PR fixes this by unregistering this memory on tensor free by attaching a hook.

This is easily reproducible with xlformers checkpointing unit tests and the fix is verified with the same.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131270
Approved by: https://github.com/LucasLLC
2024-07-23 15:47:21 +00:00
1e86387871 Revert "Support IPC for Expandable Segments (#130890)"
This reverts commit 32c2f84e349ad6e34b8559d3f1f9c27020ae702f.

Reverted https://github.com/pytorch/pytorch/pull/130890 on behalf of https://github.com/zdevito due to variable shadowing broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/130890#issuecomment-2245456085))
2024-07-23 14:46:28 +00:00
f064dac588 [CI] change xpu ci build runner type to reduce build time (#130922)
The current XPU build sometime needs 2+hours, change the build runner to `linux.12xlarge` to reduce build time. Works for #114850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130922
Approved by: https://github.com/atalman
2024-07-23 14:45:30 +00:00
6bbef2a06b [dynamo] Support set on KeysView (#131389)
Fixes https://github.com/pytorch/pytorch/issues/129664

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131389
Approved by: https://github.com/mlazos
ghstack dependencies: #131347, #131367, #131378
2024-07-23 14:15:26 +00:00
e7c5e06772 [dynamo] Support __contains__ on __dict__ on UserDefinedClassVariable (#131378)
Fixes https://github.com/pytorch/pytorch/issues/129665

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131378
Approved by: https://github.com/mlazos
ghstack dependencies: #131347, #131367
2024-07-23 14:15:26 +00:00
0bc5e26067 [dynamo] Support dict conversion of objects derived from MutableMapping (#131367)
Fixes - https://github.com/pytorch/pytorch/issues/129662

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131367
Approved by: https://github.com/williamwen42
ghstack dependencies: #131347
2024-07-23 14:15:20 +00:00
a944cce5b8 [dynamo] Support if callable on list (#131347)
Fixes https://github.com/pytorch/pytorch/issues/130720

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131347
Approved by: https://github.com/williamwen42, https://github.com/mlazos
2024-07-23 14:15:15 +00:00
250cdb2ac7 Fix cuda_half_test.cu (#131416)
$atanh(1.0)$ is $\inf$ (see https://www.mathworks.com/help/matlab/ref/atanh.html ) and difference between two infinities is nan, which is neither greater, nor less nor equal to any reasonable threshold

Fix the test by comparing that atanh of .5 is equal for float and half and that atanh of 1.0 equal to infinity

Fixes https://github.com/pytorch/pytorch/issues/131401

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131416
Approved by: https://github.com/atalman, https://github.com/albanD
2024-07-23 14:10:20 +00:00
4ac77fc6bd [HOP] Don't send HOPs to torch_dispatch (#131370)
I regretted the decision in
https://github.com/pytorch/pytorch/pull/130606. Most user
torch_dispatchs don't have enough to actually handle the HOP correctly,
so for now I'd prefer that users explicitly define the interaction
between the HOP and their torch_dispatch class.

An example is FlopCounterMode: if we allow HOPs to get passed to it, it
will ignore auto_functionalized(mm) by default but it will record flops
for mm, which is weird.

Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131370
Approved by: https://github.com/ydwu4
2024-07-23 13:41:08 +00:00
027f35d9e6 [Inductor] Allow customize decompositions for fwd_only trace function (#131329)
Summary:

Inductor will aggressively try to decompose and lower ops into a smaller opset. However, sometimes it may not align with kernel coverage (or perf preference) on different backends. (eg. Inductor will decompose Gelu into primitive ops, but certain backends already has a Gelu op) Therefore, we need a mechanism to allow customization of decomp for trace function so that Inductor will simply pass this op through.

Test Plan:

Reviewers:
@eellison
Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131329
Approved by: https://github.com/eellison
2024-07-23 13:10:48 +00:00
eb146b10db Only depend on sympy 1.12 for conda (no 3.13 there anyways) (#131355)
Fixing nightly after https://github.com/pytorch/pytorch/pull/130895
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131355
Approved by: https://github.com/atalman
2024-07-23 12:19:58 +00:00
9851c7313d [CI][dashboard] Collect PT2 cpu perf nightly (#131369)
Summary: Add a workflow similar to inductor-perf-test-nightly.yml but use x86 metal instances for perf measurement. The data processing and dashboard update will come next.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131369
Approved by: https://github.com/huydhn
2024-07-23 11:55:39 +00:00
3f3b226ffc Fixes for the extension backend tests (#130933)
There were some miscellaneous issues I found:

* The WrapperCodeGen subclass constructors don't accept any arguments, which doesn't mesh with how Inductor can try to construct them.
* A DeviceInterface subclass for Triton doesn't implement `triton_supported() == True`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130933
Approved by: https://github.com/eellison, https://github.com/jansel
2024-07-23 10:46:32 +00:00
d8e2e1fe50 [aoti] use reshape instead of view for flattening tensors for the nan checker (#131302)
For some non-contiguous tensors, tensor.view would trigger the following
runtime error:

"RuntimeError: view size is not compatible with input tensor’s size and stride
(at least one dimension spans across two contiguous subspaces).
Use .reshape(…) instead"

So, let's use reshape instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131302
Approved by: https://github.com/muchulee8, https://github.com/desertfire
2024-07-23 10:15:28 +00:00
16247987a1 Add decomposition for t_copy (#130939)
* Extracted from #128416

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130939
Approved by: https://github.com/peterbell10
2024-07-23 08:29:19 +00:00
16a2a1aad3 Annotate graph.py (#131400)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131400
Approved by: https://github.com/shunting314
2024-07-23 07:04:12 +00:00
102d8e5a63 MPS LSTM backward kernel workaround on MacOS 14.4+ (#130038)
The bug causing the correctness problem will be fixed in future OS release. Root cause of the problem is in a bug in an optimization to MPSGraph reshape operation in MacOS 14_4 that results in a correctness issue with the shapes the LSTM gradient operation has when num_layers > 2.

Solves silentness of issue #125803.
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130038
Approved by: https://github.com/malfet
2024-07-23 06:30:40 +00:00
29e2e2afb6 Revert D59561509: Multisect successfully blamed "D59561509: [FX][export] DCE pass, check schema for node impurity (#130395)" for one test failure (#131341)
Summary:
This diff reverts D59561509
D59561509: [FX][export] DCE pass, check schema for node impurity (#130395) by yushangdi causes the following test failure:

Tests affected:
- [cogwheel:cogwheel_mtia_cmf_m5_shrunk_test#test_flow_with_verification](https://www.internalfb.com/intern/test/844425041436985/)

Here's the Multisect link:
https://www.internalfb.com/multisect/6533402
Here are the tasks that are relevant to this breakage:
T191383430: 10+ tests unhealthy for ads_mtia_inference

The backout may land if someone accepts it.

If this diff has been generated in error, you can Commandeer and Abandon it.

Test Plan: NA

Differential Revision: D60029318

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131341
Approved by: https://github.com/angelayi
2024-07-23 05:23:47 +00:00
b2ad16f01d avoid OpOverloadPacket.__getattr__ calls in inductor lowering (#131348)
we have seen stacktrace samples showing that a lot of compilation time is spent in exceptions raised in `OpOverloadPacket.__getattr__`. It's not entirely clear why/how this happens, but I spot-checked a few places in `_inductor.graph.py` where we previously may have been calling `hasattr(OpOverloadPacket, ...)`, that can be avoided (hasattr will go through getattr, which, for OpOverloadPacket, will do a lookup in the dispatch table for all overload names of the packet).

Test Plan: CI

Differential Revision: D60048270

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131348
Approved by: https://github.com/davidberard98
2024-07-23 04:30:04 +00:00
99d9b369f4 [Optim] Support tensor lr for all optimizers and check it is 1-element (#131065)
Fixes: #130980
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131065
Approved by: https://github.com/janeyx99
2024-07-23 04:27:05 +00:00
781189f25d Add nvjitlink to the list of loadable global deps (#131295)
To fix the cusparse dependency resolution in CUDA-12.x, that has nvJitLink dependency:
```
$ ldd -r /usr/local/cuda-11.8/lib64/libcusparse.so.11.7.5.86
	linux-vdso.so.1 (0x00007ffea6f51000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fb13306f000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007fb133065000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fb13305f000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007fb132f10000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007fb132eeb000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb132cf7000)
	/lib64/ld-linux-x86-64.so.2 (0x00007fb143db7000)
$ ldd -r /usr/local/cuda-12.1/lib64/libcusparse.so.12.1.0.106
	linux-vdso.so.1 (0x00007ffc41909000)
	libnvJitLink.so.12 => /usr/local/cuda-12.1/lib64/libnvJitLink.so.12 (0x00007f3916b38000)
	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007f3916aea000)
	librt.so.1 => /lib/x86_64-linux-gnu/librt.so.1 (0x00007f3916ae0000)
	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007f3916ada000)
	libm.so.6 => /lib/x86_64-linux-gnu/libm.so.6 (0x00007f391698b000)
	libgcc_s.so.1 => /lib/x86_64-linux-gnu/libgcc_s.so.1 (0x00007f3916964000)
	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007f3916772000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f3929a8c000)
```
Fixes #131284
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131295
Approved by: https://github.com/malfet
2024-07-23 04:26:33 +00:00
02cd4dbcf4 [BE][CI] Get rid of duplicated code (#131406)
Followup after https://github.com/pytorch/pytorch/pull/131061 Define `run_if_exists` function that runs cpp test if it exists and prints a warning otherwise.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131406
Approved by: https://github.com/kit1980, https://github.com/atalman
2024-07-23 04:01:13 +00:00
35a0e0f018 [tp] improve SequenceParallel and its documentation (#131346)
SequenceParallel style assumes the input torch.Tensor ALREADY sharded on
the sequence dimension if not passing in DTensor. Since it causes some
user confusion on the documentation, this PR:

1. for the case where input passed in is already a DTensor, we check the
   input placements and redistribute if it's not sharded on the sequence
dimension
2. update the doc to make it more explicit about the case when user
   passed in a torch.Tensor and DTensor

This would fix https://github.com/pytorch/pytorch/issues/129355

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131346
Approved by: https://github.com/awgu
2024-07-23 03:57:01 +00:00
12434504a2 [c10d] remove non-necessary tests (#131212)
as titled, comm tensor is not being actively used as we approached the
functional collectives as our collective tracing approach

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131212
Approved by: https://github.com/XilunWu
2024-07-23 03:48:55 +00:00
8a591da3e7 [CI] Enable AOT inductor in cpu performance smoke test (#130097)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130097
Approved by: https://github.com/chuanqi129, https://github.com/desertfire
2024-07-23 03:44:13 +00:00
6cbb1437c1 Revert "Add sparse block to flex_decoding kernel (#130884)"
This reverts commit 0bf59db6cc076468f44197f0d7ee41f6204c47c2.

Reverted https://github.com/pytorch/pytorch/pull/130884 on behalf of https://github.com/atalman due to Sorry reverting test_causal_full_mask_vs_sdpa constantly failing on trunk ([comment](https://github.com/pytorch/pytorch/pull/130884#issuecomment-2244113663))
2024-07-23 02:10:14 +00:00
28b0ad4f46 [PT2] Minor fix in signpost (#131332)
Summary: compile_id is a named Tuple. We want to log signposts.

Test Plan:
Run e2e job.
Confirm this shows up correctly.
{F1767320364}

Differential Revision: D60045020

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131332
Approved by: https://github.com/oulgen
2024-07-23 01:56:00 +00:00
b435d84261 Revert "[custom ops] Add register_vmap for custom ops (#130589)"
This reverts commit 074b42064195c45471912f851e94c753992a9a1f.

Reverted https://github.com/pytorch/pytorch/pull/130589 on behalf of https://github.com/atalman due to Please fix lint and reland ([comment](https://github.com/pytorch/pytorch/pull/130589#issuecomment-2244092174))
2024-07-23 01:44:44 +00:00
8963623494 Re-implement pin_memory to be device-agnostic by leveraging the Accelerator concept (#126376)
This PR re-implements pin memory aiming to get rid of the optional `device` argument and makes all related APIs to be device-agnostic. We add two new abstract APIs in [AcceleratorHooksInterface](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/detail/AcceleratorHooksInterface.h#L12) and redefine pin memory as: "Pin memory is always pinned for the current accelerator device". In detail, it uses [getAcceleratorHooksInterface](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/Context.h#L61) in pin_memory/is_pinned to get an appropriate device and invoke the corresponding overridden interfaces, instead of using BackendSelect and then dispatching to CUDA or other specific backends' implement methods.

Note: For new backends who want to implement and use pin memory, just inherit AcceleratorHooksInterface and overwrite the `isPinnedPtr` and `getPinnedMemoryAllocator` methods.

Additional context: To avoid BC-breaking, this PR just preserves the `device` arg of related APIs and would throw a deprecation warning if `device` arg is passed. Another PR will be submitted to update all PT callers (`Tensor.is_pinned()`, `Tensor.pin_memory()`...) not to pass this arg based on this PR. In future, `device` arg will be actually removed.

Relates #124908
Relates #14560
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126376
Approved by: https://github.com/albanD
2024-07-23 01:44:15 +00:00
074b420641 [custom ops] Add register_vmap for custom ops (#130589)
Fixes #130284
Fixes #130653

- Add `torch.library.register_vmap` to custom ops
- Add `register_vmap` for operators in ops in custom_op_db.
- Make `torch.autograd.Function` support kwarg-only kwargs for vmap
- test operators in op_db with `tests/test_vmap`.
- change `test_vmap` to allow custom `out_dim` and allow "None" in `out_dim` when testing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130589
Approved by: https://github.com/zou3519
2024-07-23 00:54:52 +00:00
1e5ecc4277 move save/load from _export to export (#131353)
Test Plan: existing tests

Differential Revision: D60053905

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131353
Approved by: https://github.com/angelayi
2024-07-23 00:48:28 +00:00
26f7dd286b [export] Allow non-CIA ops to be preserved (#131075)
I feel like the semantics of `run_decompositions(preserve_ops,...)` should be that we should always preserve whatever operator is put into `preserve_ops`, even if it's not CIA?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131075
Approved by: https://github.com/bdhirsh
2024-07-23 00:41:48 +00:00
69b1999586 TunableOp size hotfix (#130800)
Fixes #130727.  GetSize calculation was incorrect for strided batched gemm.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130800
Approved by: https://github.com/xw285cornell
2024-07-22 23:42:26 +00:00
8ae1963a61 [Autograd] Cond Higher-Order Operation (#126911)
This is an updated PR to equip cond with the autograd feature and replaces the old [PR](https://github.com/pytorch/pytorch/pull/126007)

@ydwu4 I tried to incorporate your requests already.

Currently there are two problems that I struggle with solving:

1. There seems to be an import issue when trying to import cond in `torch/__init__.py`, see [here](8a704035c9/torch/__init__.py (L1914-L1916)). Therefore, I had to comment those lines, which resolved the import issues, but I believe cond is not proberly exposed as torch.cond.
2. I am not entirely sure how to deal with the opinfo test in `hop_db.py`

Co-authored-by: Yidi Wu <yidi@meta.com>
Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126911
Approved by: https://github.com/ydwu4
2024-07-22 23:18:19 +00:00
c74396e890 Revert "[c10d] remove non-necessary tests (#131212)"
This reverts commit 0c074352ab62acba22265d8f19ea95851ae61d0f.

Reverted https://github.com/pytorch/pytorch/pull/131212 on behalf of https://github.com/atalman due to sorry need to revert breaks OSS CI, module 'test_c10d_common' has no attribute 'CompilerTest' ([comment](https://github.com/pytorch/pytorch/pull/131212#issuecomment-2243961785))
2024-07-22 23:11:44 +00:00
f8f41dcb24 Revert "[inductor] Make UserDefinedTritonKernel a multi-output operation (#130832)"
This reverts commit deacc543f13067ab22e8fb2ab714a20dd60bb056.

Reverted https://github.com/pytorch/pytorch/pull/130832 on behalf of https://github.com/atalman due to broke periodic test ([comment](https://github.com/pytorch/pytorch/pull/130832#issuecomment-2243894772))
2024-07-22 22:10:02 +00:00
15eb10df02 Revert "[inductor] Use multiple outputs for flex-attention (#130833)"
This reverts commit 9df8ea1cf2d62bfe21b46188faea6ef2e29e5210.

Reverted https://github.com/pytorch/pytorch/pull/130833 on behalf of https://github.com/atalman due to broke periodic https://github.com/pytorch/pytorch/pull/130832 ([comment](https://github.com/pytorch/pytorch/pull/130833#issuecomment-2243890944))
2024-07-22 22:07:06 +00:00
f8875e8277 Revert "[inductor] Kill mark_node_as_mutating (#130834)"
This reverts commit 33f036a6f71b386d4ccb9a756ed892c144ec6a5f.

Reverted https://github.com/pytorch/pytorch/pull/130834 on behalf of https://github.com/atalman due to broke periodic https://github.com/pytorch/pytorch/pull/130832 ([comment](https://github.com/pytorch/pytorch/pull/130834#issuecomment-2243886215))
2024-07-22 22:02:43 +00:00
d33804f8b6 Replace manual parsing of "TMPDIR", "TMP", "TEMP" and "TEMPDIR" with std::filesystem::temp_directory_path() (#130842)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130842
Approved by: https://github.com/fegin
2024-07-22 21:49:33 +00:00
a136a7d623 [Functional Collective] enable custom work registration from python (#130354)
This PR does two things:
- Allow tensor -> work registration in Python via `torch._C._distributed_c10d.register_work`. Calling `torch.ops._c10d_functional.wait_tensor` on a tensor would trigger `.wait()` on the registered work object.
- Allow user-defined work object in Python to work with functional collectives.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130354
Approved by: https://github.com/wanchaol, https://github.com/fegin, https://github.com/wconstab
2024-07-22 21:45:19 +00:00
a3922acc06 [TD] More synonyms, new heuristic for test_public_bindings (#130397)
test_public_bindings should be run on anything that changes the public API - need to figure out in the future what is part of the public api, currently I'm using anything in torch/

flex_attention should be run on anything involving autograd
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130397
Approved by: https://github.com/malfet
2024-07-22 21:42:54 +00:00
0bf59db6cc Add sparse block to flex_decoding kernel (#130884)
fix typo

Finish flex_decoding block sparse

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130884
Approved by: https://github.com/drisspg
2024-07-22 21:29:43 +00:00
83b355bad5 [aoti] forward fix of D60006838, add back test_multiple_output_alias (#131331) (#131356)
Summary:

Forward fix of D60006838.

The unit test test_multiple_output_alias passed in OSS CI, but failing internally. So adding it back to skip list.

Test Plan: ci

Differential Revision: D60044926

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131356
Approved by: https://github.com/kit1980, https://github.com/malfet
2024-07-22 20:17:21 +00:00
e3eaa22126 [torchbench][multisect] Run accuracy check at Diff time (#131266)
Summary:
X-link: https://github.com/pytorch/benchmark/pull/2388

We can enable accuracy checks at Diff time since it is not a performance metric.

* Refactor the existing diff time test to use the new PT2 Benchmark Runner.
* Deprecate the speedup tests and enable the accuracy tests only. We rely on ServiceLab to perform performance testing and regression detection.

Test Plan:
Sandcastle CI

Or buck test command:

```
buck2 test 'fbcode//mode/opt' fbcode//pytorch/benchmark/fb/test_gpu:run_test_gpu -- test_training_resnet50_accuracy
```

Test UI: https://www.internalfb.com/intern/testinfra/testrun/1688850102375429

Reviewed By: oulgen

Differential Revision: D59825601

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131266
Approved by: https://github.com/oulgen
2024-07-22 20:14:28 +00:00
0c074352ab [c10d] remove non-necessary tests (#131212)
as titled, comm tensor is not being actively used as we approached the
functional collectives as our collective tracing approach

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131212
Approved by: https://github.com/XilunWu
2024-07-22 19:52:44 +00:00
781a33f5d8 Enable dynamic rollout for Linux trunk workflows (#131325)
Enables dynamic migration of jobs to the LF AWS account for the Linux trunk workflow.

The new runners are only given to people specified in this issue: https://github.com/pytorch/test-infra/issues/5132

This closes pytorch/ci-infra#250.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131325
Approved by: https://github.com/ZainRizvi
2024-07-22 19:43:24 +00:00
406f510f89 [c10d] add bfloat16 support for NAN check (#131131)
Summary:
Need another dispacher macro to support more data types
Test Plan:
(sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (86fcae11)]$ python
test/distributed/test_c10d_nccl.py -k test_nan_assert_bfloat16
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:18:
checkForNaN: block: [0,0,0], thread: [85,0,0] Assertion
`!isnan(data[i])` failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:18:
checkForNaN: block: [0,0,0], thread: [18,0,0] Assertion
`!isnan(data[i])` failed.
NCCL version 2.21.5+cuda12.0

devgpu009:1193787:1193787 [0] init.cc:1773 NCCL WARN Cuda failure
'device-side assert triggered'
.
----------------------------------------------------------------------
Ran 1 test in 9.416s

OK
Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131131
Approved by: https://github.com/wconstab, https://github.com/XilunWu
2024-07-22 19:41:19 +00:00
9e753d1f20 [AMD] catch exception when other processes belong to other users (#131018)
Summary:
It is a long known pain point that if other users are running things, the call of `torch.cuda.memory.list_gpu_processes()` will error out:
```
  torch.cuda.memory.list_gpu_processes()
  File "torch/cuda/memory.py", line 647, in list_gpu_processes
    procs = amdsmi.amdsmi_get_gpu_process_list(handle)  # type: ignore[attr-defined]
  File "amdsmi/py_interface/amdsmi_interface.py", line 1946, in amdsmi_get_gpu_process_list
    _check_res(
  File "amdsmi/py_interface/amdsmi_interface.py", line 510, in _check_res
    raise AmdSmiLibraryException(ret_code)
amdsmi.py_interface.amdsmi_exception.AmdSmiLibraryException: Error code:
	10 | AMDSMI_STATUS_NO_PERM - Permission Denied

```

So just catch this error

Test Plan: torch.cuda.memory.list_gpu_processes() no longer fails

Differential Revision: D59901053

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131018
Approved by: https://github.com/eqy, https://github.com/clee2000
2024-07-22 19:38:51 +00:00
23ae6e2eb3 [FSDP2] Removed state dict error for HSDP (#131320)
Fixes https://github.com/pytorch/torchtitan/issues/441#issuecomment-2241288906.

This PR avoids raising the 2D state dict error for HSDP, which does not depend on strided sharding.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131320
Approved by: https://github.com/wanchaol, https://github.com/weifengpy
2024-07-22 19:23:17 +00:00
d3556786b8 Blocklist certain modules for weights_only load (#131259)
Also bold certain text in the error message as suggested
<img width="3000" alt="Screenshot 2024-07-19 at 5 56 48 PM" src="https://github.com/user-attachments/assets/378f20c5-c6b2-4e53-8eaf-0bd26c3a6b60">

With a GLOBAL like `os.execv` the error message is now as such

```python
File "/data/users/mg1998/pytorch/torch/serialization.py", line 1256, in load
    raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
_pickle.UnpicklingError: Weights only load failed. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
Trying to load unsupported GLOBAL posix.execv whose module posix is blocked.

Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131259
Approved by: https://github.com/malfet, https://github.com/albanD
2024-07-22 18:23:21 +00:00
93ef2e53f8 [3.13, dynamo] support FORMAT_SIMPLE/FORMAT_SPEC (#130751)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130751
Approved by: https://github.com/Skylion007
ghstack dependencies: #130566, #130567, #130568, #130569
2024-07-22 18:07:40 +00:00
375a4d7e9e [3.13, dynamo] decompose fused load/store instructions (#130569)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130569
Approved by: https://github.com/jansel
ghstack dependencies: #130566, #130567, #130568
2024-07-22 18:07:40 +00:00
157f38bc4d [3.13, dynamo] support STORE_FAST_LOAD_FAST (#130568)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130568
Approved by: https://github.com/jansel
ghstack dependencies: #130566, #130567
2024-07-22 18:07:35 +00:00
1e116c7a1e [3.13, dynamo] fix END_FOR (#130567)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130567
Approved by: https://github.com/jansel
ghstack dependencies: #130566
2024-07-22 18:07:32 +00:00
4319147ca9 [3.13, dynamo] fix closures, MAKE_FUNCTION, LOAD_CLOSURE; support SET_FUNCTION_ATTRIBUTE (#130566)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130566
Approved by: https://github.com/jansel
2024-07-22 18:07:28 +00:00
44e689d947 Revert "[TD] More synonyms, new heuristic for test_public_bindings (#130397)"
This reverts commit d8a35d57220cdd5ed2fe52c02bb1f78cc0b3c75b.

Reverted https://github.com/pytorch/pytorch/pull/130397 on behalf of https://github.com/clee2000 due to broke lint, probably a landrace ([comment](https://github.com/pytorch/pytorch/pull/130397#issuecomment-2243518651))
2024-07-22 18:03:22 +00:00
56bb047449 [pt2] Increase dynamo/inductor default log level to info (#131311)
Summary: Avoid the logs to be too verbose

Test Plan: CI

Differential Revision: D60028647

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131311
Approved by: https://github.com/oulgen
2024-07-22 17:33:29 +00:00
d8a35d5722 [TD] More synonyms, new heuristic for test_public_bindings (#130397)
test_public_bindings should be run on anything that changes the public API - need to figure out in the future what is part of the public api, currently I'm using anything in torch/

flex_attention should be run on anything involving autograd
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130397
Approved by: https://github.com/malfet
2024-07-22 17:06:00 +00:00
b9912f31ef Revert "[export] fix zero arg export in training_ir (#130990)"
This reverts commit 50436d5bdb5d2e29307a0c0bcfcce8d7e2da82c0.

Reverted https://github.com/pytorch/pytorch/pull/130990 on behalf of https://github.com/clee2000 due to failing some executorch and torchrec tests internally D60006710 ([comment](https://github.com/pytorch/pytorch/pull/130990#issuecomment-2243395316))
2024-07-22 16:49:25 +00:00
32c2f84e34 Support IPC for Expandable Segments (#130890)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130890
Approved by: https://github.com/dsjohns2
ghstack dependencies: #130888, #130889
2024-07-22 16:15:01 +00:00
0246b28510 [aoti] refactor aoti_torch__scaled_mm and skip aoti fp8 test for some cases (#130868)
Continuing https://github.com/pytorch/pytorch/pull/128683 and https://github.com/pytorch/pytorch/pull/130582.

The api of _scaled_mm has changed. For example, there is only one return now. So change the aoti api as well.

Also, tested the fp8 tests offline. The test_fp8_abi_compatible_cpu_with_stack_allocation_and_minimal_arrayref_interface would fail with `error: use of undeclared identifier 'float8_e4m3fn'` and `error: use of undeclared identifier 'half'`, so skipping them for now.

The reason this wasn't known earlier is probably because the CI doesn't use H100.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130868
Approved by: https://github.com/drisspg, https://github.com/chenyang78, https://github.com/desertfire
2024-07-22 15:24:20 +00:00
5b5e0698a5 Add wrappers for synchronous GPUDirect Storage APIs (#130633)
Based in part on https://github.com/NVIDIA/apex/pull/1774

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633
Approved by: https://github.com/albanD
2024-07-22 14:51:24 +00:00
5c78581fc9 Fix documentation for tensor.repeat. (#131195)
Fixes #130930.

Adjusts the documentation which used `sizes` instead of `repeats`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131195
Approved by: https://github.com/mikaylagawarecki, https://github.com/soulitzer
2024-07-22 14:48:18 +00:00
26383a6cc0 Revert "Added and_masks and or_masks utilities (#131073)"
This reverts commit 92bb323d36adca097c44a2fc8d9f0d574214d801.

Reverted https://github.com/pytorch/pytorch/pull/131073 on behalf of https://github.com/albanD due to The docs build fails here and in trunk ([comment](https://github.com/pytorch/pytorch/pull/131073#issuecomment-2242997958))
2024-07-22 13:44:55 +00:00
3eb9fa5d58 Add support for using LF Canary runners (#131188)
The script is updated such that if a canary build is detected and the label_type is LF runner it will run on an LF Canary runner.

Closes pytorch/ci-infra#245.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131188
Approved by: https://github.com/ZainRizvi
2024-07-22 13:26:46 +00:00
eqy
69e2590490 Fix MKLDNN check in test_aot_inductor.py (#130982)
`torch.ops.mkldnn._is_mkldnn_bf16_supported()` assumes MKLDNN is on the system which isn't the case for e.g., some ARM system configurations

CC @tinglvv @nWEIdia

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130982
Approved by: https://github.com/malfet
2024-07-22 11:58:18 +00:00
92bb323d36 Added and_masks and or_masks utilities (#131073)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131073
Approved by: https://github.com/drisspg
ghstack dependencies: #130871, #130904
2024-07-22 11:48:03 +00:00
68df24f9b6 [xla hash update] update the pinned xla hash (#126672)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned xla hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126672
Approved by: https://github.com/pytorchbot
2024-07-22 11:35:36 +00:00
6d65a2c3f4 [3/N] Non-Tensor: Support string parameter for aten operations (#125831)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125831
Approved by: https://github.com/jansel, https://github.com/jgong5
2024-07-22 09:42:35 +00:00
8da19fec60 [Inductor] Support store SPIR-V binary file output from Intel Triton. (#130849)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130849
Approved by: https://github.com/peterbell10, https://github.com/EikanWang
2024-07-22 05:59:03 +00:00
2820e1d9f8 Update CPython support policy (#130989)
Update as specified in the RFC that was accepted: https://github.com/pytorch/rfcs/blob/master/RFC-0038-cpython-support.md
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130989
Approved by: https://github.com/seemethere
2024-07-22 05:29:07 +00:00
1614891946 [Profiler] exclude gpu_user_annotation when accumulating cuda time total (#130733)
Fixes #[130730](https://github.com/pytorch/pytorch/issues/130730)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130733
Approved by: https://github.com/aaronenyeshi
2024-07-22 04:35:21 +00:00
c2425a3b57 [BE] Use _linux-build.yml instead of -linux-build-label.yml flavor (#130762)
It was also introduced during the ARC experiment and supposed to be a temporary thing.
Fix `use_split_build` option handling in `_linux_build.yml`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130762
Approved by: https://github.com/Skylion007, https://github.com/atalman, https://github.com/jeanschmidt
2024-07-21 23:17:17 +00:00
500cbb5b90 Add decomposition for view_copy (#130938)
* Extracted from #128416
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130938
Approved by: https://github.com/peterbell10
ghstack dependencies: #130937
2024-07-21 20:39:24 +00:00
f628813066 Fix out_wrapper, _make_copy_from_view to handle all signatures (#130937)
* See #128416 and #129476
* Simplify xskip lists in test/functorch/test_ops.py
* Add supports_out=True to OpInfos for copy ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130937
Approved by: https://github.com/peterbell10
2024-07-21 20:39:24 +00:00
b193894b94 FakeTensor cache SymInt support (#127596)
Adds support for SymInts in the FakeTensor cache.

A couple notes:
1. When a SymInt is present in the input key for a FakeTensor operation we cache on the ShapeEnv instead of using the FakeTensorMode cache. This is necessary so we don't have to remember and check the guards. It reduces the cache hits but there's diminishing return on how much work we can do before the cache becomes more of a burden than a gain.
2. We need to be careful that when we cache an output SymInt that is a direct copy from the input that when we have a cache-hit we copy the SymNode from the input to the output. This is important because the fx-graph building code actually uses SymNode ids in the process of building the graph so constructing a same-content-but-different-id SymNode will fail.
3. In the cache key we store SymInts as a _PySymInputStub. These represent SymInt (and friends) but support `__hash__` and `__eq__` (which SymInt do not).
4. In the cache entry we store SymInts as a _SymIntOutputStub.

Perf example:
```
python benchmarks/dynamo/timm_models.py --ci --accuracy --timing
--explain --inductor --dynamic-shapes --dynamic-batch-only --device cuda
--training --amp --total-partitions 2 --partition-id 0 --output
/tmp/training_timm_models.csv --filter crossvit_9_240
```
fake tensor cache before:
```
INFO: FakeTensor cache stats:
INFO:   cache_hits: 68137
INFO:   cache_misses: 837
INFO:   cache_bypasses:
INFO:     symbolic shape:            48224
INFO:     CompositeImplicitAutograd: 917
INFO:     non-fake tensor:           70
INFO:     non-FakeTensor output:     62
INFO:     non-builtin:               8
INFO:     dynamic output shape:      1
```
and after:
```
INFO: FakeTensor cache stats:
INFO:   cache_hits: 88187
INFO:   cache_misses: 14233
INFO:   cache_bypasses:
INFO:     CompositeImplicitAutograd: 1037
INFO:     non-FakeTensor output:     602
INFO:     non-fake tensor:           70
INFO:     unsafe view:               36
INFO:     non-builtin:               8
INFO:     dynamic output shape:      1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127596
Approved by: https://github.com/eellison
ghstack dependencies: #131014, #129780
2024-07-21 19:26:38 +00:00
ebce85172e FakeTensor cache SymInt support: flatten cache key (#129780)
This is part of #127596, pulled out to make reviewing a little easier.

Flatten the FakeTensor cache key - so it's a list of singular elements and pointing at one requires a single index rather than a PyTree path.  This is used in the next PR to allow us to have the cache entry refer to an input SymInt that it needs to copy directly into the output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129780
Approved by: https://github.com/oulgen, https://github.com/eellison
ghstack dependencies: #131014
2024-07-21 19:26:38 +00:00
f3562e2cdc backport dataclass(slots=True) (#131014)
Python 3.10 adds `@dataclass(slots=True)` to auto-build the `__slots__` for a dataclass. This is really useful but we can't use it until 3.10 becomes our minimum version.

Copied the code for that functionality from python into a new decorator and ported it to use 3.8 syntax (removed use of `match`).

Usage:
```
@dataclass_slots
@dataclass
class X:
  pass
```
is the same as (in py3.10):
```
@dataclass(slots=True)
class X:
  pass
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131014
Approved by: https://github.com/oulgen, https://github.com/eellison
2024-07-21 19:26:31 +00:00
1439bd3c9c [Easy][pytree] enable CXX pytree under torch::deploy (#130144)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130144
Approved by: https://github.com/zou3519
ghstack dependencies: #130895, #130139
2024-07-21 07:36:22 +00:00
ddde9dd25c [dynamo][automatic_dynamic] Trigger dynamism on stride changes (#130232)
Fixes https://github.com/pytorch/pytorch/issues/129798

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130232
Approved by: https://github.com/ezyang
2024-07-21 03:45:54 +00:00
e506dfa640 [dynamo] Add a JK kill switch for disabling compile (#131258)
Summary: The JK disables dynamo by passing None to set_eval_frame.

Test Plan:
Ran buck test mode/opt caffe2/test/dynamo:test_dynamo

Buck UI: https://www.internalfb.com/buck2/1fec33b4-c95a-4bdf-b47b-7c0b8ab9e24a
Test UI: https://www.internalfb.com/intern/testinfra/testrun/2814750010105363
Network: Up: 0B  Down: 0B
Jobs completed: 9596. Time elapsed: 28:54.5s.
Tests finished: Pass 4796. Fail 0. Fatal 0. Skip 17. Build failure 0

Also manually write a small local test with torch.compile and toggles the code to see if PT2 can be disabled. Validated with running the test and observing the log.

PT2 enabled: P1486847242. Can see dynamo log about graph breaks.
PT2 disabled: P1486847727. No dynamo log. The newly added warning printed.

Reviewed By: ezyang

Differential Revision: D59968925

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131258
Approved by: https://github.com/c00w
2024-07-21 01:22:31 +00:00
cyy
1d1d074072 [3/N] Fix Wunused-parameter warnings (#131271)
Follows #131170

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131271
Approved by: https://github.com/ezyang
2024-07-20 23:31:03 +00:00
d57af32e63 Fix undefined tensor error in _copy_from_and_resize when fallback to cpu. (#130237)
1) Add skip undefined tensor in cpu fallback when call _copy_from_and_resize;
2) Modify to_cpu function support optional tensor;
3) Add copy back to origin optional tensor when alias_info isWrite is true.

@ezyang @bdhirsh

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130237
Approved by: https://github.com/ezyang
2024-07-20 23:12:17 +00:00
13283fb4bc [distributed] test_store: remove flaky bind test (#131262)
Fixes https://github.com/pytorch/pytorch/issues/131084

There's no good way to fix this since some tests environments can bind the protected range. Removing test since the value is relatively low since it's just testing error messages.

Test plan:

```
python test/distributed/test_store.py -v -k address
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131262
Approved by: https://github.com/mori360, https://github.com/XilunWu
2024-07-20 23:04:31 +00:00
407c87a32c [debug][dtensor] fixed updating current module (#130995)
**Summary**
Fixed issue with updating the current module when transitioning between child module to parent module and in the backward pass. The first issue is caused because the prehook is not called again when we go back to the parent module and that the hook being used was a register_module_forward_hook, which runs before the register_module_hook used in redistribute, causing the collective call to be assigned to the incorrect module. In order to do this, I updated the current module to be the parent module in a register_forward_hook in the module tracker. The second issue was caused by the parent set in the module tracker I inherit from being incorrect. I fixed this issue by saving the parents of each module and using them in collective counter instead of the incorrect set. I have updated the example in module_operation_tracing to reflect the correct output. In addition, I changed the test cases that used the incompatible old CommDebugMode.

**Test Case**
1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_operation_tracing

2. pytest test/distributed/_tensor/debug/test_comm_mode_features.py -s -k test_transformer_module_tracing

3. python test/distributed/_composable/fsdp/test_fully_shard_training.py -k TestFullyShardGradientAccumulation.test_gradient_accumulation

4. python test/distributed/_tensor/test_math_ops.py -k DistMathOpsTest.test_layer_norm_bwd

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130995
Approved by: https://github.com/XilunWu
ghstack dependencies: #130410
2024-07-20 20:57:29 +00:00
33f036a6f7 [inductor] Kill mark_node_as_mutating (#130834)
Resubmit of #129346

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130834
Approved by: https://github.com/lezcano
ghstack dependencies: #130831, #130832, #130833
2024-07-20 18:53:33 +00:00
fccbe85475 [BE] Improve CUDA UpSample error message (#131252)
`Expected grad_output.numel() <= std::numeric_limits<int32_t>::max() to be true` is not very helpful, it's better to mention method name as well as actual tensor size

This error was reported in https://github.com/pytorch/pytorch/issues/131185

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131252
Approved by: https://github.com/albanD
2024-07-20 16:49:34 +00:00
a7a951a4ae [executorch hash update] update the pinned executorch hash (#130001)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Co-authored-by: Huy Do <huydhn@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130001
Approved by: https://github.com/pytorchbot
2024-07-20 16:44:07 +00:00
b6d477fd56 [BE][Easy][16/19] enforce style for empty lines in import segments in torch/_i*/ (#129768)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129768
Approved by: https://github.com/jansel
2024-07-20 16:20:58 +00:00
8e478d4fb1 Add Alban and Piotr into Core Maintainers (#130903)
See official announcement here: https://dev-discuss.pytorch.org/t/alban-desmaison-and-piotr-bialecki-are-now-pytorch-core-maintainers/2280

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130903
Approved by: https://github.com/albanD, https://github.com/Skylion007
2024-07-20 16:02:42 +00:00
637ab85e7f fix for launching kernel invalid config error when calling embedding … (#130994)
…with large index

Fixes #130806
When an output size of 2147483648 (=131072*16384) is expected in the above issue, it throwed out the following error:
RuntimeError: HIP error: invalid configuration argument

What happened was that the second parameter passed to hipLaunchKernel was crazy {2147483648,1,1}.
Found two issues in the Indexing.cu:

1: ptrdiff_t was used but it is signed int,  outTotalSize >= 2147483648 can cause overflow when doing [this](39493aa934/aten/src/ATen/native/cuda/Indexing.cu (L1367)):
2: On ROCm, std::min -> ::min did not work as expected when outTotalSize>=2147483648

As the result, 2147483648 was sent to hipLaunchKernel which the GPU does not support such a huge number since this number specifies the number of threads per block. The original code intended to set 128 threads per block, though this is debatable as the perf would not good for latest powerful GPUs (a TODO item to update for perf maybe?) , but at least it would not cause `invalid configuration argument` error.

[Test]
Run the same code snippet in the [issue](https://github.com/pytorch/pytorch/issues/130806), and print the output, its dim and numel(), which looks like below now:
```
output=tensor([[ 0.4044, -0.0244, -0.6865,  ..., -0.7800,  0.1175,  1.6726],
        [-1.0866, -0.1609,  0.3538,  ...,  1.9105,  0.7882,  1.1583],
        [-2.2079,  0.3736,  0.3610,  ..., -0.2658, -0.0459,  1.3077],
        ...,
        [ 0.8753, -0.7482, -0.1978,  ...,  0.9016,  1.1501, -0.5178],
        [-1.5845, -0.6277,  1.4520,  ...,  0.5733, -2.1198, -0.0915],
        [-0.6310, -1.0239, -0.1910,  ...,  0.4309,  0.1630,  0.3239]],
       device='cuda:0'), dim=2, numel=2147483648
```

Added a large tensor unit test too.
```
/pytorch# pytest test/nn/test_embedding.py -k test_large_tensors
================================================================================== test session starts ===================================================================================
platform linux -- Python 3.9.19, pytest-7.3.2, pluggy-1.4.0
rootdir: /dockerx/development/pytorch
configfile: pytest.ini
plugins: flakefinder-1.1.0, rerunfailures-14.0, xdist-3.3.1, xdoctest-1.1.0, cpp-2.3.0, hypothesis-5.35.1
collected 288 items / 287 deselected / 1 selected
Running 1 items in this shard

test/nn/test_embedding.py .                                                                                                                                                        [100%]

=========================================================================== 1 passed, 287 deselected in 3.16s ============================================================================
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130994
Approved by: https://github.com/jeffdaily, https://github.com/xw285cornell
2024-07-20 08:33:29 +00:00
a8319698b3 [inductor] [cpp] improve cache blocking with CPU info (#129348)
## Description
For single thread case, this PR improves the cache blocking in CPP GEMM template with the CPU info (the L1 and L2 cache size). `Mc_blocks` and `Kc_blocks` are calculated based on the below condition:
     - size_of_B < L1
     - size_of_A < 0.5 * L2

For multi-thread, we need to tune the task decomposition among threads together with cache blocking. We disabled the cache blocking change for now and will submit a follow-up PR for multi-thread optimizations.

## Performance
No regressions. Models with > 3% performance speedup are listed below:

### BF16 single thread (measured on CPU with AMX support)
- static shape

| Model Family | Model Name | Speedup |
|--------------|------------|---------|
torchbench | detectron2_fasterrcnn_r_101_dc5| 4%

- dynamic shape

| Model Family | Model Name | Speedup |
|--------------|------------|---------|
torchbench | detectron2_fasterrcnn_r_101_dc5| 4%

### FP32 single thread (measured on Ice Lake)
- static shape

| Model Family | Model Name | Speedup |
|--------------|------------|---------|
torchbench | basic_gnn_edgecnn| 10%

- dynamic shape

| Model Family | Model Name | Speedup |
|--------------|------------|---------|
torchbench | basic_gnn_edgecnn| 10%

### Next step
The E2E level improvement is limited due to the below reasons:

- For several HF models, we can observe performance improvement at kernel level for the gemm template kernel but since the performance is either still worse than ATen kernel (thus won't be selected during autotune) or improved from worse than ATen to similar to ATen, so we don't see E2E level performance change.

- There're models where the gemm template kernel could get > 10% performance improvement with this PR but since the kernel time is only about 3% of the E2E time, we don't observe significant E2E level improvement.

We will continue to find possible optimizations in the gemm template kernel in follow-up PRs.

Co-authored-by: Jiong Gong <jiong.gong@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129348
Approved by: https://github.com/jgong5, https://github.com/jansel
ghstack dependencies: #130675, #130690
2024-07-20 06:53:31 +00:00
0b44e1a74c [inductor][cpp][gemm] optimize arbitrary N in packed gemm template (#130690)
Currently we require `n % register_block_n == 0` which typically bring good perf when `n` is a multiply of 8, 16, 32 etc. while will fall back to the reference micro gemm otherwise (where `register_block_n == 1`). This PR optimizes this by padding `n` to the multiple of `register_block_n` which is 8, 16, 32 etc. for packed weight. Therefore, the micro-gemm can work as is on the padded `n`. When the weight is padded, we will use the local accumulation buffer to get the result from micro-gemm and then unpadded (sliced) before storing back to the output buffer.

Performance numbers measured on "Intel (R) Xeon (R) CPU Max 9480", single core, bf16.

Before
AUTOTUNE linear_unary(512x768, 3073x768, 3073)
  _linear_pointwise 2.3563 ms 100.0%
  cpp_packed_gemm_0 710.5902 ms 0.3%

After
AUTOTUNE linear_unary(512x768, 3073x768, 3073)
  cpp_packed_gemm_0 1.8909 ms 100.0%
  _linear_pointwise 2.1016 ms 90.0%

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130690
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
ghstack dependencies: #130675
2024-07-20 06:30:15 +00:00
ebc012ace6 Add hooks for execution on intel gaudi devices - 1 (#128584)
## Motivation
This is follow up to PR:https://github.com/pytorch/pytorch/pull/126970  to support Gaudi devices for Pytorch UT execution.

## Changes
We are adding additional hooks to:
1. Add dtype exceptions for Gaudi/HPU
2. Extend onlyNativeDevices decorator  functionality to add additional devices

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128584
Approved by: https://github.com/albanD
2024-07-20 05:03:36 +00:00
d31f2ae904 Ensure invariant that all inputs have tensor dict (#131249)
There was a path with freezing enabled that violated the invariant that all inputs have the "tensor_dict" meta. This ensures that `register_attr_or_module` also sets tensor_dict meta.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131249
Approved by: https://github.com/anijain2305
2024-07-20 04:40:58 +00:00
37337ef5c3 add some description on create_block_mask and mask mods (#131209)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131209
Approved by: https://github.com/joydddd
2024-07-20 04:40:48 +00:00
168c0e24a5 [IntraNodeComm] Fix some issues in two-shot all-reduce (#131244)
Two issues:
- Similar to https://github.com/pytorch/pytorch/pull/129501, two-shot all-reduce's reduction order was different across ranks. This PR fixes it.
- When migrated to use SymmetricMemory, I accidentally used `get_buffer_ptrs_dev` instread of `get_buffer_ptrs` (the former is an on-device array). This PR fixes it (for https://github.com/pytorch/pytorch/issues/131215).

The failing snippet provided by https://github.com/pytorch/pytorch/issues/131215 now works.
```python
import os
import torch
import torch.distributed as dist

def _get_global_rank() -> int:
    return int(os.environ.get("LOCAL_RANK", "0"))

def is_local():
    return _get_global_rank() == 0

def _get_world_size() -> int:
    return int(os.environ.get("LOCAL_WORLD_SIZE", "1"))

global_rank = _get_global_rank()
world_size = _get_world_size()
torch.cuda.set_device(global_rank)
dist.init_process_group(backend="nccl")
global_group = dist.group.WORLD
draft_group = dist.new_group([0,1])

inp = torch.full((128, 1, 4096), global_rank, dtype=torch.bfloat16, device="cuda")
dist.all_reduce(inp, group=global_group)
expect = sum(range(world_size))
assert inp.eq(expect).all()

if 0 <= global_rank < 2:
    inp = torch.full((128, 1, 2048), global_rank, dtype=torch.bfloat16, device="cuda")
    dist.all_reduce(inp, group=draft_group)
    expect = sum(range(2))
    assert inp.eq(expect).all()

torch.cuda.synchronize()
print("success")
dist.destroy_process_group()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131244
Approved by: https://github.com/weifengpy, https://github.com/Chillee
2024-07-20 02:51:45 +00:00
d2bd9acabd [BE] bump optree version to 0.12.1 (#130139)
0.12.0 Major Updates:

- Add context manager to temporarily set the dictionary sorting mode
- Add accessor APIs
- Use `stable` tag for `pybind11` for Python 3.13 support
- Fix potential segmentation fault for pickling support

0.12.1 Updates:

- Fix warning regression during import when launch with strict warning filters

Closes #130155
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130139
Approved by: https://github.com/zou3519
ghstack dependencies: #130895
2024-07-20 02:41:10 +00:00
50436d5bdb [export] fix zero arg export in training_ir (#130990)
Fixed TrainingIRToRunDecomp failures for test_tensor_attribute_zero_args and also a few re-tracability failures because run_decomposition does a retracing.

**edit:** also remove the eliminate_dead_code() in _unlift because of one onnx test failure:
a constant tensor attr was lifted as constant_tensor input but it's not used in the graph after aot_autograd due to a short cut in its decomposition. This causes the setattr to be removed by eliminate_dead_code but the graph signature still contains the name of that buffer, which causes an inconsitency between the transformed graph and ep's original signature after _unlift. And it seems that this has happened a few times where some nodes are accidentally removed and we're in an inconsistent state.

The alternative of removing it would be: every time we call elimiate_dead_code, we verify the consistency of the graph with 1. the graph before transformation and 2. all the meta datas but i think this deserves a complete design.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130990
Approved by: https://github.com/pianpwk
2024-07-20 02:35:13 +00:00
3c43fe068f [inductor] parallel compile: Create new pipes for subproc communication (#131194)
Summary: Rather then using stdin/stdout for IPC, we can create new pipes and pass the descriptors to the subproc via the cmd line. https://github.com/pytorch/pytorch/issues/131070 reports an issue where the combination of deepspeed and onnxruntime-training causes _something_ in the subproc to write to stdout and corrupt the IPC. The current implementation was already brittle; we can just create new pipes specifically for the IPC.

Test Plan: I was able to repro the MemoryError in https://github.com/pytorch/pytorch/issues/131070 by installing deepspeed and onnxruntime-training. Verified this PR fixes.

Differential Revision: [D59968362](https://our.internmc.facebook.com/intern/diff/D59968362)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131194
Approved by: https://github.com/malfet, https://github.com/eellison, https://github.com/atalman
2024-07-20 02:23:01 +00:00
9df8ea1cf2 [inductor] Use multiple outputs for flex-attention (#130833)
Resubmit of #129344

This fixes the DCE issue for attention output

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130833
Approved by: https://github.com/lezcano
ghstack dependencies: #130831, #130832
2024-07-20 02:05:10 +00:00
deacc543f1 [inductor] Make UserDefinedTritonKernel a multi-output operation (#130832)
Resubmit of #129325

Previously each mutation was represented by a `MutationOutput` operation which
was a new scheduler node that must be scheduled immediately afterwards.

Now we have a single scheduler node, which produces mutiple `MutationOutput`
buffers as its output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130832
Approved by: https://github.com/lezcano
ghstack dependencies: #130831
2024-07-20 02:05:10 +00:00
27c2a0d63b [inductor] Separate Buffer and Operation into two concepts (#130831)
Resubmit of #128893

Currently a buffer represents both a tensor with physical storage and a
computation that produces the tensor as a result.

This PR attempts to split these into two different concepts in the scheduler.
This should allow us to have multiple outputs from a single operation.

Differential Revision: [D59876059](https://our.internmc.facebook.com/intern/diff/D59876059)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130831
Approved by: https://github.com/lezcano
2024-07-20 02:05:07 +00:00
bb4251213b Add decomposition for channel_shuffle (#118775)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/118775
Approved by: https://github.com/peterbell10
2024-07-20 01:24:41 +00:00
f0075c179b Pin sympy >= 1.13.0 (#130895)
------

The opposite of #130836. Pin `sympy >= 1.13.0` for Python >= 3.9 and `sympy == 1.12.1` for Python 3.8.

- #130836

See the PR description of #130836 for more details.

`sympy` 1.13.0 introduces some breaking changes which break our tests. More specifically:

- Ref [Backwards compatibility breaks and deprecations](https://github.com/sympy/sympy/wiki/release-notes-for-1.13.0#backwards-compatibility-breaks-and-deprecations)

> BREAKING CHANGE: Float and Integer/Rational no longer compare equal with a == b. From now on Float(2.0) != Integer(2). Previously expressions involving Float would compare unequal e.g. x*2.0 != x*2 but an individual Float would compare equal to an Integer. In SymPy 1.7 a Float will always compare unequal to an Integer even if they have the same "value". Use sympy.numbers.int_valued(number) to test if a number is a concrete number with no decimal part. ([#25614](https://github.com/sympy/sympy/pull/25614) by [@smichr](https://github.com/smichr))

`sympy >= 1.13.0` is required to enable Python 3.13 support. This should be part of #130689.

- #130689

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130895
Approved by: https://github.com/ezyang
2024-07-20 00:59:24 +00:00
30d1826b2b Revert "[executorch hash update] update the pinned executorch hash (#130001)"
This reverts commit 4821f72457afd7b1b5c61c1c8c3c49105c1bd22d.

Reverted https://github.com/pytorch/pytorch/pull/130001 on behalf of https://github.com/clee2000 due to the test_sympy_utils failure is real, Dr. CI is wrong https://github.com/pytorch/pytorch/actions/runs/10015433275/job/27687163560 4821f72457 ([comment](https://github.com/pytorch/pytorch/pull/130001#issuecomment-2240807631))
2024-07-20 00:56:14 +00:00
cyy
cd8bbdc71a [2/N] Fix Wunused-parameter warnings (#131170)
Follows #130924
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131170
Approved by: https://github.com/mikaylagawarecki
2024-07-19 23:58:56 +00:00
207fb96155 [functorch] saved tensor hooks error should only apply to grad, vjp transforms. (#131191)
There's no reason to ban them for vmap or jvp, because without the
{grad, vjp} transforms those just act above PyTorch autograd, which will
end up saving regular Tensors.

Test Plan:
- some tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131191
Approved by: https://github.com/drisspg
2024-07-19 23:16:27 +00:00
4821f72457 [executorch hash update] update the pinned executorch hash (#130001)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130001
Approved by: https://github.com/pytorchbot
2024-07-19 23:10:20 +00:00
7c299b46ca Revert "Invalidate StorageImpl instances when tensor is overwritten with cudagraphs (#125264)"
This reverts commit 8390843eba6271dcdbec7d048e9fa4e56d4479d8.

Reverted https://github.com/pytorch/pytorch/pull/125264 on behalf of https://github.com/izaitsevfb due to breaks internal tests ([comment](https://github.com/pytorch/pytorch/pull/125264#issuecomment-2240516202))
2024-07-19 22:58:51 +00:00
35bf05561c [Inductor] B2B-GEMM performance tuning with test (#130778)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130778
Approved by: https://github.com/eellison
2024-07-19 22:53:57 +00:00
6657b14a64 [inductor] Fix the method for checking the variable type of entry.numel (#131026)
The data type of numel in the IterationRangesEntry class is sympy.Expr. To determine if it's an integer, we need to use sympy.Integer.

Co-authored-by: peterbell10 <peterbell10@live.co.uk>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131026
Approved by: https://github.com/peterbell10
2024-07-19 22:51:11 +00:00
0e72baddf0 Revert "[easy][pytorch][counters] Move WaitCounter in c10/util (#131021)"
This reverts commit 0ca7b6ddd91192ebffd3c88bf314d07ba6cddf50.

Reverted https://github.com/pytorch/pytorch/pull/131021 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/131021#issuecomment-2240280827))
2024-07-19 21:56:09 +00:00
4aef5a1134 [c10] add an option to pg_config split share (#130877)
Summary:
context is: #129865
We want to give users an option to not share comms resouces so that
comm opts can overlap
Test Plan:
Augmentd UT

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130877
Approved by: https://github.com/fduwjj
2024-07-19 21:11:26 +00:00
0ca7b6ddd9 [easy][pytorch][counters] Move WaitCounter in c10/util (#131021)
Summary: Since WaitCounter frontend itself has minimal depdendencies it's fine to be moved into c10. Specific backends can be registered/linked separately.

Test Plan: unit test

Reviewed By: jamesperng, asiab4, c-p-i-o

Differential Revision: D59842868

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131021
Approved by: https://github.com/asiab4
2024-07-19 20:58:32 +00:00
c64ad2403c LF runners: Add new runner types for Amazon2023 AMIs (#131246)
Add new LF runner types with the Amazon2023 ami, matching the change done in https://github.com/pytorch/test-infra/pull/5487
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131246
Approved by: https://github.com/malfet
2024-07-19 20:30:41 +00:00
85ca88a2bb [Distributed][PP export] update tracing to handle autocast inclusion (#130998)
Fixes https://github.com/pytorch/pytorch/issues/128394

This updates PP export tracing to use no_grad() context along with avoid predispatch.
This enables tracing for HF llama models that currently fail due to not handling the use of autocast in the Rope embeddings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130998
Approved by: https://github.com/fduwjj
2024-07-19 20:08:00 +00:00
ceee87df2e [export] modify export code owners (#130894)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130894
Approved by: https://github.com/zhxchen17
2024-07-19 19:49:34 +00:00
5f981388ec Revert "[ROCm] Enable ROCm support for inductor's dynamic_rblock_scaling (#129663)"
This reverts commit d7a78ec8b938a61297221912464f5afef288b823.

Reverted https://github.com/pytorch/pytorch/pull/129663 on behalf of https://github.com/atalman due to Breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/129663#issuecomment-2240011143))
2024-07-19 19:46:26 +00:00
125be005eb [Docs] Fix fake tensor doc (#131205)
Fix this: `# AttributeError: 'FakeTensorMode' object has no attribute 'from_real_tensor'`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131205
Approved by: https://github.com/eellison
2024-07-19 17:59:45 +00:00
e49c0acc39 [dynamo] Revert https://github.com/pytorch/pytorch/pull/130416 (#131058)
All the changes brought by the original PR have been addressed in alternative ways in the stack. Why the original PR has to be reverted requires  more effort because there is some bad interaction with export.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131058
Approved by: https://github.com/williamwen42
2024-07-19 17:26:24 +00:00
042be441ba [aoti] Unskip some aot inductor tests (#130973)
Trying to unskip some tests, and if they are still broken, add reasons.

## example testing command
```
pytest -v test/inductor/test_aot_inductor.py -k test_add_complex
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130973
Approved by: https://github.com/ColinPeppler
2024-07-19 17:19:35 +00:00
9b5c70878b [Fix] Missing parameter happens when retracing an already jit.scripted module (#129787)
#### Issue
Model parameters sometime do not appear in the `named_parameters()` function. For example, when trying to jit.trace an already jit.scripted model. This PR fixes that by relying on `state_dict` to get both parameters`requires_grad=True` and buffers.

#### Test Plan
* `pytest test/export/test_converter.py -s -k test_convert_retrace_nested_scripted_modules`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129787
Approved by: https://github.com/angelayi
2024-07-19 16:58:48 +00:00
abb3f2822c [aotinductor] Support additional lifted constants supplied to const folding. (#130743)
Summary:
In export workflow, we always have a lifted graph which doesn't fetch constants through get_attr nodes. This cause some compatibility issue when we're trying to use inductor's split_const_gm function with a lifted graph.

This diff make an additive change to split_const_gm's interface, such that, when the pass sees a placeholder node is present in the lifted_constants table, it will also use that as the source of constness.

This change won't break the existing code and the lifted_constants table can be used orthogonal to the existing const folding mechanisms.

Also as required from MTIA team, we want to introduce a small callback function used to skip certain nodes during const folding.

For the internal followup counterpart, see D59685145

Test Plan: buck run mode/opt caffe2/test:test_export -- -r split_const_gm

Differential Revision: D59692790

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130743
Approved by: https://github.com/desertfire, https://github.com/SherlockNoMad
2024-07-19 16:48:56 +00:00
31e79aae6a Another follow up to #130260 (#130993)
Another followup to https://github.com/pytorch/pytorch/pull/130260
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130993
Approved by: https://github.com/huydhn
2024-07-19 16:43:54 +00:00
d4a79d4a7c Fix an example: Resolve broadcasting error in attn_bias and attn_mask… (#130209)
… addition, fix device assignment for newly created variables in method

Fix an example: Resolve broadcasting error in attn_bias and attn_mask addition, fix device assignment for newly created variables in method

1. `attn_bias += attn_mask` would cause a broadcasting error. Because the shape of `attn_bias` is (L, S), the shape of the output would be expected as (L, S) too. When the shape of input is (N, num_heads, L, S), a broadcasting should be triggered. Then, the shape of the output would be (N, num_heads, L, S), which is unexpected.
2. `attn_bias` is a newly created variables in method, which is not assigned device.

**This is my retry of #130200 .** I used a wrong account in that pr.

Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130209
Approved by: https://github.com/mikaylagawarecki
2024-07-19 15:23:22 +00:00
451fc029fe docs: note transposed weight initialisations (#130122)
Fixes #129834

Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130122
Approved by: https://github.com/mikaylagawarecki
2024-07-19 15:23:03 +00:00
5f3d8b8788 Revert "[c10] add an option to pg_config split share (#130877)"
This reverts commit 367213a608528ee74e67e03bf11f775e263ef480.

Reverted https://github.com/pytorch/pytorch/pull/130877 on behalf of https://github.com/atalman due to breaks internal build ([comment](https://github.com/pytorch/pytorch/pull/130877#issuecomment-2239298810))
2024-07-19 14:24:50 +00:00
25d8a0480b [lint] Remove unnecessary BUCKRESTRICTEDSYNTAX suppressions
Differential Revision: D59935630

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131187
2024-07-19 07:19:11 -07:00
a6a2cd6257 Typo fix (#131037)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131037
Approved by: https://github.com/albanD, https://github.com/Skylion007
2024-07-19 13:17:54 +00:00
1b72cf0b09 Add hasattr for tensor variable (#131008)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131008
Approved by: https://github.com/anijain2305
ghstack dependencies: #131007
2024-07-19 12:43:27 +00:00
1f961ad495 Runs aten cuda cpp tests in CI (#131061)
It seems like these tests are never run because https://github.com/pytorch/pytorch/pull/99956 got rid of the `pushd $1` which would make the if conditions true in CUDA builds.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131061
Approved by: https://github.com/malfet, https://github.com/eqy
2024-07-19 12:35:33 +00:00
d7a78ec8b9 [ROCm] Enable ROCm support for inductor's dynamic_rblock_scaling (#129663)
As of ROCm 6.1 [hipDeviceProp_t::regsPerMultiprocessor](https://rocm.docs.amd.com/projects/HIP/en/latest/doxygen/html/structhip_device_prop__t.html#a7390d5b180d63978c81aa971060270b4) is now available allowing us to enable this attribute on ROCm.
```
>>> torch.cuda.get_device_properties(0)
_CudaDeviceProperties(name='AMD Instinct MI250X/MI250', major=9, minor=0, gcnArchName='gfx90a:sramecc+:xnack-', total_memory=65520MB, multi_processor_count=104)
>>> torch.cuda.get_device_properties(0).regs_per_multiprocessor
65536
```

With https://github.com/triton-lang/triton/pull/3962we can extract n_regs and n_spells from a triton binary with AMD backend allowing us to enable inductor's dynamic_rblock_scaling on ROCm initially implemented in https://github.com/pytorch/pytorch/pull/115094

Leaving this in draft until following PRs have landed:
- https://github.com/pytorch/pytorch/pull/129361 to bump the triton commit pin
- https://github.com/pytorch/pytorch/pull/128449 to allow us to grab warp_size from device properties instead of hard coding 64 on ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129663
Approved by: https://github.com/jansel, https://github.com/shunting314
2024-07-19 09:45:03 +00:00
cyy
feef057691 [1/N] Fix Wunused-parameter warnings (#130924)
Before we can turn Wunused-parameter into an error
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130924
Approved by: https://github.com/ezyang
2024-07-19 06:14:51 +00:00
eee76c86a8 Write trace_structured events to scuba (#130955)
Summary: https://fb.workplace.com/groups/1286739428954016/posts/1287192258908733

Test Plan: Run test with tlparse and inspect https://www.internalfb.com/intern/scuba/query/?dataset=pt2_trace_structured_events

Differential Revision: D59866096

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130955
Approved by: https://github.com/ezyang
2024-07-19 06:02:47 +00:00
982309b501 Initial commit of flight recorder trace (#130764)
Summary:
`fr_trace.py` is used to analyze flight recorder dump files.
This script was taken from @wconstab and @zdevito.
Only minor changes made were to make the linter happy and add a few odd new fields that I added in version `2.2` of the collector portions.

Test Plan:
Tested manually on some flight recorder data and it seems to run.

TODO:
Address 15 odd `#type: ignore` that I put in there to make the linter happy for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130764
Approved by: https://github.com/fduwjj
2024-07-19 06:00:54 +00:00
fd4899bc58 [ONNX] Run ruff pyupgrade to update type annotations (#130657)
Use the newest syntax for type annotations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130657
Approved by: https://github.com/titaiwangms
2024-07-19 05:09:44 +00:00
4f60a2e39c Set correct output dtype for dequantize op during convert_pt2e in decomposed mode (#128953)
Earlier the signature of dequantize ops for decomposed quantized Tensor was changed for wider use-cases where the output dtype can be different from torch.float and needs to be passed during dequantization.
Please refer: https://github.com/pytorch/pytorch/pull/121450

However, setting of correct output dtype for dequantize ops was still missing in convert_pt2e flow.

This change enables the users to use PT2E quantization flow with non torch.float unquantized dtype, such as torch.bfloat16.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128953
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
2024-07-19 04:58:02 +00:00
d59803fb67 Refactored flexattention kernel (#130904)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130904
Approved by: https://github.com/drisspg
ghstack dependencies: #130871
2024-07-19 04:56:32 +00:00
ac76dd606f [dynamo] Alternative way to skip empty hooks guards on inbuilt nn modules (#131057)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131057
Approved by: https://github.com/williamwen42, https://github.com/jansel
ghstack dependencies: #131056
2024-07-19 04:42:38 +00:00
00e54e74ff [dynamo][cpp-guards] Fix bug in dict tags (#131056)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131056
Approved by: https://github.com/williamwen42, https://github.com/jansel
2024-07-19 04:42:38 +00:00
3c622fbcd3 [inductor] Fix var_to_range in IndexPropagation (#130984)
The current code assumes that indirect variables will be created by the
same `IndexPropagation` instance, however that isn't true in the case of
masked sub-blocks where we take in variables from the parent block.

This fixes the issue by moving the var range information up to the
`LoopBody` object where it can be shared by all sub-blocks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130984
Approved by: https://github.com/lezcano
2024-07-19 03:08:00 +00:00
b556d31586 Update torch-xpu-ops pin (ATen XPU implementation) (#131015)
Regular update.
1. New 90 ATen operators and their variants are supported for XPU.
2. Bugfixing: a. Fixing out-of-bound memory access in index_put kernel b. Fixing debug build error
3. Binary change. Split device AOT code of SYCL kernel into multiple libraries to avoid linkage failure.
4. torch-xpu-ops test case enhancement: a. Hook PyTorch testing ob_db to align opInfo configuration with CUDA b. Hook _check_arg_device2 and freeze_rng_state to make XPU happy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131015
Approved by: https://github.com/EikanWang
2024-07-19 02:18:55 +00:00
52cb9abb1d Add deterministic support in nn.functional.interpolate for XPU (#129864)
Both for CUDA and XPU, there are no deterministic implementation at native in `aten::upsample_bilinear` and `aten::replication_pad`. CUDA leverage operator decomposition path in frontend hook `nn.functional.interpolate` as its deterministic implentation. XPU backend uses the same solution in this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129864
Approved by: https://github.com/dvrogozh, https://github.com/albanD, https://github.com/EikanWang
2024-07-19 02:15:42 +00:00
39493aa934 [inductor][cpp][gemm] move bias add to epilogue (#130675)
Speedup bias-add compute by moving it to the epilogue. Performance numbers measured on "Intel (R) Xeon (R) CPU Max 9480", single core, bf16.
Before
AUTOTUNE linear_unary(512x768, 3072x768, 3072)
  cpp_packed_gemm_0 1.9200 ms 100.0%
  _linear_pointwise 1.9345 ms 99.3%

After
AUTOTUNE linear_unary(512x768, 3072x768, 3072)
  cpp_packed_gemm_0 1.8321 ms 100.0%
  _linear_pointwise 1.9246 ms 95.2%

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130675
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2024-07-19 01:16:34 +00:00
5a6a806b19 [Inductor UT] Generalize device-bias code in case TestFxGraphCache.test_inductor_counters. (#131006)
[Inductor UT] Generalize device-bias code in case `TestFxGraphCache.test_inductor_counters`.
Fix #131005

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131006
Approved by: https://github.com/masnesral
2024-07-19 01:14:22 +00:00
208dffa702 [Compiled DDP] DDP + AC unit test (#130981)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130981
Approved by: https://github.com/fegin
2024-07-19 01:07:41 +00:00
cyy
3cc6183ce1 Fix getAugOp error (#131033)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131033
Approved by: https://github.com/ezyang
2024-07-19 01:07:24 +00:00
6e7b9ee8a0 [inductor] adapte windows file path (#130713)
This PR is depends on https://github.com/pytorch/pytorch/pull/130132 can be landed successful.
The detailed log: https://github.com/pytorch/pytorch/issues/124245#issuecomment-2211889758

After the file path was adapted for Windows, the first Windows inductor case was run successful.

```python
import torch

def foo(x, y):
    a = torch.sin(x)
    b = torch.cos(x)
    return a + b
opt_foo1 = torch.compile(foo)
print(opt_foo1(torch.randn(10, 10), torch.randn(10, 10)))
```

Result:
![image](https://github.com/user-attachments/assets/4944df47-e74d-476b-8eb5-1d1fd5abeb41)

Co-authored-by: Jiong Gong <jiong.gong@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130713
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire
2024-07-18 23:19:38 +00:00
e880cb2fe0 [ONNX] Remove beartype usage (#130484)
beartype has served us well in identifying type errors and ensuring we call internal functions with the correct arguments (thanks!). However, the value of having beartype is diminished because of the following:

1. When beartype improves support for better Dict[] type checking, it discovered typing mistakes in some functions that were previously uncaught. This caused the exporter to fail with newer versions beartype when it used to succeed. Since we cannot fix PyTorch and release a new version just because of this, it creates confusion for users that have beartype in their environment from using torch.onnx
2. beartype adds an additional call line in the traceback, which makes the already thick dynamo stack even larger, affecting readability when users diagnose errors with the traceback.
3. Since the typing annotations need to be evaluated, we cannot use new syntaxes like `|` because we need to maintain compatibility with Python 3.8. We don't want to wait for PyTorch take py310 as the lowest supported Python before using the new typing syntaxes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130484
Approved by: https://github.com/titaiwangms
2024-07-18 22:07:40 +00:00
fb3674b1f4 Revert "[Autograd] Cond Higher-Order Operation (#126911)"
This reverts commit f7058b735e52a1d876912f8c96a594673a495007.

Reverted https://github.com/pytorch/pytorch/pull/126911 on behalf of https://github.com/clee2000 due to broke lint and functorch/test_aotdispatch f7058b735e Probably a landrace since both the test and lint passed on PR ([comment](https://github.com/pytorch/pytorch/pull/126911#issuecomment-2237703182))
2024-07-18 22:06:40 +00:00
686b7f046a [Fix]: TSConverter handles call ops with multiple outputs (#129294)
#### Issue
* Current call ops does not handle IR with multiple outputs. If an op has multiple outputs, we add an implicit unpack to map output. E.g.,
```
%5 : Tensor, %6 : Tensor = aten::max(%x.1, %3, %4), scope: export.test_converter.M:: # /data/users/jiashenc/pytorch/test/export/test_converter.py:774:20
```
* There are some cases that `prim::If` sub-blocks do not return any outputs. E.g.,
```
%9 : bool = aten::gt(%8, %3), scope: export.test_converter.M::/torch.nn.modules.pooling.AdaptiveMaxPool2d::pool # <string>:5:9
   = prim::If(%9), scope: export.test_converter.M::/torch.nn.modules.pooling.AdaptiveMaxPool2d::pool # <string>:5:2
    block0():
      -> ()
    block1():
       = prim::RaiseException(%5, %4), scope: export.test_converter.M::/torch.nn.modules.pooling.AdaptiveMaxPool2d::pool # <string>:5:2
      -> ()
```

#### Test Plan
We did an exhaustive search of all torch APIs that can return multiple outputs. We sample some of common ones and add new test cases based on those.
* `pytest test/export/test_converter.py -s -k test_ts2ep_multi_outputs_on_call_ops`

#### Appendix
* aten ops that return multiple outputs.
```
aten._batch_norm_impl_index
aten._batch_norm_no_update
aten._batch_norm_with_update
aten._batch_norm_with_update_functional
aten._cudnn_rnn
aten._efficient_attention_backward
aten._efficient_attention_forward
aten._embedding_bag
aten._embedding_bag_forward_only
aten._flash_attention_backward
aten._flash_attention_forward
aten._fused_adam
aten._fused_dropout
aten._fused_moving_avg_obs_fq_helper
aten._linalg_det
aten._linalg_eigh
aten._linalg_slogdet
aten._linalg_solve_ex
aten._linalg_svd
aten._native_batch_norm_legit
aten._native_batch_norm_legit_functional
aten._native_batch_norm_legit_no_training
aten._pack_padded_sequence
aten._prelu_kernel_backward
aten._scaled_dot_product_efficient_attention
aten._scaled_dot_product_efficient_attention_backward
aten._scaled_dot_product_flash_attention
aten._scaled_dot_product_flash_attention_backward
aten._scaled_dot_product_flash_attention_for_cpu
aten._scaled_dot_product_flash_attention_for_cpu_backward
aten._thnn_fused_lstm_cell
aten._thnn_fused_lstm_cell_backward_impl
aten._unique2
aten._weight_norm_interface
aten.adaptive_max_pool2d
aten.adaptive_max_pool3d
aten.aminmax
aten.batch_norm_backward
aten.convolution_backward
aten.cudnn_batch_norm
aten.cudnn_batch_norm_backward
aten.cummax
aten.cummin
aten.fractional_max_pool2d
aten.frexp
aten.grid_sampler_2d_backward
aten.grid_sampler_3d_backward
aten.gru
aten.linalg_cholesky_ex
aten.linalg_eig
aten.linalg_inv_ex
aten.linalg_ldl_factor_ex
aten.linalg_lu
aten.linalg_lu_factor_ex
aten.linalg_qr
aten.linear_backward
aten.log_sigmoid_forward
aten.lstm
aten.lu_unpack
aten.max
aten.max_pool2d_with_indices
aten.max_pool3d_with_indices
aten.median
aten.min
aten.miopen_batch_norm
aten.miopen_batch_norm_backward
aten.mkldnn_rnn_layer
aten.mkldnn_rnn_layer_backward
aten.mode
aten.multilabel_margin_loss_forward
aten.nanmedian
aten.native_batch_norm
aten.native_batch_norm_backward
aten.native_dropout
aten.native_group_norm
aten.native_group_norm_backward
aten.native_layer_norm
aten.native_layer_norm_backward
aten.nll_loss2d_forward
aten.nll_loss_forward
aten.quantized_gru
aten.quantized_lstm
aten.rnn_relu
aten.rnn_tanh
aten.sort
aten.std_mean
aten.topk
aten.triangular_solve
aten.unique_dim
aten.var_mean
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129294
Approved by: https://github.com/angelayi
2024-07-18 21:55:18 +00:00
7f1cda1533 Autoheuristic: Do not store choices as metadata (#130304)
While for optimizations like pad_mm, there are always only two possible choices, for other decision procedures, like kernel choice selection, the set of "available" choices depends on the input. Instead of storing the choices as metadata, we can instead take a look at all choices for which we have collected data (i.e. `df[CHOICE_COL].unique()`).

In this PR, I also try to replace "choice" and "feedback" with global constants CHOICE_COL and FEEDBACK_COL.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130304
Approved by: https://github.com/eellison
2024-07-18 21:39:42 +00:00
4d9f2a6d56 Small expandable segments refactor. (#130889)
Makes next PRs that will export/import segment handles easier to write.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130889
Approved by: https://github.com/dsjohns2
ghstack dependencies: #130888
2024-07-18 21:34:38 +00:00
d8fed480ef Move handle-creation logic into cudacaching allocator. (#130888)
A later PR will then make the handle abstract and able to use
either cudaMalloc or expandable segments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130888
Approved by: https://github.com/dsjohns2
2024-07-18 21:34:38 +00:00
3e9cf1cc80 Fix potential segfault during deletion (#131036)
Summary: See comment in code

Test Plan: code reading

Reviewed By: albanD

Differential Revision: D59872819

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131036
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-07-18 21:18:31 +00:00
f7058b735e [Autograd] Cond Higher-Order Operation (#126911)
This is an updated PR to equip cond with the autograd feature and replaces the old [PR](https://github.com/pytorch/pytorch/pull/126007)

@ydwu4 I tried to incorporate your requests already.

Currently there are two problems that I struggle with solving:

1. There seems to be an import issue when trying to import cond in `torch/__init__.py`, see [here](8a704035c9/torch/__init__.py (L1914-L1916)). Therefore, I had to comment those lines, which resolved the import issues, but I believe cond is not proberly exposed as torch.cond.
2. I am not entirely sure how to deal with the opinfo test in `hop_db.py`

Co-authored-by: Yidi Wu <yidi@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126911
Approved by: https://github.com/ydwu4
2024-07-18 21:09:09 +00:00
24467ba2ec Update pin (#130896)
Test the XLA pin update
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130896
Approved by: https://github.com/anijain2305
2024-07-18 21:04:30 +00:00
793b17ebcb Add numeric_debugger top level APIs (#130643)
Summary:
Add three top level APIs for numeric debugger in pt2e flow that can log intermediate output in the model
and calculate summary for metric comparisons between nodes in two graphs

* `prepare_for_propagation_comparison`
* `extract_results_from_loggers`
* `compare_results`

Test Plan:
python test/test_quantization.py -k test_prepare_for_propagation_comparison
python test/test_quantization.py -k test_extract_results_from_loggers

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130643
Approved by: https://github.com/dulinriley, https://github.com/tarun292
2024-07-18 20:54:18 +00:00
726b9268d2 Revert "Re-implement pin_memory to be device-agnostic by leveraging the Accelerator concept (#126376)"
This reverts commit c986aeea2d7d9403be702119e3dd4dcb18134fc2.

Reverted https://github.com/pytorch/pytorch/pull/126376 on behalf of https://github.com/atalman due to Failing internal builds ([comment](https://github.com/pytorch/pytorch/pull/126376#issuecomment-2237496633))
2024-07-18 20:25:20 +00:00
e7f7c5c3f8 [inductor] Avoid fallback case for custom scan op lowering (#130936)
We currently can't generate split scans when there are multiple scan
values, so we normally fall back to ATen. However, for the higher order
scan op, we can't fallback so it makes sense to just generate the slower
kernel anyway. This avoids having special shapes where we fail to
codegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130936
Approved by: https://github.com/lezcano
2024-07-18 19:53:47 +00:00
367213a608 [c10] add an option to pg_config split share (#130877)
Summary:
context is: #129865
We want to give users an option to not share comms resouces so that
comm opts can overlap
Test Plan:
Augmentd UT

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130877
Approved by: https://github.com/fduwjj
2024-07-18 19:03:00 +00:00
c015e5b9e3 Make sure that TransformGetItemToIndex for all graph replay (#131003)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131003
Approved by: https://github.com/Chillee
ghstack dependencies: #130871
2024-07-18 18:32:21 +00:00
82242a258a rm duplicate index_dtype arg (#130803)
- Remove duplicate `index_dtype` argument for `_test_meta_sparse_compressed` operation.
- Also remove unused `y_v_numel` variable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130803
Approved by: https://github.com/soulitzer
2024-07-18 18:30:13 +00:00
6d9f74f0af Add flex decoding benchmark (#130850)
ghstack-source-id: b4f26fb66ed47907b11580c8c853737959c58811
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130788

Add benchmark for flex decoding.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130850
Approved by: https://github.com/Chillee, https://github.com/drisspg
2024-07-18 18:09:25 +00:00
fff92d4f18 Revert "Use inductor TestCase for test_replicate_with_compiler.py (#129494)"
This reverts commit 9f392f8294e928aec49599ad649aa899e1356102.

Reverted https://github.com/pytorch/pytorch/pull/129494 on behalf of https://github.com/atalman due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/129494#issuecomment-2237147504))
2024-07-18 17:42:05 +00:00
745324e487 [export] turn on hybrid symints by default (#130775)
Sets `prefer_deferred_runtime_asserts_over_guards=True` for export, so any guards emitted from `SymNode.expect_true` (for example, guards that are implicitly required to be true for an op to succeed) won't lead to constraint violations. Instead these should appear in the graph as runtime asserts, or potentially as replacement expressions for placeholder shapes.

For example, this reshape op should emit s0 * s1 = s2, deferred as a runtime assert.
```
x = torch.randn(4, 8)  # [s0, s1]
y = torch.randn(32)  # [s2]
out = x.reshape(-1) + y
# this emits Eq(s0 * s1, s2), and we represent y's shape as [s0*s1] in the graph.
```

However, other complex guards can still cause export to fail, for instance guards emitted from `SymNode.guard_bool/guard_size_oblivious` (e.g. explicit if-else conditions in user code or lower-level op implementations hit during tracing) can still raise constraint violations. These can be deferred with `allow_complex_guards_as_runtime_asserts=True`. We don't yet make this default, because while this makes export more likely to succeed, it results in non-trivial asserts being emitted that often represent specialization to a variant of the op, or checks related to 0/1 specialization.

We also remove forced specializations for export and kill the `_disable_forced_specializations` flag - now any guard we can't express with Dims/DerivedDims either are handled with Hybrid SymInts, or should be resolved with rewriting or deferring.

Follow up:
Currently, `ShapeEnv._set_replacement()` is called for complex equality expressions (e.g. s2 -> s0*s1 in the example above), and the ExportedProgram stores `s0*s1` in the input placeholder. This isn't checked for validity when the program is run, so an option is to avoid replacement and/or runtime assert on equality.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130775
Approved by: https://github.com/avikchaudhuri
2024-07-18 17:40:58 +00:00
22388ffe03 Graph break on tostring for numpy remapping (#131007)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131007
Approved by: https://github.com/williamwen42
2024-07-18 17:23:41 +00:00
8bf0be7c78 [CUDAGraph] Add operator.mul to skip list for find_input_mutations (#130986)
The #130912 error happens since `operator.mul` does not have `_schema`.

So why do we have `operator.mul` and why is it not dispatched to `torch.ops.aten.mul`? This op comes from %mul_3.

    %mul_3 : [num_users=50] = call_function[target=operator.mul](args = (%arg689_1, 4096), kwargs = {})

`%arg689_1` is a placeholder with `meta[‘val’] = s0`. It comes form dynamic shapes and represents the batch size since it’s also used in many other nodes such as:

    %view_1 : [num_users=1] = call_function[target=torch.ops.aten.view.default](args = (%mm, [%arg689_1, 4096, 320]), kwargs = {})
and

    %native_group_norm_2 : [num_users=1] = call_function[target=torch.ops.aten.native_group_norm.default](args = (%div_1, %arg16_1, %arg17_1, %arg689_1, 320, 4096, 32, 1e-06), kwargs = {})

To fix the issue, we can add `operator.mul` to skip list.

Fixes #130912

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130986
Approved by: https://github.com/eellison
2024-07-18 17:11:39 +00:00
5979014059 DSD for TorchTune LoRA (#129635)
Fixes #128745
Solve the issue with conflicts when users use full_state_dict while the model is FSDP.

Current solve the issue for `full_state_dict=True`, with error
`'aten.copy_.default: got mixed torch.Tensor and DTensor, need to convert all torch.Tensor to DTensor before calling distributed operators!',).`

TODO: for` broadcast_from_rank0=True, full_state_dict=True`, the error is
`NotImplementedError: c10d::broadcast_: attempted to run this operator with Meta tensors`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129635
Approved by: https://github.com/fegin
2024-07-18 17:00:35 +00:00
5484c86021 [export] Fully support extension op in serialization/deserialization. (#130851)
Summary: Finishing up the mechanism to "register" certain types of operators to a registry so that the serializer can handle them correctly. This is expected to be firstly used by executorch.

Test Plan: buck run mode/opt caffe2/test:test_export -- -r test_export_with_extension_op_serialization

Differential Revision: D59825148

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130851
Approved by: https://github.com/angelayi
2024-07-18 16:47:53 +00:00
85451b2cde [DTensor] Fix shard_dim_alltoall fake tensor return (#129945)
shard_dim_alltoall op has a return type as a Tensor in its schemas (here: https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/Functional.cpp#L628),
but its FakeTensor implementation returns a list of tensors (see the chunk() call here: https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/_collective_utils.py#L33).
So it would error out when device="meta".

This PR fixes the fake tensor mode return type for 1d mesh and adds a test to compare shape with non-meta tensor case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129945
Approved by: https://github.com/wanchaol
2024-07-18 16:43:40 +00:00
16aaff7783 Fix mm pad regresion - more conservative estimation of plannable inputs (#128909)
- More conservative estimation of plannable inputs
- Consider constant_pad_nd as pointwise node in concat lowering
- Use aten.cat instead of constant pad ndwhen padding just a single dimension because it can be memory-planned away

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128909
Approved by: https://github.com/Chillee
2024-07-18 16:42:30 +00:00
27ded03545 [FX][export] DCE pass, check schema for node impurity (#130395)
Change the default DCE pass to check node schema for impure nodes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130395
Approved by: https://github.com/angelayi, https://github.com/jgong5
2024-07-18 16:31:40 +00:00
32ff04d30a [dtensor][debug] adding functionality to control noisiness of the debug output (#130410)
**Summary**
Currently, the output of CommDebugMode contains a lot of noise, such as operations that usually won’t tell the user much information such as aten.detach.default. I have created a set of these trivial operations and added a user argument noise_level for users to choose how much information they would want to receive.

noise_level = 1 prints module-level collective counts
noise_level = 2 prints operations not included in trivial operations and module information
noise_level = 3 prints all operations

In addition, I have removed the generate_module_tracing_table since noise_level = 1 essentially replaces it. Finally, I have updated the examples and test cases.

**Test Plan**
1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_json_dump

2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_json_dump

3. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_operation_tracing

4. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing

5. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing

6. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130410
Approved by: https://github.com/XilunWu
2024-07-18 16:12:59 +00:00
8ea03372a1 [MPS] Store philox counter as part of the RNG state (#130662)
Fixes #130613

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130662
Approved by: https://github.com/malfet
2024-07-18 15:57:28 +00:00
cyy
7c90a82970 [Reland] [5/N] Change static functions in headers to inline (#131010)
Reland of #130673

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131010
Approved by: https://github.com/Skylion007
2024-07-18 15:53:48 +00:00
d6ae8bbf16 Revert "[export] Add print_readable to unflattener (#128617)"
This reverts commit 9fee87e4cd9efb55ee5427a8e6b3c57de7c599f9.

Reverted https://github.com/pytorch/pytorch/pull/128617 on behalf of https://github.com/clee2000 due to broke inductor/test_flex_attention https://github.com/pytorch/pytorch/actions/runs/9984688318/job/27595182606 433ef4e444 Not run on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/128617#issuecomment-2236867975))
2024-07-18 15:31:51 +00:00
120fdf7ee2 Revert "[aota] Needs autograd if an input requires_grad, agnostic to enable_grad (#128890)"
This reverts commit e98135d1ad2f999fec649ecd21b35f3d5676be43.

Reverted https://github.com/pytorch/pytorch/pull/128890 on behalf of https://github.com/zou3519 due to broke trunk tests, probably a landrace ([comment](https://github.com/pytorch/pytorch/pull/128890#issuecomment-2236790805))
2024-07-18 14:58:25 +00:00
5a90ed3523 Reinplacing should ignore copy_ nodes where the mutated arg is not read (#130866)
Might fix #127660, need to test some more cases.

We update the reinplacing pass. If we have something like the following,
where "sin" is a custom op (this situation should also apply to triton
kernels)
```py
def graph(x):
    y = sin(x)
    z = sin(y)
    x.copy_(z)
```
then the reinplacer used to produce the following:
```py
"""step 1: reinplaces the first sin"""
def graph(x):
    x_clone = x.clone()
    sin_out(x, out=x_clone)
    z = sin(x_clone)
    x.copy_(z)

"""step 2: reinplaces the second sin"""
def graph(x):
    x_clone = x.clone()
    sin_out(x, out=x_clone)
    sin_out(x_clone, out=x_clone)
    x.copy_(x_clone)
```
However, the first clone is unnecessary. It is safe to reinplace
the first sin into the following:
```py
def graph(x):
    sin_out(x, out=x)
    z = sin(x)
    x.copy_(z)
```
because there are no users of `x`'s original value (the copy_ node
doesn't actually use the original value of x!)

This PR updates the reinplacing pass to ignore copy_ in its computation
of if the original value of the mutated argument is still needed.

NB: this also applies to triton kernels, but it was easier for me to
reason about custom ops (and my repros were all for custom ops).

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130866
Approved by: https://github.com/oulgen
2024-07-18 13:47:54 +00:00
dd39dca034 Removing some cruff and updating signatures for consistency (#130871)
# Summary

- This removes a bunch of example score mods that were primarily used for testing and places them directly in the test file. We should follow up with merging test_flex_decode and test_flash when the velocity slows down a little
- Fixes a bug with indexing on block mask
- Adds some doc strings to helper funcs and fixes some misc typing things
- Forces functions passed to `create_block_mask` to mask_mods and updates tests files

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130871
Approved by: https://github.com/joydddd, https://github.com/Chillee
2024-07-18 13:32:11 +00:00
9f6db5d0e2 Revert "Ensure staticmethods can be allowed in graph (#130882)"
This reverts commit b0387449db41c90fb4226baea97a8d889a0951c4.

Reverted https://github.com/pytorch/pytorch/pull/130882 on behalf of https://github.com/atalman due to failing torchrec tests internally, please fix and reland ([comment](https://github.com/pytorch/pytorch/pull/130882#issuecomment-2236528473))
2024-07-18 13:31:30 +00:00
63a0a65df9 Define 'zero-preserving unary functions' in docs (#130804)
Make explicit the definition of 'zero-preserving unary functions' in the sparse tensors documentation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130804
Approved by: https://github.com/soulitzer
2024-07-18 13:30:29 +00:00
eqy
1b07d42171 Add @syed-ahmed to CUDA CODEOWNERS paths (#130971)
CC @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130971
Approved by: https://github.com/soulitzer
2024-07-18 11:55:10 +00:00
c986aeea2d Re-implement pin_memory to be device-agnostic by leveraging the Accelerator concept (#126376)
This PR re-implements pin memory aiming to get rid of the optional `device` argument and makes all related APIs to be device-agnostic. We add two new abstract APIs in [AcceleratorHooksInterface](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/detail/AcceleratorHooksInterface.h#L12) and redefine pin memory as: "Pin memory is always pinned for the current accelerator device". In detail, it uses [getAcceleratorHooksInterface](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/Context.h#L61) in pin_memory/is_pinned to get an appropriate device and invoke the corresponding overridden interfaces, instead of using BackendSelect and then dispatching to CUDA or other specific backends' implement methods.

Note: For new backends who want to implement and use pin memory, just inherit AcceleratorHooksInterface and overwrite the `isPinnedPtr` and `getPinnedMemoryAllocator` methods.

Additional context: To avoid BC-breaking, this PR just preserves the `device` arg of related APIs and would throw a deprecation warning if `device` arg is passed. Another PR will be submitted to update all PT callers (`Tensor.is_pinned()`, `Tensor.pin_memory()`...) not to pass this arg based on this PR. In future, `device` arg will be actually removed.

Relates #124908
Relates #14560
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126376
Approved by: https://github.com/albanD
2024-07-18 11:54:14 +00:00
38b7d89aa4 Uses context pointer for deleter to enable multiple CUDAPluggableAllocator usage (#130472)
We should be able to create multiple CUDAPluggableAllocators in the same pytorch program (see https://github.com/pytorch/pytorch/issues/124807, https://github.com/pytorch/pytorch/pull/125722 for context). When mixing CUDAPluggableAllocators in the same pytorch program, we need to make sure that the deleter passed in through the CUDAPluggableAllocator gets "attached" to the data_ptr and persist until program exit (when it's called to free the memory).

Currently, CUDAPluggableAllocator maintains a global `current_custom_allocator`. When creating the `DataPtr`, `raw_deleter` attaches `custom_raw_deleter` to the DataPtr which calls  `current_custom_allocator->raw_delete(...)`. This approach is fine when using only one allocator, however for multiple allocator use case, DataPtr would be using the deleter of whatever is in the `current_custom_allocator`. For example, if allocation 1 was done with `cudaMalloc` and allocation 2 was done with `ncclMemAlloc`, and if `current_custom_allocator` is currently pointing to the CUDAPluggableAllocator with `ncclMemAlloc` - when cleaning up the allocation 1, we'd be using `ncclMemFree` instead of `cudaFree`.

In this PR, we solve the above problem by remembering the `free_fn_` using a deleter context. Hence, there is no need to go through an allocator object to find the deleter.

CC: @zdevito @ptrblck @eqy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130472
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-07-18 11:33:21 +00:00
28a74b9fa4 [NestedTensor] Integrate sum along the jagged dimension into NestedTensor (#130425)
Summary: Modify the existing `sum` operator in PyTorch, invoked by `torch.sum`, to allow for reductions along the ragged dimension of a nested tensor. This diff enables PyTorch users to invoke `torch.sum` on a nested tensor with `dim=1`, where `ragged_idx=1`.

Functions modified in `caffe2/torch/nested/_internal/ops.py`:
- `sum_dim_IntList()`: The function assumes that `ragged_idx=1`; in the case that `dim=1` as well, where `dim` is the dimension on which we reduce, this diff invokes the PyTorch benchmark found in D58423489. Specifically, this diff pads a nested tensor, e.g. of logical shape `(B, *, M)`, using [`torch.ops.aten._jagged_to_padded_dense_forward`](https://www.internalfb.com/code/fbsource/[92c2a067ab04e3eebc999254fed4ae2fbea6def3]/fbcode/deeplearning/fbgemm/fbgemm_gpu/fb/inductor_lowerings/elementwise_ops.py?lines=26), then reduces across the `*` dimension (`dim == 1`) to a `(B, M)` output tensor.
- `_wrap_jagged_dims()`: This diff adds special handling to allow for the case where `dim` contains `1` and not `0`, but to continue disallowing the case where `dim` contains `0` and not `1`. In this function's creation, I created a helper function, `_get_condition_for_invalid_jagged_reductions()`, which makes it clearer which conditions apply to which operators. Specifically, operators which are enabled with jagged reductions are specified at the top of the file in `SUPPORTED_JAGGED_REDUCTIONS` and have a different set of conditions that need to be tested, as reducing along `dim == 1` without `dim == 0` is now possible.

Functions modified in `caffe2/test/test_nestedtensor.py`:
- `test_sum_int_DimList()`: This diff adds special handling in the `sum` unit test to allow for the case where `dim` contains `1` and not `0`, but to continue disallowing the case where `dim` contains `0` and not `1`.
- `test_sum_int_DimList_ragged_dim_1()`: This diff adds a new unit test which verifies the accuracy and feasibility of reducing along the jagged dimension of a nested tensor.

Notes:
- This diff solely adds functionality for the case in which we reduce only along the ragged dimension. Cases in which we reduce along both the ragged and another dimension, like `dim == (1, 2)`, are not permitted, as this set of diffs focuses primarily on the former.
- The `sum` operator is the only operator which uses the function `_wrap_jagged_dims()`; all other operators use `_wrap_jagged_dim()`. I would like to later look into why this is the case and if we can consolidate this!
- I modified some of the comments in the `sum` function as well as the unit tests for more clarity.

Test Plan:
Verify that existing (`test_sum_int_DimList`) and new (`test_sum_int_DimList_ragged_dim_1`) unit tests pass via the following command:

```
buck2 run mode/{opt,inplace} //caffe2/test:nested -- --regex test_sum_int_DimList
```

Differential Revision: D59571209

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130425
Approved by: https://github.com/davidberard98
2024-07-18 10:48:18 +00:00
e98135d1ad [aota] Needs autograd if an input requires_grad, agnostic to enable_grad (#128890)
Reland of:  https://github.com/pytorch/pytorch/pull/128016

Summary from previous PR:
We assume only two possible mutually exclusive scenarios:

Running compiled region for training (Any of inputs has requires_grad)

Produced differentiable outputs should have requires_grad.
Running compiled region for inference (None of inputs has requires_grad)

All outputs do not have requires_grad.
Even if user runs the region under no_grad(), but has an input Tensor with requires_grad - we go Training scenario (1).

With current state that means:
1/ needs_autograd should not check torch.is_grad_enabled(), only that any of inputs requires_grad
2/ if needs_autograd => trace_joint (We are in training scenario 1.) => always run compiled region under with.enable_grad()

Changes in partitioner?

Inference and Training graphs had difference in return container, list/tuple.
The changes in partitioner are done to unify and return always tuple.
As a result - some changes in test_aotdispatch.py for graph contents list -> tuple.

Why was revert?

There was a regression of hf_Reformer model on inference.
```
TORCHINDUCTOR_FX_GRAPH_CACHE=0 python benchmarks/dynamo/torchbench.py --performance --inference --bfloat16 --backend inductor --device cuda --only hf_Reformer --cold-start-latency --use-eval-mode
```

Because one of the compiled graphs contained outputs, which are aliases to the inputs that are nn.Parameter(requires_grad=True).

Even if inference bencharmsk torchbench runs inside with` torch.no_grad()` - alias (specifically for hf_Reformer - expand) ops preserve requires_grad.

As a result we started compiling training graph instead of inference.

Fix for view ops:

If we have outputs, that are aliases to inputs that requires_grad, those outputs requires grad is not a reason to generate training graph.

This is handled in aot_autograd.py, where output_and_mutation_safe are calculated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128890
Approved by: https://github.com/bdhirsh
2024-07-18 08:27:53 +00:00
cf3f4285a8 Add recursive metadata guard test (#131002)
Ensures that nested tensors subclasses are guarded properly. It turns out this case is already handled [here](d77af49380/torch/_dynamo/variables/builder.py (L1496)) which will recursively wrap inner tensors adding metadata guards for them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131002
Approved by: https://github.com/bdhirsh
2024-07-18 08:24:43 +00:00
134bc4fc34 [BE][Easy][12/19] enforce style for empty lines in import segments in test/i*/ (#129763)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129763
Approved by: https://github.com/jansel
2024-07-18 07:49:19 +00:00
dfc3347c4a [pytorch][counters] Make WaitCounter backend pluggable (#130934)
Summary:
This diff introduces a much more flexible model for WaitCounter backend:
1. Backend can be installed dynamically (even if not linked with pytorch) instead of relying on macros and swapping implementation at compile time
2. Multiple backends are supported at the same time.

Test Plan: unit test

Reviewed By: jamesperng

Differential Revision: D59795863

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130934
Approved by: https://github.com/asiab4
2024-07-18 07:23:55 +00:00
b732b52f1e Revert "[BE][Easy][12/19] enforce style for empty lines in import segments in test/i*/ (#129763)"
This reverts commit aecc746fccc4495313167e3a7f94210daf457e1d.

Reverted https://github.com/pytorch/pytorch/pull/129763 on behalf of https://github.com/XuehaiPan due to need reland after rerunning lintrunner on main ([comment](https://github.com/pytorch/pytorch/pull/129763#issuecomment-2235736732))
2024-07-18 06:39:58 +00:00
6c2c8ee15b [export] Remove preserved ops from decomp list (#130970)
Fixes https://fb.workplace.com/groups/1075192433118967/permalink/1466016147369925/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130970
Approved by: https://github.com/bdhirsh
2024-07-18 05:15:22 +00:00
aecc746fcc [BE][Easy][12/19] enforce style for empty lines in import segments in test/i*/ (#129763)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129763
Approved by: https://github.com/jansel
2024-07-18 05:13:41 +00:00
740fb22966 [BE][Easy][4/19] enforce style for empty lines in import segments in functorch/ (#129755)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129755
Approved by: https://github.com/zou3519
ghstack dependencies: #129752
2024-07-18 05:08:03 +00:00
a085acd7d6 [dynamo] Revert back changes to UnspecializedBuiltinNNModuleVariable (#130991)
xref - https://fb.workplace.com/groups/1075192433118967/permalink/1466525440652329/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130991
Approved by: https://github.com/williamwen42, https://github.com/mlazos
2024-07-18 05:01:46 +00:00
9f392f8294 Use inductor TestCase for test_replicate_with_compiler.py (#129494)
Summary: `test/distributed/_composable/test_replicate_with_compiler.py` exercises inductor. This change introduces a version of MultiProcessTestCase that derives from the inductor TestCase class to make sure we always get a clean cache dir.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129494
Approved by: https://github.com/eellison
2024-07-18 03:08:32 +00:00
433ef4e444 Revert "[FX][export] DCE pass, check schema for node impurity (#130395)"
This reverts commit e22b0acc766db4a853fe8fd73e919b4adf0e3148.

Reverted https://github.com/pytorch/pytorch/pull/130395 on behalf of https://github.com/yushangdi due to breaking tests, need to rebase and fix ([comment](https://github.com/pytorch/pytorch/pull/130395#issuecomment-2235192986))
2024-07-18 02:46:03 +00:00
bd56bcf0ab [TEST] Fix _scaled_mm tests (#130897)
This PR resolves several sets of `_scaled_mm` test failures:
- `scale_a` and `scale_b` are now required arguments, so the function `sample_inputs_scaled_mm` must supply them
- `_scaled_mm` does not support `"meta"` device, so it should be skipped in `test_meta.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130897
Approved by: https://github.com/drisspg
2024-07-18 02:15:00 +00:00
9fee87e4cd [export] Add print_readable to unflattener (#128617)
Taking inspiration from `GraphModule.print_readable` (aka I copied its [code](17b45e905a/torch/fx/graph_module.py (L824))), I added a `print_readable` to the unflattened module, because it's kind of nontrivial to print the contents of this module.

Example print from `python test/export/test_unflatten.py -k test_unflatten_nested`
```
class UnflattenedModule(torch.nn.Module):
    def forward(self, x: "f32[2, 3]"):
        # No stacktrace found for following nodes
        rootparam: "f32[2, 3]" = self.rootparam

        # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:99 in forward, code: x = x * self.rootparam
        mul: "f32[2, 3]" = torch.ops.aten.mul.Tensor(x, rootparam);  x = rootparam = None

        # No stacktrace found for following nodes
        foo: "f32[2, 3]" = self.foo(mul);  mul = None
        bar: "f32[2, 3]" = self.bar(foo);  foo = None
        return (bar,)

    class foo(torch.nn.Module):
        def forward(self, mul: "f32[2, 3]"):
            # No stacktrace found for following nodes
            child1param: "f32[2, 3]" = self.child1param
            nested: "f32[2, 3]" = self.nested(mul);  mul = None

            # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:79 in forward, code: return x + self.child1param
            add: "f32[2, 3]" = torch.ops.aten.add.Tensor(nested, child1param);  nested = child1param = None
            return add

        class nested(torch.nn.Module):
            def forward(self, mul: "f32[2, 3]"):
                # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:67 in forward, code: return x / x
                div: "f32[2, 3]" = torch.ops.aten.div.Tensor(mul, mul);  mul = None
                return div

    class bar(torch.nn.Module):
        def forward(self, add: "f32[2, 3]"):
            # No stacktrace found for following nodes
            child2buffer: "f32[2, 3]" = self.child2buffer

            # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:87 in forward, code: return x - self.child2buffer
            sub: "f32[2, 3]" = torch.ops.aten.sub.Tensor(add, child2buffer);  add = child2buffer = None
            return sub
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128617
Approved by: https://github.com/zhxchen17, https://github.com/pianpwk
2024-07-18 01:36:01 +00:00
cyy
a0ae77b25b Simpilfy cub::unique_by_key code (#130907)
It removed an unused parameter.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130907
Approved by: https://github.com/ezyang
2024-07-18 01:12:00 +00:00
d818c3319f Autoheuristic: add config options for specifying optimizations to collect data for and use heuristics (#130245)
Previously, it was only possible to collect data or use a heuristic regardless of where autoheuristic is used. This PR makes it possible to collect data for some optimizations while using a learned heuristic for other optimizations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130245
Approved by: https://github.com/shunting314
2024-07-18 01:04:36 +00:00
051971ab32 Reorder MIOpen conditions so getCUDAHooks only called when CUDA input (#130867)
See post for more details: [fb.workplace.com/groups/1405155842844877/permalink/8719141948112860](https://fb.workplace.com/groups/1405155842844877/permalink/8719141948112860/)
Function getCUDAHooks() returns a reference to an object without checking if the object is null. In the AutoMOS QE, which runs a ML model in Messenger Android, we are getting native crashes because of this reason: [internalfb.com/code/fbsource/[b7f8e18320f9d5d8347c3428c67301f20c3c81d2]/xplat/caffe2/aten/src/ATen/native/Convolution.cpp?lines=504](https://www.internalfb.com/code/fbsource/%5Bb7f8e18320f9d5d8347c3428c67301f20c3c81d2%5D/xplat/caffe2/aten/src/ATen/native/Convolution.cpp?lines=504), crash [fburl.com/logview/xi4w7jk4](https://fburl.com/logview/xi4w7jk4)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130867
Approved by: https://github.com/albanD
2024-07-18 00:59:33 +00:00
e22b0acc76 [FX][export] DCE pass, check schema for node impurity (#130395)
Change the default DCE pass to check node schema for impure nodes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130395
Approved by: https://github.com/angelayi, https://github.com/jgong5
2024-07-18 00:55:20 +00:00
cyy
73d0f484b3 [structural binding][11/N] Replace std::tie with structural binding (#130830)
Follows  #130784

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130830
Approved by: https://github.com/janeyx99
2024-07-18 00:45:06 +00:00
e14d1d10ef Unwrap Identity in prepare indexing (#130967)
We wrap indexing calculation in the concat kernel in `Identity` so that we do not expand int32 intermediates to int64. This was causing an issue where the index simplified to an integer and would not hit an intended [path](752c817898/torch/_inductor/codegen/triton.py (L1554)) which would do wrapping with tl.full.

I couldn't generate a minimal repro to add as test but I have a repro you can check here: P1483831261 There is already a test that we dont expand the int32 intermediates to int64.

Differential Revision: [D59871850](https://our.internmc.facebook.com/intern/diff/D59871850)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130967
Approved by: https://github.com/Chillee, https://github.com/jansel
2024-07-18 00:43:53 +00:00
d77af49380 [Traceable FSDP2] Preserve fsdp.set_ op through lowering; Add unit test for multiple .set_ into same primal; Add unit test for FSDP2 module layer reuse (#130786)
Test commands:
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_fullgraph_backend_inductor`
- `pytest -rA test/functorch/test_aotdispatch.py::TestAOTAutograd::test_input_mutation_fsdp_set__into_same_input`
- `PYTORCH_TEST_WITH_CROSSREF=1 python test/functorch/test_aotdispatch.py -k TestAOTAutogradWithCache.test_input_mutation_fsdp_set__into_same_input`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130786
Approved by: https://github.com/bdhirsh
ghstack dependencies: #129773
2024-07-17 23:25:42 +00:00
fc3dbcd1c3 [Traceable FSDP2][Inductor] Re-inplace all_gather_into_tensor (#129773)
FSDP2 eager pre-allocates the output buffer for AllGather and the AllGather just writes into that buffer. However, under compile, by default we use out-of-place AllGather, which means in Traceable FSDP2 case we will be unnecessarily using more memory than eager. We want to re-inplace that AllGather instead.

This PR adds a post_grad pass to re-inplace all_gather_into_tensor (i.e. changing it from `all_gather_into_tensor.default` out-of-place op to `all_gather_into_tensor_out.default` out-variant op).

One thing to note is that since with this pass we are introducing a mutable op into the post_grad FX graph, we must do this pass after `reinplace_inplaceable_ops` (at which point we are okay again with having mutable ops in the graph). To facilitate this, this PR adds a `post_grad_custom_post_reinplace_pass` extension point to allow user-defined post-reinplace FX passes.

---

Test commands:
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_fullgraph_backend_inductor`

---

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129773
Approved by: https://github.com/eellison
2024-07-17 22:51:20 +00:00
442bfa7fc4 Fix mypy error (#130992)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130992
Approved by: https://github.com/izaitsevfb
2024-07-17 22:49:23 +00:00
a0da1265c5 Define key in codecache (#130979)
Test Plan:
```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_misc.py::InlineInbuiltNNModulesMiscTests::test_auto_functionalize_can_with_none_return_inline_inbuilt_nn_modules'
```

Differential Revision: D59875657

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130979
Approved by: https://github.com/jamesjwu
2024-07-17 22:44:50 +00:00
31e3330040 [Reland][FSDP2] Allowed List[nn.Module] as arg (#130949)
This PR allows `fully_shard`'s first argument to be `List[nn.Module]` instead of strictly `nn.Module`. This allows more flexible grouping of modules/parameters for communication, which can lead to memory savings and/or more efficient communication.

**Approach**
At a high level, we can think of a model as a tree of modules. Previously, we could only select specific module nodes in this tree as representing one FSDP parameter group. With this PR, we can select a group of module nodes, effectively becoming a single super node.

To implement the runtime schedule, we define new forward hooks that run based on the following semantics:
- If a module is the first to run the pre-hook, actually run the given pre-hook. Otherwise, the pre-hook is no-op.
- If a module is the last to run the post-hook, actually run the given post-hook. Otherwise, the post-hook is a no-op.
- First and last are determined by scoreboarding against a set of the modules.
- This set must get cleared at the end of backward in the case that >=1 module in the list is never used, in which case we still want the forward hooks to run in the next forward after this backward.

Beyond these new forward hooks, everything else is some simple generalization from `Module` to `List[Module]` or `Tuple[Module, ...]`.

**Examples**
This PR enables wrapping Llama models more efficiently by grouping the final norm and output linear together: https://github.com/pytorch/torchtitan/pull/382.

If at least one of the modules in the list does not run forward before backward, then there will be a warning message like:
```
1 of the 2 modules passed to fully_shard did not run forward before backward, which is error-prone since FSDP post-forward/pre-backward logic will not run for these modules. We recommend passing only modules that run forward together. Modules that did not run forward: [FSDPLinear(in_features=1, out_features=1, bias=True)]
```

---

**Changes for reland:** none since breakage was from PR below

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130949
Approved by: https://github.com/weifengpy
ghstack dependencies: #130947
2024-07-17 22:40:14 +00:00
ff7e021e94 [Reland][PT-D] Relaxed contract to allow Sequence[nn.Module] (#127773) (#130947)
This PR relaxes `@contract` to allow the 1st argument to be `Sequence[nn.Module]` instead of strictly `nn.Module`. This is required for the next PR, which allows `fully_shard` to take in `List[nn.Module]`.

---

**Changes for reland:**
- The previous PR assumed that any `func` decorated with `@contract` would return the same input `module` as output (which is true for PT-D composable APIs).
- However, TorchRec `shard` returns a different module as output (though that module _does_ satisfy the `@contract` FQN check).
- This PR removes the assumption and instead only enforces the FQN check following the input module order. In other words, if calling `func([x1, ..., xN])` for `N` modules `x1, ..., xN` that returns `[y1, ..., yM]` for `M` modules, we require that `N = M` and that FQNs are preserved coordinate-wise: `xi` and `yi` have same FQNs for all `i = 1, ..., N`.

Differential Revision: [D59863438](https://our.internmc.facebook.com/intern/diff/D59863438)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130947
Approved by: https://github.com/weifengpy, https://github.com/atalman
2024-07-17 22:40:13 +00:00
90105a4f3e [ts-migration] Support RaiseException, prim::Unitialized, prim::Enter, and prim::Exit (#129416)
- Support raise exception. It's behavior matches non-strict export now, thanks to @ydwu4's [PR](https://github.com/pytorch/pytorch/pull/128709).
- Support prim::Unitialized, prim::Enter, and prim::Exit
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129416
Approved by: https://github.com/angelayi
2024-07-17 21:59:52 +00:00
874bbc53c9 Revert "Define key in codecache (#130979)"
This reverts commit 4112f687831fb6f3554ff659a0be45909a1b4639.

Reverted https://github.com/pytorch/pytorch/pull/130979 on behalf of https://github.com/clee2000 due to broke lint on torch/_inductor/codecache.py https://github.com/pytorch/pytorch/actions/runs/9981737836/job/27586013811 f0faecd291 ([comment](https://github.com/pytorch/pytorch/pull/130979#issuecomment-2234392332))
2024-07-17 21:59:19 +00:00
43a6d20883 Add decomposition for reflection_pad{1,2,3}d_backward (#130299)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130299
Approved by: https://github.com/lezcano
ghstack dependencies: #130130
2024-07-17 21:56:00 +00:00
0eb43ed189 Revert "[ts-migration] Support RaiseException, prim::Unitialized, prim::Enter, and prim::Exit (#129416)"
This reverts commit f0faecd2915d73e56917922cc995237cef064e50.

Reverted https://github.com/pytorch/pytorch/pull/129416 on behalf of https://github.com/clee2000 due to broke lint, but for for torch/_inductor/codecache.py this time https://github.com/pytorch/pytorch/actions/runs/9981737836/job/27586013811 f0faecd291 ([comment](https://github.com/pytorch/pytorch/pull/129416#issuecomment-2234387254))
2024-07-17 21:55:48 +00:00
ebdfc7e37d [BE] Rename ISORT_WHITELIST to ISORT_SKIPLIST (#130987)
To better represent what this list is doing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130987
Approved by: https://github.com/seemethere, https://github.com/ZainRizvi
2024-07-17 21:52:56 +00:00
df5919393c [ROCm] std::clamp work-around for hip-clang compiler (#127812)
Fixes #127666.

Other std math functions are replaced with those in the global namespace during hipify. HIP does not claim to support every function in the C++ standard library. std::clamp is not yet supported and we have been relying on the std implementation. For Fedora 40 + gcc 14, a host-side assert is used which is not supported. Work-around this by replacing std::clamp with min and max.  Using #ifndef USE_ROCM to differentiate between CUDA using std::clamp and the ROCm replacement broke Windows builds. The replacement generates the same PTX as std::clamp, so using the replacement unconditionally. The replacement generates the same PTX as std::clamp. See https://godbolt.org/z/Wde9KW3v4 for a sample.

Original patch comes from @lamikr. Modified to improve efficiency.

https://github.com/lamikr/rocm_sdk_builder/pull/37

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127812
Approved by: https://github.com/hongxiayang, https://github.com/malfet
2024-07-17 21:31:17 +00:00
f0faecd291 [ts-migration] Support RaiseException, prim::Unitialized, prim::Enter, and prim::Exit (#129416)
- Support raise exception. It's behavior matches non-strict export now, thanks to @ydwu4's [PR](https://github.com/pytorch/pytorch/pull/128709).
- Support prim::Unitialized, prim::Enter, and prim::Exit
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129416
Approved by: https://github.com/angelayi
2024-07-17 21:27:45 +00:00
4112f68783 Define key in codecache (#130979)
Test Plan:
```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_misc.py::InlineInbuiltNNModulesMiscTests::test_auto_functionalize_can_with_none_return_inline_inbuilt_nn_modules'
```

Differential Revision: D59875657

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130979
Approved by: https://github.com/jamesjwu
2024-07-17 21:19:13 +00:00
0b134c15cd Revert "Relax constraints for creating a GenericContextWrappingVariable (#129091)"
This reverts commit 882fd9186924b4632fba65033717d97d15ad3339.

Reverted https://github.com/pytorch/pytorch/pull/129091 on behalf of https://github.com/clee2000 due to test_jit started failing on main after this stack https://github.com/pytorch/pytorch/actions/runs/9980754603/job/27583474357 a8bd2933d9 ([comment](https://github.com/pytorch/pytorch/pull/129091#issuecomment-2234269541))
2024-07-17 20:59:40 +00:00
c49f909aab Revert "wrap self.call_function(...) in try finally block to undo changes to self.kw_names (#130490)"
This reverts commit a8bd2933d9eaf24ec9582001efa844de499d9e93.

Reverted https://github.com/pytorch/pytorch/pull/130490 on behalf of https://github.com/clee2000 due to test_jit started failing on main after this stack https://github.com/pytorch/pytorch/actions/runs/9980754603/job/27583474357 a8bd2933d9 ([comment](https://github.com/pytorch/pytorch/pull/129091#issuecomment-2234269541))
2024-07-17 20:59:40 +00:00
65b4163bd2 [dynamo][nn-module] Make slice getitem on nn module container sourceless (#130852)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130852
Approved by: https://github.com/mlazos
ghstack dependencies: #130773
2024-07-17 20:17:08 +00:00
a8bd2933d9 wrap self.call_function(...) in try finally block to undo changes to self.kw_names (#130490)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130490
Approved by: https://github.com/williamwen42, https://github.com/zou3519
ghstack dependencies: #129091
2024-07-17 20:07:06 +00:00
882fd91869 Relax constraints for creating a GenericContextWrappingVariable (#129091)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129091
Approved by: https://github.com/yanboliang, https://github.com/zou3519
2024-07-17 20:07:06 +00:00
41f5d5dcaf Revert "[inductor] adapte windows file path (#130713)"
This reverts commit e51e971a8675826e517a78bf2a97f8e2df5f4abd.

Reverted https://github.com/pytorch/pytorch/pull/130713 on behalf of https://github.com/clee2000 due to sorry but I think its still failing, this time on windows CUDA https://github.com/pytorch/pytorch/actions/runs/9971126834/job/27552761451 bb62e9d7c3.  It was not run on PR due to being on the periodic workflow, which isnt usually run on PRs due to capacity issues for windows CUDA machines.  I will add ciflow/periodic to the PR to ensure the test gets run ([comment](https://github.com/pytorch/pytorch/pull/130713#issuecomment-2234092078))
2024-07-17 19:37:16 +00:00
1bf4a44b33 Revert "[ts-migration] Support RaiseException, prim::Unitialized, prim::Enter, and prim::Exit (#129416)"
This reverts commit ef0511245a92bae7057c195dcae2efc237b96f16.

Reverted https://github.com/pytorch/pytorch/pull/129416 on behalf of https://github.com/clee2000 due to broke lint for test/export/test_converter.py https://github.com/pytorch/pytorch/actions/runs/9979009143/job/27577181982 ef0511245a.  Probably a landrace ([comment](https://github.com/pytorch/pytorch/pull/129416#issuecomment-2234067407))
2024-07-17 19:21:52 +00:00
b0387449db Ensure staticmethods can be allowed in graph (#130882)
Fixes https://github.com/pytorch/pytorch/issues/124735

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130882
Approved by: https://github.com/anijain2305, https://github.com/williamwen42
2024-07-17 19:18:30 +00:00
e4f9d01cd9 Add test for dataclass field accesses (#130848)
Fixes https://github.com/pytorch/pytorch/issues/120108

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130848
Approved by: https://github.com/williamwen42, https://github.com/anijain2305
2024-07-17 19:14:23 +00:00
470f07c840 Add guard override capability for tensor subclass metadata (#130780)
Fixes https://github.com/pytorch/pytorch/issues/114405

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130780
Approved by: https://github.com/anijain2305, https://github.com/bdhirsh
ghstack dependencies: #130779
2024-07-17 19:13:53 +00:00
bea6762c01 Add guards on subclass metadata (#130779)
This PR adds guards in dynamo which verify the equality of tensor subclass metadata along with tests verifying the expected recompile behavior. The next PR adds the capability to override the guard behavior to possibly perform the check in a less expensive manner.

Toward fixing https://github.com/pytorch/pytorch/issues/114405

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130779
Approved by: https://github.com/anijain2305, https://github.com/bdhirsh
2024-07-17 19:13:52 +00:00
752c817898 [AOTI][refactor] Unify UserDefinedTritonKernel.codegen (#130796)
Summary: Unify the argment codegen logic between python wrapper and cpp wrapper.

Differential Revision: [D59809273](https://our.internmc.facebook.com/intern/diff/D59809273)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130796
Approved by: https://github.com/oulgen
2024-07-17 18:37:23 +00:00
efefea52e0 renamed inductor kernel args in flexattention properly (#130869)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130869
Approved by: https://github.com/drisspg, https://github.com/joydddd
ghstack dependencies: #130809, #130818
2024-07-17 18:36:03 +00:00
480a5bd881 Renamed mask_fn to mask_mod (#130818)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130818
Approved by: https://github.com/drisspg
ghstack dependencies: #130809
2024-07-17 18:36:03 +00:00
d96c80649f [export] constants & non-persistent buffers for training IR (#130864)
Summary: Uses original ExportedProgram constants and graph signature to inform decompositions, so that constant tensors and non-persistent buffers are respected for training IR. Removes 7 test failures for training IR.

Test Plan: test_export

Differential Revision: D59820909

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130864
Approved by: https://github.com/angelayi
2024-07-17 18:27:53 +00:00
ef0511245a [ts-migration] Support RaiseException, prim::Unitialized, prim::Enter, and prim::Exit (#129416)
- Support raise exception. It's behavior matches non-strict export now, thanks to @ydwu4's [PR](https://github.com/pytorch/pytorch/pull/128709).
- Support prim::Unitialized, prim::Enter, and prim::Exit
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129416
Approved by: https://github.com/angelayi
2024-07-17 17:48:36 +00:00
d552e5c3d5 Fix ciflow/nightly triggering commit hash update workflow (#130570)
Move the if statement to be higher so people don't get the below
![image](https://github.com/user-attachments/assets/e9be7d7c-6400-4f80-880f-d58dcb4b5495)
like https://togithub.com/pytorch/pytorch/pull/130465
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130570
Approved by: https://github.com/ZainRizvi
2024-07-17 17:13:50 +00:00
db3290846e [BE][Easy][10/19] enforce style for empty lines in import segments in test/d*/ (#129761)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129761
Approved by: https://github.com/fegin
2024-07-17 16:57:39 +00:00
1e13cb2f28 Log cache state to structured logs (#130845)
https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpRm4MaD/0_0_0/fx_graph_cache_hash_4.json

Differential Revision: D59795574

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130845
Approved by: https://github.com/jamesjwu
2024-07-17 16:45:45 +00:00
af0b5ee924 Reduce number of samples in {svd,pca}_lowrank OpInfos (#127199)
We don't need to generate so many samples for these very expensive ops.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127199
Approved by: https://github.com/peterbell10, https://github.com/zou3519
2024-07-17 16:29:36 +00:00
6e916f112f [inductor] skip fx remote cache for 2 tests in test_metrics.py (#130853)
Summary: `collect_defined_kernels()` is essentially patching deep inside to see if a specific codegen is happening. We could also patch somewhere in the cache path to make sure it's called, but I'm not sure that's really testing anything interesting. I suggest it's better to just disable the remote cache here.

Test Plan: `buck2 test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/inductor:metrics -- --exact 'caffe2/test/inductor:metrics - test_kernel_args_num_gb (caffe2.test.inductor.test_metrics.TestMetrics)' --run-disabled --stress-runs 10`

Differential Revision: D59825899

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130853
Approved by: https://github.com/oulgen
2024-07-17 16:17:43 +00:00
1fb572289b [BE][c10d] Add a warning messages in the comment about cuda hang (#130844)
Add comments to warn users potential hang for the cuda event query in NCCLPG.

Co-authored-by: Andrew Gu <31054793+awgu@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130844
Approved by: https://github.com/wconstab
2024-07-17 15:51:19 +00:00
b7d2abd766 Fix vectorized ops.masked (#130130)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130130
Approved by: https://github.com/jgong5, https://github.com/lezcano
2024-07-17 14:55:11 +00:00
b29b23137c [Easy] Fix argument name collision in dispatched functions (#129562)
Use positional-only argument to avoid naming collision with aten ops arguments that are named "self".

```python
In [1]: def foo(self, *args, **kwargs):
   ...:     print(self, args, kwargs)
   ...:

In [2]: def bar(self, /, *args, **kwargs):
   ...:     print(self, args, kwargs)
   ...:

In [3]: foo(1, 2, self=3)
TypeError: foo() got multiple values for argument 'self'

In [4]: bar(1, 2, self=3)
1
(2,)
{'self': 3}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129562
Approved by: https://github.com/zou3519, https://github.com/fegin
2024-07-17 14:39:56 +00:00
c0ed38e644 [BE][Easy][3/19] enforce style for empty lines in import segments in benchmarks/ (#129754)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129754
Approved by: https://github.com/ezyang
2024-07-17 14:34:42 +00:00
32995dec28 Add support for XPU accumulate type (#128579)
Provide an accumulate type interface specifically for XPU, similar to what was done for MPS.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128579
Approved by: https://github.com/EikanWang, https://github.com/albanD
2024-07-17 14:33:53 +00:00
76169cf691 [BE][Easy][9/19] enforce style for empty lines in import segments in test/[e-h]*/ (#129760)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129760
Approved by: https://github.com/ezyang
2024-07-17 14:25:29 +00:00
cbf274d4a7 [aoti] Add packaging solution (#129895)
In this PR, I added support for packaging the AOTI generated files into a zipfile, and loading it in python.

`compile_so` takes the path to the package, a device, and a desired so_path location, and compiles package into a .so, and saves to the specified location.
`load_package` takes a path to the package and device, calls _extract_so, and then creates a callable to run the compiled model.

The zipfile generated looks like the following:
```
|- version
|- archive_format
|- data
   |- aotinductor
      |- cbtnafqaqrhvwztv7xudlal4xs6sofxa5oxccyuaqtrt6aozaklx.cubin  # AOTI cuda generated cubin files
      |- cskkqtna23bty2v3aq7g2q37cxrgufehlkuaaolhlgug5zg6fuwe.cpp  # AOTI generated cpp file
      |- cskkqtna23bty2v3aq7g2q37cxrgufehlkuaaolhlgug5zg6fuwe_compile_flags  # Flags for compiling the .o
      |- c6qqtnpgwfi3dv5nb76ai773kt45ezoxfwdmd7q37lvq6fs2tnoi.o  # AOTI saved const.o
      |- cskkqtna23bty2v3aq7g2q37cxrgufehlkuaaolhlgug5zg6fuwe_linker_flags  # Flags for linking the files to form the .so
   |- constants
      |- constants.pt  # Constants saved using torch.save, can be loaded using mmap
```

The workflow is something like:
```
with torch.no_grad():
    ep = torch.export.export(
        model,
        example_inputs,
        dynamic_shapes=dynamic_shapes,
        strict=False,
    )
    gm = ep.module()
    package_path = torch._inductor.aot_compile(
        gm,
        example_inputs,
        options= {
              "aot_inductor.output_path": "my_path.pt2",  # or a directory
              "aot_inductor.package": True,
        }
    )
compiled_model = torch._inductor.package.load_package(package_path, device)
return compiled_model
```

I tried turning on loading the weights using mmap by default, but had some trouble with it, so that is just left as a todo

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129895
Approved by: https://github.com/malfet
2024-07-17 13:56:58 +00:00
94a910b43b Revert "Renamed mask_fn to mask_mod (#130818)"
This reverts commit 1a97bcf93b2ac98505ef6ff011ccb3565e456596.

Reverted https://github.com/pytorch/pytorch/pull/130818 on behalf of https://github.com/atalman due to Failing internally ([comment](https://github.com/pytorch/pytorch/pull/130818#issuecomment-2233367318))
2024-07-17 13:47:08 +00:00
d027aef8f8 Revert "Removed q_num_blocks from constructor (#130819)"
This reverts commit 03c660468eb57772e82c1034613f5ff8781c775a.

Reverted https://github.com/pytorch/pytorch/pull/130819 on behalf of https://github.com/atalman due to Internal problem with previous PR in stack https://github.com/pytorch/pytorch/pull/130818 ([comment](https://github.com/pytorch/pytorch/pull/130819#issuecomment-2233359569))
2024-07-17 13:43:35 +00:00
4b7ff35622 Fix flex_attention import in score_mod (#130906)
torch.nn.attention._flex_attention has been renamed to torch.nn.attention.flex_attention, so the import does not work currently.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130906
Approved by: https://github.com/Chillee
2024-07-17 13:37:08 +00:00
e1b2d8f975 Revert "[cuDNN][SDPA] Support attn_bias in cuDNN (#130482)"
This reverts commit de177b50f89e45a57ac056ee64a64d7775b450ff.

Reverted https://github.com/pytorch/pytorch/pull/130482 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/130482#issuecomment-2233309217))
2024-07-17 13:21:50 +00:00
d3a11a0198 [Inductor] Handle device_put op in constant folding. (#130824)
Fix #130823

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130824
Approved by: https://github.com/eellison, https://github.com/EikanWang
ghstack dependencies: #130817
2024-07-17 10:13:36 +00:00
2af2d26562 [Inductor UT] Generalize device-bias code in test_triton_kernels.py and test_torchinductor.py (#130817)
[Inductor UT] Generalize newly introduced device-bias code in test_triton_kernels.py::test_add_kernel and test_torchinductor.py::test_ctr_not_moved_to_cuda_when_used_in_index_put
Fix #130814 , #130838

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130817
Approved by: https://github.com/zou3519
2024-07-17 10:13:36 +00:00
2300bb2a88 [3.13, dynamo] support TO_BOOL (#130565)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130565
Approved by: https://github.com/jansel
ghstack dependencies: #130459, #130460, #130461, #130564
2024-07-17 09:47:58 +00:00
539acf7656 [3.13, dynamo] support CALL_KW (#130564)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130564
Approved by: https://github.com/jansel
ghstack dependencies: #130459, #130460, #130461
2024-07-17 09:47:58 +00:00
e2365c05d7 [3.13, dynamo] fix instruction line numbers (#130461)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130461
Approved by: https://github.com/jansel
ghstack dependencies: #130459, #130460
2024-07-17 09:47:58 +00:00
82b2e7a253 [3.13, dynamo] fix CALL_FUNCTION_EX in symbolic_convert (#130460)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130460
Approved by: https://github.com/jansel
ghstack dependencies: #130459
2024-07-17 09:47:58 +00:00
8c9a996091 [3.13, dynamo] support LOAD_FAST_LOAD_FAST and STORE_FAST_STORE_FAST (#130459)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130459
Approved by: https://github.com/jansel
2024-07-17 09:47:58 +00:00
bb62e9d7c3 Avoid autocast deprecation warning in DataParallel (#130660)
Fixes #130659

Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130660
Approved by: https://github.com/guangyey, https://github.com/fegin, https://github.com/albanD
2024-07-17 08:32:19 +00:00
f6838d521a [BE][Easy][5/19] enforce style for empty lines in import segments in tools/ and torchgen/ (#129756)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129756
Approved by: https://github.com/ezyang
2024-07-17 06:44:35 +00:00
ba48cf6535 [BE][Easy][6/19] enforce style for empty lines in import segments in test/ (#129757)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129757
Approved by: https://github.com/ezyang
2024-07-17 06:42:37 +00:00
e51e971a86 [inductor] adapte windows file path (#130713)
This PR is depends on https://github.com/pytorch/pytorch/pull/130132 can be landed successful.
The detailed log: https://github.com/pytorch/pytorch/issues/124245#issuecomment-2211889758

After the file path was adapted for Windows, the first Windows inductor case was run successful.

```python
import torch

def foo(x, y):
    a = torch.sin(x)
    b = torch.cos(x)
    return a + b
opt_foo1 = torch.compile(foo)
print(opt_foo1(torch.randn(10, 10), torch.randn(10, 10)))
```

Result:
![image](https://github.com/user-attachments/assets/4944df47-e74d-476b-8eb5-1d1fd5abeb41)

Co-authored-by: Jiong Gong <jiong.gong@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130713
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire
2024-07-17 06:36:11 +00:00
7c45476d38 [pytorch][counters] WaitCounter cleanup (#130664)
Summary:
This diff does a minor cleanup of WaitCounters:
1. Fixes some singleton use to ensure one instance of WaitCounterImpl per counter per process
2. Updates API to enable measuring duration of individual wait operations

Test Plan: unit test

Differential Revision: D59709324

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130664
Approved by: https://github.com/c-p-i-o, https://github.com/asiab4
2024-07-17 04:42:35 +00:00
419b8df0b6 [inductor][easy] add debug logs for inlining constants (#130799)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130799
Approved by: https://github.com/chenyang78
2024-07-17 04:21:08 +00:00
f2552dcc3d refactor cached tensor more generic (#129359)
# Motivation
solve https://github.com/pytorch/pytorch/issues/129027 to refactor cached tensor to be generic.

# Additional Context
No API name change. It is only decoupling with CUDA build option.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129359
Approved by: https://github.com/eqy, https://github.com/EikanWang, https://github.com/albanD
2024-07-17 03:00:08 +00:00
c6aa03bd4e Add allow_xpu to enable XPU UTs (#130312)
# Motivation
enable UTs under folder test/xpu/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130312
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD
2024-07-17 02:40:28 +00:00
fc238db62a Separate AOTI Eager utils as a single file (#125819)
The key change is code movement. We just moved aoti eager related code from `torch._inductor.utils` to `torch._inductor.aoti_eager`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125819
Approved by: https://github.com/jansel, https://github.com/jgong5, https://github.com/desertfire
2024-07-17 02:27:11 +00:00
d1c4e6b55f [BE]: Enable a few additional ruff rules (#130700)
Enables a few extra ruff rules, most of which do not have any violations as I already cleaned them with earlier PRs, these just turns them on to enforce them. Adds 1 noqa as we want the suboptimal lambda generation + call kept as a test. Also enables the test in flake8

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130700
Approved by: https://github.com/justinchuby, https://github.com/ezyang
2024-07-17 02:06:04 +00:00
c24c50da92 fix tensor print behavior for XPU (#130523)
# Motivation
Some XPU device don't support `double` data type. So we have to use `tensor.to(torch.float)` if it is a XPU tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130523
Approved by: https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/albanD
2024-07-17 02:03:32 +00:00
aa95fb99af On advice of James March, log pid instead of tid (#130679)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130679
Approved by: https://github.com/jmarchfb
2024-07-17 02:00:10 +00:00
e9023d57b0 [ROCm] Return correct AMDSMI socket_power metric (#130331)
Extending on the change in https://github.com/pytorch/pytorch/pull/127729

Depending on gcnArch the API to return socket power will change based on underlying gpu_metrics. This PR will handle both cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130331
Approved by: https://github.com/jeffdaily, https://github.com/eqy, https://github.com/malfet
2024-07-17 01:58:58 +00:00
03c660468e Removed q_num_blocks from constructor (#130819)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130819
Approved by: https://github.com/drisspg
ghstack dependencies: #130809, #130818
2024-07-17 01:41:20 +00:00
1a97bcf93b Renamed mask_fn to mask_mod (#130818)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130818
Approved by: https://github.com/drisspg
ghstack dependencies: #130809
2024-07-17 01:41:20 +00:00
6024fea0f8 Compute q_num_blocks from kv_num_blocks if q_num_blocks is not passed in (#130809)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130809
Approved by: https://github.com/drisspg
2024-07-17 01:41:15 +00:00
ef9d9be236 TCPStoreLibUvBackend: log port on error (#130797)
Adds better error messages when a socket fails to bind in libuv.

New format:
```
The server socket has failed to bind. port: 1, useIpv6: 0, code: -13, name: EACCES, message: permission denied
```

Old format:

```
The server socket has failed to listen on any local network address. useIpv6: 0, code: -98, name: EADDRINUSE, message: address already in use
```

Test plan:

Added test in `test_store.py`

```
python test/distributed/test_store.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130797
Approved by: https://github.com/kurman
2024-07-17 01:34:15 +00:00
25cb4426d3 [inductor] Add num_matches_for_scatter_upon_const_tensor to list of cached metrics (#130843)
Summary: test/inductor:scatter_optimization is using this counter and fails with remote caching enabled

Test Plan: `buck2 test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/inductor:scatter_optimization -- --exact 'caffe2/test/inductor:scatter_optimization - test_cross_entropy_loss (caffe2.test.inductor.test_scatter_optimization.TestScatterOpt)' --run-disabled --stress-runs 10`

Differential Revision: D59817406

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130843
Approved by: https://github.com/oulgen
2024-07-17 00:41:22 +00:00
8458dc8966 Revert "Use inductor TestCase for distributed tests (#129494)"
This reverts commit 3cd2ae331a5ed6839456bb0025c729a1ee50bc84.

Reverted https://github.com/pytorch/pytorch/pull/129494 on behalf of https://github.com/masnesral due to fbcode failures ([comment](https://github.com/pytorch/pytorch/pull/129494#issuecomment-2232063690))
2024-07-17 00:32:48 +00:00
d7a8e8f7c5 Revert "[PT-D] Relaxed contract to allow Sequence[nn.Module] (#127773)"
This reverts commit b27695791e9cc4eedb1b713b1be20398bfeb911b.

Reverted https://github.com/pytorch/pytorch/pull/127773 on behalf of https://github.com/atalman due to failing internally ([comment](https://github.com/pytorch/pytorch/pull/127773#issuecomment-2232004006))
2024-07-16 23:48:09 +00:00
9a6d81b178 Fix pytorch JIT build for LLVM 18+ (#130661)
Summary: LLVM upstream(https://github.com/llvm/llvm-project/pull/97824) changed `getHostCPUFeatures`to  use Return StringMap. Fix this to unblock T195389358

Test Plan:
```
buck2 build mode/opt-clang-thinlto --upload-all-actions -c unicorn.hfsort="1" -c cxx.extra_cxxflags="-gpubnames -w -Wno-enum-constexpr-conversion -Wno-missing-template-arg-list-after-template-kw -Wno-c++11-narrowing -Wno-c++11-narrowing-const-reference -ferror-limit=0" -c cxx.extra_cflags="-gpubnames -w -Wno-enum-constexpr-conversion -Wno-missing-template-arg-list-after-template-kw -Wno-c++11-narrowing -Wno-c++11-narrowing-const-reference" -c cxx.profile="fbcode//fdo/autofdo/unicorn/topaggr/top_aggregator_server:autofdo" unicorn/topaggr:top_aggregator_server
```

Differential Revision: D59708722

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130661
Approved by: https://github.com/Skylion007
2024-07-16 23:47:48 +00:00
eqy
de177b50f8 [cuDNN][SDPA] Support attn_bias in cuDNN (#130482)
CC @drisspg
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130482
Approved by: https://github.com/drisspg
2024-07-16 23:45:21 +00:00
4f40a7078e Revert "[FSDP2] Allowed List[nn.Module] as arg (#127786)"
This reverts commit d3ab8cecedd7843b8caed5946404704a18479811.

Reverted https://github.com/pytorch/pytorch/pull/127786 on behalf of https://github.com/atalman due to bottom pr from the stack is failing on internal error ([comment](https://github.com/pytorch/pytorch/pull/127786#issuecomment-2231999178))
2024-07-16 23:45:17 +00:00
7919f0b952 Add buffer static input tests to cudagraph trees (#130402)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130402
Approved by: https://github.com/eellison
ghstack dependencies: #130393
2024-07-16 22:12:38 +00:00
415d5e53ae Propagate buffer and parameter indices through AOT (#130393)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130393
Approved by: https://github.com/bdhirsh
2024-07-16 22:12:38 +00:00
5f3c356a56 Revert "[inductor] adapte windows file path (#130713)"
This reverts commit 69e99172450e40536bf2e6c110183d34a0e283e2.

Reverted https://github.com/pytorch/pytorch/pull/130713 on behalf of https://github.com/clee2000 due to broke functorch\test_eager_transforms.py on windows https://github.com/pytorch/pytorch/actions/runs/9958208725/job/27530132704 69e9917245.  Test failure on PR is real, possibly force merged to get around lint error? ([comment](https://github.com/pytorch/pytorch/pull/130713#issuecomment-2231901793))
2024-07-16 22:07:55 +00:00
2eec02523b [autograd] Support GradientEdge as output for torch.autograd.grad (#127766)
This is useful for splitting grad to run in two parts while preserving intermediates:

<details>

<summary>
Click to see code
</summary>

```python
import collections
import weakref
from torch.autograd.graph import GradientEdge

def _get_grad_fn_or_grad_acc(t):
    if t.requires_grad and t.grad_fn is None:
        return t.view_as(t).grad_fn.next_functions[0][0]
    else:
        return t.grad_fn

def reverse_closure(roots, target_nodes):
    # Recurse until we reach a target node
    closure = set()
    actual_target_nodes = set()
    q: Deque = collections.deque()
    for node in roots:
        if node is not None and node not in closure:
            closure.add(node)
            q.append(node)
    while q:
        node = q.popleft()
        reverse_edges = node.metadata.get("reverse_edges", [])
        for holder_ref, idx in reverse_edges:
            ref = holder_ref()
            if ref is not None:
                raise RuntimeError("Reverse graph is no longer alive")
            fn = ref.node
            if fn in closure or fn is None:
                continue
            if fn in target_nodes:
                actual_target_nodes.add(fn)
                continue
            closure.add(fn)
            q.append(fn)
    return closure, actual_target_nodes

# Enable weak pointer
class Holder():
    def __init__(self, node):
        self.node = node

# TODO: use weak references to avoid reference cycle
def construct_reverse_graph(roots):
    q: Deque = collections.deque()
    root_seen = set()
    reverse_graph_refs = []
    for node in roots:
        if node is not None and node not in root_seen:
            q.append(node)
            root_seen.add(node)
    while q:
        node = q.popleft()
        for fn, idx in node.next_functions:
            if fn is not None:
                # Don't necessarily need to store on the graph
                reverse_edges = fn.metadata.get("reverse_edges", [])
                if len(reverse_edges) == 0:
                    q.append(fn)
                holder = Holder(node)
                holder_ref = weakref.ref(holder)
                reverse_graph_refs.append(holder)
                reverse_edges.append((holder_ref, idx))
                fn.metadata["reverse_edges"] = reverse_edges
    return reverse_graph_refs

def get_param_groups(inputs, params):
    inputs_closure, _ = reverse_closure(inputs, set())
    param_groups = dict()  # keyed on intermediates
    for i, param in enumerate(params):
        closure, intersected = reverse_closure([param], inputs_closure)
        param_group = {
            "params": set([param]),
            "intermediates": set(intersected),
        }
        for input_node in intersected:
            existing = param_groups.get(input_node, None)
            if existing is not None:
                existing["params"] = existing["params"].union(param_group["params"])
                existing["intermediates"] = existing["intermediates"].union(param_group["intermediates"])
                param_group = existing
            else:
                param_groups[input_node] = param_group

    # Sanity check: union of all param_groups params should be equal to all params
    union_params = set()
    seen_ids = set()
    unique_param_groups = []
    for param_group in param_groups.values():
        if id(param_group) not in seen_ids:
            seen_ids.add(id(param_group))
            unique_param_groups.append(param_group)
            union_params = union_params.union(param_group["params"])
    assert union_params == set(params)

    return unique_param_groups

def compute_grads_only_inputs2(roots, inps, weights):
    root_grad_fns = list(map(_get_grad_fn_or_grad_acc, roots))
    inp_grad_fns = list(map(_get_grad_fn_or_grad_acc, inps))
    weight_grad_fns = list(map(_get_grad_fn_or_grad_acc, weights))

    reverse_graph_refs = construct_reverse_graph(root_grad_fns)
    param_groups = get_param_groups(inp_grad_fns, weight_grad_fns)
    del reverse_graph_refs

    for param_group in param_groups:
        for i, intermediate in enumerate(param_group["intermediates"]):
            def get_hook(param_group, i):
                def hook(grad_inputs):
                    if param_group.get("grads", None) is None:
                        param_group["grads"] = [None] * len(param_group["intermediates"])
                    param_group["grads"][i] = grad_inputs
                return hook
            # These are always "split" nodes that we need to recompute, so
            # save their inputs.
            intermediate.register_prehook(get_hook(param_group, i))

    dinputs = torch.autograd.grad((out,), inputs=tuple(inps), grad_outputs=(torch.ones_like(out),), retain_graph=True)
    return dinputs, param_groups

def compute_grads_only_weights2(user_weights, param_groups):
    all_dweights = dict()
    for param_group in param_groups:
        # TODO: Handle case where intermediate can have multiple outputs
        intermediate_edges = tuple(GradientEdge(i, 0) for i in param_group["intermediates"])
        weights_edges = tuple(GradientEdge(w, 0) for w in param_group["params"])

        assert all(len(g) == 1 for g in param_group["grads"])
        # [NEW!] Able to pass a GradientEdge to autograd.grad as output
        # We do not need to retain_graph because... guarantee no overlap?
        print("trying to execute: ", intermediate_edges, weights_edges)
        dweights = torch.autograd.grad(intermediate_edges, weights_edges, grad_outputs=sum(param_group["grads"], tuple()))
        for w, dw in zip(param_group["params"], dweights):
            all_dweights[w] = dw
    # return grads in the original order weights were provided in
    out = []
    for w in user_weights:
        grad_acc = _get_grad_fn_or_grad_acc(w)
        out.append(all_dweights[grad_acc])
    return tuple(out)

```

</details>

```python
import torch.nn as nn

# Setup
mod1 = nn.Linear(10, 10)
mod2 = nn.Linear(10, 10)

a = torch.rand(10, requires_grad=True)

weights = tuple(mod1.parameters()) + tuple(mod2.parameters())
inps = (a,)

out = mod2(mod1(a))

class LoggingTensorMode(torch.utils._python_dispatch.TorchDispatchMode):
    def __torch_dispatch__(self, func, types, args=(), kwargs=None):
        if kwargs is None:
            kwargs = {}
        rs = func(*args, **kwargs)
        print(f"{func.__module__}.{func.__name__}")
        return rs

print(" -- SPLIT -- ")
# Compute gradients in two parts
with LoggingTensorMode():
    print("PART 1")
    dinputs, state = compute_grads_only_inputs2((out,), inps, weights)
    print("PART 2")
    dweights = compute_grads_only_weights2(weights, state)

out = mod2(mod1(a))

print(" -- REF -- ")

# Compare with reference
with LoggingTensorMode():
    ref_all_gradients = torch.autograd.grad(out, inputs=tuple(inps) + weights, grad_outputs=(torch.ones_like(out),))

for actual, ref in zip(dinputs + dweights, ref_all_gradients):
    print(torch.allclose(actual, ref))

```

<img width="598" alt="image" src="https://github.com/pytorch/pytorch/assets/13428986/3681b8a7-3ab4-4d1d-a836-abef6913e671">

```
PART 1
torch._ops.aten.view.default
torch._ops.aten.view.default
torch._ops.aten.view.default
torch._ops.aten.view.default
torch._ops.aten.view.default
torch._ops.aten.ones_like.default
V0603 10:17:21.590878 8300067520 torch/autograd/graph.py:751] Executing: <ViewBackward0 object at 0x12a1ee160> with grad_outputs: [f32[10]]
torch._ops.aten.view.default
V0603 10:17:21.591204 8300067520 torch/autograd/graph.py:751] Executing: <AddmmBackward0 object at 0x12a1ee0d0> with grad_outputs: [f32[1, 10]]
torch._ops.aten.t.default
torch._ops.aten.mm.default
V0603 10:17:21.591578 8300067520 torch/autograd/graph.py:751] Executing: <ViewBackward0 object at 0x100d7ae50> with grad_outputs: [f32[1, 10]]
torch._ops.aten.view.default
V0603 10:17:21.591747 8300067520 torch/autograd/graph.py:751] Executing: <ViewBackward0 object at 0x12a1e4a60> with grad_outputs: [f32[10]]
torch._ops.aten.view.default
V0603 10:17:21.591834 8300067520 torch/autograd/graph.py:751] Executing: <AddmmBackward0 object at 0x12a1e4bb0> with grad_outputs: [f32[1, 10]]
torch._ops.aten.t.default
torch._ops.aten.mm.default
V0603 10:17:21.591922 8300067520 torch/autograd/graph.py:751] Executing: <ViewBackward0 object at 0x12a1e4a90> with grad_outputs: [f32[1, 10]]
torch._ops.aten.view.default
PART 2
trying to execute:  (GradientEdge(node=<AddmmBackward0 object at 0x12a1e4bb0>, output_nr=0),) (GradientEdge(node=<AccumulateGrad object at 0x12a21b130>, output_nr=0), GradientEdge(node=<AccumulateGrad object at 0x12a21b7c0>, output_nr=0))
V0603 10:17:21.592223 8300067520 torch/autograd/graph.py:751] Executing: <AddmmBackward0 object at 0x12a1e4bb0> with grad_outputs: [f32[1, 10]]
torch._ops.aten.t.default
torch._ops.aten.mm.default
torch._ops.aten.t.default
torch._ops.aten.sum.dim_IntList
torch._ops.aten.view.default
V0603 10:17:21.592421 8300067520 torch/autograd/graph.py:751] Executing: <TBackward0 object at 0x12a1cad60> with grad_outputs: [f32[10, 10]]
torch._ops.aten.t.default
trying to execute:  (GradientEdge(node=<AddmmBackward0 object at 0x12a1ee0d0>, output_nr=0),) (GradientEdge(node=<AccumulateGrad object at 0x12a1e41c0>, output_nr=0), GradientEdge(node=<AccumulateGrad object at 0x12a21b670>, output_nr=0))
V0603 10:17:21.593481 8300067520 torch/autograd/graph.py:751] Executing: <AddmmBackward0 object at 0x12a1ee0d0> with grad_outputs: [f32[1, 10]]
torch._ops.aten.t.default
torch._ops.aten.mm.default
torch._ops.aten.t.default
torch._ops.aten.sum.dim_IntList
torch._ops.aten.view.default
V0603 10:17:21.593750 8300067520 torch/autograd/graph.py:751] Executing: <TBackward0 object at 0x12a21b2b0> with grad_outputs: [f32[10, 10]]
torch._ops.aten.t.default
torch._ops.aten.view.default
torch._ops.aten.view.default
torch._ops.aten.view.default
torch._ops.aten.view.default

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127766
Approved by: https://github.com/albanD
2024-07-16 21:46:19 +00:00
c1e7e40f24 Revert "[Traceable FSDP2][Inductor] Re-inplace all_gather_into_tensor (#129773)"
This reverts commit f2f31027ce8dc3985663bf6eaa66f3c5559b724a.

Reverted https://github.com/pytorch/pytorch/pull/129773 on behalf of https://github.com/clee2000 due to failed inductor/test_torchinductor_dynamic_shapes.py on mac https://github.com/pytorch/pytorch/actions/runs/9963396991/job/27530249256 f2f31027ce.  The build failed on PR so test jobs didn't run ([comment](https://github.com/pytorch/pytorch/pull/129773#issuecomment-2231808437))
2024-07-16 20:54:14 +00:00
4e479568df [PT2] Log compile ID in the signpost event (#130801)
Summary:
We should log compile ID as well for easier comparison.

Currently going through some of this data, I think we should make few more changes as well.

Reland for D59725870

Test Plan: Sandcastle and Pytorch

Differential Revision: D59789110

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130801
Approved by: https://github.com/oulgen
2024-07-16 20:47:36 +00:00
2ceade37c5 [SymmetricMemory] put socket files in /tmp (#130757)
Currently the socket files are put in the current directory, which may not be writable in all environments.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130757
Approved by: https://github.com/Chillee
ghstack dependencies: #130756
2024-07-16 20:21:05 +00:00
0468f2616a [SymmetricMemory] make sure different subgroups with the same name use different store prefixes (#130756)
This fixes a race condition in which different subgroups with the same name on the same host would use the same store.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130756
Approved by: https://github.com/Chillee
2024-07-16 20:21:05 +00:00
f2f31027ce [Traceable FSDP2][Inductor] Re-inplace all_gather_into_tensor (#129773)
FSDP2 eager pre-allocates the output buffer for AllGather and the AllGather just writes into that buffer. However, under compile, by default we use out-of-place AllGather, which means in Traceable FSDP2 case we will be unnecessarily using more memory than eager. We want to re-inplace that AllGather instead.

This PR adds a post_grad pass to re-inplace all_gather_into_tensor (i.e. changing it from `all_gather_into_tensor.default` out-of-place op to `all_gather_into_tensor_out.default` out-variant op).

One thing to note is that since with this pass we are introducing a mutable op into the post_grad FX graph, we must do this pass after `reinplace_inplaceable_ops` (at which point we are okay again with having mutable ops in the graph). To facilitate this, this PR adds a `post_grad_custom_post_reinplace_pass` extension point to allow user-defined post-reinplace FX passes.

---

Test commands:
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_fullgraph_backend_inductor`

---

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129773
Approved by: https://github.com/eellison
2024-07-16 20:07:41 +00:00
156b99cfb1 [inductor] Handle inductor counters in fx graph cache (#130635)
Summary: Similar to the handling of metrics, save inductor counter deltas in the FX graph cache entry and increment the counters appropriately on a cache hit

Test Plan: new unit test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130635
Approved by: https://github.com/eellison
2024-07-16 20:07:16 +00:00
d548417d95 [NJT] throw an exception if nested_tensor_from_jagged is fx-traced without being fx.wrapped (#130702)
The NJT constructor can't be fx-traced safely due to the dummy nt used:

774ca93fd2/torch/nested/_internal/nested_tensor.py (L501-L508)

The error doesn't appear immediately, but appears if you try to move a module with an fx-traced NJT constructor onto a different device, or try to serialize it. Let's throw an error if we try to fx-trace the NJT constructor so users know to wrap the call.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130702
Approved by: https://github.com/jbschlosser, https://github.com/soulitzer
2024-07-16 19:21:10 +00:00
0851de5b16 Revert "[ONNX] Remove beartype usage (#130484)"
This reverts commit 1794c35912025aa44b0d70f67ff664b4f7bd1014.

Reverted https://github.com/pytorch/pytorch/pull/130484 on behalf of https://github.com/clee2000 due to test_sympy_utils failure is real https://github.com/pytorch/pytorch/actions/runs/9961499559/job/27523758780 1794c35912.  Dr CI is matching with commits in current commit? ([comment](https://github.com/pytorch/pytorch/pull/130484#issuecomment-2231575577))
2024-07-16 18:41:51 +00:00
09b1b113f5 Cache min / max seq len for torch.nested.as_nested_tensor(t) (#130766)
For the `torch.nested.as_nested_tensor(t)` constructor, computing min / max seq len is trivial since the sequence lengths are all the same. Might as well cache them during construction.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130766
Approved by: https://github.com/YuqingJ, https://github.com/soulitzer
2024-07-16 18:32:47 +00:00
408c921d96 Make hashing a SymInt raise an error again (#130548)
See https://github.com/pytorch/pytorch/issues/130547

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130548
Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/lezcano
2024-07-16 18:30:30 +00:00
1d8baa4df2 [torchbench][servicelab] Fix servicelab test failures (#130781)
Fix servicelab test failures
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130781
Approved by: https://github.com/desertfire
2024-07-16 17:35:13 +00:00
1794c35912 [ONNX] Remove beartype usage (#130484)
beartype has served us well in identifying type errors and ensuring we call internal functions with the correct arguments (thanks!). However, the value of having beartype is diminished because of the following:

1. When beartype improves support for better Dict[] type checking, it discovered typing mistakes in some functions that were previously uncaught. This caused the exporter to fail with newer versions beartype when it used to succeed. Since we cannot fix PyTorch and release a new version just because of this, it creates confusion for users that have beartype in their environment from using torch.onnx
2. beartype adds an additional call line in the traceback, which makes the already thick dynamo stack even larger, affecting readability when users diagnose errors with the traceback.
3. Since the typing annotations need to be evaluated, we cannot use new syntaxes like `|` because we need to maintain compatibility with Python 3.8. We don't want to wait for PyTorch take py310 as the lowest supported Python before using the new typing syntaxes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130484
Approved by: https://github.com/titaiwangms
2024-07-16 17:34:36 +00:00
67e22d6c61 [Fix]: Convert operator that does specialization to its symbolic counterpart (#129578)
#### Issue
During conversion, use symbolic operator when exist.

#### Test Plan
`pytest test/export/test_converter.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129578
Approved by: https://github.com/angelayi
2024-07-16 17:19:57 +00:00
e8998d68c8 [export] add non-strict training IR (#130062)
Summary: Adds non-strict implementation of training IR export. Any expected non-strict training IR failures are also either existing strict training IR or non-strict failures (no new failures added). 4 strict training IR failures also resolved.

Refraining from unifying export/export_for_training, per @ydwu4's feedback :)

Test Plan: added test_export_training_ir_to_run_decomp_non_strict.py for non-strict training IR

Differential Revision: D59349454

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130062
Approved by: https://github.com/ydwu4, https://github.com/zhxchen17
2024-07-16 17:08:00 +00:00
d2f44eabe7 [Export] Support aten.full.default and aten.full_like.default (#130639)
Summary: Add operator tests for full & full_like operators

Test Plan:
Rerun kernel test using
```
buck2 run //glow/fba/tests:run_kernel mode/dev -- --kernel splat --config "input=1;dtype=fp32;fill_value=42.0"  -tl_time
```
{F1752274071}

Operator tests
```
buck2 run mode/{opt,inplace} //caffe2/torch/fb/test_library:afg_operator_test -- -k __full__
```
{F1752340913}

Differential Revision: D59593849

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130639
Approved by: https://github.com/StellarrZ
2024-07-16 16:50:04 +00:00
f272e0ab4a [inductor] support unbacked symint divisors in vars_and_sizes (#130595)
Scenario:
```
>>> nodes
IterationRangesEntry(
    x2,
    divisor=192*u0 + 192576,
    length=s1,
    (xindex//(192*u0 + 192576)),
    {x0: 192, x1: u0 + 1003, x2: s1, x3: 192*s1*u0 + 192576*s1, x4: 192*u0 + 192576})
IterationRangesEntry(
    x1,
    divisor=192,
    length=u0 + 1003,
    ModularIndexing(xindex, 192, u0 + 1003),
    {x0: 192, x1: u0 + 1003, x2: s1, x3: 192*s1*u0 + 192576*s1, x4: 192*u0 + 192576})
IterationRangesEntry(
    x0,
    divisor=1,
    length=192,
    ModularIndexing(xindex, 1, 192),
    {x0: 192, x1: u0 + 1003, x2: s1, x3: 192*s1*u0 + 192576*s1, x4: 192*u0 + 192576})
```

Think about whether using fallback is safe here. I think it's safe because the divisor of one IterationRangesEntry should be the product of the lengths of the preceding IterationRangesEntry? Unless, one of the lengths divides by an unbacked symint?

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130595
Approved by: https://github.com/aakhundov, https://github.com/ezyang
2024-07-16 16:21:38 +00:00
2b43d339fe Make FlexAttention API public (#130755)
# Summary

Makes the prototype API flex_attention public

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130755
Approved by: https://github.com/Chillee
2024-07-16 16:21:25 +00:00
cbda8be537 Revert "Propagate buffer and parameter indices through AOT (#130393)"
This reverts commit 69a77389e2c4052834c89a25757cdbf5f83b6208.

Reverted https://github.com/pytorch/pytorch/pull/130393 on behalf of https://github.com/clee2000 due to broke lint for torch/_functorch/_aot_autograd/subclass_utils.py https://github.com/pytorch/pytorch/actions/runs/9948630877/job/27483551649 80236dca90 lint was green on PR, probably a landrace ([comment](https://github.com/pytorch/pytorch/pull/130393#issuecomment-2231263753))
2024-07-16 15:43:34 +00:00
9cb23ba85b Revert "Add buffer static input tests to cudagraph trees (#130402)"
This reverts commit 80236dca90b0874cb2b6f9c9fa5f159c55726401.

Reverted https://github.com/pytorch/pytorch/pull/130402 on behalf of https://github.com/clee2000 due to broke lint for torch/_functorch/_aot_autograd/subclass_utils.py https://github.com/pytorch/pytorch/actions/runs/9948630877/job/27483551649 80236dca90 lint was green on PR, probably a landrace ([comment](https://github.com/pytorch/pytorch/pull/130393#issuecomment-2231263753))
2024-07-16 15:43:34 +00:00
c509319210 [inductor] Disable remote fx graph cache in test_snode_runtime (#130655)
Summary: Unfortunately we can't save / restore metrics.metrics.node_runtimes in the cache entries because these contain objects that don't pickle: `TypeError: cannot pickle 'PyCapsule' object`.

Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:snode_runtime -- --exact 'caffe2/test/inductor:snode_runtime - test_mm (caffe2.test.inductor.test_snode_runtime.ComputeBoundedTests)' --run-disabled --jobs 18 --stress-runs 10`

Differential Revision: D59705654

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130655
Approved by: https://github.com/oulgen
2024-07-16 15:11:17 +00:00
aa4ad711ef [CCA][Memory Snapshot] Create TraceEntryRingBuffer class for alloc_trace logic (#130741)
Summary:
Move the alloc_trace logic into a separate class, to reduce risk of deadlocks when mixing with CCA's lock. Switch to an std::mutex instead of std::recursive_mutex.

Let's us re-use the logic in TraceEntryRingBuffer class for later diffs.

Test Plan: CI, resnet run, and FBR model.

Differential Revision: D59690408

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130741
Approved by: https://github.com/davidberard98
2024-07-16 15:01:48 +00:00
e11c41035c Directly use empty strided in cudagraph copy (#130777)
We had an issue with the `-1` somehow ending up in negative num elements required. not sure why the original didn't work - we should land if CI is green.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130777
Approved by: https://github.com/BoyuanFeng
2024-07-16 14:37:30 +00:00
4c3348932c typing: convert_frame (#130670)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130670
Approved by: https://github.com/Skylion007
ghstack dependencies: #130669
2024-07-16 14:31:35 +00:00
ea25febfab typing: storage (#130669)
This isn't a full typing of the file - it just fixes some uses of unbound 'T' (if you use a TypeVar as an output it also needs to be an input).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130669
Approved by: https://github.com/oulgen, https://github.com/Skylion007
2024-07-16 14:31:35 +00:00
8390843eba Invalidate StorageImpl instances when tensor is overwritten with cudagraphs (#125264)
Fixes #104435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125264
Approved by: https://github.com/ezyang
2024-07-16 14:29:29 +00:00
1fbfb3202d [docs][TorchScript] document c10::AliasAnalysisKind::CONSERVATIVE (#130765)
I spent a while trying to search this to remember what this was called. Adding it to the OVERVIEW.md docs so it's easier to search
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130765
Approved by: https://github.com/nmacchioni, https://github.com/eellison, https://github.com/aaronenyeshi
2024-07-16 14:20:31 +00:00
69e9917245 [inductor] adapte windows file path (#130713)
This PR is depends on https://github.com/pytorch/pytorch/pull/130132 can be landed successful.
The detailed log: https://github.com/pytorch/pytorch/issues/124245#issuecomment-2211889758

After the file path was adapted for Windows, the first Windows inductor case was run successful.

```python
import torch

def foo(x, y):
    a = torch.sin(x)
    b = torch.cos(x)
    return a + b
opt_foo1 = torch.compile(foo)
print(opt_foo1(torch.randn(10, 10), torch.randn(10, 10)))
```

Result:
![image](https://github.com/user-attachments/assets/4944df47-e74d-476b-8eb5-1d1fd5abeb41)

Co-authored-by: Jiong Gong <jiong.gong@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130713
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/desertfire
2024-07-16 13:53:39 +00:00
53e5b8ac5b [BE]: Update flake8-comprehensions and enable C420 (#130699)
Uses `dict.fromkeys` whenever possible as covered by flake8-comprehensions rule C420. While the ruff rule RUF025 is still in preview, flake8-comprehensions have added a new rule which covers this. Use dict.fromkeys is faster when the value being added to the dictionary is the same at every iteration and is immutable, it also removes an unnecessary dict comprehension.

This rule will be enabled with our current ruleset in RUF in 0.6 as C420.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130699
Approved by: https://github.com/lezcano, https://github.com/ezyang
2024-07-16 13:47:49 +00:00
213685ba97 [torchao][pt2 benchmark runner] Run performance test non-alternately (#130136)
Summary:
By default, performance tests (speedup experiments) will run the baseline and test backend alternately.

However, this does not work for the torchao backend, which will change the model in-place, therefore the baseline run will also run with torchao backend since the model has already been quantized.

Add a new experiment "latency_experiment" to run performance tests non-alternately (first run baseline for a few iterations, then run the test backend).

Test Plan:
```
buck2 run mode/opt //pytorch/benchmark:pt2 -- --only AlbertForMaskedLM --quantization noquant --performance --inference --bfloat16
```

```
buck2 run mode/opt //pytorch/benchmark:pt2 -- --only AlbertForMaskedLM --quantization autoquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune
```

Differential Revision: D59332736

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130136
Approved by: https://github.com/jerryzh168
2024-07-16 13:38:17 +00:00
67c6941b4e Update torch.cat decomp for 0-dim (#130763)
Fix for https://github.com/pytorch/pytorch/issues/130615

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130763
Approved by: https://github.com/Skylion007, https://github.com/mlazos
2024-07-16 13:34:01 +00:00
705da70f2c [inductor][cpp] align dtype convert cache between vec and scalar kernels (#130677)
The conversion cache used for fixing https://github.com/pytorch/pytorch/issues/115260 depended on "store" which might be removed and ignored. This would lead to inconsistent code generated between vec and scalar kernels since we generate scalar kernel first followed by the vector kernel and the store buffer might be removed by the scalar and impacts the vector kernel codegen. This PR move the caching from "store" to the "to_dtype" calls which won't be impacted by the removed buffers.

`pytest -k test_consistent_remove_buffers test/inductor/test_cpu_repro.py`

before
```c++
extern "C"  void kernel(const bfloat16* in_ptr0,
                       bfloat16* out_ptr1)
{
    {
        for(long x0=static_cast<long>(0L); x0<static_cast<long>(64L); x0+=static_cast<long>(16L))
        {
            auto tmp0 = at::vec::Vectorized<bfloat16>::loadu(in_ptr0 + static_cast<long>(x0), 16);
            auto tmp1 = at::vec::convert<float>(tmp0);
            auto tmp2 = tmp1 + tmp1;
            auto tmp3 = at::vec::convert<bfloat16>(tmp2);
            auto tmp4 = at::vec::convert<float>(tmp3);
            auto tmp5 = tmp1 + tmp4;
            auto tmp6 = at::vec::convert<bfloat16>(tmp5);
            tmp6.store(out_ptr1 + static_cast<long>(x0), 16);
        }
        #pragma omp simd simdlen(8)
        for(long x0=static_cast<long>(64L); x0<static_cast<long>(65L); x0+=static_cast<long>(1L))
        {
            auto tmp0 = in_ptr0[static_cast<long>(x0)];
            auto tmp1 = c10::convert<float>(tmp0);
            auto tmp2 = decltype(tmp1)(tmp1 + tmp1);
            auto tmp3 = c10::convert<bfloat16>(tmp2);
            auto tmp4 = decltype(tmp1)(tmp1 + tmp2);
            auto tmp5 = c10::convert<bfloat16>(tmp4);
            out_ptr1[static_cast<long>(x0)] = tmp5;
        }
    }
}
```

after
```c++
extern "C"  void kernel(const bfloat16* in_ptr0,
                       bfloat16* out_ptr1)
{
    {
        for(long x0=static_cast<long>(0L); x0<static_cast<long>(64L); x0+=static_cast<long>(16L))
        {
            auto tmp0 = at::vec::Vectorized<bfloat16>::loadu(in_ptr0 + static_cast<long>(x0), 16);
            auto tmp1 = at::vec::convert<float>(tmp0);
            auto tmp2 = tmp1 + tmp1;
            auto tmp3 = at::vec::convert<bfloat16>(tmp2);
            auto tmp4 = tmp1 + tmp2;
            auto tmp5 = at::vec::convert<bfloat16>(tmp4);
            tmp5.store(out_ptr1 + static_cast<long>(x0), 16);
        }
        #pragma omp simd simdlen(8)
        for(long x0=static_cast<long>(64L); x0<static_cast<long>(65L); x0+=static_cast<long>(1L))
        {
            auto tmp0 = in_ptr0[static_cast<long>(x0)];
            auto tmp1 = c10::convert<float>(tmp0);
            auto tmp2 = decltype(tmp1)(tmp1 + tmp1);
            auto tmp3 = c10::convert<bfloat16>(tmp2);
            auto tmp4 = decltype(tmp1)(tmp1 + tmp2);
            auto tmp5 = c10::convert<bfloat16>(tmp4);
            out_ptr1[static_cast<long>(x0)] = tmp5;
        }
    }
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130677
Approved by: https://github.com/leslie-fang-intel
2024-07-16 13:25:05 +00:00
68a4f2a3df Revert "Tighten torch.library.infer_schema input types (#130705)"
This reverts commit ca2d424c6e5358f9fee8dc9ee7477de76b50f848.

Reverted https://github.com/pytorch/pytorch/pull/130705 on behalf of https://github.com/atalman due to Failing internal CI ([comment](https://github.com/pytorch/pytorch/pull/130705#issuecomment-2230821876))
2024-07-16 12:57:11 +00:00
dee0f43fde Add a CI job to check runner det sync (#129746)
Add a new CI job that runs only when the runner determinator files are modified. The jobs checks that the runner_determinator.py script is in sync with the version embedded in _runner-determinator.yaml.

Fixes TBD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129746
Approved by: https://github.com/zxiiro, https://github.com/ZainRizvi, https://github.com/jeanschmidt
2024-07-16 11:44:55 +00:00
e57101d927 Add testing regarding SparseAdam state_dicts (#130645)
Summary:
- Updated SparseAdam to run test_state_dict_deterministic unit test.
- Made gradients sparse while keeping weights dense in the above test.

Test Plan:
- Ran test_optim.py locally.

Fixes #116507

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130645
Approved by: https://github.com/janeyx99
2024-07-16 11:29:22 +00:00
cyy
168e41009b [structural binding][10/N] Replace std::tie with structural binding (#130784)
Follows  #130404

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130784
Approved by: https://github.com/malfet
2024-07-16 10:28:14 +00:00
747b38c131 [BE][Easy][2/19] enforce style for empty lines in import segments in .ci/ and .github/ (#129753)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129753
Approved by: https://github.com/malfet
ghstack dependencies: #129752
2024-07-16 09:40:00 +00:00
096dc444ce Keep zero check be compatible with different sympy versions (#130729)
# Motivation
I found a difference between sympy 1.12 and 1.13.
```python
# for 1.12
>>> import sympy
>>> a = sympy.Number(0.0)
>>> a == 0
True
```
```python
# for 1.13
>>> import sympy
>>> a = sympy.Number(0.0)
>>> a == 0
False
```
The different behavior will impact the result of [safe_mul](6beec34b1c/torch/utils/_sympy/value_ranges.py (L521-L528)), resulting in an incorrect results when `a = sympy.Number(0.0)`, `b = inf` and the result is `nan` if sympy version is 1.13. (the expected result is **0**)
```python
def safe_mul(a, b):
    # Make unknown() * wrap(0.0) == wrap(0.0)
    if a == 0.0:
        return a
    elif b == 0.0:
        return b
    else:
        return a * b
```

In different sympy versions, `sympy.Number(0)` always has the same behavior that equals to 0.0.
```python
>>> import sympy
>>> a = sympy.Number(0)
>>> a == 0.0
True # for different sympy versions
```
So, use 0.0 when checking zero in safe_mul to keep compatible with different sympy versions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130729
Approved by: https://github.com/lezcano, https://github.com/EikanWang
2024-07-16 08:39:00 +00:00
fedae41c57 [dynamo] Do not mark nn.module containers as BuiltinNNModuleVariable (#130773)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130773
Approved by: https://github.com/williamwen42, https://github.com/mlazos
2024-07-16 06:55:46 +00:00
83eedf66b9 Update libfmt submodule to 11.0.1 (#130628)
Update libfmt to 11.0.1 reopen of https://github.com/pytorch/pytorch/pull/129962. Requires a kineto update and moves fmt::join into a separate include so added it where necessary.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130628
Approved by: https://github.com/aaronenyeshi
2024-07-16 06:12:11 +00:00
c549629696 [CD] Fix xpu nightly wheel test failure (#130742)
The xpu nightly wheel test met permission issue on `linux.idc.xpu` runner. Because those runners onboarded with `jenkins` user but the binary test in docker container with `root` directly. The temp files can't be deleted, refer https://github.com/pytorch/pytorch/actions/runs/9935452320/job/27448053625#step:8:91
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130742
Approved by: https://github.com/atalman
2024-07-16 05:31:20 +00:00
cyy
95dbbf713e [Distributed] [9/N] Fix clang-tidy warnings in torch/csrc/distributed/rpc (#130109)
Follows #125102

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130109
Approved by: https://github.com/ezyang
2024-07-16 04:23:42 +00:00
7b2e802f31 [dtensor] add a few dunder methods to pointwise ops (#130754)
fixes https://github.com/pytorch/pytorch/issues/130671

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130754
Approved by: https://github.com/Skylion007, https://github.com/awgu, https://github.com/msaroufim
ghstack dependencies: #130753
2024-07-16 02:53:35 +00:00
2b2671a7b1 [dtensor] fix foreach_norm when ord is 2 (#130753)
as titled, fixed a case when passing ord as 2 (default value), the op
dispatching does not receive the default value case

We simply check if the args schema receiving a `ord` field or not

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130753
Approved by: https://github.com/awgu
2024-07-16 02:53:35 +00:00
a29052a0bf [BE][Ez]: Update ruff to 0.5.2 (#130698)
Update ruff to 0.5.2 which bugfixes and performance improvements
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130698
Approved by: https://github.com/ezyang
2024-07-16 01:31:30 +00:00
ad314a2f05 Pass torch.load(weights_only=) internally to avoid FutureWarning (#130663)
Fixes #130658

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130663
Approved by: https://github.com/malfet, https://github.com/LucasLLC
2024-07-16 01:24:38 +00:00
3cd2ae331a Use inductor TestCase for distributed tests (#129494)
Summary: At least some of the tests deriving from MultiProcessTestCase exercise inductor. Using the inductor TestCase class makes sure we always get a clean cache dir.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129494
Approved by: https://github.com/eellison
2024-07-16 01:24:35 +00:00
39eeaac4e5 inductor: avoiding moving constructor to cuda when it would cause h2d sync in index_put_ fallback (#130338)
My attempt at a fix for https://github.com/pytorch/pytorch/issues/130335, see issue for more details / internal xref. Any feedback from inductor folks is appreciated. I attempted to make the move-constructors-to-cuda pass a bit less aggressive by detecting when the movement would incur a H2D sync for `aten.index_put_`. I'm not sure if there are any other ops that inductor falls back to eager on, that may-or-may-not incur a H2D sync if we change any of their inputs from cpu to cuda.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130338
Approved by: https://github.com/eellison
2024-07-16 00:48:58 +00:00
93a03edcf9 Update error message in meta__convert_weight_to_int4pack (#130707)
This PR is to fix error message in https://github.com/pytorch/pytorch/pull/129940.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130707
Approved by: https://github.com/lezcano, https://github.com/malfet
2024-07-16 00:44:35 +00:00
a3abfa5cb5 [BE][Easy][1/19] enforce style for empty lines in import segments (#129752)
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter.

You can review these PRs via:

```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129752
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-07-16 00:42:56 +00:00
eqy
5e617d7ef5 [CUDA] Actually bump tolerances for test_grad_pca_lowrank (#130770)
Fixes change in #129902 to actually bump pca rather than svd, thanks @ptrblck for the catch

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130770
Approved by: https://github.com/Skylion007
2024-07-16 00:41:10 +00:00
80236dca90 Add buffer static input tests to cudagraph trees (#130402)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130402
Approved by: https://github.com/eellison
ghstack dependencies: #130391, #130392, #130503, #130393
2024-07-16 00:25:38 +00:00
69a77389e2 Propagate buffer and parameter indices through AOT (#130393)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130393
Approved by: https://github.com/bdhirsh
ghstack dependencies: #130391, #130392, #130503
2024-07-16 00:25:38 +00:00
200d3d0a89 Remove static param counting if inlining NN modules (#130503)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130503
Approved by: https://github.com/bdhirsh
ghstack dependencies: #130391, #130392
2024-07-16 00:25:34 +00:00
0d0c09702a Update mark_static_address for inlining NN modules (#130392)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130392
Approved by: https://github.com/anijain2305
ghstack dependencies: #130391
2024-07-16 00:25:29 +00:00
d8616eb66a Mark nn_module params and buffers as static in dynamo (#130391)
This PR marks all buffers and parameters of an NNModule as static using the `mark_static_address` API. As a result, when tensors are passed to AOT, the `tensor_dict` metadata of placeholder nodes will contain the `static_address_type` key, indicating which graph argument positions are static for cudagraphs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130391
Approved by: https://github.com/anijain2305
2024-07-16 00:25:23 +00:00
9ab8d47f9d Constant folding for dynamic shape node (#129686)
Extend constant folding for dynamic shape node, only support pointwise op and some restricted ops

We support dynamic shapes by limiting constant folding of ops that are guaranteed to have uniform values (full, pointwise ops, and views) and running these operators with tensors of shape 1. This also eliminates the possibility of memory overhead of constant folding.

Taken over from https://github.com/pytorch/pytorch/pull/128937

joint work with @imzhuhl

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129686
Approved by: https://github.com/Chillee
ghstack dependencies: #130367
2024-07-16 00:17:11 +00:00
ea4f310ff1 [Nested Tensor][easy] Add softmax backward support (#130602)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130602
Approved by: https://github.com/davidberard98, https://github.com/jbschlosser
2024-07-16 00:07:42 +00:00
d3ab8ceced [FSDP2] Allowed List[nn.Module] as arg (#127786)
This PR allows `fully_shard`'s first argument to be `List[nn.Module]` instead of strictly `nn.Module`. This allows more flexible grouping of modules/parameters for communication, which can lead to memory savings and/or more efficient communication.

**Approach**
At a high level, we can think of a model as a tree of modules. Previously, we could only select specific module nodes in this tree as representing one FSDP parameter group. With this PR, we can select a group of module nodes, effectively becoming a single super node.

To implement the runtime schedule, we define new forward hooks that run based on the following semantics:
- If a module is the first to run the pre-hook, actually run the given pre-hook. Otherwise, the pre-hook is no-op.
- If a module is the last to run the post-hook, actually run the given post-hook. Otherwise, the post-hook is a no-op.
- First and last are determined by scoreboarding against a set of the modules.
- This set must get cleared at the end of backward in the case that >=1 module in the list is never used, in which case we still want the forward hooks to run in the next forward after this backward.

Beyond these new forward hooks, everything else is some simple generalization from `Module` to `List[Module]` or `Tuple[Module, ...]`.

**Examples**
This PR enables wrapping Llama models more efficiently by grouping the final norm and output linear together: https://github.com/pytorch/torchtitan/pull/382.

If at least one of the modules in the list does not run forward before backward, then there will be a warning message like:
```
1 of the 2 modules passed to fully_shard did not run forward before backward, which is error-prone since FSDP post-forward/pre-backward logic will not run for these modules. We recommend passing only modules that run forward together. Modules that did not run forward: [FSDPLinear(in_features=1, out_features=1, bias=True)]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127786
Approved by: https://github.com/yf225, https://github.com/weifengpy
ghstack dependencies: #127773
2024-07-15 23:54:10 +00:00
b27695791e [PT-D] Relaxed contract to allow Sequence[nn.Module] (#127773)
This PR relaxes `@contract` to allow the 1st argument to be `Sequence[nn.Module]` instead of strictly `nn.Module`. This is required for the next PR, which allows `fully_shard` to take in `List[nn.Module]`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127773
Approved by: https://github.com/weifengpy
2024-07-15 23:54:10 +00:00
54a932b0ac Support for expandable segments with cuda graph trees (#128068)
This PR adds support to use expandable segments with private memory pools which should unblock using it with cuda graphs and cuda graph trees. Currently, the allocator silently avoids using expandable segments when allocating in a private pool due to checkpoint saving/restoring not meshing well with how we keep track of unmapped blocks.

The PR itself is pretty short, most of the logic for checkpointing and reapplying state for non-expandable segments transfers over without much work.

Expandable segments reserve a virtual address space of size equal to the amount of physical memory on the GPU. Every time we want to `malloc()` or `free()` memory in a memory pool with expandable segments turned on, we map/unmap pages of physical GPU memory under the hood to create a new block that we return to the caller. This is beneficial due to the fact that each memory pool functions as a single segment of memory with a contiguous block of memory addresses that can grow and shrink as needed, avoiding fragmentation from allocating multiple non-contiguous segments that may not be merged together.

The caching allocator handles this by creating an unmapped block for the entire reserved virtual address space at init, which is treated similarly to an unallocated block in a free pool. When callers call `malloc()`, it's split and mapped to create allocated blocks, and calling `free()` similarly caches and merges free blocks in a free pool to be used later. Expandable blocks are unmapped and returned back to Cuda when they are cleaned up, or when we hit an OOM and the allocator attempts to remap cached free blocks. The code paths to map, free, and unmap blocks in expandable segments is similar to that for normal blocks and does all the same work of updating stats on memory usage, moving blocks between active and free pools, and returning memory to Cuda.

With Cuda Graph Trees and private memory pools, we need the ability to take checkpoints of the current state of the memory allocator after each graph capture as well as reapplying the state before capturing a new graph after replaying a captured graph so that the new cuda graph capture has access to the state of the allocator at the point after replaying a previously captured graph so it can reuse empty blocks and allocate new ones.

As mentioned in a below comment, memory in a private pool is cached until the private pool is destroyed and allocations can only grow from extra graph captures, any freeing of memory would result in invalid memory addresses and would break cuda graphs.

One implementation detail to note for unmapped blocks with expandable segments is that unmapped blocks are kept track in a member variable `unmapped` of a `BlockPool`. `unmapped` is *not* part of the checkpointed state of the caching allocator and isn't restored when reapplying checkpoints since we never free/unmap memory back to cuda and is persisted across graph captures / replays.

Checkpointing the current state of the memory allocator works as expected with expandable segments. Checkpointing grabs the first block of every segment in the active and free pools of the private pool and traverses the linked list of blocks in the segment to capture the state of every segment, which is then saved and kept for when it is needed to be reapplied. For expandable blocks, the last block in every segment will be an unallocated unmapped block containing the remaining amount of unmapped memory at graph capture time, and this too is saved in the checkpoint.

Reapplying the checkpoints works by freeing all allocated blocks and merging them into a single block per segment, then for each segment, we manually split and allocate all blocks from the checkpoint and then free the blocks marked as unallocated in the checkpoint state. For expandable segments, we need to make some modifications to not split unmapped blocks and avoid manually mapping then freeing unmapped blocks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128068
Approved by: https://github.com/eqy, https://github.com/eellison
2024-07-15 23:23:23 +00:00
006020ff6e Fix the cudagraph capture of SDPA (#130712)
Summary: The scalar tensor by default is on CPU, which failed the cuda graph capture. To fix the issue, we put the scalar tensor on GPU

Test Plan: buck2 test 'fbcode//mode/opt' fbcode//gen_ai/llm_inference/fb/tests:test_llama2_multimodal_generator -- --exact 'gen_ai/llm_inference/fb/tests:test_llama2_multimodal_generator - gen_ai.llm_inference.fb.tests.test_llama2_multimodal_generator.TestGenerator: test_multimodal_decode_gen2'

Differential Revision: D59740639

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130712
Approved by: https://github.com/Skylion007, https://github.com/chenyang78
2024-07-15 23:05:48 +00:00
50ef099ad0 Learn a heuristic to decide whether to pad before mm (#128643)
This PR introduces AutoHeuristic, a framework to collect results from autotuning, learn a heuristic as a machine learning model (a regression tree), and then ship the learned heuristic by generating the regression tree to code.

The heuristics have been learned on artificial/random data that has been collected with the `gen_data_pad_mm.py` script. The `gen_pad_mm_a100.sh` scripts can then be used to learn a heuristic and generate it to code.

The best model is decided by doing a grid search over various values for `max_depth` and `min_samples_leaf` and choosing the model with the highest number of correct predicitons on the validation set.

The heuristic can return "unsure" which means that it is not sure which choice is the best choice and as a result autotuning will happen.

On A100 only tensors where each dimension is >= 512 are considered. For smaller tensors the heuristics that I learned returned "unsure" too often.

The results for randomly generated data and huggingface look as follows:
`max_wrong_speedup` is max(`wrong_speedups`) where `wrong_speedups` contains all the speedups one could have achieved for those examples where the heuristic made a wrong choice, i.e. a `max_wrong_speedup` of 1.37 means that the heuristic selected a choice, but the other choice would have been 1.37x faster. `gman_wrong_speedup` is the geomean of `wrong_speedups`.

The heuristic is learned as a regression tree, that returns higher values for better choices. The threshold decides how much better the better choice has to be for it to be returned, i.e. on A100 if the better choice is less than 1.702530x better than the other choice, "unsure" will be returned. This threshold is determined using the validation set.

A100
```
       max_depth  min_samples_leaf dataset  correct  wrong  unsure  total  max_wrong_speedup  gman_wrong_speedup  threshold
15         5.0                10     train     2730      4    3023   5757           1.372220            1.193873   1.702530
16         5.0                10       val      878      0    1042   1920                NaN                 NaN   1.702530
17         5.0                10      test      925      2     993   1920           1.741708            1.354954   1.702530
18         5.0                10  hf-train       14      0      22     36                NaN                 NaN   1.702530
19         5.0                10    hf-inf        7      0       1      8                NaN                 NaN   1.702530
```

The numbers for huggingface only include tensors where each dim is >=512. If all tensors would have been included there would have been the following number of matmuls, where at least one dimension is unaligned:
A100 hf-train: 60
A100 hf-inf: 10

## Results on running huggingface locally
This only includes models where the learned heuristic made at least one decision. For the examples here, it takes around 0.25-0.3 seconds to perform autotuning for the padded and unpadded version, so each decision that the heuristic makes saves around 0.25-0.3 seconds.
#pad_mm_autotuning is the number of times autotuning happened in pad_mm and #heuristic_made_decision is the number of times the heuristic made a decision (i.e. it didn't return "unsure").

I ran huggingface locally, each model 5 times and took the median speedup and compilation_latency.
Results on huggingface training
```
                          name speedup_heuristic speedup_baseline  speedup_diff compilation_latency_heuristic compilation_latency_baseline  compilation_latency_diff  comp_latency_reduction%  #pad_mm_autotuning  #heuristic_made_decision
               BartForCausalLM   1.19 (+/- 0.00)  1.19 (+/- 0.00)         -0.00              40.33 (+/- 1.13)             40.95 (+/- 0.78)                     -0.62                     1.52                   3                         2
  BartForConditionalGeneration   1.53 (+/- 0.06)  1.47 (+/- 0.05)          0.06              81.93 (+/- 5.20)             82.23 (+/- 1.92)                     -0.30                     0.36                   3                         1
    BlenderbotSmallForCausalLM   1.86 (+/- 0.04)  1.86 (+/- 0.00)          0.00              36.76 (+/- 0.49)             37.62 (+/- 1.33)                     -0.87                     2.31                   3                         2
                     CamemBert   2.36 (+/- 0.01)  2.35 (+/- 0.01)          0.01              97.60 (+/- 1.91)             98.69 (+/- 1.35)                     -1.09                     1.11                   2                         1
                   DistillGPT2   2.57 (+/- 0.01)  2.57 (+/- 0.01)          0.00              57.33 (+/- 0.77)             58.26 (+/- 1.41)                     -0.93                     1.59                   3                         2
             PLBartForCausalLM   2.07 (+/- 0.01)  2.06 (+/- 0.01)          0.01              32.54 (+/- 0.83)             34.65 (+/- 0.71)                     -2.11                     6.10                   3                         2
PLBartForConditionalGeneration   1.87 (+/- 0.00)  1.88 (+/- 0.00)         -0.01              58.45 (+/- 1.24)             58.95 (+/- 1.92)                     -0.50                     0.85                   3                         1
            RobertaForCausalLM   2.39 (+/- 0.01)  2.40 (+/- 0.01)         -0.01              97.38 (+/- 1.52)             97.69 (+/- 1.18)                     -0.31                     0.32                   2                         1
              TrOCRForCausalLM   1.70 (+/- 0.00)  1.70 (+/- 0.00)         -0.00              44.79 (+/- 1.33)             45.25 (+/- 1.08)                     -0.46                     1.01                   3                         2

Mean difference in speedup: 0.01
Mean compilation latency saved: -0.80s
Mean compilation latency reduction: 1.68%
```

Results on huggingface inference
```
                          name speedup_heuristic speedup_baseline  speedup_diff compilation_latency_heuristic compilation_latency_baseline  compilation_latency_diff  comp_latency_reduction%  #pad_mm_autotuning  #heuristic_made_decision
               BartForCausalLM   1.11 (+/- 0.00)  1.11 (+/- 0.00)          0.00              19.02 (+/- 0.28)             19.40 (+/- 0.35)                     -0.38                     1.95                   3                         2
  BartForConditionalGeneration   1.26 (+/- 0.01)  1.23 (+/- 0.03)          0.03              36.84 (+/- 0.40)             36.55 (+/- 0.75)                      0.30                    -0.81                   3                         1
    BlenderbotSmallForCausalLM   1.87 (+/- 0.02)  1.87 (+/- 0.01)          0.00              17.53 (+/- 0.31)             18.03 (+/- 0.43)                     -0.49                     2.74                   3                         2
                   DistillGPT2   2.50 (+/- 0.02)  2.50 (+/- 0.01)          0.00              16.16 (+/- 0.29)             16.40 (+/- 0.18)                     -0.24                     1.46                   3                         2
             PLBartForCausalLM   1.93 (+/- 0.01)  1.94 (+/- 0.01)         -0.00              15.30 (+/- 0.22)             16.01 (+/- 0.71)                     -0.71                     4.43                   3                         2
PLBartForConditionalGeneration   1.98 (+/- 0.01)  1.98 (+/- 0.01)          0.00              25.90 (+/- 0.32)             26.58 (+/- 0.62)                     -0.67                     2.53                   3                         1
              TrOCRForCausalLM   1.61 (+/- 0.00)  1.62 (+/- 0.00)         -0.01              21.38 (+/- 0.37)             21.85 (+/- 0.16)                     -0.47                     2.16                   3                         2

Mean difference in speedup: 0.00
Mean compilation latency saved: -0.38s
Mean compilation latency reduction: 2.07%
```

For now, the heuristic can only be applied to decide whether to pad for mm. One could also learn heuristics for bmm and addmm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128643
Approved by: https://github.com/Chillee, https://github.com/eellison
2024-07-15 23:04:06 +00:00
9a5204dc2d [inductor] Remove "spawn" as an option for parallel compile method (#130746)
Summary: Looks like "spawn" is broken. Since we have "subprocess", I don't think we need it any more, so just remove as an option.

Test Plan: Verified that we get: `AssertionError: Invalid start method: spawn`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130746
Approved by: https://github.com/Skylion007
2024-07-15 22:55:54 +00:00
3f031b96c6 [Fix] Correctly identifying arguments for sub-blocks with renaming logic during TorchScript to ExportedProgram conversion (#128386)
#### Issue
Fix two issues related to inputs lifting when there are sub-blocks.
* Some inputs may appear in the nested sub-blocks, which need a recursive search to identify which arguments need to be lifted / passed in the top-level block.
* Some inputs to the sub-block are intermediate results, meaning their names are only number. This will cause issue during code generation (i.e., invalid argument name). We rename those to valid names.

#### Test Plan
* `pytest test/export/test_converter.py -s -k test_convert_nn_module_with_nested_if_and_param`
* `test/export/test_converter.py -s -k test_hidden_input_name`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128386
Approved by: https://github.com/angelayi
2024-07-15 22:48:13 +00:00
b893aa71ca Rename generate_numeric_debug_handle to numeric_debugger (#130590)
Summary:
att

Test Plan:
CI

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130590
Approved by: https://github.com/dulinriley, https://github.com/tarun292
2024-07-15 22:42:27 +00:00
535016967a Enable UFMT on all of torch/sparse (#130545)
Partially addresses #123062
Ran lintrunner on:
- torch/sparse

Detail:
```
$ lintrunner -a --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```

@ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130545
Approved by: https://github.com/ezyang
2024-07-15 22:35:52 +00:00
7d4f50de19 dynamo add support for defaultdict(set) (#130745)
Fixes #130554

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130745
Approved by: https://github.com/Skylion007
2024-07-15 22:23:33 +00:00
3928ca2ab6 [dynamo] update call map to allow multiple input parameters (#130748)
Fixes https://github.com/pytorch/pytorch/issues/128072.

Commandeering https://github.com/pytorch/pytorch/pull/128282 since the issue is now hi pri.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130748
Approved by: https://github.com/Skylion007, https://github.com/anijain2305
2024-07-15 22:16:49 +00:00
eqy
6f32dc0c7b Don't pass error message as places in assertGreaterAlmostEqual (#130648)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130648
Approved by: https://github.com/awgu
2024-07-15 22:14:49 +00:00
dff9d68f18 Revert "Fix names conflict when lifting (#129817)"
This reverts commit 53cf46b8c602f8512d49a5c30bca7fcf5411e25c.

Reverted https://github.com/pytorch/pytorch/pull/129817 on behalf of https://github.com/clee2000 due to Failing inductor/test_flex_attention.py https://github.com/pytorch/pytorch/actions/runs/9940532858/job/27478084137 74da2a467f Sorry for the churn, possibly a landrace? ([comment](https://github.com/pytorch/pytorch/pull/129817#issuecomment-2229519886))
2024-07-15 22:08:45 +00:00
78799e82b0 Revert "Invalidate StorageImpl instances when tensor is overwritten with cudagraphs (#125264)"
This reverts commit 1bc390c5f5ac065c156f55f4eceed267ecc67b41.

Reverted https://github.com/pytorch/pytorch/pull/125264 on behalf of https://github.com/jithunnair-amd due to test test/inductor/test_cudagraph_trees.py::CudaGraphTreeTests::test_fallback_to_eager_if_recompiling_too_many_times is failing https://github.com/pytorch/pytorch/actions/runs/9933628108/job/27477785946 1bc390c5f5. Test was introduced by fa5f572748 which is before the merge base ([comment](https://github.com/pytorch/pytorch/pull/125264#issuecomment-2229508737))
2024-07-15 21:59:46 +00:00
db3a641b71 Implement operator for micro-pipelined all-gather -> _scaled_mm (#129289)
This PR implements `torch.ops.symm_mem.fused_all_gather_scaled_matmul`. It's similar to `torch.ops.symm_mem.fused_all_gather_matmul`, except that it takes scales and calls ` _scaled_mm`.

[Profiling Trace vs. Baseline](https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmp0gmg1f2_) (FB internal only)

Co-authored-by: Will Feng <yf225@cornell.edu>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129289
Approved by: https://github.com/Chillee, https://github.com/weifengpy, https://github.com/drisspg
2024-07-15 21:48:35 +00:00
77fb5b0e23 [c10d] a new Pytorch API (split_group) to create a process group (#130507)
This is the implementation following the RFC: https://github.com/pytorch/pytorch/issues/130407

ncclCommSplit
Summary:
In current Pytorch/c10d, the new_group API is used to create a new
process group from the default pg.  When device_id is specified in
init_process_group and nccl is used as the backend, the new_group call
will use ncclCommSplit to create the nccl communicators to save
communicator resources. It has a few drawbacks:

Redundant calls
Suppose the default group has 256 ranks, we need to have 32 children PGs
and each child PG has 8 ranks. in this case, each rank needs to call
new_group and ncclCommSplit 32 times because of how we implement
new_group API and the collective requirement of ncclCommSplit. For a
specific global rank, 31 calls of ncclCommSplit would be no_color split,
and only 1 of them is colored split. With the proposed new split_group
API, we expect only 1 call of split_group/ncclCommSplit is needed per
rank in the above example case

new_group can only split from default_pg
Ideally, a new pg should be able to be split from any pg

With the new split_group API, users can create new PGs using
ncclCommSplit with less number of calls and initialize the PG eagerly.
This is also useful in the cases of creating many P2P communicators.
Test Plan:
New UTs:
e.g., python test/distributed/test_c10d_nccl.py -k
test_comm_split_group_larger_scale
Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130507
Approved by: https://github.com/wconstab
2024-07-15 21:26:43 +00:00
ac3e2cb64a [BE] Delete unused -rg.yml workflow (#130759)
As well as `_linux-test-label.yml` as ARC experiment is dead
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130759
Approved by: https://github.com/ZainRizvi
2024-07-15 20:41:59 +00:00
ee6f0ab190 [DeviceMesh][Reland] Only include the real thread_id in DeviceMesh hash under threaded backend (#130495) (#130685)
Summary:
As a followup to https://github.com/pytorch/pytorch/pull/130454, users are hitting the cross-mesh operation error because the DeviceMesh thread ID differs between the saved vs. loaded DTensor due to thread id being different.

This is a hot fix to only consider the real thread_id in DeviceMesh hash under threaded backend, but set it to None for all other cases.

As a follow up, we need to look at the following test failures to better root cause specific DeviceMesh related failures related to MTPG, if thread_id is not included as part of the hash.
```
test/distributed/_composable/fsdp/test_fully_shard_training.py::TestFullyShardRegisteredParams::test_param_registration_after_forward
test/distributed/_tensor/test_dtensor_ops.py::TestDTensorOpsCPU::test_dtensor_op_db_column_stack_cpu_float32
```

Adding an additional is_initialized() check since APF has a test mocking the backend without pg initialized. Therefore, we need to add the is_initialized() check to avoid test failure. In real use case, we should have a pg initialized before the get_backend() check. Not sure if we want to add this specifically for the test, but temporarily adding it to unblock APF conveyor runs.

Test Plan:
```
[irisz@devgpu051.cln3 /data/users/irisz/fbsource/fbcode (38e4a0a3b)]$ buck2 test 'fbcode//mode/opt' fbcode//apf/distributed/tests:pipeline_parallel_test_cpu -- --exact 'apf/distributed/tests:pipeline_parallel_test_cpu - apf.distributed.tests.pipeline_parallel_test_cpu.PipelineParallelContextTestCPU: test_stage_pg_creation_with_different_backends'
```

Reviewed By: gag1jain

Differential Revision: D59725924

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130685
Approved by: https://github.com/gag1jain
2024-07-15 20:05:26 +00:00
27322355de Added some more documentation to block mask creation (#130649)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130649
Approved by: https://github.com/drisspg
ghstack dependencies: #130626
2024-07-15 19:48:42 +00:00
0e79e1f958 [NJT+SDPA]Fix flash_attention output when batch_size=1 and seq_len=1 (#130652)
fix issue  #130196

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130652
Approved by: https://github.com/Skylion007, https://github.com/drisspg, https://github.com/jbschlosser
2024-07-15 19:44:04 +00:00
074a5c0c9b Revert "[BE] bump optree version to 0.12.1 (#130139)"
This reverts commit 8fcb156e8b5697a8f292db6db2a1803c5f4ce2d7.

Reverted https://github.com/pytorch/pytorch/pull/130139 on behalf of https://github.com/clee2000 due to broke inductor/test_torchinductor_codegen_dynamic_shapes.py and test_sympy_utils.py 8fcb156e8b ([comment](https://github.com/pytorch/pytorch/pull/130139#issuecomment-2229248447))
2024-07-15 19:42:11 +00:00
f1456c74a0 Fix mkl-static issue for Windows. (#130697)
Background:
We found the pytorch Windows release/2.4 performance regression: https://github.com/pytorch/pytorch/issues/130619

After some debug works, I found the pytorch Windows static mkl build options are wrong:
<img width="1049" alt="image" src="https://github.com/user-attachments/assets/38692142-bfca-4c98-8092-6e105c82bb13">
1. Thread lib is wrong.
2. Miss `openmp` lib and config.
> Debug history: https://github.com/pytorch/pytorch/issues/130619#issuecomment-2226782504 and https://github.com/pytorch/pytorch/issues/130619#issuecomment-2226418611

This PR will fix `mkl-static` build options issue.
<img width="863" alt="image" src="https://github.com/user-attachments/assets/834f6cee-7e6d-4d74-b2bc-8a270f05e429">

Reference:
<img width="482" alt="image" src="https://github.com/user-attachments/assets/8184dadb-f230-4062-a49f-51df1d7285f5">

https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-link-line-advisor.html#gs.c6izlg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130697
Approved by: https://github.com/jgong5, https://github.com/atalman
2024-07-15 19:28:11 +00:00
a7cfe40c9b [dtensor] Improve from_local API with run_check (#130289)
as titled, this PR:
1. switch `run_check` to be by default False and add extra doc/comments
   about the correctness guarantee. Since I observed so many calls
forget to use run_check=False, we should simply switch to not perform
metadata check and make our documentation explicit
2. Implement metadata check by picking up the changes from https://github.com/pytorch/pytorch/pull/115229
3. Improve the from_local documentation

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130289
Approved by: https://github.com/awgu, https://github.com/wz337
ghstack dependencies: #130286, #130287, #130288
2024-07-15 18:52:55 +00:00
3342f3aa4e [dtensor] simplify sdpa strategies (#130288)
as titled, this PR simplifies both flash and efficient attention op
strategy generation paths

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130288
Approved by: https://github.com/tianyu-l
ghstack dependencies: #130286, #130287
2024-07-15 18:52:55 +00:00
7d82dc2c23 [dtensor] slice_backward to use op strategy (#130287)
as titled. slice_backward right now forward the sharding
unconditionally, which is wrong mathmatically. This PR switch it to op
strategy and only allow replication

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130287
Approved by: https://github.com/awgu
ghstack dependencies: #130286
2024-07-15 18:52:49 +00:00
53cf46b8c6 Fix names conflict when lifting (#129817)
## Bug description
When pending args that are potentially to be lift [here](58f346c874/torch/_dynamo/output_graph.py (L1866)) having same base name, like `contiguous` and `contiguous_1`, the call into [create_graph_input](58f346c874/torch/_dynamo/output_graph.py (L2081)) can finally create a name ([here](58f346c874/torch/fx/graph.py (L1008))) that overwrite args to lift. And thus causing a wrong output of graph.

## Reproducing
Below is an reproduceable example,
```python
import logging
from typing import List

import torch
from functorch.compile import aot_module_simplified, make_boxed_func

@torch.library.custom_op("mylib::somefunc_forward", mutates_args=())
def somefunc_forward(
    input_: torch.Tensor,
    weight: torch.Tensor,
    shape: List[int],
) -> torch.Tensor:
    return torch.ones_like(input_)

@somefunc_forward.register_fake
def _(input_, shape, weight):
    return torch.empty_like(input_)

@torch.library.custom_op("mylib::somefunc_backward", mutates_args=())
def somefunc_backward(
    grad_output: torch.Tensor,
    input_: torch.Tensor,
    weight: torch.Tensor,
    shape: List[int],
) -> torch.Tensor:
    print(f"backward.{grad_output.shape=}")
    print(f"backward.{input_.shape=}")
    print(f"backward.{weight.shape=}")
    print(f"backward.{shape=}")
    assert list(weight.shape) == shape
    return torch.ones_like(weight)

@somefunc_backward.register_fake
def _(grad_output, input_, weight, shape):
    return torch.empty_like(weight)

def a_func(grad_output, input_, weight_, shape):
    return torch.ones_like(input_.sum() * weight_)

class SomeFunc(torch.autograd.Function):
    @staticmethod
    def forward(ctx, input, weight, normalized_shape):
        ctx.normalized_shape = normalized_shape
        input_ = input.contiguous()
        weight_ = weight.contiguous()
        output = somefunc_forward(input_, weight_, ctx.normalized_shape)
        ctx.save_for_backward(input_, weight_)
        return output

    @staticmethod
    def backward(ctx, grad_output):
        input_, weight_ = ctx.saved_tensors
        # grad_weight = a_func(grad_output, input_, weight_, ctx.normalized_shape)
        grad_weight = somefunc_backward(
            grad_output.contiguous(),
            input_,
            weight_,
            ctx.normalized_shape,
        )
        return None, grad_weight, None

class MyModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.weight = torch.nn.Parameter(torch.ones(7))

    def forward(self, x):
        return SomeFunc.apply(x, self.weight, [7])

model = MyModel()
torch._logging.set_logs(dynamo=logging.DEBUG, aot=logging.DEBUG, graph_code=True)

def aot_print_backend(gm, sample_inputs):
    # Forward compiler capture
    def fw(gm, sample_inputs):
        print(f"----- fw")
        gm.print_readable()
        return make_boxed_func(gm.forward)

    # Backward compiler capture
    def bw(gm, sample_inputs):
        print(f"----- bw")
        gm.print_readable()
        return make_boxed_func(gm.forward)

    # Call AOTAutograd
    gm_forward = aot_module_simplified(
        gm, sample_inputs, fw_compiler=fw, bw_compiler=bw
    )
    return gm_forward

model = torch.compile(
    model,
    backend=aot_print_backend,
    dynamic=False,
)
out = model(torch.rand((128, 4, 7)))
out.mean().backward()
```

I can see log that showing calling into create_graph_input like
```log
V0629 02:08:46.839914 8200981504 torch/_dynamo/output_graph.py:2042] [0/0] create_graph_input contiguous (none)
V0629 02:08:46.839998 8200981504 torch/_dynamo/output_graph.py:2042] [0/0] create_graph_input contiguous_1 (none)
```

And the backward graph generate will be like
```log
class GraphModule(torch.nn.Module):
    def forward(self, function_ctx, somefunc_forward_default: "f32[128, 4, 7]", contiguous: "f32[128, 4, 7]", contiguous_1: "f32[7]"):
        contiguous_1 = contiguous
        contiguous_2 = contiguous_1

        # No stacktrace found for following nodes
        _set_grad_enabled = torch._C._set_grad_enabled(False)

         # File: /Users/bytedance/testtorch/test_custom_op_bug.py:61 in backward, code: grad_output.contiguous(),
        contiguous: "f32[128, 4, 7]" = somefunc_forward_default.contiguous();  somefunc_forward_default = None

         # File: /opt/tiger/pytorch/torch/_library/custom_ops.py:506 in __call__, code: return self._opoverload(*args, **kwargs)
        somefunc_backward_default: "f32[7]" = torch.ops.mylib.somefunc_backward.default(contiguous, contiguous_1, contiguous_2, [7]);  contiguous = contiguous_1 = contiguous_2 = None

        # No stacktrace found for following nodes
        _set_grad_enabled_1 = torch._C._set_grad_enabled(True)
        return (None, somefunc_backward_default)
```

The original code of `somefunc_backward` takes a input list of `grad_output`, `input_`, `weight` and `shape`, where `weight` should be shape of `torch.Size([7])`. However, in the graph, `contiguous1` and `contiguous_2` are assigned with `contiguous`, this leads to assertion failure I added in `somefunc_backward`.

## Environment
```log
Collecting environment information...
PyTorch version: 2.5.0a0+git0b7e8df
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 14.5 (arm64)
GCC version: Could not collect
Clang version: 15.0.0 (clang-1500.3.9.4)
CMake version: version 3.26.4
Libc version: N/A

Python version: 3.9.19 (main, May  6 2024, 14:39:30)  [Clang 14.0.6 ] (64-bit runtime)
Python platform: macOS-14.5-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M3 Pro

Versions of relevant libraries:
[pip3] numpy==2.0.0
[pip3] optree==0.11.0
[pip3] torch==2.5.0a0+git0b7e8df
[pip3] torchgraph==0.0.1
[conda] numpy                     2.0.0                    pypi_0    pypi
[conda] optree                    0.11.0                   pypi_0    pypi
[conda] torch                     2.5.0a0+git0b7e8df           dev_0    <develop>
[conda] torchgraph                0.0.1                     dev_0    <develop>
```

## How to fix?

I put a naive fix that add the potential args to lift into the used_names. This visits private variables, will fix that if this issue makes sense to you.

@zou3519 @oulgen

Co-authored-by: rzou <zou3519@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129817
Approved by: https://github.com/zou3519
2024-07-15 18:49:12 +00:00
b4b64f76e5 Ensure tensors devices match on torch.index_put batch rule impl (#130479)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130479
Approved by: https://github.com/zou3519
2024-07-15 18:16:31 +00:00
00d71b3e86 Tweak tolerances for test_vjp_linalg_tensorsolve_cuda_float32 to pass in Windows / debug builds (#130449)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130449
Approved by: https://github.com/zou3519, https://github.com/malfet
ghstack dependencies: #128238, #130360
2024-07-15 17:35:34 +00:00
9e161af179 Revert "Increase tolerance for tensorsolve tests (#130620)"
This reverts commit 103b6ccab2bd025dfacc8c8a91f71f3d68e50426.

Reverted https://github.com/pytorch/pytorch/pull/130620 on behalf of https://github.com/clee2000 due to didn't work, test is still failing on this PR and on main, reverting in favor of https://github.com/pytorch/pytorch/pull/130449 instead ([comment](https://github.com/pytorch/pytorch/pull/130620#issuecomment-2229036418))
2024-07-15 17:35:04 +00:00
8fcb156e8b [BE] bump optree version to 0.12.1 (#130139)
0.12.0 Major Updates:

- Add context manager to temporarily set the dictionary sorting mode
- Add accessor APIs
- Use `stable` tag for `pybind11` for Python 3.13 support
- Fix potential segmentation fault for pickling support

0.12.1 Updates:

- Fix warning regression during import when launch with strict warning filters

Closes #130155
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130139
Approved by: https://github.com/zou3519
2024-07-15 17:27:07 +00:00
1e897a0ca4 Revert "Fix names conflict when lifting (#129817)"
This reverts commit 74da2a467f166e00316aee82ba24835ca563ed87.

Reverted https://github.com/pytorch/pytorch/pull/129817 on behalf of https://github.com/clee2000 due to broke dynamo/test_inline_inbuilt_nn_modules.py https://github.com/pytorch/pytorch/actions/runs/9940532858/job/27461141919 74da2a467f.  Test passed on PR, possibly a landrace? ([comment](https://github.com/pytorch/pytorch/pull/129817#issuecomment-2228993570))
2024-07-15 17:09:52 +00:00
0099e15b47 Also put unbacked symbols in symbol_to_node in split_module pass (#130535)
This is not a complete fix but it is a simple one, full fix tracked
in https://github.com/pytorch/pytorch/issues/130534

Internal xref:
https://fb.workplace.com/groups/6829516587176185/posts/7510238679103969/

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130535
Approved by: https://github.com/malfet
2024-07-15 16:56:01 +00:00
ca2d424c6e Tighten torch.library.infer_schema input types (#130705)
Made the following changes:
- mutates_args is now keyword-only and mandatory. This is to align with
  torch.library.custom_op (which makes it mandatory because it's easy to
  miss)
- op_name is now keyword-only. This helps the readability of the API
- updated all usages of infer_schema

This change is not BC-breaking because we introduced
torch.library.infer_schema a couple of days ago.

Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130705
Approved by: https://github.com/yushangdi
2024-07-15 16:43:57 +00:00
9df4bc6a0d Revert "Constant folding for dynamic shape node (#129686)"
This reverts commit b7d287fbec0a05a3d4c9524006e6bfd1de6a71a0.

Reverted https://github.com/pytorch/pytorch/pull/129686 on behalf of https://github.com/atalman due to Failing internally.  Test: https://github.com/pytorch/ao/blob/main/test/prototype/mx_formats/test_mx_linear.py ([comment](https://github.com/pytorch/pytorch/pull/129686#issuecomment-2228755295))
2024-07-15 15:19:24 +00:00
7cd48df2da Refine the logic of device construction when only device index is given (#129119)
# Motivation
Before this PR, device construction was `cuda` type when only a device index was given. It also returns the `PrivateUser1` type if a `PrivateUser1` type is registered.
```bash
>>> import torch
>>> device = torch.device(0)
>>> device.type
'cuda'
>>> a = torch.tensor([1, 2])
>>> b = a.to(0)
>>> b
tensor([1, 2], device='cuda:0')
```
It works well on CUDA GPU. But it will raise unexpected information and error running on XPU.
```bash
>>> import torch
>>> device = torch.device(0)
>>> device.type
'cuda'
>>> a = torch.tensor([1, 2])
>>> b = a.to(0)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/xxx/pytorch/torch/cuda/__init__.py", line 302, in _lazy_init
    raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled
```
With this PR, refine the logic to use the currently available device type instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129119
Approved by: https://github.com/albanD, https://github.com/gujinghui, https://github.com/EikanWang
ghstack dependencies: #129463, #129205, #129363
2024-07-15 14:34:29 +00:00
9cae2160f5 Introduce the concept of Accelerators to PyTorch doc (#129363)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129363
Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD
ghstack dependencies: #129463, #129205
2024-07-15 14:24:46 +00:00
74da2a467f Fix names conflict when lifting (#129817)
## Bug description
When pending args that are potentially to be lift [here](58f346c874/torch/_dynamo/output_graph.py (L1866)) having same base name, like `contiguous` and `contiguous_1`, the call into [create_graph_input](58f346c874/torch/_dynamo/output_graph.py (L2081)) can finally create a name ([here](58f346c874/torch/fx/graph.py (L1008))) that overwrite args to lift. And thus causing a wrong output of graph.

## Reproducing
Below is an reproduceable example,
```python
import logging
from typing import List

import torch
from functorch.compile import aot_module_simplified, make_boxed_func

@torch.library.custom_op("mylib::somefunc_forward", mutates_args=())
def somefunc_forward(
    input_: torch.Tensor,
    weight: torch.Tensor,
    shape: List[int],
) -> torch.Tensor:
    return torch.ones_like(input_)

@somefunc_forward.register_fake
def _(input_, shape, weight):
    return torch.empty_like(input_)

@torch.library.custom_op("mylib::somefunc_backward", mutates_args=())
def somefunc_backward(
    grad_output: torch.Tensor,
    input_: torch.Tensor,
    weight: torch.Tensor,
    shape: List[int],
) -> torch.Tensor:
    print(f"backward.{grad_output.shape=}")
    print(f"backward.{input_.shape=}")
    print(f"backward.{weight.shape=}")
    print(f"backward.{shape=}")
    assert list(weight.shape) == shape
    return torch.ones_like(weight)

@somefunc_backward.register_fake
def _(grad_output, input_, weight, shape):
    return torch.empty_like(weight)

def a_func(grad_output, input_, weight_, shape):
    return torch.ones_like(input_.sum() * weight_)

class SomeFunc(torch.autograd.Function):
    @staticmethod
    def forward(ctx, input, weight, normalized_shape):
        ctx.normalized_shape = normalized_shape
        input_ = input.contiguous()
        weight_ = weight.contiguous()
        output = somefunc_forward(input_, weight_, ctx.normalized_shape)
        ctx.save_for_backward(input_, weight_)
        return output

    @staticmethod
    def backward(ctx, grad_output):
        input_, weight_ = ctx.saved_tensors
        # grad_weight = a_func(grad_output, input_, weight_, ctx.normalized_shape)
        grad_weight = somefunc_backward(
            grad_output.contiguous(),
            input_,
            weight_,
            ctx.normalized_shape,
        )
        return None, grad_weight, None

class MyModel(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.weight = torch.nn.Parameter(torch.ones(7))

    def forward(self, x):
        return SomeFunc.apply(x, self.weight, [7])

model = MyModel()
torch._logging.set_logs(dynamo=logging.DEBUG, aot=logging.DEBUG, graph_code=True)

def aot_print_backend(gm, sample_inputs):
    # Forward compiler capture
    def fw(gm, sample_inputs):
        print(f"----- fw")
        gm.print_readable()
        return make_boxed_func(gm.forward)

    # Backward compiler capture
    def bw(gm, sample_inputs):
        print(f"----- bw")
        gm.print_readable()
        return make_boxed_func(gm.forward)

    # Call AOTAutograd
    gm_forward = aot_module_simplified(
        gm, sample_inputs, fw_compiler=fw, bw_compiler=bw
    )
    return gm_forward

model = torch.compile(
    model,
    backend=aot_print_backend,
    dynamic=False,
)
out = model(torch.rand((128, 4, 7)))
out.mean().backward()
```

I can see log that showing calling into create_graph_input like
```log
V0629 02:08:46.839914 8200981504 torch/_dynamo/output_graph.py:2042] [0/0] create_graph_input contiguous (none)
V0629 02:08:46.839998 8200981504 torch/_dynamo/output_graph.py:2042] [0/0] create_graph_input contiguous_1 (none)
```

And the backward graph generate will be like
```log
class GraphModule(torch.nn.Module):
    def forward(self, function_ctx, somefunc_forward_default: "f32[128, 4, 7]", contiguous: "f32[128, 4, 7]", contiguous_1: "f32[7]"):
        contiguous_1 = contiguous
        contiguous_2 = contiguous_1

        # No stacktrace found for following nodes
        _set_grad_enabled = torch._C._set_grad_enabled(False)

         # File: /Users/bytedance/testtorch/test_custom_op_bug.py:61 in backward, code: grad_output.contiguous(),
        contiguous: "f32[128, 4, 7]" = somefunc_forward_default.contiguous();  somefunc_forward_default = None

         # File: /opt/tiger/pytorch/torch/_library/custom_ops.py:506 in __call__, code: return self._opoverload(*args, **kwargs)
        somefunc_backward_default: "f32[7]" = torch.ops.mylib.somefunc_backward.default(contiguous, contiguous_1, contiguous_2, [7]);  contiguous = contiguous_1 = contiguous_2 = None

        # No stacktrace found for following nodes
        _set_grad_enabled_1 = torch._C._set_grad_enabled(True)
        return (None, somefunc_backward_default)
```

The original code of `somefunc_backward` takes a input list of `grad_output`, `input_`, `weight` and `shape`, where `weight` should be shape of `torch.Size([7])`. However, in the graph, `contiguous1` and `contiguous_2` are assigned with `contiguous`, this leads to assertion failure I added in `somefunc_backward`.

## Environment
```log
Collecting environment information...
PyTorch version: 2.5.0a0+git0b7e8df
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 14.5 (arm64)
GCC version: Could not collect
Clang version: 15.0.0 (clang-1500.3.9.4)
CMake version: version 3.26.4
Libc version: N/A

Python version: 3.9.19 (main, May  6 2024, 14:39:30)  [Clang 14.0.6 ] (64-bit runtime)
Python platform: macOS-14.5-arm64-arm-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Apple M3 Pro

Versions of relevant libraries:
[pip3] numpy==2.0.0
[pip3] optree==0.11.0
[pip3] torch==2.5.0a0+git0b7e8df
[pip3] torchgraph==0.0.1
[conda] numpy                     2.0.0                    pypi_0    pypi
[conda] optree                    0.11.0                   pypi_0    pypi
[conda] torch                     2.5.0a0+git0b7e8df           dev_0    <develop>
[conda] torchgraph                0.0.1                     dev_0    <develop>
```

## How to fix?

I put a naive fix that add the potential args to lift into the used_names. This visits private variables, will fix that if this issue makes sense to you.

@zou3519 @oulgen

Co-authored-by: rzou <zou3519@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129817
Approved by: https://github.com/zou3519
2024-07-15 13:41:46 +00:00
ee039c0614 [custom_op] triton_op API V0 (#130637)
This is the initial version of an API to create custom operators whose
implementations are backed by triton kernels. While user-defined triton
kernels work out-of-the-box with triton kernels, you may wish to
construct a custom operator if you need to compose with other PyTorch
subsystems, like Tensor subclasses or vmap.

I'm hoping to get design feedback on this and ship it so that we can
begin experimenting with customers.

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130637
Approved by: https://github.com/albanD
2024-07-15 13:00:54 +00:00
cyy
6beec34b1c [structural binding][9/N] Replace std::tie with structural binding (#130404)
Follows  #130544

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130404
Approved by: https://github.com/janeyx99
2024-07-15 10:14:52 +00:00
ac28ae18dc [BE][Ez]: Update pybind11 submodule to v2.13.1 (#129827)
Updates pybind11 submodule to v2.13.1
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129827
Approved by: https://github.com/XuehaiPan, https://github.com/atalman, https://github.com/albanD
2024-07-15 08:58:56 +00:00
1d983bbb28 [easy][inline-inbuilt-nn-module] Update test output (#130681)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130681
Approved by: https://github.com/zou3519, https://github.com/jansel
ghstack dependencies: #130654, #130420
2024-07-15 06:19:53 +00:00
1a266def4f [dynamo][unsoundness but very controlled] Skip guards on inbuilt nn module hooks (#130420)
Reduces the guard overhead from 2.1k units to 1k units. Compared to no-inlining (0.4k units), this reduces the slowdown from 5x to 2.5x.

This introduces unsoundness, but only for hooks for inbuilt nn modules (user defined nn module hooks are fine).

Each builtin nn module adds 4 empty ordered dict checks in the check_fn. This blows up for models with large numbers of builtin nn modules. With this PR, we skip those guards. There is no other easy way I can think of right now to control the guard overhead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130420
Approved by: https://github.com/jansel
ghstack dependencies: #130654
2024-07-15 06:19:53 +00:00
dc7725cc16 [halide-backend] Random number generation (#130211)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130211
Approved by: https://github.com/jansel
2024-07-15 05:03:24 +00:00
1bc390c5f5 Invalidate StorageImpl instances when tensor is overwritten with cudagraphs (#125264)
Fixes #104435

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125264
Approved by: https://github.com/ezyang
2024-07-15 04:16:17 +00:00
a3c0bab502 [inductor] [cpp] use non-temporal tile load for A (#129455)
Use non-temporal tile load `_tile_stream_loadd` for A to keep B in L1.
Verified AMP static shapes and dynamic shapes on CPU with AMX support and no obvious performance boost (no regression either) at end-to-end level. We're expecting to get performance gain when adding https://github.com/pytorch/pytorch/pull/129348 (also in this ghstack) on top of this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129455
Approved by: https://github.com/jgong5
2024-07-15 04:07:29 +00:00
c547b2e871 Fix python detection in cuda.cmake (#130651)
If Python package has not been detected previously, call it here

This fixes regression introduced by https://github.com/pytorch/pytorch/pull/128801 that results in annoying, but harmless warning reported in https://github.com/pytorch/pytorch/issues/129777

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130651
Approved by: https://github.com/Skylion007
2024-07-15 03:45:31 +00:00
c0897919da Revert " [5/N] Change static functions in headers to inline (#130673)"
This reverts commit 4410c44ae6fd8eb36f2358ac76f7d988ca7537c5.

Reverted https://github.com/pytorch/pytorch/pull/130673 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it causes CUDA build 12.1/12.4 to timeout in trunk, I am not sure what I am looking at yet, so attempt to revert to see if it fixes trunk.  Plz keep in mind that a cancelled job is counted as a failure ([comment](https://github.com/pytorch/pytorch/pull/130673#issuecomment-2227641368))
2024-07-15 03:27:11 +00:00
cyy
28f6ae2718 [9/N] Replace c10::optional with std::optional (#130674)
Follows  #130509

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130674
Approved by: https://github.com/Skylion007
2024-07-15 00:48:43 +00:00
774ca93fd2 Added zb1p schedule (#130210)
Adds the ZB1P schedule in https://arxiv.org/pdf/2401.10241.

The ZB2P schedule might not be zero bubble when pp_group_size > 4. Proof:

![image](https://github.com/pytorch/pytorch/assets/13212964/fac4a738-c323-47c7-bcaa-c6cdd1cf20d7)

Since ZB2P generates longer schedules for some cases, and we might need a collective for fault tolerance all reduce at the end of every iteration for llama 4, so holding off to implement a more fancier ZBV schedule for now unless it would be useful.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130210
Approved by: https://github.com/H-Huang
2024-07-14 17:32:59 +00:00
cyy
5fe9515d35 [structural binding][8/N] Replace std::tie with structural binding (#130544)
Follows #130216
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130544
Approved by: https://github.com/ezyang
2024-07-14 13:23:20 +00:00
81322aee74 [Inductor][CPP] Support more than one LocalBuffer (#129121)
**Summary**
Support more than 1 Local Buffer in an outer loop fused node and also the case when multi global buffers sharing usage of same local buffer.

**TestPlan**
```
python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_two_local_buffers_in_outer_loop_fusion
python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_share_local_buffers_in_outer_loop_fusion
```

**Next Step**

- [✓] Support more than one Local Buffer/Global Buffer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129121
Approved by: https://github.com/jgong5, https://github.com/peterbell10
ghstack dependencies: #126967
2024-07-14 11:31:14 +00:00
adaa0fea5a [Inductor][CPP] Enable Local Buffer for Outer loop fusion (#126967)
**Summary**
Currently, the Inductor CPP backend [generated code](https://gist.github.com/leslie-fang-intel/98f91d43dabed581a1ffe23daf133a65#file-bf16-softmax-generated-code-wo-local-buffer-py) for `Softmax` with BF16 data type is significantly slower than the [ATen Implementation](9a2beb862d/aten/src/ATen/native/cpu/SoftMaxKernel.cpp (L149)). Upon comparing the generated code with ATen, the performance bottleneck appears to be related to the usage of [local buffer in ATen](9a2beb862d/aten/src/ATen/native/cpu/SoftMaxKernel.cpp (L159-L160)).

In the current implementation, the Inductor uses the output buffer of Kernel Group Args to store and load temporary result (such as `exp`), since this buffer is corresponding to a `SchedulerNode`. Each thread accesses a portion of this output buffer via indexing. However, since this buffer (take this `exp` as example) is only utilized internally within decomposed `softmax`, this buffer can be replaced with a thread-local buffer similar to ATen's approach.

In this PR, we have introduced the optimizations of `LocalBuffer`. Following this enhancement, the [new generated Inductor code with local buffer](https://gist.github.com/leslie-fang-intel/98f91d43dabed581a1ffe23daf133a65#file-bf16-softmax-generated-code-w-local-buffer-py) for BF16 `Softmax` demonstrates significantly improved performance. Running the benchmark [here](https://gist.github.com/leslie-fang-intel/37d81441237b5139c8295f5e6c4cd31a) to test this BF16 `Softmax` case on an 8480 Xeon server shows similar performance between the Inductor CPP Backend and the ATen implementation.

**TestPlan**
```
python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_local_buffer_in_outer_loop_fusion
```

**Next Step**

- [ ] Support more than one Local Buffer/Global Buffer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126967
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-07-14 11:28:10 +00:00
dcaa111dc8 support intersection by polyfill (#130672)
Fixes https://github.com/pytorch/pytorch/issues/130557

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130672
Approved by: https://github.com/anijain2305
2024-07-14 10:44:26 +00:00
4d7bf72d93 [BE][Easy] fix ruff rule needless-bool (SIM103) (#130206)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130206
Approved by: https://github.com/malfet
2024-07-14 08:17:52 +00:00
fa5f572748 [cudagraph] fallback to eager if re-record too many times (#129349)
Summary:
CUDAGraph Trees previously relies on an assumption that static inputs (parameters and buffers) does not change tensor addresses across multiple function invocations. This assumption can be used to reduce the number of tensor copies to improve performance. We also use `check_static_inputs_are_stable()` to check whether this assumption holds at runtime.

While this assumption is True in most cases, we recently observe a few cases that this assumption is not valid:
- [Inline inbuilt nn modules](https://github.com/pytorch/pytorch/pull/126822): the same function (a nn module) is used in multiple places and different parameters and buffers are passed to this function with different tensor addresses
- Some user code changes tensor addresses of parameters/buffers. See [internal example]( https://www.internalfb.com/mlhub/pipelines/runs/mast/sw-935450288-OfflineTraining_08ba1cf0?job_attempt=1&version=0&env=PRODUCTION)
- Compiled Autograd may also pass parameters/buffers with different tensor addresses across runs.

Previous PR [#126822](https://github.com/pytorch/pytorch/pull/126822) (by @mlazos) allows detecting static tensor address changes during runtime and re-recording a cudagraph if that happened. However, if the same function is re-recorded too many times, it may introduce large overhead and hurt performance. This PR adds `torch._inductor.config.triton.cudagraph_max_recording` (=5) to fallback to eager if a function has been recorded more than `cudagraph_max_recording` times for a specific node in the CUDAGraph Trees.

A summary on how static tensor address changes are handled now:

- For each child node, check the assumption via `check_invariants`. If this holds, execute node with the assumption.
- If the assumption does not hold for all child nodes, re-record if the function_id has not been recorded too many times for the current_node.
- If the function_id has been re-recorded too many times, fallback to eager function and warning.

Test Plan: CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129349
Approved by: https://github.com/eellison
2024-07-14 04:17:24 +00:00
cyy
4410c44ae6 [5/N] Change static functions in headers to inline (#130673)
Follows #128286

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130673
Approved by: https://github.com/ezyang
2024-07-14 03:15:28 +00:00
6f275ae4d0 Add kwinputs to Kineto Traces (#130373)
Summary: On the autograd side of things, we are currently saving the kwinputs but we aren't doing anything with them on the profiler side. This diff enables the use of the kwinputs for both FunctionEvents and Chrome Traces.

Test Plan: Added unit testing for both chrome traces and FunctionEvents. Used RecordFunctionFast to test kwinputs since test already had kwargs being passed in but not tested.

Differential Revision: D59472345

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130373
Approved by: https://github.com/davidberard98
2024-07-14 00:40:59 +00:00
f9f85bfc0b [Inductor] FlexAttention supports partial masking (#130415) (#130626)
This is the new version of https://github.com/pytorch/pytorch/pull/130415

Updated test script: https://gist.github.com/yanboliang/7c34a82df611d4ea8869cb9e041bfbfc
Updated perf numbers:
```
(pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py
fwd speedup: 0.7166695598192317
bwd speedup: 0.7142133867805904
(pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py --partial-mask
fwd speedup: 0.8428246087169973
bwd speedup: 0.8486261278030254
```
Approved by: https://github.com/Chillee

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130626
Approved by: https://github.com/drisspg, https://github.com/yanboliang
2024-07-14 00:37:26 +00:00
cbb7e26acd [3.13, dynamo] fix jump target offset calculation (#130458)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130458
Approved by: https://github.com/jansel
ghstack dependencies: #130383, #130384, #130385
2024-07-13 23:32:06 +00:00
0b5792c0ae [3.13, dynamo] fix NULL ordering in symbolic_convert CALL (#130385)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130385
Approved by: https://github.com/jansel
ghstack dependencies: #130383, #130384
2024-07-13 23:32:05 +00:00
87b406d7e5 [3.13, dynamo] codegen TO_BOOL before conditional jump (#130384)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130384
Approved by: https://github.com/jansel
ghstack dependencies: #130383
2024-07-13 23:32:02 +00:00
92ac9ee83c [3.13, dynamo] swap null and pop_null in codegen (#130383)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130383
Approved by: https://github.com/jansel
2024-07-13 23:31:57 +00:00
97cfc65dbc Back out "[DeviceMesh] Only include the real thread_id in DeviceMesh hash under threaded backend (#130495)" (#130676)
Summary:
Original commit changeset: 80c2ca639146

Original Phabricator Diff: D59612200

Test Plan: buck2 test 'fbcode//mode/opt' fbcode//apf/distributed/tests:pipeline_parallel_test_cpu -- --exact 'apf/distributed/tests:pipeline_parallel_test_cpu - apf.distributed.tests.pipeline_parallel_test_cpu.PipelineParallelContextTestCPU: test_stage_pg_creation_with_different_backends'

Differential Revision: D59719562

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130676
Approved by: https://github.com/xunnanxu
2024-07-13 23:19:22 +00:00
e5de25896f Fixed CUDA randint generation for large ranges. (#126066)
Fixes #125224

For large ranges, calls to CUDA `randint` use a different `unroll_factor` to generate random ints. This `unroll_factor` was not considered correctly in the calculation of the Philox offsets. Thus, some of the random states were reused, resulting in lower entropy (see #125224).

This also affects multiple other random functions, such as `torch.rand` and `torch.randn`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126066
Approved by: https://github.com/eqy, https://github.com/lezcano
2024-07-13 21:42:27 +00:00
1f162a5fce Revert "[Inductor][CPP] Support vectorization of remainder (#129849)"
This reverts commit 5bc18ec0a181fac0994522fefaf664f917d64b86.

Reverted https://github.com/pytorch/pytorch/pull/129849 on behalf of https://github.com/izaitsevfb due to fails the compilation of executorch benchmark internally ([comment](https://github.com/pytorch/pytorch/pull/129849#issuecomment-2227054413))
2024-07-13 19:28:34 +00:00
8714b7fc69 [dynamo][cpp-guards] Use dict tags to skip guards on immutable dict getitems (#130654)
Reduces the guard overhead from 3.7k units to 2.1k units.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130654
Approved by: https://github.com/jansel
2024-07-13 15:31:10 +00:00
cyy
7c83f5f7d5 [8/N] Replace c10::optional with std::optional (#130509)
Follows #130510

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130509
Approved by: https://github.com/ezyang
2024-07-13 13:05:36 +00:00
0effcb70ef Revert "[ONNX] Remove beartype usage (#130484)"
This reverts commit f44739cf42e22a569bd1bdb0c113f8a069c17a41.

Reverted https://github.com/pytorch/pytorch/pull/130484 on behalf of https://github.com/huydhn due to Sorry for reverting your change but those failures show up in trunk after the commit landed f44739cf42, I am reverting it to see if it fix trunk ([comment](https://github.com/pytorch/pytorch/pull/130484#issuecomment-2226812311))
2024-07-13 07:52:59 +00:00
567482973d typing fake_tensor.py (#128041)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128041
Approved by: https://github.com/eellison
ghstack dependencies: #129182
2024-07-13 06:07:40 +00:00
1ad0f38a37 Fix IMAs in FlexAttention + autotuning (#130352)
# Summary

Makes error message better for non divisible sequence lengths.

Updates this PR was blocked due to two IMAs.
- The first, is that when the kv indices ends up being an 'arange' I.e. there are non sparse blocks, we end up loading off of kv_indices + 1.
- The second I dont really have a clear answer for. We were hitting an ima here:
9f401187c7/torch/_inductor/kernel/flex_attention.py (L846)
I noticed that the for our inputs 2048 and q_blocksize = 128 we were again exactly at 16. Something felt fishy. I suspect we launch one extra sparse_q block,  But why only during autotuning...

### Repro:
https://gist.github.com/drisspg/f312a66426f3440b7756c6c0cc037f4c
### After this change:
```
========= COMPUTE-SANITIZER
AUTOTUNE flex_attention(1x1x2048x64, 1x1x2048x64, 1x1x2048x64, 1x1x2048, 1x1x16, 1x1x16x16)
  triton_flex_attention_0 2.1118 ms 100.0% BLOCK_DMODEL=64, BLOCK_M=128, BLOCK_N=128, OUTPUT_LOGSUMEXP=True, PRESCALE_QK=False, ROWS_GUARANTEED_SAFE=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4
  triton_flex_attention_3 2.4306 ms 86.9% BLOCK_DMODEL=64, BLOCK_M=64, BLOCK_N=128, OUTPUT_LOGSUMEXP=True, PRESCALE_QK=False, ROWS_GUARANTEED_SAFE=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4
  triton_flex_attention_1 2.5729 ms 82.1% BLOCK_DMODEL=64, BLOCK_M=128, BLOCK_N=64, OUTPUT_LOGSUMEXP=True, PRESCALE_QK=False, ROWS_GUARANTEED_SAFE=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4
  triton_flex_attention_4 2.8035 ms 75.3% BLOCK_DMODEL=64, BLOCK_M=64, BLOCK_N=64, OUTPUT_LOGSUMEXP=True, PRESCALE_QK=False, ROWS_GUARANTEED_SAFE=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4
  triton_flex_attention_2 2.8837 ms 73.2% BLOCK_DMODEL=64, BLOCK_M=128, BLOCK_N=128, OUTPUT_LOGSUMEXP=True, PRESCALE_QK=False, ROWS_GUARANTEED_SAFE=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4
SingleProcess AUTOTUNE benchmarking takes 0.7225 seconds and 1.5218 seconds precompiling
AUTOTUNE flex_attention_backward(1x1x2048x64, 1x1x2048x64, 1x1x2048x64, 1x1x2048, 1x1x2048, 1x1x2048x64, 1x1x2048x64, 1x1x2048x64, 1x1x16, 1x1x16x16, 1x1x16, 1x1x16x16)
  triton_flex_attention_backward_30 2.7763 ms 100.0% BLOCK_DMODEL=64, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=4
  triton_flex_attention_backward_15 3.1404 ms 88.4% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4
  triton_flex_attention_backward_14 3.2604 ms 85.2% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=4
  triton_flex_attention_backward_7 3.4176 ms 81.2% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=3, num_warps=4
  triton_flex_attention_backward_8 3.4182 ms 81.2% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=4, num_warps=4
  triton_flex_attention_backward_34 3.4939 ms 79.5% BLOCK_DMODEL=64, BLOCK_M1=64, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=64, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=8
  triton_flex_attention_backward_6 3.6517 ms 76.0% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=32, BLOCK_N1=32, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=4
  triton_flex_attention_backward_26 3.7000 ms 75.0% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=8
  triton_flex_attention_backward_22 4.0120 ms 69.2% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=128, BLOCK_N1=128, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=4
  triton_flex_attention_backward_18 4.5052 ms 61.6% BLOCK_DMODEL=64, BLOCK_M1=32, BLOCK_M2=64, BLOCK_N1=64, BLOCK_N2=32, PRESCALE_QK=False, SM_SCALE=0.125, SPARSE_KV_BLOCK_SIZE=128, SPARSE_Q_BLOCK_SIZE=128, num_stages=1, num_warps=8
SingleProcess AUTOTUNE benchmarking takes 6.6558 seconds and 6.3567 seconds precompiling
torch.Size([1, 1, 2048, 64])
Test completed successfully!
========= ERROR SUMMARY: 0 errors
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130352
Approved by: https://github.com/Skylion007, https://github.com/Chillee
2024-07-13 05:27:39 +00:00
c03e667276 [Inductor][PatternMatcher] Always prevent match across mutations (#130584)
Preventing match across mutations should always be the safe thing to do. This will be especially important for Traceable FSDP2 because in that case we do have mutation ops (`.set_` and `.resize_(0)`) in the middle of the graph for both joint-graph and post-grad graph, so making sure the pattern matcher passes work well with middle-of-graph mutation ops is important.

Q: Why can't we move these mutation ops to the end of graph, to make pass writing easier?
A: We attempted to do that in https://github.com/pytorch/pytorch/pull/129852, but the custom FX passes (in `torch/_functorch/_aot_autograd/fx_passes.py`) for the re-functionalization is complicated to maintain, and the changes to partitioner (in `torch/_functorch/partitioners.py`) also feels hacky. Hence we want to preserve these mutation ops in the middle of graph to avoid the complexity.

Test commands:
- `pytest -rA test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_uint4x2_mixed_mm`
- `pytest -rA test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_serialized_patterns_up_to_date`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130584
Approved by: https://github.com/jansel
2024-07-13 03:39:21 +00:00
3710a79622 Flex Attention HOP: Add support for flex decoding (#129415)
# Flex Decoding
tl;dr This PR adds `flex_decoding` kernel to higher-order-op: `flex_attention` as the backend for multi-head attention decoding.

Higher-order-op `flex_attention` was introduced in (https://github.com/pytorch/pytorch/pull/121845) to accept a user defined score modification callable (`score_mod`) and through `torch.compile`to create an efficient fused flash attention kernel instatiation. The `flex_attention` kernel is efficient for long queries (>512 tokens) attention. This PR introduces `flex_decoding` kernel as an alternative backend for `flex_attention` HOP to handle LLM inference where short queries (<32 tokens) attends to long key/value sequences.

### Details

LLM decoding iteratively attends each newly generated token ( query length = 1 ) to a long key/value context (up to 132k). `flex_attention` kernel only parallelizes attention along query length (M), batch size (B) and number of heads (H) dimension. LLM decoding lacks enough parallelism in the M dimension to fill up all SMs on the modern GPUs.

`flex_decoding` adds parallelization along key/value sequence length (N). The key/value cache of a single head are split into multiple blocks and the query tokens attends to them in parallel. The results for the same head are then reduced across KV blocks to generate a global output.

## Examples

Consider a Group Query Attention (GQA) decoding case, where a query token of 16 query heads (Hq) attends to 2 kv head (Hkv). Assume a batch size of 2 (B=2) and kv cache length of 4096 (N=4096). The attention kernel iteratively attends to newly generated query token (Mq = 1).

We transform this problem into a Multiheaded Attention (MHA) problem by assuming a query length equal to number of query heads per kv heads, i.e. M=Hq//Hkv.
The inputs to `flex_attention` HOP is thus a query of shape (B=2, H=Hkv=2, M=Hq//Hkv=8, D=64), key,value of shape (B=2, H=Hkv=2, N=4096, D=64, which lead to an intermediate attention score matrix of shape (2, 2, 8, 4096) and an output of shape (2, 2, 8, 64).

```Python
import torch
from torch.nn.attention._flex_attention import _flex_attention as flex_attention

torch.manual_seed(0)

# Lets create some input tensors
# query of shape (B, Hkv, Hq//Hkv, D)
# key/value of shape (B, Hkv, N, D)
query = torch.randn(2, 2, 8, 64, device="cuda", dtype=torch.float32)
key = torch.randn(2, 2, 4096, 64, device="cuda", dtype=torch.float32)
value = torch.randn(2, 2, 4096, 64, device="cuda", dtype=torch.float32)

# Lets create a new score_modification checkerboard.
def checkerboard(score, batch, head, token_q, token_kv):
    score = torch.where(torch.abs(token_kv - token_q) == 1, score * 0.5, score)
    score = torch.where(torch.abs(token_kv - token_q) == 2, score * 2.0, score)
    return score

# Lets call flex_attention with this new score modification for decoding.
# The flex_attention HOP will chose flex_decoding as its backend since our query length (M) is only 8.
output = flex_attention(query, key, value, score_mod=checkerboard)

compiled_flex_attention = torch.compile(flex_attention)
out_compiled = compiled_flex_attention (query, key, value, score_mod=checkerboard)

torch.testing.assert_close(output, out_compiled, atol=2e-2, rtol=2e-2)
```

## Future Plans
- This PR does not implement load mask for score_mod function. This means if the score_mod functions takes a captured buffer along the M dimension , it must be padded to q length of 16, or next 2^n of query length if q_len > 16.
i.e.
```python
q_scale = torch.randn(Hq//Hkv, device="cuda")
q_scale = torch.nn.functional.pad(q_scale, (0, 16-Hq//Hkv)) # Pad captured buffer
def bias_mod(score, batch, head, q, kv):
    score = score + q_scale[token_q]
    return score
```
- Backward path for short queries (<128 token) currently does not work because the `flex_attention_backward` kernel is lacking mask support and only takes query length of a multiple of 128.
- Dynamic shape and max_autotuning is currently not working
- Add block sparse mask support (#129216 is a draft for flex_attention kernel)
- Add explicit GQA support. (#130076 is a draft for GQA support on flex_attention kernel)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129415
Approved by: https://github.com/Chillee
2024-07-13 00:41:48 +00:00
f44739cf42 [ONNX] Remove beartype usage (#130484)
beartype has served us well in identifying type errors and ensuring we call internal functions with the correct arguments (thanks!). However, the value of having beartype is diminished because of the following:

1. When beartype improves support for better Dict[] type checking, it discovered typing mistakes in some functions that were previously uncaught. This caused the exporter to fail with newer versions beartype when it used to succeed. Since we cannot fix PyTorch and release a new version just because of this, it creates confusion for users that have beartype in their environment from using torch.onnx
2. beartype adds an additional call line in the traceback, which makes the already thick dynamo stack even larger, affecting readability when users diagnose errors with the traceback.
3. Since the typing annotations need to be evaluated, we cannot use new syntaxes like `|` because we need to maintain compatibility with Python 3.8. We don't want to wait for PyTorch take py310 as the lowest supported Python before using the new typing syntaxes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130484
Approved by: https://github.com/titaiwangms
2024-07-13 00:08:25 +00:00
a7f54c7f8a [dynamo] add meta fn for aten.kthvalue.default (#130562)
I saw
```
torch._dynamo.exc.Unsupported: unsupported operator: aten.kthvalue.default
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130562
Approved by: https://github.com/jingsh, https://github.com/zou3519
2024-07-12 23:48:31 +00:00
634b62f111 typing proxy_tensor.py (#129182)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129182
Approved by: https://github.com/Chillee
2024-07-12 23:17:09 +00:00
ea78b0c177 Revert "Fix static py::object dangling pointer with py::gil_safe_call_once_and_store (#130341)"
This reverts commit a17d1e5322229a31f868d98987996a04736933a6.

Reverted https://github.com/pytorch/pytorch/pull/130341 on behalf of https://github.com/izaitsevfb due to internal needs pybind update ([comment](https://github.com/pytorch/pytorch/pull/130341#issuecomment-2226499397))
2024-07-12 23:07:37 +00:00
f422027fce fix torch.linalg.lstsq input check (#130612)
Fixes [#117236 ](https://github.com/pytorch/pytorch/issues/117236)
The current case does not meet the vector scenario requirements, and it lacks sufficient checks (relying solely on ```dim_diff``` is insufficient).  Consequently, it triggers an internal assertion error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130612
Approved by: https://github.com/lezcano
2024-07-12 23:06:52 +00:00
06ebf87a1e Fix and improve reorder_compute_for_overlap (#130573)
Since the raise_comms and sink_waits passes are also scheduling-based, we can now implement reorder_compute_for_overlap as an optional step in the same pass. Merging them into the same pass greatly simplifies the logic and makes it easier to reason about the synergy between different passes.

- The unit tests are now fixed and re-enabled.
- Verified that the pass produces good schedulling w/ Llama3 70B in torchtitan (the scheduling was sub-optimal before this PR).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130573
Approved by: https://github.com/Chillee
ghstack dependencies: #129980
2024-07-12 22:25:49 +00:00
619029e892 [easy] Small rendering fix in Tensor.module_load doc (#130489)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130489
Approved by: https://github.com/janeyx99
2024-07-12 22:12:53 +00:00
95046c86e3 [HOP] add HOP x torch_dispatch interaction (#130606)
This involved beefing up the Python dispatcher to handle torch_dispatch.
Given a HOP and a torch_dispatch Tensor subclass:
- the HOP will show up in the subclass's `__torch_dispatch__`
- you can also use HOP.py_impl to register a rule for the HOP x
  subclass interaction
- (coming soon) we'll offer a way to open register HOP x subclass
  interaction without needing to touch the subclass's
  `__torch_dispatch__` or the HOP's .py_impl.

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130606
Approved by: https://github.com/ydwu4
2024-07-12 21:51:36 +00:00
f093cd4086 Fix custom ops warning during export (#130623)
Fixes https://github.com/pytorch/pytorch/issues/130588

The problem was we were warning on all custom ops, not just ones marked
as CompositeImplicitAutograd. This PR changes the warning to just warn
on CompositeImplicitAutograd ops.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130623
Approved by: https://github.com/williamwen42
2024-07-12 21:34:29 +00:00
7c289c2a5c Add torch.serialization.safe_globals context manager (#127939)
Add context manager mentioned in https://github.com/pytorch/pytorch/pull/127808#pullrequestreview-2096298486

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127939
Approved by: https://github.com/albanD
2024-07-12 20:38:43 +00:00
f0d7164cb9 Revert "[inductor] switch AotCodeCompiler to new cpp_builder (#130127)"
This reverts commit 2abc7cc21b8a215f000ac037c316ca178e9ade81.

Reverted https://github.com/pytorch/pytorch/pull/130127 on behalf of https://github.com/izaitsevfb due to breaks meta-internal tests ([comment](https://github.com/pytorch/pytorch/pull/130127#issuecomment-2226313943))
2024-07-12 20:36:00 +00:00
103b6ccab2 Increase tolerance for tensorsolve tests (#130620)
Fix current failure in periodic trunk https://hud.pytorch.org/failure?name=periodic%20%2F%20linux-focal-cuda11.8-py3.10-gcc9-debug%20%2F%20test%20(default%2C%204%2C%205%2C%20linux.4xlarge.nvidia.gpu)&jobName=undefined&failureCaptures=%5B%22functorch%2Ftest_ops.py%3A%3ATestOperatorsCUDA%3A%3Atest_vjp_linalg_tensorsolve_cuda_float32%22%5D

Since it appeared with https://github.com/pytorch/pytorch/pull/128238 that only updates random seed for the test, I expect this is just bad luck of the draw. Thus increasing tolerance like we do for other tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130620
Approved by: https://github.com/lezcano, https://github.com/atalman, https://github.com/malfet
2024-07-12 20:08:18 +00:00
af4da0799c [PyTorch] Half: don't disable direct conversion to/from float on mobile (#130465)
As far as I can tell, `FCVT` (https://developer.arm.com/documentation/ddi0602/2024-06/SIMD-FP-Instructions/FCVT--Floating-point-convert-precision--scalar--?lang=en)
is part of the base aarch64 instruction set, so it should work fine on mobile.

Differential Revision: [D59589733](https://our.internmc.facebook.com/intern/diff/D59589733/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130465
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-07-12 19:46:30 +00:00
d727e2f2d1 add total wall time in calculate_time_spent (#130611)
Fixes #ISSUE_NUMBER

Actual wall time is fwd_entire_frame_time + bwd_inductor_compile.  `calculate_time_spent` is accessed internally for monitoring use https://fburl.com/code/iiurj5m6.  However, summing values up lose the info of fwd/bwd.

This PR adds a new key of `total_wall_time` without affecting dynamo counters.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130611
Approved by: https://github.com/oulgen, https://github.com/Yuzhen11
2024-07-12 19:32:44 +00:00
eqy
60fc01d0ab [CUDA] Don't double-destroy CUDA graph when debug dump is used (#130401)
Repro from @eellison

Could have sworn we had another PR with this fix floating around somewhere but I couldn't find it...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130401
Approved by: https://github.com/Skylion007, https://github.com/eellison
2024-07-12 18:57:07 +00:00
43b98fa521 Add debug repr to SymNode (#129925)
Fixes #129403

Create a separate printing function to debug SymNode, since we can't easily change `__repr__` that is used by GraphModule.recompile() to create a pythonic version of a graph

This is my first contribution, please let me know if there is anything that I should look into in further details

Thank you for you guidance! 🙏 I hope to contribute more in the future!

@aorenste
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129925
Approved by: https://github.com/aorenste
2024-07-12 18:31:23 +00:00
2c4303c1d1 [ROCm] [BUGFIX] Re-enable rocm-specific tuning parameters (#130617)
Small bug fix - https://github.com/pytorch/pytorch/pull/124592 replaced the torch.version.hip with device_props but made a mistake in porting the original logic.

The original code was:
```
if torch.version.hip is not None:
```

Which was incorrectly replaced by:
```
if self.device_props.type != "hip":
```

Perhaps we need to write some unit tests here in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130617
Approved by: https://github.com/masnesral
2024-07-12 18:29:59 +00:00
741c1710e8 [cond] inlining into one of the branches when pred is a python constant (#130493)
Reland https://github.com/pytorch/pytorch/pull/128709.

When the input predicate is a python constant, we specialize into one of the branches and warn users that torch.cond is not preserving the dynamism. The previous behavior is that we baked in True/False in the cond operator. This can be confusing. In this PR, we change it to be specializing into one of the branches when the inputs are constants.

We additionally change the naming of cond operator to default one without overriding its name. This allows better testing on de-serialized graph.

Test Plan:
The predicate in some existing tests is the result of a shape comparison. When no dynamic shape is involved, the predicate is a python bool. To fix them, we either change the predicate to be some data-dependent tensor or change the test to check cond is specialized as one of the branches,

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130493
Approved by: https://github.com/BoyuanFeng
2024-07-12 18:02:09 +00:00
0bf9a091ec [torchbind] add tracing_mode support (#129586)
Sometimes, it could be difficult to write a fake class e.g. when the original implementation is using some third-party libraries or users are certain that the class is safe to trace with the real object.

This PR allows user to specify their intention by implementing a "safe_to_trace_with_real_obj" method on their script class.

Test Plan:
`pytest test/export/test_torchbind.py -k safe`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129586
Approved by: https://github.com/zou3519
2024-07-12 18:01:47 +00:00
c3e77d144e [3.12, 3.13, dynamo] simplified construction for frame f_locals/localsplus (#129185)
Construct frame localsplus in 3.12+ using our own simplified way rather than copypasting from CPython.

This is necessary for 3.13 since we can no longer generate frame `f_locals` before executing the interpreter frame.
We also enable this for 3.12 since the `f_locals` construction between 3.12 and 3.13 is the same, so we can test for correctness with 3.12.

This is also one of the first steps to completing https://github.com/pytorch/pytorch/issues/93753 - we will implement simplified f_locals generation of previous Python versions in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129185
Approved by: https://github.com/jansel
2024-07-12 17:56:38 +00:00
b0a597fcb4 Fix #121334: graph break on constant method call (#130158)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130158
Approved by: https://github.com/lezcano
2024-07-12 17:34:46 +00:00
4865c6425c Add new control plane handler (#129712)
Summary:
Add a new control plane handler to retrieve flight recorder data as
JSON.

Test Plan:
Unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129712
Approved by: https://github.com/wconstab
2024-07-12 17:32:01 +00:00
55dc82bef9 [EZ] Make test_pytree_inputs actually run tests on CUDA (#130593)
Right now it's only running it on CPU even when `self.device` is set to CUDA

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130593
Approved by: https://github.com/angelayi
2024-07-12 17:17:28 +00:00
988ed4d5db [export] clean up allow_complex_guards_as_runtime_asserts flag (#130596)
Summary: removes underscore, cleans up dead code in DimConstraints

Test Plan: existing export tests

Reviewed By: angelayi

Differential Revision: D59612746

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130596
Approved by: https://github.com/angelayi
2024-07-12 17:17:11 +00:00
dafef3ff35 [CP] Make CP loss curve on par with TP (#129515)
Summary:
This PR changes two implementations to make CP (CP8) lose curve be on par with TP (TP8).

1. Making key and value contiguous before doing ring attention. It is unclear why this is a requirement as SDPA does not have this requirement.

2. Use the out, grad_out, softmax_lse passed by autograd to do the backward. This implementation is similar to the implementation in transformer engine. The original implementation reruns the SDPA to get the output and logsumexp and uses that reculcated results to infer the corrected softmax_lse. But that implementation does not give a better accuracy or lose curve. Instead, that implementation converges slower.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129515
Approved by: https://github.com/d4l3k, https://github.com/wanchaol
ghstack dependencies: #129512, #129514
2024-07-12 16:55:28 +00:00
c35f12c67c [EZ] Add formatting changes to .git-blame-ignore-revs (#130627)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130627
Approved by: https://github.com/izaitsevfb, https://github.com/clee2000
2024-07-12 16:37:46 +00:00
22fd89c904 [TEST][Inductor] Fix scaled_mm call (#130582)
`_scaled_mm` no longer returns `amax` (see #128683)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130582
Approved by: https://github.com/drisspg
2024-07-12 16:25:18 +00:00
34e57025e1 Add unsigned int types to torch/types.h (#130616)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130616
Approved by: https://github.com/NicolasHug, https://github.com/albanD
2024-07-12 16:24:29 +00:00
2b1df24877 Revert "Make hashing a SymInt raise an error again (#130548)"
This reverts commit 3100455b8eeebdfbc3428ff9454579ac50666faf.

Reverted https://github.com/pytorch/pytorch/pull/130548 on behalf of https://github.com/clee2000 due to broke inductor/test_triton_kernels.py https://github.com/pytorch/pytorch/actions/runs/9908970127/job/27377960411 3100455b8e. Not run on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/130548#issuecomment-2225912018))
2024-07-12 16:20:12 +00:00
2a1f22e57f Change BN to eval before QAT Convert phase (#130598)
**Summary**
In the QAT convert phase, we fold bn into conv and do DCE to this BN node. We should change `torch.ops.aten._native_batch_norm_legit.default` to `torch.ops.aten._native_batch_norm_legit_no_training.default`  for a safe DCE.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130598
Approved by: https://github.com/jgong5, https://github.com/yushangdi
2024-07-12 16:03:56 +00:00
18418a7dbb [ONNX] Fix torch_onnx patch accuracy bug in benchmark (#130586)
The ONNX related compilers have another route of accuracy check, and this PR brings torch_onnx compiler to the right measurement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130586
Approved by: https://github.com/justinchuby
2024-07-12 15:47:59 +00:00
e5657024b5 Fix loss_parallel with BF16 logits (#130550)
Fixes #130549

This PR uses the specific dtype for the `grad_input` buffer and fixes the error

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130550
Approved by: https://github.com/tianyu-l
2024-07-12 15:47:38 +00:00
ea4b80e6d6 [FX][export] strict DCE pass, check schema for node impurity (#130552)
Fixes the failure in `test/export/test_export_training_ir_to_run_decomp.py ` caused by dead code elimination removing node with side effects.

For background, in export, we may want to export higher-level IRs that are not functional, so we need to check for side effects more carefully.

 A call_function node is impure if it has at least one mutable argument.

Fixed the tests below:

test_to_module_with_mutated_buffer_multiple_update_sub_later
test_export_input_mutation_static_shape
test_buffer_util

Another attempt modifying the original DCE pass is made in PR #130395, but it breaks some other tests, so here we add a flag and use it for export only.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130552
Approved by: https://github.com/pianpwk
2024-07-12 15:43:27 +00:00
febadda107 [MPS] Fix torch.[all|any] for 5+D tensors (#130542)
Workaround bug in `reductionAndWithTensor:` that kills app with the
following assert if 5+D tensor as an input
```
Assertion failed: (0 <= mpsAxis && mpsAxis < 4 && "Runtime canonicalization must simplify reduction axes to minor 4 dimensions."), function encodeNDArrayOp, file GPUReductionOps.mm, line 76.
```
by reshaping the tensor to 2D/3D one before running the reduction.

Refactored common code into `all_any_common_impl_mps` as both `reductionOrWithTensor:` and `reductionAndWithTensor:` suffer from the same issue

Enabled `test_reduction_ops_5D` and  added regression test to it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130542
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #130541
2024-07-12 15:06:22 +00:00
d443fbc025 [inductor] Cache precompilation functions based on configs (#130350)
Summary: If we attempt to precompile sets of different choices (e.g. Triton vs Cutlass) that have the same key, the cached pool of futures doesn't work, since it only includes the first set of configs.  Add the config's hashes to the key to avoid this problem.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130350
Approved by: https://github.com/eellison
2024-07-12 14:21:49 +00:00
9c69684af8 [custom_ops] expose torch.library.register_torch_dispatch (#130261)
This is the API for defining the interaction between a torch_dispatch
class and a custom op. Taking API bikeshedding.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130261
Approved by: https://github.com/albanD
ghstack dependencies: #130064
2024-07-12 14:13:01 +00:00
ba941769b5 Add API for open registration between operators and subclasses (and modes) (#130064)
We add torch.library.Library._register_torch_dispatch_rule. Here, a user
can provide us a specific rule to run for a specific
(torch_dispatch_class, operator) pair. The motivation is that a user
might want to extend a subclass/mode but may not have access to the
source code of the subclass/mode.

I'll make this public in a follow-up PR if we think the approach and API
is good.

Keep in mind that many subclasses will likely deliver their own open
registration solution (DTensor has register_sharding_prop_rule and NJT
has register_jagged_op); _register_torch_dispatch_rule is meant as a
catch-all open registration mechanism for when the subclass hasn't
provided anything more specific.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130064
Approved by: https://github.com/albanD
2024-07-12 14:13:01 +00:00
ae3ac9cb64 Only test _is_param if doing instance check on Parameter base (#130578)
Fixes https://github.com/pytorch/pytorch/issues/111348

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130578
Approved by: https://github.com/Skylion007
2024-07-12 13:55:13 +00:00
6f54e961ea Add trace_shape_events artifact tracing for ShapeEnv events (#130473)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130473
Approved by: https://github.com/lezcano
2024-07-12 13:50:25 +00:00
3100455b8e Make hashing a SymInt raise an error again (#130548)
See https://github.com/pytorch/pytorch/issues/130547

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130548
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-07-12 13:49:56 +00:00
b75cc70875 [Pipelining] add looped schedules to fsdp/ddp test (#130563)
It feels like an oversight that these were not tested, especially since
the test case already handles multi schedules specially but no
multi-schedules were being tested
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130563
Approved by: https://github.com/H-Huang
2024-07-12 13:39:47 +00:00
da030e7add Revert "[Inductor] FlexAttention supports partial masking (#130415)"
This reverts commit 207564bab1c4fe42750931765734ee604032fb69.

Reverted https://github.com/pytorch/pytorch/pull/130415 on behalf of https://github.com/janeyx99 due to Windows trunk test_proxy_tensor test failures look relevant  ([comment](https://github.com/pytorch/pytorch/pull/130415#issuecomment-2225575622))
2024-07-12 13:20:18 +00:00
207564bab1 [Inductor] FlexAttention supports partial masking (#130415)
This is the new version of #130235

Updated test script: https://gist.github.com/yanboliang/7c34a82df611d4ea8869cb9e041bfbfc
Updated perf numbers:
```
(pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py
fwd speedup: 0.7166695598192317
bwd speedup: 0.7142133867805904
(pt) [ybliang@devgpu002.ash8 ~/local/debug]$ CUDA_VISIBLE_DEVICES=4 python debug7.py --partial-mask
fwd speedup: 0.8428246087169973
bwd speedup: 0.8486261278030254
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130415
Approved by: https://github.com/Chillee
2024-07-12 07:19:28 +00:00
e568c91a7b [CP] Fix the incorrect ring schedule in the fwd and bwd (#129514)
Summary:
1. The argument order for all_to_all_single is "block, output_split_size, input_split_sizes, pg".
2. Uses the correct ring order for the grad_kv.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129514
Approved by: https://github.com/d4l3k, https://github.com/drisspg, https://github.com/wanchaol
ghstack dependencies: #129512
2024-07-12 07:05:36 +00:00
0d8dedb01b [dtensor] Add dtensor to TORCH_LOGS (#129512)
Summary:
Add the basic log for dispatcher of dtensor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129512
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
2024-07-12 06:50:53 +00:00
b6215f44ef DCP checkpoint_dist_client integration (#130452)
Summary:
Integrate scope tracking with `checkpoint/fb/logging_handlers.py`.

Add a map of uuid -> tracker context manager. when logging handler has following events:
* `start`: create scope_tracker object, call `__enter__`, add to map with uuid
* `end`: retrieve scope_tracker object by uuid, call `__exit__`.
* `exception`: retrieve scope_tracker object by uuid, call `__exit__` with current exception info.

Test Plan:
Test with bento notebook (attached).
with a runtime_error in finish_checkpoint method.

scuba records:
https://fburl.com/scuba/workflow_signpost/ddttgmv2

Differential Revision: D56654417

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130452
Approved by: https://github.com/LucasLLC
2024-07-12 06:01:56 +00:00
ff25dfca5a Save quantization_tag in export graph serialization (#127473)
Summary: `quantization_tag` is a first class citizen metadata in quantization flows that is preserved by it. As we'll want to store the quantized exported graphs we also need to preserve this metadata as it's used in later flows. Only json supported metadata will be allowed to be serialized.

Test Plan: Added test case

Differential Revision: D57939282

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127473
Approved by: https://github.com/angelayi
2024-07-12 05:06:40 +00:00
b7d287fbec Constant folding for dynamic shape node (#129686)
Extend constant folding for dynamic shape node, only support pointwise op and some restricted ops

We support dynamic shapes by limiting constant folding of ops that are guaranteed to have uniform values (full, pointwise ops, and views) and running these operators with tensors of shape 1. This also eliminates the possibility of memory overhead of constant folding.

Taken over from https://github.com/pytorch/pytorch/pull/128937

joint work with @imzhuhl

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129686
Approved by: https://github.com/Chillee
ghstack dependencies: #130367
2024-07-12 03:44:29 +00:00
ae0edadea0 [SDPA] Replace masked_fill_ with aten::where (#130281)
Summary:
full context in D59385876

Based on the offline discussion with PT2 folks, we switched to change the SDPA impl to mitigate the AOTI lowering issue

Test Plan: PYTORCH_TEST_FBCODE=1 buck2 run  mode/opt -c=python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.split-dwarf=true caffe2/test/inductor:test_inductor -- -r test_sdpa_inference_mode_aot_compile

Differential Revision: D59495634

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130281
Approved by: https://github.com/drisspg, https://github.com/zou3519, https://github.com/Skylion007, https://github.com/justinchuby
2024-07-12 03:04:31 +00:00
c16e90fe06 The device_suffix in a test_name is "privateuse1" sometimes. (#130091)
When run some test cases on the privateuse1 device, the device_suffix in a test_name is 'privateuse1' sometimes.
For examples, a test_name is 'test_Dropout1d_npu', while it would be 'test_Dropout1d_privateuse1' sometimes.
When setUpClass() didn't set it, the device_suffix would be "privateuse1".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130091
Approved by: https://github.com/zou3519
2024-07-12 02:51:40 +00:00
9ae40c6bc0 Fix and improve raise_comms and sink_waits (#129980)
The tests for `raise_comms` and `sink_waits` passes were not enabled in CI. The passes are now broken due to functional collective v2 and possibly other changes.

Correctness issues:
- The original passes did not take mutation into consideration and may yield semantically different scheduling order. This may be due to the recent changes to how mutations are expressed in Inductor IR (e.g., MutationOutput).

Effectiveness issues:
- The original passes only moved the comm/wait nodes themselves. However, comm nodes can come with prologues (e.g., clone for all_reduce_, split-cat for non-zero dim all-gather). Whenever there are any prologues, the comms won't be raised at all.
- The prologues are often horizontally fused with other pointwise nodes. This can severely delay the scheduling of the comm node.

This PR:
- Make the passes handle mutation correctly.
- Instead of moving individual comm/wait nodes, schedule all node using a scored method. This way the comm nodes can be optimally raised even in the presence of prologues.
- The horizontal fusion of prolofues often severely delays the scheduling of the comm node. Horizontally fusing this clone can almost never out-perform scheduling the comm node earlier. Also in most cases, this clone is eliminated via in-place reuse. Therefore, we tell the scheduler to not fuse it.
- Enable the tests in CI.

Co-authored-by: Will Feng <yf225@cornell.edu>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129980
Approved by: https://github.com/yf225
2024-07-12 01:55:47 +00:00
c6a676add4 [Traceable FSDP2][Inductor] Add GroupedSchedulerNode to contain nodes that must be scheduled together (#128568)
As discussed with @mlazos and @Chillee in the Inductor group chat, we need the concept of `GroupedSchedulerNode` to be able to express nodes that must be scheduled together one-after-another (i.e. no other node is allowed to fuse into them or schedule in-between them).

This is particularly important for comm reordering and fine-grained control of peak memory. For Traceable FSDP2, there are two very important requirements:
- At any time, there must be only one AllGather in flight. However, our existing comm reordering pass will naturally raise **all** of AllGather ops to the beginning of the graph, which will clearly blow up memory usage. Instead, we leverage GroupedScheduleNode which provides simple connection points to build the "chaining" on. i.e. we use it to express the schedule `(copyin + AllGather1) -> (AllGather1Wait+copyout) -> (copyin + AllGather2) -> (AllGather2Wait+copyout) ...` by setting up fake dep between the GroupedScheduleNode, which is a very clean and easy-to-understand way to express this schedule.
- The "comms" in FSDP2 are not just comms, but a combination of compute and comm. We must prevent other nodes from being scheduled in-between that set of nodes, otherwise we are artificially delaying the release of comm buffer memory which makes the peak memory usage quite bad. This is particularly pronounced for `AllGatherWait+copyout`.

From these two requirements, we derive the behavior of `GroupedSchedulerNode`: it contains nodes that must be scheduled together one-after-another (i.e. no other node is allowed to fuse into them or schedule in-between them).

----

Q: Can we leverage `ir.Subgraph`?
A: I looked into the possibility of using `ir.Subgraph` to implement this, but realized that:
1. `ir.Subgraph` requires defining the subgraph in FX IR.
2. There is no guarantee that the Inductor IR nodes that we want to group together will all have a corresponding FX IR node, because some of those Inductor IR nodes can potentially be dynamically generated by a custom pass in the scheduler (e.g. for merging multiple all-gathers into one big all-gather, and later we want to group that big all-gather with some other op). Dynamically generated Inductor IR node doesn't have a corresponding upstream FX IR node.
3. For the above reasons, we can't use the `ir.Subgraph`, and need to define a new (and more lightweight) concept of `GroupedSchedulerNode` to achieve the behavior we need (this PR).

----

Test commands:
- `pytest -rA  test/distributed/test_compute_comm_reordering.py::TestComputeCommReorderingMultiProc::test_grouped_scheduler_node`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128568
Approved by: https://github.com/eellison, https://github.com/mlazos
2024-07-12 01:42:38 +00:00
c101c4517a Add python type for list iterators (#130511)
Fixes https://github.com/pytorch/pytorch/issues/117026

Also not sure why this was missing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130511
Approved by: https://github.com/williamwen42, https://github.com/yanboliang, https://github.com/anijain2305
2024-07-12 01:14:18 +00:00
536b5b19b5 Revert "Simplify c10::string_view (#130009)"
This reverts commit 10c7f037fe3271cb3865816c216007ba403f5347.

Reverted https://github.com/pytorch/pytorch/pull/130009 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/130009#issuecomment-2224223526))
2024-07-12 00:46:49 +00:00
7f2436014e add MTIA as valid device type for prof averages (#130340)
Summary: Add MTIA as valid device option for getting profile averages

Test Plan: Tested with auto-trace on MTIA

Differential Revision: D59486392

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130340
Approved by: https://github.com/aaronenyeshi
2024-07-12 00:39:01 +00:00
7ce5b5767c Revert "Make c10::string_view an alias of std::string_view (#130417)"
This reverts commit c9551a3f50efc8163d8508a3c2189536528577ac.

Reverted https://github.com/pytorch/pytorch/pull/130417 on behalf of https://github.com/izaitsevfb due to depends on #130009 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130417#issuecomment-2224212227))
2024-07-12 00:37:04 +00:00
b5b91b418d [Easy] Update record_function Comment (#130561)
Summary: Users have been confused why user annotations on GPU tracks do not show when doing GPU only tracing. This comment should help users understand that to use this function they need to have CPU activies enabled.

Test Plan: N/A it is just updating a comment

Differential Revision: D59649390

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130561
Approved by: https://github.com/aaronenyeshi
2024-07-11 23:51:25 +00:00
18b7633bfb [export] fix kwargs in run_decompositions() for training IR (#130553)
Re-exporting GraphModule expects all inputs to be in args, though not in pytree-flattened format. This avoids failing when we run with a fx.Interpreter subclass in [AOTAutograd tracing](973037be6a/torch/_functorch/_aot_autograd/traced_function_transforms.py (L760-L762)).

Removes 7 test failures for training IR export.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130553
Approved by: https://github.com/zhxchen17, https://github.com/ydwu4
2024-07-11 22:53:18 +00:00
26c2b92525 [export] make with_effect mark op has_effect to prevent them from DCEed. (#129680)
Before the PR, custom ops that don't return outputs will get eliminated after calling `.module()` because the effect_token that keeps the operator alive is removed in remove_effect_token pass. The reason why we want to remove_effect_token is because we don't want the token to be part of input. However, this causes DCE calls in remove_effect_token itself and the dce calls in unlift to remove the custom op in the graph causing an error in the exported graph.

This PR calls has_side_effect in with_effect to make sure graph.eliminate_dead_code doesn't remove the calls by accident.

Test Plan:
Add a new test pytest test/export/test_torchbind.py -k test_export_inplace_custom_op

Differential Revision: [D59498728](https://our.internmc.facebook.com/intern/diff/D59498728)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129680
Approved by: https://github.com/angelayi
2024-07-11 22:46:21 +00:00
9c6c0deadc Add eager_compile_backwards_failure to tlparse (#130434)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130434
Approved by: https://github.com/albanD
2024-07-11 22:35:33 +00:00
d97d962082 Revert "Add decompositions for copy variants of view ops (#128416)"
This reverts commit 68751799b85aa7f659420801bdbb8451f01ab09a.

Reverted https://github.com/pytorch/pytorch/pull/128416 on behalf of https://github.com/izaitsevfb due to breaks test_qs8_permute_copy test in executorch ([comment](https://github.com/pytorch/pytorch/pull/128416#issuecomment-2224023423))
2024-07-11 22:09:23 +00:00
a2f630a9a4 Revert "Decompose expand_copy and permute_copy (#129476)"
This reverts commit 7d4cb2109823f1c4001dff62b461bb9eda07ca17.

Reverted https://github.com/pytorch/pytorch/pull/129476 on behalf of https://github.com/izaitsevfb due to depends on #128416 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/129476#issuecomment-2224019720))
2024-07-11 22:06:15 +00:00
fc872e98f3 Infer prim tags from equivalent aten ones (#130367)
Take intersection of all the tags for corresponding aten op overloads. Previously, some of the rng ops not having tags caused issues with constant folding (they should get decomposed but thats a separate issue).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130367
Approved by: https://github.com/ezyang
2024-07-11 20:53:52 +00:00
726a287271 [export] Expand verifier to be multiple on ExportedProgram (#130364)
Summary: This diff updates the ExportedProgram class in PyTorch to allow for multiple verifiers to be attached to it. This is done by adding a new field to the ExportedProgram schema called "verifiers" which is a list of strings representing the names of the verifiers to be attached to the program. The verifiers are loaded using the "load_verifier" function which is defined in the "torch._export.serde.serialize" module. The "exported_program.dialect" field is also deprecated in favor of the "verifiers" field.

Test Plan: CI

Differential Revision: D59408546

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130364
Approved by: https://github.com/angelayi, https://github.com/ydwu4
2024-07-11 20:34:49 +00:00
5c6edd29ec Turn on splitShare=1 to make the optimization of comm_split effective. (#129929)
Fixes #129865
Currently, new_group will call ncclCommSplit in some cases. In theory, ncclCommSplit will bring performance and memory benefits. However, the config parameter of the ncclCommSplit function in pytorch does not set "splitShare=1", which results in the optimization of ncclCommSplit being turned off and the benefits being invalid.
This PR turn on splitShare=1 to make the optimization of comm_split effective.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129929
Approved by: https://github.com/shuqiangzhang
2024-07-11 20:14:58 +00:00
c50b189280 Move trunk windows builds to CUDA-12.1 (#130446)
That should catch build regressions that were previously only detectable during the nightly builds
Win + CUDA-11.8 builds and tests are still run as part of periodic workflow
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130446
Approved by: https://github.com/atalman
2024-07-11 19:50:57 +00:00
bc18863713 Corner-case fix for upscale_histogram in the new HistogramObserver (#130316)
Summary: Small fix to the bucketize function that caused a run-time error in some corner cases.

Test Plan: Unit tests

Differential Revision: D59508432

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130316
Approved by: https://github.com/jerryzh168
2024-07-11 19:49:21 +00:00
cd9bae30de Allow kwargs in _remove_effect_tokens_pass (#130491)
Summary: Previously, remove_effect_tokens pass didn't pass kwargs to the internal nodes. This PR fix it and add a test for it.

Test Plan: buck2 run caffe2/test:test_export -- -r test_remove_effect_token_kwargs

Reviewed By: angelayi

Differential Revision: D59603147

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130491
Approved by: https://github.com/angelayi
2024-07-11 19:03:19 +00:00
578388bed8 Revert "Support for expandable segments with cuda graph trees (#128068)"
This reverts commit fdc83610f272610ce50d1a6f5b6354f2df1baabb.

Reverted https://github.com/pytorch/pytorch/pull/128068 on behalf of https://github.com/janeyx99 due to Reverting for breaking ROCm tests on trunk, I think the tests need to be qualified with @onlyCUDA ([comment](https://github.com/pytorch/pytorch/pull/128068#issuecomment-2223672381))
2024-07-11 18:58:13 +00:00
1cae60a87e Caching attr_proxy for nn_module attribute to fix guard check failure (#130280)
Fixes https://github.com/pytorch/pytorch/issues/129939

Differential Revision: [D59594605](https://our.internmc.facebook.com/intern/diff/D59594605)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130280
Approved by: https://github.com/anijain2305
2024-07-11 18:21:35 +00:00
0a4fe2ff86 [DSD] Use no_grad() to make some operations faster and avoid possible memory leakage (#130355)
Use no_grad() to make some operations faster and avoid possible memory leakage

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130355
Approved by: https://github.com/wz337
2024-07-11 18:18:08 +00:00
973037be6a [BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): list() / tuple() / dict() (#130199)
This PR changes the empty collection factory call to Python literals:

- `list()` -> `[]`
- `tuple()` -> `()`
- `dict()` -> `{}`

The Python literals are more performant and safer. For example, the bytecode for building an empty dictionary:

```bash
$ python3 -m dis - <<EOS
import collections

d1 = {}
d2 = dict()

dict = collections.OrderedDict
d3 = dict()
EOS
```

```text
  0           0 RESUME                   0

  1           2 LOAD_CONST               0 (0)
              4 LOAD_CONST               1 (None)
              6 IMPORT_NAME              0 (collections)
              8 STORE_NAME               0 (collections)

  3          10 BUILD_MAP                0
             12 STORE_NAME               1 (d1)

  4          14 PUSH_NULL
             16 LOAD_NAME                2 (dict)
             18 CALL                     0
             26 STORE_NAME               3 (d2)

  6          28 LOAD_NAME                0 (collections)
             30 LOAD_ATTR                8 (OrderedDict)
             50 STORE_NAME               2 (dict)

  7          52 PUSH_NULL
             54 LOAD_NAME                2 (dict)
             56 CALL                     0
             64 STORE_NAME               5 (d3)
             66 RETURN_CONST             1 (None)
```

The dict literal `{}` only has one bytecode `BUILD_MAP`, while the factory call `dict()` has three `PUSH_NULL + LOAD_NAME + CALL`. Also, the factory call is not safe if users override the `dict` name in `locals` or `globals` (see the example of replacing with `OrderedDict` above).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130199
Approved by: https://github.com/malfet
2024-07-11 17:30:28 +00:00
492de213e2 Revert "Change deprecated warning on dispatch_on_subclass to warn once (#130047)"
This reverts commit f21a21828ac6e16d903ee88f726fdb2278c04782.

Reverted https://github.com/pytorch/pytorch/pull/130047 on behalf of https://github.com/albanD due to The failure on the PR are valid, they should not have been ignored ([comment](https://github.com/pytorch/pytorch/pull/130047#issuecomment-2223488933))
2024-07-11 17:24:02 +00:00
f21a21828a Change deprecated warning on dispatch_on_subclass to warn once (#130047)
Summary:
Right now the deprecated warning fires on every operator that calls into torch_function. Changing it to TORCH_WARN_ONCE instead.

More context in https://fb.workplace.com/groups/260102303573409/permalink/445299188387052/

Test Plan: Sandcastle

Differential Revision: D59338775

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130047
Approved by: https://github.com/XilunWu
2024-07-11 17:02:26 +00:00
3896ba3260 [DeviceMesh] Only include the real thread_id in DeviceMesh hash under threaded backend (#130495)
Fixes #ISSUE_NUMBER

As a followup to https://github.com/pytorch/pytorch/pull/130454, users are hitting the cross-mesh operation error because the DeviceMesh thread ID differs between the saved vs. loaded DTensor due to thread id being different.

This is a hot fix to only consider the real thread_id in DeviceMesh hash under threaded backend, but set it to None for all other cases.

As a follow up, we need to look at the following test failures to better root cause specific DeviceMesh related failures related to MTPG, if thread_id is not included as part of the hash.
```
test/distributed/_composable/fsdp/test_fully_shard_training.py::TestFullyShardRegisteredParams::test_param_registration_after_forward
test/distributed/_tensor/test_dtensor_ops.py::TestDTensorOpsCPU::test_dtensor_op_db_column_stack_cpu_float32
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130495
Approved by: https://github.com/awgu, https://github.com/wanchaol
2024-07-11 17:02:18 +00:00
72d9135679 increase tensor size to force out of memory exception on the latest generations of GPUs (#130334)
This PR fixes profiler/test_profiler.py::.TestProfiler::test_oom_tracing
Test expects OOM by allocating huge tensor. But MI300X has enough memory to allocate such a tensor.
This PR increases tensor size with a large margin to force OutOfMemory exception on MI300X and future GPU generations

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130334
Approved by: https://github.com/jeffdaily, https://github.com/janeyx99
2024-07-11 16:59:40 +00:00
9c1ba5ac10 [BE] Cleanup unused vars in MPS (#130541)
And move `using namespace mps` outside of every function as there are no
need to repeat it
Use `getTensorsStringKey` instead of explicit
`getMPSShapeString(getMPSShape(t)) + getMPSDataTypeString(t)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130541
Approved by: https://github.com/Skylion007
2024-07-11 16:48:03 +00:00
68ad3eb722 Do not set hints for mark_unbacked quantities (#130483)
Fixes https://github.com/pytorch/pytorch/issues/130456

When we mark_unbacked a size, we actually DO have a hint for it
(because we have a real, input tensor) for it, and previously, we were
accidentally putting it into the hint field of SymNode.  If marked
unbacked size is zero or one, this can lead to inconsistency between
hint compute and static evaluation compute under guard size oblivious,
since that's the whole point of size oblivious.  Answer is to scrub out
hints on mark unbacked ints.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130483
Approved by: https://github.com/lezcano
2024-07-11 15:51:00 +00:00
ca023f77bc [CD] Add pytorch xpu wheel build in nightly (#129560)
Add pytorch xpu wheel build in nightly after the xpu build image enabling PR https://github.com/pytorch/builder/pull/1879 merged

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129560
Approved by: https://github.com/atalman
2024-07-11 15:49:04 +00:00
fb9bc6d74a [custom op] add doc for CustomOpDef.set_kernel_enabled (#130406)
<img width="1067" alt="Screenshot 2024-07-09 at 6 14 55 PM" src="https://github.com/pytorch/pytorch/assets/22356083/941751f8-8e12-43cb-8477-c739476e0096">
<img width="965" alt="Screenshot 2024-07-09 at 6 14 59 PM" src="https://github.com/pytorch/pytorch/assets/22356083/aa9be099-f26c-45a3-8a14-742a2bb7c28b">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130406
Approved by: https://github.com/zou3519
2024-07-11 15:47:35 +00:00
5ed72ff5f5 Reduce all tensors to their metadata in AOTAutogradCache; add tests (#128583)
This PR makes it so that all tensors are reduced to their metadata in AOTAutogradCache. Because dynamo always embeds constant tensors into the FXgraph directly, there's no risk of a constant tensor whose values are semantically important being lost here. AOTAutograd itself may take a constant tensor and set it as an attribute on an FXGraph for inductor, but Dynamo never does this.

One other thing that this diff does is add `[pickler.fast](https://docs.python.org/3/library/pickle.html#pickle.Pickler.fast)` to our pickling algorithm for cache key generation. Pickle will often memoize/intern strings when pickling, leading to false cache misses due to inconsistent memoization. Turning on pickler.fast removes this behavior.

Technically `fast` is a "deprecated" feature according to python docs. But it's still supported in py3.8-3.12, and if it ever is removed, the only downside will just be a few more cache misses, so I think it's worth just adding here (and removing later as needed)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128583
Approved by: https://github.com/oulgen
ghstack dependencies: #128335
2024-07-11 15:39:09 +00:00
be7bf20234 Add JK to enable fx graph cache for amd (#130463)
Test Plan: ad hoc testing

Differential Revision: D59593961

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130463
Approved by: https://github.com/nmacchioni, https://github.com/mxz297
2024-07-11 15:28:38 +00:00
6f662e9575 update the input weight of _convert_weight_to_int4pack to [n][k / 2] uint8 (#129940)
This PR is to update the input `weight` of `_convert_weight_to_int4pack` from `[n][k] int32` to `[n][k / 2] uint8`, both for CPU, CUDA and MPS, which can help decouple int4 model checkpoint with different ISAs and different platforms in `gpt-fast`. The advantage is int4 model checkpoint can be shared in different test machines, without re-generating in one certain platform. Meanwhile, the size of input `weight` can be reduced to `1 / 8`.

Before this PR, packed weight stored in CUDA specific layout: `[n/8][k/(InnerKTiles*16)][32][InnerKTiles/2]`, dtype int32, where InnerKTiles = 2, 4, 8. CPU packed weight viewed as the SAME shape but stored in different layout: `[n/64][k][32]`, dtype uint8. Weight is strongly coupled with platforms (CPU/CUDA) and ISAs (AVX512/AVX2/scalar). And users cannot use a generated weight in another different ISA or platform, because when loading weight into devices, the compute format is different.
![image](https://github.com/pytorch/pytorch/assets/61222868/64971c4b-29b9-42cf-9aeb-ffa01cea93dd)

Now, we use common serialized layout (`[n][k/2] uint8`) for different devices or ISAs as input `weight` of `_convert_weight_to_int4pack`, and each back chooses how to interpret as compute layout.
![image](https://github.com/pytorch/pytorch/assets/61222868/c7990761-c723-417b-aca2-7c60db7785c7)

### Performance
Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores)
There is no obvious regression of this PR.
![image](https://github.com/pytorch/pytorch/assets/61222868/6046dcf4-920b-4c63-9ca3-1c8c3cafebde)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129940
Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/mingfeima
2024-07-11 15:26:48 +00:00
cyy
c4a2b6a943 [2/N] Fix NVCC warnings (#130214)
Follows #130191
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130214
Approved by: https://github.com/ezyang
2024-07-11 14:46:53 +00:00
a833582dbb [dynamo][tuple] Optimize guard for small tuples - helps conv2d guards (#130400)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130400
Approved by: https://github.com/yanboliang, https://github.com/jansel
ghstack dependencies: #130285, #130368, #130416
2024-07-11 14:13:24 +00:00
f7d7b94017 [dynamo][unspecialized-nn-module] Distinguish between user-defined and builtin nn module (#130416)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130416
Approved by: https://github.com/jansel
ghstack dependencies: #130285, #130368
2024-07-11 14:13:24 +00:00
fed8b0055f [dynamo][bufgix] Fix the value for key manager (#130368)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130368
Approved by: https://github.com/jansel
ghstack dependencies: #130285
2024-07-11 14:13:19 +00:00
9c612df504 [dynamo][cpp-guards][QOL] Print NO_TENSOR_ALIASING guard once (#130285)
NO_TENSOR_ALIASING guard lists all tensors. Printing it on every occurence is ugly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130285
Approved by: https://github.com/jansel
2024-07-11 14:13:14 +00:00
bac10cdd6f [DCP] Fix duplicated logging messages when enable both c10d and dcp l… (#130423)
…ogger

Fixes #129951 . Would you take a moment to review it? @LucasLLC

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130423
Approved by: https://github.com/Skylion007
2024-07-11 13:43:39 +00:00
0d66ccaf23 [IntraNodeComm] fix an issue where input check fails when running all-reduce on sub groups (#130492)
Tested against the following snippet with `ENABLE_INTRA_NODE_COMM=1`.

```python
import os
import torch
import torch.distributed as dist

def main():
    rank = int(os.environ["RANK"])
    local_rank = int(os.environ["LOCAL_RANK"])
    world_size = int(os.environ["WORLD_SIZE"])

    torch.cuda.set_device(f"cuda:{local_rank}")
    dist.init_process_group("nccl")

    draft_group = dist.new_group([0, 1, 2, 3])
    target_group = dist.new_group([4, 5, 6, 7])

    inp = torch.full((128, 128), rank, dtype=torch.bfloat16, device="cuda")
    dist.all_reduce(inp)
    expect = sum(range(world_size))
    assert inp.eq(expect).all()

    if 0 <= rank < 4:
        inp = torch.full((128, 128), rank, dtype=torch.bfloat16, device="cuda")
        dist.all_reduce(inp, group=draft_group)
        expect = sum(range(4))
        assert inp.eq(expect).all()
    else:
        inp = torch.full((128, 128), rank, dtype=torch.bfloat16, device="cuda")
        dist.all_reduce(inp, group=target_group)
        expect = sum(range(4, 8))
        assert inp.eq(expect).all()

    torch.cuda.synchronize()
    dist.destroy_process_group()

if __name__ == "__main__":
    main()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130492
Approved by: https://github.com/Chillee
2024-07-11 13:39:14 +00:00
f261c6ebe8 Revert "[halide-backend] Update CI pin (#130258)"
This reverts commit 4fcfd475bea24b832da32a0c4d464dd87c73a2a9.

Reverted https://github.com/pytorch/pytorch/pull/130258 on behalf of https://github.com/albanD due to Seems to have broken trunk pretty bad 4fcfd475be ([comment](https://github.com/pytorch/pytorch/pull/130258#issuecomment-2222935064))
2024-07-11 13:26:01 +00:00
354edb232a Make public binding test only consider files that are packaged in the wheels (#130497)
In particular, when creating the PyTorch wheel, we use setuptools find_packages 551b3c6dca/setup.py (L1055) which explicitly skips packages without `__init__.py` files (namespace packages) https://setuptools.pypa.io/en/latest/userguide/package_discovery.html#finding-simple-packages.

So this PR is reverting the change to stop skipping these namespace packages as, even though they are in the codebase, they are not in the published binaries and so we're ok relaxing the public API and importability rules for them.

A manual diff of the two traversal methods:
```
torch._inductor.kernel.bmm
torch._inductor.kernel.conv
torch._inductor.kernel.flex_attention
torch._inductor.kernel.mm
torch._inductor.kernel.mm_common
torch._inductor.kernel.mm_plus_mm
torch._inductor.kernel.unpack_mixed_mm
torch._strobelight.examples.cli_function_profiler_example
torch._strobelight.examples.compile_time_profile_example
torch.ao.pruning._experimental.data_sparsifier.benchmarks.dlrm_utils
torch.ao.pruning._experimental.data_sparsifier.benchmarks.evaluate_disk_savings
torch.ao.pruning._experimental.data_sparsifier.benchmarks.evaluate_forward_time
torch.ao.pruning._experimental.data_sparsifier.benchmarks.evaluate_model_metrics
torch.ao.pruning._experimental.data_sparsifier.lightning.tests.test_callbacks
torch.ao.quantization.experimental.APoT_tensor
torch.ao.quantization.experimental.adaround_fake_quantize
torch.ao.quantization.experimental.adaround_loss
torch.ao.quantization.experimental.adaround_optimization
torch.ao.quantization.experimental.apot_utils
torch.ao.quantization.experimental.fake_quantize
torch.ao.quantization.experimental.fake_quantize_function
torch.ao.quantization.experimental.linear
torch.ao.quantization.experimental.observer
torch.ao.quantization.experimental.qconfig
torch.ao.quantization.experimental.quantizer
torch.csrc.jit.tensorexpr.codegen_external
torch.csrc.jit.tensorexpr.scripts.bisect
torch.csrc.lazy.test_mnist
torch.distributed._tensor.examples.checkpoint_example
torch.distributed._tensor.examples.comm_mode_features_example
torch.distributed._tensor.examples.comm_mode_features_example_argparser
torch.distributed._tensor.examples.convnext_example
torch.distributed._tensor.examples.torchrec_sharding_example
torch.distributed._tensor.examples.visualize_sharding_example
torch.distributed.benchmarks.benchmark_ddp_rpc
torch.distributed.checkpoint.examples.async_checkpointing_example
torch.distributed.checkpoint.examples.fsdp_checkpoint_example
torch.distributed.checkpoint.examples.stateful_example
torch.distributed.examples.memory_tracker_example
torch.fx.experimental.shape_inference.infer_shape
torch.fx.experimental.shape_inference.infer_symbol_values
torch.include.fp16.avx
torch.include.fp16.avx2
torch.onnx._internal.fx.analysis.unsupported_nodes
torch.onnx._internal.fx.passes._utils
torch.onnx._internal.fx.passes.decomp
torch.onnx._internal.fx.passes.functionalization
torch.onnx._internal.fx.passes.modularization
torch.onnx._internal.fx.passes.readability
torch.onnx._internal.fx.passes.type_promotion
torch.onnx._internal.fx.passes.virtualization
torch.utils._strobelight.examples.cli_function_profiler_example
torch.utils.benchmark.examples.sparse.compare
torch.utils.benchmark.examples.sparse.fuzzer
torch.utils.benchmark.examples.sparse.op_benchmark
torch.utils.tensorboard._convert_np
torch.utils.tensorboard._embedding
torch.utils.tensorboard._onnx_graph
torch.utils.tensorboard._proto_graph
torch.utils.tensorboard._pytorch_graph
torch.utils.tensorboard._utils
torch.utils.tensorboard.summary
torch.utils.tensorboard.writer

```

These are all either namespace packages (which we want to remove) or package that are not importable (and tagged as such in the test).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130497
Approved by: https://github.com/aorenste
2024-07-11 13:22:04 +00:00
215013daad [cuDNN][SDPA] Limit cuDNN SDPA head-dim to 128 (#130494)
Limit cuDNN SDPA to head-dim 128 globally. Apparently the support for 256 is only for the forward on sm90+, which would be clunky to maintain as it would mean dispatching different for forward/backward.

CC @drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130494
Approved by: https://github.com/drisspg, https://github.com/Skylion007
2024-07-11 13:21:18 +00:00
cyy
9822fdc354 [7/N] Replace c10::optional with std::optional (#130510)
Follows #130438

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130510
Approved by: https://github.com/janeyx99
2024-07-11 13:21:05 +00:00
f52b2ee90f Modularize aten parameter parser and checker (#125308)
In this PR, we abstracted the different types of aten operation parameters as `ParameterMetadata`. This structure intends to be used to represent and store the metadata of each aten operation parameter. Currently, it only supports `Tensor`, `TensorList`, and `Scalar`.

```C++
using ParameterMetadataValue = std::variant<TensorMetadata, std::vector<TensorMetadata>, c10::Scalar>;
```

With this PR, we can extend other parameter-type support in a more modularize way, like `string`, `int`, `double`.

Differential Revision: [D59399546](https://our.internmc.facebook.com/intern/diff/D59399546)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125308
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/atalman
2024-07-11 13:17:25 +00:00
2a51ccc77e When translation validation is enabled, assert that hint is consistent (#130478)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130478
Approved by: https://github.com/lezcano
2024-07-11 13:02:31 +00:00
cyy
c9551a3f50 Make c10::string_view an alias of std::string_view (#130417)
Follows #130009 to further facilitate the mitigation from c10::string_view to std::string_view. The old c10::string_view was renamed to c10::string_view_ext.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130417
Approved by: https://github.com/ezyang
2024-07-11 12:31:06 +00:00
cyy
c5b66c3fe1 Enable -Werror=pedantic on torch targets (#130319)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130319
Approved by: https://github.com/ezyang
2024-07-11 12:27:32 +00:00
5db9bd467e Skip test_nnc_correctness for new op _unsafe_masked_index (#130375)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130375
Approved by: https://github.com/lezcano
2024-07-11 08:17:16 +00:00
b1942a1af4 [fbgemm_gpu] Break up fbgemm_cuda_utils.cuh, pt 10 (#130468)
Summary:
X-link: https://github.com/pytorch/FBGEMM/pull/2814

X-link: https://github.com/facebookresearch/FBGEMM/pull/19

- Break up `fbgemm_cuda_utils.cuh`, pt 10

Test Plan:
```
buck2 targets //deeplearning/fbgemm/fbgemm_gpu/test/jagged/... | grep -v '-' | xargs -I % sh -c 'buck2 run @//mode/opt -c fbcode.nvcc_arch=v100 -c fbcode.platform=platform010 % || exit 255'

buck2 targets //deeplearning/fbgemm/fbgemm_gpu/test/tbe/... | grep -v '-' | xargs -I % sh -c 'buck2 run @//mode/opt -c fbcode.nvcc_arch=v100 -c fbcode.platform=platform010 % || exit 255'

buck2 targets //deeplearning/fbgemm/fbgemm_gpu/test/sparse/... | grep -v '-' | xargs -I % sh -c 'buck2 run @//mode/opt -c fbcode.nvcc_arch=v100 -c fbcode.platform=platform010 % || exit 255'

buck2 build --config fbcode.enable_gpu_sections=true --flagfile fbcode//mode/dev-nosan-amd-gpu fbcode//smart/inference_platform_sp/llm_predictor_amd:service

buck2 build --flagfile fbcode//mode/amd-gpu fbcode//hpc/ops:sparse_ops

buck2 build --flagfile fbcode//mode/dev-nosan-amd-gpu fbcode//caffe2/benchmarks/operator_benchmark/pt:add_test
```

Reviewed By: spcyppt

Differential Revision: D59545097

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130468
Approved by: https://github.com/ezyang
2024-07-11 07:10:27 +00:00
79c41bb58a [inductor] switch CppCodeCache to new cpp_builder. (#130132)
Changes:
1. switch CppCodeCache to new cpp_builder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130132
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-11 07:03:43 +00:00
75ab027fbb [dtensor] move bernolli to op strategy (#130286)
as titled

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130286
Approved by: https://github.com/awgu, https://github.com/yifuwang
2024-07-11 06:43:11 +00:00
fdc83610f2 Support for expandable segments with cuda graph trees (#128068)
This PR adds support to use expandable segments with private memory pools which should unblock using it with cuda graphs and cuda graph trees. Currently, the allocator silently avoids using expandable segments when allocating in a private pool due to checkpoint saving/restoring not meshing well with how we keep track of unmapped blocks.

The PR itself is pretty short, most of the logic for checkpointing and reapplying state for non-expandable segments transfers over without much work.

Expandable segments reserve a virtual address space of size equal to the amount of physical memory on the GPU. Every time we want to `malloc()` or `free()` memory in a memory pool with expandable segments turned on, we map/unmap pages of physical GPU memory under the hood to create a new block that we return to the caller. This is beneficial due to the fact that each memory pool functions as a single segment of memory with a contiguous block of memory addresses that can grow and shrink as needed, avoiding fragmentation from allocating multiple non-contiguous segments that may not be merged together.

The caching allocator handles this by creating an unmapped block for the entire reserved virtual address space at init, which is treated similarly to an unallocated block in a free pool. When callers call `malloc()`, it's split and mapped to create allocated blocks, and calling `free()` similarly caches and merges free blocks in a free pool to be used later. Expandable blocks are unmapped and returned back to Cuda when they are cleaned up, or when we hit an OOM and the allocator attempts to remap cached free blocks. The code paths to map, free, and unmap blocks in expandable segments is similar to that for normal blocks and does all the same work of updating stats on memory usage, moving blocks between active and free pools, and returning memory to Cuda.

With Cuda Graph Trees and private memory pools, we need the ability to take checkpoints of the current state of the memory allocator after each graph capture as well as reapplying the state before capturing a new graph after replaying a captured graph so that the new cuda graph capture has access to the state of the allocator at the point after replaying a previously captured graph so it can reuse empty blocks and allocate new ones.

As mentioned in a below comment, memory in a private pool is cached until the private pool is destroyed and allocations can only grow from extra graph captures, any freeing of memory would result in invalid memory addresses and would break cuda graphs.

One implementation detail to note for unmapped blocks with expandable segments is that unmapped blocks are kept track in a member variable `unmapped` of a `BlockPool`. `unmapped` is *not* part of the checkpointed state of the caching allocator and isn't restored when reapplying checkpoints since we never free/unmap memory back to cuda and is persisted across graph captures / replays.

Checkpointing the current state of the memory allocator works as expected with expandable segments. Checkpointing grabs the first block of every segment in the active and free pools of the private pool and traverses the linked list of blocks in the segment to capture the state of every segment, which is then saved and kept for when it is needed to be reapplied. For expandable blocks, the last block in every segment will be an unallocated unmapped block containing the remaining amount of unmapped memory at graph capture time, and this too is saved in the checkpoint.

Reapplying the checkpoints works by freeing all allocated blocks and merging them into a single block per segment, then for each segment, we manually split and allocate all blocks from the checkpoint and then free the blocks marked as unallocated in the checkpoint state. For expandable segments, we need to make some modifications to not split unmapped blocks and avoid manually mapping then freeing unmapped blocks.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128068
Approved by: https://github.com/zdevito, https://github.com/eqy
2024-07-11 05:33:09 +00:00
da24823e06 [BE][EZ] Migrate to new dcp save and load APIs (#130475)
When I play with DCP for distributed inference, I found that we are still using deprecated APIs for DCP even in unit test. So this PR is using the new API with unified small letters "dcp".

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130475
Approved by: https://github.com/wz337
2024-07-11 04:13:39 +00:00
5835ff1ed5 [Easy][Inductor] Add comment for .min_order and .max_order (#130390)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130390
Approved by: https://github.com/anijain2305
2024-07-11 03:58:03 +00:00
a4576dad34 [reland][custom ops] infer schema (#130079)
Fixes #129617

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130079
Approved by: https://github.com/zou3519
2024-07-11 03:39:07 +00:00
9f401187c7 [pipelining] Refactor test_schedule to fix "-k" (#130294)
This is kind of a short-sighted workaround and we should actually come
up with a way to fix this in general, but I got annoyed that I can't use
-k to filter tests in test_schedule, and realized it's because we jam
tests using the new MultiProcContinuousTest fixture together with
old-style tests.

For now I separate the two types of tests so -k works again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130294
Approved by: https://github.com/H-Huang
2024-07-11 03:18:02 +00:00
dfd1d1971e Fix warning when pickle.load torch.Storage (#130246)
Fixes https://github.com/pytorch/pytorch/issues/130242

Since `torch.save` does not use pickle for storages, the `torch.load` in `_load_from_bytes` should not ever be called when `torch.load`-ing a checkpoint. Setting weights_only=False explicitly in `_load_from_bytes` to avoid the weights_only warning when using the pickle module

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130246
Approved by: https://github.com/albanD
2024-07-11 02:40:29 +00:00
4fcfd475be [halide-backend] Update CI pin (#130258)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130258
Approved by: https://github.com/eellison
2024-07-11 02:26:16 +00:00
df9d1b44e7 Preserve _numeric_debug_handle throguh deepcopy and re-export (#129287)
Summary:
* Added support for preserving it during deepcopy, need to remap the args since _numeric_debug_handle refers
to the nodes in the graph

TODO: need to fully support re-export, currently the metadata for output node is not preserved

Test Plan:
python test/test_quantization.py -k test_deepcopy_preserve_handle
python test/test_quantization.py -k test_copy_preserve_handle

all related tests:
python test/test_quantization.py -k TestGenerateNumericDebugHandle

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129287
Approved by: https://github.com/zhxchen17
2024-07-11 02:19:41 +00:00
a205a53c50 Make sym_node log more useful (#130436)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130436
Approved by: https://github.com/Skylion007
2024-07-11 01:42:53 +00:00
79e34800c3 Suppress guards generated by empty_strided in ir_node_to_tensor (#130431)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130431
Approved by: https://github.com/IvanKobzarev
2024-07-11 01:19:11 +00:00
cyy
798b9652f7 [6/N] Replace c10::optional with std::optional (#130438)
Follows #130408

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130438
Approved by: https://github.com/janeyx99
2024-07-11 01:15:37 +00:00
5bc18ec0a1 [Inductor][CPP] Support vectorization of remainder (#129849)
**Summary**
When check the vectorization status among 3 test suit, we found some operators disabled vectorization with message `Disabled vectorization: op: remainder`. In this PR, we add vectorization support of this op.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_remainder
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_int_div_vec
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129849
Approved by: https://github.com/jgong5, https://github.com/lezcano
ghstack dependencies: #130405
2024-07-11 00:50:50 +00:00
6adc725157 doc - fix the max_norm value in a note (#129687)
`max_norm=True` is currently written in the note, but `max_norm` can be a `float`, NOT a `bool` (as the [docstring](ec284d3a74/torch/nn/modules/sparse.py (L30)) says).
That note was created in #45595

The current pull request cleans it up.
The value `True` in the note can confuse the users to think it can be a boolean.

In fact, a counter-intuitive behavior will happen if users try to set it to `False`:
it will be interpreted as 0, so the values of the embedding will become 0 - not what the users were expecting by setting it to `False`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129687
Approved by: https://github.com/mikaylagawarecki, https://github.com/malfet
2024-07-11 00:01:17 +00:00
358da54be5 [inductor] Better messaging when triton version is too old (#130403)
Summary:
If triton is available, but we can't import triton.compiler.compiler.triton_key, then we see some annoying behavior:
1) If we don't actually need to compile triton, the subprocess pool will still spew error messages about the import failure; it's unclear to users if this is an actual problem.
2) If we do need to compile triton, we a) see the error messages from above and b) get a vanilla import exception without the helpful "RuntimeError: Cannot find a working triton installation ..."

Test Plan: Ran with and without torch.compile for a) recent version of triton, b) triton 2.2, and c) no triton. In all cases, verified expected output (success or meaningful error message)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130403
Approved by: https://github.com/eellison
2024-07-10 23:45:50 +00:00
ceedee23ec [DTensor] Included meshes in cross-mesh error msg (#130454)
The current error message is not actionable since we do not know which meshes are involved. Including the `__repr__` of each mesh in the error helps but is not always sufficient.

7d4cb21098/torch/distributed/device_mesh.py (L395-L408)

The problem is that `DeviceMesh.__eq__` is actually pretty involved, and we cannot see all parts of the `__eq__` criteria just from the `__repr__` (e.g. the thread ID).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130454
Approved by: https://github.com/wz337, https://github.com/wanchaol
2024-07-10 22:40:57 +00:00
2abc7cc21b [inductor] switch AotCodeCompiler to new cpp_builder (#130127)
Changes:
1. Switch `AotCodeCompiler` to new cpp_builder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130127
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-10 22:28:29 +00:00
551b3c6dca Use irange to avoid -Wsign-compare errors (#130388)
Fixes meta-internal errors after importing #128753

(see [D59498679](https://www.internalfb.com/diff/D59498679))
```
fbcode/caffe2/aten/src/ATen/Context.cpp:286:34: error: comparison of integers of different signs: 'int' and 'size_t' (aka 'unsigned long') [-Werror,-Wsign-compare]
      for (auto index = 0; index < at::getNumGPUs(); index++) {
                           ~~~~~ ^ ~~~~~~~~~~~~~~~~
1 error generated.
```
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130388
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-07-10 22:07:51 +00:00
ce499eee0c Revert "Add API for open registration between operators and subclasses (and modes) (#130064)"
This reverts commit c23d103afae65588772cb30037ea4110f01f6f41.

Reverted https://github.com/pytorch/pytorch/pull/130064 on behalf of https://github.com/izaitsevfb due to fails internal builds, see [D59553526](https://www.internalfb.com/diff/D59553526) ([comment](https://github.com/pytorch/pytorch/pull/130064#issuecomment-2221587575))
2024-07-10 21:50:32 +00:00
83c95c48f7 Flight recoder data as JSON (#129505)
Summary:
Provide a new API to retrieve flight recorder data as JSON.
The one minor difference between flight recorder as Pickle v/s JSON is
that the JSON API does not retrieve stack traces at the moment.
This ends up being far too much data.

Test Plan:
unit test

Differential Revision: [D59536460](https://our.internmc.facebook.com/intern/diff/D59536460)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129505
Approved by: https://github.com/wconstab, https://github.com/d4l3k
2024-07-10 21:50:27 +00:00
86bca69c5f Revert "[custom_ops] expose torch.library.register_torch_dispatch (#130261)"
This reverts commit bb9a73f767526e0d23c60360db5212b6bed0e8bc.

Reverted https://github.com/pytorch/pytorch/pull/130261 on behalf of https://github.com/izaitsevfb due to depends on #130064 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130261#issuecomment-2221569707))
2024-07-10 21:43:28 +00:00
e14a0f45ed Revert "[reland][custom ops] infer schema (#130079)"
This reverts commit bef085bdfa62cc14589c70279de17108b2c2089f.

Reverted https://github.com/pytorch/pytorch/pull/130079 on behalf of https://github.com/izaitsevfb due to depends on #130064 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130079#issuecomment-2221561483))
2024-07-10 21:40:16 +00:00
46c52661bc Use a better cherry-pick strategy for stable pytorch w/ distribute changes (#129987)
1. Update the branch name from internal feedback
2. Only cherry-pick in the changes to these folders
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129987
Approved by: https://github.com/seemethere
2024-07-10 20:55:36 +00:00
80a421a54d [TD] Pin numpy to 1.26.0 in indexer (#130442)
Temporarily pin 1.26.0 to get the workflow working while I go sort out which dependencies need to be updated

Succeeding run: https://github.com/pytorch/pytorch/actions/runs/9877733366/job/27280052419?pr=130442

Tested by adding my branch to the trust relationship for the policy and removing the environment
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130442
Approved by: https://github.com/atalman, https://github.com/malfet
2024-07-10 20:52:24 +00:00
cd2638be09 Revert "[pipelining] Refactor test_schedule to fix "-k" (#130294)"
This reverts commit 1352f13f7827cd1862a6e0507fb17dccddf73dc2.

Reverted https://github.com/pytorch/pytorch/pull/130294 on behalf of https://github.com/clee2000 due to broke lint https://github.com/pytorch/pytorch/actions/runs/9879591538/job/27286156803 ([comment](https://github.com/pytorch/pytorch/pull/130294#issuecomment-2221376073))
2024-07-10 20:26:58 +00:00
b81767161e Revert "[aota] Needs autograd if an input requires_grad, agnostic to enable_grad (#128890)"
This reverts commit 08d5423d339ac4b302f8ae6b63b334e032104753.

Reverted https://github.com/pytorch/pytorch/pull/128890 on behalf of https://github.com/clee2000 due to broke inductor/test_flex_attention https://github.com/pytorch/pytorch/actions/runs/9879109008/job/27286339304 08d5423d33 test was not run on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/128890#issuecomment-2221368245))
2024-07-10 20:22:24 +00:00
1b3b4c2fb9 [runtime asserts] deduplicate runtime asserts & CSE (#128599) (#130380)
original PR: https://github.com/pytorch/pytorch/pull/128599 (re-created after revert + poisoned diff train)

Summary:
This PR adds deduplication and CSE for runtime asserts. Existing size computation in the graph is CSE'd along with added runtime asserts, and redundant asserts are removed. Shape calls on intermediate tensors are also turned into compute on input sizes if possible, allowing intermediate tensors to be freed earlier. For example:
```
z = torch.cat([x, x], dim=0)  # 2*s0
w = z.repeat(y.shape[0])  # 2*s0*s1
_w = w.shape[0]

s0 = x.shape[0]
s1 = y.shape[0]
_w0 = 2 * s0
_w = _w0 * s1
```

Additionally, constrain_range calls are deduplicated. Single-symbol bound checks for unbacked symbols (e.g. u0 >= 0, u0 <= 5) and sym_constrain_range.default calls are also removed, since they accumulate range info in the ShapeEnv, and are replaced with two _assert_scalar.default calls that check the min/max bounds. For example:
```
torch.sym_constrain_range_for_size(n, min=2, max=16)
torch.sym_constrain_range(n, min=4, max=20)
torch._check(n >= 0)
torch._check(n >= 3)
torch._check(n <= 14)

torch.sym_constrain_range_for_size(n)
torch._check(n >= 4)
torch._check(n <= 14)
```

Test Plan:
contbuild & OSS CI, see 940e4477ab

Original Phabricator Test Plan:
Imported from GitHub, without a `Test Plan:` line.

Differential Revision: D59543603

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130380
Approved by: https://github.com/izaitsevfb
2024-07-10 19:23:37 +00:00
1352f13f78 [pipelining] Refactor test_schedule to fix "-k" (#130294)
This is kind of a short-sighted workaround and we should actually come
up with a way to fix this in general, but I got annoyed that I can't use
-k to filter tests in test_schedule, and realized it's because we jam
tests using the new MultiProcContinuousTest fixture together with
old-style tests.

For now I separate the two types of tests so -k works again.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130294
Approved by: https://github.com/H-Huang
2024-07-10 18:32:51 +00:00
cf090e222e Update torch-xpu-ops pin (ATen XPU implementation) (#130333)
1. Fixing compilation error due to PyTorch update. The helper function prototype changes, `checkIndexTensorTypes`.
2. Fixing compilation error due to PyTorch update. PyTorch forced -Werror=unused-function.
3. Fixing inductor case failure due to CUDA bias implementation in the case. https://github.com/pytorch/pytorch/issues/130426

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130333
Approved by: https://github.com/EikanWang, https://github.com/atalman
2024-07-10 18:10:53 +00:00
4b7ee51260 [BE][MPS] Cleanup optimizers code (#130453)
- Fix C++20 forward compatibility warnings, namely
```
warning: use of function template name with no prior declaration in function call with explicit template arguments is a C++20 extension [-Wc++20-extensions]
  multi_tensor_apply_for_fused_optimizer<2, 512>(kernel_name,
```
- Use nested namespaces
- Do not explicitly specify `at::` namespace for functions already implemented inside of that namespace
- Use more convenience methods (rather than call by hand)
- Use C++14 `return f();` for void functions

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130453
Approved by: https://github.com/Skylion007
2024-07-10 18:00:05 +00:00
08d5423d33 [aota] Needs autograd if an input requires_grad, agnostic to enable_grad (#128890)
Reland of:  https://github.com/pytorch/pytorch/pull/128016

Summary from previous PR:
We assume only two possible mutually exclusive scenarios:

Running compiled region for training (Any of inputs has requires_grad)

Produced differentiable outputs should have requires_grad.
Running compiled region for inference (None of inputs has requires_grad)

All outputs do not have requires_grad.
Even if user runs the region under no_grad(), but has an input Tensor with requires_grad - we go Training scenario (1).

With current state that means:
1/ needs_autograd should not check torch.is_grad_enabled(), only that any of inputs requires_grad
2/ if needs_autograd => trace_joint (We are in training scenario 1.) => always run compiled region under with.enable_grad()

Changes in partitioner?

Inference and Training graphs had difference in return container, list/tuple.
The changes in partitioner are done to unify and return always tuple.
As a result - some changes in test_aotdispatch.py for graph contents list -> tuple.

Why was revert?

There was a regression of hf_Reformer model on inference.
```
TORCHINDUCTOR_FX_GRAPH_CACHE=0 python benchmarks/dynamo/torchbench.py --performance --inference --bfloat16 --backend inductor --device cuda --only hf_Reformer --cold-start-latency --use-eval-mode
```

Because one of the compiled graphs contained outputs, which are aliases to the inputs that are nn.Parameter(requires_grad=True).

Even if inference bencharmsk torchbench runs inside with` torch.no_grad()` - alias (specifically for hf_Reformer - expand) ops preserve requires_grad.

As a result we started compiling training graph instead of inference.

Fix for view ops:

If we have outputs, that are aliases to inputs that requires_grad, those outputs requires grad is not a reason to generate training graph.

This is handled in aot_autograd.py, where output_and_mutation_safe are calculated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128890
Approved by: https://github.com/bdhirsh
2024-07-10 17:56:32 +00:00
0beeac35fa Revert "[cond] inlining into one of the branches when pred is a python constant (#128709)"
This reverts commit fe3e6878c4bb2a6001045c179fd7fa9838242558.

Reverted https://github.com/pytorch/pytorch/pull/128709 on behalf of https://github.com/ydwu4 due to causing error on truck due to a land racing: fe3e6878c4 ([comment](https://github.com/pytorch/pytorch/pull/128709#issuecomment-2221104043))
2024-07-10 17:47:19 +00:00
b4b7477d3f Fix CPU Annotation Overlapping with Python Events (#129599)
Summary:
Currently we have an issue where CPU User annotations can overlap with python events in the event that a python event calls step() within the function itself. To combat this, we can move the left side of the user annotation to the beginning of the parent python function. We do this because when instantiating the profiler we already start on step 0.
To implement this, we start by collecting all instances of ProfilerStep during post processing. Since TorchOps and Python events are sorted already, we can easily check if the current python event partially overlaps with the current ProfilerStep and, if so, alter the start time of the current ProfilerStep. We then move to the next ProfilerStep and continue iterating through all the python events. This keeps the time complexity of adding events to 'out' at O(s + n) -> O(n) post sorting, where "s" is the number of ProfilerSteps and "n" is the length of all events.

Test Plan:
Added unit test in which step() is called midway through a function. Afterwards, we print out a trace and then load the json to check that there are no overlaps. Also make sure that there is no regression in performance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129599
Approved by: https://github.com/aaronenyeshi
2024-07-10 17:33:56 +00:00
6b3460ae0d fix discrepancy from the export of #126601 (#130296)
#126601 (internally [D58103182](https://www.internalfb.com/diff/D58103182)) was exported missing one class definition. This PR brings github repo in sync with fbcode.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130296
Approved by: https://github.com/kit1980, https://github.com/seemethere, https://github.com/malfet
2024-07-10 17:26:44 +00:00
7d4cb21098 Decompose expand_copy and permute_copy (#129476)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129476
Approved by: https://github.com/amjames, https://github.com/lezcano
2024-07-10 17:12:01 +00:00
a7aa066b09 Fix link to dynamo in torch/fx readme (#130233)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130233
Approved by: https://github.com/janeyx99
2024-07-10 17:00:49 +00:00
a09910d3a9 add strobelight profile links to tlparse (#129703)
Summary: title.

Test Plan:
buck2TORCH_TRACE=~/my_trace_log_dir buck2 run  @//mode/inplace  @//mode/opt  //caffe2/fb/strobelight:compile_time_profiler_example

tlparse ~/my_trace_log_dir

result
https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpBrQJcL/index.html
 {F1726980413}

Differential Revision: D59130581

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129703
Approved by: https://github.com/aorenste
2024-07-10 16:53:21 +00:00
fe3e6878c4 [cond] inlining into one of the branches when pred is a python constant (#128709)
When the input predicate is a python constant, we specialize into one of the branches and warn users that torch.cond is not preserving the dynamism. The previous behavior is that we baked in True/False in the cond operator. This can be confusing. In this PR, we change it to be specializing into one of the branches when the inputs are constants.

We additionally change the naming of cond operator to default one without overriding its name. This allows better testing on de-serialized graph.

Test Plan:
The predicate in some existing tests is the result of a shape comparison. When no dynamic shape is involved, the predicate is a python bool. To fix them, we either change the predicate to be some data-dependent tensor or change the test to check cond is specialized as one of the branches,

Differential Revision: [D59589709](https://our.internmc.facebook.com/intern/diff/D59589709)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128709
Approved by: https://github.com/zou3519
2024-07-10 16:44:27 +00:00
9d94b122f0 Fix usage of USE_ROCM when calling cudaFuncGetAttributes (#130441)
This fixes MSVC build regression introduced by https://github.com/pytorch/pytorch/pull/129710 as VC++ fails to unroll nested defines in the specific order and fails with
```
C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\int4mm.cu(984): error: "#" not expected here
    do { const cudaError_t __err = cudaFuncGetAttributes( &funcAttr, #if defined(USE_ROCM) (void *)func #else func #endif ); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "C:\\actions-runner\\_work\\pytorch\\pytorch\\builder\\windows\\pytorch\\aten\\src\\ATen\\native\\cuda\\int4mm.cu", __func__, static_cast<uint32_t>(991), true); } while (0);
```

Fixes https://github.com/pytorch/pytorch/issues/130437

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130441
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-07-10 16:30:43 +00:00
ae73489b7d [codemod] Use C++17 [[fallthrough]] in 1 file inc caffe2/aten/src/ATen/native/cuda/DistributionTemplates.h (#130433)
Test Plan: Sandcastle

Reviewed By: meyering

Differential Revision: D59528276

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130433
Approved by: https://github.com/malfet
2024-07-10 16:30:37 +00:00
bef085bdfa [reland][custom ops] infer schema (#130079)
Fixes #129617

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130079
Approved by: https://github.com/zou3519
2024-07-10 16:18:36 +00:00
ce4d95143f Add scale kwarg to FlexAttention (and some changes that get FlexAttention numerics to be as accurate as FA2) (#130250)
After this PR, our numerical error is within 3% of FA2 for forward and gradients. Prior, for `dq` our numerical error was 30% higher. I also added a `PRESCALE_QK` kernel option that increases perf by about 3-4% but incurs about 20-30% more numerical error.

![image](https://github.com/pytorch/pytorch/assets/6355099/7b5ff44e-219b-4a05-8a1b-2a0182c01ab2)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130250
Approved by: https://github.com/drisspg
ghstack dependencies: #130227
2024-07-10 16:14:45 +00:00
a7715e36de Add block mask utility support for batches and heads > 1 (#130227)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130227
Approved by: https://github.com/yanboliang
2024-07-10 16:14:45 +00:00
c83b941141 [export] add dynamic shapes argument and infer from graph nodes (#129928)
Fixes the example in #118304 for `torch._functorch.aot_autograd.aot_export_module` and `torch.export.export`.

On a high level, the issue is caused by not detecting fake_mode when there's no input.

Change plan:

1) we add a  `dynamic_shapes: Union[bool, None] = None` arg to `aot_export_module` and `_aot_export_function`.

2) if the input is not a graph module, then we can only rely on this `dynamic_shapes` input arg.

3) If the input is a graph module, then we can traverse the graph and check.

4) So we check if the input mod is a graph module or just a module, and do 2) or 3) depending on the type.

Fixes #129927

Bug source: dynamo's fake_mode is not detected correctly in `_convert_input_to_fake` in `_traced.py` when there’s no input to the graph). So in ` _strict_export_lower_to_aten_ir`, we create another fake_mode. `dynamo_fake_mode` is not the same as the fake_mode used by dynamo.

Change plan:
check `gm_torch_level` graph's node meta "example_value" for fake mode in addition.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129928
Approved by: https://github.com/angelayi
2024-07-10 15:51:05 +00:00
cyy
d31f866b33 [BE] [CMake] Remove AT_CORE_STATIC_WINDOWS option (#130409)
AT_CORE_STATIC_WINDOWS was inherited from torch and is not used anymore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130409
Approved by: https://github.com/malfet
2024-07-10 15:50:47 +00:00
81ea298600 Wrap the test func with try/except to always call destroy_process_group (#124961)
This can avoid PG warning about not calling destry_pg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124961
Approved by: https://github.com/wanchaol, https://github.com/wz337
2024-07-10 15:36:38 +00:00
81df076bfd Fix Apple crash when running PyTorch with Metal API validation turned on (#130377)
Fixes #130376 (at least, for my usage)

There may be other places in the code base where `-setBytes:length:` is called with a length of 0 besides this, but this is the case that has triggered for me. Please let me know if there are any specific tests I should run.
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130377
Approved by: https://github.com/malfet
2024-07-10 15:07:47 +00:00
417c83e7cf [ROCm] Unskip scaled_dot_product_attention tests on ROCm (#127966)
Needle has moved quite a bit on the ROCm backend front. This PR intended to examine the tests referenced in the following issue: https://github.com/pytorch/pytorch/issues/96560

This a follow-up PR to https://github.com/pytorch/pytorch/pull/125069

unskipping the next batch of tests referenced by the aforementioned issue. No explicit changes needed for source as they worked immediately after unskipping.

The tests previously marked with xfail have now been modified to not expect a failure iff running on ROCm as they now pass. Behavior is unchanged for them on other architectures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127966
Approved by: https://github.com/malfet
2024-07-10 14:53:41 +00:00
b38de2f9e2 [decomps] Fix aten._to_copy decomp (#130381)
`aten._to_copy` can receive a python number as input. This occurs in
torch.compile support for vmap (see #130188). Previously, this would
raise an assertion error. This PR changes it so that if we see a python
number, we call torch.scalar_tensor on it first (h/t @bdhirsh).

Fixes #130362

Fixes #130188

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130381
Approved by: https://github.com/Chillee
2024-07-10 14:34:28 +00:00
cyy
bd3452f431 [5/N] Change #include <c10/util/Optional.h> to #include <optional> (#130408)
Follows  #130329

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130408
Approved by: https://github.com/malfet
2024-07-10 14:29:43 +00:00
99967e1119 [MPS][TYPE_PROMOTION] Fix Clamp (#130226)
Summary:
1. Fixed #130201 by adding type promotion.
2. Added proper tests.
3. Found torch's type promotion is different from numpy as follows:

```python
import torch
import numpy as np
np.clip(np.array([1], dtype=np.float32), np.array([1], dtype=np.int32), None).dtype  # dtype('float64')
torch.clamp(torch.tensor([1], dtype=torch.float32), torch.tensor([1], dtype=torch.int32)).dtype  # torch.float32
```

~Not sure the proper way to handle it, it causes numpy ref tests to fail.~
Reason here, so think I'm gonna xfail it:
3c1cf03fde/test/test_ops.py (L260-L264)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130226
Approved by: https://github.com/malfet
2024-07-10 14:27:39 +00:00
6ce0bd7d3b [HOP] Use user directed names for variables where possible (#130271)
Afaict the previous check was too strict. Removing it passes all the
mutation tests (mutation checks happen via the TensorVariable's mutable_local).

Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130271
Approved by: https://github.com/Chillee, https://github.com/ydwu4
2024-07-10 13:59:20 +00:00
637cc8d27f Revert "update the input weight of _convert_weight_to_int4pack to [n][k / 2] uint8 (#129940)"
This reverts commit 6367f02a0e136ced05c665301bcdaa4d76690457.

Reverted https://github.com/pytorch/pytorch/pull/129940 on behalf of https://github.com/albanD due to Broke rocm tests on main 6367f02a0e ([comment](https://github.com/pytorch/pytorch/pull/129940#issuecomment-2220554681))
2024-07-10 13:48:32 +00:00
a1590e16df Add single Python 3.10, single Cuda 12.1 build with dependencies included (#130349)
Build large wheel for Python 3.10, CUDA 12.1 that will be used in Colab. Build name: ``manywheel-py3_11-cuda12_1-full-build``

We still have all code to support the full build in builder repo, here:
https://github.com/pytorch/builder/blob/main/manywheel/build_cuda.sh#L151

Test:
```
import sys
import torch
sys.version_info
print(torch.__version__)
sys.version_info

2.3.0+cu121
sys.version_info(major=3, minor=10, micro=12, releaselevel='final', serial=0)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130349
Approved by: https://github.com/malfet
2024-07-10 12:57:39 +00:00
cb2bce98de [MPS][BE] Reduce the number of parameters encoded for no momentum fused SGD (#130131)
Summary:

1. Reduce the number of parameters encoded for no momentum fused SGD
2. Use convenience functions `mtl_setBuffer` and `mtl_setBytes`.

Just a BE, no significant performance difference is observed.

Test plan: Relying on CI signals
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130131
Approved by: https://github.com/janeyx99, https://github.com/malfet
2024-07-10 07:58:38 +00:00
6367f02a0e update the input weight of _convert_weight_to_int4pack to [n][k / 2] uint8 (#129940)
This PR is to update the input `weight` of `_convert_weight_to_int4pack` from `[n][k] int32` to `[n][k / 2] uint8`, both for CPU, CUDA and MPS, which can help decouple int4 model checkpoint with different ISAs and different platforms in `gpt-fast`. The advantage is int4 model checkpoint can be shared in different test machines, without re-generating in one certain platform. Meanwhile, the size of input `weight` can be reduced to `1 / 8`.

Before this PR, packed weight stored in CUDA specific layout: `[n/8][k/(InnerKTiles*16)][32][InnerKTiles/2]`, dtype int32, where InnerKTiles = 2, 4, 8. CPU packed weight viewed as the SAME shape but stored in different layout: `[n/64][k][32]`, dtype uint8. Weight is strongly coupled with platforms (CPU/CUDA) and ISAs (AVX512/AVX2/scalar). And users cannot use a generated weight in another different ISA or platform, because when loading weight into devices, the compute format is different.
![image](https://github.com/pytorch/pytorch/assets/61222868/64971c4b-29b9-42cf-9aeb-ffa01cea93dd)

Now, we use common serialized layout (`[n][k/2] uint8`) for different devices or ISAs as input `weight` of `_convert_weight_to_int4pack`, and each back chooses how to interpret as compute layout.
![image](https://github.com/pytorch/pytorch/assets/61222868/c7990761-c723-417b-aca2-7c60db7785c7)

### Performance
Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores)
There is no obvious regression of this PR.
![image](https://github.com/pytorch/pytorch/assets/61222868/6046dcf4-920b-4c63-9ca3-1c8c3cafebde)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129940
Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/mingfeima
2024-07-10 07:38:42 +00:00
e29657efb6 [Inductor][CPP] Fix typo in merge rules (#130405)
**Summary**
There is a typo of the `CPU Inductor` group in `merge_rules.yaml` which should be `test/inductor/test_cpu_repro.py` instead of `test/inductor/test_cpu_repo.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130405
Approved by: https://github.com/jgong5, https://github.com/lezcano
2024-07-10 07:13:03 +00:00
cyy
10c7f037fe Simplify c10::string_view (#130009)
Make c10::basic_string_view a subclass of std::basic_string_view for easier replacement in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130009
Approved by: https://github.com/ezyang
2024-07-10 05:02:16 +00:00
a17d1e5322 Fix static py::object dangling pointer with py::gil_safe_call_once_and_store (#130341)
Fix static `py::object`s with `py::gil_safe_call_once_and_store`.

The following code will leak a `py::object` which will call its destructor when shutdown the program. The destructor will call `Py_DECREF(obj.m_ptr)` which may raise a segmentation fault.

```c++
void func() {
    static py::object obj = py::module_::import("foo").attr("bar");

    ...
}
```

The correct code is to use raw pointers rather than the instance.

```c++
void func() {
    static py::object* obj_ptr = new py::object{py::module_::import("foo").attr("bar")};
    py::object obj = *obj_ptr;

    ...
}
```

This PR uses the `py::gil_safe_call_once_and_store` function from `pybind11`, which can run arbitrary initialization code only once under the Python GIL thread safely.

```c++
void func() {
    PYBIND11_CONSTINIT static py::gil_safe_call_once_and_store<py::object> storage;
    py::object obj = storage
                         .call_once_and_store_result(
                             []() -> py::object {
                                 return py::module_::import("foo").attr("bar");
                             }
                         )
                         .get_stored();

    ...
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130341
Approved by: https://github.com/ezyang
2024-07-10 04:23:37 +00:00
5abe7ebd41 Add new (private) capture_triton API (#130178)
When applied to a triton kernel, capture_triton allows the triton kernel
to be captured when tracing with make_fx. It does this by transforming the
call to the triton kernel into a call to the
triton_kernel_wrapper_mutation HOP, which can actually be traced into a
graph via make_fx.

We have two main uses cases for this:
- non-strict export doesn't use Dynamo, but people want to use
  non-strict export to export programs with triton kernels.
  non-strict export uses make_fx tracing, so this is a necessary step in
  that direction.
- People want to write inductor passes that replace a sequence of
  operators with a call to a function that may contain a triton kernel.
  The way these passes work today is that we have a FX graph and want to
  replace a subgraph of it with a new subgraph. We obtain said subgraph
  from calling make_fx on the function; this won't work on raw triton
  kernels but will work if one uses capture_triton.

Test Plan:
- I wrote some manual tests to run make_fx over two of the triton
  kernels in test_triton_kernels. It would be nice to be able to run
  make_fx through all of the tests in the file but I'm not sure how to
  do that refactor right now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130178
Approved by: https://github.com/oulgen
ghstack dependencies: #130177
2024-07-10 03:09:29 +00:00
99c68f7bea Refactor TritonKernelVariable's logic so it can be shared (#130177)
TritonKernelVariable's logic tells us how to go from a user-defined
triton kernel and a grid to a call to the triton_kernel_wrapper_mutation
HOP. We want to re-use this in a setting without Dynamo; in the next PR
up, we create a new decorator (capture_triton) that, when applied to a
triton kernel, transforms a call to the triton kernel into a call
to the triton_kernel_wrapper_mutation HOP.

Test Plan:
- existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130177
Approved by: https://github.com/oulgen, https://github.com/ydwu4
2024-07-10 03:09:29 +00:00
868d9a4f12 [cpu][flash attention] fix nan issue (#130014)
Fixes #127055.

NaNs are generated in flash attention because the computation of `std::exp((-inf) - (-inf))` and `+/-inf * 0` in lazy softmax. We fix the issue by avoiding the related calculation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130014
Approved by: https://github.com/jgong5, https://github.com/drisspg
2024-07-10 02:33:26 +00:00
68751799b8 Add decompositions for copy variants of view ops (#128416)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128416
Approved by: https://github.com/amjames, https://github.com/lezcano
2024-07-10 01:39:09 +00:00
cyy
007e75958f [4/N] Change #include <c10/util/Optional.h> to #include <optional> (#130329)
Follows #130300

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130329
Approved by: https://github.com/ezyang
2024-07-10 01:26:50 +00:00
9912209743 check if the input fx graph of aot_compile return tuple (#129824)
Fixes https://github.com/pytorch/pytorch/issues/129719

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129824
Approved by: https://github.com/angelayi, https://github.com/yushangdi
2024-07-10 01:18:55 +00:00
cyy
85b8503621 [Caffe2] Remove Caffe2 documentation (#130089)
Due to the removal of Caffe2 code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130089
Approved by: https://github.com/r-barnes, https://github.com/albanD
2024-07-10 00:52:16 +00:00
cyy
7a3ab1fe79 [structural binding][7/N] Replace std::tie with structural binding (#130216)
Follows #120353

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130216
Approved by: https://github.com/albanD
2024-07-10 00:52:04 +00:00
fb696bf264 Revert "Add block mask utility support for batches and heads > 1 (#130227)"
This reverts commit 64139987c0588f2eef198a0b9fd6904783b37b2c.

Reverted https://github.com/pytorch/pytorch/pull/130227 on behalf of https://github.com/izaitsevfb due to breaks internal builds, please see D59498662 ([comment](https://github.com/pytorch/pytorch/pull/130227#issuecomment-2218842579))
2024-07-09 22:34:39 +00:00
44815ed67e Revert "Add scale kwarg to FlexAttention (and some changes that get FlexAttention numerics to be as accurate as FA2) (#130250)"
This reverts commit 3e48d927332915e1ecbd3c7f2c6b9680428f181e.

Reverted https://github.com/pytorch/pytorch/pull/130250 on behalf of https://github.com/izaitsevfb due to depends on #130227 which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130250#issuecomment-2218840674))
2024-07-09 22:32:54 +00:00
5b5a1f5202 Add on to Mark some test_decomp tests as slow on win #130260 (#130337)
An add on to https://github.com/pytorch/pytorch/pull/130260

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130337
Approved by: https://github.com/malfet
2024-07-09 22:30:53 +00:00
fd43a2ba27 Forward fix for test_compare_cpu_cuda_float32 (#130360)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130360
Approved by: https://github.com/malfet
ghstack dependencies: #128238
2024-07-09 22:28:39 +00:00
3be4922a9d Revert "[HOP] Use user directed names for variables where possible (#130271)"
This reverts commit adb65682affdfc37f724c02ea8c8930d3925fc07.

Reverted https://github.com/pytorch/pytorch/pull/130271 on behalf of https://github.com/clee2000 due to broke inductor/test_flex_attention https://github.com/pytorch/pytorch/actions/runs/9863205414/job/27236960046 adb65682af Test not run on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/130271#issuecomment-2218832643))
2024-07-09 22:24:39 +00:00
37d4d04309 [torchscript] Add logging for model id. (#130118)
Summary: as title.

Test Plan: CI

Reviewed By: angelayi

Differential Revision: D59348256

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130118
Approved by: https://github.com/BoyuanFeng
2024-07-09 22:24:16 +00:00
fb5cb17fbe [torch][fx] Add normalize_args constructor argument to FxGraphDrawer (#130348)
Summary:
When writing out Graphviz files for graphs, sometimes the arguments are all
in a row and it's unclear which is which. Like for `aten.conv2d`, someone might not
remember the stride, padding, dilation order.

Add an option `normalize_args` (defaults to False) to normalize all args into kwargs.
This should help the readability of a graph.

Differential Revision: D59529417

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130348
Approved by: https://github.com/mcremon-meta
2024-07-09 22:16:54 +00:00
df83142131 [CCA][Memory Snapshot] Stop duplicating annotations to all device_traces (#130315)
Summary: This diff fixes a bug, where all record_annotations will save a TraceEntry to each of the device_traces. Instead, we should only save annotations to the current device_trace that is being called by the thread calling the native allocator's recordAnnotation.

Test Plan: CI and ran workloads on MVAI WPR FBR.

Reviewed By: zdevito

Differential Revision: D59477339

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130315
Approved by: https://github.com/zdevito
2024-07-09 21:38:47 +00:00
bb9a73f767 [custom_ops] expose torch.library.register_torch_dispatch (#130261)
This is the API for defining the interaction between a torch_dispatch
class and a custom op. Taking API bikeshedding.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130261
Approved by: https://github.com/albanD
ghstack dependencies: #130064
2024-07-09 21:11:27 +00:00
c23d103afa Add API for open registration between operators and subclasses (and modes) (#130064)
We add torch.library.Library._register_torch_dispatch_rule. Here, a user
can provide us a specific rule to run for a specific
(torch_dispatch_class, operator) pair. The motivation is that a user
might want to extend a subclass/mode but may not have access to the
source code of the subclass/mode.

I'll make this public in a follow-up PR if we think the approach and API
is good.

Keep in mind that many subclasses will likely deliver their own open
registration solution (DTensor has register_sharding_prop_rule and NJT
has register_jagged_op); _register_torch_dispatch_rule is meant as a
catch-all open registration mechanism for when the subclass hasn't
provided anything more specific.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130064
Approved by: https://github.com/albanD
2024-07-09 21:11:27 +00:00
9c9744c3ac Revert "[runtime asserts] deduplicate runtime asserts & CSE (#128599)"
This reverts commit 940e4477ab0b81eea25051447cf5f599080c903f.

Reverted https://github.com/pytorch/pytorch/pull/128599 on behalf of https://github.com/izaitsevfb due to breaking internal APS tests, see D59498864 ([comment](https://github.com/pytorch/pytorch/pull/128599#issuecomment-2218724762))
2024-07-09 21:03:49 +00:00
f85bda8bdd c10d/Handlers: expose running handlers from Python (#130149)
This adds a `_run_handler` method that will invoke a specific handler.

Test plan:

```
python test/distributed/elastic/test_control_plane.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130149
Approved by: https://github.com/kurman, https://github.com/c-p-i-o
2024-07-09 20:20:59 +00:00
1d93367cfa Fix typo (#130305)
Fixes #130241

that is a reopen pr of #130244, for possibly fixing the failed job
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130305
Approved by: https://github.com/Skylion007
2024-07-09 20:02:00 +00:00
721a798886 add bits16 to graph dtype_abbrs (#130339)
As title, patch the dtype in torch.fx.graph
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130339
Approved by: https://github.com/angelayi
2024-07-09 19:58:51 +00:00
42f647219a [ROCm] Add int4 support (#129710)
- Add AMD support for int4 kernel
  - Only supports CDNA2 and CDNA3 gpus for now
  - Uses `mfma_f32_16x16x16bf16` instruction for matrix multiply
  - Uses `v_and_or_b32` instruction and `__hfma2` instrinsic for unpacking bf16 values
  - Enable hipify for `__nv_bfloat16` and `__nv_bfloat162` data types
- Enable int4 unit tests for CDNA2 and CDNA3 AMD gpus
- Fix torchscript issues due to hipify for `__nv_bfloat16` type
  - TorchScript has its own implementation for bfloat16 type
    - Implemented in `__nv_bloat16` structure at [resource_strings.h](https://github.com/pytorch/pytorch/blob/main/torch/csrc/jit/codegen/fuser/cuda/resource_strings.h)
    - So, we shouldn't hipify any reference of `__nv_bfloat16` in the torchscript implementation
    - Hence moved the `__nv_bfloat16` direct references in `codegen.cpp` and `cuda_codegen.cpp` to `resource_strings.h` which is already exempted from hipify

Fixes #124699
Fixes pytorch-labs/gpt-fast/issues/154

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129710
Approved by: https://github.com/malfet
2024-07-09 19:49:12 +00:00
adb65682af [HOP] Use user directed names for variables where possible (#130271)
Afaict the previous check was too strict. Removing it passes all the
mutation tests (mutation checks happen via the TensorVariable's mutable_local).

Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130271
Approved by: https://github.com/Chillee, https://github.com/ydwu4
ghstack dependencies: #130255, #130268
2024-07-09 19:42:52 +00:00
cyy
a6345d3477 [CMake] [3/N] Remove unused code (#130322)
Some functions used by Caffe2 were removed along with some outdated checks. Follows #130006.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130322
Approved by: https://github.com/r-barnes
2024-07-09 19:33:33 +00:00
3477ee38e4 fix the use of initial learning rate in the OneCycleLR example (#130306)
Fixes #127649

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130306
Approved by: https://github.com/janeyx99
2024-07-09 18:58:07 +00:00
3689471ea4 [inductor] Add FileCheck to flex attention epilogue test (#129343)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129343
Approved by: https://github.com/lezcano
2024-07-09 18:15:55 +00:00
c6cce976b2 Fix an issue where ENABLE_INTRA_NODE_COMM=1 + multiple process groups leads to failure (#130269)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130269
Approved by: https://github.com/Chillee
2024-07-09 17:42:09 +00:00
cb4bec311a Fix nodes has more than one output users after replace_set_grad_with_hop pass (#129716)
Summary: Previously, when we inline the subgraphs that doesn't have a different require_grad environment, we didn't clean up the nodes's users in subgraph and direcly used them to  to replace the output of  the call_modules. This records dead depencies in node.users. This PR fixes this.

Test Plan:
Added a new test.

Also see the torchrec tests:
Step 1:
buck run mode/dev-nosan //aimp/experimental/pt2:pt2_export -- --model-entity-id 934687114 --output /tmp/934687114.zip --use-torchrec-eager-mp --use-manifold

Step 2:
buck run mode/opt -c python.package_style=inplace -c fbcode.enable_gpu_sections=true aimp/cli:cli --  --platform=aps --template=disagg_gpu_aps_pt2 --pt2 --model-entity-id=934687114 non-request-only-tagging torchrec-shard-and-quantize gpu-disagg-split assign-device materialize-weights script-and-save

Differential Revision: D59132214

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129716
Approved by: https://github.com/angelayi
2024-07-09 17:04:03 +00:00
e4c51d22c5 [cuDNN] Cleanup < 8.5 #ifdefs (#130283)
We've said cuDNN 8.5 is the minimum supported version for a bit now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130283
Approved by: https://github.com/Skylion007
2024-07-09 16:35:39 +00:00
cab90b0049 [custom ops] disable kernel temporarily (#130190)
Fixes #128621

Sometimes we want to disable the backend implementation for testing/benchmarking purposes.

For example:

```python
@custom_op("mylib::f", mutates_args=())
def f(x: Tensor) -> Tensor:
    return torch.zeros(1)

print(f(torch.randn(1))) # tensor([0.])

@f.register_kernel("cpu")
def _(x):
    return torch.ones(1)

print(f(torch.randn(1))). # tensor([1.])

with f.set_kernel_enabled("cpu", enabled = False):
    print(f(0)) # tensor([0.])
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130190
Approved by: https://github.com/williamwen42, https://github.com/zou3519
2024-07-09 16:13:50 +00:00
edf273edf4 Revert some PRs (#130303)
Summary:
Revert https://github.com/pytorch/pytorch/pull/129346 thru
https://github.com/pytorch/pytorch/pull/128893

For S430832

Test Plan: Tests

Differential Revision: D59503843

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130303
Approved by: https://github.com/bdhirsh
2024-07-09 14:46:00 +00:00
cyy
71efbf701d [3/N] Change #include <c10/util/Optional.h> to #include <optional> (#130300)
Follows #130236

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130300
Approved by: https://github.com/ezyang
2024-07-09 13:32:57 +00:00
a5f816df18 Add more dtypes to __cuda_array_interface__ (#129621)
`__cuda_array_interface__` was missing some unsigned integer dtypes as well as BF16.

numba doesn't support BF16 so I skip tests for that one.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129621
Approved by: https://github.com/lezcano
2024-07-09 10:47:19 +00:00
3e48d92733 Add scale kwarg to FlexAttention (and some changes that get FlexAttention numerics to be as accurate as FA2) (#130250)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130250
Approved by: https://github.com/drisspg
ghstack dependencies: #130160, #130106, #130224, #130227
2024-07-09 09:24:06 +00:00
eqy
86fb76e871 [SDPA] Clean up print in test/test_transformers.py (#130302)
Left this in #125343, oops...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130302
Approved by: https://github.com/awgu
2024-07-09 09:20:52 +00:00
953c6476bd [CMAKE] Look for Development.Module instead of Development (#129669)
Based on the [cmake issue](https://gitlab.kitware.com/cmake/cmake/-/issues/23716) and [manylinux issue](https://github.com/pypa/manylinux/issues/1347), when building a python module, it should find the `Development.Module` module, not `Development`, which includes `Development.Module` and `Development.Embed`, and will expect the shared python library only. After this PR and before #124613, pytorch could be built with a static libpython (e.g. in manylinux).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129669
Approved by: https://github.com/malfet
2024-07-09 09:16:43 +00:00
b139b5090f [pytorch] Name threads in thread pools for better debugging (#130270)
Threads inside the thread pools are not named, so they inherit the main process name or the name of the first thread. In our case if we set `pt_main_thread` as the thread name when a thread does `import torch`, this name will be inherited by all the threads in the created pools.

This PR names the threads in the pools I was able to find. There are other pools created, like OpenMP ones and we need to follow-up on those.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130270
Approved by: https://github.com/d4l3k, https://github.com/albanD
2024-07-09 08:03:47 +00:00
312652c325 [RFC] Add support for device extension autoloading (#127074)
Fixes #122468

- Load device extensions at the end of `torch/__init__.py`
- Enabled by default, or you can disable it with `TORCH_DEVICE_BACKEND_AUTOLOAD=0`

run test:

```python
python test/run_test.py -i test_autoload_enable
python test/run_test.py -i test_autoload_disable
```

doc:

https://docs-preview.pytorch.org/pytorch/pytorch/127074/miscellaneous_environment_variables.html

co-author:  @jgong5 @bsochack @bkowalskiINTEL @jczaja @FFFrog @hipudding

Co-authored-by: albanD <desmaison.alban@gmail.com>
Co-authored-by: Jiong Gong <jiong.gong@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127074
Approved by: https://github.com/albanD, https://github.com/jgong5
2024-07-09 06:14:13 +00:00
6c4efd4e95 [Memory Snapshot][BE] Clean up record function callback scope (#130265)
Summary: We can directly set the scope to at::RecordScope::USER_SCOPE for the at::RecordFunctionCallback object, rather than performing a check inside of the callback.

Test Plan:
Ran locally, works fine.

https://www.internalfb.com/pytorch_memory_visualizer/mvai_gpu_traces/tree/gpu_snapshot/fire-aaronshi-20240704-1709-7a80b83b/0/rank-0_itrn-1503.Jul_04_17_24_02.3577.snapshot.pickle

Differential Revision: D59477046

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130265
Approved by: https://github.com/davidberard98
2024-07-09 05:23:48 +00:00
ded469cfbd [issue scrubbing] Fix imports in test_memory_planning.py to work with pytest (#130275)
Summary: I actually don't grok why this pattern works; I guess pytest expects a different import syntax for these relative imports?? But this pattern is used in many other tests here (notably `test_aot_inductor.py`), so it must be right ;)

Test Plan:
Ran both ways:
* `python test/inductor/test_memory_planning.py`
* `pytest test/inductor/test_memory_planning.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130275
Approved by: https://github.com/zou3519
2024-07-09 05:20:56 +00:00
e235db98c9 [Inductor] Add aot_mode UT to new cpp_builder. (#130105)
Changes:
1. Add `aot_mode` parameter to `validate_new_cpp_commands` UT.
2. Switch AotCodeCompiler vec isa command gen to new cpp_builder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130105
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-09 04:08:35 +00:00
31df1d235e Support tensor stride (#129297)
Summary:
X-link: https://github.com/facebookresearch/param/pull/126

Support tensor stride for execution trace.

Test Plan: buck2 test mode/opt caffe2/test:test_profiler_cuda profiler.test_execution_trace.TestExecutionTrace

Differential Revision: D58900476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129297
Approved by: https://github.com/sanrise, https://github.com/izaitsevfb
2024-07-09 03:55:46 +00:00
e836ee1955 Enhancements to recompiles logs (#130043)
----

- We now record on CacheEntry what the compile id that populated it was, so now we can say why a specific frame was rejected
- Add structured log for recompiles under name artifact "recompile_reasons". As it stands, it's not terribly structured, but this was the easiest thing I could do to start
- Slightly reformat multi-reason printing; since we only report one guard failure seems better to have it as a single line

Example output:

```
V0703 10:34:13.273000 140345997743104 torch/_dynamo/guards.py:2590] [0/1] [__recompiles] Recompiling function f in /data/users/ezyang/a/pytorch/b.py:3
V0703 10:34:13.273000 140345997743104 torch/_dynamo/guards.py:2590] [0/1] [__recompiles]     triggered by the following guard failure(s):
V0703 10:34:13.273000 140345997743104 torch/_dynamo/guards.py:2590] [0/1] [__recompiles]     - 0/0: tensor 'L['x']' size mismatch at index 0. expected 4, actual 5
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130043
Approved by: https://github.com/anijain2305
2024-07-09 03:40:56 +00:00
cyy
29861779ce [2/N] Change #include <c10/util/Optional.h> to #include <optional> (#130236)
Follows  #128301. The changes were made by grep and sed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130236
Approved by: https://github.com/ezyang
2024-07-09 03:17:24 +00:00
d1e0653fad [fx][easy] print_readable should recursively apply options (#130268)
For example, print_readable(colored=True) should also print submodules
with colors.

Test Plan:
- tested locally
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130268
Approved by: https://github.com/Chillee
ghstack dependencies: #130255
2024-07-09 02:50:20 +00:00
f2c9f0c0db [HOP] improve naming for subgraph inputs (#130255)
Previously, subgraph input names were whatever the input proxies were,
which were confusing. This PR changes those names to be
whatever the names of the arguments the functions being
speculate_subgraph'ed are. This is best-effort: if we can't figure it
out then we go back to the previous strategy.

Test Plan:
- existing expecttests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130255
Approved by: https://github.com/ydwu4
2024-07-09 02:46:40 +00:00
abe81d5d05 Fix the rest of foreach flakers (#130277)
Reenable foreach tests on non-sm86 machines. I believe I've fixed the flakes that are caused when TORCH_SHOW_CPP_STACKTRACES=1, though I know @clee2000 had also just landed https://github.com/pytorch/pytorch/pull/129004 for the same effect.

Regardless, this makes the foreach tests more robust against future disruptions anyway. Fix similar in flavor to https://github.com/pytorch/pytorch/pull/129003

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130277
Approved by: https://github.com/soulitzer
2024-07-09 02:08:21 +00:00
d44c30e2f9 Revert "Add API for open registration between operators and subclasses (and modes) (#130064)"
This reverts commit 922d2737d5e0ad22ee1dcf91c48ab09d641de840.

Reverted https://github.com/pytorch/pytorch/pull/130064 on behalf of https://github.com/huydhn due to Sorry for reverting your change but test_profiler_tree is failing in trunk after this lands 922d2737d5, maybe a landrace ([comment](https://github.com/pytorch/pytorch/pull/130064#issuecomment-2216135497))
2024-07-09 01:48:38 +00:00
75fa10066d Mark some test_decomp tests as slow on win (#130260)
Auto slow test detection is marking and then un marking these as slow, so permanently mark them as slow on windows.

These tests take >500s on windows.

This is part of the reason why test_decomp keeps failing on windows (ex da66e50e6e)

The other part is something to do with reruns + thresholds that I am still investigating
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130260
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-07-09 00:16:31 +00:00
7f08d3d9a0 [C10D] Fix corrupt log due to uint_8 printing as char (#130184)
Previously, jobs would log lines like this due to interpreteting an int8 value as a signed char when streaming out.

"ProcessGroupNCCL created ncclComm_ 0x94960120 on CUDA device: ^@"

We need a better solution for avoiding this systematically, but at least
for now fix the spot we know about.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130184
Approved by: https://github.com/eeggl, https://github.com/Skylion007
2024-07-08 23:37:50 +00:00
4c19623800 Change numeric_debug_handle to store per-node id (#129811)
Summary:
Previously we store edge id in numeric_debug_handle to support operator fusion and operator decomposition throughout the stack,
but according to feedback from customers, people prefer the simpler per-node id, and they are fine with not having the additional
support for numerical debugging for inputs and willing to hack around to achieve this.

This PR changes the structure of numeric_debug_handle to store unique_id for each node instead.

e.g.
graph:
```
node = op(input_node, weight_node)
```
Before:
```
node.meta[NUMERIC_DEBUG_HANDLE_KEY] = {input_node: id1, weight_node: id2, "output": id3}
```

After:
```
node.meta[NUMERIC_DEBUG_HANDLE_KEY] = id1
```

Test Plan:
python test/test_quantization.py -k TestGenerateNumericDebugHandle

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129811
Approved by: https://github.com/tarun292
2024-07-08 23:36:19 +00:00
a28bb3268d [Pipelining] Reorder _Action from F1_1 to 1F1 (#129786)
Also steers away from accesing _Action via positional unpacking since
that is error prone

Co-authored-by: Howard Huang <howardhuang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129786
Approved by: https://github.com/H-Huang
2024-07-08 23:07:51 +00:00
60d9f3f7d9 Set the epoch timestamp when uploading data to dynamoDB (#130273)
This is to move away the `_event_time` field from Rockset, which we cannot use when reimport the data
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130273
Approved by: https://github.com/clee2000
2024-07-08 22:58:32 +00:00
b4cc25f126 [custom_op]Fix self in mutation_args (#130179)
Fixes #124933

## Issue Summary
If users define `self` as mutate args, there is an error occurs `TypeError: AutoFunctionalized.__call__() got multiple values for argument 'self'`. For the following example, the schema for mutates_args is parsed as {"self": FakeTensor}.  6df963a2c8/torch/_higher_order_ops/auto_functionalize.py (L234)
In the above line, it is unwrapped as `self=FakeTensor` and leads to wrong argument pass because `self` is the default keyword for functions of a class, such as https://github.com/pytorch/pytorch/compare/main...findhao/fix-self-custom-ops#diff-9453b6b52a54783beec3dd1c60248620f61c3a524d404a188af17bbdf6be3d9eR292 .
```python
import torch

@torch.library.custom_op("mylib::foo", mutates_args={"self"})
def foo(self: torch.Tensor) -> None:
    self.sin_()

x = torch.randn(3)

@torch.compile(backend="inductor", fullgraph=True)
def f(x):
    foo(x)

f(x)
```
## Fix
This PR changes all related default argument `self` to `self_` following the existing way in 6fc771d19b/torch/_ops.py (L667)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130179
Approved by: https://github.com/zou3519
2024-07-08 22:55:50 +00:00
17ca0d0edf Add linux manywheel python 3.13 binary workflows (#130030)
Test with passing linux manywheel workflows is here: https://github.com/pytorch/pytorch/pull/121979
Builder PR already merged: https://github.com/pytorch/builder/pull/1910

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130030
Approved by: https://github.com/albanD
2024-07-08 22:50:15 +00:00
00335a27b4 Accept min / max sequence length in nested_tensor_from_jagged() constructor (#130175)
This PR updates the public API for NJT construction `torch.nested.nested_tensor_from_jagged()` to accept values for min / max sequence length. It's useful to provide these ahead of time to avoid GPU -> CPU syncs from on-demand computation later on.

NB: The test changes are extensive because I reworked the existing `_validate_nt()` helper function used throughout our NJT construction tests to verify more (specifically: expected cached min / max seq len and contiguity).

API design question: should we additionally provide an option to compute these from `offsets` at construction time? I can think of three possible cases during construction:
1. Min / max seq len has already been obtained from *somewhere* (manual calculation, static values, etc.) and they should be used in the cache
2. Min / max seq len should be computed immediately at construction time for use in the cache (ideally, the caller wouldn't have to do this computation manually)
3. Min / max seq len are not needed at all (i.e. SDPA isn't ever called) and computation should be skipped
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130175
Approved by: https://github.com/davidberard98, https://github.com/soulitzer
2024-07-08 22:14:52 +00:00
922d2737d5 Add API for open registration between operators and subclasses (and modes) (#130064)
We add torch.library.Library._register_torch_dispatch_rule. Here, a user
can provide us a specific rule to run for a specific
(torch_dispatch_class, operator) pair. The motivation is that a user
might want to extend a subclass/mode but may not have access to the
source code of the subclass/mode.

I'll make this public in a follow-up PR if we think the approach and API
is good.

Keep in mind that many subclasses will likely deliver their own open
registration solution (DTensor has register_sharding_prop_rule and NJT
has register_jagged_op); _register_torch_dispatch_rule is meant as a
catch-all open registration mechanism for when the subclass hasn't
provided anything more specific.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130064
Approved by: https://github.com/albanD
2024-07-08 22:13:05 +00:00
44a773c121 Revert "[custom ops] infer schema (#130079)"
This reverts commit 3fe324ffb612c8712f6af7639c1e7bcec5f3b4fd.

Reverted https://github.com/pytorch/pytorch/pull/130079 on behalf of https://github.com/huydhn due to The test_public_bindings failure looks legit 3fe324ffb6 ([comment](https://github.com/pytorch/pytorch/pull/130079#issuecomment-2215420957))
2024-07-08 22:02:29 +00:00
f9bb258892 Revert "[Inductor] Add aot_mode UT to new cpp_builder. (#130105)"
This reverts commit 21eeedb4554edab22b42bcb2f75f19e85652b72e.

Reverted https://github.com/pytorch/pytorch/pull/130105 on behalf of https://github.com/izaitsevfb due to Breaks 46 tests internally at meta with: OSError: CUDA_HOME environment variable is not set ([comment](https://github.com/pytorch/pytorch/pull/130105#issuecomment-2215392198))
2024-07-08 21:40:03 +00:00
5e467604c3 Revert "[inductor] switch AotCodeCompiler to new cpp_builder (#130127)"
This reverts commit dc5f37193f8d144d3de8525bf64eb1775d91e932.

Reverted https://github.com/pytorch/pytorch/pull/130127 on behalf of https://github.com/izaitsevfb due to Depends on #130105 which has to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130127#issuecomment-2215355259))
2024-07-08 21:25:28 +00:00
09d57f577b Revert "[inductor] switch CppCodeCache to new cpp_builder. (#130132)"
This reverts commit 3957b3b34976896e0b13e1d09cf19e1da5b8292e.

Reverted https://github.com/pytorch/pytorch/pull/130132 on behalf of https://github.com/izaitsevfb due to Depends on  #130105 which has to be reverted ([comment](https://github.com/pytorch/pytorch/pull/130132#issuecomment-2215352180))
2024-07-08 21:22:39 +00:00
856fe230c7 [AOTI] better approach to generating runtime checks for symbolic dimensions (#130220)
Previously, we only handled cases where the symbolic dimension is of
Symbol. We should use bound_sympy which handles more general cases for us.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130220
Approved by: https://github.com/aakhundov
2024-07-08 20:46:38 +00:00
3fe324ffb6 [custom ops] infer schema (#130079)
Fixes #129617

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130079
Approved by: https://github.com/zou3519
2024-07-08 20:46:23 +00:00
1e61cb8c87 Revert "[3.12, 3.13, dynamo] simplified construction for frame f_locals/localsplus (#129185)"
This reverts commit b428f1ad77aedfd150e920c8b0d23b7e6393ad6f.

Reverted https://github.com/pytorch/pytorch/pull/129185 on behalf of https://github.com/huydhn due to dr ci categorization is wrong, the test_linalg xsuccess is real, theres also a test_jit failure https://github.com/pytorch/pytorch/actions/runs/9844339391/job/27178009798 b428f1ad77 ([comment](https://github.com/pytorch/pytorch/pull/129185#issuecomment-2215230345))
2024-07-08 20:37:07 +00:00
f059201e0d [dtensor][debug] added deviceMesh for relevant operations and module parameter sharding and module fqn (#130072)
**Summary**
In order to give users more information, I have added the deviceMesh for operations with DTensor inputs, and module parameter sharding and FQN. These changes have only been placed in operation tracing log. In the future, I plan to just have one logging function with an argument to show how detailed a user wants the log to be, and will get rid of the module tracing log function. This information has also been added to the JSON dump and can be seen in the browser visual. I have also edited the test case file as the module_depth dictionary has been replaced with module_helper_dict and have edited the example output for the MLP operation tracing which can be seen below:

**Test Plan**
1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_json_dump

2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_json_dump

3. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_operation_tracing

4. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing

5. pytest test/distributed/_tensor/debug/test_comm_mode_features.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130072
Approved by: https://github.com/XilunWu
ghstack dependencies: #129994
2024-07-08 20:12:52 +00:00
3e53cae0fc Release 2.4 matrix update. Future releases dates (#130267)
Added Release Compatibility Matrix for release 2.4
Updated future release dates for 2.6-2.9
Updated possible patch release date for 2.4

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130267
Approved by: https://github.com/malfet, https://github.com/albanD
2024-07-08 20:09:17 +00:00
36e2608783 [Quant][PT2E] enable qlinear post op fusion for dynamic quant & qat (#122667)
**Description**
Add fusion path for dynamic quant and for QAT.
The following patterns can be matched for static quant with QAT cases:
`qx -> qlinear -> add -> optional relu -> optional type convert -> optional quant`

The following patterns can be matched for dynamic quant cases:
`qx -> qlinear -> add -> optional relu`

**Test plan**
python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear
python test/inductor/test_cpu_cpp_wrapper.py -k test_qlinear
python test/test_quantization.py -k test_linear_unary
python test/test_quantization.py -k test_linear_binary

Differential Revision: [D57655830](https://our.internmc.facebook.com/intern/diff/D57655830)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122667
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel
2024-07-08 20:04:39 +00:00
a8985a97f9 elastic/store: use wait instead of get for barrier (#130148)
Summary: We call `.get` in the elastic store barrier operation but we don't need the result. This switches it to use `.wait` instead which eliminates one network round trip as `get` internally does a wait first.

Test Plan:

CI + existing tests -- no behavior change

Differential Revision: D59396199

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130148
Approved by: https://github.com/kurman, https://github.com/wconstab
2024-07-08 19:53:42 +00:00
22c809aa73 [FSDP] Runtime Error on Checkpoint Loading for optimizer state (#129110)
for checkpoint optimizer, tensors are created on CUDA when other backends are used. This is because by default torch.device() constructed via a single device ordinal is treated as a cuda device.

In _alloc_tensor, empty tensor are created using device = cast(torch.device, _get_device_module(device_type).current_device()). above will return only the index which will create the empty tensor on CUDA by the default behavior. So, change it to use torch.device(device_type,device_module(device_type).current_device()) to get the device with the index.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129110
Approved by: https://github.com/fegin
2024-07-08 18:52:13 +00:00
9158bb7837 Ignore functional tensor wrapper when caching (#128335)
This PR makes it so that we don't try to serialize FunctionalTensorWrappers. FunctionalTensorWrappers don't pickle well because they have no underlying storage. This should be fixable at a later point, but I might not be the right author for implementing the serialization for it. If there's a way to avoid actually saving the FunctionalTensorWrappers themselves and just saving the ViewMetadata so we can replay it, that would also work.

To do this, we disable view_replay_input_mutations when using AOTAutogradCache, and then only keep the functional tensor in the ViewAndMutationMeta if we need it for view_replay_input_mutations (i.e. the cache is off).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128335
Approved by: https://github.com/bdhirsh
2024-07-08 18:39:20 +00:00
6dc64026cb Restrict fusions in foreach if there are dependencies on multiple subkernels (#130046)
In https://www.internalfb.com/intern/sevmanager/view/s/429861/, a downstream consuming buffer `buf486_buf526` had two read dependencies; `buf373` and `buf394`, both of which were at separate indices of the upstream foreach op. `buf486_buf526` was fused into `buf373` because in the usual fused case, this is completely fine if all dependencies are met in the upstream fused buffer. However in the foreach case and this case specifically it is possible for foreach ops to be partitioned if there are many arguments in order to stay under CUDA driver arg limits. As a result, this large foreach op was split into two, and the latter had `buf394` in its node schedule for allocation, while the earlier split did not, even though `buf486_buf526` uses the `buf394`, as a result we would hit the unbound local error.

@eellison provided this repro to help debug the issue (https://www.internalfb.com/phabricator/paste/view/P1453035092)

To fix this, we no longer return a valid producer subnode if there are multiple producer subnodes for a downstream consuming op. In short we should not fuse if there are dependencies on multiple foreach subkernels because 1) their execution order is non-deterministic and 2) (this issue) we may not properly handle dependencies in the presence of foreach partitioning.

Co-authored-by: David Berard <dberard@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130046
Approved by: https://github.com/eellison
2024-07-08 18:25:16 +00:00
64139987c0 Add block mask utility support for batches and heads > 1 (#130227)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130227
Approved by: https://github.com/yanboliang
ghstack dependencies: #130160, #130106, #130224
2024-07-08 18:15:35 +00:00
cd683212a2 Fix indexing twice with score_mod (#130224)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130224
Approved by: https://github.com/yanboliang
ghstack dependencies: #130160, #130106
2024-07-08 18:15:35 +00:00
e16276b9bf [ROCm] Check supported archs before setting preferred blas backend to hipblasLT (#128753)
This PR is needed to resolve usability issues with PyTorch ROCm nightly wheels on non-gfx90a/gf94x architectures as a result of https://github.com/pytorch/pytorch/pull/127944.

Addresses https://github.com/pytorch/pytorch/issues/119081#issuecomment-2166504992

### With this PR's changes, I get the following on a gfx908 (unsupported by hipblasLT) architecture:
_Using setter function:_
```
>>> torch.backends.cuda.preferred_blas_library(backend="cublaslt")
[W617 19:58:58.286088851 Context.cpp:280] Warning: torch.backends.cuda.preferred_blas_library is an experimental feature. If you see any error or unexpected behavior when this flag is set please file an issue on GitHub. (function operator())
[W617 19:59:02.125161985 Context.cpp:291] Warning: Attempting to use hipBLASLt on an unsupported architecture! Overriding blas backend to hipblas (function operator())
<_BlasBackend.Cublas: 0>
```

_Using `TORCH_BLAS_PREFER_HIPBLASLT` env var:_
```
root@9d47bf40d4d4:/tmp/pytorch# TORCH_BLAS_PREFER_CUBLASLT=1 python
>>> import torch
>>> torch.backends.cuda.preferred_blas_library()
[W619 06:14:11.627715807 Context.cpp:274] Warning: Attempting to use hipBLASLt on an unsupported architecture! Overriding blas backend to hipblas (function operator())
<_BlasBackend.Cublas: 0>
```

### and the following on a gfx90a (supported by hipblasLT) architecture:
_Using setter function:_
```
>>> import torch
>>> torch.backends.cuda.preferred_blas_library()
<_BlasBackend.Cublaslt: 1>
>>> torch.backends.cuda.preferred_blas_library(backend="cublas")
<_BlasBackend.Cublas: 0>
>>> torch.backends.cuda.preferred_blas_library(backend="cublaslt")
[W620 18:38:29.404265518 Context.cpp:293] Warning: torch.backends.cuda.preferred_blas_library is an experimental feature. If you see any error or unexpected behavior when this flag is set please file an issue on GitHub. (function operator())
<_BlasBackend.Cublaslt: 1>
```

_Using `TORCH_BLAS_PREFER_HIPBLASLT` env var:_
```
root@9d47bf40d4d4:/tmp/pytorch# TORCH_BLAS_PREFER_HIPBLASLT=1 python
>>> import torch
>>> torch.backends.cuda.preferred_blas_library()
<_BlasBackend.Cublaslt: 1>
```
(Same result for _Using `TORCH_BLAS_PREFER_CUBLASLT` env var:_)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128753
Approved by: https://github.com/malfet
2024-07-08 17:43:41 +00:00
b428f1ad77 [3.12, 3.13, dynamo] simplified construction for frame f_locals/localsplus (#129185)
Construct frame localsplus in 3.12+ using our own simplified way rather than copypasting from CPython.

This is necessary for 3.13 since we can no longer generate frame `f_locals` before executing the interpreter frame.
We also enable this for 3.12 since the `f_locals` construction between 3.12 and 3.13 is the same, so we can test for correctness with 3.12.

This is also one of the first steps to completing https://github.com/pytorch/pytorch/issues/93753 - we will implement simplified f_locals generation of previous Python versions in the future.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129185
Approved by: https://github.com/jansel
2024-07-08 17:39:05 +00:00
d325aaef39 [halide-backend] Use get_reduction_combine_fn for reduction ops (#130212)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130212
Approved by: https://github.com/eellison
2024-07-08 17:23:32 +00:00
a18568f293 [dtensor][debug] Added functionality to convert log into a json file (#129994)
**Summary**
Currently, users have 2 options to view the tracing data. The first is through console where colored text is used to help users read the information. The second is they can log the information to a text file to view the log, which is useful in instances where the log is too long to fit in the console. However, depending on the model complexity, these logs could go on for thousands of lines making it difficult for the user to find specific information. In order to fix this, I have added the functionality to convert the log into a JSON file, which will be used to create a tree view in a browser, allowing the user to collapse parts of the log that will not be useful to them. I have given the user the option to pass their own file path, but have a default one in the event that none is provided. The expected output of the beginning json file and the browser view for the MLP model are shown below:

<img width="542" alt="Screenshot 2024-07-02 at 3 40 41 PM" src="https://github.com/pytorch/pytorch/assets/50644008/b9570540-e1d2-4777-b643-db4801b60ed8">

<img width="777" alt="Screenshot 2024-07-02 at 3 41 43 PM" src="https://github.com/pytorch/pytorch/assets/50644008/9296e255-c3ae-48a4-8be7-4273f69ee178">

**Test Plan**
1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_json_dump

2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_json_dump

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129994
Approved by: https://github.com/XilunWu
2024-07-08 17:15:34 +00:00
61017eb77b Add missing mapping between DLDevice and ATenDevice for MAIA (#129615)
This PR adds missing mapping between the `DLDevice `and `ATenDevice `for MAIA device. These changes are necessary for `dlpack `support for `maia `tensors.

[MAIA is added to the DldeviceType enum in the dlpack repo](bbd2f4d324/include/dlpack/dlpack.h (L120)) already.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129615
Approved by: https://github.com/albanD
2024-07-08 17:08:39 +00:00
63743b223c [AO] catch qparam mismatch for cat (#123769)
Summary:
use &= instead of |= since |= ignores incorrect scale/zp
change scale to use float comparison, instead of int comparison

Issue warning instead of error for backward compatibility: ex: P1204628034

Test Plan: see warning in: P1204628034

Reviewed By: jerryzh168

Differential Revision: D55699212

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123769
Approved by: https://github.com/jerryzh168
2024-07-08 16:47:14 +00:00
f4774d64bf Skip test_profile_memory on windows (#130037)
The test was introduced in https://github.com/pytorch/pytorch/pull/128743
It is failing on windows cuda a9a744e442/1 (it is skipped on cpu jobs)

After talking with the author and Aaron, I have been advised to skip it on windows, as windows support for kineto is not a high priority
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130037
Approved by: https://github.com/huydhn, https://github.com/aaronenyeshi
2024-07-08 16:11:51 +00:00
d7b7f8b79f Revert "[ROCm] Add int4 support (#129710)"
This reverts commit d0ad13fa42fc2e9935bd3bda2937a3491276d274.

Reverted https://github.com/pytorch/pytorch/pull/129710 on behalf of https://github.com/jeffdaily due to original ROCm PR did not have ciflow/rocm, missed signal ([comment](https://github.com/pytorch/pytorch/pull/129710#issuecomment-2214558368))
2024-07-08 16:07:53 +00:00
c8ab2e8b63 Set seed per sample for OpInfo tests + support for restricting to a single sample input (#128238)
This PR:
* Sets a random seed before generating each sample for an OpInfo test. It does this by intercepting the sample input iterator via `TrackedInputIter`, optionally setting the seed to a test name specific seed before each iterator call (default is to set the seed).
    * Some quick and dirty benchmarking shows (hopefully) negligible overhead from setting the random seed before each sample input generation. For a trivial (single assert) test that uses `@ops`:
* Uncovered a bunch of test issues:
    * Test breakdown (>100 total)
        * A lot of tolerance issues (tweaked tolerance values to fix)
        * 1 broken OpInfo (`sample_inputs_masked_fill` was generating a sample of the wrong dtype)
        * 3 actually broken semantics (for masked tensor; added xfails)
        * 4 Jacobian mismatches (added xfails)
        * 2 nan results (skip for now, need fixing)
        * 3 results too far from reference result (add xfails)
* Skips MPS tests for now (there are so many failures!). Those will default to the old behavior.

**before (no seed setting):**
```
real	0m21.306s
user	0m19.053s
sys	0m5.192s
```

**after (with seed setting):**
```
real	0m21.905s
user	0m19.578s
sys	0m5.390s
```

* Utilizing the above for reproducible sample input generation, adds support for restricting the iterator to a single sample input. This is done via an env var `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX` and its usage is included in the repro command.

```
======================================================================
ERROR: test_bar_add_cuda_uint8 (__main__.TestFooCUDA.test_bar_add_cuda_uint8)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_device_type.py", line 971, in test_wrapper
    return test(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/jbschlosser/branches/testing_updates/test/test_ops.py", line 2671, in test_bar
    self.assertFalse(True)
AssertionError: True is not false

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_utils.py", line 2816, in wrapper
    method(*args, **kwargs)
  File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_utils.py", line 2816, in wrapper
    method(*args, **kwargs)
  File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_device_type.py", line 419, in instantiated_test
    result = test(self, **param_kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_utils.py", line 1426, in wrapper
    fn(*args, **kwargs)
  File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_device_type.py", line 982, in test_wrapper
    raise new_e from e
Exception: Caused by sample input at index 3: SampleInput(input=Tensor[size=(10, 5), device="cuda:0", dtype=torch.uint8], args=TensorList[Tensor[size=(), device="cuda:0", dtype=torch.uint8]], kwargs={}, broadcasts_input=False, name='')

To execute this test, run the following from the base repo dir:
    PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=3 python test/test_ops.py -k TestFooCUDA.test_bar_add_cuda_uint8

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

----------------------------------------------------------------------
Ran 1 test in 0.037s

FAILED (errors=1)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128238
Approved by: https://github.com/janeyx99, https://github.com/justinchuby
2024-07-08 16:06:38 +00:00
acf9e31cf8 adding MTIA to supported activities (#130052)
Summary: Put the hasMTIA block in the if condition as well to let MTIA activities be added to supported activities

Test Plan: Tested with auto-trace

Differential Revision: D59280848

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130052
Approved by: https://github.com/aaronenyeshi
2024-07-08 15:20:05 +00:00
16d53cb7d5 Only run mixed_mm heuristic if shapes are static (#130081)
If we have dynamic shapes, the heuristic in mixed_mm will cause a crash, because it cannot compare m, k and n to integer values. This PR makes it so that the heuristic only runs if we have static shapes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130081
Approved by: https://github.com/Chillee
2024-07-08 14:20:55 +00:00
010009e642 [compiled autograd] c++ autograd function saved_data: lift tensors (#130057)
avoid recompiles when custom c++ autograd function use ctx->saved_data to save tensors

iv.toTensor can return reference for `after(iv.toTensor())`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130057
Approved by: https://github.com/jansel
2024-07-08 07:42:07 +00:00
cyy
f4dcf2ae93 [1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128301
Approved by: https://github.com/ezyang, https://github.com/r-barnes
2024-07-08 07:03:53 +00:00
f053be2a97 [dynamo] Graph break on random_ op (#130222)
Fixes https://github.com/pytorch/pytorch/issues/121621

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130222
Approved by: https://github.com/jansel
2024-07-08 06:10:24 +00:00
31bb65de19 [Inductor] Fix conditional codegen (#129492)
Summary:
We have the cache to guarantee the `sym` is codegen only once, see the following code
```
def ensure_size_computed(self, sym: sympy.Symbol):
    if isinstance(sym, sympy.Symbol) and symbol_is_type(sym, SymT.PRECOMPUTED_SIZE):
        if sym in self.computed_sizes:
            return
        self.computed_sizes.add(sym)
        expr = V.graph.sizevars.inv_precomputed_replacements[sym]
        self.writeline(
            f"{self.declare}{sym} = {self.expr_printer(expr)}{self.ending}"
        )
```
However, we don't consider the case when same `sym`s need to be codegen in both conditions (true branch and false branch), which caused the issue of  `undefined symbols`: P1441378833

To fix the issue, we use a stack to capture the state before doing the condition codegen and restore the state after doing the codegen

Test Plan:
TORCH_LOGS="+inductor" buck2 run mode/dev-nosan -c fbcode.nvcc_arch=h100 -c fbcode.enable_gpu_sections=true --config 'cxx.extra_cxxflags=-g1' -c fbcode.platform010_cuda_version=12 //scripts/hhh:repro_cond_torch_compile

PYTORCH_TEST_FBCODE=1 TORCH_COMPILE_DEBUG=1 buck2 run  mode/opt -c=python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.split-dwarf=true //caffe2/test/inductor:control_flow -- -r test_cond_control_flow_with_precomputed_size

Differential Revision: D58973730

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129492
Approved by: https://github.com/aakhundov
2024-07-08 05:33:47 +00:00
c5c9dbece1 [dynamo][user-defined] Simplify and improve scope of UserDefinedObject var_getattr (#130169)
Fixes https://github.com/pytorch/pytorch/issues/122649

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130169
Approved by: https://github.com/jansel
ghstack dependencies: #118448, #130159
2024-07-08 04:10:56 +00:00
d0ad13fa42 [ROCm] Add int4 support (#129710)
Add AMD support for int4 kernel using mfma_f32_16x16x16bf16 instruction.
Only supports CDNA2 and CDNA3 gpus for now.
Fixes #124699

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129710
Approved by: https://github.com/malfet
2024-07-07 23:54:22 +00:00
d1b832e739 [inductor][mkl][inline-inbuilt-nn-modules] Change assertion (#130219)
Fixes the test in the next PR - `python test/inductor/test_mkldnn_pattern_matcher.py -k TestDynamicPatternMatcher.test_conv3d_unary_dynamic_shapes`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130219
Approved by: https://github.com/leslie-fang-intel
2024-07-07 21:32:07 +00:00
940e4477ab [runtime asserts] deduplicate runtime asserts & CSE (#128599)
This PR adds deduplication and CSE for runtime asserts. Existing size computation in the graph is CSE'd along with added runtime asserts, and redundant asserts are removed. Shape calls on intermediate tensors are also turned into compute on input sizes if possible, allowing intermediate tensors to be freed earlier. For example:
```
z = torch.cat([x, x], dim=0)  # 2*s0
w = z.repeat(y.shape[0])  # 2*s0*s1
_w = w.shape[0]
# something with _w ...

# turns into ->
s0 = x.shape[0]
s1 = y.shape[0]
_w0 = 2 * s0
_w = _w0 * s1
```

Additionally, constrain_range calls are deduplicated. Single-symbol bound checks for unbacked symbols (e.g. u0 >= 0, u0 <= 5) and sym_constrain_range.default calls are also removed, since they accumulate range info in the ShapeEnv, and are replaced with two _assert_scalar.default calls that check the min/max bounds. For example:
```
torch.sym_constrain_range_for_size(n, min=2, max=16)
torch.sym_constrain_range(n, min=4, max=20)
torch._check(n >= 0)
torch._check(n >= 3)
torch._check(n <= 14)

# turns into
torch.sym_constrain_range_for_size(n)
torch._check(n >= 4)
torch._check(n <= 14)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128599
Approved by: https://github.com/ezyang
2024-07-07 20:10:14 +00:00
0c44684901 [Typo] Fix typo in DispatchKeyExtractor.h (#130221)
Summary: typo_helper

Test Plan: ci

Differential Revision: D59424671

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130221
Approved by: https://github.com/Skylion007
2024-07-07 19:43:31 +00:00
e423224546 Revert "[Inductor][CPP] Enable Local Buffer for Outer loop fusion (#126967)"
This reverts commit 98929ceae3873f18f4747b88cdff708fde107aa7.

Reverted https://github.com/pytorch/pytorch/pull/126967 on behalf of https://github.com/leslie-fang-intel due to Broken trunk and need rebase ([comment](https://github.com/pytorch/pytorch/pull/126967#issuecomment-2212337926))
2024-07-07 06:16:32 +00:00
1b57dce35f Revert "[Inductor][CPP] Support more than one LocalBuffer (#129121)"
This reverts commit f794cf59bd0891ff4a4337e0d919ee68ba1f0472.

Reverted https://github.com/pytorch/pytorch/pull/129121 on behalf of https://github.com/leslie-fang-intel due to Broken trunk and need rebase ([comment](https://github.com/pytorch/pytorch/pull/129121#issuecomment-2212337590))
2024-07-07 06:13:40 +00:00
f794cf59bd [Inductor][CPP] Support more than one LocalBuffer (#129121)
**Summary**
Support more than 1 Local Buffer in an outer loop fused node and also the case when multi global buffers sharing usage of same local buffer.

**TestPlan**
```
python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_two_local_buffers_in_outer_loop_fusion
python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_share_local_buffers_in_outer_loop_fusion
```

**Next Step**

- [✓] Support more than one Local Buffer/Global Buffer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129121
Approved by: https://github.com/jgong5, https://github.com/peterbell10
ghstack dependencies: #126967
2024-07-07 05:43:08 +00:00
98929ceae3 [Inductor][CPP] Enable Local Buffer for Outer loop fusion (#126967)
**Summary**
Currently, the Inductor CPP backend [generated code](https://gist.github.com/leslie-fang-intel/98f91d43dabed581a1ffe23daf133a65#file-bf16-softmax-generated-code-wo-local-buffer-py) for `Softmax` with BF16 data type is significantly slower than the [ATen Implementation](9a2beb862d/aten/src/ATen/native/cpu/SoftMaxKernel.cpp (L149)). Upon comparing the generated code with ATen, the performance bottleneck appears to be related to the usage of [local buffer in ATen](9a2beb862d/aten/src/ATen/native/cpu/SoftMaxKernel.cpp (L159-L160)).

In the current implementation, the Inductor uses the output buffer of Kernel Group Args to store and load temporary result (such as `exp`), since this buffer is corresponding to a `SchedulerNode`. Each thread accesses a portion of this output buffer via indexing. However, since this buffer (take this `exp` as example) is only utilized internally within decomposed `softmax`, this buffer can be replaced with a thread-local buffer similar to ATen's approach.

In this PR, we have introduced the optimizations of `LocalBuffer`. Following this enhancement, the [new generated Inductor code with local buffer](https://gist.github.com/leslie-fang-intel/98f91d43dabed581a1ffe23daf133a65#file-bf16-softmax-generated-code-w-local-buffer-py) for BF16 `Softmax` demonstrates significantly improved performance. Running the benchmark [here](https://gist.github.com/leslie-fang-intel/37d81441237b5139c8295f5e6c4cd31a) to test this BF16 `Softmax` case on an 8480 Xeon server shows similar performance between the Inductor CPP Backend and the ATen implementation.

**TestPlan**
```
python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_local_buffer_in_outer_loop_fusion
```

**Next Step**

- [ ] Support more than one Local Buffer/Global Buffer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126967
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-07-07 05:34:57 +00:00
a3ce9eddd6 [BE][Easy] apply autofix for ruff rule unnecessary-literal-set (C405) and unnecessary-map (C417) (#130198)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130198
Approved by: https://github.com/Skylion007
2024-07-07 00:58:22 +00:00
9983242c8e [inductor] support adding a new inductor backend using PrivateUse1 (#129953)
Add handling custom device registered by PrivateUse1 in init_backend_registration() func

Fixes #129952

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129953
Approved by: https://github.com/jansel
2024-07-06 21:15:40 +00:00
3d138af943 [Inductor] First implementation of the B2B-GEMM pass with tests (#129995)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129995
Approved by: https://github.com/eellison
2024-07-06 19:10:22 +00:00
3957b3b349 [inductor] switch CppCodeCache to new cpp_builder. (#130132)
Changes:
1. switch CppCodeCache to new cpp_builder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130132
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-06 18:57:44 +00:00
dc5f37193f [inductor] switch AotCodeCompiler to new cpp_builder (#130127)
Changes:
1. Switch `AotCodeCompiler` to new cpp_builder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130127
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-06 18:44:13 +00:00
cyy
dfe3534134 [1/N] Fix NVCC warnings (#130191)
Fixes NVCC warnings, as the required steps to enable Werror on CUDA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130191
Approved by: https://github.com/Skylion007
2024-07-06 18:25:04 +00:00
3f50e197c4 [BE] annotate torch.autograd.graph (#129558)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129558
Approved by: https://github.com/soulitzer
2024-07-06 18:14:16 +00:00
01ec03bac6 [inductor] switch HalideCodeCache to new cpp_builder. (#130146)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130146
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-06 17:35:17 +00:00
cyy
2f219f7d79 Enforce unused-{variable/function} checks to all torch targets (#130189)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130189
Approved by: https://github.com/ezyang
2024-07-06 16:03:01 +00:00
cyy
096eca2f9a [2/N] Replace exceptions with static_assert(false) in some templates (#130116)
Follows #127371

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130116
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-07-06 13:23:05 +00:00
520a4642bf [CI] Enable build with asserts (#129924)
Not a standard CMake config, as far as I can tell, but it introduces an important concept of optimized build without `NDEBUG`. Test by running `python -c "import torch; torch._C._crash_if_debug_asserts_fail(424242)"`, which is a no-op unless debug_assert_fail is enabled.

Add recently added `_unsafe_masked_index`/`_unsafe_masked_index_put_accumulate` to DONT_ENFORCE_SAME_TENSOR_IMPL_OR_STORAGE to avoid all test involving those ops to fail with internal assert
Suppress number of internal asserts to make CI green, see https://github.com/pytorch/pytorch/issues/130073

Fixes https://github.com/pytorch/pytorch/issues/102105

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129924
Approved by: https://github.com/atalman, https://github.com/albanD
2024-07-06 13:14:32 +00:00
da66e50e6e Added compile option to create_block_mask (#130106)
Compiling the `create_block_mask` function allows us to "materialize" extremely large masks. This would have been a 1 *trillion* element tensor if fully materialized.

```
print(do_bench(lambda: create_block_mask(causal_mask, 1, 1, 2**20, 2**20, _compiled=True)))
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130106
Approved by: https://github.com/yanboliang
ghstack dependencies: #130160
2024-07-06 08:09:56 +00:00
963f430d13 Revert "[runtime asserts] deduplicate runtime asserts & CSE (#128599)"
This reverts commit 0267b2ddcb58aa66b2b62336216da7df4f9939d8.

Reverted https://github.com/pytorch/pytorch/pull/128599 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to cause a landrace and fails inductor/test_cudagraph_trees in trunk 0267b2ddcb ([comment](https://github.com/pytorch/pytorch/pull/128599#issuecomment-2211690518))
2024-07-06 07:20:05 +00:00
aa4899eee9 [CCA][Memory Snapshot] Fix race on alloc_trace vector - S430480 (#130180)
Summary:
Multiple threads can be calling the alloc_trace std::vector, which will result in SIGSEGVs when objects are double freed, accessed after free, or two inserts at the same time.

We need to lock when inserting, accessing or removing TraceEntry in alloc_trace.

Test Plan:
This is a rare crash, which was exposed when we introduced recordAnnotations, which saves record_function annotations into the snapshot files. Saving a lot of annotations can trigger this bug. Here are a few jobs that crashed before, and this diff fixes.

Differential Revision: D59380507

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130180
Approved by: https://github.com/eqy, https://github.com/kit1980
2024-07-06 06:14:54 +00:00
e019540c9e Revert "Fix the SDPA AOT export issue (#130164)"
This reverts commit 1927c406844affbfe3496d5cbc31d4ebe11c8bfb.

Reverted https://github.com/pytorch/pytorch/pull/130164 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is breaking ExecuTorch tests in trunk 1927c40684 ([comment](https://github.com/pytorch/pytorch/pull/130164#issuecomment-2211667777))
2024-07-06 05:59:49 +00:00
bf609630ae Fix a bunch of stride issues with FlexAttention (#130160)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130160
Approved by: https://github.com/yanboliang
2024-07-06 03:58:14 +00:00
10c831567b Make sympify'ing SymInt/etc produce their sympy expression (#130166)
There is one huge problem this fixes: today, sympify(symint)
produces a float(!!) because Sympy attempts to see if you can
coerce the symint to float in sympify and of course this works on
SymInt.

However, this also has another nontrivial effect: anywhere in Inductor
where sympy expressions are passed around, it is also valid to pass
around a SymInt now.  I'm ambivalent about this: it's currently a
mistake to be passing around a SymInt when a sympy expression is
expected.  But maybe this is fine?

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130166
Approved by: https://github.com/yf225
2024-07-06 03:56:45 +00:00
acd03ca2d9 [halide-backend] Support scan kernels (#129035)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129035
Approved by: https://github.com/shunting314, https://github.com/eellison
ghstack dependencies: #130129
2024-07-06 03:49:50 +00:00
c5110f6388 [halide-backend] Use 0D scalar inputs/outputs (#130129)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130129
Approved by: https://github.com/shunting314
2024-07-06 03:49:50 +00:00
0267b2ddcb [runtime asserts] deduplicate runtime asserts & CSE (#128599)
This PR adds deduplication and CSE for runtime asserts. Existing size computation in the graph is CSE'd along with added runtime asserts, and redundant asserts are removed. Shape calls on intermediate tensors are also turned into compute on input sizes if possible, allowing intermediate tensors to be freed earlier. For example:
```
z = torch.cat([x, x], dim=0)  # 2*s0
w = z.repeat(y.shape[0])  # 2*s0*s1
_w = w.shape[0]
# something with _w ...

# turns into ->
s0 = x.shape[0]
s1 = y.shape[0]
_w0 = 2 * s0
_w = _w0 * s1
```

Additionally, constrain_range calls are deduplicated. Single-symbol bound checks for unbacked symbols (e.g. u0 >= 0, u0 <= 5) and sym_constrain_range.default calls are also removed, since they accumulate range info in the ShapeEnv, and are replaced with two _assert_scalar.default calls that check the min/max bounds. For example:
```
torch.sym_constrain_range_for_size(n, min=2, max=16)
torch.sym_constrain_range(n, min=4, max=20)
torch._check(n >= 0)
torch._check(n >= 3)
torch._check(n <= 14)

# turns into
torch.sym_constrain_range_for_size(n)
torch._check(n >= 4)
torch._check(n <= 14)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128599
Approved by: https://github.com/ezyang
2024-07-06 03:44:49 +00:00
7c43f59a45 [audio hash update] update the pinned audio hash (#129429)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129429
Approved by: https://github.com/pytorchbot
2024-07-06 03:34:12 +00:00
bd0252fb98 [dynamo][user-defined] Support method descriptors (#130159)
Fixes https://github.com/pytorch/pytorch/issues/120650

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130159
Approved by: https://github.com/jansel
ghstack dependencies: #118448
2024-07-06 02:03:09 +00:00
a1a2023eb8 Back out "Pass device to is_pinned call inside TensorProperties.create_from_tensor" (#129972)
Summary:
It turns out, the device used as a param in is_pinned is meant to be the accelerator device with the respect to which pinning is expected. Passing 'cpu' always makes the return value false, regardless of whether the actual tensor is a cpu tensor pinned to Cuda.

Besides, there is a PR https://github.com/pytorch/pytorch/pull/126376 about to be merged which automatically uses the correct accelerator device which obviates the need for users to pass any kind of explicit  device and doesn't create Cuda context for pure cpu tensors.

Note, https://www.internalfb.com/intern/test/844425019931542?ref_report_id=0 test is expected to be broken by this diff, but it should be fixed forward by https://github.com/pytorch/pytorch/pull/126376

Test Plan: Sandcastle.

Differential Revision: D59283190

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129972
Approved by: https://github.com/LucasLLC
2024-07-06 01:07:32 +00:00
1927c40684 Fix the SDPA AOT export issue (#130164)
Summary:
## Context
TL;DR: aot_export failed for SDPA memory efficient backend when using `inference_mode`

The CMF AOTI lowering started to fail on the trunk. We have the script (https://fburl.com/code/kfk64i5s) to reproduce the issue quickly (log: P1469307638). By bisecting the stack, we found the issue starting from the D58701607

## Root Cause
In the `inference_mode()`,
the `aten::scaled_dot_product_attention` was not decomposed before the `functionalization` and the op it-self was an out-place op, so the `functionalization` doesn't make change and then was decomposed into `masked_fill_.`, then decomposed to the `copy_`
So it's `aten::sdpa` --- (functionalization) ---> `aten::sdpa` --- (decompose) ---> `masked_fill_` --- (decompose) ---> `copy_` ---> failure

In the `torch.no_grad()`,
`aten::sdpa` was decomposed before `functionalization`, so the story is
`aten::sdpa` --- (decompose) ---> `masked_fill_` --- (functionalization) ---> `masked_fill` --- (decompose) ---> `out-place ops` ---> good

## How to fix
Long-term:
The issue was tracked in the ticket (https://github.com/pytorch/pytorch/issues/129418). The long-term fix could be we do one more round of `functionalization` after the `decompose`, like

`aten::sdpa` --- (functionalization) ---> `aten::sdpa` --- (decompose) ---> `masked_fill_` --- (functionalization) ---> `masked_fill` ---> good

Short-term:
It would be a big change I guess. To unblock the production use-case, I marked the `aten::sdpa` should be decomposed in this diff

Test Plan:
local repro works now

buck run mode/opt scripts/sijiac/prototypes:sdpa_aoti

Differential Revision: D59385876

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130164
Approved by: https://github.com/zou3519
2024-07-06 00:57:47 +00:00
c5ede865c4 [pt2-bench] raise tolerance for squeezenet1_1 (#130165)
The training accuracy for this model starts to regress. It does not show up on the weekly run yet but
1. it shows up in my MA runs [here](https://hud.pytorch.org/benchmark/torchbench/inductor_max_autotune?dashboard=torchinductor&startTime=Fri,%2028%20Jun%202024%2006:53:45%20GMT&stopTime=Fri,%2005%20Jul%202024%2006:53:45%20GMT&granularity=hour&mode=training&dtype=amp&lBranch=gh/shunting314/162/head&lCommit=cb236e8c198b54901e4fb19698f91be786f72e25&rBranch=main&rCommit=4ee1cb9b955fcc5d75a421b19393998122136f2c)
2. I can repro it locally

Command:
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/torchbench.py --accuracy --training --amp --backend
 inductor --device cuda --only squeezenet1_1
```

Raise the tolerance to fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130165
Approved by: https://github.com/jansel
ghstack dependencies: #129996, #129941, #130005, #130163
2024-07-06 00:49:15 +00:00
0fcbca9adb [pt2-bench] use eval mode for vision_maskrcnn (#130163)
Try to fix https://github.com/pytorch/pytorch/issues/130161

The reason that `--accuracy` works is we use eval mode. While `--training` does not work since we use training mode but TorchBench does not return targets tenors. In training mode, vision_maskrcnn requires targets tensors

I fix that to always use eval mode for vision_maskrcnn training.

With the fix, I start see a segfault: https://gist.github.com/shunting314/5a70df3463b2a4421b2c34aa88e78d1f

I'm not sure if that's due to my local setup but I think the fix in this PR is something we need any way. We can check the dashboard after the PR is in.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130163
Approved by: https://github.com/jansel
ghstack dependencies: #129996, #129941, #130005
2024-07-06 00:49:15 +00:00
cyy
e5841bb8d5 [3/N] Enforce unused-function and unused-variable checks (#130084)
Follows #129878.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130084
Approved by: https://github.com/ezyang
2024-07-05 23:56:00 +00:00
126796d239 [c10d] fixing an UT after a change in eager mode new group (#130167)
Summary:
after
https://github.com/pytorch/pytorch/pull/129284, new_group is eager now if device_id is specified, one UT was broken
This PR fixes it.

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130167
Approved by: https://github.com/wconstab
2024-07-05 23:18:30 +00:00
d1d0a7080f [torchgen] reference generated comment to actual location of the generator and template (#130020)
As per title.

```diff
# torch/_VF.pyi

- # @generated from torch/_C/_VariableFunctions.pyi.in
+ # @generated by tools/pyi/gen_pyi.py from torch/_C/_VariableFunctions.pyi.in
```

```diff
# torch/return_types.pyi

- # @generated from torch/_C/return_types.pyi
+ # @generated by tools/pyi/gen_pyi.py from torch/_C/return_types.pyi.in
```

```diff
# torch/_C/__init__.pyi

- # @generated from torch/_C/__init__.pyi.in
+ # @generated by tools/pyi/gen_pyi.py from torch/_C/__init__.pyi.in
```

```diff
# torch/_C/_nn.pyi

+ # @generated by tools/pyi/gen_pyi.py from torch/_C/_nn.pyi.in
```

```diff
# torch/_C/_VariableFunctions.pyi

- # @generated from torch/_C/_VariableFunctions.pyi.in
+ # @generated by tools/pyi/gen_pyi.py from torch/_C/_VariableFunctions.pyi.in
```

```diff
# torch/nn/functional.pyi

+ # @generated by tools/pyi/gen_pyi.py from torch/nn/functional.pyi.in
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130020
Approved by: https://github.com/ezyang
2024-07-05 21:47:14 +00:00
6fc771d19b Revert "Change depreacate warning on dispatch_on_subclass to warn once (#130047)"
This reverts commit 8ff243bcf190bab62348310693f0ad2f90061c89.

Reverted https://github.com/pytorch/pytorch/pull/130047 on behalf of https://github.com/clee2000 due to broke test_overrides.py::TestTorchFunctionWarning::test_warn_on_invalid_torch_function on multiple jobs 8ff243bcf1 https://github.com/pytorch/pytorch/actions/runs/9812489165/job/27097342443.  Dr CI is doing something weird about the unstable failures ([comment](https://github.com/pytorch/pytorch/pull/130047#issuecomment-2211409090))
2024-07-05 21:03:36 +00:00
df50452279 Pin optree==0.11.0 on windows CI (#130155)
Fixes #ISSUE_NUMBER

doctests
test_testing

Failing run has 0.12.0 https://github.com/pytorch/pytorch/actions/runs/9804335516/job/27072891998
Succeeding run has 0.11.0 https://github.com/pytorch/pytorch/actions/runs/9798330845/job/27057359554

It is already pinned for mac and linux
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130155
Approved by: https://github.com/huydhn, https://github.com/atalman
2024-07-05 20:28:58 +00:00
18e75c098b [DCP] Adds Checkpointing Team (dcp) to merge rules (#129582)
[DCP] Adds Checkpointing Team (dcp) to merge rules. Please comment to this PR if you think you should be added as well!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129582
Approved by: https://github.com/fegin
2024-07-05 20:09:31 +00:00
739fc01ac9 [NCCL] Make sure current device is correct in torch.distributed.barrier()'s streamSynchronize (#129908)
The real root cause of the issue is that the current stream on a given CUDA device may be the legacy default stream, which doesn't seem to have a device associated with it. If the current CUDA device as reported by `cudaGetDevice` doesn't match the device of the intended legacy default stream's device (this happens if a user is running distributed code without e.g., `torch.cuda.set_device(mylocalrank)`) then the stream synchronize will not have the intended effect. Previous stream sync code here correctly inserted a `DeviceGuard` to ensure that this legacy-default-stream-sync with a mismatched current device didn't happen, but the check is elided here. The simplest fix is to just use the `CUDAStream` wrapper's `synchronize()` call, which already correctly uses a `DeviceGuard` internally:
a21d4363d2/c10/cuda/CUDAStream.h (L132)

OUTDATED below:
The current behavior of `barrier`'s `synchronizeInternal` seems to be a bit counterintuitive, as it is synchronizing on a device's current `CUDAStream` rather than the one used for the actual `allreduce` (the `ncclStream`). In practice this results in a script like the following:
```
import logging
import os
import time
import torch
import torch.distributed as dist

def main():
    logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(message)s")

    backend = 'nccl'
    group = torch.distributed.init_process_group(backend=backend)
    rank = torch.distributed.get_rank(group=group)

    for i in range(4):
        time.sleep(rank)
        logging.info(f"Rank {rank}: enter barrier {i}")
        dist.barrier()
        logging.info(f"Rank {rank}: exit barrier {i}")

    dist.destroy_process_group()

if __name__ == "__main__":
    main()
```
appearing to show that ranks can exit barrier(s) before other ranks have entered. Note that the device-side ordering should still be correct in this case, but the host is free to run ahead.

The issue can be worked-around by adding a `torch.cuda.synchronize(rank)` after the `barrier`, but this seems to be against the spirit of the stream synchronization which deliberately tried to avoid a device synchronization.

This PR does a sync on the `allreduce`'s stream so that a device synchronization is not needed to align the host's output with the device.

CC @wujingyue @Aidyn-A @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129908
Approved by: https://github.com/kwen2501
2024-07-05 19:53:54 +00:00
faebaef089 [EZ] Fix typo in upload stats OIDC rolename (#130168)
My mistake from https://github.com/pytorch/pytorch/pull/129544
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130168
Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/atalman
2024-07-05 19:38:24 +00:00
3d56673b24 [Split Build][BE] remove extraneous .py, .a, and .so files (#130053)
Removes extraneous .a, .so, and .py files from the split build. From here we can also clean up the builder script which produces the binary to do this. That pr is https://github.com/pytorch/builder/pull/1912

Verification:

The built wheel with BUILD_LIBTORCH_WHL=1 has the following files only (with .a, .so, and .py extensions)

```
sahanp@devgpu086 ~/p/dist (viable/strict)> pwd                                                                                                                                                                                                                            (pytorch-3.10)
/home/sahanp/pytorch/dist
sahanp@devgpu086 ~/p/dist (viable/strict)> find . -type f \( -name "*.py" -o -name "*.a" -o -name "*.so" \)                                                                                                                                                               (pytorch-3.10)
./torch/__init__.py
./torch/lib/libbackend_with_compiler.so
./torch/lib/libc10.so
./torch/lib/libjitbackend_test.so
./torch/lib/libtorch.so
./torch/lib/libtorch_cpu.so
./torch/lib/libtorch_global_deps.so
./torch/lib/libtorchbind_test.so
sahanp@devgpu086 ~/p/dist (viable/strict)>
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130053
Approved by: https://github.com/atalman
2024-07-05 19:05:32 +00:00
8ff243bcf1 Change depreacate warning on dispatch_on_subclass to warn once (#130047)
Summary:
Right now the deprecated warning fires on every operator that calls into torch_function. Changing it to TORCH_WARN_ONCE instead.

More context in https://fb.workplace.com/groups/260102303573409/permalink/445299188387052/

Test Plan: Sandcastle

Differential Revision: D59338775

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130047
Approved by: https://github.com/XilunWu
2024-07-05 18:52:49 +00:00
784e3b4123 Revert "Change numeric_debug_handle to store per-node id (#129811)"
This reverts commit a9a744e442975cfbc6f4b26a532e5c1b3d9d5692.

Reverted https://github.com/pytorch/pytorch/pull/129811 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/129811#issuecomment-2211245852))
2024-07-05 18:14:02 +00:00
889ed48a22 Fix missing id-token write in upload stats (#130153)
Fix the mistake from https://github.com/pytorch/pytorch/pull/129544
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130153
Approved by: https://github.com/clee2000
2024-07-05 18:05:46 +00:00
7c5f3cd049 Add explain function to TSConverter. (#129968)
Summary: The explain function does a conversion dry run to provide feedback on which operators are not supported / fail the conversion to the users.

Test Plan: * `pytest test/export/test_converter.py`

Differential Revision: D59251934

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129968
Approved by: https://github.com/angelayi
2024-07-05 18:04:29 +00:00
7ea8a3c9b8 [dynamo] Validate check_fn (#118448)
Fixes - https://github.com/pytorch/pytorch/issues/128090

Tracker issue here - https://github.com/pytorch/pytorch/issues/129937

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118448
Approved by: https://github.com/jansel, https://github.com/ezyang
2024-07-05 18:04:12 +00:00
7192ee0735 Default to input tensor device for as_nested_tensor(t) (#130050)
Fixes #129647
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130050
Approved by: https://github.com/YuqingJ
2024-07-05 17:50:08 +00:00
a33ee73a28 Upload perf stats to both Rockset and dynamoDB (#129544)
To avoid outage on HUD, I plan to migrate perf stats to dynamoDB as follows:

1. Upload perf stats to both Rockset and dynamoDB
2. Copy all the existing content from Rockset to dynamoDB
3. Create new Rockset tables to map to dynamoDB
4. Switch HUD to use the new Rockset tables (temporarily)
5. Delete the existing tables

This depends on https://github.com/pytorch-labs/pytorch-gha-infra/pull/422

### Testing

```
python3 -m tools.stats.upload_dynamo_perf_stats --workflow-run-id 9770217910 --workflow-run-attempt 1 --repo "pytorch/pytorch" --head-branch "gh/shunting314/162/head" --rockset-collection torch_dynamo_perf_stats --rockset-workspace inductor --dynamodb-table torchci-dynamo-perf-stats --match-filename "^inductor_"
...
Writing 1607 documents to DynamoDB torchci-dynamo-perf-stats
```

And confirm the same number of documents is on the table

![Screenshot 2024-07-03 at 18 10 35](https://github.com/pytorch/pytorch/assets/475357/6c055c96-00ca-4cb3-bbe5-fe4914f9da9b)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129544
Approved by: https://github.com/clee2000
2024-07-05 16:31:49 +00:00
e7ab7b83bc Have torch_key hash entire torch directory (#129250)
Summary:
Title. This way, both FXGraphCache and AOTAutogradCache use the same torch_key, and we don't need to only hash specific files.

There's an argument to be made to only hash *.py and *.cpp files. Maybe we can fix the glob to do that.

We use a buck_filegroup because otherwise $SRCs gets too large. By using `$(location :torch_sources)`, we make the genrule implicitly depend on all files globbed by torch_sources.

Test Plan:
Unit tests still pass on OSS
For torch_key:

```
buck2 build caffe2:src_hash.txt -v 2 --show-output
```
See the output, then make any change to any torch file. See that the hash changes.

Reviewed By: oulgen

Differential Revision: D58875785

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129250
Approved by: https://github.com/oulgen
2024-07-05 15:37:16 +00:00
eea4ece256 Revert "[audio hash update] update the pinned audio hash (#129429)"
This reverts commit 30fc4b06f55c7c4a915f938d7d5d6abbbc23bf61.

Reverted https://github.com/pytorch/pytorch/pull/129429 on behalf of https://github.com/jeanschmidt due to pytorch bot should not have allowed this merge, as there are failing jobs ([comment](https://github.com/pytorch/pytorch/pull/129429#issuecomment-2210894639))
2024-07-05 13:38:44 +00:00
4b05d9d233 Revert "[NCCL] Make sure current device is correct in torch.distributed.barrier()'s streamSynchronize (#129908)"
This reverts commit c9f1db265e317829b3a4d3af5be5c9266874dcd4.

Reverted https://github.com/pytorch/pytorch/pull/129908 on behalf of https://github.com/jeanschmidt due to Seems to have introduced windows errors on main ([comment](https://github.com/pytorch/pytorch/pull/129908#issuecomment-2210888890))
2024-07-05 13:34:59 +00:00
8f6765f7a7 [pt2-bench] fix accuracy failure for beit_base_patch16_224 during training (#130005)
This model's accuracy test recently regressed. I have a quite smooth debugging process to figure out the cause. So I'd like to write it down just in case it can be helpful.

Clicking the model name beit_base_patch16_224 on the dashboard, we are able to see the pass status of the model in e.g. the past month. For this model, we can see that it starts to fail on June 08:

<img width="1118" alt="Screenshot 2024-07-02 at 5 36 35 PM" src="https://github.com/pytorch/pytorch/assets/52589240/32f27ccd-3ec7-4431-88b3-febeff831f8e">

What's nice is the dashboard shows the nightly commits for each run.

Running
```
git log --oneline a448b3ae9537c0ae233fb9199a4a221fdffbb..0e6c204642a571d5a7cd60be0caeb9b50faca030 torch/_inductor/
```
Gives us the list of Inductor PRs between the good and bad commit: https://gist.github.com/shunting314/eb57965688fc9e1746fcfa9b7b6b02df

Roughly looking thru the PRs, I feel
```
ffc202a1b91 Added remove_noop_ops to joint_graph_passes (#124451)
```
can change numerics so I disable it locally by this one line change: https://gist.github.com/shunting314/13aec768bda986056d0fb40dce53396e  . And then the accuracy test pass. (Command: time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only beit_base_patch16_224  )

Horace's PR (https://github.com/pytorch/pytorch/pull/124451) itself is valid. It removes no-op ops in joint-graph. I think maybe the graph get changed and cause the partitioner do different recomputation decisions. That can cause some numerics change.

Since this is not a real issue, I'll raise the tolerance to make it pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130005
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #129996, #129941
2024-07-05 10:26:39 +00:00
c0735a3dd3 [pt2-bench] fix accuracy failure for a few models (#129941)
This PR batch the fix for a few accuracy failures issues during training by raising tolerance. I do that only for models that I think it fails not due to real issue.

## sebotnet33ts_256

The accuracy test for this model start to fail around June 05 [link](https://hud.pytorch.org/benchmark/timm_models/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Sun%2C%2002%20Jun%202024%2007%3A19%3A38%20GMT&stopTime=Tue%2C%2002%20Jul%202024%2007%3A19%3A38%20GMT&granularity=day&mode=training&dtype=amp&lBranch=main&lCommit=04a0d856207d83c2031e4b9cb6825ba3e0092850&rBranch=main&rCommit=e62925930f6a62f6aeeb1fe1a661a9bd3352b53d&model=sebotnet33ts_256).

I can not repro locally, but from the log from the dashboard:
```
RMSE (res-fp64): 0.09441, (ref-fp64): 0.02971 and shape=torch.Size([1536]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000
```
raising the tolerance should fix it.

## DebertaForQuestionAnswering

This model fails accuracy test on the dashboard only in max-autotune mode. I can not repro locally by command:
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/huggingface.py --accuracy --no-translation-validation --training --amp --backend inductor --device cuda --only DebertaForQuestionAnswering
```

From error message on the dashboard:
```
RMSE (res-fp64): 0.01803, (ref-fp64): 0.00537 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
```

0.02 tolerance should suppress this error.

## gluon_inception_v3

This model fail on the dashboard in max-autotune mode. I can not repro locally by command
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only gluon_inception_v3
```

From error message on the dashboard
```
RMSE (res-fp64): 0.02798, (ref-fp64): 0.00730 and shape=torch.Size([384]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
Accuracy failed for key name Mixed_7c.branch3x3dbl_3a.bn.running_var
```
raising tolerance should suppress this error.

# mobilenetv3_large_100
Fail in MA model. I can not repro locally by command
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only
```
The error message on the dashboard is
```
RMSE (res-fp64): 0.29754, (ref-fp64): 0.05205 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000
```

The tensor is so small that the noise can be high. I use larger multiplier for smaller tensor in torch._dynamo.utils.same.

# yolov3

Fail on dashboard with error
```
Error on the dashboard: RMSE (res-fp64): 0.01278, (ref-fp64): 0.00246 and shape=torch.Size([256]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```

Fix it by using a larger multiplier for smaller tensors and raising the tolereance.

# timm_efficientdet

Fail on the dashboard with error
```
E0623 18:37:43.638000 139924418725056 torch/_dynamo/utils.py:1468] RMSE (res-fp64): 0.00096, (ref-fp64): 0.00009 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```
But I can not repro locally with command
```
time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only timm_efficientdet  --training
```

Raise the tolerance should fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129941
Approved by: https://github.com/jansel
ghstack dependencies: #129996
2024-07-05 10:26:39 +00:00
8f1c2e1e28 [pt2-bench] pass acc test if ref is NaN (#129996)
I'm debugging the accuracy failure for training vision_maskrcnn.

Unfortunately I could not succeed to run it locally (I've check pined commits for torchbenchmars/torchvision are correct, and reinstalled torchbenchmark for mask_rcnn). I get this error:
```
eager run fail: AssertionError: targets should not be none when in training mode
```
(Command: time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --training --only vision_maskrcnn )

But look at the log from the dashboard
```
E0623 19:17:59.085000 140114670171328 torch/_dynamo/utils.py:1468] RMSE (res-fp64): nan, (ref-fp64): nan and shape=torch.Size([1024, 256, 1, 1]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```

We can see both the reference number and the pt2 number are NaN. I change torch._dynamo.utils.same to return true if both RMSE values are NaN.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129996
Approved by: https://github.com/jansel
2024-07-05 10:26:39 +00:00
78a0b010eb Refine XPU UTs (#130138)
# Motivation
1. enable all test cases related to `TestXpu` running in XPU CI.
2. make `test_lazy_init` stable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130138
Approved by: https://github.com/EikanWang
2024-07-05 09:56:22 +00:00
3240bff56a [benchmarking] Add join_results.py (#129202)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129202
Approved by: https://github.com/yanboliang, https://github.com/shunting314
2024-07-05 06:55:30 +00:00
30fc4b06f5 [audio hash update] update the pinned audio hash (#129429)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129429
Approved by: https://github.com/pytorchbot
2024-07-05 03:32:29 +00:00
c9f1db265e [NCCL] Make sure current device is correct in torch.distributed.barrier()'s streamSynchronize (#129908)
The real root cause of the issue is that the current stream on a given CUDA device may be the legacy default stream, which doesn't seem to have a device associated with it. If the current CUDA device as reported by `cudaGetDevice` doesn't match the device of the intended legacy default stream's device (this happens if a user is running distributed code without e.g., `torch.cuda.set_device(mylocalrank)`) then the stream synchronize will not have the intended effect. Previous stream sync code here correctly inserted a `DeviceGuard` to ensure that this legacy-default-stream-sync with a mismatched current device didn't happen, but the check is elided here. The simplest fix is to just use the `CUDAStream` wrapper's `synchronize()` call, which already correctly uses a `DeviceGuard` internally:
a21d4363d2/c10/cuda/CUDAStream.h (L132)

OUTDATED below:
The current behavior of `barrier`'s `synchronizeInternal` seems to be a bit counterintuitive, as it is synchronizing on a device's current `CUDAStream` rather than the one used for the actual `allreduce` (the `ncclStream`). In practice this results in a script like the following:
```
import logging
import os
import time
import torch
import torch.distributed as dist

def main():
    logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(message)s")

    backend = 'nccl'
    group = torch.distributed.init_process_group(backend=backend)
    rank = torch.distributed.get_rank(group=group)

    for i in range(4):
        time.sleep(rank)
        logging.info(f"Rank {rank}: enter barrier {i}")
        dist.barrier()
        logging.info(f"Rank {rank}: exit barrier {i}")

    dist.destroy_process_group()

if __name__ == "__main__":
    main()
```
appearing to show that ranks can exit barrier(s) before other ranks have entered. Note that the device-side ordering should still be correct in this case, but the host is free to run ahead.

The issue can be worked-around by adding a `torch.cuda.synchronize(rank)` after the `barrier`, but this seems to be against the spirit of the stream synchronization which deliberately tried to avoid a device synchronization.

This PR does a sync on the `allreduce`'s stream so that a device synchronization is not needed to align the host's output with the device.

CC @wujingyue @Aidyn-A @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129908
Approved by: https://github.com/kwen2501
2024-07-04 20:36:58 +00:00
7128504424 [inductor] Add Triton template for Conv3D (#129518)
This commit adds a Triton template for Conv3D ops,
by following the same logic like Conv2D. Conv3D
aren't as frequently used like Conv2D so they might
enjoy less optimizations in various libraries. So having
a Triton based inductor impl can improve performance
for cases.

Co-authored-by: Jack Taylor <108682042+jataylo@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129518
Approved by: https://github.com/jansel, https://github.com/jataylo
2024-07-04 20:30:50 +00:00
e590168865 Enable sharing meta tensors between processes (#129520)
Fixes #129436

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129520
Approved by: https://github.com/ezyang
2024-07-04 20:29:48 +00:00
21eeedb455 [Inductor] Add aot_mode UT to new cpp_builder. (#130105)
Changes:
1. Add `aot_mode` parameter to `validate_new_cpp_commands` UT.
2. Switch AotCodeCompiler vec isa command gen to new cpp_builder.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130105
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-04 19:08:56 +00:00
d496145534 [CD] Add triton xpu wheel build (#129730)
Enable triton xpu wheel build firstly, then add pytorch xpu nightly wheel build

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129730
Approved by: https://github.com/atalman
2024-07-04 17:55:20 +00:00
f78b79daaa Forward fix the missing torch.nn.Module.set_submodule from D59140215 (#130075)
Summary: This is to forward fix D59140215 from a PyTorch open source contributor T194074371. On PyTorch side, we need to use isinstance instead of type when checking for nn.Module.  This is the same way get_submodule is currently implemented.

Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//dper3/dper3/core/tests:module_test`

Differential Revision: D59254638

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130075
Approved by: https://github.com/mikaylagawarecki
2024-07-04 17:46:56 +00:00
5b5f4b02c2 [pipelining] [BE] Move pipeline_order validation to schedules.py (#129369)
# Changes
* small fix in stage error message
* Move `format_pipeline_order` and `_validate_pipeline_order` out of `test_schedule.py` into `schedules.py`.
* Wrap the execution runtime in a try-except which on error will log the timestep and schedule plan before re-raising the exception.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129369
Approved by: https://github.com/wconstab
ghstack dependencies: #129368
2024-07-04 16:38:30 +00:00
6dfa53ca76 Revert "[pt2-bench] pass acc test if ref is NaN (#129996)"
This reverts commit 51fa0bd436cf627bd0c8ccf3a3a8b9c07d260622.

Reverted https://github.com/pytorch/pytorch/pull/129996 on behalf of https://github.com/jeanschmidt due to Seems to have introduced breakages in main cuda12 focal jobs ([comment](https://github.com/pytorch/pytorch/pull/129996#issuecomment-2209175516))
2024-07-04 14:55:38 +00:00
fa3953a2e1 Revert "[pt2-bench] fix accuracy failure for a few models (#129941)"
This reverts commit dafbd603ee6672d9592ec72b59300a2631f431d2.

Reverted https://github.com/pytorch/pytorch/pull/129941 on behalf of https://github.com/jeanschmidt due to Seems to have introduced breakages in main cuda12 focal jobs ([comment](https://github.com/pytorch/pytorch/pull/129996#issuecomment-2209175516))
2024-07-04 14:55:38 +00:00
54da35a2e0 Revert "[pt2-bench] fix accuracy failure for beit_base_patch16_224 during training (#130005)"
This reverts commit 0af8c8a981e79b05767089e57e81262dbbf2b1b4.

Reverted https://github.com/pytorch/pytorch/pull/130005 on behalf of https://github.com/jeanschmidt due to Seems to have introduced breakages in main cuda12 focal jobs ([comment](https://github.com/pytorch/pytorch/pull/129996#issuecomment-2209175516))
2024-07-04 14:55:38 +00:00
57d05f2616 [RELAND] Add xpu to getAccelerator (#129205)
# Motivation
Add `xpu` support to `getAccelerator`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129205
Approved by: https://github.com/albanD, https://github.com/gujinghui
ghstack dependencies: #129463
2024-07-04 10:26:52 +00:00
551f3b92b2 [Dynamo] Add assertion for tensor unpack shape mismatch (#130077)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130077
Approved by: https://github.com/Chillee
2024-07-04 09:25:08 +00:00
f3962cfd9c [RELAND] XPUHooksInterface inherits from AcceleratorHooksInterface (#129463)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129463
Approved by: https://github.com/gujinghui, https://github.com/albanD
2024-07-04 08:46:34 +00:00
fa4e489d70 [dynamo][dynamic-shapes] Graph break if out shape changes on out= variants (#130074)
Fixes https://github.com/pytorch/pytorch/issues/130068

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130074
Approved by: https://github.com/ezyang
ghstack dependencies: #129913, #129914
2024-07-04 08:36:12 +00:00
e98587c58d Update torch-xpu-ops pin (ATen XPU implementation) (#129353)
188 new ATen operators/variants are added in the pin update, involving eager and torch.compile usage on HuggingFace, TIMM and TorchBench models. 16 new unit tests ported to enhance functionality coverage. Aligned source file directory structure with ATen native. Fixed corner case failures in aten::resize, aten::index_add and aten::index_put.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129353
Approved by: https://github.com/EikanWang
2024-07-04 07:36:17 +00:00
bffb278700 [ONNX] Add artifacts_dir to torch-onnx-patch in benchmark (#130069)
Add `artifacts_dir` to torch-onnx-patch to save error report for debugging.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130069
Approved by: https://github.com/justinchuby
2024-07-04 07:11:02 +00:00
d62d351107 [Optim][BE] Change str(device) to _get_device_type(device) (#129984)
Prevent using vague expressions like `"cuda" in str(device)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129984
Approved by: https://github.com/janeyx99
ghstack dependencies: #129451, #129552
2024-07-04 06:44:48 +00:00
42f3d7e948 [MPS] Add mps profiler env vars to docs (#129552)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129552
Approved by: https://github.com/malfet
ghstack dependencies: #129451
2024-07-04 06:44:48 +00:00
cyy
07b06f0f0a [2/N] Remove outdated CMake code (#130006)
Follows #129851

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130006
Approved by: https://github.com/drisspg
2024-07-04 06:24:22 +00:00
26be691e6b Unify shard logic for inductor and dynamo test_config (#129508)
Addresses https://github.com/pytorch/pytorch/pull/129480#issuecomment-2189954552

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129508
Approved by: https://github.com/clee2000, https://github.com/huydhn
2024-07-04 06:04:29 +00:00
9c9ac670a0 [dtensor][be] Reduced redundant LOC by creating functions to set up models used in example (#129613)
**Summary**
As the CommModeFeature example file grew, there were to many LOC that was repeated for setting up the models used. I created two functions, one to handle MLP and MLPStacked models and the other for transformer models. The output of the examples will not have changed.

**Test Plan**
1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_distributed_sharding_display

2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLPStacked_distributed_sharding_display

3. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing

4. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing

5. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_operation_tracing

6. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129613
Approved by: https://github.com/XilunWu
ghstack dependencies: #129602
2024-07-04 06:00:58 +00:00
0b9995c1ce [dtensor][debug] Added forward and backward differentiation for module level tracing (#129602)
**Summary**
Currently, comm_mode only allowed users to differentiate between forward and backward passes at the operational level. I modified the code so that users can now see the collective counts for the passes at a module level. I decided to slightly change how the output was formatted making it easier to differentiate between a collective count and an operation. I have designed the operational trace table function so that in the future, a user can use command line arguments in order to determine the level of information they want to display instead of having two similar functions. Finally, I have updated the new output and test cases for comm_mode example and test files. The expected output for the first 3 examples are shown below:

<img width="320" alt="Screenshot 2024-06-26 at 2 30 25 PM" src="https://github.com/pytorch/pytorch/assets/50644008/b8e88075-a07f-4e84-b728-a08959df3661">

<img width="497" alt="Screenshot 2024-06-26 at 2 29 15 PM" src="https://github.com/pytorch/pytorch/assets/50644008/5ef4bea7-1355-4089-bfb0-c7e3f588ac77">

<img width="615" alt="Screenshot 2024-06-26 at 2 31 05 PM" src="https://github.com/pytorch/pytorch/assets/50644008/feacae51-76f7-403b-b6cd-dd15e981770e">

**Test Plan**
1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing

2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing

3. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_operation_tracing

4. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing

5. pytest test/distributed/_tensor/debug/test_comm_mode_features.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129602
Approved by: https://github.com/XilunWu, https://github.com/wz337
2024-07-04 06:00:58 +00:00
e2e624a02f [AOTAutograd] Micro-optimize runtime_wrapper (#128188)
This moves a bunch of runtime inspection of the `output_info` for alias handling into the construction of fixed output handlers that are created during compilation and captured by the runtime wrapper.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128188
Approved by: https://github.com/bdhirsh
2024-07-04 03:53:06 +00:00
a7a7363be0 [dynamo] Skip side effect tracking for c wrappers/descriptors (#129914)
Fixes PYTORCH_TEST_WITH_DYNAMO=1 pytest -vs test/test_python_dispatch.py::TestPythonDispatch::test_deepcopy_wrapper_subclass

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129914
Approved by: https://github.com/jansel
ghstack dependencies: #129913
2024-07-04 03:14:45 +00:00
da8af685ac [dynamo] Skip ID_MATCH guard on GetSetDescriptorType (#129913)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129913
Approved by: https://github.com/jansel
2024-07-04 03:14:45 +00:00
8405ba21c1 [inductor][cpp] fix the vec convertion between float and int64 on AVX2 (#130013)
Fix https://github.com/pytorch/pytorch/issues/129863

There is no single instruction support on AVX2 to convert between fp and int64 and has to be emulated. The original fast implementation (see https://stackoverflow.com/questions/41144668) assumes the data range is within [-2^51, 2^51]. The issue reported in https://github.com/pytorch/pytorch/issues/129863 has the input data outside this range and failed the test. This PR supports the full range of the conversion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130013
Approved by: https://github.com/lezcano
2024-07-04 03:01:49 +00:00
cyy
99ec7bbee7 Force inconsistent-missing-override for torch targets (#130010)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130010
Approved by: https://github.com/ezyang
2024-07-04 02:37:57 +00:00
0af8c8a981 [pt2-bench] fix accuracy failure for beit_base_patch16_224 during training (#130005)
This model's accuracy test recently regressed. I have a quite smooth debugging process to figure out the cause. So I'd like to write it down just in case it can be helpful.

Clicking the model name beit_base_patch16_224 on the dashboard, we are able to see the pass status of the model in e.g. the past month. For this model, we can see that it starts to fail on June 08:

<img width="1118" alt="Screenshot 2024-07-02 at 5 36 35 PM" src="https://github.com/pytorch/pytorch/assets/52589240/32f27ccd-3ec7-4431-88b3-febeff831f8e">

What's nice is the dashboard shows the nightly commits for each run.

Running
```
git log --oneline a448b3ae9537c0ae233fb9199a4a221fdffbb..0e6c204642a571d5a7cd60be0caeb9b50faca030 torch/_inductor/
```
Gives us the list of Inductor PRs between the good and bad commit: https://gist.github.com/shunting314/eb57965688fc9e1746fcfa9b7b6b02df

Roughly looking thru the PRs, I feel
```
ffc202a1b91 Added remove_noop_ops to joint_graph_passes (#124451)
```
can change numerics so I disable it locally by this one line change: https://gist.github.com/shunting314/13aec768bda986056d0fb40dce53396e  . And then the accuracy test pass. (Command: time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only beit_base_patch16_224  )

Horace's PR (https://github.com/pytorch/pytorch/pull/124451) itself is valid. It removes no-op ops in joint-graph. I think maybe the graph get changed and cause the partitioner do different recomputation decisions. That can cause some numerics change.

Since this is not a real issue, I'll raise the tolerance to make it pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130005
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #129996, #129941
2024-07-04 01:14:29 +00:00
dafbd603ee [pt2-bench] fix accuracy failure for a few models (#129941)
This PR batch the fix for a few accuracy failures issues during training by raising tolerance. I do that only for models that I think it fails not due to real issue.

## sebotnet33ts_256

The accuracy test for this model start to fail around June 05 [link](https://hud.pytorch.org/benchmark/timm_models/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Sun%2C%2002%20Jun%202024%2007%3A19%3A38%20GMT&stopTime=Tue%2C%2002%20Jul%202024%2007%3A19%3A38%20GMT&granularity=day&mode=training&dtype=amp&lBranch=main&lCommit=04a0d856207d83c2031e4b9cb6825ba3e0092850&rBranch=main&rCommit=e62925930f6a62f6aeeb1fe1a661a9bd3352b53d&model=sebotnet33ts_256).

I can not repro locally, but from the log from the dashboard:
```
RMSE (res-fp64): 0.09441, (ref-fp64): 0.02971 and shape=torch.Size([1536]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000
```
raising the tolerance should fix it.

## DebertaForQuestionAnswering

This model fails accuracy test on the dashboard only in max-autotune mode. I can not repro locally by command:
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/huggingface.py --accuracy --no-translation-validation --training --amp --backend inductor --device cuda --only DebertaForQuestionAnswering
```

From error message on the dashboard:
```
RMSE (res-fp64): 0.01803, (ref-fp64): 0.00537 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
```

0.02 tolerance should suppress this error.

## gluon_inception_v3

This model fail on the dashboard in max-autotune mode. I can not repro locally by command
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only gluon_inception_v3
```

From error message on the dashboard
```
RMSE (res-fp64): 0.02798, (ref-fp64): 0.00730 and shape=torch.Size([384]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
Accuracy failed for key name Mixed_7c.branch3x3dbl_3a.bn.running_var
```
raising tolerance should suppress this error.

# mobilenetv3_large_100
Fail in MA model. I can not repro locally by command
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only
```
The error message on the dashboard is
```
RMSE (res-fp64): 0.29754, (ref-fp64): 0.05205 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000
```

The tensor is so small that the noise can be high. I use larger multiplier for smaller tensor in torch._dynamo.utils.same.

# yolov3

Fail on dashboard with error
```
Error on the dashboard: RMSE (res-fp64): 0.01278, (ref-fp64): 0.00246 and shape=torch.Size([256]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```

Fix it by using a larger multiplier for smaller tensors and raising the tolereance.

# timm_efficientdet

Fail on the dashboard with error
```
E0623 18:37:43.638000 139924418725056 torch/_dynamo/utils.py:1468] RMSE (res-fp64): 0.00096, (ref-fp64): 0.00009 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```
But I can not repro locally with command
```
time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only timm_efficientdet  --training
```

Raise the tolerance should fix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129941
Approved by: https://github.com/jansel
ghstack dependencies: #129996
2024-07-04 01:14:29 +00:00
51fa0bd436 [pt2-bench] pass acc test if ref is NaN (#129996)
I'm debugging the accuracy failure for training vision_maskrcnn.

Unfortunately I could not succeed to run it locally (I've check pined commits for torchbenchmars/torchvision are correct, and reinstalled torchbenchmark for mask_rcnn). I get this error:
```
eager run fail: AssertionError: targets should not be none when in training mode
```
(Command: time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --training --only vision_maskrcnn )

But look at the log from the dashboard
```
E0623 19:17:59.085000 140114670171328 torch/_dynamo/utils.py:1468] RMSE (res-fp64): nan, (ref-fp64): nan and shape=torch.Size([1024, 256, 1, 1]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```

We can see both the reference number and the pt2 number are NaN. I change torch._dynamo.utils.same to return true if both RMSE values are NaN.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129996
Approved by: https://github.com/jansel
2024-07-04 01:14:29 +00:00
9108b74bbc Updates to scaled_mm for rowwise scaling (#130059)
# Summary

This updates _scaled_mm's API to enforce that input scales are always 2 dimensional. This resolves ambiguity around scaling scheme
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130059
Approved by: https://github.com/vkuzo
2024-07-04 00:53:17 +00:00
cd70ac884f c10d/Utils: better error message on 0 bytes (#130056)
This improves the error messages on 0 bytes sent/received. We currently log it as a connection reset when it's caused by other reasons.

Test plan:

```
python test/distributed/test_store.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130056
Approved by: https://github.com/kurman, https://github.com/rsdcastro
2024-07-04 00:48:20 +00:00
cyy
efb73eda51 [2/N] Fix some violations of unused-function and unused-variable checks in torch_cpu (#129878)
Follows #128670

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129878
Approved by: https://github.com/ezyang
2024-07-04 00:39:28 +00:00
d95a019704 [export] construct empty graph when there's no tensor computation (#129541)
Fixes [#127110](https://github.com/pytorch/pytorch/issues/127110).

When input module does not contain any tensor computation, we would create a graph with inputs and outputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129541
Approved by: https://github.com/angelayi
2024-07-04 00:26:17 +00:00
2fe7c1fe04 [custom ops] Support factory function (#129978)
Fixes #129389

If a user registers a device-specific implementation for an operator that accepts no Tensors, then we require the operator to have a "device: torch.device argument"

We switch on the device argument to select the correct backend to dispatch to.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129978
Approved by: https://github.com/zou3519
2024-07-04 00:10:52 +00:00
779fc8119e Revert "XPUHooksInterface inherits from AcceleratorHooksInterface (#129463)"
This reverts commit 6353a12e6a80f06217645b10fb69cffeac08a8d0.

Reverted https://github.com/pytorch/pytorch/pull/129463 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/129463#issuecomment-2207529072))
2024-07-03 23:43:15 +00:00
8a9725bedb Revert "Add xpu to getAccelerator (#129205)"
This reverts commit 3e2df3ca9d0a593e09bc94c14bbf2b213413cbf3.

Reverted https://github.com/pytorch/pytorch/pull/129205 on behalf of https://github.com/kit1980 due to Need to revert https://github.com/pytorch/pytorch/pull/129463 which breaks Meta builds ([comment](https://github.com/pytorch/pytorch/pull/129205#issuecomment-2207514346))
2024-07-03 23:37:24 +00:00
a9a744e442 Change numeric_debug_handle to store per-node id (#129811)
Summary:
Previously we store edge id in numeric_debug_handle to support operator fusion and operator decomposition throughout the stack,
but according to feedback from customers, people prefer the simpler per-node id, and they are fine with not having the additional
support for numerical debugging for inputs and willing to hack around to achieve this.

This PR changes the structure of numeric_debug_handle to store unique_id for each node instead.

e.g.
graph:
```
node = op(input_node, weight_node)
```
Before:
```
node.meta[NUMERIC_DEBUG_HANDLE_KEY] = {input_node: id1, weight_node: id2, "output": id3}
```

After:
```
node.meta[NUMERIC_DEBUG_HANDLE_KEY] = id1
```

Test Plan:
python test/test_quantization.py -k TestGenerateNumericDebugHandle

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129811
Approved by: https://github.com/tarun292
2024-07-03 22:03:31 +00:00
b0d0114f5b Enable automigration for windows jobs (#129977)
Enable Windows jobs to automatically use LF runners when the author is opted-in

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129977
Approved by: https://github.com/clee2000
2024-07-03 22:02:56 +00:00
a79bb8db91 Make _embedding_bag_backward explicitly dispatch to CPU and CUDA. (#129691)
This PR modifies `_embedding_bag_backward` item inside _native_functions.yaml_, so that it
dispatches to CPU and CUDA directly, instead of `CompositeImplicitAutograd`.

*Context:* PyTorch operations that have the `CompositeImplicitAutograd` dispatch do not
allow third party backends (e.g. XLA) to modify its implementation, since this dispatch
key has higher priority. When calling `_embedding_bag_backward` operation using XLA, a
dispatch error will be thrown, since PyTorch/XLA doesn't support sparse tensors.

*Problem:* `_embedding_bag_backward` has a `sparse` parameter that controls whether the
operation should return a sparse or dense tensor. However, at the moment, PyTorch/XLA does
not support sparse tensors. In order to fallback that execution to dense, i.e. change the
flag at runtime, we need to be able to modify its implementation.

*Solution:* we have changed the dispatch of `_embedding_bag_backward` to CPU and CUDA,
which allowed us to introduce our own kernel for it.

Additionally, this PR refactored the representation of its mode from constant integers
into an enum class. It also introduces two additional operators: `int == EmbeddingBagMode`
and `int != EmbeddingBagMode`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129691
Approved by: https://github.com/lezcano
2024-07-03 21:54:49 +00:00
7bbd6cf931 [custom_ops] Mark older custom ops prototypes as deprecated (#130032)
I've had at least one person try to call APIs from here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130032
Approved by: https://github.com/yushangdi, https://github.com/williamwen42
2024-07-03 21:11:05 +00:00
a21d4363d2 [Profiler] Remove all instances of TMP_USE_TSC_AS_TIMESTAMP (#129973)
Summary: Now that D56584521 is in, we can remove all insteances of TMP_USE_TSC_AS_TIMESTAMP

Test Plan:
Ran resnet. Trace looks good
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Jun_27_14_46_01.1967733.pt.trace.json.gz&bucket=gpu_traces

Reviewed By: aaronenyeshi, swolchok

Differential Revision: D59132793

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129973
Approved by: https://github.com/aaronenyeshi
2024-07-03 19:28:52 +00:00
042d764872 [export] Update example inputs format for DB. (#129982)
Summary: To give user a simpler example code, we are getting rid of ExportArgs in favor of example_args and example_kwargs.

Test Plan: CI

Differential Revision: D59288920

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129982
Approved by: https://github.com/angelayi
2024-07-03 17:53:15 +00:00
9b902b3ee3 AOTI: dont treat views of buffers as constants (#129688)
More context [here](https://github.com/pytorch/pytorch/issues/129682#issuecomment-2195463838), but this change was enough to get this AOTI + float8 repro running for me (below).

Previously, it would fail an assertion [here](https://github.com/pytorch/pytorch/blob/main/torch/_meta_registrations.py#L5387) at inductor lowering time. It looks like during lowering, we were supposed to pass `param.transpose(1, 0)` as the second argument to the scaled_mm kernel. But in the inductor IR, this object is a `ReinterpretView` with `get_name()` equal to one of the param constants, so we would end up passing the constant directly into the kernel, instead of performing the view first.

I'm not totally sure if this is the right place to make the change, so interested in any thoughts from inductor folks (cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @eellison )

```
import torch
from torch.export import export
from torch.export._trace import _export
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the BSD 3-Clause license found in the
# LICENSE file in the root directory of this source tree.
import copy
import io
import random
import unittest
import pytest
import torch
import torch.nn as nn
import torch.nn.functional as F
from float8_experimental.float8_dynamic_linear import Float8DynamicLinear
from float8_experimental.float8_linear_utils import swap_linear_with_float8_linear
from float8_experimental.float8_tensor import Float8Tensor
from float8_experimental.float8_utils import compute_error
random.seed(0)
torch.manual_seed(0)
is_H100 = torch.cuda.is_available() and torch.cuda.get_device_capability() >= (9, 0)
import torch.nn.utils.parametrize as parametrize
# NOTE: we should upstream this directly into export and make it more automatic!
class UnwrapTensorSubclass(torch.nn.Module):
    def forward(self, *tensors):
        todo = list(tensors)
        for tp, meta, inner_tensors in reversed(self.rebuild_stack):
            nb_tensor = len(inner_tensors)
            inner_tensors = {a: b for a, b in zip(inner_tensors, todo[-nb_tensor:])}
            todo = todo[nb_tensor:]
            rebuilt = tp.__tensor_unflatten__(inner_tensors, meta, None, None)
            todo.append(rebuilt)
        assert len(todo) == 1
        return todo[0]
    def right_inverse(self, tensor):
        assert type(tensor) is not torch.Tensor
        rebuild_stack = []
        plain_tensors = []
        todo = [tensor]
        while todo:
            obj = todo.pop()
            inner_tensors, metadata = obj.__tensor_flatten__()
            rebuild_stack.append((type(obj), metadata, inner_tensors))
            for attr_name in inner_tensors:
                val = getattr(obj, attr_name)
                if type(val) is torch.Tensor:
                    plain_tensors.append(val)
                else:
                    assert isinstance(val, torch.Tensor)
                    todo.append(val)
        self.rebuild_stack = rebuild_stack
        return plain_tensors
def unwrap_tensor_subclass(model, filter_fn=None):
    for name, child in model.named_children():
        if (
            isinstance(child, Float8DynamicLinear) and
            hasattr(child, "weight") and
            type(child.weight) is not torch.Tensor and
            isinstance(child.weight, torch.Tensor)
        ):
            parametrize.register_parametrization(child, "weight", UnwrapTensorSubclass())
        unwrap_tensor_subclass(child)
    return model
class FeedForward(nn.Module):
    def __init__(self) -> None:
        super().__init__()
        self.w1 = nn.Linear(4096, 14336, bias=False)
        self.w3 = nn.Linear(4096, 14336, bias=False)
        self.w2 = nn.Linear(14336, 4096, bias=False)
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.w2(F.silu(self.w1(x)) * self.w3(x))
    def reset_parameters(self):
        for m in self.modules():
            if isinstance(m, nn.Linear):
                m.reset_parameters()
export_model = FeedForward().to("cuda")
swap_linear_with_float8_linear(
    export_model,
    Float8DynamicLinear,
    from_float_kwargs={"pre_quantize_weight": True},
)
export_model = unwrap_tensor_subclass(export_model)
batch_size = 4
num_tokens = 1024
embedding_dim = 4096
input_tensor = torch.randn(
    batch_size, num_tokens, embedding_dim, device="cuda", dtype=torch.float32
)
example_args = (input_tensor,)
# NOTE: this breaks unless we use strict=False, pre_dispatch=False!
exported_program: torch.export.ExportedProgram = _export(
    export_model,
    example_args,
    strict=False,
    pre_dispatch=False,
)
with torch.no_grad():
    so_path = torch._inductor.aot_compile(exported_program.module(), example_args)
    print(so_path)

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129688
Approved by: https://github.com/eellison
2024-07-03 17:24:08 +00:00
35600bcaad Print float with full precision, don't truncate (#130027)
Fixes https://github.com/pytorch/pytorch/issues/119338

Exercised in https://github.com/pytorch/pytorch/pull/118448

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130027
Approved by: https://github.com/lezcano, https://github.com/Skylion007
2024-07-03 17:20:19 +00:00
01e41f1814 Modified autotuning for flex_attention to pass in (proper) fake inputs for the block sparse entries (#129915)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129915
Approved by: https://github.com/yanboliang, https://github.com/eellison
ghstack dependencies: #129846, #129950
2024-07-03 17:08:45 +00:00
e2eb33b089 Added methods to blockmask to visualize them (#129950)
<img width="319" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/319b10f4-f6fe-4ff8-9529-d366ff411b95">
<img width="324" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/27a8953a-3c50-4922-b5d0-4ea5630a133a">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129950
Approved by: https://github.com/yanboliang, https://github.com/drisspg
ghstack dependencies: #129846
2024-07-03 17:08:45 +00:00
29c68df600 Stop immediately specializing common constants 0/1 for plain int (#128327)
Fixes https://github.com/pytorch/pytorch/issues/128319

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128327
Approved by: https://github.com/lezcano
ghstack dependencies: #129983
2024-07-03 16:41:51 +00:00
9e1e58e052 Support allowlisted modules and op overloads in AOTAutogradCache (#128329)
Ops in torch, torch.functional, and torch.nn.functional are cache safe by default (at least, based on my cursory audit of the ops). This fixes a few tests that use these ops with the cache.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128329
Approved by: https://github.com/bdhirsh
2024-07-03 14:59:24 +00:00
64a04d2225 Make sparse empty constructors specialize instead of fail on symbolic inputs (#129983)
Exercised in https://github.com/pytorch/pytorch/pull/128327

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129983
Approved by: https://github.com/anijain2305
2024-07-03 13:27:19 +00:00
735044191f [Easy] Add whitespace after comma when re-rendering tuple default value in schema (#129884)
The default value of `rot90()` in the schema registry is `[0,1]` because we split the function schema by `", "`. There should be no space after `,` in `[0,1]`.

5c9d5272e4/aten/src/ATen/native/native_functions.yaml (L6120-L6126)

Then the the default value is formatted to `(0,1)` in `pyi` files. This PR manually adds an extra whitespace when rerendering the default value to a string.

```python
", ".join(string.split(","))
```

```python
# before
def rot90(input: Tensor, k: _int = 1, dims: _size = (0,1)) -> Tensor: ...
# after
def rot90(input: Tensor, k: _int = 1, dims: _size = (0, 1)) -> Tensor: ...
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129884
Approved by: https://github.com/ezyang
2024-07-03 11:45:24 +00:00
8f70bf7a94 Skip TestSDPAPrivateUse1Only on FBCODE (#129997)
Summary: The test is from D59181111, but I couldn't figure out a way to make it pass on FBCODE because loading PyTorch C++ extension requires Ninja which is not going to work with BUCK

Test Plan: `buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test:transformers`

Differential Revision: D59304327

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129997
Approved by: https://github.com/drisspg
2024-07-03 06:48:51 +00:00
62b710782d change LayoutLMForSequenceClassification inference accuracy tolerance (#129728)
Fixes #128510.

https://github.com/pytorch/pytorch/pull/124451 makes LayoutLMForSequenceClassification hit the SDPA pattern 1 and then encounter the accuracy issue. The issue only happens with BF16 inference single thread. This PR tends to increase the model tolerance and make the check pass. Note that even the math-version SDPA could have the issue because of some small implementation diff.

The test log:
Single thread
```
correct_result:  SequenceClassifierOutput(loss=tensor(0.5998), logits=tensor([[0.3301, 0.1338]], dtype=torch.bfloat16), hidden_states=None, attentions=None)
new_result:  SequenceClassifierOutput(loss=tensor(0.6016), logits=tensor([[0.3281, 0.1357]], dtype=torch.bfloat16), hidden_states=None, attentions=None)
E0627 01:09:16.762789 140281313759104 torch/_dynamo/utils.py:1476] RMSE (res-fp64): 0.00151, (ref-fp64): 0.00046 and shape=torch.Size([1, 2]). res.dtype: torch.bfloat16, multiplier: 3.000000, tol: 0.001000
E0627 01:09:16.762972 140281313759104 torch/_dynamo/utils.py:1390] Accuracy failed for key name logits
fail_accuracy
```

Multiple threads
```
correct_result:  SequenceClassifierOutput(loss=tensor(0.6007), logits=tensor([[0.3301, 0.1357]], dtype=torch.bfloat16), hidden_states=None, attentions=None)
new_result:  SequenceClassifierOutput(loss=tensor(0.6016), logits=tensor([[0.3281, 0.1357]], dtype=torch.bfloat16), hidden_states=None, attentions=None)
pass
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129728
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-03 06:28:27 +00:00
4fc9157e90 [halide-backend] Disable split reductions for Halide (#129320)
In theory Halide doesn't need the split reduction stuff we do for Triton since it can generate multiple kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129320
Approved by: https://github.com/shunting314, https://github.com/eellison
ghstack dependencies: #129321
2024-07-03 05:56:40 +00:00
0abcca85b7 [halide-backend] Support manual schedules (#129321)
Currently using this for some by-hand hacking, but might need to implement our own scheduler later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129321
Approved by: https://github.com/shunting314
2024-07-03 05:56:40 +00:00
8af58f66bb Fix typo in floordiv solver code that affects flipped relation (#129888)
Fixes https://github.com/pytorch/pytorch/issues/123535

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129888
Approved by: https://github.com/lezcano
2024-07-03 04:47:32 +00:00
424cd1e1df Enable TORCH_TRACE by default on Conda on Mast (#129988)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129988
Approved by: https://github.com/kunalb
2024-07-03 03:35:45 +00:00
1026b0f687 Use setup-miniconda step from test-infra for llm retrival workflow (#129720)
Undo https://github.com/pytorch/pytorch/pull/129722

Use the setup-miniconda step in written in test-infra to install miniconda in the llm retrieval workflow.  It comes with a cache so we don't have to worry about hitting cache limits.  The llm retrieval job was failing due to too many requests https://github.com/pytorch/pytorch/issues/129718#issue-2379260544

2aba8f107a/.github/actions/setup-miniconda/action.yml (L1)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129720
Approved by: https://github.com/PaliC, https://github.com/malfet, https://github.com/huydhn
2024-07-03 03:02:23 +00:00
31fc5b8966 Add support for inline_asm_elementwise in Inductor lowerings (#129846)
This doesn't actually expose `inline_asm_elementwise` from any public API, but makes it pretty easy to register a lowering for a custom op that uses it.

<img width="667" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/f125f4bb-4f8c-46e7-8e06-925f37ed2930">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129846
Approved by: https://github.com/shunting314
2024-07-03 02:34:03 +00:00
9ee8c18309 TCPStore: add ping to verify network connectivity on connect (#129985)
This does a round trip request on socket connect -- this allows for detecting connection resets etc and retrying before the non-retryable application requests are sent.

This adds support for PING to both the libuv and legacy backend.

Example error:
```
[trainer85612|12]:W0701 13:41:43.421574  4776 TCPStore.cpp:182] [c10d] recvValue failed on SocketImpl(fd=24, ...): Connection reset by peer
[trainer85612|12]:Exception raised from recvBytes at /mnt/code/pytorch/torch/csrc/distributed/c10d/Utils.hpp:669 (most recent call first):
...
[trainer85612|12]:#9 c10d::TCPStore::incrementValueBy(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long) from /packages/.../conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so:84809637
[trainer85612|12]:#10 c10d::TCPStore::waitForWorkers() from /packages/.../conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so:84812868
[trainer85612|12]:#11 c10d::TCPStore::TCPStore(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10d::TCPStoreOptions const&) from /packages/.../conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so:84814775
```

Test plan:

```
python test/distributed/test_store.py -v
```

```
tristanr@devvm4382 ~/pytorch (d4l3k/tcpstore_ping)> python ~/pt_tests/tcpstore_large_test.py
starting pool
started 90000
started 30000
started 70000
started 20000
started 80000
started 60000
started 0
[W702 16:16:25.301681870 TCPStore.cpp:343] [c10d] Starting store with 100000 workers but somaxconn is 4096.This might cause instability during bootstrap, consider increasing it.
init 20000
set 20000
init 80000
set 80000
init 70000
set 70000
init 60000
set 60000
init 30000
set 30000
init 90000
set 90000
started 40000
init 40000
set 40000
started 50000
init 50000
set 50000
started 10000
init 10000
set 10000
init 0
set 0
run finished 617.2992351055145
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129985
Approved by: https://github.com/rsdcastro, https://github.com/kurman
2024-07-03 02:09:44 +00:00
91a8376d47 run_test: Unset cpp stacktraces after reruns (#129004)
Rerun the failing test singly with the env var set.  If it succeeds, start a new process without the cpp stack traces env var

We don't want to waste time generating these if we don't have to

They can also show up in assertion errors, which may cause unexpected failures if a test wants to check these

Adds new --rs (run single) to be used the same way --scs and --sc are.  It will only run the single test in the step current file

https://hud.pytorch.org/pytorch/pytorch/pull/129004?sha=2c349d3557d399020bf1f6a8b7045e2e4957ba46 has some examples of logs

In the above:
* test_checkpoint_valid failed, then passed in another subprocess.  The testing continued in a different new subprocess from the test right after it (test_checkpointing_without_reentrant_early_free)
* test_format_traceback_short failed consistently, but it continued to run because keep-going was set

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129004
Approved by: https://github.com/PaliC
2024-07-03 01:50:15 +00:00
c77c139878 [Intel Triton] Update Intel Triton to resolve installation issue on manylinux. (#129847)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129847
Approved by: https://github.com/Skylion007, https://github.com/gujinghui, https://github.com/atalman
ghstack dependencies: #129782
2024-07-03 01:46:32 +00:00
c686304277 Enable UFMT on test/test_public_bindings.py (#128389)
Part of: https://github.com/pytorch/pytorch/issues/123062

Ran lintrunner on:
> test/test_public_bindings.py

Detail:
```
$ lintrunner -a --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```

Co-authored-by: Edward Z. Yang <ezyang@fb.com>
Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128389
Approved by: https://github.com/malfet
2024-07-03 01:43:41 +00:00
3b77b122c5 [Inductor UT] update rtol for convoluton on XPU. (#129782)
[Inductor UT] update rtol for convoluton on XPU.
Fix https://github.com/pytorch/pytorch/issues/129974

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129782
Approved by: https://github.com/atalman
2024-07-03 01:37:16 +00:00
1e27af335e [easy] enhance local model loading (#129897)
Summary:
1. add one more model lib dep.
2. add error message when torchscript failed to find a class in python compilation unit.

Test Plan: CI

Reviewed By: jingsh

Differential Revision: D59243250

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129897
Approved by: https://github.com/jingsh
2024-07-03 00:29:02 +00:00
be2d79a16b [dynamic] config to disable duck sizing (#129804)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129804
Approved by: https://github.com/ezyang
2024-07-03 00:20:54 +00:00
111f9b5d44 [Dynamo] Add config to skip/inline torchrec (#129912)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129912
Approved by: https://github.com/anijain2305
2024-07-03 00:14:51 +00:00
89646ebb11 Revert "[export] make with_effect mark op has_effect to prevent them from DCEed. (#129680)"
This reverts commit 4b8a5e03745924c8f987dc072fa4d41f4cb6f103.

Reverted https://github.com/pytorch/pytorch/pull/129680 on behalf of https://github.com/kit1980 due to breaking internal builds, see D59181183 ([comment](https://github.com/pytorch/pytorch/pull/129680#issuecomment-2204737227))
2024-07-03 00:03:50 +00:00
921c116089 [inductor] Kill mark_node_as_mutating (#129346)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129346
Approved by: https://github.com/lezcano
ghstack dependencies: #128893, #129325, #129343, #129344
2024-07-02 23:50:07 +00:00
b2ac8d2af3 [inductor] Use multiple outputs for flex-attention (#129344)
This fixes the DCE issue for attention output

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129344
Approved by: https://github.com/lezcano
ghstack dependencies: #128893, #129325, #129343
2024-07-02 23:50:07 +00:00
45844e0d4e [inductor] Add FileCheck to flex attention epilogue test (#129343)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129343
Approved by: https://github.com/lezcano
ghstack dependencies: #128893, #129325
2024-07-02 23:50:04 +00:00
7955cd3e83 [inductor] Make UserDefinedTritonKernel a multi-output operation (#129325)
Previously each mutation was represented by a `MutationOutput` operation which
was a new scheduler node that must be scheduled immediately afterwards.

Now we have a single scheduler node, which produces mutiple `MutationOutput`
buffers as its output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129325
Approved by: https://github.com/lezcano
ghstack dependencies: #128893
2024-07-02 23:50:00 +00:00
fb078c20c1 [inductor] Separate Buffer and Operation into two concepts (#128893)
Currently a buffer represents both a tensor with physical storage and a
computation that produces the tensor as a result.

This PR attempts to split these into two different concepts in the scheduler.
This should allow us to have multiple outputs from a single operation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128893
Approved by: https://github.com/lezcano
2024-07-02 23:49:57 +00:00
872d972e41 [custom_op] better error message on no returns (#129896)
I run into this a lot. I can imagine that it would look opaque to users,
so made it more friendly

Old error message: "ValueError: infer_schema(func): Return has unsupported type <class 'inspect._empty'>."

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129896
Approved by: https://github.com/yushangdi
2024-07-02 23:34:23 +00:00
aa0352ca38 [custom ops] add default value support for device types (#129792)
Fixes #129371

I think the first case in Issue #129371 is already supported in the current code? Since it takes care of string default values. This PR adds support for device type default values.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129792
Approved by: https://github.com/zou3519
2024-07-02 23:31:29 +00:00
d7680a564b Bug fixes for disabling 0/1 specialization on plain int (#129961)
These bug fixes will be exercised in
https://github.com/pytorch/pytorch/pull/128327 but I separate them from
the actual policy change (which is more risky)

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129961
Approved by: https://github.com/lezcano
2024-07-02 23:19:48 +00:00
eqy
29ffa20bb1 [CUDA] Bump tolerances for test_grad_pca_lowrank (#129902)
The revert of #127199 seems to surface an additional failure on A100---small tolerance bump to account for this.

I did find what appears to be a race condition in the one of the kernels used in this workload but I'm not sure it's related here...

CC @nWEIdia

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129902
Approved by: https://github.com/ezyang
2024-07-02 23:17:02 +00:00
b5fdbc1a9f Revert "[pipelining] [BE] Move pipeline_order validation to schedules.py (#129369)"
This reverts commit ec789a3c9ddd4e550b3dea6934ce2d41deb98784.

Reverted https://github.com/pytorch/pytorch/pull/129369 on behalf of https://github.com/clee2000 due to broke test/distributed/pipelining/test_schedule.py::ScheduleTest::test_non_symmetric_stage_ids_ScheduleClass0 on distributed cuda https://github.com/pytorch/pytorch/actions/runs/9766039400/job/26959115773 ec789a3c9d.  You can see the error on the PR, but Dr. CI classified it wrong ([comment](https://github.com/pytorch/pytorch/pull/129369#issuecomment-2204568418))
2024-07-02 22:30:53 +00:00
b6f781e433 Bug fix for captuing execution trace grid function (#129832)
Summary:
The inputs to grid function are varying argument, it can be one number, two numbers, or three  numbers. The current implementation captured it as a tuple. For example "grid((16,))". The fix is to change it to varying number of elements. In the previous example, it is changed to "grid(16,)".

PARAM et-replay code will be modified to reflect this change in a following up DIFF.

Test Plan: buck2 test  mode/dev-nosan caffe2/test:profiler -- -- test_execution_trace_with_pt2

Differential Revision: D59195933

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129832
Approved by: https://github.com/Skylion007, https://github.com/davidberard98
2024-07-02 22:23:57 +00:00
39357ba06f [dynamo] don't constrain range on the replacement for a symbol (#129907)
# Error
```
  File "/data/users/colinpeppler/pytorch/torch/_meta_registrations.py", line 704, in sym_constrain_range
    constrain_range(size, min=min, max=max)
  File "/data/users/colinpeppler/pytorch/torch/fx/experimental/symbolic_shapes.py", line 898, in constrain_range
    a.node.shape_env._constrain_range(a.node.expr, min, max)
  File "/data/users/colinpeppler/pytorch/torch/fx/experimental/recording.py", line 245, in wrapper
    return fn(*args, **kwargs)
  File "/data/users/colinpeppler/pytorch/torch/fx/experimental/symbolic_shapes.py", line 2813, in _constrain_range
    assert isinstance(a, sympy.Symbol), f"constraining non-Symbols NYI, {a} is {type(a)}"
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
AssertionError: constraining non-Symbols NYI, s1 + s2 is <class 'sympy.core.add.Add'>
```

# Context
I ran into the following scenario:
```
getitem = ...
sym_size_int = torch.ops.aten.sym_size.int(getitem, 0) # this is u0 = s0 + s1
_check_is_size = torch._check_is_size(sym_size_int)
# we fail at this guy
sym_constrain_range_default = torch.ops.aten.sym_constrain_range.default(sym_size_int, min = 4, max = 1234)

# runtime assertion
add = sym_size_int + sym_size_int_1
eq = add == sym_size_int
_assert_scalar_default = torch.ops.aten._assert_scalar(eq, "Runtime assertion failed for expression Eq(s0 + s1, u0) on node 'eq'")
```

everything but getitem was asserted into the FX graph by insert_deferred_runtime_asserts()
7e4329c258/torch/fx/passes/runtime_assert.py (L38-L52)

In the above scenario, we fail trying to constraint the range on `s0 + s1` which is not a `sympy.Symbol`.

And why exactly are we constraining the range on `s0 + s1`? Because it's the replacement for `u0`.

# Approach
Whenever we try to constrain the range on the replacement of ~~an unbacked symint~~ a non-symbol, just ignore it.

In the scenario above, we'll be okay to ignore it because whenever there's a replacement on an unbacked symint, we will update its range. Hence, no need to constrain the range on `s1 + s1`. We can confirm this with `TORCH_LOGS="+dynamic"`.
```
torch/fx/experimental/symbolic_shapes.py:4737: _update_var_to_range u0 = VR[4, 198] (update)
torch/fx/experimental/symbolic_shapes.py:4856: set_replacement u0 = s1 + s2 (trivial_lhs) VR[4, 198]
```

600bf978ba/torch/fx/experimental/symbolic_shapes.py (L4759-L4764)

Differential Revision: [D59257079](https://our.internmc.facebook.com/intern/diff/D59257079)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129907
Approved by: https://github.com/jingsh
2024-07-02 21:46:40 +00:00
c22e66896f Revert "Fix typo in floordiv solver code that affects flipped relation (#129888)"
This reverts commit 3c6c3b94486d49614bae5e76e7bd6b9579f643d4.

Reverted https://github.com/pytorch/pytorch/pull/129888 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the updated test starts to fail flakily in trunk somehow, so I am reverting the change to see if it helps ([comment](https://github.com/pytorch/pytorch/pull/129888#issuecomment-2204442653))
2024-07-02 21:16:59 +00:00
1ddb100318 [FSDP1][Easy] Remove Spammy Log Lin in _runtime_utils.py (#129967)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129967
Approved by: https://github.com/awgu, https://github.com/fduwjj, https://github.com/Skylion007
2024-07-02 21:08:57 +00:00
deefc10dd3 [executorch hash update] update the pinned executorch hash (#129428)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129428
Approved by: https://github.com/pytorchbot
2024-07-02 20:39:39 +00:00
cyy
26de2c2487 [3/N] Enable clang-tidy on torch/csrc/jit/serialization/* (#129850)
Follows #129300.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129850
Approved by: https://github.com/ezyang
2024-07-02 20:08:48 +00:00
8ec5ba960f [MPS] Add tensor_lr overloads to fused adam & adamw (#129451)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129451
Approved by: https://github.com/janeyx99
2024-07-02 19:46:30 +00:00
2631a96f2a Stop updating hints (#129893)
Some profiling suggests that the repeated maybe evaluate static calls are expensive.

Ref: https://github.com/pytorch/pytorch/issues/123964

With test script:

```
import torch
import torch._dynamo.config

torch._dynamo.config.capture_scalar_outputs = True

@torch.compile(fullgraph=True)
def f(a, b):
    xs = b.tolist()
    for x in xs:
        torch._check_is_size(x)
        torch._check(x <= 20)
    return a.split(xs)

N = 20

splits = torch.randint(10, (N,))
sz = splits.sum().item()

f(torch.randn(sz), splits)
```

Before:

```
real    0m18.526s
user    0m16.555s
sys     0m11.031s
```

After:

```
real    0m13.831s
user    0m12.152s
sys     0m10.941s
```

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129893
Approved by: https://github.com/lezcano
2024-07-02 19:24:33 +00:00
1f6c1fcd36 [dtensor][debug] add operation tracing to comm_mode (#129017)
**Summary**
I have added an even more detailed module tracker that now includes the collective counts and operations that happen in each submodule making it easier for users to debug. The tracing now includes the operation's DTensor arguements' input shape and sharding. Like the module collective tracing, the user also has the option to log the tracing table to output.txt file. I have decided not to include the example output for transformer as it is too many lines. The expected output for the MLP_operation_tracing is shown below:

<img width="574" alt="Screenshot 2024-06-25 at 3 33 16 PM" src="https://github.com/pytorch/pytorch/assets/50644008/a09e2504-19d5-4c69-96e8-f84e852d7786">

<img width="467" alt="Screenshot 2024-06-25 at 3 33 45 PM" src="https://github.com/pytorch/pytorch/assets/50644008/55c07d2d-6cb6-410f-82ac-2849bb7bfbbb">

**Test Plan**
1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_operation_tracing
2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129017
Approved by: https://github.com/XilunWu
2024-07-02 19:05:05 +00:00
bf05ea2bab Re-generate Linux build workflows after #124014 (#129976)
This looks like a landrace as lint passed on #124014
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129976
Approved by: https://github.com/kit1980
2024-07-02 18:57:20 +00:00
080149cb38 [Inductor][FlexAttention] Add helper functions of converting score_mod to block_mask (#129909)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129909
Approved by: https://github.com/Chillee, https://github.com/drisspg
ghstack dependencies: #129831, #129859
2024-07-02 18:48:16 +00:00
1f3e2d7877 [Inductor] Rename TemplatedAttention to FlexAttention (#129859)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129859
Approved by: https://github.com/Chillee, https://github.com/drisspg
ghstack dependencies: #129831
2024-07-02 18:48:16 +00:00
aa7ea6b45c Add wraps back (#129933)
Fixes https://github.com/pytorch/pytorch/issues/129922

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129933
Approved by: https://github.com/eqy, https://github.com/janeyx99
2024-07-02 18:24:02 +00:00
ec789a3c9d [pipelining] [BE] Move pipeline_order validation to schedules.py (#129369)
# Changes
* small fix in stage error message
* Move `format_pipeline_order` and `_validate_pipeline_order` out of `test_schedule.py` into `schedules.py`.
* Wrap the execution runtime in a try-except which on error will log the timestep and schedule plan before re-raising the exception.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129369
Approved by: https://github.com/wconstab
ghstack dependencies: #129368
2024-07-02 18:19:28 +00:00
4eb449f7dc [pipelining] add small logging section to docs (#129368)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129368
Approved by: https://github.com/wconstab
2024-07-02 18:19:28 +00:00
34e94c507a [Inductor] Make FlexAttention block_mask argument as tuple (#129831)
Re-organize ```block_mask``` related arguments a tuple to reduce the individual argument number. I was trying to use named tuple, but aot autograd doesn't work well with named tuple. The only downside of using tuple rather than named tuple is we need to use index to access its element. But we only need this at one place, it should be fine.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129831
Approved by: https://github.com/Chillee, https://github.com/drisspg
2024-07-02 17:18:33 +00:00
9105d54c6b [dynamo][sparse] Graph break on sparse tensors (#129883)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129883
Approved by: https://github.com/ezyang
ghstack dependencies: #129830, #129858, #129857, #129881
2024-07-02 16:51:56 +00:00
75443d3daf [dynamic-shapes] Dont create symbol if .item() is a nan (#129881)
Passes ` PYTORCH_TEST_WITH_DYNAMO=1 pytest test/torch_np/numpy_tests/lib/test_function_base.py::TestInterp::test_scalar_interpolation_point` in the stack.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129881
Approved by: https://github.com/ezyang, https://github.com/zou3519
ghstack dependencies: #129830, #129858, #129857
2024-07-02 16:51:56 +00:00
d146a62e77 [MPS][BE] Introduce mtl_setBytes (#129910)
Which for primitive types calls `[encoder setBytes:&val legnth:sizeof(val) index:idx];` and for container types passes number of elements equal to the size of the container

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129910
Approved by: https://github.com/Skylion007
2024-07-02 16:36:57 +00:00
9fb2dec7a6 [custom ops] Add unknown arg (#129614)
Fixes #129372

Add a mutated_args="unknown" that pessimistically assumes that all inputs to the operator are being mutates.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129614
Approved by: https://github.com/zou3519
2024-07-02 16:10:14 +00:00
e3b3431c42 Fix for HistogramObserver (#129387)
Summary:
There were two problems with the HistogramObserver:
1. It does not work when someone passes a batch_size 1, tensor_size 1 data-point.
2. The Histogram doesn't seem to actually update if the range of the new x falls within the old one

These issues were both fixed.

On top of this, I greatly simplified the logic for the histogram updating. Now, it doesn't do the downsampling anymore, which saves a ton of memory and code. The accuracy can still be controlled with the upsampling ratio. This ratio was also too high for the accuracy we generally need here, I reduced the default for this.
Also the code is cleaner now, much easier to follow what's happening.

test_histogram_observer_same_inputs was likely wrong - If I pass 0s and 1s to my histogramobserver, I want them to actually count! The current test now thinks it's good to discard and ignore these values.

Test Plan: You can run the included tests.

Differential Revision: D58931336

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129387
Approved by: https://github.com/jerryzh168
2024-07-02 15:41:44 +00:00
03440a1c13 Revert "Add support for inline_asm_elementwise in Inductor lowerings (#129846)"
This reverts commit badc638eb68c0b07ae3b857e885e6d0137b218aa.

Reverted https://github.com/pytorch/pytorch/pull/129846 on behalf of https://github.com/jeffdaily due to introduced ROCm breakages in trunk ([comment](https://github.com/pytorch/pytorch/pull/129846#issuecomment-2203519554))
2024-07-02 15:25:34 +00:00
3fd128361e [traced-graph][sparse] add relay override for layout_impl (#129930)
In the "layout()" method of "TensorImpl" defined in the file core/TensorImpl.h, the following code and documentation can be found:

```
  Layout layout() const {
  ...
  if .. {
  ...
  } else if (is_sparse_compressed()) {
      // Typically, the tensor dispatch keys define the tensor layout
      // uniquely. This allows using non-virtual layout method for
      // better performance. However, when tensor's layout depends,
      // say, on tensor attributes, one must use this execution path
      // where the corresponding tensor impl class overwrites virtual
      // layout_impl() method.
      return layout_impl();
    } else {
    ...
    }
  }

```
However, this override was never implemented. This PR put the override in place, to prepare for sparsity propagation in another PR.

https://github.com/pytorch/pytorch/issues/117188

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129930
Approved by: https://github.com/ezyang
2024-07-02 15:24:34 +00:00
dacc33d2fa Make sym_min/sym_max handle Numpy scalars (#129917)
Internal xref:
https://fb.workplace.com/groups/1069285536500339/posts/7773876449374514/

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129917
Approved by: https://github.com/Skylion007
2024-07-02 14:59:20 +00:00
f1df13f023 [BE][Easy] Fix PYI001: unprefixed-type-param in torch/utils/data/datapipes (#129885)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129885
Approved by: https://github.com/ezyang
2024-07-02 14:56:27 +00:00
257b9c7936 Fix layout for *_like() factories on NJTs (#129879)
Background: this bug was triggering DEBUG=1 asserts in the backward for `unbind()`, which calls `empty_like()`. I found that the NJT implementation of `empty_like()` was redispatching on `values` while blindly passing along all kwargs. This resulted in `empty_like(values, ..., layout=torch.jagged)`, which is incorrect since `values` is strided, tripping the debug assert here:

433b691f98/aten/src/ATen/EmptyTensor.cpp (L305)

This PR explicitly sets `layout=torch.strided` when redispatching `*_like()` factories on `values`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129879
Approved by: https://github.com/soulitzer
2024-07-02 14:51:23 +00:00
6c2a8b6b38 [Ez][BE]: Enable new stable ruff rules (#129825)
Applies a bunch of new ruff lint rules that are now stable. Some of these improve efficiency or readability. Since I already did passes on the codebase for these when they were in preview, there should be relatively few changes to the codebase. This is just more for future hardening of it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129825
Approved by: https://github.com/XuehaiPan, https://github.com/jansel, https://github.com/malfet
2024-07-02 14:47:10 +00:00
2926655761 [inductor] optimize cpp builder configuration code (#129577)
Changes:
1. Combine choose isa condition dispatch code.
2. Unificate MacOS openmp configuration code.
3. Clean up useless code.

Co-authored-by: Jason Ansel <jansel@jansel.net>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129577
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-02 14:41:59 +00:00
6cb0ad3375 [BE]: Update NCCL submodule to 2.21.5 (#124014)
Update NCCL to the latest version. This release is mostly bugfixes with a few new minor features.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124014
Approved by: https://github.com/eqy, https://github.com/ezyang, https://github.com/nWEIdia, https://github.com/malfet, https://github.com/atalman
2024-07-02 14:39:33 +00:00
dc75ec252a [inductor] Fix can_merge check for expr=q0*q1 (#129806)
Fixes #111884

In the minimised reproducer, we have a loop with the index expression `-q0*q1`
for which in the merge tester we get:
```
expr1 = - 0 * (_merge_tester * 16) = 0
expr2 = - _merge_tester * 0 = 0
```
so it decides we can merge the dimensions and `q0` is set to `0`, meaning `-q0*q1` is always zero!

Here I change the test so we have at least one case where no zeros are
substituted so we can catch this situation. In the normal strided case we get
e.g.
```
expr = 16 * q0 + q1
expr1 = 16 * _merge_tester2 + (16 * _merge_tester1)
expr2 = 16 * (_merge_tester2 + _merge_tester1)
```
which are still equivalent expressions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129806
Approved by: https://github.com/lezcano
2024-07-02 14:30:02 +00:00
37e3c60897 [Inductor][CPP] Remove redundant INT8-specific logic in the INT8 GEMM template (#129470)
**Summary**
Remove redundant INT8-specific logic in the INT8 GEMM template to unify the code structure with FP32/BF16/FP16 GEMM Template.

**Test Plan**
```
numactl -C 56-111 -m 1 python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129470
Approved by: https://github.com/jgong5
ghstack dependencies: #128825, #129048, #129049, #129103, #129220, #129221
2024-07-02 13:15:15 +00:00
b6379591a9 [Inductor][CPP] Pass weight dtype explicitly for cpp gemm template (#129221)
**Summary**
This PR mainly refactor 2 things:

1. Passing in weight's data type explicitly in `create_micro_gemm` as `input2.dtype`. When registering `CppMicroGemmConfig`, we will reuse `input.dtype` if `input2.dtype` is not explicitly registered.
2. Add an util function to get the output data type and compute data type from input data type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129221
Approved by: https://github.com/jgong5, https://github.com/jansel
ghstack dependencies: #128825, #129048, #129049, #129103, #129220
2024-07-02 13:06:32 +00:00
72fa864098 [Inductor][CPP] Enable Quantized Linear with AMX MicroGEMM (#129220)
**Summary**
Add the AMX micro gemm kernel with int8 data type.

**Test Plan**
```
clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_amx
```

**Next Step**
- [✓] Unary post op fusion
- [✓] Int8 output
- [✓] Binary Fusion
- [✓] AMX int8 MicroGEMM Kernel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129220
Approved by: https://github.com/jgong5
ghstack dependencies: #128825, #129048, #129049, #129103
2024-07-02 12:53:35 +00:00
a796358330 [Inductor][CPP] Enable Quantized Linear GEMM Template with Binary Fusion (#129103)
**Summary**
Based on previous PR, add the config to support quantized linear binary - optional(unary) post op fusion.

- Activation dtype: uint8
- Weight dtype: int8
- Output dtype: float32/bfloat16/uint8
- Post Op Fusion: with binary and optional[Unary] post operator fusion

**Test Plan**
```
clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_with_pointwise_binary
```

**Next Step**
- [✓] Unary post op fusion
- [✓] Int8 output
- [✓] Binary Fusion
- [ ] AMX int8 MicroGEMM Kernel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129103
Approved by: https://github.com/jgong5, https://github.com/jansel
ghstack dependencies: #128825, #129048, #129049
2024-07-02 12:45:10 +00:00
86e2d16ba0 [Inductor][Quant] Change the schema of QLinear Binary (#129049)
**Summary**
We change the schema of QLinear Binary, so it will be easier to enable the corresponding gemm template.

- Extra input of binary post-op is a tensor which needs to be an input node of autotuning, we need to move it at front of `output_scale` which is a scalar.
- We also move it at front of `bias`, since `bias` is optional tensor for this fusion, but `other` is a must to have for linear binary fusion.

**Test Plan**
```
python -u -m pytest -s -v test/quantization/core/test_quantized_op.py -k qlinear
python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k qlinear
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129049
Approved by: https://github.com/jgong5, https://github.com/jansel
ghstack dependencies: #128825, #129048
2024-07-02 12:36:38 +00:00
07450e9713 Revert "[MPS] Add support for autocast in MPS (#99272)"
This reverts commit 6240cfd5c751bea6ca91dc765085e1d871b22345.

Reverted https://github.com/pytorch/pytorch/pull/99272 on behalf of https://github.com/jeanschmidt due to introduced breakages in trunk ([comment](https://github.com/pytorch/pytorch/pull/99272#issuecomment-2203033719))
2024-07-02 12:29:51 +00:00
0441173ab2 Add slowTest marker to test_linalg_solve_triangular_large (#129903)
In nvidia internal testing, for slower devices such as Orin NX, on large dtypes like complex128, test_linalg_solve_triangular_large is taking multiple hours to complete and timing out CI. This PR adds a slowTest marker so it can be skipped due to speed issues. cc @nWEIdia
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129903
Approved by: https://github.com/lezcano
2024-07-02 12:27:12 +00:00
95a5958db4 [ROCm] Update nightly triton-rocm pin to release branch (#129361)
Update pin to tip of https://github.com/triton-lang/triton/commits/release/3.0.x/ following upstream strategy here https://github.com/pytorch/pytorch/pull/126098

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129361
Approved by: https://github.com/peterbell10
2024-07-02 11:49:52 +00:00
3c6c3b9448 Fix typo in floordiv solver code that affects flipped relation (#129888)
Fixes https://github.com/pytorch/pytorch/issues/123535

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129888
Approved by: https://github.com/lezcano
2024-07-02 11:15:03 +00:00
8ef8240172 Don't mark conversion to float as is_integer = False (#129890)
Zero is an integer, so if you say is_integer = False, you are also
saying the result cannot be zero, which is undesirable.

This is exercised by next PR in the stack.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129890
Approved by: https://github.com/lezcano
2024-07-02 11:08:09 +00:00
eb1ff76f23 Make are_strides_like_channels_last size oblivious (#129677)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129677
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #129869
2024-07-02 11:05:20 +00:00
ebeeb22669 Correctly put mark_unbacked symbols in shape_env_to_source_to_symbol_cache (#129869)
Internal xref:
https://www.internalfb.com/intern/anp/view/?source=version_selector&id=5534845

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129869
Approved by: https://github.com/albanD
2024-07-02 11:05:20 +00:00
567dd1a3ca [inductor] unificate toolchain code. (#129816)
This PR is the implemention of https://github.com/pytorch/pytorch/issues/124245#issuecomment-2197778902 plan 2, and it is continued PR to https://github.com/pytorch/pytorch/pull/129789

Changes:
1. Unificate cpp builder's toolchain code.
2. Move all build related code to `cpp_builder.py`.
3. Optimize `codecache.py`, `cpp_builder.py` and `cpu_vec_isa.py` import logical follow: https://github.com/pytorch/pytorch/issues/124245#issuecomment-2197778902

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129816
Approved by: https://github.com/jansel
2024-07-02 09:52:06 +00:00
badc638eb6 Add support for inline_asm_elementwise in Inductor lowerings (#129846)
This doesn't actually expose `inline_asm_elementwise` from any public API, but makes it pretty easy to register a lowering for a custom op that uses it.

<img width="667" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/f125f4bb-4f8c-46e7-8e06-925f37ed2930">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129846
Approved by: https://github.com/shunting314
2024-07-02 09:31:38 +00:00
ccc4ee7793 check boolean alpha and beta of Fake tensor impl for Tensor.addr (#129839)
Fixes https://github.com/pytorch/pytorch/issues/127043

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129839
Approved by: https://github.com/lezcano
2024-07-02 09:20:49 +00:00
5c9d5272e4 fixes #124582 (#128483)
added check for existence of outputs requiring grad to make_graphed_callables.

added new test case, updated existing test case to include parameterless modules.

Fixes #124582

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128483
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-07-02 08:45:59 +00:00
1ad683033b Implemented flexible PP schedule (#129597)
Enabled some cases to work where num_microbatches % pp_size != 0. Using the flex_pp schedule, we will have

num_rounds = max(1, n_microbatches // pp_group_size) and it works as long as n_microbatches % num_rounds is 0. As a few examples, support

pp_group_size = 4, n_microbatches = 10. We will have num_rounds = 2 and n_microbatches % 2 is 0.
pp_group_size = 4, n_microbatches = 3. We will have num_rounds = 1 and n_microbatches % 1 is 0.

Moved over from PiPPy (https://github.com/pytorch/PiPPy/pull/1129)

Tested using the config in (1), schedule looks like the following graph:

```
=========== ALL_RANK_ACTIONS ===========
         Rank 0  Rank 1  Rank 2  Rank 3
Step 00: F0_s0   None    None    None
Step 01: F1_s0   F0_s1   None    None
Step 02: F2_s0   F1_s1   F0_s2   None
Step 03: F3_s0   F2_s1   F1_s2   F0_s3
Step 04: F4_s0   F3_s1   F2_s2   F1_s3
Step 05: F0_s4   F4_s1   F3_s2   F2_s3
Step 06: F1_s4   F0_s5   F4_s2   F3_s3
Step 07: F2_s4   F1_s5   F0_s6   F4_s3
Step 08: F3_s4   F2_s5   F1_s6   F0_s7
Step 09: F4_s4   F3_s5   None    B0_s7
Step 10: F5_s0   None    F2_s6   F1_s7
Step 11: None    None    B0_s6   B1_s7
Step 12: None    F4_s5   F3_s6   F2_s7
Step 13: None    B0_s5   B1_s6   B2_s7
Step 14: F6_s0   F5_s1   F4_s6   F3_s7
Step 15: B0_s4   B1_s5   B2_s6   B3_s7
Step 16: F7_s0   F6_s1   F5_s2   F4_s7
Step 17: B1_s4   B2_s5   B3_s6   B4_s7
Step 18: F8_s0   F7_s1   F6_s2   F5_s3
Step 19: B2_s4   B3_s5   B4_s6   B0_s3
Step 20: F9_s0   F8_s1   F7_s2   F6_s3
Step 21: B3_s4   B4_s5   B0_s2   B1_s3
Step 22: F5_s4   F9_s1   F8_s2   F7_s3
Step 23: B4_s4   B0_s1   B1_s2   B2_s3
Step 24: F6_s4   F5_s5   F9_s2   F8_s3
Step 25: B0_s0   B1_s1   B2_s2   B3_s3
Step 26: F7_s4   F6_s5   F5_s6   F9_s3
Step 27: B1_s0   B2_s1   B3_s2   B4_s3
Step 28: F8_s4   F7_s5   F6_s6   F5_s7
Step 29: B2_s0   B3_s1   B4_s2   B5_s7
Step 30: F9_s4   F8_s5   F7_s6   F6_s7
Step 31: B3_s0   B4_s1   B5_s6   B6_s7
Step 32: None    F9_s5   F8_s6   F7_s7
Step 33: B4_s0   B5_s5   B6_s6   B7_s7
Step 34: None    None    F9_s6   F8_s7
Step 35: B5_s4   B6_s5   B7_s6   B8_s7
Step 36: None    None    None    F9_s7
Step 37: B6_s4   B7_s5   B8_s6   B9_s7
Step 38: None    None    None    None
Step 39: B7_s4   B8_s5   B9_s6   B5_s3
Step 40: None    None    None    None
Step 41: B8_s4   B9_s5   B5_s2   B6_s3
Step 42: None    None    None    None
Step 43: B9_s4   B5_s1   B6_s2   B7_s3
Step 44: None    None    None    None
Step 45: B5_s0   B6_s1   B7_s2   B8_s3
Step 46: None    None    None    None
Step 47: B6_s0   B7_s1   B8_s2   B9_s3
Step 48: None    None    None
Step 49: B7_s0   B8_s1   B9_s2
Step 50: None    None
Step 51: B8_s0   B9_s1
Step 52: None
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129597
Approved by: https://github.com/H-Huang
2024-07-02 07:54:38 +00:00
3e2df3ca9d Add xpu to getAccelerator (#129205)
# Motivation
Add `xpu` support to `getAccelerator`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129205
Approved by: https://github.com/albanD, https://github.com/gujinghui
ghstack dependencies: #129463
2024-07-02 06:48:24 +00:00
6353a12e6a XPUHooksInterface inherits from AcceleratorHooksInterface (#129463)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129463
Approved by: https://github.com/gujinghui, https://github.com/albanD
2024-07-02 06:48:24 +00:00
76259ebfdd [inductor] split cpu vec isa to dedicate file (keep git history) (#129789)
This PR is the implemention of https://github.com/pytorch/pytorch/issues/124245#issuecomment-2197778902 plan 1

Changes:
1. Duplicate `codecache.py` to `cpu_vec_isa.py` with its `git history`.
<img width="745" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/106533da-ce80-4825-8271-35ffb3141f92">

2. Make `cpu_vec_isa.py` as dedicate file for CPU vec isa. It also good to extend for more archtectures and vec isa.
3. Update code for above changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129789
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-07-02 05:29:05 +00:00
f6edd1f7c9 [BE] Make ActivationWrapper an abstract class (#129808)
Fixes #95481

Test Plan:
Unit tested checkpoint_wrapper.py by instantizing ActivationWrapper and got TypeError as expected.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129808
Approved by: https://github.com/Skylion007
2024-07-02 04:29:43 +00:00
c2d0b7b96d Revert "[ROCm] std::clamp work-around for hip-clang compiler (#127812)"
This reverts commit 8c2c3a03fb87c3568a22362d83b00d82b9fb3db2.

Reverted https://github.com/pytorch/pytorch/pull/127812 on behalf of https://github.com/ezyang due to windows trunk job failing ([comment](https://github.com/pytorch/pytorch/pull/127812#issuecomment-2201653245))
2024-07-02 01:52:31 +00:00
6240cfd5c7 [MPS] Add support for autocast in MPS (#99272)
Fixes https://github.com/pytorch/pytorch/issues/88415

Co-authored-by: Siddharth Kotapati <skotapati@apple.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99272
Approved by: https://github.com/malfet
2024-07-02 01:49:52 +00:00
600bf978ba [Pipelining] Add to/from CSV format and improved __repr__ (#129264)
_Action.__repr__ gets rearranged so it doesn't require an underscore or
a 's' prefix, but still keeps multi-digit stage and microbatch indices
separated by an alpha character indicating the action type.

to/from CSV methods allow dumping a generated schedule to CSV format for
offline visualization or manual editing in a spreadsheet and reloading
to use at runtime.

Co-authored-by: Howard Huang <howardhuang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129264
Approved by: https://github.com/H-Huang
2024-07-02 01:26:23 +00:00
83e6ec2ccd [FSDP2+TP] Disable 2D state_dict (#129519)
Fixes #ISSUE_NUMBER

Gonna fill in the RFC but just want to run CI to see if anything else breaks.

Test:
```
python test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_raise_not_implemented_state_dict_if_2d
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129519
Approved by: https://github.com/awgu
2024-07-02 01:26:14 +00:00
cyy
46366888d7 Remove outdated CMake code (#129851)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129851
Approved by: https://github.com/ezyang
2024-07-02 00:40:37 +00:00
7e4329c258 [EZ][BE] Bump min cmake version to 3.18 (#129906)
As this is a min CMake version supported by top level PyTorch

Hides
```
CMake Deprecation Warning at aten/src/ATen/native/quantized/cpu/qnnpack/deps/clog/CMakeLists.txt:7 (cmake_minimum_required):
  Compatibility with CMake < 3.5 will be removed from a future version of
  CMake.

  Update the VERSION argument <min> value or use a ...<max> suffix to tell
  CMake that the project does not need compatibility with older versions.

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129906
Approved by: https://github.com/kit1980
2024-07-01 23:06:49 +00:00
9645eaaaec [BE] Improve logging for runner-determinator (#129679)
This lets us be more flexible about what data we output and throwing exceptions. It's also less likely to break when others make changes (e.g. any print statement would have broken this code before since the printed output was expected to only be a json)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129679
Approved by: https://github.com/zxiiro, https://github.com/jeanschmidt, https://github.com/Skylion007
2024-07-01 22:31:35 +00:00
eeef68671d [autograd] Do not detach when unpacking tensors that do not require grad (#127959)
In this PR:
- Ensure that if a tensor not requiring grad is saved for backward unpacking does not trigger a detach (unless the user installs a saved tensor pack hook that returns a tensor requiring grad).
- Update non-reentrant checkpoint to also no longer detach for this case.

Alternatives:
- For custom autograd Function, you could directly save on ctx to work around this, but that would not work for when we switch to using custom ops.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127959
Approved by: https://github.com/YuqingJ
ghstack dependencies: #125795, #128545, #129262
2024-07-01 21:57:36 +00:00
87693b534c [ROCm] Use AOTriton as a dynamic library (#129094)
This PR enables using AOTriton as a shared library dependency instead of a static one.

Resolves the issue of linker errors when trying to build PyTorch for a lot of (>7 or so) gfx archs due to huge size of aotriton static library.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129094
Approved by: https://github.com/malfet
2024-07-01 21:39:27 +00:00
8c2c3a03fb [ROCm] std::clamp work-around for hip-clang compiler (#127812)
Fixes #127666.

Other std math functions are replaced with those in the global namespace during hipify. HIP does not claim to support every function in the C++ standard library. std::clamp is not yet supported and we have been relying on the std implementation. For Fedora 40 + gcc 14, a host-side assert is used which is not supported. Work-around this by replacing std::clamp with min and max for USE_ROCM builds.

Patch comes from @lamikr. Modified to use #ifndef USE_ROCM.

https://github.com/lamikr/rocm_sdk_builder/pull/37

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127812
Approved by: https://github.com/hongxiayang, https://github.com/malfet
2024-07-01 21:00:33 +00:00
750c701e49 [ROCm] Update xlogy comment detailing issue (#128151)
update skip reason comment with more accurate descriptor

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128151
Approved by: https://github.com/zou3519
2024-07-01 20:58:58 +00:00
78cda9a810 [symbolic-shapes] Add FloatPow in the symbolic shape guard closure (#129857)
Fixes test failure raised in the next diff.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129857
Approved by: https://github.com/ezyang
ghstack dependencies: #129830, #129858
2024-07-01 20:44:59 +00:00
53d67165c0 [dynamo] Skip FUNCTION_MATCH guards for descriptors (#129858)
Hard to write tests. This PR makes many test pass in the stack such as

`PYTORCH_TEST_WITH_DYNAMO=1 pytest test/test_ao_sparsity.py::TestComposability::test_convert_without_squash_mask`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129858
Approved by: https://github.com/mlazos
ghstack dependencies: #129830
2024-07-01 20:44:59 +00:00
f86dbae247 Fix typo in lxml requirement (#129695)
Extra period at the end throws off pip:
```
root@f04177cab5af:/data/pytorch# pip install -r .ci/docker/requirements-ci.txt
ERROR: Invalid requirement: 'lxml==5.0.0.': Expected end or semicolon (after version specifier)
    lxml==5.0.0.
        ~~~~~~~^ (from line 309 of .ci/docker/requirements-ci.txt)
```

Not sure why CI docker builds do not have an issue with this period.

Typo comes from f73b1b9388
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129695
Approved by: https://github.com/huydhn
2024-07-01 19:43:37 +00:00
fdd0a7f9b4 Run test_mps_allocator_module serially (#129340)
Not sure why this test starts to fail (maybe runner update) 8a2fed7e6a/1 or why it was XFAIL in this old PR https://github.com/pytorch/pytorch/pull/97151, but the test is passing locally for me now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129340
Approved by: https://github.com/kit1980, https://github.com/malfet
2024-07-01 18:44:48 +00:00
b02186ffc1 Revert "Allow get attributes on DDP similar to FSDP (#128620)"
This reverts commit 065c386990dce444db17eff7b254bf79e82450ef.

Reverted https://github.com/pytorch/pytorch/pull/128620 on behalf of https://github.com/jeanschmidt due to Reverting in order to see if the trunk error on inductor is fixed ([comment](https://github.com/pytorch/pytorch/pull/128620#issuecomment-2200717876))
2024-07-01 17:57:00 +00:00
bb0f3df562 Fix index issues in torch.fx.interpreter (#129527)
Summary: Fix index issues in torch.fx.interpreter by changing range from `[:i]` to `[:i+1]`. Because if there are `n` elements, the last index `i` of the `for` loop is `n-1` and `[:i]` can only get access to elements from index `0` to index `n-2` and miss the last element. `[:i+1]` can get access to all elements correctly.

Test Plan: Test with Node API

Differential Revision: D59028395

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129527
Approved by: https://github.com/dulinriley
2024-07-01 17:46:13 +00:00
1956d87c1f Increase riscv implementation in DepthwiseConvKernel (#127867)
**Summary:**

Increase riscv implementation in DepthwiseConvKernel.

**Compile:**

export USE_CUDA=0
export USE_DISTRIBUTED=0
export USE_MKLDNN=0
export MAX_JOBS=4
export CMAKE_CXX_COMPILER=clang++
export CMAKE_C_COMPILER=clang
export CMAKE_C_FLAGS=-march=rv64gcv
export CMAKE_CXX_FLAGS=-march=rv64gcv
python3 setup.py develop --cmake

**Test Plan:**

**Correctness** - Check the results of the run before and after test_convolution.py

python3 test/run_test.py --include nn/test_convolution --keep-going

**Before:**
===== 9 passed, 13 skipped, 564 deselected in 46.55s =====
The following tests failed consistently:
test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_backward_twice
test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_inconsistent_types
test/nn/test_convolution.py::TestConvolutionNN::test_conv_modules_raise_error_on_incorrect_input_size
test/nn/test_convolution.py::TestConvolutionNN::test_conv_shapecheck
test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv1d
test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv2d
test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv3d
test/nn/test_convolution.py::TestConvolutionNN::test_mismatch_shape_conv2d
test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCPU::test_conv_empty_channel_cpu_complex64
test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCPU::test_conv_empty_channel_cpu_float32

**After:**
===== 9 passed, 13 skipped, 564 deselected in 48.13s =====
The following tests failed consistently:
test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_backward_twice
test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_inconsistent_types
test/nn/test_convolution.py::TestConvolutionNN::test_conv_modules_raise_error_on_incorrect_input_size
test/nn/test_convolution.py::TestConvolutionNN::test_conv_shapecheck
test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv1d
test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv2d
test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv3d
test/nn/test_convolution.py::TestConvolutionNN::test_mismatch_shape_conv2d
test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCPU::test_conv_empty_channel_cpu_complex64
test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCPU::test_conv_empty_channel_cpu_float32

**Performance** - Compare the results before and after mobilenet_v2

python3 run.py mobilenet_v2  -d cpu -t eval

**Before:**
Running eval method from mobilenet_v2 on cpu in eager mode with input batch size 16 and precision fp32.
CPU Wall Time per batch: 19590.647 milliseconds
CPU Wall Time:       19590.647 milliseconds
Time to first batch:         5271.3518 ms
CPU Peak Memory:                0.3809 GB

**After:**
Running eval method from mobilenet_v2 on cpu in eager mode with input batch size 16 and precision fp32.
CPU Wall Time per batch: 13523.530 milliseconds
CPU Wall Time:       13523.530 milliseconds
Time to first batch:         2696.0304 ms
CPU Peak Memory:                0.3408 GB

**Versions:**
Clang version: 17.0.2
Platform: CanMV-K230
Architecture: riscv64
OS: Ubuntu 23.10

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127867
Approved by: https://github.com/malfet
2024-07-01 17:11:21 +00:00
c9dc9887db Revert "Enable UFMT on test/test_public_bindings.py (#128389)"
This reverts commit fe5424d0f8604f6e66d827ae9f94b05cb7119d55.

Reverted https://github.com/pytorch/pytorch/pull/128389 on behalf of https://github.com/clee2000 due to broke test_mps.py::TestMPS::test_mps_allocator_module? https://github.com/pytorch/pytorch/actions/runs/9730750763/job/26854426294 fe5424d0f8 Not sure how this change can do that.  Build failed on PR so test didn't run ([comment](https://github.com/pytorch/pytorch/pull/128389#issuecomment-2200589719))
2024-07-01 16:34:04 +00:00
433b691f98 Revert "[inductor] optimize cpp builder configuration code (#129577)"
This reverts commit 2e3ff394bf94d3b9cbab0fe8a93a9ea7c9cb4267.

Reverted https://github.com/pytorch/pytorch/pull/129577 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, see D59181128 ([comment](https://github.com/pytorch/pytorch/pull/129577#issuecomment-2200554824))
2024-07-01 16:14:06 +00:00
19e17216a2 Revert "[inductor] split cpu vec isa to dedicate file (keep git history) (#129789)"
This reverts commit 58f346c874a8a982679b4d4f3876602cc05d66d4.

Reverted https://github.com/pytorch/pytorch/pull/129789 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert https://github.com/pytorch/pytorch/pull/129577 ([comment](https://github.com/pytorch/pytorch/pull/129789#issuecomment-2200545144))
2024-07-01 16:08:44 +00:00
b6dc37bb4e Revert "[inductor] unificate toolchain code. (#129816)"
This reverts commit 67c9ec2b6d12ffd0e83861dcc16c1cd1a9b74d35.

Reverted https://github.com/pytorch/pytorch/pull/129816 on behalf of https://github.com/jeanschmidt due to Need to revert in order to revert #129577 ([comment](https://github.com/pytorch/pytorch/pull/129816#issuecomment-2200539687))
2024-07-01 16:06:22 +00:00
cyy
ca5d13c672 [1/N] Enable unused variable warnings on torch_cpu and fix some violations (#128670)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128670
Approved by: https://github.com/ezyang
2024-07-01 14:56:46 +00:00
e385bf8ef8 Revert "[halide-backend] Disable split reductions for Halide (#129320)"
This reverts commit a18eb651d352e45860a96869abaf9fb7b215eac6.

Reverted https://github.com/pytorch/pytorch/pull/129320 on behalf of https://github.com/jeanschmidt due to This PR is breaking internal builds, please check comments on it D59204360 ([comment](https://github.com/pytorch/pytorch/pull/129320#issuecomment-2200351678))
2024-07-01 14:44:35 +00:00
a83eaf1c3a Revert "[halide-backend] Support manual schedules (#129321)"
This reverts commit 9ae78a578caff195821ad535a9e8d8ef59552142.

Reverted https://github.com/pytorch/pytorch/pull/129321 on behalf of https://github.com/jeanschmidt due to Reverting, as it is required to do so in order to revert #129320 ([comment](https://github.com/pytorch/pytorch/pull/129321#issuecomment-2200345664))
2024-07-01 14:42:33 +00:00
cc9b005bf2 Enable torchao nightly workflow (#129779)
Summary:
Make the following improvements:
* Schedule the torchao benchmark nightly
* Enable torchbench, timm, and huggingface models
* Refactor the benchmarking script to better arrange the benchmarking groups

Test workflow: https://github.com/pytorch/benchmark/actions/runs/9705589352

X-link: https://github.com/pytorch/benchmark/pull/2336

Differential Revision: D59074571

Pulled By: xuzhao9

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129779
Approved by: https://github.com/jerryzh168
2024-07-01 14:28:38 +00:00
75f64e1203 Fix test test_type_hints.py::TestTypeHints::test_doc_examples (#129829)
As per the title, this test was broken for months.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129829
Approved by: https://github.com/ezyang
2024-07-01 13:28:37 +00:00
e1b426b345 [ROCm] CUDA_VISIBLE_DEVICES fallback option for device_count (#129650)
Updating `_parse_visible_devices` to allow use of CUDA_VISIBLE_DEVICES if HIP_VISIBLE_DEVICES is unset, to avoid any unnecessary code changes in workloads that already rely on CUDA_VISIBLE_DEVICES.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129650
Approved by: https://github.com/hongxiayang, https://github.com/malfet
2024-07-01 11:40:09 +00:00
cyy
313eec02cc Add hash function of std::string_view to torch/csrc/lazy/core/hash.h (#128800)
For easier moving of c10::string_view to std::string_view in PyTorch/XLA.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128800
Approved by: https://github.com/ezyang
2024-07-01 09:53:34 +00:00
f6a0be5023 Add warpSize to Device properties (#128449)
Adding warp_size to CudaDeviceProperties.

>>> import torch
>>> prop = torch.cuda.get_device_properties(torch.cuda.current_device())
>>> prop.warp_size
64
>>>

@jeffdaily @pruthvistony @jithunnair-amd @ROCmSupport

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128449
Approved by: https://github.com/eqy, https://github.com/jataylo, https://github.com/jithunnair-amd, https://github.com/malfet
2024-07-01 09:13:32 +00:00
04a0d85620 [BE] Print all pip packages installed on the system after TorchChat (#129809)
To make debugging regressions like ones happened last Wed when new version of torchao was released, that resulted in TorchBench downgrading pytorch version to 2.3.1

Test plan: Look at the log output for example https://github.com/pytorch/pytorch/actions/runs/9720408234/job/26832794157?pr=129809#step:20:1158 contains
```
+ echo 'Print all dependencies after TorchBench is installed'
Print all dependencies after TorchBench is installed
+ python -mpip freeze
absl-py==2.1.0
accelerate==0.31.0
aiohttp==3.9.5
aiosignal==1.3.1
astunparse==1.6.3
async-timeout==4.0.3
attrs==23.2.0
audioread==3.0.1
beautifulsoup4==4.12.3
boto3==1.19.12
botocore==1.22.12
bs4==0.0.2
cachetools==5.3.3
certifi==2024.6.2
cffi==1.16.0
charset-normalizer==3.3.2
click==8.1.7
...
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129809
Approved by: https://github.com/kit1980, https://github.com/atalman
2024-07-01 04:51:53 +00:00
cyy
eb1583dbc1 [2/N] Fix clang-tidy warnings in torch/csrc/jit/serialization (#129300)
Follows #129055
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129300
Approved by: https://github.com/ezyang
2024-07-01 01:09:00 +00:00
e62073d799 [dynamo] Skip FUNCTION_MATCH on method-wrapper objects (#129830)
Fixes https://github.com/pytorch/pytorch/issues/118563

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129830
Approved by: https://github.com/jansel
2024-06-30 20:21:18 +00:00
eqy
24b6c5a41f [cuDNN][SDPA] Bail out of dispatching to cuDNN for head dim > 128 on Ampere (#129587)
Fix for #129579

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129587
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2024-06-30 19:37:44 +00:00
eqy
f845a7a91a [cuDNN][SDPA] Remove TORCH_CUDNN_SDPA_ENABLED=1, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343)
Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes.

What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here...

Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343
Approved by: https://github.com/Skylion007
2024-06-30 19:22:16 +00:00
eqy
7b0e9a27ba Restore allowed_info in OOM message when applicable (#129546)
Seems to be removed following #99699?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129546
Approved by: https://github.com/Skylion007
2024-06-30 17:22:32 +00:00
8755e035d2 [CUDA][Pooling] Fix 64-bit indexing in avg_pool_2d backward attempt 2 (#129818)
Somehow the original PR was missing the `CUDA_KERNEL_LOOP_TYPE` change???

Thanks @johnc-keen @Chillee for the great repro! (#129785)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129818
Approved by: https://github.com/Chillee, https://github.com/Skylion007
2024-06-30 16:52:33 +00:00
eqy
4dd3cff234 [CUDA] Fix more DeviceIndex printing (#128540)
Same `char` dtype causing device index `0` to be interpreted as a null-terminator, see also #123984

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128540
Approved by: https://github.com/nWEIdia, https://github.com/Skylion007
2024-06-30 16:44:14 +00:00
eqy
68484621fe [cuDNN][functorch] Bump tolerances for nn.functional.conv2d in test_vmap_autograd_grad (#129796)
Newer versions of cuDNN can dispatch to a winograd kernel here on A100 which affects numerics a bit

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129796
Approved by: https://github.com/Skylion007
2024-06-30 16:36:12 +00:00
fff633f087 [CI] Enable AOT inductor FP32 accuracy test for CPU (#129040)
This PR enabled AOT inductor backend FP32 accuracy check for CPU in CI workflow, which could catch AOT inductor issue at early stage.

**Test Time cost:**
| Suite       	| Precision 	| Time cost 	|
|-------------	|-----------	|-----------	|
| Huggingface 	| FP32      	|   1h12m   	|
| Timm models 	| FP32      	|   1h32m   	|
|  Torchbench 	| FP32      	|   1h40m   	|

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129040
Approved by: https://github.com/chuanqi129, https://github.com/desertfire, https://github.com/malfet
2024-06-30 14:00:09 +00:00
8a5fda0377 added type hints for __contains__ (#129653)
- Fixes #129646
- Added test in test/typing/reveal/tensor_constructors.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129653
Approved by: https://github.com/ezyang
2024-06-30 11:49:11 +00:00
1a689ea38c [Inductor][CPP] Enable Quantized Linear GEMM Template with INT8 output and Unary Post Op (#129048)
**Summary**
Based on previous PR, add the config to support of int8 output and unary post op fusion with `ReLU` and `GeLU`

- Activation dtype: uint8
- Weight dtype: int8
- Output dtype: float32/bfloat16/uint8
- Post Op Fusion: with unary post operator fusion

**Test Plan**
```
clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_with_pointwise
```

**Next Step**
- [✓] Unary post op fusion
- [✓] Int8 output
- [ ] Binary Fusion
- [ ] AMX int8 MicroGEMM Kernel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129048
Approved by: https://github.com/jgong5, https://github.com/jansel
ghstack dependencies: #128825
2024-06-30 09:53:55 +00:00
35a197defa [Inductor][CPP] Enable Quantized Linear GEMM Template with FP32 output (#128825)
**Summary**
Support int8 GEMM Template with refer MicroInt8GEMM kernel for case:

- Activation dtype: uint8
- Weight dtype: int8
- Output dtype: float32/bfloat16
- Post Op Fusion: without unary post operator fusion

**Test Plan**
```
clear && python -u -m pytest -s -v test/inductor/test_cpu_select_algorithm.py -k test_quantized_linear_with_pointwise
```

**Next Step**
- [ ] Unary post op fusion
- [ ] Int8 output
- [ ] Binary Fusion
- [ ] AMX int8 MicroGEMM Kernel

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128825
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-06-30 09:45:43 +00:00
fe5424d0f8 Enable UFMT on test/test_public_bindings.py (#128389)
Part of: https://github.com/pytorch/pytorch/issues/123062

Ran lintrunner on:
> test/test_public_bindings.py

Detail:
```
$ lintrunner -a --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```

Co-authored-by: Edward Z. Yang <ezyang@fb.com>
Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128389
Approved by: https://github.com/ezyang
2024-06-30 08:49:51 +00:00
4ee1cb9b95 [BE][Easy] replace import pathlib with from pathlib import Path (#129426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129426
Approved by: https://github.com/malfet
2024-06-30 01:36:07 +00:00
2effbcfcd8 Revert "[BE][Easy] replace import pathlib with from pathlib import Path (#129426)"
This reverts commit 6d75604ef135925e8c85363c2f4a5e0b6f7fef28.

Reverted https://github.com/pytorch/pytorch/pull/129426 on behalf of https://github.com/XuehaiPan due to recognize `Path` as new exported API ([comment](https://github.com/pytorch/pytorch/pull/129426#issuecomment-2198371625))
2024-06-29 23:24:06 +00:00
67c9ec2b6d [inductor] unificate toolchain code. (#129816)
This PR is the implemention of https://github.com/pytorch/pytorch/issues/124245#issuecomment-2197778902 plan 2, and it is continued PR to https://github.com/pytorch/pytorch/pull/129789

Changes:
1. Unificate cpp builder's toolchain code.
2. Move all build related code to `cpp_builder.py`.
3. Optimize `codecache.py`, `cpp_builder.py` and `cpu_vec_isa.py` import logical follow: https://github.com/pytorch/pytorch/issues/124245#issuecomment-2197778902

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129816
Approved by: https://github.com/jansel
2024-06-29 23:21:13 +00:00
3fec0efd34 [Inductor][CPP] Support vectorization of bitwise fn (#129733)
**Summary**
When check the vectorization status among 3 test suit, we found some operators disabled vectorization with message `Disabled vectorization: op: bitwise_and`. In this PR, we add vectorization support of 6 bitwise functions.

In this PR, we also remove `bitwise_xor` from `ops_to_bool` list which sets output data type as bool in data type propagation. It seems wrong since according to this doc
https://pytorch.org/docs/stable/generated/torch.bitwise_xor.html, it should return the same integral data type with input and the testcase `test_bitwise3` failed due to this issue.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_bitwise
python -u -m pytest -s -v test/inductor/test_torchinductor.py -k test_bitwise3
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129733
Approved by: https://github.com/jgong5, https://github.com/Skylion007
2024-06-29 17:25:27 +00:00
6d75604ef1 [BE][Easy] replace import pathlib with from pathlib import Path (#129426)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129426
Approved by: https://github.com/malfet
2024-06-29 15:42:09 +00:00
7837a12474 [BE] enforce style for empty lines in import segments (#129751)
This PR follows https://github.com/pytorch/pytorch/pull/129374#pullrequestreview-2136555775 cc @malfet:

> Lots of formatting changes unrelated to PR goal, please keep them as part of separate PR (and please add lint rule if you want to enforce those, or at least cite one)

`usort` allows empty lines within import segments. For example, `usort` do not change the following code:

```python
import torch.aaa
import torch.bbb
import torch.ccc

x = ...  # some code
```

```python
import torch.aaa

import torch.bbb
import torch.ccc

x = ...  # some code
```

```python
import torch.aaa

import torch.bbb

import torch.ccc

x = ...  # some code
```

This PR first sort imports via `isort`, then re-sort the file using `ufmt` (`usort` + `black`). This enforces the following import style:

1. no empty lines within segments.
2. single empty line between segments.
3. two spaces after import statements.

All the code snippets above will be formatted to:

```python
import torch.aaa
import torch.bbb
import torch.ccc

x = ...  # some code
```

which produces a consistent code style.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129751
Approved by: https://github.com/malfet
2024-06-29 14:15:24 +00:00
9ae78a578c [halide-backend] Support manual schedules (#129321)
Currently using this for some by-hand hacking, but might need to implement our own scheduler later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129321
Approved by: https://github.com/shunting314
ghstack dependencies: #126417, #129025, #129026, #127506, #129036, #129320
2024-06-29 14:06:28 +00:00
a18eb651d3 [halide-backend] Disable split reductions for Halide (#129320)
In theory Halide doesn't need the split reduction stuff we do for Triton since it can generate multiple kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129320
Approved by: https://github.com/shunting314, https://github.com/eellison
ghstack dependencies: #126417, #129025, #129026, #127506, #129036
2024-06-29 14:06:28 +00:00
4cb8cb04a7 [halide-backend] Enable bfloat16 support (#129036)
Requires https://github.com/halide/Halide/pull/8255

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129036
Approved by: https://github.com/shunting314, https://github.com/eellison
ghstack dependencies: #126417, #129025, #129026, #127506
2024-06-29 14:06:25 +00:00
b93bf55b6a [halide-backend] Add GPU support (#127506)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127506
Approved by: https://github.com/shunting314, https://github.com/eellison
ghstack dependencies: #126417, #129025, #129026
2024-06-29 14:06:21 +00:00
86cadc6385 [halide-backend] Dimension-based indexing (#129026)
Prior to this the generated Halide code was a rather literal translation of the Triton code, with XBLOCK/YBLOCK/RBLOCK and 1D inputs.  Halide prefers dimensions, and this 1D index triggers a lot of bugs and perf issues.  This PR infers dimensions and changes the indexing in the generated code.

Before
```py
@hl.generator(name="kernel")
class Kernel:
    in_ptr0 = hl.InputBuffer(hl.Float(32), 1)
    out_ptr3 = hl.OutputBuffer(hl.Float(32), 2)

    def generate(g):
        in_ptr0 = g.in_ptr0
        out_ptr3 = g.out_ptr3
        xindex = hl.Var('xindex')
        rindex = hl.Var('rindex')
        r1 = rindex
        x0 = xindex
        idom = hl.RDom([hl.Range(0, 16), hl.Range(0, 32)])
        odom = hl.RDom([hl.Range(0, 16)])
        rdom = hl.RDom([hl.Range(0, 32)])
        xindex_idom = idom.x
        xindex_odom = odom.x
        rindex_idom = idom.y
        r1_idom = rindex_idom
        x0_idom = xindex_idom
        x0_odom = xindex_odom
        tmp0 = hl.Func('tmp0')
        tmp0[rindex, xindex] = in_ptr0[r1 + (32*x0)]
        tmp1 = hl.Func('tmp1')
        tmp1[xindex] = hl.maximum(rdom, tmp0[rdom, xindex])
        tmp2 = hl.Func('tmp2')
        tmp2[rindex, xindex] = tmp0[rindex, xindex] - tmp1[xindex]
        tmp3 = hl.Func('tmp3')
        tmp3[rindex, xindex] = hl.fast_exp(hl.cast(hl.Float(32), tmp2[rindex, xindex])) if tmp2.type().bits() <= 32 else hl.exp(tmp2[rindex, xindex])
        tmp4 = hl.Func('tmp4')
        tmp4[xindex] = hl.sum(rdom, tmp3[rdom, xindex])
        tmp5 = hl.Func('tmp5')
        tmp5[rindex, xindex] = tmp3[rindex, xindex] / tmp4[xindex]
        out_ptr3_i0 = hl.Var('out_ptr3_i0')
        out_ptr3_i1 = hl.Var('out_ptr3_i1')
        out_ptr3[out_ptr3_i0, out_ptr3_i1] = hl.cast(out_ptr3.type(), tmp5[out_ptr3_i0, out_ptr3_i1])

        assert g.using_autoscheduler()
        in_ptr0.set_estimates([hl.Range(0, 512)])
        out_ptr3.set_estimates([hl.Range(0, 32), hl.Range(0, 16)])
```

After
```py
@hl.generator(name="kernel")
class Kernel:
    in_ptr0 = hl.InputBuffer(hl.Float(32), 2)
    out_ptr3 = hl.OutputBuffer(hl.Float(32), 2)

    def generate(g):
        in_ptr0 = g.in_ptr0
        out_ptr3 = g.out_ptr3
        h0 = hl.Var('h0')
        h1 = hl.Var('h1')
        rdom = hl.RDom([hl.Range(0, 32)])
        hr1 = rdom[0]
        tmp0 = hl.Func('tmp0')
        tmp0[h0, h1] = in_ptr0[h0, h1,]
        tmp1 = hl.Func('tmp1')
        tmp1[h1] = hl.maximum(rdom, tmp0[hr1, h1])
        tmp2 = hl.Func('tmp2')
        tmp2[h0, h1] = tmp0[h0, h1] - tmp1[h1]
        tmp3 = hl.Func('tmp3')
        tmp3[h0, h1] = hl.fast_exp(hl.cast(hl.Float(32), tmp2[h0, h1])) if tmp2.type().bits() <= 32 else hl.exp(tmp2[h0, h1])
        tmp4 = hl.Func('tmp4')
        tmp4[h1] = hl.sum(rdom, tmp3[hr1, h1])
        tmp5 = hl.Func('tmp5')
        tmp5[h0, h1] = tmp3[h0, h1] / tmp4[h1]
        out_ptr3[h0, h1,] = hl.cast(hl.Float(32), tmp5[h0, h1])

        assert g.using_autoscheduler()
        in_ptr0.dim(0).set_min(0)
        in_ptr0.dim(0).set_stride(1)
        in_ptr0.dim(0).set_extent(32)
        in_ptr0.dim(1).set_min(0)
        in_ptr0.dim(1).set_stride(32)
        in_ptr0.dim(1).set_extent(16)
        in_ptr0.set_estimates([hl.Range(0, 32), hl.Range(0, 16)])
        out_ptr3.set_estimates([hl.Range(0, 32), hl.Range(0, 16)])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129026
Approved by: https://github.com/shunting314, https://github.com/eellison
ghstack dependencies: #126417, #129025
2024-06-29 14:06:16 +00:00
da5f37515e [halide-backend] Generate standalone runtime (#129025)
This puts the halide runtime in a global shared object, rather than copying it to each kernel.  Having many copies of the runtime causes many issues with cuda.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129025
Approved by: https://github.com/shunting314, https://github.com/eellison
ghstack dependencies: #126417
2024-06-29 14:06:12 +00:00
e34b7e6af3 [halide-backend] Initial implementation of HalideKernel and HalideScheduling (#126417)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126417
Approved by: https://github.com/shunting314, https://github.com/eellison
2024-06-29 14:06:08 +00:00
13d4be1dc7 [pipelining] Support W action for schedules (#129233)
Add support to for the `W` action in `_step_microbatches`.

## TODO:
- Clean up the tests theres a lot of copy-pasted repeated code there

Co-authored-by: Will Constable <whc@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129233
Approved by: https://github.com/wconstab
ghstack dependencies: #128983, #128976
2024-06-29 11:51:40 +00:00
a6da01bd01 [pipelining] Support arbitrary stage ordering on ranks (#128976)
Fixes based on discussion in https://github.com/pytorch/pytorch/issues/128665

Our previous assumption was that for looped schedules `stage_ids = range(rank, total_stages, num_local_stages)`. This is not true for all schedules. This change relaxes that assumptions and allows arbitrary ordering of stages. For example in the added test we do, rank 0: [stage0, stage3], rank 1: [stage1, stage2]. The test also adds a schedule registry (for testing) which performs 1 microbatch through this schedule

```
F0_0 None None F0_3 B0_3 None None B0_0
None F0_1 F0_2 None None B0_2 B0_1 None
```

Co-authored-by: Will Constable <whc@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128976
Approved by: https://github.com/wconstab
ghstack dependencies: #128983
2024-06-29 11:51:39 +00:00
18ae3bab2f [Pipelining] Support separate dw_runner for PipelineStage (#128983)
Fixes #128974

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128983
Approved by: https://github.com/H-Huang
2024-06-29 11:51:34 +00:00
b0e5c9514d use shutil.which in check_compiler_ok_for_platform (#129069)
the same as https://github.com/pytorch/pytorch/pull/126060
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129069
Approved by: https://github.com/ezyang
2024-06-29 11:38:51 +00:00
56935684c3 Use Generic TypeAlias (PEP 585) and Union Type (PEP 604) in .pyi stub files (#129419)
------

- [Generic TypeAlias (PEP 585)](https://peps.python.org/pep-0585): e.g. `typing.List[T] -> list[T]`, `typing.Dict[KT, VT] -> dict[KT, VT]`, `typing.Type[T] -> type[T]`.
- [Union Type (PEP 604)](https://peps.python.org/pep-0604): e.g. `Union[X, Y] -> X | Y`, `Optional[X] -> X | None`, `Optional[Union[X, Y]] -> X | Y | None`.

Note that in `.pyi` stub files, we do not need `from __future__ import annotations`. So this PR does not violate issue #117449:

- #117449

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129419
Approved by: https://github.com/ezyang
ghstack dependencies: #129375, #129376
2024-06-29 09:23:39 +00:00
9120992c72 [BE][Easy] enable postponed annotations in torchgen (#129376)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129376
Approved by: https://github.com/ezyang
ghstack dependencies: #129375
2024-06-29 09:23:39 +00:00
8a67daf283 [BE][Easy] enable postponed annotations in tools (#129375)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129375
Approved by: https://github.com/malfet
2024-06-29 09:23:35 +00:00
58f346c874 [inductor] split cpu vec isa to dedicate file (keep git history) (#129789)
This PR is the implemention of https://github.com/pytorch/pytorch/issues/124245#issuecomment-2197778902 plan 1

Changes:
1. Duplicate `codecache.py` to `cpu_vec_isa.py` with its `git history`.
<img width="745" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/106533da-ce80-4825-8271-35ffb3141f92">

2. Make `cpu_vec_isa.py` as dedicate file for CPU vec isa. It also good to extend for more archtectures and vec isa.
3. Update code for above changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129789
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-06-29 07:19:54 +00:00
a676b7c5f3 Add XGLMForCausalLM to the flaky model list (#129776)
Not failing on devGPU. Went to CI machine ... flaky. So adding to the flaky list.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129776
Approved by: https://github.com/mlazos
ghstack dependencies: #129583, #129610, #129775
2024-06-29 05:47:28 +00:00
5d1763d159 Add lcnet to the inline_inbuilt_nn_module list (#129775)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129775
Approved by: https://github.com/mlazos
ghstack dependencies: #129583, #129610
2024-06-29 05:47:28 +00:00
89696db4b0 Revert "[LLVM/TensorExpr] Update for an API change in LLVM 18." (#129797)
This reverts commit 20f394f10a389bcf13485929be8862f98ad4b322 (https://github.com/pytorch/pytorch/pull/117086)

LLVM upstream changed the pass builder API again, so registerPassBuilderCallbacks no longer takes extra boolean for PopulateClassToPassNames. Update accordingly.

Relevant LLVM upstream change:
https://github.com/llvm/llvm-project/pull/96321
https://github.com/llvm/llvm-project/pull/96462

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129797
Approved by: https://github.com/dcci
2024-06-29 05:17:20 +00:00
3ef44df667 [ts-migration] support prim::SetAttr and fix prim::GetAttr (#129440)
- Lifting Tensor Constant attributes to buffers: TorchScript does not automatically lift tensor constant attributes to buffers. So previous converter cannot access tensor constant attributes. This PR fixed the issue.
- Add SetAttr support for tensor attributes by copy_.
- Add SetAttr support for non-tensor attributes. In particular, we maintain the current value of non-tensor attributes in `name_to_non_tensor_attribute_node`, similar to an interpreter pass on non-tensor attributes. So we can support the following use case:
```python
 def forward(self, x):
      c1 = self.count
      self.count += 1
      c2 = self.count
      return x + c1 + c2
```
- Fixed a bug in GetAttr to support the following use case:
```python
def forward(self, inp):
  x = self.buffer
  self.buffer += 1
  y = self.buffer
  return x + y + inp
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129440
Approved by: https://github.com/angelayi
2024-06-29 05:08:13 +00:00
ec47d4d9a8 [Inductor] FlexAttention supports block sparse mask (#129216)
Benchmark script (causal mask): https://gist.github.com/yanboliang/c2010a1fd081d4e8ca94fadec9eef286
Initial perf number:
* fwd speedup: 0.44 -> 0.72
* bwd speedup: 0.38 -> 0.71

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129216
Approved by: https://github.com/Chillee
2024-06-29 04:44:38 +00:00
7b5a8424a1 [GPT-fast] Update micro benchmark numbers as A100-50G (#129799)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129799
Approved by: https://github.com/Chillee
2024-06-29 04:36:07 +00:00
065c386990 Allow get attributes on DDP similar to FSDP (#128620)
FSDP implements the following logic but its missing from DDP.
This PR adds an equivalent function for the same.

```python
    def __getattr__(self, name: str) -> Any:
        """Forward missing attributes to the wrapped module."""
        try:
            return super().__getattr__(name)  # defer to nn.Module's logic
        except AttributeError:
            return getattr(self._fsdp_wrapped_module, name)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128620
Approved by: https://github.com/awgu
2024-06-29 01:57:22 +00:00
2bc6f329b2 Make PyTorch argparser understand complex (#129580)
It understands float and int, so why not `complex`.

Test plan: `python -c "import torch;print(torch.rand(3, dtype=complex))"`

Fixes https://github.com/pytorch/pytorch/issues/126837

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129580
Approved by: https://github.com/albanD
2024-06-29 01:21:12 +00:00
dfd55d1714 Revert "[cond] inlining into one of the branches when pred is a python constant (#128709)"
This reverts commit 23adf166e166bd56e3446284939af7e46a181079.

Reverted https://github.com/pytorch/pytorch/pull/128709 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is breaking one ExecuTorch test ([comment](https://github.com/pytorch/pytorch/pull/128709#issuecomment-2197806850))
2024-06-29 01:03:55 +00:00
3d96217891 Revert "[BE][Easy] use pathlib.Path instead of dirname / ".." / pardir (#129374)"
This reverts commit 9e1f3ecaa710785a1ab03c6ad5093a5566d6c5e5.

Reverted https://github.com/pytorch/pytorch/pull/129374 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is still failing with the same error ([comment](https://github.com/pytorch/pytorch/pull/129374#issuecomment-2197801405))
2024-06-29 00:47:15 +00:00
c0782e7c81 Kineto profiler: collecting observer traces from C++ child threads (#128743)
Summary:
In a C++ program, if we have child threads doing GPU work, it would be nice to get traces of those threads as well. The problem is, pushProfilingCallbacks() is not called on child threads, therefore, no observer traces are collected on these threads, entirely missing in the final output.

This diff provides a new API that a child thread may elect to call to register itself onto the profiler that was started in main thread (or whatever the Python thread that manages the profiler).

Test Plan:
```
buck2 test @mode/opt //caffe2/test:profiler_test_cpp_thread
```

Reviewed By: aaronenyeshi

Differential Revision: D56669942

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128743
Approved by: https://github.com/aaronenyeshi
2024-06-29 00:44:30 +00:00
a32ce5ce34 Revert "[BE][Easy] enable postponed annotations in tools (#129375)"
This reverts commit 59eb2897f1745f513edb6c63065ffad481c4c8d0.

Reverted https://github.com/pytorch/pytorch/pull/129375 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I need to revert to cleanly revert https://github.com/pytorch/pytorch/pull/129374, please do a rebase and reland this ([comment](https://github.com/pytorch/pytorch/pull/129375#issuecomment-2197800541))
2024-06-29 00:44:25 +00:00
6063bb9d45 Revert "[BE][Easy] enable postponed annotations in torchgen (#129376)"
This reverts commit 494057d6d4e9b40daf81a6a4d7a8c839b7424b14.

Reverted https://github.com/pytorch/pytorch/pull/129376 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I need to revert to cleanly revert https://github.com/pytorch/pytorch/pull/129374, please do a rebase and reland this ([comment](https://github.com/pytorch/pytorch/pull/129375#issuecomment-2197800541))
2024-06-29 00:44:25 +00:00
83caf4960f Revert "Use Generic TypeAlias (PEP 585) and Union Type (PEP 604) in .pyi stub files (#129419)"
This reverts commit e40f50cb87bcd176a380b729af5dda13dbe9c399.

Reverted https://github.com/pytorch/pytorch/pull/129419 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I need to revert to cleanly revert https://github.com/pytorch/pytorch/pull/129374, please do a rebase and reland this ([comment](https://github.com/pytorch/pytorch/pull/129375#issuecomment-2197800541))
2024-06-29 00:44:24 +00:00
00d7bba2fa Revert "[BE] enforce style for empty lines in import segments (#129751)"
This reverts commit f5ff1a3ab9ef279655308266029faf6543a8a1ca.

Reverted https://github.com/pytorch/pytorch/pull/129751 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I need to revert to cleanly revert https://github.com/pytorch/pytorch/pull/129374, please do a rebase and reland this ([comment](https://github.com/pytorch/pytorch/pull/129751#issuecomment-2197799814))
2024-06-29 00:41:41 +00:00
fa6c0fe3e4 Revert "Conversions between strided and jagged layouts for Nested Tensors (#115749)"
This reverts commit 9450e198aa0bdf6f81ccb8ad2f74c06e81d1af6e.

Reverted https://github.com/pytorch/pytorch/pull/115749 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/115749#issuecomment-2197790226))
2024-06-29 00:16:47 +00:00
24f69eef6a [FSDP2] Ran reduce-scatter copy-in in default stream (#129721)
This PR runs the reduce-scatter copy-in in the default stream, allowing the reduce-scatter input (large allocation proportional to unsharded gradients) to be allocated in the default stream to avoid fragmenting that memory across stream memory pools.
- In general, minimizing memory usage spikes in non-default-stream memory pools helps because otherwise, that memory cannot be reused by the default stream outside of that spike. This reduce-scatter input allocation represents one such spike. The reduce-scatter outputs are still allocated in the separate `reduce_scatter` stream since they are small and have a non-spiky allocation/free pattern (we iteratively allocate them through backward and free them altogether after optimizer).
- This PR should not have any impact on overlap (I sanity checked Llama3-8B traces from torchtitan; plus we have the `test_fully_shard_overlap.py` unit tests).

**Experiment**
**(Before)** Llama3-8B, 1D FSDP, 8 H100s, bf16/fp32 mixed precision, no AC, local batch size 1:
```
[rank0]:2024-06-27 16:38:56,620 - root - INFO - step:  1  loss: 12.2764  memory: 71.99GiB(75.75%)  wps: 1,436  mfu: 8.41%
[rank0]:2024-06-27 16:38:56,620 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:2024-06-27 16:38:57,943 - root - INFO - step:  2  loss: 12.1001  memory: 79.82GiB(83.98%)  wps: 6,195  mfu: 36.28%
[rank0]:2024-06-27 16:38:59,266 - root - INFO - step:  3  loss: 11.7697  memory: 79.82GiB(83.98%)  wps: 6,193  mfu: 36.27%
[rank0]:2024-06-27 16:39:00,587 - root - INFO - step:  4  loss: 11.2807  memory: 79.82GiB(83.98%)  wps: 6,203  mfu: 36.32%
[rank0]:2024-06-27 16:39:01,910 - root - INFO - step:  5  loss: 10.9494  memory: 79.82GiB(83.98%)  wps: 6,198  mfu: 36.30%
```

**(After)** Llama3-8B, 1D FSDP, 8 H100s, bf16/fp32 mixed precision, no AC, local batch size 1:
```
[rank0]:2024-06-27 16:41:12,106 - root - INFO - step:  1  loss: 12.2560  memory: 69.46GiB(73.08%)  wps: 1,158  mfu: 6.78%
[rank0]:2024-06-27 16:41:12,106 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:2024-06-27 16:41:13,502 - root - INFO - step:  2  loss: 12.0949  memory: 77.29GiB(81.32%)  wps: 5,870  mfu: 34.37%
[rank0]:2024-06-27 16:41:14,839 - root - INFO - step:  3  loss: 11.7770  memory: 77.29GiB(81.32%)  wps: 6,130  mfu: 35.90%
[rank0]:2024-06-27 16:41:16,154 - root - INFO - step:  4  loss: 11.3188  memory: 77.29GiB(81.32%)  wps: 6,230  mfu: 36.48%
[rank0]:2024-06-27 16:41:17,474 - root - INFO - step:  5  loss: 10.9443  memory: 77.29GiB(81.32%)  wps: 6,211  mfu: 36.37%
```
**2.53 GiB reduction in peak reserved memory.**

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129721
Approved by: https://github.com/weifengpy, https://github.com/yifuwang
2024-06-28 23:55:12 +00:00
f06e3a1569 [Split Build] Make script not crash if split build is not set (#129774)
Fixes issue causing https://github.com/pytorch/pytorch/actions/runs/9704484834/job/26801889463 to crash
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129774
Approved by: https://github.com/atalman
2024-06-28 23:50:18 +00:00
7bda23ef84 [BE]: Update ruff to 0.5.0 (#129744)
Update ruff to 0.5.0 so we can enable all the some of the new checks I've been wanting to add to the codebase. First just updating the code to comply with some rule changes and a couple minor API changes / deprecations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129744
Approved by: https://github.com/ezyang
2024-06-28 21:49:56 +00:00
0a337613f8 Fix typo in stack_module_state doc (#129126)
I think there is a typo in the first example of the `torch.func.stack_module_state` documentation. The first parameter in the function call in the `wrapper` return is missing an 's'.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129126
Approved by: https://github.com/zou3519
2024-06-28 21:36:40 +00:00
f5ff1a3ab9 [BE] enforce style for empty lines in import segments (#129751)
This PR follows https://github.com/pytorch/pytorch/pull/129374#pullrequestreview-2136555775 cc @malfet:

> Lots of formatting changes unrelated to PR goal, please keep them as part of separate PR (and please add lint rule if you want to enforce those, or at least cite one)

`usort` allows empty lines within import segments. For example, `usort` do not change the following code:

```python
import torch.aaa
import torch.bbb
import torch.ccc

x = ...  # some code
```

```python
import torch.aaa

import torch.bbb
import torch.ccc

x = ...  # some code
```

```python
import torch.aaa

import torch.bbb

import torch.ccc

x = ...  # some code
```

This PR first sort imports via `isort`, then re-sort the file using `ufmt` (`usort` + `black`). This enforces the following import style:

1. no empty lines within segments.
2. single empty line between segments.
3. two spaces after import statements.

All the code snippets above will be formatted to:

```python
import torch.aaa
import torch.bbb
import torch.ccc

x = ...  # some code
```

which produces a consistent code style.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129751
Approved by: https://github.com/malfet
2024-06-28 21:02:59 +00:00
5b96a552df Add a check and error message for no support on MPS for conv with output_channels > 2^16 (#129484)
Fixes the silent correctness issue in #129207 by preventing the user from calling the convolution op on MPS device with an unsupported value.

The fix for the missing support is coming in later as that requires work on the kernel side so it'll take some more time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129484
Approved by: https://github.com/kulinseth
2024-06-28 20:57:40 +00:00
bc8883a7c4 fix the error msg in device_mesh (#129747)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129747
Approved by: https://github.com/awgu, https://github.com/wconstab
2024-06-28 20:12:09 +00:00
45f3e20527 Improve error message for weights_only load (#129705)
As @vmoens pointed out, the current error message does not make the "either/or" between setting `weights_only=False` and using `add_safe_globals` clear enough, and should print the code for the user to call `add_safe_globals`

New formatting looks like such

In the case that `add_safe_globals` can be used

```python
>>> import torch
>>> from torch.testing._internal.two_tensor import TwoTensor
>>> torch.save(TwoTensor(torch.randn(2), torch.randn(2)), "two_tensor.pt")
>>> torch.load("two_tensor.pt", weights_only=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/data/users/mg1998/pytorch/torch/serialization.py", line 1225, in load
    raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
_pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options
        (1) Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
        (2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
        WeightsUnpickler error: Unsupported global: GLOBAL torch.testing._internal.two_tensor.TwoTensor was not an allowed global by default. Please use `torch.serialization.add_safe_globals([TwoTensor])` to allowlist this global if you trust this class/function.

Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.
```

For other issues (unsupported bytecode)
```python
>>> import torch
>>> t = torch.randn(2, 3)
>>> torch.save(t, "protocol_5.pt", pickle_protocol=5)
>>> torch.load("protocol_5.pt", weights_only=True)
/data/users/mg1998/pytorch/torch/_weights_only_unpickler.py:359: UserWarning: Detected pickle protocol 5 in the checkpoint, which was not the default pickle protocol used by `torch.load` (2). The weights_only Unpickler might not support all instructions implemented by this protocol, please file an issue for adding support if you encounter this.
  warnings.warn(
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/data/users/mg1998/pytorch/torch/serialization.py", line 1225, in load
    raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
_pickle.UnpicklingError: Weights only load failed. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
 Please file an issue with the following so that we can make `weights_only=True` compatible with your use case: WeightsUnpickler error: Unsupported operand 149

Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.
```

Old formatting would have been like:
```python
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/data/users/mg1998/pytorch/torch/serialization.py", line 1203, in load
    raise pickle.UnpicklingError(UNSAFE_MESSAGE + str(e)) from None
_pickle.UnpicklingError: Weights only load failed. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you get the file from a trusted source. Alternatively, to load with `weights_only` please check the recommended steps in the following error message. WeightsUnpickler error: Unsupported global: GLOBAL torch.testing._internal.two_tensor.TwoTensor was not an allowed global by default. Please use `torch.serialization.add_safe_globals` to allowlist this global if you trust this class/function.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129705
Approved by: https://github.com/albanD, https://github.com/vmoens
ghstack dependencies: #129239, #129396, #129509
2024-06-28 19:36:31 +00:00
99456a612b [AOTI] Properly indent launchKernel calls in AOTInductor (#129616)
Summary:
There is a small cosmetic issue in the C++ wrapper file generated by AOTInductor - The launchKernel() call isn't properly indented.

Added indentation for launchKernel() code block call when there's a "if" condition. a.k.a when `grid_uses_symbolic_shapes` is `True`.

Test Plan:
Test cmd ran (in pytorch oss):

`TORCH_LOGS="output_code" TORCH_COMPILE_DEBUG=1 python test/inductor/test_aot_inductor.py -k test_zero_grid_with_backed_symbols_abi_compatible_cuda`

And then manually verified the output code generated in a path like
`/tmp/torchinductor_guorachel/coraisesuchpl3qabrazn7ydydszcit6lwpn7ckd3b4wej4rep5l/cba5g5ajeh5sym3tp5iqn7kkokimj7qqd4krs2rruhupbfqgppge.cpp`

Similarly, also verified for test case:`test_zero_grid_with_unbacked_symbols_abi_compatible_cuda`

Differential Revision: D58897157

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129616
Approved by: https://github.com/ColinPeppler
2024-06-28 19:16:18 +00:00
6120aa3718 [nn-module] Use standard dict for _parameters, _modules and _buffers (#129164)
TorchDynamo guard mechanism guards on the key order on the dictionaries if the user iterates over the dictionary. For standard dict, we can write a fast C++ implementation by using PyDict_Next. But with OrderedDict, we have to rely on `keys` Python API to get the key ordering. This makes guard evaluation slow.

With Dynamo inlining into inbuilt nn modules, I am seeing many guards over the OrderedDict on `_modules`, `_parameters`. From reading the code, I don't see any reason to not use standard dicts. I think OrderedDict was preferred over dict because of the ordering, but dicts are now ordered. With this PR, I am observing ~20% reduction in guard overhead of a HF model.

Functionality impact
- The only difference between dict and OrdedeDict is `move_to_end` method for OrderedDict ([link](https://stackoverflow.com/questions/34305003/difference-between-dictionary-and-ordereddict)). But the changes here are internal to nn module, and we do not use `move_to_end` for `_parameters`, `_modules` and `_buffers`. We  use `move_to_end` for hooks but this PR keeps the OrderedDict for hooks untouched (we should still followup with hooks but in a separate PR).

Perf impact
- I dont anticipate any perf impact. `dict` is completely implemented in C. OrderedDict is Python wrapper over dict with only few method overridden ([link](https://stackoverflow.com/questions/34305003/difference-between-dictionary-and-ordereddict)).

Typing impact
- I dont anticipate any. For all the user visible methods for nn.Module, we don't expose the underlying `_modules` etc. We have iterators like `named_parameters` which return an Iterator of Parameter. So, no typing changes required.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129164
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #129163
2024-06-28 18:30:13 +00:00
db4c7bb7fc Refine typing annotation for compile (#129136)
before
![image](https://github.com/pytorch/pytorch/assets/46243324/91372d0f-ad0e-4abe-9582-7fe892f99ec8)

after
![image](https://github.com/pytorch/pytorch/assets/46243324/175066ff-78f9-44a1-a3bb-5df809f7e86d)

Co-authored-by: Nyakku Shigure <sigure.qaq@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129136
Approved by: https://github.com/ezyang
2024-06-28 17:57:44 +00:00
FEI
59e4e92556 sdp::SDPBackend::flash_attention support PrivateUse1 (#126392)
Fixes https://github.com/pytorch/pytorch/issues/124271

cc  @cpuhrsch @drisspg @albanD @soulitzer

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126392
Approved by: https://github.com/drisspg
2024-06-28 17:48:40 +00:00
26d633b721 [BE] Correctly catch skip signals emitting from sys.exit in Sandcastle (#129731)
https://github.com/pytorch/pytorch/pull/129581 does not work correctly with Sandcastle environment. This PR fixes the issue.

Differential Revision: [D59144062](https://our.internmc.facebook.com/intern/diff/D59144062/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129731
Approved by: https://github.com/wz337
2024-06-28 17:24:12 +00:00
c12a4f2e65 Add decomposition for slice_scatter (#123744)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123744
Approved by: https://github.com/peterbell10
2024-06-28 17:02:10 +00:00
6897631ceb Guard on inner tensor names for traceable wrapper subclasses (#129618)
Fixes #129601

Background: it's possible that a traceable wrapper subclass will have an optional inner tensor constituent (e.g. NJT's cached min / max sequence lengths). To specify this, the subclass's `__tensor_flatten__()` impl should leave out any unspecified optional inner tensors in the returned list of `attrs`.

This PR guards on the list of inner tensor `attrs` returned in `subclass.__tensor_flatten__()[0]`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129618
Approved by: https://github.com/anijain2305
2024-06-28 16:30:25 +00:00
b84036e3fb [AOTI] Fix test_dynamic_scalar_abi_compatible_cpu_with_stack_allocation (#129173)
Fixes #122978
## Summary
To fix compilation error for test test_dynamic_scalar_abi_compatible_cpu_with_stack_allocation

- Error 1
```
error: no matching function for call to ‘torch::aot_inductor::ArrayRefTensor<float>::ArrayRefTensor(float [1], const int64_t [0], const int64_t [0], int&, int32_t&)’
  613 |     ArrayRefTensor<float> buf3(buf3_storage, int_array_6, int_array_6, cached_torch_device_type_cpu, this->device_idx_);
      |                                                                                                                       ^
...
torch/include/torch/csrc/inductor/aoti_runtime/arrayref_tensor.h:188:35: note:   no known conversion for argument 2 from ‘const int64_t [0]’ {aka ‘const long int [0]’} to ‘torch::aot_inductor::MiniArrayRef<const long int>’
  188 |       MiniArrayRef<const int64_t> sizes,
      |       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~
```
Fix: added constructor for empty array in arrayref_tensor.h
- Error 2
```
error: cannot convert ‘torch::aot_inductor::ArrayRefTensor<float>’ to ‘AtenTensorHandle’ {aka ‘AtenTensorOpaque*’}
  625 |     AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_item_float32(buf3, &zuf0_raw));
      |                                                         ^~~~
      |                                                         |
      |                                                         torch::aot_inductor::ArrayRefTensor<float>
```
Fix: in cpp_wrapper_cpu.py, added codegen to call convert ArrayRefTensor to AtenTensorHandle first.
## Test Plan
```
python test/inductor/test_aot_inductor.py -k AOTInductorTestABICompatibleCpuWithStackAllocation.test_dynamic_scalar_abi_compatible_cpu_with_stack_allocation
```

Before the fix, detailed in  #122978:
```
 |     AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_item_float32(buf3, &zuf0_raw));
      |                                                         ^~~~
      |                                                         |
      |                                                         torch::aot_inductor::ArrayRefTensor<float>
/home/yingzhaoseattle/pytorch/torch/include/torch/csrc/inductor/aoti_runtime/utils.h:34:8: note: in definition of macro ‘AOTI_TORCH_ERROR_CODE_CHECK’
Ran 1 test in 4.377s
FAILED (errors=1)
```
After the fix

```
/home/yingzhaoseattle/pytorch/torch/backends/cudnn/__init__.py:107: UserWarning: PyTorch was compiled without cuDNN/MIOpen support. To use cuDNN/MIOpen, rebuild PyTorch making sure the library is visible to the build system.
  warnings.warn(
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('extern_calls', 1)]
.
----------------------------------------------------------------------
Ran 1 test in 9.633s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129173
Approved by: https://github.com/chenyang78
2024-06-28 16:27:42 +00:00
04264efab6 Add structured logging on FXGraphCache hit (#129588)
We'll also want to do this for AOTAutogradCache once that's ready

Differential Revision: [D59144226](https://our.internmc.facebook.com/intern/diff/D59144226)
Co-authored-by: Oguz Ulgen <oulgen@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129588
Approved by: https://github.com/oulgen, https://github.com/xmfan
2024-06-28 16:06:22 +00:00
e40f50cb87 Use Generic TypeAlias (PEP 585) and Union Type (PEP 604) in .pyi stub files (#129419)
------

- [Generic TypeAlias (PEP 585)](https://peps.python.org/pep-0585): e.g. `typing.List[T] -> list[T]`, `typing.Dict[KT, VT] -> dict[KT, VT]`, `typing.Type[T] -> type[T]`.
- [Union Type (PEP 604)](https://peps.python.org/pep-0604): e.g. `Union[X, Y] -> X | Y`, `Optional[X] -> X | None`, `Optional[Union[X, Y]] -> X | Y | None`.

Note that in `.pyi` stub files, we do not need `from __future__ import annotations`. So this PR does not violate issue #117449:

- #117449

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129419
Approved by: https://github.com/ezyang
ghstack dependencies: #129375, #129376
2024-06-28 15:37:57 +00:00
494057d6d4 [BE][Easy] enable postponed annotations in torchgen (#129376)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129376
Approved by: https://github.com/ezyang
ghstack dependencies: #129375
2024-06-28 15:37:57 +00:00
59eb2897f1 [BE][Easy] enable postponed annotations in tools (#129375)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129375
Approved by: https://github.com/malfet
2024-06-28 15:37:54 +00:00
2e3ff394bf [inductor] optimize cpp builder configuration code (#129577)
Changes:
1. Combine choose isa condition dispatch code.
2. Unificate MacOS openmp configuration code.
3. Clean up useless code.

Co-authored-by: Jason Ansel <jansel@jansel.net>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129577
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-06-28 15:08:54 +00:00
eabe6574c0 [metal] Parameterize group_size in int4_mm test, fix int4mm shader for group_size > 128 (#129628)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129628
Approved by: https://github.com/kimishpatel
2024-06-28 15:01:30 +00:00
635d6c9d66 [FSDP2] Ran post-acc-grad hooks manually (#129450)
FSDP2 accumulates gradients for sharded parameters outside of the autograd engine's normal accumulation logic. We can respect registered post-accumulate-grad hooks by running them manually.

**Discussion**
Discussing with @soulitzer, changing FSDP2 to make the sharded parameters autograd leaves requires nontrivial changes to FSDP and some changes to the autograd engine (around forward vs. backward streams) where the changes may not preserve eager-mode performance and/or add some complexity.

Under the FSDP2 design, the sharded parameters never participate in autograd, so calling `register_post_accumulate_grad_hook` on them would otherwise be a no-op. In other words, there is virtually no chance for FSDP2 incorrectly re-running the hook when it should not.

Given these, a reasonable near-term solution is for FSDP2 to run the post-accumulate-grad hooks manually.

**Caveats**
- Running `foreach=False` optimizer _per parameter tensor_  incurs significantly higher CPU overhead compared to `foreach=True` (partially due to `DTensor` being a `__torch_dispatch__` tensor subclass).
    - On preliminary benchmarking on Llama3-8B on 8 GPUs, this CPU overhead is mostly tolerable, but on smaller # of GPUs or a less compute-intensive model, this may not be.
    - One solution for native Adam/AdamW is to use `fused=True`, which makes both the CPU overhead lower and GPU compute faster. However, this is generally not an option for user-defined optimizers.
    - If this CPU overhead blocks adoption of this feature, then we should seriously consider an FSDP-specific API like `register_post_backward_hook(params: List[nn.Parameter]) -> None` that allows the user to see all parameters in the `FSDPParamGroup` together for the hook so that the user can still run a `foreach=True` optimizer step on that `List[nn.Parameter]`.
- The post-accumulate-grad hook runs in the reduce-scatter stream. Our current stream handling logic does not have the default stream wait for the reduce-scatter stream until the end of backward. Unless we add that, we cannot simply run the post-accumulate-grad hook in the default stream.
    - This means that optimizer compute will overlap with backward compute, which may slowdown end-to-end execution slightly (e.g. due to SM contention or wave quantization effects). For example, on Llama3-8B, we see about ~3% decrease in MFU when running optimizer in backward even though the optimizer steps are fully overlapped and there are no CPU boundedness issues.
- This PR's goal is only to run the hook manually. State dict etc. for optimizer-in-backward is out of scope.

**Experiments (torchtitan)**
- Llama3-8B on 2 GPUs, local batch size 1, with full activation checkpointing, and bf16/fp32 mixed precision:
    - Without optimizer-in-backward: 82.03 GiB reserved memory; 28.1% MFU
    - With optimizer-in-backward (`foreach=False`): 72.84 GiB reserved memory; 28.9% MFU (speedup from more of optimizer step overlapped)
    - With optimizer-in-backward (`fused=True`): 70.84 GiB reserved memory; 30.4% MFU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129450
Approved by: https://github.com/weifengpy, https://github.com/yf225
2024-06-28 14:50:09 +00:00
fe4032fe20 [BE][CMake] Do not use EXEC_PROGRAM (#129714)
It was deprecated since CMake-3.0 in favor of `execute_process`, see https://cmake.org/cmake/help/v3.18/command/exec_program.html

This makes the following warning disappear:
```
CMake Warning (dev) at cmake/Modules/FindARM.cmake:5 (EXEC_PROGRAM):
  Policy CMP0153 is not set: The exec_program command should not be called.
  Run "cmake --help-policy CMP0153" for policy details.  Use the cmake_policy
  command to set the policy and suppress this warning.

  Use execute_process() instead.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129714
Approved by: https://github.com/kit1980
2024-06-28 13:29:52 +00:00
98d34d849d Add a XPU UT to ensure lazy init (#129638)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129638
Approved by: https://github.com/gujinghui
2024-06-28 13:22:17 +00:00
22a06869f2 include jit/*.pyi (#129654)
Fixes #108781, see https://github.com/pytorch/pytorch/pull/108782#issuecomment-1927321532

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129654
Approved by: https://github.com/ezyang
2024-06-28 12:40:11 +00:00
424068d0d2 [Windows] remove mkl shared library dependency. (#129493)
# Background
I have fixed pytorch Windows missing mkl shared library dependency issue: https://github.com/pytorch/pytorch/issues/124009
The solution is change torch_cpu module static link mkl library:
1. pytorch static link mkl PR: https://github.com/pytorch/pytorch/pull/124925
2. builder install mkl static library: https://github.com/pytorch/builder/pull/1790

Double confirmed current build is using mkl static link: https://github.com/pytorch/pytorch/issues/124009#issuecomment-2160941802

# Goal
Remove setup.py `install_requires` will install mkl shared lib on pytorch Windows. It is not required now, due to we have static linked it.
It will reduce the pytorch install network traffic and avoid install useless mkl shared library package.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129493
Approved by: https://github.com/malfet
2024-06-28 11:42:21 +00:00
a0dac3de31 Noise tensor using same size/stride with input to promote performance when channel last situation. (#129467)
All ops in _dropout_impl function are point-wise op. When input and output tensors are with same size and stride, those operators will get better performance. So i have remove memory in at::empty_like in make noise tensor.

@ezyang

Test code:
```
import torch

input1 = torch.randn((50, 20, 50 ,30)).cuda()
input2 = torch.randn((50, 20, 50 ,30)).cuda().to(memory_format=torch.channels_last)
input3 = torch.randn((50, 20, 50 , 50)).cuda()[...,10:40]
dropout = torch.nn.Dropout(p=0.5, inplace=True)

# warmup:
for i in range(20):
    output = dropout(input1)
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
num = 10000
start_event.record()
for i in range(num):
    output = dropout(input1)
end_event.record()
end_event.synchronize()
time = start_event.elapsed_time(end_event)
print("input1 each time: {0}.".format(time * 1.0/num), flush =True)

start_event.record()
for i in range(num):
    output = dropout(input2)
end_event.record()
end_event.synchronize()
time = start_event.elapsed_time(end_event)
print("input2 each time: {0}.".format(time * 1.0/num), flush =True)

start_event.record()
for i in range(num):
    output = dropout(input3)
end_event.record()
end_event.synchronize()
time = start_event.elapsed_time(end_event)
print("input3 each time: {0}.".format(time * 1.0/num), flush =True)
```

Test result:

  | 算子名称 | 输入信息size / stride | empty是否携带连续性参数 | 耗时(ms) | 备注
-- | -- | -- | -- | -- | --
1 | dropout | (50, 20, 50 ,30) / (30000, 1500, 30, 1) | LEGACY_CONTIGUOUS_MEMORY_FORMAT | 0.0426735 |  
2 | dropout | (50, 20, 50 ,30) / (30000, 1, 600, 20) | LEGACY_CONTIGUOUS_MEMORY_FORMAT | 0.0461689 |  
3 | dropout | (50, 20, 50 ,30) / (50000, 2500, 50, 1) | LEGACY_CONTIGUOUS_MEMORY_FORMAT | 0.0512882 |  
4 | dropout | (50, 20, 50 ,30) / (30000, 1500, 30, 1) | 空,根据输入决定size/stride | 0.0426598 | 对比1,基本一致
5 | dropout | (50, 20, 50 ,30) / (30000, 1, 600, 20) | 空,根据输入决定size/stride | 0.0422751 | 对比2,提升8.4%左右
6 | dropout | (50, 20, 50 ,30) / (50000, 2500, 50, 1) | 空,根据输入决定size/stride | 0.0509037 | 对比3,基本一致

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129467
Approved by: https://github.com/ezyang
2024-06-28 10:06:13 +00:00
999eec8dea Revert "[cuDNN][SDPA] Remove TORCH_CUDNN_SDPA_ENABLED=1, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343)"
This reverts commit b7e7a4cb01de394af7686ab6feb216a8a5c716bb.

Reverted https://github.com/pytorch/pytorch/pull/125343 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break some test_transformer running on internal A100 and V100 ([comment](https://github.com/pytorch/pytorch/pull/125343#issuecomment-2196202003))
2024-06-28 06:03:54 +00:00
d21993bbb8 Revert "[cuDNN][SDPA] Bail out of dispatching to cuDNN for head dim > 128 on Ampere (#129587)"
This reverts commit 7854d84acbfb7a4e3e807951188535a0316b585e.

Reverted https://github.com/pytorch/pytorch/pull/129587 on behalf of https://github.com/huydhn due to Sorry for revert yet another of your change but I need to revert this to cleanly revert https://github.com/pytorch/pytorch/pull/125343#issuecomment-2196187332 ([comment](https://github.com/pytorch/pytorch/pull/129587#issuecomment-2196198756))
2024-06-28 06:01:07 +00:00
c43923a116 Revert "[Inductor] FlexAttention supports block sparse mask (#129216)"
This reverts commit b9d3cedd648d4ed9d0bf5b918893341e5f95289a.

Reverted https://github.com/pytorch/pytorch/pull/129216 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is still failing in trunk b9d3cedd64, maybe a landrace given that TD has been turned off ([comment](https://github.com/pytorch/pytorch/pull/129216#issuecomment-2196182882))
2024-06-28 05:44:46 +00:00
73eb4503cc Enable UFMT for numpy_test files, test_xnnpack_integration.py (#129023)
Fixes #123062

Run lintrunner on files:
test/test_xnnpack_integration.py

```bash
$ lintrunner

  FLAKE8 success!
  CLANGFORMAT success!
  MYPY success!
  MYPYSTRICT success!
  CLANGTIDY success!
  TYPEIGNORE success!
  TYPENOSKIP success!
  NOQA success!
  NATIVEFUNCTIONS success!
  NEWLINE success!
  CONSTEXPR success!
  SPACES success!
  TABS success!
  INCLUDE success!
  PYBIND11_INCLUDE success!
  ERROR_PRONE_ISINSTANCE success!
  PYBIND11_SPECIALIZATION success!
  PYPIDEP success!
  EXEC success!
  CUBINCLUDE success!
  RAWCUDADEVICE success!
  RAWCUDA success!
  ROOT_LOGGING success!
  DEPLOY_DETECTION success!
  CMAKE success!
  SHELLCHECK success!
  ACTIONLINT success!
  TESTOWNERS success!
  TEST_HAS_MAIN success!
  CALL_ONCE success!
  ONCE_FLAG success!
  WORKFLOWSYNC success!
  UFMT success!
  COPYRIGHT success!
  BAZEL_LINTER success!
  LINTRUNNER_VERSION success!
  ATEN_CPU_GPU_AGNOSTIC success!
  MERGE_CONFLICTLESS_CSV success!
  RUFF success!
ok No lint issues.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129023
Approved by: https://github.com/ezyang
2024-06-28 05:40:31 +00:00
b019f38fdd [inductor] Fix pattern replacements with multiple users (#129689)
Fixes #129685

After matching a pattern, we currently try to remove all the nodes of that
pattern, which doesn't work if any intermediate node has users outside of the
pattern. In which case we can't delete those particular nodes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129689
Approved by: https://github.com/shunting314
2024-06-28 05:16:17 +00:00
eqy
7854d84acb [cuDNN][SDPA] Bail out of dispatching to cuDNN for head dim > 128 on Ampere (#129587)
Fix for #129579

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129587
Approved by: https://github.com/Skylion007, https://github.com/drisspg
2024-06-28 04:42:45 +00:00
8d4216af8c Fix compile error with Intel oneAPI compiler (#129589)
I am building PyTorch with the Intel oneAPI 2024.0.0 compiler, and encountered this compile error:
```
[ 85%] Building CXX object caffe2/CMakeFiles/cpu_rng_test.dir/__/aten/src/ATen/test/cpu_rng_test.cpp.o
In file included from /home/src/pytorch/aten/src/ATen/test/cpu_rng_test.cpp:2:
/home/src/pytorch/aten/src/ATen/test/rng_test.h:119:41: error: loop variable 'to' creates a copy from type 'const ::std::optional<int64_t>' (aka 'const optional<long>') [-Werror,-Wrange-loop-construct]
  119 |     for (const ::std::optional<int64_t> to : tos) {
      |                                         ^
/home/src/pytorch/aten/src/ATen/test/rng_test.h:119:10: note: use reference type 'const ::std::optional<int64_t> &' (aka 'const optional<long> &') to prevent copying
  119 |     for (const ::std::optional<int64_t> to : tos) {
      |          ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
      |                                         &
1 error generated.
```

This change makes the compiler happy.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129589
Approved by: https://github.com/colesbury
2024-06-28 02:35:10 +00:00
4b8a5e0374 [export] make with_effect mark op has_effect to prevent them from DCEed. (#129680)
Before the PR, custom ops that don't return outputs will get eliminated after calling `.module()` because the effect_token that keeps the operator alive is removed in remove_effect_token pass. The reason why we want to remove_effect_token is because we don't want the token to be part of input. However, this causes DCE calls in remove_effect_token itself and the dce calls in unlift to remove the custom op in the graph causing an error in the exported graph.

This PR calls has_side_effect in with_effect to make sure graph.eliminate_dead_code doesn't remove the calls by accident.

Test Plan:
Add a new test pytest test/export/test_torchbind.py -k test_export_inplace_custom_op

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129680
Approved by: https://github.com/angelayi
2024-06-28 02:22:30 +00:00
4b598d87d3 Fix FindBLAS.cmake (#129713)
Fixes regression introduced by https://github.com/pytorch/pytorch/pull/125227 by adding `INCLUDE(CheckFunctionExists)` that fixes
```
CMake Error at cmake/Modules/FindBLAS.cmake:413 (check_function_exists):
  Unknown CMake command "check_function_exists".
```

Fixes https://github.com/pytorch/pytorch/issues/129693

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129713
Approved by: https://github.com/kit1980
2024-06-28 02:15:16 +00:00
b9d3cedd64 [Inductor] FlexAttention supports block sparse mask (#129216)
Benchmark script (causal mask): https://gist.github.com/yanboliang/c2010a1fd081d4e8ca94fadec9eef286
Initial perf number:
* fwd speedup: 0.44 -> 0.72
* bwd speedup: 0.38 -> 0.71

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129216
Approved by: https://github.com/Chillee
2024-06-28 01:32:54 +00:00
c07a799ed5 [Traceable FSDP2] Add Dynamo support for run_with_rng_state HOP (#127247)
Test command:
`pytest -rA test/inductor/test_compiled_autograd.py::TestCompiledAutograd::test_trace_run_with_rng_state`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127247
Approved by: https://github.com/bdhirsh
ghstack dependencies: #129502
2024-06-28 01:04:49 +00:00
36b9d9cfcd [Inductor UT] Generalize device-bias code in newly added UT test_scatter_optimization.py (#129622)
[Inductor UT] Generalize device-bias code in newly added UT test_scatter_optimization.py and test_torchinductor_dynamic_shapes.py
Fix issue #129624 , #129642

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129622
Approved by: https://github.com/EikanWang, https://github.com/peterbell10
2024-06-28 01:04:21 +00:00
deaab33f3f [custom op] add error message (#129417)
Fixes [#129370](https://github.com/pytorch/pytorch/issues/129370)

Suggest correct a List type annotation when input is in Tuple type. To avoid confusion, we only suggest a type if the type is supported.

Example:
Tuple[int, int] -> List[int]
Tuple[Tensor, Tensor, Optional[Tensor]] -> List[Optional[Tensor]]
Tuple[int, ...] -> List[int]

ValueError: infer_schema(func): Parameter y has unsupported type typing.Tuple[torch.Tensor, torch.Tensor, typing.Optional[torch.Tensor]]. Tuple type annotation is not supported. Please try to use a List instead. For example, typing.List[typing.Optional[torch.Tensor]].
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129417
Approved by: https://github.com/zou3519
2024-06-28 01:03:14 +00:00
8ba0f6c7c2 Revert "[nn-module] Use standard dict for _parameters, _modules and _buffers (#129164)"
This reverts commit f2840bb22079a6952c61446a3d0dfc12f6452852.

Reverted https://github.com/pytorch/pytorch/pull/129164 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break some internal dper3 tests ([comment](https://github.com/pytorch/pytorch/pull/129164#issuecomment-2195888838))
2024-06-28 00:49:39 +00:00
9e1f3ecaa7 [BE][Easy] use pathlib.Path instead of dirname / ".." / pardir (#129374)
Changes by apply order:

1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`.
2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`.
3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first.

    `.parent{...}.absolute()` -> `.absolute().parent{...}`

4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.)

    `.parent.parent.parent.parent` -> `.parents[3]`

5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~

    ~`.parents[3]` -> `.parents[4 - 1]`~

6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374
Approved by: https://github.com/justinchuby, https://github.com/malfet
2024-06-28 00:35:15 +00:00
d4b6ff6fbe Disable llm-td step (#129722)
As it often fails during conda install step with `Unexpected HTTP response: 429`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129722
Approved by: https://github.com/kit1980, https://github.com/clee2000
2024-06-28 00:12:32 +00:00
0ffb17547e [Simple FSDP] Add unit test for torch.compile + reparameterization + SAC (#129641)
This can reproduce the error in https://github.com/pytorch/pytorch/issues/129684. Adding a unit test so that we hold the line for torch.compile + reparameterization + SAC to always be working, to pave the path for Tianyu's intern's project.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129641
Approved by: https://github.com/tianyu-l
2024-06-28 00:00:36 +00:00
169b4ca07e add uuid in cudaDeviceProperties (#125083)
Replaces #99967.

Fixes #99903.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125083
Approved by: https://github.com/pruthvistony, https://github.com/albanD, https://github.com/eqy, https://github.com/malfet
2024-06-27 23:53:13 +00:00
cyy
fb5888c719 Remove unused type traits in torch/csrc/utils (#128799)
Follows  #127852

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128799
Approved by: https://github.com/ezyang
2024-06-27 23:51:18 +00:00
3fc279633b [ATen] Make argsort.stable CompositeImplicitAutograd (#129529)
It literally just calls `at::sort` and returns the indices, so is composite compliant.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129529
Approved by: https://github.com/lezcano
2024-06-27 23:49:16 +00:00
7cf0b90e49 [BE] enable UFMT in torch.utils.data (#127705)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127705
Approved by: https://github.com/ezyang
ghstack dependencies: #127706, #127704
2024-06-27 23:16:24 +00:00
f911957573 [BE] sort imports in torch.utils.data (#127704)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127704
Approved by: https://github.com/ezyang
ghstack dependencies: #127706
2024-06-27 23:16:24 +00:00
d80939e5e9 [BE] enable UFMT for torch/storage.py (#127706)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127706
Approved by: https://github.com/ezyang
2024-06-27 23:16:24 +00:00
67416a2996 [c10d] Introduce a util for detecting DMA connectivity among devices (#129510)
This PR introduces `_detect_dma_connectivity` - a utility for detecting DMA connectivity among devices.

The "DMA connectivity" in this context is more stringent than the ability to perform memory copy without CPU involvement. We define it as the ability for a device to issue load/store instructions and perform atomic operations on memory that resides on connected devices. The ability translates to the ability to run most aten GPU operations with operands backed by remote memory. `_detect_dma_connectivity` can help PyTorch and its users to determine whether certain DMA-based optimizations are possible.

`_detect_dma_connectivity` takes a `(device_type, connection_type)` pair and returns a matrix describing the connectivity. Connectivity detectors are statically registered on a `(device_type, connection_type)` basis. This PR implements the detector for `(CUDA, "nvlink")`. Later, detectors for pairs such as `(ROCM, "infinity_fabric")` can be introduced.

Example:

```python3
>>> from torch._C._autograd import DeviceType
>>> from torch._C._distributed_c10d import _detect_dma_connectivity
>>> connectivity = _detect_dma_connectivity(DeviceType.CUDA, "nvlink")
>>> for row in connectivity.matrix:
...     print(row)
...
[0, 18, 18, 18, 18, 18, 18, 18]
[18, 0, 18, 18, 18, 18, 18, 18]
[18, 18, 0, 18, 18, 18, 18, 18]
[18, 18, 18, 0, 18, 18, 18, 18]
[18, 18, 18, 18, 0, 18, 18, 18]
[18, 18, 18, 18, 18, 0, 18, 18]
[18, 18, 18, 18, 18, 18, 0, 18]
[18, 18, 18, 18, 18, 18, 18, 0]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129510
Approved by: https://github.com/weifengpy
2024-06-27 23:02:07 +00:00
305ba62906 Add support to GradScaler for respecting an already set grad_scale value (#123429)
Fixes #123428

Co-authored-by: Yousuf Mohamed-Ahmed <youmed.tech@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123429
Approved by: https://github.com/ezyang
2024-06-27 22:40:54 +00:00
83a4a8b510 [C10D] clean up pointless 'or None' clause (#129522)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129522
Approved by: https://github.com/awgu
2024-06-27 22:40:11 +00:00
5e7ac69a67 [Dynamic Shapes] fixed dynamic shape inference (#128807)
Made dynamic dimension indirectly bound to an integer constrained.
After each ShapeEnv._refine_ranges, check if the new ValueRange is singleton, if it is, replace the symbol.

Fixes #122307

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128807
Approved by: https://github.com/ezyang
2024-06-27 22:33:32 +00:00
b8398b771c Upload test stats when workflow regardless of conclusion (#129694)
Upload test stats when workflow always so that we can get status for cancelled workflows (especially ones that were cancelled manually)

There aren't that many workflow conclusions, so might as well as always run it, and we can see what happens

Undos [this old PR](https://togithub.com/pytorch/pytorch/pull/79180)

Notable pitfalls from the above:
Might cause noise if things can't be downloaded, but since this workflow doesn't show up on PRs, I think it's ok to slowly deal with what comes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129694
Approved by: https://github.com/huydhn
2024-06-27 21:12:21 +00:00
1d0efedc85 [Profiler] Add TSC Clock Callback to CUPTI (#125036)
Summary:
Right now we use the default clock for CUPTI which is not monotonic nor particularly fast. We have already added the Kineto side of the implementation here: https://www.internalfb.com/diff/D56525885

This diff only adds the compile flags such that the TSC format is used and sets the converter using a libkineto call in the profiler

Test Plan:
Obtained following trace using resnet test:
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Apr_25_11_03_18.3862943.pt.trace.json.gz&bucket=gpu_traces

TBD: Add benchmarks

Differential Revision: D56584521

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125036
Approved by: https://github.com/aaronenyeshi
2024-06-27 21:07:43 +00:00
602b5cb218 [inductor] switch HalideCodeCache to new cpp_builder. (#129441)
Original PRs is damaged by confilct and rebase: https://github.com/pytorch/pytorch/pull/128303, https://github.com/pytorch/pytorch/pull/129144

This PR just switch `HalideCodeCache` to new cpp_builder and it is not `fb_code` related. It can merge without `fb_code` test.
Let's land this change firstly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129441
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-06-27 20:50:13 +00:00
39427288f4 Taskify training IR + run_decomp flow failures (#129547)
Differential Revision: [D59069088](https://our.internmc.facebook.com/intern/diff/D59069088)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129547
Approved by: https://github.com/avikchaudhuri
ghstack dependencies: #128077, #129092, #129249
2024-06-27 20:43:22 +00:00
23adf166e1 [cond] inlining into one of the branches when pred is a python constant (#128709)
When the input predicate is a python constant, we specialize into one of the branches and warn users that torch.cond is not preserving the dynamism. The previous behavior is that we baked in True/False in the cond operator. This can be confusing. In this PR, we change it to be specializing into one of the branches when the inputs are constants.

We additionally change the naming of cond operator to default one without overriding its name. This allows better testing on de-serialized graph.

Test Plan:
The predicate in some existing tests is the result of a shape comparison. When no dynamic shape is involved, the predicate is a python bool. To fix them, we either change the predicate to be some data-dependent tensor or change the test to check cond is specialized as one of the branches,

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128709
Approved by: https://github.com/zou3519
2024-06-27 20:28:50 +00:00
71f5ecd1ee Fixed Memory Leaks in tests (#129640)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129640
Approved by: https://github.com/clee2000
ghstack dependencies: #129400
2024-06-27 20:26:21 +00:00
dabaebd339 Make run_decomp work (#129249)
In this PR, we implement the first version of training_ir.run_decomp functionality. Since we don't return the modified buffers as extra output in training IR, our previous strategy of reusing graph signature won't work. In fact, this run_decomp is more similar to retracing. So i reuse some of export steps here. After this PR:
export_for_training().run_decomp({}, _preserve_ops=[all 183 ops]) == export_for_predispatch() - autograd_manipulating_ops.

Differential Revision: [D59069090](https://our.internmc.facebook.com/intern/diff/D59069090)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129249
Approved by: https://github.com/zhxchen17
ghstack dependencies: #128077, #129092
2024-06-27 19:16:07 +00:00
ec284d3a74 Prototype for export_for_training (#129092)
This PR implements export_for_training where the IR is not-functional, pre-dispatch aten IR. The general strategy:
1. Call dynamo to get torch IR
2. Lift param/buffer
3. call make_fx

TODO:
1. run_decomp doesn't work
2. not-strict is not supported

Differential Revision: [D59069087](https://our.internmc.facebook.com/intern/diff/D59069087)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129092
Approved by: https://github.com/zhxchen17
ghstack dependencies: #128077
2024-06-27 18:27:11 +00:00
4dcc1ceff3 [dynamo] Fakify result of delegate (#128752)
Summary: Somehow the delegate returns a real tensor result even though we pass in fake tensors. So here we need to convert the result to fake.

Test Plan: `buck2 run @//mode/dev-nosan //on_device_ai/helios/multi_zion:multi_zion_test -- -r test_single_delegate_dsp_only`

Differential Revision: D58617091

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128752
Approved by: https://github.com/ydwu4
2024-06-27 17:59:52 +00:00
389492e264 Fix runner determinator bug (#129612)
Currently the runner determinator is buggy and doesn't let anyone's workflows run against the LF runners (it prefixes a "@" to the user names in the issue instead of either stripping it or prefixing it to the incoming names)

This PR fixes the bug so that people opted in to using LF runners can actually use them. It also puts the python code back into the repo.  Even though the code isn't directly invoked, having it there makes testing and linting easier/possible

Also includes lint fixes

Note: if you just review the .yml file you'll see all the relevant diffs

### Testing:
#### Before
```
python .github/scripts/runner_determinator.py --github-token $GH_KEY --github-issue 5132 --github-actor ZainRizvi --github-issue-owner ZainRizvi --github-branch foo
{"label_type": "", "message": "LF Workflows are disabled for ZainRizvi, ZainRizvi. Using meta runners."}
```

#### After
```
python .github/scripts/runner_determinator.py --github-token $GH_KEY --github-issue 5132 --github-actor ZainRizvi --github-issue-owner ZainRizvi --github-branch foo
{"label_type": "lf.", "message": "LF Workflows are enabled for ZainRizvi, ZainRizvi. Using LF runners."}
```

Aside: updated test case after rebase:
```
python .github/scripts/runner_determinator.py --github-token $GH_KEY --github-issue 5132 --github-actor ZainRizvi --github-issue-owner ZainRizvi2 --github-branch foo  --github-repo python/pythonss --github-ref-type branch
{"label_type": "lf.", "message": "LF Workflows are enabled for ZainRizvi. Using LF runners."}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129612
Approved by: https://github.com/zxiiro, https://github.com/jeanschmidt
2024-06-27 17:51:09 +00:00
a4d7aa498b [Traceable FSDP2] Add auto-functionalize support for mutable list[Tensor] (copy from Brian's PR #127347); enable E2E inductor unit test for transformer model (#129502)
Copy of Brian's PR: https://github.com/pytorch/pytorch/pull/127347 with additional changes to support mutable `List[Tensor]` in Inductor. Also enable E2E inductor unit test for Traceable FSDP2 + transformer model.

Test commands:
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_trace_fsdp_set_`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_simple_mlp_fullgraph_backend_aot_eager`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_simple_mlp_fullgraph_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_fullgraph_backend_aot_eager`
- `pytest -rA test/dynamo/test_misc.py::MiscTests::test_auto_functionalize_tensorlist`
- `pytest -rA  test/inductor/test_torchinductor.py::GPUTests::test_fallback_mutable_op_list_cuda`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129502
Approved by: https://github.com/zou3519
2024-06-27 17:50:57 +00:00
9174d14551 Don't install remaining caffe2 python files (#129067)
It is assumed that they are no longer needed.
And keeping their installation as is breaks
"python setup.py develop --user" workflow
when non-root user is used.

This change is follow up for 3d617333e700
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129067
Approved by: https://github.com/cyyever, https://github.com/r-barnes
2024-06-27 17:25:59 +00:00
e0bba37d66 [codemod] Add [[noreturn]] to 2 files inc caffe2/c10/util/TypeCast.cpp (#129575)
Summary: LLVM-15 has a warning `-Wno-return` which can be used to identify functions that do not return. Qualifying these functions with `[[noreturn]]` is a perf optimization.

Test Plan: Sandcastle

Reviewed By: dmm-fb

Differential Revision: D59003594

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129575
Approved by: https://github.com/Skylion007
2024-06-27 17:23:22 +00:00
321bdcb372 Fix device propagation for checkpointing (#128671)
Fixes: #128478

In backward() implementation checkpointing code was quering device type from the rng_state tensors saved on forward(). These tensors are CPU only tensors and don't carry device information with them. As a result CUDA device was assumed as a default. Which is not correct if user runs on some other device. For example, on XPU.

This patch saves full device information on forward() and uses it on backward() to get device type. Previously forward save only device index.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128671
Approved by: https://github.com/guangyey, https://github.com/soulitzer
2024-06-27 17:14:13 +00:00
04206d1898 TunableOp hotfix, unit test follow-up (#129606)
PR #129281 was landed to fix critical issues but did not contain unit tests to exercise those issues.  This is a follow-up set of unit tests that would exercise the problems seen previously.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129606
Approved by: https://github.com/atalman
2024-06-27 17:01:04 +00:00
5c6af2b583 [cpu] Fix div with rounding_mode="floor" when division overflows (#129536)
Fixes #77742

`Sleef_fmod` returns NaN when the division overflows, where `libm` returns 0. In this narrow case we can drop the `fmod` from the calulation entirely.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129536
Approved by: https://github.com/lezcano
2024-06-27 16:50:47 +00:00
5ceba6a3cb Revert "[Inductor] FlexAttention supports block sparse mask (#129216)"
This reverts commit 4082759925a712b7cb340164d3da3a1dab372d9f.

Reverted https://github.com/pytorch/pytorch/pull/129216 on behalf of https://github.com/clee2000 due to broke functorch/aot_dispatch and test_proxy_tensor on windows https://github.com/pytorch/pytorch/actions/runs/9691331440/job/26743164471 4082759925 missed on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/129216#issuecomment-2195087274))
2024-06-27 15:57:52 +00:00
82c8fc3a2b [inductor] Add size_hint to conv dilation (#129631)
Summary: [Here](ea588d7fd3/torch/_inductor/kernel/conv.py (L252)) in the `conv` lowering `dilation` is not `size_hint`-ed. This breaks if `dilation` is a symbolic expression (which we see in some internal models). The PR fixes it by adding a `size_hints`.

Test Plan:
```
$ python test/inductor/test_torchinductor.py -k test_convolution5
...
----------------------------------------------------------------------
Ran 2 tests in 7.329s

OK
```

Differential Revision: D59097019

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129631
Approved by: https://github.com/chenyang78
2024-06-27 15:27:57 +00:00
483dbfcf2a [BE] Correctly catch skip signals emitting from sys.exit (#129581)
Some tests in test_c10d_nccl.py overwrite `_join_process()` and `_check_return_codes()`, which cause the skip signals are not catched appropriately. This PR fixes the issue.

Differential Revision: [D59067457](https://our.internmc.facebook.com/intern/diff/D59067457/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129581
Approved by: https://github.com/fduwjj
2024-06-27 15:12:51 +00:00
2d9012ad25 Forward fix internal pyre failure from D58983461 (#129525)
Summary: Somehow, using underscore alias of some builtin types breaks pyre

Test Plan:
All failed tests from D58983461 are passing:

```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/torch/fb/training_toolkit/utils/tests:gpu_memory_utils_test-type-checking
buck2 test 'fbcode//mode/dev-nosan' fbcode//dper_lib/silvertorch/lib:device_util-type-checking
buck2 test 'fbcode//mode/dev-nosan' fbcode//dper_lib/silvertorch/lib:thompson_samplers_gpu-type-checking
buck2 test 'fbcode//mode/dev-nosan' fbcode//dper_lib/silvertorch/modules/retrieval/diversity/tests:combined_sampling_diversifier_test-type-checking
buck2 test 'fbcode//mode/dev-nosan' fbcode//dper_lib/silvertorch/modules/retrieval/diversity/tests:submodular_opt_test-type-checking
```

Differential Revision: D59029768

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129525
Approved by: https://github.com/XuehaiPan, https://github.com/clee2000, https://github.com/malfet
2024-06-27 14:41:20 +00:00
0680e6cd1c [Profiler] Add sraikund16 to profiler paths in CODEOWNERS (#129591)
Summary: Add Shivam to the list of code owners for the profiler code paths, so that Shivam gets added to reviewers for PRs too.

Test Plan: CI

Differential Revision: D59072152

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129591
Approved by: https://github.com/sraikund16
2024-06-27 14:22:09 +00:00
ad607b91f4 [dynamo][onnx] Skip some dynamic=True test with inlining in built nn modules (#129610)
These tests fail with dynamic=True when inlining in built nn modules. There are a few more recompilations. Since `dynamic=True` is not a recommended usage, I am skipping these tests for now. This is the tracking issue to come back later and fix/update these tests - https://github.com/pytorch/pytorch/issues/129456
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129610
Approved by: https://github.com/yanboliang
ghstack dependencies: #129583
2024-06-27 10:56:24 +00:00
a028e5862d [profiler] Directly use end_ns to create the FunctionEvent instead of using start_ns + duration_ns in pytorch profiler post processing for checking parent-child precisely (#129554)
Use the raw end_ns directly, instead of the sum of start_ns and duration_ns, in order to avoid negative CPU time in profiler.

Fix https://github.com/pytorch/pytorch/issues/101861

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129554
Approved by: https://github.com/gujinghui, https://github.com/aaronenyeshi
2024-06-27 10:46:05 +00:00
ff026f3d0a Fix an issue in meta_scaled_mm (#129521)
Summary:
To fix the following failure cases:

For example, when `M, K, N = 245760, 656, 6560`, fp8 with compile fails due to `RuntimeError: mat2 must be col_major`.

---------
From the inductor generated code (https://fburl.com/everpaste/epcagkrd)
```
V0625 01:38:55.551000 140329914449920 torch/_inductor/scheduler.py:1623] [0/0] scheduling ComputedBuffer(name='buf12', layout=FixedLayout('cuda', torch.float8_e4m3fn, size=[656, 6560], stride=[6656, 1]),
... ...
V0625 01:38:56.194000 140329914449920 torch/_inductor/graph.py:1680] [0/0] [__output_code]         buf12 = empty_strided_cuda((656, 6560), (6656, 1), torch.float8_e4m3fn)
... ...
V0625 01:38:56.194000 140329914449920 torch/_inductor/graph.py:1680] [0/0] [__output_code]     return (buf10, buf2, buf5, buf6, reinterpret_tensor(buf11, (245760, 656), (1, 245760), 0), reinterpret_tensor(buf12, (6560, 656), (1, 6656), 0), )
... ...
V0625 01:39:12.098000 140312968167424 torch/_inductor/graph.py:1680] [1/0_1] [__output_code]     assert_size_stride(permute_10, (6560, 656), (1, 6656))
... ...
V0625 01:39:12.098000 140312968167424 torch/_inductor/graph.py:1680] [1/0_1] [__output_code]         buf8 = aten._scaled_mm.default(buf6, permute_10, buf7, reciprocal_3, None, None, torch.bfloat16)
```

Inductor gives the mat2 (`permute_10`) a different stride (`6656`) instead of using its shape[0] (`(6560, 656)`).

Therefore, the `stride[1] == shape[0]` condition fails.

To fix the issue, simply modify the `is_col_major` check to exclude this condition as it doesn't hold for all valid cases.

Test Plan:
Run the failed case again. It works with the fix.
-----
Sandcastle / GitHub CI will make sure the existing tests could still pass.

Reviewed By: vkuzo

Differential Revision: D58994704

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129521
Approved by: https://github.com/drisspg
2024-06-27 07:03:34 +00:00
9f29a2291c Feat: Updated torch.nn.Modules.set_submodules() (#127714)
modified:   torch/nn/modules/module.py

Implemented feature request by #127712.
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127714
Approved by: https://github.com/mikaylagawarecki
2024-06-27 06:38:54 +00:00
c9798d123b [dynamo][compile-time] Manually trace torch.nn.Module.parameters (#129583)
With this PR, we are not worse than no-inlining for Dynamo-only compilation time (there is a litte bit of noise, so outlier of 0.89 is probably ok here). For most of the models, we see positive numbers because of better caching in `UserDefinedObjectVariable`.

![image](https://github.com/pytorch/pytorch/assets/13822661/719d34fd-3e7f-4886-b7e0-1dbfc7141aa5)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129583
Approved by: https://github.com/jansel
2024-06-27 06:06:04 +00:00
cf392d8a89 [pytorch][cuda] Generate kernels for 5x5 filters on depth wise convolution backward (#129609)
In #125362 we improved the default implementation of depth wise convolution 2D forward pass by precomputing boundaries of accessed slices instead of doing expensive edge checks in the inner loops. We also generated kernels for 5x5 filters as this is a common problem size.

In this PR we tried to applied the same strategy for the backward kernel but we only saw good gains just by generating code for 5x5 filters. We could also write a fallback implementation that precomputes access boundaries when filter size and stride are not known at compile time may bring some speedup but that kernel would very rarely be called.

This PR also hints the thread count at compile time and leaves only the unroll directive that seems to help performance.

Before:

```
         B      C      iH      iW    kH    kW  conv2d-backward (cuda)  conv2d-fp16-backward (cuda)
0      8.0   64.0  1024.0  1008.0   5.0   5.0               89.002686                    26.400480
1      8.0   64.0  1008.0  1008.0   5.0   5.0               88.885025                    25.995296
2      4.0   48.0   720.0   539.0   6.0   1.0                9.488832                     9.091136
3      4.0  120.0   379.0   283.0   6.0   1.0                4.194640                     3.844432
4      4.0   32.0   713.0   532.0   6.0   1.0                8.027296                     7.700064
5      4.0    3.0   712.0   542.0  31.0  31.0               15.618095                    15.097760
6      4.0  120.0   379.0   288.0   1.0   6.0                3.788224                     3.499648
7   1024.0  384.0     1.0   928.0   1.0   3.0               18.988289                    14.152768
8      4.0   24.0   687.0   512.0   6.0   1.0                6.902704                     6.685056
9     96.0   96.0   112.0   112.0   5.0   5.0               15.672400                     4.953984
10    96.0   80.0    56.0    56.0   5.0   5.0                3.261152                     1.250320
11    64.0  128.0    64.0    84.0   3.0   3.0                3.172192                     1.515648
12    16.0  960.0     7.0     7.0   5.0   5.0                0.197024                     0.072736
13    16.0   64.0   112.0   112.0   3.0   3.0                1.126240                     0.650304
```

After
```
conv2d-performance:
         B      C      iH      iW    kH    kW  conv2d-backward (cuda)  conv2d-fp16-backward (cuda)
0      8.0   64.0  1024.0  1008.0   5.0   5.0               76.278656                    26.418720
1      8.0   64.0  1008.0  1008.0   5.0   5.0               73.211617                    26.018433
2      4.0   48.0   720.0   539.0   6.0   1.0                8.901312                     9.322912
3      4.0  120.0   379.0   283.0   6.0   1.0                3.815616                     3.992208
4      4.0   32.0   713.0   532.0   6.0   1.0                7.753024                     8.032433
5      4.0    3.0   712.0   542.0  31.0  31.0               15.244144                    15.277296
6      4.0  120.0   379.0   288.0   1.0   6.0                3.503264                     3.552976
7   1024.0  384.0     1.0   928.0   1.0   3.0               16.682976                    14.167969
8      4.0   24.0   687.0   512.0   6.0   1.0                6.802576                     7.019040
9     96.0   96.0   112.0   112.0   5.0   5.0               12.713024                     4.958656
10    96.0   80.0    56.0    56.0   5.0   5.0                2.648352                     1.254752
11    64.0  128.0    64.0    84.0   3.0   3.0                3.213568                     1.517952
12    16.0  960.0     7.0     7.0   5.0   5.0                0.182208                     0.076256
13    16.0   64.0   112.0   112.0   3.0   3.0                1.139952                     0.652432
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129609
Approved by: https://github.com/ezyang, https://github.com/eqy
2024-06-27 06:01:47 +00:00
4082759925 [Inductor] FlexAttention supports block sparse mask (#129216)
Benchmark script (causal mask): https://gist.github.com/yanboliang/c2010a1fd081d4e8ca94fadec9eef286
Initial perf number:
* fwd speedup: 0.44 -> 0.72
* bwd speedup: 0.38 -> 0.71

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129216
Approved by: https://github.com/Chillee
2024-06-27 05:44:27 +00:00
5ee893a84a Add inductor support for conv3d transpose (#129458)
This PR is to add Conv3d Transpose support in inductor. Basicly reuse and expand Conv2d Transpose and unit tests to Conv3d Transpose.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129458
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-06-27 05:27:10 +00:00
9b5b93c58f [CUDA][Inductor][CI] Revert PR#127150 since cu124 is now behaving similar enough to cu121 (#128423)
Pre-requisite: close https://github.com/pytorch/pytorch/issues/126692 first.

This PR also gives a current read on cu121 and cu124 parity.

Essentially reverting #127150

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128423
Approved by: https://github.com/atalman, https://github.com/eqy
2024-06-27 05:22:18 +00:00
ea588d7fd3 [SymmetricMemory] use SCM_RIGHTS socket control message to share exported cumem handle (#129412)
`SymmetricMemory` currently uses the `pidfd_getfd` syscall to share the exported cumem fd among devices. The syscall is introduced in linux kernel 5.6 which is relatively new and not available everywhere.

This PR replaces the use of the `pidfd_getfd` syscall with socket + SCM_RIGHTS control message. The approach is demonstrated in [memMapIPCDrv](https://github.com/NVIDIA/cuda-samples/tree/master/Samples/3_CUDA_Features/memMapIPCDrv) in [cuda-samples](https://github.com/NVIDIA/cuda-samples/tree/master/Samples) (relevant code: https://github.com/NVIDIA/cuda-samples/blob/master/Common/helper_multiprocess.cpp).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129412
Approved by: https://github.com/Chillee
2024-06-27 04:38:13 +00:00
84ad5452f6 [MPS] Fused SGD optimizer (#129350)
```
[-------------------------------------- Fused SGD --------------------------------------]
                                                          |  Fused: True  |  Fused: False
1 threads: ------------------------------------------------------------------------------
      numel: 1024, num_tensors: 100, momentum: True       |        2      |       15
      numel: 1024, num_tensors: 100, momentum: False      |        2      |        5
      numel: 65536, num_tensors: 100, momentum: True      |        3      |       16
      numel: 65536, num_tensors: 100, momentum: False     |        2      |        5
      numel: 1048576, num_tensors: 100, momentum: True    |       11      |       16
      numel: 1048576, num_tensors: 100, momentum: False   |        8      |        6
      numel: 1024, num_tensors: 500, momentum: True       |       29      |       70
      numel: 1024, num_tensors: 500, momentum: False      |       20      |       24
      numel: 65536, num_tensors: 500, momentum: True      |       33      |       76
      numel: 65536, num_tensors: 500, momentum: False     |       22      |       26
      numel: 1048576, num_tensors: 500, momentum: True    |       70      |       80
      numel: 1048576, num_tensors: 500, momentum: False   |       43      |       40
      numel: 1024, num_tensors: 1000, momentum: True      |      108      |      139
      numel: 1024, num_tensors: 1000, momentum: False     |       72      |       48
      numel: 65536, num_tensors: 1000, momentum: True     |      116      |      150
      numel: 65536, num_tensors: 1000, momentum: False    |       77      |       52
      numel: 1048576, num_tensors: 1000, momentum: True   |      190      |      170
      numel: 1048576, num_tensors: 1000, momentum: False  |      120      |       50
```

```python
def profile_fused_sgd():
    from torch.optim.sgd import sgd
    import torch.utils.benchmark as benchmark

    import itertools

    def profile(fn, params, grads, momentum_buffer_list, fused):
        fn(
            params,
            grads,
            momentum_buffer_list,
            momentum=True if len(momentum_buffer_list) > 0 else False,
            dampening=0.0,
            nesterov=False,
            foreach=False,
            fused=fused,
            lr=1e-3,
            weight_decay=.0,
            maximize=False,
            grad_scale=None,
            found_inf=None,
        )
        torch.mps.synchronize()

    device = "mps"

    results = []

    for num_tensors, numel, momentum in itertools.product([100, 500, 1000], [1024, 65536, 1048576], [True, False]):
        sublabel = f"numel: {numel}, num_tensors: {num_tensors}, momentum: {momentum}"
        print(sublabel)
        params, grads = [[torch.arange(numel, dtype=torch.float32, device=device) + (numel * i) for i in range(num_tensors)] for _ in range(2)]
        momentum_buffer_list = [torch.arange(numel, dtype=torch.float32, device=device) + (numel * i) for i in range(num_tensors)] if momentum else []
        fn = sgd

        for fused in [True, False]:

            t = benchmark.Timer(
                    stmt='profile(fn, params, grads, momentum_buffer_list, fused)',
                    label='Fused SGD',
                    sub_label=sublabel,
                    globals=locals(),
                    description= f"Fused: {fused}",
                ).blocked_autorange(min_run_time=5)
            results.append(t)

    compare = benchmark.Compare(results)
    compare.trim_significant_figures()
    compare.colorize(rowwise=True)
    compare.print()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129350
Approved by: https://github.com/janeyx99
ghstack dependencies: #129006, #129008, #129007, #129105
2024-06-27 04:37:14 +00:00
e19042481b [cuDNN][cuDNN Frontend] Bump cuDNN FE submodule to 1.5.2 (#129592)
Some relevant fixes include stride-0 support 👀

CC @drisspg @Skylion007 @vedaanta

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129592
Approved by: https://github.com/Skylion007
2024-06-27 04:01:23 +00:00
9450e198aa Conversions between strided and jagged layouts for Nested Tensors (#115749)
This PR does 3 things:
1. Adds a copy-free strided->jagged layout conversion for NT
2. Adds a copy-free jagged->strided layout conversion for NT
3. Modifies and expands the .to() API to support the layout argument for the specific case of NT layout conversion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115749
Approved by: https://github.com/jbschlosser
2024-06-27 03:41:28 +00:00
c9ceae3fac Use JK for mast rdzv handler tcpstore handling and additional logging (#129603)
Summary:
Use JK to control the release instead of using env variable to toggle the feature.

Note: sharing the store reduces shutdown races asn the TCPStore lifecycle is managed outside of trainer rank execution time.

Test Plan: CI

Differential Revision: D59071544

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129603
Approved by: https://github.com/d4l3k
2024-06-27 03:34:52 +00:00
b9697eacd3 [torchbind] support tensor ops inside of __obj_flatten__ (#129605)
As titled. Previously, __obj_flatten__ can run in a fake tensor mode, e.g. in process_input of aot_autograd, which is surrounded by a fake tensor mode. This causes the tensor ops inside __obj_flatten__ to run under fake tensor mode. However, tensors inside of script obejct are real tensors, this causes the fake tensor mode to error out saying that we need to first fakify fall the tensors (because allow_non_fake_inputs is set to True).

In this PR, we disable all the dispatch modes when running to_fake_obj.

 Note that, the output of `__obj_flatten__` will be fakified and filled inside of the corresponding FakeScriptObject. So during traicng, we'll be using FakeScriptObject that has fake tensor contents.

Test Plan:
Add a new test: pytest test/export/test_torchbind.py -k test_compile_tensor_op_in_tensor_flatten

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129605
Approved by: https://github.com/angelayi
2024-06-27 03:07:31 +00:00
cdbd6542d0 Fix inductor benchmarks (#129620)
By installing torchao explicitly, as torchao-0.3.0 that was release recently to pypi introduced hard dependency to torch-2.3.1, which results in following cryptic error: `RuntimeError: operator torchvision::nms does not exist`

TODOs:
 - Figure out what installs torchao from pypi rather than builds from source
 - Add proper CI pin for torchao
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129620
Approved by: https://github.com/kit1980, https://github.com/huydhn
2024-06-27 02:59:08 +00:00
27a14405d3 enable device index check for all device types (#126767)
enable device index check for all device types for grad setter.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126767
Approved by: https://github.com/albanD
2024-06-27 01:09:53 +00:00
0b7e8df7d8 [CUDAGraph Trees] Enable input mutation support in OSS (#129184)
Summary: Enable input mutation support for cudagraph trees in OSS.

Test Plan: CI

Differential Revision: D58847850

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129184
Approved by: https://github.com/eellison
2024-06-27 00:49:45 +00:00
7bb558fd6e add _flash_attention_forward and _efficient_attention_forward to compute intensive ops in partitioner (#129533)
Avoid recompute of SDPA during the backward.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129533
Approved by: https://github.com/drisspg
2024-06-27 00:49:00 +00:00
b6689e0fb8 [ts migration] add logging as part of torch logging system (#129405)
#### Description
Add more verbose logging of conversion process. Output which IR is being converted, which function is used to do conversion, and whether it succeeds.

#### Example
`TORCH_LOGS="+export,ts2ep_conversion" pytest test/export/test_converter.py -s -k test_prim_tolist`
```
test/export/test_converter.py I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734] TorchScript graph
I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734]
I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734] graph(%x.1 : Long(3, strides=[1], requires_grad=0, device=cpu)):
I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734]   %1 : __torch__.export.test_converter.___torch_mangle_1.Module = prim::CreateObject()
I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734]   %2 : int = prim::Constant[value=1](), scope: export.test_converter.Module::
I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734]   %3 : int = prim::Constant[value=0](), scope: export.test_converter.Module::
I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734]   %4 : int[] = prim::tolist(%x.1, %2, %3), scope: export.test_converter.Module::
I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734]   return (%4)
I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734]
I0624 13:19:26.416000 140608224474112 torch/_export/converter.py:734]
V0624 13:19:26.417000 140608224474112 torch/_export/converter.py:690] Convert [%1 : __torch__.export.test_converter.___torch_mangle_1.Module = prim::CreateObject()]
V0624 13:19:26.417000 140608224474112 torch/_export/converter.py:690] Convert using [convert_prim_CreateObject] succeeds
V0624 13:19:26.417000 140608224474112 torch/_export/converter.py:690] Convert [%2 : int = prim::Constant[value=1](), scope: export.test_converter.Module::]
V0624 13:19:26.417000 140608224474112 torch/_export/converter.py:690] Convert using [convert_prim_Constant] succeeds
V0624 13:19:26.417000 140608224474112 torch/_export/converter.py:690] Convert [%3 : int = prim::Constant[value=0](), scope: export.test_converter.Module::]
V0624 13:19:26.417000 140608224474112 torch/_export/converter.py:690] Convert using [convert_prim_Constant] succeeds
V0624 13:19:26.417000 140608224474112 torch/_export/converter.py:690] Convert [%4 : int[] = prim::tolist(%x.1, %2, %3), scope: export.test_converter.Module::]
V0624 13:19:26.417000 140608224474112 torch/_export/converter.py:690] Convert using [convert_prim_tolist] succeeds
I0624 13:19:26.427000 140608224474112 torch/_export/converter.py:760] TS2EPConverter IR-to-IR conversion succeeds
```

#### Test Plan
`pytest test/export/test_converter`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129405
Approved by: https://github.com/angelayi
2024-06-27 00:20:20 +00:00
90f6043368 Don't decompose functional composite ops in export inference IR (#128077)
Recently we decided to split export IR into two different IRs (training vs inference). In the inference IR, one major change we decided to introduce was we wanted to keep the composite ops that user specified in the IR. This PR does that by overriding the CompositeImplicitAutograd decomp in export inference path.

Differential Revision: [D58701607](https://our.internmc.facebook.com/intern/diff/D58701607)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128077
Approved by: https://github.com/bdhirsh
2024-06-26 23:07:55 +00:00
64f1111d38 Expose nholmann json to torch (#129570)
Summary:

Expose nlohmann json library so that it can be used from inside Pytorch. The library already exists in the `third_party` directory. This PR is making `nlohmann/json.hpp` header available to be used from `torch.distributed`.
The next PR makes actual use of this header.

imported-using-ghimport

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D59035246

Pulled By: c-p-i-o

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129570
Approved by: https://github.com/d4l3k, https://github.com/malfet
2024-06-26 21:59:26 +00:00
5ad2ad5921 Update start_, end_ and retired only for the right entry when retire a work (#128948)
Fixes #128805
If the buffer size of NCCLTraceBuffer is 10 and the pg has recorded 11 works, the entry of the work 0 will have been overwritten by the work 10,  so when watchdog retire the work 0, the  start_ and end_  of the entry 0 shouldn't be set to nullptr.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128948
Approved by: https://github.com/wconstab, https://github.com/c-p-i-o
2024-06-26 21:58:00 +00:00
b8e5678ad2 Delete lazy ddp optimizer (#120727)
This is no longer necessary now that the normal ddp optimizer works correctly with inductor strides.

Differential Revision: [D54858819](https://our.internmc.facebook.com/intern/diff/D54858819)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120727
Approved by: https://github.com/jansel, https://github.com/yf225
2024-06-26 21:53:54 +00:00
13316a8d46 [Profiler] Add Rank to NCCL Debug Info (#129528)
Summary: We need to add the Rank information to the NCCL debug data so that kineto can infer all the necessary process group info such that on-demand can create distributedInfo metadata. Kineto portion will be added in a follow up diff

Test Plan: Tested in D58736045, this diff just splits the kineto and profiler instances

Differential Revision: D59028819

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129528
Approved by: https://github.com/aaronenyeshi
2024-06-26 21:24:05 +00:00
7b1988f922 [ez] Give trymerge id token write permissions after #129503 (#129594)
Forgot to do this in #129503

Also fix minor typo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129594
Approved by: https://github.com/huydhn
2024-06-26 20:33:14 +00:00
795db80975 Upload release tag source code to s3 (#128842)
Upload tarball containing source code to s3 for release tags

Can be found here https://us-east-1.console.aws.amazon.com/s3/buckets/pytorch?region=us-east-1&bucketType=general&prefix=source_code/test/&showversions=false

D58695048 for adding permissions to allow uploading to the s3 folder
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128842
Approved by: https://github.com/atalman, https://github.com/malfet
2024-06-26 20:32:40 +00:00
28480dd7dc [CI] Fix runner determinator for ciflow (#129500)
In case of ciflow, runs are triggered by a tag which is created by @pytorchbot, which breaks the logic of the runner determinator.

In case of tag triggers, extract the pr number from the tag name, fetch the pr and extract the user login from it.

Both the inline and standalone python scripts have been updated for consistency.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129500
Approved by: https://github.com/malfet, https://github.com/zxiiro
2024-06-26 20:27:06 +00:00
d3d6764082 [pytorch][logging] add fb internal ODS implementation of wait counter (#128605)
* created fb internal implementation in `caffe2/torch/csrc/monitor/fb/instrumentation.cpp`
    * uses `facebook::data_preproc::WaitCounterUs` under the hood by having `WaitCounterImpl` trivially subclass it.
    * this makes `WaitCounterHandle` a glorified pointer to `facebook::data_preproc::WaitCounterUs` which is statically defined in the `STATIC_WAIT_COUNTER` macro making these pointers Meyer's singletons.
        * `facebook::data_preproc::WaitCounterUs` uses 3 singletons:
             1. `std::unique_ptr<DynamicCounter::State>` map — leaky singleton
             2. `std::weak_ptr<WaitCounterUs::State>` map — leaky singleton
             3. publisherSingleton — normal singleton since it manages resources (threads)
        * `facebook::data_preproc::WaitCounterUs` actually owns shared pointers to the state and its destructor will remove it from the `std::weak_ptr<WaitCounterUs::State>` map when the reference count for the state hits 0.
* linked `caffe2/torch/csrc/monitor/fb/instrumentation.cpp` and added `//data_preproc/common:counters` (dpp dependency) to `caffe2/fb/fbcode/target_definitions.bzl`
* wrapped OSS null implementation in `#ifndef FBCODE_CAFFE2` so that internally we use the fb internal implementation.

as a follow-up I might move the counter implementation out of the data_preproc/counters library to a more common ai infra library?

Differential Revision: [D58458751](https://our.internmc.facebook.com/intern/diff/D58458751/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128605
Approved by: https://github.com/c-p-i-o
ghstack dependencies: #128466
2024-06-26 19:11:21 +00:00
90f82426b9 RS migration - trymerge to upload merge records to s3 (#129503)
Uploads merge records to to ossci-raw-job-status (public) bucket instead of directly to rockset

The runner used by trymerge is a GH runner, so it doesn't have access to s3.  Instead, I save the record as a json and upload the json to s3 in a different step that runs after the aws credentials are configured.

The role is defined [here](https://togithub.com/pytorch-labs/pytorch-gha-infra/pull/421)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129503
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi, https://github.com/malfet
2024-06-26 19:06:52 +00:00
895316119d Revert "[BE][Easy] use pathlib.Path instead of dirname / ".." / pardir (#129374)"
This reverts commit 0314c4c101c44d5d89b4fad9d37a012dc6f31128.

Reverted https://github.com/pytorch/pytorch/pull/129374 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it causes lots of internal build failures where they fail to find hipify module ([comment](https://github.com/pytorch/pytorch/pull/129374#issuecomment-2192437052))
2024-06-26 19:03:57 +00:00
e9aefad641 Revert "[CUDA][Inductor][CI] Revert PR#127150 since cu124 is now behaving similar enough to cu121 (#128423)"
This reverts commit 551e4127185195ae8a5331dc8bbfdffd5d4dd1b8.

Reverted https://github.com/pytorch/pytorch/pull/128423 on behalf of https://github.com/nWEIdia due to Sorry for reverting your change but I need to revert it to cleanly revert https://github.com/pytorch/pytorch/pull/129374 ([comment](https://github.com/pytorch/pytorch/pull/128423#issuecomment-2192423840))
2024-06-26 18:54:41 +00:00
cca85c96cd [export] minor typo fix (#129543)
Fixes a typo in torch.export doc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129543
Approved by: https://github.com/angelayi
2024-06-26 18:35:31 +00:00
87d14ad419 [inductor] Fix TORCHINDUCTOR_FORCE_DISABLE_CACHES (#129257)
Summary: See https://github.com/pytorch/pytorch/issues/129159; this option wasn't doing its job for a few reasons. In this PR:
* Fix the with_fresh_cache_if_config() decorator
* Reset the "TORCHINDUCTOR_CACHE_DIR" & "TRITON_CACHE_DIR" env vars in sub-process to support them changing in the parent process

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129257
Approved by: https://github.com/oulgen
2024-06-26 18:34:48 +00:00
61bf1452a3 Add one more shard for CPU jobs (#129299)
The first shard is very close to 3.5h and timeout sometimes now 1c75ddff35 (26540310592)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129299
Approved by: https://github.com/kit1980, https://github.com/ZainRizvi
2024-06-26 18:32:10 +00:00
b9a1c2c991 [ROCm] Enable F8 Inductor Unit tests (#128353)
First batch of inductor unit test enablement on ROCm for the fnuz f8 variant on MI300

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128353
Approved by: https://github.com/jansel, https://github.com/eellison
2024-06-26 18:30:43 +00:00
8e4f7f742f [DCP] Capture reader, writer and planner components in the DCP API logger (#129548)
Summary: Capture reader, writer and planner components in the DCP API logger

Test Plan:
logs can be found in scuba pytorch_dcp_logging

https://fburl.com/scuba/pytorch_dcp_logging/ruqez1ki

Differential Revision: D59040866

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129548
Approved by: https://github.com/wz337, https://github.com/fegin
2024-06-26 18:11:16 +00:00
7373492c9b Use _unsafe_masked_index in masked_scatter decomposition (#123667)
and remove masked_scatter_with_index inductor prims

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123667
Approved by: https://github.com/peterbell10
2024-06-26 17:18:24 +00:00
1b1fd0f4fe [ROCm] Use additional shard for inductor workflow to resolve timeouts (#129480)
This will help timeouts on inductor workflow. The cuda equivalent job also moved to 2 shards since e0aa992d73

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129480
Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd, https://github.com/malfet
2024-06-26 17:18:20 +00:00
bc68907caa [EZ][BE] Replace assertTrue with more appropriate checks (#129569)
Based on this https://github.com/pytorch/pytorch/pull/129340#issuecomment-2191228046 I.e.
- `assertTrue(x == y)` -> `assertEqual(x, y)
- `assertTrue(not x)` -> assertFalse(x)`
- `assertTrue(x > y)` -> assertGreater(x, y)`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129569
Approved by: https://github.com/jeanschmidt, https://github.com/Skylion007
2024-06-26 16:29:59 +00:00
9cf8e5dd32 chore(quantization): Enable PT2E symmetric dynamic quantization (#124615)
in `_find_choose_qparams_node` function compare
the current node if it is affine or symmetric
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124615
Approved by: https://github.com/kimishpatel, https://github.com/malfet
2024-06-26 16:14:58 +00:00
f7708ffebb Revert "[AOTI][refactor] Unify UserDefinedTritonKernel.codegen (#129378)"
This reverts commit 52009068bc39ebc846bd37b44f5f9c5f62257778.

Reverted https://github.com/pytorch/pytorch/pull/129378 on behalf of https://github.com/clee2000 due to broke inductor/test_aot_inductor.py::AOTInductorTestABICompatibleCuda::test_triton_kernel_sympy_expr_arg_abi_compatible_cuda and a few other tests https://github.com/pytorch/pytorch/actions/runs/9680978494/job/26713689249 52009068bc. The tests were added in https://github.com/pytorch/pytorch/pull/129301 which is before your base ([comment](https://github.com/pytorch/pytorch/pull/129378#issuecomment-2192032697))
2024-06-26 15:46:17 +00:00
474d743dba [torchao][benchmark] Skip all accuracy tests by returning pass_due_to_skip (#129545)
Summary: As the title says.

Test Plan:
```
buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BERT_pytorch --quantization noquant --inference --bfloat16 --accuracy
```

Differential Revision: D59040593

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129545
Approved by: https://github.com/HDCharles
2024-06-26 14:21:53 +00:00
25cec43678 Remove dependency on private _compat_pickle in CPython (#129509)
Use the IMPORT_MAPPING and NAME_MAPPING from here https://github.com/python/cpython/blob/main/Lib/_compat_pickle.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129509
Approved by: https://github.com/malfet
ghstack dependencies: #129239, #129396
2024-06-26 14:20:27 +00:00
3b531eace7 Add example for torch.serialization.add_safe_globals (#129396)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129396
Approved by: https://github.com/albanD, https://github.com/malfet
ghstack dependencies: #129239
2024-06-26 14:20:27 +00:00
303ad8d7f5 Add warning for weights_only (#129239)
Also changes default for `weights_only` to `None` per comment below (hence the `suppress-bc-linter` tag)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129239
Approved by: https://github.com/albanD, https://github.com/malfet
2024-06-26 14:20:19 +00:00
52009068bc [AOTI][refactor] Unify UserDefinedTritonKernel.codegen (#129378)
Summary: Unify the UserDefinedTritonKernel argument codegen logic between python wrapper and cpp wrapper. This prepares for later PRs that will simplify AOTI codegen.

Differential Revision: [D59002226](https://our.internmc.facebook.com/intern/diff/D59002226)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129378
Approved by: https://github.com/oulgen, https://github.com/chenyang78
ghstack dependencies: #129267
2024-06-26 13:53:27 +00:00
42d490d41d [AOTI][refactor] Move generate_user_defined_triton_kernel (#129267)
Summary: Move generate_user_defined_triton_kernel from cpp_wrapper_cpu to cpp_wrapper_cuda as it's for CUDA only

Differential Revision: [D58953005](https://our.internmc.facebook.com/intern/diff/D58953005)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129267
Approved by: https://github.com/chenyang78
2024-06-26 13:50:39 +00:00
53fafdd0c3 [BE] Runner determinator: more resilient user matching (#129462)
Small improvements on runner determinator script:

* Don't do splitting of the issue comment, unless necessary;
* Match username against a set over a list;
* Match both triggering_actor and issue owner over only actor (to avoid edge cases, where we get `pytorch-bot[bot]`)
* Add stripping, to remove potential breaking and not visible whitespaces;
* Don't use linux.4xlarge as a runner: it should not depend on meta runners, for reliability;
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129462
Approved by: https://github.com/zxiiro, https://github.com/ZainRizvi
2024-06-26 13:47:52 +00:00
211f38e742 Revert "[ALI] [Reland] Use LF runners for Lint (#129071)"
This reverts commit 1b92bdd0ea326cd30bc3945602701ffe28c85fd5.

Reverted https://github.com/pytorch/pytorch/pull/129071 on behalf of https://github.com/malfet due to All LF jobs are backlogged, so revert this one ([comment](https://github.com/pytorch/pytorch/pull/129071#issuecomment-2191676677))
2024-06-26 13:19:00 +00:00
92be3403ea Fix an issue in oneShotAllReduce where different ranks perform reduction in different order (#129501)
In `oneShotAllReduce`, ranks read data from peers in a round-robin fashion to load-balance NVLinks. However, the following reduction is also performed in the this order which is different across ranks. This can results in slight numerical differences across ranks, which can lead to a hang in data dependent applications like speculative decoding.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129501
Approved by: https://github.com/Chillee
2024-06-26 08:43:10 +00:00
f2840bb220 [nn-module] Use standard dict for _parameters, _modules and _buffers (#129164)
TorchDynamo guard mechanism guards on the key order on the dictionaries if the user iterates over the dictionary. For standard dict, we can write a fast C++ implementation by using PyDict_Next. But with OrderedDict, we have to rely on `keys` Python API to get the key ordering. This makes guard evaluation slow.

With Dynamo inlining into inbuilt nn modules, I am seeing many guards over the OrderedDict on `_modules`, `_parameters`. From reading the code, I don't see any reason to not use standard dicts. I think OrderedDict was preferred over dict because of the ordering, but dicts are now ordered. With this PR, I am observing ~20% reduction in guard overhead of a HF model.

Functionality impact
- The only difference between dict and OrdedeDict is `move_to_end` method for OrderedDict ([link](https://stackoverflow.com/questions/34305003/difference-between-dictionary-and-ordereddict)). But the changes here are internal to nn module, and we do not use `move_to_end` for `_parameters`, `_modules` and `_buffers`. We  use `move_to_end` for hooks but this PR keeps the OrderedDict for hooks untouched (we should still followup with hooks but in a separate PR).

Perf impact
- I dont anticipate any perf impact. `dict` is completely implemented in C. OrderedDict is Python wrapper over dict with only few method overridden ([link](https://stackoverflow.com/questions/34305003/difference-between-dictionary-and-ordereddict)).

Typing impact
- I dont anticipate any. For all the user visible methods for nn.Module, we don't expose the underlying `_modules` etc. We have iterators like `named_parameters` which return an Iterator of Parameter. So, no typing changes required.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129164
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #129163
2024-06-26 07:59:42 +00:00
ead97ee486 [Compile+SAC] Only warn for in-place ops once (#129397)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129397
Approved by: https://github.com/tianyu-l
2024-06-26 07:25:02 +00:00
c422a9549d [easy][DCP] Fix test_fsdp_ep.py for _MeshEnv.create_child_mesh API ch… (#129445)
…ange

Update test/distributed/checkpoint/e2e/test_fsdp_ep.py for #127465 change.
Failure info:
```bash
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] Caught exception:
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] Traceback (most recent call last):
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664]   File "/projs/framework/fooooo/code/pytorch_new/torch/testing/_internal/common_distributed.py", line 657, in run_test
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664]     getattr(self, test_name)()
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664]   File "/projs/framework/fooooo/code/pytorch_new/torch/testing/_internal/common_distributed.py", line 539, in wrapper
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664]     fn()
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664]   File "/projs/framework/fooooo/code/pytorch_new/torch/testing/_internal/common_utils.py", line 2744, in wrapper
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664]     method(*args, **kwargs)
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664]   File "/projs/framework/fooooo/code/pytorch_new/torch/testing/_internal/distributed/_tensor/common_dtensor.py", line 369, in wrapper
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664]     func(self, *args, **kwargs)  # type: ignore[misc]
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664]   File "/projs/framework/fooooo/code/pytorch_new/torch/testing/_internal/common_distributed.py", line 180, in wrapper
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664]     return func(*args, **kwargs)
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664]   File "/projs/framework/fooooo/code/pytorch_new/torch/testing/_internal/distributed/checkpoint_utils.py", line 44, in wrapper
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664]     func(self, *args, **kwargs)
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664]   File "/projs/framework/fooooo/code/pytorch_new/test/distributed/checkpoint/e2e/test_fsdp_ep.py", line 76, in test_e2e
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664]     mesh_fsdp_ep = _mesh_resources.create_child_mesh(mesh_fsdp_tp, 0, "dp")
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] TypeError: _MeshEnv.create_child_mesh() takes 3 positional arguments but 4 were given
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664]
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] To execute this test, run the following from the base repo dir:
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664]      python test/distributed/checkpoint/e2e/test_fsdp_ep.py -k TestFSDPWithEP.test_e2e
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664]
[rank4]:E0625 10:50:33.502000 140043309847744 torch/testing/_internal/common_distributed.py:664] This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129445
Approved by: https://github.com/fegin, https://github.com/wz337
2024-06-26 06:43:30 +00:00
8b8e2fcdda [DCP] Fix Optimizer Learning Rate not being loaded correctly (#129398)
Fixes #129079

Currently, the tensor object is loading correctly in-place, but the non-tensor object such as learning rate is not load correctly after f518cf811d, which is a regression introduced in 2.3.

This PR replaces tree_map_only and manual replacement of the state dict items with _tree_map_only and fixes the regression of non-tensor loading.

Test:
```
# test to make sure lr is loading correctly
python3 test/distributed/checkpoint/e2e/test_e2e_save_and_load.py -k test_init_state_dict
# test to make sure load on meta device model still works
python3 test/distributed/checkpoint/test_tp_checkpoint.py -k test_tp_checkpoint_load_on_meta_device
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129398
Approved by: https://github.com/fegin
2024-06-26 06:41:47 +00:00
000f2d637b Refactoring the code to make it lint clean (#129424)
Summary: Refactoring the code to make it lint clean

Test Plan: buck2 build mode/dev-tsan caffe2/test:test_profiler_cuda

Differential Revision: D58971175

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129424
Approved by: https://github.com/aaronenyeshi
2024-06-26 06:12:01 +00:00
610894e978 [MPS][BE] Generalize Fused optimizers (#129105)
This PR generalizes the multi_tensor_apply function for other fused optimizers

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129105
Approved by: https://github.com/malfet
ghstack dependencies: #129006, #129008, #129007
2024-06-26 06:00:41 +00:00
d02bba519c [export] match fake mode for _decompose_exported_program() (#129421)
Summary:
_decompose_exported_program() ran into an issue with trace_joint, where trace_joint() produces values with mismatching FakeModes. Adding fake mode context to aot_export_module() so this doesn't happen.

#thanks to tugsbayasgalan for the fix!

Test Plan: test_experimental

Differential Revision: D58977694

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129421
Approved by: https://github.com/tugsbayasgalan, https://github.com/zhxchen17
2024-06-26 05:52:31 +00:00
7420bad74c [BE] Do not assert if the barrier is not created (#129497)
the foler will be created as long as TEMP_DIR is set and the program
has the write permission. This will ensure some test environment can run the
spawn tests.

Differential Revision: [D59020736](https://our.internmc.facebook.com/intern/diff/D59020736/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129497
Approved by: https://github.com/fduwjj, https://github.com/wz337
2024-06-26 05:51:36 +00:00
c04cec609d [dtensor][debug] fixing CommDebugMode module collective tracing (#128887)
**Summary**
The logic for CommDebugMode module collective tracing is incorrect as it only worked for leaf module nodes on the model's module tree. If we had a sub-module that had a collective call along with a nested module inside it, the sub-module was not removed from the module_tracker parent set leading to double-counting collectives. This problem was addressed by checking to make sure the current sub-module was not already in the parent set. The output of the below test cases should remain the same.

**Test Plan**
1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing

2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128887
Approved by: https://github.com/XilunWu
ghstack dependencies: #128729
2024-06-26 05:25:57 +00:00
bd3a11776f [dtensor][test] test case suite for comm_mode features (#128729)
**Summary**
Currently, there is only an example file for comm_mode and its features. I have created test cases that mirror the examples while the more complicated test cases also ensure that comm_mode resets all variables when used multiple times in the same function. This test case suite will also help developers ensure that new code they add to comm_mode does not affect correctness of old features.
#128536

**Test Plan**
pytest test/distributed/_tensor/debug/test_comm_mode_features.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128729
Approved by: https://github.com/XilunWu
2024-06-26 05:25:57 +00:00
6181e65cd8 Nested tensor subclass support (#127431)
When we have nested tensor subclasses, we need to recursively flatten/unflatten in Fake tensor creation and AOTAUtograd. Most of the PR is about mechanical change which changes today's single level flatten logic to be recursive.

Differential Revision: [D58533224](https://our.internmc.facebook.com/intern/diff/D58533224)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127431
Approved by: https://github.com/bdhirsh
2024-06-26 04:45:22 +00:00
cda4d4887d Skip signals from older runs of the same workflows (#129291)
I discovered this bug in trymerge when debugging https://github.com/pytorch/pytorch/pull/129013 in which Dr.CI reported no relevant failures while mergebot complained about some unrelated ROCm failures https://github.com/pytorch/pytorch/pull/129013#issuecomment-2183009217.

It turns out that mergebot took into account stale signals from older runs of the same workflow here.  For example,
* https://github.com/pytorch/pytorch/actions/runs/9604985361 was the first run where it had a ROCm failure
* While https://github.com/pytorch/pytorch/actions/runs/9608926565 was the second attempt and it was all green

Notice that both runs came from the same push to commit [be69191](be69191f2d) with [ciflow/rocm/129013](https://github.com/pytorch/pytorch/tree/ciflow/rocm/129013).  So, we just need to check the signals from the newer run.

Note that Dr.CI handles this part correctly using the logic in https://github.com/pytorch/test-infra/blob/main/torchci/pages/api/drci/drci.ts#L1079-L1088.  So, the fix in this PR is to bring the same logic to trymerge.

### Testing

`pytest -v test_trymerge.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129291
Approved by: https://github.com/ZainRizvi
2024-06-26 03:49:09 +00:00
c718e2f43b [pytorch][logging] add empty wait counter implementation (#128466)
Differential Revision: [D58441466](https://our.internmc.facebook.com/intern/diff/D58441466)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128466
Approved by: https://github.com/c-p-i-o
2024-06-26 03:47:17 +00:00
54f27b886e [Inductor UT] Reuse test_distributed_patterns.py for Intel GPU (#129437)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129437
Approved by: https://github.com/EikanWang, https://github.com/jansel
2024-06-26 02:58:45 +00:00
555f71a15b Fix test_auto_simd in machine with AMX support (#129444)
Fixes #129438

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129444
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-06-26 02:50:55 +00:00
a89a1ed072 [easy][DCP] make BroadcastingTorchSaveReader device generic (#129231)
Test test/distributed/checkpoint/test_format_utils.py on GPU and othor device pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129231
Approved by: https://github.com/fegin
2024-06-26 02:37:30 +00:00
90d5a6f001 [inductor] Add lowering and codegen for aten.sort (#128458)
Closes #125633

Benchmarks:
| Shape       | dim | stable | compiled | eager   | speedup |
|-------------|-----|--------|----------|---------|---------|
| (256, 4096) | 0   | False  | 0.73 ms  | 1.26 ms | 1.7     |
| (256, 4096) | 0   | True   | 0.75 ms  | 1.27 ms | 1.7     |
| (4096, 256) | 1   | False  | 0.20 ms  | 0.73 ms | 3.7     |
| (4096, 256) | 1   | True   | 0.21 ms  | 0.73 ms | 3.5     |
| (255, 4096) | 0   | False  | 1.05 ms  | 1.48 ms | 1.4     |
| (255, 4096) | 0   | True   | 1.03 ms  | 1.47 ms | 1.4     |
| (4096, 255) | 1   | False  | 0.52 ms  | 0.98 ms | 1.9     |
| (4096, 255) | 1   | True   | 0.54 ms  | 1.00 ms | 1.9     |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128458
Approved by: https://github.com/lezcano, https://github.com/eellison
2024-06-26 01:36:39 +00:00
b7e7a4cb01 [cuDNN][SDPA] Remove TORCH_CUDNN_SDPA_ENABLED=1, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343)
Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes.

What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here...

Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343
Approved by: https://github.com/Skylion007
2024-06-26 00:49:18 +00:00
9554a9af87 [GPT-benchmark] Distinguish LLM models and mirco-benchmarks (#129498)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129498
Approved by: https://github.com/huydhn
2024-06-26 00:25:05 +00:00
0d0d42c4a7 test_qat_mobilenet_v2 succeeding on dynamo (#129532)
https://github.com/pytorch/pytorch/actions/runs/9669572961/job/26677024995

Test is usually marked as slow so it doesn't get run on dynamo since dynamo doesn't have a slow equivalent

However, it is succeeding, so we might as well as do what the logs tell us to do and remove the failure
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129532
Approved by: https://github.com/malfet, https://github.com/kit1980
2024-06-25 23:55:12 +00:00
112ef79f29 [inductor] Remove comm-specific node attributes from scheduler (#129084)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129084
Approved by: https://github.com/lezcano
2024-06-25 23:52:19 +00:00
d1f9e822dd [DTensor][Test] Update implicit replication unit tests for tensor arg being the first in args list (#127803)
Change the operands order so we can have test coverage for when the first arg is a tensor arg instead of DTensor arg.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127803
Approved by: https://github.com/XilunWu
2024-06-25 23:51:58 +00:00
575bc1e3af [Reopen #114036] Allow "must recompute" in torch.compile + selective checkpointing (SAC) (#129295)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129295
Approved by: https://github.com/Chillee
2024-06-25 23:47:08 +00:00
f389541ce0 Add Strided Input test for flex attention (#128915)
Test strided inputs to the flex_attention HOP. Similar to how inputs are generated in
https://github.com/pytorch/pytorch/blob/main/benchmarks/transformer/score_mod.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128915
Approved by: https://github.com/Chillee, https://github.com/drisspg
2024-06-25 23:26:34 +00:00
87ebd627a7 RS migration - upload sccache stats to s3 instead of rockset (#129490)
Upload sccache stats to s3 instead of rockset

I don't think we use these anywhere, so it's ok to cut off the ingest into rockset right now.

We should consider deleting this entirely if we don't plan on using it

I will work on copying existing data over from rockset to s3
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129490
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2024-06-25 23:23:16 +00:00
52341c28e8 Revert "[FSDP2] Ran post-acc-grad hooks manually (#129450)"
This reverts commit 7ebffef4d02a3cc68dbbcf44b92d63c7fe0ebb67.

Reverted https://github.com/pytorch/pytorch/pull/129450 on behalf of https://github.com/clee2000 due to broke distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_simple_mlp_fullgraph_backend_aot_eager 7ebffef4d0 https://github.com/pytorch/pytorch/actions/runs/9667812641/job/26671489454.  Test got added in https://github.com/pytorch/pytorch/pull/129157 which is before your mergebase ([comment](https://github.com/pytorch/pytorch/pull/129450#issuecomment-2190174363))
2024-06-25 23:13:57 +00:00
bbd47f7b2f Remove ProcessGroupCudaP2P and change async-TP to use SymmetricMemory (#128762)
This PR removes `ProcessGroupCudaP2P` and changes async-TP to use `SymmetricMemory`. The async-TP implementation is still workspace-based, but it now doesn't require a buffer size to be specified upfront.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128762
Approved by: https://github.com/wanchaol
2024-06-25 22:32:21 +00:00
1c5df9107d [BE] Fix several incorrect skip tests (#129488)
These tests may not be skipped properly if NCCL library exists but CUDA is not avaiable.

Differential Revision: [D59013855](https://our.internmc.facebook.com/intern/diff/D59013855/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129488
Approved by: https://github.com/wz337, https://github.com/fduwjj
2024-06-25 22:10:31 +00:00
fd414d6189 [inductor] don't materialize the large sparse matrix in CE bwd (#129043)
Inductor currently materialize a large sparse matrix in the backward pass for CrossEntropyLoss and load that to compute gradients of Softmax input. If we could fuse the sparse matrix computation to the consumer sides, we gonna have both perf and memory usage wins.

The Fx graph snippets that construct this aforementioned sparse matrix looks like:
```
       full_default_3: "bf16[32768, 50257]" = torch.ops.aten.full.default([32768, 50257], 0, dtype = torch.bfloat16, layout = torch.strided, device = device(type='cuda', index=0), pin_memory = False)
       scatter: "bf16[32768, 50257]" = torch.ops.aten.scatter.value(full_default_3, 1, where_2, -1.0);  full_default_3 = where_2 = None
```
Leveraging the following observations:
- the scatter is applied upon a all zero (or more generally a const tensor)
- the index tensor for the scatter has a single element on the scatter dimension. In this case it's the label tensor

allow us to lower this 'scatter_upon_const_tensor' pattern to a pointwise kernel that can be easily fused with downstream kernels:

```
    def inner_fn(idx):
        selector_idx = list(idx)
        selector_idx[dim] = 0  # can do this since the index tensor has a single element on the scatter dimension

        selector = selector_loader(selector_idx)
        return ops.where(
            selector == ops.index_expr(idx[dim], torch.int64),
            ops.constant(val, dtype),
            ops.constant(background_val, dtype),
        )
```

## Test result on microbenchmark

For the microbenchmark added as `test_cross_entropy_loss`, we improve latency from 47.340ms to 42.768ms, memory footprint from 10.524GB to 7.227GB on A100. (on H100, we improve latency from 27.54ms to 23.51ms, memory footprint from 10.574GB to 7.354GB).

The saving matches the back-of-envelope calculation. We avoid storing a BF16 tensor with shape [30K, 50K] which is about 3GB in size. On A100, avoid loading and storing such a tensor can roughly save 3GB x 2 / 1.5TBGS = 4ms

## Test result on llm.c

We also test this on llm.c and the saving is much larger especially for memory footprint. The reason is due to autotuning that allocates extra memory for benchmarking. (Check https://github.com/pytorch/pytorch/issues/129258 and https://github.com/pytorch/pytorch/pull/129399 for more details).

For llm.c PyTorch implementation on A100, we improve from
171K tokens/s , 33.6G peak memory usage to
180K tokens/s, 18.6G peak memory usage. (A **45%** saving of peak memory)

## Test on PyTorch 2.0 Dashboard

The optimization is quite general especially for transformers. We tested this on PyTorch2.0 dashboard. Here is the [result](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2017%20Jun%202024%2018%3A07%3A51%20GMT&stopTime=Mon%2C%2024%20Jun%202024%2018%3A07%3A51%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/shunting314/158/head&lCommit=c62c55e29c65497d495217b6574bb36b0c4da7d4&rBranch=main&rCommit=0d25f096c1beaf8749932a3d6083ad653405ed71).

TLDR, for Huggingface benchmark suite, we get **6%** geomean perf improvement and **10%** geomean memory footprint improvement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129043
Approved by: https://github.com/jansel, https://github.com/Chillee
2024-06-25 21:25:50 +00:00
e1499f6342 [C10D] Make new_group eager when used with comm_split (#129284)
If users pass `device_id` to init_process_group, they enable eager init
for the default group.  Then if they subsequently call `new_group`, the
device_id argument is not required as it should be assumed to match the
one used for init_process_group.

However, both `init_process_group` and `new_group` apis share a helper
function, which expects a `device_id` value that defaults to None.  When
it's None, eager initialization is disabled.

This PR ensures that if a device_id was passed to init_process_group,
the same device_id will automatically be fed into the helper function
for any new_group calls that follow.

**Test plan**
I found an existing test in CI  `test_comm_split_subgroup` that failed after my change, because it was asserting that backend comm_split counter did not increment eagerly, and its behavior had changed to increment eagerly.  I updated the test in the PR to pass with my change.

I also tested locally via simple program with TORCH_CPP_LOG_LEVEL=INFO and
observed eager initialization of the 'lows' and 'highs' PGs before the
'Here' print.

```
import torch
import torch.distributed as dist
dist.init_process_group(backend="nccl", device_id =torch.device(f"cuda:{torch.distributed.get_node_local_rank(0)}"))
dist.new_group([0, 1], group_desc="lows")
dist.new_group([2, 3], group_desc="highs")
print("Here")
torch.distributed.destroy_process_group()
```

Output:
https://gist.github.com/wconstab/88a5ba0b970244ca1f79133f989e0349

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129284
Approved by: https://github.com/pavanbalaji, https://github.com/fduwjj, https://github.com/d4l3k, https://github.com/nvcastet
2024-06-25 21:09:34 +00:00
e58ef5b65f [export] Rewrite exportdb formatting. (#129260)
Summary: It'll be easier to generate examples if the code doesn't depend on exportdb library.

Test Plan: CI

Differential Revision: D58886554

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129260
Approved by: https://github.com/tugsbayasgalan
2024-06-25 21:04:53 +00:00
551e412718 [CUDA][Inductor][CI] Revert PR#127150 since cu124 is now behaving similar enough to cu121 (#128423)
Pre-requisite: close https://github.com/pytorch/pytorch/issues/126692 first.

This PR also gives a current read on cu121 and cu124 parity.

Essentially reverting #127150

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128423
Approved by: https://github.com/atalman, https://github.com/eqy
2024-06-25 20:59:49 +00:00
79959d707c [Inductor][ROCm] Composable Kernel backend for Inductor (#125453)
This PR adds an alternative backend for Inductor, adding Composable Kernel Universal GEMM instances to the autotune instance selection.

The implementation is heavily influenced by the series of PRs which adds CUTLASS backend (https://github.com/pytorch/pytorch/issues/106991). The main differences are
 (1) customizing compiler for the ROCm platform
 (2) customizing template code generation for Composable Kernel Universal GEMM instances.

We provide config tuning knobs for balancing between instance sources compilation time and finding the best instance.

### Testing
Install the ck library
```
pip install git+https://github.com/rocm/composable_kernel@develop
```
Run the test
```
TORCH_LOGS=+torch._inductor \
pytest --capture=tee-sys test/inductor/test_ck_backend.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125453
Approved by: https://github.com/eellison, https://github.com/jansel
2024-06-25 20:54:14 +00:00
ae0f84d89c [CI] Enable amp accuracy check for inductor cpu (#127758)
This is to enable inductor AMP accuracy check for on CPU in CI workflow to capture issue early. Three suites are included: timms, huggingface as well as torchbench.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127758
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-06-25 20:34:18 +00:00
45f2876934 [Fix] NumToTensor resulted from numel() and size() in TSCovnerter (#128761)
#### Issue
In jit.trace, torch.numel() is automatically cast to a `LongTensor`. But during conversion, we lost the casting part. `prim::NumToTensor` was previously converted to `torch.ops.aten.scalar_tensor`, which uses the same `dtype` as the input tensor instead of `LongTensor`. in this PR, we add a casting to convert it to the correct `dtype`.

#### Test Plan
We activate previously failing test case.
* `pytest test/export/test_converter.py -s -k test_implicit_constant_to_tensor_handling`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128761
Approved by: https://github.com/angelayi
2024-06-25 20:20:03 +00:00
e68ee2cadb TunableOp hotfix (#129281)
Fixes.
- PYTORCH_TUNABLEOP_NUMERICAL_CHECK=1 had a memory leak.
- The strided batched gemm size calculation for buffer rotation was incorrect resulting in a mem fault.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129281
Approved by: https://github.com/xw285cornell, https://github.com/eqy, https://github.com/mxz297
2024-06-25 20:12:46 +00:00
1865fe282f Log whenever we sleep (#129197)
Summary:
Log whenever we sleep for heartbeatTimeout.
Useful for debugging stuck jobs.
This will eventually turn into a metric.

Test Plan:
none.

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129197
Approved by: https://github.com/Skylion007, https://github.com/d4l3k, https://github.com/wconstab
2024-06-25 20:09:41 +00:00
b1f486aff9 Revert "Add warning for weights_only (#129239)"
This reverts commit 381ce0821c3fa2b342f0b8660c76cc27f48543c4.

Reverted https://github.com/pytorch/pytorch/pull/129239 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I am seeing some test_nn failures from ROCm 381ce0821c, trying to revert this to see if trunk recovers ([comment](https://github.com/pytorch/pytorch/pull/129239#issuecomment-2189812903))
2024-06-25 19:30:07 +00:00
7cf454ec52 Revert "Add example for torch.serialization.add_safe_globals (#129396)"
This reverts commit f18becaaf1c7a7bf851e3ae8d215eee8dba688b6.

Reverted https://github.com/pytorch/pytorch/pull/129396 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I am seeing some test_nn failures from ROCm 381ce0821c, trying to revert this to see if trunk recovers ([comment](https://github.com/pytorch/pytorch/pull/129239#issuecomment-2189812903))
2024-06-25 19:30:07 +00:00
0298560ca2 TCPStore: improve connect and retry logic (#129261)
We've been facing issues where TCPStore can successfully connect but then fail in the validate() function due to resets from listen backlog queue overflow when combined with reset enabled as well as long init times.

This PR does a few things:
* Retry that connect and validate up to the specified timeout.
* Use exponential backoff for the retry logic with jitter instead of a fixed 1s sleep.
* Eliminate the `sleep(std::chrono::milliseconds(numWorkers))` on init which can add significant delays to startup. This is no longer necessary per @XilunWu https://github.com/pytorch/pytorch/pull/116141

Test plan:

```
python test/distributed/test_store.py -v
./build/bin/BackoffTest
```

Will do internal testing with some large scale jobs to ensure TCPStore works correctly.

At 4k scale: 4x improvement

```
tristanr@devvm4382 ~/pt_tests [SIGABRT]> time TORCH_SHOW_CPP_STACKTRACES=1 python tcpstore_large_test.py                                                                                                   (pytorch-3.10)
started 0
init 0
set 0
joined all

________________________________________________________
Executed in    1.98 secs    fish           external
   usr time    0.93 secs   91.00 micros    0.93 secs
   sys time    1.98 secs  954.00 micros    1.97 secs

tristanr@devvm4382 ~/pt_tests> conda activate torchdrive-3.10                                                                                                                                              (pytorch-3.10)
tristanr@devvm4382 ~/pt_tests> time TORCH_SHOW_CPP_STACKTRACES=1 python tcpstore_large_test.py                                                                                                          (torchdrive-3.10)
started 0
init 0
set 0
joined all

________________________________________________________
Executed in    8.20 secs    fish           external
   usr time    2.15 secs    0.00 micros    2.15 secs
   sys time    2.76 secs  843.00 micros    2.76 secs
```

```py
import time
import os
import threading
from multiprocessing import Pool

WORLD_SIZE = 10000

import torch.distributed as dist

def run(rank):
    should_log = rank % (WORLD_SIZE // 10) == 0
    if should_log:
        print(f"started {rank}")
    store = dist.TCPStore(
        host_name="devvm4382.nao0.facebook.com",
        port=29500,
        world_size=WORLD_SIZE,
        is_master=rank == 0,
        use_libuv=True,
    )
    if should_log:
        print(f"init {rank}")
    store.set(f"key{rank}", "1234")
    if should_log:
        print(f"set {rank}")
    del store

def noop(rank):
    pass

print("starting pool")
with Pool(WORLD_SIZE) as pool:
    pool.map(noop, range(WORLD_SIZE), 1)
    print("pool hot")
    start = time.time()
    pool.map(run, range(WORLD_SIZE), 1)
    print("run finished", time.time()-start)
```

```
tristanr@devvm4382 ~/pt_tests> python tcpstore_large_test.py                                                                                                                                (pytorch-3.10)
starting pool
pool hot
started 0
[W624 16:58:09.086081750 TCPStore.cpp:343] [c10d] Starting store with 10000 workers but somaxconn is 4096.This might cause instability during bootstrap, consider increasing it.
started 1000
init 1000
set 1000
started 2000
init 2000
set 2000
started 3000
init 3000
set 3000
started 4000
init 4000
set 4000
started 5000
init 5000
set 5000
started 6000
init 6000
set 6000
started 7000
init 7000
set 7000
started 8000
init 8000
set 8000
started 9000
init 9000
set 9000
init 0
set 0
run finished 0.705092191696167
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129261
Approved by: https://github.com/rsdcastro, https://github.com/wconstab, https://github.com/kurman, https://github.com/XilunWu, https://github.com/c-p-i-o
2024-06-25 19:24:22 +00:00
816e8a3f21 [MacOS] Improve libomp packaging (#129473)
Instead of replacing `@rpath/libomp.dylib` with `@loadper_path/libomp.dylib`, keep it in place and add `@loadper_path` as new rpath

This should prevent double-loading of OpenMP runtime, because in case of `@rpath` loader is allowed to reuse other libraries, but `loadper_path` directive forces it to load it from the location relative to the executable

Test plan:
- Prepare the environment
```shell
conda create -n py310-cf python=3.10 numpy pip -c conda-forge
conda activate py310-cf
pip install torch --index-url https://download.pytorch.org/whl/test/cpu
```
- Verify that OpenMP is loaded twice and than crashes
```shell
KMP_VERSION=true python -c "import numpy as np; import torch; print(torch.__version__, torch.backends.openmp.is_available()); print(torch.rand(300, 300).abs().max())"
```
output:
```
LLVM OMP version: 5.0.20140926
LLVM OMP library type: performance
LLVM OMP link type: dynamic
LLVM OMP build time: no_timestamp
LLVM OMP build compiler: Clang 16.0
LLVM OMP alternative compiler support: yes
LLVM OMP API version: 5.0 (201611)
LLVM OMP dynamic error checking: no
LLVM OMP thread affinity support: no
LLVM OMP version: 5.0.20140926
LLVM OMP library type: performance
LLVM OMP link type: dynamic
LLVM OMP build time: no_timestamp
LLVM OMP build compiler: Clang 12.0
LLVM OMP alternative compiler support: yes
LLVM OMP API version: 5.0 (201611)
LLVM OMP dynamic error checking: no
LLVM OMP thread affinity support: no
2.4.0 True
zsh: segmentation fault  KMP_VERSION=true python -c
```
- Install artifact from this PR and make sure it passes the same test
```shell
python -mpip install ~/Downloads/torch-2.5.0.dev20240625-cp310-none-macosx_11_0_arm64.whl
KMP_VERSION=true python -c "import numpy as np; import torch; print(torch.__version__, torch.backends.openmp.is_available()); print(torch.rand(300, 300).abs().max())"
```
output
```
LLVM OMP version: 5.0.20140926
LLVM OMP library type: performance
LLVM OMP link type: dynamic
LLVM OMP build time: no_timestamp
LLVM OMP build compiler: Clang 16.0
LLVM OMP alternative compiler support: yes
LLVM OMP API version: 5.0 (201611)
LLVM OMP dynamic error checking: no
LLVM OMP thread affinity support: no
2.5.0.dev20240625 True
tensor(1.0000)
```
- Make sure it still uses bundled OpenMP if none is available in the environment
```
conda uninstall numpy -c conda-forge
KMP_VERSION=true python -c "from ctypes import cdll, c_char_p, c_uint32; import torch; from ctypes import cdll, c_char_p, c_uint32; libdyld = cdll.LoadLibrary('libSystem.dylib'); libdyld._dyld_image_count.restype = c_uint32; libdyld._dyld_get_image_name.restype = c_char_p; libdyld._dyld_get_image_name.argtypes = [c_uint32]; print(torch.rand(300, 300).abs().max()); libs = [libdyld._dyld_get_image_name(i).decode('ascii') for i in range(libdyld._dyld_image_count())]; print([l for l in libs if 'libomp.dylib' in l])"
```

Fixes https://github.com/pytorch/pytorch/issues/124497 and https://github.com/pytorch/pytorch/issues/126385
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129473
Approved by: https://github.com/atalman
2024-06-25 19:12:34 +00:00
b045878f81 Revert "Remove test_mps_allocator_module XFAIL (#129340)"
This reverts commit c888ee36325148ed99db4298bf2ae739ebbeacdc.

Reverted https://github.com/pytorch/pytorch/pull/129340 on behalf of https://github.com/huydhn due to The test is now failing again in trunk after a day or so of staying green, we need to continue the investigation ([comment](https://github.com/pytorch/pytorch/pull/129340#issuecomment-2189701706))
2024-06-25 18:37:54 +00:00
7ebffef4d0 [FSDP2] Ran post-acc-grad hooks manually (#129450)
FSDP2 accumulates gradients for sharded parameters outside of the autograd engine's normal accumulation logic. We can respect registered post-accumulate-grad hooks by running them manually.

**Discussion**
Discussing with @soulitzer, changing FSDP2 to make the sharded parameters autograd leaves requires nontrivial changes to FSDP and some changes to the autograd engine (around forward vs. backward streams) where the changes may not preserve eager-mode performance and/or add some complexity.

Under the FSDP2 design, the sharded parameters never participate in autograd, so calling `register_post_accumulate_grad_hook` on them would otherwise be a no-op. In other words, there is virtually no chance for FSDP2 incorrectly re-running the hook when it should not.

Given these, a reasonable near-term solution is for FSDP2 to run the post-accumulate-grad hooks manually.

**Caveats**
- Running `foreach=False` optimizer _per parameter tensor_  incurs significantly higher CPU overhead compared to `foreach=True` (partially due to `DTensor` being a `__torch_dispatch__` tensor subclass).
    - On preliminary benchmarking on Llama3-8B on 8 GPUs, this CPU overhead is mostly tolerable, but on smaller # of GPUs or a less compute-intensive model, this may not be.
    - One solution for native Adam/AdamW is to use `fused=True`, which makes both the CPU overhead lower and GPU compute faster. However, this is generally not an option for user-defined optimizers.
    - If this CPU overhead blocks adoption of this feature, then we should seriously consider an FSDP-specific API like `register_post_backward_hook(params: List[nn.Parameter]) -> None` that allows the user to see all parameters in the `FSDPParamGroup` together for the hook so that the user can still run a `foreach=True` optimizer step on that `List[nn.Parameter]`.
- The post-accumulate-grad hook runs in the reduce-scatter stream. Our current stream handling logic does not have the default stream wait for the reduce-scatter stream until the end of backward. Unless we add that, we cannot simply run the post-accumulate-grad hook in the default stream.
    - This means that optimizer compute will overlap with backward compute, which may slowdown end-to-end execution slightly (e.g. due to SM contention or wave quantization effects). For example, on Llama3-8B, we see about ~3% decrease in MFU when running optimizer in backward even though the optimizer steps are fully overlapped and there are no CPU boundedness issues.
- This PR's goal is only to run the hook manually. State dict etc. for optimizer-in-backward is out of scope.

**Experiments (torchtitan)**
- Llama3-8B on 2 GPUs, local batch size 1, with full activation checkpointing, and bf16/fp32 mixed precision:
    - Without optimizer-in-backward: 82.03 GiB reserved memory; 28.1% MFU
    - With optimizer-in-backward (`foreach=False`): 72.84 GiB reserved memory; 28.9% MFU (speedup from more of optimizer step overlapped)
    - With optimizer-in-backward (`fused=True`): 70.84 GiB reserved memory; 30.4% MFU

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129450
Approved by: https://github.com/weifengpy
2024-06-25 18:34:56 +00:00
dd00f5e78d Fixes T192448049 (#129146)
Differential Revision: D58767610

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129146
Approved by: https://github.com/angelayi
2024-06-25 17:50:15 +00:00
53f462c506 Write dynamo benchmarks performance result to csv when throw exceptions (#126764)
**Performance mode Issue**: When dynamo benchmarks performance warm-up failed, the result will be not written into csv file. But the accuracy will be written as `fail_to_run` even when dynamo pass failed. So the accuracy model number is not aligned with performance model number for each of their csv files.
![image](https://github.com/pytorch/pytorch/assets/84730719/9043d215-130b-46b4-a835-f148c225947c)

- **Fix**: The warm-up failed models will be recorded into csv file shown as following:
![image](https://github.com/pytorch/pytorch/assets/84730719/7907a3c2-c942-42bb-b31c-55424a0e8117)

**Accuracy mode issue**: `detectron2_fasterrcnn_r` models failed on accuracy mode, but was tested successfully on performance mode. The accuracy failure is same as PR ee557d8f61.
```
Dynamic Shape:
Traceback (most recent call last):
  File "benchmarks/dynamo/torchbench.py", line 449, in <module>
    torchbench_main()
  File "benchmarks/dynamo/torchbench.py", line 445, in torchbench_main
    main(TorchBenchmarkRunner(), original_dir)
  File "/workspace/pytorch/benchmarks/dynamo/common.py", line 3650, in main
    process_entry(0, runner, original_dir, args)
  File "/workspace/pytorch/benchmarks/dynamo/common.py", line 3582, in process_entry
    return run(runner, args, original_dir)
  File "/workspace/pytorch/benchmarks/dynamo/common.py", line 4163, in run
    assert marked, f"nothing in example_inputs had a dim with {batch_size}"
AssertionError: nothing in example_inputs had a dim with 4
```
![image](https://github.com/pytorch/pytorch/assets/84730719/f25392f0-f982-46c8-8e2c-a8a25d85a21a)

- **Fix**: same as PR ee557d8f61, the batch_size will be skipped to set as 4 when testing dynamic shapes.

Dynamic shapes passrate improved from 89% -> **95%**
| Comp Item | Compiler | suite      | before     | After fix  |
|-----------|----------|------------|------------|------------|
| Pass Rate | Inductor | torchbench | 89%, 73/82 | 95%, 79/83 |

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126764
Approved by: https://github.com/jansel
2024-06-25 17:49:04 +00:00
e317a8b264 Add guard to use AMX for x86_64 only (#129479)
Trying to mitigate aarch64 and s390 nightly failures as per this comment:
https://github.com/pytorch/pytorch/pull/127195#issuecomment-2189177949

Fixes https://github.com/pytorch/pytorch/issues/129443

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129479
Approved by: https://github.com/nWEIdia, https://github.com/malfet
2024-06-25 17:31:28 +00:00
45b2931b7e Revert "[Traceable FSDP2] Don't decompose fsdp.split_with_sizes_copy (#129414)"
This reverts commit b24787b7576c184a54d13c1833ada23a395f5c31.

Reverted https://github.com/pytorch/pytorch/pull/129414 on behalf of https://github.com/ZainRizvi due to This PR is seems to be causing multiple macos failures.  Looks like it was merged before trunk jobs were started, which would have run those tests ([comment](https://github.com/pytorch/pytorch/pull/129414#issuecomment-2189479505))
2024-06-25 17:05:55 +00:00
fb40ba6fc2 Revert "[Traceable FSDP2] Add Dynamo support for run_with_rng_state HOP (#127247)"
This reverts commit aa4ee2cb9e1f9be6bbdd27654e0f768b7fe9be6c.

Reverted https://github.com/pytorch/pytorch/pull/127247 on behalf of https://github.com/ZainRizvi due to This PR is seems to be causing multiple macos failures.  Looks like it was merged before trunk jobs were started, which would have run those tests ([comment](https://github.com/pytorch/pytorch/pull/129414#issuecomment-2189479505))
2024-06-25 17:05:55 +00:00
ad76da6c16 Revert "[inductor] Fix TORCHINDUCTOR_FORCE_DISABLE_CACHES (#129257)"
This reverts commit 7b57ddd38c6d502ba313c0e6b0c92b6787d69986.

Reverted https://github.com/pytorch/pytorch/pull/129257 on behalf of https://github.com/clee2000 due to one of the PRs in the stack seems to have broken test/distributed/_composable/test_replicate_with_compiler.py::ReplicateTest::test_bucketing_concat_op on distributed https://github.com/pytorch/pytorch/actions/runs/9653941844/job/26627760340 4c1e4c5f30, not tested on this PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/129257#issuecomment-2189444171))
2024-06-25 16:48:32 +00:00
b38f6d4cd2 Revert "[inductor] Enable FX graph caching in OSS by default (#125863)"
This reverts commit 4c1e4c5f307f9743014a08cf97d3fa8de7e1ce5f.

Reverted https://github.com/pytorch/pytorch/pull/125863 on behalf of https://github.com/clee2000 due to one of the PRs in the stack seems to have broken test/distributed/_composable/test_replicate_with_compiler.py::ReplicateTest::test_bucketing_concat_op on distributed https://github.com/pytorch/pytorch/actions/runs/9653941844/job/26627760340 4c1e4c5f30, not tested on this PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/129257#issuecomment-2189444171))
2024-06-25 16:48:32 +00:00
f8db12a538 Fix logic to find sbgemm in BLAS library (#125227)
Current logic to set the HAS_SBGEMM flag is ignored in case the BLAS libraries are found already, ie, if set from environment variable BLAS=OpenBLAS . If BLAS_LIBRARIES are already set the code to find if BLAS_LIBRARY has sbgemm is never executed. The following commit brings out this logic outside unconditionally.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125227
Approved by: https://github.com/malfet
2024-06-25 16:34:38 +00:00
665d6ea05b [export] Fix IR canonlization. (#129401)
Summary: as title. we should unpack results from _canonicalize_graph.

Differential Revision: D58963429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129401
Approved by: https://github.com/tugsbayasgalan
2024-06-25 16:33:02 +00:00
e364290718 Support linear backward for NJT with dim > 3 (#129393)
Replaces usage of `torch.mm()` with `torch.matmul()` in NJT's impl of linear_backward to support higher dims. See [here](https://github.com/pytorch/pytorch/issues/125214#issuecomment-2184968703) for more context.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129393
Approved by: https://github.com/soulitzer
2024-06-25 16:06:23 +00:00
0e6bb7f1ce [caffe2][be] migrate gloabl static initializer (#128784)
Summary:
Caffe2 lib has 200+ global static initializer usage, which are papar-cut reference to startup perf. Detail in this post https://fb.workplace.com/groups/arglassesperf/permalink/623909116287154.

This Diff migrate StorageImpl.cpp

Addtional Context: https://fb.workplace.com/groups/arglassesperf/permalink/623909116287154

Test Plan: CI

Differential Revision: D58639283

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128784
Approved by: https://github.com/aaronenyeshi
2024-06-25 15:30:49 +00:00
fd4af87855 Fix non-portable path warning (#129474)
MacOS uses case-insensitive filesystem by default, but it's better to specify include path using proper capitalization

Should fix
```
MultiTensorApply.h:4:10: warning: non-portable path to file '<ATen/native/mps/operations/FusedOptimizerOps.h>'; specified path differs in case from file name on disk [-Wnonportable-include-path]
#include <Aten/native/mps/operations/FusedOptimizerOps.h>
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129474
Approved by: https://github.com/albanD, https://github.com/atalman, https://github.com/qqaatw
2024-06-25 15:17:21 +00:00
cb1c56caba Set target dependencies to always build for sm90a on rowwise scaling (#129402)
# Summary

Instead of landing global builder changes; https://github.com/pytorch/builder/pull/1878

This PR targets only the Rowwise file and adds the sm90a featurs.

Verified locally by setting:
```
TORCH_CUDA_ARCH_LIST=9.0
```

We can see in the build.ninja file that the proper flags are set:

```
build caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/RowwiseScaledMM.cu.o: CUDA_COMPILER__torch_cuda_unscanned_Release /home/drisspg/meta/pytorch/aten/src/ATen/native/cuda/RowwiseScaledMM.cu || cmake_object_order_depends_target_torch_cuda
  DEFINES = -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS
  DEP_FILE = caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/RowwiseScaledMM.cu.o.d
  FLAGS = -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda  -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler=-Wall,-Wextra,-Wdeprecated,-Wno-unused-parameter,-Wno-missing-field-initializers,-Wno-unknown-pragmas,-Wno-type-limits,-Wno-array-bounds,-Wno-unknown-pragmas,-Wno-strict-overflow,-Wno-strict-aliasing,-Wno-unused-function,-Wno-maybe-uninitialized -Wno-deprecated-copy -gencode arch=compute_90a,code=sm_90a
  INCLUDES = -I/home/drisspg/meta/pytorch/build/aten/src -I/home/drisspg/meta/pytorch/aten/src -I/home/drisspg/meta/pytorch/build -I/home/drisspg/meta/pytorch -I/home/drisspg/meta/pytorch/third_party/onnx -I/home/drisspg/meta/pytorch/build/third_party/onnx -I/home/drisspg/meta/pytorch/third_party/foxi -I/home/drisspg/meta/pytorch/build/third_party/foxi -I/home/drisspg/meta/pytorch/aten/src/THC -I/home/drisspg/meta/pytorch/aten/src/ATen/cuda -I/home/drisspg/meta/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/drisspg/meta/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include -I/home/drisspg/meta/pytorch/build/caffe2/aten/src -I/home/drisspg/meta/pytorch/aten/src/ATen/.. -I/home/drisspg/meta/pytorch/build/nccl/include -I/home/drisspg/meta/pytorch/c10/cuda/../.. -I/home/drisspg/meta/pytorch/c10/.. -I/home/drisspg/meta/pytorch/third_party/tensorpipe -I/home/drisspg/meta/pytorch/build/third_party/tensorpipe -I/home/drisspg/meta/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/drisspg/meta/pytorch/torch/csrc/api -I/home/drisspg/meta/pytorch/torch/csrc/api/include -isystem /home/drisspg/meta/pytorch/build/third_party/gloo -isystem /home/drisspg/meta/pytorch/cmake/../third_party/gloo -isystem /home/drisspg/meta/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/drisspg/meta/pytorch/third_party/protobuf/src -isystem /home/drisspg/meta/pytorch/third_party/ittapi/include -isystem /home/drisspg/meta/pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda-12.3/include -isystem /home/drisspg/meta/pytorch/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /home/drisspg/meta/pytorch/third_party/ideep/include -isystem /home/drisspg/meta/pytorch/cmake/../third_party/cudnn_frontend/include
  OBJECT_DIR = caffe2/CMakeFiles/torch_cuda.dir
  OBJECT_FILE_DIR = caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda
 ```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129402
Approved by: https://github.com/malfet
2024-06-25 13:54:51 +00:00
71ebe5121a [MPS] Fast math env var (#129007)
Allow users to decide whether they want to have fast math enabled via env var
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129007
Approved by: https://github.com/malfet
ghstack dependencies: #129006, #129008
2024-06-25 13:52:07 +00:00
bbdeff76fc fix add decomposition for complex numbers (#129044)
Fixes #125745

Bug source: When addition requires broadcasting, adding complex numbers is not implemented correctly in `torch/_inductor/decomposition.py` because `x.view(x.real.dtype)` would multiply the last dimension by 2, and then broadcasting wouldn't work.

Fix: re-shape the complex tensors after view and before broadcasting.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129044
Approved by: https://github.com/zou3519, https://github.com/lezcano
2024-06-25 11:05:41 +00:00
6508f0f5d4 Improved backward tracking and attribution, fixed typing for python < 3.10 (#129400)
For #125323
* Fixes typing for python < 3.10
* Fixes #129390

For #124688
* Improved attribution by registering `register_hook` and `post_accumulate_grad_hook` on params.
* Fixed pre-mature per module bw peak state initialization for AC.
* This improves per-module stats, global `peak_mem` was already accurate and remains unaffected.

For #128508
* When AC is applied to a `mod (nn.Module)` the backward order of execution is `pre-bw -> pre-fw -> post-fw -> post-bw`. Since the `ModTracker` maintains the `parents` attribute as set, the `post-fw` during backward was prematurely removing it from parents.
* With the fix we now maintain a per-module counter and only remove a module from `parents` when its counter goes to 0.
* Added tests to ensure this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129400
Approved by: https://github.com/awgu, https://github.com/huydhn
2024-06-25 10:54:58 +00:00
63474620ab test_jit: Replace plain assert by test assert (#128950)
The plain assert doesn't show the values in case of failure
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128950
Approved by: https://github.com/zou3519
2024-06-25 09:04:53 +00:00
0314c4c101 [BE][Easy] use pathlib.Path instead of dirname / ".." / pardir (#129374)
Changes by apply order:

1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`.
2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`.
3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first.

    `.parent{...}.absolute()` -> `.absolute().parent{...}`

4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.)

    `.parent.parent.parent.parent` -> `.parents[3]`

5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~

    ~`.parents[3]` -> `.parents[4 - 1]`~

6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374
Approved by: https://github.com/justinchuby, https://github.com/malfet
2024-06-25 08:28:38 +00:00
4ca8eecca4 skip test_graph_capture_oom for jetson (#128661)
On Jetson IGX, `python test/test_cuda.py -k test_graph_capture_oom` fails with the following error:

```
RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED at "/opt/pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp":841, please report a bug to PyTorch.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/usr/lib/python3.10/unittest/case.py", line 59, in testPartExecutor
    yield
  File "/usr/lib/python3.10/unittest/case.py", line 591, in run
    self._callTestMethod(testMethod)
  File "/usr/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
    method()
  File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 2759, in wrapper
    method(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 2759, in wrapper
    method(*args, **kwargs)
  File "/opt/pytorch/pytorch/test/test_cuda.py", line 2255, in test_graph_capture_oom
    with self.assertRaisesRegex(RuntimeError, oom_regex):
  File "/usr/lib/python3.10/unittest/case.py", line 239, in __exit__
    self._raiseFailure('"{}" does not match "{}"'.format(
  File "/usr/lib/python3.10/unittest/case.py", line 163, in _raiseFailure
    raise self.test_case.failureException(msg)
AssertionError: "out of memory" does not match "NVML_SUCCESS == r INTERNAL ASSERT FAILED at "/opt/pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp":841, please report a bug to PyTorch. "

```

This is a known issue as nvml support on Jetson is limited, and the OOM reporting in CUDACachingAllocator.cpp requires nvml to be properly loaded, which fails on Jetson.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128661
Approved by: https://github.com/eqy, https://github.com/atalman
2024-06-25 08:25:11 +00:00
eqy
8bfd9e9815 [cuDNN] Graph-capturable cuDNN CTCLoss (#128271)
cuDNN v8.x added a graph-capturable CTCLoss, which slots "neatly" into the `Tensor` variant

~~WIP as cuDNN has a restriction on the max target length (255), but this is not checkable in the graph-capture case, so the UX around warnings/error-messages here might need to be tuned...~~
Currently checks restriction on max target length during warmup run(s), and bails out during capture if this constraint was violated during warmup.

CC @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128271
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-06-25 06:01:50 +00:00
533c4190f9 [inductor][cpp] support nested kernel with indirect indexing (#129223)
This PR makes sure the current kernel is used for generating CSE variables when nested kernel codegen is involved, e.g., nested CppKernel is used to generate epilogue of CppTemplateKernel. Without the fix, the epilogue with indirect indexing would fail to run.

pytest -k test_linear_with_embedding_bias_False_cpu test_cpu_select_algorithm.py

Epilogue code Before:
```c++
                {
                    #pragma GCC ivdep
                    for(long x0=static_cast<long>(0L); x0<static_cast<long>(m_end + ((-1L)*m_start)); x0+=static_cast<long>(1L))
                    {
                        for(long x1=static_cast<long>(0L); x1<static_cast<long>(16L*(c10::div_floor_integer(N0, 16L))); x1+=static_cast<long>(16L))
                        {
                            auto tmp0 = in_ptr2[static_cast<long>(m_start + x0)];
                            auto tmp11 = at::vec::Vectorized<float>::loadu(local_acc_buf + static_cast<long>(x1 + (N0*x0)), 16);
                            auto tmp1 = 64L;
                            auto tmp2 = c10::convert<int64_t>(tmp1);
                            auto tmp3 = decltype(tmp0)(tmp0 + tmp2);
                            auto tmp4 = tmp0 ? tmp3 : tmp0;
                            auto tmp5 = decltype(tmp4)(tmp4 + tmp2);
                            auto tmp6 = tmp1 ? tmp5 : tmp4;
                            auto tmp7 = tmp6;
                            auto tmp8 = c10::convert<int64_t>(tmp7);
                            TORCH_CHECK((0 <= tmp8) & (tmp8 < 64L), "index out of bounds: 0 <= tmp8 < 64L");
                            auto tmp10 = at::vec::Vectorized<float>::loadu(in_ptr3 + static_cast<long>(n_start + x1 + (384L*tmp6)), 16);
                            auto tmp12 = (tmp11);
                            auto tmp13 = tmp10 + tmp12;
                            tmp13.store(Y + static_cast<long>(n_start + x1 + (384L*m_start) + (384L*x0)));
                        }
                        #pragma omp simd simdlen(8)
                        for(long x1=static_cast<long>(16L*(c10::div_floor_integer(N0, 16L))); x1<static_cast<long>(N0); x1+=static_cast<long>(1L))
                        {
                            auto tmp0 = in_ptr2[static_cast<long>(m_start + x0)];
                            auto tmp11 = local_acc_buf[static_cast<long>(x1 + (N0*x0))];
                            auto tmp1 = 64L;
                            auto tmp2 = c10::convert<int64_t>(tmp1);
                            auto tmp3 = decltype(tmp0)(tmp0 + tmp2);
                            auto tmp4 = tmp0 ? tmp3 : tmp0;
                            auto tmp5 = decltype(tmp4)(tmp4 + tmp2);
                            auto tmp6 = tmp1 ? tmp5 : tmp4;
                            auto tmp7 = tmp6;
                            auto tmp8 = c10::convert<int64_t>(tmp7);
                            TORCH_CHECK((0 <= tmp8) & (tmp8 < 64L), "index out of bounds: 0 <= tmp8 < 64L");
                            TORCH_CHECK((0 <= tmp8) & (tmp8 < 64L), "index out of bounds: 0 <= tmp8 < 64L");
                            auto tmp10 = in_ptr3[static_cast<long>(n_start + x1 + (384L*tmp6))];
                            auto tmp12 = c10::convert<float>(tmp11);
                            auto tmp13 = decltype(tmp10)(tmp10 + tmp12);
                            Y[static_cast<long>(n_start + x1 + (384L*m_start) + (384L*x0))] = tmp13;
                        }
                    }
                }
```

Epilogue code After:
```c++
                {
                    #pragma GCC ivdep
                    for(long x0=static_cast<long>(0L); x0<static_cast<long>(m_end + ((-1L)*m_start)); x0+=static_cast<long>(1L))
                    {
                        for(long x1=static_cast<long>(0L); x1<static_cast<long>(16L*(c10::div_floor_integer(N0, 16L))); x1+=static_cast<long>(16L))
                        {
                            auto tmp0 = in_ptr2[static_cast<long>(m_start + x0)];
                            auto tmp13 = at::vec::Vectorized<float>::loadu(local_acc_buf + static_cast<long>(x1 + (N0*x0)), 16);
                            auto tmp1 = 64L;
                            auto tmp2 = c10::convert<int64_t>(tmp1);
                            auto tmp3 = decltype(tmp0)(tmp0 + tmp2);
                            auto tmp4 = tmp0 < 0;
                            auto tmp5 = tmp4 ? tmp3 : tmp0;
                            auto tmp6 = decltype(tmp5)(tmp5 + tmp2);
                            auto tmp7 = tmp5 < 0;
                            auto tmp8 = tmp7 ? tmp6 : tmp5;
                            auto tmp9 = tmp8;
                            auto tmp10 = c10::convert<int64_t>(tmp9);
                            TORCH_CHECK((0 <= tmp10) & (tmp10 < 64L), "index out of bounds: 0 <= tmp10 < 64L");
                            auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr3 + static_cast<long>(n_start + x1 + (384L*tmp8)), 16);
                            auto tmp14 = (tmp13);
                            auto tmp15 = tmp12 + tmp14;
                            tmp15.store(Y + static_cast<long>(n_start + x1 + (384L*m_start) + (384L*x0)));
                        }
                        #pragma omp simd simdlen(8)
                        for(long x1=static_cast<long>(16L*(c10::div_floor_integer(N0, 16L))); x1<static_cast<long>(N0); x1+=static_cast<long>(1L))
                        {
                            auto tmp0 = in_ptr2[static_cast<long>(m_start + x0)];
                            auto tmp13 = local_acc_buf[static_cast<long>(x1 + (N0*x0))];
                            auto tmp1 = 64L;
                            auto tmp2 = c10::convert<int64_t>(tmp1);
                            auto tmp3 = decltype(tmp0)(tmp0 + tmp2);
                            auto tmp4 = tmp0 < 0;
                            auto tmp5 = tmp4 ? tmp3 : tmp0;
                            auto tmp6 = decltype(tmp5)(tmp5 + tmp2);
                            auto tmp7 = tmp5 < 0;
                            auto tmp8 = tmp7 ? tmp6 : tmp5;
                            auto tmp9 = tmp8;
                            auto tmp10 = c10::convert<int64_t>(tmp9);
                            TORCH_CHECK((0 <= tmp10) & (tmp10 < 64L), "index out of bounds: 0 <= tmp10 < 64L");
                            TORCH_CHECK((0 <= tmp10) & (tmp10 < 64L), "index out of bounds: 0 <= tmp10 < 64L");
                            auto tmp12 = in_ptr3[static_cast<long>(n_start + x1 + (384L*tmp8))];
                            auto tmp14 = c10::convert<float>(tmp13);
                            auto tmp15 = decltype(tmp12)(tmp12 + tmp14);
                            Y[static_cast<long>(n_start + x1 + (384L*m_start) + (384L*x0))] = tmp15;
                        }
                    }
                }
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129223
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
2024-06-25 05:21:00 +00:00
665dbc2f52 [easy][DCP] Fix test_fine_tuning.py for get/set_state_dict API changes (#129365)
Update test/distributed/checkpoint/e2e/test_fine_tuning.py for https://github.com/pytorch/pytorch/pull/112203 change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129365
Approved by: https://github.com/fegin
2024-06-25 05:12:02 +00:00
0e1e289033 [ONNX] Benchmark refactored ONNX export (#129427)
Reuse torch.onnx.export with torch_onnx patch to test ExportedProgram -> ONNX IR exporter

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129427
Approved by: https://github.com/justinchuby
2024-06-25 04:47:53 +00:00
f18becaaf1 Add example for torch.serialization.add_safe_globals (#129396)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129396
Approved by: https://github.com/albanD
ghstack dependencies: #129244, #129251, #129239
2024-06-25 04:19:44 +00:00
381ce0821c Add warning for weights_only (#129239)
Also changes default for `weights_only` to `None` per comment below (hence the `suppress-bc-linter` tag)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129239
Approved by: https://github.com/albanD
ghstack dependencies: #129244, #129251
2024-06-25 04:19:44 +00:00
c5f7755e86 Allow BUILD/NEWOBJ instruction for items added via torch.serialization.add_safe_globals (#129251)
Previously, allowlisting functions/classes via `torch.serialization.add_safe_globals(obj)` for the `weights_only` Unpickler had the following effect:

- For a [`GLOBAL`](https://github.com/python/cpython/blob/3.12/Lib/pickletools.py#L1926-L1939) instruction, `GLOBAL obj.__module__ obj.__name__` would be allowed and translated back to obj to be pushed back to the stack.
- For a [`REDUCE`](https://github.com/python/cpython/blob/3.12/Lib/pickletools.py#L1926-L1982) instruction where we expect the stack to contain `func` and `args`, `func` is allowed if it was added via `add_safe_globals`

However, it did not have an effect on `BUILD` and `NEWOBJ` instructions

Some classes may be rebuilt via [`NEWOBJ`](https://github.com/python/cpython/blob/3.12/Lib/pickletools.py#L2091-L2104) instruction, which indicates that their constructor should be used to rebuild the class.

Further, a [`BUILD`](https://github.com/python/cpython/blob/3.12/Lib/pickletools.py#L1984-L2007) instruction might be used if an object's `__reduce__`/`__reduce_ex__` returns a non-None value for `state`. Which indicates a `__setstate__` or `__dict__.update`.

**This PR makes sure that adding objects to the allowlist will also allow `NEWOBJ` and `BUILD` instructions for them.**

In particular, the update for `NEWOBJ` should unblock allowlisting of [`ScaledMMConfig`](d4ade877df/float8_experimental/float8_tensor.py (L26-L30)) in float8_experimental @drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129251
Approved by: https://github.com/albanD
ghstack dependencies: #129244
2024-06-25 04:19:44 +00:00
1bb1e3463c Fix allowlisting of builtins for weights_only unpickler (#129244)
Since we use [`DEFAULT_PROTOCOL=2`](https://github.com/pytorch/pytorch/blob/main/torch/serialization.py#L62), some functions/classes that were renamed from python 2-->3 will be pickled with their python2 name. This PR ensures that when a mod `GLOBAL <python2_mod>.<python2_name> ` is encountered, [following the strategy used by pickle](https://github.com/python/cpython/blob/main/Lib/pickle.py#L1590C13-L1593C63) it is properly mapped to `<python3_mod>.<python3_name>`.

This fix ensures that `add_safe_globals` works properly for such functions/classes (i.e. users will allowlist the python3 func and the weights_only unpickler will do the appropriate translation when checking whether a class was allowlisted).

An example is as follows:
`__builtin__` was named to `builtins`, see the [release notes for Python 3.0](https://docs.python.org/3/whatsnew/3.0.html)

> Renamed module `__builtin__` to [`builtins`](https://docs.python.org/3/library/builtins.html#module-builtins) (removing the underscores, adding an ‘s’). The __builtins__ variable found in most global namespaces is unchanged. To modify a builtin, you should use [builtins](https://docs.python.org/3/library/builtins.html#module-builtins), not `__builtins__`!

However, since we use [`DEFAULT_PROTOCOL=2`](https://github.com/pytorch/pytorch/blob/main/torch/serialization.py#L62), builtins will be pickled with their module string as `__builtin__`.

```python
>>> import pickle
>>> import pickletools
>>> print.__module__
'builtins'
>>> with open('print.pkl', 'wb') as f:
>>>      pickle.dump(print, f, protocol=2) # 2 because this is the default protocol used by pytorch
>>> with open('print.pkl', 'rb') as f:
>>>     pickletools.dis(f)
0: \x80 PROTO      2
2: c    GLOBAL     '__builtin__ print' # pickle saves the module string as __builtin__ !!! :(
21: q    BINPUT     0
23: .    STOP
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129244
Approved by: https://github.com/albanD
2024-06-25 04:19:44 +00:00
aa4ee2cb9e [Traceable FSDP2] Add Dynamo support for run_with_rng_state HOP (#127247)
Test command:
`pytest -rA test/inductor/test_compiled_autograd.py::TestCompiledAutograd::test_trace_run_with_rng_state`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127247
Approved by: https://github.com/bdhirsh
ghstack dependencies: #129414
2024-06-25 03:13:38 +00:00
b24787b757 [Traceable FSDP2] Don't decompose fsdp.split_with_sizes_copy (#129414)
This makes it easier to do pattern-matching on `fsdp.split_with_sizes_copy` in Inductor passes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129414
Approved by: https://github.com/bdhirsh
2024-06-25 03:08:56 +00:00
e6bfa2958b Add aten._unsafe_masked_index (#116491)
To generate masked indexing operations that would generate
masked loads in triton code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116491
Approved by: https://github.com/lezcano, https://github.com/peterbell10
2024-06-25 02:45:02 +00:00
4d04203852 [BE] Runner determinator: Expect usernames to be prefixed with '@' (#129246)
Expect the username in the runner rollover issue (https://github.com/pytorch/test-infra/issues/5132) to be prefixed with a "@".

This will make typos way less likely since github's autocomplete/autoformating will help out

For now, I've updated the issue to have usernames both with and without the @ while this change rolls out

Testing:
Ran the script locally on both this issue and a new test issue and verified they both had the expected output:
```
(venv) (base) ➜  ~/pytorch git:(zainr/improve-get-workflow-type)
python .github/scripts/get_workflow_type.py --github-token github_pat_***  --github-issue 5132 --github-user ZainRizvi --github-branch "zainr/stuff"
{"label_type": "lf.", "message": "LF Workflows are enabled for ZainRizvi. Using LF runners."}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129246
Approved by: https://github.com/zxiiro, https://github.com/huydhn
2024-06-25 02:39:33 +00:00
533395e204 Fix build error on s390x (#129326)
This PR fixes the build error on s390 after #127195.

The following is the log of the build on s390x. This is because `SYS_arch_prctl` is not defined on s390x.
```
...
[792/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/FlushDenormal.cpp.o
[793/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/Utils.cpp.o
FAILED: caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/Utils.cpp.o
/usr/bin/c++ -DAT_PER_OPERATOR_HEADERS -DCAFFE2_BUILD_MAIN_LIB -DFLASHATTENTION_DISABLE_ALIBI -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cpu_EXPORTS -I/pytorch/build/aten/src -I/pytorch/aten/src -I/pytorch/build -I/pytorch -I/pytorch/cmake/../third_party/benchmark/include -I/pytorch/third_party/onnx -I/pytorch/build/third_party/onnx -I/pytorch/third_party/foxi -I/pytorch/build/third_party/foxi -I/pytorch/torch/csrc/api -I/pytorch/torch/csrc/api/include -I/pytorch/caffe2/aten/src/TH -I/pytorch/build/caffe2/aten/src/TH -I/pytorch/build/caffe2/aten/src -I/pytorch/build/caffe2/../aten/src -I/pytorch/torch/csrc -I/pytorch/third_party/miniz-2.1.0 -I/pytorch/third_party/kineto/libkineto/include -I/pytorch/third_party/kineto/libkineto/src -I/pytorch/third_party/cpp-httplib -I/pytorch/aten/src/ATen/.. -I/pytorch/c10/.. -I/pytorch/third_party/FP16/include -I/pytorch/third_party/tensorpipe -I/pytorch/build/third_party/tensorpipe -I/pytorch/third_party/tensorpipe/third_party/libnop/include -I/pytorch/third_party/fmt/include -I/pytorch/third_party/flatbuffers/include -isystem /pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /pytorch/cmake/../third_party/googletest/googlemock/include -isystem /pytorch/cmake/../third_party/googletest/googletest/include -isystem /pytorch/third_party/protobuf/src -isystem /pytorch/cmake/../third_party/eigen -isystem /pytorch/build/include -Wno-maybe-uninitialized -Wno-uninitialized -Wno-free-nonheap-object -Wno-nonnull -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow -O3 -DNDEBUG -DNDEBUG -fPIC -DTORCH_USE_LIBUV -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-type-limits -Wno-array-bounds -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wno-maybe-uninitialized -fvisibility=hidden -O2 -fopenmp -MD -MT caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/Utils.cpp.o -MF caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/Utils.cpp.o.d -o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/cpu/Utils.cpp.o -c /pytorch/aten/src/ATen/cpu/Utils.cpp
/pytorch/aten/src/ATen/cpu/Utils.cpp: In function 'bool at::cpu::init_amx()':
/pytorch/aten/src/ATen/cpu/Utils.cpp:60:21: error: 'SYS_arch_prctl' was not declared in this scope; did you mean 'SYS_prctl'?
   60 |   long rc = syscall(SYS_arch_prctl, ARCH_REQ_XCOMP_PERM, XFEATURE_XTILEDATA);
      |                     ^~~~~~~~~~~~~~
      |                     SYS_prctl
[794/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/Integration.cpp.o
[795/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/GridSampler.cpp.o
[796/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/detail/CPUGuardImpl.cpp.o
[797/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/ThreadLocalState.cpp.o
[798/2147] Building CXX object caffe2/CMakeFiles/vec_test_all_types_DEFAULT.dir/__/aten/src/ATen/test/vec_test_all_types.cpp.o
[799/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/Utils.cpp.o
[800/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/VmapModeRegistrations.cpp.o
[801/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/ZeroTensorFallback.cpp.o
[802/2147] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/autocast_mode.cpp.o
ninja: build stopped: subcommand failed.
Building wheel torch-2.5.0a0+git94dc325
-- Building version 2.5.0a0+git94dc325
cmake -GNinja -DBUILD_CAFFE2=0 -DBUILD_PYTHON=True -DBUILD_TEST=True -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/pytorch/torch -DCMAKE_PREFIX_PATH=/usr/local/lib/python3.10/dist-packages -DPython_EXECUTABLE=/usr/bin/python3 -DTORCH_BUILD_VERSION=2.5.0a0+git94dc325 -DUSE_GLOO=0 -DUSE_NUMPY=True /pytorch
cmake --build . --target install --config Release
Build step 'Execute shell' marked build as failure
...
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129326
Approved by: https://github.com/Skylion007, https://github.com/eqy
2024-06-25 02:39:13 +00:00
c4dd752d97 [dynamo][compile-time][inlining-inbuilt-nn-modules] Manually implement nn.Module._call_impl (#129285)
# Compile time for eager backend
## AlbertForMaskedLM
No inlining - 3.65 seconds
Inlining on main - 7.48 seconds
Inlining + this PR - 2.86 seconds

## MobileBertForMaskedLM
No inlining - 26.90 seconds
Inlining on main - 48.21 seconds
Inlining + this PR - 24.25 seconds

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129285
Approved by: https://github.com/jansel
ghstack dependencies: #129316, #129315
2024-06-25 01:31:26 +00:00
514f9279f8 [dynamo][compile-time] Manually implement nn.Module.__getattr__ to reduce compile time (#129315)
# Compile time for eager backend
## AlbertForMaskedLM
No inlining - 3.65 seconds
Inlining on main - 7.48 seconds
Inlining + this PR - 6.70 seconds

## MobileBertForMaskedLM
No inlining - 26.90 seconds
Inlining on main - 48.21 seconds
Inlining + this PR - 43.85 seconds

*Next PR in the stack makes the total compile time better/comparable to no inlining*

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129315
Approved by: https://github.com/jansel
ghstack dependencies: #129316
2024-06-25 01:31:26 +00:00
c012013aa6 Revert "Add Strided Input test for flex attention (#128915)"
This reverts commit 41bb81b58279f492e72bd270b3b071dd2953ed8c.

Reverted https://github.com/pytorch/pytorch/pull/128915 on behalf of https://github.com/huydhn due to Sorry for reverting your change but its tests are failing in trunk, i.e. 41bb81b582 (26627138290) ([comment](https://github.com/pytorch/pytorch/pull/128915#issuecomment-2187695317))
2024-06-25 00:43:34 +00:00
1315be4893 [aotinductor] only autotune at compile time when enabled via config (#129413)
internal breakage when enabled.

Test Plan: CI

Differential Revision: D58965784

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129413
Approved by: https://github.com/jingsh, https://github.com/desertfire
2024-06-25 00:41:10 +00:00
78e40b271b Change index_put on GPU to accept FP8 inputs (#128758)
As the title says, this PR changes the dispatcher for the CUDA index_put_ kernel to accept FP8 inputs. This is useful for Transformers models where the KV cache is FP8 and has been pre-allocated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128758
Approved by: https://github.com/eqy, https://github.com/drisspg
2024-06-25 00:38:03 +00:00
8b6391ee59 [Test][DTensor] Temporarily skip gloo test for test_depthwise_convolution (#129391)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129391
Approved by: https://github.com/awgu
2024-06-25 00:29:50 +00:00
81de71fdc5 [inductor] fix a double clone in coordesc tuning (#129399)
It's embarrassing that there is a hidden double clone bug in coordinate descent tuning.

In `CachingAutotuner.coordinate_descent_tuning`, we clone mutated args to make sure benchmarking does not cause numerical problems. But latter on in `CachingAutotuner.bench` we do that again.

This double clone is fine if
- the tensor is small
- the allocation of the tensor is not on the critical path for memory footprint.

But neither holds for quite common usage of cross entropy loss.

This is related to the memory usage debugging in https://github.com/pytorch/pytorch/pull/129043 . Note that the general issue that peak memory usage increasing due to autotuning still exists. This bug just makes it worse (since we double allocate).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129399
Approved by: https://github.com/Chillee, https://github.com/jansel
2024-06-25 00:18:51 +00:00
14dc08ddc7 Inductor to fail gracefully on Voltas for bf16 tensors (#129288)
Volta(sm_7x) do not have a HW support for bfloat16 datatype, and while it is is emulated to ted in software, so PyTorch eager can use bfloat16 tensors, but not in Triton. So if graph with either CUDA bf16 input or output tensors is used, raise warnings and skip the frame.

Add optional parameter `including_emulation` to `torch.cuda.is_bf16_supported` method and call it from `torch._inductor.compile_fx. _check_triton_bf16_support`.

Test plan: Modify `is_bf16_supported` to return False and see that warning is generated

Fixes https://github.com/pytorch/pytorch/issues/118122 and https://github.com/pytorch/pytorch/issues/118581

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129288
Approved by: https://github.com/eqy, https://github.com/jansel
2024-06-25 00:04:13 +00:00
4c1e4c5f30 [inductor] Enable FX graph caching in OSS by default (#125863)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125863
Approved by: https://github.com/eellison, https://github.com/oulgen
ghstack dependencies: #129257
2024-06-24 23:39:43 +00:00
7b57ddd38c [inductor] Fix TORCHINDUCTOR_FORCE_DISABLE_CACHES (#129257)
Summary: See https://github.com/pytorch/pytorch/issues/129159; this option wasn't doing its job for a few reasons. In this PR:
* Fix the with_fresh_cache_if_config() decorator
* Reset the "TORCHINDUCTOR_CACHE_DIR" & "TRITON_CACHE_DIR" env vars in sub-process to support them changing in the parent process

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129257
Approved by: https://github.com/oulgen
2024-06-24 23:39:43 +00:00
b22f0f5f51 [torchbind] fix bug of mutating FakeScriptObjects twice in aot_export (#128844)
This PR does two things:
1. it duplicates the fake script object because aot_export trace the program twice. The result of tracing in the first time would cause the tracing result of second time be wrong.
2. Also add a new test for methods that return constant outputs. Before the PR, there's is no meta["val"] for these nodes because fx won't track these constants. We still need to preserve these constant return operators in the graph because torchbind objects are stateful and deleting it would remove the implicit state mutation inside of the object.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128844
Approved by: https://github.com/angelayi
2024-06-24 23:14:34 +00:00
41bb81b582 Add Strided Input test for flex attention (#128915)
Test strided inputs to the flex_attention HOP. Similar to how inputs are generated in
https://github.com/pytorch/pytorch/blob/main/benchmarks/transformer/score_mod.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128915
Approved by: https://github.com/Chillee, https://github.com/drisspg
2024-06-24 22:56:39 +00:00
00f675bb4c [Nested Tensor]fix sdpa backward for the special case with ragged second batch dim and constant length (#128349)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128349
Approved by: https://github.com/jbschlosser
2024-06-24 22:35:07 +00:00
7b7f357042 Fix DEBUG=1 asserts with NJT ops (#129014)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129014
Approved by: https://github.com/YuqingJ, https://github.com/soulitzer
2024-06-24 22:32:01 +00:00
5f912f480c Fix max_pool2d decomposition for empty list and integer limits (#129106)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129106
Approved by: https://github.com/peterbell10, https://github.com/lezcano, https://github.com/malfet
ghstack dependencies: #129096, #129097
2024-06-24 22:19:42 +00:00
e096faaf30 Fix rot90 decomposition for no rotation (#129097)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129097
Approved by: https://github.com/peterbell10
ghstack dependencies: #129096
2024-06-24 22:19:42 +00:00
fbca70718f Fix scatter lowering when src is a Number (#129096)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129096
Approved by: https://github.com/peterbell10
2024-06-24 22:19:39 +00:00
8edb7b96b1 Enable dynamic rollout for pull workflow (#129243)
Enables dynamic migration of jobs to the LF AWS account for the pull workflow.  For now, it leaves out a few jobs that need a bit more testing: Namely Windows and Android runners.

The new runners are only given to people specified in this issue:
https://github.com/pytorch/test-infra/issues/5132

Note: The non-pull jobs updated are the ones that have are synced to jobs in pull.yml (via `sync-tag`) and thus have to be updated whenever their corresponding pull.yml jobs are edited

Based on https://github.com/pytorch/pytorch/pull/128597
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129243
Approved by: https://github.com/zxiiro, https://github.com/huydhn, https://github.com/malfet
2024-06-24 22:15:53 +00:00
30bfdf1afc Errors when 0-dim tensor of complex or bool type passed to aminmax. (#128404)
Fixes #126742

Added errors for the case of 0-dim tensors of complex or bool types passed to aminmax.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128404
Approved by: https://github.com/janeyx99
2024-06-24 21:46:49 +00:00
18fdc0ae5b [executorch hash update] update the pinned executorch hash (#129099)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned executorch hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129099
Approved by: https://github.com/pytorchbot
2024-06-24 21:01:40 +00:00
93a33bf3ac [BE] update type annotations for basic utilities in torch/__init__.py (#129001)
Changes:

1. Make some arguments positional-only as we only support Python 3.8+
2. Clean up `torch.typename(obj)` implementation.
3. Update type annotations., especially `is_tensor()` and `is_masked_tensor()` using `TypeGuard`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129001
Approved by: https://github.com/malfet
2024-06-24 18:04:38 +00:00
1a54bb0f96 Revert "[halide-backend] Initial implementation of HalideKernel and HalideScheduling (#126417)"
This reverts commit 4f9399bd0d2bc0cbd14348b80e32b263de5c6bc0.

Reverted https://github.com/pytorch/pytorch/pull/126417 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/126417#issuecomment-2186999121))
2024-06-24 16:50:15 +00:00
063facf352 Revert "[halide-backend] Generate standalone runtime (#129025)"
This reverts commit 10c64c3b49e2008a50f9229e600c68c8a3d49292.

Reverted https://github.com/pytorch/pytorch/pull/129025 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/129025#issuecomment-2186995467))
2024-06-24 16:47:25 +00:00
c888ee3632 Remove test_mps_allocator_module XFAIL (#129340)
Not sure why this test starts to fail (maybe runner update) 8a2fed7e6a/1 or why it was XFAIL in this old PR https://github.com/pytorch/pytorch/pull/97151, but the test is passing locally for me now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129340
Approved by: https://github.com/kit1980
2024-06-24 16:26:38 +00:00
cb4919344a Revert "[BE] update type annotations for basic utilities in torch/__init__.py (#129001)"
This reverts commit e53d9590287cbf97521f96d055910394f6e9a849.

Reverted https://github.com/pytorch/pytorch/pull/129001 on behalf of https://github.com/XuehaiPan due to lint failure ([comment](https://github.com/pytorch/pytorch/pull/129001#issuecomment-2186944549))
2024-06-24 16:18:43 +00:00
7b910285db Revert "[inductor] Refactor fusion of inplace operations (#128979)"
This reverts commit 72e3aca227ae1e3dc1b91aee415cf27b0cb22f2b.

Reverted https://github.com/pytorch/pytorch/pull/128979 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128979#issuecomment-2186846940))
2024-06-24 15:29:40 +00:00
df51d0b623 [aotinductor][UserDefinedTritonKernel] use appropriate expr printer when printing args (#129301)
Encountered the following C++ compile error.
```
Declared in this scope; did you mean ‘std::max’?
  619 |     auto var_5 = max(1, u0);
```

This PR will use the C++ printer when it's doing C++ codegen, before this PR it was using the Python printer even during C++ codegen.

Differential Revision: [D58913123](https://our.internmc.facebook.com/intern/diff/D58913123)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129301
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2024-06-24 15:23:05 +00:00
e53d959028 [BE] update type annotations for basic utilities in torch/__init__.py (#129001)
Changes:

1. Make some arguments positional-only as we only support Python 3.8+
2. Clean up `torch.typename(obj)` implementation.
3. Update type annotations., especially `is_tensor()` and `is_masked_tensor()` using `TypeGuard`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129001
Approved by: https://github.com/malfet
2024-06-24 14:35:41 +00:00
c89a9f5d17 Allow SAC policy_fn to return bool for backward compatibility (#129262)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129262
Approved by: https://github.com/Chillee, https://github.com/fmassa
ghstack dependencies: #125795, #128545
2024-06-24 13:54:30 +00:00
9094248090 [FSDP2] Fixed unshard without lazy init (#129241)
Previously, the `FSDPCommContext` only defines the stream attributes when `FSDPCommContext.init` is called from lazy initialization. This means that if the user calls `module.unshard()` before lazy init (e.g. first forward pass), then it would error in `wait_for_unshard()`. This PR fixes this by making sure that the stream attributes are defined, only with the default stream, at construction time.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129241
Approved by: https://github.com/Skylion007, https://github.com/weifengpy
2024-06-24 13:31:54 +00:00
d21f311af8 [Easy][Traceable FSDP2] Skip rocm for the E2E tests (#129339)
The CUDA implementation of `resize_storage_bytes_` doesn't run on rocm yet, so need to skip it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129339
Approved by: https://github.com/msaroufim
2024-06-24 06:38:33 +00:00
662e9e1076 [BE] enable UFMT for torch/nn/functional.py (#128592)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128592
Approved by: https://github.com/mikaylagawarecki
2024-06-24 06:24:12 +00:00
8a2fed7e6a [Inductor][CPP] Fallback QLinear Binaryfusion from postop sum to binary add when others is view (#128808)
**Summary**
In int8 GEMM Template, we will view the input from 3D to 2D and view the output back to 3D for QLinear which makes the output of this QLinear as `view`. So, if this output view inputs to a QLinear-Binary fusion which breaks the assumption of QLinear-Binary with post op inplace `sum`. We change the postop name from inplace `sum` to outplace `add` for this case which is similar as FP32/BF16 Linear Inplace as in 1208347d09/torch/_inductor/fx_passes/mkldnn_fusion.py (L541-L543).

**TestPlan**
```
clear && numactl -C 56-111 -m 1 python -u -m pytest -s -v inductor/test_mkldnn_pattern_matcher.py -k test_qlinear_dequant_promotion_cpu_input_dim_exceeds_2
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128808
Approved by: https://github.com/jgong5
ghstack dependencies: #128804
2024-06-24 01:12:18 +00:00
287c68c5ec [Inductor][Quant] Use output dtype torch.uint8 explicitly (#128804)
**Summary**
Previously, we use `None` as output data type in the lowering of QLinear/QConv for uint8 implicitly. It's not clear and we should use `torch.uint8` explicitly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128804
Approved by: https://github.com/Xia-Weiwen, https://github.com/jgong5
2024-06-24 01:08:49 +00:00
7b9e6430ed [Split Build] Add periodic and trunk CI for cuda builds (#129269)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129269
Approved by: https://github.com/atalman
2024-06-23 17:04:37 +00:00
f85d1e845a [BE] enable UFMT for torch/nn/*.py (#128593)
Part of #123062

- #123062
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128593
Approved by: https://github.com/mikaylagawarecki
2024-06-23 16:05:13 +00:00
dadc0ed4c8 [Traceable FSDP2] Add aot_eager backend E2E tests for transformer model (#129157)
This PR adds Traceable FSDP2 `aot_eager` backend E2E tests for simple MLP as well as transformer model.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129157
Approved by: https://github.com/awgu
ghstack dependencies: #129203
2024-06-23 06:11:11 +00:00
b91a9dc328 [Brian's PR #128754] Use torch.ops.fsdp.set_ for FSDP2 storage resize; dont functionalize resize_, set_, split_with_sizes_copy.out (#129203)
This is a copy of Brian's PR https://github.com/pytorch/pytorch/pull/128754, with some changes in the test_distributed_patterns.py unit tests to more closely reflect FSDP2 patterns. Also disabled two tests `test_input_mutation_storage_resize_up_down` and `test_input_mutation_storage_resize_not_supported` in test_aotdispatch.py until we figure out the right behavior for them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129203
Approved by: https://github.com/bdhirsh
2024-06-23 06:07:19 +00:00
62ccf6d7cd [BE] enable UFMT for torch/nn/modules (#128594)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128594
Approved by: https://github.com/mikaylagawarecki
2024-06-23 05:37:57 +00:00
440d8fbd4a FSDP2 Memory Tracker (#125323)
* __->__ #125323
### Why do we need the FSDP Memory Tracker?

**Tuning Decisions**

1. What is the expected peak memory with current configuration?
2. If I change my FSDP wrapping, how much effect will it have on peak memory?
3. What is the best batch size to use?
4. What is the maximum sequence length that one can run with current configuration?
5. How does increasing/decreasing the “DP” world size affect peak memory?
6. How much memory do I save if I move the optimizer to the CPU?
7. Which activation checkpointing policy should I use?
8. If I have various SAC policies, How do they compare against each other?
9. What happens if I apply different SAC policies to different FSDP units?
10. If I make my gradient reduction in fp32, what effect will it have on memory?
11. If I want to use a custom mixed precision policy, how will it affect the peak memory?
12. When does it make sense to use HSDP?
13. Can I reshard to a smaller mesh without increasing peak memory substantially?
14. Can safely disable post forward reshard without causing an OOM?

**Debugging**

1. Which module contributes most to activation memory?
2. Which FSDP unit is holding a lot of unsharded memory?
3. AC is not releasing memory?

The FSDP2 Memory Tracker addresses all of the above. It is based on:
 *  #124688
 *  #128508

Example and Output:

```
if __name__== "__main__":
    from contextlib import nullcontext
    from functools import partial
    import torch
    from torch.distributed._composable import checkpoint
    from torch.distributed._composable.fsdp import (
        CPUOffloadPolicy,
        fully_shard,
        MixedPrecisionPolicy,
    )
    from torch.distributed._tensor import DeviceMesh
    from torch.distributed._tools.fsdp2_mem_tracker import FSDPMemTracker
    from torch._subclasses.fake_tensor import FakeTensorMode
    from torch.testing._internal.distributed._tensor.common_dtensor import (
    ModelArgs,
    Transformer,
    TransformerBlock,
    )
    from torch.testing._internal.distributed.fake_pg import FakeStore
    dev = torch.device("cuda:0")
    torch.cuda.set_device(dev)
    world_size = 4
    store = FakeStore()
    torch.distributed.init_process_group(
        "fake", rank=0, world_size=world_size, store=store
    )
    mesh = DeviceMesh("cuda", torch.arange(0, world_size))
    torch.cuda.empty_cache()
    torch.manual_seed(42)
    use_fake_mode = False
    with FakeTensorMode() if use_fake_mode else nullcontext():
        vocab_size = 8192
        bsz, seq_len = 32, 1024
        with torch.device(dev):
            model_args = ModelArgs(
                n_layers=2,
                n_heads=16,
                vocab_size=vocab_size,
                max_seq_len=seq_len,
                dropout_p=0.1,
            )
            model = Transformer(model_args)
        foreach = True
        mp_policy = MixedPrecisionPolicy(param_dtype=torch.bfloat16, reduce_dtype=torch.float32)
        offload_policy = CPUOffloadPolicy(pin_memory=not use_fake_mode)
        reshard_after_forward = True
        fsdp_config = {

        }
        fully_shard_fn = partial(
            fully_shard,
            mesh=mesh,
            reshard_after_forward=reshard_after_forward,
            offload_policy=offload_policy,
            mp_policy=mp_policy,
        )
        for module in model.modules():
            if isinstance(module, TransformerBlock):
                checkpoint(module, preserve_rng_state=not use_fake_mode)
                fully_shard_fn(module)
        fully_shard_fn(model)
        optim = torch.optim.Adam(model.parameters(), lr=1e-2, foreach=foreach)

        torch.manual_seed(42)
        inp = torch.randint(0, vocab_size, (bsz, seq_len), device=dev)
        torch.cuda.reset_accumulated_memory_stats()
        torch.cuda.reset_peak_memory_stats()
        fmt = FSDPMemTracker(model, optim)
        fmt.track_inputs((inp,))
        with fmt:
            for iter_idx in range(2):
                loss = model(inp).sum()
                loss.backward()
                optim.step()
                optim.zero_grad()
                if iter_idx == 0:
                    fmt.reset_mod_stats()
    mem_stats = torch.cuda.memory_stats()
    tracker_peak = fmt.get_tracker_snapshot("peak")[dev]["Total"]
    cuda_peak_active = mem_stats["active_bytes.all.peak"]
    fmt.display_modulewise_snapshots(depth=4, units="MiB", tabulate=True)
    fmt.display_snapshot("peak", units="MiB", tabulate=True)
    print(
        f"peak active: {cuda_peak_active / (1024**3)} GiB | "
        f"Tracker Max: {tracker_peak / (1024 ** 3)} GiB"
    )
    if not use_fake_mode:
        print(f"Accuracy: {tracker_peak/cuda_peak_active}")

    try:
        torch.distributed.destroy_process_group()
    except Exception as e:
        print(e)
```

<img width="1236" alt="Screenshot 2024-06-21 at 5 16 49 PM" src="https://github.com/pytorch/pytorch/assets/12934972/9be40b8b-e635-4112-b111-418413e6b959">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125323
Approved by: https://github.com/awgu
2024-06-23 05:23:00 +00:00
17d1723aee [dynamo][unspecialized-nn-modules] Remove dead (also incorrect) code (#129316)
This code is unused because we just inline the `.parameters` call. The code was also wrong because side-effects only track the first level of mutations. An object might not marked mutated if one of the child objects (like a dict) is mutated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129316
Approved by: https://github.com/jansel
2024-06-23 03:02:27 +00:00
cac6f99d41 Fix Windows CUDA periodic inductor/test_pattern_matcher test (#129198)
The check was run on Windows and crashed there because Windows doesn't have triton, i.e. https://github.com/pytorch/pytorch/actions/runs/9606662121/job/26502347998#step:15:13196

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129198
Approved by: https://github.com/kit1980, https://github.com/ZainRizvi, https://github.com/malfet
2024-06-23 02:32:27 +00:00
749c03406c [metal] Add int4mm weight packing mps kernel, and improved int4mm shader (#128965)
Adds _convert_weight_to_int4pack MPS kernel
Replaces previous int4mm Metal shader, with shader authored by @kimishpatel  which improves perf by ~40%

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128965
Approved by: https://github.com/malfet
2024-06-23 02:10:46 +00:00
856541c701 [custom_op] support default dtype values (#129189)
This PR:
- moves some of the dtype-string utilities into ScalarType.{h, cpp}
- adds a new utility to get a mapping from dtype name to the C++ dtype
- the perser now checks if the string is a dtype name; if it is then it
  pulls the c++ dtype from the mapping.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129189
Approved by: https://github.com/albanD
ghstack dependencies: #129177, #129178, #129179
2024-06-23 00:13:23 +00:00
3e02ecd740 Test only one sample with huber_loss (#129245)
Fixes https://github.com/pytorch/pytorch/issues/129238

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129245
Approved by: https://github.com/huydhn
2024-06-22 21:15:39 +00:00
94dc3253a0 [BE][Easy] enable UFMT for torch/distributed/ (#128870)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128870
Approved by: https://github.com/fegin, https://github.com/wconstab
2024-06-22 18:53:28 +00:00
e165a5971f [Traceable FSDP2] Fix support for CUDA resize_storage_bytes_ (#129215)
Currently if `x` is a CUDA tensor, calling `x.untyped_storage().resize_()` seems to always go into the `built without cuda` branch of `resize_storage_bytes_()` regardless of whether PyTorch is built with CUDA. I suspect this is because `inductor_ops.cpp` is only included in `libtorch_cpu.so` thus doesn't have the `USE_CUDA` information or ability to link to CUDA-related functions.

This PR moves `resize_storage_bytes_()` related custom op functions out of `inductor_ops.cpp` into its standalone file `resize_storage_bytes.cpp` to be included in `libtorch_python.so` instead. This mimics the setup for `StorageMethods.cpp`. This way, `resize_storage_bytes_()` can have access to the CUDA-related functions, which passes the CUDA unit test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129215
Approved by: https://github.com/jansel
2024-06-22 18:38:47 +00:00
0e6118a68e [dtensor][debug] added logging module tracing table to file feature (#128721)
**Summary**
Currently, only way for users to view the module tracing table is to print in the console which could be hard to read. I have added the functionality to comm_debug_mode for a user to log the module tracing table to output.txt file giving the user more options to view module tracing. I have implemented the use case in the module tracing examples. The expected output is shown below for MLPModule tracing:
<img width="349" alt="Screenshot 2024-06-14 at 10 39 07 AM" src="https://github.com/pytorch/pytorch/assets/50644008/a05288a9-3cdb-483b-8e27-daab50da6251">

**Test Plan**
1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing

2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128721
Approved by: https://github.com/tianyu-l, https://github.com/XilunWu
ghstack dependencies: #128720
2024-06-22 18:14:13 +00:00
1afd492d88 [dtensor][example] add functionality allowing users to choose which example they'd to run (#128720)
**Summary**
The previous example file would run all examples at the same time, leading to confusing output as the 4 processors would mix up the order. In order to fix this, I have added the functionality to choose which example to run to make it easier for users to read the output. Due to importing from torch.testing._internal.distributed._tensor.common_dtensor, the argparser from a file in the dependency tree would overwrite the argparser that I attempted to place in the example file. As a result, I created an argparser in a different file and imported it above previously mentioned import.

**Test Plan**
1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_distributed_sharding_display

2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLPStacked_distributed_sharding_display

3. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing

4. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing

5. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -h

The first four outputs will be the same as the outputs seen in previous PRs. The expected output for help argument is seen below:
<img width="931" alt="Screenshot 2024-06-14 at 10 25 06 AM" src="https://github.com/pytorch/pytorch/assets/50644008/547ca112-1e7a-4769-857a-558292c6fe7b">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128720
Approved by: https://github.com/XilunWu
2024-06-22 18:14:13 +00:00
10c64c3b49 [halide-backend] Generate standalone runtime (#129025)
This puts the halide runtime in a global shared object, rather than copying it to each kernel.  Having many copies of the runtime causes many issues with cuda.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129025
Approved by: https://github.com/shunting314, https://github.com/eellison
ghstack dependencies: #126417
2024-06-22 17:39:52 +00:00
4f9399bd0d [halide-backend] Initial implementation of HalideKernel and HalideScheduling (#126417)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126417
Approved by: https://github.com/shunting314, https://github.com/eellison
2024-06-22 17:39:52 +00:00
79aabaf626 [3.13, dynamo] codegen PUSH_NULL when callable is codegen'd (#129172)
Significant bytecode generation API change!

The new suggested convention to generating bytecode to call a function is now to wrap instructions that push a callable to the stack with `add_push_null`, then that callable is called with `create_call_function` with `push_null=False` (see diff for examples).

In Python 3.13, NULL is now expected to be pushed after the callable. In <=3.12, the NULL was pushed before the callable.  This change abstracts away the exact placement of the NULL, but the developer must be aware that a NULL may be needed when codegen'ing a callable.

This abstraction also reduces the need for the `push_null=True` option in `create_call_function`, which removes the need to rotate a NULL to the right place on the stack with a sequence of `SWAP` instructions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129172
Approved by: https://github.com/jansel
2024-06-22 17:25:23 +00:00
905dfa186c Fix ConstraintViolationError exception string when exprs are int (#129271)
As titled. If `expr1` `expr2` are int, don't need to do `.xreplace`.

See example error:

```
UserError: L['args'][0][0].size()[1] = 35 is not equal to L['args'][0][2].size()[1] = 23
```

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129271
Approved by: https://github.com/lezcano
2024-06-22 16:33:40 +00:00
920ebccca2 [inductor][cpp] refactor CppTemplateKernel to inherit CppKernel (#129101)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129101
Approved by: https://github.com/leslie-fang-intel
2024-06-22 12:50:37 +00:00
72e3aca227 [inductor] Refactor fusion of inplace operations (#128979)
`WeakDep`s force readers to have completed before a mutation overwrites the
buffer, but we want to allow fusions to occur for inplace mutations where the
same index is read and written.

Currently this is achieved by:
1. Identifying the buffers used by the mutating op in its `dep_closure`
2. Not creating `WeakDep`s for buffers in the `dep_closure`
3. Fixing up any bad fusions that might occur by an extra check in `can_fuse_vertical`

So we are first over-agressive in removing `WeakDep`, then add an ad-hoc fixup.

This PR instead emits all `WeakDep`s and adds a `fusable_weak_dep` check to
`can_fuse_vertical` which selectively allows inplace operation to fuse.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128979
Approved by: https://github.com/lezcano
ghstack dependencies: #129082, #129083
2024-06-22 12:38:22 +00:00
88a35b5b64 BE: User future annotations in _inductor/comms.py (#129083)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129083
Approved by: https://github.com/lezcano
ghstack dependencies: #129082
2024-06-22 12:38:22 +00:00
73ba226d98 [inductor] Linear time dead node elimination (#129082)
The nodes are already topologically sorted by this point, so DCEing a chain of
nodes will take one full iteration per node. Simply reversing the iteration
order means all users will be removed before checking a node.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129082
Approved by: https://github.com/lezcano
2024-06-22 12:38:17 +00:00
cb126711cd [merge_rule] add more cpp inductor files (#129192)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129192
Approved by: https://github.com/leslie-fang-intel, https://github.com/atalman
2024-06-22 09:04:14 +00:00
b57fa8d9c0 [BE] Remove JNI from libtorch builds (#124995)
Removes jni files from the libtorch build as we do not plan to distribute them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124995
Approved by: https://github.com/malfet
2024-06-22 07:41:54 +00:00
9ffdbb5d12 Forward Fix PR for #128683 (#129037)
Summary:
This forward fixes this diff:
D58699985

Since we have a few things in flight it would be much better to forward fix this test

Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:test_inductor_cuda -- --exact 'caffe2/test/inductor:test_inductor_cuda - test_red_followed_by_transposed_pointwise (caffe2.test.inductor.test_torchinductor.TritonCodeGenTests)'

Differential Revision: D58767577

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129037
Approved by: https://github.com/vkuzo
2024-06-22 05:50:21 +00:00
64743de6d8 [Split Build][BE] consolidate pip install commands (#129253)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129253
Approved by: https://github.com/atalman
ghstack dependencies: #129011
2024-06-22 05:49:14 +00:00
7661d1220a [Split Build] Fix typo in pull ci (#129270)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129270
Approved by: https://github.com/atalman
2024-06-22 05:48:01 +00:00
b0044e2e18 [Split Build] Support nightly release (#129011)
This PR adds the split build to our binaries workflow. Validation for the workflow is done using the PR above in conjunction with https://github.com/pytorch/builder/pull/1876.

Test Workflow: Check CI in the workflow above
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129011
Approved by: https://github.com/atalman
2024-06-22 05:45:14 +00:00
b72ef9df0d Update torchbench model expected accuracy values after pinning numpy (#129213)
After pinning numpy on torchbench, we need to move torchbench inductor benchmark jobs out of unstable state asap, so that more failures don't sneak it.  I'm updating the expected values here to make trunk green.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129213
Approved by: https://github.com/xuzhao9, https://github.com/malfet, https://github.com/desertfire
2024-06-22 04:59:50 +00:00
f42d5b6dca [Memory Snapshot] Make recordAnnotations callback initialize lazily (#129242)
Summary: Make the recordAnnotations' Record function callback lazily initialize when record memory history starts. This will help reduce the impact on Time To First Batch metric.

Test Plan: CI and ran locally.

Differential Revision: D58875576

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129242
Approved by: https://github.com/zdevito
2024-06-22 04:05:55 +00:00
858fb05dac Modify ExternKernelAlloc with NoneLayout to not assign its result to anything (#129188)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129188
Approved by: https://github.com/yifuwang
2024-06-22 02:57:44 +00:00
2f8b301c32 Clean up distributed/CONTRIBUTING.md (#128450)
Click [here](cf6c88af48/torch/distributed/CONTRIBUTING.md) to see the rendered version of the file in this PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128450
Approved by: https://github.com/wanchaol
2024-06-22 02:41:22 +00:00
5b14943213 Run TestAOTAutograd test suite with cache (#128222)
This diff introduces AOTAutogradTestWithCache, which runs AOTAutogradTests with both dynamo and AOTAutogradCache.

To do this, for any verify_aot_autograd() calls in the original tests, we run compiled_f an extra time. We also turn on a new strict mode that throws any time a cache is missed due to weird reasons, like BypassAOTAutogradCache or FxGraphCacheMiss.

We use a mocked version of FXGraphCache to decrease the number of variables for these tests. The normal tests in test_aot_autograd_cache.py will still run with FXGraphCache. I might change my mind and unmock these in the future.

In total, 87 of the tests pass naturally. None of the tests fail in non strict cache mode, so the cache never crashes, it just misses more often than we'd like. The remaining 27 tests fail due to relatively simple (though not necessarily easy to fix) reasons. I'll fix the remaining test failures in the next few PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128222
Approved by: https://github.com/bdhirsh
2024-06-22 02:13:28 +00:00
c5b9ee7408 [easy][dynamo] Remove try except from call_getattr (#129217)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129217
Approved by: https://github.com/lezcano
ghstack dependencies: #129098, #129015
2024-06-21 23:56:00 +00:00
1c75ddff35 Revert "[cuDNN] Graph-capturable cuDNN CTCLoss (#128271)"
This reverts commit 40e8675fcbb233c98ec532607d5cd421ec850253.

Reverted https://github.com/pytorch/pytorch/pull/128271 on behalf of https://github.com/malfet due to This makes PyTorch buildable only with CuDNN v9 ([comment](https://github.com/pytorch/pytorch/pull/128271#issuecomment-2183576996))
2024-06-21 23:29:20 +00:00
ef55446538 [FSDP2] Add 'TORCH_LOGS=+fsdp' to log hooks(pre/post forward/backward) and FQN (_init_fqns) (#128663)
Summary:
Add  '`TORCH_LOGS=+fsdp`' in the CLI to print fsdp logs
Example:
`TORCH_LOGS=+fsdp torchrun --standalone --nproc_per_node=2 run_fsdp.py`
Description:
Add logging to `FSDPParamGroup.pre_forward`, `FSDPParamGroup.post_forward`, `FSDPParamGroup.pre_backward`, and `FSDPParamGroup.post_backward`, `FSDPState._root_pre_forward` if is the root, and `FSDPState._root_post_backward_final_callback`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128663
Approved by: https://github.com/weifengpy, https://github.com/awgu
2024-06-21 23:25:58 +00:00
9d1b65b569 [PT2][Observability] Change the log logic (#129201)
Summary: We only log the multiplier when users changes the default value.

Test Plan: see signal

Differential Revision: D58854330

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129201
Approved by: https://github.com/Skylion007, https://github.com/dshi7
2024-06-21 21:48:34 +00:00
40e8675fcb [cuDNN] Graph-capturable cuDNN CTCLoss (#128271)
cuDNN v8.x added a graph-capturable CTCLoss, which slots "neatly" into the `Tensor` variant

~~WIP as cuDNN has a restriction on the max target length (255), but this is not checkable in the graph-capture case, so the UX around warnings/error-messages here might need to be tuned...~~
Currently checks restriction on max target length during warmup run(s), and bails out during capture if this constraint was violated during warmup.

CC @ptrblck

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128271
Approved by: https://github.com/ezyang
2024-06-21 21:40:23 +00:00
9103b40a47 Fix small typo in docstring in ParameterList (#129193)
In the docstring of `nn.ParameterList`, ParameterDict.append/extend was being used, which is most likely a typo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129193
Approved by: https://github.com/mikaylagawarecki
2024-06-21 20:53:52 +00:00
92ca17d85d Update triton pin (#126098)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126098
Approved by: https://github.com/bertmaher
2024-06-21 18:46:15 +00:00
d52684e9a8 [BE]: Update CUDNN_frontend submodule to v1.5.1 (#128612)
Updates submodule to cudnn_frontend v1.5.1
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128612
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-06-21 18:17:35 +00:00
ebf25e128c [autograd] Do not stash version counter for saved tensor (#128545)
Fixes https://github.com/pytorch/pytorch/issues/128611

We detach using tensor_data, which already preserves the version counter, so there is no reason to save it prior to unpacking:
```
at::TensorBase VariableHooks::tensor_data(const at::TensorBase& self) const {
  TORCH_CHECK(self.defined(), "cannot call tensor_data() on undefined tensor");
  auto self_impl_copy = self.unsafeGetTensorImpl()->shallow_copy_and_detach(
      /*version_counter=*/self.unsafeGetTensorImpl()->version_counter(),
      /*allow_tensor_metadata_change=*/
      self.unsafeGetTensorImpl()->allow_tensor_metadata_change());
  return at::Tensor(self_impl_copy);
}
```
This changes the behavior when hooks are involved:
- Previously, if you had a hook that replaced the saved tensor with an entirely new tensor, we would've smashed the saved version counter onto that during unpack, which is not quite correct because the tensor returned by user's pack hook is not necessarily aliased to the tensor originally being saved (unlikely), and even if it were, the version counter would already be shared, if the user did their operations not in inference mode (unlikely).
- In this PR, we restore the version counter using the version counter from the unpack hook's output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128545
Approved by: https://github.com/albanD
ghstack dependencies: #125795
2024-06-21 18:03:06 +00:00
58cefaf53b Fix hipify regular expression for AOTI wrapper (#128912)
Summary: We need to redefine RE_PYTORCH_PREPROCESSOR here since in hipify_torch, it will apply positive lookbehind (?<=\W) and lookahead (?=\W) to the pattern to avoid matching keyword at the beginning and end of code line. However, this can  happen in codegen, which will cause the pattern to not match.

Test Plan:
```
buck2 run //caffe2/test/inductor:test_cpp_wrapper_hipify
```

```
File changed: fbcode//caffe2/test/inductor/test_cpp_wrapper_hipify.py
Buck UI: https://www.internalfb.com/buck2/395155fa-b2dc-4892-8c71-74e52c65fa2f
Note:    Using experimental modern dice
Network: Up: 0B  Down: 0B  (reSessionID-8fcfc520-755c-48f9-bacc-507c62f59231)
Jobs completed: 10947. Time elapsed: 0.5s.
Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2)
BUILD SUCCEEDED
/data/users/zhuoran/fbsource/buck-out/v2/gen/fbcode/15b7034708b669be/caffe2/test/inductor/__test_cpp_wrapper_hipify__/test_cpp_wrapper_hipify#link-tree/torch/_utils_internal.py:282: NCCL_DEBUG env var is set to None
/data/users/zhuoran/fbsource/buck-out/v2/gen/fbcode/15b7034708b669be/caffe2/test/inductor/__test_cpp_wrapper_hipify__/test_cpp_wrapper_hipify#link-tree/torch/_utils_internal.py:300: NCCL_DEBUG is forced to WARN from None
test_hipify_aoti_driver_header (caffe2.test.inductor.test_cpp_wrapper_hipify.TestCppWrapperHipify) ... ok
test_hipify_basic_declaration (caffe2.test.inductor.test_cpp_wrapper_hipify.TestCppWrapperHipify) ... ok
test_hipify_cross_platform (caffe2.test.inductor.test_cpp_wrapper_hipify.TestCppWrapperHipify) ... ok

----------------------------------------------------------------------
Ran 3 tests in 0.262s

OK
```

e2e test:

```
TORCH_LOGS="output_code,graph_code" buck2 run mode/{opt,amd-gpu,inplace} -c fbcode.triton_backend=amd -c fbcode.enable_gpu_sections=true //aiplatform/modelstore/model_generation/gpu_lowering_service:gpu_lowering_cli -- --model_input_path="ads_storage_fblearner/tree/user/facebook/fblearner/predictor/936383960/0/gpu_lowering/input.merge" --model_output_path="ads_storage_fblearner/tree/user/facebook/fblearner/predictor/936383960/0/gpu_lowering/mi300_inductor_output.merge" --lowering_backend AOT_INDUCTOR --is_ads_model False --aot_inductor_lowering_settings_json='{"use_scripting":true,"preset_lowerer":"standalone_hstu_cint;disable_new_lowering_weights;disable_dper_passes:passes=fuse_parallel_linear_no_weight_change","precision":4,"output_precision":4, "remove_unexpected_type_cast":false, "sample_input_tile_factor":32}' 2>&1 | tee local_benchmark_log.txt
```

Differential Revision: D58705216

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128912
Approved by: https://github.com/desertfire
2024-06-21 18:00:40 +00:00
2db33054b3 Disable fast path in TransformerEncoderLayer when there are forward (pre-)hooks attached to modules (#128415)
Fixes #128413

Disable fast-path if there are forward hooks or pre-hooks.

Example failure case given in the issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128415
Approved by: https://github.com/mikaylagawarecki
2024-06-21 17:38:08 +00:00
8edd4c71c6 [AOTI][refactor] Remove GridExprCppPrinter (#129142)
Summary: Previously we thought using CppPrinter is not ABI-compatibility safe, but c10/util/generic_math.h has been changed to header-only implementation, so we can remove GridExprCppPrinter now.

Differential Revision: [D58854214](https://our.internmc.facebook.com/intern/diff/D58854214)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129142
Approved by: https://github.com/chenyang78
2024-06-21 17:18:37 +00:00
bdc39eef3b [inductor] Add --inductor-config benchmark flag (#129034)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129034
Approved by: https://github.com/shunting314, https://github.com/eellison
ghstack dependencies: #129024, #129033
2024-06-21 16:53:42 +00:00
bb4ab59651 [inductor] Run more test on correct device (#129033)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129033
Approved by: https://github.com/shunting314, https://github.com/eellison
ghstack dependencies: #129024
2024-06-21 16:53:42 +00:00
feb3f3ad77 [inductor] Refactors for Halide backend (#129024)
Pulling these inductor-related refactors out of the larger Halide
backend PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129024
Approved by: https://github.com/shunting314, https://github.com/eellison
2024-06-21 16:53:35 +00:00
237c4e6163 Improved flexattention bwd perf + added configurations for benchmarks (#129013)
Before:
<img width="519" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/6f4a9b37-4aff-48d3-aaba-7e8e5a5bf0fb">

After:
<img width="541" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/423f179e-76f5-457b-8064-ee8a70247534">

After fixing strides:
![image](https://github.com/pytorch/pytorch/assets/6355099/58471587-404b-4bfc-b9b2-7546bdf53f54)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129013
Approved by: https://github.com/drisspg, https://github.com/yanboliang
ghstack dependencies: #128938
2024-06-21 15:58:53 +00:00
bdd11483ea [3.13] get C dynamo to compile with python callback and custom frame eval (#129171)
Start enabling parts of C Dynamo for 3.13

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129171
Approved by: https://github.com/jansel, https://github.com/albanD
2024-06-21 15:58:02 +00:00
b0ae0db815 [Inductor][Intel GPU] Support reduction split. (#129120)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129120
Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/desertfire
ghstack dependencies: #129124
2024-06-21 15:11:59 +00:00
fb0c51b61c [Inductor UT] Fix UT failure 'test_polar_dynamic_shapes_xpu' introduced by #128722 (#129124)
[Inductor UT] Fix UT failure 'test_polar_dynamic_shapes_xpu' introduced by #128722. Currently, XPU CI does not gate PR merge. So, we have to do some post-CI fixing as some PRs may break XPU CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129124
Approved by: https://github.com/EikanWang, https://github.com/desertfire
2024-06-21 15:08:17 +00:00
715b09ae2d Revert "Fix DEBUG=1 asserts with NJT ops (#129014)"
This reverts commit 2bb8ee602b264b652a9dbd6877da61018054d313.

Reverted https://github.com/pytorch/pytorch/pull/129014 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/129014#issuecomment-2182922009))
2024-06-21 15:03:02 +00:00
cyy
479ce5e2f4 Remove outdated CUDA code from CMake (#128801)
It's possible to simplify some CUDA handling logic in CMake.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128801
Approved by: https://github.com/r-barnes, https://github.com/malfet
2024-06-21 15:00:00 +00:00
cyy
2c7c286fa4 [1/N] Fix clang-tidy warnings in torch/csrc/jit/serialization (#129055)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129055
Approved by: https://github.com/r-barnes
2024-06-21 14:56:31 +00:00
53be7ff0e4 Make tl.atomic_add relaxed (#129133)
We don't use any fancy synchronization within out atomic ops, we just
want them to be atomic, so better to have them be relaxed than the
default aquire/release.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129133
Approved by: https://github.com/peterbell10
2024-06-21 14:49:58 +00:00
62e5d045c0 [AOTI] Auto-tune Triton kernels in a seperate block (#129057)
Summary: Currently AOTI does a two-pass compilation for the CUDA backend. In the first pass AOTI generates Python code, runs the generated code once with real example inputs to trigger Triton kernel compilation and tuning, and then AOTI runs the second pass to generate cpp code and compiles that into a shared library.

There are several problems with this approach when we want to enable the cpp wrapper mode for JIT Inductor:
* Compilation time: JIT compilation is more sensitive to compilation time than AOT compilation. The two-pass approach does add extra overhead for compilation.
* Peak memory size: when executing the first-pass generated code with real inputs, some inputs need to be cloned to avoid side effect coming from input mutation. This can raise the high-water mark for memory consumption.
* Missing triton kernel autotuning: Because kernel autotune depends on the kernel being executed in the two-pass approach, some kernels will not be autotuned when a model contains control flow such as torch.if or torch.while.

This PR is the first step towards solving these problems by moving Triton kernel autotuning to the compile time and use random inputs for tuning. The cpp wrapper codegen still has two passes, but in the first pass, Inductor will generate a separate code just for kernel autotuning, with https://gist.github.com/desertfire/606dc772b3e989b5e2edc66d76593070 as an example, and we no longer need to execute the model after the first-pass finishes. After that we rerun a second pass to generate cpp code. This reduces peak memory consumption and enables kernel autotuning when there is control flow. Truly making the codegen into one-pass will come later once this solution is proven stable and generates as performant kernels as before.

Differential Revision: [D58782766](https://our.internmc.facebook.com/intern/diff/D58782766)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129057
Approved by: https://github.com/jansel, https://github.com/eellison
2024-06-21 14:34:13 +00:00
9795dba1e0 Optim package docstring fix (#129086)
Fix docstrings in various files in optim package. This is a last remaining fix for the issue #112593

The fix can be verified by running pydocstyle path-to-file --count

Fixes #112593

Related #128248

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129086
Approved by: https://github.com/janeyx99
2024-06-21 14:30:53 +00:00
b697808056 [BE][Easy] eliminate relative import in torchgen (#128872)
Fix generated by:

```bash
ruff check --config 'lint.flake8-tidy-imports.ban-relative-imports="all"' --fix --select=TID $(fd '.pyi?$' torchgen)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128872
Approved by: https://github.com/zou3519
2024-06-21 14:11:46 +00:00
e1c1052829 Backward support for unbind() with NJT (#128032)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128032
Approved by: https://github.com/soulitzer
2024-06-21 14:05:23 +00:00
27ae1f981d [inductor] fix linear_add_bias for autocast case (#129138)
Previously `linear_add_bias` only support the added tensor is `bfloat16`.

```
        class M(torch.nn.Module):
            def __init__(self, dtype):
                super().__init__()
                self.linear1 = torch.nn.Linear(10, 64, bias=False)
                self.bias1 = torch.randn(64).bfloat16()  # if the bias is not bf16, we will crash

            def forward(self, x):
                return self.linear1(x) + self.bias1
```
For `Autocast(bf16)` cases, `self.bias1` will not be converted to bf16. And we also not checked the dtype for weight and bias in the pattern matcher, this will lead to error if weight is bfl6 while bias is fp32.

We have 2 options to resolve this:
 - Check bias/weight dtype, only fold the bias when they are same dtype
 - We will fold them even they are not same dtype. By inserting to_dtypes for `bias node` to enforce it have same dtype with weight.

This PR chose option1, since we can't implicitly cast bias to bf16 here which would lose precision.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129138
Approved by: https://github.com/jgong5
2024-06-21 14:04:30 +00:00
5d8e23b49c [custom_op] Support string default values in schema (#129179)
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129179
Approved by: https://github.com/albanD
ghstack dependencies: #129177, #129178
2024-06-21 13:31:40 +00:00
08b616281f [custom ops] Switch out references from old landing page to new landing page (#129178)
Test Plan:
- existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129178
Approved by: https://github.com/albanD
ghstack dependencies: #129177
2024-06-21 13:31:40 +00:00
311fadb1fb [docs] Redirect custom ops landing page to the correct place (#129177)
I'm moving it to pytorch/tutorials
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129177
Approved by: https://github.com/albanD
2024-06-21 13:31:32 +00:00
217aac96d7 Introduce a prototype for SymmetricMemory (#128582)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

This PR introduces a prototype for `SymmetricMemory` (including a CUDA implementation) - a remote-memory access-based communication primitive. It allows for user-defined communication patterns/kernels and is designed to be torch.compile-friendly. It addresses the major limitations of `IntraNodeComm` and `ProcessGroupCudaP2p` and serves as a replacement for them.

### SymmetricMemory

`SymmetricMemory` represents symmetric allocations across a group of devices. The allocations represented by a `SymmetricMemory` object are accessible by all devices in the group. The class can be used for **op-level custom communication patterns** (via the get_buffer APIs and the synchronization primitives), as well as **custom communication kernels** (via the buffer and signal_pad device pointers).

### Python API Example

```python
from torch._C.distributed_c10d import _SymmetricMemory

# Set a store for rendezvousing symmetric allocations on a group of devices
# identified by group_name. The concept of groups is logical; users can
# utilize predefined groups (e.g., a group of device identified by a
# ProcessGroup) or create custom ones. Note that a SymmetricMemoryAllocator
# backends might employ a more efficient communication channel for the actual
# rendezvous process and only use the store for bootstrapping purposes.
_SymmetricMemory.set_group_info(group_name, rank, world_size, store)

# Identical to empty_strided, but allows symmetric memory access to be
# established for the allocated tensor via _SymmetricMemory.rendezvous().
# This function itself is not a collective operation.
t = _SymmetricMemory.empty_strided_p2p((64, 64), (64, 1), torch.float32, group_name)

# Users can write Python custom ops that leverages the symmetric memory access.
# Below are examples of things users can do (assuming the group's world_size is 2).

# Establishes symmetric memory access on tensors allocated via
# _SymmetricMemory.empty_strided_p2p(). rendezvous() is a one-time process,
# and the mapping between a local memory region and the associated SymmetricMemory
# object is unique. Subsequent calls to rendezvous() with the same tensor will receive
# the cached SymmetricMemory object.
#
# The function has a collective semantic and must be invoked simultaneously
# from all rendezvous participants.
symm_mem = _SymmetricMemory.rendezvous(t)

# This represents the allocation on rank 0 and is accessible from all devices.
buf = symm_mem.get_buffer(0, (64, 64), torch.float32)

if symm_mem.rank == 0:
    symm_mem.wait_signal(src_rank=1)
    assert buf.eq(42).all()
else:
    # The remote buffer can be used as a regular tensor
    buf.fill_(42)
    symm_mem.put_signal(dst_rank=0)

symm_mem.barrier()

if symm_mem.rank == 0:
    symm_mem.barrier()
    assert buf.eq(43).all()
else:
    new_val = torch.empty_like(buf)
    new_val.fill_(43)
    # Contiguous copies to/from a remote buffer utilize copy engines
    # which bypasses SMs (i.e. no need to load the data into registers)
    buf.copy_(new_val)
    symm_mem.barrier()
```

### Custom CUDA Comm Kernels

Given a tensor, users can access the associated `SymmetricMemory` which provides pointer to remote buffers/signal_pads needed for custom communication kernels.

```cpp
TORCH_API c10::intrusive_ptr<SymmetricMemory> get_symmetric_memory(
    const at::Tensor& tensor);

class TORCH_API SymmetricMemory : public c10::intrusive_ptr_target {
 public:
  ...
  virtual std::vector<void*> get_buffer_ptrs() = 0;
  virtual std::vector<void*> get_signal_pad_ptrs() = 0;
  virtual void** get_buffer_ptrs_dev() = 0;
  virtual void** get_signal_pad_ptrs_dev() = 0;
  virtual size_t get_buffer_size() = 0;
  virtual size_t get_signal_pad_size() = 0;
  virtual int get_rank() = 0;
  virtual int get_world_size() = 0;
  ...
};
```

### Limitations of IntraNodeComm and ProcessGroupCudaP2p
Both `IntraNodeComm` (used by `ProcessGroupCudaP2p`) manages a single fixed-size workspace. This approach:
- Leads to awkward UX in which the required workspace needs to be specified upfront.
- Can not avoid extra copies for some algorithms in eager mode (e.g., custom/multimem all-reduce, reduce-scatter, all-gather).
- Prevents torch.compile from eliminating all copies.

In addition, they only offer out-of-the-box communication kernels and don't expose required pointers for user-defined, custom CUDA comm kernels.

* __->__ #128582

Differential Revision: [D58849033](https://our.internmc.facebook.com/intern/diff/D58849033)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128582
Approved by: https://github.com/wanchaol
2024-06-21 08:49:11 +00:00
f0443ad174 [compiled autograd] flatten runtime inputs with fast path (#129116)
covered by test_compiled_autograd.py and test_standalone_compile.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129116
Approved by: https://github.com/jansel
ghstack dependencies: #127960, #128905, #128982, #128987, #129181
2024-06-21 08:16:33 +00:00
d97dfe9313 [compiled autograd] move inputs to cuda with non_blocking=True (#129181)
non_blocking=True requires first pinning, which shouldn't be a problem given that they are cpu scalars

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129181
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #127960, #128905, #128982, #128987
2024-06-21 08:16:33 +00:00
8f320fd6c6 [compiled autograd] treat input params as static (#128987)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128987
Approved by: https://github.com/eellison, https://github.com/BoyuanFeng
ghstack dependencies: #127960, #128905, #128982
2024-06-21 08:16:33 +00:00
fafa1867d1 [compiled autograd] use in_compiled_autograd_region instead of compiled_autograd_enabled_count (#128982)
current implementation of compiled_autograd_enabled_count affects the entire region under the context manager. so if the context manager wraps torch.compile calls unrelated to the backward, they are affected too:
- no lazy compile for compiled fw
- no aot autograd cache for inference graphs

we instead maintain a flag when we execute the compiled backward callable, to isolate the special handling to the compiled backward graph

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128982
Approved by: https://github.com/jansel
ghstack dependencies: #127960, #128905
2024-06-21 08:16:33 +00:00
68b33453f4 [aot autograd] collect static parameter metadata when graphs fallback to inference (#128905)
https://github.com/pytorch/pytorch/pull/126820 but for graphs that have requires_grad inputs but no requires_grad outputs i.e. inference graph

the implementation of inference graph fallback was throwing away the static parameter information during metadata recomputation

also adding a cudagraphs counter to test this easier

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128905
Approved by: https://github.com/mlazos
ghstack dependencies: #127960
2024-06-21 08:16:33 +00:00
123812790b [compiled autograd] update benchmarks to use cli flags for fullgraph/dynamic (#127960)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127960
Approved by: https://github.com/jansel
2024-06-21 08:16:33 +00:00
aee512cc9d [dtensor][op] Fixed stack op strategy (#129018)
**Summary**
The previous stack op strategy was causing the input to be resharded, resulting in list index out of range error. I delayed the resharding for after the input_specs were created so that the new dimension could be inserted, preventing the error above. I have also ran all the other test cases to ensure changes did not introduce any new bugs

**Test Plan**
pytest test/distributed/_tensor/test_tensor_ops.py -s -k test_stack

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129018
Approved by: https://github.com/XilunWu
2024-06-21 08:10:28 +00:00
6b5fbc544e [dynamo] Use polyfill to trace through the attributes of torch.jit.* and lru_cache_wrapper (#128336)
Earlier we were taking the vt for `obj` and then monkeypatching that `vt.source` to be `obj._torchdynamo_inline`. If one accesses `obj.attr_a`, this would cause problems because Dynamo would then search it in `obj._torchdynamo_inline.attr_a`. This PR makes it more functional, so that we have different vts for obj and `ob._torchdynamo_inline`.

Fixes https://github.com/pytorch/pytorch/issues/93698

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128336
Approved by: https://github.com/jansel, https://github.com/yanboliang
ghstack dependencies: #129117
2024-06-21 07:44:44 +00:00
914d3ca2ba [inductor][cpp] BF16 AMX micro-gemm support (#127195)
This PR adds the intrinsics based micro-gemm for BF16 using Advanced Matrix eXtension (AMX) instructions available in Intel 4th and 5th Xeon processors. A compilation check is added to `codecache.py` to check the validity of the compiler support. Also, since AMX requires an initialization in the Linux kernel to extra register states, an initialization function is added to do that and triggered via `codecache.py`.

Performance speedups with >=10% on BF16 AMP, max_autotune vs. no autotune, measured on Intel(R) Xeon(R) Platinum 8488C:
Static shapes
Single-threaded
| Model Family | Model Name | Speedup |
|--------------|------------|---------|
| timm_models | mixer_b16_224 | 1.54 |
| timm_models | convit_base | 1.53 |
| huggingface | MobileBertForQuestionAnswering | 1.52 |
| torchbench | fastNLP_Bert | 1.44 |
| torchbench | llama | 1.33 |
| timm_models | swin_base_patch4_window7_224 | 1.31 |
| torchbench | dlrm | 1.28 |
| torchbench | timm_vision_transformer_large | 1.28 |
| huggingface | MobileBertForMaskedLM | 1.27 |
| timm_models | vit_base_patch16_224 | 1.26 |
| timm_models | beit_base_patch16_224 | 1.23 |
| timm_models | jx_nest_base | 1.21 |
| torchbench | pyhpc_equation_of_state | 1.18 |
| huggingface | Speech2Text2ForCausalLM | 1.15 |
| timm_models | pit_b_224 | 1.14 |
| timm_models | twins_pcpvt_base | 1.14 |
| torchbench | maml_omniglot | 1.1 |
| timm_models | botnet26t_256 | 1.1 |

Multi-threaded
| Model Family | Model Name | Speedup |
|--------------|------------|---------|
| torchbench | BERT_pytorch | 1.35 |
| torchbench | lennard_jones | 2.43 |
| torchbench | hf_Albert | 1.35 |
| torchbench | hf_T5 | 1.34 |
| torchbench | soft_actor_critic | 1.34 |
| torchbench | fastNLP_Bert | 1.28 |
| huggingface | LayoutLMForSequenceClassification | 1.26 |
| torchbench | llama | 1.24 |
| huggingface | GPT2ForSequenceClassification | 1.19 |
| torchbench | hf_Bart | 1.17 |
| torchbench | hf_Bert_large | 1.16 |
| torchbench | hf_GPT2 | 1.16 |
| timm_models | gmixer_24_224 | 1.16 |
| torchbench | hf_GPT2_large | 1.15 |
| torchbench | maml_omniglot | 1.14 |
| torchbench | hf_Bert | 1.13 |
| torchbench | hf_DistilBert | 1.13 |
| torchbench | hf_T5_large | 1.12 |
| huggingface | MT5ForConditionalGeneration | 1.11 |

Dynamic shapes
Single-threaded
| Model Family | Model Name | Speedup |
|--------------|------------|-------|
| timm_models | mixer_b16_224 | 1.52 |
| timm_models | convit_base | 1.5 |
| huggingface | MobileBertForQuestionAnswering | 1.49 |
| torchbench | fastNLP_Bert | 1.42 |
| torchbench | timm_vision_transformer_large | 1.28 |
| timm_models | swin_base_patch4_window7_224 | 1.27 |
| torchbench | llama | 1.26 |
| huggingface | MobileBertForMaskedLM | 1.25 |
| timm_models | vit_base_patch16_224 | 1.25 |
| timm_models | beit_base_patch16_224 | 1.24 |
| timm_models | jx_nest_base | 1.2 |
| torchbench | dlrm | 1.19 |
| timm_models | pit_b_224 | 1.13 |
| timm_models | twins_pcpvt_base | 1.13 |
| torchbench | hf_Bert_large | 1.12 |
| torchbench | hf_BigBird | 1.11 |
| huggingface | Speech2Text2ForCausalLM | 1.11 |
| timm_models | eca_botnext26ts_256 | 1.11 |
| timm_models | botnet26t_256 | 1.1 |

Multi-threaded
| Model Family | Model Name | Speedup |
|--------------|------------|-------|
| torchbench | BERT_pytorch | 1.18 |
| torchbench | lennard_jones | 2.18 |
| torchbench | hf_Albert | 1.37 |
| torchbench | soft_actor_critic | 1.31 |
| huggingface | GPT2ForSequenceClassification | 1.29 |
| torchbench | hf_T5 | 1.28 |
| torchbench | fastNLP_Bert | 1.27 |
| torchbench | hf_Bart | 1.21 |
| torchbench | hf_Bert_large | 1.19 |
| torchbench | hf_T5_large | 1.19 |
| torchbench | hf_Bert | 1.16 |
| torchbench | hf_GPT2 | 1.16 |
| huggingface | CamemBert | 1.16 |
| torchbench | hf_GPT2_large | 1.13 |
| torchbench | functorch_maml_omniglot | 1.12 |
| huggingface | BertForMaskedLM | 1.12 |
| huggingface | MT5ForConditionalGeneration | 1.12 |
| torchbench | hf_DistilBert | 1.11 |
| timm_models | mixnet_l | 1.11 |
| timm_models | tf_mixnet_l | 1.11 |

No perf regressions.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127195
Approved by: https://github.com/jansel
2024-06-21 07:21:47 +00:00
632910e2a8 Add test to xfail_list only for abi_compatible (#128506)
https://github.com/pytorch/pytorch/pull/126717 will skip the tests in both ABI compatible and non-ABI compatible mode.
It's not expected to skip them in non-ABI compatible mode since they can actually run successfully in such mode but only have issues in ABI compatible mode.

We leverage the existing `xfail_list` for those that will only fail in ABI compatible mode.

- `test_qlinear_add` is already in the `xfail_list`.
- `test_linear_packed` doesn't fail either in my local run (running with `TORCHINDUCTOR_ABI_COMPATIBLE=1`) or in the CI of this PR so I didn't add it into `xfail_list`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128506
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-06-21 07:19:28 +00:00
62e425ab03 Memory Tracker for tracking Module wise memory (#124688)
We present a utility MemTracker, that tracks the module-wise memory for the code executed under its context. The core features that this tool aims to provide are:

1. Capturing 'snapshots' of memory for each module during its execution. Specifically, at 8 points, during pre-forward, post-forward, pre-backward, 2nd pre-forward (if AC is applied), 2nd post-forward (if AC is applied), post-backward. Also capturing peak memory snapshot during forward and backward.
2. Each such snapshot provides the per device (cpu, cuda etc) memory breakdown in terms of the global parameters, gradients, activations, optimizer states and temporary memory.
3. A summary for each module (that can be analyzed or processed later), in terms of the memory occupied by its own parameters, buffers, inputs and outputs. The remaining components can be derived from these per module attributes and its corresponding captured snapshots.
4. Record the global peak memory consumption per device and their respective breakdowns.
5. Ability to do all of this under the FakeTensorMode so that all these statistics can be obtained without executing code on real data.
6. Ability to register and track modules, optimizers and any other tensors that are created outside the context of MemTracker.
7. Ability to capture a custom memory snapshot at any point during program execution execution.
8. Utility functions to display all of these statistics in user-friendly and human readable manner.

These features will enable users to anticipate OOMs, debug and pinpoint where majority of memory comes from, experiment with different activation checkpointing policies, batch sizes, mixed precision, model architecture features (ex. number of layers, hidden dimensions, number of attention heads etc.) and inter-device memory movement (ex. CPU off-loading) among others. Basically anything and everything related to device memory.

* __->__ #128508

Example:

>     import torch
>     import torchvision.models as models
>     from torch.distributed._tools.mem_tracker import MemTracker
>     device, dtype = "cuda", torch.float32
>     with torch.device(device):
>         model = models.resnet18().to(dtype=dtype)
>     optim = torch.optim.Adam(model.parameters(), foreach=True)
>     mem_tracker = MemTracker()
>     mem_tracker.track_external(model, optim)
>     with mem_tracker as mt:
>         for i in range(2):
>             input_batch = torch.randn(256, 3, 224, 224, device=device, dtype=dtype)
>             model(input_batch).sum().backward()
>             optim.step()
>             optim.zero_grad()
>             if i == 0:
>                 # to account for lazy init of optimizer state
>                 mt.reset_mod_stats()
>     mt.display_snapshot("peak", units="MiB", tabulate=True)
>     mt.display_modulewise_snapshots(depth=2, units="MiB", tabulate=True)
>     # Check for accuracy of peak memory
>     tracker_max = mt.get_tracker_snapshot('peak')[device]['Total']
>     cuda_max = torch.cuda.max_memory_allocated()
>     accuracy = tracker_max / cuda_max
>     print(f"Tracker Max: {tracker_max}, CUDA Max: {cuda_max}, Accuracy: {accuracy}")

Output

<img width="1197" alt="Screenshot 2024-06-15 at 12 10 12 AM" src="https://github.com/pytorch/pytorch/assets/12934972/83e953db-43dc-4094-90eb-9f1d2ca8e758">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124688
Approved by: https://github.com/awgu
2024-06-21 07:15:32 +00:00
2b1b055a96 [Split Build] Fix libtorch_python RPATH (#129088)
In the split build we end up with an incorrect RPATH for `libtorch_python.so`. This PR fixes said RPATH.

What the rpath should look like:
```
sahanp@devgpu086 ~/pytorch ((636de71c…))> objdump -p ~/main_so_files/libtorch_python.so | grep "RPATH"                        (pytorch-3.10)
  RPATH                /lib/intel64:/lib/intel64_win:/lib/win-x64:/home/sahanp/pytorch/build/lib:/home/sahanp/.conda/envs/pytorch-3.10/lib:
```

Before

```
sahanp@devgpu086 ~/pytorch ((636de71c…))> objdump -p ~/split_so_files/libtorch_python.so | grep "RPATH"                       (pytorch-3.10)
  RPATH                /home/sahanp/pytorch/torch/lib:/home/sahanp/pytorch/build/lib:
```

After
```
sahanp@devgpu086 ~/pytorch ((636de71c…))> objdump -p build/lib/libtorch_python.so | grep "RPATH"                              (pytorch-3.10)
  RPATH                /lib/intel64:/lib/intel64_win:/lib/win-x64:/home/sahanp/pytorch/build/lib:/home/sahanp/pytorch/torch/lib:/home/sahanp/.conda/envs/pytorch-3.10/lib:
```

Testing that this works is in the above PR. Similarly, after running ciflow/binaries the output of objdump -p should not change https://www.diffchecker.com/14PRmCNz/ (checked manywheel py 3.10 cuda 12.1)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129088
Approved by: https://github.com/malfet
2024-06-21 06:49:19 +00:00
c008488b9c [dynamo][guards] Dont run TYPE_MATCH for DICT_LENGTH C++ guard (#129163)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129163
Approved by: https://github.com/williamwen42, https://github.com/jansel
2024-06-21 06:27:19 +00:00
cyy
5c676bb8b3 Remove Caffe2 handling from onnx_unpack_quantized_weights (#129021)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129021
Approved by: https://github.com/justinchuby, https://github.com/albanD
2024-06-21 06:16:44 +00:00
3a2fdbb142 [dynamo] - Add JK killswitch for dynamo compilation. (#128538)
This allows easy disablement of dynamo in emergency situations where env variables are hard to set.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128538
Approved by: https://github.com/jansel
2024-06-21 06:14:06 +00:00
f73b451e78 Revert "Improved flexattention bwd perf + added configurations for benchmarks (#129013)"
This reverts commit ff89ebc50a738c734496393dc25313cf197fd0b4.

Reverted https://github.com/pytorch/pytorch/pull/129013 on behalf of https://github.com/huydhn due to Sorry for reverting your change but one of the test_torchinductor_opinfo test starts to fail after this commit ff89ebc50a, I am reverting to see if it helps trunk recovers ([comment](https://github.com/pytorch/pytorch/pull/129013#issuecomment-2182042422))
2024-06-21 05:46:46 +00:00
b542825066 Enable deterministic support for oneDNN (#127277)
This PR is a part of RFC https://github.com/pytorch/pytorch/issues/114848.
For the request for Torchbenchmark models, this PR enables the deterministic attribute for the oneDNN operators for XPU backends, like convolution, deconvolution and matmult.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127277
Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/desertfire, https://github.com/gujinghui
2024-06-21 05:21:24 +00:00
e8dbb45e98 [dynamo][user-defined-object] Check that object is valid (#129117)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129117
Approved by: https://github.com/yf225
2024-06-21 04:18:54 +00:00
cyy
e99a24ce7c Remove TensorImpl_test.cpp (#129054)
It's not used because of removal of Caffe2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129054
Approved by: https://github.com/albanD, https://github.com/malfet
2024-06-21 04:17:36 +00:00
880e894c39 [Brian's PR #128981] fix dynamo isinstance inlining for nn.Parameter + subclasses (#129162)
This is a copy of Brian's PR https://github.com/pytorch/pytorch/pull/128981, with very small changes to work around numpy related errors.

For discussions, please see Brian's original PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129162
Approved by: https://github.com/bdhirsh
2024-06-21 03:48:10 +00:00
8cd9b10456 Fix exp decomp numerics (#129154)
Our previous implementation would sometimes generate `inf` because we did not do the same numerics tricks as in eager:

See comment / [link](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/core/TransformationHelper.h#L123-L144) :
```
    # curand_uniform has (0,1] bounds. log(1) is 0 and exponential excludes 0.
    # we need log to be not 0, and not underflow when converted to half
    # fast __logf approximation can underflow, so set log to -epsilon/2 for 1 or close to 1 args
```

Fix for https://github.com/pytorch/pytorch/issues/127749.

Added a test for non-inf, but it would be great to have more robust decomp distribution tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129154
Approved by: https://github.com/bdhirsh, https://github.com/zou3519
2024-06-21 03:21:30 +00:00
ff89ebc50a Improved flexattention bwd perf + added configurations for benchmarks (#129013)
Before:
<img width="519" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/6f4a9b37-4aff-48d3-aaba-7e8e5a5bf0fb">

After:
<img width="541" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/423f179e-76f5-457b-8064-ee8a70247534">

After fixing strides:
![image](https://github.com/pytorch/pytorch/assets/6355099/58471587-404b-4bfc-b9b2-7546bdf53f54)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129013
Approved by: https://github.com/drisspg, https://github.com/yanboliang
ghstack dependencies: #128938
2024-06-21 03:01:16 +00:00
0acd09aecd [torchrec][pt-d][model store] introduce LocalShardsWrapper for DTensor (#129150)
Summary:
Same as D57688538, recreated because of GH issues

This diff introduces LocalShardsWrapper which is crucial to migrating from using ShardedTensor to DTensor in TRec state dict representation. As well as any changes needed in PT-D and ModelStore to support this.

It allows us to extend DTensor to support multiple shards on a rank as well as empty shards on a rank as needed by TRec sharding logic.

This diff also extends the support for LocalShardsWrapper to be used in conjunction with DTensor in checkpointing cases (ModelStore and DCP)

See D54375878 for how it is used.

**LocalShardsWrapper supports the following torch ops:**
+ torch.ops._c10d_functional.all_gather_into_tensor.default
+ aten._to_copy.default
+ aten.view.default
+ aten.equal.default
+ aten.detach.default

With extensibility to add more as required by use cases.

See https://docs.google.com/document/d/16Ptl50mGFJW2cljdF2HQ6FwsiA0scwbAbjx_4dhabJw/edit?usp=drivesdk for more info regarding design and approach.

NOTE: This version of LocalShardsWrapper does not support empty shards, that is added in the next diff enabling CW. D57063512

Test Plan:
` buck test mode/opt -c python.package_style=inplace aiplatform/modelstore/client/tests_gpu:dist_checkpoint_save_load_with_stateful_tests -- --print-passing-details`

`buck2 test 'fbcode//mode/dev-nosan' fbcode//torchrec/distributed/tests:test_tensor_configs -- --print-passing-details`

Sandcastle

Reviewed By: XilunWu, wanchaol

Differential Revision: D58570479

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129150
Approved by: https://github.com/XilunWu
2024-06-21 01:58:51 +00:00
31c9e3d2f4 [FSDP][Test] Test save model save with FSDP1 and load into FSDP2 applied model (#129028)
A lot of models have already been saving the model state in FULL_STATE_DICT mode with FSDP1 in APF. This unit test is just to demonstrate FSDP1 -> FSDP2 transition. The use of deprecating APIs in this test is intentional.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129028
Approved by: https://github.com/awgu, https://github.com/fegin
2024-06-21 01:40:58 +00:00
8758fedbfc [export] copy sym ops when respecting call module signature (#129153)
Summary:
Export, through AOTAutograd, [deduplicates](11ff5345d2/torch/fx/experimental/proxy_tensor.py (L198)) sym_size calls, which can cause issues during unflattening when the sym_size node is used in multiple submodules.

If preserve_call_module_signature is set, these nodes can't be passed between submodules as placeholders, so the calls (and any downstream un-duplicated nodes) must be copied. Adding this to unflattener

Test Plan: export unflatten test case

Reviewed By: TroyGarden, angelayi

Differential Revision: D58697231

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129153
Approved by: https://github.com/angelayi
2024-06-21 01:40:22 +00:00
5da428d9eb [cpu][flash attention] fix attention mask issue (#128816)
For attention mask in flash attention:

- Fix the issue of accessing illegal memory when the last size of mask is 1.
- Add UT of attention mask for various shapes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128816
Approved by: https://github.com/jgong5, https://github.com/drisspg
2024-06-21 01:12:48 +00:00
d4022b4658 Revert "[BE] enable UFMT for torch/nn/modules (#128594)"
This reverts commit 95ac2d648279ebc73feccf6d8eccafa4b2759de8.

Reverted https://github.com/pytorch/pytorch/pull/128594 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128594#issuecomment-2181788935))
2024-06-21 00:50:08 +00:00
cc8193c707 Revert "[BE] enable UFMT for torch/nn/functional.py (#128592)"
This reverts commit f6e6e55fa7d883a89ba99584f8632c260519ba73.

Reverted https://github.com/pytorch/pytorch/pull/128592 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128592#issuecomment-2181783936))
2024-06-21 00:44:16 +00:00
9c929f6ce9 Revert "[BE][Easy] enable UFMT for torch/distributed/ (#128870)"
This reverts commit a0e1e20c4157bb3e537fc784a51d7aef1e754157.

Reverted https://github.com/pytorch/pytorch/pull/128870 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128870#issuecomment-2181780356))
2024-06-21 00:38:28 +00:00
9dd8f8cf8b [cpuinfo][submodule] bump cpuinfo to the latest to support amx isa check (#127505)
Fix https://github.com/pytorch/pytorch/issues/127368

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127505
Approved by: https://github.com/ezyang
2024-06-21 00:17:44 +00:00
c027c8935b [distributed] NCCL result code update (#128777)
The nccl result codes are outdated. This PR fixes #128756.

Fixes #128756

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128777
Approved by: https://github.com/Skylion007
2024-06-20 23:51:39 +00:00
43060a1dbc Add shard support to test_inductor (#129160)
I added one more shard for inductor tests earlier in https://github.com/pytorch/pytorch/pull/129108, but didn't realize that the second shard didn't do any inductor tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129160
Approved by: https://github.com/seemethere, https://github.com/malfet
2024-06-20 23:41:00 +00:00
31d5753247 Short-term fix to preserve NJT metadata cache in torch.compile (#122836)
Idea: close over min / max sequence length in the main NJT view func (`_nested_view_from_jagged`) so that view replay during fake-ification propagates these correctly in torch.compile.

For dynamic shapes support for min / max sequence length, this PR uses a hack that stores the values in `(val, 0)` shaped tensors.

**NB: This PR changes SDPA to operate on real views instead of using `buffer_from_jagged()` / `ViewNestedFromBuffer`, which may impact the internal FIRST model. That is, it undoes the partial revert from #123215 alongside a fix to the problem that required the partial revert. We need to verify that there are no regressions there before landing.**

Differential Revision: [D55448636](https://our.internmc.facebook.com/intern/diff/D55448636)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122836
Approved by: https://github.com/soulitzer
2024-06-20 23:15:53 +00:00
63a724d8e1 Revert "Introduce a prototype for SymmetricMemory (#128582)"
This reverts commit 8771e3429c3d7327f08c48d547ad73546d5603b3.

Reverted https://github.com/pytorch/pytorch/pull/128582 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128582#issuecomment-2181656181))
2024-06-20 22:31:29 +00:00
5fba5d83f0 add xpu for amp (#127276)
As support for Intel GPU has been upstreamed, this PR is to add the XPU-related contents to AMP doc.

Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127276
Approved by: https://github.com/dvrogozh, https://github.com/albanD, https://github.com/malfet
2024-06-20 21:49:35 +00:00
adc14adb88 Fix flakiness with test_binary_op_list_error_cases (#129003)
So how come this PR fixes any flakiness?

Well, following my investigation (read pt 1 in the linked ghstack PR below), I had realized that this test only consistently errors after another test was found flaky.

Why? Because TORCH_SHOW_CPP_STACKTRACES=1 gets turned on for _every_ test after _any_ test reruns, following this PR https://github.com/pytorch/pytorch/pull/119408. And yea, this test checked for exact error message matching, which no longer would match since the stacktrace for a foreach function is obviously going to be different from a nonforeach.

So we improve the test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129003
Approved by: https://github.com/soulitzer
2024-06-20 21:48:22 +00:00
61fa3de4cb ci: Hardcode runner-determinator (#128985)
Hardcode the runner-determinator script for testing ALI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128985
Approved by: https://github.com/ZainRizvi
2024-06-20 21:14:23 +00:00
aace8ffc00 Revert "[BE] enable UFMT for torch/nn/*.py (#128593)"
This reverts commit a87d82abd746240e7b46b992fa9df7ae6d3e6d4a.

Reverted https://github.com/pytorch/pytorch/pull/128593 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128593#issuecomment-2181562604))
2024-06-20 21:09:44 +00:00
f2f4dde2d3 [dynamo] Remove ID_MATCH for FSDPModuleVariable (#129015)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129015
Approved by: https://github.com/yf225
ghstack dependencies: #129098
2024-06-20 19:23:32 +00:00
e84cf805d2 Revert "Modularize aten parameter parser and checker (#125308)"
This reverts commit 60bbdc0b40656cf70b2b098c7d715e19f031fb0d.

Reverted https://github.com/pytorch/pytorch/pull/125308 on behalf of https://github.com/fbgheith due to test failures when run by meta ([comment](https://github.com/pytorch/pytorch/pull/125308#issuecomment-2181327211))
2024-06-20 18:52:05 +00:00
254487f288 Revert "Separate AOTI Eager utils as a single file (#125819)"
This reverts commit 18634048a1f939a961b7c96b0acfe78b474c821e.

Reverted https://github.com/pytorch/pytorch/pull/125819 on behalf of https://github.com/fbgheith due to test failures when run by meta ([comment](https://github.com/pytorch/pytorch/pull/125819#issuecomment-2181317332))
2024-06-20 18:49:08 +00:00
73340f0909 Revert "[3/N] Non-Tensor: Support string parameter for aten operations (#125831)"
This reverts commit a52c8ace98afe76dc9e2c330b415972fd1529077.

Reverted https://github.com/pytorch/pytorch/pull/125831 on behalf of https://github.com/fbgheith due to test failures when run by meta ([comment](https://github.com/pytorch/pytorch/pull/125831#issuecomment-2181313892))
2024-06-20 18:45:41 +00:00
8c2542623b [Traceable FSDP2] [Dynamo] Add tracing support for out-variant custom ops that return None (#129078)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129078
Approved by: https://github.com/yanboliang
2024-06-20 17:46:13 +00:00
734891ac22 Fix export log script (#128967)
Summary: Title

Test Plan: CI

Differential Revision: D58699557

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128967
Approved by: https://github.com/jiashenC
2024-06-20 17:01:00 +00:00
ddb95dbb0d Fixing equalize with three things and improving functionality (#124632)
Summary:
(1) Make code work when a first layer does not have a bias.
(2) Make it possible to provide both modules and module names as input
(3) Allow sequences of contiguous layers as input, that then get split into pairs
(4) fix documentation to be more clear on inputs to be provided

Test Plan:
Run this new version of the algorithm on a network and see if it throws errors.

There's also this notebook to run and test N5199827

It you tell me where I can find the tests for this code, I can add some simple unit tests as well.

Differential Revision: D55895862

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124632
Approved by: https://github.com/jerryzh168
2024-06-20 16:55:56 +00:00
832fc35211 Revert "Improved flexattention bwd perf + added configurations for benchmarks (#129013)"
This reverts commit 6d2b3c90f144d7b77d51da27e6696192b2b97ebd.

Reverted https://github.com/pytorch/pytorch/pull/129013 on behalf of https://github.com/ZainRizvi due to Sorry but this is causing a flexattention test to fail on ROCm. Can you please fix that test before remerging this in? See 6d2b3c90f1 for details ([comment](https://github.com/pytorch/pytorch/pull/129013#issuecomment-2181133070))
2024-06-20 16:51:41 +00:00
65286883d4 [export] reland "experimental joint graph API." (#129081)
Summary: previous diff got reverted despite CI was green.

Test Plan: CI

Differential Revision: D58790048

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129081
Approved by: https://github.com/tugsbayasgalan
2024-06-20 16:50:53 +00:00
fc5b0ff2d7 [BE][Hackaday] deprecate legacy cuda docker image (#128859)
Fixes https://github.com/pytorch/builder/issues/1795 from the pytorch side specifically for the cuda image

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128859
Approved by: https://github.com/atalman
2024-06-20 16:30:49 +00:00
b2a9b8d485 [CpuInductor] Enable NEON ISA detection on Linux ARM (#129075)
Also, cleanup code a bit to use `x in [y, z]` instead of `x == y or x == z`

And do not redefine `at_align`, but instead use `alignas(64)` as was suggested in https://github.com/pytorch/pytorch/pull/128686/files#r1639365978

Test plan: `python3 -c "import torch._inductor.codecache as cc; isa = cc.valid_vec_isa_list()[0];print(str(isa), bool(isa))"`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129075
Approved by: https://github.com/jansel
2024-06-20 16:22:57 +00:00
e0aa992d73 Fix inductor and deploy jobs timing out (#129108)
Some trunk and periodic jobs are timing out at the moment, including:

* `deploy`.  This is because https://github.com/pytorch/pytorch/pull/127952 has removed `deploy` config, but there is one left over in periodic.
    * [periodic / linux-focal-cuda12.4-py3.10-gcc9 / test (deploy, 1, 1, linux.4xlarge.nvidia.gpu](https://github.com/pytorch/pytorch/actions/runs/9525590191/job/26260620457).
* `inductor`, including `py3.10`, `py3.12`, and `cuda12.1`, `cuda12.4`.  The increase comes from this change https://github.com/pytorch/pytorch/pull/128343, so I add another GPU shard.
    * [inductor / cuda12.1-py3.12-gcc9-sm86 / test (inductor, 1, 1, linux.g5.4xlarge.nvidia.gpu)](https://github.com/pytorch/pytorch/actions/runs/9522817887/job/26255069269)
    * [inductor / cuda12.1-py3.10-gcc9-sm86 / test (inductor, 1, 1, linux.g5.4xlarge.nvidia.gpu)](https://github.com/pytorch/pytorch/actions/runs/9524651902/job/26260009757)
    * [inductor-cu124 / cuda12.4-py3.10-gcc9-sm86 / test (inductor, 1, 1, linux.g5.4xlarge.nvidia.gpu)](https://github.com/pytorch/pytorch/actions/runs/9587982228/job/26440205869)
    * [inductor-cu124 / cuda12.4-py3.12-gcc9-sm86 / test (inductor, 1, 1, linux.g5.4xlarge.nvidia.gpu)](https://github.com/pytorch/pytorch/actions/runs/9587982228/job/26440634200)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129108
Approved by: https://github.com/malfet
2024-06-20 16:03:11 +00:00
2bb8ee602b Fix DEBUG=1 asserts with NJT ops (#129014)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129014
Approved by: https://github.com/YuqingJ, https://github.com/soulitzer
2024-06-20 15:15:28 +00:00
7178b4e987 [Dynamo x torch_function] fix incorrect source (#128980)
Fixes https://github.com/pytorch/pytorch/issues/128964

The problem was that we were installing the source for a type
incorrectly.

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128980
Approved by: https://github.com/mlazos
2024-06-20 14:54:00 +00:00
ea47d542ca [dynamo][guards] Remove BOOL_FALSE - not needed after C++ guards (#129098)
PyDict_Size is very fast ... earlier with Python guards, Cpython will go through layers of fluff to finally call the PyDict_Size. With C++ guards, its not needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129098
Approved by: https://github.com/jansel
2024-06-20 14:40:27 +00:00
54b0006cb2 Evaluate symexprs on load path of cache not write (#128997)
When caching is enabled, an internal model fails with
```
assert_size_stride(bmm_9, (17, s0, 512), (54784, 512, 1))
AssertionError: expected size 17==17, stride 57344==54784 at dim=0
```
looking at this model, the exact problem is when the cache is hit on the forward graph, the generated code for backward fails since the strides of the outputs of forward, passed to backward as inputs, are not what we expected.

This PR changes the evaluation logic so that we defer evaluation of output stride exprs to load path as opposed to eagerly doing it on save path.

I have not been able to come up with a unit test repro for this problem.

Differential Revision: [D58796503](https://our.internmc.facebook.com/intern/diff/D58796503)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128997
Approved by: https://github.com/ezyang
2024-06-20 08:55:12 +00:00
799acd31b4 [MPS] Add lu_factor (#99269)
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at d75cde1</samp>

Added MPS support and autograd formulas for LU factorization of tensors. Implemented the `linalg_lu_factor` and `linalg_lu_factor.out` functions for the MPS backend in `LinearAlgebra.mm` and added tests in `test_mps.py`. Added the corresponding dispatch entries in `native_functions.yaml` and the backward and forward formulas in `derivatives.yaml`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/99269
Approved by: https://github.com/kulinseth, https://github.com/lezcano
2024-06-20 07:35:29 +00:00
0d25f096c1 [CppInductor] Fix erfinv codegen when non-vectorized isa (#129090)
Fix erfinv codegen when ISA could not be detected

Manual test plan (on MacOS):
 - Modify `valid_vec_isa_list` to return empty list
 - Run `python3 inductor/test_torchinductor_opinfo.py -v -k test_comprehensive_erfinv_cpu_bool`

Before this change, abovementioned test will fail with
```
Output:
/var/folders/rk/fxg20zvx6vvb5bk7cplq4xrc0000gn/T/tmpgic60b6c/ns/cnsp7snp7fyclkm5lsfiyiv3m6c3svevkbhcb3v7pijdfjwlyaij.cpp:11:25: error: use of undeclared identifier 'calc_erfinv'
            auto tmp2 = calc_erfinv(tmp1);
                        ^
1 error generated.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129090
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-06-20 06:09:48 +00:00
6d2b3c90f1 Improved flexattention bwd perf + added configurations for benchmarks (#129013)
Before:
<img width="519" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/6f4a9b37-4aff-48d3-aaba-7e8e5a5bf0fb">

After:
<img width="541" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/423f179e-76f5-457b-8064-ee8a70247534">

After fixing strides:
![image](https://github.com/pytorch/pytorch/assets/6355099/58471587-404b-4bfc-b9b2-7546bdf53f54)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129013
Approved by: https://github.com/drisspg, https://github.com/yanboliang
ghstack dependencies: #128938
2024-06-20 05:15:48 +00:00
ad2593cb86 [Animesh's PR #125340] [dynamo][fsdp] Track FSDPNNModuleVariable for mutations (#129045)
This is a copy of Animesh's work in https://github.com/pytorch/pytorch/pull/125340, with very small changes to the unit test. It's needed sooner for the Traceable FSDP2 work, so I copy it here and will work through landing it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129045
Approved by: https://github.com/anijain2305
2024-06-20 04:02:36 +00:00
19f3abcde4 [Docs][MPS] Add mps environment variable table (#129008)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129008
Approved by: https://github.com/malfet
ghstack dependencies: #129006
2024-06-20 03:30:35 +00:00
609ffaf717 Add more shards for slow CPU and ROCm jobs (#128873)
As they start to timeout in trunk fc2913fb80/1.  Adding one more shard for slow CPU job is trivial.  ROCm runners is harder to find, but I assume that this is ok because slow jobs only run periodically.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128873
Approved by: https://github.com/PaliC
2024-06-20 03:13:19 +00:00
d8db074988 [Traceable FSDP2] [Dynamo] Fix OptimizedModule._initialize to allow tracing into FSDP2 module hooks for module from user-defined module class (#129046)
This is a workaround to allow inplace fully-sharded module to still go into this branch:
3a185778ed/torch/_dynamo/eval_frame.py (L163)
instead of the second branch:
3a185778ed/torch/_dynamo/eval_frame.py (L166)

If we don't do this, `torch.compile(fully_shard(module_from_user_defined_module_class))` will ignore all module hooks which will break FSDP tracing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129046
Approved by: https://github.com/anijain2305
2024-06-20 00:15:55 +00:00
859fa183fe BE: Use future annotations in inductor scheduler and ir (#128892)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128892
Approved by: https://github.com/lezcano
2024-06-20 00:10:43 +00:00
a2b1673dfb [Horace's PR #126446] Prevent partitioner from ever saving views (#129039)
Most work is done by Horace in https://github.com/pytorch/pytorch/issues/126446, this PR just additionally adds the config for it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129039
Approved by: https://github.com/Chillee
2024-06-19 23:21:16 +00:00
9d06e3783d [Inductor][CPP] Fix the symbolic size cast issue in GEMM Benchmark (#128824)
**Summary**
The symbolic size generated from size hint (python int) is different with c type `long` of kernel args which may cause the benchmark failing to run.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128824
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-06-19 23:11:53 +00:00
a6ac6447b5 Re-enable py3.12 nightly wheel builds and add triton dependency for ROCm (#128525)
The llnl-hatchet developers have published the py3.12 binaries on [PyPI](https://pypi.org/project/llnl-hatchet/#files). In fact, looking [here](https://download.pytorch.org/whl/nightly/llnl-hatchet), it seems we already have the py3.12 wheels mirrored. This should allow us to re-enable py3.12 binaries for ROCm.

This PR reverts commit 9d849d4312cd1e62d97b9e9d58979ec78d36c95f.

It also adds the pytorch-triton-rocm dependency for torch wheels on ROCm since pytorch-triton-rocm py3.12 wheels are available now

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128525
Approved by: https://github.com/malfet
2024-06-19 21:56:54 +00:00
571a0db132 [inductor] Fix logging for run_and_get_cpp_code (#128794)
Summary: Found during testing with remote caching: Use the same output logger object between graph.py and codecache.py since it's patched in `run_and_get_cpp_code`. That allows us to capture any logging produced from the codecache path when using `run_and_get_cpp_code`. I'm also fixing a few tests that were passing mistakenly because logging was missing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128794
Approved by: https://github.com/oulgen, https://github.com/leslie-fang-intel
2024-06-19 21:32:34 +00:00
cyy
277f2914a5 [9/N] Remove unused functions (#128704)
MKL can not be enabled on aarch64, and as CI compiles code with `-Werror=unused-function` it will fail to compile with
```
/usr/bin/c++ -DAT_PER_OPERATOR_HEADERS -DBUILD_ONEDNN_GRAPH -DCAFFE2_BUILD_MAIN_LIB -DCPUINFO_SUPPORTED_PLATFORM=1 -DFLASHATTENTION_DISABLE_ALIBI -DFMT_HEADER_ONLY=1 -DFXDIV_USE_INLINE_ASSEMBLY=0 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DNNP_CONVOLUTION_ONLY=0 -DNNP_INFERENCE_ONLY=0 -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DUSE_C10D_GLOO -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cpu_EXPORTS -I/var/lib/jenkins/workspace/build/aten/src -I/var/lib/jenkins/workspace/aten/src -I/var/lib/jenkins/workspace/build -I/var/lib/jenkins/workspace -I/var/lib/jenkins/workspace/cmake/../third_party/benchmark/include -I/var/lib/jenkins/workspace/third_party/onnx -I/var/lib/jenkins/workspace/build/third_party/onnx -I/var/lib/jenkins/workspace/third_party/foxi -I/var/lib/jenkins/workspace/build/third_party/foxi -I/var/lib/jenkins/workspace/torch/csrc/api -I/var/lib/jenkins/workspace/torch/csrc/api/include -I/var/lib/jenkins/workspace/caffe2/aten/src/TH -I/var/lib/jenkins/workspace/build/caffe2/aten/src/TH -I/var/lib/jenkins/workspace/build/caffe2/aten/src -I/var/lib/jenkins/workspace/build/caffe2/../aten/src -I/var/lib/jenkins/workspace/torch/csrc -I/var/lib/jenkins/workspace/third_party/miniz-2.1.0 -I/var/lib/jenkins/workspace/third_party/kineto/libkineto/include -I/var/lib/jenkins/workspace/third_party/kineto/libkineto/src -I/var/lib/jenkins/workspace/third_party/cpp-httplib -I/var/lib/jenkins/workspace/aten/src/ATen/.. -I/var/lib/jenkins/workspace/third_party/FXdiv/include -I/var/lib/jenkins/workspace/c10/.. -I/var/lib/jenkins/workspace/third_party/pthreadpool/include -I/var/lib/jenkins/workspace/third_party/cpuinfo/include -I/var/lib/jenkins/workspace/aten/src/ATen/native/quantized/cpu/qnnpack/include -I/var/lib/jenkins/workspace/aten/src/ATen/native/quantized/cpu/qnnpack/src -I/var/lib/jenkins/workspace/aten/src/ATen/native/quantized/cpu/qnnpack/deps/clog/include -I/var/lib/jenkins/workspace/third_party/NNPACK/include -I/var/lib/jenkins/workspace/third_party/FP16/include -I/var/lib/jenkins/workspace/third_party/tensorpipe -I/var/lib/jenkins/workspace/build/third_party/tensorpipe -I/var/lib/jenkins/workspace/third_party/tensorpipe/third_party/libnop/include -I/var/lib/jenkins/workspace/third_party/fmt/include -I/var/lib/jenkins/workspace/build/third_party/ideep/mkl-dnn/include -I/var/lib/jenkins/workspace/third_party/ideep/mkl-dnn/src/../include -I/var/lib/jenkins/workspace/third_party/flatbuffers/include -isystem /var/lib/jenkins/workspace/build/third_party/gloo -isystem /var/lib/jenkins/workspace/cmake/../third_party/gloo -isystem /var/lib/jenkins/workspace/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /var/lib/jenkins/workspace/cmake/../third_party/googletest/googlemock/include -isystem /var/lib/jenkins/workspace/cmake/../third_party/googletest/googletest/include -isystem /var/lib/jenkins/workspace/third_party/protobuf/src -isystem /var/lib/jenkins/workspace/third_party/XNNPACK/include -isystem /var/lib/jenkins/workspace/cmake/../third_party/eigen -isystem /var/lib/jenkins/workspace/third_party/ideep/mkl-dnn/include/oneapi/dnnl -isystem /var/lib/jenkins/workspace/third_party/ideep/include -isystem /var/lib/jenkins/workspace/build/include -D_GLIBCXX_USE_CXX11_ABI=1 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOCUPTI -DLIBKINETO_NOROCTRACER -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=range-loop-construct -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wsuggest-override -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Werror -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Wno-stringop-overflow -O3 -DNDEBUG -DNDEBUG -std=gnu++17 -fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -D__NEON__ -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-unknown-pragmas -Wno-type-limits -Wno-array-bounds -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wno-maybe-uninitialized -fvisibility=hidden -O2 -pthread -fopenmp -MD -MT caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/mkldnn/Linear.cpp.o -MF caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/mkldnn/Linear.cpp.o.d -o caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/mkldnn/Linear.cpp.o -c /var/lib/jenkins/workspace/aten/src/ATen/native/mkldnn/Linear.cpp
/var/lib/jenkins/workspace/aten/src/ATen/native/mkldnn/Linear.cpp:426:15: error: ‘at::Tensor at::native::mkl_linear(const at::Tensor&, const at::Tensor&, const at::Tensor&, const std::optional<at::Tensor>&, int64_t)’ defined but not used [-Werror=unused-function]
  426 | static Tensor mkl_linear(
      |               ^~~~~~~~~~
```

Follows #128499

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128704
Approved by: https://github.com/malfet
2024-06-19 20:46:45 +00:00
fca408fa29 s390x vectorization: rework operators (#129066)
Move operators from member functions to free functions. This is needed to fix torch inductor on s390x.

This change fixes tests like
DynamicShapesMiscTests::test_numpy_min_dynamic_shapes from test/dynamo/test_dynamic_shapes.py

This change also fixes recently intorduced build failure on s390x.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129066
Approved by: https://github.com/malfet
2024-06-19 20:12:41 +00:00
73f5d2b787 Run ET unit tests on PT CI (#128560)
This is the first PR to add all existing ET unit tests into PT CI.  The goal is to improve the coverage there to avoid breaking change from PT that could break ET.  With this, any future unit tests on ET will automatically be run on PT CI.  The duration of the job is now 40+ minutes, not too bad.

This also fixed the failed ET build in https://github.com/pytorch/pytorch/pull/123043.

Adding model coverage is a bit more evolved and requires adding new shards, so I will follow up on that in separate PRs.

[T192117506](https://www.internalfb.com/intern/tasks/?t=192117506), with the failed diffs D58295865 and D58394154

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128560
Approved by: https://github.com/guangy10, https://github.com/digantdesai
2024-06-19 20:08:58 +00:00
df94d57c0a Revert "[export] experimental joint graph API. (#128847)"
This reverts commit 0707811286d1846209676435f4f86f2b4b3d1a17.

Reverted https://github.com/pytorch/pytorch/pull/128847 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/128847#issuecomment-2179326891))
2024-06-19 19:04:36 +00:00
b5d541609d [Memory Snapshot] Add recordAnnotations to capture record_function annotations (#129072)
Summary:
Add new traceEvents into Memory Snapshot for record_function annotations. These will capture both the profiler's step annotation as well as user annotations.

Test Plan:
CI

Pulled By:
aaronenyeshi

Differential Revision: D55941362

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129072
Approved by: https://github.com/zdevito
2024-06-19 18:05:41 +00:00
bafd68b4fc [inductor] fix windows python module ext and func export declaration (#129059)
I have run the first inductor case on Windows base on the exploration code: https://github.com/pytorch/pytorch/pull/128330
Due to some fundamental PR still need pass `fb_code`: https://github.com/pytorch/pytorch/pull/128303
This PR would land some part of exploration code:
1. Fix Windows python module ext type: pyd.
2. Add function export declaration for Windows.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129059
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-06-19 17:51:32 +00:00
0707811286 [export] experimental joint graph API. (#128847)
Summary:
WARNING: This API is highly unstable and will be subject to change in the future.

Add a protoype to "decompose" an ExportedProgram into a joint graph form, so that we can compute the gradients on this graph.

Test Plan: buck test mode/opt caffe2/torch/fb/export:test_experimental

Differential Revision: D55657917

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128847
Approved by: https://github.com/tugsbayasgalan
2024-06-19 16:45:27 +00:00
0fc603ece4 [optim] Fused implementation stability table (#129006)
I'd like to discuss the criteria that we regard an implementation as stable. If there is no existing standard, my initial proposal would be a 6 month period after the commit to regard it as stable. As a result, now Adam and AdamW on CUDA would be considered as stable, while the rest are of beta.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129006
Approved by: https://github.com/malfet
2024-06-19 16:29:49 +00:00
1b92bdd0ea [ALI] [Reland] Use LF runners for Lint (#129071)
Quick experiment with using LF runners for lint jobs.

Picking a set of jobs where infra failures would be obvious to most people (lint)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129071
Approved by: https://github.com/malfet
2024-06-19 16:10:51 +00:00
236fbcbdf4 [Split Build] Test split build in pull CI workflow (#126813)
This PR builds the split build in the pull workflow and runs the appropriate tests against them. A single linux cpu and single gpu build were chosen arbitrarily to not add too many tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126813
Approved by: https://github.com/atalman
ghstack dependencies: #127934
2024-06-19 15:57:21 +00:00
7d33ff59ba [Split Build]Use same package (#127934)
This PR removes the second separate package we were using for the libtorch wheel.
In terms of testing that this works we will look use the PRs above this in the stack.

As for sanity checking these are the wheels that are produced by running
```
python setup.py clean && BUILD_LIBTORCH_WHL=1 with-proxy python setup.py bdist_whee
l && BUILD_PYTHON_ONLY=1 with-proxy python setup.py bdist_wheel --cmake
```

```
sahanp@devgpu086 ~/pytorch ((5f15e171…))> ls -al dist/                                                        (pytorch-3.10)
total 677236
drwxr-xr-x 1 sahanp users       188 Jun  4 12:19 ./
drwxr-xr-x 1 sahanp users      1696 Jun  4 12:59 ../
-rw-r--r-- 1 sahanp users  81405742 Jun  4 12:19 torch-2.4.0a0+gitca0a73c-cp310-cp310-linux_x86_64.whl
-rw-r--r-- 1 sahanp users 612076919 Jun  4 12:19 libtorch-2.4.0a0+gitca0a73c-py3-none-any.whl
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127934
Approved by: https://github.com/atalman
2024-06-19 15:57:21 +00:00
lyb
ffb50fb691 [ONNX] Add onnx::Gelu support for version 20 (#128773)
Fixes https://github.com/pytorch/pytorch/issues/128772
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128773
Approved by: https://github.com/justinchuby
2024-06-19 15:39:02 +00:00
3397d5ef90 Revert "[ALI] Use lf runners for Lint" (#129070)
Reverts pytorch/pytorch#128978
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129070
Approved by: https://github.com/atalman
2024-06-19 14:48:16 +00:00
118f9ceb7c [inductor][ci] Fix torchbench dependency issue with numpy (#128968)
For some reason, pip will always upgrade the numpy version even when an older version has been installed.
We have to lock numpy version to the old version to make this constraint explicit.

Torchbench commit: 23512dbebd

Second attempt to fix #128845

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128968
Approved by: https://github.com/eellison
2024-06-19 12:10:50 +00:00
e49525275d Make TraceUtils.h to be device-agnostic (#126969)
Some features of third-party devices depend on TraceUtils.h, so some of the CUDA code was removed and split into NCCLUtils files.

In addition, some common functions still remain in TraceUtils.h since I'm not sure if other devices will use them later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126969
Approved by: https://github.com/c-p-i-o
2024-06-19 09:06:49 +00:00
7fac03aee9 [ALI] Use lf runners for Lint (#128978) 2024-06-19 10:59:07 +02:00
50567f7081 Pass device to is_pinned call inside TensorProperties.create_from_tensor (#128896)
Summary:
The default input device for is_pinned function is Cuda. This can unnecessarily create Cuda context for CPU tensors when just generating TensorProperties, bloating memory usage. Passing the device to the is_pinned call site inside def create_from_tensor solves this issue.

This also fixes Model Store test
https://www.internalfb.com/intern/test/844425019931542?ref_report_id=0
which is currently broken on memory usage assertions.

Test Plan: UT

Differential Revision: D58695006

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128896
Approved by: https://github.com/fegin
2024-06-19 08:50:46 +00:00
d3e8b8bf47 Remove cuda check in the CUDAGraph destructor (#127382)
Fixes #125804

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127382
Approved by: https://github.com/eqy, https://github.com/eellison
2024-06-19 08:09:31 +00:00
ba92f5277f [inductor][refactor] Unify the use of generate_kernel_call (#128467)
Summary: Refactor TritonTemplateKernel.call_kernel and ForeachKernel.call_kernel to use wrapper.generate_kernel_call to generate kernel calls instead of explicitly composing the kernel call string. This consolidates the entry point of generate_kernel_call and similifies later changes in this PR stack.

Differential Revision: [D58733631](https://our.internmc.facebook.com/intern/diff/D58733631)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128467
Approved by: https://github.com/shunting314
2024-06-19 07:47:25 +00:00
3a185778ed [aotinductor] Add torch.polar fallback op for shim v2 (#128722)
Compilation error:
```
$ TORCHINDUCTOR_C_SHIM_VERSION=2 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_LOGS_FORMAT="%(pathname)s:%(lineno)s: %(message)s" TORCH_LOGS="+output_code" python test/inductor/test_cpu_cpp_wrapper.py -k test_polar

/tmp/tmp2sp128xj/dy/cdypvu3hvgg3mwxydwbiuddsnmuoi37it3mrpjktcnu6vt4hr3ki.cpp:59:33: error: ‘aoti_torch_cpu_polar’ was not declared in this scope; did you mean ‘aoti_torch_cpu_topk’?
```

Steps:
1. Add aten.polar
2. run `python torchgen/gen.py --update-aoti-c-shim`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128722
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2024-06-19 05:06:58 +00:00
a584b2a389 Revert "Add test to xfail_list only for abi_compatible (#128506)"
This reverts commit df85f34a14dd30f784418624b05bd52b12ab8b0b.

Reverted https://github.com/pytorch/pytorch/pull/128506 on behalf of https://github.com/huydhn due to The failure shows up in trunk df85f34a14 ([comment](https://github.com/pytorch/pytorch/pull/128506#issuecomment-2177744578))
2024-06-19 04:59:10 +00:00
fcf2a1378b Enable fp8 rowwise scaling kernel on cuda, TAKE 2: #125204 (#128989)
# Summary
First PR got reverted and needed a redo

This pull request introduces an fp8 row-scaling kernel as an optional implementation for `scaled_mm`. The kernel selection is based on the scaling tensors of the inputs. For inputs `x` and `y` of shape `[M, K]` and `[K, N]` respectively, the following conditions must be met:
- `x`'s scale should be a 1-dimensional tensor of length `M`.
- `y`'s scale should be a 1-dimensional tensor of length `N`.

It's important to note that this kernel is not called "rowwise, columnwise" scaling because, although the scales for `y` are semantically along its columns, this implementation only supports the TN format. This means the scaling is along the faster-moving dimension, or the "row".

The following two PRs were required to enable local builds:
- [PR #126185](https://github.com/pytorch/pytorch/pull/126185)
- [PR #125523](https://github.com/pytorch/pytorch/pull/125523)

### Todo
We still do not build our Python wheels with this architecture.

@ptrblck @malfet, should we replace `sm_90` with `sm_90a`?

The NVRTC TMA shadowing feels wrong, but I a not sure the right way to spoof the symbol for this compilation unit:
https://github.com/pytorch/pytorch/pull/125204/files#r1586986954

#### ifdef

I tried to use : `#if !defined(USE_ROCM) && defined(CUDA_VERSION) && CUDA_VERSION >= 12000 && \
    defined(__CUDA_ARCH__) && __CUDA_ARCH__ > 900` to gate the building of the kernel. I was having a hell of a time with this.. so I am not really sure the right way to do this

Kernel Credit:
@jwfromm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128989
Approved by: https://github.com/yangsiyu007, https://github.com/vkuzo
2024-06-19 04:49:39 +00:00
2f88597aad [inductor] For internal, allow multiple workers if the method is "subprocess" (#129002)
Summary: This does not change the current default behavior in fbcode ("fork" if unspecified and no worker processes if unspecified). But it allows us to more easily test the subprocess-based parallel if we override the start method to subprocess.

Test Plan: Set `TORCHINDUCTOR_WORKER_START=subprocess` and locally ran all torchbench models listed [here](https://www.internalfb.com/intern/wiki/PyTorch/Teams/PyTorch_Perf_Infra/TorchBench/#torchbench-internal-mode)

Differential Revision: D58755021

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129002
Approved by: https://github.com/eellison
2024-06-19 04:28:27 +00:00
1f0a68b572 [ROCm] Fix fp32 atomicAdd for non-MI100 GPUs (#128750)
Current implementation is very specific to MI100.
This is causing performance degradation for other GPUs.

Fixes #128631

Benchmarking on MI300X:
```
Before:  1918.5126953125 ms
After: 0.8285150527954102 ms
```

Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128750
Approved by: https://github.com/xw285cornell
2024-06-19 03:56:20 +00:00
acefc5c016 [torch.compile] Enable bwd compilation metrics (#128973)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128973
Approved by: https://github.com/dshi7
2024-06-19 03:45:41 +00:00
eb9f4da11e Modified template indexing to broadcast indices to out instead of mask and some other flexattention micro-opts (#128938)
For headdim=64 and headdim=128

Old:
<img width="656" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/2c5d1613-96dc-4300-8dc0-dccaef59e73c">

New:
<img width="644" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/730004a8-6d5f-46a5-82a0-2594feb5e192">

Note, this does regress headdim=256. We can unregress it by special casing `headdim=256`, but ehh.... we can do it later

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128938
Approved by: https://github.com/drisspg
2024-06-19 03:41:22 +00:00
8771e3429c Introduce a prototype for SymmetricMemory (#128582)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

This PR introduces a prototype for `SymmetricMemory` (including a CUDA implementation) - a remote-memory access-based communication primitive. It allows for user-defined communication patterns/kernels and is designed to be torch.compile-friendly. It addresses the major limitations of `IntraNodeComm` and `ProcessGroupCudaP2p` and serves as a replacement for them.

### SymmetricMemory

`SymmetricMemory` represents symmetric allocations across a group of devices. The allocations represented by a `SymmetricMemory` object are accessible by all devices in the group. The class can be used for **op-level custom communication patterns** (via the get_buffer APIs and the synchronization primitives), as well as **custom communication kernels** (via the buffer and signal_pad device pointers).

### Python API Example

```python
from torch._C.distributed_c10d import _SymmetricMemory

# Set a store for rendezvousing symmetric allocations on a group of devices
# identified by group_name. The concept of groups is logical; users can
# utilize predefined groups (e.g., a group of device identified by a
# ProcessGroup) or create custom ones. Note that a SymmetricMemoryAllocator
# backends might employ a more efficient communication channel for the actual
# rendezvous process and only use the store for bootstrapping purposes.
_SymmetricMemory.set_group_info(group_name, rank, world_size, store)

# Identical to empty_strided, but allows symmetric memory access to be
# established for the allocated tensor via _SymmetricMemory.rendezvous().
# This function itself is not a collective operation.
t = _SymmetricMemory.empty_strided_p2p((64, 64), (64, 1), torch.float32, group_name)

# Users can write Python custom ops that leverages the symmetric memory access.
# Below are examples of things users can do (assuming the group's world_size is 2).

# Establishes symmetric memory access on tensors allocated via
# _SymmetricMemory.empty_strided_p2p(). rendezvous() is a one-time process,
# and the mapping between a local memory region and the associated SymmetricMemory
# object is unique. Subsequent calls to rendezvous() with the same tensor will receive
# the cached SymmetricMemory object.
#
# The function has a collective semantic and must be invoked simultaneously
# from all rendezvous participants.
symm_mem = _SymmetricMemory.rendezvous(t)

# This represents the allocation on rank 0 and is accessible from all devices.
buf = symm_mem.get_buffer(0, (64, 64), torch.float32)

if symm_mem.rank == 0:
    symm_mem.wait_signal(src_rank=1)
    assert buf.eq(42).all()
else:
    # The remote buffer can be used as a regular tensor
    buf.fill_(42)
    symm_mem.put_signal(dst_rank=0)

symm_mem.barrier()

if symm_mem.rank == 0:
    symm_mem.barrier()
    assert buf.eq(43).all()
else:
    new_val = torch.empty_like(buf)
    new_val.fill_(43)
    # Contiguous copies to/from a remote buffer utilize copy engines
    # which bypasses SMs (i.e. no need to load the data into registers)
    buf.copy_(new_val)
    symm_mem.barrier()
```

### Custom CUDA Comm Kernels

Given a tensor, users can access the associated `SymmetricMemory` which provides pointer to remote buffers/signal_pads needed for custom communication kernels.

```cpp
TORCH_API c10::intrusive_ptr<SymmetricMemory> get_symmetric_memory(
    const at::Tensor& tensor);

class TORCH_API SymmetricMemory : public c10::intrusive_ptr_target {
 public:
  ...
  virtual std::vector<void*> get_buffer_ptrs() = 0;
  virtual std::vector<void*> get_signal_pad_ptrs() = 0;
  virtual void** get_buffer_ptrs_dev() = 0;
  virtual void** get_signal_pad_ptrs_dev() = 0;
  virtual size_t get_buffer_size() = 0;
  virtual size_t get_signal_pad_size() = 0;
  virtual int get_rank() = 0;
  virtual int get_world_size() = 0;
  ...
};
```

### Limitations of IntraNodeComm and ProcessGroupCudaP2p
Both `IntraNodeComm` (used by `ProcessGroupCudaP2p`) manages a single fixed-size workspace. This approach:
- Leads to awkward UX in which the required workspace needs to be specified upfront.
- Can not avoid extra copies for some algorithms in eager mode (e.g., custom/multimem all-reduce, reduce-scatter, all-gather).
- Prevents torch.compile from eliminating all copies.

In addition, they only offer out-of-the-box communication kernels and don't expose required pointers for user-defined, custom CUDA comm kernels.

* __->__ #128582

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128582
Approved by: https://github.com/wanchaol
2024-06-19 03:38:58 +00:00
ed5b8432cd Enable mixed_mm only if casting from lower-bitwidth type to a higher one (#128899)
This PR changes the behavior of `cuda_and_enabled_mixed_mm` such that mixed_mm is only enabled if we are casting from a lower-bitwidth type to a higher one.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128899
Approved by: https://github.com/eellison
2024-06-19 03:12:18 +00:00
df85f34a14 Add test to xfail_list only for abi_compatible (#128506)
https://github.com/pytorch/pytorch/pull/126717 will skip the tests in both ABI compatible and non-ABI compatible mode.
It's not expected to skip them in non-ABI compatible mode since they can actually run successfully in such mode but only have issues in ABI compatible mode.

We leverage the existing `xfail_list` for those that will only fail in ABI compatible mode.

- `test_qlinear_add` is already in the `xfail_list`.
- `test_linear_packed` doesn't fail either in my local run (running with `TORCHINDUCTOR_ABI_COMPATIBLE=1`) or in the CI of this PR so I didn't add it into `xfail_list`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128506
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-06-19 01:18:37 +00:00
4bc90185fb fix: Print statements causing parse error (#128969)
The print statements for the get_workflow_type script is problematic because the shell script calling this script is expecting the output to only be JSON. This PR resolves this by removing all print statements to covert them to a message field in the JSON return output so that the output can continue to expect to be JSON while giving us the debug data we are looking for.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128969
Approved by: https://github.com/tylertitsworth, https://github.com/ZainRizvi
2024-06-19 01:17:08 +00:00
eda375a490 [Inductor] Remove min/max from inductor opinfo test (#128925)
**Summary**
Remove `max.binary, min.binary, maximum, minimum` from `inductor_one_sample` op list as we fix the bool vectorization issue in https://github.com/pytorch/pytorch/pull/126841.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_maximum
python -u -m pytest -s -v test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_minimum
python -u -m pytest -s -v test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_min_binary
python -u -m pytest -s -v test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_max_binary
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128925
Approved by: https://github.com/isuruf, https://github.com/jgong5, https://github.com/peterbell10
2024-06-19 01:14:27 +00:00
2458f79f83 [Inductor UT][Intel GPU] Skip newly added test case test_torchinductor_strided_blocks:test_reduction for Intel GPU (#128881)
Skip newly added test case test_torchinductor_strided_blocks:test_reduction for Intel GPU because
it have not implemented reduction kernel split.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128881
Approved by: https://github.com/blaine-rister, https://github.com/EikanWang, https://github.com/malfet
2024-06-19 00:44:57 +00:00
b0d2fe6299 Revert "Short-term fix to preserve NJT metadata cache in torch.compile (#122836)"
This reverts commit 2a41fc03903de63270d325bd1886a50faf32d7e4.

Reverted https://github.com/pytorch/pytorch/pull/122836 on behalf of https://github.com/jbschlosser due to internal test failures with DEBUG=1 asserts ([comment](https://github.com/pytorch/pytorch/pull/122836#issuecomment-2177298245))
2024-06-19 00:28:53 +00:00
5ffb032be6 Revert "Backward support for unbind() with NJT (#128032)"
This reverts commit 5dc4f652bc5c068ef15130c955e3f2ffe11f4b74.

Reverted https://github.com/pytorch/pytorch/pull/128032 on behalf of https://github.com/jbschlosser due to reverting to revert parent PR ([comment](https://github.com/pytorch/pytorch/pull/128032#issuecomment-2177296325))
2024-06-19 00:26:40 +00:00
35c78668b4 Improve the debugging message for when foreach mta_called (#128991)
The hope that lives in this PR: I am currently trying to debug why the foreach tests are so flaky. It looks like every flaky test falls under this pattern:
- a test is flaky due to the mta_called assertion, which gathers data from the profiler regarding whether the multi_tensor_apply_kernel has been called.
- then, a later test fails deterministically, usually failing to compare two results.

```
================== 1 failed, 241 deselected, 2 rerun in 1.76s ==================
Got exit code 1
Stopping at first consistent failure
The following tests failed and then succeeded when run in a new process ['test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_add_cuda_bfloat16']
The following tests failed consistently: ['test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_add_cuda_bfloat16']
```

So my suspicion is that the first causes the second, but what causes the first? Idk! So it would be nice to have the error message tell us what the profiler actually saw in case it's getting muddled. This change would help mostly because I have not been able to repro this flakiness locally.

Also undo the useless changes in #128220 which are actually redundant as Joel and I realized that we set the seed during the setUp of every test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128991
Approved by: https://github.com/clee2000
2024-06-19 00:25:09 +00:00
99f042d336 Revert "Forward fix to skip ROCm tests for #122836 (#128891)"
This reverts commit 4061b3b8225f522ae0ed6db00111441e7d3cc3d5.

Reverted https://github.com/pytorch/pytorch/pull/128891 on behalf of https://github.com/jbschlosser due to reverting to revert parent PR ([comment](https://github.com/pytorch/pytorch/pull/128891#issuecomment-2177291249))
2024-06-19 00:21:21 +00:00
670b94c9c8 [inductor][mkldnn] Use floats instead of ints for pattern matcher test (#128484)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128484
Approved by: https://github.com/mlazos
ghstack dependencies: #128428
2024-06-19 00:06:46 +00:00
c5e0b84484 [dynamo][trace_rules] Remove incorrectly classified Ingraph functions (#128428)
Co-authored-by: Laith Sakka <lsakka@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128428
Approved by: https://github.com/yanboliang, https://github.com/mlazos
2024-06-19 00:06:46 +00:00
cyy
cb5e9183c6 [Caffe2] [2/N] Remove Caffe2 from tests (#128911)
Follows #128675

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128911
Approved by: https://github.com/titaiwangms, https://github.com/r-barnes
2024-06-19 00:05:50 +00:00
ac5f565fa7 [FSDP2] Added set_post_optim_event (#128975)
This PR adds `set_post_optim_event` that allows power users to provide their own CUDA event that is recorded after the optimizer step for the FSDP root module to wait the all-gather streams on.
```
def set_post_optim_event(self, event: torch.cuda.Event) -> None:
```
By default, the root would have the all-gather streams wait on the current stream (`wait_stream`), which may introduce false dependencies if there is unrelated computation after the optimizer step and before the wait. For example, this pattern can appear in recommendation models.

To avoid those false dependencies while preserving the correctness guarantee, we provide this API so that the user can provide their own CUDA event to wait the all-gather streams on.

We include both correctness test (`test_fully_shard_training.py`) and overlap test (`test_fully_shard_overlap.py`).

---

One possible way to use the API is to register a post-step hook on the optimizer. For example:
12e8d1399b/test/distributed/_composable/fsdp/test_fully_shard_training.py (L546-L552)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128975
Approved by: https://github.com/sanketpurandare, https://github.com/weifengpy
ghstack dependencies: #128884
2024-06-18 22:26:14 +00:00
d9c294c672 [Inductor] Fix arguments passed to triton kernel launch hooks (#128732)
`binary.launch_enter_hook` is treated as an instance method and will add a `self` argument to the hooks.
`CompiledKernel.launch_enter_hook` is a static method, which matches the hook calling convention of profilers (i.e., a single `LazyDict` argument only).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128732
Approved by: https://github.com/shunting314, https://github.com/bertmaher
2024-06-18 22:06:55 +00:00
a0e1e20c41 [BE][Easy] enable UFMT for torch/distributed/ (#128870)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128870
Approved by: https://github.com/fegin
ghstack dependencies: #128868, #128869
2024-06-18 21:49:08 +00:00
3b798df853 [BE][Easy] enable UFMT for torch/distributed/{fsdp,optim,rpc}/ (#128869)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128869
Approved by: https://github.com/fegin
ghstack dependencies: #128868
2024-06-18 21:49:08 +00:00
cec31050b4 [BE][Easy] enable UFMT for torch/distributed/{tensor,_tensor}/ (#128868)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128868
Approved by: https://github.com/fegin
2024-06-18 21:49:02 +00:00
e47603a549 Fix weight_norm decomposition behavior (#128956)
By upcasting norm to float32 to align with CUDA and CPU behaviors
e6d4451ae8/aten/src/ATen/native/WeightNorm.cpp (L56-L59)

Discovered this when started running OpInfo tests, see https://github.com/pytorch/pytorch/actions/runs/9552858711/job/26332062502#step:20:1060
```
  File "/var/lib/jenkins/workspace/test/test_decomp.py", line 185, in op_assert_ref
    assert orig.dtype == decomp.dtype, f"{i} Operation:  {op}"
AssertionError: 1 Operation:  aten._weight_norm_interface.default
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128956
Approved by: https://github.com/albanD
ghstack dependencies: #128955
2024-06-18 21:24:12 +00:00
2227da4431 [Profiler] Clean up use_mtia to follow standard use_device instead (#126284)
Summary:
use_mtia should instead set use_device='mtia' similar to cuda, xpu, and privateuseone. Avoid an ever-growing list of use_* arguments.

Since use_mtia is specific to FBCode, we don't need a deprecation warning.

Test Plan: CI.

Differential Revision: D57338005

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126284
Approved by: https://github.com/fenypatel99
2024-06-18 21:01:03 +00:00
4cc3fb5ee2 Bump urllib3 from 2.2.1 to 2.2.2 in /tools/build/bazel (#128908)
Bumps [urllib3](https://github.com/urllib3/urllib3) from 2.2.1 to 2.2.2.
- [Release notes](https://github.com/urllib3/urllib3/releases)
- [Changelog](https://github.com/urllib3/urllib3/blob/main/CHANGES.rst)
- [Commits](https://github.com/urllib3/urllib3/compare/2.2.1...2.2.2)

---
updated-dependencies:
- dependency-name: urllib3
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-06-18 13:38:22 -07:00
5dc4f652bc Backward support for unbind() with NJT (#128032)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128032
Approved by: https://github.com/soulitzer
2024-06-18 20:29:00 +00:00
44722c6b10 Revert "[dynamo][fsdp] Dont take unspecializedNNModuleVariable path for FSDP modules (#128453)"
This reverts commit 2b28b107dbafeec18d1095a2002e79511aa241df.

Reverted https://github.com/pytorch/pytorch/pull/128453 on behalf of https://github.com/anijain2305 due to luca saw bad compile time ([comment](https://github.com/pytorch/pytorch/pull/128453#issuecomment-2176877667))
2024-06-18 20:09:00 +00:00
1babeddbbf Revert "[inductor][mkldnn] Use floats instead of ints for pattern matcher test (#128484)"
This reverts commit 1f6e84fa6852805e15ddc9583c5f36c3a7f93df8.

Reverted https://github.com/pytorch/pytorch/pull/128484 on behalf of https://github.com/anijain2305 due to luca saw bad compile time ([comment](https://github.com/pytorch/pytorch/pull/128453#issuecomment-2176877667))
2024-06-18 20:09:00 +00:00
5bc9835d64 Revert "[dynamo][trace_rules] Remove incorrectly classified Ingraph functions (#128428)"
This reverts commit c52eda896eb3ec7f8d04b6321861f4c5614a40bb.

Reverted https://github.com/pytorch/pytorch/pull/128428 on behalf of https://github.com/anijain2305 due to luca saw bad compile time ([comment](https://github.com/pytorch/pytorch/pull/128453#issuecomment-2176877667))
2024-06-18 20:09:00 +00:00
9a7e2519d3 [MPS] Fused Adam & AdamW (#127242)
Summary:

This PR adds fused Adam and AdamW implementations.

Benchmark on Macbook Pro with M1 Max chip and 64GB unified memory:
**Fast math enabled:**
```
[---------------------------------------------- Fused Adam ----------------------------------------------]
                                                                           |  Fused: True  |  Fused: False
1 threads: -----------------------------------------------------------------------------------------------
      amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 100        |       10      |       100
      amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 100       |        9      |        89
      amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 100       |        9      |        90
      amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 100      |        9      |        83
      amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 100       |       12      |        94
      amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 100      |       11      |        88
      amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 100      |       12      |        90
      amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 100     |       11      |       100
      amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 100     |       27      |       100
      amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 100    |       23      |       100
      amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 100    |       27      |       100
      amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 100   |       23      |        98
      amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 500        |       82      |       480
      amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 500       |       72      |       450
      amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 500       |       82      |       450
      amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 500      |       73      |       420
      amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 500       |       91      |       500
      amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 500      |       83      |       400
      amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 500      |       94      |       500
      amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 500     |       78      |       400
      amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 500     |      170      |       500
      amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 500    |      140      |       600
      amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 500    |      170      |       600
      amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 500   |      140      |       500
      amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 1000       |      250      |       890
      amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 1000      |      220      |       850
      amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 1000      |      250      |       830
      amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 1000     |      220      |       770
      amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 1000      |      270      |       870
      amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 1000     |      230      |       840
      amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 1000     |      270      |       810
      amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 1000    |      240      |       800
      amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 1000    |      400      |      1000
      amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 1000   |      360      |      2000
      amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 1000   |      430      |      2000
      amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 1000  |      360      |      1300

Times are in milliseconds (ms).
```

**Fast math disabled:**
```
[---------------------------------------------- Fused Adam ----------------------------------------------]
                                                                           |  Fused: True  |  Fused: False
1 threads: -----------------------------------------------------------------------------------------------
      amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 100        |       10      |       100
      amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 100       |        9      |        84
      amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 100       |        9      |        84
      amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 100      |        9      |        79
      amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 100       |       11      |        93
      amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 100      |       10      |        90
      amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 100      |       11      |        91
      amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 100     |       11      |        81
      amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 100     |       34      |       100
      amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 100    |       31      |       100
      amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 100    |       34      |        95
      amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 100   |       31      |       100
      amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 500        |       94      |       500
      amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 500       |       82      |       430
      amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 500       |       92      |       430
      amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 500      |       81      |       390
      amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 500       |       98      |       500
      amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 500      |       88      |       430
      amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 500      |      100      |       500
      amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 500     |       88      |       400
      amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 500     |      210      |       500
      amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 500    |      190      |       610
      amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 500    |      210      |       510
      amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 500   |      190      |       500
      amsgrad: True, adamWflag: True, numel: 1024, num_tensors: 1000       |      300      |       900
      amsgrad: False, adamWflag: True, numel: 1024, num_tensors: 1000      |      260      |       850
      amsgrad: True, adamWflag: False, numel: 1024, num_tensors: 1000      |      295      |       900
      amsgrad: False, adamWflag: False, numel: 1024, num_tensors: 1000     |      260      |       800
      amsgrad: True, adamWflag: True, numel: 65536, num_tensors: 1000      |      320      |       910
      amsgrad: False, adamWflag: True, numel: 65536, num_tensors: 1000     |      280      |       900
      amsgrad: True, adamWflag: False, numel: 65536, num_tensors: 1000     |      320      |       900
      amsgrad: False, adamWflag: False, numel: 65536, num_tensors: 1000    |      300      |       900
      amsgrad: True, adamWflag: True, numel: 1048576, num_tensors: 1000    |      500      |      2000
      amsgrad: False, adamWflag: True, numel: 1048576, num_tensors: 1000   |      480      |      2000
      amsgrad: True, adamWflag: False, numel: 1048576, num_tensors: 1000   |      540      |      1500
      amsgrad: False, adamWflag: False, numel: 1048576, num_tensors: 1000  |      480      |      1200

Times are in milliseconds (ms).
```

```python
def profile_fused_adam():
    from torch.optim import adam, adamw
    import torch.utils.benchmark as benchmark

    import itertools

    def profile(fn, params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, fused):
        fn(
            params,
            grads,
            exp_avgs,
            exp_avg_sqs,
            max_exp_avg_sqs,
            state_steps,
            foreach=False,
            capturable=False,
            fused=fused,
            amsgrad=amsgrad,
            beta1=0.9,
            beta2=0.99,
            lr=1e-3,
            weight_decay=.0,
            eps=1e-5,
            maximize=False,
            grad_scale=None,
            found_inf=None,
        )
        torch.mps.synchronize()

    device = "mps"

    results = []

    for num_tensors, numel, adamWflag, amsgrad in itertools.product([100, 500, 1000], [1024, 65536, 1048576], [True, False], [True, False]):
        print(f"amsgrad: {amsgrad}, adamWflag: {adamWflag}, numel: {numel}, num_tensors: {num_tensors}")
        params, grads, exp_avgs, exp_avg_sqs = [[torch.arange(numel, dtype=torch.float32, device=device) + (numel * i) for i in range(num_tensors)] for _ in range(4)]
        max_exp_avg_sqs = [torch.arange(numel, dtype=torch.float32, device=device) for _ in range(num_tensors)] if amsgrad else []
        state_steps = [torch.tensor([5], dtype=torch.float32, device=device) for _ in range(num_tensors)]
        if adamWflag:
            fn = adamw.adamw
        else:
            fn = adam.adam

        for fused in [True, False]:

            t = benchmark.Timer(
                    stmt='profile(fn, params, grads, exp_avgs, exp_avg_sqs, max_exp_avg_sqs, state_steps, amsgrad, fused)',
                    label='Fused Adam',
                    sub_label=f"amsgrad: {amsgrad}, adamWflag: {adamWflag}, numel: {numel}, num_tensors: {num_tensors}",
                    globals=locals(),
                    description= f"Fused: {fused}",
                ).blocked_autorange(min_run_time=5)
            results.append(t)

    compare = benchmark.Compare(results)
    compare.trim_significant_figures()
    compare.colorize(rowwise=True)
    compare.print()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127242
Approved by: https://github.com/kulinseth, https://github.com/janeyx99
2024-06-18 19:59:50 +00:00
fe8558b7aa [DSD] Add unittest to verify HSDP1 + broadcast_from_rank0 (#128755)
HSDP1 + broadcast_from_rank0 actually behaves differently from FSDP1 + broadcast_from_rank0. So we need an unittest to cover this use case.

This test relies on the fix from https://github.com/pytorch/pytorch/pull/128446.

Differential Revision: [D58621436](https://our.internmc.facebook.com/intern/diff/D58621436/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128755
Approved by: https://github.com/Skylion007, https://github.com/wz337
ghstack dependencies: #128685
2024-06-18 19:42:51 +00:00
abde6cab4c Remove compile_threads=1 in test_inductor_collectives.py (#128580)
Summary: I believe https://github.com/pytorch/pytorch/issues/125235 should be fixed after switching to subprocess-based parallel compile.

Test Plan: Ran locally with python-3.9

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128580
Approved by: https://github.com/eellison
2024-06-18 19:31:13 +00:00
04a5d3228e [ts migration] Support prim::tolist and aten::len (#128894)
Support prim::tolist and aten::len. Add unit tests for prim::min.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128894
Approved by: https://github.com/angelayi
2024-06-18 19:11:07 +00:00
44483972bd [EZ] Keep weight_norm var name aligned (#128955)
To keep it aligned with
e6d4451ae8/aten/src/ATen/native/native_functions.yaml (L6484)
I.e.  `x`->`v`, `y`->`g`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128955
Approved by: https://github.com/albanD, https://github.com/Skylion007
2024-06-18 18:40:59 +00:00
bdffd9f0c6 [export] Graph break on nn.Parameter construction (#128935)
Fixes https://github.com/pytorch/pytorch/issues/126109

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128935
Approved by: https://github.com/angelayi
2024-06-18 18:37:44 +00:00
1a527915a6 [DSD] Correctly handle shared parameters for optimizer state_dict (#128685)
*
Fixes https://github.com/pytorch/pytorch/issues/128011

See the discussion in https://github.com/pytorch/pytorch/pull/128076

Current implementation of `set_optimizer_state_dict()` assumes that all the fqns returned by `_get_fqns()` must exist in the optimizer state_dict. This is not true if the model has shared parameters. In such a case, only one fqn of the shared parameters will appear in the optimizer state_dict. This PR addresses the issue.

Differential Revision: [D58573487](https://our.internmc.facebook.com/intern/diff/D58573487/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128685
Approved by: https://github.com/LucasLLC
2024-06-18 18:34:32 +00:00
d77a1aaa86 DOC: add note about same sized tensors to dist.gather() (#128676)
Fixes #103305

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128676
Approved by: https://github.com/wconstab
2024-06-18 18:26:07 +00:00
1877b7896c [checkpoint] Clean up selective activation checkpoint and make public (#125795)
### bc-breaking for existing users of the private API:
- Existing policy functions must now change their return value to be [CheckpointPolicy](c0b40ab42e/torch/utils/checkpoint.py (L1204-L1230))  Enum instead of bool.
   - To restore previous behavior, return `PREFER_RECOMPUTE` instead of `False` and `{PREFER,MUST}_SAVE` instead of `True` depending whether you prefer the compiler to override your policy.
- Policy function now accepts a `ctx` object instead of `mode` for its first argument.
   - To restore previous behavior, `mode = "recompute" if ctx.is_recompute else "forward"`.
- Existing calls to `_pt2_selective_checkpoint_context_fn_gen` must be renamed to `create_selective_checkpoint_contexts `. The way you use the API remains the same. It would've been nice to do something different (not make the user have to use functools.partial?), but this was the easiest to compile (idk if this should actually be a constraint).

Related doc: https://docs.google.com/document/d/1BKyizkZPdri9mHqdDOLAUpkI7SbbKfLHRFVVpK9ZWqo/edit

Memory considerations:
- As with the existing SAC, cached values are cleared upon first use.
- We error if the user wishes to backward a second time on a region forwarded with SAC enabled.

In-place:
- We use version counting to enforce that if any cached tensor has been mutated. In-place operations not mutating cached tensors are allowed.
- `allow_cache_entry_mutation=True` can be passed to disable this check (useful in the case of auto AC where the user is cleverly also saves the output of the in-place)

Randomness, views
- Currently in this PR, we don't do anything special for randomness or views, the author of the policy function is expected to handle them properly. (Would it would be beneficial to error? - we either want to save all or recompute all random tensors)

Tensor object preservation
- ~We guarantee that if a tensor does not requires grad, and it is saved, then what you get out is the same tensor object.~ UPDATE: We guarantee that if a tensor is of non-differentiable dtype AND it is not a view, and it is saved, then what you get out is the same tensor object. This is a nice guarantee for nested tensors which care about the object identity of of the offsets tensor.

Policy function
- Enum values are `{MUST,PREFER}_{SAVE,RECOMPUTE}` (bikeshed welcome). Alternatively there was `{SAVE,RECOMPUTE}_{NON_,}OVERRIDABLE`. The former was preferred bc it seemed clearer that two `MUST` clashing should error, versus it is ambiguous whether two `NON_OVERRIDABLE` being stacked should silently ignore or error.
- The usage of Enum today. There actually is NO API to stack SAC policies today. The only thing the Enum should matter for in the near term is the compiler. The stacking SAC policy would be useful if someone wants to implement something like simple FSDP, but it is not perfect because with a policy of `PREFER_SAVE` you are actually saving more than autograd would save normally (would be fixed with AC v3).
- The number of times we call the policy_fn is something that should be documented as part of public API. We call the policy function for all ops except ~~detach~~ UPDATE :  metadata ops listed in `torch.utils.checkpoint.SAC_IGNORED_OPS`) because these ops may be called a different number of times by AC itself between forward and recompute.
- The policy function can be a stateful object (we do NOT make separate copies of this object for forward/recompute, the user is expected to handle that via is_recompute see below).
Tensors guaranteed to be the same tensor as-is
- Policy function signature takes ctx object as its first argument. The ctx function is an object encapsulating info that may be useful to the user, it currently only holds "is_recompute". Adding this indirection gives us flexibility to add more attrs later if necessary.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125795
Approved by: https://github.com/Chillee, https://github.com/fmassa
2024-06-18 18:18:50 +00:00
77830d509f Revert "Introduce a prototype for SymmetricMemory (#128582)"
This reverts commit 7a39755da28d5a109bf0c37f72b364d3a83137b1.

Reverted https://github.com/pytorch/pytorch/pull/128582 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/128582#issuecomment-2176685232))
2024-06-18 18:11:43 +00:00
84c86e56bd Update tracker issues after successfully cherry-picking a PR (#128924)
This extends the capacity of the cherry-pick bot to automatically update the tracker issue with the information.  For this to work, the tracker issue needs to be an open one with a `release tracker` label, i.e. https://github.com/pytorch/pytorch/issues/128436.  The version from the release branch, i.e. `release/2.4`, will be match with the title of the tracker issue, i.e. `[v.2.4.0] Release Tracker` or `[v.2.4.1] Release Tracker`

### Testing

`python cherry_pick.py --onto-branch release/2.4 --classification release --fixes "DEBUG DEBUG" --github-actor huydhn 128718`

* On the PR https://github.com/pytorch/pytorch/pull/128718#issuecomment-2174846771
* On the tracker issue https://github.com/pytorch/pytorch/issues/128436#issuecomment-2174846757

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128924
Approved by: https://github.com/atalman
2024-06-18 17:48:47 +00:00
eqy
4e03263224 [CUDA][Convolution] Add missing launch bounds to vol2col_kernel (#128740)
Fix "too many resources requested" that can happen with recent toolkits on V100.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128740
Approved by: https://github.com/mikaylagawarecki
2024-06-18 17:26:23 +00:00
26e374e3ca [EZ] Fix typos in RELEASE.md (#128769)
This PR fixes typo in `RELEASE.md`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128769
Approved by: https://github.com/yumium, https://github.com/mikaylagawarecki
2024-06-18 17:15:05 +00:00
9818283da1 re-enable jacrev/jacfwd/hessian after #128028 landed (#128622)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128622
Approved by: https://github.com/zou3519
2024-06-18 17:08:58 +00:00
eqy
ec616da518 RNN API cleanup for cuDNN 9.1 (#122011)
Can potentially avoid a bit of boilerplate if we move directly to cuDNN 9.1's RNN API...

Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122011
Approved by: https://github.com/Skylion007
2024-06-18 16:16:38 +00:00
108318ad10 [BE][JIT] Handle case where codegen object can be unset (#128951)
Summary:
Unblocks a test that's failing.

`codegen` can be unset until `compile` is called. If `codegen` is not set, then just use the kernel name directly.

Test Plan:
```
buck2 run //caffe2/test:tensorexpr -- --regex test_simple_add
```

Differential Revision: D58727391

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128951
Approved by: https://github.com/aaronenyeshi
2024-06-18 15:40:45 +00:00
4817180601 make fallback for aten.argsort.stable (#128907)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128907
Approved by: https://github.com/lezcano
ghstack dependencies: #128343
2024-06-18 14:56:35 +00:00
22d258427b [BE][Easy] enable UFMT for torch/distributed/_shard/ (#128867)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128867
Approved by: https://github.com/fegin
ghstack dependencies: #128866
2024-06-18 14:39:25 +00:00
e6d4451ae8 [BE][Easy] enable UFMT for torch/distributed/{algorithms,autograd,benchmarks,checkpoint,elastic}/ (#128866)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128866
Approved by: https://github.com/fegin
2024-06-18 13:51:53 +00:00
f2805a0408 [FSDP2] Added APIs for explicit fwd/bwd prefetching (#128884)
This PR adds two APIs `set_modules_to_forward_prefetch` and `set_modules_to_backward_prefetch` to enable explicit forward/backward all-gather prefetching, respectively.

```
def set_modules_to_forward_prefetch(self, modules: List[FSDPModule]): -> None
def set_modules_to_backward_prefetch(self, modules: List[FSDPModule]): -> None
```

**Motivation**
FSDP2 implements _reasonable defaults_ for forward and backward prefetching. In forward, it uses implicit prefetching and allows two all-gather output tensors to be alive at once (so that the current all-gather copy-out can overlap with the next all-gather). In backward, it uses explicit prefetching based on the reverse post-forward order.

However, there may be cases where with expert knowledge, we can reduce communication bubbles by moving all-gathers manually. One way to expose such behavior is to expose _prefetching limits_, i.e. integers that configure how many outstanding all-gathers/all-gather output tensors can be alive at once. IMIHO, this leans toward _easy_, not _simple_ (see [PyTorch design principles](https://pytorch.org/docs/stable/community/design.html#principle-2-simple-over-easy)).

The crux of the problem is that there may be special cases where manual intervention can give better performance. Exposing a prefetching limit and allowing users to pass a value >1 just smooths over the problem since such a limit would generally apply over the entire model even though it possibly should not. Then, expert users will see a specific all-gather that they want to deviate from this limit, and there is little we can do.

Thus, we instead choose to expose the most primitive extension point: namely, every `FSDPModule` gives an opportunity to prefetch other all-gathers in forward and in backward. How to leverage this extension point is fully up to the user. Implementing the prefetch limit can be done using this extension point (e.g. record the post-forward order yourself using forward hooks, iterate over that order, and call the `set_modules_to_forward_prefetch` / `set_modules_to_backward_prefetch` APIs).

Differential Revision: [D58700346](https://our.internmc.facebook.com/intern/diff/D58700346)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128884
Approved by: https://github.com/ckluk2, https://github.com/weifengpy
2024-06-18 13:32:57 +00:00
3dd5f0ecbb Remove circular import (#128875)
Summary: A spurious import is causing circular dependency errors

Test Plan: phabricator signals

Differential Revision: D58685676

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128875
Approved by: https://github.com/kit1980
2024-06-18 12:30:13 +00:00
304c934572 Move MKLDNN Specific IR to Separate File (#126504)
**Summary**
Following the discussion in https://github.com/pytorch/pytorch/pull/122593#discussion_r1604144782, Move Inductor MKLDNN specific IRs to a separate file.

Co-authored-by: Isuru Fernando <ifernando@quansight.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126504
Approved by: https://github.com/desertfire, https://github.com/jgong5
ghstack dependencies: #126841, #126940
2024-06-18 09:29:13 +00:00
6e43897912 [BE][ptd_fb_test][3/N] Enable TestSlide for MultiThreadedTestCase (#128843)
Enabling testslide for MultiThreadedTestCase, similar to https://github.com/pytorch/pytorch/pull/127512.

Differential Revision: [D58677457](https://our.internmc.facebook.com/intern/diff/D58677457/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128843
Approved by: https://github.com/wz337
2024-06-18 07:05:31 +00:00
60baeee59f [BE] Skip the test if CUDA is not available (#128885)
As title

Differential Revision: [D58690210](https://our.internmc.facebook.com/intern/diff/D58690210/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128885
Approved by: https://github.com/wz337
2024-06-18 07:02:44 +00:00
e3a39d49a0 [Traceable FSDP][Compiled Autograd] Add queue_callback() support (#126366)
Adds support for `Variable._execution_engine.queue_callback()`, which is used in FSDP2.

Important tests:
- `pytest -rA test/inductor/test_compiled_autograd.py::TestCompiledAutograd::test_callback_graph_break_throws_error`
- `pytest -rA test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_callback_adds_callback`
- `PYTORCH_TEST_WITH_DYNAMO=1 python test/test_autograd.py -k TestAutograd.test_callback_adds_callback`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126366
Approved by: https://github.com/xmfan
2024-06-18 06:22:14 +00:00
f7eae27946 Pass params to dump_nccl_trace_pickle (#128781)
Summary
Pass parameters from request to dump_nccl_trace_pickle handler.
The supported parameters + value are all lowercase.
includecollectives={true, false}
includestacktraces={true, false}
onlyactive={true, false}

Example post is:
/handler/dump_nccl_trace_pickle?includecollectives=true&includestacktraces=false&onlyactive=true

Test Plan:
unit tests

Differential Revision: [D58640474](https://our.internmc.facebook.com/intern/diff/D58640474)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128781
Approved by: https://github.com/d4l3k
2024-06-18 03:46:57 +00:00
d9eaa224f2 Fixes #128429: NaN in triu op on MPS (#128575)
Fixes triu op when k > 0 and the lower triangle of the input tensor contains inf leading to NaNs in the computation through complement. Fixed by using select API instead.

Fixes #128429

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128575
Approved by: https://github.com/kulinseth
2024-06-18 03:44:42 +00:00
59b4983dc0 DebugPlane: add dump_traceback handler (#128904)
This adds a `dump_traceback` handler so you can see all running threads for a job. This uses a temporary file as a buffer when calling `faulthandler.dump_traceback` and requires the GIL to be held during dumping.

Test plan:

```
python test/distributed/elastic/test_control_plane.py -v -k traceback
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128904
Approved by: https://github.com/c-p-i-o
2024-06-18 03:40:16 +00:00
17abbafdfc [inductor] Fix some windows cpp builder issue (#128765)
1. fix some Windows build args.
2. fix c++20 likely issue on Windows, reference: https://github.com/pytorch/pytorch/pull/124997.
3. remove compiler return value check, different compilers return variant value, let's check exception to catch error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128765
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-06-18 03:25:20 +00:00
4061b3b822 Forward fix to skip ROCm tests for #122836 (#128891)
Fixes broken ROCm tests from #122836.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128891
Approved by: https://github.com/huydhn
ghstack dependencies: #127007, #128057, #122836
2024-06-18 03:01:19 +00:00
c017c97333 [dynamo][inlining-inbuilt-nn-modules] Update test output (#128880)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128880
Approved by: https://github.com/mlazos
ghstack dependencies: #128315, #128748, #128877, #128878
2024-06-18 02:18:09 +00:00
4e97d37fd9 [inlining-inbuilt-nn-modules][pre-grad] Adjust efficient_conv_bn_eval_graph for inlining (#128878)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128878
Approved by: https://github.com/mlazos
ghstack dependencies: #128315, #128748, #128877
2024-06-18 02:18:09 +00:00
22f1793c0a [dynamo][easy] Use LazyVariableTracker for UserDefinedObject var_getattr (#128877)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128877
Approved by: https://github.com/mlazos
ghstack dependencies: #128315, #128748
2024-06-18 02:17:56 +00:00
43998711a7 [CUDAGraph] add more docs for cudagraph trees (#127963)
This PR adds more documentation for CUDAGraph Trees, including
- Iteration Support
- Input Mutation Support
- Dynamic Shape Support
- NCCL Support
- Reasons for Skipping CUDAGraph

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127963
Approved by: https://github.com/eellison
2024-06-18 02:07:07 +00:00
e12fa93b8b add is_big_gpu(0) check to test_select_algorithm tests in tests/inductor/test_cuda_cpp_wrapper.py (#128652)
In NVIDIA internal CI, on Jetson devices we are seeing this failure for `python test/inductor/test_cuda_cpp_wrapper.py -k test_addmm_cuda_cuda_wrapper -k test_linear_relu_cuda_cuda_wrapper`:

```
/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_fx.py:132: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
  warnings.warn(
W0613 20:57:17.722000 281473279256672 torch/_inductor/utils.py:902] [0/0] Not enough SMs to use max_autotune_gemm mode
frames [('total', 1), ('ok', 1)]
stats [('calls_captured', 2), ('unique_graphs', 1)]
inductor [('extern_calls', 2), ('fxgraph_cache_miss', 1), ('pattern_matcher_count', 1), ('pattern_matcher_nodes', 1)]
aot_autograd [('total', 1), ('ok', 1)]
F
======================================================================
FAIL: test_linear_relu_cuda_cuda_wrapper (__main__.TestCudaWrapper)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 2759, in wrapper
    method(*args, **kwargs)
  File "/opt/pytorch/pytorch/test/inductor/test_torchinductor.py", line 9818, in new_test
    return value(self)
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/pytorch/pytorch/test/inductor/test_cuda_cpp_wrapper.py", line 152, in fn
    _, code = test_torchinductor.run_and_get_cpp_code(
  File "/opt/pytorch/pytorch/test/inductor/test_torchinductor.py", line 356, in run_and_get_cpp_code
    result = fn(*args, **kwargs)
  File "/opt/pytorch/pytorch/test/inductor/test_select_algorithm.py", line 43, in wrapped
    return fn(*args, **kwargs)
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/usr/lib/python3.10/unittest/mock.py", line 1379, in patched
    return func(*newargs, **newkeywargs)
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/opt/pytorch/pytorch/test/inductor/test_select_algorithm.py", line 62, in test_linear_relu_cuda
    self.assertEqual(counters["inductor"]["select_algorithm_autotune"], 1)
  File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 3642, in assertEqual
    raise error_metas.pop()[0].to_error(
AssertionError: Scalars are not equal!

Expected 1 but got 0.
Absolute difference: 1
Relative difference: 1.0
```
Looking into it, we see the failure is from https://github.com/pytorch/pytorch/blob/main/test/inductor/test_select_algorithm.py#L62. The warning `W0613 20:57:17.722000 281473279256672 torch/_inductor/utils.py:902] [0/0] Not enough SMs to use max_autotune_gemm ` is triggered from https://github.com/pytorch/pytorch/blob/main/torch/_inductor/utils.py#L973. Printing torch.cuda.get_device_properties(0).multi_processor_count returns 16 on the computelab AGX Orin; thus it makes sense that this check is failing, since the min_required_sms is 68, thus not letting it pick the autotune algorithm. Looking at the main for test_select_algorithm.py, we see that these tests should only be run if is_big_gpu(0) is true: https://github.com/pytorch/pytorch/blob/main/test/inductor/test_select_algorithm.py#L344. Thus this PR adds a similar check to the invocation of these tests in test_cuda_cpp_wrapper.py.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128652
Approved by: https://github.com/soulitzer, https://github.com/eqy
2024-06-18 02:00:04 +00:00
9e8443b56f Remove dtype from gpt-fast micro benchmark experiments model name (#128789)
Per comments on https://github.com/pytorch/test-infra/pull/5344, we already have a dtype column with the same information

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128789
Approved by: https://github.com/yanboliang
2024-06-18 01:26:45 +00:00
fbc7559ceb [custom ops] convert string type annotation to real type (#128809)
Fixes #105157

Bug source: `from __future__ import annotations` converts type annotation to strings to make forwards references easier. However, existing custom ops do not consider strings to be valid types.

Fix: We check if the argument and return type annotation is string type. If so, we try to use `eval` to convert it to a type.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128809
Approved by: https://github.com/zou3519
2024-06-18 00:55:50 +00:00
c35ffaf954 [Inductor][CPP] Add ne with VecMask (#126940)
**Summary**
Fix https://github.com/pytorch/pytorch/issues/126824#issuecomment-2125039161 which is missing the support of `ne` with `VecMask`.

**Test Plan**
```
python test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_ne_cpu_bool
```

Co-authored-by: Isuru Fernando <ifernando@quansight.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126940
Approved by: https://github.com/isuruf, https://github.com/jgong5, https://github.com/peterbell10
ghstack dependencies: #126841
2024-06-18 00:23:03 +00:00
beb29836cd [Inductor][CPP] Add Min/Max with VecMask (#126841)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/126824 which is missing the support of `min/max` with `VecMask`.

**TestPlan**
```
python test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_clamp_max_cpu_bool
python test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_clamp_min_cpu_bool
```

Co-authored-by: Isuru Fernando <ifernando@quansight.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126841
Approved by: https://github.com/isuruf, https://github.com/jgong5, https://github.com/peterbell10
2024-06-18 00:20:32 +00:00
11ff5345d2 Changed colored logging to only be turned on if printing to interactive terminal (#128874)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128874
Approved by: https://github.com/anijain2305
2024-06-17 23:53:26 +00:00
b70440f0a7 Document the torch.cuda.profiler.profile function (#128216)
Fixes https://github.com/pytorch/pytorch/issues/127901

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128216
Approved by: https://github.com/malfet, https://github.com/eqy
2024-06-17 23:42:40 +00:00
95b5ea9cde Add mark_unbacked (#128638)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128638
Approved by: https://github.com/IvanKobzarev
2024-06-17 23:39:48 +00:00
8415a4ba98 Back out "[ROCm] TunableOp for gemm_and_bias (#128143)" (#128815)
Summary:
Original commit changeset: 35083f04fdae

Original Phabricator Diff: D58501726

This PR is bringing a large numerical gap. e.g. for 256 x 4096 x 4096 GEMM, if we enable tunable op + DISABLE_ADDMM_HIP_LT=0, the results are way off.

Differential Revision: D58660832

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128815
Approved by: https://github.com/mxz297, https://github.com/eqy, https://github.com/malfet
2024-06-17 22:52:27 +00:00
3b8c9b8ab1 [Docker Release] Test if pytorch was compiled with CUDA before pushing to repo (#128852)
Related to: https://github.com/pytorch/pytorch/issues/125879
Would check if we are compiled with CUDA before publishing CUDA Docker nightly image

Test
```
#18 [conda-installs 5/5] RUN IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())');    echo "Is torch compiled with cuda: ${IS_CUDA}";     if test "${IS_CUDA}" != "True" -a ! -z "12.4.0"; then 	exit 1;     fi
#18 1.656 Is torch compiled with cuda: False
#18 ERROR: process "/bin/sh -c IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())');    echo \"Is torch compiled with cuda: ${IS_CUDA}\";     if test \"${IS_CUDA}\" != \"True\" -a ! -z \"${CUDA_VERSION}\"; then \texit 1;     fi" did not complete successfully: exit code: 1
------
 > [conda-installs 5/5] RUN IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())');    echo "Is torch compiled with cuda: ${IS_CUDA}";     if test "${IS_CUDA}" != "True" -a ! -z "12.4.0"; then 	exit 1;     fi:
1.656 Is torch compiled with cuda: False
------
Dockerfile:80
--------------------
  79 |     RUN /opt/conda/bin/pip install torchelastic
  80 | >>> RUN IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())');\
  81 | >>>     echo "Is torch compiled with cuda: ${IS_CUDA}"; \
  82 | >>>     if test "${IS_CUDA}" != "True" -a ! -z "${CUDA_VERSION}"; then \
  83 | >>> 	exit 1; \
  84 | >>>     fi
  85 |
--------------------
ERROR: failed to solve: process "/bin/sh -c IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())');    echo \"Is torch compiled with cuda: ${IS_CUDA}\";     if test \"${IS_CUDA}\" != \"True\" -a ! -z \"${CUDA_VERSION}\"; then \texit 1;     fi" did not complete successfully: exit code: 1
(base) [ec2-user@ip-172-30-2-248 pytorch]$ docker buildx build --progress=plain  --platform="linux/amd64"  --target official -t ghcr.io/pytorch/pytorch:2.5.0.dev20240617-cuda12.4-cudnn9-devel --build-arg BASE_IMAGE=nvidia/cuda:12.4.0-devel-ubuntu22.04 --build-arg PYTHON_VERSION=3.11 --build-arg CUDA_VERSION= --build-arg CUDA_CHANNEL=nvidia --build-arg PYTORCH_VERSION=2.5.0.dev20240617 --build-arg INSTALL_CHANNEL=pytorch --build-arg TRITON_VERSION= --build-arg CMAKE_VARS="" .
#0 building with "default" instance using docker driver
```

Please note looks like we are installing from pytorch rather then nighlty channel on PR hence cuda 12.4 is failing since its not in pytorch channel yet:
https://github.com/pytorch/pytorch/actions/runs/9555354734/job/26338476741?pr=128852

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128852
Approved by: https://github.com/malfet
2024-06-17 22:51:12 +00:00
1835e3beab Fix the inductor ci (#128879)
Fix the torchbench+inductor ci on trunk due to recent upgrade to numpy 2.0.0rc1.
We have to remove DALLE2_pytorch model, since it depends on embedding-reader, which is not compatible with numpy>2: https://github.com/rom1504/embedding-reader/blob/main/requirements.txt#L3

Fixes #128845

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128879
Approved by: https://github.com/eellison
2024-06-17 22:20:33 +00:00
7baf32b5e7 [c10d] fix p2p group commsplit (#128803)
Summary:
For PointToPoint(sendrecv), the deviceId is lower_rank:higher_rank. This means a p2p group cannot be created through commSplit since it cannot find a parent.

Fix this by using the right device key of current rank.

Differential Revision: D58631639

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128803
Approved by: https://github.com/shuqiangzhang
2024-06-17 22:07:40 +00:00
1fd7496ab2 [MTIA] Fix synchronize API (#128714)
Reviewed By: fenypatel99

Differential Revision: D58590313

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128714
Approved by: https://github.com/aaronenyeshi
2024-06-17 21:58:46 +00:00
cyy
163847b1bb [1/N] [Caffe2] Remove caffe2_aten_fallback code (#128675)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128675
Approved by: https://github.com/r-barnes
2024-06-17 21:25:59 +00:00
8953725e6d [Inductor][FlexAttention] Tune backwards kernel block sizes (#128853)
This replaces #128767 which somehow closed by mistake.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128853
Approved by: https://github.com/angelayi
2024-06-17 21:10:55 +00:00
a489792bb2 [GPT-benchmark] Fix memory bandwidth for MoE (#128783)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128783
Approved by: https://github.com/Chillee
ghstack dependencies: #128768
2024-06-17 21:04:57 +00:00
8c06eae17e [GPT-benchmark] Add metric: compilation time for GPT models (#128768)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128768
Approved by: https://github.com/Chillee
2024-06-17 21:04:57 +00:00
a59766ee05 replace AT_ERROR(...) with TORCH_CHECK(false, ...) (#128788)
as per title. encountered the old-fashioned by chance

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128788
Approved by: https://github.com/mikaylagawarecki
2024-06-17 20:50:22 +00:00
0f89e66d17 Validate logs are created by default (#128522)
Summary: Make sure that logs are caputured in default settings

Test Plan: ci

Differential Revision: D58395812

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128522
Approved by: https://github.com/d4l3k
2024-06-17 20:07:13 +00:00
1577328ea4 Set bash shell on Windows (#128854)
Attempt to fix the missing python3 command on the new Windows AMI https://github.com/pytorch/pytorch/actions/runs/9551494945/job/26325922503.  I added the logic to copy python to python3 to make the command available, it worked with the previous AMI, but start to fail now and the cause is not clear (maybe it's not the AMI, but a new GitHub runner version)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128854
Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/atalman
2024-06-17 19:24:09 +00:00
b181b58857 Fix Storage.filename to not track the filename when storage was mmap-ed with MAP_PRIVATE (#128725)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128725
Approved by: https://github.com/albanD
2024-06-17 18:55:47 +00:00
213eba7d2e Configure mergebot via config (#128840)
Fixes #ISSUE_NUMBER
* Companion to https://github.com/pytorch/test-infra/pull/5312
* See the above for details + possible risks
* Without the above PR, this should have no effects
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128840
Approved by: https://github.com/huydhn
2024-06-17 18:53:56 +00:00
c172b58fe0 Revert "Update DALLE2_pytorch expected accuracy result on CPU (#128718)"
This reverts commit fd27138c4a86bd763a6b8128d940a7c98f951603.

Reverted https://github.com/pytorch/pytorch/pull/128718 on behalf of https://github.com/huydhn due to This has reverted back to the previous expected value for some reason 153362fbc9 ([comment](https://github.com/pytorch/pytorch/pull/128718#issuecomment-2174194219))
2024-06-17 18:49:15 +00:00
5344c41d43 Use forked torchbench branch with pinned numpy (#128856)
Adds pinned numpy commit to yolov3 dependencies to the existing pinned commit.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128856
Approved by: https://github.com/huydhn, https://github.com/PaliC
2024-06-17 18:41:42 +00:00
cyy
d35cdee97f [Caffe2] Remove caffe2 onnx tests (#128687)
They are not used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128687
Approved by: https://github.com/r-barnes
2024-06-17 18:17:58 +00:00
153362fbc9 Support HSDP + Monolith Checkpointing (#128446)
Fixes #128444. Rank 0 check should be in the same group as the broadcast

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128446
Approved by: https://github.com/fegin
2024-06-17 16:59:41 +00:00
c6b180a316 Created docs (and example) for cudart function in torch.cuda (#128741)
Fixes #127908

## Description

Created docs to document the torch.cuda.cudart function to solve the issue #127908.
I tried to stick to the [guidelines to document a function](https://github.com/pytorch/pytorch/wiki/Docstring-Guidelines#documenting-a-function) but I was not sure if there is a consensus on how to handle the docs of a function that calls an internal function. So I went ahead and tried what the function will raise, etc. from the user endpoint and documented it (i.e. I am giving what actually _lazy_init() will raise).

Updated PR from #128298 since I made quite a big mistake in my branch. I apologize for the newbie mistake.

### Summary of Changes

- Added docs for torch.cuda.cudart
- Added the cudart function in the autosummary of docs/source/cuda.rst

## Checklist
- [X] The issue that is being fixed is referred in the description
- [X] Only one issue is addressed in this pull request
- [X] Labels from the issue that this PR is fixing are added to this pull request
- [X] No unnecesary issues are included into this pull request

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128741
Approved by: https://github.com/msaroufim
2024-06-17 16:50:37 +00:00
fc2913fb80 Remove amax return from _scaled_mm (#128683)
# Summary
The primary reason for the change was lack of current use case and the need to work around an two Inductor issue.
- Tensor arguments as kwarg only
- multiple outputs from triton templates

If the need for the amax return type arises we can consider either adding it, more likely creating a separate op.

In principle PyTorch is moving away from ops that bundle lots of functionality into "mega ops". We instead rely upon the compiler to generate appropriate fused kernels.

### Changes:
- This removes the amax return type from scaled_mm. We have found that the common use case is to return in "high-precision" ( a type with more precision than fp8). This is only relevant when returning in low-precision.
- We currently still allow for fp8 returns and scaled result.  Perhaps we should also ban this as well...

New signature:
```Python
def meta_scaled_mm(
    self: torch.Tensor,
    mat2: torch.Tensor,
    scale_a: torch.Tensor,
    scale_b: torch.Tensor,
    bias: Optional[torch.Tensor] = None,
    scale_result: Optional[torch.Tensor] = None,
    out_dtype: Optional[torch.dtype] = None,
    use_fast_accum: bool = False,
) -> torch.Tensor:
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128683
Approved by: https://github.com/vkuzo
2024-06-17 16:48:00 +00:00
73b78d1cbe Document the torch.nn.parallel.scatter_gather.gather function (#128566)
Fixes #127899

### Description
Add docstring to `torch/nn/parallel/scatter_gather.py:gather` function

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128566
Approved by: https://github.com/kwen2501
2024-06-17 16:44:17 +00:00
316b729677 [Fix] TS converter constant to tensor (#128442)
#### Issue
Tensor constant was previously lifted directly as an input in the fx graph, which results errors for multiple test cases with tensor constant. This PR introduces a fix to convert tensor constant to a `GetAttr` in the fx graph.

This PR also introduces other fixes to maintain a valid `state_dict` for exported program when there are tensor constants. In short, after tensor constants are converted as `GetAttr`, they are treated as buffers during retracing. The fix will convert those back from buffer to constant.

#### Test Plan
Add new test cases that generate tensor constants
* `pytest test/export/test_converter.py -s -k test_implicit_constant_to_tensor_handling`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128442
Approved by: https://github.com/angelayi
2024-06-17 16:42:43 +00:00
a87d82abd7 [BE] enable UFMT for torch/nn/*.py (#128593)
Part of #123062

- #123062
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128593
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #128596, #128594, #128592
2024-06-17 16:29:29 +00:00
f6e6e55fa7 [BE] enable UFMT for torch/nn/functional.py (#128592)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128592
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #128596, #128594
2024-06-17 16:29:29 +00:00
95ac2d6482 [BE] enable UFMT for torch/nn/modules (#128594)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128594
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #128596
2024-06-17 16:29:25 +00:00
dff6342a0b [BE][Easy] enable UFMT for torch/nn/parallel (#128596)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128596
Approved by: https://github.com/mikaylagawarecki
2024-06-17 16:29:22 +00:00
bfad0aee44 [export] Preserve requires_grad for export inputs. (#128656)
Summary: Today meta['val'] on placeholder nodes doesn't preserve the consistent requires_grad information with the original inputs. Seems there's no easy way to fix this directly at proxy tensor layer. This is useful for reexporting joint graph.

Test Plan: test_preserve_requires_grad_placeholders

Differential Revision: D58555651

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128656
Approved by: https://github.com/tugsbayasgalan
2024-06-17 16:26:08 +00:00
2a41fc0390 Short-term fix to preserve NJT metadata cache in torch.compile (#122836)
Idea: close over min / max sequence length in the main NJT view func (`_nested_view_from_jagged`) so that view replay during fake-ification propagates these correctly in torch.compile.

For dynamic shapes support for min / max sequence length, this PR uses a hack that stores the values in `(val, 0)` shaped tensors.

**NB: This PR changes SDPA to operate on real views instead of using `buffer_from_jagged()` / `ViewNestedFromBuffer`, which may impact the internal FIRST model. That is, it undoes the partial revert from #123215 alongside a fix to the problem that required the partial revert. We need to verify that there are no regressions there before landing.**

Differential Revision: [D55448636](https://our.internmc.facebook.com/intern/diff/D55448636)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122836
Approved by: https://github.com/soulitzer
ghstack dependencies: #127007, #128057
2024-06-17 15:25:09 +00:00
24443fe16a [inductor] parallel compile: Print traceback detail when there's an exception in a sub-process (#128775)
Summary: We lose traceback info when an exception occurs in a subprocess because Python traceback objects don't pickle. In the subprocess-based parallel compile, we _are_ logging an exception in the subprocess, but a) those messages are easy to miss because they're not in the traceback output, and b) it seems that logging in the subproc is swallowed by default in internal builds. This PR captures the traceback in the subprocess and makes it available in the exception thrown in the main process. Users now see failures that look like this:

```
  ...
  File "/home/slarsen/.conda/envs/pytorch-3.10_3/lib/python3.10/concurrent/futures/_base.py", line 458, in result
    return self.__get_result()
  File "/home/slarsen/.conda/envs/pytorch-3.10_3/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
    raise self._exception
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
SubprocException: An exception occurred in a subprocess:

Traceback (most recent call last):
  File "/data/users/slarsen/pytorch-3.10_3/torch/_inductor/compile_worker/subproc_pool.py", line 270, in do_job
    result = SubprocMain.foo()
  File "/data/users/slarsen/pytorch-3.10_3/torch/_inductor/compile_worker/subproc_pool.py", line 263, in foo
    SubprocMain.bar()
  File "/data/users/slarsen/pytorch-3.10_3/torch/_inductor/compile_worker/subproc_pool.py", line 260, in bar
    SubprocMain.baz()
  File "/data/users/slarsen/pytorch-3.10_3/torch/_inductor/compile_worker/subproc_pool.py", line 257, in baz
    raise Exception("an error occurred")
Exception: an error occurred
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128775
Approved by: https://github.com/jansel
2024-06-17 15:10:47 +00:00
e3093849e5 [Docs] Update links (#128795)
From
https://pytorch.org/docs/stable/nn.html#torch.nn.Embedding to
https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html

And from
https://pytorch.org/docs/stable/nn.html#torch.nn.EmbeddingBag  to
https://pytorch.org/docs/stable/generated/torch.nn.EmbeddingBag.html

Fixes https://github.com/pytorch/pytorch/issues/128774

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128795
Approved by: https://github.com/atalman
2024-06-17 14:55:32 +00:00
0f81473d7b Update fake tensor error checks for bool tensor subtraction (#128492)
Fixes #127003

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128492
Approved by: https://github.com/soulitzer
2024-06-17 13:41:15 +00:00
b0282071c4 [dynamo] override torch.nn.modules.activation._is_make_fx_tracing (#128748)
Discovered while inlining `MultiHeadAttention` nn Module.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128748
Approved by: https://github.com/jansel
ghstack dependencies: #128315
2024-06-17 08:49:29 +00:00
b40a033c38 [cpp_extension][inductor] Fix sleef windows depends. (#128770)
# Issue:
During I'm working on enable inductor on PyTorch Windows, I found the sleef lib dependency issue.
<img width="1011" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/423bd854-3c5f-468f-9a64-a392d9b514e3">

# Analysis:
After we enabled SIMD on PyTorch Windows(https://github.com/pytorch/pytorch/pull/118980 ), the sleef functions are called from VEC headers. It bring the sleef to the dependency.

Here is a different between Windows and Linux OS.
## Linux :
Linux is default export its functions, so libtorch_cpu.so static link to sleef.a, and then It also export sleef's functions.
<img width="647" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/00ac536c-33fc-4943-a435-25590508840d">

## Windows:
Windows is by default not export its functions, and have many limitation to export functions, reference: https://github.com/pytorch/pytorch/issues/80604
We can't package sleef functions via torch_cpu.dll like Linux.

# Solution:
Acturally, we also packaged sleef static lib as a part of release. We just need to help user link to sleef.lib, it should be fine.
1. Add sleef to cpp_builder for inductor.
2. Add sleef to cpp_extension for C++ extesion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128770
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-06-17 05:44:34 +00:00
a52c8ace98 [3/N] Non-Tensor: Support string parameter for aten operations (#125831)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125831
Approved by: https://github.com/jansel, https://github.com/jgong5
2024-06-17 05:11:29 +00:00
cyy
74e11a4210 Enable clang-tidy on torch/csrc/mps (#128782)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128782
Approved by: https://github.com/Skylion007
2024-06-17 02:19:48 +00:00
cyy
f9dae86222 Concat namespaces in torch/csrc/utils/* (#128787)
Concat namespaces in torch/csrc/utils/*
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128787
Approved by: https://github.com/Skylion007
2024-06-16 23:51:14 +00:00
6cbdbb6c3c Remove top lev numpy dependency from fuzzer.py (#128759)
Test CI

This fixes issues like this where I don't even intend to use the fuzzer. this way if someone is calling functions from the fuzzer numpy will be imported otherwise the import should not happen at the top of the file

```
>>> import torchao
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torchao/__init__.py", line 26, in <module>
    from torchao.quantization import (
  File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torchao/quantization/__init__.py", line 7, in <module>
    from .smoothquant import *  # noqa: F403
  File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torchao/quantization/smoothquant.py", line 18, in <module>
    import torchao.quantization.quant_api as quant_api
  File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torchao/quantization/quant_api.py", line 23, in <module>
    from torchao.utils import (
  File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torchao/utils.py", line 2, in <module>
    import torch.utils.benchmark as benchmark
  File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torch/utils/benchmark/__init__.py", line 4, in <module>
    from torch.utils.benchmark.utils.fuzzer import *  # noqa: F403
  File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torch/utils/benchmark/utils/fuzzer.py", line 5, in <module>
    import numpy as np
ModuleNotFoundError: No module named 'numpy'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128759
Approved by: https://github.com/Skylion007
2024-06-16 16:34:12 +00:00
f8d60e0e0a [Inductor][CPP] Fix Half data type cse cache issue for CPP Backend (#128498)
**Summary**
Fixing issue: https://github.com/pytorch/pytorch/issues/128263. After https://github.com/pytorch/pytorch/issues/115260, we cached the higher precision cse variable to avoid duplicate casting between buffers. However, it failed to check the original data type. This means if we convert `int32` to `bf16` for `store` and then convert `bf16` back to `fp32` for `load`, it would incorrectly hit the cache and reuse the `int32` cse var. This PR fixes the issue.

**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_issue_128263
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128498
Approved by: https://github.com/jgong5, https://github.com/zhuhaozhe, https://github.com/jerryzh168
2024-06-16 11:27:13 +00:00
979edbbe12 [Traceable FSDP2] Dynamo support FSDP2 use_training_state context manager (#127854)
Improve Dynamo to support the FSDP2 `use_training_state()` context manager.

Test command:
`
pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_dynamo_trace_use_training_state
`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127854
Approved by: https://github.com/yanboliang
2024-06-16 08:48:52 +00:00
e4d8aa4d24 [torchbench] Enable some models with inline_inbuilt_nn_modules (#128315)
For all models, graph breaks/recompiles reduce.
For drq, it increases and this is a legit one.

Co-authored-by: Laith Sakka <lsakka@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128315
Approved by: https://github.com/jansel
2024-06-16 08:37:23 +00:00
cc518ebd38 [Inductor Intel GPU backend Upstream] Reuse inductor test for Intel GPU (PART 2) (#124147)
Reuse Inductor test case for Intel GPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124147
Approved by: https://github.com/EikanWang, https://github.com/jansel
2024-06-16 08:07:05 +00:00
f1ee3589a1 [Inductor] Emit strided block pointer from ModularIndexing and FloorDiv (#127342)
**Summary**

Inductor currently uses modulo and division to compute indices into certain multi-dimensional tensors, such as those arising from row padding. This PR matches on that indexing pattern, replacing it with an N-D block pointer. This should be more efficient than computing indices with division and modulo, and it can easily map to DMAs on non-GPU hardware targets.

Because the 1D block size needs to map to an integer block shape in ND, we need to know that the ND block size evenly divides the size of the iteration range. This PR only generates ND block pointers when it can guarantee that the iteration order and number of elements loaded are unchanged. This means that the number of elements in a slice of the iteration range must either be:
  - Powers of 2. Since Triton block sizes are powers of 2, any integer power of 2 either divides the block size, or is greater than the block size. In the latter case, `CielDiv(x, y)` rounds up to 1.
  - Multiples of the maximum block size. Since block sizes are powers of 2, the maximum block size is a multiple of every possible block size.

Note that a *slice* of the iteration range does not include the leading dimension. Thus we can support arbitrary leading dimensions like `(5,8)`.

Feature proposal and discussion: https://github.com/pytorch/pytorch/issues/125077

Example kernel:
```
triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
    xnumel = 4096
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:]
    xmask = xindex < xnumel
    tmp0 = tl.reshape(tl.load(tl.make_block_ptr(in_ptr0, shape=[32, 16, 8], strides=[1024, 32, 1], block_shape=[32 * (32 <= ((127 + XBLOCK) // 128)) + ((127 + XBLOCK) // 128) * (((127 + XBLOCK) // 128) < 32), 16 * (16 <= ((7 + XBLOCK) // 8)) + ((7 + XBLOCK) // 8) * (((7 + XBLOCK) // 8) < 16), 8 * (8 <= XBLOCK) + XBLOCK * (XBLOCK < 8)], order=[0, 1, 2], offsets=[(xoffset // 128), (xoffset // 8) % 16, xoffset % 8]), boundary_check=[0, 1, 2]), [XBLOCK])
    tmp1 = tmp0 + tmp0
    tl.store(tl.make_block_ptr(out_ptr0, shape=[4096], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp1, [XBLOCK]).to(tl.float32))
''', device_str='cuda')
```

**Test Plan**

This PR adds a new CI test script to cover this feature. The tests can be grouped into a few main categories:
  - Can we generate strided block pointers for the appropriate shapes?
     - Powers of 2
     - Non-power of 2, but multiple of the maximum block size
     - Arbitrary leading dimensions, with power of 2 inner dimensions
     - Weird strides and offsets
     - Reductions
     - Symbolic shapes that are multiples of the maximum block size (wasn't able to trace this through dynamo)
     - Broadcasts (some variables are missing from the indexing expression)
  - Do we still compile other cases correctly, even if we don't expect to be able to generate block pointers?
     - Unsupported static shapes
     - Unsupported symbolic shapes
  - Mixing and matching these cases:
     - Pointwise and reduction in the same kernel
  - Sanity check the test harness
     - Do we raise an exception if the expected number of block pointers and the actual number are different?

**Follow-ups**

There are a few important cases which this PR can't handle. I'm hoping these can be deferred to follow-up PRs:
  - Handle non-divisible shapes
      - Change the tiling algorithm to generate a 2D (X,Y) blocking, if doing so enables block pointers to be emitted.
      - Pad unsupported loads up to the nearest divisible size, then mask/slice out the extra elements? This is probably the best solution, but I'm not yet sure how to go about it in triton.
 - Take advantage of this analysis when `triton.use_block_ptr=False`. I'm guessing we can still avoid `%` and `/` without requiring block pointers. Maybe we could compute block indices with arange and broadcast instead?

Differential Revision: D56739375

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127342
Approved by: https://github.com/jansel, https://github.com/shunting314
2024-06-16 07:35:57 +00:00
a61939467a Enable passing dynamo-traced complex test (#128771)
Fixes https://github.com/pytorch/pytorch/issues/118159

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128771
Approved by: https://github.com/anijain2305
2024-06-16 07:28:09 +00:00
ab13980424 [ONNX] Update 'person_of_interest.rst', 'CODEOWNERS' and 'merge_rules.yaml' (#126364)
The following are all constrained under the ONNX exporter project scope.

- `personal_of_interest.rst`
  - Moving folks no longer working on the project to emeritus.
  - Adding @justinchuby, @titaiwangms, @shubhambhokare1 and @xadupre,
    who have all made countless contributions to this project.
- `CODEOWNERS`
  - Removing folks no longer working on the project.
  - Updating new owners who will now be notified with PRs related to
    the specific file paths.
- `merge_rules.yaml`
  - Removing folks no longer working on the project.

🫡

Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126364
Approved by: https://github.com/titaiwangms, https://github.com/justinchuby, https://github.com/albanD
2024-06-16 04:52:16 +00:00
6079c50910 Make config.fx_graph_remote_cache be three-value switch (#128628)
Summary:
We want to allow for three configurations
False: Force off
True: Force on
None: OFF for OSS and JK config for internal

Test Plan: CI

Differential Revision: D58535897

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128628
Approved by: https://github.com/masnesral, https://github.com/eellison
2024-06-15 17:52:09 +00:00
94c0dcbe1d [inductor] Parallel compile: handle crashes in subprocesses (#128757)
Summary: If any subprocess in the pool crashes, we get a BrokenProcessPool exception and the whole pool becomes unusable. Handle crashes by recreating the pool.

Test Plan:
* New unit test
* Started a long-running test (`test/inductor/test_torchinductor.py`), periodically killed subprocess manually, made sure the test run recovers and makes progress.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128757
Approved by: https://github.com/jansel
2024-06-15 17:35:04 +00:00
f0d68120f4 [subclasses] Handle dynamo inputs that are subclass views with (-1) in the view (#128662)
When handling an input to dynamo that's a view of a subclass, dynamo does some handling to reconstruct the view. Part of this is to construct symints for the input parameters to the view.

Previously, the code would just call `create_symbol()` which by default specifies a _positive_ symint (>= 0); this fails in the case where you have an aten::view that was called with a -1.

Fix: just specify `positive=None` when calling `create_symbol()`, to avoid restricting the symint to >= 0 or <= 0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128662
Approved by: https://github.com/jbschlosser
2024-06-15 14:58:18 +00:00
18634048a1 Separate AOTI Eager utils as a single file (#125819)
The key change is code movement. We just moved aoti eager related code from `torch._inductor.utils` to `torch._inductor.aoti_eager`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125819
Approved by: https://github.com/jansel, https://github.com/jgong5, https://github.com/desertfire
ghstack dependencies: #125308
2024-06-15 13:42:49 +00:00
7a39755da2 Introduce a prototype for SymmetricMemory (#128582)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

This PR introduces a prototype for `SymmetricMemory` (including a CUDA implementation) - a remote-memory access-based communication primitive. It allows for user-defined communication patterns/kernels and is designed to be torch.compile-friendly. It addresses the major limitations of `IntraNodeComm` and `ProcessGroupCudaP2p` and serves as a replacement for them.

### SymmetricMemory

`SymmetricMemory` represents symmetric allocations across a group of devices. The allocations represented by a `SymmetricMemory` object are accessible by all devices in the group. The class can be used for **op-level custom communication patterns** (via the get_buffer APIs and the synchronization primitives), as well as **custom communication kernels** (via the buffer and signal_pad device pointers).

### Python API Example

```python
from torch._C.distributed_c10d import _SymmetricMemory

# Set a store for rendezvousing symmetric allocations on a group of devices
# identified by group_name. The concept of groups is logical; users can
# utilize predefined groups (e.g., a group of device identified by a
# ProcessGroup) or create custom ones. Note that a SymmetricMemoryAllocator
# backends might employ a more efficient communication channel for the actual
# rendezvous process and only use the store for bootstrapping purposes.
_SymmetricMemory.set_group_info(group_name, rank, world_size, store)

# Identical to empty_strided, but allows symmetric memory access to be
# established for the allocated tensor via _SymmetricMemory.rendezvous().
# This function itself is not a collective operation.
t = _SymmetricMemory.empty_strided_p2p((64, 64), (64, 1), torch.float32, group_name)

# Users can write Python custom ops that leverages the symmetric memory access.
# Below are examples of things users can do (assuming the group's world_size is 2).

# Establishes symmetric memory access on tensors allocated via
# _SymmetricMemory.empty_strided_p2p(). rendezvous() is a one-time process,
# and the mapping between a local memory region and the associated SymmetricMemory
# object is unique. Subsequent calls to rendezvous() with the same tensor will receive
# the cached SymmetricMemory object.
#
# The function has a collective semantic and must be invoked simultaneously
# from all rendezvous participants.
symm_mem = _SymmetricMemory.rendezvous(t)

# This represents the allocation on rank 0 and is accessible from all devices.
buf = symm_mem.get_buffer(0, (64, 64), torch.float32)

if symm_mem.rank == 0:
    symm_mem.wait_signal(src_rank=1)
    assert buf.eq(42).all()
else:
    # The remote buffer can be used as a regular tensor
    buf.fill_(42)
    symm_mem.put_signal(dst_rank=0)

symm_mem.barrier()

if symm_mem.rank == 0:
    symm_mem.barrier()
    assert buf.eq(43).all()
else:
    new_val = torch.empty_like(buf)
    new_val.fill_(43)
    # Contiguous copies to/from a remote buffer utilize copy engines
    # which bypasses SMs (i.e. no need to load the data into registers)
    buf.copy_(new_val)
    symm_mem.barrier()
```

### Custom CUDA Comm Kernels

Given a tensor, users can access the associated `SymmetricMemory` which provides pointer to remote buffers/signal_pads needed for custom communication kernels.

```cpp
TORCH_API c10::intrusive_ptr<SymmetricMemory> get_symmetric_memory(
    const at::Tensor& tensor);

class TORCH_API SymmetricMemory : public c10::intrusive_ptr_target {
 public:
  ...
  virtual std::vector<void*> get_buffer_ptrs() = 0;
  virtual std::vector<void*> get_signal_pad_ptrs() = 0;
  virtual void** get_buffer_ptrs_dev() = 0;
  virtual void** get_signal_pad_ptrs_dev() = 0;
  virtual size_t get_buffer_size() = 0;
  virtual size_t get_signal_pad_size() = 0;
  virtual int get_rank() = 0;
  virtual int get_world_size() = 0;
  ...
};
```

### Limitations of IntraNodeComm and ProcessGroupCudaP2p
Both `IntraNodeComm` (used by `ProcessGroupCudaP2p`) manages a single fixed-size workspace. This approach:
- Leads to awkward UX in which the required workspace needs to be specified upfront.
- Can not avoid extra copies for some algorithms in eager mode (e.g., custom/multimem all-reduce, reduce-scatter, all-gather).
- Prevents torch.compile from eliminating all copies.

In addition, they only offer out-of-the-box communication kernels and don't expose required pointers for user-defined, custom CUDA comm kernels.

* __->__ #128582

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128582
Approved by: https://github.com/wanchaol
2024-06-15 10:20:21 +00:00
60bbdc0b40 Modularize aten parameter parser and checker (#125308)
In this PR, we abstracted the different types of aten operation parameters as `ParameterMetadata`. This structure intends to be used to represent and store the metadata of each aten operation parameter. Currently, it only supports `Tensor`, `TensorList`, and `Scalar`.

```C++
using ParameterMetadataValue = std::variant<TensorMetadata, std::vector<TensorMetadata>, c10::Scalar>;
```

With this PR, we can extend other parameter-type support in a more modularize way, like `string`, `int`, `double`, and other different types to be summarized as the following list. The list is collected from all aten operations and ordered by the number of being used.

- `Tensor`
- `bool`
- `int64_t`
- `TensorList`
- `Scalar`
- `c10::SymIntArrayRef`
- `::std::optional<Tensor>`
- `IntArrayRef`
- `double`
- `c10::SymInt`
- `::std::optional<ScalarType>`
- `::std::optional<double>`
- `::std::optional<bool>`
- `::std::optional<Layout>`
- `::std::optional<Device>`
- `::std::optional<int64_t>`
- `Dimname`
- `::std::optional<Generator>`
- `c10::string_view`
- `::std::optional<c10::string_view>`
- `OptionalIntArrayRef`
- `::std::optional<Scalar>`
- `OptionalSymIntArrayRef`
- `::std::optional<MemoryFormat>`
- `::std::optional<c10::SymInt>`
- `ScalarType`
- `ArrayRef<Scalar>`
- `DimnameList`
- `::std::optional<ArrayRef<double>>`
- `::std::array<bool,3>`
- `::std::optional<DimnameList>`
- `c10::List<::std::optional<Tensor>>`
- `::std::array<bool,2>`
- `Storage`
- `::std::array<bool,4>`
- `Device`
- `DeviceIndex`
- `ITensorListRef`
- `Stream`
- `Layout`
- `MemoryFormat`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125308
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-06-15 09:18:44 +00:00
de4f379cf2 run mkldnn test with inlining (#128749)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128749
Approved by: https://github.com/anijain2305
2024-06-15 09:04:08 +00:00
b50c0e94c2 TCPStoreLibUvBackend: use somaxconn and enable TCP_NODELAY (#128739)
This adjusts the settings of the libuv backend to match the older TCPStore.

* DEFAULT_BACKLOG: setting this to -1 will enable using the host somaxconn value instead of a hardcoded 16k value. When going over this limit with `tcp_abort_on_overflow` set it results in connections being reset.
* TCP_NODELAY: Since TCPStore primarily sends small messages there's no benefit to using Nargle's algorithm and it may add additional latency for store operations.

Test plan:

```
python test/distributed/test_store.py -v -k LibUv
```

Benchmark script:
```
import time
import os

import torch.distributed as dist

rank = int(os.environ["RANK"])

store = dist.TCPStore(
    host_name="<server>",
    port=29500,
    world_size=2,
    is_master=(rank == 0),
    use_libuv=True,
)

if rank == 1:
    total_iters = 0
    total_dur = 0
    for iter in range(10):
        iters = 500000
        start = time.perf_counter()
        for i in range(iters):
            store.set(f"key_{i}", f"value_{i}")
        dur = time.perf_counter() - start
        print(f"{iter}. {iters} set, qps = {iters/dur}")
        total_iters += iters
        total_dur += dur

    print(f"overall qps = {total_iters/total_dur}")
else:
    print("sleeping")
    time.sleep(1000000000)
```

Performance seems to be negligible difference between TCP_NODELAY and not for a single host

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128739
Approved by: https://github.com/rsdcastro, https://github.com/kurman, https://github.com/c-p-i-o
2024-06-15 07:40:18 +00:00
cyy
e4c32d14a8 [3/N] Remove inclusion of c10/util/string_utils.h (#128504)
Follows #128372

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128504
Approved by: https://github.com/malfet
2024-06-15 06:38:40 +00:00
472211c97a Make assert_size_stride to return all errors (#128764)
This will help debug some problems I'm encountering, but in general, it is best to show the entire error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128764
Approved by: https://github.com/jansel
2024-06-15 06:32:40 +00:00
4ccbf711e2 Learning Rate Scheduler docstring fix (#128679)
Fix docstrings in Learning Rate Scheduler.

The fix can be verified by running pydocstyle path-to-file --count

Related #112593

**BEFORE the PR:**
pydocstyle torch/optim/lr_scheduler.py --count

92


**AFTER the PR:**
pydocstyle torch/optim/lr_scheduler.py --count

0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128679
Approved by: https://github.com/janeyx99
2024-06-15 05:30:35 +00:00
108adbc726 [dynamo][side effects] Raise assertion error if the object is already tracked for mutation (#128590)
This issue was pointed out by @tombousso here - https://github.com/pytorch/pytorch/pull/128269#issuecomment-2163755792

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128590
Approved by: https://github.com/mlazos
ghstack dependencies: #128715, #128269
2024-06-15 05:07:49 +00:00
9ebf77b13b Fix windows inductor defination issue (#128686)
Changes:
1. Add memory align macro support on Windows.
2. Fix `#pragma unroll` not support on MSVC cl compiler.
`#pragma unroll` occur error on msvc `cl` compiler, but it would be supported on Windows `clang`.
We'd better disable it only on `__msvc_cl__` compiler, and get better performance if we enabled `clang`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128686
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-06-15 03:02:00 +00:00
7e092a62e6 [dynamo] Support weakref objects (#128533)
Fixes https://github.com/pytorch/pytorch/issues/125720

I was earlier worried that DELETE_* or STORE_* on referent values should result in a graph break, because they could invalidate the weak ref. But then @zou3519 pointed out that weakref invalidation will happen EVENTUALLY, CPython provides no guarantees when the weakref will be invalidated (even when the user calls del x and x is the last reference).

So any code that relies on del x to invalidate the weakref of x right away is BAD code. CPython provide no guarantees. Therefore we can (ab)use this nuance, and can just ignore DELETE_* or STORE_* on the referent objects.

The only corner case is when Dynamo is reconstructing the weakref object. Dynamo will have a hard time being correct here, so just SKIP_FRAME on such a case. This is rare.

Cpython notes
1) https://docs.python.org/3/library/weakref.html
2) https://docs.python.org/3/reference/datamodel.html#index-2

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128533
Approved by: https://github.com/jansel
2024-06-15 02:16:25 +00:00
62a0e39ced [dynamo][inlining-nn-modules] Update tests with new expected counts (#128463)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128463
Approved by: https://github.com/yanboliang
2024-06-15 02:08:02 +00:00
2d01f87737 Enable torch.empty for float8 dtypes + deterministic mode + cpu (#128744)
Summary:

Enables creating empty float8 tensors for:
* cuda when `torch.use_deterministic_algorithms` is set to True
* cpu for all settings of `torch.use_deterministic_algorithms`

Context for NaN values of float8_e4m3fn and float8_e5m2: https://arxiv.org/pdf/2209.05433, Section 3, Table 1

Context for NaN values of float8_e4m3fnuz and float8_e5m2fnuz: https://arxiv.org/pdf/2206.02915, Section 3.2, "instead of reserving one exponent field to represent Inf and NaN, we reserve only a single codeword (corresponding to negative zero)"

Test Plan:

```
python test/test_quantization.py -k test_empty
```

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes https://github.com/pytorch/pytorch/issues/128733

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128744
Approved by: https://github.com/malfet, https://github.com/drisspg
2024-06-15 02:05:30 +00:00
846bb30e13 Revert "[1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301)"
This reverts commit bd72e28314d8d63bb347becb8309f5ac7761c6b5.

Reverted https://github.com/pytorch/pytorch/pull/128301 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it fails XLA build bd72e28314. Please rebase your PR before relanding because I think the failure is hidden by an unrelated broken trunk XLA failure from your current base commit ([comment](https://github.com/pytorch/pytorch/pull/128301#issuecomment-2169035822))
2024-06-15 01:58:20 +00:00
5efe71f134 Revert "[export] Add print_readable to unflattener (#128617)"
This reverts commit 5d9a609b4f6c94fb930188e4d7c99f53d989c022.

Reverted https://github.com/pytorch/pytorch/pull/128617 on behalf of https://github.com/huydhn due to Sorry for reverting your change but another failed test shows up in trunk inductor/test_flex_attention.py where it needs to be updated 5d9a609b4f.  I guess it is easier to revert and reland this ([comment](https://github.com/pytorch/pytorch/pull/128617#issuecomment-2169030779))
2024-06-15 01:46:23 +00:00
f37121bb74 Add model name, quantization and device to gpt_fast micro benchmark output (#128091)
A small enhancement to https://hud.pytorch.org/benchmark/llms with these columns in the output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128091
Approved by: https://github.com/yanboliang
2024-06-15 01:39:48 +00:00
3f47c72268 add multiprocessing checks in test_dataloader.py (#128244)
Add multiprocessing checks in test_dataloader.py for tests requiring multiprocessing similar to test_multiprocessing.py: https://github.com/pytorch/pytorch/blob/main/test/test_multiprocessing.py#L41-L52. Change all Jetson skips to TEST_CUDA_IPC checks since that is the root cause of the failures on Jetson in the first place.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128244
Approved by: https://github.com/eqy, https://github.com/malfet
2024-06-15 01:32:55 +00:00
73ba432d32 [custom_op]Fix None return schema (#128667)
Fixes #125044

If users define a schema returns `None`, it will be parsed to a `torch.NoneType`.  Auto functionalization support the `()` as a empty return but not for `None`. So, `None` return fails the check for [`can_auto_functionalize`](https://github.com/pytorch/pytorch/blob/findhao/fix_none_return_functionalize/torch/_higher_order_ops/auto_functionalize.py#L71) even we can take this as a `()` return. This PR is a fix to skip the check for None return.

I hope it can be fixed in a [deeper level](31e44c72ca), but this fix breaks a lot of existing schemas. So it's better to fix this issue in the auto_functionalize.py at this moment.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128667
Approved by: https://github.com/zou3519
2024-06-15 00:41:37 +00:00
6616ad030f [Inductor] Fix the High Order Op layout issue (#128275)
Fix the issue: https://github.com/pytorch/pytorch/issues/127995

- In current implementation of creating `FallbackKernel`, the `device` of the `NoneLayout` is set to `None` when `example_output` returns from `cls.process_kernel` is `None`. 921aa194c7/torch/_inductor/ir.py (L5632-L5649)
- If a `ExternalKernel schedulerNode` has None device, the previous buffer will not flush before codegen this `ExternalKernel schedulerNode`  which causes the wrong generated code.
ef2b5ed500/torch/_inductor/scheduler.py (L2701-L2709)

**Test Plan**
```
python -u -m pytest -s -v test/higher_order_ops/test_with_effects.py -k test_compile_inductor_external_op_return_none
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128275
Approved by: https://github.com/eellison
2024-06-15 00:33:21 +00:00
5d9a609b4f [export] Add print_readable to unflattener (#128617)
Taking inspiration from `GraphModule.print_readable` (aka I copied its [code](17b45e905a/torch/fx/graph_module.py (L824))), I added a `print_readable` to the unflattened module, because it's kind of nontrivial to print the contents of this module.

Example print from `python test/export/test_unflatten.py -k test_unflatten_nested`
```
class UnflattenedModule(torch.nn.Module):
    def forward(self, x: "f32[2, 3]"):
        # No stacktrace found for following nodes
        rootparam: "f32[2, 3]" = self.rootparam

        # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:99 in forward, code: x = x * self.rootparam
        mul: "f32[2, 3]" = torch.ops.aten.mul.Tensor(x, rootparam);  x = rootparam = None

        # No stacktrace found for following nodes
        foo: "f32[2, 3]" = self.foo(mul);  mul = None
        bar: "f32[2, 3]" = self.bar(foo);  foo = None
        return (bar,)

    class foo(torch.nn.Module):
        def forward(self, mul: "f32[2, 3]"):
            # No stacktrace found for following nodes
            child1param: "f32[2, 3]" = self.child1param
            nested: "f32[2, 3]" = self.nested(mul);  mul = None

            # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:79 in forward, code: return x + self.child1param
            add: "f32[2, 3]" = torch.ops.aten.add.Tensor(nested, child1param);  nested = child1param = None
            return add

        class nested(torch.nn.Module):
            def forward(self, mul: "f32[2, 3]"):
                # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:67 in forward, code: return x / x
                div: "f32[2, 3]" = torch.ops.aten.div.Tensor(mul, mul);  mul = None
                return div

    class bar(torch.nn.Module):
        def forward(self, add: "f32[2, 3]"):
            # No stacktrace found for following nodes
            child2buffer: "f32[2, 3]" = self.child2buffer

            # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:87 in forward, code: return x - self.child2buffer
            sub: "f32[2, 3]" = torch.ops.aten.sub.Tensor(add, child2buffer);  add = child2buffer = None
            return sub
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128617
Approved by: https://github.com/zhxchen17, https://github.com/pianpwk
2024-06-15 00:26:04 +00:00
d67923b955 Adding kwargs to composable AC API to enable full capabilities (#128516)
Summary:
Firstly, this does not change any existing behaviour, since all the
default values for kwargs were hardcoded into the ``_checkpoint_without_reentrant_generator`` call.

Secondly, this is needed for unlocking the full potential of composable
checkpointing making it equivalent to ``torch.utils.checkpoint.checkpoint(use_reentrant=False)``.

Finally, an added benefit is now composable checkpointing can be used under ``FakeTensorMode`` by
passing ``preserve_rng_state=False``.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128516
Approved by: https://github.com/awgu
2024-06-15 00:23:48 +00:00
271852aa7e inductor: pre-grad bmm pass shouldn't match if output is mutated (#128570)
This PR is enough to get this test to pass when using `TORCHDYNAMO_INLINE_INBUILT_NN_MODULES`:
```
TORCHDYNAMO_INLINE_INBUILT_NN_MODULES=1  python test/inductor/test_group_batch_fusion.py -k TestPostGradBatchLinearFusion.test_batch_linear_post_grad_fusion
```

inductor has a pre-grad pass to swap out multiple `linear` layers with with `addbmm`, but it also needs to insert an `unbind()` at the end. If that unbind is then followed by a mutation (like `add_()`), the autograd engine will complain (autograd does not let you mutate the output of multiple-out-view ops like unbind).

I made a tweak to the pattern matching logic to avoid matching if the output of the linear is used in an op that mutates its input. My hope is that:
(1) this situation is rare enough that it won't materially impact pattern matching in real world code
(2) I had to use a heuristic for "is an op a mutable op", since the graph we get is from dynamo, so it can contain code like `operator.iadd` in it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128570
Approved by: https://github.com/eellison, https://github.com/mlazos
ghstack dependencies: #127927
2024-06-15 00:08:44 +00:00
ba19ed9a1a FunctionalTensor: dispatch metadata directly to inner tensor (#127927)
Fixes https://github.com/pytorch/pytorch/issues/127374

The error in the linked repro is:
```
AssertionError: Please convert all Tensors to FakeTensors first or instantiate FakeTensorMode with 'allow_non_fake_inputs'. Found in aten.sym_storage_offset.default(_to_functional_tensor(FakeTensor(..., device='cuda:0', size=(16, 4), dtype=torch.uint8),
       device='cuda:0'))
```

Where we hit FakeTensor.__torch_dispatch__, but our input is a C++ `FunctionalTensorWrapper`.

What should actually have happened is that the call to `aten.sym_storage_offset` hits the `Functionalize` dispatch key, which should remove the `FunctionalTensorWrapper`  and redispatch. I spent some time debugging and haven't actually figured out why this isn't happening. Instead, this PR just skips that step completely, and asks `FunctionalTensor` to directly unwrap the C++ `FunctionalTensorWrapper` when querying tensor metadata.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127927
Approved by: https://github.com/tugsbayasgalan
2024-06-15 00:08:44 +00:00
574a2cbcb7 Enable UFMT on common_device_type.py and common_dtype.py (#128490)
Part of: https://github.com/pytorch/pytorch/issues/123062

Ran lintrunner on:
> torch/testing/_internal/common_device_type.py
> torch/testing/_internal/common_dtype.py

Detail:
```
$ lintrunner -a --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128490
Approved by: https://github.com/ezyang, https://github.com/XuehaiPan
2024-06-15 00:07:42 +00:00
0492ec460a [BE] Remove external testing of torch::deploy (#127952)
As we don't expect external users of torch::deploy as the library is no longer supported, we will remove external testing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127952
Approved by: https://github.com/malfet
2024-06-14 23:32:02 +00:00
cyy
bd72e28314 [1/N] Change #include <c10/util/Optional.h> to #include <optional> (#128301)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128301
Approved by: https://github.com/ezyang
2024-06-14 23:21:01 +00:00
52d4442a00 [c10d] Socket, TCPStore: add better logging (#128673)
This adds better logging of errors to the socket and TCPStore classes.

All socket operations should now include the local and remote addresses and we actually log errors from the TCPStoreBackend::run as well as TCPStoreBackendUV which were previously INFO messages and not actually logged.

It also overhauls test_wait in test_store.py as it had a race condition causing it to be flaky.

Test plan:

```
python test/distributed/test_store.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128673
Approved by: https://github.com/c-p-i-o
2024-06-14 23:08:29 +00:00
4abecd7102 [AOTI] fixed performance issue for AOTI_TORCH_CHECK (#128402)
We introduced AOTI_TORCH_CHECK in #119220 to resolve slow-compilation
time issues. Unfortunately, it caused perf regressions for CPU
, as described in issue #126665. After some investigation, it turned
out the slow compilation was caused by the use of the builtin
function __builtin_expect provided by gcc/clang. Moreover,
nuking __builtin_expect doesn't seem to cause any performance penalty,
even though its purpose is to improve performance by providing the
compiler with branch prediction information.

abs latency numbers using the script shared by #126665:

                            before the fix      after the fix
T5Small                     1019.055694         917.875027
T5ForConditionalGeneration  1009.825196         916.369239
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128402
Approved by: https://github.com/desertfire
2024-06-14 23:03:17 +00:00
fd27138c4a Update DALLE2_pytorch expected accuracy result on CPU (#128718)
I suspect that the issue shows up because of the new version of https://pypi.org/project/pyarrow/16.1.0/#history released yesterday.  The package is a dependency of DALLE2_pytorch https://github.com/pytorch/benchmark/blob/main/torchbenchmark/models/DALLE2_pytorch/install.py#L22.

I'll just update the expected accuracy result on CPU benchmark because the model fails to run there anyway.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128718
Approved by: https://github.com/malfet
2024-06-14 22:54:21 +00:00
d3a4d9e4fe Update cu124 dynamo benchmark expected values (#128737)
Missed one in https://github.com/pytorch/pytorch/pull/128589

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128737
Approved by: https://github.com/Skylion007
2024-06-14 22:23:00 +00:00
bca2cf00ed [ONNX] Add dynamic axes support to torchscript exporter with dynamo=True (#128371)
This PR enables specific axe to be dynamic with calling torch.export.export and torch.export.Dim.

Features:
(1) Turn dynamic_axes to dynamic_shapes
(2) Dim constraints remain the same (see test case with hitting constraints). This might give different user experience, since we didn't have any constraints in torchscript-onnx exporting.
(3) If input_names is used in dynamic_axes, ValueError will be raised, as input_names is currently not supported.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128371
Approved by: https://github.com/justinchuby
2024-06-14 21:56:51 +00:00
f103247a14 Run all samples for torchinductor tests (#128343)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128343
Approved by: https://github.com/lezcano
2024-06-14 21:52:12 +00:00
e9c6e8369c Torchbind call method + effects support (#128397)
Adds effect token support to torchbind method calls by allowing `with_effects` to take in `torch.ops._higher_order_ops.call_torchbind` as an input.

Here is the print from `TORCH_LOGS="aot" python test/export/test_torchbind.py -k test_compile_obj_torchbind_op`:
```python
def forward(self, arg0_1: "f32[0]", arg1_1: "f32[2]", arg2_1):
    # File: /data/users/angelayi/pytorch2/test/export/test_torchbind.py:1266 in f, code: torch.ops._TorchScriptTesting.queue_push(tq, x.cos())
    cos: "f32[2]" = torch.ops.aten.cos.default(arg1_1)
    with_effects = torch._higher_order_ops.effects.with_effects(arg0_1, torch.ops._TorchScriptTesting.queue_push.default, arg2_1, cos);  arg0_1 = cos = None
    getitem: "f32[0]" = with_effects[0];  with_effects = None

    # File: /data/users/angelayi/pytorch2/test/export/test_torchbind.py:1267 in f, code: torch.ops._TorchScriptTesting.queue_push(tq, x.cos() + 1)
    cos_1: "f32[2]" = torch.ops.aten.cos.default(arg1_1)
    add: "f32[2]" = torch.ops.aten.add.Tensor(cos_1, 1);  cos_1 = None
    with_effects_1 = torch._higher_order_ops.effects.with_effects(getitem, torch.ops._TorchScriptTesting.queue_push.default, arg2_1, add);  getitem = add = None
    getitem_2: "f32[0]" = with_effects_1[0];  with_effects_1 = None

    # File: /data/users/angelayi/pytorch2/test/export/test_torchbind.py:1268 in f, code: torch.ops._TorchScriptTesting.queue_pop(tq)
    with_effects_2 = torch._higher_order_ops.effects.with_effects(getitem_2, torch.ops._TorchScriptTesting.queue_pop.default, arg2_1);  getitem_2 = None
    getitem_4: "f32[0]" = with_effects_2[0];  with_effects_2 = None

    # File: /data/users/angelayi/pytorch2/test/export/test_torchbind.py:1269 in f, code: torch.ops._TorchScriptTesting.queue_push(tq, x.sin())
    sin: "f32[2]" = torch.ops.aten.sin.default(arg1_1);  arg1_1 = None
    with_effects_3 = torch._higher_order_ops.effects.with_effects(getitem_4, torch.ops._TorchScriptTesting.queue_push.default, arg2_1, sin);  getitem_4 = sin = None
    getitem_6: "f32[0]" = with_effects_3[0];  with_effects_3 = None

    # File: /data/users/angelayi/pytorch2/test/export/test_torchbind.py:1270 in f, code: return tq.pop(), tq.pop() + tq.size(), tq
    with_effects_4 = torch._higher_order_ops.effects.with_effects(getitem_6, torch.ops._higher_order_ops.call_torchbind, arg2_1, 'pop');  getitem_6 = None
    getitem_8: "f32[0]" = with_effects_4[0]
    getitem_9: "f32[2]" = with_effects_4[1];  with_effects_4 = None
    with_effects_5 = torch._higher_order_ops.effects.with_effects(getitem_8, torch.ops._higher_order_ops.call_torchbind, arg2_1, 'pop');  getitem_8 = None
    getitem_10: "f32[0]" = with_effects_5[0]
    getitem_11: "f32[2]" = with_effects_5[1];  with_effects_5 = None
    with_effects_6 = torch._higher_order_ops.effects.with_effects(getitem_10, torch.ops._higher_order_ops.call_torchbind, arg2_1, 'size');  getitem_10 = arg2_1 = None
    getitem_12: "f32[0]" = with_effects_6[0];  with_effects_6 = None
    add_1: "f32[2]" = torch.ops.aten.add.Tensor(getitem_11, 0);  getitem_11 = None
    return (getitem_12, getitem_9, add_1)
```

In order to support this, this PR makes the following changes:
* Adds `FakeScriptObject` to `CustomObjArgument`, which will be put on the `meta["val"]` of nodes representing torchbind objects.
* Adds pickle/deepcopy support to FunctionSchema.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128397
Approved by: https://github.com/ydwu4, https://github.com/zou3519
2024-06-14 21:28:17 +00:00
65d3ddcb8b Add GLIBC requirements for libtorch to solve #113124 (#128135)
Fixes #113124.

## Description

I modified the installing.rst file to address the system requirements and troubleshooting steps for using LibTorch with different GLIBC versions.

### Summary of Changes

- Added system requirements specifying the GLIBC version needed for both the cxx11 ABI version and the pre-cxx11 ABI version of LibTorch.
- Included a troubleshooting section with instructions on how to check the dependencies of the LibTorch libraries and identify the required GLIBC version using the `ldd lib/libtorch.so` command.

## Checklist
- [X] The issue that is being fixed is referred in the description
- [X] Only one issue is addressed in this pull request
- [X] Labels from the issue that this PR is fixing are added to this pull request
- [X] No unnecesary issues are included into this pull request

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128135
Approved by: https://github.com/jbschlosser
2024-06-14 21:24:53 +00:00
e9a29aaa4a [ONNX] Add upsample trilinear to skip decomp (#128259)
(1) Add upsample trilinear vec to skip decomposition
(2) Add tests to make sure that torch.export.export still decomposes them
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128259
Approved by: https://github.com/justinchuby
2024-06-14 21:20:44 +00:00
e6e102cf85 Dynamo testing: add some skips (#128734)
The following tests are failing consistently for me locally, so we're
going to skip them. They're disabled in CI but it looks like they're
just always failing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128734
Approved by: https://github.com/williamwen42
ghstack dependencies: #128731
2024-06-14 20:53:30 +00:00
11de50f17c [Dynamo] skip some TorchScript tests (#128731)
We don't care about the Dynamo x TorchScript composition, so I'm
disabling these tests (so they don't get reported as flaky). Not
disabling all of the TorchScript tests yet because they have been useful
to catch random bugs.

Test Plan:
- CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128731
Approved by: https://github.com/williamwen42
2024-06-14 20:53:30 +00:00
4b96575a09 [dynamo][aot autograd] Silently disable default saved tensor hooks during tracing (#123196)
FIXES #113263. Same idea as in https://github.com/pytorch/pytorch/pull/113417, but we need a more intrusive C API to silently nop default saved tensor hooks, in order to support user-code that use torch.autograd.disable_saved_tensors_hooks (see test_unpack_hooks_can_be_disabled). We mock the output of get_hooks while leaving push/pop untouched.

For compiled autograd, we're firing pack hooks once and unpack hooks twice right now, I'll look into this separately from this issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123196
Approved by: https://github.com/soulitzer
2024-06-14 20:28:08 +00:00
1aafb9eb90 [dynamo][yolov3] Track UnspecializedNNModuleVariable for mutation (#128269)
Fixes https://github.com/pytorch/pytorch/issues/101168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128269
Approved by: https://github.com/jansel
ghstack dependencies: #128715
2024-06-14 20:17:03 +00:00
9c77332116 [torch.compile][ci] Flaky models in CI (similar to DISABLED_TEST) (#128715)
These models are really flaky. I went into the CI machine and ran the model many times, sometime it fails, sometimes it passes. Even Pytorch-eager results change from run to run, so the accuracy comparison is fundamentally broken/non-deterministic. I am hitting these issues more frequently in inlining work. There is nothing wrong with inlining, I think these models are on the edge of already-broken accuracy measurement, and inlining is just pushing it in more broken direction.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128715
Approved by: https://github.com/eellison
2024-06-14 20:17:03 +00:00
2e5366fbc0 Extended Module Tracker (#128508)
This is an extension of [ModuleTracker](https://github.com/pytorch/pytorch/blob/main/torch/utils/module_tracker.py) with added features and bug fixes.

1. Allows installing user-defined hooks to be called in pre-fw, post-fw, pre-bw and post-bw hooks of the ``ModTracker``.
2. Adds a function ``get_known_fqn`` that retrieves the fqn of the module as tracked by the ``ModTracker``.
3. Only registers the multi-grad hooks if we are in the forward pass. This is important because, a module's pre-fw and post-fw hooks get called in the backward during AC and we do not want to register multi-grad hooks in this case.
4. Sets the kwarg ``always_call=True`` for post-fw hooks, so that they are called post AC.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128508
Approved by: https://github.com/wanchaol
2024-06-14 19:48:46 +00:00
d50712e5e3 [PT2] add inductor log for unbind_stack_pass (#128684)
Summary: Currently, we do not log the pass. To better enable pattern hit inspection, we enable it.

Test Plan: see signal

Differential Revision: D58571992

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128684
Approved by: https://github.com/dshi7
2024-06-14 19:45:55 +00:00
9035fff2de [BE] Do not test deprecated torch.nn.utils.weight_norm (#128727)
Test `torch.nn.utils.parametrizations.weight_norm` instead
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128727
Approved by: https://github.com/kit1980
ghstack dependencies: #128726
2024-06-14 19:14:44 +00:00
27458cc097 [BE] Refactor repeated code in test_weight_norm (#128726)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128726
Approved by: https://github.com/kit1980
2024-06-14 19:14:44 +00:00
a6bd154a42 [inductor] Support mm decomps for matrices with unbacked sizes (#128655)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128655
Approved by: https://github.com/jansel
2024-06-14 18:35:42 +00:00
b94c52dd29 [GHF] Refuse merge to non-default branch (#128710)
Unless PR is ghstack one

Test plan:
```
% GITHUB_TOKEN=$(gh auth token)  python3 -c "from trymerge import GitHubPR; pr=GitHubPR('pytorch', 'pytorch', 128591); print(pr.base_ref(), pr.default_branch())"
release/2.4 main
```
Fixes: https://github.com/pytorch/test-infra/issues/5339

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128710
Approved by: https://github.com/seemethere, https://github.com/atalman
2024-06-14 18:23:25 +00:00
be0eec9031 [export] Improve static typing in tracer. (#128552)
Summary: as title.

Test Plan: CI

Differential Revision: D58485487

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128552
Approved by: https://github.com/angelayi
2024-06-14 17:57:37 +00:00
2367161e4b Revert "[ROCm] Unskip scaled_dot_product_attention tests on ROCm (#127966)"
This reverts commit c339efaf023b4af056dad4cb2f11c07930ed8af6.

Reverted https://github.com/pytorch/pytorch/pull/127966 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI ([comment](https://github.com/pytorch/pytorch/pull/127966#issuecomment-2168505985))
2024-06-14 17:57:23 +00:00
d7fc871175 [inductor] Improve superfluous mask handling in triton codegen (#128518)
This takes the logic from `filter_masks` and factors it out into
`_has_constant_mask`. I also improve support for `persistent_reduction` kernels
by making use of the static RBLOCK value and potentially XBLOCK too in the
`no_x_dim` case.

I then use this helper when generating the `xmask` and `rmask`, so we can
generate them as constants meaning triton can optimize them even if they are
included.

e.g. `compiled_sum(torch.randn(1024, 512, device="cuda"), dim=-1)`
before:
```python
@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, rnumel):
    xnumel = 1024
    XBLOCK: tl.constexpr = 1
    rnumel = 512
    RBLOCK: tl.constexpr = 512
    xoffset = tl.program_id(0) * XBLOCK
    xindex = tl.full([1], xoffset, tl.int32)
    xmask = xindex < xnumel
    rindex = tl.arange(0, RBLOCK)[:]
    roffset = 0
    rmask = rindex < rnumel
    r1 = rindex
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (r1 + (512*x0)), rmask & xmask, other=0.0)
    tmp1 = tl.broadcast_to(tmp0, [RBLOCK])
    tmp3 = tl.where(rmask & xmask, tmp1, 0)
    tmp4 = triton_helpers.promote_to_tensor(tl.sum(tmp3, 0))
    tl.store(out_ptr0 + (x0), tmp4, xmask)
```

after:
```python
@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, rnumel):
    xnumel = 1024
    XBLOCK: tl.constexpr = 1
    rnumel = 512
    RBLOCK: tl.constexpr = 512
    xoffset = tl.program_id(0) * XBLOCK
    xindex = tl.full([1], xoffset, tl.int32)
    xmask = tl.full([RBLOCK], True, tl.int1)
    rindex = tl.arange(0, RBLOCK)[:]
    roffset = 0
    rmask = tl.full([RBLOCK], True, tl.int1)
    r1 = rindex
    x0 = xindex
    tmp0 = tl.load(in_ptr0 + (r1 + (512*x0)), None)
    tmp1 = tl.broadcast_to(tmp0, [RBLOCK])
    tmp3 = triton_helpers.promote_to_tensor(tl.sum(tmp1, 0))
    tl.store(out_ptr0 + (x0), tmp3, None)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128518
Approved by: https://github.com/lezcano
2024-06-14 17:52:55 +00:00
2357490524 [PT2] Enable shape_padding multiplier adjustment (#128346)
Summary:
Our experiments demonstrate that the current defautl value 1.1 may not be the best multiplier, and we thus enable the adjustment of the value to further improve the QPS.

context: https://docs.google.com/document/d/10VjpOJkTv5A4sNX7dD6qT7PyhBxn6LSeLAuaqYtoOto/edit

Test Plan:
# IG_CTR

{F1682138315}

Differential Revision: D58373261

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128346
Approved by: https://github.com/jackiexu1992
2024-06-14 17:49:24 +00:00
cyy
d4807da802 Various fixes of torch/csrc files (#127252)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127252
Approved by: https://github.com/r-barnes
2024-06-14 17:31:24 +00:00
089e76cca3 [traced-graph][sparse] remove redundant assert in sparse prop test (#128523)
The assertEqualMeta() method already tests that the first argument is a FakeTensor

https://github.com/pytorch/pytorch/issues/117188

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128523
Approved by: https://github.com/huydhn
2024-06-14 17:05:17 +00:00
1fb4effe7a [GPT-fast benchmark] Add MLP, gather + gemv, gemv micro benchmark (#128002)
Output example:
```
| name                         | metric                    | target  | actual  |
|------------------------------|---------------------------|---------|---------|
| layer_norm_bfloat16          | memory_bandwidth(GB/s)    | 1017    | 1000.01 |
| mlp_layer_norm_gelu_bfloat16 | flops_utilization         | 0.71    | 0.71    |
| gemv_int8                    | memory_bandwidth(GB/s)    | 990     | 984.06 |
| gemv_bfloat16                | memory_bandwidth(GB/s)    | 1137    | 1137.92 |
| gather_gemv_int8             | memory_bandwidth(GB/s)    | 1113    | 1111.09 |
| gather_gemv_bfloat16         | memory_bandwidth(GB/s)    | 1249    | 1248.15 |

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128002
Approved by: https://github.com/Chillee
2024-06-14 17:03:22 +00:00
4c84af0f5d Fix indexing and slicing of ranges in dynamo (#128567)
Fix https://github.com/pytorch/pytorch/issues/128520
Dynamo does not handle range()[binary subscript] or range()[trinary_subscript] correctly. Right now it calls
the get_item function which basically applies the subscript operation on top of the list of [start, end, step]! which is completely not related to what is  expected.

in python, range()[complex subscript] is another range, ex:
range(1, 10, 2)[1:4:1] is range(3, 9, 2)
and range(1, 10, 2)[1:4:1]  is range(-9, 9, 2)

This diff fix index and slice applications on range.
it mimics implementations from (https://github.com/python/cpython/blob/main/Objects/rangeobject.c)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128567
Approved by: https://github.com/anijain2305
2024-06-14 16:49:49 +00:00
f75f5987aa Revert "Extended Module Tracker (#128508)"
This reverts commit 1f46284f9ed5b60981174e689d750b358b19e4c4.

Reverted https://github.com/pytorch/pytorch/pull/128508 on behalf of https://github.com/malfet due to Broke lint, see https://github.com/pytorch/pytorch/actions/runs/9515753429/job/26230639980 ([comment](https://github.com/pytorch/pytorch/pull/128508#issuecomment-2168405784))
2024-06-14 16:46:03 +00:00
732b4e9074 Fix generated vararg types (#128648)
In the generated files torchgen is incorrectly generating types on the varargs.

The changes all look like this (changing `*size: _int` to `*size: Union[_int, SymInt]`):
```
--- ./torch/_VF.pyi.sav	2024-06-13 20:36:49.189664629 -0700
+++ ./torch/_VF.pyi	2024-06-13 20:36:57.208894614 -0700
@@ -168,17 +168,17 @@
 @overload
 def _efficientzerotensor(size: Sequence[Union[_int, SymInt]], *, dtype: Optional[_dtype] = None, layout: Optional[_layout] = None, device: Optional[Optional[DeviceLikeType]] = None, pin_memory: Optional[_bool] = False, requires_grad: Optional[_bool] = False) -> Tensor: ...
 @overload
-def _efficientzerotensor(*size: _int, dtype: Optional[_dtype] = None, layout: Optional[_layout] = None, device: Optional[Optional[DeviceLikeType]] = None, pin_memory: Optional[_bool] = False, requires_grad: Optional[_bool] = False) -> Tensor: ...
+def _efficientzerotensor(*size: Union[_int, SymInt], dtype: Optional[_dtype] = None, layout: Optional[_layout] = None, device: Optional[Optional[DeviceLikeType]] = None, pin_memory: Optional[_bool] = False, requires_grad: Optional[_bool] = False) -> Tensor: ...
 def _embedding_bag(weight: Tensor, indices: Tensor, offsets: Tensor, scale_grad_by_freq: _bool = False, mode: _int = 0, sparse: _bool = False, per_sample_weights: Optional[Tensor] = None, include_last_offset: _bool = False, padding_idx: _int = -1) -> Tuple[Tensor, Tensor, Tensor, Tensor]: ...
 def _embedding_bag_forward_only(weight: Tensor, indices: Tensor, offsets: Tensor, scale_grad_by_freq: _bool = False, mode: _int = 0, sparse: _bool = False, per_sample_weights: Optional[Tensor] = None, include_last_offset: _bool = False, padding_idx: _int = -1) -> Tuple[Tensor, Tensor, Tensor, Tensor]: ...
 @overload
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128648
Approved by: https://github.com/jamesjwu
2024-06-14 16:04:37 +00:00
8629939a51 [torch/c10] Add C10_UBSAN_ENABLED macro and use it to disable SymInt_… (#127967)
Adds `C10_UBSAN_ENABLED` macro and use it to disable `SymIntTest::Overflows` (fails under `signed-integer-overflow` UBSAN check).

Also cleans up UBSAN guard in `jit/test_misc.cpp` to use `C10_UBSAN_ENABLED`  and the existing `C10_ASAN_ENABLED` instead of locally defining `HAS_ASANUBSAN`.

> NOTE: This should fix `SymIntTest::Overflows` failing under ubsan in fbcode too...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127967
Approved by: https://github.com/atalman, https://github.com/d4l3k, https://github.com/malfet
2024-06-14 16:01:12 +00:00
ee140a198f Revert "[Port][Quant][Inductor] Bug fix: mutation nodes not handled correctly for QLinearPointwiseBinaryPT2E (#128591)"
This reverts commit 03e8a4cf45ee45611de77b55b515a8936f60ce31.

Reverted https://github.com/pytorch/pytorch/pull/128591 on behalf of https://github.com/atalman due to Contains release only changes should not be landed ([comment](https://github.com/pytorch/pytorch/pull/128591#issuecomment-2168308233))
2024-06-14 15:51:00 +00:00
c187593418 Prevent expansion of cat indexing to avoid int64 intermediate (#127815)
Fix for https://github.com/pytorch/pytorch/issues/127652

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127815
Approved by: https://github.com/shunting314, https://github.com/peterbell10
2024-06-14 15:42:08 +00:00
c339efaf02 [ROCm] Unskip scaled_dot_product_attention tests on ROCm (#127966)
Needle has moved quite a bit on the ROCm backend front. This PR intended to examine the tests referenced in the following issue: https://github.com/pytorch/pytorch/issues/96560

This a follow-up PR to https://github.com/pytorch/pytorch/pull/125069

unskipping the next batch of tests referenced by the aforementioned issue. No explicit changes needed for source as they worked immediately after unskipping.

The tests previously marked with xfail have now been modified to not expect a failure iff running on ROCm as they now pass. Behavior is unchanged for them on other architectures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127966
Approved by: https://github.com/pruthvistony, https://github.com/zou3519
2024-06-14 15:24:28 +00:00
c76a9d13cb Revert D56709309 (#128481)
Summary: potential fw compatibility issue raised from D58397323

Test Plan: Sandcastle

Reviewed By: houseroad

Differential Revision: D58443190

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128481
Approved by: https://github.com/desertfire
2024-06-14 14:57:17 +00:00
9972e5f447 Rename impl_abstract to register_fake, part 2/2 (#123938)
This PR renames the implementation details of register_fake to align
more with the new name. It is in its own PR because this is risky
(torch.package sometimes depends on private library functions and
implementation details).

Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123938
Approved by: https://github.com/williamwen42
2024-06-14 14:37:24 +00:00
a2d9c430b4 Adding a note for Getting Started with PyTorch on Intel GPUs (#127872)
Adding a note for Getting Started with PyTorch on Intel GPUs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127872
Approved by: https://github.com/svekars
2024-06-14 14:24:28 +00:00
dfc4b608e1 Remove leftover warning causing log spew (#128688)
This warning was left by mistake, and is uninformative (the user is doing nothing wrong) and causing log spew in trainings. See https://github.com/pytorch/pytorch/pull/120750#discussion_r1638430500
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128688
Approved by: https://github.com/drisspg
2024-06-14 14:08:11 +00:00
e1dfc61250 Document CI/CD security philosophy (#128316)
Namely:
-  when use of non-ephemeral runners is OK, vs when it is not
- Why binary build pipelines should not use distributed caching
- Why temporary CI artifacts should not be considered safe
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128316
Approved by: https://github.com/seemethere, https://github.com/atalman
2024-06-14 13:47:25 +00:00
cyy
bfd5ea93e0 Enable clang-tidy on c10/util/Float8*.h (#120573)
This PR clears warnings and enables clang-tidy on c10/util/Float8*.h.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120573
Approved by: https://github.com/drisspg
2024-06-14 13:47:07 +00:00
1f46284f9e Extended Module Tracker (#128508)
This is an extension of [ModuleTracker](https://github.com/pytorch/pytorch/blob/main/torch/utils/module_tracker.py) with added features and bug fixes.

1. Allows installing user-defined hooks to be called in pre-fw, post-fw, pre-bw and post-bw hooks of the ``ModTracker``.
2. Adds a function ``get_known_fqn`` that retrieves the fqn of the module as tracked by the ``ModTracker``.
3. Only registers the multi-grad hooks if we are in the forward pass. This is important because, a module's pre-fw and post-fw hooks get called in the backward during AC and we do not want to register multi-grad hooks in this case.
4. Sets the kwarg ``always_call=True`` for post-fw hooks, so that they are called post AC.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128508
Approved by: https://github.com/wanchaol
2024-06-14 12:01:53 +00:00
e397ad6883 Improve codegen for ops.masked in triton (#128054)
Fixes https://github.com/pytorch/pytorch/issues/127930
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128054
Approved by: https://github.com/peterbell10, https://github.com/lezcano
2024-06-14 11:52:56 +00:00
7e734e2d08 [inductor] Fix nested indirect indexing case for index_propagation (#128378)
Tries to fix #127677.

# Context

Just as @peterbell10 pointed out, we have the following scenario:
```
a = ops.indirect_indexing(...)
b = ops.index_expr(a, ...)
c = ops.indirect_indexing(b, ...)
```

We can repro this as:
```
def forward(self, arg0_1, arg1_1, arg2_1):
    iota = torch.ops.prims.iota.default(arg0_1, start = 0, step = 1, index=0),
    repeat_interleave = torch.ops.aten.repeat_interleave.Tensor(arg1_1);
    index = torch.ops.aten.index.Tensor(iota, [repeat_interleave]);
    index_1 = torch.ops.aten.index.Tensor(arg2_1, [index]);
    return (index_1,)
```

which should generate a JIT py file like this:
```
def triton_poi_fused_index_select_0(in_ptr0, in_ptr1, out_ptr0, ks0, xnumel, XBLOCK : tl.constexpr):
    ...
    tmp0 = tl.load(in_ptr0 + (x1), xmask, eviction_policy='evict_last')
    tmp1 = ks0
    tmp2 = tmp0 + tmp1
    tmp3 = tmp0 < 0
    tmp4 = tl.where(tmp3, tmp2, tmp0)
    # check_bounds()
    tl.device_assert(((0 <= tmp4) & (tmp4 < ks0)) | ~(xmask), "index out of bounds: 0 <= tmp4 < ks0")

def call():
  arg0_1, arg1_1, arg2_1 = args
  buf1 = aten.repeat_interleave.Tensor(arg1_1)
  buf4 = empty_strided_cuda((u0, 64), (64, 1))
  triton_poi_fused_index_select_0.run(
    buf1, arg2_1, buf4, s0,
    triton_poi_fused_index_select_0_xnumel,
    grid=grid(triton_poi_fused_index_select_0_xnumel),
    stream=stream0)
```

# Issue
In our `IndexPropagation.indirect_indexing()` call we have `expr=indirect0` which is spawned in `LoopBodyBlock.indirect_indexing()`.
3b555ba477/torch/_inductor/ir.py (L8154-L8160)

When we try to see if we can prove its bounds, we fail because `indirect0` isn't in `var_ranges`.

# Approach
When creating `indirect` symbols from fallback, specify its range to be `[-size, size -1]` to avoid a lookup error with `indirectX`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128378
Approved by: https://github.com/lezcano, https://github.com/peterbell10
2024-06-14 10:07:06 +00:00
99988be423 [halide-backend] Add test shard (#127308)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127308
Approved by: https://github.com/shunting314, https://github.com/eellison
ghstack dependencies: #128266
2024-06-14 10:02:57 +00:00
03e8a4cf45 [Port][Quant][Inductor] Bug fix: mutation nodes not handled correctly for QLinearPointwiseBinaryPT2E (#128591)
Port #127592 from main to release/2.4

------
Fixes #127402

- Revert some changes to `ir.MutationOutput` and inductor/test_flex_attention.py
- Add checks of mutation for QLinearPointwiseBinaryPT2E

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127592
Approved by: https://github.com/leslie-fang-intel, https://github.com/Chillee

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128591
Approved by: https://github.com/jgong5, https://github.com/Chillee
2024-06-14 09:31:38 +00:00
43ae3073f9 Revert "[traced-graph][sparse] remove redundant assert in sparse prop test (#128523)"
This reverts commit ba3726d02b25dff92762c59d4dffe96a7babfa75.

Reverted https://github.com/pytorch/pytorch/pull/128523 on behalf of https://github.com/DanilBaibak due to Sorry for the revert. Looks like your changes broke the inductor tests: inux-jammy-cpu-py3.8-gcc11-inductor, linux-jammy-cpu-py3.8-gcc11-inductor, linux-jammy-cpu-py3.8-gcc11-inductor. [Here you can find more details](ba3726d02b). ([comment](https://github.com/pytorch/pytorch/pull/128523#issuecomment-2167518145))
2024-06-14 08:27:05 +00:00
0344f95c2e Add missing #include <array> to thread_name.cpp (#128664)
I got local compile errors (using clang 14.0.6) due to this missing include after pulling the
latest pytorch main.  It's totally puzzling why CI appears to pass
without this fix.  Hopefully someone else will have an idea if we are
missing some CI coverage or if I am using a strange build setup locally.

The PR introducing the compile errors was https://github.com/pytorch/pytorch/pull/128448.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128664
Approved by: https://github.com/fduwjj, https://github.com/malfet, https://github.com/d4l3k
2024-06-14 07:49:09 +00:00
03725a0512 [dtensor][example] added MLPStacked example for printing sharding (#128461)
**Summary**
Currently, the comm_mode_feature_examples does not have an example for printing sharding information for a model with nested module. While adding the new example to the suite, I recognized a way to refactor existing examples in order to make them more readable for users. The expected output can be found below:
<img width="354" alt="Screenshot 2024-06-11 at 5 41 14 PM" src="https://github.com/pytorch/pytorch/assets/50644008/68cef7c7-cb1b-4e51-8b60-85123d96ca92">

**Test Plan**
torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128461
Approved by: https://github.com/XilunWu
ghstack dependencies: #128369, #128451
2024-06-14 07:30:31 +00:00
dd3b79a08f [dtensor][be] improving readability of comm_mode.py and comm_mode_features_example.py (#128451)
**Summary**
I have added comments to address previous readability concerns in comm_mode.py and comm_mode_features_example.py. I also renamed files and test cases in order to better reflect what they are about. Removed non-distributed test case and other lines of code that do not contribute to the example of how comm_mode can be used. Finally, I've added the expected output for each example function so users are not forced to run code.

**Test Plan**
torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128451
Approved by: https://github.com/XilunWu
ghstack dependencies: #128369
2024-06-14 07:30:31 +00:00
e886122e98 [dtensor][debug] add module level tracing and readable display (#128369)
**Summary**
Currently, CommDebugMode only allows displaying collective tracing at a model level whereas a user may require a more detailed breakdown. In order to make this possible, I have changed the ModuleParamaterShardingTracker by adding a string variable to track the current sub-module as well as a dictionary keeping track of the depths of the submodules in the model tree. CommModeDebug class was changed by adding a new dictionary keeping track of the module collective counts as well as a function that displays the counts in a way that is easy for the user to read. Two examples using MLPModule and Transformer have been added to showcase the new changes. The expected output of the simpler MLPModule example is:

<img width="255" alt="Screenshot 2024-06-10 at 4 58 50 PM" src="https://github.com/pytorch/pytorch/assets/50644008/cf2161ef-2663-49c1-a8d5-9f97e96a1791">

**Test Plan**
torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/display_sharding_example.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128369
Approved by: https://github.com/XilunWu
2024-06-14 07:30:31 +00:00
4669c6d3ae [quant][pt2e][quantizer] Support set_module_name_qconfig in X86InductorQuantizer (#126044)
Summary:
Added `set_module_name_qconfig` support to allow users to set configurations based on module name in `X86InductorQuantizer`.

For example, only quantize the `sub`:

```python
class M(torch.nn.Module):
    def __init__(self):
        super().__init__()
        self.linear = torch.nn.Linear(5, 5)
        self.sub = Sub()

    def forward(self, x):
        x = self.linear(x)
        x = self.sub(x)
        return x

m = M().eval()
example_inputs = (torch.randn(3, 5),)
# Set config for a specific submodule.
quantizer = X86InductorQuantizer()
quantizer.set_module_name_qconfig("sub", xiq.get_default_x86_inductor_quantization_config())
```

- Added `set_module_name_qconfig` to allow user set the configuration at the `module_name` level.
- Unified the annotation process to follow this order:  `module_name_qconfig`, `operator_type_qconfig`, and `global_config`.
- Added `config_checker` to validate all user configurations and prevent mixing of static/dynamic or QAT/non-QAT configs.
- Moved `_get_module_name_filter` from `xnnpack_quantizer.py` into `utils.py` as it common for all quantizer.

Test Plan

```bash
python -m pytest quantization/pt2e/test_x86inductor_quantizer.py -k test_set_module_name
```

@Xia-Weiwen @leslie-fang-intel  @jgong5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126044
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jerryzh168
2024-06-14 07:13:10 +00:00
674be9d3be Update cu124 dynamo benchmark expected values (#128589)
I believe this corresponds to changes in https://github.com/pytorch/pytorch/pull/127780

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128589
Approved by: https://github.com/nWEIdia, https://github.com/DanilBaibak
2024-06-14 07:04:34 +00:00
18f35d9e12 Revert "Run all samples for torchinductor tests (#128343)"
This reverts commit 41df20c07caecddb6d21d69a125f2998ae9313e8.

Reverted https://github.com/pytorch/pytorch/pull/128343 on behalf of https://github.com/clee2000 due to broke inductor/test_torchinductor_opinfo.py::TestInductorOpInfoCUDA::test_comprehensive_nn_functional_avg_pool3d_cuda_float16 and other tests 41df20c07c https://github.com/pytorch/pytorch/actions/runs/9509191526/job/26213490266. I think this might be a landrace ([comment](https://github.com/pytorch/pytorch/pull/128343#issuecomment-2167275337))
2024-06-14 06:08:17 +00:00
f48f7615dc [easy][subclasses] dynamo.reset() in test_subclass_views (#128659)
When we don't dynamo.reset(), we don't recompile on different dynamic shapes.

Also, some of the returned views were tuples - so when we `* 2`, we actually just copy all the inputs twice in the tuple. I changed it so that it would just return one of the values from the return tuple.

Additionally, this exposes a bug that fails with the slice operation, so I skipped it when we're testing with dynamic shapes:

```
  File "/home/dberard/local/pytorch/torch/fx/experimental/symbolic_shapes.py", line 3996, in produce_guards
    sexpr = ShapeGuardPrinter(symbol_to_source, source_ref, self.var_to_sources).doprint(expr)
  File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/printer.py", line 292, in doprint
    return self._str(self._print(expr))
  File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/printer.py", line 331, in _print
    return printmethod(expr, **kwargs)
  File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/str.py", line 56, in _print_Add
    t = self._print(term)
  File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/printer.py", line 331, in _print
    return printmethod(expr, **kwargs)
  File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/str.py", line 366, in _print_Mul
    a_str = [self.parenthesize(x, prec, strict=False) for x in a]
  File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/str.py", line 366, in <listcomp>
    a_str = [self.parenthesize(x, prec, strict=False) for x in a]
  File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/str.py", line 37, in parenthesize
    return self._print(item)
  File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/printer.py", line 331, in _print
    return printmethod(expr, **kwargs)
  File "/home/dberard/local/pytorch/torch/fx/experimental/symbolic_shapes.py", line 1494, in _print_Symbol
    assert self.symbol_to_source.get(expr), (
AssertionError: s3 (could be from ['<ephemeral: symint_visitor_fn>', '<ephemeral: symint_visitor_fn>']) not in {s0: ["L['x'].a.size()[1]", "L['x'].b.size()[1]", "L['x'].size()[1]", "L['x'].a.size()[1]", "L['x'].b.size()[1]", "L['x'].a.size()[1]", "L['x'].b.size()[1]"], s1: ["L['x'].a.stride()[0]", "L['x'].b.stride()[0]", "L['x'].stride()[0]", "L['x'].a.stride()[0]", "L['x'].b.stride()[0]", "L['x'].a.stride()[0]", "L['x'].b.stride()[0]"], s2: ["L['x'].a.storage_offset()", "L['x'].b.storage_offset()", "L['x'].a.storage_offset()", "L['x'].b.storage_offset()"]}.  If this assert is failing, it could be due to the issue described in https://github.com/pytorch/pytorch/pull/90665
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128659
Approved by: https://github.com/YuqingJ
2024-06-14 05:18:07 +00:00
9ac08dab1f Updates diskspace-cleanup for ROCm CI (#127947)
Gets the location of the docker directory and outputs how much disk space is being used by docker.

This is required since the new Cirrascale CI nodes for ROCm have docker root directory in a different partition.

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127947
Approved by: https://github.com/jithunnair-amd, https://github.com/malfet
2024-06-14 04:32:38 +00:00
eff01bce21 Only run inductor A100 perf benchmark smoke test periodically (#128677)
Attempt to mitigate the long queue on A100 as reported in https://github.com/pytorch/pytorch/issues/128627.

From what I see, this change 03467b3fed/1 doubles the job duration from 20+ to 40+ minutes. This, together https://github.com/pytorch/pytorch/blob/main/.github/workflows/inductor-cu124.yml and maybe an increase number of PR with `ciflow/inductor`, are all contributing to the long queue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128677
Approved by: https://github.com/atalman, https://github.com/desertfire
2024-06-14 02:39:33 +00:00
ba3726d02b [traced-graph][sparse] remove redundant assert in sparse prop test (#128523)
The assertEqualMeta() method already tests that the first argument is a FakeTensor

https://github.com/pytorch/pytorch/issues/117188

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128523
Approved by: https://github.com/soulitzer
2024-06-14 02:34:51 +00:00
685fcfb40d Fix docstring in autograd (#128657)
Fix docstrings in autograd files.

The fix can be verified by running pydocstyle path-to-file --count

Related #112593

**BEFORE the PR:**

pydocstyle torch/autograd/anomaly_mode.py --count
8
pydocstyle torch/autograd/__init__.py --count
9

**AFTER the PR:**

pydocstyle torch/autograd/anomaly_mode.py --count
0
pydocstyle torch/autograd/__init__.py --count
0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128657
Approved by: https://github.com/soulitzer
2024-06-14 02:18:59 +00:00
0186b386cd Revert "[ONNX] Add upsample trilinear to skip decomp (#128259)"
This reverts commit b72989a2b5ac4637612e31e325d7c8233fcbd7a1.

Reverted https://github.com/pytorch/pytorch/pull/128259 on behalf of https://github.com/huydhn due to Sorry for reverting your change but its ONNX job is failing in trunk b72989a2b5 ([comment](https://github.com/pytorch/pytorch/pull/128259#issuecomment-2167058937))
2024-06-14 01:44:26 +00:00
f48ca2561d Document torch.cuda.profiler.start (#128098)
document https://github.com/pytorch/pytorch/issues/127917 start function of cuda/ profiler.py

Fixes 127917

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128098
Approved by: https://github.com/aaronenyeshi
2024-06-14 01:44:18 +00:00
41df20c07c Run all samples for torchinductor tests (#128343)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128343
Approved by: https://github.com/lezcano
2024-06-14 01:28:32 +00:00
6895a5804c Revert "[checkpoint] Clean up selective activation checkpoint and make public (#125795)"
This reverts commit c472cec5656b9ffb668af97a02d711bdbdf5ebec.

Reverted https://github.com/pytorch/pytorch/pull/125795 on behalf of https://github.com/soulitzer due to breaking torchtitan CI ([comment](https://github.com/pytorch/pytorch/pull/125795#issuecomment-2167036157))
2024-06-14 01:14:59 +00:00
6564d63e69 Use mv kernel for small M (#128632)
Previously we are using:
* mv kernel for M == 1
* mm kernel for 1 < M < 4
* llama.cpp inspired mm kernel for M >= 4

This PR consolidate it to only 2 kernels, use the same mv kernel for M <
12.

Benchmarked on https://github.com/malfet/llm_experiments/blob/main/metal-perf/int8mm.mm

Mac M1 Max, input size M x 4128 x 4096

![llama cpp shader and ATen shader (2)](https://github.com/pytorch/pytorch/assets/8188269/9e2e3024-c5ea-4303-88bf-ff3646296396)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128632
Approved by: https://github.com/malfet
2024-06-14 01:06:53 +00:00
ae2359638b Save DOT file of graph instead of SVG for GraphTranformObserver (#128634)
Summary:
GraphTransformObserver saves the SVG file of the input/output graph in each inductor pass. In my test with CMF model, if the graph is large, GraphViz took forever to convert DOT to SVG. That is NOT acceptable.

This DIFF is to save DOT file instead of SVG file to speed it up. Also DOT file size is order of mangitude smaller than SVG.

To view these graphs, user can run dot -Txxx inpout.dot to convert DOT to any other format you want. User can control how many iterations to layout the graph properly. Refer to https://web.archive.org/web/20170507095019/http://graphviz.org/content/attrs#dnslimit for details.

Test Plan: buck2 test mode/dev-sand caffe2/test:fx --  fx.test_fx_xform_observer.TestGraphTransformObserver

Differential Revision: D58539182

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128634
Approved by: https://github.com/mengluy0125
2024-06-14 00:54:22 +00:00
6f181756dc Use by-column algorithm for fp16/bf16 CPUBlas gemm_transb kernels (#127318)
Summary: #96074 (D44340826) changed the algorithm for 16-bit types for gemm_notrans_ and gemm_transb_ for the sake of precision. In this diff, we go back to the old algorithm for gemm_transb_, maintaining precision by allocating temporary space equal to (in elements, so actually double since we are accumulating 16-bit types into fp32) the size of `c` to accumulate into.

Test Plan: Used https://github.com/malfet/llm_experiments (benchmarks/benchmark_torch_mm.py) to benchmark before and after:

before:
```
mv_nt    torch.float32    5.47 usec
mv_nt    torch.float16    8.45 usec
mv_nt   torch.bfloat16  183.43 usec
mv_ta    torch.float32    5.70 usec
mv_ta    torch.float16   24.17 usec
mv_ta   torch.bfloat16   97.27 usec
notrans  torch.float32    5.58 usec
notrans  torch.float16   25.18 usec
notrans torch.bfloat16   63.11 usec
trans_a  torch.float32    5.59 usec
trans_a  torch.float16   68.94 usec
trans_a torch.bfloat16  311.60 usec
trans_b  torch.float32    5.63 usec
trans_b  torch.float16    8.76 usec
trans_b torch.bfloat16   29.17 usec
```

after:
```
mv_nt    torch.float32    5.53 usec
mv_nt    torch.float16    8.57 usec
mv_nt   torch.bfloat16  188.17 usec
mv_ta    torch.float32    5.78 usec
mv_ta    torch.float16   28.59 usec
mv_ta   torch.bfloat16   98.45 usec
notrans  torch.float32    5.71 usec
notrans  torch.float16   26.08 usec
notrans torch.bfloat16   64.06 usec
trans_a  torch.float32    5.72 usec
trans_a  torch.float16   32.21 usec
trans_a torch.bfloat16   32.10 usec
trans_b  torch.float32    5.83 usec
trans_b  torch.float16    9.05 usec
trans_b torch.bfloat16   29.66 usec
```

Also expanded coverage to a range of larger matrix-vector and matrix-matrix sizes.

before:
```
Matrix-vector:
m=1024, n=1024, k=1
====================
notrans  torch.float32   24.75 usec
notrans  torch.float16  258.04 usec
notrans torch.bfloat16  245.64 usec
trans_a  torch.float32   26.94 usec
trans_a  torch.float16  692.09 usec
trans_a torch.bfloat16 1709.53 usec
m=4100, n=4100, k=1
====================
notrans  torch.float32 2811.48 usec
notrans  torch.float16 4192.06 usec
notrans torch.bfloat16 4041.01 usec
trans_a  torch.float32 2778.38 usec
trans_a  torch.float16 17218.41 usec
trans_a torch.bfloat16 27561.21 usec
m=16384, n=16384, k=1
====================
notrans  torch.float32 60157.66 usec
notrans  torch.float16 64121.38 usec
notrans torch.bfloat16 65714.65 usec
trans_a  torch.float32 84975.39 usec
trans_a  torch.float16 1024223.33 usec
trans_a torch.bfloat16 1078683.21 usec

Matrix-matrix:
m=1024, n=1024, k=256
====================
notrans  torch.float32  302.55 usec
notrans  torch.float16 172869.06 usec
notrans torch.bfloat16 172837.81 usec
trans_a  torch.float32  250.03 usec
trans_a  torch.float16 333373.38 usec
trans_a torch.bfloat16 432760.00 usec
m=4100, n=4100, k=128
====================
notrans  torch.float32 5278.56 usec
notrans  torch.float16 1426335.29 usec
notrans torch.bfloat16 1404249.37 usec
trans_a  torch.float32 4818.63 usec
trans_a  torch.float16 2969936.17 usec
trans_a torch.bfloat16 3432565.96 usec
m=16384, n=16384, k=16
====================
notrans  torch.float32 72225.71 usec
notrans  torch.float16 1439875.54 usec
notrans torch.bfloat16 1443716.33 usec
trans_a  torch.float32 221130.21 usec
trans_a  torch.float16 16910654.17 usec
trans_a torch.bfloat16 21447377.63 usec
```

after:
```
Matrix-vector:
m=1024, n=1024, k=1
====================
notrans  torch.float32   25.11 usec
notrans  torch.float16  252.76 usec
notrans torch.bfloat16  238.58 usec
trans_a  torch.float32   26.62 usec
trans_a  torch.float16  167.40 usec
trans_a torch.bfloat16  174.08 usec
m=4100, n=4100, k=1
====================
notrans  torch.float32 2774.28 usec
notrans  torch.float16 3991.70 usec
notrans torch.bfloat16 3945.44 usec
trans_a  torch.float32 3011.25 usec
trans_a  torch.float16 2666.85 usec
trans_a torch.bfloat16 2686.95 usec
m=16384, n=16384, k=1
====================
notrans  torch.float32 58682.15 usec
notrans  torch.float16 63077.52 usec
notrans torch.bfloat16 63319.33 usec
trans_a  torch.float32 70549.57 usec
trans_a  torch.float16 42145.45 usec
trans_a torch.bfloat16 42270.13 usec

Matrix-matrix:
m=1024, n=1024, k=256
====================
notrans  torch.float32  289.37 usec
notrans  torch.float16 179704.87 usec
notrans torch.bfloat16 173490.33 usec
trans_a  torch.float32  330.89 usec
trans_a  torch.float16 42466.26 usec
trans_a torch.bfloat16 42811.19 usec
m=4100, n=4100, k=128
====================
notrans  torch.float32 4793.33 usec
notrans  torch.float16 1407557.04 usec
notrans torch.bfloat16 1388212.17 usec
trans_a  torch.float32 4714.20 usec
trans_a  torch.float16 359406.58 usec
trans_a torch.bfloat16 350419.42 usec
m=16384, n=16384, k=16
====================
notrans  torch.float32 65757.08 usec
notrans  torch.float16 1427715.71 usec
notrans torch.bfloat16 1440883.00 usec
trans_a  torch.float32 202263.44 usec
trans_a  torch.float16 1387522.33 usec
trans_a torch.bfloat16 1762253.92 usec
```

We are improving, but still have a lot of room for improvement compared to float32 BLAS. Full disclosure: applying this same method to gemm_notrans (which does correspond to notrans in the benchmark's nomenclature) does not approve performance across the board; the 16KB x 16KB x 16 matmul regresses and I haven't figured out why yet.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127318
Approved by: https://github.com/peterbell10, https://github.com/malfet
2024-06-14 00:39:18 +00:00
18f5357f4f Introduce heuristic for mixed_mm on A100 (#128232)
This PR introduces a heuristic for tuned_mixed_mm. The heuristic is only enabled on an A100, because it has only been tested on an A100, and it is only enabled if force_mixed_mm="heuristic".

I compared the heuristic to the aten fallback implementation and triton+autotune:
 Geometric mean speedup: 2.51
 ```
 m     n     k  triton + autotune (GB/s)  aten (GB/s)  heuristic (GB/s)  used_heuristic  speedup (heuristic/aten)
  1  4096  4096                    456.95       134.59            459.37            True                      3.41
  1  4096  8192                    523.93       138.29            553.50            True                      4.00
  1  4096 16394                    233.70       161.62            234.14            True                      1.45
  1  8192  4096                    633.25       140.64            574.86            True                      4.09
  1  8192  8192                    737.54       147.41            690.26            True                      4.68
  1  8192 16394                    413.67       175.88            408.68            True                      2.32
  1 16394  4096                    717.22       167.22            665.36            True                      3.98
  1 16394  8192                    812.69       177.17            815.90            True                      4.61
  1 16394 16394                    473.17       178.58            435.11            True                      2.44
  4  4096  4096                    479.46       134.80            486.74            True                      3.61
  4  4096  6333                    174.27       106.74            171.64            True                      1.61
  4  4096  8192                    567.14       138.32            571.09            True                      4.13
  4  4096 12313                    179.65       105.91            180.03            True                      1.70
  4  4096 16394                    222.96       145.54            222.81            True                      1.53
  4  6333  4096                    491.78       126.37            473.20            True                      3.74
  4  6333  6333                    268.79       143.40            269.75            True                      1.88
  4  6333  8192                    783.80       135.12            796.23            True                      5.89
  4  6333 12313                    286.35       142.37            287.30            True                      2.02
  4  6333 16394                    362.47       139.66            361.47            True                      2.59
  4  8192  4096                    642.73       140.53            641.88            True                      4.57
  4  8192  6333                    287.65       137.63            287.38            True                      2.09
  4  8192  8192                    738.42       150.16            721.59            True                      4.81
  4  8192 12313                    301.27       146.18            302.31            True                      2.07
  4  8192 16394                    415.37       167.66            393.41            True                      2.35
  4 12313  4096                    823.66       141.81            745.40            True                      5.26
  4 12313  6333                    433.92       148.17            429.83            True                      2.90
  4 12313  8192                    984.60       149.30            988.95            True                      6.62
  4 12313 12313                    452.00       150.87            452.50            True                      3.00
  4 12313 16394                    609.88       159.20            609.71            True                      3.83
  4 16394  4096                    779.44       157.46            777.10            True                      4.94
  4 16394  6333                    402.93       139.50            309.47            True                      2.22
  4 16394  8192                    950.38       175.49            949.67            True                      5.41
  4 16394 12313                    414.62       153.99            315.95            True                      2.05
  4 16394 16394                    497.56       174.97            461.77            True                      2.64
16  4096  4096                    475.92       134.45            478.57            True                      3.56
16  4096  6333                    146.36       112.50            145.35            True                      1.29
16  4096  8192                    560.00       138.22            557.19            True                      4.03
16  4096 12313                    152.02       105.06            151.27            True                      1.44
16  4096 16394                    222.48       156.72            222.88            True                      1.42
16  6333  4096                    692.41       122.14            696.88            True                      5.71
16  6333  6333                    220.74       140.90            225.41            True                      1.60
16  6333  8192                    813.56       140.21            820.28            True                      5.85
16  6333 12313                    232.48       131.19            232.55            True                      1.77
16  6333 16394                    367.39       134.93            361.87            True                      2.68
16  8192  4096                    665.54       140.29            266.24            True                      1.90
16  8192  6333                    254.77       136.65            240.12            True                      1.76
16  8192  8192                    750.63       146.26            736.93            True                      5.04
16  8192 12313                    266.61       127.13            251.81            True                      1.98
16  8192 16394                    397.25       160.42            390.76            True                      2.44
16 12313  4096                    857.48       141.36            851.36            True                      6.02
16 12313  6333                    423.21       132.40            357.55            True                      2.70
16 12313  8192                   1021.24       145.68           1024.60            True                      7.03
16 12313 12313                    370.12       143.94            383.52            True                      2.66
16 12313 16394                    608.52       141.03            608.48            True                      4.31
16 16394  4096                    826.48       155.94            826.74            True                      5.30
16 16394  6333                    420.38       144.09            265.23            True                      1.84
16 16394  8192                    988.07       156.21            984.63            True                      6.30
16 16394 12313                    431.40       146.92            265.49            True                      1.81
16 16394 16394                    497.39       167.86            461.79            True                      2.75
23  4096  4096                    344.43       132.84            338.64            True                      2.55
23  4096  6333                    195.34       118.48            195.31            True                      1.65
23  4096  8192                    389.83       140.02            376.62            True                      2.69
23  4096 12313                    204.49       137.96            204.80            True                      1.48
23  4096 16394                    242.48       148.99            242.74            True                      1.63
23  6333  4096                    429.25       126.52            517.75            True                      4.09
23  6333  6333                    295.56       133.51            296.14            True                      2.22
23  6333  8192                    594.88       137.05            581.78            True                      4.25
23  6333 12313                    315.18       131.67            314.64            True                      2.39
23  6333 16394                    386.46       141.45            386.54            True                      2.73
23  8192  4096                    553.52       142.05            568.35            True                      4.00
23  8192  6333                    215.58       139.01            210.86            True                      1.52
23  8192  8192                    609.21       154.85            528.76            True                      3.41
23  8192 12313                    220.38       142.93            233.54            True                      1.63
23  8192 16394                    402.63       158.39            403.21            True                      2.55
23 12313  4096                    723.54       131.58            581.94            True                      4.42
23 12313  6333                    307.90       131.58            307.90            True                      2.34
23 12313  8192                    893.36       129.97            623.72            True                      4.80
23 12313 12313                    322.40       134.84            317.80            True                      2.36
23 12313 16394                    512.97       142.31            409.45            True                      2.88
23 16394  4096                    703.66       154.54            643.53            True                      4.16
23 16394  6333                    305.55       127.55            293.17            True                      2.30
23 16394  8192                    768.12       154.60            681.53            True                      4.41
23 16394 12313                    311.61       140.92            307.01            True                      2.18
23 16394 16394                    467.24       171.07            467.29            True                      2.73
32  4096  4096                    344.71       132.30            338.62            True                      2.56
32  4096  6333                    206.48       107.59            205.55            True                      1.91
32  4096  8192                    387.24       137.82            353.12            True                      2.56
32  4096 12313                    216.35       120.61            214.50            True                      1.78
32  4096 16394                    242.05       149.92            241.94            True                      1.61
32  6333  4096                    525.50       127.12            518.02            True                      4.08
32  6333  6333                    300.50       118.41            296.55            True                      2.50
32  6333  8192                    600.92       136.99            601.94            True                      4.39
32  6333 12313                    316.13       136.45            316.03            True                      2.32
32  6333 16394                    386.11       141.34            386.10            True                      2.73
32  8192  4096                    546.18       140.18            341.14            True                      2.43
32  8192  6333                    218.40       130.65            263.42            True                      2.02
32  8192  8192                    608.29       147.16            542.12            True                      3.68
32  8192 12313                    225.60       135.04            225.23            True                      1.67
32  8192 16394                    434.75       160.42            401.28            True                      2.50
32 12313  4096                    787.80       136.28            583.60            True                      4.28
32 12313  6333                    316.66       125.76            323.35            True                      2.57
32 12313  8192                    891.38       128.88            639.50            True                      4.96
32 12313 12313                    326.11       132.37            325.88            True                      2.46
32 12313 16394                    521.64       139.47            395.69            True                      2.84
32 16394  4096                    625.55       158.46            651.16            True                      4.11
32 16394  6333                    304.14       131.13            284.55            True                      2.17
32 16394  8192                    767.79       162.95            704.34            True                      4.32
32 16394 12313                    310.74       137.68            303.39            True                      2.20
32 16394 16394                    465.92       171.43            465.37            True                      2.71
43  4096  4096                    345.05       133.87            196.47            True                      1.47
43  4096  6333                    148.64        99.92            148.97            True                      1.49
43  4096  8192                    386.50       135.39            214.00            True                      1.58
43  4096 12313                    190.39       109.36            156.27            True                      1.43
43  4096 16394                    203.63       150.24            204.05            True                      1.36
43  6333  4096                    421.35       106.04            132.25            True                      1.25
43  6333  6333                    224.75       113.01            224.97            True                      1.99
43  6333  8192                    471.11       117.61            327.39            True                      2.78
43  6333 12313                    234.55       115.61            234.74            True                      2.03
43  6333 16394                    311.56       132.24            312.01            True                      2.36
43  8192  4096                    400.73       140.12            269.11            True                      1.92
43  8192  6333                    167.32       119.13            168.84            True                      1.42
43  8192  8192                    435.45       146.98            286.21            True                      1.95
43  8192 12313                    161.05       127.82            162.78            True                      1.27
43  8192 16394                    207.16       156.40            208.90            True                      1.34
43 12313  4096                    484.01       120.10            313.35            True                      2.61
43 12313  6333                    234.54       106.63            232.85            True                      2.18
43 12313  8192                    515.34       130.23            411.70            True                      3.16
43 12313 12313                    239.39       130.04            239.03            True                      1.84
43 12313 16394                    316.02       137.39            316.29            True                      2.30
43 16394  4096                    475.60       152.57            340.97            True                      2.23
43 16394  6333                    241.21       132.49            208.59            True                      1.57
43 16394  8192                    499.34       157.43            361.61            True                      2.30
43 16394 12313                    246.25       132.31            211.68            True                      1.60
43 16394 16394                    302.90       158.56            277.05            True                      1.75
64  4096  4096                    280.48       126.82            195.97            True                      1.55
64  4096  6333                    150.94       101.63            150.48            True                      1.48
64  4096  8192                    305.47       135.06            211.03            True                      1.56
64  4096 12313                    158.12       110.06            158.15            True                      1.44
64  4096 16394                    206.68       136.21            201.28            True                      1.48
64  6333  4096                    409.11       105.10            296.07            True                      2.82
64  6333  6333                    229.98       108.46            230.59            True                      2.13
64  6333  8192                    469.32       112.24            330.58            True                      2.95
64  6333 12313                    245.02       117.16            244.84            True                      2.09
64  6333 16394                    317.78       125.80            318.37            True                      2.53
64  8192  4096                    323.42       139.92            267.31            True                      1.91
64  8192  6333                    167.51       118.45            167.56            True                      1.41
64  8192  8192                    341.13       146.71            284.88            True                      1.94
64  8192 12313                    172.21       123.42            171.97            True                      1.39
64  8192 16394                    217.22       153.18            216.99            True                      1.42
64 12313  4096                    482.19       123.32            311.82            True                      2.53
64 12313  6333                    238.73       123.88            238.66            True                      1.93
64 12313  8192                    516.32       122.11            330.50            True                      2.71
64 12313 12313                    248.73       125.32            296.82            True                      2.37
64 12313 16394                    314.98       134.06            320.31            True                      2.39
64 16394  4096                    476.59       154.58            340.84            True                      2.20
64 16394  6333                    240.54       119.60            214.82            True                      1.80
64 16394  8192                    501.36       149.02            359.45            True                      2.41
64 16394 12313                    244.65       126.01            222.47            True                      1.77
64 16394 16394                    302.48       160.36            283.66            True                      1.77
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128232
Approved by: https://github.com/Chillee
2024-06-14 00:31:22 +00:00
cyy
9ebec1f345 Enable Wunused-function in torch_cpu (#128576)
Follows #128499

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128576
Approved by: https://github.com/ezyang, https://github.com/r-barnes
2024-06-14 00:12:58 +00:00
6767e38267 Fix manual licensing (#128630)
It has come to my attention that some of our licenses are incorrect, so I attempted to rectify a few of them based on given recommendations for:
clog - BSD-3
eigen - MPL-2.0
ffnvcodec - LGPL-2.1
-> **hungarian - Permissive (free to use)**
irrlicht - The Irrlicht Engine License (zlib/libpng)
-> **pdcurses - Public Domain for core**
-> **sigslot - Public Domain**
test - BSD-3
Vulkan - Apache-2.0 or MIT
fb-only: more context is here https://fb.workplace.com/groups/osssupport/posts/26333256012962998/?comment_id=26333622989592967

This PR addressed the manual mismatches of licensing mentioned above (the two bolded, one is getting addressed in #128085, but as everything else is generated by pulling through other files, I did not address those. It is unclear what needs to be updated for the remaining to be accurate/if they're inaccurate today.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128630
Approved by: https://github.com/malfet
2024-06-14 00:12:09 +00:00
afdaa7fc95 [while_loop] expose it as torch.while_loop (#128562)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128562
Approved by: https://github.com/zou3519
2024-06-13 23:44:10 +00:00
c486e2ab64 Add coloring to fx graph print out (#128476)
Note: Won't land immediately, at least I'll need to add a color option to the field. But curious if any tests fail.

Old:
<img width="1294" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/c3a750ed-5e54-4621-b2e4-be5481be15b6">

New:
<img width="1303" alt="image" src="https://github.com/pytorch/pytorch/assets/6355099/3a1f1adc-6f3a-413e-8b87-ee53da9bf4ed">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128476
Approved by: https://github.com/ezyang
2024-06-13 23:39:04 +00:00
61421c42c0 [custom_op] don't invoke autograd.Function when unnecessary (#127976)
This matches our autograd logic for pytorch native operators. There's no
need to invoke an autograd.Function if we're under a torch.no_grad() or
if none of the inputs have requires_grad=True (invoking an
autograd.Function results in (noticeable) overhead).

Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127976
Approved by: https://github.com/williamwen42
2024-06-13 23:38:23 +00:00
b72989a2b5 [ONNX] Add upsample trilinear to skip decomp (#128259)
(1) Add upsample trilinear vec to skip decomposition
(2) Add tests to make sure that torch.export.export still decomposes them
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128259
Approved by: https://github.com/justinchuby
2024-06-13 23:31:34 +00:00
8c20f53a5e Try seeding individual foreach tests (#128220)
A first easy attempt to deflake foreach

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128220
Approved by: https://github.com/ZainRizvi, https://github.com/crcrpar, https://github.com/huydhn
2024-06-13 22:42:16 +00:00
865d7b3424 [Reland][dynamo] Enable some inlining inbuilt nn module tests (#128440)
Co-authored-by: Laith Sakka <lsakka@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128440
Approved by: https://github.com/williamwen42, https://github.com/jansel
2024-06-13 22:39:22 +00:00
3a0006ef22 Remove global variable SIZE, and fix linter warning (#128559)
- Resolve a TODO by removing global variable `SIZE`.
- Fix a linter warning in `test/test_nestedtensor.py`.

`pytest pytorch/test/test_sort_and_select.py` and ` pytest test/test_nestedtensor.py` pass.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128559
Approved by: https://github.com/kit1980, https://github.com/Skylion007
2024-06-13 22:09:51 +00:00
6211e67e49 Document torch.jit.frontend.get_default_args (#128408)
Fixes #127896

### Description
Add docstring to `torch/jit/frontend.py:get_default_args` function

### Checklist
- [x] The issue that is being fixed is referred in the description
- [x] Only one issue is addressed in this pull request
- [x] Labels from the issue that this PR is fixing are added to this pull request
- [x] No unnecessary issues are included into this pull request

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128408
Approved by: https://github.com/malfet
2024-06-13 21:49:16 +00:00
bf8a05f483 [FSDP2] Included module FQN in FSDPParamGroup record_functions (#128624)
This PR adds the module FQN into the `FSDPParamGroup` `record_function`s for improved clarity in profiler traces.

Differential Revision: [D58544809](https://our.internmc.facebook.com/intern/diff/D58544809)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128624
Approved by: https://github.com/ckluk2
2024-06-13 21:35:33 +00:00
c8e9656a12 Revert "Add test to xfail_list only for abi_compatible (#128506)"
This reverts commit 49366b2640df1cba5a3b40bedd31b57b08529612.

Reverted https://github.com/pytorch/pytorch/pull/128506 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it causes an inductor test to fail in trunk 49366b2640 ([comment](https://github.com/pytorch/pytorch/pull/128506#issuecomment-2166824714))
2024-06-13 21:30:07 +00:00
8763d44bf1 add xpu to torch.compile (#127279)
As support for Intel GPU has been upstreamed, this PR is to add the XPU-related contents to torch.compile doc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127279
Approved by: https://github.com/dvrogozh, https://github.com/svekars
2024-06-13 21:15:09 +00:00
790138fdc7 Add profiler annotation for fused_all_gather_matmul and fused_matmul_reduce_scatter (#127556)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127556
Approved by: https://github.com/awgu
ghstack dependencies: #127454, #127455
2024-06-13 20:52:46 +00:00
3b28dc6c9d Improve the scheduling for fused_matmul_reduce_scatter (#127455)
In fused_all_gather_matmul, each rank copies their shard into their
local p2p buffer, performs a barrier, then performs (copy -> matmul) for
each remote shard. The (copy -> matmul)s for remote shards run on two
streams without synchronization. This not only allows for
computation/communication overlapping, but also computation/computation
overlapping which alleviates the wave quantization effect caused by
computation decomposition.

However, the synchronization-free approach doesn't work well with
fused_matmul_reduce_scatter, in which there's a barrier in every step.
Without synchronization between the two streams, a matmul in one stream
can delay a barrier in the other stream, further delaying the copy
waiting for the barrier.

This PR addresss the issue by adding synchronization between the two
streams such that the matmul of step i can only start after the barrier
of step i-1 completes. With this approach, we lose the
computation/computation overlapping, but avoid slowdown due to delayed
barrier.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127455
Approved by: https://github.com/Chillee
ghstack dependencies: #127454
2024-06-13 20:52:46 +00:00
c0b40ab42e doc string for torch.jit.frontend.get_jit_class_def method (#128391)
Fixes #127904

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128391
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-06-13 19:51:02 +00:00
a3af32c2fb Add functionality to make ViewAndMutationData (slightly more) cache safe (#127618)
This PR changes the traced_tangents field of ViewAndMutationMeta to be cache safe. Specifically, at runtime, the only time we need the fw_metadata's traced_tangent's field is for Tensor subclass metadata from __tensor_flatten__. So instead of storing an entire FakeTensor, which has many fields that can be unserializable, only store the result of __tensor_flatten__() on any FakeTensors representing subclasses.

That said, there's no guarantee that `__tensor_flatten__` is actually serializable: if we fail to pickle the result of __tensor_flatten__ we won't save to the cache.

To do this, we also make a small change to `__coerce_same_metadata_as_tangent__`, so that it takes in the return value of tensor_flatten() instead of an entire FakeTensor. Let me know if we should change the name of the function.

By doing this, we can now run the dynamic shapes cache test with autograd turned on.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127618
Approved by: https://github.com/bdhirsh
2024-06-13 19:45:33 +00:00
39193b10e8 [inductor] fx graph cache: memoize devices to make cache key calculation more predictable (#128366)
Summary: I've seen this issue once in the wild and oulgen was able to repro in a unit test. The problem is this:
- We're using pickle to turn everything related to the FX graph cache key into a byte stream, then hashing the bytes to compute the cache key.
- Pickle is optimized to avoid serializing the same ID more than once; it instead drops a reference to a previously-pickled object if it encounters the same ID.
- That pickle behavior means that we can see different cache keys if an object id appears more than once in the hashed objects vs. being functionally equivalent but distinct objects.

The cases I've investigated only involve the torch.device objects in the tensor graph args. That is, we may compile a graph with two tensor args, each referencing `torch.device('cpu')`. In one run, those devices may reference the same object; in another, they may reference distinct (but equivalent) objects. In practice, my observation is that the compiler is largely deterministic and this situation is rare. I've seen cache misses on a real benchmark only when enabling/disabling FakeTensor caching in order to introduce different code paths that otherwise produce the same fx graph. But the failing unit test seems to be enough motivation for a remediation?

I don't really love this solution, but I've failed to find another way to make the pickling phase robust to these kinds of changes, e.g., by changing the protocol version or by overriding internal methods (which would also be gross). But I'm definitely open to other creative ideas.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128366
Approved by: https://github.com/oulgen, https://github.com/eellison
2024-06-13 19:25:14 +00:00
c54e358bdb enable comprehensive padding internally (#128555)
Summary: The feature was previously disabled in fbcode due to breaking the deterministic NE unit tests. Now it has been on in OSS for quite a while and we verified that it has no NE impact on CMF, we want to update the unit test and enable the feature.

Test Plan:
```
time buck2 test 'fbcode//mode/opt' fbcode//aps_models/ads/icvr/tests/ne/e2e_deterministic_tests:fm_tests -- --exact 'aps_models/ads/icvr/tests/ne/e2e_deterministic_tests:fm_tests - aps_models.ads.icvr.tests.ne.e2e_deterministic_tests.icvr_fm_test.ICVR_FM_DeterministicTest: test_icvr_fm_pt2_fsdp_multi_gpus'

```

Differential Revision: D58425432

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128555
Approved by: https://github.com/eellison
2024-06-13 19:20:00 +00:00
cdc37e4bff Add a shape property to IR nodes (#127818)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127818
Approved by: https://github.com/peterbell10
2024-06-13 19:11:52 +00:00
5a80d2df84 [BE] enable UFMT for torch/nn/utils (#128595)
Part of #123062

- #123062
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128595
Approved by: https://github.com/Skylion007
2024-06-13 18:34:57 +00:00
9f55c80a9f [AOTI] Fix a minimal_arrayref_interface test failure (#128613)
Summary: When calling a fallback op in the minimal_arrayref_interface mode with an optional tensor, a temporary RAIIAtenTensorHandle needes to be explicitly created in order to pass a pointer of tensor as the optional tensor parameter.

Test Plan: CI

Differential Revision: D58528575

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128613
Approved by: https://github.com/hl475
2024-06-13 18:25:04 +00:00
a265556362 inductor fusion logs: make it easier to attribute to aten graph (#127159)
Summary:

I want to be able to look at inductor fusion logs and reason about which parts of the aot_autograd aten graph were fused / not fused.

This PR adds a short description of each buffer to the fusion logs. Example for forward of `Float8Linear`:

```
torch._inductor.scheduler.__fusion: ===== attempting fusion (1/10): 13 nodes =====
torch._inductor.scheduler.__fusion: fuse_nodes_once, candidates:
torch._inductor.scheduler.__fusion:   SchedulerNode(name='buf0'), Reduction(['[254201]', 'max', 'origins={abs_1, max_1}'])
torch._inductor.scheduler.__fusion:   SchedulerNode(name='buf3'), Reduction(['[114688]', 'max', 'origins={abs_2, max_2}'])
torch._inductor.scheduler.__fusion:   SchedulerNode(name='buf6'), Pointwise(['[]', 'origins={reciprocal_1, convert_element_type_6, clamp_min_2, mul_2, copy_1, reciprocal_3, convert_element_type_5}'])
torch._inductor.scheduler.__fusion:   ExternKernelSchedulerNode(name='buf10')
torch._inductor.scheduler.__fusion:   SchedulerNode(name='buf2'), Pointwise(['[]', 'origins={full_default}'])
torch._inductor.scheduler.__fusion:   SchedulerNode(name='buf8'), Pointwise(['[8192, 7168]', 'origins={convert_element_type, clamp_min, convert_element_type_1, _scaled_mm, convert_element_type_4, clamp_max, convert_element_type
_3, clamp_min_1, copy, convert_element_type_2, mul_1, mul, reciprocal}'])
torch._inductor.scheduler.__fusion:   SchedulerNode(name='buf4'), Reduction(['[512]', 'max', 'origins={abs_2, max_2}'])
torch._inductor.scheduler.__fusion:   SchedulerNode(name='buf13'), Pointwise(['[8192, 7168]', 'origins={clone_2}'])
torch._inductor.scheduler.__fusion:   SchedulerNode(name='buf7'), Pointwise(['[16384, 8192]', 'origins={convert_element_type, clamp_min, convert_element_type_1, _scaled_mm, convert_element_type_4, clamp_max, convert_element_typ
e_3, clamp_min_1, copy, convert_element_type_2, mul_1, mul, reciprocal}'])
torch._inductor.scheduler.__fusion:   ExternKernelSchedulerNode(name='buf9')
torch._inductor.scheduler.__fusion:   SchedulerNode(name='buf1'), Reduction(['[528]', 'max', 'origins={abs_1, max_1}'])
torch._inductor.scheduler.__fusion:   SchedulerNode(name='buf5'), Pointwise(['[]', 'origins={convert_element_type, clamp_min, convert_element_type_1, copy, reciprocal_2, mul, reciprocal}'])
torch._inductor.scheduler.__fusion:   SchedulerNode(name='buf12'), Pointwise(['[8192, 16384]', 'origins={clone_1}'])
torch._inductor.scheduler.__fusion: cannot fuse buf0 with buf7: no shared data
torch._inductor.scheduler.__fusion: cannot fuse buf0 with buf12: no shared data
torch._inductor.scheduler.__fusion: cannot fuse buf0 with buf1: numel/rnumel mismatch (reduce) (528, 1), (254201, 528)
torch._inductor.scheduler.__fusion: cannot fuse buf7 with buf1: nodes numel incompatibility
torch._inductor.scheduler.__fusion: cannot fuse buf12 with buf1: nodes numel incompatibility
torch._inductor.scheduler.__fusion: cannot fuse buf5 with buf7: numel/rnumel mismatch (non-reduce) (1, 134217728), (1, 1)
torch._inductor.scheduler.__fusion: cannot fuse buf5 with buf12: numel/rnumel mismatch (non-reduce) (1, 134217728), (1, 1)
torch._inductor.scheduler.__fusion: cannot fuse buf3 with buf8: intermediate nodes between node1 & node2
torch._inductor.scheduler.__fusion: cannot fuse buf3 with buf13: no shared data
torch._inductor.scheduler.__fusion: cannot fuse buf3 with buf4: numel/rnumel mismatch (reduce) (512, 1), (114688, 512)
torch._inductor.scheduler.__fusion: cannot fuse buf8 with buf4: nodes numel incompatibility
torch._inductor.scheduler.__fusion: cannot fuse buf13 with buf4: nodes numel incompatibility
torch._inductor.scheduler.__fusion: cannot fuse buf6 with buf8: numel/rnumel mismatch (non-reduce) (1, 58720256), (1, 1)
torch._inductor.scheduler.__fusion: cannot fuse buf6 with buf13: numel/rnumel mismatch (non-reduce) (1, 58720256), (1, 1)
torch._inductor.scheduler.__fusion: cannot fuse buf5 with buf9: node2 is extern or nop
torch._inductor.scheduler.__fusion: cannot fuse buf6 with buf9: node2 is extern or nop
torch._inductor.scheduler.__fusion: cannot fuse buf7 with buf9: node2 is extern or nop
torch._inductor.scheduler.__fusion: cannot fuse buf8 with buf9: node2 is extern or nop
torch._inductor.scheduler.__fusion: cannot fuse buf9 with buf10: node1 is extern or nop
torch._inductor.scheduler.__fusion: found 4 possible fusions
torch._inductor.scheduler.__fusion: fusing buf7 with buf12
torch._inductor.scheduler.__fusion: fusing buf8 with buf13
torch._inductor.scheduler.__fusion: fusing buf4 with buf6
torch._inductor.scheduler.__fusion: fusing buf1 with buf5
torch._inductor.scheduler.__fusion: completed fusion round (1/10): fused 13 nodes into 9 nodes
```

Test Plan: will add tests after we align some version of this can land

Reviewers:

Subscribers:

Tasks:

Tags:

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127159
Approved by: https://github.com/mlazos
2024-06-13 18:22:02 +00:00
de9a072ac4 Updating the sigslot license to Public Domain (#128085)
It seems that Sigslot's license is Public Domain, not Apache 2. https://sigslot.sourceforge.net

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128085
Approved by: https://github.com/janeyx99
2024-06-13 18:13:54 +00:00
8733c4f4be docs: Add link to test-infra issue (#128608)
It's not immediately obvious from this file that the issue being referred to is in another repo. Add that detail and link to make it easier for folks reading this code to jump to the correct issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128608
Approved by: https://github.com/DanilBaibak, https://github.com/jeanschmidt, https://github.com/ZainRizvi
2024-06-13 18:00:53 +00:00
dd19c9150c Revert "[aota] compiled forward outputs requires_grad alignment with eager (#128016)"
This reverts commit b459713ca75f6ab7c8a59acec0258e0f77904ada.

Reverted https://github.com/pytorch/pytorch/pull/128016 on behalf of https://github.com/bdhirsh due to fix torchbench regression ([comment](https://github.com/pytorch/pytorch/pull/128016#issuecomment-2166446841))
2024-06-13 17:56:42 +00:00
52f529105d force_stride_order on fused_all_gather_matmul/fused_matmul_reduce_scatter's operands to avoid a copy due to layout transformation (#127454)
When performing fused_all_gather_matmul/fused_matmul_reduce_scatter and gather_dim/scatter_dim != 0, a copy of the lhs operand (A_shard/A) is needed for layout transformation.
This copy can be avoided if the lhs operand already has the following stride order:

    lhs.movedim(gather_dim, 0).contiguous().movedim(0, gather_dim).stride()

In `micro_pipeline_tp` passes, we enforce the lhs operand to have such stride order via `inductor_prims.force_stride_order`. This way if the lhs operand has a flexible layout, the copy is avoided.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127454
Approved by: https://github.com/Chillee
2024-06-13 17:52:37 +00:00
d5780396c7 Skip debug asserts for mixed dense, subclass views in autograd_not_implemented_fallback (#128057)
Fixes #125503
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128057
Approved by: https://github.com/albanD, https://github.com/soulitzer
ghstack dependencies: #127007
2024-06-13 17:13:02 +00:00
9a8917fdbd Naive CPU kernels for jagged <-> padded dense conversions (#127007)
This PR introduces naive CPU impls for:
* `_jagged_to_padded_dense_forward()`
* `_padded_dense_to_jagged_forward()`

On the CUDA side, these are backed by lifted FBGEMM kernels. We may want to revisit the CPU versions with higher-performance implementations at a later time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127007
Approved by: https://github.com/davidberard98
2024-06-13 17:13:02 +00:00
a0604193a2 handle call_function with Parameter args in DDPOptimizer splitting (#128034)
When nn module inlining is enabled, modules are replaced with the underlying function calls in the output fx graph.
example:
```
class GraphModule(torch.nn.Module):
  def forward(self, L_x_: "f32[1024, 1024]"):
      l_x_ = L_x_

      # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_structured_trace.py:284 in forward, code: return self.layers(x)
      l__self___layers_0: "f32[1024, 1024]" = self.L__self___layers_0(l_x_);  l_x_ = None
      l__self___layers_1: "f32[1024, 1024]" = self.L__self___layers_1(l__self___layers_0);  l__self___layers_0 = None
      return (l__self___layers_1,)
```

will be
```
class GraphModule(torch.nn.Module):
    def forward(self, L_self_layers_0_weight: "f32[1024, 1024]", L_self_layers_0_bias: "f32[1024]", L_x_: "f32[1024, 1024]", L_self_layers_1_weight: "f32[1024, 1024]", L_self_layers_1_bias: "f32[1024]"):
        l_self_layers_0_weight = L_self_layers_0_weight
        l_self_layers_0_bias = L_self_layers_0_bias
        l_x_ = L_x_
        l_self_layers_1_weight = L_self_layers_1_weight
        l_self_layers_1_bias = L_self_layers_1_bias

        # File: /data/users/lsakka/pytorch/pytorch/torch/nn/modules/linear.py:116 in forward, code: return F.linear(input, self.weight, self.bias)
        input_1: "f32[1024, 1024]" = torch._C._nn.linear(l_x_, l_self_layers_0_weight, l_self_layers_0_bias);  l_x_ = l_self_layers_0_weight = l_self_layers_0_bias = None
        input_2: "f32[1024, 1024]" = torch._C._nn.linear(input_1, l_self_layers_1_weight, l_self_layers_1_bias);  input_1 = l_self_layers_1_weight = l_self_layers_1_bias = None
        return (input_2,)
```
The DDP optimizer when performing splitting, does not handle the inlined graph since it does not handle function calls since earlier we did not have function calls with params as inputs. (but calls to modules instead).

This diff addresses that, it uses the example_value in the arguments to determine Parameter arguments of a function call
and the Parameter properties.
This address #https://github.com/pytorch/pytorch/issues/127552

running the optimizer on the code above with inlining yields to the following splitting:
```
---submod_0 graph---
graph():
    %l_x_ : torch.Tensor [num_users=1] = placeholder[target=l_x_]
    %l_self_layers_0_weight : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=l_self_layers_0_weight]
    %l_self_layers_0_bias : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=l_self_layers_0_bias]
    %linear : [num_users=1] = call_function[target=torch._C._nn.linear](args = (%l_x_, %l_self_layers_0_weight, %l_self_layers_0_bias), kwargs = {})
    return linear

---submod_1 graph---
graph():
    %input_1 : [num_users=1] = placeholder[target=input_1]
    %l_self_layers_1_weight : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=l_self_layers_1_weight]
    %l_self_layers_1_bias : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=l_self_layers_1_bias]
    %linear : [num_users=1] = call_function[target=torch._C._nn.linear](args = (%input_1, %l_self_layers_1_weight, %l_self_layers_1_bias), kwargs = {})
    return linear

---final graph---
graph():
    %l_self_layers_0_weight : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=L_self_layers_0_weight]
    %l_self_layers_0_bias : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=L_self_layers_0_bias]
    %l_x_ : torch.Tensor [num_users=1] = placeholder[target=L_x_]
    %l_self_layers_1_weight : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=L_self_layers_1_weight]
    %l_self_layers_1_bias : torch.nn.parameter.Parameter [num_users=1] = placeholder[target=L_self_layers_1_bias]
    %submod_0 : [num_users=1] = call_module[target=compiled_submod_0](args = (%l_x_, %l_self_layers_0_weight, %l_self_layers_0_bias), kwargs = {})
    %submod_1 : [num_users=1] = call_module[target=compiled_submod_1](args = (%submod_0, %l_self_layers_1_weight, %l_self_layers_1_bias), kwargs = {})
    return (submod_1,)
---------------

```
where as without inlining it uses to be
```
---submod_0 graph---
graph():
    %l_x_ : torch.Tensor [num_users=1] = placeholder[target=l_x_]
    %l__self___layers_0 : [num_users=1] = call_module[target=L__self___layers_0](args = (%l_x_,), kwargs = {})
    return l__self___layers_0
/data/users/lsakka/pytorch/pytorch/torch/_inductor/compile_fx.py:133: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
  warnings.warn(

---submod_1 graph---
graph():
    %l__self___layers_0 : [num_users=1] = placeholder[target=l__self___layers_0]
    %l__self___layers_1 : [num_users=1] = call_module[target=L__self___layers_1](args = (%l__self___layers_0,), kwargs = {})
    return l__self___layers_1

---final graph---
graph():
    %l_x_ : torch.Tensor [num_users=1] = placeholder[target=L_x_]
    %submod_0 : [num_users=1] = call_module[target=compiled_submod_0](args = (%l_x_,), kwargs = {})
    %submod_1 : [num_users=1] = call_module[target=compiled_submod_1](args = (%submod_0,), kwargs = {})
    return (submod_1,)
---------------
```

TESTING:

(1) running
``` TORCHDYNAMO_INLINE_INBUILT_NN_MODULES=1   pytest test/distributed/test_dynamo_distributed.py -k ```
result in reduction in failures from 6 to 2 with this PR.

The two remaining are FSDP related which does not sounds trivial and have so many details. will leave them for future work.

Co-authored-by: Animesh Jain <anijain@umich.edu>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128034
Approved by: https://github.com/anijain2305, https://github.com/wconstab
2024-06-13 17:07:27 +00:00
3e3435678c Remove some implications from the static_eval pattern matcher (#128500)
We should be able to remove this as, with the new canonicalisation, we
have that `a < b` and `-a > -b` should be canonicalised to the same
expression (if SymPy does not interfere too much).

nb. I thought this would cut further the compilation time, but I was running
the benchmarks wrong (not removing triton's cache oops). It turns out that
after the first PR in this stack, https://github.com/pytorch/pytorch/issues/128398 is fully fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128500
Approved by: https://github.com/ezyang
ghstack dependencies: #128410, #128411
2024-06-13 16:50:00 +00:00
0fdd8d84fa Do not generate -1* in SymPy expressions when canonicalising (#128411)
Partially addresses https://github.com/pytorch/pytorch/issues/128150
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128411
Approved by: https://github.com/ezyang
ghstack dependencies: #128410
2024-06-13 16:49:59 +00:00
bdeb9225b0 Do not call get_implications unnecessarily (#128410)
This should improve compilation times. With this PR and the patch in
the original issue, I get a compilation time of `Compilation time: 307.30 second`.

Fixes https://github.com/pytorch/pytorch/issues/128398
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128410
Approved by: https://github.com/Chillee
2024-06-13 16:49:55 +00:00
cyy
e2a72313e8 Concat namespaces of torch/csrc/profiler code and other fixes (#128606)
Improve namespaces and modernize codebase of torch/csrc/profiler code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128606
Approved by: https://github.com/Skylion007, https://github.com/aaronenyeshi
2024-06-13 16:46:34 +00:00
7c370d2fb0 expose set_thread_name to Python and set thread names (#128448)
This adds a new multiprocessing method `_set_thread_name` and calls it from torchelastic and dataloader main functions. This will allow better monitoring of processes as we can separate elastic and dataloading processes from the main training process.

Threads named:

* torchrun/elastic
* PyTorch dataloader worker processes + pin memory thread
* TCPStore
* ProcessGroupNCCL background threads
* WorkerServer httpserver thread

Test plan:

```
$ torchrun --nnodes 1 --nproc_per_node 1 --no-python /bin/bash -c 'ps -eL | grep pt_'
3264281 3264281 pts/45   00:00:02 pt_elastic
3264281 3267950 pts/45   00:00:00 pt_elastic
```

dataloading

```py
import torch
import time

from torch.utils.data import (
    DataLoader,
    Dataset,
)

class NoopDataset(Dataset):
    def __getitem__(self, index):
        return index

    def __len__(self):
        return 10

dataloader = DataLoader(NoopDataset(), num_workers=2)

for i, x in enumerate(dataloader):
    print(i, x)
    time.sleep(10000)
```

```
$ python3 ~/scripts/dataloader_test.py
$ ps -eL | grep pt_
1228312 1228312 pts/45   00:00:02 pt_main_thread
1228312 1230058 pts/45   00:00:00 pt_main_thread
1228312 1230059 pts/45   00:00:00 pt_main_thread
1230052 1230052 pts/45   00:00:00 pt_data_worker
1230052 1230198 pts/45   00:00:00 pt_data_worker
1230052 1230740 pts/45   00:00:00 pt_data_worker
1230055 1230055 pts/45   00:00:00 pt_data_worker
1230055 1230296 pts/45   00:00:00 pt_data_worker
1230055 1230759 pts/45   00:00:00 pt_data_worker
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128448
Approved by: https://github.com/c-p-i-o, https://github.com/andrewkho, https://github.com/rsdcastro
2024-06-13 16:38:23 +00:00
b05b8d3989 [EZ][ALI Migration] Add logging for workflow type determination (#128619)
To help figure out what went wrong when the wrong label appears to have been set
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128619
Approved by: https://github.com/zxiiro, https://github.com/clee2000
2024-06-13 16:37:07 +00:00
e9b81e4edf Fakify torch bind input by default (#128454)
Summary: Try a reland of https://github.com/pytorch/pytorch/pull/127116 after some fixes landed

Differential Revision: D58418251

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128454
Approved by: https://github.com/angelayi
2024-06-13 16:25:11 +00:00
c63ccead5e Revert "[dynamo] Enable some inlining inbuilt nn module tests (#128440)"
This reverts commit 1602c7d0c861a4382746ccb18c76d8703a636f4e.

Reverted https://github.com/pytorch/pytorch/pull/128440 on behalf of https://github.com/clee2000 due to new test broke internally D58501220 ([comment](https://github.com/pytorch/pytorch/pull/128440#issuecomment-2166127531))
2024-06-13 16:14:37 +00:00
17b45e905a Fix get output code when caching is enabled (#128445)
Summary: Improve output code retrieval mechanism so that it works in the presence of cache hits.

Test Plan: ci

Differential Revision: D58429602

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128445
Approved by: https://github.com/jansel, https://github.com/eellison, https://github.com/masnesral
2024-06-13 16:00:30 +00:00
93a14aba6e [BE]: Update mypy to 1.10.0 (#127717)
Updates mypy to the latest and greatest.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127717
Approved by: https://github.com/ezyang
2024-06-13 15:57:13 +00:00
49366b2640 Add test to xfail_list only for abi_compatible (#128506)
https://github.com/pytorch/pytorch/pull/126717 will skip the tests in both ABI compatible and non-ABI compatible mode.
It's not expected to skip them in non-ABI compatible mode since they can actually run successfully in such mode but only have issues in ABI compatible mode.

We leverage the existing `xfail_list` for those that will only fail in ABI compatible mode.

- `test_qlinear_add` is already in the `xfail_list`.
- `test_linear_packed` doesn't fail either in my local run (running with `TORCHINDUCTOR_ABI_COMPATIBLE=1`) or in the CI of this PR so I didn't add it into `xfail_list`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128506
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-06-13 15:32:15 +00:00
cf7adc2fa1 [Inductor] Update Intel GPU Triton commit pin. (#124842)
Update Intel triton for Pytorch 2.4 release.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124842
Approved by: https://github.com/EikanWang
2024-06-13 14:34:37 +00:00
edb45dce85 Add OpInfo entry for as_strided_copy (#127231)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127231
Approved by: https://github.com/lezcano
2024-06-13 13:58:47 +00:00
7cc07a3eb1 [custom_op] stop using nonlocals to store information (#128547)
Fixes https://github.com/pytorch/pytorch/issues/128544
Fixes https://github.com/pytorch/pytorch/issues/128535

We had a problem with multithreading where the nonlocals were being
clobbered. In the first place, we stored these nonlocals because we
wanted to ferry information from an autograd.Function.apply to
autograd.Function.forward.

Our new approach is:
- pass the information directly as an input to the
  autograd.Function.apply. This means that the autograd.Function.forward
  will receive the information too.
- this messes up ctx.needs_input_grad, which has an element per input to
  forward. The user should not see the additional information we passed.
  We fix this by temporarily overriding ctx.needs_input_grad to the
  right thing.
- this exposed a bug in that ctx.needs_input_grad wasn't correct for
  TensorList inputs. This PR fixes that too.

Test Plan:
- existing and new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128547
Approved by: https://github.com/williamwen42, https://github.com/soulitzer
2024-06-13 13:36:39 +00:00
2b9465d62a [aota] Allow some mutations in backward (#128409)
https://github.com/pytorch/pytorch/issues/127572

Allow mutations in backward on forward inputs, if
1/ not mutationg metadata
Enforced at compilation time.

2/ if create_graph=True: mutated input does not require_grad
Enforced in runtime, when create_graph mode can be detected by checking torch.is_grad_enabled()

Adding input_joint_info to track mutations of inputs during joint.
Created a separate field in ViewAndMutationMeta as it is filled only after joint fn tracing.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128409
Approved by: https://github.com/bdhirsh
2024-06-13 12:09:08 +00:00
d0c08926d1 allow inlining functions in _python_dispatch and _is_make_fx_tracing (#128485)
This fix grab breaks in torch_multimodal_clip benchmark.

Co-authored-by: Animesh Jain <anijain@umich.edu>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128485
Approved by: https://github.com/anijain2305
ghstack dependencies: #128428
2024-06-13 09:56:39 +00:00
1fd2cd26a0 [inductor][cpp] support bf16/fp16 gemm template epilogue fusion (#126545)
As part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows:
1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling.
2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops.
3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen.
4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126545
Approved by: https://github.com/jansel
2024-06-13 09:46:22 +00:00
c897651392 [inductor] Add BackendFeature gating (#128266)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128266
Approved by: https://github.com/shunting314
2024-06-13 07:31:51 +00:00
88974fedd0 Clean up xpu ut to make CI happy (#128383)
# Motivation
Before #127611 merged, the xpu-specific UT `test/test_xpu.py` was skipped temporarily. This PR aims to fix the UT bug introduced by #127741.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128383
Approved by: https://github.com/EikanWang
2024-06-13 07:06:41 +00:00
ce79b09415 [CUDA][Sparse] Change comparison function of test_sparse_semi_structured.py and bump tolerances for sp24_matmuls (#128553)
Minor tweak of comparison as using `assert` on `torch.allclose` prevents the mismatches from being logged. Also bump a few tolerances that seem to be causing failures on sm86/sm90

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128553
Approved by: https://github.com/jcaip
2024-06-13 06:58:07 +00:00
0678742924 [MPS] Add Metal implementation of exp op (#128421)
To improve accuracy, use `precise::exp()` (and `precise::sin()`/`precise::cos()` for complex flavor)
Reuse `test_exp1` to check that accuracy of `exp` ops is sometimes closer to CPU

Fix bug in non-contiguous tensors handling

Fixes https://github.com/pytorch/pytorch/issues/84936
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128421
Approved by: https://github.com/kulinseth
ghstack dependencies: #128373, #128375
2024-06-13 06:53:17 +00:00
14c9eb5ed2 Add XPU code owners (#128486)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128486
Approved by: https://github.com/atalman, https://github.com/malfet
2024-06-13 06:33:45 +00:00
518c9e6455 Forward fix lint (#128587)
merge at will
After https://github.com/pytorch/pytorch/pull/125968
and https://github.com/pytorch/pytorch/pull/127693
landrace

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128587
Approved by: https://github.com/huydhn
2024-06-13 06:19:03 +00:00
c52eda896e [dynamo][trace_rules] Remove incorrectly classified Ingraph functions (#128428)
Co-authored-by: Laith Sakka <lsakka@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128428
Approved by: https://github.com/yanboliang, https://github.com/mlazos
ghstack dependencies: #126578, #128440, #128470, #128453, #128484
2024-06-13 06:08:56 +00:00
1f6e84fa68 [inductor][mkldnn] Use floats instead of ints for pattern matcher test (#128484)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128484
Approved by: https://github.com/mlazos
ghstack dependencies: #126578, #128440, #128470, #128453
2024-06-13 06:08:56 +00:00
ea541dd965 SymIntify cross_entropy_loss_prob_target numel call (#128141)
This PR replaces call to ```numel``` with ```sym_numel``` in cross_entropy_loss_prob_target.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128141
Approved by: https://github.com/ezyang
2024-06-13 05:37:17 +00:00
ade3d07483 GGML inspired int8 MM Metal shader (#127646)
## Context

This PR ported GGML int8 per channel matrix multiplication and matrix vector multiplication metal shaders into ATen library.
llama.cpp LICENSE: https://github.com/ggerganov/llama.cpp/blob/master/LICENSE

## Key Changes

Made the following changes to the original code:

* Memory layout of weight and scales is different than llama.cpp.
* Weight dequantization (scales multiplication) is done after MM is finished.
* Following PyTorch naming convention (M, K, N and assuming row major).

## Benchmark

When M = 1, mv shader improves existing ATen int8mm by 40%.
When M > 4, mm shader outperforms existing ATen int8mm up to 10x for a large M, as show blow.
![image](https://github.com/pytorch/pytorch/assets/8188269/fd9eff71-c538-4263-a7b5-f96fe479ae9d)

Hence the kernel chooses different shaders based on M.

## Test Plan

Tests are passing:
```
❯ python test/test_mps.py -v -k _int8_
/Users/larryliu/CLionProjects/pytorch/venv/lib/python3.8/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'dlopen(/Users/larryliu/CLionProjects/pytorch/venv/lib/python3.8/site-packages/torchvision/image.so, 0x0006): Symbol not found: __ZN3c1017RegisterOperatorsD1Ev
  Referenced from: <A770339A-37C9-36B2-84FE-4125FBE26FD6> /Users/larryliu/CLionProjects/pytorch/venv/lib/python3.8/site-packages/torchvision/image.so
  Expected in:     <5749F98A-0A0C-3F89-9CBF-277B3C8EA00A> /Users/larryliu/CLionProjects/pytorch/torch/lib/libtorch_cpu.dylib'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
test__int8_mm_m_1_k_32_n_32_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_1_k_32_n_64_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_1_k_64_n_32_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_1_k_64_n_64_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_32_k_32_n_32_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_32_k_32_n_64_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_32_k_64_n_32_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_32_k_64_n_64_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_64_k_32_n_32_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_64_k_32_n_64_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_64_k_64_n_32_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_64_k_64_n_64_mps (__main__.TestLinalgMPSMPS) ... ok

----------------------------------------------------------------------
Ran 12 tests in 1.180s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127646
Approved by: https://github.com/malfet
2024-06-13 05:23:56 +00:00
b86b4ace88 Invalidate eager params when inlining and freezing nn modules (#128543)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128543
Approved by: https://github.com/anijain2305
2024-06-13 04:50:17 +00:00
83bb9b7c53 [BE] explicitly export subpackage torch.utils (#128342)
Resolves #126401

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128342
Approved by: https://github.com/Skylion007
ghstack dependencies: #127707
2024-06-13 04:39:16 +00:00
2229884102 Introduce int_oo (#127693)
In a previous life, we used sympy.oo to represent the lower/upper bounds of integer ranges. Later, we changed this to be sys.maxsize - 1 for a few reasons: (1) sometimes we do tests on a value being exactly sys.maxsize, and we wanted to avoid a data dependent guard in this case, (2) sympy.oo corresponds to floating point infinity, so you get incorrect types for value ranges with oo, and (3) you can do slightly better reasoning if you assume that input sizes fall within representable 64-bit integer range.

After working in the sys.maxsize regime for a bit, I've concluded that this was actually a bad idea. Specifically, the problem is that you end up with sys.maxsize in your upper bound, and then whenever you do any sort of size-increasing computation like size * 2, you end up with 2 * sys.maxsize, and you end up doing a ton of arbitrary precision int computation that is totally unnecessary. A symbolic bound is better.

But especially after #126905, we can't go back to using sympy.oo, because that advertises that it's not an integer, and now your ValueRanges is typed incorrectly. So what do we do? We define a new numeric constant `int_oo`, which is like `sympy.oo` but it advertises `is_integer`. **test/test_sympy_utils.py** describes some basic properties of the number, and **torch/utils/_sympy/numbers.py** has the actual implementation.

The rest of the changes of the PR are working out the implications of this change. I'll give more commentary as inline comments.

Fixes https://github.com/pytorch/pytorch/issues/127396

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127693
Approved by: https://github.com/lezcano
ghstack dependencies: #126905
2024-06-13 04:08:20 +00:00
d3b8230639 Fix profiler_kineto Clang errors (#128464)
Summary: There are clang errors in profiler_kineto. It would probably be a good idea to fix them as the file is already quite dense.

Test Plan: Make sure all on Phabricator all tests under static_tests/lint_root pass

Differential Revision: D58431005

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128464
Approved by: https://github.com/aaronenyeshi
2024-06-13 03:10:50 +00:00
d630e1e838 Revert "[dynamo][yolov3] Track UnspecializedNNModuleVariable for mutation (#128269)"
This reverts commit f2d7f235a684c593f5a1ff2ca0b47b47274bfe85.

Reverted https://github.com/pytorch/pytorch/pull/128269 on behalf of https://github.com/anijain2305 due to incorrect ([comment](https://github.com/pytorch/pytorch/pull/128269#issuecomment-2164267320))
2024-06-13 03:04:26 +00:00
7fe9ab9ccc update amp example to device-agnostic (#127278)
As support for Intel GPU has been upstreamed, this PR is to make the AMP example doc device-agnostic.

Co-authored-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127278
Approved by: https://github.com/dvrogozh, https://github.com/EikanWang, https://github.com/svekars
2024-06-13 02:01:16 +00:00
cyy
3f9b8446cf [8/N] Remove unused functions (#128499)
Follows #128407

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128499
Approved by: https://github.com/malfet
2024-06-13 01:15:11 +00:00
ede74940a1 optimize vec isa check dispatch logical. (#128320)
Optimize cpu vec isa check dispatch by archecture, it makes code easy to read and maintaince.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128320
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-06-13 01:06:34 +00:00
c1cd946818 [cond] add a set_ and data mutation expected failure test (#128457)
A follow up of the discussion in https://github.com/pytorch/pytorch/pull/126936.

Cond errors out early because of a graph break triggered by DelayGraphBreakVariable, which is created due to `aten.set_` [here](https://github.com/pytorch/pytorch/blob/main/torch/_dynamo/variables/tensor.py#L366-L376).

We might need to see what happened to this test if we allow graph break in higher order op.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128457
Approved by: https://github.com/zou3519
2024-06-13 00:16:59 +00:00
c472cec565 [checkpoint] Clean up selective activation checkpoint and make public (#125795)
Related doc: https://docs.google.com/document/d/1BKyizkZPdri9mHqdDOLAUpkI7SbbKfLHRFVVpK9ZWqo/edit

Memory considerations:
- As with the existing SAC, cached values are cleared upon first use.
- We error if the user wishes to backward a second time on a region forwarded with SAC enabled.

In-place:
- We use version counting to enforce that if any cached tensor has been mutated. In-place operations not mutating cached tensors are allowed.
- `allow_cache_entry_mutation=True` can be passed to disable this check (useful in the case of auto AC where the user is cleverly also saves the output of the in-place)

Randomness, views
- Currently in this PR, we don't do anything special for randomness or views, the author of the policy function is expected to handle them properly. (Would it would be beneficial to error? - we either want to save all or recompute all random tensors)

Tensor object preservation
- We guarantee that if a tensor does not requires grad, and it is saved, then what you get out is the same tensor object. If the tensor does require grad, we must detach to avoid creating a reference cycle. This is a nice guarantee for nested tensors which care about the object identity of of the offsets tensor.

Policy function
- Enum values are `{MUST,PREFER}_{SAVE,RECOMPUTE}` (bikeshed welcome). Alternatively there was `{SAVE,RECOMPUTE}_{NON_,}OVERRIDABLE`. The former was preferred bc it seemed clearer that two `MUST` clashing should error, versus it is ambiguous whether two `NON_OVERRIDABLE` being stacked should silently ignore or error.
- The usage of Enum today. There actually is NO API to stack SAC policies today. The only thing the Enum should matter for in the near term is the compiler. The stacking SAC policy would be useful if someone wants to implement something like simple FSDP, but it is not perfect because with a policy of `PREFER_SAVE` you are actually saving more than autograd would save normally (would be fixed with AC v3).
- The number of times we call the policy_fn is something documented part of public API. We call the policy function for all ops except detach because detach is itself called a different number of times by AC between forward and recompute.
- The policy function can be a stateful object (we do NOT make separate copies of this object for forward/recompute, the user is expected to handle that via is_recompute see below).
Tensors guaranteed to be the same tensor as-is
- Policy function signature takes ctx object as its first argument. The ctx function is an object encapsulating info that may be useful to the user, it currently only holds "is_recompute". Adding this indirection gives us flexibility to add more attrs later if necessary.

"bc-breaking" for existing users of the private API:
- Existing policy functions must now change their return value to use the Enum.
- Existing calls to `_pt2_selective_checkpoint_context_fn_gen` must be renamed to `gen_selective_checkpoint_context_fn`. The way you use the API remains the same. It would've been nice to do something different (not make the user have to use functools.partial?), but this was the easiest to compile (idk if this should actually be a constraint).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125795
Approved by: https://github.com/Chillee, https://github.com/fmassa
2024-06-12 23:57:33 +00:00
25b7537a27 doc comment typo fixes and improvements (#128512)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128512
Approved by: https://github.com/LucasLLC
2024-06-12 23:55:09 +00:00
eb1db6702f [2nd try][AOTI] Switch to use shim v2 (#128521)
Test Plan: Sandcastle

Differential Revision: D58470269

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128521
Approved by: https://github.com/desertfire
2024-06-12 23:44:24 +00:00
4423e1bbdc [release] Increase version 2.4.0->2.5.0 (#128514)
Same as https://github.com/pytorch/pytorch/pull/121974
Branch cut for 2.4.0 completed hence advance main version to 2.5.0

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128514
Approved by: https://github.com/malfet
2024-06-12 23:40:01 +00:00
3bc2004f91 [ts_converter] Fix prim::dtype (#128517)
Summary: prim::dtype has the signature `(Tensor a) -> int`, where it gets the dtype of the tensor and returns the integer corresponding to this dtype based on the enum in ScalarType.h. Previously we were converting prim::dtype by returning the actual dtype of the tensor (ex. torch.float32). This causes some incorrect control flow to behavior, specifically where it checks if `prim::dtype(tensor) in [3, 5, 7]`, where [3, 5, 7] correspond to torch.int32, torch.float16, torch.float64. This control flow would always returns False because we would be comparing torch.float32 against the integers [3, 5, 7], which is a type mismatch.

Test Plan: 7/22 internal models now are convertable and runnable in eager and sigmoid! P1410243909

Reviewed By: jiashenC

Differential Revision: D58469232

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128517
Approved by: https://github.com/jiashenC
2024-06-12 23:02:50 +00:00
2fa6f80b13 Perform reciprocal optimization with foreach_div (#128433)
Fixes https://github.com/pytorch/pytorch/issues/114165

Internal xref
https://fb.workplace.com/groups/1144215345733672/posts/2801223606699496/

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128433
Approved by: https://github.com/awgu
2024-06-12 22:57:03 +00:00
8db4a41973 Use computeStorageNbytesContiguous if possible (#128515)
```at::detail::computeStorageNbytesContiguous``` does fewer data-dependent tests compared to ```at::detail::computeStorageNbytes```. Therefore, use of former is more likely to succeed with dynamic shapes. This PR detects is_contiguous and dispatches to the appropriate function. This should be helpful in unblocking aot_eager for torchrec. As an aside, this is an alternative solution to the unsound solution I had first proposed in another [PR](#128141).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128515
Approved by: https://github.com/ezyang
2024-06-12 22:53:06 +00:00
e2610240f9 [ROCm] Enable several inductor UTs (#127761)
Fixes #ISSUE_NUMBER

Needs https://github.com/pytorch/pytorch/pull/125396

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127761
Approved by: https://github.com/peterbell10, https://github.com/pruthvistony
2024-06-12 22:47:45 +00:00
bb3cf8a339 Lift inductor lowerings for jagged <-> padded dense kernels (#125968)
This PR lifts internal lowerings written for FBGEMM kernels that do jagged <-> padded dense conversions. In particular, this PR provides lowerings and meta registrations for the following ATen ops:
* `_jagged_to_padded_dense_forward()`
* `_padded_dense_to_jagged_forward()`
    * NB: if `total_L` is not provided, the output shape is data-dependent. An unbacked SymInt is used for this case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125968
Approved by: https://github.com/davidberard98
2024-06-12 22:46:09 +00:00
b4a7b543e5 Add targeted unit tests for guards-related functions used in the codecache (#128482)
Summary: Add a few unit tests that exercise `produce_guards_expression` and `evaluate_guards_expression` (and specifically "ToFloat" "FloatTrueDiv" added in https://github.com/pytorch/pytorch/pull/128418)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128482
Approved by: https://github.com/ezyang
ghstack dependencies: #128418
2024-06-12 22:41:50 +00:00
1f302d6885 Support aten operations with out tensor (#124926)
This PR intends to support the aten operations with the `out` tensor.

Currently, the AOT compile always does **NOT** keep input tensor mutations. According to the comments, this is because it has not encountered such a use case.
> For now there's no use case involving keeping input mutations in the graph (which we can only do in the inference case anyway). We can add this later if we need to.

However, for aten operations, it is popular that the `out` tensor is an input parameter and needs to be mutated. This PR intends to support it by adding a `keep_inference_input_mutations` flag to `aot_inductor.keep_inference_input_mutations`. This flag can provide flexibility to the callee in deciding whether the AOT compile needs to keep input tensor mutations in the graph.

Take `clamp` as an example as follows.
```python
out_tensor = torch.randn(128, dtype=torch.float, device=device).fill_(-2.0)
inp_tensor = torch.randn(128, dtype=torch.float, device=device).fill_(1.0)
min_tensor = inp_tensor - 0.05
max_tensor = inp_tensor + 0.05
torch.clamp(input=inp_tensor, min=min_tensor, max=max_tensor, out=out_tensor)
```

W/O this PR
```python
def forward(self):
    arg0_1: "f32[128]"; arg1_1: "f32[128]"; arg2_1: "f32[128]"; arg3_1: "f32[128]";

    arg0_1, arg1_1, arg2_1, arg3_1, = fx_pytree.tree_flatten_spec([], self._in_spec)
    clamp_min: "f32[128]" = torch.ops.aten.clamp_min.Tensor(arg0_1, arg1_1);  arg0_1 = arg1_1 = None
    clamp_max: "f32[128]" = torch.ops.aten.clamp_max.Tensor(clamp_min, arg2_1);  clamp_min = arg2_1 = None
    return (clamp_max, clamp_max)
```

W/ this PR
```python
def forward(self):
    arg0_1: "f32[128]"; arg1_1: "f32[128]"; arg2_1: "f32[128]"; arg3_1: "f32[128]";

    arg0_1, arg1_1, arg2_1, arg3_1, = fx_pytree.tree_flatten_spec([], self._in_spec)
    clamp_min: "f32[128]" = torch.ops.aten.clamp_min.Tensor(arg0_1, arg1_1);  arg0_1 = arg1_1 = None
    clamp_max: "f32[128]" = torch.ops.aten.clamp_max.Tensor(clamp_min, arg2_1);  clamp_min = arg2_1 = None
    copy_: "f32[128]" = torch.ops.aten.copy_.default(arg3_1, clamp_max);  arg3_1 = clamp_max = None
    return (copy_,)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124926
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/angelayi
2024-06-12 22:31:59 +00:00
f4edd67fe7 [c10d] fix OSS commSplit bug (#128459)
Summary:
D56907877 modified OSS commSplit. However, commSplit requires every rank being called even though it is no-color. ncclCommSplit will not create a communicator for nocolor ranks hence this line of code will potentially throw error like `NCCL WARN CommUserRank : comm argument is NULL`

Revert this change from D56907877

Test Plan: CI

Differential Revision: D58436088

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128459
Approved by: https://github.com/shuqiangzhang
2024-06-12 22:29:01 +00:00
f39ab8a0fe Fix side effect pruning (#128028)
Summary:
The previous side effect pruning algorithm would keep many dead cell
variables alive. For example, in
https://github.com/pytorch/pytorch/issues/125078, the compiled function
has one return but there were three in the Dynamo graph due to two
dead cell variables not being pruned away.

This PR adds a corrected algorithm. "new cell variables" are alive if
they can be reached from one of the following:
1. any of the tx.symbolic_locals or tx.stack (that is, if they are
   involved in a return from the function or intermediate variable
   during a graph break). Example: an alive NestedUserFunctionVariable
2. "mutations to pre-existing objects". Example: appending a
   NestedUserFunctionVariable to a global list

The new algorithm reflects this, but please let me know if there are
more cases to handle.

Test Plan:
- existing tests (afaict, test/dynamo/test_python_autograd is the best
  SideEffects test case we have)
- see in test/dynamo/test_higher_order_ops that the expecttests changed
  -- the functorch dynamo graphs no longer return dead cellvars.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128028
Approved by: https://github.com/jansel
2024-06-12 22:25:37 +00:00
cyy
3008644297 [Caffe2] Remove remaining unused perfkernels (#128477)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128477
Approved by: https://github.com/ezyang, https://github.com/r-barnes
2024-06-12 22:19:36 +00:00
55a6b38f52 [inductor] enable fx graph cache on torchbench (#128239)
Summary: We've already enabled for timm and huggingface, but we had failures saving cache entries for moco. It looks like https://github.com/pytorch/pytorch/pull/128052 has fixed that issue, so we can enable for torchbench.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128239
Approved by: https://github.com/oulgen
2024-06-12 22:15:02 +00:00
6206da55ef Fix lint after #119459 (#128558)
TSIA
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128558
Approved by: https://github.com/atalman, https://github.com/kit1980, https://github.com/malfet
2024-06-12 22:11:37 +00:00
2b28b107db [dynamo][fsdp] Dont take unspecializedNNModuleVariable path for FSDP modules (#128453)
Co-authored-by: Laith Sakka <lsakka@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128453
Approved by: https://github.com/yf225
ghstack dependencies: #126578, #128440, #128470
2024-06-12 22:03:45 +00:00
6aef2052ea Save backward graphs lazily to cache (#126999)
This PR makes it so we lazily save to the cache on backward call instead of saving ahead of time always. We have to pass a closure to post_compile to prevent cyclic dependencies.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126999
Approved by: https://github.com/bdhirsh
ghstack dependencies: #126791
2024-06-12 21:58:34 +00:00
87072dcfdb Change Dynamo's custom ops warning message to be less spammy (#128456)
This is a short-term fix (for 2.4). In the longer term we should
fix https://github.com/pytorch/pytorch/issues/128430

The problem is that warnings.warn that are inside Dynamo print
all the time. Python warnings are supposed to print once, unless their
cache is reset: Dynamo ends up resetting that cache everytime it runs.

As a workaround we provide our own warn_once cache that is keyed on the
warning msg. I am not worried about this increasing memory usage because
that's effectively what python's warnings.warn cache does.

Test Plan:
- fix tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128456
Approved by: https://github.com/anijain2305
2024-06-12 21:57:12 +00:00
c53d65b3d3 [inductor] fix linear add bias pattern (#128473)
Fix https://github.com/pytorch/pytorch/issues/128287.
Previous the assertion in `linear_add_bias` are pretty bad
```
assert packed_weight_node.name == "_reorder_linear_weight"
assert transpose_weight_node.name == "permute_default"
```
because the `name` can be changed to `_reorder_linear_weight_id, permute_default_id` if we have more than 1 reorder/permute.

Check `target` instead `name` can solve this issue.

UT is also updated to have match more than 1 `linear_add_bias` pattern to cover this case.

Co-authored-by: Jiong Gong <jiong.gong@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128473
Approved by: https://github.com/jgong5
2024-06-12 21:55:35 +00:00
bb13fad7aa Share TCPStore by default when using c10d rdzv handler (#128096)
Summary:
Number of features rely on TCP store as a control plane. By default TCPStore server is started on rank0 trainer and this can create a a race condition when rank0 may exit (error and graceful exit) and any other ranks reading/writing will fail.

Solution: TCPStore server should outlive all the trainer processes. By moving the ownership TCPStore to torchelastic agent it naturally fixes the lifecycle of the server.

Static rendezvous in torchelastic does already support sharing of the TCPStore server. We are extending this to more commonly used c10d rendezvous handler.

Any handler would like to manage tcp store has to:
- Return true on `use_agent_store` property
- `RendezvousInfo`.`RendezvousStoreInfo`#[`master_addr/master_port`] values refer to managed TCPStore (those are returned on `next_rendezvous` call)

Note: in some instances users may want to use non-TCPStore based stores for the torchelastic rendezvous process, so the handler will need to create and hold a reference to TCPStore (as done in this change)

Test Plan:
`cat ~/workspace/dist-demo/stores.py`
~~~
import torch
import logging
import sys
import torch.distributed as dist
import torch

import os
import time

logger = logging.getLogger(__name__)
logger.addHandler(logging.StreamHandler(sys.stderr))
logger.setLevel(logging.INFO)

def _run_test(store):

    if dist.get_rank() == 1:
        logger.info("Rank %s is sleeping", dist.get_rank())
        time.sleep(5)
        key = "lookup_key"
        logger.info("Checking key %s in store on rank %s", key, dist.get_rank())
        store.check([key])
    else:
        logger.info("rank %s done", dist.get_rank())

def main() -> None:
    use_gpu = torch.cuda.is_available()
    dist.init_process_group(backend="nccl" if use_gpu else "gloo")
    dist.barrier()

    logger.info(f"Hello World from rank {dist.get_rank()}")

    host = os.environ['MASTER_ADDR']
    port = os.environ['MASTER_PORT']
    world_size = os.environ['WORLD_SIZE']

    logger.info("testing TCPStore")
    store = dist.TCPStore(
        host_name=host, port=int(port), world_size=int(world_size),
    )
    _run_test(store)

if __name__ == "__main__":
    main()
~~~

With the fix (TORCH_DISABLE_SHARE_RDZV_TCP_STORE=0 or just drop the option)
~~~
(pytorch_38) [kurman@devgpu011.cln5 ~/local/pytorch (main)]$ TORCH_DISABLE_SHARE_RDZV_TCP_STORE=0 python -m torch.distributed.run --rdzv-backend c10d --nproc-per-node 3 ~/workspace/dist-demo/stores.py
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Hello World from rank 1
Hello World from rank 2
Hello World from rank 0
testing TCPStore
testing TCPStore
testing TCPStore
rank 2 done
Rank 1 is sleeping
rank 0 done
Checking key lookup_key in store on rank 1
~~~

TORCH_DISABLE_SHARE_RDZV_TCP_STORE=1
~~~
(pytorch_38) [kurman@devgpu011.cln5 ~/local/pytorch (main)]$ TORCH_DISABLE_SHARE_RDZV_TCP_STORE=1 python -m torch.distributed.run --rdzv-backend c10d --npro
c-per-node 3 ~/workspace/dist-demo/stores.py
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
WARNING:__main__:
*****************************************

Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Hello World from rank 0
Hello World from rank 2
Hello World from rank 1
testing TCPStore
testing TCPStore
testing TCPStore
rank 0 done
rank 2 done
Rank 1 is sleeping
Checking key lookup_key in store on rank 1
[rank1]: Traceback (most recent call last):
[rank1]:   File "/home/kurman/workspace/dist-demo/stores.py", line 46, in <module>
[rank1]:     main()
[rank1]:   File "/home/kurman/workspace/dist-demo/stores.py", line 42, in main
[rank1]:     _run_test(store)
[rank1]:   File "/home/kurman/workspace/dist-demo/stores.py", line 22, in _run_test
[rank1]:     store.check([key])
[rank1]: torch.distributed.DistNetworkError: Connection reset by peer
E0605 17:40:22.853277 140249136719680 torch/distributed/elastic/multiprocessing/api.py:832] failed (exitcode: 1) local_rank: 1 (pid: 2279237) of binary: /home/kurman/.conda/envs/pytorch_38/bin/python
Traceback (most recent call last):
  File "/home/kurman/.conda/envs/pytorch_38/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/kurman/.conda/envs/pytorch_38/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/data/users/kurman/pytorch/torch/distributed/run.py", line 904, in <module>
    main()
  File "/data/users/kurman/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/data/users/kurman/pytorch/torch/distributed/run.py", line 900, in main
    run(args)
  File "/data/users/kurman/pytorch/torch/distributed/run.py", line 891, in run
    elastic_launch(
  File "/data/users/kurman/pytorch/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/data/users/kurman/pytorch/torch/distributed/launcher/api.py", line 263, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/kurman/workspace/dist-demo/stores.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-06-05_17:40:22
  host      : devgpu011.cln5.facebook.com
  rank      : 1 (local_rank: 1)
  exitcode  : 1 (pid: 2279237)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
~~~

Differential Revision: D58180193

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128096
Approved by: https://github.com/shuqiangzhang
2024-06-12 21:49:42 +00:00
c0ea8fc3a3 Disable inlining nn modules on static inputs tests (#128529)
With inilining NN modules these tests no longer raise runtime errors because changing static ptrs induces a rerecording instead of a runtime error. The solution is to run the test with inlining disabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128529
Approved by: https://github.com/anijain2305
ghstack dependencies: #128528
2024-06-12 21:40:29 +00:00
ff3ba99320 Disable inline nn modules on unstable ptr test (#128528)
With inilining NN modules these tests no longer raise runtime errors because changing static ptrs induces a rerecording instead of a runtime error. The solution is to run the test with inlining disabled.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128528
Approved by: https://github.com/anijain2305
2024-06-12 21:40:29 +00:00
1026b7cfbe Add docstring for the torch.typename function (#128129)
Fixes: #127885

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128129
Approved by: https://github.com/malfet
2024-06-12 21:34:20 +00:00
cba840fde9 Fix accidental variable shadow (#128460)
Fixes #128322

We should probably crank up clang's warning levels...

Test:
```
import torch

def addmv_slice(input, mat, vec, slice_op):
    vec = vec[slice_op]
    res = torch.addmv(input, mat, vec)  # traced line: 25
    return res

torch._dynamo.reset()
model_opt = torch.compile(addmv_slice)

input = torch.empty(size=[11]).uniform_(-1, 1)
mat = torch.empty([11, 128]).uniform_(-10.0, 20.0)

vec = torch.empty([256]).uniform_(-10.0, 20.0)
slice_op = slice(None, None, 2)
out = model_opt(input, mat, vec, slice_op)

vec = torch.empty([384]).uniform_(-10.0, 20.0)
slice_op = slice(None, None, 3)
out = model_opt(input, mat, vec, slice_op)
```
before this change the test fails with:
```
torch._dynamo.exc.TorchRuntimeError: Failed running call_function <built-in function getitem>(*(FakeTensor(..., size=(s0,)), slice(None, None, s1)), **{}):
slice step cannot be zero
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128460
Approved by: https://github.com/oulgen, https://github.com/Skylion007
2024-06-12 21:14:04 +00:00
0444e89931 [export] Remove replace_sym_size_ops_pass (#128443)
Summary: Not needed anymore.

Test Plan: CI

Differential Revision: D58429458

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128443
Approved by: https://github.com/angelayi
2024-06-12 21:03:06 +00:00
67e6c76a18 Support apply_(callable) sugar for CPU NJTs (#125416)
Example:
```python
nt.apply_(lambda x: x * 2)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125416
Approved by: https://github.com/soulitzer
2024-06-12 20:30:57 +00:00
dd143d44cc [BE] enable UFMT for top-level files torch/*.py (#127707)
Part of #123062

- #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127707
Approved by: https://github.com/ezyang
2024-06-12 20:15:05 +00:00
cc231a8e2b First version of AOTAutogradCache (#126791)
This PR implements "V0" of AOTAutogradCache. Given an input to AOTAutograd, we calculate a cache key, then save an AOTAutogradCacheEntry.
Each AOTAutogradCacheEntry has:
- A CompiledForward and optionally a CompiledBackward
- A bunch of metadata.

CompiledForward and CompiledBackward each save the *key* to the FXGraphCache associated with the compiled object. FXGraphCache populates this key field as long as it's able to return a compiled graph given a set of inputs. We then load the same object from the FXGraphCache on an AOTAutogradCache hit.

On cache miss:
- Run AOTAutograd, up to AOTAutogradDispatch.post_compile.
- Save an AOTAutogradCacheEntry to the cache after compiling the necessary portions and receiving a cache key from FXGraphCache. In this we *always* compile the backwards ahead of time. The PR above this one implements backward lazy caching, so that we only save to the cache after compiling the backward in a lazy backward scenario.
- Return the resulting object

On cache hit:
- Run AOTAutogradCacheEntry.post_compile() on the cache key.
- This attempts to load the forward and backward graphs from FXGraphCache
- As long as we successfully load from FXGraphCache, it's a hit. We then rewrap the callable with post compile wrappers using our saved metadata.

For now, we ignore the fakified out and debug wrappers. We only save to the cache if Fakified out is turned off.

V0 Guards behavior:
FXGraphCache serializes guards that are needed in the shape_env based on the symint inputs to the graph. The invariant that AOTAutograd uses here is that the sources for symints given to it by dynamo are exactly the same as the ones it passes to inductor, for both the forward and backward passes. (This does *not* mean that the tensor values passed in are the same: only that their symints are). That is, AOTAutograd and Inductor never create new guards based on symints with *different sources* than those passed to it by inductor.

We don't currently store any AOTAutograd specific guards: my hypothesis is that FXGraphCache already stores these, as any guards generated by AOTAutograd should already be in the shape_env before calling into inductor, and we don't generate new guards post inductor. If this is needed, I'll add it in another diff.

Testing:
We'll start with some basic unit tests, but I'll be adding more and more complicated testing as the next step.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126791
Approved by: https://github.com/bdhirsh
2024-06-12 20:04:44 +00:00
7775fee10f [tp] refactor and fix PrepareModuleInput for DTensor inputs (#128431)
as titled, this PR refactors the PrepareModuleInput style to have common
method prepare_input_arg, allow both args/kwargs to reuse this logic

This also fixes https://github.com/pytorch/pytorch/issues/128365

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128431
Approved by: https://github.com/awgu
2024-06-12 19:16:33 +00:00
ec1fdda196 Fix jagged NT softmax semantics (#119459)
Before: `softmax` definition uses `jagged_unary_pointwise()` (wrong)
After: `softmax` impl adjusts the `dim` arg to account for the difference in dimensionality between the outer NT and the NT's `_values`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119459
Approved by: https://github.com/soulitzer
2024-06-12 19:12:03 +00:00
817ce6835b Revert "[cuDNN][SDPA] Remove TORCH_CUDNN_SDPA_ENABLED=1, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343)"
This reverts commit 4c971932e839fc5da2b91906ad028d4654932bca.

Reverted https://github.com/pytorch/pytorch/pull/125343 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/125343#issuecomment-2163690162))
2024-06-12 18:47:52 +00:00
6d1b1ddd3e Select Runner Label Dynamically (#127287)
Updated `get_workflow_type.py` logic to dynamically select a prefix for the runner label.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127287
Approved by: https://github.com/ZainRizvi
2024-06-12 18:47:47 +00:00
7db501ba2b Revert "[cuDNN][SDPA] Support different key, value dimension in cuDNN SDPA (#128350)"
This reverts commit 45dccfddcd8fce804f50075484421ade27f1f021.

Reverted https://github.com/pytorch/pytorch/pull/128350 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/128350#issuecomment-2163669538))
2024-06-12 18:35:18 +00:00
d71f92213c [DSD] keep 'exp_avg' as DTensor after torch.distributed.checkpoint.state_dict.set_optimizer_state_dict (#128004)
Fixes #126950
`ptd_state_dict` with `broadcast_from_rank0=False` might miss 2 condition checks in the `set_optimizer_state_dict`
Here we add another condition `full_state_dict=True` with corresponding tensor distribution without broadcasting if broadcast_from_rank0=False

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128004
Approved by: https://github.com/fegin
2024-06-12 18:14:56 +00:00
624e8ae491 Documentation for is_dependent function (#128197)
Docstring for torch.distributions.constraints.is_dependent

Fixes #127900

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128197
Approved by: https://github.com/fritzo, https://github.com/malfet
2024-06-12 17:50:41 +00:00
a70a7337d2 Update torch.nanmean() docstring to mention input dtype requirement (#128155)
Fixes #120570

## Description
Update torch.nanmean() docstring to mention input dtype requirement as either floating point type or complex.
Previously, the torch.mean() docstring had been updated in #120208 in a similar manner, but the torch.nanmean() docstring was not updated.

## Checklist

- [X] The issue that is being fixed is referred in the description.
- [X] Only one issue is addressed in this pull request.
- [x] Labels from the issue that this PR is fixing are added to this pull request.
- [X] No unnecessary issues are included into this pull request.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128155
Approved by: https://github.com/malfet
2024-06-12 17:46:36 +00:00
0f52dc7e51 Document torch.cuda.profiler.stop (#128196)
Fixes #127918

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128196
Approved by: https://github.com/malfet, https://github.com/eqy
2024-06-12 17:39:43 +00:00
5001f41b90 Revert "Make TraceUtils.h to be device-agnostic (#126969)"
This reverts commit 648625b230e8e6e7478fb219ff4f0aa6a45070f5.

Reverted https://github.com/pytorch/pytorch/pull/126969 on behalf of https://github.com/clee2000 due to failing internal builds D58443769 ([comment](https://github.com/pytorch/pytorch/pull/126969#issuecomment-2163462600))
2024-06-12 16:32:57 +00:00
f89574fa23 Revert "Pass params to dump_nccl_trace_pickle (#128307)"
This reverts commit eb567b1f40233667b982f81e3a75deec0fdfd9ca.

Reverted https://github.com/pytorch/pytorch/pull/128307 on behalf of https://github.com/clee2000 due to sorry need to revert this in order to revert 126969 ([comment](https://github.com/pytorch/pytorch/pull/128307#issuecomment-2163459399))
2024-06-12 16:29:51 +00:00
81e4e12f02 Revert "Support aten operations with out tensor (#124926)"
This reverts commit cba195c8edd6c7149036ef0767772d11fff5390e.

Reverted https://github.com/pytorch/pytorch/pull/124926 on behalf of https://github.com/clee2000 due to newly added test broke in internal D58444103.  Test passed in OSS CI though ([comment](https://github.com/pytorch/pytorch/pull/124926#issuecomment-2163441547))
2024-06-12 16:20:04 +00:00
c5172b8de8 Revert "[AOTI] Switch to use shim v2 (#127674)"
This reverts commit 9a38cae299e5ffd8143182bec878c28f96cfd72a.

Reverted https://github.com/pytorch/pytorch/pull/127674 on behalf of https://github.com/clee2000 due to tests failed internally D56709309 ([comment](https://github.com/pytorch/pytorch/pull/127674#issuecomment-2163436728))
2024-06-12 16:17:07 +00:00
9e39c62908 correct avx512_vnni isa name. (#128318)
`x86` has two vnni isa currently: `avx2_vnni` and `avx512_vnni`.
This PR correct the function name to `avx512_vnni`.

Co-authored-by: Jiong Gong <jiong.gong@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128318
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/desertfire
2024-06-12 16:12:49 +00:00
f2dcbe89d6 Revert "Prevent expansion of cat indexing to avoid int64 intermediate (#127815)"
This reverts commit 793df7b7cb1473004837f5867f4c1c4b2b0f751d.

Reverted https://github.com/pytorch/pytorch/pull/127815 on behalf of https://github.com/clee2000 due to the newly added test is failing internally D58444153.  Test exists in opensource and passed in OSS CI, maybe env difference? ([comment](https://github.com/pytorch/pytorch/pull/127815#issuecomment-2163421968))
2024-06-12 16:09:22 +00:00
8df56afc20 Add support in Python API for the recommended max working set size. (#128289)
Adds ways for users to request recommended max size for Metal on Mac. It plumbs through
https://developer.apple.com/documentation/metal/mtldevice/2369280-recommendedmaxworkingsetsize?language=objc

Can be used like
```
        max_memory = torch.mps.recommended_max_memory()
        print ("Recommended Max Memory : ", (max_memory/(1024*1024*1024)), "GB")
```

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128289
Approved by: https://github.com/malfet
2024-06-12 16:03:57 +00:00
b19c2319e4 [ROCm] TunableOp for gemm_and_bias (#128143)
Thus far TunableOp was implemented for gemm, bgemm, and scaled_mm.  gemm_and_bias was notably missing.  This PR closes that gap.

This PR also fixes a regression after #124362 disabled the numerical check by default. The env var to enable it no longer worked.

CC @xw285cornell

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128143
Approved by: https://github.com/Skylion007
2024-06-12 15:53:39 +00:00
3c971d2ef3 Flip default value for mypy disallow_untyped_defs [final] (#127836)
Not requiring all functions to have types allows a lot of 'Any' types to slip in - which poison types and make mypy unable to properly typecheck the code.  I want to flip the default so that new files are required to have fully typed defs and we can have a burndown list of files that fail to require full types.

The preceding stack of PRs (cut up simply to limit the number of file changes per PR "reasonable") adds `# mypy: allow-untyped-defs` to any file which didn't immediately pass mypy with the flag flipped.  Due to changing files and merge conflicts it will probably be necessary to have several passes through before landing this final PR which turns the option on.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127836
Approved by: https://github.com/oulgen, https://github.com/Skylion007
2024-06-12 15:28:42 +00:00
15ab636007 Revert "Fix side effect pruning (#128028)"
This reverts commit a55d0d9718c11eb2897423c78eff18b168dd0a06.

Reverted https://github.com/pytorch/pytorch/pull/128028 on behalf of https://github.com/clee2000 due to broke test in internal D58443816.  Test exists in external too though ([comment](https://github.com/pytorch/pytorch/pull/128028#issuecomment-2163249251))
2024-06-12 14:55:57 +00:00
5ef70faaa7 Revert "Make torch_geometric models compatible with export (#123403)" (#128377)
This reverts commit d78991a7381adb3df5e9b63c365db4506643edce.

This PR reverts https://github.com/pytorch/pytorch/pull/123403 to fix the performance regression as discussed in https://github.com/pytorch/pytorch/issues/127513#issuecomment-2158835653.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128377
Approved by: https://github.com/jgong5, https://github.com/angelayi, https://github.com/desertfire
2024-06-12 14:53:01 +00:00
71f491554c Revert "First version of AOTAutogradCache (#126791)"
This reverts commit abc3eec22d38079bee855fbcb75da62a9558284c.

Reverted https://github.com/pytorch/pytorch/pull/126791 on behalf of https://github.com/DanilBaibak due to The changes broke a number of linux jobs ([comment](https://github.com/pytorch/pytorch/pull/126791#issuecomment-2163081643))
2024-06-12 13:59:29 +00:00
abc3eec22d First version of AOTAutogradCache (#126791)
This PR implements "V0" of AOTAutogradCache. Given an input to AOTAutograd, we calculate a cache key, then save an AOTAutogradCacheEntry.
Each AOTAutogradCacheEntry has:
- A CompiledForward and optionally a CompiledBackward
- A bunch of metadata.

CompiledForward and CompiledBackward each save the *key* to the FXGraphCache associated with the compiled object. FXGraphCache populates this key field as long as it's able to return a compiled graph given a set of inputs. We then load the same object from the FXGraphCache on an AOTAutogradCache hit.

On cache miss:
- Run AOTAutograd, up to AOTAutogradDispatch.post_compile.
- Save an AOTAutogradCacheEntry to the cache after compiling the necessary portions and receiving a cache key from FXGraphCache. In this we *always* compile the backwards ahead of time. The PR above this one implements backward lazy caching, so that we only save to the cache after compiling the backward in a lazy backward scenario.
- Return the resulting object

On cache hit:
- Run AOTAutogradCacheEntry.post_compile() on the cache key.
- This attempts to load the forward and backward graphs from FXGraphCache
- As long as we successfully load from FXGraphCache, it's a hit. We then rewrap the callable with post compile wrappers using our saved metadata.

For now, we ignore the fakified out and debug wrappers. We only save to the cache if Fakified out is turned off.

V0 Guards behavior:
FXGraphCache serializes guards that are needed in the shape_env based on the symint inputs to the graph. The invariant that AOTAutograd uses here is that the sources for symints given to it by dynamo are exactly the same as the ones it passes to inductor, for both the forward and backward passes. (This does *not* mean that the tensor values passed in are the same: only that their symints are). That is, AOTAutograd and Inductor never create new guards based on symints with *different sources* than those passed to it by inductor.

We don't currently store any AOTAutograd specific guards: my hypothesis is that FXGraphCache already stores these, as any guards generated by AOTAutograd should already be in the shape_env before calling into inductor, and we don't generate new guards post inductor. If this is needed, I'll add it in another diff.

Testing:
We'll start with some basic unit tests, but I'll be adding more and more complicated testing as the next step.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126791
Approved by: https://github.com/bdhirsh
2024-06-12 13:44:30 +00:00
2e065f2486 [Quant][Inductor] Bug fix: mutation nodes not handled correctly for QLinearPointwiseBinaryPT2E (#127592)
Fixes #127402

- Revert some changes to `ir.MutationOutput` and inductor/test_flex_attention.py
- Add checks of mutation for QLinearPointwiseBinaryPT2E

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127592
Approved by: https://github.com/leslie-fang-intel, https://github.com/Chillee
2024-06-12 10:49:16 +00:00
46a35a1ed4 [BE] enable UFMT for torch/__init__.py (#127710)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127710
Approved by: https://github.com/ezyang
ghstack dependencies: #127703, #127708, #127709
2024-06-12 10:40:23 +00:00
26433b86de [BE][Easy] sort __all__ in torch/__init__.py (#127709)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127709
Approved by: https://github.com/ezyang
ghstack dependencies: #127703, #127708
2024-06-12 10:21:36 +00:00
2386045e4f Add OpInfo entry for alias_copy (#127232) (#128142)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128142
Approved by: https://github.com/lezcano
2024-06-12 09:39:58 +00:00
1edcb31d34 [RELAND][inductor][cpp] bf16/fp16 gemm template computed with fp32 (#128472)
reland for https://github.com/pytorch/pytorch/pull/126068

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128472
Approved by: https://github.com/desertfire
2024-06-12 08:37:16 +00:00
ebb00a92bd [dynamo] Skip freezing expect failure for inlining inbuilt nn modules (#128470)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128470
Approved by: https://github.com/mlazos
ghstack dependencies: #126578, #128440
2024-06-12 08:21:50 +00:00
1602c7d0c8 [dynamo] Enable some inlining inbuilt nn module tests (#128440)
Co-authored-by: Laith Sakka <lsakka@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128440
Approved by: https://github.com/williamwen42, https://github.com/jansel
ghstack dependencies: #126578
2024-06-12 08:21:50 +00:00
04037f3d22 [BE] sort imports in torch/__init__.py (#127708)
----

- Sort import via `usort`
- Change relative import `from . import xxx` to absolute import `from torch import xxx`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127708
Approved by: https://github.com/ezyang
ghstack dependencies: #127703
2024-06-12 08:03:54 +00:00
0b331fd5d7 [CUDA] Abate SoftMax.cu compiler warning spam (#128468)
Avoids excessively spammy warnings such as
```
pytorch/aten/src/ATen/native/cuda/SoftMax.cu(844): warning #191-D: type qualifier is meaningless on cast type
        [&] { const auto& the_type = input.scalar_type(); constexpr const char* at_dispatch_name = "host_softmax"; at::ScalarType _st = ::detail::scalar_type(the_type); ; switch (_st) { case at::ScalarType::Double: { do { if constexpr (!at::should_include_kernel_dtype( at_dispatch_name, at::ScalarType::Double)) { do { ::c10::detail::deprecated_AT_ERROR(); if (!(false)) { ::c10::detail::torchCheckFail( __func__, "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", static_cast<uint32_t>(844), (::c10::detail::torchCheckMsgImpl( "Expected " "false" " to be true, but got false.  " "(Could this error message be improved?  If so, " "please report an enhancement request to PyTorch.)", ::c10::str("dtype '", toString(at::ScalarType::Double), "' not selected for kernel tag ", at_dispatch_name)))); }; } while (false); } } while (0); using scalar_t __attribute__((__unused__)) = c10::impl::ScalarTypeToCPPTypeT<at::ScalarType::Double>; return [&] { using accscalar_t = acc_type<scalar_t, true>; if (!half_to_float) { auto output_ptr = output.mutable_data_ptr<scalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_size*sizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1L << 30L) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, scalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size * dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(880), true); } while (0); } } else { auto output_ptr = output.mutable_data_ptr<accscalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_size*sizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1<<30) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, accscalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size * dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(916), true); } while (0); } } }(); } case at::ScalarType::Float: { do { if constexpr (!at::should_include_kernel_dtype( at_dispatch_name, at::ScalarType::Float)) { do { ::c10::detail::deprecated_AT_ERROR(); if (!(false)) { ::c10::detail::torchCheckFail( __func__, "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", static_cast<uint32_t>(844), (::c10::detail::torchCheckMsgImpl( "Expected " "false" " to be true, but got false.  " "(Could this error message be improved?  If so, " "please report an enhancement request to PyTorch.)", ::c10::str("dtype '", toString(at::ScalarType::Float), "' not selected for kernel tag ", at_dispatch_name)))); }; } while (false); } } while (0); using scalar_t __attribute__((__unused__)) = c10::impl::ScalarTypeToCPPTypeT<at::ScalarType::Float>; return [&] { using accscalar_t = acc_type<scalar_t, true>; if (!half_to_float) { auto output_ptr = output.mutable_data_ptr<scalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_size*sizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1L << 30L) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, scalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size * dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(880), true); } while (0); } } else { auto output_ptr = output.mutable_data_ptr<accscalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_size*sizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1<<30) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, accscalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size * dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(916), true); } while (0); } } }(); } case at::ScalarType::Half: { do { if constexpr (!at::should_include_kernel_dtype( at_dispatch_name, at::ScalarType::Half)) { do { ::c10::detail::deprecated_AT_ERROR(); if (!(false)) { ::c10::detail::torchCheckFail( __func__, "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", static_cast<uint32_t>(844), (::c10::detail::torchCheckMsgImpl( "Expected " "false" " to be true, but got false.  " "(Could this error message be improved?  If so, " "please report an enhancement request to PyTorch.)", ::c10::str("dtype '", toString(at::ScalarType::Half), "' not selected for kernel tag ", at_dispatch_name)))); }; } while (false); } } while (0); using scalar_t __attribute__((__unused__)) = c10::impl::ScalarTypeToCPPTypeT<at::ScalarType::Half>; return [&] { using accscalar_t = acc_type<scalar_t, true>; if (!half_to_float) { auto output_ptr = output.mutable_data_ptr<scalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_size*sizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1L << 30L) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, scalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size * dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(880), true); } while (0); } } else { auto output_ptr = output.mutable_data_ptr<accscalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_size*sizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1<<30) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, accscalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size * dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(916), true); } while (0); } } }(); } case at::ScalarType::BFloat16: { do { if constexpr (!at::should_include_kernel_dtype( at_dispatch_name, at::ScalarType::BFloat16)) { do { ::c10::detail::deprecated_AT_ERROR(); if (!(false)) { ::c10::detail::torchCheckFail( __func__, "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", static_cast<uint32_t>(844), (::c10::detail::torchCheckMsgImpl( "Expected " "false" " to be true, but got false.  " "(Could this error message be improved?  If so, " "please report an enhancement request to PyTorch.)", ::c10::str("dtype '", toString(at::ScalarType::BFloat16), "' not selected for kernel tag ", at_dispatch_name)))); }; } while (false); } } while (0); using scalar_t __attribute__((__unused__)) = c10::impl::ScalarTypeToCPPTypeT<at::ScalarType::BFloat16>; return [&] { using accscalar_t = acc_type<scalar_t, true>; if (!half_to_float) { auto output_ptr = output.mutable_data_ptr<scalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_size*sizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1L << 30L) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, scalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size * dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, scalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(880), true); } while (0); } } else { auto output_ptr = output.mutable_data_ptr<accscalar_t>(); auto input_ptr = input.const_data_ptr<scalar_t>(); if (dim_size <= 1024 && dim_size*sizeof(scalar_t) <= 4096) { int64_t remaining = outer_size; int64_t chunk_size = (1<<30) / dim_size; while(remaining > 0) { dispatch_softmax_forward<scalar_t, accscalar_t, accscalar_t, is_log_softmax, false>( output_ptr, input_ptr, dim_size, dim_size, std::min<int64_t>(remaining, chunk_size), nullptr ); input_ptr += chunk_size * dim_size; output_ptr += chunk_size * dim_size; remaining -= chunk_size; } } else { constexpr int ILP = sizeof(float4) / sizeof(scalar_t); dim3 block = SoftMaxForward_getBlockSize(dim_size); size_t smem_reduction_sz = block.x / 32 * sizeof(accscalar_t); auto max_elements_per_smem = (at::cuda::getCurrentDeviceProperties()->sharedMemPerBlock - smem_reduction_sz) / sizeof(scalar_t); bool can_use_smem = dim_size < max_elements_per_smem; can_use_smem &= !(reinterpret_cast<const uintptr_t>(input_ptr) % ALIGN_BYTES); can_use_smem &= (!(reinterpret_cast<uintptr_t>(output_ptr) % ALIGN_BYTES)); can_use_smem &= !(dim_size % ILP); if (can_use_smem) { size_t smem_sz = dim_size * sizeof(scalar_t) + smem_reduction_sz; cunn_SoftMaxForwardSmem<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_sz, stream>>>(output_ptr, input_ptr, dim_size); } else { cunn_SoftMaxForward<ILP, scalar_t, accscalar_t, accscalar_t, Epilogue> <<<grid, block, smem_reduction_sz, stream>>>(output_ptr, input_ptr, dim_size); } do { const cudaError_t __err = cudaGetLastError(); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", __func__, static_cast<uint32_t>(916), true); } while (0); } } }(); } default: do { ::c10::detail::deprecated_AT_ERROR(); if (!(false)) { ::c10::detail::torchCheckFail( __func__, "/workspace/pytorch/aten/src/ATen/native/cuda/SoftMax.cu", static_cast<uint32_t>(844), (::c10::detail::torchCheckMsgImpl( "Expected " "false" " to be true, but got false.  " "(Could this error message be improved?  If so, " "please report an enhancement request to PyTorch.)", ::c10::str('"', at_dispatch_name, "\" not implemented for '", toString(_st), "'")))); }; } while (false); } }()

```
and
```
SoftMax.cu:844: warning: comparison of integer expressions of different signedness: ‘int64_t’ {aka ‘long int’} and ‘long unsigned int’ [-Wsign-compare]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128468
Approved by: https://github.com/valentinandrei
2024-06-12 07:47:14 +00:00
8b3daf1768 Add FloatTrueDiv and ToFloat to SYMPY_INTERP (#128418)
Summary: I admit I'm not 100% sure what I'm doing here. I'm hitting a bug in the FX graph cache when we try to evaluate a guards expression. We're creating guards that look like this:
```
Ne(CeilToInt(FloatTrueDiv(ToFloat(8*L['t0']) - 4.0, 8.0))*CeilToInt(FloatTrueDiv(ToFloat(8*L['t1']) - 4.0, 8.0)), CeilToInt(FloatTrueDiv(ToFloat(8*L['t1']) - 4.0, 8.0))) and ...
```
It looks like we have a facility to define these operators in the SYMPY_INTERP map and we're just missing FloatTrueDiv and ToFloat. What's surprsing to me is that we're only hitting this problem with the FX graph enabled. We can create such guards, but we've never actually evaluated any?

Test Plan:
`TORCHINDUCTOR_FX_GRAPH_CACHE=1 python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --inference --bfloat16 --only detectron2_fcos_r_50_fpn`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128418
Approved by: https://github.com/ezyang
2024-06-12 06:26:43 +00:00
a421699998 Revert "[tp] refactor and fix PrepareModuleInput for DTensor inputs (#128431)"
This reverts commit 089f9a116ac8b2c14d6351b52614b529caba126b.

Reverted https://github.com/pytorch/pytorch/pull/128431 on behalf of https://github.com/DanilBaibak due to Sorry for the revert. Your changes broke the linter. Here you can find more details - 089f9a116a ([comment](https://github.com/pytorch/pytorch/pull/128431#issuecomment-2162197858))
2024-06-12 06:25:53 +00:00
dcc0093dba [BE][Easy] export explicitly imported public submodules (#127703)
Add top-level submodules `torch.{storage,serialization,functional,amp,overrides,types}`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127703
Approved by: https://github.com/ezyang
2024-06-12 05:52:18 +00:00
62311257ad Add 1 test case for Convtranspose1D in op microbenchmark (#127216)
Operator Convtransposd1d suffers performance regression with specific shape, #120982. Then we'd like to have this shape included into op level benchmark in this PR.

I reproduced the regression that convtranspos1d with shape [2016, 1026, 1024, 256, 1, 224]. Here is the summary:

Hardware info: Intel SPR8480-56cores per socket with frequency=2.1G.
Performance comparison between torch 1.13 vs. torch 2.2
Benchmarking **PyTorch1.13**: ConvTranspose1d Mode: Eager
Name: ConvTranspose1d_IC2016_OC1026_kernel1024_stride256_N1_L224_cpu
Input: IC: 2016, OC: 1026, kernel: 1024, stride: 256, N: 1, L: 224, device: cpu
Forward Execution Time (s) : **0.96s**

Benchmarking **PyTorch2.2:** ConvTranspose1d
Mode: Eager
Name: ConvTranspose1d_IC2016_OC1026_kernel1024_stride256_N1_L224_cpu
Input: IC: 2016, OC: 1026, kernel: 1024, stride: 256, N: 1, L: 224, device: cpu
Forward Execution Time (s) : **7.988s**

Also benchmarking for 7 rounds to check the variance.

  | Round1 | Round2 | Round3 | Round4 | Round5 | Round6 | Round7 | Normalized   Variance
-- | -- | -- | -- | -- | -- | -- | -- | --
Pytorch1.13 | 0.971 | 0.972 | 0.969 | 0.970 | 0.972 | 0.970 | 0.971 | 0.0002%
Pytorch 2.2 | 8.064 | 8.053 | 8.027 | 7.927 | 7.971 | 7.929 | 7.902 | 0.0059%
Ratio v2.2 vs.   v1.13(Lower is better) | 8.31 | 8.28 | 8.29 | 8.18 | 8.20 | 8.18 | 8.14 |  

Reproduce script:
numctl -N 0 python -m pt.conv_test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127216
Approved by: https://github.com/chuanqi129, https://github.com/jgong5, https://github.com/atalman
2024-06-12 05:33:54 +00:00
089f9a116a [tp] refactor and fix PrepareModuleInput for DTensor inputs (#128431)
as titled, this PR refactors the PrepareModuleInput style to have common
method prepare_input_arg, allow both args/kwargs to reuse this logic

This also fixes https://github.com/pytorch/pytorch/issues/128365

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128431
Approved by: https://github.com/awgu
2024-06-12 05:22:24 +00:00
77a0ca66e4 Add threadfence to 2-stage reduction for correct writes visibility (#128455)
Final block accumulating 2-stage reduction result has to complete acquire pattern to make sure the writes of all other blocks are visible to it, see https://docs.nvidia.com/cuda/parallel-thread-execution/index.html?highlight=atom#release-and-acquire-patterns
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128455
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-06-12 04:13:36 +00:00
c0b87afcad [RELAND2][dynamo][nn-modules] Trace through nn.Module dunder methods for UnspecializedNNModule (#126578)
Tracing through `__init__`  is important because it initializes (calls STORE_ATTR) on members. By doing that, we kick in the mutation tracking for these objects. So, things like mutating `_modules` etc is tracked automatically.

Fixes https://github.com/pytorch/pytorch/issues/111837

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126578
Approved by: https://github.com/jansel
2024-06-12 04:09:23 +00:00
02e7519ac3 DOC: strip inaccurate either float32 or float64 statement from set_default_type (#128192)
Fixes #126647

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128192
Approved by: https://github.com/malfet
2024-06-12 03:57:48 +00:00
cyy
8cf302dce4 [5/N] Change static functions in headers to inline (#128406)
Follows #128286

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128406
Approved by: https://github.com/ezyang
2024-06-12 03:25:54 +00:00
86b5df3e71 Documenting the torch.fx.annotate.annotate function (#128337)
Fixes #127903

This PR adds docstring to the `torch.fx.annotate.annotate` function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128337
Approved by: https://github.com/malfet
2024-06-12 03:06:32 +00:00
7c2058338a Improve convert fp32 to fp16 fx pass (#127829)
Summary: Improve the convert fp32 to fp16 fx pass to use to_dtype node and const folding instead of inplace conversion.

Test Plan:
```
buck2 test @//mode/{opt,inplace} //glow/fb/fx/fba/tests:test_fba_pass_manager_builder
```

Differential Revision: D57803843

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127829
Approved by: https://github.com/Skylion007
2024-06-12 02:50:37 +00:00
3ddec713b8 Revert "[cuDNN][Quantization] Don't print when plan finalization fails in cuDNN quantization backend (#128177)"
This reverts commit cac7a22b92478d897488688010e562b7bd36b97f.

Reverted https://github.com/pytorch/pytorch/pull/128177 on behalf of https://github.com/clee2000 due to broke test/test_quantization.py::TestQuantizedLinear::test_qlinear_cudnn on sm86 tests cac7a22b92 https://github.com/pytorch/pytorch/actions/runs/9470648757/job/26100448913.  Probably a landrace, test ran on the PR and succeed ([comment](https://github.com/pytorch/pytorch/pull/128177#issuecomment-2161977110))
2024-06-12 02:20:15 +00:00
85eeb90d2c [dynamo] Fix graph breaks related to HF ModelOutput (#127780)
Fixes https://github.com/pytorch/pytorch/issues/126028 and https://github.com/pytorch/pytorch/issues/126027.

Changes:
- Support building `CustomizedDictVariable` in` VariableBuilder` (but only for HF `ModelOutput` subclasses)
- Remove `DataClassVariable` since it's not really being used anywhere (`CustomizedDictVariable` can be used instead)
- Support side effects for `CustomizedDictVariable`
- Allow `NO_HASATTR` leaf guard on `DictSubclassGuardManager`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127780
Approved by: https://github.com/jansel, https://github.com/anijain2305
2024-06-12 02:16:24 +00:00
7f6daf289b [inductor] parallel compile: set LD_LIBRARY_PATH for sub-processes in internal (#128376)
Test Plan: `TORCHINDUCTOR_WORKER_START=subprocess TORCHINDUCTOR_COMPILE_THREADS=16 buck run mode/opt scripts/slarsen/torch_compile:run`

Differential Revision: D58371264

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128376
Approved by: https://github.com/eellison
2024-06-12 01:55:53 +00:00
3d55d84ec2 [Fix] Check tensor dtype before using torch.allclose in _trace log (#128438)
#### Issue
`torch.allclose` errors out during logging due to different dtypes.

#### Test
* `pytest test/test_jit.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128438
Approved by: https://github.com/angelayi
2024-06-12 01:52:09 +00:00
bb2a995529 Back out "[Dynamo] Treat integers stored on nn.Modules as dynamic (#126466)" (#128432)
Summary:
Original commit changeset: c7d2e6b13922

Original Phabricator Diff: D57618942

Differential Revision: D58383241

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128432
Approved by: https://github.com/ezyang, https://github.com/Yuzhen11
2024-06-12 01:34:32 +00:00
cyy
9538bf4e7c [2/N] Remove inclusion of c10/util/string_utils.h (#128372)
Follows  #128300.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128372
Approved by: https://github.com/aaronenyeshi
2024-06-12 01:18:20 +00:00
cyy
219da29dfd [7/N] Remove unused functions (#128407)
Follows  #128309
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128407
Approved by: https://github.com/ezyang
2024-06-12 01:10:33 +00:00
cyy
fb013ecb24 Remove unused private List::ptr_to_first_element (#128405)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128405
Approved by: https://github.com/ezyang
2024-06-12 01:07:14 +00:00
6af4c6acad Migrate test to internal base class, fixes (#128367)
Summary:
## Remove etc deps
converted tests to non-etcd based rdzv handler so that tests don't have dependency on etcd server

## Adopt pytorch test convetions
- test starts with `test_TESTS.py`
- Test base class is torch.testing._internal.common_utils.TestCase
- include __main__  handler

## reduce test timing (used to take > 300 seconds):

3.05s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_init_method_env_with_torchelastic
2.59s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_init_method_tcp_with_torchelastic
2.33s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_elastic_worker_raise_exception
2.33s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_run_path
2.30s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_nproc_launch_auto_configurations
2.24s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_is_torchelastic_launched_with_logs_spec_defined
2.24s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_is_torchelastic_launched
2.17s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_elastic_multiple_agents
2.12s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_elastic
2.08s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_nproc_gpu_launch_configurations
1.32s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_standalone
1.05s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_nproc_launch_number_configurations
1.05s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_with_env_vars
1.05s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_user_script_python
1.05s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_user_script_python_caffe2_bc
1.04s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_user_script_bash
1.03s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_user_script_default_nproc
0.04s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_logs_logs_spec_entrypoint_must_be_defined
0.01s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_elastic_agent_raise_exception
0.01s call     test/distributed/launcher/run_test.py::ElasticLaunchTest::test_launch_shutdown

Test Plan: pytest --durations=0  test/distributed/launcher/run_test.py

Differential Revision: D58388182

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128367
Approved by: https://github.com/d4l3k
2024-06-12 01:03:40 +00:00
786c24a4cd [inductor] Always realize sigmoid for CPU (#128339)
Summary: Currently the cpu backend prefers to always realize exp because it's a heavy op on CPU. For the same reason, we need to realize sigmoid as well. This solves a problem in llama2 inference where exp was repeated in an inner loop for many times.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128339
Approved by: https://github.com/eellison, https://github.com/helloguo, https://github.com/jansel, https://github.com/jgong5, https://github.com/peterbell10
2024-06-12 00:46:33 +00:00
5d8c7f39d4 Revert "Introduce int_oo (#127693)"
This reverts commit 9cab5987bdeb66df8efbc581b3469bfe300e168c.

Reverted https://github.com/pytorch/pytorch/pull/127693 on behalf of https://github.com/clee2000 due to sorry executorch CI is a bit weird regarding pins, I'll make a chat with mergen with the choices of what to do and how it'll affect executorch CI, reverting for now to prevent more divergences in the meantime ([comment](https://github.com/pytorch/pytorch/pull/127693#issuecomment-2161775400))
2024-06-11 23:36:08 +00:00
c9c1fed065 Revert "Flip default value for mypy disallow_untyped_defs [10+2/11] (#128374)"
This reverts commit c13e03c87428b986972a48d8fc78dbffc2579f63.

Reverted https://github.com/pytorch/pytorch/pull/128374 on behalf of https://github.com/clee2000 due to sorry I need to revert this in order to revert something else, to remerge, just rebase and fix the merge conflict ([comment](https://github.com/pytorch/pytorch/pull/128374#issuecomment-2161772864))
2024-06-11 23:34:03 +00:00
94fea82d66 init sub comment (#128082)
Fixes #127905

### Description

Add docstring to torch/onnx/symbolic_opset9.py:sigmoid function

### Checklist
- [x] The issue that is being fixed is referred in the description
- [x] Only one issue is addressed in this pull request
- [x] Labels from the issue that this PR is fixing are added to this pull request
- [x] No unnecessary issues are included into this pull request

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128082
Approved by: https://github.com/titaiwangms
2024-06-11 22:42:35 +00:00
447173198b Add docstring for the torch.fx.operator_schemas.create_type_hint func… (#128139)
Fixes: #127916

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128139
Approved by: https://github.com/SherlockNoMad
2024-06-11 22:42:11 +00:00
b79d056e76 [export] FIx unflattener for preserving modules containing unused inputs (#128260)
Currently unflattener fails if the module its preserving the module signature for contains unused inputs/outputs.

This also fixes unflattener issues in D57829276.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128260
Approved by: https://github.com/pianpwk
2024-06-11 22:32:08 +00:00
eb567b1f40 Pass params to dump_nccl_trace_pickle (#128307)
Summary:
Pass parameters from request to dump_nccl_trace_pickle handler.
The supported parameters + value are all lowercase.
includecollectives={true, false}
includestacktraces={true, false}
onlyactive={true, false}

Example post is:
/handler/dump_nccl_trace_pickle?includecollectives=true&includestacktraces=false&onlyactive=true

Test Plan:
unit tests

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128307
Approved by: https://github.com/d4l3k
ghstack dependencies: #128191
2024-06-11 22:28:53 +00:00
1dd2431f86 [Test] Add test for only_active flag (#128191)
Summary:
Add a unit test for the only_active flag to _dump_nccl_trace API call.
With this flag, we only expect active records to be returned.

Test Plan:
Unit test.

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128191
Approved by: https://github.com/d4l3k
2024-06-11 22:26:01 +00:00
5fcb5f0c8b init reshape_from_tensor_shape comment (#128171)
Fixes #127897

### Description
Add docstring to torch/onnx/symbolic_opset9.py:sigmoid function

### Checklist
- [x] The issue that is being fixed is referred in the description
- [x] Only one issue is addressed in this pull request
- [x] Labels from the issue that this PR is fixing are added to this pull request
- [x] No unnecessary issues are included into this pull request

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128171
Approved by: https://github.com/titaiwangms
2024-06-11 21:56:33 +00:00
a55d0d9718 Fix side effect pruning (#128028)
Summary:
The previous side effect pruning algorithm would keep many dead cell
variables alive. For example, in
https://github.com/pytorch/pytorch/issues/125078, the compiled function
has one return but there were three in the Dynamo graph due to two
dead cell variables not being pruned away.

This PR adds a corrected algorithm. "new cell variables" are alive if
they can be reached from one of the following:
1. any of the tx.symbolic_locals or tx.stack (that is, if they are
   involved in a return from the function or intermediate variable
   during a graph break). Example: an alive NestedUserFunctionVariable
2. "mutations to pre-existing objects". Example: appending a
   NestedUserFunctionVariable to a global list

The new algorithm reflects this, but please let me know if there are
more cases to handle.

Test Plan:
- existing tests (afaict, test/dynamo/test_python_autograd is the best
  SideEffects test case we have)
- see in test/dynamo/test_higher_order_ops that the expecttests changed
  -- the functorch dynamo graphs no longer return dead cellvars.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128028
Approved by: https://github.com/jansel
2024-06-11 21:40:48 +00:00
8c1247cffb [BE] Fixed CPU autocast warning (#127774)
This PR fixes
```
/data/users/andgu/pytorch/torch/utils/checkpoint.py:1398: FutureWarning: `torch.cpu.amp.autocast(args...)` is deprecated. Please use `torch.amp.autocast('cpu', args...)` instead.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127774
Approved by: https://github.com/soulitzer, https://github.com/Skylion007, https://github.com/tianyu-l
2024-06-11 21:33:35 +00:00
70a1e85718 [Traceable FSDP2] Use custom ops for AllGather copy-in / copy-out and ReduceScatter copy-in (#127856)
Making these operations into custom ops helps Inductor identify these ops and enforce the FSDP communication op ordering.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127856
Approved by: https://github.com/awgu
2024-06-11 20:15:03 +00:00
adb699189b Revert "[RELAND][dynamo][nn-modules] Trace through nn.Module dunder methods for UnspecializedNNModule (#126578)"
This reverts commit b2d602306a9eb19e30328cbaee941c874f8148a9.

Reverted https://github.com/pytorch/pytorch/pull/126578 on behalf of https://github.com/clee2000 due to failed internal test D58394084.  Author has forward fix but includes external changes so reverting is a bit easier to coordinate ([comment](https://github.com/pytorch/pytorch/pull/126578#issuecomment-2161481839))
2024-06-11 19:41:41 +00:00
eqy
45dccfddcd [cuDNN][SDPA] Support different key, value dimension in cuDNN SDPA (#128350)
CC @vedaanta-nvidia @drisspg

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128350
Approved by: https://github.com/Skylion007
2024-06-11 19:22:21 +00:00
3e09123797 Enable UFMT on test_nestedtensor.py (#128359)
split it into two PRs since it is more than 2k lines of change

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128359
Approved by: https://github.com/davidberard98
2024-06-11 19:14:04 +00:00
61f922c2ca Fix 'get_real_value' on placeholder nodes (#127698)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127698
Approved by: https://github.com/jansel
ghstack dependencies: #127695, #127696
2024-06-11 18:57:25 +00:00
984b1a8c35 Fix 'get_attr' call in dynamo 'run_node' (#127696)
Fixes #124858

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127696
Approved by: https://github.com/jansel
ghstack dependencies: #127695
2024-06-11 18:57:25 +00:00
205410cb44 add xpu to torch.tensors (#127280)
As support for Intel GPU has been upstreamed, this PR is to add the XPU-related contents to torch.tensors doc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127280
Approved by: https://github.com/svekars
2024-06-11 18:13:01 +00:00
cac7a22b92 [cuDNN][Quantization] Don't print when plan finalization fails in cuDNN quantization backend (#128177)
Similar in spirit to #125790, hopefully addresses failures seen for cuDNN 9.1 upgrade: #https://github.com/pytorch/pytorch/pull/128166

CC @nWEIdia @atalman

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128177
Approved by: https://github.com/nWEIdia, https://github.com/Skylion007
2024-06-11 18:09:25 +00:00
8a09940a54 [inductor] fix compile time regression by caching get_gpu_type (#128363)
We observed signficant compile time regression in torchtitan when turning
on 2D parallel + torch.compile recently. So I decided to get a deeper
understanding why.

It turns out this is affecting **all the trainings** that have functional collectives
captured in the graph, not only 2D parallel (2D parallel was just the
job that happen to have collectives captured in the TP region).

The root cause is because when doing inductor lowering, we are calling
the comm analysis pass to get a estimated collective time for each
collective node in the graph, for each call to check the collective
node, we are calling `get_gpu_type()`, which under the hood calls a
`torch.utils.collect_env.run` to get the GPU info. However, this call is
super expensive! The reason is that this call effectively spawns a new
process and call `nvidia-smi` to get the GPU info, so the cost is **linear**
to the number of collective nodes in the graph.

see https://github.com/pytorch/pytorch/blob/main/torch/utils/collect_env.py#L75

The fix is to add a lru cache to the function, so that we only call this
once and reuse the cached results afterwards

torchtitan benchmark shows:
* before this fix: 2D parallel + fp8 compile time: 6min +
* after this fix: 2D parallel + fp8 compile time: 2min 48s (more than 100% improvement)

There're more room to improve the compile time, but this PR is trying to fix the biggest regression I found so far.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128363
Approved by: https://github.com/yf225
2024-06-11 18:02:13 +00:00
1d233b8f50 Revert "Make nn.Module state_dict load_state_dict pre-hook and state_dict post hook public (#126704)"
This reverts commit c38b3381a12a0ec033dd417827c530c4474b8165.

Reverted https://github.com/pytorch/pytorch/pull/126704 on behalf of https://github.com/clee2000 due to broke internal typecheck D58394110 (which probably means the code wouldn't work either but I guess it didn't run on the diff). Probably an easy fix? ([comment](https://github.com/pytorch/pytorch/pull/126704#issuecomment-2161299193))
2024-06-11 17:45:20 +00:00
491c4a5dcb Revert "Make sure #126704 is BC for torch.save-ed nn.Module (#128344)"
This reverts commit 841d87177a900c2bbd59b6589165189141c4e8bb.

Reverted https://github.com/pytorch/pytorch/pull/128344 on behalf of https://github.com/clee2000 due to broke internal typecheck D58394110 (which probably means the code wouldn't work either but I guess it didn't run on the diff). Probably an easy fix? ([comment](https://github.com/pytorch/pytorch/pull/126704#issuecomment-2161299193))
2024-06-11 17:45:20 +00:00
4345d98663 [dynamo] Fix for #127696 (#128358)
Test Plan:
`buck2 test @//mode/dev-nosan //executorch/exir/backend/...`
https://www.internalfb.com/intern/testinfra/testrun/12666373989243932

Differential Revision: D58384518

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128358
Approved by: https://github.com/ydwu4
2024-06-11 16:43:15 +00:00
a838e90964 Add Intel Gaudi device/HPU to auto load in instantiate_device_type_tests (#126970)
### Motivation
Intel Gaudi accelerator (device name hpu) is seen to have good pass rate with the pytorch framework UTs , however being an out-of-tree device, we face challenges in adapting the device to natively run the existing pytorch UTs under pytorch/test. The UTs however is a good indicator of the device stack health and as such we run them regularly with adaptations.
Although we can add Gaudi/HPU device to generate the device specific tests using the TORCH_TEST_DEVICES environment variable, we miss out on lot of features such as executing for specific dtypes, skipping and overriding opInfo. With significant changes introduced every Pytorch release maintaining these adaptations become difficult and time consuming.
Hence with this PR  we introduce Gaudi device in common_device_type framework, so that the tests are instantiated for Gaudi when the library is loaded.
The eventual goal is to introduce Gaudi out-of-tree support as equivalent to in-tree devices

### Changes
Add HPUTestBase of type DeviceTypeTestBase specifying appropriate attributes for Gaudi/HPU.
Include code to check if  intel Gaudi Software library is loaded and if so, add the device to the list of devices considered for instantiation of device type tests

### Additional Context
please refer the following RFC : https://github.com/pytorch/rfcs/pull/63/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126970
Approved by: https://github.com/albanD
2024-06-11 16:35:17 +00:00
29081059b6 [Static Runtime] Fix & run gen_static_runtime_ops (#128299)
gen_static_runtime_ops hasn't been updated in a while. In preparation for https://github.com/pytorch/pytorch/pull/127675 in which I need to re-run the codegen step for cumprod, I want to land these changes beforehand in case there are any other issues that arise.

I added a number of ops to the blocklist:
```
+        "_nested_tensor_storage_offsets",
+        "_nested_get_values",  # no CPU backend
+        "_nested_get_values_copy",  # no CPU backend
+        "_nested_view_from_jagged",  # testing needs to be patched
+        "_nested_view_from_jagged_copy",  # testing needs to be patched
+        "_nested_view_from_buffer",  # testing needs to be patched
+        "_nested_view_from_buffer_copy",  # testing needs to be patched
+        "_int_mm",  # testing needs to be patched
+        "_to_sparse_csc",  # testing needs to be patched
+        "_to_sparse_csr",  # testing needs to be patched
+        "segment_reduce",  # testing needs to be patched
```

Most of these are added just because testing doesn't work right now.

Additionally, a few `fft` ops seem to have been removed from native_functions.yaml; I'm guessing it's unlikely FFT would have been used in many real models though.

Differential Revision: [D58329403](https://our.internmc.facebook.com/intern/diff/D58329403/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128299
Approved by: https://github.com/YuqingJ
2024-06-11 16:27:39 +00:00
f8c45996d5 [MPS] Make erfinv compilable for bfloat16 (#128375)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128375
Approved by: https://github.com/Skylion007
ghstack dependencies: #128373
2024-06-11 16:04:11 +00:00
c13e03c874 Flip default value for mypy disallow_untyped_defs [10+2/11] (#128374)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128374
Approved by: https://github.com/Skylion007
2024-06-11 15:58:28 +00:00
053930e194 [MPS][BE] Remove code duplication (#128373)
Use `scalarToMetalTypeString` instead of `getMetalType`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128373
Approved by: https://github.com/Skylion007
2024-06-11 15:58:04 +00:00
9a38cae299 [AOTI] Switch to use shim v2 (#127674)
Differential Revision: D56709309

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127674
Approved by: https://github.com/desertfire
2024-06-11 15:01:25 +00:00
55901fb3da [fx] Preserve Fx graph node order in partitioner across runs (#115621)
Fixes #ISSUE_NUMBER
partitioner generates different graph in recompilation on each run
Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115621
Approved by: https://github.com/ezyang
2024-06-11 14:04:52 +00:00
fc77fdca6f [guard_size_oblivious] Add gso ExpandUtils:_sym_to (#128224)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128224
Approved by: https://github.com/ezyang
2024-06-11 14:01:34 +00:00
648625b230 Make TraceUtils.h to be device-agnostic (#126969)
Some features of third-party devices depend on TraceUtils.h, so some of the CUDA code was removed and split into NCCLUtils files.

In addition, some common functions still remain in TraceUtils.h since I'm not sure if other devices will use them later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126969
Approved by: https://github.com/c-p-i-o
2024-06-11 08:38:07 +00:00
207c2248a8 [inductor] Fix lowering full with SymBool value (#128213)
Fixes #128161, fixes #128095

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128213
Approved by: https://github.com/lezcano
2024-06-11 08:33:35 +00:00
a206dcc79e fb_memcache: Move to fbcode from thirdparty (#128174)
Summary: The fb_memcache injections location and path is changing.

Test Plan: Existing tests should pass.

Reviewed By: bertmaher, oulgen

Differential Revision: D57973772

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128174
Approved by: https://github.com/oulgen
2024-06-11 07:46:12 +00:00
f2d7f235a6 [dynamo][yolov3] Track UnspecializedNNModuleVariable for mutation (#128269)
Fixes https://github.com/pytorch/pytorch/issues/101168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128269
Approved by: https://github.com/jansel
ghstack dependencies: #128295, #126578, #128268, #128254
2024-06-11 07:09:04 +00:00
402b289f3b Properly register parameter for binary folding test (#128356)
This PR properly registers the tensor used in the module compute as a parameter. This bug was hidden previously because all tensors on the nn modules would be considered constant by dynamo, with inlining NN modules, this is no longer the case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128356
Approved by: https://github.com/anijain2305
ghstack dependencies: #128355
2024-06-11 06:48:26 +00:00
a32157c67c Mark params static if inlining modules and freezing (#128355)
Today inlining builtin nn modules is not compatible with parameter freezing. Freezing parameters and then constant folding them through the graph relies on the assumption that they will not be inputs and will be static across calls to the same graph. When inlining builtin nn modules this assumption is broken and we reuse the same graph for different instances of the same nn module. There are three options 1) abandon constant folding, 2) create a dispatcher layer (like cudagraphs) which will dispatch to the correct constant-folded graph for each distinct set of parameters or 3) recompile

This PR implements 3 by introducing guards on the parameter pointers. This was due to freezing being relatively rare and performance sensistive. 2 Had many more unknowns and 1 is not a viable option due to the drop in performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128355
Approved by: https://github.com/anijain2305
2024-06-11 06:48:26 +00:00
24e7f29099 Lowering for avg_pool_3d_backward (Fixes:#127101) (#127722)
We implemented a lowering for the avg_pool3d_backward operation and created tests for it.
We ran some benchmarks and achieved the following results:

```
[-------------- avgpool_3d_backwards --------------]
                             |  Decomposed  |  Eager
16 threads: ----------------------------------------
      (3, 5, 400, 200, 200)  |     6061     |  11160
      (3, 5, 300, 200, 200)  |     4547     |   8372
      (3, 5, 200, 200, 200)  |     3032     |   5585
      (3, 5, 300, 300, 300)  |    10100     |  18840
      (3, 5, 100, 100, 100)  |      381     |    703
      (3, 5, 100, 300, 200)  |     2270     |   4190
      (8, 8, 128, 128, 128)  |     3397     |   6253
      (2, 3, 150, 150, 150)  |      520     |    947
      (1, 3, 128, 128, 128)  |      161     |    299
      (8, 16, 64, 64, 64)    |      851     |   1569
      (1, 1, 50, 50, 50)     |       17     |     11
      (3, 5, 20, 40, 40)     |       17     |     30
      (3, 5, 10, 20, 20)     |       17     |     11
      (1, 1, 10, 10, 10)     |       16     |     11
      (3, 5, 5, 10, 10)      |       17     |     11
      (3, 5, 2, 5, 5)        |       17     |     11
```
These were run on an RTX 3050, so we were not able to allocate larger tensors due to memory limitations.
We believe it would be beneficial to benchmark this on more recent hardware, just to check if the performance holds up with larger sizes.

Furthermore, we also refactored code from adaptive_avg_pool2d and adaptive_max_pool2d, to reduce code duplication.
We diffed the kernels and they are identical.

Fixes #127101

Co-authored-by: Martim Mendes <martimccmendes@tecnico.ulisboa.pt>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127722
Approved by: https://github.com/jansel
2024-06-11 06:39:04 +00:00
5b5d269d34 Speed up fx graph iteration by implementing it in C++ (#128288)
Before this change
```
python benchmarks/dynamo/microbenchmarks/fx_microbenchmarks.py
iterating over 100000000 FX nodes took 19.5s (5132266 nodes/s)
```

After this change
```
python benchmarks/dynamo/microbenchmarks/fx_microbenchmarks.py
iterating over 100000000 FX nodes took 3.4s (29114001 nodes/s)
```

5.7x improvement

Differential Revision: [D58343997](https://our.internmc.facebook.com/intern/diff/D58343997)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128288
Approved by: https://github.com/jansel, https://github.com/albanD
2024-06-11 05:48:31 +00:00
fa88f390a0 Revert "[inductor] enable fx graph cache on torchbench (#128239)"
This reverts commit 734e8f6ad7e7f0fa0341fb658f1f986225173f5f.

Reverted https://github.com/pytorch/pytorch/pull/128239 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to surface a bunch of inductor failures in trunk 734e8f6ad7 ([comment](https://github.com/pytorch/pytorch/pull/128239#issuecomment-2159789242))
2024-06-11 04:53:38 +00:00
fe39c07826 [pipelining][doc] Remove duplicated words (#128368)
"for execution" is used in both step titles

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128368
Approved by: https://github.com/wconstab
ghstack dependencies: #128361
2024-06-11 04:52:57 +00:00
cba195c8ed Support aten operations with out tensor (#124926)
This PR intends to support the aten operations with the `out` tensor.

Currently, the AOT compile always does **NOT** keep input tensor mutations. According to the comments, this is because it has not encountered such a use case.
> For now there's no use case involving keeping input mutations in the graph (which we can only do in the inference case anyway). We can add this later if we need to.

However, for aten operations, it is popular that the `out` tensor is an input parameter and needs to be mutated. This PR intends to support it by adding a `keep_inference_input_mutations` flag to `aot_inductor.keep_inference_input_mutations`. This flag can provide flexibility to the callee in deciding whether the AOT compile needs to keep input tensor mutations in the graph.

Take `clamp` as an example as follows.
```python
out_tensor = torch.randn(128, dtype=torch.float, device=device).fill_(-2.0)
inp_tensor = torch.randn(128, dtype=torch.float, device=device).fill_(1.0)
min_tensor = inp_tensor - 0.05
max_tensor = inp_tensor + 0.05
torch.clamp(input=inp_tensor, min=min_tensor, max=max_tensor, out=out_tensor)
```

W/O this PR
```python
def forward(self):
    arg0_1: "f32[128]"; arg1_1: "f32[128]"; arg2_1: "f32[128]"; arg3_1: "f32[128]";

    arg0_1, arg1_1, arg2_1, arg3_1, = fx_pytree.tree_flatten_spec([], self._in_spec)
    clamp_min: "f32[128]" = torch.ops.aten.clamp_min.Tensor(arg0_1, arg1_1);  arg0_1 = arg1_1 = None
    clamp_max: "f32[128]" = torch.ops.aten.clamp_max.Tensor(clamp_min, arg2_1);  clamp_min = arg2_1 = None
    return (clamp_max, clamp_max)
```

W/ this PR
```python
def forward(self):
    arg0_1: "f32[128]"; arg1_1: "f32[128]"; arg2_1: "f32[128]"; arg3_1: "f32[128]";

    arg0_1, arg1_1, arg2_1, arg3_1, = fx_pytree.tree_flatten_spec([], self._in_spec)
    clamp_min: "f32[128]" = torch.ops.aten.clamp_min.Tensor(arg0_1, arg1_1);  arg0_1 = arg1_1 = None
    clamp_max: "f32[128]" = torch.ops.aten.clamp_max.Tensor(clamp_min, arg2_1);  clamp_min = arg2_1 = None
    copy_: "f32[128]" = torch.ops.aten.copy_.default(arg3_1, clamp_max);  arg3_1 = clamp_max = None
    return (copy_,)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124926
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/angelayi
2024-06-11 04:35:27 +00:00
16e67be7f1 Also preserve unbacked SymInts when partitioning as backward inputs (#128338)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128338
Approved by: https://github.com/IvanKobzarev
2024-06-11 04:27:09 +00:00
7afffdf48b [CI] Comment hf_T5_generate, hf_GPT2 and timm_efficientnet in inductor cpu smoketest for performance unstable issue (#127588)
Fixes #126993

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127588
Approved by: https://github.com/chuanqi129, https://github.com/jgong5, https://github.com/desertfire
2024-06-11 03:12:11 +00:00
ca45649eb5 [easy][dynamo][inline work] Fix test with inlining inbuilt nn modules (#128254)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128254
Approved by: https://github.com/williamwen42
ghstack dependencies: #128295, #126578, #128268
2024-06-11 03:02:51 +00:00
665e568381 [inductor][inlining nn module] Skip batchnorm version check test for inlining (#128268)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128268
Approved by: https://github.com/zou3519
ghstack dependencies: #128295, #126578
2024-06-11 03:02:51 +00:00
4077cdd589 [pipelining][doc] Update arg list of pipeline API (#128361)
And document the use of `build_stage` API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128361
Approved by: https://github.com/wconstab
2024-06-11 02:55:17 +00:00
cyy
e4bd0adca5 [6/N] Remove unused functions (#128309)
Follows #127185

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128309
Approved by: https://github.com/ezyang
2024-06-11 02:46:33 +00:00
793df7b7cb Prevent expansion of cat indexing to avoid int64 intermediate (#127815)
Fix for https://github.com/pytorch/pytorch/issues/127652

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127815
Approved by: https://github.com/shunting314, https://github.com/peterbell10
2024-06-11 02:41:07 +00:00
d1d9bc7aa6 init add comment (#128083)
Fixes #127898

### Description

Add docstring to torch/onnx/symbolic_opset9.py:sigmoid function

### Checklist
- [x] The issue that is being fixed is referred in the description
- [x] Only one issue is addressed in this pull request
- [x] Labels from the issue that this PR is fixing are added to this pull request
- [x] No unnecessary issues are included into this pull request

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128083
Approved by: https://github.com/titaiwangms
2024-06-11 02:37:04 +00:00
841d87177a Make sure #126704 is BC for torch.save-ed nn.Module (#128344)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128344
Approved by: https://github.com/albanD
ghstack dependencies: #126906, #126704
2024-06-11 02:26:06 +00:00
3b555ba477 Add docstring for torch.utils.data.datapipes.decoder.basicandlers (#128018)
Fixes #127912

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128018
Approved by: https://github.com/andrewkho
2024-06-11 01:32:45 +00:00
734e8f6ad7 [inductor] enable fx graph cache on torchbench (#128239)
Summary: We've already enabled for timm and huggingface, but we had failures saving cache entries for moco. It looks like https://github.com/pytorch/pytorch/pull/128052 has fixed that issue, so we can enable for torchbench.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128239
Approved by: https://github.com/oulgen
2024-06-11 00:40:31 +00:00
cyy
99f5a85a09 [Clang Tidy] Fix misc-header-include-cycle errors in clang-tidy and ignore some files (#127233)
Since there are such cycles in libfmt and PyTorch, which are detected by clang-tidy.
```
/home/cyy/pytorch/third_party/fmt/include/fmt/format-inl.h:25:10: error: circular header file dependency detected while including 'format.h', please check the include path [misc-header-include-cycle,-warnings-as-errors]
   25 | #include "format.h"
      |          ^
/home/cyy/pytorch/third_party/fmt/include/fmt/format.h:4530:12: note: 'format-inl.h' included from here
 4530 | #  include "format-inl.h"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127233
Approved by: https://github.com/ezyang
2024-06-10 23:49:58 +00:00
f843ccbb1a [MTIA] Add set_device support (#128040)
Summary: Support set_device API in MTIA backend.

Reviewed By: gnahzg

Differential Revision: D58089498

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128040
Approved by: https://github.com/gnahzg
2024-06-10 23:42:52 +00:00
cyy
30875953a4 [1/N] Remove inclusion of c10/util/string_utils.h (#128300)
As a first step to remove it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128300
Approved by: https://github.com/ezyang, https://github.com/eqy
2024-06-10 23:40:47 +00:00
cyy
2126ae186e Remove caffe2/perfkernels files (#128186)
These files are not used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128186
Approved by: https://github.com/ezyang, https://github.com/r-barnes
2024-06-10 23:40:18 +00:00
739aa224ec [Fix] Parameter un/lifting issues in the TorchScript to ExportedProgram converter (#127975)
This PR fixes issues related to parameters and inputs lifting in the converter.

#### Issue 1
```
> Graph[linear.weights, bias.weights, x.1]
%1 ...
%2 ...
%3 = CreateObject()

	> Block 0[]
        %linear.0 = GetAttr(linear)[%3]

	             > Block 0.0[]
	             %weight.0 = GetAttr(weights)[%linear.0]

	> Block 1[]
	...
```
* Model parameters for the top level module should be unlifted, while parameters from sub-blocks should be lifted.
#### Fixes
* Bottom-up traversal (i.e., start from the inner most block) to figure out which parameters to be lifted for sub-blocks.

#### Test Plan
* Add test cases for nested block without control flow `pytest test/export/test_converter.py -s -k test_convert_nn_module_with_nested_param`
* Add test cases for nested block with control flow `pytest test/export/test_converter.py -s -k test_convert_nn_module_with_nested_if_and_param`

#### Outcome
##### TorchScript
```
graph(%x.1 : Float(3, strides=[1], requires_grad=0, device=cpu),
      %m1.m1.linear.weight : Float(3, 3, strides=[3, 1], requires_grad=0, device=cpu),
      %m1.m1.linear.bias : Float(3, strides=[1], requires_grad=0, device=cpu),
      %m1.linear.weight : Float(3, 3, strides=[3, 1], requires_grad=0, device=cpu),
      %m1.linear.bias : Float(3, strides=[1], requires_grad=0, device=cpu),
      %m1.m2.linear.weight : Float(3, 3, strides=[3, 1], requires_grad=0, device=cpu),
      %m1.m2.linear.bias : Float(3, strides=[1], requires_grad=0, device=cpu),
      %linear.weight : Float(3, 3, strides=[3, 1], requires_grad=0, device=cpu),
      %linear.bias : Float(3, strides=[1], requires_grad=0, device=cpu),
      %m2.m1.linear.weight : Float(3, 3, strides=[3, 1], requires_grad=0, device=cpu),
      %m2.m1.linear.bias : Float(3, strides=[1], requires_grad=0, device=cpu),
      %m2.linear.weight : Float(3, 3, strides=[3, 1], requires_grad=0, device=cpu),
      %m2.linear.bias : Float(3, strides=[1], requires_grad=0, device=cpu),
      %m2.m2.linear.weight : Float(3, 3, strides=[3, 1], requires_grad=0, device=cpu),
      %m2.m2.linear.bias : Float(3, strides=[1], requires_grad=0, device=cpu)):
  %15 : __torch__.export.test_converter.___torch_mangle_14.SuperNestedM1 = prim::CreateObject()
  %16 : NoneType = prim::Constant(), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
  %17 : int = prim::Constant[value=1](), scope: export.test_converter.SuperNestedM1:: # /data/users/jiashenc/pytorch/test/export/test_converter.py:342:34
  %18 : Tensor = aten::max(%x.1), scope: export.test_converter.SuperNestedM1:: # /data/users/jiashenc/pytorch/test/export/test_converter.py:342:19
  %19 : Tensor = aten::gt(%18, %17), scope: export.test_converter.SuperNestedM1:: # /data/users/jiashenc/pytorch/test/export/test_converter.py:342:19
  %20 : bool = aten::Bool(%19), scope: export.test_converter.SuperNestedM1:: # /data/users/jiashenc/pytorch/test/export/test_converter.py:342:19
  %21 : Tensor = prim::If(%20), scope: export.test_converter.SuperNestedM1:: # /data/users/jiashenc/pytorch/test/export/test_converter.py:342:16
    block0():
      %linear.6 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%15), scope: export.test_converter.SuperNestedM1::
      %m1.1 : __torch__.export.test_converter.___torch_mangle_15.NestedM = prim::GetAttr[name="m1"](%15), scope: export.test_converter.SuperNestedM1::
      %24 : Tensor = aten::sum(%x.1, %16), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 # /data/users/jiashenc/pytorch/test/export/test_converter.py:327:19
      %25 : Tensor = aten::gt(%24, %17), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 # /data/users/jiashenc/pytorch/test/export/test_converter.py:327:19
      %26 : bool = aten::Bool(%25), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 # /data/users/jiashenc/pytorch/test/export/test_converter.py:327:19
      %27 : Tensor = prim::If(%26), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 # /data/users/jiashenc/pytorch/test/export/test_converter.py:327:16
        block0():
          %linear.10 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%m1.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %m1.3 : __torch__.export.test_converter.___torch_mangle_16.M = prim::GetAttr[name="m1"](%m1.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %linear.12 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%m1.3), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %weight.4 : Tensor = prim::GetAttr[name="weight"](%linear.12), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %bias.4 : Tensor = prim::GetAttr[name="bias"](%linear.12), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %33 : Tensor = aten::linear(%x.1, %weight.4, %bias.4), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15
          %weight.6 : Tensor = prim::GetAttr[name="weight"](%linear.10), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %bias.6 : Tensor = prim::GetAttr[name="bias"](%linear.10), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %36 : Tensor = aten::linear(%33, %weight.6, %bias.6), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15
          -> (%36)
        block1():
          %linear.14 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%m1.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %m2.3 : __torch__.export.test_converter.___torch_mangle_16.M = prim::GetAttr[name="m2"](%m1.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %linear.16 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%m2.3), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %weight.8 : Tensor = prim::GetAttr[name="weight"](%linear.16), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %bias.8 : Tensor = prim::GetAttr[name="bias"](%linear.16), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %42 : Tensor = aten::linear(%x.1, %weight.8, %bias.8), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15
          %weight.2 : Tensor = prim::GetAttr[name="weight"](%linear.14), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %bias.2 : Tensor = prim::GetAttr[name="bias"](%linear.14), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1
          %45 : Tensor = aten::linear(%42, %weight.2, %bias.2), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m1 # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15
          -> (%45)
      %weight.10 : Tensor = prim::GetAttr[name="weight"](%linear.6), scope: export.test_converter.SuperNestedM1::/torch.nn.modules.linear.Linear::linear
      %bias.10 : Tensor = prim::GetAttr[name="bias"](%linear.6), scope: export.test_converter.SuperNestedM1::/torch.nn.modules.linear.Linear::linear
      %48 : Tensor = aten::linear(%27, %weight.10, %bias.10), scope: export.test_converter.SuperNestedM1::/torch.nn.modules.linear.Linear::linear # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15
      -> (%48)
    block1():
      %linear.8 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%15), scope: export.test_converter.SuperNestedM1::
      %m2.1 : __torch__.export.test_converter.___torch_mangle_15.NestedM = prim::GetAttr[name="m2"](%15), scope: export.test_converter.SuperNestedM1::
      %51 : Tensor = aten::sum(%x.1, %16), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 # /data/users/jiashenc/pytorch/test/export/test_converter.py:327:19
      %52 : Tensor = aten::gt(%51, %17), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 # /data/users/jiashenc/pytorch/test/export/test_converter.py:327:19
      %53 : bool = aten::Bool(%52), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 # /data/users/jiashenc/pytorch/test/export/test_converter.py:327:19
      %54 : Tensor = prim::If(%53), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 # /data/users/jiashenc/pytorch/test/export/test_converter.py:327:16
        block0():
          %linear.1 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%m2.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %m1 : __torch__.export.test_converter.___torch_mangle_16.M = prim::GetAttr[name="m1"](%m2.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %linear.5 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%m1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %weight.1 : Tensor = prim::GetAttr[name="weight"](%linear.5), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %bias.1 : Tensor = prim::GetAttr[name="bias"](%linear.5), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %60 : Tensor = aten::linear(%x.1, %weight.1, %bias.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15
          %weight.3 : Tensor = prim::GetAttr[name="weight"](%linear.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %bias.3 : Tensor = prim::GetAttr[name="bias"](%linear.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %63 : Tensor = aten::linear(%60, %weight.3, %bias.3), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15
          -> (%63)
        block1():
          %linear.3 : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%m2.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %m2 : __torch__.export.test_converter.___torch_mangle_16.M = prim::GetAttr[name="m2"](%m2.1), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %linear : __torch__.torch.nn.modules.linear.___torch_mangle_17.Linear = prim::GetAttr[name="linear"](%m2), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %weight.5 : Tensor = prim::GetAttr[name="weight"](%linear), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %bias.5 : Tensor = prim::GetAttr[name="bias"](%linear), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %69 : Tensor = aten::linear(%x.1, %weight.5, %bias.5), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15
          %weight.12 : Tensor = prim::GetAttr[name="weight"](%linear.3), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %bias.12 : Tensor = prim::GetAttr[name="bias"](%linear.3), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2
          %72 : Tensor = aten::linear(%69, %weight.12, %bias.12), scope: export.test_converter.SuperNestedM1::/export.test_converter.NestedM::m2 # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15
          -> (%72)
      %weight : Tensor = prim::GetAttr[name="weight"](%linear.8), scope: export.test_converter.SuperNestedM1::/torch.nn.modules.linear.Linear::linear
      %bias : Tensor = prim::GetAttr[name="bias"](%linear.8), scope: export.test_converter.SuperNestedM1::/torch.nn.modules.linear.Linear::linear
      %75 : Tensor = aten::linear(%54, %weight, %bias), scope: export.test_converter.SuperNestedM1::/torch.nn.modules.linear.Linear::linear # /data/users/jiashenc/pytorch/torch/nn/modules/linear.py:116:15
      -> (%75)
  return (%21)
```
##### ExportedProgram
```
ExportedProgram:
    class GraphModule(torch.nn.Module):
        def forward(self, p_linear_weight: "f32[3, 3]", p_linear_bias: "f32[3]", p_m1_linear_weight: "f32[3, 3]", p_m1_linear_bias: "f32[3]", p_m1_m1_linear_weight: "f32[3, 3]", p_m1_m1_linear_bias: "f32[3]", p_m1_m2_linear_weight: "f32[3, 3]", p_m1_m2_linear_bias: "f32[3]", p_m2_linear_weight: "f32[3, 3]", p_m2_linear_bias: "f32[3]", p_m2_m1_linear_weight: "f32[3, 3]", p_m2_m1_linear_bias: "f32[3]", p_m2_m2_linear_weight: "f32[3, 3]", p_m2_m2_linear_bias: "f32[3]", x_1: "f32[3]"):
            # No stacktrace found for following nodes
            max_1: "f32[]" = torch.ops.aten.max.default(x_1)
            gt: "b8[]" = torch.ops.aten.gt.Scalar(max_1, 1);  max_1 = None

            # File: <eval_with_key>.137:23 in forward, code: cond = torch.ops.higher_order.cond(l_args_0_, cond_true_2, cond_false_2, [l_args_3_0_, l_args_3_13_, l_args_3_5_, l_args_3_12_, l_args_3_14_, l_args_3_1_, l_args_3_3_, l_args_3_4_, l_args_3_7_, l_args_3_10_, l_args_3_11_, l_args_3_2_, l_args_3_6_, l_args_3_8_, l_args_3_9_]);  l_args_0_ = cond_true_2 = cond_false_2 = l_args_3_0_ = l_args_3_13_ = l_args_3_5_ = l_args_3_12_ = l_args_3_14_ = l_args_3_1_ = l_args_3_3_ = l_args_3_4_ = l_args_3_7_ = l_args_3_10_ = l_args_3_11_ = l_args_3_2_ = l_args_3_6_ = l_args_3_8_ = l_args_3_9_ = None
            true_graph_0 = self.true_graph_0
            false_graph_0 = self.false_graph_0
            conditional = torch.ops.higher_order.cond(gt, true_graph_0, false_graph_0, [p_linear_weight, p_linear_bias, x_1, p_m1_linear_weight, p_m1_m1_linear_bias, p_m1_linear_bias, p_m1_m2_linear_weight, p_m1_m2_linear_bias, p_m1_m1_linear_weight, p_m2_m2_linear_bias, p_m2_m1_linear_weight, p_m2_linear_weight, p_m2_m1_linear_bias, p_m2_m2_linear_weight, p_m2_linear_bias]);  gt = true_graph_0 = false_graph_0 = p_linear_weight = p_linear_bias = x_1 = p_m1_linear_weight = p_m1_m1_linear_bias = p_m1_linear_bias = p_m1_m2_linear_weight = p_m1_m2_linear_bias = p_m1_m1_linear_weight = p_m2_m2_linear_bias = p_m2_m1_linear_weight = p_m2_linear_weight = p_m2_m1_linear_bias = p_m2_m2_linear_weight = p_m2_linear_bias = None
            getitem: "f32[3]" = conditional[0];  conditional = None
            return (getitem,)

        class <lambda>(torch.nn.Module):
            def forward(self, p_linear_weight: "f32[3, 3]", p_linear_bias: "f32[3]", x_1: "f32[3]", p_m1_linear_weight: "f32[3, 3]", p_m1_m1_linear_bias: "f32[3]", p_m1_linear_bias: "f32[3]", p_m1_m2_linear_weight: "f32[3, 3]", p_m1_m2_linear_bias: "f32[3]", p_m1_m1_linear_weight: "f32[3, 3]", p_m2_m2_linear_bias: "f32[3]", p_m2_m1_linear_weight: "f32[3, 3]", p_m2_linear_weight: "f32[3, 3]", p_m2_m1_linear_bias: "f32[3]", p_m2_m2_linear_weight: "f32[3, 3]", p_m2_linear_bias: "f32[3]"):
                # File: <eval_with_key>.134:8 in forward, code: sum_default = torch.ops.aten.sum.default(l_args_3_5__1, dtype = None)
                sum_1: "f32[]" = torch.ops.aten.sum.default(x_1)

                # File: <eval_with_key>.134:9 in forward, code: gt_scalar = torch.ops.aten.gt.Scalar(sum_default, 1);  sum_default = None
                gt: "b8[]" = torch.ops.aten.gt.Scalar(sum_1, 1);  sum_1 = None

                # File: <eval_with_key>.134:12 in forward, code: cond = torch.ops.higher_order.cond(gt_scalar, cond_true_0, cond_false_0, [l_args_3_12__true_branch, l_args_3_1__true_branch, l_args_3_5__1, l_args_3_14__true_branch, l_args_3_7__true_branch, l_args_3_3__true_branch, l_args_3_4__true_branch]);  gt_scalar = cond_true_0 = cond_false_0 = l_args_3_12__true_branch = l_args_3_1__true_branch = l_args_3_5__1 = l_args_3_14__true_branch = l_args_3_7__true_branch = l_args_3_3__true_branch = l_args_3_4__true_branch = None
                true_graph_0 = self.true_graph_0
                false_graph_0 = self.false_graph_0
                conditional = torch.ops.higher_order.cond(gt, true_graph_0, false_graph_0, [p_m1_linear_weight, p_m1_linear_bias, x_1, p_m1_m1_linear_bias, p_m1_m1_linear_weight, p_m1_m2_linear_weight, p_m1_m2_linear_bias]);  gt = true_graph_0 = false_graph_0 = p_m1_linear_weight = p_m1_linear_bias = x_1 = p_m1_m1_linear_bias = p_m1_m1_linear_weight = p_m1_m2_linear_weight = p_m1_m2_linear_bias = None
                getitem: "f32[3]" = conditional[0];  conditional = None

                # File: <eval_with_key>.134:14 in forward, code: linear_default = torch.ops.aten.linear.default(getitem, l_args_3_0__1, l_args_3_13__1);  getitem = l_args_3_0__1 = l_args_3_13__1 = None
                linear: "f32[3]" = torch.ops.aten.linear.default(getitem, p_linear_weight, p_linear_bias);  getitem = p_linear_weight = p_linear_bias = None
                return (linear,)

            class <lambda>(torch.nn.Module):
                def forward(self, p_m1_linear_weight: "f32[3, 3]", p_m1_linear_bias: "f32[3]", x_1: "f32[3]", p_m1_m1_linear_bias: "f32[3]", p_m1_m1_linear_weight: "f32[3, 3]", p_m1_m2_linear_weight: "f32[3, 3]", p_m1_m2_linear_bias: "f32[3]"):
                    # File: <eval_with_key>.130:8 in forward, code: linear_default = torch.ops.aten.linear.default(l_args_3_5__1, l_args_3_7__true_branch, l_args_3_14__true_branch);  l_args_3_5__1 = l_args_3_7__true_branch = l_args_3_14__true_branch = None
                    linear: "f32[3]" = torch.ops.aten.linear.default(x_1, p_m1_m1_linear_weight, p_m1_m1_linear_bias);  x_1 = p_m1_m1_linear_weight = p_m1_m1_linear_bias = None

                    # File: <eval_with_key>.130:9 in forward, code: linear_default_1 = torch.ops.aten.linear.default(linear_default, l_args_3_12__1, l_args_3_1__1);  linear_default = l_args_3_12__1 = l_args_3_1__1 = None
                    linear_1: "f32[3]" = torch.ops.aten.linear.default(linear, p_m1_linear_weight, p_m1_linear_bias);  linear = p_m1_linear_weight = p_m1_linear_bias = None
                    return (linear_1,)

            class <lambda>(torch.nn.Module):
                def forward(self, p_m1_linear_weight: "f32[3, 3]", p_m1_linear_bias: "f32[3]", x_1: "f32[3]", p_m1_m1_linear_bias: "f32[3]", p_m1_m1_linear_weight: "f32[3, 3]", p_m1_m2_linear_weight: "f32[3, 3]", p_m1_m2_linear_bias: "f32[3]"):
                    # File: <eval_with_key>.131:8 in forward, code: linear_default = torch.ops.aten.linear.default(l_args_3_5__1, l_args_3_3__false_branch, l_args_3_4__false_branch);  l_args_3_5__1 = l_args_3_3__false_branch = l_args_3_4__false_branch = None
                    linear: "f32[3]" = torch.ops.aten.linear.default(x_1, p_m1_m2_linear_weight, p_m1_m2_linear_bias);  x_1 = p_m1_m2_linear_weight = p_m1_m2_linear_bias = None

                    # File: <eval_with_key>.131:9 in forward, code: linear_default_1 = torch.ops.aten.linear.default(linear_default, l_args_3_12__1, l_args_3_1__1);  linear_default = l_args_3_12__1 = l_args_3_1__1 = None
                    linear_1: "f32[3]" = torch.ops.aten.linear.default(linear, p_m1_linear_weight, p_m1_linear_bias);  linear = p_m1_linear_weight = p_m1_linear_bias = None
                    return (linear_1,)

        class <lambda>(torch.nn.Module):
            def forward(self, p_linear_weight: "f32[3, 3]", p_linear_bias: "f32[3]", x_1: "f32[3]", p_m1_linear_weight: "f32[3, 3]", p_m1_m1_linear_bias: "f32[3]", p_m1_linear_bias: "f32[3]", p_m1_m2_linear_weight: "f32[3, 3]", p_m1_m2_linear_bias: "f32[3]", p_m1_m1_linear_weight: "f32[3, 3]", p_m2_m2_linear_bias: "f32[3]", p_m2_m1_linear_weight: "f32[3, 3]", p_m2_linear_weight: "f32[3, 3]", p_m2_m1_linear_bias: "f32[3]", p_m2_m2_linear_weight: "f32[3, 3]", p_m2_linear_bias: "f32[3]"):
                # File: <eval_with_key>.135:8 in forward, code: sum_default = torch.ops.aten.sum.default(l_args_3_5__1, dtype = None)
                sum_1: "f32[]" = torch.ops.aten.sum.default(x_1)

                # File: <eval_with_key>.135:9 in forward, code: gt_scalar = torch.ops.aten.gt.Scalar(sum_default, 1);  sum_default = None
                gt: "b8[]" = torch.ops.aten.gt.Scalar(sum_1, 1);  sum_1 = None

                # File: <eval_with_key>.135:12 in forward, code: cond = torch.ops.higher_order.cond(gt_scalar, cond_true_1, cond_false_1, [l_args_3_2__false_branch, l_args_3_5__1, l_args_3_9__false_branch, l_args_3_11__false_branch, l_args_3_6__false_branch, l_args_3_10__false_branch, l_args_3_8__false_branch]);  gt_scalar = cond_true_1 = cond_false_1 = l_args_3_2__false_branch = l_args_3_5__1 = l_args_3_9__false_branch = l_args_3_11__false_branch = l_args_3_6__false_branch = l_args_3_10__false_branch = l_args_3_8__false_branch = None
                true_graph_0 = self.true_graph_0
                false_graph_0 = self.false_graph_0
                conditional = torch.ops.higher_order.cond(gt, true_graph_0, false_graph_0, [p_m2_linear_weight, x_1, p_m2_linear_bias, p_m2_m1_linear_weight, p_m2_m1_linear_bias, p_m2_m2_linear_bias, p_m2_m2_linear_weight]);  gt = true_graph_0 = false_graph_0 = p_m2_linear_weight = x_1 = p_m2_linear_bias = p_m2_m1_linear_weight = p_m2_m1_linear_bias = p_m2_m2_linear_bias = p_m2_m2_linear_weight = None
                getitem: "f32[3]" = conditional[0];  conditional = None

                # File: <eval_with_key>.135:14 in forward, code: linear_default = torch.ops.aten.linear.default(getitem, l_args_3_0__1, l_args_3_13__1);  getitem = l_args_3_0__1 = l_args_3_13__1 = None
                linear: "f32[3]" = torch.ops.aten.linear.default(getitem, p_linear_weight, p_linear_bias);  getitem = p_linear_weight = p_linear_bias = None
                return (linear,)

            class <lambda>(torch.nn.Module):
                def forward(self, p_m2_linear_weight: "f32[3, 3]", x_1: "f32[3]", p_m2_linear_bias: "f32[3]", p_m2_m1_linear_weight: "f32[3, 3]", p_m2_m1_linear_bias: "f32[3]", p_m2_m2_linear_bias: "f32[3]", p_m2_m2_linear_weight: "f32[3, 3]"):
                    # File: <eval_with_key>.132:8 in forward, code: linear_default = torch.ops.aten.linear.default(l_args_3_5__1, l_args_3_11__true_branch, l_args_3_6__true_branch);  l_args_3_5__1 = l_args_3_11__true_branch = l_args_3_6__true_branch = None
                    linear: "f32[3]" = torch.ops.aten.linear.default(x_1, p_m2_m1_linear_weight, p_m2_m1_linear_bias);  x_1 = p_m2_m1_linear_weight = p_m2_m1_linear_bias = None

                    # File: <eval_with_key>.132:9 in forward, code: linear_default_1 = torch.ops.aten.linear.default(linear_default, l_args_3_2__1, l_args_3_9__1);  linear_default = l_args_3_2__1 = l_args_3_9__1 = None
                    linear_1: "f32[3]" = torch.ops.aten.linear.default(linear, p_m2_linear_weight, p_m2_linear_bias);  linear = p_m2_linear_weight = p_m2_linear_bias = None
                    return (linear_1,)

            class <lambda>(torch.nn.Module):
                def forward(self, p_m2_linear_weight: "f32[3, 3]", x_1: "f32[3]", p_m2_linear_bias: "f32[3]", p_m2_m1_linear_weight: "f32[3, 3]", p_m2_m1_linear_bias: "f32[3]", p_m2_m2_linear_bias: "f32[3]", p_m2_m2_linear_weight: "f32[3, 3]"):
                    # File: <eval_with_key>.133:8 in forward, code: linear_default = torch.ops.aten.linear.default(l_args_3_5__1, l_args_3_8__false_branch, l_args_3_10__false_branch);  l_args_3_5__1 = l_args_3_8__false_branch = l_args_3_10__false_branch = None
                    linear: "f32[3]" = torch.ops.aten.linear.default(x_1, p_m2_m2_linear_weight, p_m2_m2_linear_bias);  x_1 = p_m2_m2_linear_weight = p_m2_m2_linear_bias = None

                    # File: <eval_with_key>.133:9 in forward, code: linear_default_1 = torch.ops.aten.linear.default(linear_default, l_args_3_2__1, l_args_3_9__1);  linear_default = l_args_3_2__1 = l_args_3_9__1 = None
                    linear_1: "f32[3]" = torch.ops.aten.linear.default(linear, p_m2_linear_weight, p_m2_linear_bias);  linear = p_m2_linear_weight = p_m2_linear_bias = None
                    return (linear_1,)

Graph signature: ExportGraphSignature(input_specs=[InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_linear_weight'), target='linear.weight', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_linear_bias'), target='linear.bias', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m1_linear_weight'), target='m1.linear.weight', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m1_linear_bias'), target='m1.linear.bias', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m1_m1_linear_weight'), target='m1.m1.linear.weight', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m1_m1_linear_bias'), target='m1.m1.linear.bias', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m1_m2_linear_weight'), target='m1.m2.linear.weight', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m1_m2_linear_bias'), target='m1.m2.linear.bias', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m2_linear_weight'), target='m2.linear.weight', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m2_linear_bias'), target='m2.linear.bias', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m2_m1_linear_weight'), target='m2.m1.linear.weight', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m2_m1_linear_bias'), target='m2.m1.linear.bias', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m2_m2_linear_weight'), target='m2.m2.linear.weight', persistent=None), InputSpec(kind=<InputKind.PARAMETER: 2>, arg=TensorArgument(name='p_m2_m2_linear_bias'), target='m2.m2.linear.bias', persistent=None), InputSpec(kind=<InputKind.USER_INPUT: 1>, arg=TensorArgument(name='x_1'), target=None, persistent=None)], output_specs=[OutputSpec(kind=<OutputKind.USER_OUTPUT: 1>, arg=TensorArgument(name='getitem'), target=None)])
Range constraints: {}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127975
Approved by: https://github.com/angelayi, https://github.com/ydwu4
2024-06-10 23:24:16 +00:00
b2d602306a [RELAND][dynamo][nn-modules] Trace through nn.Module dunder methods for UnspecializedNNModule (#126578)
Tracing through `__init__`  is important because it initializes (calls STORE_ATTR) on members. By doing that, we kick in the mutation tracking for these objects. So, things like mutating `_modules` etc is tracked automatically.

Fixes https://github.com/pytorch/pytorch/issues/111837

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126578
Approved by: https://github.com/jansel
ghstack dependencies: #128295
2024-06-10 23:11:04 +00:00
05711eece9 [dynamo][inlining inbuilt modules] Ensure BC for nn_module_stack (#128295)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128295
Approved by: https://github.com/ydwu4
2024-06-10 23:11:04 +00:00
a287ff75d0 Use init_torchbind_implementations in inductor torchbind tests. (#128341)
Summary: To unify how we load the torch bind libraries for testing.

Test Plan: Existing tests.

Differential Revision: D58372372

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128341
Approved by: https://github.com/angelayi
2024-06-10 23:02:48 +00:00
4bbadeee8a Revert "Set simdlen based on ATEN_CPU_CAPABILITY (#123514)"
This reverts commit b66e3f0957b96b058c9b632ca60833d9717a9d8a.

Reverted https://github.com/pytorch/pytorch/pull/123514 on behalf of https://github.com/clee2000 due to broke test/inductor/test_torchinductor.py::CpuTests::test_new_cpp_build_logical_cpu on periodic test on the no gpu tests b66e3f0957 https://github.com/pytorch/pytorch/actions/runs/9453518547/job/26040077301 ([comment](https://github.com/pytorch/pytorch/pull/123514#issuecomment-2159433432))
2024-06-10 22:46:01 +00:00
2176ef7dfa [compiled autograd] support .backward(inputs=) (#128252)
autograd already marks nodes as needed or not before calling calling compiled autograd. so our worklist already skips nodes not specified in the `inputs` kwarg.

For the .backward(inputs=) case, I'm keeping the grads as outputs, just like for .grad(inputs=), this is to still guard on graph_output when we collect the nodes. This does not get DCE'd rn, and is ignored in the post graph bytecode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128252
Approved by: https://github.com/jansel
2024-06-10 22:20:51 +00:00
583a56d5a8 DOC: add docstring to construct_and_record_rdzv_event() (#128189)
Fixes #127902

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128189
Approved by: https://github.com/kurman
2024-06-10 22:17:33 +00:00
c38b3381a1 Make nn.Module state_dict load_state_dict pre-hook and state_dict post hook public (#126704)
Fixes https://github.com/pytorch/pytorch/issues/75287 and https://github.com/pytorch/pytorch/issues/117437

- `nn.Module._register_state_dict_hook` --> add public `nn.Module.register_state_dict_post_hook`
   - Add a test as this API was previously untested
- `nn.Module._register_load_state_dict_pre_hook` --> add public `nn.Module.register_load_state_dict_pre_hook` (remove the `with_module` flag, default it to `True`
    ~- For consistency with optimizer `load_state_dict_pre_hook` raised by @janeyx99, allow the pre-hook to return a new `state_dict`~
 - Document issue pointed out by https://github.com/pytorch/pytorch/issues/117437 regarding `_register_state_dict_hook` semantic of returning a new state_dict only being respected for the root for private hook
       - Remove this for the public `register_state_dict_post_hook`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126704
Approved by: https://github.com/albanD
ghstack dependencies: #126906
2024-06-10 21:50:17 +00:00
a2d4fea872 [easy] Move state_dict hooks tests to test_module_hooks and decorate tests that call load_state_dict with swap (#126906)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126906
Approved by: https://github.com/albanD
2024-06-10 21:50:17 +00:00
58083ffb10 Improve unbacked reasoning involving has internal overlap (#128332)
Fixes https://github.com/pytorch/pytorch/issues/122477
Partially addresses https://github.com/pytorch/pytorch/issues/116336

This PR is slightly overkill: not only does it disable the overlap test
when there are unbacked SymInts, it also improves the is non-overlapping
and dense test for some more unbacked situations.  We technically don't
need the latter change, but I was already deep in the sauce and just
went ahead and did it.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128332
Approved by: https://github.com/lezcano
2024-06-10 21:49:38 +00:00
6630dcd53c Add docstring for the torch.serialization.default_restore_location function (#128132)
Fixes: #127887

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128132
Approved by: https://github.com/mikaylagawarecki
2024-06-10 21:33:56 +00:00
3a2d0755a4 enable test_ParameterList with dynamo if nn module inlining enabled only (#128308)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128308
Approved by: https://github.com/anijain2305
2024-06-10 21:25:40 +00:00
b459713ca7 [aota] compiled forward outputs requires_grad alignment with eager (#128016)
Original issue: https://github.com/pytorch/pytorch/issues/114338

We assume only two possible mutually exclusive scenarios:

1. Running compiled region for training (Any of inputs has requires_grad)
	- Produced differentiable outputs should have requires_grad.

2. Running compiled region for inference (None of inputs has requires_grad)
	- All outputs do not have requires_grad.

Even if user runs the region under no_grad(), but has an input Tensor with requires_grad - we go Training scenario (1).

With current state that means:
1/ needs_autograd should not check torch.is_grad_enabled(), only that any of inputs requires_grad
2/ if needs_autograd => trace_joint (We are in training scenario 1.) => always run compiled region under with.enable_grad()

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128016
Approved by: https://github.com/bdhirsh
2024-06-10 20:51:22 +00:00
4460e481bc Disable jacrev/jacfwd/hessian if compiling with dynamo (#128255)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128255
Approved by: https://github.com/zou3519
2024-06-10 20:47:53 +00:00
90bb510ece Revert "Deprecate torch._utils.is_compiling() and torch._dynamo.external_utils.is_compiling() (#127690)"
This reverts commit 348b181a97abc2e636a6c18e5880a78e5d1dab94.

Reverted https://github.com/pytorch/pytorch/pull/127690 on behalf of https://github.com/clee2000 due to sorry I think https://github.com/pytorch/pytorch/pull/126898#issuecomment-2142884456 is still relevant, I will reach out to them to see what needs to be done in internal to get this remerged ([comment](https://github.com/pytorch/pytorch/pull/127690#issuecomment-2159248859))
2024-06-10 20:44:42 +00:00
38e0a0440c [AMD] Default to hipblaslt in gemm (#127944)
Summary: It has been a constant pain that we have to specify env var to go with the hipblaslt path. The default path is very slow on MI300. Therefore, let's default to hipblaslt.

Differential Revision: D58150764

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127944
Approved by: https://github.com/aaronenyeshi, https://github.com/houseroad
2024-06-10 19:55:21 +00:00
946f554c8f Flip default value for mypy disallow_untyped_defs [10+1/11] (#128293)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128293
Approved by: https://github.com/oulgen
2024-06-10 19:32:44 +00:00
55646554b7 [EZ] Fix typos in SECURITY.md (#128340)
permisisons -> permissions
lates -> latest

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128340
Approved by: https://github.com/clee2000, https://github.com/atalman, https://github.com/kit1980
2024-06-10 19:21:39 +00:00
9cab5987bd Introduce int_oo (#127693)
In a previous life, we used sympy.oo to represent the lower/upper bounds of integer ranges. Later, we changed this to be sys.maxsize - 1 for a few reasons: (1) sometimes we do tests on a value being exactly sys.maxsize, and we wanted to avoid a data dependent guard in this case, (2) sympy.oo corresponds to floating point infinity, so you get incorrect types for value ranges with oo, and (3) you can do slightly better reasoning if you assume that input sizes fall within representable 64-bit integer range.

After working in the sys.maxsize regime for a bit, I've concluded that this was actually a bad idea. Specifically, the problem is that you end up with sys.maxsize in your upper bound, and then whenever you do any sort of size-increasing computation like size * 2, you end up with 2 * sys.maxsize, and you end up doing a ton of arbitrary precision int computation that is totally unnecessary. A symbolic bound is better.

But especially after #126905, we can't go back to using sympy.oo, because that advertises that it's not an integer, and now your ValueRanges is typed incorrectly. So what do we do? We define a new numeric constant `int_oo`, which is like `sympy.oo` but it advertises `is_integer`. **test/test_sympy_utils.py** describes some basic properties of the number, and **torch/utils/_sympy/numbers.py** has the actual implementation.

The rest of the changes of the PR are working out the implications of this change. I'll give more commentary as inline comments.

Fixes https://github.com/pytorch/pytorch/issues/127396

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127693
Approved by: https://github.com/lezcano
ghstack dependencies: #126905
2024-06-10 19:09:53 +00:00
db2fa7b827 Revert "[export] FIx unflattener for preserving modules containing unused inputs (#128260)"
This reverts commit 093a4ff5f859ccbbd8ba62dd189f76e5faadfb04.

Reverted https://github.com/pytorch/pytorch/pull/128260 on behalf of https://github.com/angelayi due to breaking windows test ([comment](https://github.com/pytorch/pytorch/pull/128260#issuecomment-2159050726))
2024-06-10 18:42:33 +00:00
093a4ff5f8 [export] FIx unflattener for preserving modules containing unused inputs (#128260)
Currently unflattener fails if the module its preserving the module signature for contains unused inputs/outputs.

This also fixes unflattener issues in D57829276.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128260
Approved by: https://github.com/pianpwk
2024-06-10 18:39:33 +00:00
fa8ec8e718 [dynamo] handle hashable exceptions in trace_rules lookup (#128078)
Summary: Found during user empathy day when attempting to hash a fractions.Fraction object before it was fully constructed. See https://github.com/pytorch/pytorch/issues/128075

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128078
Approved by: https://github.com/anijain2305
2024-06-10 18:23:22 +00:00
136bdb96cb Update Kineto submodule with fix to test_basic_chrome_trace (#128333)
Summary: We've updated the sort_index in Kineto chrome traces to support device ids up to 16 devices. This should make chrome trace rows be ordered in the same way as CUDA. We need to update the unit test as well.

Test Plan:
Ran locally the changing test:
```
$ buck2 test 'fbcode//mode/opt' fbcode//caffe2/test:test_profiler_cuda -- --exact 'caffe2/test:test_profiler_cuda - test_basic_chrome_trace (profiler.test_profiler.TestProfiler)'
File changed: fbcode//caffe2/third_party/kineto.submodule.txt
Buck UI: https://www.internalfb.com/buck2/f4fd1e9a-99f1-4422-aeed-b54903c64146
Test UI: https://www.internalfb.com/intern/testinfra/testrun/16888498639845776
Network: Up: 5.4KiB  Down: 8.6KiB  (reSessionID-0329120e-7fa2-4bc0-b539-7e58058f8fce)
Jobs completed: 6. Time elapsed: 1:01.2s.
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Differential Revision: D58362964

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128333
Approved by: https://github.com/Skylion007
2024-06-10 18:12:34 +00:00
83941482f7 Add docstring for the torch.distributed.elastic.utils.distributed.get_free_port function (#128133)
Fixes: #127914

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128133
Approved by: https://github.com/H-Huang
2024-06-10 18:10:58 +00:00
08d038f8a8 [PT2] Fix a typo and lint problem (#128258)
Summary: Titled

Test Plan: see signal

Differential Revision: D58310169

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128258
Approved by: https://github.com/dshi7, https://github.com/Yuzhen11
2024-06-10 18:03:40 +00:00
46948300a2 [c10d] integrate PMI NCCL initialization to NCCL-PG (#128243)
Summary: Move broadcastUniqueID check to NCCLUtils

Differential Revision: D58273755

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128243
Approved by: https://github.com/wconstab
2024-06-10 17:20:03 +00:00
ab3a0b192a [RFC] add per-collective timeout value in flight recorder (#128190)
Summary:
Add timeout value field on every collected record.

Test Plan:
Unit tests

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128190
Approved by: https://github.com/wconstab
2024-06-10 17:12:57 +00:00
8e482e909b Add some guard to size oblivious has_internal_overlap (#128328)
This doesn't actually help on
https://github.com/pytorch/pytorch/issues/122477 but I noticed this
modest improvement so sure, why not.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128328
Approved by: https://github.com/Skylion007
2024-06-10 17:11:26 +00:00
7b9c5e0e3f Turn on GraphTransformObserver for inductor (#127962)
The FX graphs for some PT2 models are very complicated, Inductor usually goes through many passes of graph optimization to generate the final FX graph. It’s very difficult to see the change in each pass, and check if the optimized graph is correct and optimal.

GraphTransformObserver is an observer listening to all add/erase node events on GraphModule during a graph transform pass, and save the changed nodes. When the pass is done and if there is any change in the graph, GraphTransformObserver will save the SVG files of the input graph and the output graph for that pass.

This PR is to enable GraphTransformObserver for inductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127962
Approved by: https://github.com/jansel
2024-06-10 16:49:02 +00:00
ca561d639b Revert "Fix 'get_attr' call in dynamo 'run_node' (#127696)"
This reverts commit b741819b0580204e6a6b60c62ce44dacaf7787c8.

Reverted https://github.com/pytorch/pytorch/pull/127696 on behalf of https://github.com/clee2000 due to broke (executorch?) internal tests D58295865 ([comment](https://github.com/pytorch/pytorch/pull/127696#issuecomment-2158820093))
2024-06-10 16:29:20 +00:00
d22287d1ad Revert "Fix 'get_real_value' on placeholder nodes (#127698)"
This reverts commit 19b31d899a78a6806314bcc73b88172dabf0c26e.

Reverted https://github.com/pytorch/pytorch/pull/127698 on behalf of https://github.com/clee2000 due to broke (executorch?) internal tests D58295865 ([comment](https://github.com/pytorch/pytorch/pull/127696#issuecomment-2158820093))
2024-06-10 16:29:20 +00:00
3b73f5de3a Revert "Add OpInfo entry for alias_copy (#127232) (#128142)"
This reverts commit 04da6aeb61f4d57bf73ed1054dd897abbcceca83.

Reverted https://github.com/pytorch/pytorch/pull/128142 on behalf of https://github.com/DanilBaibak due to The changes broke the test_output_match_alias_copy_cpu_complex64 test. ([comment](https://github.com/pytorch/pytorch/pull/128142#issuecomment-2158793878))
2024-06-10 16:17:16 +00:00
c993f1b37f Fix edge cases for gather in inductor (#126893)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126893
Approved by: https://github.com/peterbell10
ghstack dependencies: #126876
2024-06-10 15:31:03 +00:00
04da6aeb61 Add OpInfo entry for alias_copy (#127232) (#128142)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128142
Approved by: https://github.com/lezcano
2024-06-10 15:01:53 +00:00
b66e3f0957 Set simdlen based on ATEN_CPU_CAPABILITY (#123514)
It is part of https://github.com/pytorch/pytorch/issues/123224. Set simdlen based on the environment ATEN_CPU_CAPABILITY to control CPU vec ISA like eager.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123514
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-06-10 09:02:14 +00:00
df43d5843e fix miss isa bool check (#128274)
New cpp builder missed ISA bool(dry-compile) check.
<img width="941" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/695ce911-7f6d-401d-b96b-2b9bda751b15">
@jgong5 Found this missing and then I submit this PR to fix it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128274
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-06-10 02:45:46 +00:00
cyy
26f6a87ae9 [5/N] Remove unused functions (#127185)
Follows #128193

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127185
Approved by: https://github.com/ezyang
2024-06-10 01:57:49 +00:00
d3817d8a60 Don't create python tuple when _maybe_handle_torch_function is called from C++ (#128187)
Marginal overhead reduction when calling through the `torch.ops` API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128187
Approved by: https://github.com/lezcano
ghstack dependencies: #128183, #128184, #128185
2024-06-10 00:16:59 +00:00
cd2ad29afe [inductor] Reduce binding overhead of _reinterpret_tensor (#128185)
Going through the dispatcher + pybind11 + torch.ops adds about 2 us overhead
per call compared to `PyArgParser`.

Note that views of inputs are reconstructed by AOTAutograd before being returned
to the python code, so dispatching for autograd's sake shouldn't be required
here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128185
Approved by: https://github.com/lezcano
ghstack dependencies: #128183, #128184
2024-06-09 23:33:03 +00:00
253fa9c711 [AOTAutograd] Remove runtime import from view replay function (#128184)
`gen_alias_from_base` spends about ~0.5 us in this import statement,
which is called for each view in the graph output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128184
Approved by: https://github.com/lezcano
ghstack dependencies: #128183
2024-06-09 23:33:03 +00:00
55b2a0a002 [AOTAutograd] Use _set_grad_enabled instead of no_grad (#128183)
This saves ~1us of overhead from each inductor graph call.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128183
Approved by: https://github.com/lezcano
2024-06-09 23:33:03 +00:00
5e7377e044 [Dynamo][TVM] Make the opt_level parameter adjustable (#127876)
Fixes #127874

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127876
Approved by: https://github.com/jansel
2024-06-09 21:38:00 +00:00
c7e2c9c37e [c10d][doc] add a doc page for NCCL ENVs (#128235)
Addressing issue: https://github.com/pytorch/pytorch/issues/128204

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128235
Approved by: https://github.com/wconstab
2024-06-09 16:08:38 +00:00
0bf2fe522a [RFC] Provide optional switches to _dump_nccl_trace (#127651)
Summary:
Data from PyTorch distributed is mostly useful during initial stages of model development.
Provide options to reduce data sent/dumped.
`_dump_nccl_trace` takes 3 optional switches. Default as before returns everything
- `includeCollectives`: option to also include collectives: Default is True.
- `includeStacktraces`: option to include stack traces in collectives. Default is True.
- `onlyActive`: option to only send active collective work - i.e. not completed. Default is
    False (i.e. send everything)

Test Plan:
Unit tests

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127651
Approved by: https://github.com/wconstab
2024-06-09 14:00:57 +00:00
75b0720a97 Revert "Use hidden visibility in OBJECTCXX files (#127265)"
This reverts commit 669560d51aa1e81ebd09e2aa8288d0d314407d82.

Reverted https://github.com/pytorch/pytorch/pull/127265 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but I suspect that it causes this failure https://github.com/pytorch/vision/issues/8478 on vision where its C++ extension could not be loaded on macOS ([comment](https://github.com/pytorch/pytorch/pull/127265#issuecomment-2156401838))
2024-06-09 09:05:17 +00:00
eqy
4c971932e8 [cuDNN][SDPA] Remove TORCH_CUDNN_SDPA_ENABLED=1, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343)
Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes.

What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here...

Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343
Approved by: https://github.com/Skylion007
2024-06-09 06:53:34 +00:00
3964a3ec73 Complete revamp of float/promotion sympy handling (#126905)
At a high level, the idea behind this PR is:

* Make it clearer what the promotion and int/float rules for various Sympy operations are. Operators that previously were polymorphic over int/float are now split into separate operators for clarity. We never do mixed int/float addition/multiplication etc in sympy, instead, we always promote to the appropriate operator. (However, equality is currently not done correctly.)
* Enforce strict typing on ValueRanges: if you have a ValueRange for a float, the lower and upper MUST be floats, and so forth for integers.

The story begins in **torch/utils/_sympy/functions.py**. Here, I make some changes to how we represent certain operations in sympy expressions:

* FloorDiv now only supports integer inputs; to do float floor division, do a truediv and then a trunc. Additionally, we remove the divide out addition by gcd optimization, because sympy gcd is over fields and is willing to generate rationals (but rationals are bad for ValueRange strict typing).
* ModularIndexing, LShift, RShift now assert they are given integer inputs.
* Mod only supports integer inputs; eventually we will support FloatMod (left for later work, when we build out Sympy support for floating operations). Unfortunately, I couldn't assert integer inputs here, because of a bad interaction with sympy's inequality solver that is used by the offline solver
* TrueDiv is split into FloatTrueDiv and IntTrueDiv. This allows for us to eventually generate accurate code for Python semantics IntTrueDiv, which is written in a special way to preserve precision when the inputs are >= 2**53 beyond what first coercing the integer to floats and then doing true division.
* Trunc is split to TruncToFloat and TruncToInt.
* Round is updated to return a float, not an int, making it consistent with the round op handler in Inductor. To get Python-style conversion to int, we call TruncToInt on the result.
* RoundDecimal updated to consistently only ever return a float
* Add ToFloat for explicit coercion to float (required so we can enforce strict ValueRanges typing)

In **torch/__init__.py**, we modify SymInt and SymFloat to appropriately call into new bindings that route to these refined sympy operations.  Also, we modify `torch.sym_min` and `torch.sym_max` to have promotion semantics (if one argument is a float, the return result is always a float), making them inconsistent with builtins.min/max, but possible to do type analysis without runtime information.

We also need to introduce some new op handlers in **torch/_inductor/ops_handler.py**:

* `to_int` for truncation to int64, directly corresponding to TruncToInt; this can be implemented by trunc and dtype, but with a dedicated handler it is more convenient for roundtripping in Sympy
* `int_truediv` for Python-style integer true division, which has higher precision than casting to floats and then running `truediv`

These changes have consequences. First, we need to make some administrative changes:

* Actually wire up these Sympy functions from SymInt/SymFloat in **torch/fx/experimental/sym_node.py**, including the new promotion rules (promote2)
* Add support for new Sympy functions in **torch/utils/_sympy/interp.py**, **torch/utils/_sympy/reference.py**
  * In particular, in torch.utils._sympy.reference, we have a strong preference to NOT do nontrivial compute, instead, everything in ops handler should map to a singular sympy function
  * TODO: I chose to roundtrip mod back to our Mod function, but I think I'm going to have to deal with the C/Python inconsistency this to fix tests here
* Add printer support for the Sympy functions in **torch/_inductor/codegen/common.py**, **torch/_inductor/codegen/cpp_utils.py**, **torch/_inductor/codegen/triton.py**. `int_truediv` and mixed precision equality is currently not implemented soundly, so we will lose precision in codegen for large values. TODO: The additions here are not exhaustive yet
* Update ValueRanges logic to use new sympy functions in **torch/utils/_sympy/value_ranges.py**. In general, we prefer to use the new Sympy function rather than try to roll things by hand, which is what was done previously for many VR analysis functions.

In **torch/fx/experimental/symbolic_shapes.py** we need to make some symbolic reasoning adjustments:

* Avoid generation of rational subexpressions by removing simplification of `x // y` into `floor(x / y)`. This simplification then triggers an addition simplification rule `(x + y) / c --> x / c + y / c` which is bad because x / c is a rational number now
* `_assert_bound_is_rational` is no more, we no longer generate rational bounds
* Don't intersect non-int value ranges with the `int_range`
* Support more sympy Functions for guard SYMPY_INTERP
* Assert the type of value range is consistent with the variable type

The new asserts uncovered necessary bug fixes:

* **torch/_inductor/codegen/cpp.py**, **torch/_inductor/select_algorithm.py**, **torch/_inductor/sizevars.py** - Ensure Wild/Symbol manually allocated in Inductor is marked `is_integer` so it's accepted to build expressions
* **torch/_inductor/utils.py** - make sure you actually pass in sympy.Expr to these functions
* **torch/_inductor/ir.py** - make_contiguous_strides_for takes int/SymInt, not sympy.Expr!
* **torch/export/dynamic_shapes.py** - don't use infinity to represent int ranges, instead use sys.maxsize - 1

Because of the removal of some symbolic reasoning that produced rationals, some of our symbolic reasoning has gotten worse and we are unable to simplify some guards. Check the TODO at **test/test_proxy_tensor.py**

**Reland notes.** This requires this internal fbcode diff https://www.internalfb.com/phabricator/paste/view/P1403322587 but I cannot prepare the diff codev due to https://fb.workplace.com/groups/osssupport/posts/26343544518600814/

It also requires this Executorch PR https://github.com/pytorch/executorch/pull/3911 but the ET PR can be landed prior to this landing.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126905
Approved by: https://github.com/xadupre, https://github.com/lezcano
2024-06-09 06:20:25 +00:00
31c3fa6cf5 [audio hash update] update the pinned audio hash (#128178)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128178
Approved by: https://github.com/pytorchbot
2024-06-09 04:29:04 +00:00
cyy
7bfd1db53a [4/N] Change static functions in headers to inline (#128286)
Follows #128194.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128286
Approved by: https://github.com/Skylion007, https://github.com/XuehaiPan
2024-06-09 03:08:53 +00:00
f681e3689b [dtensor][experiment] experimenting with displaying distributed model parameters and printing sharding info (#127987)
**Summary**
Example code to display distributed model parameters and verify them against ground truth. Also prints sharding information.

**Test Plan**
torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/display_sharding_example.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127987
Approved by: https://github.com/XilunWu
ghstack dependencies: #127358, #127360, #127630
2024-06-09 00:14:07 +00:00
2c2cf1d779 [dtensor][experiment] experimenting with displaying model parameters (#127630)
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):

**Summary**
Example code to display model parameters and verify them against ground truth. Also expanded on moduletracker to accomplish this.

**Test Plan**
python3 torch/distributed/_tensor/examples/display_sharding_example.py

* #127987
* __->__ #127630
* #127360
* #127358

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127630
Approved by: https://github.com/XilunWu
ghstack dependencies: #127358, #127360
2024-06-09 00:14:07 +00:00
d34075e0bd Add Efficient Attention support on ROCM (#124885)
This patch implements `with sdpa_kernel(SDPBackend.EFFICIENT_ATTENTION):` by reusing AOTriton's accelerated SDPA implementation

Known limitations:
- Only supports MI200/MI300X GPUs
- Does not support varlen
- Does not support `CausalVariant`
- Optional arguments `causal_diagonal` and `seqlen_k` in `_efficient_attention_forward/backward` must be null
- Does not work well with inductor's SDPA rewriter. The rewriter has been updated to only use math and flash attention on ROCM.

This PR also uses a different approach of installing AOTriton binary instead of building it from source in the base docker image. More details on motivation: https://github.com/pytorch/pytorch/pull/124885#issuecomment-2153229129

`PYTORCH_TEST_WITH_ROCM=1 PYTORCH_TESTING_DEVICE_ONLY_FOR="cuda" python test/test_transformers.py` yields "55028 passed, 20784 skipped" results with this change.  [Previous result](https://hud.pytorch.org/pr/127528) of `test_transformers.py` was 0 error, 0 failure, 55229 skipped out of 75517 tests in total (the XML report does not contain total number of passed tests).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124885
Approved by: https://github.com/malfet
2024-06-08 22:41:05 +00:00
6e7a23475d [easy] Run autograd if any mutations on inputs that require grad (#128229)
If any inputs are mutated that require grad, even if all the outputs don't require grad, we should still run autograd with a backwards graph. This fixes two tests: test_input_mutation_alias_everything and test_view_detach.

Fixes #128035
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128229
Approved by: https://github.com/aorenste
2024-06-08 21:18:38 +00:00
aee154edbe [Traceable FSDP2] Make FSDPParam._unsharded_param creation traceable (#127245)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127245
Approved by: https://github.com/awgu
2024-06-08 21:10:15 +00:00
0dd55ee159 Fix bug in _update_process_group API (#128262)
`local_used_map_` was undefined in case of `find_unused_parameters=False`, this resulted in an error when we ran `local_used_map_.fill_(0);`

Added a unit test as well

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128262
Approved by: https://github.com/awgu
2024-06-08 19:52:24 +00:00
3494f3f991 [dynamo] Skip inlining builtin nn modules for torch.compile inside cond (#128247)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128247
Approved by: https://github.com/ydwu4
ghstack dependencies: #128246
2024-06-08 19:20:00 +00:00
33972dfd58 [easy][inline-inbuilt-nn-modules] Fix expected graph for control flow test (#128246)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128246
Approved by: https://github.com/ydwu4
2024-06-08 19:20:00 +00:00
57536286e2 Flip default value for mypy disallow_untyped_defs [10/11] (#127847)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127847
Approved by: https://github.com/oulgen
ghstack dependencies: #127842, #127843, #127844, #127845, #127846
2024-06-08 18:50:06 +00:00
8db9dfa2d7 Flip default value for mypy disallow_untyped_defs [9/11] (#127846)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127846
Approved by: https://github.com/ezyang
ghstack dependencies: #127842, #127843, #127844, #127845
2024-06-08 18:50:06 +00:00
27f9d3b0a1 Flip default value for mypy disallow_untyped_defs [8/11] (#127845)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127845
Approved by: https://github.com/oulgen
ghstack dependencies: #127842, #127843, #127844
2024-06-08 18:49:56 +00:00
038b927590 Flip default value for mypy disallow_untyped_defs [7/11] (#127844)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127844
Approved by: https://github.com/oulgen
ghstack dependencies: #127842, #127843
2024-06-08 18:49:45 +00:00
7c12cc7ce4 Flip default value for mypy disallow_untyped_defs [6/11] (#127843)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127843
Approved by: https://github.com/oulgen
ghstack dependencies: #127842
2024-06-08 18:49:29 +00:00
3a0d088517 Flip default value for mypy disallow_untyped_defs [5/11] (#127842)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127842
Approved by: https://github.com/oulgen
2024-06-08 18:49:18 +00:00
62bcdc0ac9 Flip default value for mypy disallow_untyped_defs [4/11] (#127841)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127841
Approved by: https://github.com/oulgen
2024-06-08 18:36:48 +00:00
afe15d2d2f Flip default value for mypy disallow_untyped_defs [3/11] (#127840)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127840
Approved by: https://github.com/oulgen
2024-06-08 18:28:01 +00:00
ea614fb2b1 Flip default value for mypy disallow_untyped_defs [2/11] (#127839)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127839
Approved by: https://github.com/oulgen
2024-06-08 18:23:08 +00:00
dcfa7702c3 Flip default value for mypy disallow_untyped_defs [1/11] (#127838)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127838
Approved by: https://github.com/oulgen
2024-06-08 18:16:33 +00:00
2369c719d4 [DSD][BE] Cleanup unused variables and rename variables to avoid exposure to the users (#128249)
These APIs and variables should not be exposed to users as they are designed to be used internally.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128249
Approved by: https://github.com/wz337
2024-06-08 17:12:17 +00:00
02a901f1e9 Revert "[RFC] Provide optional switches to _dump_nccl_trace (#127651)"
This reverts commit 0a761f0627130e739f0e2748e3f71a0c347552c4.

Reverted https://github.com/pytorch/pytorch/pull/127651 on behalf of https://github.com/atalman due to Breaks internal CI ([comment](https://github.com/pytorch/pytorch/pull/127651#issuecomment-2156076838))
2024-06-08 15:30:04 +00:00
57a24c4fdb Revert "[RFC] add per-collective timeout value in flight recorder (#128190)"
This reverts commit 09cccbc1c74c9d1157c1caca5526e79ee9b7ea01.

Reverted https://github.com/pytorch/pytorch/pull/128190 on behalf of https://github.com/atalman due to Sorry need to revert this, in conflict with https://github.com/pytorch/pytorch/pull/127651 that needs reverting ([comment](https://github.com/pytorch/pytorch/pull/128190#issuecomment-2156075318))
2024-06-08 15:25:27 +00:00
348b181a97 Deprecate torch._utils.is_compiling() and torch._dynamo.external_utils.is_compiling() (#127690)
This PR is split from PR #126898.

- #126898

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127690
Approved by: https://github.com/Skylion007
2024-06-08 15:25:03 +00:00
917387f66d [AOTI] fix a constant tensor device move issue (#128265)
Summary: When copying a constant tensor to another device, `.to` returns a fake tensor and causes a problem when a real tensor is expected.

Test Plan: CI

Differential Revision: D58313034

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128265
Approved by: https://github.com/chenyang78
2024-06-08 13:23:49 +00:00
cyy
695502ca65 [3/N] Change static functions in headers to inline (#128194)
Follows #127764

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128194
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2024-06-08 08:06:31 +00:00
73d6ec2db6 Increase verbosity of FX graph dumps (#128042)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128042
Approved by: https://github.com/aorenste
2024-06-08 07:24:58 +00:00
0e6c204642 [pipelining] Friendly error message when not traceable (#128276)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128276
Approved by: https://github.com/H-Huang
2024-06-08 06:36:11 +00:00
44371bd432 Revert "[dynamo][nn-modules] Trace through nn.Module dunder methods for UnspecializedNNModule (#126578)"
This reverts commit 7ede78f9f5d7e6c993faa1a70a5f0b0eaec5640d.

Reverted https://github.com/pytorch/pytorch/pull/126578 on behalf of https://github.com/anijain2305 due to pippy tests fail ([comment](https://github.com/pytorch/pytorch/pull/126578#issuecomment-2155836555))
2024-06-08 06:35:34 +00:00
6e13c7e874 Revert "[dynamo] Support if cond on UnspecializedNNModuleVariable and add inline tests (#128158)"
This reverts commit 747fc35ff54154ddec2a5ab5661f57c28d65c591.

Reverted https://github.com/pytorch/pytorch/pull/128158 on behalf of https://github.com/anijain2305 due to pippy tests fail ([comment](https://github.com/pytorch/pytorch/pull/128158#issuecomment-2155835787))
2024-06-08 06:32:28 +00:00
94165dba7b Revert "[dynamo] Inline the getattr of fx graph and proxy graph (#128172)"
This reverts commit 662a78f957fb89e53ebeba7deb880561e10ecaf6.

Reverted https://github.com/pytorch/pytorch/pull/128172 on behalf of https://github.com/anijain2305 due to pippy tests fail ([comment](https://github.com/pytorch/pytorch/pull/128172#issuecomment-2155835201))
2024-06-08 06:29:36 +00:00
8a0bc8c9ee [fsdp2] simplify fsdp_param logic with DTensorSpec (#128242)
as titled, we can use a single DTensorSpec to save the SPMD sharding
spec, plus the global shape/stride to simplify the FSDPParam logic

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128242
Approved by: https://github.com/awgu
2024-06-08 05:56:41 +00:00
cbb7e3053f View specialization (#127641)
This PR adds specialization shortcuts for converting n-d to 1-d and 1-d to 2-d views.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127641
Approved by: https://github.com/ezyang
2024-06-08 05:52:52 +00:00
310f80995b Added memory budget to partitioner (#126320)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126320
Approved by: https://github.com/shunting314
2024-06-08 05:52:40 +00:00
ffc202a1b9 Added remove_noop_ops to joint_graph_passes (#124451)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124451
Approved by: https://github.com/ezyang, https://github.com/fmassa
2024-06-08 05:48:11 +00:00
c446851829 [fsdp2] update foreach_reduce accumulate_grad (#128117)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128117
Approved by: https://github.com/awgu
2024-06-08 05:13:57 +00:00
613c7d270d [pipelining] Format doc (#128279)
- Should use two dots around `var`
- Wrap lines
- Add section cross ref
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128279
Approved by: https://github.com/H-Huang
ghstack dependencies: #128273, #128278
2024-06-08 04:59:04 +00:00
2e42671619 [pipelining] Rename to stage.py and schedules.py (#128278)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128278
Approved by: https://github.com/H-Huang
ghstack dependencies: #128273
2024-06-08 04:42:35 +00:00
0e3fe694d1 [pipelining] Restore a stage constructor for tracer path (#128273)
In case user modified stage module out of place, such as
mod = DDP(mod)
mod = torch.compile(mod)

They need a stage builder else than `pipe.build_stage()`.

This PR provides an API to do so:
```
def build_stage(
  stage_module,
  stage_index,
  pipe.info(),
  ...
)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128273
Approved by: https://github.com/wconstab
2024-06-08 04:42:35 +00:00
8a45cf4c64 [AOTI] align data_size of the constants (#127610)
https://github.com/pytorch/pytorch/pull/124272 set the alignment to the `consts_o` but if there're `data_size` of tensor in the `consts_o` non divisible by the alignment, the following tensors are not aligned anymore, resulting in poor performance on CPU.
We align the `data_size` as well in this PR and pad the serialized bytes. Since `size` of the tensor instead of the `data_size` is used when creating tensor from the serialized bytes ([link](f4d7cdc5e6/torch/csrc/inductor/aoti_runtime/model.h (L236-L259))), there won't be correctness issue. `data_size` is only used to record the [bytes_read](f4d7cdc5e6/torch/csrc/inductor/aoti_runtime/model.h (L217)).

This PR will improve the performance on CPU for 4 models in HF, 7 models in TIMM and 1 model in Torchbench.

For the unit test, I add a bias value the original `data_size` of which is not divisible by the alignment to test the correctness:
```
constants_info_[0].dtype = static_cast<int32_t>(at::kFloat);
constants_info_[0].data_size = 64; # was 40 before this PR
constants_info_[0].shape = {10};

constants_info_[1].dtype = static_cast<int32_t>(at::kFloat);
......
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127610
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-06-08 04:31:00 +00:00
1d84c7e100 [DeviceMesh] Update get_group and add get_all_groups (#128097)
Fixes #121984

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128097
Approved by: https://github.com/wconstab, https://github.com/wanchaol
2024-06-08 04:28:56 +00:00
6e5c2a1a3b [inductor] Add missing files to torch_key (#128230)
Previosly all subdirs (like torch.inductor.codegen) were not hashed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128230
Approved by: https://github.com/oulgen
2024-06-08 03:26:48 +00:00
6220602943 [torchbind] support query schema of methods (#128267)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128267
Approved by: https://github.com/angelayi
2024-06-08 03:20:44 +00:00
0ef5229569 Revert "Change lerp decomp to use aten.as_strided_copy instead of prims.copy_strided (#128030)"
This reverts commit fdf1666b20f63e4acf01798f009e478d997a7f7f.

Reverted https://github.com/pytorch/pytorch/pull/128030 on behalf of https://github.com/nWEIdia due to breaking cuda12.1 test_cuda, see HUD https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=inductor ([comment](https://github.com/pytorch/pytorch/pull/128030#issuecomment-2155764546))
2024-06-08 02:34:06 +00:00
f9508b4c1f [pipelining] Update Pipelining Docs (#128236)
----

- Bring PipelineStage/Schedule more front-and-center
- provide details on how to manually construct PipelineStage
- move tracer example and manual example below so the high-level flow
  (e2e) is closer to the top
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128236
Approved by: https://github.com/H-Huang
ghstack dependencies: #128201, #128228
2024-06-08 02:03:46 +00:00
fe74bbd6f0 init sigmoid comments (#127983)
Fixes #127913

### Description
Add docstring to `torch/onnx/symbolic_opset9.py`:`sigmoid` function

### Checklist

- [x] The issue that is being fixed is referred in the description
- [x] Only one issue is addressed in this pull request
- [x] Labels from the issue that this PR is fixing are added to this pull request
- [x] No unnecessary issues are included into this pull request

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127983
Approved by: https://github.com/xadupre
2024-06-08 01:48:00 +00:00
921aa194c7 [pipelining] Move modify_graph_op_device to _IR.py (#128241)
This part is more IR related.
Thus moving from `PipelineStage` constructor to `pipe.build_stage(..., device, ...)`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128241
Approved by: https://github.com/wconstab
ghstack dependencies: #128240
2024-06-08 01:35:07 +00:00
ad96f991a5 [pipelining] Add pipe.build_stage() (#128240)
Given `PipelineStage` name to manual side.
Thus adding a method under `Pipe` to create PipelineStage.
Moved `PipeInfo` to utils.py to avoid circular dependency between `_IR` and `PipelineStage`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128240
Approved by: https://github.com/wconstab, https://github.com/H-Huang
2024-06-08 01:26:02 +00:00
5ef081031e [MPS] Include MPSGraphVenturaOps.h for complex types on macOS 12 (#127859)
Fixes this on macOS 12:

```
/Users/qqaatw/Forks/pytorch/aten/src/ATen/native/mps/operations/FastFourierTransform.mm:108:60: error: use of undeclared identifier 'MPSDataTypeComplexFloat16'; did you mean 'MPSDataTypeFloat16'?
            (inputTensor.dataType == MPSDataTypeFloat16) ? MPSDataTypeComplexFloat16 : MPSDataTypeComplexFloat32;
                                                           ^~~~~~~~~~~~~~~~~~~~~~~~~
                                                           MPSDataTypeFloat16
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127859
Approved by: https://github.com/kulinseth
2024-06-08 00:54:30 +00:00
647815049e Inductor: Allow small sizes of m for mixed mm autotuning (#127663)
For mixed mm with small sizes of m, such as in the example provided in #127056, being able to set BLOCK_M to 16 leads to better performance. This PR introduces kernel configs that are specific to mixed mm by extending the mm configs with two configs that work well for the example provided in #127056.
I am excluding configs with (BLOCK_M=16, BLOCK_K=16, BLOCK_N=64) because triton crashes when this config is used.

For the example in #127056:
- Without my changes, skip_triton is evaluated to true which disables autotuning. On my machine I achieve 146GB/s.
- If autotuning is enabled, but BLOCK_M>=32, I achieve 614 GB/s.
- With the changes in this PR (i.e. autotuning enabled and BLOCK_M=16), I achieve 772 GB/s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127663
Approved by: https://github.com/Chillee
2024-06-08 00:46:16 +00:00
cyy
ef2b5ed500 [4/N] Remove unused functions (#128193)
Follows #128179

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128193
Approved by: https://github.com/ezyang
2024-06-08 00:09:26 +00:00
39dd4740e6 [inductor][dynamo-inline-nn-modules] Fix test with inlining flag (#128200)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128200
Approved by: https://github.com/Skylion007
ghstack dependencies: #128001, #126578, #128158, #128172
2024-06-07 23:51:58 +00:00
bef586111a [pipelining] pipelining.rst updates (#128228)
fix some nits and add `PipelineStage` (manual)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128228
Approved by: https://github.com/wconstab
ghstack dependencies: #128201
2024-06-07 23:29:54 +00:00
09cccbc1c7 [RFC] add per-collective timeout value in flight recorder (#128190)
Summary:
Add timeout value field on every collected record.

Test Plan:
Unit tests

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128190
Approved by: https://github.com/wconstab
2024-06-07 23:29:35 +00:00
11f2d8e823 Move inductor cuda 124 jobs to a separate workflow that is not triggered by ciflow/inductor (#128250)
https://github.com/pytorch/pytorch/pull/127825

The majority of the g5 runner usage comes from inductor (its something like 2x everything else)
in the past week, inductor ran 1300 ish times on PRs and 300 times on main.  Inductor-periodic ran 50 times on main, so the previous move from inductor -> inductor-periodic only results in 250 fewer runs.

I was under the impression that cu124 is experimental currently and eventually we'll need to switch to it, so this will stay until we switch or inductor uses much fewer runners

Are we expected to be able to handle two versions of cuda in CI?  Because currently we cannot, at least not comfortably

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128250
Approved by: https://github.com/huydhn
2024-06-07 23:01:52 +00:00
5b3624117a update test_issue175 to handle inline_inbuilt_nn_modules (#128026)
with inlining the output graph have more function calls reflecting those on the test that count number of function calls.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128026
Approved by: https://github.com/anijain2305
ghstack dependencies: #127553
2024-06-07 22:07:16 +00:00
ba81c3c290 [inductor] add cpp builder code. (take 2) (#125849)
Fully manual rebase the code of PR: https://github.com/pytorch/pytorch/pull/124045
The old PR seems crashed due to too many commits, and too many times rebase. Please reference: https://github.com/pytorch/pytorch/pull/124045#issuecomment-2103744588

-------
It is the first step of RFC https://github.com/pytorch/pytorch/issues/124245.
Changes:
1. Add cpp builder code, the new cpp_builder support Windows OS.
2. Add CPU ISA checker which is cross OS and exported from backend cpuinfo.
3. Switch compiler ISA checker to new cpp builder.
4. CppCodeCache use the new ISA checker.
5. Add temprary `test_new_cpp_build_logical` UT to help on transfer to new code.
<img width="1853" alt="Image" src="https://github.com/pytorch/pytorch/assets/8433590/ce6519ab-ba92-4204-b1d6-7d15d2ba2cbe">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125849
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-06-07 20:49:58 +00:00
3a620a0f65 bug fix of dynamo_timed in cprofile (#128203)
Fixes #ISSUE_NUMBER

fb-only: "Entire Frame" was missing before this change.

Before: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/f565966006-TrainingApplication/20240527/rank_0/5_0_1/compilation_metrics_23.html
After: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/f569854578-TrainingApplication/20240606/rank_0/0_0_0/compilation_metrics_16.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128203
Approved by: https://github.com/Chillee
2024-06-07 20:47:27 +00:00
8892ddaacc [TD] Test removal on sm86 (#127131)
Yolo

I'm excited to break CI :')
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127131
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
2024-06-07 20:19:18 +00:00
fdf1666b20 Change lerp decomp to use aten.as_strided_copy instead of prims.copy_strided (#128030)
aten.lerp decomposition causes prims::copy_strided to appear in the graph, which is not core aten.

Internal ref: https://fb.workplace.com/groups/pytorch.edge.users/permalink/1525644288305859/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128030
Approved by: https://github.com/Skylion007, https://github.com/zou3519
2024-06-07 20:12:52 +00:00
e647ea55a3 [pipelining] redirect README to document (#128205)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128205
Approved by: https://github.com/wconstab, https://github.com/H-Huang
2024-06-07 19:34:52 +00:00
dcb63fcedb [pipelining] Remove num_microbatches from stage (#128201)
This is similar to https://github.com/pytorch/pytorch/pull/127979, but instead of removing `num_microbatches` from schedule, we remove it from `PipelineStage`. This also means that during `PipelineSchedule` init we need to setup the buffers for the stage(s).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128201
Approved by: https://github.com/kwen2501
2024-06-07 18:56:44 +00:00
cafbcb6376 [BE]: Update ruff to 0.4.8 (#128214)
Updates ruff to 0.4.8. Some minor fixes, but noticably is 10% faster on microbenchmark and should further reduce local and CI runtime of the linter. Also includes a few bugfixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128214
Approved by: https://github.com/ezyang
2024-06-07 18:41:35 +00:00
8ca4cefc7d [C10D] Ensure gil is not released when calling toPyBytes (#128212)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128212
Approved by: https://github.com/Skylion007, https://github.com/XilunWu
2024-06-07 18:24:10 +00:00
0a6df4fca6 delete inductor config.trace.compile_profile (#127143)
Fixes #ISSUE_NUMBER

https://fb.workplace.com/groups/257735836456307/posts/687858786777341/?comment_id=687861123443774&reply_comment_id=687865486776671

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127143
Approved by: https://github.com/Chillee
2024-06-07 18:05:50 +00:00
82d7a36a27 Added torchao nightly workflow (#128152)
Summary:
Add torchao benchmark workflow, upload the artifacts to GHA.

X-link: https://github.com/pytorch/benchmark/pull/2273

Test Plan:
```
python run_benchmark.py torchao --ci
```

Differential Revision: D58140479

Pulled By: xuzhao9

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128152
Approved by: https://github.com/jerryzh168
2024-06-07 17:52:15 +00:00
0c7f4353e5 [inductor] simplify indexing (#127661)
This is a short term fix for: https://github.com/pytorch/pytorch/issues/124002

We found the cause of bad perf for the int8_unpack kernel is due to sub-optimal indexing. In this PR we introduce 2 indexing optimizations:
1. expand FloorDiv to the entire expression when feasible. E.g. `x1 * 1024 + x2 // 2`  will be transformed to `(x1 * 2048 + x2) // 2`. The motivation is that we have more chance to simplify loops for `x1 * 2048 + x2`.
2. merge ModularIndexing pairs: `ModularIndexing(ModularIndex(x, 1, a), 1, b)`, can be simplified to `ModularIndexing(x, 1, b)` if a is a multiple of b.

With both indexing optimizations, we improve int8_unpack perf by 1.54x (183us -> 119us).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127661
Approved by: https://github.com/jansel
2024-06-07 17:51:30 +00:00
662a78f957 [dynamo] Inline the getattr of fx graph and proxy graph (#128172)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128172
Approved by: https://github.com/yanboliang
ghstack dependencies: #128001, #126578, #128158
2024-06-07 17:14:58 +00:00
19b31d899a Fix 'get_real_value' on placeholder nodes (#127698)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127698
Approved by: https://github.com/jansel
ghstack dependencies: #127695, #127696
2024-06-07 17:13:43 +00:00
b741819b05 Fix 'get_attr' call in dynamo 'run_node' (#127696)
Fixes #124858

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127696
Approved by: https://github.com/jansel
ghstack dependencies: #127695
2024-06-07 17:13:43 +00:00
3aa623d407 Fix assume_constant_result for UnspecializedNNModuleVariable methods (#127695)
Fixes #127509

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127695
Approved by: https://github.com/jansel
2024-06-07 17:13:43 +00:00
754e6d4ad0 Make jobs with LF runners still pass lint (#128175)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128175
Approved by: https://github.com/huydhn
2024-06-07 17:13:04 +00:00
85758fa5ae [c10d][TCPStore] make TCPStore server use libuv by default (#127957)
**Summary**
This PR switches the default TCPStore server backend to a new implementation that utilizes [`libuv`](https://github.com/libuv/libuv) for significantly lower initialization time and better scalability:
<img width="714" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/18503011-da5d-4104-8ba9-abc456438b02">

We hope this improvement would benefit users from a much shorter startup time in large-scale jobs. Eventually, we hope to fully replace the old TCPStore backend implementation with the libuv one.

**What it changes**
This PR changes the underlying TCPStore server backend to `libuv` if users don't explicitly specify to use the old TCPStore server. This change is not supposed to cause any user notice except significant faster TCPStore startup for large-scale jobs.

One thing to note is, we do not support the initialization approach where user passes in a socket for libuv backend. We plan to support it as a next step but we choose to disable it before fully testing. If you are initializing TCPStore in this approach, you can see the next section to remain using the old TCPStore server.

**Fallback/Remain using the old TCPStore server**
For users who want to stay with the old TCPStore backend, there're 3 ways:

1. If user is directly instantiating TCPStore object, user can pass in argument `use_libuv=False` to use the old TCPStore server backend e.g. `store = torch.distributed.TCPStore(..., use_libuv=False)`.
2. Or, specify the TCPStore backend option in `init_method` when calling default ProcessGroup init, e.g. `torch.distributed.init_process_group(..., init_method="{YOUR_RENDEZVOUS_METHOD}://{YOUR_HOSTNAME}:{YOUR_PORT}?use_libuv=0")`
3. Or, user can set environment variable `USE_LIBUV` to `"0"` when launching.

These 3 approach are in order of precedence. That being said, if user specifies `use_libuv=0` in `init_method` and also sets environment var `USE_LIBUV="1"`, the former will take effect and the TCPStore backend instantiated will be the old one instead of the one using libuv.

**Operating Systems Compatibility**
From the CI signals, we believe the new implementation has the same behavior as the old TCPStore server on all supported platforms. If you notice any behavior discrepancy, please file an issue with `oncall: distributed` label.

**Test Plan**
`pytest test/distributed/test_store.py`
<img width="2548" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/dc0aebeb-6d5a-4daa-b98c-e56bd39aa588">
note: `TestMultiThreadedWait::test_wait` is a broken test that has been there for some time.

`test/distributed/elastic/utils/distributed_test.py`
<img width="2558" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/a6a3266d-b798-41c4-94d2-152056a034f6">

**TODO**
1. Update the doc at

- https://pytorch.org/docs/stable/distributed.html#distributed-key-value-store
- https://pytorch.org/docs/stable/distributed.html#tcp-initialization

2. Make torch elastic rendezvous to use libuv TCPStore as well. See `torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py` cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @kurman
3. Test if libuv backend is okay with initialization with socket. Change `LibUvTCPStoreTest::test_take_over_listen_socket`.

**Test Plan**
`pytest test/distributed/test_store.py`
<img width="2548" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/dc0aebeb-6d5a-4daa-b98c-e56bd39aa588">
note: `TestMultiThreadedWait::test_wait` is a broken test that has been there for some time.

`test/distributed/elastic/utils/distributed_test.py`
<img width="2558" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/a6a3266d-b798-41c4-94d2-152056a034f6">

Differential Revision: [D58259591](https://our.internmc.facebook.com/intern/diff/D58259591)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127957
Approved by: https://github.com/kurman
ghstack dependencies: #127956
2024-06-07 16:53:01 +00:00
6c824cd9fb [BE][c10d] fix use of TORCH_ERROR in TCPStore libuv backend (#127956)
**Summary**
The use of TORCH_ERROR in TCPStore libuv backend code needs update.

Differential Revision: [D58259589](https://our.internmc.facebook.com/intern/diff/D58259589)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127956
Approved by: https://github.com/shuqiangzhang, https://github.com/cyyever
2024-06-07 16:53:01 +00:00
b9b89ed638 [pipelining] fix LoopedBFS (#127796)
# Issues

Currently two issues need to be fixed with LoopedBFS:
1. The wrap around send operation to the looped around stage blocks will cause a hang. For some reason this doesn't surface on single node, but on multihost this surfaces in a hang.
<img width="1311" alt="image" src="https://github.com/pytorch/pytorch/assets/14858254/210d9d18-455f-4f65-8a11-7ce2c1ec73fd">
2. When microbatches are popped off in `backward_one_chunk` will automatically use the `bwd_chunk_id` starting from 0. This works for interleaved 1f1b and 1f1b, but for loopedBFS we want to pop from starting at `num_microbatches - 1`. Same needs to be fixed for gpipe?

# Changes
- Update LoopedBFS implementation to share `_step_microbatches` with `Interleaved1F1B`
- Also share the tests between the two schedules for varying num_microbatches, local_stages, and world_sizes
- Update `backward_one_chunk` to optionally take a `bwd_chunk_id` argument.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127796
Approved by: https://github.com/wconstab
2024-06-07 16:46:38 +00:00
d9696ea624 [AOTInductor] [Tooling] Update NaN and INF Checker for AOTInductor (#127574)
Summary:
1. Integrate NaN and INF checker with existing config, controllable by env var.
2. Move inject point of NaN & INF checker earlier, this could prevent buffer freeing before check.
3. Inject debugging code in Kernel level, which prevents us trying to read buffers that are fused inplace and into a single kernel.

Test Plan:
Debugging utility.
Test and check by existing tests with env var:
```
TORCHINDUCTOR_NAN_ASSERTS=1 TORCHINDUCTOR_MAX_AUTOTUNE=0 python test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCuda.test_seq_non_abi_compatible_cuda
```

Reviewed By: ColinPeppler

Differential Revision: D57989176

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127574
Approved by: https://github.com/chenyang78, https://github.com/desertfire
2024-06-07 16:46:26 +00:00
fc6e3ff96d [ROCm] Update triton pin to fix libtanh issue (#125396)
There were some internal build issues related to tanh when we moved to upstream triton in ROCm. These issues were fixed by the following triton commit: https://github.com/triton-lang/triton/pull/3810 . This PR moves the triton pin to incorporate that change. Added some skips for unit tests that regressed due to the triton commit bump in this PR.

Needs https://github.com/pytorch/pytorch/pull/127968 since this PR introduces a triton dependency on llnl-hatchet, which doesn't have py3.12 wheels available currently.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125396
Approved by: https://github.com/pruthvistony, https://github.com/malfet
2024-06-07 16:23:04 +00:00
128952625b Revert "Added memory budget to partitioner (#126320)"
This reverts commit 2184cdd29128a924583e4702489177f83fb8270a.

Reverted https://github.com/pytorch/pytorch/pull/126320 on behalf of https://github.com/ZainRizvi due to The new test_ac.py fails on ROCm machines ([comment](https://github.com/pytorch/pytorch/pull/126320#issuecomment-2155141886))
2024-06-07 16:15:03 +00:00
cyy
c219fa5eb9 [3/N] Remove unused functions (#128179)
Following https://github.com/pytorch/pytorch/pull/128005, this PR continues to remove unused functions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128179
Approved by: https://github.com/ezyang
2024-06-07 16:13:16 +00:00
8d16a73f0f Manipulate triton_hash_with_backend so that it doesn't contain any keywords (#128159)
Summary: See https://github.com/pytorch/pytorch/issues/127637 where "def" appears in the backend_hash and causes a problem.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128159
Approved by: https://github.com/jansel
2024-06-07 16:10:44 +00:00
852b7b4c99 [inductor] Enable subprocess-based parallel compile as the default (#126817)
Differential Revision: [D58239826](https://our.internmc.facebook.com/intern/diff/D58239826)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126817
Approved by: https://github.com/eellison
ghstack dependencies: #128037, #128086
2024-06-07 16:10:11 +00:00
ac51f782fe Revert "Complete revamp of float/promotion sympy handling (#126905)"
This reverts commit 2f7cfecd86009a9d396fdbdcdfb4ba7a005db16b.

Reverted https://github.com/pytorch/pytorch/pull/126905 on behalf of https://github.com/atalman due to Sorry need to revert - failing internally ([comment](https://github.com/pytorch/pytorch/pull/126905#issuecomment-2155118778))
2024-06-07 16:01:46 +00:00
23c156cd2d Revert "[inductor] simplify indexing (#127661)"
This reverts commit 901226ae837bd4629b34735c84a3481c4988bb5b.

Reverted https://github.com/pytorch/pytorch/pull/127661 on behalf of https://github.com/atalman due to Sorry reverting because in conflict with https://github.com/pytorch/pytorch/pull/126905 which needs to be reverted, will be relanding it ([comment](https://github.com/pytorch/pytorch/pull/127661#issuecomment-2155115388))
2024-06-07 15:58:36 +00:00
cyy
a1b664adeb Add default values to PyTorchMemEffAttention::AttentionKernel::Params members (#112215)
Default values were added to Params in order to eliminate CUDA warnings like
```
and the implicitly-defined constructor does not initialize ‘PyTorchMemEffAttention::AttentionKernel<float, cutlass::arch::Sm80, true, 64, 64, 64, true, true>::accum_t PyTorchMemEffAttention::AttentionKernel<float, cutlass::arch::Sm80, true, 64, 64, 64, true, true>::Params::scale’
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112215
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-06-07 15:54:07 +00:00
3090667cf9 [pipelining] pipeline() taking microbatch as example input (#128163)
Changed the API of `pipeline()` to take microbatch instead of full batch as example args.

Main purpose is to:
- make this API more atomic;
- decouple tracing frontend from runtime info like `num_chunks`.

Side effects:
- Creates opportunity for varying `num_chunks` of schedules with the same `pipe` object.
- User has to create example microbatch input.
- Chunk spec stuff are now all moved to runtime side.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128163
Approved by: https://github.com/H-Huang
2024-06-07 15:51:53 +00:00
224b4339e5 Revert "Make ValueRange repr less chatty by default (#128043)"
This reverts commit f0dd11df5534ae074ad2d090e6700576a22719d6.

Reverted https://github.com/pytorch/pytorch/pull/128043 on behalf of https://github.com/atalman due to Sorry reverting because in conflict with [#126905](https://github.com/pytorch/pytorch/pull/126905) which needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/128043#issuecomment-2155091732))
2024-06-07 15:43:39 +00:00
6e75024ff0 Run TestAOTAutograd with dynamo (#128047)
My goal is to run these tests with the autograd cache on, but first I want them running with dynamo. These tests already caught an interesting issue so I thought it would be helpful to just have them.

Next up I'll have a second subclass of these tests, run them twice, and expect a cache hit the second time from autograd.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128047
Approved by: https://github.com/ezyang
2024-06-07 15:42:28 +00:00
771be55bb0 Documenting torch.onnx.operator.shape_as_tensor (#128051)
Fixes #127890

This PR adds docstring to the `torch.onnx.operator.shape_as_tensor` function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128051
Approved by: https://github.com/xadupre
2024-06-07 15:20:18 +00:00
3f9798a4fd add docstring to masked_fill, expand, select, unsqueeze, cat fns (#128055)
Fixes #127891
Fixes #127893
Fixes #127894
Fixes #127907
Fixes #127910

## Description
Add docstring to `masked_fill`, `expand`, `select`, `unsqueeze`, and `cat` functions in torch.onnx.symbolic_opset9.py

remaining pydocstyle errors: 257

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128055
Approved by: https://github.com/xadupre
2024-06-07 15:17:22 +00:00
543a870943 [pipelining] Rename ManualPipelineStage -> PipelineStage (#128157)
Renaming ManualPipelineStage to remove the "Manual" part. I needed to replace the existing `PipelineStage` which takes in the `pipe` argument, so I have renamed that to `TracerPipelineStage`. @kwen2501 will remove this entirely in favor of adding a util to `Pipe` to just create the stage directly.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128157
Approved by: https://github.com/wconstab
2024-06-07 09:24:16 +00:00
5f81265572 [Traceable FSDP2] Return early from _register_post_backward_hook when compile (#127864)
Dynamo doesn't support `RegisterPostBackwardFunction` very well yet. This PR skips it and rely on `root_post_backward_callback` under compile. We will improve `RegisterPostBackwardFunction` support in Q3.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127864
Approved by: https://github.com/awgu
2024-06-07 09:19:07 +00:00
7efaeb1494 [AOTI] docs: add suggestion to turn on freezing on CPU (#128010)
With https://github.com/pytorch/pytorch/pull/124350 landed, it is now suggested in AOTI to turn on freezing on CPU to get better performance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128010
Approved by: https://github.com/desertfire
2024-06-07 08:57:02 +00:00
0c16800b4a [pipelining] include lifted constants in input_to_state (#128173)
Previous PR only looked at state dict to determine inputs to state, missing out on lifted tensors

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128173
Approved by: https://github.com/kwen2501
2024-06-07 08:40:54 +00:00
01601ebd41 Retire torch.distributed.pipeline (#127354)
Actually retiring module after deprecation warning for a while.
The new supported module is: torch.distributed.pipelining.
Please migrate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127354
Approved by: https://github.com/wconstab
2024-06-07 08:11:58 +00:00
70724bdbfe Bugfix for nondeterminstic torch_key (#128111)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128111
Approved by: https://github.com/oulgen
2024-06-07 07:17:39 +00:00
00c6ca4459 [compiled autograd][cudagraphs] Inputs runtime wrapper to move cpu scalars to cuda (#125382)
Most commonly CPU scalars used for philox random seed. Right now, any cpu input will skip cudagraphing the entire graph. We need both the traced graph and the runtime inputs to be cudaified.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125382
Approved by: https://github.com/jansel
2024-06-07 07:12:46 +00:00
190f06d468 [pipelining] Lower _configure_data_parallel_mode to stage (#127946)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127946
Approved by: https://github.com/wconstab
ghstack dependencies: #127935
2024-06-07 07:06:23 +00:00
a448b3ae95 [Traceable FSDP2] Check hasattr('fsdp_pre_all_gather') only when not compile (#127855)
Dynamo doesn't support `hasattr(inner_tensor, "fsdp_post_all_gather")` yet. We will work on this support in Q3.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127855
Approved by: https://github.com/awgu
2024-06-07 06:36:40 +00:00
2ff312359c skip hf_T5_generate in dynamic shape test (#121129)
As reported in https://github.com/pytorch/pytorch/issues/119434, `hf_T5_generate` failed with dynamic shape testing, we propose to skip the dynamic batch size testing of this model in this PR.

* Error msg is
```
  File "/home/jiayisun/pytorch/torch/_dynamo/guards.py", line 705, in SHAPE_ENV
    guards = output_graph.shape_env.produce_guards(
  File "/home/jiayisun/pytorch/torch/fx/experimental/symbolic_shapes.py", line 3253, in produce_guards
    raise ConstraintViolationError(
torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (L['inputs_tensor'].size()[0])! For more information, run with TORCH_LOGS="+dynamic".
  - Not all values of RelaxedUnspecConstraint(L['inputs_tensor'].size()[0]) are valid because L['inputs_tensor'].size()[0] was inferred to be a constant (4).
```

* Root Cause is
This error happens while creating guard for this [model script line](https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/modeling_t5.py#L561): `scores += position_bias_masked`
I run it with TORCH_LOGS="+dynamic" and got the key line : `I0305 00:21:00.849974 140376923287424 torch/fx/experimental/symbolic_shapes.py:3963] [6/0_1] eval Eq(s0, 4) [guard added] at miniconda3/envs/pt2/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py:561 in forward (_refs/__init__.py:403 in _broadcast_shapes)`
The reason for this error is that the batch dimension of `inputs_tensor` in the dynamic batch size test is marked as dynamic shape `s0`, so the batch dimension of `scores` generated by a series of operations with `inputs_tensor` is also `s0`. However, because the function of creating `attention_mask` is not in Dynamo but in python. The batch dimension of `attention_mask` is the real shape `4`, and the batch dimension of `position_bias_masked` generated by a series of operations with `attention_mask` is also the real shape `4`, not the dynamic shape `s0`. The current line of `scores += position_bias_masked` requires creating a guard and check whether the batch dimension of `scores` is always equal to the batch dimension of `position_bias_masked`, Eq(s0, 4), the error happens.
So the root cause of this error is that the function of creating `attention_mask` not in Dynamo but in python. The reason why the function of `attention_mask` not in Dynamo is that Dynamo has a graph break on this function (happened in the [model script line](https://github.com/huggingface/transformers/blob/main/src/transformers/generation/utils.py#L476): `is_pad_token_in_inputs = (pad_token_id is not None) and (pad_token_id in inputs)`) due to the following error:
`torch._dynamo.exc.Unsupported: Tensor.item`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121129
Approved by: https://github.com/leslie-fang-intel, https://github.com/ezyang
2024-06-07 06:28:29 +00:00
d943357a21 [XPU] Add xpu support of make triton (#126513)
This PR is to add XPU support for `make triton`.

If a user wishes to use Triton with XPU support, the user needs to install the  [intel-xpu-backend-for-triton](https://github.com/intel/intel-xpu-backend-for-triton).

This PR allows the user to easily install Triton for xpu backend support:

```
# clone the pytorch repo
export USE_XPU=1
make triton
```
The XPU version of triton will always be built from the source. It will cat the commit id from `.ci/docker/ci_commit_pins/triton-xpu.txt`, for example, `b8c64f64c18d8cac598b3adb355c21e7439c21de`.

So the final call would be like:

```
pip install --force-reinstall "git+https://github.com/intel/intel-xpu-backend-for-triton@b8c64f64c18d8cac598b3adb355c21e7439c21de#subdirectory=python"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126513
Approved by: https://github.com/EikanWang, https://github.com/atalman
2024-06-07 06:25:47 +00:00
68cc63ae27 introduce skipIfNNModuleInlined and skip test_cpu_cuda_module_after_dynamo (#128023)
see the issue https://github.com/pytorch/pytorch/issues/127636 to for details about the issue, TLDR is that
when inlining is enabled, we create a fake tensor while tracing in dynamo and try to perform  aten.add.Tensor between
two tensor of different types, with out inlining we do not hit that operation during tracing.
```
Failed running call_function <built-in function add>(*(FakeTensor(..., size=(20, 20), grad_fn=<AddBackward0>), FakeTensor(..., device='cuda:0', size=(20, 20))), **{}):
Unhandled FakeTensor Device Propagation for aten.add.Tensor, found two different devices cpu, cuda:0
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128023
Approved by: https://github.com/anijain2305
ghstack dependencies: #127487, #127553
2024-06-07 06:00:33 +00:00
7e48d6a497 reset dynamo in test_do_not_skip_side_effects unit test loop to avoid dynamo cache limit hit (#127487)
fix https://github.com/pytorch/pytorch/issues/127483

When nn module inlining is enabled, all recompilations are considered for the same frame hence we hit the cache limit for
test_do_not_skip_side_effects, but without inlining things are different , each time we hit a new Object Model we do not consider that a re-compilation, as explained in https://github.com/pytorch/pytorch/issues/127483

For that test we do not really care about cache size hence i reset dynamo in the main loop.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127487
Approved by: https://github.com/anijain2305
2024-06-07 06:00:33 +00:00
dc8e3c2e90 [inductor] subproc parallel compile: initialize future before sending work to the pool (#128086)
Summary: I got reports of intermittent failures in CI and the logs show errors like this:
```
CRITICAL:concurrent.futures:Future 139789013754560 in unexpected state: FINISHED
```
I can't repro locally, but seems clear that we should initialize the future _before_ sending work to the subprocess pool since it could finish before we call set_running_or_notify_cancel()

Differential Revision: [D58239829](https://our.internmc.facebook.com/intern/diff/D58239829)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128086
Approved by: https://github.com/jansel
ghstack dependencies: #128037
2024-06-07 04:17:35 +00:00
6a2bf48cfa [inductor] subproc parallel-compile: start thread last in init (#128037)
Summary: Observed on an internal workload: the helper thread started and attempted to access member variables before they were initialized.

Differential Revision: [D58239827](https://our.internmc.facebook.com/intern/diff/D58239827)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128037
Approved by: https://github.com/Skylion007, https://github.com/eellison
2024-06-07 04:17:35 +00:00
e8e0bdf541 [inductor] parallel-compile: call triton_key() before forking (#127639)
Summary:
A user reported severe slowdown on a workload when using parallel compile. The issue is that in some environments, the process affinity changes after forking such that all forked subprocesses use a single logical processor. Described here: https://github.com/pytorch/pytorch/issues/99625. That requires a separate fix, but during debuging we noticed that we can at least optimize the expensive call to triton_key() before forking.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127639
Approved by: https://github.com/eellison, https://github.com/anijain2305
2024-06-07 04:12:57 +00:00
96806b1777 [pipelining][doc] Add frontend description and change tracer example (#128070)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128070
Approved by: https://github.com/wconstab, https://github.com/H-Huang
2024-06-07 04:09:36 +00:00
3df53c2a8f [dtensor] directly return local_tensor under no_grad (#128145)
as titled, skip the autograd function and directly return the
local_tensor if it's under no_grad context, this would avoid creating
views

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128145
Approved by: https://github.com/awgu
ghstack dependencies: #128112
2024-06-07 04:01:47 +00:00
747fc35ff5 [dynamo] Support if cond on UnspecializedNNModuleVariable and add inline tests (#128158)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128158
Approved by: https://github.com/jansel
ghstack dependencies: #128001, #126578
2024-06-07 03:50:33 +00:00
5e5bbdb35e [DDP] Bucket handling: make first bucket size equal to bucket_cap_mb if it was set (#121640)
The fist DDP bucket is always being created of the size of `dist._DEFAULT_FIRST_BUCKET_BYTES` (1 MiB) by default regardless of `bucket_cap_mb`. The proposal is to set `bucket_cap_mb` as the one main bucket size if it was supplied by the user.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121640
Approved by: https://github.com/wanchaol
2024-06-07 03:33:33 +00:00
4d0ece8196 [pipelining] Consolidate chunk counting between stage and schedule (#127935)
We used to have two backward chunk id counting systems, one at schedule level, the other at stage level.
(Which makes safety dependent on the two advancing hand-in-hand.)

This PR consolidates the counting system to the schedule side only, which would pass `mb_index` to the following stage calls:
`forward_one_chunk`
`backward_one_chunk`
`get_bwd_send_ops`
...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127935
Approved by: https://github.com/H-Huang
2024-06-07 03:33:18 +00:00
476bfe6cce fix torch.compile with triton kernels under inference_mode (#124489)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124489
Approved by: https://github.com/albanD
2024-06-07 03:29:37 +00:00
50155e825b [export] provide refine function for automatically accepting dynamic shapes suggested fixes (#127436)
Summary:
Part of the work helping export's automatic dynamic shapes / dynamic shapes refining based on suggested fixes.

Introduces a util function refine_dynamic_shapes_from_suggested_fixes() that takes the error message from a ConstraintViolationError message containing suggested dynamic shapes fixes, along with the original dynamic shapes spec, and returns the new spec. Written so that the suggested fixes from export can be directly parsed and used.

Example usage for the automatic dynamic shapes workflow:
```
# export, fail, parse & refine suggested fixes, re-export
try:
    export(model, inps, dynamic_shapes=dynamic_shapes)
except torch._dynamo.exc.UserError as exc:
    new_shapes = refine_dynamic_shapes_from_suggested_fixes(exc.msg, dynamic_shapes)
    export(model, inps, dynamic_shapes=new_shapes)
```

For examples of behavior, see the added test and docstring. Will take suggestions for renaming the function to something else 😅

Test Plan: test_export tests

Differential Revision: D57409142

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127436
Approved by: https://github.com/avikchaudhuri
2024-06-07 03:29:06 +00:00
65aa16f968 Revert "Default XLA to use swap_tensors path in nn.Module._apply (#126814)" (#128170)
https://github.com/pytorch/pytorch/issues/128165 :(

This reverts commit a7b1dd82ff3063894fc665ab0c424815231c10e6.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128170
Approved by: https://github.com/drisspg, https://github.com/albanD
2024-06-07 01:44:14 +00:00
f99409903c Documenting torch.distributions.utils.clamp_probs (#128136)
Fixes https://github.com/pytorch/pytorch/issues/127889

This PR adds docstring to the `torch.distributions.utils.clamp_probs` function.

Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128136
Approved by: https://github.com/janeyx99, https://github.com/svekars, https://github.com/malfet
2024-06-07 00:49:41 +00:00
740cd0559f Filter non input symexprs from codecache guards (#128052)
Summary: Dynamo lifts all symexprs that appear in the inputs to top level which means that we do not need to look at guards that contain symexprs that do not appear in the inputs. Prune them.

Test Plan: added two new tests

Differential Revision: D58200476

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128052
Approved by: https://github.com/ezyang, https://github.com/masnesral
2024-06-07 00:48:49 +00:00
117ab34891 Documenting the torch.utils.collect_env.get_pretty_env_info function (#128123)
Fixes #127888

This PR adds docstring to the `torch.utils.collect_env.get_pretty_env_info` function

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128123
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-06-07 00:43:18 +00:00
901226ae83 [inductor] simplify indexing (#127661)
This is a short term fix for: https://github.com/pytorch/pytorch/issues/124002

We found the cause of bad perf for the int8_unpack kernel is due to sub-optimal indexing. In this PR we introduce 2 indexing optimizations:
1. expand FloorDiv to the entire expression when feasible. E.g. `x1 * 1024 + x2 // 2`  will be transformed to `(x1 * 2048 + x2) // 2`. The motivation is that we have more chance to simplify loops for `x1 * 2048 + x2`.
2. merge ModularIndexing pairs: `ModularIndexing(ModularIndex(x, 1, a), 1, b)`, can be simplified to `ModularIndexing(x, 1, b)` if a is a multiple of b.

With both indexing optimizations, we improve int8_unpack perf by 1.54x (183us -> 119us).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127661
Approved by: https://github.com/jansel
2024-06-06 23:57:45 +00:00
7ede78f9f5 [dynamo][nn-modules] Trace through nn.Module dunder methods for UnspecializedNNModule (#126578)
Tracing through `__init__`  is important because it initializes (calls STORE_ATTR) on members. By doing that, we kick in the mutation tracking for these objects. So, things like mutating `_modules` etc is tracked automatically.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126578
Approved by: https://github.com/jansel
ghstack dependencies: #128001
2024-06-06 23:05:49 +00:00
e5b3387166 [dynamo] Bugfix for nn parameter construction (#128001)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128001
Approved by: https://github.com/jansel
2024-06-06 23:05:49 +00:00
6dfdce92ba Fixed typos in the complex numbers portion of the autograd docs (#127948)
This PR fixes several typos in the complex numbers section of the docs for autograd. Only documentation was altered.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127948
Approved by: https://github.com/soulitzer
2024-06-06 22:47:04 +00:00
56a3d276fe Handle custom op during TorchScript to ExportedProgram conversion (#127580)
#### Description
Handle custom ops during TorchScript to ExportedProgram covnersion
```python
torch.library.define(
    "mylib::foo",
    "(Tensor x) -> Tensor",
    lib=lib,
)

# PyTorch custorm op implementation
@torch.library.impl(
    "mylib::foo",
    "CompositeExplicitAutograd",
    lib=lib,
)
def foo_impl(x):
    return x + x

# Meta function of the custom op.
@torch.library.impl_abstract(
    "mylib::foo",
    lib=lib,
)
def foo_meta(x):
    return x + x

class M(torch.nn.Module):
    def forward(self, x):
        return torch.ops.mylib.foo(x)
```

#### Test Plan
* Add a test case where custom op is called and converted. `pytest test/export/test_converter.py -s -k test_ts2ep_converter_custom_op`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127580
Approved by: https://github.com/angelayi
2024-06-06 22:06:51 +00:00
80fa2778ed Update types for verbose in lr_scheduler (#127943)
I'm currently locked into jsonargparse version 4.19.0, and it complains when used in combination with LightningCLI (v2.0.8). This is because it cares about the types declared in google style docstrings. This causes a problem when it tries to parse how it should cast arguments to construct an instance of an LRScheduler class because the docstrings declare the "verbose" parameter as a bool, but the defaults recently changed to a string "deprecated". This means the type should really be `bool | str`.

This PR adds a `| str` to the docstring type in each learning rate scheduler class. This will prevent jsonargparse from complaining.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127943
Approved by: https://github.com/janeyx99
2024-06-06 21:59:22 +00:00
0a761f0627 [RFC] Provide optional switches to _dump_nccl_trace (#127651)
Summary:
Data from PyTorch distributed is mostly useful during initial stages of model development.
Provide options to reduce data sent/dumped.
`_dump_nccl_trace` takes 3 optional switches. Default as before returns everything
- `includeCollectives`: option to also include collectives: Default is True.
- `includeStacktraces`: option to include stack traces in collectives. Default is True.
- `onlyActive`: option to only send active collective work - i.e. not completed. Default is
    False (i.e. send everything)

Test Plan:
Unit tests

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127651
Approved by: https://github.com/wconstab
2024-06-06 21:59:09 +00:00
54fe2d0e89 [cuDNN][quantization] skip qlinear test in cuDNN v9.1.0 (#128166)
#120006 only very recently unskipped this test 3 days ago so we don't consider it a blocker for cuDNNv9 for now

CC @atalman

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128166
Approved by: https://github.com/atalman, https://github.com/nWEIdia
2024-06-06 21:43:29 +00:00
04272a0e12 Add docstring for the torch.ao.quantization.utils.get_combined_dict function (#128127)
Fixes: #127906

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128127
Approved by: https://github.com/jerryzh168
2024-06-06 21:22:09 +00:00
baaa914bf7 [small] test clean up (#128079)
remove unnecessary line: https://github.com/pytorch/pytorch/issues/123733
add main so test can be run `python ...`: https://github.com/pytorch/pytorch/issues/124906

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128079
Approved by: https://github.com/awgu
2024-06-06 21:21:40 +00:00
9554300436 [inductor][codegen] Codegen constexpr globals and constexpr annotated globals correctly. (#126195)
[Triton #3762](https://github.com/triton-lang/triton/pull/3762)
disallows access to globals which are not `tl.constexpr`

Triton has always treated captured globals this way, but they now
require it be explicit in user code.

Updated codegen to make sure these variables are defined before writing
the kernel source when compiling a user defined triton kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126195
Approved by: https://github.com/alexbaden, https://github.com/bertmaher
2024-06-06 20:50:11 +00:00
2184cdd291 Added memory budget to partitioner (#126320)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126320
Approved by: https://github.com/shunting314
2024-06-06 20:32:29 +00:00
7e059b3c95 Add a call to validate docker images after build step is complete (#127768)
Adds validation to docker images. As discussed here: https://github.com/pytorch/pytorch/issues/125879
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127768
Approved by: https://github.com/huydhn, https://github.com/Skylion007
2024-06-06 20:25:39 +00:00
e8670f6aea [Dynamo][TVM] Support macOS and Linux/aarch64 platforms (#128124)
Fixes #128122
With this fix, I've confirmed that the repro works on the platforms below.
- macOS 14.5 (arm64)
- Ubuntu 20.04.6 LTS (GNU/Linux 5.10.120-tegra aarch64)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128124
Approved by: https://github.com/malfet
2024-06-06 19:47:11 +00:00
de4f8b9946 [BE]: Update cudnn to 9.1.0.70 (#123475)
cuDNN has managed to upload cu11 and cu12 wheels for ~~9.0.0.312~~ 9.1.0.70, so trying this out...

CC @Skylion007 @malfet

Co-authored-by: Wei Wang <weiwan@nvidia.com>
Co-authored-by: atalman <atalman@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123475
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/nWEIdia, https://github.com/atalman
2024-06-06 18:45:22 +00:00
fba21edf5b [CI] Ensure inductor/test_cpu_cpp_wrapper is actually run in inductor_cpp_wrapper_abi_compatible (#126717)
`inductor/test_cpu_cpp_wrapper` is not actually being run in `inductor_cpp_wrapper_abi_compatible` test config

The cpu device type gets removed in d28868c7e8/torch/testing/_internal/common_device_type.py (L733)

so d28868c7e8/test/inductor/test_cpu_cpp_wrapper.py (L396) returns false.

Feel free to make a PR with a different way to do this (a better RUN_CPU check?)

Add a skip for a failing test.  I am not equipped to fix it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126717
Approved by: https://github.com/ZainRizvi
2024-06-06 18:23:52 +00:00
936225d7b2 [mergebot] Fix pending unstable jobs being viewed as failed (#128080)
https://github.com/pytorch/pytorch/pull/128038#issuecomment-2150802030

In the above, pending unstable jobs get put into the ok_failed_checks list, and because there are a lot of unstable jobs, it exceeds the threshold and merge fails.

I don't think unstable jobs should be considered in the ok failed checks threshold, only flaky and broken trunk jobs should be considered there.

Change looks big, but main thing is that unstable jobs don't get included in the check for how many flaky failures there are.  The other changes are mostly renames so things are clearer
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128080
Approved by: https://github.com/huydhn
2024-06-06 18:22:20 +00:00
32fb68960e [FSDP2] Added experimental warning to unshard API (#128138)
There is still ongoing discussion on how this API should work.

Current approach:
- The pre-all-gather ops run in the default stream and the all-gather is called from the default stream with `async_op=True`.
- Pros:
    - The all-gather input and output tensors are allocated in the default stream, so there is no increased memory fragmentation across stream pools.
    - There is no need for additional CUDA synchronization. The API is self-contained.
- Cons:
    - The pre-all-gather ops (e.g. cast from fp32 -> bf16 and all-gather copy-in device copies) cannot overlap with other default stream compute. The biggest concern here is for CPU offloading, the H2D copies cannot overlap.

Alternative approach:
- Follow the default implicit prefetching approach, where the pre-all-gather ops and all-gather run in separate streams.
- Pros:
    - The pre-all-gather ops can overlap with default stream compute.
- Cons:
    - We require an API that should be called after the last optimizer step (namely, last op that modified sharded parameters) and before the first `unshard` call that has the all-gather streams wait for the default stream. The API is no longer self-contained and now has a complementary API.
    - The all-gather input and output tensors are allocated in separate streams (not the default stream), so there can be increased memory fragmentation across pools.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128138
Approved by: https://github.com/wanchaol
ghstack dependencies: #128100
2024-06-06 18:18:42 +00:00
78a6b0c479 update test_reformer_train test to handle nn module inlining (#127467)
number of call nodes increase due to inlining
before inlining:
```
 class GraphModule(torch.nn.Module):
        def forward(self, function_ctx, cat: "f32[1, s0, 512]"):
            # No stacktrace found for following nodes
            _set_grad_enabled = torch._C._set_grad_enabled(False)

            # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_repros.py:283 in backward, code: grad_attn_output, grad_hidden_states = torch.chunk(
            chunk = torch.chunk(cat, 2, dim = -1);  cat = None
            getitem: "f32[1, s0, 256]" = chunk[0]
            getitem_1: "f32[1, s0, 256]" = chunk[1];  chunk = None

            # No stacktrace found for following nodes
            _set_grad_enabled_1 = torch._C._set_grad_enabled(True)
            return (getitem_1, None)
```

after inlining:
```
class GraphModule(torch.nn.Module):
    def forward(self, s0: "Sym(s0)", L_hidden_states_: "f32[1, s0, 256]", L_self_layers_0_weight: "f32[256, 256]", L_self_layers_0_bias: "f32[256]", L_self_layer_norm_weight: "f32[512]", L_self_layer_norm_bias: "f32[512]", L_self_layer_norm_normalized_shape_0_: "Sym(512)"):
        l_hidden_states_ = L_hidden_states_
        l_self_layers_0_weight = L_self_layers_0_weight
        l_self_layers_0_bias = L_self_layers_0_bias
        l_self_layer_norm_weight = L_self_layer_norm_weight
        l_self_layer_norm_bias = L_self_layer_norm_bias
        l_self_layer_norm_normalized_shape_0_ = L_self_layer_norm_normalized_shape_0_

        # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_repros.py:332 in forward, code: hidden_states = torch.cat([hidden_states, hidden_states], dim=-1)
        hidden_states: "f32[1, s0, 512]" = torch.cat([l_hidden_states_, l_hidden_states_], dim = -1);  l_hidden_states_ = None

        # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_repros.py:333 in forward, code: hidden_states = _ReversibleFunction.apply(
        function_ctx = torch.autograd.function.FunctionCtx()

        # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_repros.py:258 in forward, code: hidden_states, attn_output = torch.chunk(hidden_states, 2, dim=-1)
        chunk = torch.chunk(hidden_states, 2, dim = -1);  hidden_states = None
        hidden_states_1: "f32[1, s0, 256]" = chunk[0]
        attn_output: "f32[1, s0, 256]" = chunk[1];  chunk = None

        # File: /data/users/lsakka/pytorch/pytorch/torch/nn/modules/linear.py:116 in forward, code: return F.linear(input, self.weight, self.bias)
        attn_output_1: "f32[1, s0, 256]" = torch._C._nn.linear(attn_output, l_self_layers_0_weight, l_self_layers_0_bias);  attn_output = l_self_layers_0_weight = l_self_layers_0_bias = None

        # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_repros.py:272 in forward, code: ctx.save_for_backward(attn_output.detach(), hidden_states.detach())
        detach: "f32[1, s0, 256]" = attn_output_1.detach()
        detach_1: "f32[1, s0, 256]" = hidden_states_1.detach()

        # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_repros.py:279 in forward, code: return torch.cat([attn_output, hidden_states], dim=-1)
        hidden_states_2: "f32[1, s0, 512]" = torch.cat([attn_output_1, hidden_states_1], dim = -1);  attn_output_1 = hidden_states_1 = None

        # File: /data/users/lsakka/pytorch/pytorch/torch/nn/modules/normalization.py:201 in forward, code: return F.layer_norm(
        hidden_states_3: "f32[1, s0, 512]" = torch.nn.functional.layer_norm(hidden_states_2, (l_self_layer_norm_normalized_shape_0_,), l_self_layer_norm_weight, l_self_layer_norm_bias, 1e-12);  hidden_states_2 = l_self_layer_norm_normalized_shape_0_ = l_self_layer_norm_weight = l_self_layer_norm_bias = None

        # File: /data/users/lsakka/pytorch/pytorch/test/dynamo/test_repros.py:352 in forward, code: hidden_states = torch.nn.functional.dropout(
        hidden_states_4: "f32[1, s0, 512]" = torch.nn.functional.dropout(hidden_states_3, p = 0.5, training = True);  hidden_states_3 = None
        return (hidden_states_4,)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127467
Approved by: https://github.com/anijain2305
ghstack dependencies: #126444, #127146, #127424, #127440
2024-06-06 17:56:36 +00:00
304956e1fb Switch to torch.float16 on XPU AMP mode (#127741)
# Motivation
Previously, the default dtype for AMP on XPU was aligned with the CPU. To align with other GPUs, we intend to change the default dtype for AMP to `torch.float16`. This change aims to save users the effort of converting models from `torch.float16` to `torch.bfloat16`, or vice versa when they want to run the model on different types of GPUs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127741
Approved by: https://github.com/EikanWang, https://github.com/albanD
2024-06-06 17:40:13 +00:00
1d0c1087dd Allow overriding per-dim group options via _MeshEnv.set_dim_group_options (#126599)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126599
Approved by: https://github.com/wanchaol
ghstack dependencies: #126598
2024-06-06 17:18:12 +00:00
e9c5144cbc Fix bug in update_process_group DDP API (#128092)
Fix bug in `_update_process_group` DDP API where we didn't correctly reset `local_used_map_` and a few other variables. This resulted in errors like `Encountered gradient which is undefined, but still allreduced by...`

Added a unit test as well that reproduced the issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128092
Approved by: https://github.com/awgu, https://github.com/fegin
2024-06-06 17:10:42 +00:00
2ffdf556ea Add back API that some people rely on in torch.cuda.amp.grad_scaler namespace (#128056)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128056
Approved by: https://github.com/kit1980, https://github.com/eqy
2024-06-06 17:02:32 +00:00
2d47385f0f [BE]: Enable ruff TCH rules and autofixes for better imports (#127688)
Automated fixes to put imports that are only used in type hints into TYPE_CHECKING imports. This also enables the RUFF TCH rules which will automatically apply autofixes to move imports in and out of TYPE_CHECKING blocks as needed in the future, this will make the initial PyTorch import faster and will reduce cyclic dependencies.

Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127688
Approved by: https://github.com/XuehaiPan, https://github.com/ezyang, https://github.com/malfet
2024-06-06 16:55:58 +00:00
4f87f47ea1 [dtensor] reuse DTensorSpec as much as possible (#128112)
as titled, given that our DTensorSpec is immutable, we can always reuse
the spec if the input/output have the same tensor metadata. this helps two fold:
1. We don't need to re-calculate the hash everytime we produce a
   DTensorSpec, reduce runtime operator overhead
2. reduce the DTensor construction overhead.

Some local benchmark on a 800 parameter clip_grad_norm shows that for
foreach_norm the CPU overhead reduces from 11ms -> 7.8ms (around 30% improvement)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128112
Approved by: https://github.com/awgu
2024-06-06 16:55:50 +00:00
f0dd11df55 Make ValueRange repr less chatty by default (#128043)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128043
Approved by: https://github.com/lezcano
2024-06-06 16:42:48 +00:00
eqy
0de6d2427f Bump tolerances for inductor/test_efficient_conv_bn_eval.py::EfficientConvBNEvalCudaTests::test_basic_cuda attempt 2 (#128048)
CC @nWEIdia @huydhn @Skylion007

Same thing but also bump backward tolerances...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128048
Approved by: https://github.com/Skylion007
2024-06-06 16:17:43 +00:00
a5b86a1ec0 Revert "FP8 rowwise scaling (#125204)"
This reverts commit 5dc912822913b3d90f4938891c7eca722a057cf1.

Reverted https://github.com/pytorch/pytorch/pull/125204 on behalf of https://github.com/atalman due to Sorry need to revert this failing, on internal CI. I suggest to reimport this and try to land internally resolving all issues ([comment](https://github.com/pytorch/pytorch/pull/125204#issuecomment-2152905513))
2024-06-06 16:12:34 +00:00
a5ba9b2858 Fix for addcdiv contiguous problem (#124442)
Fixes issue number #118115
Co-authored-by: Siddharth Kotapati <skotapati@apple.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124442
Approved by: https://github.com/kulinseth
2024-06-06 16:09:18 +00:00
c58d3af3b4 Revert "Add OpInfo entry for alias_copy (#127232)"
This reverts commit 457df212e1c6e1aa4f1eb2ad6ee292052d7c07e1.

Reverted https://github.com/pytorch/pytorch/pull/127232 on behalf of https://github.com/clee2000 due to broke [onnx](https://github.com/pytorch/pytorch/actions/runs/9397057801/job/25880181144) and [mps](https://github.com/pytorch/pytorch/actions/runs/9397057805/job/25879818705) tests, [hud link](457df212e1) , base is 15 days old, the onnx test xfailed on the pr but the xfail was removed so if you rebase itll surface, mps build failed so no mps tests were run on the pr ([comment](https://github.com/pytorch/pytorch/pull/127232#issuecomment-2152848758))
2024-06-06 15:44:47 +00:00
9d849d4312 Disable py3.12 nightly wheel builds for ROCm (#127968)
Triton commit bump PR https://github.com/pytorch/pytorch/pull/125396 reverted due to missing llnl-hatchet dependency for triton. Workaround is to disable py3.12 binary build jobs for ROCm on PyTorch CI until llnl-hatchet publishes py3.12 wheels on [PyPI](https://pypi.org/project/llnl-hatchet/#files)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127968
Approved by: https://github.com/atalman, https://github.com/pruthvistony
2024-06-06 15:17:35 +00:00
48a54146e7 Revert "[dynamo] Support ndarray.dtype attribute access (#124490)"
This reverts commit 4adee71155bec4e419bac32be2cbc1763bc6c98f.

Reverted https://github.com/pytorch/pytorch/pull/124490 on behalf of https://github.com/atalman due to Breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/124490#issuecomment-2152664749))
2024-06-06 14:21:29 +00:00
f08fd8e9e3 Remove redundant device guard in Resize.h (#126498)
In https://github.com/pytorch/pytorch/pull/113386 a device guard was [inserted](https://github.com/pytorch/pytorch/pull/113386/files#diff-2691af3a999b3a8f4a0f635aabcd8edf0ffeda501edfa9366648e8a89de12a90R30).

The new inserted device guarded has a clear and more confined guarded scope.
And it's hard to tell the exact purpose and scope of the  [old device guard](78ffe49a3f/aten/src/ATen/native/cuda/Resize.h (L41)).

Removing the guard has negligible positive performance impact and make the code more understandable.

Thanks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126498
Approved by: https://github.com/eqy, https://github.com/lezcano
2024-06-06 13:01:42 +00:00
c97e3ebb96 Fix wrongly exposed variables in torch/__init__.py (#127795)
<img width="609" alt="image" src="https://github.com/pytorch/pytorch/assets/16078332/964c6707-1856-4c2c-8cd8-ce1d96d38d36">

This PR removes temporary variables in `torch/__init__.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127795
Approved by: https://github.com/albanD
2024-06-06 08:31:41 +00:00
457df212e1 Add OpInfo entry for alias_copy (#127232)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127232
Approved by: https://github.com/lezcano
2024-06-06 07:46:26 +00:00
f5328542b5 Allow multiple cudagraph recordings per compiled graph (#126822)
### Introduction/Problem

Today when dynamo traces a builtin nn module (nn.Linear for example) it will specially handle parameters of that module by storing them as constant attributes of the graph. This requires that dynamo guard on the ID of the NNModule because if the instance of the module changes, we need to retrace and recollect the new parameters as attributes of the graph. This creates a 1:1 compiled graph to cudagraph relationship.

With hierarchical compilation, dynamo will treat builtin nn modules like any other code. This reduces complexity and critically, if there are multiple identical layers in a model, we only need to compile one of those layers once, and reuse the same compiled artifact for each layer. This introduces a problem for the current approach to parameter handling. Since the parameters could now possibly change across calls to the compiled artifact, these need to be inputs to the graph instead of attributes. This introduces a problem for cudagraphs - previously cudagraphs was guaranteed that the parameters of builtin NN Modules would be constant across calls, but now since the compiled artifact needs to be agnostic to the actual instance of the NN module being used these parameter memory locations may vary. Previously cudagraphs simply copies varying inputs to cudagraph owned memory, but since the parameters are quite large, this is catastrophic for performance.

### Solution
To avoid this performance cliff, this PR allows cudagraphs to re-record a new cudagraph if only parameters change. Metadata about which arguments are parameters are propagated from AOT Autograd to compile_fx, and these indices are passed to cudagraphs. If these memory locations change, a new graph is recorded vs previously where this would be an error (because this previously should not happen). This enables a 1:many compiled graph to cudagraph relationship. Across similar modules we will re-record cudagraphs and dispatch the correct graph if parameter pointers match when the cudagraph is executed.

### Next steps (if needed)
It is theoretically possible that a user passes Parameters that change frequently as inputs to model code - if this is a common issue this design allows for dynamo to pass metadata indicating which parameters were created in a builtin NN Module context to only permit those parameters to have the multi-cudagraph behavior, but this PR does not implement this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126822
Approved by: https://github.com/eellison
ghstack dependencies: #126820, #126821
2024-06-06 06:39:59 +00:00
5a3bea1e88 Remove unused arg to GraphLowering (#126821)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126821
Approved by: https://github.com/eellison
ghstack dependencies: #126820
2024-06-06 06:39:59 +00:00
70ba6f0ab6 Collect static parameter metadata in aot (#126820)
Collect the indices of the static parameters to pass down to cudagraphs in order to re-record if necessary.
This location was chosen in order to allow us to restrict this (if needed) in the future by setting metadata in dynamo.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126820
Approved by: https://github.com/bdhirsh
2024-06-06 06:39:50 +00:00
c8ff1cd387 [FSDP2] Changed test_register_forward_method to use multiprocess test (#128100)
The test seems to be flaky due to multi-threaded process group. This PR converts the test to use normal multi-process `ProcessGroupNCCL` to fix the flakiness.

This PR closes https://github.com/pytorch/pytorch/issues/126851.

Interestingly, the original MTPG version passes for me on devgpu. Either way, the new version also passes on devgpu, so we can see in CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128100
Approved by: https://github.com/weifengpy
2024-06-06 06:34:02 +00:00
638f543ac2 Enable single nadam test (#128087)
https://github.com/pytorch/pytorch/issues/117150 has been fixed

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128087
Approved by: https://github.com/xmfan
2024-06-06 06:25:00 +00:00
cd42b95047 Handle aten::__contains__ during TorchScript to ExportedProgram conversion (#127544)
#### Description
Add support for converting `prim::__contains__` from TorchScript IR to ExportedProgram, e.g.,
```python
class MIn(torch.nn.Module):
    def forward(self, x: torch.Tensor):
        return x.dtype in [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
```
#### Test Plan
* Add test cases to cover both contains IR resulted from primitive types or Tensor. `pytest test/export/test_converter.py -s -k test_ts2ep_converter_contains`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127544
Approved by: https://github.com/angelayi
2024-06-06 05:00:13 +00:00
cyy
68eb771265 [2/N] Remove unused test functions (#128005)
Following #127881, this PR continues to remove unused test functions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128005
Approved by: https://github.com/ezyang
2024-06-06 03:41:32 +00:00
2f7cfecd86 Complete revamp of float/promotion sympy handling (#126905)
At a high level, the idea behind this PR is:

* Make it clearer what the promotion and int/float rules for various Sympy operations are. Operators that previously were polymorphic over int/float are now split into separate operators for clarity. We never do mixed int/float addition/multiplication etc in sympy, instead, we always promote to the appropriate operator. (However, equality is currently not done correctly.)
* Enforce strict typing on ValueRanges: if you have a ValueRange for a float, the lower and upper MUST be floats, and so forth for integers.

The story begins in **torch/utils/_sympy/functions.py**. Here, I make some changes to how we represent certain operations in sympy expressions:

* FloorDiv now only supports integer inputs; to do float floor division, do a truediv and then a trunc. Additionally, we remove the divide out addition by gcd optimization, because sympy gcd is over fields and is willing to generate rationals (but rationals are bad for ValueRange strict typing).
* ModularIndexing, LShift, RShift now assert they are given integer inputs.
* Mod only supports integer inputs; eventually we will support FloatMod (left for later work, when we build out Sympy support for floating operations). Unfortunately, I couldn't assert integer inputs here, because of a bad interaction with sympy's inequality solver that is used by the offline solver
* TrueDiv is split into FloatTrueDiv and IntTrueDiv. This allows for us to eventually generate accurate code for Python semantics IntTrueDiv, which is written in a special way to preserve precision when the inputs are >= 2**53 beyond what first coercing the integer to floats and then doing true division.
* Trunc is split to TruncToFloat and TruncToInt.
* Round is updated to return a float, not an int, making it consistent with the round op handler in Inductor. To get Python-style conversion to int, we call TruncToInt on the result.
* RoundDecimal updated to consistently only ever return a float
* Add ToFloat for explicit coercion to float (required so we can enforce strict ValueRanges typing)

In **torch/__init__.py**, we modify SymInt and SymFloat to appropriately call into new bindings that route to these refined sympy operations.  Also, we modify `torch.sym_min` and `torch.sym_max` to have promotion semantics (if one argument is a float, the return result is always a float), making them inconsistent with builtins.min/max, but possible to do type analysis without runtime information.

We also need to introduce some new op handlers in **torch/_inductor/ops_handler.py**:

* `to_int` for truncation to int64, directly corresponding to TruncToInt; this can be implemented by trunc and dtype, but with a dedicated handler it is more convenient for roundtripping in Sympy
* `int_truediv` for Python-style integer true division, which has higher precision than casting to floats and then running `truediv`

These changes have consequences. First, we need to make some administrative changes:

* Actually wire up these Sympy functions from SymInt/SymFloat in **torch/fx/experimental/sym_node.py**, including the new promotion rules (promote2)
* Add support for new Sympy functions in **torch/utils/_sympy/interp.py**, **torch/utils/_sympy/reference.py**
  * In particular, in torch.utils._sympy.reference, we have a strong preference to NOT do nontrivial compute, instead, everything in ops handler should map to a singular sympy function
  * TODO: I chose to roundtrip mod back to our Mod function, but I think I'm going to have to deal with the C/Python inconsistency this to fix tests here
* Add printer support for the Sympy functions in **torch/_inductor/codegen/common.py**, **torch/_inductor/codegen/cpp_utils.py**, **torch/_inductor/codegen/triton.py**. `int_truediv` and mixed precision equality is currently not implemented soundly, so we will lose precision in codegen for large values. TODO: The additions here are not exhaustive yet
* Update ValueRanges logic to use new sympy functions in **torch/utils/_sympy/value_ranges.py**. In general, we prefer to use the new Sympy function rather than try to roll things by hand, which is what was done previously for many VR analysis functions.

In **torch/fx/experimental/symbolic_shapes.py** we need to make some symbolic reasoning adjustments:

* Avoid generation of rational subexpressions by removing simplification of `x // y` into `floor(x / y)`. This simplification then triggers an addition simplification rule `(x + y) / c --> x / c + y / c` which is bad because x / c is a rational number now
* `_assert_bound_is_rational` is no more, we no longer generate rational bounds
* Don't intersect non-int value ranges with the `int_range`
* Support more sympy Functions for guard SYMPY_INTERP
* Assert the type of value range is consistent with the variable type

The new asserts uncovered necessary bug fixes:

* **torch/_inductor/codegen/cpp.py**, **torch/_inductor/select_algorithm.py**, **torch/_inductor/sizevars.py** - Ensure Wild/Symbol manually allocated in Inductor is marked `is_integer` so it's accepted to build expressions
* **torch/_inductor/utils.py** - make sure you actually pass in sympy.Expr to these functions
* **torch/_inductor/ir.py** - make_contiguous_strides_for takes int/SymInt, not sympy.Expr!
* **torch/export/dynamic_shapes.py** - don't use infinity to represent int ranges, instead use sys.maxsize - 1

Because of the removal of some symbolic reasoning that produced rationals, some of our symbolic reasoning has gotten worse and we are unable to simplify some guards. Check the TODO at **test/test_proxy_tensor.py**

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126905
Approved by: https://github.com/xadupre, https://github.com/lezcano
2024-06-06 02:29:45 +00:00
c1a43a69e4 [NestedTensor] Add error checks for unbind operator coverage when ragged_idx != 1 (#128058)
Summary:
Add the following error checks for the `unbind` operator on `NestedTensor`s when `ragged_idx != 1`:

- The current implementation allows the creation of `NestedTensor` instances from the class definition with an `offsets` tensor that applies to a dimension other than the jagged dimension. This diff ensures that `unbind` fails when the `offsets` exceed the length of the jagged dimension.

Test Plan:
Added the following unit tests:

`test_unbind_with_lengths_ragged_idx_equals_2_bad_dim_cpu` verifies that `unbind` fails when there is a mismatch between the offsets and the jagged dimension, for `NestedTensor`s with `lengths`.
```
test_unbind_with_lengths_ragged_idx_equals_2_bad_dim_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
```

Reviewed By: davidberard98

Differential Revision: D57989082

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128058
Approved by: https://github.com/davidberard98
2024-06-06 01:56:12 +00:00
9795c4224b Revert "[DDP] Bucket handling: make first bucket size equal to bucket_cap_mb if it was set (#121640)"
This reverts commit e98662bed99df57b7d79f9fc1cbe670afc303235.

Reverted https://github.com/pytorch/pytorch/pull/121640 on behalf of https://github.com/clee2000 due to Sorry but it looks like you're failing  `distributed/_composable/test_replicate_with_compiler.py::ReplicateTest::test_bucketing_coalesced_op `. THe build failed so the tests didn't run, consider rebasing, there have been a couple of PRs lately related to cudnn so you probably are either based on a bad or too old of a commit e98662bed9 https://github.com/pytorch/pytorch/actions/runs/9392731942/job/25868060913 ([comment](https://github.com/pytorch/pytorch/pull/121640#issuecomment-2151258585))
2024-06-06 01:50:18 +00:00
sdp
b4a0161449 Build SYCL kernels for ATen XPU ops on Native Windows (take 2) (#127390)
Original PR https://github.com/pytorch/pytorch/pull/126725 is closed due to bad rebase.

-------
As proposed in https://github.com/pytorch/pytorch/issues/126719, we are enabling PyTorch XPU on Native Windows on Intel GPU.

This PR  enables XPU build on Windows as the first step of #126719:

- Enable `USE_XPU` build on Windows using MSVC as host compiler. The use of MSVC as host compiler seamlessly aligns with the existing PyTorch build on Windows.
- Build oneDNN GPU library on Windows.

Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127390
Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/ezyang
2024-06-06 01:41:06 +00:00
6adcf21b2b Documenting the torch.cuda.nccl.version function (#128022)
Fixes #127892

This PR adds docstring to the torch.cuda.nccl.version function

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128022
Approved by: https://github.com/malfet
2024-06-06 01:13:07 +00:00
bf2c05352e Make length == stop size oblivious too (#128050)
This doesn't do anything right now (need some other PRs to activate)
but since it edits a header file it would be better to land this
earlier.

Context: https://github.com/pytorch/pytorch/pull/127693

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128050
Approved by: https://github.com/Skylion007, https://github.com/lezcano
2024-06-06 01:09:37 +00:00
80d34217c6 Typo fixes: et al. (#127811)
"et al." is short for _et alia_ and should be abbreviated with a period on the second word. Noticed this typo when reading through the SGD docs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127811
Approved by: https://github.com/janeyx99
2024-06-06 01:03:25 +00:00
d3ad84c38f Use pexpr, not texpr in Triton launch codegen (#128038)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128038
Approved by: https://github.com/Skylion007
2024-06-06 00:45:59 +00:00
8bcebc8dae Add runtime dependency on setuptools for cpp_extensions (#127921)
As per title since this was removed from the builtin python binary in 3.12 and we use it `torch.utils.cpp_extension.*`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127921
Approved by: https://github.com/Skylion007
2024-06-05 23:59:38 +00:00
cyy
2fd75667b4 [Caffe2]Remove Caffe2 scripts and benchmarks (#126747)
Due to removal of Caffe2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126747
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-06-05 23:46:31 +00:00
e98662bed9 [DDP] Bucket handling: make first bucket size equal to bucket_cap_mb if it was set (#121640)
The fist DDP bucket is always being created of the size of `dist._DEFAULT_FIRST_BUCKET_BYTES` (1 MiB) by default regardless of `bucket_cap_mb`. The proposal is to set `bucket_cap_mb` as the one main bucket size if it was supplied by the user.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121640
Approved by: https://github.com/wanchaol
2024-06-05 23:44:54 +00:00
ffaea656b5 WorkerServer: add support for binding to TCP (#127986)
This adds support for the WorkerServer binding to TCP as well as the existing unix socket support.

```py
server = _WorkerServer("", 1234)
```

Test plan:

Added unit test

```
python test/distributed/elastic/test_control_plane.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127986
Approved by: https://github.com/c-p-i-o
2024-06-05 22:56:32 +00:00
a7c596870d [BE][Eazy] remove torch.torch.xxx usages (#127800)
NB: `torch` is exposed in `torch/__init__.py`. So there can be `torch.torch.torch.xxx`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127800
Approved by: https://github.com/peterbell10, https://github.com/kit1980, https://github.com/malfet
2024-06-05 21:53:49 +00:00
4123323eff [ONNX] Single function for torch.onnx.export and torch.onnx.dynamo_export (#127974)
Add `dynamo: bool = True` as a switch in `torch.onnx.export` to provide users an option to try `torch.onnx.dynamo_export`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127974
Approved by: https://github.com/justinchuby
2024-06-05 21:27:46 +00:00
01694eaa56 Move cuda 12.4 jobs to periodic for both pull and inductor (#127825)
Moves 12.4 sm86/a10g jobs in pull to trunk
Moves 12.4 cuda non sm86 jobs to periodic
Moves 12.4 jobs in inductor to inductor-periodic, except inductor_timm which seems to give important signal

There has been a lot of queueing for cuda runners due to the addition of jobs for cuda 12.4, so move those jobs to other workflows that are run less often
Co-authored-by: Andrey Talman <atalman@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127825
Approved by: https://github.com/ZainRizvi, https://github.com/nWEIdia, https://github.com/atalman, https://github.com/malfet
2024-06-05 21:01:36 +00:00
8184cd85fc [fake tensor] Set _is_param for base fake tensors for views (#127823)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127823
Approved by: https://github.com/eellison, https://github.com/ezyang
ghstack dependencies: #127972
2024-06-05 20:26:52 +00:00
626dc934d1 [dynamo][pippy] Hotfix for nn_module_stack for pippy usecase (#127972)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127972
Approved by: https://github.com/ydwu4
2024-06-05 20:14:50 +00:00
72e863df27 Update _learnable_fake_quantize.py (#127993)
Remove sentence "For literature references, please see the class _LearnableFakeQuantizePerTensorOp." and add "s" to "support"

(Possibly) Fixes #99107 (But not sure, sorry)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127993
Approved by: https://github.com/jerryzh168
2024-06-05 20:02:33 +00:00
6e545392cd Move nongpu workflows from trunk to periodic (#128049)
We don't need to run them on every PR. These are used to test for graceful degradation of GPU.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128049
Approved by: https://github.com/clee2000
2024-06-05 18:31:26 +00:00
6412c6060c [reland] Refresh OpOverloadPacket if a new OpOverload gets added (#128000)
If a user accesses an OpOverloadPacket, then creates a new OpOverload,
then uses the OpOverloadPacket, the new OpOverload never gets hit. This
is because OpOverloadPacket caches OpOverloads when it is constructed.

This PR fixes the problem by "refreshing" the OpOverloadPacket if a new
OpOverload gets constructed and the OpOverloadPacket exists.

Test Plan:
- new tests

This is the third land attempt. The first one was reverted for breaking
internal tests, the second was reverted for being erroneously suspected
of causing a perf regression.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128000
Approved by: https://github.com/albanD
2024-06-05 17:57:09 +00:00
bb68b54be0 [BE][ptd_fb_test][1/N] Enable testslide (#127512)
This change allows to enable Testslide, which gives us more readable output, import time, etc. The PR is previously stamped https://github.com/pytorch/pytorch/pull/126460 but the old PR has some ghexport issue.

Differential Revision: [D57919583](https://our.internmc.facebook.com/intern/diff/D57919583/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127512
Approved by: https://github.com/wz337, https://github.com/Skylion007
2024-06-05 17:45:15 +00:00
3acbfd602e Document torch.utils.collect_env.get_env_info function (#128021)
Fixes #127911

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128021
Approved by: https://github.com/malfet
2024-06-05 17:44:47 +00:00
6454e95824 [FSDP2] enable CI for torch.compile(root Transformer) (#127832)
This CI showcases FSDP2 works with `torch.compile` root model, since FSDP1 can do the same

compiling root Transformer without AC: `pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_multi_group`

compiling root Transformer with AC: `pytest test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_with_activation_checkpointing`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127832
Approved by: https://github.com/awgu
2024-06-05 17:29:46 +00:00
4adee71155 [dynamo] Support ndarray.dtype attribute access (#124490)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124490
Approved by: https://github.com/lezcano
ghstack dependencies: #125717
2024-06-05 17:20:01 +00:00
a9cc147fa1 [DSD][FSDP1] Deprecate FSDP.state_dict_type and redirect users to DSD (#127794)
Summary:
As title

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127794
Approved by: https://github.com/awgu
ghstack dependencies: #127793
2024-06-05 16:55:05 +00:00
9acc19f8da [inductor] Take absolute value of strides when picking loop order (#127425)
Fixes #126860

The stride hint is found by comparing the value of the indexing expression
evaluated at `idx` set to all zeros and at `idx[dim] = 1`. This causes a problem
for padded inputs where 0 and 1 are still in the padded region.

In particular, for reflection padding this causes the stride to be negative.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127425
Approved by: https://github.com/lezcano
2024-06-05 16:48:22 +00:00
22964d1007 [DSD] Deprecate submodules feature for DSD (#127793)
Summary:
Getting a partial of the state_dict and set the state_dict with the type of Dict[nn.Module, Dict[str, Any]] is too complicated and can confuse users. The features can be achieved by simple pre-processing and post-processing by users. So this PR adds the deprecation warning to the feature.

The previous PR, https://github.com/pytorch/pytorch/pull/127070, assumes
no one is using the feature and remove it without the grace period. This
seems to be too aggresive and causes some concerns. This PR adds the
deprecation warning and tests.

We will remove the support in 2.5.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127793
Approved by: https://github.com/LucasLLC
2024-06-05 16:31:29 +00:00
5dc9128229 FP8 rowwise scaling (#125204)
# Summary
This pull request introduces an fp8 row-scaling kernel as an optional implementation for `scaled_mm`. The kernel selection is based on the scaling tensors of the inputs. For inputs `x` and `y` of shape `[M, K]` and `[K, N]` respectively, the following conditions must be met:
- `x`'s scale should be a 1-dimensional tensor of length `M`.
- `y`'s scale should be a 1-dimensional tensor of length `N`.

It's important to note that this kernel is not called "rowwise, columnwise" scaling because, although the scales for `y` are semantically along its columns, this implementation only supports the TN format. This means the scaling is along the faster-moving dimension, or the "row".

The following two PRs were required to enable local builds:
- [PR #126185](https://github.com/pytorch/pytorch/pull/126185)
- [PR #125523](https://github.com/pytorch/pytorch/pull/125523)

### Todo
We still do not build our Python wheels with this architecture.

@ptrblck @malfet, should we replace `sm_90` with `sm_90a`?

The NVRTC TMA shadowing feels wrong, but I a not sure the right way to spoof the symbol for this compilation unit:
https://github.com/pytorch/pytorch/pull/125204/files#r1586986954

#### ifdef

I tried to use : `#if !defined(USE_ROCM) && defined(CUDA_VERSION) && CUDA_VERSION >= 12000 && \
    defined(__CUDA_ARCH__) && __CUDA_ARCH__ > 900` to gate the building of the kernel. I was having a hell of a time with this.. so I am not really sure the right way to do this

Kernel Credit:
@jwfromm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125204
Approved by: https://github.com/lw, https://github.com/malfet
2024-06-05 15:46:40 +00:00
4f9fcd7156 Handle unpacking during TorchScript to ExportedProgram conversion (#127419)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127419
Approved by: https://github.com/angelayi
2024-06-05 15:27:13 +00:00
cyy
9f2c4b9342 Replace with standard type traits in torch/csrc (#127852)
In preparation to clean up more type traits.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127852
Approved by: https://github.com/ezyang
2024-06-05 15:22:48 +00:00
cyy
3d617333e7 Simplify CMake code (#127683)
Due to the recent adoption of find(python), it is possible to further simplify some CMake code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127683
Approved by: https://github.com/ezyang
2024-06-05 15:17:31 +00:00
cyy
df75a9dc80 Remove Caffe2/onnx (#127991)
Remove Caffe2/onnx since it is not used. Other tiny fixes are also applied.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127991
Approved by: https://github.com/ezyang
2024-06-05 15:10:12 +00:00
d48c25c7d1 [BE] Fix missing-prototypes errors in Metal backend (#127994)
By declaring a bunch of functions static.
Removed `USE_PYTORCH_METAL` from list of flags that suppress `-Werror=missing-prototypes`. This  will prevent regressions like the ones reported in https://github.com/pytorch/pytorch/issues/127942 to sneak past CI, that builds PyTorch with Metal support.
Use nested namespaces
Remove spurious semicolon after TORCH_LIBRARY declaration.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127994
Approved by: https://github.com/Skylion007, https://github.com/ZainRizvi
2024-06-05 14:58:19 +00:00
8992141dba Restore MPS testing on MacOS 13 and m2 metal (#127853)
The runners are ready now https://github.com/organizations/pytorch/settings/actions/runners?qr=label%3Amacos-m1-13, we want to keep some MacOS 13 runner for mps coverage until MacOS 15 is out.

This also fixes the `macos-m2-14` mistake from https://github.com/pytorch/pytorch/pull/127582.

The current `macos-m2-14` runner is on 14.2 while our `macos-m1-14` has 14.4.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127853
Approved by: https://github.com/malfet
2024-06-05 14:44:00 +00:00
879d01afcb [dynamo][numpy] Add unsigned integer dtypes (#125717)
We should support these to whatever extent we can. They corresponding
`torch.uint<w>` types are defined, so I don't see an issue with
generating the various casting rules and allowing them to trace.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125717
Approved by: https://github.com/lezcano
2024-06-05 14:33:47 +00:00
4ce5322a1f Enable UFMT on test_shape_ops.py test_show_pickle.py test_sort_and_select.py (#127165)
Fixes some files in #123062

Run lintrunner on files:
test_shape_ops.py
test_show_pickle.py
test_sort_and_select.py

```bash
$ lintrunner --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127165
Approved by: https://github.com/ezyang
2024-06-05 14:31:26 +00:00
faabda4fc9 [Inductor] Skip model_fail_to_load and eager_fail_to_run models in inductor benchmarks test (#127210)
Aligned with test-infra repo, we skipped `model_fail_to_load` and `eager_fail_to_run` models
Refer code logic:
d3b79778f8/torchci/rockset/inductor/__sql/compilers_benchmark_performance.sql (L57-L58)
```SQL
  WHERE
    filename LIKE '%_accuracy'
    AND filename LIKE CONCAT(
      '%_', : dtypes, '_', : mode, '_', : device,
      '_%'
    )
    AND _event_time >= PARSE_DATETIME_ISO8601(:startTime)
    AND _event_time < PARSE_DATETIME_ISO8601(:stopTime)
    AND (workflow_id = :workflowId OR :workflowId = 0)
    AND accuracy != 'model_fail_to_load'
    AND accuracy != 'eager_fail_to_run'
),
```

Comp Item | Compiler | suite | Before | After fix
-- | -- | -- | -- | --
Pass Rate | Inductor | torchbench | 96%, 80/83 | 100%, 80/80

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127210
Approved by: https://github.com/jansel
2024-06-05 14:23:09 +00:00
c3949b20a1 Opt model save and load (#126374)
## save&load support for OptimizedModule

[Issue Description](https://github.com/pytorch/pytorch/pull/101651)

English is not my native language; please excuse typing errors.

This pr is based on commit b9588101c4d3411b107fdc860acfa8a72c642f91\
I'll do something with the merge conflicts later

### test result for test/dynamo

Conclusion:\
It performs the same as before as far as I can see.

ENV(CPU only):\
platform linux -- Python 3.10.14, pytest-7.3.2, pluggy-1.5.0\
configfile: pytest.ini\
plugins: anyio-3.7.1, cpp-2.3.0, flakefinder-1.1.0, xdist-3.3.1, xdoctest-1.1.0, metadata-3.1.1, html-4.1.1, hypothesis-5.35.1, rerunfailures-14.0

#### before this pr:

[before](https://github.com/pytorch/pytorch/files/15329370/before.md)

#### after this pr:

[after](https://github.com/pytorch/pytorch/files/15329376/after.md)

### some changes

1. add test_save_and_load to test/dynamo/test_modules.py with & without "backend='inductor'"
2. add \_\_reduce\_\_ function to OptimizedModule and derived classes of _TorchDynamoContext for pickling & unpickling
3. change the wrappers into wrapper classes ( including convert_frame_assert, convert_frame, catch_errors_wrapper in torch/_dynamo/convert_frame.py & wrap_backend_debug in torch/_dynamo/repro/after_dynamo.py )
4. change self.output.compiler_fn into innermost_fn(self.output.compiler_fn) in torch/_dynamo/symbolic_convert.py to get the origin compiler_fn and to avoid the "compiler_fn is not eager" condition

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126374
Approved by: https://github.com/msaroufim, https://github.com/jansel
2024-06-05 13:01:16 +00:00
9a8ab778d3 Revert "[BE]: Update cudnn to 9.1.0.70 (#123475)"
This reverts commit c490046693e77e254664e19d940e9b05a1da18ef.

Reverted https://github.com/pytorch/pytorch/pull/123475 on behalf of https://github.com/huydhn due to CUDA trunk jobs are pretty red after this change, and the forward fix https://github.com/pytorch/pytorch/pull/127984 does not look working ([comment](https://github.com/pytorch/pytorch/pull/123475#issuecomment-2149258430))
2024-06-05 08:59:53 +00:00
bb2de3b101 Fixed broken link and removed unfinished sentence from issue #126367 (#127938)
Fixes #126367.

## Description

Fixed a broken link in the pytorch/docs/source/torch.compiler_faq.rst doc and deleted a few words that were extra according to the issue tagged above.

## Checklist
- [X] The issue that is being fixed is referred in the description
- [X] Only one issue is addressed in this pull request
- [X] Labels from the issue that this PR is fixing are added to this pull request
- [X] No unnecesary issues are included into this pull request

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127938
Approved by: https://github.com/msaroufim
2024-06-05 07:37:32 +00:00
4a384d813b [SDPA/memeff] Backport changes from xFormers to PT (#127090)
Backporting a few fixes from xFormers:
* Bug fixes for local attention (which is not exposed in PT at the moment)
* Massively reduced memory usage on the BW pass (see also https://github.com/facebookresearch/xformers/pull/1028)

Essentially this will also make xFormers build process much easier, as we will be able to use mem-eff from PyTorch (if the user has a recent enough version) rather than building it at xFormers install time
The goal is to have the source of truth for these files in PT moving forward, and remove them from xFormers eventually once our users have a recent-enough version of PT.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127090
Approved by: https://github.com/drisspg
2024-06-05 07:33:27 +00:00
cyy
b054470db2 Remove unused functions (#127881)
Some unused functions detected by g++ warnings can be removed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127881
Approved by: https://github.com/zou3519
2024-06-05 05:21:24 +00:00
30788739f4 [c10d] add a simple test to demonstrate the user usage of collectives (#127665)
Summary:
Just play around the UT and think it would be good to give an simple
example of user function which can be used for different subclasses of
_ControlCollectives, and test the user function can be executed
correctly

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127665
Approved by: https://github.com/d4l3k
2024-06-05 04:32:11 +00:00
e505132797 [export] track TORCH_DYNAMO_DO_NOT_EMIT_RUNTIME_ASSERTS for export runtime asserts (#127554)
Track TORCH_DYNAMO_DO_NOT_EMIT_RUNTIME_ASSERTS=1 in export so it doesn't omit runtime asserts.

Differential Revision: D57978699

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127554
Approved by: https://github.com/tugsbayasgalan
2024-06-05 04:16:54 +00:00
d5cb5d623a Revert "Complete revamp of float/promotion sympy handling (#126905)"
This reverts commit fb696ef3aa34e20c0fef1c0210a397abd3ea5885.

Reverted https://github.com/pytorch/pytorch/pull/126905 on behalf of https://github.com/ezyang due to internal user reported ceiling equality simplification problem, I have a plan ([comment](https://github.com/pytorch/pytorch/pull/126905#issuecomment-2148805840))
2024-06-05 03:57:58 +00:00
55a4ef80c4 [pipelining] test pipeline_order in schedule (#127559)
Add a unittest to test validate the pipeline order for different `num_stages`, `num_microbatches`, `num_world_size` combinations. This doesn't actually run the schedule but just validates the ordering of microbatches processed is valid, therefore doesn't require GPUs / multiple processes.

Will add more combinations and negative tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127559
Approved by: https://github.com/wconstab
ghstack dependencies: #127084, #127332
2024-06-05 03:51:27 +00:00
71e684bfae [BE][Mac] Add missing prototypes (#127988)
Really confused how CI did not catch this one, but this triggers missing prototype erros if compiled from scratch on MacOS Sonoma using clang-15

Fixes https://github.com/pytorch/pytorch/issues/127942

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127988
Approved by: https://github.com/Skylion007, https://github.com/huydhn
2024-06-05 02:16:50 +00:00
cyy
ce4436944c Fix IOS builds (#127985)
IOS builds fail these days, fix it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127985
Approved by: https://github.com/ezyang
2024-06-05 02:14:43 +00:00
a135776307 Remove tensor subclass detection logic from weights_only unpickler (#127808)
Remove logic to auto-detect and allow subclasses that did not override certain methods from the weights_only unpickler from https://github.com/pytorch/pytorch/pull/124331 for 2.4 release

Subclasses should be loadable using `torch.serialization.add_safe_globals`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127808
Approved by: https://github.com/malfet
2024-06-05 02:14:30 +00:00
8e496046e5 Update torch-xpu-ops pin (ATen XPU implementation) (#127879)
Support AMP GradScaler.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127879
Approved by: https://github.com/EikanWang
2024-06-05 02:13:46 +00:00
6c07e2c930 fix redundant tensor (#127850)
As title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127850
Approved by: https://github.com/mikaylagawarecki
2024-06-05 02:03:02 +00:00
8830b81208 [c10d] Add commCreateFromRanks to c10d (#127421) (#127982)
This is a duplicate of: https://github.com/pytorch/pytorch/pull/127421 which we can't merge. its landed internally already

Summary:

`ncclCommCreateFromRanks` - described in this [document](https://docs.google.com/document/d/1QIRkAO4SAQ6eFBpxE51JmRKRAH2bwAHn8OIj69XuFqQ/edit#heading=h.5g71oqe3soez), replaces `ncclCommSplit` in NCCLX versions 2.21.5+.  The difference is that `ncclCommCreateFromRanks` is given a list of active ranks and is collective only over those ranks as opposed to `ncclCommSplit` for which you give it a color for every rank including NO_COLOR for inactive ranks and the collective is over the entire world.

This diff connects `ncclCommCreateFromRanks` to `c10d`

`ncclCommSplit` will still be available at the NCCL API but, in this diff, is not used starting at version 2.21.5

Split the python test and implementation of `split()` for internal FB and external OSS builds.

The diff defines `"USE_C10D_NCCL_FBCODE"` as a compiler option. When defined, we use the version of split in the newly created `NCCLUtils.cpp` in the `fb` directory.  The `fb` directory is not *shipit*-ed to *github*.

The same API is used for `split()` in both the `ncclx` and `nccl` versions adding `ranks` to the API.  This argument is not used in the `nccl` version nor in the 2.18 `ncclx` version where `ncclCommSplit()` is used instead of `ncclCommCreateFromRanks()` in `ncclx`

This diff was squashed with D57343946 - see D57343946 for additional review comments.

Test Plan:
for 2.18.3-1 and 2.21.5-1 versions:
```
buck2 run fbcode//mode/opt -c param.use_nccl=True -c fbcode.nvcc_arch=a100 -c hpc_comms.use_ncclx="$VERSION" -c fbcode.enable_gpu_sections=true  fbcode//caffe2/test/distributed/fb:test_comm_split_subgroup_x
```

```
BUILD SUCCEEDED
...
ok

----------------------------------------------------------------------
Ran 1 test in 10.210s

OK
~/scripts
```

OSS build:
`[cmodlin@devgpu003.vll5 ~/fbsource/third-party/ncclx/v2.21.5-1 (e56338cfa)]$ ./maint/oss_build.sh`

OSS build output:
```
...
ncclCommHash 197dce9b413e2775
nccl commDesc example_pg
Dump from comm 0x4708aa0 rings: [[0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0], [0]]
Dump from comm 0x4708aa0 commDesc: example_pg
Dump from comm 0x4708aa0 nRanks: 1
Dump from comm 0x4708aa0 nNodes: 1
Dump from comm 0x4708aa0 node: 0
Dump from comm 0x4708aa0 localRanks: 1
Dump from comm 0x4708aa0 localRank: 0
Dump from comm 0x4708aa0 rank: 0
Dump from comm 0x4708aa0 commHash: "197dce9b413e2775"

2024-05-24T09:02:54.385543 devgpu003:3040664:3040744 [0][AsyncJob]ctran/backends/ib/CtranIb.cc:143 NCCL WARN CTRAN-IB : No active device found.

2024-05-24T09:02:54.385607 devgpu003:3040664:3040744 [0][AsyncJob]ctran/mapper/CtranMapper.cc:187 NCCL WARN CTRAN: IB backend not enabled
Created NCCL_SPLIT_TYPE_NODE type splitComm 0x11c76d0, rank 0
~/fbsource/third-party/ncclx/v2.21.5-1
```

Reviewed By: wconstab, wesbland

Differential Revision: D56907877

Fixes #ISSUE_NUMBER

Co-authored-by: Cory Modlin <cmodlin@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127982
Approved by: https://github.com/izaitsevfb
2024-06-05 00:19:52 +00:00
7fdfb88f03 [pipelining] rewrite interleaved 1f1b (#127332)
## Context

Interleaved 1F1B has multiple points in the schedule where communication is both criss-crossed across ranks leading to hangs due to 1. looped nature of schedules, 2. batched nature of forward + backward in 1f1b phase.

<img width="1370" alt="image" src="https://github.com/pytorch/pytorch/assets/14858254/a07c2b1d-8a99-420b-9ba3-32a0115d228b">

In the current implementation, it is difficult to fix these hangs since it requires `dist.recv` from a prior point in time, but each rank operates on its own step schedule and does not have knowledge of other ranks operations to perform the `recv` prior to their own `send`.

## New implementation

The new implementation is split into 2 parts:

1. Creating the pipeline order.

Each rank will create the timestep normalized ordering of all schedule actions across all ranks. This is created once during the initialization of the schedule class. The timestep between each rank is normalized as each rank can only have 1 computation action (forward or backward) during that timestep.

<img width="1065" alt="image" src="https://github.com/pytorch/pytorch/assets/14858254/196f2347-7ff4-49cf-903b-d8db97d1156f">

3. Executing the pipeline order.

Once the pipeline order is determined, execution is simple because as each rank will perform its send to its peer (based on whether they did forward and backward). Now that each rank has a global understanding of the schedule, they can check their previous and next neighbor ranks to see if they need to recv any activations/gradients from them. Therefore, during execution, each rank is aligned and executing the same time step.

## Benefits

- Implementation is faster since 1f1b computation can now be split up in two time steps, 1 for forward and 1 for backward.
- Debugging is easier since we can now determine which timestep each rank is hung on
- Testing is easier since we can just validate the pipeline order, without running the schedule. This allows us to test on large amount of ranks without actually needing the GPUs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127332
Approved by: https://github.com/wconstab
ghstack dependencies: #127084
2024-06-04 23:46:05 +00:00
1f67cfd437 [inductor] raise tolerance for cspdarknet (#127949)
cspdarknet previously is flaky but after https://github.com/pytorch/pytorch/pull/127367 it fails quite stably. It's probably due to small numerical change from the mentioned PR. That PR will let inductor generated different code due to different loop orders.

Raise tolerance to pass CI.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127949
Approved by: https://github.com/atalman, https://github.com/nWEIdia, https://github.com/eqy
2024-06-04 23:28:20 +00:00
907cb28f67 Revert "Inductor: Allow small sizes of m for mixed mm autotuning (#127663)"
This reverts commit d8d0bf264a736c7fb3cd17799a1c1aba4addf8d9.

Reverted https://github.com/pytorch/pytorch/pull/127663 on behalf of https://github.com/soulitzer due to breaks torch ao CI, see: https://github.com/pytorch/pytorch/issues/127924 ([comment](https://github.com/pytorch/pytorch/pull/127663#issuecomment-2148554128))
2024-06-04 23:06:43 +00:00
f4b05ce683 Add registry for TorchScript to ExportedProgram conversion (#127464)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127464
Approved by: https://github.com/ydwu4, https://github.com/angelayi
2024-06-04 22:53:00 +00:00
0eb9ec958a Revert "Inductor respects strides for custom ops by default (#126986)" (#127923)
This reverts commit dd64ca2a02434944ecbc8f3e186d44ba81e3cb26.

There's a silent incorrectness bug with needs_fixed_stride_order=True and
mutable custom ops, so it's better to flip the default back to avoid
silent incorrectness.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127923
Approved by: https://github.com/williamwen42
2024-06-04 22:25:45 +00:00
20f966a8e0 Ignore undocumented PipelineSchedule.step (#127955)
Ignore undocumented PipelineSchedule.step to fix doc build:

https://github.com/pytorch/pytorch/actions/runs/9372492435/job/25805861083?pr=127938#step:11:1284

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127955
Approved by: https://github.com/kit1980
2024-06-04 22:11:09 +00:00
a7b1dd82ff Default XLA to use swap_tensors path in nn.Module._apply (#126814)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126814
Approved by: https://github.com/JackCaoG, https://github.com/albanD
ghstack dependencies: #127313
2024-06-04 21:40:49 +00:00
1b704a160f Add linker script optimization flag to CMAKE rule for CUDA ARM wheel (#127514)
Original PR - https://github.com/pytorch/pytorch/pull/127220

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127514
Approved by: https://github.com/Aidyn-A, https://github.com/atalman
2024-06-04 20:51:44 +00:00
6dc0a291b9 Revert "[dynamo] Bugfix for nn parameter construction (#127806)"
This reverts commit f27c4dd862bf79f37019ef277957cd577d57b66f.

Reverted https://github.com/pytorch/pytorch/pull/127806 on behalf of https://github.com/PaliC due to causing nn tests to fail ([comment](https://github.com/pytorch/pytorch/pull/127806#issuecomment-2148393903))
2024-06-04 20:51:41 +00:00
597922ba21 Reapply "distributed debug handlers (#126601)" (#127805)
This reverts commit 7646825c3eb687030c4f873b01312be0eed80174.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127805
Approved by: https://github.com/PaliC
2024-06-04 19:44:30 +00:00
e76b28c765 [dtensor][debug] added c10d alltoall_ and alltoall_base_ to CommDebugMode (#127360)
**Summary**
Added c10d alltoall_ and alltoall_base tracing to CommDebugMode and edited test case in test_comm_mode to include added features.

**Test Plan**
pytest test/distributed/_tensor/debug/test_comm_mode.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127360
Approved by: https://github.com/wz337, https://github.com/XilunWu, https://github.com/yifuwang
ghstack dependencies: #127358
2024-06-04 18:29:48 +00:00
01e6d1cae4 [dtensor][debug] added c10d reduce_scatter_ and reduce_scatter_tensor_coalesced tracing_ to CommDebugMode (#127358)
**Summary**
Added c10d reduce_scatter_ and reduce_scatter_tensor_coalesced tracing to CommDebugMode and edited test case in test_comm_mode to include added features.

**Test Plan**
pytest test/distributed/_tensor/debug/test_comm_mode.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127358
Approved by: https://github.com/wz337, https://github.com/XilunWu, https://github.com/yifuwang
2024-06-04 18:29:48 +00:00
9a25ff77af Revert "[inductor] Enable subprocess-based parallel compile as the default (#126817)"
This reverts commit cf77e7dd9770caf65e898ac2ee82045aa0408e30.

Reverted https://github.com/pytorch/pytorch/pull/126817 on behalf of https://github.com/huydhn due to There are lots of flaky inductor failure showing up in trunk after this commit cf77e7dd97, so I am trying to revert this to see if this helps ([comment](https://github.com/pytorch/pytorch/pull/126817#issuecomment-2148143502))
2024-06-04 18:26:12 +00:00
f27c4dd862 [dynamo] Bugfix for nn parameter construction (#127806)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127806
Approved by: https://github.com/jansel
ghstack dependencies: #127785, #127802
2024-06-04 18:25:46 +00:00
569c5e72e7 [dynamo] Unspec nn module when global backward hooks are present (#127802)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127802
Approved by: https://github.com/jansel
ghstack dependencies: #127785
2024-06-04 18:25:46 +00:00
c7e936a56a [dynamo] Tensorvariable - track grad with _grad field (#127785)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127785
Approved by: https://github.com/jansel
2024-06-04 18:25:46 +00:00
3bcc3cddb5 Using scalarType instead string in function _group_tensors_by_device_and_dtype. (#127869)
Now torch.dtype can pass through pybind11, so modify function _group_tensors_by_device_and_dtype to using scalar type. And without convert torch.dtype and string in python and c++ side.
@ezyang @bdhirsh
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127869
Approved by: https://github.com/ezyang
2024-06-04 18:19:33 +00:00
0ff60236ab Revert "Retire torch.distributed.pipeline (#127354)"
This reverts commit b9c058c203ee38032594f898f27cd8404f113a63.

Reverted https://github.com/pytorch/pytorch/pull/127354 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the doc build failure looks legit b9c058c203 ([comment](https://github.com/pytorch/pytorch/pull/127354#issuecomment-2148133982))
2024-06-04 18:19:31 +00:00
627d2cd87d [CI] disable td for xpu ci test by default (#127611)
Due to the xpu ci test has been enabled td by default, a lot of test cases (75%) have been skipped in CI tests. It caused some ci failures escaped from the ci tests, for example issue #127539. This PR depends on PR #127595 landed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127611
Approved by: https://github.com/etaf, https://github.com/atalman
2024-06-04 17:15:10 +00:00
36e9b71613 Enable UFMT on test/test_jit_fuser_te.py (#127759)
Part of #123062

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127759
Approved by: https://github.com/ezyang
2024-06-04 16:56:03 +00:00
ff32f6c93b Use freshly traced jit-traced module to be used in export analysis (#127577)
Summary: When we export already traced module, it seems to be modifying some global state causing the traced modules to fail to run. For now, we are only logging for test cases, so it is probs ok to trace fresh copy to be used in export for now.

Test Plan: CI

Differential Revision: D57983518

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127577
Approved by: https://github.com/pianpwk
2024-06-04 16:54:23 +00:00
c490046693 [BE]: Update cudnn to 9.1.0.70 (#123475)
cuDNN has managed to upload cu11 and cu12 wheels for ~~9.0.0.312~~ 9.1.0.70, so trying this out...

CC @Skylion007 @malfet

Co-authored-by: Wei Wang <weiwan@nvidia.com>
Co-authored-by: atalman <atalman@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123475
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/nWEIdia
2024-06-04 16:33:06 +00:00
97ea2b5d83 documentation for pattern_matcher.py (#127459)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127459
Approved by: https://github.com/oulgen
ghstack dependencies: #127457, #127458
2024-06-04 15:24:47 +00:00
7a60a75256 Add typing annotations to pattern_matcher.py (#127458)
Turn on `mypy: disallow-untyped-defs` in pattern_matcher.py and fix the fallout.

There are still a bunch of `type: ignore` annotations which should eventually be ironed out.

In the processs found a bug: #127457

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127458
Approved by: https://github.com/Skylion007
ghstack dependencies: #127457
2024-06-04 15:24:47 +00:00
9adfa143d7 fix post_grad pattern (#127457)
The lowering pattern built by cuda_and_enabled_mixed_mm_and_not_int8() was using ListOf() incorrectly - ListOf() is meant to represent a single repeating pattern - but cuda_and_enabled_mixed_mm_and_not_int8() was passing two patterns - I think based on the comment it's trying to build a sequence which would be represented by an actual list, not ListOf().

The behavior of the existing pattern would be to pass the second pattern as the `partial` parameter of `ListOf` which is meant to be a boolean - so it's almost certainly not what was intended.

I tried changing it to be what I thought was the intended behavior but then the resnet152 test failed accuracy - so I'm just preserving the existing behavior with the correct parameter types.

Found when adding annotations to pattern_matcher.py (#127458)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127457
Approved by: https://github.com/oulgen
2024-06-04 15:24:41 +00:00
cyy
f8c6d43524 Concat namespaces and other fixes in torch/csrc/utils (#127833)
It contains formatting and other minor fixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127833
Approved by: https://github.com/ezyang
2024-06-04 15:12:45 +00:00
91461601b6 [TORCH_FA2_flash_api] Update total_q to the reshaped query 0th dimension (#127524)
There is a difference (&bug) between the TORCH_FA2_flash_api:**mha_varlen_fwd** and FA2_flash_api:**mha_varlen_fwd** at the query transposition (GQA) step.

```
at::Tensor temp_q = q;
if (seqlenq_ngroups_swapped) {
        temp_q = q.reshape( ...
 ...
}
const int total_q = q.sizes()[0];
CHECK_SHAPE(temp_q, total_q, num_heads, head_size_og);
```

When doing query transposition we need to update total_q to the reshaped query 0th dimension, i.e:
```
const int total_q = temp_q.sizes()[0];
 ```

In the original FA2_flash_api:**mha_varlen_fwd** they dont introduce a new variable temp_q but overwrite the q value directly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127524
Approved by: https://github.com/drisspg
2024-06-04 14:44:45 +00:00
c209fbdc53 [inductor] Fix missing unbacked def for unbacked in input expr (#127770)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127770
Approved by: https://github.com/ezyang
2024-06-04 14:43:01 +00:00
cyy
059cae6176 [Caffe2] Remove Caffe2 proto and other files (#127655)
Remove Caffe2 proto files altogether.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127655
Approved by: https://github.com/ezyang
2024-06-04 14:22:21 +00:00
4c074a9b8b Revert "[torchbind] always fakify script object by default in non-strict export (#127116)"
This reverts commit c27882ffa8c1c7e4cf8ebc6c2f879e5b6c8814ad.

Reverted https://github.com/pytorch/pytorch/pull/127116 on behalf of https://github.com/atalman due to Failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/127116#issuecomment-2147459339))
2024-06-04 12:53:19 +00:00
fb696ef3aa Complete revamp of float/promotion sympy handling (#126905)
At a high level, the idea behind this PR is:

* Make it clearer what the promotion and int/float rules for various Sympy operations are. Operators that previously were polymorphic over int/float are now split into separate operators for clarity. We never do mixed int/float addition/multiplication etc in sympy, instead, we always promote to the appropriate operator. (However, equality is currently not done correctly.)
* Enforce strict typing on ValueRanges: if you have a ValueRange for a float, the lower and upper MUST be floats, and so forth for integers.

The story begins in **torch/utils/_sympy/functions.py**. Here, I make some changes to how we represent certain operations in sympy expressions:

* FloorDiv now only supports integer inputs; to do float floor division, do a truediv and then a trunc. Additionally, we remove the divide out addition by gcd optimization, because sympy gcd is over fields and is willing to generate rationals (but rationals are bad for ValueRange strict typing).
* ModularIndexing, LShift, RShift now assert they are given integer inputs.
* Mod only supports integer inputs; eventually we will support FloatMod (left for later work, when we build out Sympy support for floating operations). Unfortunately, I couldn't assert integer inputs here, because of a bad interaction with sympy's inequality solver that is used by the offline solver
* TrueDiv is split into FloatTrueDiv and IntTrueDiv. This allows for us to eventually generate accurate code for Python semantics IntTrueDiv, which is written in a special way to preserve precision when the inputs are >= 2**53 beyond what first coercing the integer to floats and then doing true division.
* Trunc is split to TruncToFloat and TruncToInt.
* Round is updated to return a float, not an int, making it consistent with the round op handler in Inductor. To get Python-style conversion to int, we call TruncToInt on the result.
* RoundDecimal updated to consistently only ever return a float
* Add ToFloat for explicit coercion to float (required so we can enforce strict ValueRanges typing)

In **torch/__init__.py**, we modify SymInt and SymFloat to appropriately call into new bindings that route to these refined sympy operations.  Also, we modify `torch.sym_min` and `torch.sym_max` to have promotion semantics (if one argument is a float, the return result is always a float), making them inconsistent with builtins.min/max, but possible to do type analysis without runtime information.

We also need to introduce some new op handlers in **torch/_inductor/ops_handler.py**:

* `to_int` for truncation to int64, directly corresponding to TruncToInt; this can be implemented by trunc and dtype, but with a dedicated handler it is more convenient for roundtripping in Sympy
* `int_truediv` for Python-style integer true division, which has higher precision than casting to floats and then running `truediv`

These changes have consequences. First, we need to make some administrative changes:

* Actually wire up these Sympy functions from SymInt/SymFloat in **torch/fx/experimental/sym_node.py**, including the new promotion rules (promote2)
* Add support for new Sympy functions in **torch/utils/_sympy/interp.py**, **torch/utils/_sympy/reference.py**
  * In particular, in torch.utils._sympy.reference, we have a strong preference to NOT do nontrivial compute, instead, everything in ops handler should map to a singular sympy function
  * TODO: I chose to roundtrip mod back to our Mod function, but I think I'm going to have to deal with the C/Python inconsistency this to fix tests here
* Add printer support for the Sympy functions in **torch/_inductor/codegen/common.py**, **torch/_inductor/codegen/cpp_utils.py**, **torch/_inductor/codegen/triton.py**. `int_truediv` and mixed precision equality is currently not implemented soundly, so we will lose precision in codegen for large values. TODO: The additions here are not exhaustive yet
* Update ValueRanges logic to use new sympy functions in **torch/utils/_sympy/value_ranges.py**. In general, we prefer to use the new Sympy function rather than try to roll things by hand, which is what was done previously for many VR analysis functions.

In **torch/fx/experimental/symbolic_shapes.py** we need to make some symbolic reasoning adjustments:

* Avoid generation of rational subexpressions by removing simplification of `x // y` into `floor(x / y)`. This simplification then triggers an addition simplification rule `(x + y) / c --> x / c + y / c` which is bad because x / c is a rational number now
* `_assert_bound_is_rational` is no more, we no longer generate rational bounds
* Don't intersect non-int value ranges with the `int_range`
* Support more sympy Functions for guard SYMPY_INTERP
* Assert the type of value range is consistent with the variable type

The new asserts uncovered necessary bug fixes:

* **torch/_inductor/codegen/cpp.py**, **torch/_inductor/select_algorithm.py**, **torch/_inductor/sizevars.py** - Ensure Wild/Symbol manually allocated in Inductor is marked `is_integer` so it's accepted to build expressions
* **torch/_inductor/utils.py** - make sure you actually pass in sympy.Expr to these functions
* **torch/_inductor/ir.py** - make_contiguous_strides_for takes int/SymInt, not sympy.Expr!
* **torch/export/dynamic_shapes.py** - don't use infinity to represent int ranges, instead use sys.maxsize - 1

Because of the removal of some symbolic reasoning that produced rationals, some of our symbolic reasoning has gotten worse and we are unable to simplify some guards. Check the TODO at **test/test_proxy_tensor.py**

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126905
Approved by: https://github.com/xadupre, https://github.com/lezcano
2024-06-04 11:47:32 +00:00
db515b6ac7 [ROCm] Fix error in torch.cuda initialisation if amdsmi is not available (#127528)
Reported in https://github.com/AUTOMATIC1111/stable-diffusion-webui/issues/15874

When nvml_count is set via 9f73c65b8f/torch/cuda/__init__.py (L834)

If amdsmi is not available this will throw an error
```
File "python3.10/site-packages/torch/cuda/__init__.py", line 634, in _raw_device_count_amdsmi
    except amdsmi.AmdSmiException as e:
NameError: name 'amdsmi' is not defined
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127528
Approved by: https://github.com/jeffdaily, https://github.com/eqy, https://github.com/pruthvistony, https://github.com/atalman
2024-06-04 11:16:02 +00:00
49048e7f26 [FSDP2] Fixed variable shadowing of module (#127776)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127776
Approved by: https://github.com/wanchaol
ghstack dependencies: #127771
2024-06-04 10:27:34 +00:00
f325b39303 Introduce Inductor passes to micro-pipeline all-gather-matmul and matmul-reduce-scatter in certain cases (#126598)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126598
Approved by: https://github.com/wanchaol
2024-06-04 09:06:56 +00:00
cf77e7dd97 [inductor] Enable subprocess-based parallel compile as the default (#126817)
Differential Revision: [D58056502](https://our.internmc.facebook.com/intern/diff/D58056502)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126817
Approved by: https://github.com/eellison
2024-06-04 07:48:32 +00:00
b9c058c203 Retire torch.distributed.pipeline (#127354)
Actually retiring module after deprecation warning for a while.
The new supported module is: torch.distributed.pipelining.
Please migrate.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127354
Approved by: https://github.com/wconstab
2024-06-04 07:03:26 +00:00
6abca6a564 [export][unflatten] More strictly respect scope when removing inputs (#127607)
Code snippet from TorchTitan (LLaMa):
```
for layer in self.layers.values():
    h = layer(h, self.freqs_cis)
```
`self.freqs_cis` is a buffer of root module (`self`).
It is also an explicit arg in the call signature of original `layer` modules.
If not respecting scope -- `freqs_cis`'s scope only corresponds to root -- `_sink_param` can remove `freqs_cis` from `layer`'s call signature, resulting in runtime error.

There are two fixes in this PR:
1. We filter out the `inputs_to_state` corresponding to the current scope, using existing code that does prefix matching.
2. We delay the removal of param inputs from `call_module` nodes' `args`, till `_sink_param` call on that submodule returns. The return now returns information on which input is actually removed by the submodule, thus more accurate than just doing:
```
    for node in call_module_nodes:
        node.args = tuple(filter(lambda n: n.name not in inputs_to_state, node.args))
```

Before the PR:
![Screenshot 2024-05-31 at 1 40 24 AM](https://github.com/pytorch/pytorch/assets/6676466/a2e06b18-44d5-40ca-b242-0edab45075b7)

After the PR:
![Screenshot 2024-05-31 at 1 43 41 AM](https://github.com/pytorch/pytorch/assets/6676466/b72afb94-cdfa-420d-b88b-29a92bf2a0c0)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127607
Approved by: https://github.com/pianpwk
2024-06-04 06:43:54 +00:00
e216df48c8 [Dynamo][TVM] Fix ignored trials argument for MetaSchedule (#127747)
Fixes #127746

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127747
Approved by: https://github.com/jansel
2024-06-04 06:13:02 +00:00
2122c9e2a9 [BE] Enabled lintrunner on torch/distributed/utils.py (#127771)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127771
Approved by: https://github.com/wanchaol, https://github.com/Skylion007
2024-06-04 06:10:33 +00:00
ef77f2ca4a [pipelining] Simple 1F1B schedule (#127673)
![Screenshot 2024-05-31 at 9 13 18 PM](https://github.com/pytorch/pytorch/assets/6676466/ecf3ca24-33a6-4188-9f7c-df6e96311caa)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127673
Approved by: https://github.com/wconstab
2024-06-04 06:09:51 +00:00
f4b77ce8e2 Masked scale meta function registration #119984 (#127389)
Fixes #119984

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127389
Approved by: https://github.com/cpuhrsch
2024-06-04 06:09:17 +00:00
cyy
e7cb43a2d2 Check unused variables in tests (#127498)
Enables unused variable checks in CMake.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127498
Approved by: https://github.com/ezyang
2024-06-04 05:35:25 +00:00
2ad0e4197d [ts-migration] support aten::__is__, aten::__isnot__, aten::__not__, profiler::_record_function_enter_new, profiler::_record_function_exit (#127656)
Support more ops in ts converter and add unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127656
Approved by: https://github.com/SherlockNoMad
2024-06-04 04:51:29 +00:00
8d153e0bab [Inductor] Add FlexAttention backward kernel dynamic shape tests (#127728)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127728
Approved by: https://github.com/Chillee
2024-06-04 04:32:03 +00:00
e793ae220f [Inductor][Flex-attention] Support different sequence lengths for Query and Key/Value (#127678)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127678
Approved by: https://github.com/Chillee
2024-06-04 04:27:24 +00:00
dae757c971 Specify supported OS matrix (#127816)
Windows-10 or newer
manylinux-2014
MacOS-11 or newer (but only on Apple Silicon)

Fixes https://github.com/pytorch/pytorch/issues/126679

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127816
Approved by: https://github.com/kit1980, https://github.com/huydhn
2024-06-04 04:25:41 +00:00
22368eac10 [FSDP2] Fix submesh slicing to enable 3D parallelism (#127585)
Ensures the submesh used to create sharded parameters are created on a
submesh that excludes the Pipeline Parallelism dimension.

Also cleans up the logic for storing placements to no longer consider the outer / global dims.  Since we store an 'spmd' submesh, we can avoid this.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127585
Approved by: https://github.com/wanchaol
2024-06-04 04:24:09 +00:00
69f5b66132 [Inductor] FlexAttention backward kernel optimization (#127208)
BWD Speedups (before this PR):
```
| Type    |   Speedup | shape             | score_mod     | dtype          |
|---------|-----------|-------------------|---------------|----------------|
| Average |     0.211 |                   |               |                |
| Max     |     0.364 | (16, 16, 512, 64) | relative_bias | torch.bfloat16 |
| Min     |     0.044 | (2, 16, 4096, 64) | causal_mask   | torch.bfloat16 |
```
BWD Speedups (after this PR, though not optimizing block size yet):
```
| Type    |   Speedup | shape              | score_mod     | dtype          |
|---------|-----------|--------------------|---------------|----------------|
| Average |     0.484 |                    |               |                |
| Max     |     0.626 | (2, 16, 512, 256)  | head_bias     | torch.bfloat16 |
| Min     |     0.355 | (8, 16, 4096, 128) | relative_bias | torch.bfloat16 |
```

There are a few things need to do as follow-ups:
* Optimized default block size on A100/H100.
* Support different seqlen for Q and K/V.
* Support dynamic shapes for backward.
* Enhance unit tests to check there is no ```nan``` value in any grad. I think we should make some changes to ```test_padded_dense_causal``` because it has invalid inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127208
Approved by: https://github.com/Chillee
2024-06-04 04:22:41 +00:00
2498ef7490 Fix scheduler typehints (#127769)
Fixes scheduler typehints

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127769
Approved by: https://github.com/jansel
2024-06-04 04:19:06 +00:00
6580a18f86 [c10d][BE] fix test_init_pg_and_rpc_with_same_socket (#127654)
**Summary**
fix `test_init_pg_and_rpc_with_same_socket` in `test/distributed/test_store.py` which missed a call to destroy the created ProcessGroup before exiting test function. It lead to "init PG twice" error in the test.

**Test Plan**
`pytest test/distributed/test_store.py -s -k test_init_pg_and_rpc_with_same_socket`
`ciflow/periodic` since this test is included in `.ci/pytorch/multigpu-test.sh`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127654
Approved by: https://github.com/Skylion007, https://github.com/malfet
2024-06-04 04:00:28 +00:00
7e906ec9e5 [PT2][Optimus] Improve group batch fusion with same parent/users fusion enablement (#127648)
Summary:
Currently, we fuse the ops in random place, we here enable the same parent/users fuse to enable follow up potential split cat elimination.

Context

https://docs.google.com/document/d/1MSZY23wKD2keW2Z-DfAI1DscDERHKjOJAnuB5bxa06I/edit

Test Plan:
# local reproduce

```
buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "pm_cmf" --flow_id 559694026
```
P1386889671

Differential Revision: D58037636

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127648
Approved by: https://github.com/jackiexu1992
2024-06-04 03:41:44 +00:00
c32fe6b279 [FSDP] keep paras in torch.distributed.checkpoint.state_dict.set_optimizer_state_dict (#127644)
This addresses Fixes https://github.com/pytorch/pytorch/issues/126948
The previous code under `_load_optim_state_dict `function with condition of `info.broadcast_from_rank0`, `optim_state_dict` holds the parameters based on `optim`.
Changes here aim to synchronize the differential parameters.
Unit tests are conducted under `test_state_dict.py` in `test_optim_state_dict_para_matching`,

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127644
Approved by: https://github.com/fegin
2024-06-04 03:32:22 +00:00
4d0386ce1c [torch/jit-runtime] Add explicit include of <chrono> to torch/jit/run… (#127779)
Added an explicit include to `<chrono>` in `jit/runtime/logging.h` since `std::chrono::time_point<std::chrono::high_resolution_clock>` is directly referenced in the header.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127779
Approved by: https://github.com/albanD
2024-06-04 02:12:17 +00:00
ddef7c350f Add comments about runner labels (#127827)
To distinguish between org-wide and repo-specific runners as well as highlight where they are hosted (by DevInfra, LF or various partners

Delete unused `bm-runner`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127827
Approved by: https://github.com/huydhn
2024-06-04 02:06:43 +00:00
1208347d09 [inductor][ez] fix loop ordering test (#127807)
I didn't realize that the main block is not being run when inductor tests are being run in FBCode via remote GPUs. This is a quick fix. I've tested it in both OSS and FBCode.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127807
Approved by: https://github.com/eellison, https://github.com/jansel
2024-06-04 01:14:34 +00:00
41033a4274 PyPI: fix link to images to be rendered (#127798)
It addresses the long pending issues on PyPI. The [package description](https://pypi.org/project/torch/2.3.0/) is the repo's Readme, but compared to GitHub rendering, PyPI accepts only raw images linked via MarkDown images.
![image](https://github.com/pytorch/pytorch/assets/6035284/1d8e51d5-c8c1-4f92-b323-f7684879adb4)
 This minor link edit makes the image become raw images and so correctly rendered via PyPI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127798
Approved by: https://github.com/albanD
2024-06-04 00:59:58 +00:00
cyy
05fa05cbae [2/N] Change static functions in headers to inline (#127764)
Follows #127727

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127764
Approved by: https://github.com/Skylion007
2024-06-04 00:49:04 +00:00
dbf39a6e63 [inductor] fix linear_add_bias path (#127597)
Previous the `linear_add_bias` path do not work.
This PR is to fix it and add more ut with it.

**TestPlan**
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_add_bias
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127597
Approved by: https://github.com/jgong5, https://github.com/jansel
2024-06-04 00:39:01 +00:00
b42cfcabc4 Lift jagged -> padded dense forward / backward kernels from fbgemm_gpu (#125946)
PyTorch can't depend on `fbgemm_gpu` as a dependency because `fbgemm_gpu` already has a dependency on PyTorch. So this PR copy / pastes kernels from `fbgemm_gpu`:
* `dense_to_jagged_forward()` as CUDA registration for new ATen op `_padded_dense_to_jagged_forward()`
* `jagged_to_padded_dense_forward()` as CUDA registration for new ATen op `_jagged_to_padded_dense_forward()`

CPU impls for these new ATen ops will be added in a follow-up PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125946
Approved by: https://github.com/davidberard98
2024-06-03 23:41:54 +00:00
eqy
ac568fc007 [CUDNN] Remove defunct cuDNN V8 API build flag (#120006)
The flag basically does nothing following #95722

Let's see if the quantization tests break

CC @malfet @atalmanagement

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120006
Approved by: https://github.com/malfet
2024-06-03 22:42:05 +00:00
0e7bd7fedd [ROCm] TunableOp improvements (#124362)
- use less memory; smaller default hipblaslt workspace size
- options to avoid cache effects
  - icache flush option
  - rotating buffers during tuning
- python APIs
- unit tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124362
Approved by: https://github.com/xw285cornell
2024-06-03 22:30:11 +00:00
0f1f0d3015 Onboard ARM bfloat16 to gemv fast path (#127484)
Summary: Used bfloat16 dot support from #127477 to write a bfloat16 transposed fast path and integrated it.

Test Plan: Ran https://github.com/malfet/llm_experiments/blob/main/benchmarks/benchmark_torch_mm.py before and after on my Apple M1 Pro.
Before:
```
mv_nt    torch.float32    6.77 usec
mv_nt    torch.float16    8.24 usec
mv_nt   torch.bfloat16  184.74 usec
mv_ta    torch.float32    5.71 usec
mv_ta    torch.float16   27.95 usec
mv_ta   torch.bfloat16   98.06 usec
notrans  torch.float32    5.55 usec
notrans  torch.float16   25.11 usec
notrans torch.bfloat16   63.55 usec
trans_a  torch.float32    5.62 usec
trans_a  torch.float16   74.48 usec
trans_a torch.bfloat16  313.19 usec
trans_b  torch.float32    5.68 usec
trans_b  torch.float16    8.18 usec
trans_b torch.bfloat16   14.96 usec
```

After:
```
mv_nt    torch.float32    5.40 usec
mv_nt    torch.float16    8.25 usec
mv_nt   torch.bfloat16   12.81 usec
mv_ta    torch.float32    5.69 usec
mv_ta    torch.float16   27.94 usec
mv_ta   torch.bfloat16   98.18 usec
notrans  torch.float32    5.60 usec
notrans  torch.float16   25.17 usec
notrans torch.bfloat16   63.22 usec
trans_a  torch.float32    5.61 usec
trans_a  torch.float16   69.32 usec
trans_a torch.bfloat16  316.62 usec
trans_b  torch.float32    5.60 usec
trans_b  torch.float16    8.09 usec
trans_b torch.bfloat16   14.61 usec
```

Note large improvement in mv_nt torch.bfloat16 case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127484
Approved by: https://github.com/malfet
ghstack dependencies: #127477, #127478
2024-06-03 22:14:16 +00:00
f6ca822366 Patch ARM Half use_gemv_fast_path gate to avoid kernel duplication (#127478)
Summary: The existing code didn't gate the fast path, so the fast path had to duplicate the stock kernel. Now we gate it and delete the duplicate kernel.

Test Plan: Existing tests. Flipped the TORCH_INTERNAL_ASSERT_DEBUG_ONLY to non-debug and forced to fail (locally) to make sure we had test coverage.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127478
Approved by: https://github.com/malfet
ghstack dependencies: #127477
2024-06-03 22:14:16 +00:00
6faa3d5f18 Onboard ARM bfloat16 to gemm-by-dot-product-for-gemm_transa_ infrastructure (#127477)
Summary: This gets us a baseline level of reasonable performance for
bfloat16 matrix-vector and matrix-matrix multiplication on my Apple
M1. I've intentionally left using intrinsics for future work.

Test Plan: Used
https://github.com/malfet/llm_experiments/blob/main/benchmarks/benchmark_torch_mm.py
(modified to run larger sizes) to benchmark a range of LLM-interesting
matrix-vector and matrix-matrix sizes on my Apple M1 Pro. bfloat16 performance is
improved across the board (except possibly for very small cases) and
now exceeds float32 performance (as it should) for the matrix-vector
cases.

Before:
```
Matrix-vector:
m=8, n=128, k=1
====================
trans_b  torch.float32    0.75 usec
trans_b  torch.float16    0.71 usec
trans_b torch.bfloat16    0.81 usec
m=128, n=8, k=1
====================
trans_b  torch.float32    0.75 usec
trans_b  torch.float16    0.93 usec
trans_b torch.bfloat16    0.98 usec
m=4096, n=4096, k=1
====================
trans_b  torch.float32 2194.31 usec
trans_b  torch.float16  661.27 usec
trans_b torch.bfloat16 3758.42 usec
m=11008, n=4096, k=1
====================
trans_b  torch.float32 5792.04 usec
trans_b  torch.float16 1789.98 usec
trans_b torch.bfloat16 10120.67 usec
m=4096, n=11008, k=1
====================
trans_b  torch.float32 6101.22 usec
trans_b  torch.float16 1927.34 usec
trans_b torch.bfloat16 10469.47 usec
m=32000, n=4096, k=1
====================
trans_b  torch.float32 18353.20 usec
trans_b  torch.float16 5161.06 usec
trans_b torch.bfloat16 29601.69 usec

Matrix-matrix (prompt len 4:
m=8, n=128, k=4
====================
trans_b  torch.float32    2.14 usec
trans_b  torch.float16    0.85 usec
trans_b torch.bfloat16    1.19 usec
m=128, n=8, k=4
====================
trans_b  torch.float32    1.47 usec
trans_b  torch.float16    1.85 usec
trans_b torch.bfloat16    1.75 usec
m=4096, n=4096, k=4
====================
trans_b  torch.float32 4416.40 usec
trans_b  torch.float16 2688.36 usec
trans_b torch.bfloat16 14987.33 usec
m=11008, n=4096, k=4
====================
trans_b  torch.float32 6140.24 usec
trans_b  torch.float16 7467.26 usec
trans_b torch.bfloat16 40295.52 usec
m=4096, n=11008, k=4
====================
trans_b  torch.float32 6143.10 usec
trans_b  torch.float16 7298.04 usec
trans_b torch.bfloat16 41393.43 usec
m=32000, n=4096, k=4
====================
trans_b  torch.float32 17650.72 usec
trans_b  torch.float16 21346.63 usec
trans_b torch.bfloat16 116849.98 usec

Matrix-matrix (prompt len 8:
m=8, n=128, k=8
====================
trans_b  torch.float32    1.05 usec
trans_b  torch.float16    1.03 usec
trans_b torch.bfloat16    1.69 usec
m=128, n=8, k=8
====================
trans_b  torch.float32    2.05 usec
trans_b  torch.float16    3.08 usec
trans_b torch.bfloat16    2.95 usec
m=4096, n=4096, k=8
====================
trans_b  torch.float32 2323.99 usec
trans_b  torch.float16 5265.45 usec
trans_b torch.bfloat16 29942.40 usec
m=11008, n=4096, k=8
====================
trans_b  torch.float32 6202.01 usec
trans_b  torch.float16 14677.90 usec
trans_b torch.bfloat16 80625.18 usec
m=4096, n=11008, k=8
====================
trans_b  torch.float32 6112.05 usec
trans_b  torch.float16 14340.52 usec
trans_b torch.bfloat16 82799.99 usec
m=32000, n=4096, k=8
====================
trans_b  torch.float32 17650.65 usec
trans_b  torch.float16 42551.43 usec
trans_b torch.bfloat16 236081.08 usec

Matrix-matrix (prompt len 16:
m=8, n=128, k=16
====================
trans_b  torch.float32    1.26 usec
trans_b  torch.float16    1.34 usec
trans_b torch.bfloat16    2.69 usec
m=128, n=8, k=16
====================
trans_b  torch.float32    1.60 usec
trans_b  torch.float16    5.81 usec
trans_b torch.bfloat16    5.34 usec
m=4096, n=4096, k=16
====================
trans_b  torch.float32 2328.05 usec
trans_b  torch.float16 10526.58 usec
trans_b torch.bfloat16 60028.28 usec
m=11008, n=4096, k=16
====================
trans_b  torch.float32 6243.35 usec
trans_b  torch.float16 28505.08 usec
trans_b torch.bfloat16 163670.15 usec
m=4096, n=11008, k=16
====================
trans_b  torch.float32 5870.11 usec
trans_b  torch.float16 28597.89 usec
trans_b torch.bfloat16 165404.88 usec
m=32000, n=4096, k=16
====================
trans_b  torch.float32 17746.27 usec
trans_b  torch.float16 83393.87 usec
trans_b torch.bfloat16 472313.13 usec

Matrix-matrix (prompt len 32:
m=8, n=128, k=32
====================
trans_b  torch.float32    1.35 usec
trans_b  torch.float16    2.01 usec
trans_b torch.bfloat16    4.68 usec
m=128, n=8, k=32
====================
trans_b  torch.float32    1.19 usec
trans_b  torch.float16   10.98 usec
trans_b torch.bfloat16   10.13 usec
m=4096, n=4096, k=32
====================
trans_b  torch.float32 2525.29 usec
trans_b  torch.float16 23106.71 usec
trans_b torch.bfloat16 122987.04 usec
m=11008, n=4096, k=32
====================
trans_b  torch.float32 6131.34 usec
trans_b  torch.float16 57537.41 usec
trans_b torch.bfloat16 327825.00 usec
m=4096, n=11008, k=32
====================
trans_b  torch.float32 6395.01 usec
trans_b  torch.float16 57456.33 usec
trans_b torch.bfloat16 331325.58 usec
m=32000, n=4096, k=32
====================
trans_b  torch.float32 19078.68 usec
trans_b  torch.float16 167735.08 usec
trans_b torch.bfloat16 975736.88 usec

Matrix-matrix (prompt len 128:
m=8, n=128, k=128
====================
trans_b  torch.float32    2.40 usec
trans_b  torch.float16    6.07 usec
trans_b torch.bfloat16   16.83 usec
m=128, n=8, k=128
====================
trans_b  torch.float32    1.78 usec
trans_b  torch.float16   40.35 usec
trans_b torch.bfloat16   37.21 usec
m=4096, n=4096, k=128
====================
trans_b  torch.float32 4827.60 usec
trans_b  torch.float16 84341.24 usec
trans_b torch.bfloat16 478917.75 usec
m=11008, n=4096, k=128
====================
trans_b  torch.float32 11879.96 usec
trans_b  torch.float16 226484.33 usec
trans_b torch.bfloat16 1289465.50 usec
m=4096, n=11008, k=128
====================
trans_b  torch.float32 10707.75 usec
trans_b  torch.float16 229200.58 usec
trans_b torch.bfloat16 1327416.67 usec
m=32000, n=4096, k=128
====================
trans_b  torch.float32 33306.32 usec
trans_b  torch.float16 662898.21 usec
trans_b torch.bfloat16 3815866.63 usec
```

After:
```
Matrix-vector:
m=8, n=128, k=1
====================
trans_b  torch.float32    0.77 usec
trans_b  torch.float16    0.72 usec
trans_b torch.bfloat16    0.77 usec
m=128, n=8, k=1
====================
trans_b  torch.float32    0.73 usec
trans_b  torch.float16    0.93 usec
trans_b torch.bfloat16    1.56 usec
m=4096, n=4096, k=1
====================
trans_b  torch.float32 2195.22 usec
trans_b  torch.float16  675.40 usec
trans_b torch.bfloat16 1038.29 usec
m=11008, n=4096, k=1
====================
trans_b  torch.float32 5980.27 usec
trans_b  torch.float16 1806.08 usec
trans_b torch.bfloat16 2756.46 usec
m=4096, n=11008, k=1
====================
trans_b  torch.float32 6339.95 usec
trans_b  torch.float16 1844.71 usec
trans_b torch.bfloat16 2726.52 usec
m=32000, n=4096, k=1
====================
trans_b  torch.float32 18137.17 usec
trans_b  torch.float16 6020.75 usec
trans_b torch.bfloat16 8612.89 usec

Matrix-matrix (prompt len 4:
m=8, n=128, k=4
====================
trans_b  torch.float32    2.24 usec
trans_b  torch.float16    0.91 usec
trans_b torch.bfloat16    1.07 usec
m=128, n=8, k=4
====================
trans_b  torch.float32    1.58 usec
trans_b  torch.float16    1.96 usec
trans_b torch.bfloat16    2.11 usec
m=4096, n=4096, k=4
====================
trans_b  torch.float32 4583.43 usec
trans_b  torch.float16 3014.04 usec
trans_b torch.bfloat16 4434.04 usec
m=11008, n=4096, k=4
====================
trans_b  torch.float32 6245.55 usec
trans_b  torch.float16 7513.82 usec
trans_b torch.bfloat16 11207.80 usec
m=4096, n=11008, k=4
====================
trans_b  torch.float32 6096.22 usec
trans_b  torch.float16 7688.82 usec
trans_b torch.bfloat16 11143.72 usec
m=32000, n=4096, k=4
====================
trans_b  torch.float32 17982.88 usec
trans_b  torch.float16 22001.28 usec
trans_b torch.bfloat16 32470.62 usec

Matrix-matrix (prompt len 8:
m=8, n=128, k=8
====================
trans_b  torch.float32    1.05 usec
trans_b  torch.float16    1.02 usec
trans_b torch.bfloat16    1.44 usec
m=128, n=8, k=8
====================
trans_b  torch.float32    2.07 usec
trans_b  torch.float16    3.10 usec
trans_b torch.bfloat16    3.38 usec
m=4096, n=4096, k=8
====================
trans_b  torch.float32 2245.43 usec
trans_b  torch.float16 5597.87 usec
trans_b torch.bfloat16 8775.08 usec
m=11008, n=4096, k=8
====================
trans_b  torch.float32 6227.68 usec
trans_b  torch.float16 15102.41 usec
trans_b torch.bfloat16 22457.37 usec
m=4096, n=11008, k=8
====================
trans_b  torch.float32 6082.16 usec
trans_b  torch.float16 15131.57 usec
trans_b torch.bfloat16 21860.15 usec
m=32000, n=4096, k=8
====================
trans_b  torch.float32 19659.00 usec
trans_b  torch.float16 45075.64 usec
trans_b torch.bfloat16 67746.75 usec

Matrix-matrix (prompt len 16:
m=8, n=128, k=16
====================
trans_b  torch.float32    1.31 usec
trans_b  torch.float16    1.41 usec
trans_b torch.bfloat16    2.04 usec
m=128, n=8, k=16
====================
trans_b  torch.float32    1.66 usec
trans_b  torch.float16    5.76 usec
trans_b torch.bfloat16    6.37 usec
m=4096, n=4096, k=16
====================
trans_b  torch.float32 2271.34 usec
trans_b  torch.float16 11198.46 usec
trans_b torch.bfloat16 16893.54 usec
m=11008, n=4096, k=16
====================
trans_b  torch.float32 6266.85 usec
trans_b  torch.float16 29342.49 usec
trans_b torch.bfloat16 45159.22 usec
m=4096, n=11008, k=16
====================
trans_b  torch.float32 5999.16 usec
trans_b  torch.float16 29157.43 usec
trans_b torch.bfloat16 43295.81 usec
m=32000, n=4096, k=16
====================
trans_b  torch.float32 18028.83 usec
trans_b  torch.float16 89626.88 usec
trans_b torch.bfloat16 128164.62 usec

Matrix-matrix (prompt len 32:
m=8, n=128, k=32
====================
trans_b  torch.float32    1.38 usec
trans_b  torch.float16    2.03 usec
trans_b torch.bfloat16    3.29 usec
m=128, n=8, k=32
====================
trans_b  torch.float32    1.24 usec
trans_b  torch.float16   10.58 usec
trans_b torch.bfloat16   11.97 usec
m=4096, n=4096, k=32
====================
trans_b  torch.float32 2591.56 usec
trans_b  torch.float16 21683.62 usec
trans_b torch.bfloat16 32657.68 usec
m=11008, n=4096, k=32
====================
trans_b  torch.float32 6468.43 usec
trans_b  torch.float16 57811.33 usec
trans_b torch.bfloat16 89263.21 usec
m=4096, n=11008, k=32
====================
trans_b  torch.float32 6034.74 usec
trans_b  torch.float16 59372.56 usec
trans_b torch.bfloat16 88107.85 usec
m=32000, n=4096, k=32
====================
trans_b  torch.float32 18609.27 usec
trans_b  torch.float16 167298.00 usec
trans_b torch.bfloat16 255116.37 usec

Matrix-matrix (prompt len 128:
m=8, n=128, k=128
====================
trans_b  torch.float32    2.44 usec
trans_b  torch.float16    6.11 usec
trans_b torch.bfloat16   10.92 usec
m=128, n=8, k=128
====================
trans_b  torch.float32    1.80 usec
trans_b  torch.float16   40.26 usec
trans_b torch.bfloat16   44.82 usec
m=4096, n=4096, k=128
====================
trans_b  torch.float32 4773.29 usec
trans_b  torch.float16 84458.54 usec
trans_b torch.bfloat16 131248.58 usec
m=11008, n=4096, k=128
====================
trans_b  torch.float32 12249.16 usec
trans_b  torch.float16 234411.87 usec
trans_b torch.bfloat16 351970.71 usec
m=4096, n=11008, k=128
====================
trans_b  torch.float32 11439.24 usec
trans_b  torch.float16 233347.04 usec
trans_b torch.bfloat16 354475.96 usec
m=32000, n=4096, k=128
====================
trans_b  torch.float32 33803.03 usec
trans_b  torch.float16 688157.54 usec
trans_b torch.bfloat16 1048221.42 usec
```

Also ran the stock configuration; it was unchanged, indicating that we need to integrate this path with torch.mv separately, which will come in a follow-up PR.l

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127477
Approved by: https://github.com/malfet
2024-06-03 22:14:10 +00:00
01fc22056a [BE] enable UFMT for torch/masked/ (#127715)
Part of #123062

- #123062
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127715
Approved by: https://github.com/cpuhrsch
2024-06-03 22:01:49 +00:00
406532f864 [AMD] Fix power_draw api (#127729)
Summary: average_socket_power only gives me NA. So we need to change it to current_socket_power

Test Plan: Before `torch.cuda.power_draw` gives me NA, after it gives me the right power reading (e.g.441)

Differential Revision: D58047484

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127729
Approved by: https://github.com/nmacchioni, https://github.com/eqy
2024-06-03 21:46:50 +00:00
c27882ffa8 [torchbind] always fakify script object by default in non-strict export (#127116)
This diff can be risky for internal tests: any torchbind class that hasn't registered a fake class will fail and we should fix them. We've gained some confidence that this can work e2e by implementing FakeTensorQueue for TBE models in sigmoid with [D54210823](https://www.internalfb.com/diff/D54210823).

Differential Revision: [D57991002](https://our.internmc.facebook.com/intern/diff/D57991002)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127116
Approved by: https://github.com/zou3519
ghstack dependencies: #127113, #127114
2024-06-03 21:38:57 +00:00
3efac92888 [torchbind] support torch.compile with aot_eager backend (#127114)
Differential Revision: [D57991001](https://our.internmc.facebook.com/intern/diff/D57991001)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127114
Approved by: https://github.com/zou3519
ghstack dependencies: #127113
2024-06-03 21:38:57 +00:00
c6dc624690 [torchbind] remove test cases that don't fakify script objects (#127113)
As titled.

Differential Revision: [D57991003](https://our.internmc.facebook.com/intern/diff/D57991003)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127113
Approved by: https://github.com/zou3519
2024-06-03 21:38:50 +00:00
6d4ec9b2ec [RFC] Introduce Checkpointable for DCP (#127540) (#127628)
Summary:
# Introduce Checkpointable interface for DCP to support arbitrary tensor subclasses for checkpointing

**Authors:**
* zainhuda

## **Summary**
This diff adds a CheckpointableTensor interface to allow for future compatibility for any tensor subclass with DCP in a clean and maintainable way.

## **Motivation**
For TorchRec sharding migration from ShardedTensor to DTensor, we create a tensor subclass that is stored by DTensor to support TorchRec's sharding schemes (ex, empty shards, multiple shards on a rank).

## **Proposed Implementation**
View the CheckpointableTensor interface implementation, in which, we introduce the minimal set of methods needed to be compatible with DCP. These methods are expected to implemented by any tensor subclasses and as such are then checkpointable by DCP.

## **Drawbacks**
No drawbacks, it extends functionality in a clean and maintainable way.

## **Alternatives**
Alternative design was creating paths for checking for certain attributes in tensor subclasses which can get messy and hard to maintain/understand why it was there in the first place.

Test Plan:
Sandcastle

cc mrshenli pritamdamania87 zhaojuanmao satgera gqchen aazzolini osalpekar jiayisuse H-Huang kwen2501 awgu penguinwu fegin XilunWu wanchaol fduwjj wz337 tianyu-l wconstab yf225 chauhang d4l3k LucasLLC

Differential Revision: D57970603

Pulled By: iamzainhuda

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127628
Approved by: https://github.com/wz337, https://github.com/XilunWu, https://github.com/fegin
2024-06-03 21:21:55 +00:00
a4064da8ca Always simplify sympy expressions before printing. (#127543)
This is important because if a replacement has happened during inductor lowering, we may have stale symbols in sympy expressions that we need to replace away.  Do this at the very end.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127543
Approved by: https://github.com/lezcano
2024-06-03 20:36:14 +00:00
ef9451ac8d Move the build of AOTriton to base ROCM docker image. (#127012)
Mitigates #126111

AOTrtion, as a Math library, takes long time to build. However, this library itself is not moving as fast as PyTorch itself and it is not cost-efficient to build it for every CI check.

This PR moves the build of AOTriton from PyTorch to its base docker image, avoids duplicated and long build time.

Pre-this-PR:
* PyTorch base docker build job duration: 1.1-1.3h
* PyTorch build job duration: 1.4-1.5hr (includes AOTriton build time of 1hr6min on a linux.2xlarge node)

Post-this-PR:
* PyTorch base docker build job duration: 1.3h (includes AOTriton build time of 20min on a linux.12xlarge node)
* PyTorch build job duration: <20 min

Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127012
Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/huydhn
2024-06-03 20:35:22 +00:00
941316f821 [pipelining] Stress test schedules with multi iters (#127475)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127475
Approved by: https://github.com/wconstab
2024-06-03 20:24:07 +00:00
db9d457a3f Use sleef on macOS Apple silicon by default (#126509)
Use sleef ~~for aarch64~~ on macOS Apple silicon by default.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126509
Approved by: https://github.com/digantdesai, https://github.com/malfet
2024-06-03 19:33:06 +00:00
2fc907971a Revert "[Inductor] FlexAttention backward kernel optimization (#127208)"
This reverts commit f7171313abf14d9501a330457140b2f8a01c9985.

Reverted https://github.com/pytorch/pytorch/pull/127208 on behalf of https://github.com/yanboliang due to test_flex_attention is failing internally ([comment](https://github.com/pytorch/pytorch/pull/127208#issuecomment-2145830810))
2024-06-03 18:13:27 +00:00
3f45fa63f2 Revert "[Inductor] Add FlexAttention backward kernel dynamic shape tests (#127728)"
This reverts commit 10e3406ea5d115a54a7d753d33110762eb6c07ff.

Reverted https://github.com/pytorch/pytorch/pull/127728 on behalf of https://github.com/yanboliang due to Ineternal breakage of https://github.com/pytorch/pytorch/pull/127208 hence reverting ([comment](https://github.com/pytorch/pytorch/pull/127728#issuecomment-2145822667))
2024-06-03 18:10:46 +00:00
c35b65715c Revert "[Inductor][Flex-attention] Support different sequence lengths for Query and Key/Value (#127678)"
This reverts commit e2e3ca94ccce1c0abbfd75ac0368793e1756c268.

Reverted https://github.com/pytorch/pytorch/pull/127678 on behalf of https://github.com/atalman due to Ineternal breakage of https://github.com/pytorch/pytorch/pull/127208 hence reverting ([comment](https://github.com/pytorch/pytorch/pull/127678#issuecomment-2145821489))
2024-06-03 18:07:57 +00:00
3437177e2b Quick Fix on #126854, deepcopy lr and other possible base_parameters (#127190)
* Apply `deepcopy` to every base parameters (`initial_lr`, `max_lr`) when instantiating `LRScheduler`.

Fixes #126854

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127190
Approved by: https://github.com/janeyx99
2024-06-03 18:06:31 +00:00
d8d0bf264a Inductor: Allow small sizes of m for mixed mm autotuning (#127663)
For mixed mm with small sizes of m, such as in the example provided in #127056, being able to set BLOCK_M to 16 leads to better performance. This PR introduces kernel configs that are specific to mixed mm by extending the mm configs with two configs that work well for the example provided in #127056.
I am excluding configs with (BLOCK_M=16, BLOCK_K=16, BLOCK_N=64) because triton crashes when this config is used.

For the example in #127056:
- Without my changes, skip_triton is evaluated to true which disables autotuning. On my machine I achieve 146GB/s.
- If autotuning is enabled, but BLOCK_M>=32, I achieve 614 GB/s.
- With the changes in this PR (i.e. autotuning enabled and BLOCK_M=16), I achieve 772 GB/s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127663
Approved by: https://github.com/Chillee
2024-06-03 17:53:48 +00:00
7c3740d388 [NestedTensor] Extend coverage for unbind when ragged_idx != 1 (#127493)
Summary:
Extend coverage for the `NestedTensor` `unbind` operator to cases in which `ragged_idx != 1`.

Currently, the `unbind` operator in the `NestedTensor` class splits a tensor along the 0-th dimension, where the `ragged_idx` property, which controls the jagged dimension upon which `unbind` splits, is 1. This diff extends support for `ragged_idx != 1` in `NestedTensor`s, allowing `unbind` to split a tensor along a jagged dimension greater than 0 for `NestedTensor`s with and without the `lengths` property.

Test Plan:
Added the following unit tests:

`test_unbind_ragged_idx_equals_2_cpu`, `test_unbind_ragged_idx_equals_3_cpu`, and `test_unbind_ragged_idx_equals_last_dim_cpu` verify that `unbind` works for all jagged dimensions greater than 1, for `NestedTensor`s without `lengths`.
```
test_unbind_ragged_idx_equals_2_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
test_unbind_ragged_idx_equals_3_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
test_unbind_ragged_idx_equals_last_dim_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
```

`test_unbind_with_lengths_cpu` and `test_unbind_with_lengths_ragged_idx_equals_1_cpu` verify that `unbind` works when the jagged dimension is 1, for `NestedTensor`s with `lengths`.
```
test_unbind_with_lengths_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
test_unbind_with_lengths_ragged_idx_equals_1_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
```

`test_unbind_with_lengths_ragged_idx_equals_2_cpu` and `test_unbind_with_lengths_ragged_idx_equals_3_cpu` verify that `unbind` works when the jagged dimension is greater than 1, for `NestedTensor`s with `lengths`.
```
test_unbind_with_lengths_ragged_idx_equals_2_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
test_unbind_with_lengths_ragged_idx_equals_3_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
```

`test_unbind_with_lengths_ragged_idx_equals_0_cpu` verifies that `unbind` fails when the jagged dimension is 0 (the batch dimension), for `NestedTensor`s with `lengths`.
```
test_unbind_with_lengths_ragged_idx_equals_0_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
```

`test_unbind_with_lengths_ragged_idx_equals_2_bad_dim_cpu` verifies that `unbind` fails when there is a mismatch between the offsets and the jagged dimension, for `NestedTensor`s with `lengths`.
```
test_unbind_with_lengths_ragged_idx_equals_2_bad_dim_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
```

`test_unbind_with_wrong_lengths_cpu` verifies that `unbind` fails when the lengths exceed the limitations set by offsets, for `NestedTensor`s with `lengths`.

```
test_unbind_with_wrong_lengths_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
```

Differential Revision: D57942686

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127493
Approved by: https://github.com/davidberard98
2024-06-03 17:46:12 +00:00
4d32de14b6 [export] Handle serializing duplicate getitem nodes (#127633)
We ran into a graph that looks something like the following, where we have 2 getitem calls to the same index (%getitem, %getitem_2 both query topk[0]):
```
graph():
    %x : [num_users=1] = placeholder[target=x]
    %topk : [num_users=3] = call_function[target=torch.ops.aten.topk.default](args = (%x, 2), kwargs = {})
    %getitem : [num_users=1] = call_function[target=operator.getitem](args = (%topk, 0), kwargs = {})
    %getitem_1 : [num_users=1] = call_function[target=operator.getitem](args = (%topk, 1), kwargs = {})
    %getitem_2 : [num_users=1] = call_function[target=operator.getitem](args = (%topk, 0), kwargs = {})
    %mul_tensor : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%getitem, %getitem_2), kwargs = {})
    %mul : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%mul_tensor, 2), kwargs = {})
    return (mul, getitem_1)
```

The duplicate getitem call gets created during a pass.. so there are a couple of solutions:

1. Change serializer to support the case of duplicate getitem calls
2. Change the pass so that it doesn’t produce duplicate getitem calls
3. Add a pass which dedups the getitem calls

As a framework, we should do 1 and 3 (through a CSE pass).

This PR implements solution 1. However, the serializer currently does some special handling for getitem nodes -- instead of directly serializing the getitem nodes, we serialize the output of the node that outputting a list of tensors (the %topk node in this example) into a list nodes for each output ([%getitem, %getitem_1]). This fails when we have duplicate getitem nodes to the same index (%getitem_2), since we do not record that duplicate getitem node anywhere. So, the solution this PR takes is that the serializer will deduplicate the getitem nodes (%getitem_2 will be replaced with %getitem). This would result in a sematically correct graph, but not necessarily node-to-node identical as the original fx graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127633
Approved by: https://github.com/ydwu4
2024-06-03 17:25:51 +00:00
12c4a2c297 [BE]: Apply PLR1736 fixes (unnecessary index lookup) (#127716)
Applies the PLR1736 preview rule with some more autofixes to cut down on unnecessary accesses. Added a noqa since that test actually testing the dunder method.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127716
Approved by: https://github.com/ezyang
2024-06-03 17:22:13 +00:00
21144ce570 [dtensor] implement scatter op with simple replication (#126713)
as titled, implement torch.scatter op with simple replications strategy,
need to follow up and see if we could actually support any sharding
pattern

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126713
Approved by: https://github.com/tianyu-l
ghstack dependencies: #126712
2024-06-03 16:16:28 +00:00
ded580a594 [dtensor] standardize multi mesh-dim strategy with utils (#126712)
This PR standardize the multi mesh-dim strategy generation by unifying a
util to expand from a single mesh dim strategy to multi mesh dim
strategy, to allow strategy generation simpler

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126712
Approved by: https://github.com/tianyu-l
2024-06-03 16:16:28 +00:00
d1fad416a8 Revert "Add aten._unsafe_masked_index (#116491)"
This reverts commit f03f8bc901a6c9038308a6353e8d280f4b5628f5.

Reverted https://github.com/pytorch/pytorch/pull/116491 on behalf of https://github.com/PaliC due to breaking onnx tests ([comment](https://github.com/pytorch/pytorch/pull/116491#issuecomment-2145557724))
2024-06-03 15:51:50 +00:00
53f001c599 Revert "correct BLAS input (#126200)" (#127762)
This reverts commit ea13e9a097aaa875a2b404822579b7f8b62ea291.

Looks like this could have caused: https://github.com/pytorch/pytorch/actions/runs/9346105069/job/25722431775#step:17:984

Aarch64 tests failures:
```
+ echo 'Checking that MKLDNN is available on aarch64'
Checking that MKLDNN is available on aarch64
+ pushd /tmp
/tmp /
+ python -c 'import torch; exit(0 if torch.backends.mkldnn.is_available() else 1)'
Error: Process completed with exit code 1.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127762
Approved by: https://github.com/PaliC, https://github.com/malfet
2024-06-03 15:49:48 +00:00
8677508167 [c10d] guard gpu context during abort (#127363)
This is a mitigation for an internal out of MEM issues on GPU0 that happend during comms abort, this PR was tested internally to have fixed the out of MEM issue.

Note This is supposed to be mitigation only, as the ideal fix should be within NCCL comm libs, which should just set the right CUDA context before any CUDA call and restore it to its exact previous state

ncclCommDestroy/ncclCommAbort -> commReclaim -> commDestroySync (https://fburl.com/code/pori1tka)

In commDestroySync, it thinks that "current device context" is not same as comm's device context. It tries to:
1) save the current context
2) sets the comm's device context
3) cleans up things
4) Restores "previously stored context" by another cudaSetDevice.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127363
Approved by: https://github.com/wconstab
2024-06-03 15:41:11 +00:00
430cdfc0ac [ATen][Native] fixes sparse SPMV on aarch64 (#127642)
Fixes #127491
In #127491 result was allocated as `result = at::empty(...)`, which does not guarantee `result` being filled by zeros, therefore `torch.mv` was producing non-finite values. This happened mainly because the corner case (`beta = 0`) of `addmv` was not taken care of, as it should be just like in any other `addmv`/`addmm`:
923edef31c/aten/src/ATen/native/mkl/SparseBlasImpl.cpp (L307-L311)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127642
Approved by: https://github.com/malfet
2024-06-03 15:38:27 +00:00
badf898df2 Remove unstable ARC jobs (#127563)
Disable these jobs since we're no longer trying to enable ARC
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127563
Approved by: https://github.com/huydhn
2024-06-03 15:30:06 +00:00
63d7ffe121 Retry of D58015187 Move AsyncCompile to a different file (#127691)
Summary:
This is a retry of https://github.com/pytorch/pytorch/pull/127545/files
and
D58015187, fixing the internal test that also imported codecache

Test Plan: Same tests as CI in github, plus sandcastle for internal unit tests should pass now

Differential Revision: D58054611

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127691
Approved by: https://github.com/oulgen
2024-06-03 15:29:41 +00:00
3f8b8f08c8 [Split Build] Make libtorch_global_deps accessible from libtorch wheel (#127570)
Title
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127570
Approved by: https://github.com/atalman, https://github.com/malfet
2024-06-03 15:14:29 +00:00
d05cddfe23 Revert "FP8 rowwise scaling (#125204)"
This reverts commit 923edef31c7f3e98a14625724f2019b1422dcb26.

Reverted https://github.com/pytorch/pytorch/pull/125204 on behalf of https://github.com/atalman due to Broke nightlies and internal tests ([comment](https://github.com/pytorch/pytorch/pull/125204#issuecomment-2145422196))
2024-06-03 15:00:21 +00:00
f03f8bc901 Add aten._unsafe_masked_index (#116491)
To generate masked indexing operations that would generate
masked loads in triton code

Pull Request resolved: https://github.com/pytorch/pytorch/pull/116491
Approved by: https://github.com/lezcano, https://github.com/peterbell10
2024-06-03 14:44:03 +00:00
d6963e769c Force Inductor output code to be dumped even if it fails to compile (#127700)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127700
Approved by: https://github.com/oulgen
2024-06-03 14:06:53 +00:00
f343f98710 [jit] Validate mobile module fields parsed by flatbuffer loader (#127437)
Fixing error in `torch.jit.load` Python API function that cause crash in C-backend of PyTorch.
The mobile module is succesfully parsed from flatbuffer format, but its fields are used without any validation.

Fixes #127434

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127437
Approved by: https://github.com/davidberard98
2024-06-03 08:48:12 +00:00
e017b56c0c [dtensor] local_map UX change: keep func signature and be compatible with Tensor input (#126924)
**Summary**
This PR has 2 parts of change in `local_map`:

1. regulates the way user can access `DeviceMesh` inside the `func` argument of `local_map`. This means `local_map` will strictly follow the `func` signature without implicitly passing any argument to `func`. If user wants to use `DeviceMesh` inside `func`, this mesh must be explicitly passed to `func` as an argument by user. For example,

```
def user_function(device_mesh, /, *args, **kwargs):
    USER CODE HERE

local_func = local_map(func=user_function, ...)
dtensor_out = local_func(device_mesh, dtensor_input, ...)
```

Before this PR, user code was like:
```
def user_function(device_mesh, /, *args, **kwargs):
    USER CODE HERE

local_func = local_map(func=user_function, ...)
dtensor_out = local_func(dtensor_input, ...)  # local_map passes mesh implicitly for user
```

2. `local_map` now supports mix use of `torch.Tensor` and `DTensor` in argument:

- Pure torch.Tensor case: no `DTensor` argument is passed in, all tensor arguments are `torch.Tensor`. Bypass the `in_placements` check and unwrapping steps. The output will not be wrapped into `DTensor` but directly returned.
- Pure DTensor case: no `torch.Tensor` argument is passed in, all tensor arguments are `DTensor`. This follows the default rule: `in_placements` check, unwrapping arguments, pass into `func`, wrapping the `torch.Tensor` output into `DTensor` if the `out_placements` is not `None`.
- Mix of the above two: some arguments are `torch.Tensor` while some are `DTensor`. Only perform `in_placements` check and unwrapping on `DTensor` arguments. For output processing, it's the same as Pure DTensor case.

**Test**
`pytest test/distributed/_tensor/experimental/test_local_map.py`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126924
Approved by: https://github.com/wanchaol
2024-06-03 08:41:59 +00:00
2d1ad0c31a [CI] Add freezing for cpu inductor accuracy test in inductor CI (#124715)
This PR is to enable '--freezing' when running dynamo accuracy check in CI.
Backgroud:
ISSUES[#124286](https://github.com/pytorch/pytorch/issues/124286) is not captured by CI since freezing is not enabled for cpu-inductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124715
Approved by: https://github.com/chuanqi129, https://github.com/jgong5, https://github.com/atalman, https://github.com/desertfire
2024-06-03 07:37:30 +00:00
10e3406ea5 [Inductor] Add FlexAttention backward kernel dynamic shape tests (#127728)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127728
Approved by: https://github.com/Chillee
2024-06-03 07:15:46 +00:00
6d21685b45 [DSD] Fixes various bugs for broadcast_from_rank0 (#127635)
Fixes https://github.com/pytorch/pytorch/issues/126285

Summary:
1. Fixes https://github.com/pytorch/pytorch/issues/126285
2. Broadcasting one tensor per time to avoid OOM.
3. Add some docstring

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127635
Approved by: https://github.com/weifengpy
2024-06-03 06:35:21 +00:00
48846cd164 Update torch-xpu-ops pin (ATen XPU implementation) (#127730)
Regular bi-weekly pin update.
1. Porting operator relative PyTorch unit tests. The existing operators in torch-xpu-ops are covered by, 1) Operator specific test, like test_binary_ufuncs.py. 2) Operator common test, like test_ops.py.
2. Bugfixing under the latest PyTorch unit test scope, https://github.com/intel/torch-xpu-ops/tree/release/2.4/test/xpu.

Totally 297 ATen operators are implemented in torch-xpu-ops. https://github.com/intel/torch-xpu-ops/blob/release/2.4/yaml/xpu_functions.yaml

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127730
Approved by: https://github.com/EikanWang
2024-06-03 05:55:00 +00:00
e2e3ca94cc [Inductor][Flex-attention] Support different sequence lengths for Query and Key/Value (#127678)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127678
Approved by: https://github.com/Chillee
2024-06-03 04:35:50 +00:00
cyy
288df042c5 [1/N] Change static functions in headers to inline (#127727)
So that it may fix some tricky linking issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127727
Approved by: https://github.com/ezyang
2024-06-03 04:34:36 +00:00
cyy
1b182ea0d2 Remove c10::guts::{conjunction,disjunction} (#127726)
They are not used in Pytorch OSS.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127726
Approved by: https://github.com/ezyang
2024-06-03 04:06:21 +00:00
3399ad8d9d [Inductor][CPP] Add UT for bitwise right shift (#127731)
**Summary**
Per the discussion in https://github.com/pytorch/pytorch/issues/127310, `bitwise_right_shift` failed in Torch 2.1 but pass with latest PyTorch, Add the UT in this PR to ensure the correctness.

**TestPlan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_bitwise_right_shift
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127731
Approved by: https://github.com/Skylion007
2024-06-03 04:05:41 +00:00
7e97b33fbb [Dynamo] Log backward graph compilation metrics (#126629)
Fixes #125313

Compilation metric logs for the code example at #125313:
```
%s CompilationMetrics(compile_id='0/0', frame_key='1', co_name='forward', co_filename='/data/users/ybliang/debug/debug2.py', co_firstlineno=10, cache_size=0, accumulated_cache_size=0, guard_count=11, shape_env_guard_count=0, graph_op_count=1, graph_node_count=3, graph_input_count=1, start_time=1716247236.6165977, entire_frame_compile_time_s=7.926939964294434, backend_compile_time_s=7.887059926986694, inductor_compile_time_s=4.108498811721802, code_gen_time_s=3.97833514213562, fail_type=None, fail_reason=None, fail_user_frame_filename=None, fail_user_frame_lineno=None, non_compliant_ops=set(), compliant_custom_ops=set(), restart_reasons={"'skip function graph_break in file /home/ybliang/local/pytorch/torch/_dynamo/decorators.py'"}, dynamo_time_before_restart_s=0.025330543518066406, has_guarded_code=True, is_fwd=True)
%s CompilationMetrics(compile_id='1/0', frame_key='2', co_name='torch_dynamo_resume_in_forward_at_12', co_filename='/data/users/ybliang/debug/debug2.py', co_firstlineno=12, cache_size=0, accumulated_cache_size=0, guard_count=10, shape_env_guard_count=0, graph_op_count=2, graph_node_count=5, graph_input_count=1, start_time=1716247244.544928, entire_frame_compile_time_s=0.10148310661315918, backend_compile_time_s=0.08753013610839844, inductor_compile_time_s=0.03691983222961426, code_gen_time_s=0.022417306900024414, fail_type=None, fail_reason=None, fail_user_frame_filename=None, fail_user_frame_lineno=None, non_compliant_ops=set(), compliant_custom_ops=set(), restart_reasons=set(), dynamo_time_before_restart_s=0.0, has_guarded_code=True, is_fwd=True)
tensor([[-0.1622, -0.0000, -0.0000,  0.5643, -0.0000,  0.0000, -0.5087,  0.0914,
         -0.0000, -0.0421]], grad_fn=<CompiledFunctionBackward>)
%s CompilationMetrics(compile_id='1/0', frame_key=None, co_name=None, co_filename=None, co_firstlineno=None, cache_size=None, accumulated_cache_size=None, guard_count=None, shape_env_guard_count=None, graph_op_count=None, graph_node_count=None, graph_input_count=None, start_time=None, entire_frame_compile_time_s=None, backend_compile_time_s=None, inductor_compile_time_s=0.026738643646240234, code_gen_time_s=0.016446352005004883, fail_type=None, fail_reason=None, fail_user_frame_filename=None, fail_user_frame_lineno=None, non_compliant_ops=None, compliant_custom_ops=None, restart_reasons=None, dynamo_time_before_restart_s=None, has_guarded_code=None, is_fwd=False)
%s CompilationMetrics(compile_id='0/0', frame_key=None, co_name=None, co_filename=None, co_firstlineno=None, cache_size=None, accumulated_cache_size=None, guard_count=None, shape_env_guard_count=None, graph_op_count=None, graph_node_count=None, graph_input_count=None, start_time=None, entire_frame_compile_time_s=None, backend_compile_time_s=None, inductor_compile_time_s=0.14563536643981934, code_gen_time_s=0.08652091026306152, fail_type=None, fail_reason=None, fail_user_frame_filename=None, fail_user_frame_lineno=None, non_compliant_ops=None, compliant_custom_ops=None, restart_reasons=None, dynamo_time_before_restart_s=None, has_guarded_code=None, is_fwd=False)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126629
Approved by: https://github.com/ezyang
2024-06-03 03:55:33 +00:00
84776d7597 Revert "[BE]: Update mypy to 1.10.0 (#127717)"
This reverts commit 30213ab0a7b27277e76ea9dd707ce629a63d91ee.

Reverted https://github.com/pytorch/pytorch/pull/127717 on behalf of https://github.com/huydhn due to I am not sure why but the failures look legit and they are showing up in trunk 30213ab0a7 ([comment](https://github.com/pytorch/pytorch/pull/127717#issuecomment-2144183347))
2024-06-03 02:52:47 +00:00
e57f51b80f Update _dedup_save_plans.py (#126569)
To resolve https://github.com/pytorch/pytorch/issues/125740, save each tensor on the lowest rank.

Fixes #125740

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126569
Approved by: https://github.com/LucasLLC
2024-06-03 01:55:03 +00:00
fec8ef8c17 [Aten][BlasKernel] Add function prototype to fix compiler error (#127719)
Adds a prototype for function `fp16_dot_with_fp32_arith()` in `aten/src/ATen/native/BlasKernel.cpp`.

Without this patch the build fails on Apple silicon/MacOs (CPU) with the error `no previous prototype for function 'fp16_dot_with_fp32_arith' [-Werror,-Wmissing-prototypes]`.

The function cannot be marked `static` because its use is not limited to this file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127719
Approved by: https://github.com/Skylion007
2024-06-02 23:41:43 +00:00
8b08b0f340 [BE] enable ruff rule Q from flake8-quotes (#127713)
Enable [ruff rule `Q`](https://docs.astral.sh/ruff/rules/#flake8-quotes-q) from flake8-quotes. Fixes:

- [avoidable-escaped-quote (Q003)](https://docs.astral.sh/ruff/rules/avoidable-escaped-quote/#avoidable-escaped-quote-q003)
- [unnecessary-escaped-quote (Q004)](https://docs.astral.sh/ruff/rules/unnecessary-escaped-quote/#unnecessary-escaped-quote-q004)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127713
Approved by: https://github.com/ezyang
2024-06-02 23:25:26 +00:00
139b9c6529 Avoid reference cycle in inner closure (#127711)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127711
Approved by: https://github.com/Skylion007, https://github.com/izaitsevfb
2024-06-02 21:28:46 +00:00
30213ab0a7 [BE]: Update mypy to 1.10.0 (#127717)
Updates mypy to the latest and greatest.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127717
Approved by: https://github.com/ezyang
2024-06-02 21:07:23 +00:00
fb53cd6497 [aten_cuda/flash_attn] Add typename to template argument Kernel_trait… (#127634)
Adds the `typename` keyword to the template argument `Kernel_traits::TiledMma` and `Kernel_traits::TiledMmaSdP` (which are dependent type names) when calling the template function `pytorch_flash::convert_layout_acc_Aregs`.

Without `typename` flash_attention kernels do not compile with Clang under C++20 since Clang compiles the entire .cu file in a single pass as opposed to NVCC which split compiles the host and device code. Adding `typename` seems to be OK under NVCC based on CI cuda builds succeeding.

Below is the excerpt of the compilation error:

```
third_party/py/torch/aten/src/ATen/native/transformers/cuda/flash_attn/static_switch.h:46:24: note: expanded from macro 'ALIBI_SWITCH'
   46 |   #define ALIBI_SWITCH BOOL_SWITCH
      |                        ^
third_party/py/torch/aten/src/ATen/native/transformers/cuda/flash_attn/flash_bwd_launch_template.h:132:5: note: in instantiation of function template specialization 'pytorch_flash::run_flash_bwd_seqk_parallel<pytorch_flash::Flash_bwd_ke
rnel_traits<160, 64, 64, 8, 4, 4, 4, false, true>, true>' requested here
  132 |     run_flash_bwd_seqk_parallel<Kernel_traits, Is_dropout>(params, stream);
      |     ^
third_party/py/torch/aten/src/ATen/native/transformers/cuda/flash_attn/flash_bwd_launch_template.h:280:13: note: in instantiation of function template specialization 'pytorch_flash::run_flash_bwd<pytorch_flash::Flash_bwd_kernel_traits<1
60, 64, 64, 8, 4, 4, 4, false, true>, true>' requested here
  280 |             run_flash_bwd<Flash_bwd_kernel_traits<Headdim, 64, 64, 8, 4, 4, 4, false, true, T>, Is_dropout>(params, stream);
      |             ^
third_party/py/torch/aten/src/ATen/native/transformers/cuda/flash_attn/static_switch.h:36:26: note: expanded from macro 'DROPOUT_SWITCH'
   36 |   #define DROPOUT_SWITCH BOOL_SWITCH
      |                          ^
third_party/py/torch/aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_hdim160_fp16_sm80.cu:12:5: note: in instantiation of function template specialization 'pytorch_flash::run_mha_bwd_hdim160<cutlass::half_t>' request
ed here
   12 |     run_mha_bwd_hdim160<cutlass::half_t>(params, stream);
      |     ^
In file included from third_party/py/torch/aten/src/ATen/native/transformers/cuda/flash_attn/kernels/flash_bwd_hdim160_fp16_sm80.cu:7:
In file included from third_party/py/torch/aten/src/ATen/native/transformers/cuda/flash_attn/flash_bwd_launch_template.h:12:
third_party/py/torch/aten/src/ATen/native/transformers/cuda/flash_attn/flash_bwd_kernel.h:543:86: error: missing 'typename' prior to dependent type name 'Flash_bwd_kernel_traits<160, 64, 64, 8, 4, 4, 4, false, true>::TiledMmaSdP'
  543 |         Tensor tPrP = make_tensor(rP.data(), pytorch_flash::convert_layout_acc_Aregs<Kernel_traits::TiledMmaSdP>(rP.layout()));
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127634
Approved by: https://github.com/Skylion007
2024-06-02 16:25:02 +00:00
08653fe355 Beef up the allow_in_graph docs (#127117)
We make the following changes:
- most of the time when someone uses allow_in_graph, they actually
  wanted to make a custom op. We add a link to the custom ops landing
  page and explain the differences between allow_in_graph and custom
  ops.
- we warn people against using allow_in_graph footguns and document
  them.

Test Plan:
- tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127117
Approved by: https://github.com/jansel, https://github.com/albanD
2024-06-02 15:00:46 +00:00
e24a87ed8d [BE][Ez]: Apply PYI059 - Generic always come last (#127685)
Generic baseclass should always be last or unexpected issues can occur, especially in non-stub files (such as with MRO). Applies autofixes from the preview PYI059 rule to fix the issues in the codebase.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127685
Approved by: https://github.com/ezyang
2024-06-02 13:38:58 +00:00
c2547dfcc3 [BE][Ez]: Enable ruff PYI019 (#127684)
Tells pytorch to use typing_extensions.Self when it's able to.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127684
Approved by: https://github.com/ezyang
2024-06-02 13:38:33 +00:00
67ef2683d9 [BE] wrap deprecated function/class with typing_extensions.deprecated (#127689)
Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing.

Note that only warnings that their messages contain `[Dd]eprecat(ed|ion)` are updated in this PR.

Resolves #126888

- #126888

This PR is split from PR #126898.

- #126898

------

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127689
Approved by: https://github.com/Skylion007
2024-06-02 12:30:43 +00:00
c1dd3a615f Implement Graph Transform Observer (#127427)
Summary: Implement Graph Transform Observer

Differential Revision: D57887518

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127427
Approved by: https://github.com/angelayi
2024-06-02 06:49:47 +00:00
cyy
4e7f497bb3 [Submodule] Remove ios-cmake (#127694)
It has not been updated for a long time and CI iOS builds don't rely on it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127694
Approved by: https://github.com/ezyang
2024-06-02 04:40:21 +00:00
2129903aa3 Properly detect nested torch function args (#127496)
Dynamo was not detecting nested torch function classes in containers. This was due to pytree compatibility for variable trackers being removed.
Fixes https://github.com/pytorch/pytorch/issues/127174

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127496
Approved by: https://github.com/anijain2305
2024-06-02 03:43:22 +00:00
16578e8584 [symbolic shapes] if symbol not in var_ranges default to unknown range (#127681)
Purpose of this PR is to get around this error: https://github.com/pytorch/pytorch/issues/127677

Differential Revision: D58048558

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127681
Approved by: https://github.com/lezcano
2024-06-02 02:28:40 +00:00
4fd777ed59 [ONNX] Add quantized layer norm op to opset 17 (#127640)
Fixes #126160
Continue #126555

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127640
Approved by: https://github.com/justinchuby
2024-06-02 02:10:02 +00:00
c19ad112f6 [Inductor UT][Intel GPU] Skip test case which doesn't currently work on the XPU stack but newly re-enabled by community. (#127629)
The Inductor UT test/inductor/test_triton_heuristics.py:test_artificial_zgrid that previously skipped was recently enbaled by the PR https://github.com/pytorch/pytorch/pull/127448. However, the test doesn't currently work on the XPU stack, it will huang on GPU, so this PR skip the test for Intel GPU instead of expected failure.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127629
Approved by: https://github.com/EikanWang, https://github.com/peterbell10
2024-06-02 01:00:33 +00:00
2cef2fc2b4 [ts migration] support aten::dim, aten::len, aten::__getitem__ (#127593)
- Add support for aten::dim, aten::len, aten::__getitem__ for torchscript to export converter.
- Add unit tests
Co-authored-by: cyy <cyyever@outlook.com>
Co-authored-by: Menglu Yu <mengluy@meta.com>
Co-authored-by: Animesh Jain <anijain@umich.edu>
Co-authored-by: Simon Fan <xmfan@meta.com>
Co-authored-by: Zain Rizvi <ZainR@meta.com>
Co-authored-by: Tugsbayasgalan (Tugsuu) Manlaibaatar <tmanlaibaatar@meta.com>
Co-authored-by: titaiwangms <titaiwang@microsoft.com>
Co-authored-by: Yueming Hao <yhao@meta.com>
Co-authored-by: IvanKobzarev <ivan.kobzarev@gmail.com>
Co-authored-by: PyTorch MergeBot <pytorchmergebot@users.noreply.github.com>
Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Co-authored-by: Bin Bao <binbao@meta.com>
Co-authored-by: Feny Patel <fenypatel@meta.com>
Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com>
Co-authored-by: xinan.lin <xinan.lin@intel.com>
Co-authored-by: Zain Huda <zainhuda@meta.com>
Co-authored-by: Chien-Chin Huang <chienchin@fb.com>
Co-authored-by: Wei Wang <weiwan@nvidia.com>
Co-authored-by: Jason Ansel <jansel@meta.com>
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>
Co-authored-by: Iris Z <31293777+wz337@users.noreply.github.com>
Co-authored-by: Wang, Eikan <eikan.wang@intel.com>
Co-authored-by: angelayi <yiangela7@gmail.com>
Co-authored-by: Svetlana Karslioglu <svekars@meta.com>
Co-authored-by: Yanbo Liang <ybliang8@gmail.com>
Co-authored-by: Catherine Lee <csl@fb.com>
Co-authored-by: Kwanghoon An <kwanghoon@meta.com>
Co-authored-by: Brian Hirsh <hirsheybar@fb.com>
Co-authored-by: Robert Mast <rmast@live.nl>
Co-authored-by: drisspg <drisspguessous@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127593
Approved by: https://github.com/SherlockNoMad, https://github.com/malfet
2024-06-02 00:36:33 +00:00
0d9e527c4d Remove tensor storage_offset/storage_bytes from the cache key (#127319)
Summary: We observed differences in these fields and inductor does not specialize on them so it is safe to remove them from the key.

Test Plan: CI

Reviewed By: masnesral

Differential Revision: D57871276

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127319
Approved by: https://github.com/masnesral
2024-06-02 00:28:43 +00:00
eqy
2e779166eb [Functorch][cuDNN] Bump tolerances for test_vmapjvpvjp (#127355)
cuDNN can select a winograd kernel for this case which slightly affects tolerances...

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127355
Approved by: https://github.com/zou3519, https://github.com/Skylion007
2024-06-01 21:22:55 +00:00
6e2e09f6cc [inductor] fix redis-related env vars in remote_cache.py (#127583)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127583
Approved by: https://github.com/oulgen
2024-06-01 19:55:25 +00:00
b505e86475 [Inductor][CI][CUDA 12.4] Update dynamic_inductor_timm_training.csv - change gluon_inception_v3 from fail_accuracy to pass (#127672)
From the HUD, most of the time the "X" is due to "improved_accuracy" for gluon_inception_v3.

![image](https://github.com/pytorch/pytorch/assets/143543872/d4f70377-2756-4921-872d-587426f00302)

https://hud.pytorch.org/hud/pytorch/pytorch/main/1?per_page=50&name_filter=inductor_timm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127672
Approved by: https://github.com/eqy, https://github.com/Skylion007
2024-06-01 19:12:43 +00:00
17dea09b15 Revert "Default XLA to use swap_tensors path in nn.Module._apply (#126814)"
This reverts commit bfdec93395f675a0e5a59e95aef9104ac8f5081a.

Reverted https://github.com/pytorch/pytorch/pull/126814 on behalf of https://github.com/izaitsevfb due to suspicious build instructions count regression, see [D58015016](https://www.internalfb.com/diff/D58015016) ([comment](https://github.com/pytorch/pytorch/pull/126814#issuecomment-2143545818))
2024-06-01 18:46:16 +00:00
82cd7a7dab Revert "Default meta device to use swap_tensors in nn.Module._apply (.to_empty and .to('meta')) (#126819)"
This reverts commit fa426b096b3635daab6ce26b44d50f3baab5a4e5.

Reverted https://github.com/pytorch/pytorch/pull/126819 on behalf of https://github.com/izaitsevfb due to suspicious build instructions count regression, see [D58015016](https://www.internalfb.com/diff/D58015016) ([comment](https://github.com/pytorch/pytorch/pull/126814#issuecomment-2143545818))
2024-06-01 18:46:16 +00:00
42312a52b3 [DSD] Adds type_check param to copy state dict utils (#127417)
[DSD] Adds type_check param to copy state dict utils.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127417
Approved by: https://github.com/fegin
2024-06-01 17:50:52 +00:00
edffb28d39 [BE][Ez]: Enable B019 - flags memory leaks through LRU cache on method (#127686)
Flags potential mem leaks through LRUCache and will hopefully make future contributors rethink this pattern which can cause memleaks. noqas the violations we currently have (should be fixed later)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127686
Approved by: https://github.com/c-p-i-o
2024-06-01 17:19:24 +00:00
22f392ba40 Revert "[easy?] Move AsyncCompile to a different file (#127235)"
This reverts commit f58fc16e8f059232f452a333f32e14ff681e12af.

Reverted https://github.com/pytorch/pytorch/pull/127235 on behalf of https://github.com/izaitsevfb due to breaking internal tests, see [D58015187](https://www.internalfb.com/diff/D58015187) ([comment](https://github.com/pytorch/pytorch/pull/127235#issuecomment-2143518610))
2024-06-01 17:16:16 +00:00
d49dc8f4b8 Revert "Add noqa to prevent lint warnings (#127545)"
This reverts commit f9937afd4f87fbb4844642ae2f587b13b5caa08c.

Reverted https://github.com/pytorch/pytorch/pull/127545 on behalf of https://github.com/izaitsevfb due to reverting to unblock the revert of #127545 ([comment](https://github.com/pytorch/pytorch/pull/127545#issuecomment-2143517711))
2024-06-01 17:12:46 +00:00
114c752b14 Revert "Improve MAGMA conditional macro in BatchLinearAlgebra.cpp (#127495)"
This reverts commit ee08cf57924a4230edad3101666890d8fe050c75.

Reverted https://github.com/pytorch/pytorch/pull/127495 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/127495#issuecomment-2143508218))
2024-06-01 16:39:06 +00:00
efcea2d2fd [dynamo] Support __getitem__ on NNModuleVariable __dict__ (#126956)
Moves further along (but still fails) for the testcase in https://github.com/pytorch/pytorch/pull/126875

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126956
Approved by: https://github.com/jansel
ghstack dependencies: #126923
2024-06-01 15:22:45 +00:00
4129c3e596 Let us find out why we wrote foreach meta regs (#127623)
Turns out it was for no reason!...well, after realizing that these ops are all CompositeExplicit, their meta impls come for free.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127623
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #127412
2024-06-01 13:58:18 +00:00
ac60bdaf01 Allow slow foreach to run for any backend, not just CPU (#127412)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127412
Approved by: https://github.com/albanD
2024-06-01 13:58:18 +00:00
4aa7a1efcf [dynamo] Initial exception handling support (#126923)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126923
Approved by: https://github.com/williamwen42, https://github.com/jansel
2024-06-01 13:00:32 +00:00
25994a7ed1 [AOTI] Fix a bug when mutated buffer meets .to (#127671)
Summary: Before this change, the added unit test will trigger: `AssertionError: Can not find the original value for L__self____tensor_constant0_cuda0`. The reason is GraphLowering.constant_name could rename a constant with a device suffix but AOTI requires that new name being registered properly.

Differential Revision: [D58047165](https://our.internmc.facebook.com/intern/diff/D58047165)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127671
Approved by: https://github.com/ColinPeppler, https://github.com/22quinn
2024-06-01 12:30:56 +00:00
c3be459f26 [inductor] fix mkldnn linear binary fusion check ut (#127296)
In this PR:

(1)Fix the unary fusion for bf16 conv/linear.
    Previously we registered same fusion pattern for `bf16. fp16`. And we do not check the dtype while matching the pattern. This results the `fp16` case matched the `bf16` pattern but in later replacement, we found that we have a float16 here which is not expected, so we do not fuse them.  We fix it by checking dtypes to avoid `fp16` case matched `bf16` pattern.

```
  def _is_valid_computation_unary_fusion(computation_op, lowp_dtype=None):
      def fn(match):
          matched = _is_single_computation_op(computation_op, **lowp_dtype**)(match) # previously we do not check lowp_dtype here

```

It is not exposed before because we only check the match count, and the match count is anyway correct because we matched the pattern. To address this, we add check on number of `generated_kernel`. If it is not fused, there will be an additional kernel to compute the post op.

(2)Previous the ut
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_binary
```
dose not check the fusion status, fix it in this PR.

(3)Extend `test_conv_binary` to test with lp.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127296
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel
2024-06-01 11:10:29 +00:00
e62925930f Clear dest impl extra_meta_ info when shallow_copy_from src impl to dest impl. (#127616)
tensorA.data = tensorB will call shallow_copy_from function to copy tensorB metadata and storage to tensorA metadata and storage. If tensorB extra_meta_ is nullptr,then tensorA extra_meta_ still keep in tensorA. This will contaminate new meta data in tensorA.
@ezyang  @bdhirsh
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127616
Approved by: https://github.com/ezyang
2024-06-01 06:54:32 +00:00
554265d450 [Inductor]: Use new device-agnostic libdevice import from triton.language (#127348)
Triton refactored `libdevice` in 5e6952d8c5

While both imports still appear to work under CUDA, this change is required to pull the correct libdevice variants under the Intel XPU backend. I am working on developing a test that catches this behavior. The easiest path would be to enable `test/inductor/test_triton_kernels.py` under the XPU backend, but a different group at Intel manages that test and I need to see if they already have an enabling plan.

I am not sure the double `libdevice` import (see line 22 where I have the nolint flag) is really necessary but have yet to find a conclusive test case.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127348
Approved by: https://github.com/etaf, https://github.com/peterbell10
2024-06-01 06:15:33 +00:00
7ef7c265d4 Ack codecvt_utf8_utf16 as a deprecated func in C++17 (#127659)
https://en.cppreference.com/w/cpp/header/codecvt.  This starts to fail on MacOS after migrating it to MacOS 14 with a newer toolchain.  For example 57baae9c9b.

As there is no clear alternative to the deprecated function yet, I just ack the warning to fix the build and complete the migration https://github.com/pytorch/pytorch/issues/127490
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127659
Approved by: https://github.com/kit1980, https://github.com/atalman
2024-06-01 04:31:39 +00:00
3c1cf03fde Add fake impl for aten.unique_dim (#126561)
Follow-up to #113118 and #124306.

Developed in coordination with the solution to https://github.com/microsoft/onnxscript/pull/1547

This PR adds the missing fake tensor implementation for `aten.unique_dim`, thus enabling tracing and compilation of `torch.unique` when `dim` is not None.

Local testing has proceeded with the following simple script (provided that one has checked out the changes in https://github.com/microsoft/onnxscript/pull/1547):

```python
    import onnx
    import onnxruntime as ort
    import logging
    import numpy as np
    onnx_program = torch.onnx.dynamo_export(
        lambda x: torch.unique(x,
                               dim=0,
                               return_inverse=True),
        torch.arange(10),
        export_options=torch.onnx.ExportOptions(
            dynamic_shapes=True,
            diagnostic_options=torch.onnx.DiagnosticOptions(
                verbosity_level=logging.DEBUG)))
    onnx_program.save("torch_unique.onnx")
    onnx_inputs = onnx_program.adapt_torch_inputs_to_onnx(torch.arange(10))
    onnx_outputs = onnx_program(*onnx_inputs)
    loaded_onnx_program = onnx.load("torch_unique.onnx")
    onnx.checker.check_model(loaded_onnx_program)
    ort_session = ort.InferenceSession("torch_unique.onnx")
    inputs = np.random.randint(0, 10, 10)
    print(f"Inputs: {inputs}")
    outputs = ort_session.run(None,
                              {
                                  "l_x_": inputs
                              })
    print(f"Outputs: {outputs}")
    print("Success")
```

Co-authored-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126561
Approved by: https://github.com/ezyang
2024-06-01 04:03:10 +00:00
25447ba241 Always Link libtorch and libtorch_cpu to ensure the functionality for AOT mode (#127381)
Fix #126763: The root cause is that the produced library does not link any torch library because the vec ISA is invalid, and then it cannot run into another path without linking `libtorch` and `libtorch_cpu`.

https://github.com/pytorch/pytorch/blob/main/torch/_inductor/codecache.py#L1637-L1642

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127381
Approved by: https://github.com/desertfire
2024-06-01 01:47:41 +00:00
df53cc7114 [reland] "[reland] _foreach_copy with different src/dst dtypes" (#127186)
Fixes #115171

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127186
Approved by: https://github.com/ezyang
2024-06-01 01:25:10 +00:00
ff8042bcfb Enable AOTI shim v2 build and add into libtorch (#125211)
Summary:
Follow up of https://github.com/pytorch/pytorch/pull/125087

This diff will create shim v2 header and cpp file and corresponding build

Differential Revision: D56617546

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125211
Approved by: https://github.com/desertfire
2024-05-31 23:56:11 +00:00
a8c9b26534 [BE] Fix dependabot security errors (#127567)
Fixes https://github.com/pytorch/pytorch/security/dependabot/36 and https://github.com/pytorch/pytorch/security/dependabot/37 by deleting spurious dependency

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127567
Approved by: https://github.com/malfet
2024-05-31 23:00:07 +00:00
f7171313ab [Inductor] FlexAttention backward kernel optimization (#127208)
BWD Speedups (before this PR):
```
| Type    |   Speedup | shape             | score_mod     | dtype          |
|---------|-----------|-------------------|---------------|----------------|
| Average |     0.211 |                   |               |                |
| Max     |     0.364 | (16, 16, 512, 64) | relative_bias | torch.bfloat16 |
| Min     |     0.044 | (2, 16, 4096, 64) | causal_mask   | torch.bfloat16 |
```
BWD Speedups (after this PR, though not optimizing block size yet):
```
| Type    |   Speedup | shape              | score_mod     | dtype          |
|---------|-----------|--------------------|---------------|----------------|
| Average |     0.484 |                    |               |                |
| Max     |     0.626 | (2, 16, 512, 256)  | head_bias     | torch.bfloat16 |
| Min     |     0.355 | (8, 16, 4096, 128) | relative_bias | torch.bfloat16 |
```

There are a few things need to do as follow-ups:
* Optimized default block size on A100/H100.
* Support different seqlen for Q and K/V.
* Support dynamic shapes for backward.
* Enhance unit tests to check there is no ```nan``` value in any grad. I think we should make some changes to ```test_padded_dense_causal``` because it has invalid inputs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127208
Approved by: https://github.com/Chillee
2024-05-31 22:56:10 +00:00
57baae9c9b Migrating CI/CD jobs to macOS 14 (#127582)
We have half the fleet in MacoS 14 already and it has been running fine so far https://github.com/pytorch/pytorch/issues/127490.  So, I'm preparing the final push to replace the rest of them.  This also switches release build from 13 to 14 (GitHub runners)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127582
Approved by: https://github.com/atalman
2024-05-31 22:30:59 +00:00
02248b73eb [EZ] Port over all test-infra scale configs to lf runners (#127645)
Follow up to https://github.com/pytorch/pytorch/pull/127578

Since GPU builds seem to be working correctly, porting over all remaining scale configs from [the org-wide scale config file](https://github.com/pytorch/test-infra/blob/main/.github/scale-config.yml)

The naming convention here is all temporary. We'll figure out something better before completing the migration
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127645
Approved by: https://github.com/malfet
2024-05-31 22:24:41 +00:00
bb1468d506 Updates state dict in state dict loader (#127617)
Fixes #125096

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127617
Approved by: https://github.com/Skylion007, https://github.com/fegin
2024-05-31 21:59:10 +00:00
f33beb767d [NestedTensor] Use maybe_mark_dynamic instead of mark_dynamic (#127453)
Fixes #127097

**TL;DR**: dimensions marked with mark_dynamic can result in assertion failures if the marked-dynamic dimensions get specialized. In NJT, we don't care _that_ much that a dimension is marked as dynamic. So instead, mark with `maybe_mark_dynamic` which suggests that a dimension should be dynamic, but doesn't fail if the dimension gets specialized.

**Background**:
NJT marks the values tensor as dynamic:

49ad90349d/torch/nested/_internal/nested_tensor.py (L122)

It does this for two reasons:
1. **Conceptual**: We know that this dimension _should_ be dynamic; it's a nested tensor, so the sequence lengths will _probably_ vary between batches in the common case. Therefore, we should compile it as dynamic to prevent needing a recompile to trigger automatic dynamic shapes.
2. **Implementation detail**: Right now we run into issues with torch.compile / tensor_unflatten / other details when the dimensions are not marked as dynamic. We have some attempts to remove this (e.g. https://github.com/pytorch/pytorch/pull/126563) but while testing this I wasn't able to get all tests to pass, so there could be potential regressions here if we removed the mark_dynamic.

**Justification for this change**

1. **Conceptual**: AFAIK, we don't care enough about the dynamism of this dimension to error out if we specialize. We'd prefer that we don't have to recompile to get automatic dynamic shapes, but it's also better to not have this issue (and not to force the user to go hunt down all the other equivalent shapes to mark them as dynamic as well). This solution allows us to suggest the dynamism but not force it.
2. **Implementation detail**: This still marks the dimension as symbolic at the beginning of dynamo tracing, so we will (probably) avoid a lot of the issues we run into when we completely remove the `mark_dynamic` decorators.

Differential Revision: [D57933779](https://our.internmc.facebook.com/intern/diff/D57933779)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127453
Approved by: https://github.com/soulitzer, https://github.com/YuqingJ
2024-05-31 21:32:12 +00:00
6bfc6e0875 Add back private function torch.cuda.amp.autocast_mode._cast (#127433)
This is unfortunately used in a few places in the wild: https://github.com/search?q=torch.cuda.amp.autocast_mode._cast&type=code
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127433
Approved by: https://github.com/zou3519, https://github.com/guangyey
2024-05-31 20:48:15 +00:00
923edef31c FP8 rowwise scaling (#125204)
# Summary
This pull request introduces an fp8 row-scaling kernel as an optional implementation for `scaled_mm`. The kernel selection is based on the scaling tensors of the inputs. For inputs `x` and `y` of shape `[M, K]` and `[K, N]` respectively, the following conditions must be met:
- `x`'s scale should be a 1-dimensional tensor of length `M`.
- `y`'s scale should be a 1-dimensional tensor of length `N`.

It's important to note that this kernel is not called "rowwise, columnwise" scaling because, although the scales for `y` are semantically along its columns, this implementation only supports the TN format. This means the scaling is along the faster-moving dimension, or the "row".

The following two PRs were required to enable local builds:
- [PR #126185](https://github.com/pytorch/pytorch/pull/126185)
- [PR #125523](https://github.com/pytorch/pytorch/pull/125523)

### Todo
We still do not build our Python wheels with this architecture.

@ptrblck @malfet, should we replace `sm_90` with `sm_90a`?

The NVRTC TMA shadowing feels wrong, but I a not sure the right way to spoof the symbol for this compilation unit:
https://github.com/pytorch/pytorch/pull/125204/files#r1586986954

#### ifdef

I tried to use : `#if !defined(USE_ROCM) && defined(CUDA_VERSION) && CUDA_VERSION >= 12000 && \
    defined(__CUDA_ARCH__) && __CUDA_ARCH__ > 900` to gate the building of the kernel. I was having a hell of a time with this.. so I am not really sure the right way to do this

Kernel Credit:
@jwfromm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125204
Approved by: https://github.com/lw
2024-05-31 20:09:08 +00:00
806e6257f3 Unconditionally assign symbolic shapes as locals (#127486)
Internal xref: https://fb.workplace.com/groups/1405155842844877/posts/8493858177307906

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127486
Approved by: https://github.com/albanD
2024-05-31 20:01:44 +00:00
033e733021 Revert "[BE] wrap deprecated function/class with typing_extensions.deprecated (#126898)"
This reverts commit 749a132fb0a8325cbad4734a563aa459ca611991.

Reverted https://github.com/pytorch/pytorch/pull/126898 on behalf of https://github.com/fbgheith due to switching typing-extensions=4.3.0 to 4.9.0 causes internal failure ([comment](https://github.com/pytorch/pytorch/pull/126898#issuecomment-2142884456))
2024-05-31 19:47:24 +00:00
ea13e9a097 correct BLAS input (#126200)
Fixes #32407

With this little correction to Dependencies.cmake it is possible to build an MKL-free version of Pytorch up from version v2.0.0 by explicitly choosing another MKL-free BLAS.

This pullrequest fulfills the "if not already present" part of the original comment in  Dependencies.cmake:
"setting default preferred BLAS options if not already present."

It's tested with this Action-.yml:
```
name: Build PyTorch v2.0.0 without AVX

on:
  push:
    branches:
      - v2.0.0
  pull_request:
    branches:
      - v2.0.0

jobs:
  build:
    runs-on: ubuntu-20.04
    defaults:
      run:
        shell: bash -el {0}
    steps:

    - name: Checkout repository
      uses: actions/checkout@v4
      with:
        #repository: 'pytorch/pytorch'
        #ref: 'v2.3.0'
        submodules: 'recursive'

    - uses: conda-incubator/setup-miniconda@v3
      with:
        auto-activate-base: true
        activate-environment: true
        python-version: 3.10.13

    - name: Install Dependencies - Common - Linux 2
      run: |
        conda info
        conda list
        conda install nomkl
        conda install astunparse numpy ninja pyyaml setuptools cmake cffi typing_extensions future six requests dataclasses
        export PYTORCH_CPU_CAPABILITY=cpu
        export ATEN_CPU_CAPABILITY_DEFAULT=cpu
        export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
        export ATEN_CPU_CAPABILITY=default
        export USE_NNPACK=0
        export MAX_JOBS=4
        export USE_CUDA=0
        export USE_ROCM=0
        export BLAS=OpenBLAS
        export CMAKE_ARGS="-D CMAKE_BUILD_TYPE=Release -D USE_AVX=OFF -D USE_NNPACK=OFF -D C_HAS_AVX_2=OFF -D C_HAS_AVX2_2=OFF -D CXX_HAS_AVX_2=OFF -D CXX_HAS_AVX2_2=OFF -D CAFFE2_COMPILER_SUPPORTS_AVX512_EXTENSIONS=OFF -DPYTHON_INCLUDE_DIR=$(python -c "import sysconfig; print(sysconfig.get_path('include'))") -DPYTHON_LIBRARY=$(python -c "import sysconfig; print(sysconfig.get_config_var('LIBDIR'))") -DPYTHON_EXECUTABLE:FILEPATH=`which python`"
        pip install build wheel typing_extensions
        python setup.py bdist_wheel
    - name: Archive production artifacts
      uses: actions/upload-artifact@v4
      with:
        name: dist-without-markdown
        path: |
          dist
          !dist/**/*.md
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126200
Approved by: https://github.com/jgong5, https://github.com/kit1980
2024-05-31 19:38:42 +00:00
bbf892dd58 Revert "Add back private function torch.cuda.amp.autocast_mode._cast (#127433)"
This reverts commit 6e0eeecc7cd4dc389683e35d1f2e34738e09e597.

Reverted https://github.com/pytorch/pytorch/pull/127433 on behalf of https://github.com/fbgheith due to depends on https://github.com/pytorch/pytorch/pull/126898 which is failing internally and needs to be reverted ([comment](https://github.com/pytorch/pytorch/pull/127433#issuecomment-2142869610))
2024-05-31 19:35:15 +00:00
1103444870 [AOTI] Add back include_pytorch for specifying link paths (#126802)
Summary: Running dashboard with the cpp wrapper mode sometimes hit erros like "undefined symbol: aoti_torch_empty_stride", although it can not be reproduced locally and seems only happen on the dashboard CI.

Differential Revision: [D57911442](https://our.internmc.facebook.com/intern/diff/D57911442)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126802
Approved by: https://github.com/chenyang78
ghstack dependencies: #126916, #127037
2024-05-31 19:32:52 +00:00
8af1c655e5 improve eager overhead of _disable_dynamo (#127325)
it seems like `_disable_dynamo` actually has a fair amount of overhead (especially when it was added to `DTensor.__new__`: this change speeds up @wanchaol 's repro from 0.380 -> 0.312s: P1378202570 (that repro runs a vanilla MLP using 2D parallelism, and calls the DTensor constructor 1280 times).

It looks like most of the slowndown is in the fact that we are repeatedly running `import torch._dynamo` and constructing an instance of `torch._dynamo.disable(fn, recursive)` on every call to the constructor - this PR caches it on the first invocation.

~~Update: I realized I cannot use `torch.compiler.is_compiling` to know when to fast-path, because when we hit a graph break, cpython will be running so it will return False.~~

~~As a test / potential fix, I added a new config, `torch._dynamo.config._is_compiling` that is set to True **always** inside a compiled region (even on frames that are run by cpython). This definitely seems to do what I want in terms of knowing when to fastpath and avoid overhead - although interested in feedback on how reasonable this is~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127325
Approved by: https://github.com/wanchaol, https://github.com/anijain2305
2024-05-31 19:30:47 +00:00
b704c7cf0f Re trying Support min/max carry over for eager mode from_float method (#127576)
Summary:
Original commit changeset: 2605900516c8

Original Phabricator Diff: D57977896

Test Plan: Re enabling due to prod failure

Reviewed By: jerryzh168

Differential Revision: D57978925

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127576
Approved by: https://github.com/jerryzh168
2024-05-31 19:08:07 +00:00
121c55d8d1 Old branch deletion script to also delete old ciflow tags (#127625)
Change branch deletion script to also delete left over ciflow tags that the bot doesn't get to, as well as the one created by triggering a workflow on HUD

Example run https://github.com/pytorch/pytorch/actions/runs/9322082915/job/25662376463?pr=127625
(didn't actually delete the tag, but lists what tags it would delete)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127625
Approved by: https://github.com/huydhn
2024-05-31 18:54:54 +00:00
0be06b08fc [GPT-fast benchmark] Merge GPT-fast and micro benchmark output as one CSV file (#127586)
Consolidate GPT-fast models benchmark with micro-benchmark, and save output as one CSV file with the same format as https://github.com/pytorch/pytorch/pull/126754#issue-2307296847.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127586
Approved by: https://github.com/Chillee
2024-05-31 18:50:49 +00:00
4a0d96e496 Add a GH action to autolabel docathon PRs (#127569)
To ease oncall burden for the docathon PR reviewers and ensure all PRs are correctly labeled, adding this GH action that will look for the issue number in the PR and if that issue has a docathon-h1-2024 label, then it would propagate the labels from the issues into the PR. It should not conflict with the existing labelers because we use ``pull_request.add_to_labels`` - credit @kit1980.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127569
Approved by: https://github.com/kit1980
2024-05-31 17:57:07 +00:00
b2f5fd8efb [ts_converter] Basic support for prim::If conversion (#127336)
Script module:
```
graph(%self : __torch__.M,
      %x.1 : Tensor,
      %y.1 : Tensor):
  %11 : int = prim::Constant[value=1]()
  %5 : bool = aten::Bool(%x.1) # /data/users/angelayi/pytorch2/test/export/test_converter.py:27:19
  %21 : Tensor = prim::If(%5) # /data/users/angelayi/pytorch2/test/export/test_converter.py:27:16
    block0():
      %8 : Tensor = aten::mul(%y.1, %y.1) # /data/users/angelayi/pytorch2/test/export/test_converter.py:28:27
      -> (%8)
    block1():
      %12 : Tensor = aten::add(%y.1, %y.1, %11) # /data/users/angelayi/pytorch2/test/export/test_converter.py:30:27
      -> (%12)
  return (%21)
```
ExportedProgram:
```
ExportedProgram:
    class GraphModule(torch.nn.Module):
        def forward(self, x_1: "b8[]", y_1: "i64[]"):
            # File: <eval_with_key>.23:9 in forward, code: cond = torch.ops.higher_order.cond(l_args_0_, cond_true_0, cond_false_0, [l_args_3_0_]);  l_args_0_ = cond_true_0 = cond_false_0 = l_args_3_0_ = None
            true_graph_0 = self.true_graph_0
            false_graph_0 = self.false_graph_0
            conditional = torch.ops.higher_order.cond(x_1, true_graph_0, false_graph_0, [y_1]);  x_1 = true_graph_0 = false_graph_0 = y_1 = None
            return (conditional,)

        class <lambda>(torch.nn.Module):
            def forward(self, y_1: "i64[]"):
                # File: <eval_with_key>.20:6 in forward, code: mul_tensor = torch.ops.aten.mul.Tensor(l_args_3_0__1, l_args_3_0__1);  l_args_3_0__1 = None
                mul: "i64[]" = torch.ops.aten.mul.Tensor(y_1, y_1);  y_1 = None
                return mul

        class <lambda>(torch.nn.Module):
            def forward(self, y_1: "i64[]"):
                # File: <eval_with_key>.21:6 in forward, code: add_tensor = torch.ops.aten.add.Tensor(l_args_3_0__1, l_args_3_0__1, alpha = 1);  l_args_3_0__1 = None
                add: "i64[]" = torch.ops.aten.add.Tensor(y_1, y_1);  y_1 = None
                return add
```

This PR also adds support for TupleIndex and incorporates some changes from https://github.com/pytorch/pytorch/pull/127341
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127336
Approved by: https://github.com/BoyuanFeng
2024-05-31 17:46:16 +00:00
cyy
3e66052e16 Improve python3 discovery code in CMake (#127600)
The improvement is based on my comments in #124613 and it also fixes the current linux-s390x-binary-manywheel  CI failures.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127600
Approved by: https://github.com/Skylion007
2024-05-31 17:29:06 +00:00
8d7393cb5e Update triton-xpu commit pin merge rules for XPU (#127203)
Add the ".ci/docker/ci_commit_pins/triton-xpu.txt" to the XPU merge rules.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127203
Approved by: https://github.com/atalman
2024-05-31 17:19:19 +00:00
1699edaabb [DeviceMesh] Adding nD slicing support back (#127465)
Fixes #126530

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127465
Approved by: https://github.com/wconstab, https://github.com/wanchaol
2024-05-31 17:06:36 +00:00
8bf2c0a203 [BE][Ez]: Update ruff to 0.4.6 (#127614)
Update ruff linter to 0.4.6. Uneventful PR that fixes bugs and reduces false positives.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127614
Approved by: https://github.com/albanD
2024-05-31 17:01:50 +00:00
58b461d57a Revert "[ROCm] Update triton pin to fix libtanh issue (#125396)"
This reverts commit 19333d1eb9b8965edd6c8a52fd59b5c67b4fb523.

Reverted https://github.com/pytorch/pytorch/pull/125396 on behalf of https://github.com/atalman due to Broke nightly builds ([comment](https://github.com/pytorch/pytorch/pull/125396#issuecomment-2142638237))
2024-05-31 16:51:39 +00:00
225ec08e35 Fix typo in .ci/docker/ubuntu-cuda/Dockerfile (#127503)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127503
Approved by: https://github.com/nWEIdia, https://github.com/Skylion007
2024-05-31 16:50:35 +00:00
67f0807042 [Inductor] [CI] [CUDA] Skip the failed models and tests the better way (#127150)
Address subtasks in https://github.com/pytorch/pytorch/issues/126692

After enabling the disabled shards, the following two models regressed (for cu124 configuration):
dynamic_inductor_timm_training.csv
cspdarknet53,pass,7   (expected)                                        | cspdarknet53,fail_accuracy,7           (actual)
eca_botnext26ts_256,pass,7        (expected)                            | eca_botnext26ts_256,fail_accuracy,7 (actual)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127150
Approved by: https://github.com/huydhn, https://github.com/eqy, https://github.com/atalman
2024-05-31 16:35:57 +00:00
64c581a1d4 [DSD] Make distributed state_dict support torch.distributed is not initialized case (#127385)
Fixes https://github.com/pytorch/pytorch/issues/124942

Summary:
Allow DSD to support loading the regular optimizer state_dict and can be used when torch.distributed.is_initialized() is False.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127385
Approved by: https://github.com/wz337
ghstack dependencies: #127070, #127071, #127384
2024-05-31 16:28:16 +00:00
8b4ad3a8d9 [DSD] Unify the API signatures of set_model_state_dict and set_optimizer_state_dict (#127384)
Summary:
Allow the optim_state_dict argument to be a positional argument. This make sense since this is a required argument and this will make the function signature the consistent as set_model_state_dict without causing BC issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127384
Approved by: https://github.com/wz337
ghstack dependencies: #127070, #127071
2024-05-31 16:24:51 +00:00
bd868eeb28 [DSD] Support flattening the optimizer state_dict when saving and unflattening when loading (#127071)
Fixes https://github.com/pytorch/pytorch/issues/126595

**What does this PR do?**
This PR unflattens the optimizer state_dict, similar to what TorchRec does. The current `get_optimizer_state_dict()` converts the parameter IDs to FQNs in order to avoid any conflict with different optimizers on different ranks. The current returned optimizer state_dict looks like the following one:
```
{
    "state": {
          "layer1.weight": {"step": 10, "exp_avg": SomeTensor, "exp_avg_sq": SomeTensor},
          "layer2.weight": {"step": 10, "exp_avg": SomeTensor, "exp_avg_sq": SomeTensor},
    },
    "param_group": [
         {"lr": 0.0, "betas": (0.9, 0.95), ..., "params": ["layer1.weight", "layer2.weight"]}
    ]
}
```
While this can avoid the conflict and can support merging multiple optimizers use case (e.g., optimizer in backward), the current optimizer state_dict still cannot support MPMD (e.g., pipeline parallelism). The root cause is `param_group`. `param_group` cannot generate unique keys during saving -- DCP will flatten the dict but for `param_group`, DCP will get the keys like, `param_group.lr` or `param_group.params`. These keys will conflict when using pipeline parallelism.

This PR flatten the optimizer state_dict to the one as the following one:
```
{
    "state.layer1.weight.step": 10,
    "state.layer2.weight.step": 10,
    "state.layer1.weight.exp_avg": SomeTensor,
    "state.layer2.weight.exp_avg": SomeTensor,
    "state.layer1.weight.exp_avg_sq": SomeTensor,
    "state.layer2.weight.exp_avg_sq": SomeTensor,
    "param_group.layer1.weight.lr" : 0.1,
    "param_group.layer2.weight.lr" : 0.1,
    "param_group.layer1.weight.betas" : (0.9, 0.95),
    "param_group.layer2.weight.betas" : (0.9, 0.95),
}
```
This allows distributed state_dict (DSD) to support MPMD (e.g., pipeline parallelism).

**Pros and Cons**
*Pros*
1. Can support optimizer resharding (e.g., changing the parallelisms from 3D to 2D or changing the number of workers).
2. User don't need to manually add prefix to different optimizer.
3. Allow users to merge the optimizer states easily. One use case is loop-based pipeline parallelism.

*Cons*
1. The implementation has a strong assumption of the structure of `param_groups` and its value. If the assumption changes or some customized optimizers do not meet the assumption, the implementations will be broken.
2. There will be extra values saved in the checkpoints. The assumption here is `param_group` generally contains scalars which are cheap to save.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127071
Approved by: https://github.com/wconstab, https://github.com/wz337
ghstack dependencies: #127070
2024-05-31 16:20:36 +00:00
6b1b8d0193 [DSD] Remove the support of Dict[nn.Module, Dict[str, Any]] state_dict (#127070)
Summary:
This is a very complicated signature that is hard for users to reason. Remove the support of this feature.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127070
Approved by: https://github.com/wz337
2024-05-31 16:16:05 +00:00
a010fa9e24 [DCP] Fix variable spelling (#127565)
Summary: tsia

Test Plan: sandcastle

Differential Revision: D57983752

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127565
Approved by: https://github.com/wz337, https://github.com/fegin
2024-05-31 15:32:08 +00:00
75e7588f47 [Inductor UT] Fix expected failure but pass for test case on Intel GPU. (#127595)
The XPU expected failure test case `TritonCodeGenTests.test_codegen_config_option_dont_assume_alignment` should have been expected passed after the PR #126261 merged, but due to test flaky, this case was skiped when landing the PR. The expected failure but passed error then exposed in periodic test: https://github.com/pytorch/pytorch/actions/runs/9302864965/job/25605549183#step:14:2082.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127595
Approved by: https://github.com/EikanWang, https://github.com/chuanqi129, https://github.com/peterbell10, https://github.com/atalman
2024-05-31 15:32:00 +00:00
4644def434 Update docstring for weights_only (#127575)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127575
Approved by: https://github.com/malfet
2024-05-31 14:27:31 +00:00
cddb8dbebe add workloadd events to pytorch (#127415)
Summary: add workloadd events to pytorch

Test Plan: CIs

Differential Revision: D57914472

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127415
Approved by: https://github.com/sraikund16
2024-05-31 14:25:44 +00:00
10a92b5f84 [AOTI] Fix a bool value codegen issue when calling custom ops (#127398)
Summary: fixes https://github.com/pytorch/pytorch/issues/127392

Differential Revision: [D57911527](https://our.internmc.facebook.com/intern/diff/D57911527)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127398
Approved by: https://github.com/angelayi, https://github.com/chenyang78
ghstack dependencies: #126916, #127037
2024-05-31 14:01:36 +00:00
17c5b6508b [AOTI] Support _CollectiveKernel in the cpp wrapper mode (#127037)
Summary: _CollectiveKernel appears in TorchBench moco training. It's a special Fallback op that requires extra care.

Differential Revision: [D57911441](https://our.internmc.facebook.com/intern/diff/D57911441)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127037
Approved by: https://github.com/malfet
ghstack dependencies: #126916
2024-05-31 13:58:50 +00:00
413b81789f [AOTI][refactor] Unify val_to_arg_str and val_to_cpp_arg_str (#126916)
Summary: Now fallback argument type information has been passed, so time to unify val_to_arg_str and val_to_cpp_arg_str

Differential Revision: [D57907751](https://our.internmc.facebook.com/intern/diff/D57907751)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126916
Approved by: https://github.com/chenyang78
2024-05-31 13:56:11 +00:00
aaef7b29e9 Only register _inductor_test ops if not running with deploy (#127557)
Internal xref: https://fb.workplace.com/groups/1405155842844877/posts/8498194410207616

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127557
Approved by: https://github.com/zou3519
2024-05-31 13:34:23 +00:00
029b3ec775 Revert "[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068)"
This reverts commit dae33a4961addb5847dbb362e7bb907bbfc64929.

Reverted https://github.com/pytorch/pytorch/pull/126068 on behalf of https://github.com/PaliC due to failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/126068#issuecomment-2141992307))
2024-05-31 12:33:25 +00:00
cyy
a6bae1f6db Remove more caffe2 files (#127511)
Remove more caffe2 files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127511
Approved by: https://github.com/r-barnes
2024-05-31 11:26:27 +00:00
df0c69f32d [inductor] Add fallback for collectives size estimation for unbacked (#127562)
Differential Revision: [D57982928](https://our.internmc.facebook.com/intern/diff/D57982928)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127562
Approved by: https://github.com/yifuwang
2024-05-31 11:15:46 +00:00
f4d7cdc5e6 [dynamo] Add current instruction to BlockStackEntry (#127482)
Will be used by exception handling in later PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127482
Approved by: https://github.com/jansel
2024-05-31 08:58:53 +00:00
2a03bf5a14 [inductor] fix grid z bug for large grid (#127448)
Fixes #123210

2f3d3ddd70/torch/_inductor/runtime/triton_heuristics.py (L1733-L1753)

If a kernel's y_grid  is larger than 65535, it will be split into multiple z grids. The above grad_fn does this split before the kernel launch; however, the computations for yoffset and the y_grid are incorrect. For example, if we have xy numel of `(1*XBLOCK, 65537*YBLOCK)`, this function will return an [xyz]_grid with (1, 32768, 2). XBLOCK and YBLOCK here are used for the following `get_grid_dim`. Let's use their default values (4, 1024).

2f3d3ddd70/torch/_inductor/runtime/triton_heuristics.py (L1734)

[xyz]_grid = (1, 32768, 2) means the workload are divided to two z grids. Because the triton kernel generation still follows xy dimension, one of the exampled generated kernel is shown below.

```python
@triton.jit
def triton_(in_ptr0, out_ptr0, ynumel, xnumel, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr):
    ynumel = 65537*1024
    xnumel = 1*4
    yoffset = tl.program_id(1) * (tl.program_id(2) + 1) * YBLOCK
    yindex = yoffset + tl.arange(0, YBLOCK)[None, :]
    ymask = yindex < ynumel
    xoffset = tl.program_id(0) * XBLOCK
    xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
    xmask = xindex < xnumel
    x2 = xindex
    y0 = yindex % 128
    y1 = (yindex // 128)
    y3 = yindex
    tmp0 = tl.load(in_ptr0 + (y0 + (128*x2) + (512*y1)), xmask, eviction_policy='evict_last')
    tl.store(out_ptr0 + (x2 + (4*y3)), tmp0, xmask)
```

For a trition block with xyz index (0, 0, 1), its yoffset and xoffset are both 0s based on the compuation `yoffset = tl.program_id(1) * (tl.program_id(2) + 1) * YBLOCK` and `xoffset = tl.program_id(0) * XBLOCK`. So, this triton block will access the very first elements of the input.  However, the correct yoffset should be `(y_index + z_index * y_grid ) * YBLOCK` which is the starting position of the 2nd z grid.

At the same time, because we used `y_grid = y_grid // div` to compute the maximum number of element in y dimension, the y_grid is 32768. The total y grids is 32768*2 = 65536, which is less than the actual y grids 65537. So, we should use `y_grid = ceildiv(y_grid, div)` to compute the y grid to save the remaining grids.

#123210 is not about AOTInductor, the root cause is the triton kernel generated by torchinductor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127448
Approved by: https://github.com/eellison
2024-05-31 08:01:34 +00:00
4935a019e4 [ONNX] Update decomposition table to core ATen ops (#127353)
Fixes #125894

Previous to this PR, there are ATen core ops missing in the decomposition table because we thought they might be decomposed into prim ops, as they are under _refs. The PR picks them back according to f6ef832e87/torch/_decomp/__init__.py (L253)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127353
Approved by: https://github.com/justinchuby
2024-05-31 06:35:47 +00:00
cyy
0c5faee372 Replace python::python with Python::Module (#127485)
Use found Python::Module target
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127485
Approved by: https://github.com/ezyang
2024-05-31 05:57:05 +00:00
b5e85b8ecc Add deferred_runtime_assertion pass after run_decompositions (#127305)
Summary: We also want to reinsert the deferred_runtime passes after run_decompositions as well

Test Plan: CI

Reviewed By: zhxchen17

Differential Revision: D57802237

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127305
Approved by: https://github.com/BoyuanFeng
2024-05-31 05:45:28 +00:00
ae47152ca8 Expand supported labels to most self-hosted linux pull.yml workflows (#127578)
Initial set of runners added in https://github.com/pytorch/pytorch/pull/127566 seem to be working.

Expanding to include more machine types, especially GPU machines
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127578
Approved by: https://github.com/huydhn
2024-05-31 05:40:16 +00:00
ec098b88b6 [compiled autograd] torch.compile API (#125880)
- enter existing compiled autograd ctx manager before entering torch.compile frames

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125880
Approved by: https://github.com/jansel
2024-05-31 04:38:20 +00:00
cyy
ee08cf5792 Improve MAGMA conditional macro in BatchLinearAlgebra.cpp (#127495)
Unnecessary TORCH_CHECK(false) are changed to macro coverage as mentioned in #127371
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127495
Approved by: https://github.com/ezyang
2024-05-31 04:27:20 +00:00
159632aecd [dynamo] Support hasattr on BuiltinVariable (#127372)
Fixes https://github.com/pytorch/pytorch/issues/127172

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127372
Approved by: https://github.com/williamwen42, https://github.com/yanboliang
ghstack dependencies: #127377
2024-05-31 04:23:56 +00:00
bb6bfd9ad8 [dynamo][compile-time] Cache the child guard managers (#127377)
Reduces compile time of MobileBertForMaskedLM model from 39 seconds to 26 seconds. This was a regression introduced by #125202. Before that PR, compile time was 24 seconds. The extra two seconds is just because we are going through enormous number of guards.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127377
Approved by: https://github.com/jansel
2024-05-31 04:23:56 +00:00
f264745ff1 [interformer] batch pointwise op + unbind stack pass in post grad (#126959)
Summary: Tested on H100 with single GPU, and the bs is set to 64.

Test Plan:
# local script

```
buck2 run mode/opt scripts/jackiexu0313/pt2:uniarch_perf_benchmark -- single-module-benchmark --provider interformer --enable_pt2 True --batch_size 64
```

baseline: P1370993922

| Metric             | Value        |
|:-------------------|:-------------|
| Latency            | 120.84 ms    |
| Model size         | 5.93 G bytes |
| Flops/example      | 62.22 GB     |
| TFLOPS             | 32.95        |
| MFU                | 4.12%        |
| Activation/example | 128.17 MB    |

proposal: P1371676068

config
```
torch._inductor.config.pre_grad_fusion_options = {}
torch._inductor.config.post_grad_fusion_options = {
        "batch_aten_mul": {"min_fuse_set_size": 50},
        "batch_aten_sigmoid": {"min_fuse_set_size": 50},
        "batch_aten_relu": {"min_fuse_set_size": 50},
        "batch_linear_post_grad": {"min_fuse_set_size": 50},
        "unbind_stack_aten_pass": {},
}
```

| Metric             | Value        |
|:-------------------|:-------------|
| Latency            | 117.30 ms    |
| Model size         | 5.93 G bytes |
| Flops/example      | 62.65 GB     |
| TFLOPS             | 34.18        |
| MFU                | 4.27%        |
| Activation/example | 163.12 MB    |

Differential Revision: D57595173

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126959
Approved by: https://github.com/jackiexu1992
2024-05-31 03:54:43 +00:00
cyy
8629f9b3f2 Remove more unused variables in tests (#127510)
Follows #127379

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127510
Approved by: https://github.com/Skylion007, https://github.com/r-barnes
2024-05-31 03:39:45 +00:00
0aaac68c57 Add structured logging for tensor fakeification (#126879)
This adds dumps of MetaTensorDesc and MetaStorageDesc to structured logs
when they are triggered from Dynamo.  The logs look like this:

```
V0522 08:13:25.267000 140224882566144 torch/_subclasses/meta_utils.py:195] {"describe_storage": {"id": 0, "describer_id": 0, "size": 32}, "frame_id": 0, "frame_compile_id": 0, "attempt": 0}
V0522 08:13:25.267000 140224882566144 torch/_subclasses/meta_utils.py:220] {"describe_tensor": {"id": 0, "ndim": 1, "dtype": "torch.float32", "device": "device(type='cpu')", "size": [8], "is_leaf": true, "stride": [1], "storage": 0, "view_func": "<built-in method _view_func_unsafe of Tensor object at 0x7f882959e840>", "describer_id": 0}, "frame_id": 0, "frame_compile_id": 0, "attempt": 0}
V0522 08:13:25.268000 140224882566144 torch/_subclasses/meta_utils.py:1594] {"describe_source": {"describer_id": 0, "id": 0, "source": "L['x']"}, "frame_id": 0, "frame_compile_id": 0, "attempt": 0}
```

The `describer_id` is used to disambiguate ids.  We expect it to be
unique per frame id, but if there is a bug it possibly is not.  Note you will get
redundant dumps when evaluation restarts.

tlparse can use this to give a visualization of input tensors to a
model, you could also use this to generate example inputs to run graphs
on.

Some care is taken to avoid redumping the tensor metadata multiple
times, which would happen ordinarily because AOTAutograd refakifies
everything after Dynamo, to deal with metadata mutation.

Partially fixes https://github.com/pytorch/pytorch/issues/126644

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126879
Approved by: https://github.com/jamesjwu
2024-05-31 01:58:44 +00:00
b1792a622d [pipelining] handle param aliasing (#127471)
Adds support for parameter aliasing in pipelining. Does this by reading the state_dict, and creating a map of id -> valid tensor FQNs (to be used in _sink_params). Assigns additional FQN attributes that may be used, runs _sink_params(), and then deletes unused attributes. Shares some similarity with how export's unflattener does it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127471
Approved by: https://github.com/kwen2501
2024-05-31 01:52:57 +00:00
d535de1747 [inductor] remove reordering_reindex (#127367)
This fixes the loop ordering issue for avg_pool2d here (https://github.com/pytorch/pytorch/issues/126255#issuecomment-2117931529).

The reason we can not fuse the 2 kernels for avg_pool2d is due to ComputedBuffer.iter_reordering_reindex. Take a simpler example:

```
        def f(x, y):
            """
            Add a matmul since inductor may force layout for output.
            """
            return (x.sum(dim=-1) + 1) @ y

        # Make the first 2 dimension not able to merge on purpose so that
        # ComputedBuffer.iter_reoredering_reindex will be updated.
        x = rand_strided([20, 20, 30], [30, 900, 1], device="cuda")
        y = torch.randn(20, 20)
```

Suppose x.sum is stored to x2. The computed buffer for x2 will remember that we have reordered it's first and second dimension (i.e. loop order [1, 0]). Later one when we decide the loop order for x2 when computing 'x2 + 1' , we decide to pick loop order [1, 0] according to the stride analysis. And then we use the saved ComputedBuffer.iter_reordering_reindex to further reorder the loop order. The net effect is that we use loop order [0, 1] which cause the pointwise kernel not able to fuse with the reduction kernel.

I feel that we don't need ComputedBuffer.iter_reordering_reindex. And test result shows removing it has neutral impact on the dashboard [link](https://hud.pytorch.org/benchmark/compilers?startTime=Wed%2C%2022%20May%202024%2017%3A30%3A29%20GMT&stopTime=Wed%2C%2029%20May%202024%2017%3A30%3A29%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/shunting314/153/head&lCommit=195f42cf1a414d2d1a0422b8a081a85ff52b7d20&rBranch=main&rCommit=d6e3e89804c4063827ea21ffcd3d865e5fe365d9)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127367
Approved by: https://github.com/jansel
2024-05-31 01:36:43 +00:00
7646825c3e Revert "distributed debug handlers (#126601)"
This reverts commit 3d541835d509910fceca00fc5a916e9718c391d8.

Reverted https://github.com/pytorch/pytorch/pull/126601 on behalf of https://github.com/PaliC due to breaking internal typechecking tests ([comment](https://github.com/pytorch/pytorch/pull/126601#issuecomment-2141076987))
2024-05-31 01:21:24 +00:00
cyy
d44daebdbc [Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127051
Approved by: https://github.com/cpuhrsch, https://github.com/malfet
2024-05-31 01:20:45 +00:00
da9fb670d2 Nadam support the flag for "maximize" (#127214)
Fixes https://github.com/pytorch/pytorch/issues/126642

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127214
Approved by: https://github.com/janeyx99
2024-05-31 01:11:16 +00:00
f6e303fa47 Revert "[DeviceMesh] Adding nD slicing support back (#127465)"
This reverts commit e72232f8f032b970b74da18200678b3a4617bf95.

Reverted https://github.com/pytorch/pytorch/pull/127465 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lint e72232f8f0, the error does not like look trivial fix, so I revert the change for a forward fix ([comment](https://github.com/pytorch/pytorch/pull/127465#issuecomment-2141051630))
2024-05-31 00:43:13 +00:00
af5ed05416 Include triton in py3.12 binaries (#127547)
Additional Builder PR: https://github.com/pytorch/builder/pull/1846/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127547
Approved by: https://github.com/williamwen42
2024-05-31 00:30:10 +00:00
fc73d07e5e [c10d] Decorate methods in NCCLUtils.hpp with TORCH_API (#127550)
Summary:
User-defined PyTorch modules that uses `C10D_NCCL_CHECK` run into undefined symbol errors
when loaded by `torch.library.load()`, because they have not been exported.  This change
exports the symbols needed to resolve those runtime errors.

Test Plan: PyTorch CI

Differential Revision: D57977944

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127550
Approved by: https://github.com/Skylion007
2024-05-31 00:17:25 +00:00
a2bff4dc8c Fix lint (#127584)
Trivial fix after https://github.com/pytorch/pytorch/pull/124678

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127584
Approved by: https://github.com/huydhn
2024-05-31 00:00:11 +00:00
e72232f8f0 [DeviceMesh] Adding nD slicing support back (#127465)
Fixes #126530

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127465
Approved by: https://github.com/wconstab
2024-05-30 23:55:21 +00:00
214dd44608 [c10d] add Work's numel to logger for debugging purposes (#127468)
Summary:
We have seen some cases that all ranks call into a collective but it got
stuck probably due to incorrect sizes of the tensors. Adding the size
info into logging for debugging

Also, taking this chance to consolidate all logger related status
metrics in to one struct

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127468
Approved by: https://github.com/wconstab
2024-05-30 23:32:33 +00:00
620ec081ec Extract inner loops into separate function for ARM64 fp16_dot_with_fp32_arith (#127476)
Summary: Preparing to generalize to bf16. (This should not be committed unless the following bf16 PR is committed!)

Test Plan: Spot-checked llm_experiments benchmark result to make sure it didn't regress.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127476
Approved by: https://github.com/malfet
ghstack dependencies: #127435, #127451
2024-05-30 23:28:17 +00:00
603bde1de3 Use efficient ARM fp16 dot product for gemm_transa_ general case (#127451)
Summary: This doesn't change the overall gemm algorithm away from repeated dot products, just uses our efficient fp16 dot product developed for the gemv case. It seems to improve performance for every prompt length I tested.

Test Plan: Use https://github.com/malfet/llm_experiments/blob/main/benchmarks/benchmark_torch_mm.py , edited to test only the trans_b (really gemm_transa_) case for the sizes outlined in the output.

Before:
```
Matrix-vector:
m=8, n=128, k=1
====================
trans_b  torch.float32    1.05 usec
trans_b  torch.float16    0.97 usec
trans_b torch.bfloat16    1.06 usec
m=128, n=8, k=1
====================
trans_b  torch.float32    0.80 usec
trans_b  torch.float16    0.97 usec
trans_b torch.bfloat16    1.00 usec
m=4096, n=4096, k=1
====================
trans_b  torch.float32 2160.75 usec
trans_b  torch.float16  659.77 usec
trans_b torch.bfloat16 3800.13 usec
m=11008, n=4096, k=1
====================
trans_b  torch.float32 6343.68 usec
trans_b  torch.float16 1789.42 usec
trans_b torch.bfloat16 10098.34 usec
m=4096, n=11008, k=1
====================
trans_b  torch.float32 6217.20 usec
trans_b  torch.float16 1874.47 usec
trans_b torch.bfloat16 10490.30 usec
m=32000, n=4096, k=1
====================
trans_b  torch.float32 17934.45 usec
trans_b  torch.float16 5323.81 usec
trans_b torch.bfloat16 29320.80 usec

Matrix-matrix (prompt len 4:
m=8, n=128, k=4
====================
trans_b  torch.float32    2.40 usec
trans_b  torch.float16    1.22 usec
trans_b torch.bfloat16    1.22 usec
m=128, n=8, k=4
====================
trans_b  torch.float32    1.52 usec
trans_b  torch.float16    1.33 usec
trans_b torch.bfloat16    1.77 usec
m=4096, n=4096, k=4
====================
trans_b  torch.float32 4317.09 usec
trans_b  torch.float16 15541.04 usec
trans_b torch.bfloat16 15032.29 usec
m=11008, n=4096, k=4
====================
trans_b  torch.float32 6191.19 usec
trans_b  torch.float16 40436.29 usec
trans_b torch.bfloat16 40626.93 usec
m=4096, n=11008, k=4
====================
trans_b  torch.float32 6049.22 usec
trans_b  torch.float16 42367.16 usec
trans_b torch.bfloat16 42482.43 usec
m=32000, n=4096, k=4
====================
trans_b  torch.float32 17611.36 usec
trans_b  torch.float16 117368.54 usec
trans_b torch.bfloat16 116958.85 usec

Matrix-matrix (prompt len 8:
m=8, n=128, k=8
====================
trans_b  torch.float32    1.04 usec
trans_b  torch.float16    1.71 usec
trans_b torch.bfloat16    1.74 usec
m=128, n=8, k=8
====================
trans_b  torch.float32    2.10 usec
trans_b  torch.float16    2.01 usec
trans_b torch.bfloat16    2.91 usec
m=4096, n=4096, k=8
====================
trans_b  torch.float32 2456.23 usec
trans_b  torch.float16 30112.76 usec
trans_b torch.bfloat16 29941.58 usec
m=11008, n=4096, k=8
====================
trans_b  torch.float32 6236.12 usec
trans_b  torch.float16 80361.22 usec
trans_b torch.bfloat16 80466.64 usec
m=4096, n=11008, k=8
====================
trans_b  torch.float32 6236.10 usec
trans_b  torch.float16 82990.74 usec
trans_b torch.bfloat16 83899.80 usec
m=32000, n=4096, k=8
====================
trans_b  torch.float32 17606.43 usec
trans_b  torch.float16 234397.38 usec
trans_b torch.bfloat16 237057.29 usec

Matrix-matrix (prompt len 16:
m=8, n=128, k=16
====================
trans_b  torch.float32    1.31 usec
trans_b  torch.float16    2.67 usec
trans_b torch.bfloat16    2.72 usec
m=128, n=8, k=16
====================
trans_b  torch.float32    1.66 usec
trans_b  torch.float16    3.36 usec
trans_b torch.bfloat16    5.18 usec
m=4096, n=4096, k=16
====================
trans_b  torch.float32 2504.24 usec
trans_b  torch.float16 60896.53 usec
trans_b torch.bfloat16 59852.49 usec
m=11008, n=4096, k=16
====================
trans_b  torch.float32 6407.11 usec
trans_b  torch.float16 163294.92 usec
trans_b torch.bfloat16 161199.10 usec
m=4096, n=11008, k=16
====================
trans_b  torch.float32 6132.30 usec
trans_b  torch.float16 167244.77 usec
trans_b torch.bfloat16 170064.35 usec
m=32000, n=4096, k=16
====================
trans_b  torch.float32 17635.56 usec
trans_b  torch.float16 475020.00 usec
trans_b torch.bfloat16 476332.29 usec

Matrix-matrix (prompt len 32:
m=8, n=128, k=32
====================
trans_b  torch.float32    1.40 usec
trans_b  torch.float16    4.67 usec
trans_b torch.bfloat16    4.80 usec
m=128, n=8, k=32
====================
trans_b  torch.float32    1.24 usec
trans_b  torch.float16    6.10 usec
trans_b torch.bfloat16   10.03 usec
m=4096, n=4096, k=32
====================
trans_b  torch.float32 2660.63 usec
trans_b  torch.float16 122436.04 usec
trans_b torch.bfloat16 121687.96 usec
m=11008, n=4096, k=32
====================
trans_b  torch.float32 6405.60 usec
trans_b  torch.float16 324708.42 usec
trans_b torch.bfloat16 324866.67 usec
m=4096, n=11008, k=32
====================
trans_b  torch.float32 6566.74 usec
trans_b  torch.float16 330801.04 usec
trans_b torch.bfloat16 332561.79 usec
m=32000, n=4096, k=32
====================
trans_b  torch.float32 18610.84 usec
trans_b  torch.float16 944578.75 usec
trans_b torch.bfloat16 940674.33 usec

Matrix-matrix (prompt len 128:
m=8, n=128, k=128
====================
trans_b  torch.float32    2.48 usec
trans_b  torch.float16   16.43 usec
trans_b torch.bfloat16   17.11 usec
m=128, n=8, k=128
====================
trans_b  torch.float32    1.83 usec
trans_b  torch.float16   22.31 usec
trans_b torch.bfloat16   37.00 usec
m=4096, n=4096, k=128
====================
trans_b  torch.float32 4806.59 usec
trans_b  torch.float16 485338.83 usec
trans_b torch.bfloat16 478835.08 usec
m=11008, n=4096, k=128
====================
trans_b  torch.float32 12109.51 usec
trans_b  torch.float16 1300928.58 usec
trans_b torch.bfloat16 1293181.63 usec
m=4096, n=11008, k=128
====================
trans_b  torch.float32 11223.70 usec
trans_b  torch.float16 1326119.92 usec
trans_b torch.bfloat16 1330395.12 usec
m=32000, n=4096, k=128
====================
trans_b  torch.float32 33485.34 usec
trans_b  torch.float16 3869227.17 usec
trans_b torch.bfloat16 3792905.00 usec
```

After:
```
Matrix-vector:
m=8, n=128, k=1
====================
trans_b  torch.float32    0.75 usec
trans_b  torch.float16    0.71 usec
trans_b torch.bfloat16    0.81 usec
m=128, n=8, k=1
====================
trans_b  torch.float32    0.75 usec
trans_b  torch.float16    0.93 usec
trans_b torch.bfloat16    0.98 usec
m=4096, n=4096, k=1
====================
trans_b  torch.float32 2194.31 usec
trans_b  torch.float16  661.27 usec
trans_b torch.bfloat16 3758.42 usec
m=11008, n=4096, k=1
====================
trans_b  torch.float32 5792.04 usec
trans_b  torch.float16 1789.98 usec
trans_b torch.bfloat16 10120.67 usec
m=4096, n=11008, k=1
====================
trans_b  torch.float32 6101.22 usec
trans_b  torch.float16 1927.34 usec
trans_b torch.bfloat16 10469.47 usec
m=32000, n=4096, k=1
====================
trans_b  torch.float32 18353.20 usec
trans_b  torch.float16 5161.06 usec
trans_b torch.bfloat16 29601.69 usec

Matrix-matrix (prompt len 4:
m=8, n=128, k=4
====================
trans_b  torch.float32    2.14 usec
trans_b  torch.float16    0.85 usec
trans_b torch.bfloat16    1.19 usec
m=128, n=8, k=4
====================
trans_b  torch.float32    1.47 usec
trans_b  torch.float16    1.85 usec
trans_b torch.bfloat16    1.75 usec
m=4096, n=4096, k=4
====================
trans_b  torch.float32 4416.40 usec
trans_b  torch.float16 2688.36 usec
trans_b torch.bfloat16 14987.33 usec
m=11008, n=4096, k=4
====================
trans_b  torch.float32 6140.24 usec
trans_b  torch.float16 7467.26 usec
trans_b torch.bfloat16 40295.52 usec
m=4096, n=11008, k=4
====================
trans_b  torch.float32 6143.10 usec
trans_b  torch.float16 7298.04 usec
trans_b torch.bfloat16 41393.43 usec
m=32000, n=4096, k=4
====================
trans_b  torch.float32 17650.72 usec
trans_b  torch.float16 21346.63 usec
trans_b torch.bfloat16 116849.98 usec

Matrix-matrix (prompt len 8:
m=8, n=128, k=8
====================
trans_b  torch.float32    1.05 usec
trans_b  torch.float16    1.03 usec
trans_b torch.bfloat16    1.69 usec
m=128, n=8, k=8
====================
trans_b  torch.float32    2.05 usec
trans_b  torch.float16    3.08 usec
trans_b torch.bfloat16    2.95 usec
m=4096, n=4096, k=8
====================
trans_b  torch.float32 2323.99 usec
trans_b  torch.float16 5265.45 usec
trans_b torch.bfloat16 29942.40 usec
m=11008, n=4096, k=8
====================
trans_b  torch.float32 6202.01 usec
trans_b  torch.float16 14677.90 usec
trans_b torch.bfloat16 80625.18 usec
m=4096, n=11008, k=8
====================
trans_b  torch.float32 6112.05 usec
trans_b  torch.float16 14340.52 usec
trans_b torch.bfloat16 82799.99 usec
m=32000, n=4096, k=8
====================
trans_b  torch.float32 17650.65 usec
trans_b  torch.float16 42551.43 usec
trans_b torch.bfloat16 236081.08 usec

Matrix-matrix (prompt len 16:
m=8, n=128, k=16
====================
trans_b  torch.float32    1.26 usec
trans_b  torch.float16    1.34 usec
trans_b torch.bfloat16    2.69 usec
m=128, n=8, k=16
====================
trans_b  torch.float32    1.60 usec
trans_b  torch.float16    5.81 usec
trans_b torch.bfloat16    5.34 usec
m=4096, n=4096, k=16
====================
trans_b  torch.float32 2328.05 usec
trans_b  torch.float16 10526.58 usec
trans_b torch.bfloat16 60028.28 usec
m=11008, n=4096, k=16
====================
trans_b  torch.float32 6243.35 usec
trans_b  torch.float16 28505.08 usec
trans_b torch.bfloat16 163670.15 usec
m=4096, n=11008, k=16
====================
trans_b  torch.float32 5870.11 usec
trans_b  torch.float16 28597.89 usec
trans_b torch.bfloat16 165404.88 usec
m=32000, n=4096, k=16
====================
trans_b  torch.float32 17746.27 usec
trans_b  torch.float16 83393.87 usec
trans_b torch.bfloat16 472313.13 usec

Matrix-matrix (prompt len 32:
m=8, n=128, k=32
====================
trans_b  torch.float32    1.35 usec
trans_b  torch.float16    2.01 usec
trans_b torch.bfloat16    4.68 usec
m=128, n=8, k=32
====================
trans_b  torch.float32    1.19 usec
trans_b  torch.float16   10.98 usec
trans_b torch.bfloat16   10.13 usec
m=4096, n=4096, k=32
====================
trans_b  torch.float32 2525.29 usec
trans_b  torch.float16 23106.71 usec
trans_b torch.bfloat16 122987.04 usec
m=11008, n=4096, k=32
====================
trans_b  torch.float32 6131.34 usec
trans_b  torch.float16 57537.41 usec
trans_b torch.bfloat16 327825.00 usec
m=4096, n=11008, k=32
====================
trans_b  torch.float32 6395.01 usec
trans_b  torch.float16 57456.33 usec
trans_b torch.bfloat16 331325.58 usec
m=32000, n=4096, k=32
====================
trans_b  torch.float32 19078.68 usec
trans_b  torch.float16 167735.08 usec
trans_b torch.bfloat16 975736.88 usec

Matrix-matrix (prompt len 128:
m=8, n=128, k=128
====================
trans_b  torch.float32    2.40 usec
trans_b  torch.float16    6.07 usec
trans_b torch.bfloat16   16.83 usec
m=128, n=8, k=128
====================
trans_b  torch.float32    1.78 usec
trans_b  torch.float16   40.35 usec
trans_b torch.bfloat16   37.21 usec
m=4096, n=4096, k=128
====================
trans_b  torch.float32 4827.60 usec
trans_b  torch.float16 84341.24 usec
trans_b torch.bfloat16 478917.75 usec
m=11008, n=4096, k=128
====================
trans_b  torch.float32 11879.96 usec
trans_b  torch.float16 226484.33 usec
trans_b torch.bfloat16 1289465.50 usec
m=4096, n=11008, k=128
====================
trans_b  torch.float32 10707.75 usec
trans_b  torch.float16 229200.58 usec
trans_b torch.bfloat16 1327416.67 usec
m=32000, n=4096, k=128
====================
trans_b  torch.float32 33306.32 usec
trans_b  torch.float16 662898.21 usec
trans_b torch.bfloat16 3815866.63 usec
```

torch.float16 performance seems to be improved for all except the
m=128, n=8, k=128 case, where it is roughly neutral. This case
motivated the addition of the "first-tier tail fixup" in the dot
kernel.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127451
Approved by: https://github.com/malfet
ghstack dependencies: #127435
2024-05-30 23:28:17 +00:00
74b89b9283 Extract dot-product functions from fp16_gemv_trans gemv kernels (#127435)
Summary: Refactoring step before we attempt to use these to implement a less bad fp16 GEMM.

Test Plan: Existing tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127435
Approved by: https://github.com/malfet
2024-05-30 23:28:17 +00:00
a3c00e4331 [Easy] Move V.fake_mode inside of replace_by_example (#127494)
Was writing docs and saw that we always have this duplicated usage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127494
Approved by: https://github.com/shunting314, https://github.com/aorenste
2024-05-30 23:23:42 +00:00
f9a1bc2c65 [FSDP] Remove _sync_module_states (#124678)
Remove this unused API

Differential Revision: [D56445639](https://our.internmc.facebook.com/intern/diff/D56445639/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124678
Approved by: https://github.com/awgu
2024-05-30 23:02:09 +00:00
029af29e6d support operator.index function (#127440)
Fix https://github.com/pytorch/pytorch/issues/127426

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127440
Approved by: https://github.com/mlazos
ghstack dependencies: #126444, #127146, #127424
2024-05-30 22:44:18 +00:00
3b88c27c46 Mark DynamicShapesExportTests::test_retracibility_dict_container_inp_out as slow (#127558)
Same as https://github.com/pytorch/pytorch/pull/117896, another slowpoke `DynamicShapesExportTests::test_retracibility_dict_container_inp_out` shows up on recently on MacOS.  For example, https://ossci-raw-job-status.s3.amazonaws.com/log/25585713394

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127558
Approved by: https://github.com/clee2000
2024-05-30 22:40:48 +00:00
e02971fcfb Revert "Enable UFMT on test_shape_ops.py test_show_pickle.py test_sort_and_select.py (#127165)"
This reverts commit a288b95d4e5ceed327c5bdb9696331aa87688d60.

Reverted https://github.com/pytorch/pytorch/pull/127165 on behalf of https://github.com/atalman due to lint is failing ([comment](https://github.com/pytorch/pytorch/pull/127165#issuecomment-2140930658))
2024-05-30 22:06:46 +00:00
4ee003abdf [inductor] Repeat should not return a view (#127533)
Fixes #127474

`as_strided` unwraps views and looks at the underlying storage, so it isn't
legal to lower `repeat`, which should return a new storage, into a view.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127533
Approved by: https://github.com/lezcano
2024-05-30 21:38:59 +00:00
a288b95d4e Enable UFMT on test_shape_ops.py test_show_pickle.py test_sort_and_select.py (#127165)
Fixes some files in #123062

Run lintrunner on files:
test_shape_ops.py
test_show_pickle.py
test_sort_and_select.py

```bash
$ lintrunner --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127165
Approved by: https://github.com/ezyang
2024-05-30 21:34:16 +00:00
f471482eb2 Try to include NCCL related header file with macro USE_C10D_NCCL (#127501)
Fixes #ISSUE_NUMBER
Try to include NCCL related header file with macro USE_C10D_NCCL, so that third-party device compilation will not be interrupted.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127501
Approved by: https://github.com/ezyang
2024-05-30 21:33:41 +00:00
6849b80411 Add ninja as dev dependency (#127380)
`ninja` is required to build C++ extensions in tests.

```pytb
ERROR: test_autograd_cpp_node (__main__.TestCompiledAutograd)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/PanXuehai/Projects/pytorch/torch/testing/_internal/common_utils.py", line 2741, in wrapper
    method(*args, **kwargs)
  File "test/inductor/test_compiled_autograd.py", line 1061, in test_autograd_cpp_node
    module = torch.utils.cpp_extension.load_inline(
  File "/home/PanXuehai/Projects/pytorch/torch/utils/cpp_extension.py", line 1643, in load_inline
    return _jit_compile(
  File "/home/PanXuehai/Projects/pytorch/torch/utils/cpp_extension.py", line 1718, in _jit_compile
    _write_ninja_file_and_build_library(
  File "/home/PanXuehai/Projects/pytorch/torch/utils/cpp_extension.py", line 1800, in _write_ninja_file_and_build_library
    verify_ninja_availability()
  File "/home/PanXuehai/Projects/pytorch/torch/utils/cpp_extension.py", line 1849, in verify_ninja_availability
    raise RuntimeError("Ninja is required to load C++ extensions")
RuntimeError: Ninja is required to load C++ extensions

To execute this test, run the following from the base repo dir:
     python test/inductor/test_compiled_autograd.py -k TestCompiledAutograd.test_autograd_cpp_node
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127380
Approved by: https://github.com/ezyang
2024-05-30 21:22:42 +00:00
094183dba6 [torchbench][pt2] Enable Huggingface and Timm models for interal buck runner (#127460)
Summary: Add huggingface and timm model runs to the  internal pt2 benchmark runner.

Test Plan:
Tesing huggingface model:

```
$ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only BlenderbotSmallForCausalLM --performance --training --device=cuda --amp

 33/ 33 +0 frames   2s 13 graphs 13 graph calls    0/ -12 =   0% ops   0% time
```

Testing timm model:

```
$ buck2 run mode/opt //pytorch/benchmark:pt2 -- --only coat_lite_mini --performance --training --device=cuda --amp

loading model: 0it [00:11, ?it/s]
cuda train coat_lite_mini
  8/  8 +0 frames   4s  2 graphs  2 graph calls    0/  -1 =   0% ops   0% time
```

Differential Revision: D57930582

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127460
Approved by: https://github.com/HDCharles, https://github.com/huydhn
2024-05-30 21:18:28 +00:00
cyy
bf2f5e70dd Fix warnings in SmallVector (#127250)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127250
Approved by: https://github.com/ezyang
2024-05-30 21:13:20 +00:00
ad1b18ab2f Add repo-specific scale config files (#127566)
Part of moving pytorch/pytorch CI infra to a Linux foundation run AWS account.

For self-hosted runners that can run jobs from just a single repo, the runner scalers expect them to be stored in the repo itself.

These scale-config files define how the linux foundation's self-hosted runners are configured. These will apply to runners that only are available to the pytorch/pytorch and pytorch/pytorch-canary repos
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127566
Approved by: https://github.com/zxiiro, https://github.com/huydhn, https://github.com/atalman
2024-05-30 21:08:45 +00:00
846f79e61a Revert "Reduce number of samples in {svd,pca}_lowrank OpInfos (#127199)"
This reverts commit 18a3f781e6382e2222d7c30c18136267407f9953.

Reverted https://github.com/pytorch/pytorch/pull/127199 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing MacOS trunk job 18a3f781e6 (25619618844) ([comment](https://github.com/pytorch/pytorch/pull/127199#issuecomment-2140834363))
2024-05-30 20:45:31 +00:00
cce2192396 [pipelining] Support calling multiple recv fwd/bwd ops (#127084)
Currently, only a single `get_fwd_recv_ops` or `get_bwd_recv_ops` can be called before `forward_one_chunk` and `backward_one_chunk` since they both share the same chunk_id counter. This creates a separate `recv_chunk_id` counter so that recvs can be accumulated.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127084
Approved by: https://github.com/wconstab
2024-05-30 20:15:52 +00:00
aa3d041830 [pipelining] Fix block comments for doc rendering (#127418)
Previous:
<img width="915" alt="image" src="https://github.com/pytorch/pytorch/assets/14858254/14626937-7d79-4a7a-9d0b-3fcfe64b4667">
<img width="926" alt="image" src="https://github.com/pytorch/pytorch/assets/14858254/58ab009c-3f93-46d7-a04f-499a2a0ba390">

New:
https://docs-preview.pytorch.org/pytorch/pytorch/127418/distributed.pipelining.html

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127418
Approved by: https://github.com/wconstab
2024-05-30 20:10:07 +00:00
ff23c5b7d7 [cudagraph] improve log for mutating static input tensor addresses (#127145)
Summary: This diff adds more log for cudagraph when static input tensor mutates. For each placeholder whose static input tensor address mutates, we log its name, changed data pointer address, and the input stack trace. Since some placeholder may have empty stack trace, we find its first user with an non-empty stack trace and print this stack trace instead.

Test Plan: buck2 run fbcode//caffe2/test/inductor:cudagraph_trees -- --r test_static_inputs_address_mutation_log

Differential Revision: D57805118

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127145
Approved by: https://github.com/eellison
2024-05-30 19:57:32 +00:00
19333d1eb9 [ROCm] Update triton pin to fix libtanh issue (#125396)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125396
Approved by: https://github.com/pruthvistony, https://github.com/nmacchioni
2024-05-30 19:26:58 +00:00
2cb6f20867 Warn env vars only once during program (#127046)
This avoids logs being excessively noisy in some training runs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127046
Approved by: https://github.com/kwen2501, https://github.com/wconstab
2024-05-30 19:10:53 +00:00
4afc5c7bb9 [torchscript] Handle prim::device and prim::dtype (#127466)
- Support prim::device and prim::dtype during torchscript migration to export
- Add unit tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127466
Approved by: https://github.com/SherlockNoMad
2024-05-30 18:35:44 +00:00
fa426b096b Default meta device to use swap_tensors in nn.Module._apply (.to_empty and .to('meta')) (#126819)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126819
Approved by: https://github.com/albanD
ghstack dependencies: #127313, #126814
2024-05-30 18:28:13 +00:00
bfdec93395 Default XLA to use swap_tensors path in nn.Module._apply (#126814)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126814
Approved by: https://github.com/JackCaoG, https://github.com/albanD
ghstack dependencies: #127313
2024-05-30 18:28:13 +00:00
39cf2f8e66 Added sorting notes for eig/eigvals (#127492)
Fixes #58034

@lezcano , Added suggested comments for eig and eigvals in the documentation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127492
Approved by: https://github.com/lezcano, https://github.com/kit1980
2024-05-30 18:13:22 +00:00
7827afca14 Copy the constant folding pass to the pass under export/passes folder (#127456)
It's a generic pass and I'm trying to find a good place to host it. It's currently needed by quantization flow. See context in D55930580, it's too much effort to land a fix in the inductor folder.

Differential Revision: [D57934182](https://our.internmc.facebook.com/intern/diff/D57934182/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127456
Approved by: https://github.com/angelayi
2024-05-30 18:04:08 +00:00
f9937afd4f Add noqa to prevent lint warnings (#127545)
This is to prevent the import from being removed due to unused import. What's annoying about this is that it's not consistently running: lintrunner doesn't warn me on this PR even without the comment, but it does on other PRs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127545
Approved by: https://github.com/masnesral
2024-05-30 17:56:49 +00:00
12d6446507 Revert "[inductor] fix mkldnn linear binary fusion check ut (#127296)"
This reverts commit cdeb242fc977210e211fd77b217320205c9f4042.

Reverted https://github.com/pytorch/pytorch/pull/127296 on behalf of https://github.com/huydhn due to Sorry for reverting you change but one of the tests is failing on trunk ROCm.  Please help fix and reland the change https://github.com/pytorch/pytorch/actions/runs/9302535020/job/25606932572 ([comment](https://github.com/pytorch/pytorch/pull/127296#issuecomment-2140334323))
2024-05-30 17:18:23 +00:00
e9a6bbbf7c Revert "[CI] add xpu test in periodic workflow (#126410)"
This reverts commit 30d98611a3a35287c47ded9647f0b4c81fbdf036.

Reverted https://github.com/pytorch/pytorch/pull/126410 on behalf of https://github.com/malfet due to Let's sync up on the test strategy/policies here ([comment](https://github.com/pytorch/pytorch/pull/126410#issuecomment-2140269549))
2024-05-30 17:01:02 +00:00
cyy
8777443d73 Remove FindMatlabMex.cmake (#127414)
It is not used anymore.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127414
Approved by: https://github.com/ezyang
2024-05-30 16:26:35 +00:00
b506d37331 Fix multiple errors while parsing NativeFunctions from YAML (#127413)
Fixing multiple errors in parse_native_yaml when loading NativeFunctions from Yaml file.

Add assertions that validates parsed data.

Fixes #127404, #127405, #127406, #127407, #127408, #127409, #127410, #127411

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127413
Approved by: https://github.com/ezyang
2024-05-30 16:25:04 +00:00
ea5c17de90 Revert "Add torchao nightly testing workflow (#126885)"
This reverts commit d938170314fa89acaad6b06fbbaac6b98f1e618f.

Reverted https://github.com/pytorch/pytorch/pull/126885 on behalf of https://github.com/atalman due to Broke inductor periodic test ([comment](https://github.com/pytorch/pytorch/pull/126885#issuecomment-2140139486))
2024-05-30 16:23:06 +00:00
cyy
be7be9fa16 [Distributed] [8/N] Fix clang-tidy warnings in torch/csrc/distributed/c10d (#125102)
This PR continues to clean clang-tidy warnings in torch/csrc/distributed/c10d, following https://github.com/pytorch/pytorch/pull/124987.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125102
Approved by: https://github.com/ezyang
2024-05-30 16:19:53 +00:00
576c5ef1dd [inductor] fix some tests in test_max_autotune.py (#127472)
Fix https://github.com/pytorch/pytorch/issues/126176  . We should not use torch.empty to generate input data if we are gonna do any accuracy test. torch.empty may return NaN. In that cause both the reference and the actual result may contain NaN at the same index. But `NaN != NaN` so the test fail.

Also if torch.empty returns NaN is not deterministic. It may depends on other tests running earlier.

Generating random data instead of calling torch.empty fixes the problem.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127472
Approved by: https://github.com/eellison, https://github.com/jansel
2024-05-30 16:04:48 +00:00
d2df0f56a3 Fix compilation_latency regression caused by #127060 (#127326)
It seems that while #127060 improved the speed for tacotron2 it introduced a compilation_latency regression for some of the TIMM benchmarks.

The original change was to precompute the Dep metadata - but apparently some benchmarks have few enough overlaps that precomputing O(n) deps was slower than ignoring O(n^2) deps.  So change it to go back to computing the Dep metadata on demand but to then cache the result.

`dm_nfnet_f0` was a good example because on the dashboard it showed an increase from 140s -> 154s.

```
python benchmarks/dynamo/timm_models.py --performance --cold-start-latency --training --amp --backend inductor --dynamic-shapes --dynamic-batch-only --device cuda --total-partitions 5 --partition-id 1 --output timm-0.csv --only dm_nfnet_f0
```

Looking at the compilation_latency result.

On viable (d6e3e8980):
172.777958
176.725071
177.907955

On viable with #127060 and #127061 fully backed out:
158.305166
158.688560
160.791187

On viable w/ this change:
160.094164
160.201845
161.752157

I think that's probably close enough considering the variance.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127326
Approved by: https://github.com/oulgen
2024-05-30 15:37:08 +00:00
ffe506e853 Better graph break msg (and warning) on Dynamo x Python C++ extension (#127301)
Dynamo graph breaks on Python C/C++ extensions (e.g. pybinded
functions). The usual way to handle this is to turn those extensions
into custom ops. This PR adds a nicer graph break message and also
changes it to unconditionally warn on this graph break (because graph
break messages are usually not visible).

Fixes https://github.com/pytorch/pytorch/issues/126799

Test Plan:
- new test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127301
Approved by: https://github.com/jansel
ghstack dependencies: #127291, #127292, #127400, #127423
2024-05-30 14:54:29 +00:00
c9beea13ac Rewrite existing links to custom ops gdocs with the landing page (#127423)
NB: these links will be live after the docs build happens, which is once
a day.

Test Plan:
- existing tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127423
Approved by: https://github.com/jansel, https://github.com/williamwen42
ghstack dependencies: #127291, #127292, #127400
2024-05-30 14:54:29 +00:00
18a3f781e6 Reduce number of samples in {svd,pca}_lowrank OpInfos (#127199)
We don't need to generate so many samples for these very expensive ops.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127199
Approved by: https://github.com/peterbell10, https://github.com/zou3519
ghstack dependencies: #125580
2024-05-30 14:45:58 +00:00
48538d3d14 Implement svd_lowrank and pca_lowrank for complex numbers (#125580)
We fix a number of bugs previously present in the complex
implementation.

We also heavily simplify the implementation, using, among
other things, that we now have conjugate views.

I saw there is a comment regarding how slow some checks on this
function are. As such, I removed quite a few of the combinations of inputs
to make the OpInfo lighter. I still left a couple relevant examples to not regress
coverage though.

Fixes https://github.com/pytorch/pytorch/issues/122188

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125580
Approved by: https://github.com/pearu, https://github.com/peterbell10
2024-05-30 14:45:58 +00:00
3fb8a0b627 Fix nextafter in inductor CPP codegen (#126876)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126876
Approved by: https://github.com/peterbell10, https://github.com/jgong5
2024-05-30 14:08:16 +00:00
ce63b676f3 Revert "[compiled autograd] torch.compile API (#125880)"
This reverts commit e1c322112a3d7b128b42e27f68bc9a714bfd9a09.

Reverted https://github.com/pytorch/pytorch/pull/125880 on behalf of https://github.com/atalman due to sorry your PR broke lint, need to revert ([comment](https://github.com/pytorch/pytorch/pull/125880#issuecomment-2139605376))
2024-05-30 13:53:31 +00:00
6e0eeecc7c Add back private function torch.cuda.amp.autocast_mode._cast (#127433)
This is unfortunately used in a few places in the wild: https://github.com/search?q=torch.cuda.amp.autocast_mode._cast&type=code
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127433
Approved by: https://github.com/zou3519, https://github.com/guangyey
2024-05-30 13:29:23 +00:00
3f5d8636aa [inductor] Copy RedisRemoteCacheBackend into pytorch (#127480)
Summary: We need an implementation of RedisRemoteCacheBackend with the same API that we're using for FbMemcacheRemoteFxGraphCacheBackend. So we'll stop using the Triton implementation and adapt a version for use by inductor. I also renamed parameters and cache entries to match our cache terminology.

Test Plan: Ran this command twice and inspected log output to ensure I got cache hits:
```
TORCH_LOGS=+torch._inductor.codecache TORCHINDUCTOR_FX_GRAPH_REMOTE_CACHE=1 python benchmarks/dynamo/torchbench.py --performance --inductor --device cuda --training --amp --print-compilation-time --only dcgan
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127480
Approved by: https://github.com/oulgen
2024-05-30 13:08:10 +00:00
cdeb242fc9 [inductor] fix mkldnn linear binary fusion check ut (#127296)
In this PR:

(1)Fix the unary fusion for bf16 conv/linear.
    Previously we registered same fusion pattern for `bf16. fp16`. And we do not check the dtype while matching the pattern. This results the `fp16` case matched the `bf16` pattern but in later replacement, we found that we have a float16 here which is not expected, so we do not fuse them.  We fix it by checking dtypes to avoid `fp16` case matched `bf16` pattern.

```
  def _is_valid_computation_unary_fusion(computation_op, lowp_dtype=None):
      def fn(match):
          matched = _is_single_computation_op(computation_op, **lowp_dtype**)(match) # previously we do not check lowp_dtype here

```

It is not exposed before because we only check the match count, and the match count is anyway correct because we matched the pattern. To address this, we add check on number of `generated_kernel`. If it is not fused, there will be an additional kernel to compute the post op.

(2)Previous the ut
```
python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_binary
```
dose not check the fusion status, fix it in this PR.

(3)Extend `test_conv_binary` to test with lp.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127296
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/jansel
2024-05-30 12:29:36 +00:00
9f73c65b8f xpu: pass MAX_JOBS building xpu_mkldnn_proj (#126562)
mkldnn is quite big project and MAX_JOBS support is essential when building on a system with big number of cpus and limited memory.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126562
Approved by: https://github.com/jgong5, https://github.com/guangyey, https://github.com/albanD
2024-05-30 12:10:33 +00:00
30d98611a3 [CI] add xpu test in periodic workflow (#126410)
Works for https://github.com/pytorch/pytorch/issues/114850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126410
Approved by: https://github.com/EikanWang, https://github.com/atalman
2024-05-30 12:10:15 +00:00
1071437169 Introduce cuda_p2p based fused_all_gather_matmul and fused_matmul_reduce_scatter (#126634)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126634
Approved by: https://github.com/Chillee, https://github.com/wanchaol
2024-05-30 12:10:11 +00:00
705346bf8d [ONNX] Skip optimizer when it fails (#127349)
continue #127039

(1) Skip optimizer when it fails
(2) Update onnx, ort, and onnx-script
(3) The update to onnx-script results in the actual optimizer and rewriter enabling in this PR, and https://github.com/pytorch/pytorch/pull/123379 did not update onnx-script.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127349
Approved by: https://github.com/justinchuby
2024-05-30 07:08:45 +00:00
cd06ae0cb8 Relax use_count constraints for swap_tensors when AccumulateGrad holds a reference (#127313)
### Before this PR:
`torch.utils.swap_tensors(a, b)` required the `use_count` of `a` and `b` to be 1

```python
a = torch.randn(2, 3, requires_grad=True)
b = torch.randn(2, 4)
out = a * 2
out.sum().backward()
# Calling swap_tensors here would fail due to the reference held by AccumulateGrad node, which is not cleaned up after backward
# torch.utils.swap_tensors(a, b)
del out
# Calling swap_tensors here would pass
torch.utils.swap_tensors(a, b)
```
### After this PR:
`torch.utils.swap_tensors(a, b)` requires the `use_count` of `a` and `b` to be 1 or 2 IF the second reference is held by `AccumulateGrad`

A pre-hook will be registered on the `AccumulateGrad` node so that it will fail if it is called (i.e. if user attempts to backward through the graph).

```python
a = torch.randn(2, 3, requires_grad=True)
b = torch.randn(2, 4)
out = a * 2
out.sum().backward()
# Calling swap_tensors here is ok
torch.utils.swap_tensors(a, b)
# If we ever backward to the AccumulateGrad node it will error that it was poisoned by swap_tensors
```

### Application to `nn.Module`

This issue is especially pertinent in context of `nn.Module` where parameters will have `AccumulateGrad` nodes initialized after forward. Specifically, this is intended to address https://github.com/pytorch/pytorch/pull/126814#issuecomment-2127777866. Previously, this would fail at the `m.cpu()` but we want users to be able to do something like the following, and instead raise an error if the user ever attempts to backward through the poisoned `AccumulateGrad` node

```python
import torch
import torch.nn as nn
m = nn.Linear(3, 5)
inp = torch.randn(2, 3)
out = m(inp)
out.sum().backward()
m.cpu()
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127313
Approved by: https://github.com/soulitzer
2024-05-30 07:06:55 +00:00
d44ab8ba6d [dynamo] utility to generate bytecode from template function (#127359)
This will be helpful in reducing some of the hardcoded and python-version-dependent bytecode generation in various places in dynamo - e.g. resume function generation and object reconstruction.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127359
Approved by: https://github.com/jansel
ghstack dependencies: #127329
2024-05-30 06:37:32 +00:00
5d316c81be [Inductor] Add 0 initialization to Triton masked loads (#127311)
For a masked `tl.load` operation, the Triton language specifies that values masked out (i.e. where the mask evaluates to false) are undefined in the output of the load. Triton provides an optional `other` parameter which, when included, provides an explicit value to use for masked out values from the load. If the output from a masked load without the `other` parameter is used in a conditional, unexpected behavior can occur.

Despite the language specification, all Triton backends currently in use by PyTorch Inductor (NVIDIA, AMD, and Intel) 0-initialize masked loads if `other` is not present (we recently changed the Intel backend behavior to match NVIDIA and AMD because that's what our users expect, even if we are not following the Triton spec to the tee). This PR attempts to "future-proof" Inductor for new backends (or perhaps changes in the current backends? - we did not see any performance change from 0-initializing in the Intel XPU backend but one could imagine compiler optimizations to remove paths that depend on undefined) to add an explicit `other` in instances where later conditionals depend on the `tl.load` output. I also removed an exception to `other` behavior for boolean loads, which was put in place for a Triton bug that should be fixed. I added `other` to the getting started documentation as a clue that masked load behavior requires explicit initialization if, even though I don't expect `undef` values to cause the example code to fail if the underlying output is not 0-initialized.  Finally, I added other to the `make_load` function in `select_algorithm.py`, though I wasn't able to determine if that function was actually being called.

Fixes #126535

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127311
Approved by: https://github.com/jansel
2024-05-30 04:50:54 +00:00
3947731887 enable test_parameter_free_dynamic_shapes test when nn module inlining is on (#127424)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127424
Approved by: https://github.com/mlazos
ghstack dependencies: #126444, #127146
2024-05-30 04:20:07 +00:00
15cc9f2e7e [dtensor][be] added checksAssert function and refactored test cases (#127356)
**Summary**
Added c10d checksAsserts functions to reduce written lines of code and refactored test cases. Merged one test case into another.

**Test Plan**
pytest test/distributed/_tensor/debug/test_comm_mode.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127356
Approved by: https://github.com/XilunWu
ghstack dependencies: #127025, #127029, #127040, #127134, #127334
2024-05-30 03:48:17 +00:00
998f38814c [dtensor][debug] added c10d allgather, allgather_coalesced, and allgather_into_tensor_coalesced tracing to CommDebugMode (#127334)
**Summary**
Added c10d allgather, allgather_coalesced, and allgather_into_tensor_coalesced tracing to CommDebugMode and edited test case in test_comm_mode to include added features.

**Test Plan**
pytest test/distributed/_tensor/debug/test_comm_mode.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127334
Approved by: https://github.com/XilunWu, https://github.com/yifuwang
ghstack dependencies: #127025, #127029, #127040, #127134
2024-05-30 03:48:17 +00:00
f58fc16e8f [easy?] Move AsyncCompile to a different file (#127235)
By moving AsyncCompile to its own file, we can import codecache without running the side effects of AsyncCompile. This will be important for AOTAutogradCaching, where we want to share some implementation details with codecache.py without spawning new processes.

To conservatively maintain the same behavior elsewhere, every time we import codecache, I've added an import to torch._inductor.async_compile (except in autograd_cache.py, where the explicit goal is to not do this)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127235
Approved by: https://github.com/aorenste, https://github.com/oulgen, https://github.com/masnesral
2024-05-30 02:43:02 +00:00
e0fc1ab625 Forward fix for templates + views (#127446)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127446
Approved by: https://github.com/eellison
2024-05-30 02:34:35 +00:00
3d541835d5 distributed debug handlers (#126601)
This adds debug handlers as described in:
* https://gist.github.com/d4l3k/828b7be585c7615e85b2c448b308d925 (public copy)
* https://docs.google.com/document/d/1la68szcS6wUYElUUX-P6zXgkPA8lnfzpagMTPys3aQ8/edit (internal copy)

This is only adding the C++ pieces that will be used from the main process. The Python and torchrun pieces will be added in a follow up PR.

This adds 2 handlers out of the box:

* `/handler/ping` for testing purposes
* `/handler/dump_nccl_trace_pickle` as a POC integration with Flight Recorder

Test plan:

```
python test/distributed/elastic/test_control_plane.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126601
Approved by: https://github.com/kurman, https://github.com/c-p-i-o
2024-05-30 02:21:08 +00:00
e1c322112a [compiled autograd] torch.compile API (#125880)
- enter existing compiled autograd ctx manager before entering torch.compile frames

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125880
Approved by: https://github.com/jansel
2024-05-30 02:10:06 +00:00
da39461d61 [optim] Move test_grad_scaling_autocast_fused_optimizers to test_cuda.py (#126418)
this PR address the comments in this PR #124904

- Move test_grad_scaling_autocast_fused_optimizers to test_cuda.py
- Combine _grad_scaling_autocast_fused_optimizers into test_grad_scaling_autocast_fused_optimizers
- Move to OptimizerInfo framework.
- For failing tests test_grad_scaling_autocast_fused_optimizers AdamW_cuda_float32, Adam_cuda_float32
    - Added toleranceOverride in this PR
    - created a issue #127000

```
> (c2env) [sandish@devgpu166.ash6 ~/pytorch (refactoroptimizers)]$ python test/test_cuda.py -k test_grad_scaling_autocast_fused_optimizers -v
/home/sandish/pytorch/torch/backends/cudnn/__init__.py:106: UserWarning: PyTorch was compiled without cuDNN/MIOpen support. To use cuDNN/MIOpen, rebuild PyTorch making sure the library is visible to the build system.
  warnings.warn(
/home/sandish/pytorch/torch/backends/cudnn/__init__.py:106: UserWarning: PyTorch was compiled without cuDNN/MIOpen support. To use cuDNN/MIOpen, rebuild PyTorch making sure the library is visible to the build system.
  warnings.warn(
test_grad_scaling_autocast_fused_optimizers_Adagrad_cpu_float32 (__main__.TestCudaOptimsCPU) ... {'fused': True}
{'fused': True}
{'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'lr': 0.1, 'fused': True}
{'lr': 0.1, 'fused': True}
{'initial_accumulator_value': 0.1, 'weight_decay': 0.1, 'fused': True}
{'initial_accumulator_value': 0.1, 'weight_decay': 0.1, 'fused': True}
{'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.1, 'fused': True}
{'lr': 0.1, 'lr_decay': 0.5, 'weight_decay': 0.1, 'fused': True}
{'lr': tensor(0.0010), 'fused': True}
{'lr': tensor(0.0010), 'fused': True}
ok
test_grad_scaling_autocast_fused_optimizers_AdamW_cpu_float32 (__main__.TestCudaOptimsCPU) ... {'fused': True}
{'fused': True}
{'lr': 0.01, 'fused': True}
{'lr': 0.01, 'fused': True}
{'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'fused': True}
ok
test_grad_scaling_autocast_fused_optimizers_Adam_cpu_float32 (__main__.TestCudaOptimsCPU) ... {'fused': True}
{'fused': True}
{'lr': 0.01, 'fused': True}
{'lr': 0.01, 'fused': True}
{'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'fused': True}
ok
test_grad_scaling_autocast_fused_optimizers_SGD_cpu_float32 (__main__.TestCudaOptimsCPU) ... {'fused': True}
{'fused': True}
{'lr': 0.01, 'fused': True}
{'lr': 0.01, 'fused': True}
{'lr': tensor(0.0010), 'fused': True}
{'lr': tensor(0.0010), 'fused': True}
{'momentum': 0.9, 'fused': True}
{'momentum': 0.9, 'fused': True}
{'momentum': 0.9, 'dampening': 0.5, 'fused': True}
{'momentum': 0.9, 'dampening': 0.5, 'fused': True}
{'momentum': 0.9, 'weight_decay': 0.1, 'fused': True}
{'momentum': 0.9, 'weight_decay': 0.1, 'fused': True}
{'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.1, 'fused': True}
{'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
ok
test_grad_scaling_autocast_fused_optimizers_Adagrad_cuda_float32 (__main__.TestCudaOptimsCUDA) ... skipped 'cuda is not supported for fused on Adagrad'
test_grad_scaling_autocast_fused_optimizers_AdamW_cuda_float32 (__main__.TestCudaOptimsCUDA) ... {'fused': True}
{'fused': True}
{'lr': 0.01, 'fused': True}
{'lr': 0.01, 'fused': True}
{'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'fused': True}
{'capturable': True, 'fused': True}
{'capturable': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'fused': True}
{'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'fused': True}
{'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'fused': True}
ok
test_grad_scaling_autocast_fused_optimizers_Adam_cuda_float32 (__main__.TestCudaOptimsCUDA) ... {'fused': True}
{'fused': True}
{'lr': 0.01, 'fused': True}
{'lr': 0.01, 'fused': True}
{'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'fused': True}
{'capturable': True, 'fused': True}
{'capturable': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'fused': True}
{'weight_decay': 0.1, 'amsgrad': True, 'capturable': True, 'fused': True}
{'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'fused': True}
{'lr': tensor(0.0010), 'amsgrad': True, 'capturable': True, 'fused': True}
ok
test_grad_scaling_autocast_fused_optimizers_SGD_cuda_float32 (__main__.TestCudaOptimsCUDA) ... {'fused': True}
{'fused': True}
{'lr': 0.01, 'fused': True}
{'lr': 0.01, 'fused': True}
{'lr': tensor(0.0010), 'fused': True}
{'lr': tensor(0.0010), 'fused': True}
{'momentum': 0.9, 'fused': True}
{'momentum': 0.9, 'fused': True}
{'momentum': 0.9, 'dampening': 0.5, 'fused': True}
{'momentum': 0.9, 'dampening': 0.5, 'fused': True}
{'momentum': 0.9, 'weight_decay': 0.1, 'fused': True}
{'momentum': 0.9, 'weight_decay': 0.1, 'fused': True}
{'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.1, 'fused': True}
{'momentum': 0.9, 'nesterov': True, 'weight_decay': 0.1, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
{'weight_decay': 0.1, 'maximize': True, 'fused': True}
ok

----------------------------------------------------------------------
Ran 8 tests in 16.117s

OK (skipped=1)

> lintrunner test/test_cuda.py
----------------------------------------------------------------------
ok No lint issues.

> lintrunner torch/testing/_internal/common_optimizers.py
----------------------------------------------------------------------
ok No lint issues.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126418
Approved by: https://github.com/janeyx99
2024-05-30 01:47:41 +00:00
67739d8c6f Revert "[Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051)"
This reverts commit 699db7988d84d163ebb6919f78885e4630182a7a.

Reverted https://github.com/pytorch/pytorch/pull/127051 on behalf of https://github.com/PaliC due to This PR needs to be synced using the import button as there is a bug in our diff train ([comment](https://github.com/pytorch/pytorch/pull/127051#issuecomment-2138496995))
2024-05-30 01:16:57 +00:00
1abcac9dab New Custom Ops Documentation landing page (#127400)
We create a new landing page for PyTorch custom ops (suggested by
jansel). All of our error messages will link here, and I'll work with
the docs team to see if we can boost SEO for this page.

NB: the landing page links some non-searchable webpages. Two of those
(the Python custom ops tutorial and C++ custom ops tutorial) will turn
into actual webpages when PyTorch 2.4 comes around. I'll make the third one
(the Custom Operators Manual) once it stabilizes (we continously add new
things to it and the length means that we might want to create a custom
website for it to make the presentation more ingestable).

Test Plan:
- view docs preview.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127400
Approved by: https://github.com/jansel
ghstack dependencies: #127291, #127292
2024-05-30 01:06:04 +00:00
49ad90349d Correct error message for aten::_local_scalar_dense on meta tensor (#124554)
registering a meta for aten::_local_scalar_dense with a different error message.

Fixes pytorch#119588

Co-authored-by: Edward Z. Yang <ezyang@mit.edu>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124554
Approved by: https://github.com/ezyang
2024-05-30 00:50:29 +00:00
d66f12674c Handle tuple and dict during TorchScript to ExportedProgram conversion (#127341)
* Add some test cases for testing List, Tuple, and Dict
* Refactor the conversion code slightly
* Add a logic to handle Dict
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127341
Approved by: https://github.com/SherlockNoMad, https://github.com/angelayi
2024-05-30 00:08:09 +00:00
f14dc3bde8 Fix check message (#126951)
As title.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126951
Approved by: https://github.com/Skylion007, https://github.com/kit1980
2024-05-29 23:58:09 +00:00
76fc58c160 Document the legacy constructor for Tensor (#122625)
Fixes https://github.com/pytorch/pytorch/issues/122408

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122625
Approved by: https://github.com/albanD
2024-05-29 23:23:19 +00:00
7931eee5c5 Support torch.dtype as parameter in pybind11 cpp extension. (#126865)
Support torch.dtype as parameter in pybind11 cpp extension.
Example:
`
cpp_extension.my_ops(self, other, torch.dtype)
`

@ezyang @bdhirsh
Co-authored-by: Edward Z. Yang <ezyang@mit.edu>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126865
Approved by: https://github.com/ezyang
2024-05-29 23:19:32 +00:00
cyy
8ea1dc8748 Use Python::NumPy target (#127399)
Now that we use FindPython, use it again for numpy detection.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127399
Approved by: https://github.com/malfet
2024-05-29 23:17:58 +00:00
0fa2c5b049 Fix mask propagation in the presence of where (#125574)
Before, when calling ops.where, masks were not properly propagated. We
now restrict the optimisation to `ops.masked`, which I think it was what
the original code intended to do.

I'm not 100% sure that even in the masked case this code is not
introducing some bugs, but this is a strict improvement over the
previous state.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125574
Approved by: https://github.com/peterbell10
ghstack dependencies: #114471, #126783
2024-05-29 23:17:41 +00:00
15a7916c0e Ability to capture Process Groups information into Execution Traces (#126995)
Contains a method added to the ExecutionTraceObserver class to record the snapshot of the current process group config upon tracing start.

Unit test:

```
(pytorch) [dsang@devgpu021.nha2 ~/github/pytorch-fork (viable/strict)]$ touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_ddp_profiling_execution_trace
/home/dsang/github/pytorch-fork/torch/distributed/optim/__init__.py:28: UserWarning: TorchScript support for functional optimizers isdeprecated and will be removed in a future PyTorch release. Consider using the torch.compile optimizer instead.
  warn("TorchScript support for functional optimizers is"
test_ddp_profiling_execution_trace (__main__.TestDistBackendWithSpawn.test_ddp_profiling_execution_trace) ... /home/dsang/github/pytorch-fork/torch/distributed/optim/__init__.py:28: UserWarning: TorchScript support for functional optimizers isdeprecated and will be removed in a future PyTorch release. Consider using the torch.compile optimizer instead.
  warn("TorchScript support for functional optimizers is"
/home/dsang/github/pytorch-fork/torch/distributed/optim/__init__.py:28: UserWarning: TorchScript support for functional optimizers isdeprecated and will be removed in a future PyTorch release. Consider using the torch.compile optimizer instead.
  warn("TorchScript support for functional optimizers is"
NCCL version 2.20.5+cuda12.0
[rank1]:[W523 16:06:01.705774398 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[rank0]:[W523 16:06:01.705905760 reducer.cpp:1400] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[rank1]:[W523 16:06:01.715182258 execution_trace_observer.cpp:819] Enabling Execution Trace Observer
printing pg info into trace
[rank0]:[W523 16:06:01.715841805 execution_trace_observer.cpp:819] Enabling Execution Trace Observer
printing pg info into trace
[rank1]:[W523 16:06:01.727881877 execution_trace_observer.cpp:831] Disabling Execution Trace Observer
[rank0]:[W523 16:06:01.728792871 execution_trace_observer.cpp:831] Disabling Execution Trace Observer
Execution trace saved at /tmp/tmpdsov4ngi.et.json
[{'id': 3, 'name': '## process_group:init ##', 'ctrl_deps': 2, 'inputs': {'values': ['[{"pg_name": "0", "pg_desc": "default_pg", "backend_config": "cuda:nccl", "ranks": [], "group_size": 2, "group_count": 1}]'], 'shapes': [[]], 'types': ['String']}, 'outputs': {'values': [], 'shapes': [], 'types': []}, 'attrs': [{'name': 'rf_id', 'type': 'uint64', 'value': 1}, {'name': 'fw_parent', 'type': 'uint64', 'value': 0}, {'name': 'seq_id', 'type': 'int64', 'value': -1}, {'name': 'scope', 'type': 'uint64', 'value': 7}, {'name': 'tid', 'type': 'uint64', 'value': 1}, {'name': 'fw_tid', 'type': 'uint64', 'value': 0}, {'name': 'op_schema', 'type': 'string', 'value': ''}, {'name': 'kernel_backend', 'type': 'string', 'value': ''}, {'name': 'kernel_file', 'type': 'string', 'value': ''}]}]
Execution trace saved at /tmp/tmpsdiqy6az.et.json
[{'id': 3, 'name': '## process_group:init ##', 'ctrl_deps': 2, 'inputs': {'values': ['[{"pg_name": "0", "pg_desc": "default_pg", "backend_config": "cuda:nccl", "ranks": [], "group_size": 2, "group_count": 1}]'], 'shapes': [[]], 'types': ['String']}, 'outputs': {'values': [], 'shapes': [], 'types': []}, 'attrs': [{'name': 'rf_id', 'type': 'uint64', 'value': 1}, {'name': 'fw_parent', 'type': 'uint64', 'value': 0}, {'name': 'seq_id', 'type': 'int64', 'value': -1}, {'name': 'scope', 'type': 'uint64', 'value': 7}, {'name': 'tid', 'type': 'uint64', 'value': 1}, {'name': 'fw_tid', 'type': 'uint64', 'value': 0}, {'name': 'op_schema', 'type': 'string', 'value': ''}, {'name': 'kernel_backend', 'type': 'string', 'value': ''}, {'name': 'kernel_file', 'type': 'string', 'value': ''}]}]
ok

----------------------------------------------------------------------
Ran 1 test in 24.447s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126995
Approved by: https://github.com/briancoutinho, https://github.com/sraikund16
2024-05-29 23:16:17 +00:00
3174e6cb8e [Temp][CI] Run older MPS tests/Mac builds on MacOS 13 (#127428)
To avoid ambiguity while migration outlined in https://github.com/pytorch-labs/pytorch-gha-infra/pull/399 is in progress. Otherwise, MPS jobs for Ventura can be accidentally scheduled on Sonoma or builds, which might result in flaky failures on trunk.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127428
Approved by: https://github.com/huydhn
2024-05-29 22:58:41 +00:00
9257a0698b [Split Build] Load dependencies from libtorch in __init__.py (#126826)
This PR makes it such that we search for a libtorch wheel when initializing pytorch in order to find the necessary shared libraries.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126826
Approved by: https://github.com/huydhn, https://github.com/atalman, https://github.com/ZainRizvi
2024-05-29 22:03:50 +00:00
b0ef363972 [dtensor] rename _Partial -> Partial for all imports (#127420)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127420
Approved by: https://github.com/awgu
2024-05-29 21:42:40 +00:00
d99b115eb3 Fix delete old branches workflow (#127442)
The ubuntu runner started using 2.45.1 (prev 2.43.2), which includes 1f49f7506f (changes +00:00 timezone to Z)

Python versions prior to 3.11 do not support Z when parsing isoformat, so update the workflow to use 3.11

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127442
Approved by: https://github.com/huydhn, https://github.com/malfet
2024-05-29 21:17:09 +00:00
38a33c3202 don't call .item in onehot for XLA (#127335)
We found that `nn.function.one_hot` will cause a graph break due to the item call in the native implementation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127335
Approved by: https://github.com/ezyang
2024-05-29 20:37:26 +00:00
cyy
84b5aa9a68 [Caffe2] [Reland] Remove Caffe2 proto files (#127394)
Reland of #126134, which was reverted due to the wrong base. Now that #126705 has been relanded, it's time to remand this one.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127394
Approved by: https://github.com/r-barnes
2024-05-29 20:37:02 +00:00
92d081e228 [Docs] Add str type to cuda.get_device_name() and cuda. get_device_capability() function (#126743)
Fixes #126400

The `get_device_name()` and `get_device_capability()` allow passing in a string, but it's not stated in the doc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126743
Approved by: https://github.com/eqy, https://github.com/kit1980
2024-05-29 20:09:52 +00:00
24a4bfdcc2 [AdaRound] Make versatile for data / extra param for callback function (#126891)
Summary:
For Speech sequential model, there could be a case where model(data) does not work correctly for feed forward,

Speech model uses a different type of Criterion (a.k.a loss function) to feed a data on individual components like encoder, predictor, joiner.

Hence we need extra parameter to pass feedforward wrapper

Differential Revision: D57680391

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126891
Approved by: https://github.com/jerryzh168
2024-05-29 20:05:27 +00:00
c404b2968c Support min/max carry over for eager mode from_float method (#127309)
Summary:
After QAT is completed or given pre-tuned weight observer via tunable PTQ algorithm, it should not over-write again with a given weight, at least for static QAT never.

Dynamic QAT also does not require to re-run weight observer again by design.

This is a fix

Test Plan: Signals

Differential Revision: D57747749

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127309
Approved by: https://github.com/jerryzh168
2024-05-29 19:33:26 +00:00
82a370ae3a Revert "Refresh OpOverloadPacket if a new OpOverload gets added (#126863)" (#127366)
This reverts commit ed734178abc99bc1d83ad2c61d3a1e4d4f5d20c8.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127366
Approved by: https://github.com/zou3519
2024-05-29 19:26:06 +00:00
05e99154ee Allow int vals to go down the fastpath for _foreach_max (#127303)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127303
Approved by: https://github.com/albanD
ghstack dependencies: #127187
2024-05-29 19:08:58 +00:00
601c5e085d Add _foreach_max (#127187)
This PR adds _foreach_max support, the second reduction foreach op we have :D

I did have to change the autogen slightly for foreach. I can promise that the existing foreach ops' derivative behavior has not changed as I've added a skip list for the harder requirement I am setting (that the arg list should match in length). I needed to add this requirement as there is another wrong max (the one that does take in a dim for reduction) that keeps getting matched first.

Caveats!
- We do not fast path if the shapes, dtypes, device, the regular shebang for foreach are not met. We fall back to slowpath!
- MORE IMPORTANTLY, we also do not fast path for int8 and int16 and bool, but that's really a skill issue on my end as I've hardcoded -INFINITY into the CUDA kernels, and -INFINITY is not defined for small ints. It'd be nice to know how to do this properly, but that work can also come later.
- This does NOT support empty Tensors in the list, because the original max op also does not support empty Tensors. ~I think this should be allowed though, and this PR may come later.~ I understand why this is not allowed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127187
Approved by: https://github.com/albanD
2024-05-29 19:08:58 +00:00
90f4b3fcb2 PyTorch Distributed security assumptions (#127403)
To highlight, that PyTorch Distributed should only be used in a trusted environment and never on the nodes with open network access, which is very similar in spirit to https://github.com/tensorflow/tensorflow/blob/master/SECURITY.md#running-a-tensorflow-server

Thanks to @Xbalien and @K1ingzzz for drawing attention to missing documentation on distributed workloads security assumptions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127403
Approved by: https://github.com/wconstab
2024-05-29 19:08:20 +00:00
5196ef1b59 support builtin id function on user defined object variables. (#127146)
Fix: https://github.com/pytorch/pytorch/pull/127146

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127146
Approved by: https://github.com/anijain2305
ghstack dependencies: #126444
2024-05-29 19:00:37 +00:00
ff65b18fcf Update the is_causal explaination in the SDPA doc (#127209)
Fixes #126873

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127209
Approved by: https://github.com/drisspg
2024-05-29 18:53:17 +00:00
cyy
9cc0d56fdc Remove unused variables in tests (#127379)
Reland test fixes in #127161 and reduce reduce_ops_test into floating point types.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127379
Approved by: https://github.com/ezyang
2024-05-29 18:30:51 +00:00
d938170314 Add torchao nightly testing workflow (#126885)
Add and test torchao nightly testing workflow.

This workflow will be triggered under the following conditions:
1. If the PR has ciflow/torchao label
2. Manual trigger

It will run the torchao benchmark on torchbench/timm/huggingface model workloads with 5 configs (noquant, autoquant, int8dynamic, int8weightonly, int4weightonly). The output will be updated to the PT2 Dashboard: https://hud.pytorch.org/benchmark/compilers
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126885
Approved by: https://github.com/huydhn
2024-05-29 18:22:29 +00:00
090a031d6f Use bit_cast instead of UB type-pun-via-union in Half.h (#127321)
Summary: Type punning via union has undefined behavior due to the strict aliasing rule. bit_cast does the same thing safely (using memcpy under the hood).

Test Plan: CI

Godbolt demonstrates that doing this via memcpy still generates the same instructions: https://godbolt.org/z/PhePzd4Ex
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127321
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2024-05-29 17:43:50 +00:00
8b5cbb7c68 Improve NLLLoss docs (#127346)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127346
Approved by: https://github.com/mikaylagawarecki
2024-05-29 17:29:06 +00:00
28de9143a3 opcheck should be usable without optional dependencies (#127292)
This PR excises opcheck's dependency on
torch.testing._internal.common_utils, (which comes with dependencies on
expecttest and hypothesis). We do this by moving what we need to
torch.testing._utils and adding a test for it.

Fixes #126870, #126871

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127292
Approved by: https://github.com/williamwen42
ghstack dependencies: #127291
2024-05-29 17:17:49 +00:00
8a31c2aa84 [export] allow complex guards as runtime asserts (#127129)
With the current state of export's dynamic shapes, we struggle with guards and constraints that are beyond the current dynamic shapes language, expressed with dims and derived dims. While we can compile and guarantee correctness for guards within the current language (e.g. min/max ranges, linear relationships, integer divisibility) we struggle to dynamically compile guards which extend beyond that.

For these "complex" guards, we typically do either of the following: 1) raise a constraint violation error, along the lines of "not all values of <symbol> in the specified range satisfy <guard>", with or without suggested fixes, 2) specialize to the provided static values and suggest removing dynamism, or 3) fail compilation due to some arbitrary unsupported case. Previous [work](https://github.com/pytorch/pytorch/pull/124949) went towards resolving this by disabling forced specializations, instead allowing the user to fail at runtime with incorrect inputs.

In this PR, relying on [hybrid backed-unbacked symints](https://github.com/pytorch/pytorch/issues/121749), [deferred runtime asserts](https://github.com/pytorch/pytorch/blob/main/torch/fx/passes/runtime_assert.py), and the function [_is_supported_equivalence()](d7de4c9d80/torch/fx/experimental/symbolic_shapes.py (L1824)), we add a flag `_allow_complex_guards_as_runtime_asserts` which allows the user to compile exported programs containing these guards and maintain dynamism, while adding correctness checks as runtime assertions in the graph.

Hybrid backed-unbacked symints allow us to easily bypass "implicit" guards emitted from computation - guards that we ~expect to be true. Popular examples revolve around reshapes:
```
# reshape
def forward(self, x, y):  # x: [s0, s1], y: [s2]
    return x.reshape([-1]) + y  # guard s0 * s1 = s2

This leads to the following exported program

class GraphModule(torch.nn.Module):
    def forward(self, x: "f32[s0, s1]", y: "f32[s2]"):
        sym_size_int: "Sym(s2)" = torch.ops.aten.sym_size.int(y, 0)
        mul: "Sym(-s2)" = -1 * sym_size_int;  sym_size_int = None
        sym_size_int_1: "Sym(s0)" = torch.ops.aten.sym_size.int(x, 0)
        sym_size_int_2: "Sym(s1)" = torch.ops.aten.sym_size.int(x, 1)
        mul_1: "Sym(s0*s1)" = sym_size_int_1 * sym_size_int_2;  sym_size_int_1 = sym_size_int_2 = None
        add: "Sym(s0*s1 - s2)" = mul + mul_1;  mul = mul_1 = None
        eq: "Sym(Eq(s0*s1 - s2, 0))" = add == 0;  add = None
        _assert_scalar = torch.ops.aten._assert_scalar.default(eq, "Runtime assertion failed for expression Eq(s0*s1 - s2, 0) on node 'eq'");  eq = None

        view: "f32[s0*s1]" = torch.ops.aten.view.default(x, [-1]);  x = None
        add_1: "f32[s0*s1]" = torch.ops.aten.add.Tensor(view, y);  view = y = None
        return (add_1,)
```
Another case is symbol divisibility:
```
def forward(self, x):  # x: [s0, s1]
    return x.reshape([-1, x.shape[0] - 1])  # Eq(Mod(s0 * s1, s0 - 1), 0)
```

Applying deferred runtime asserts also helps dynamic compilation for "explicit" complex guards that typically cause problems for export. For example we can generate runtime asserts for not-equal guards, and complex conditions like the following:
```
class Foo(torch.nn.Module):
    def forward(self, x, y):
        # check that negation of first guard also shows up as runtime assertion
        if x.shape[0] == y.shape[0]:  # False
            return x + y
        elif x.shape[0] == y.shape[0] ** 3:  # False
            return x + 2, y + 3
        elif x.shape[0] ** 2 == y.shape[0] * 3:  # True
            return x * 2.0, y * 3.0
```
For the above graph we will generate 3 runtime assertions: the negation of the first 2, and the 3rd condition as a guard.

One additional benefit here over the current state of exported programs is that this adds further correctness guarantees - previously with explicit complex guards, if compilation succeeded, the guards would be ignored at runtime, treated as given.

As shown above, the runtime asserts appear as math ops in the graph, generated by the sympy interpreter, resulting in an _assert_scalar call. There is an option to avoid adding these asserts into the graph, by setting `TORCH_DYNAMO_DO_NOT_EMIT_RUNTIME_ASSERTS=1`. This results in the "original" computation graph, with dynamism, and any incorrect inputs will fail on ops during runtime. Further work could go into prettifying the printer, so the majority of the graph isn't guard-related.

Ideally this PR would subsume and remove the recently added [_disable_forced_specializations](https://github.com/pytorch/pytorch/pull/124949) flag, but that flag still handles one additional case of specialization: single-variable equalities where the symbol is solvable for a concrete value: see this [PR](https://github.com/pytorch/pytorch/pull/126925)

This PR doesn't change any behavior around data-dependent errors/unbacked symints yet, that could be further work.

NOTE: will take naming change suggestions for the flag :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127129
Approved by: https://github.com/avikchaudhuri
2024-05-29 17:15:25 +00:00
cc6e72d882 Drop caffe2 core tests and some other stuff (#127089)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127089
Approved by: https://github.com/Skylion007
2024-05-29 17:11:45 +00:00
cyy
e8e327ba82 Cover clang-tidy to torch/csrc/onnx/init.cpp (#127393)
Enabling it will not cause issues.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127393
Approved by: https://github.com/Skylion007
2024-05-29 17:05:28 +00:00
cyy
7de1352457 [1/N] Replace exceptions with static_assert(false) in some templates (#127371)
This PR tries to report some failures at build time. Once the build fails, it generally indicates that we can wrap the code inside some conditional macros, and it is a hint to further reduce the built code size. The sizeof operations were used to ensure that the assertion dependents on specific template instantiations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127371
Approved by: https://github.com/ezyang, https://github.com/Skylion007
2024-05-29 16:14:00 +00:00
cyy
c69562caf9 [Caffe2]Remove more caffe2 files (#126628)
They are not used.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126628
Approved by: https://github.com/albanD
2024-05-29 16:08:48 +00:00
80a8fc07b2 [dynamo] Handle np.iinfo/finfo/dtype as input (#124482)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124482
Approved by: https://github.com/lezcano
ghstack dependencies: #124481
2024-05-29 16:00:15 +00:00
9a8e8101a8 Fix wording in nn.Linear docstring. (#127240)
Definition (Linear Transformation):
A mapping $T : V \to W$ between $F$-vector spaces $V,W$ is called a *linear transformation* if and only if

a) $T(u+v)=T(u)+T(v)$,
b) $T(cv)=cT(v)$

for all $u, v \in V$, $c \in F$.

Consequently, $T(0_V)=0_W$.

Thus $x \mapsto xA^T+b$ for nonzero $b$ is **not** a linear transformation, but is often referred to as an affine linear transformation.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127240
Approved by: https://github.com/soulitzer, https://github.com/albanD
2024-05-29 14:55:40 +00:00
ade075444f [dynamo] Support numpy.dtype (#124481)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124481
Approved by: https://github.com/lezcano
2024-05-29 14:45:14 +00:00
bf966588f1 [BE][Ez]: Update cudnn_frontend submodule to v1.4.0 (#127175)
Updates the cudnn_frontend submodule to the latest 1.4.0 version.

Should be a straightforward, header-only submodule update.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127175
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-05-29 14:23:38 +00:00
0910429d72 [BE][CMake] Use FindPython module (#124613)
As FindPythonInterp and FindPythonLibs has been deprecated since cmake-3.12

Replace `PYTHON_EXECUTABLE` with `Python_EXECUTABLE` everywhere (CMake variable names are case-sensitive)

This makes PyTorch buildable with python3 binary shipped with XCode on MacOS

TODO: Get rid of `FindNumpy` as its part of Python package
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124613
Approved by: https://github.com/cyyever, https://github.com/Skylion007
2024-05-29 13:17:35 +00:00
942d9abd66 [AOTI] Update reinplace to cover mutated buffer (#127297)
Summary: Unlike JIT Inductor, AOTI currently unlifts weights and buffers from input args, so the reinplace pass didn't really work for AOTI because it only checks mutation on placeholder, which led to excessive memory copies for kv_cache updates in LLM models. This PR removes those memory copies and roughly offers a 2x speedup. In the future, we will revert the unlift logic in AOTI and make the behvior consitent with JIT Inductor.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127297
Approved by: https://github.com/peterbell10, https://github.com/chenyang78
2024-05-29 13:07:53 +00:00
af69a52f06 Reapply "Remove more of caffe2 (#126705)" (#127317)
This reverts commit 00fe0a0d795680ade029fc552f33fffed75c0250.

Originally was unnecessarily reverted by an oncall. Landing again.

Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127317
Approved by: https://github.com/izaitsevfb
2024-05-29 12:20:25 +00:00
749a132fb0 [BE] wrap deprecated function/class with typing_extensions.deprecated (#126898)
Use `typing_extensions.deprecated` for deprecation annotation if possible. Otherwise, add `category=FutureWarning` to `warnings.warn("message")` if the category is missing.

Note that only warnings that their messages contain `[Dd]eprecat(ed|ion)` are updated in this PR.

UPDATE: Use `FutureWarning` instead of `DeprecationWarning`.

Resolves #126888

- #126888

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126898
Approved by: https://github.com/albanD
2024-05-29 12:09:27 +00:00
cyy
699db7988d [Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127051
Approved by: https://github.com/cpuhrsch, https://github.com/malfet
2024-05-29 11:58:03 +00:00
02b1cdab23 [Sync torch_FA2 and FA2 flash_api] + [Expose seqused_k & alibi_slopes arguments] (#126520)
1. **Expose seqused_k & alibi_slopes arguments**:
- This can be used when your sequence length k is not the full extent of the tensor. This is useful for kv cache scenarios and was not previously supported in the FA2 TORCH integration. We need these arguments for external xformers lib call to the _flash_attention_forward API.
Before:
```
  std::optional<Tensor> seqused_k = c10::nullopt;
  std::optional<Tensor> alibi_slopes = c10::nullopt;
```
After:
```
_flash_attention_forward(...
    std::optional<Tensor>& seqused_k,
    std::optional<Tensor>& alibi_slopes,
```

2. There is a difference between the **TORCH_FA2_flash_api:mha_fwd** and **FA2_flash_api:mha_fwd** (same for **mha_varlen_fwd**) at the query transposition (GQA) step.

The **CHECK_SHAPE** is applied on the original query vs the reshaped query. This causes an error (because of the shape constraint) for such inputs:
```
q = torch.randn([7, 1, 4, 256], dtype=torch.bfloat16, device='cuda')
k = torch.randn([7, 51, 1, 256], dtype=torch.bfloat16, device='cuda')
v = torch.randn([7, 51, 1, 256], dtype=torch.bfloat16, device='cuda')
```

![image](https://github.com/pytorch/pytorch/assets/927999/77ea6bf6-b6e9-4f3f-96a9-8d952956ddd9)

- i've modified the code as little as possible, but if you prefer a more verbose change like the following, dont hesitate to tell me:
```
at::Tensor swapped_q = seqlenq_ngroups_swapped
    ? q.reshape({batch_size, num_heads_k, num_heads / num_heads_k, head_size_og}).transpose(1, 2)
    : q;

if (seqlenq_ngroups_swapped) {
    seqlen_q = num_heads / num_heads_k;
    num_heads = num_heads_k;
}

CHECK_SHAPE(swapped_q, batch_size, seqlen_q, num_heads, head_size_og);
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126520
Approved by: https://github.com/drisspg
2024-05-29 11:54:44 +00:00
dae33a4961 [inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068)
As part of #125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126068
Approved by: https://github.com/jansel
ghstack dependencies: #124021, #126019
2024-05-29 11:15:41 +00:00
65af1a9c26 FIX the document of distributed.new_group() (#122703)
As for now, the document of distributed.new_group() says that it returns `None` when current ranks is not in the new created process group. However, it actually returns `GroupMember.NON_GROUP_MEMBER`. I have check the code and think it is more appropriate that we fix the document.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122703
Approved by: https://github.com/wconstab, https://github.com/kwen2501
2024-05-29 09:40:25 +00:00
6c81856dca [inductor] Add a subprocess-based parallel compile (#126816)
Summary:
Adds a "safe" parallel compile implementation that a) Popens a sub-process with an entry point we control, and b) Uses a ProcessPoolExecutor in that sub-processes to perform parallel compiles. This change essentially squashes these two implementations from jansel, but removes the "thread-based" approach since benchmarking revealed that compile-time performance was poor compared to the existing impl:
https://github.com/pytorch/pytorch/pull/124682
https://github.com/pytorch/pytorch/pull/122941

This PR adds the implementation, but defaults to the existing "fork". I'll submit a separate change to enable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126816
Approved by: https://github.com/jansel
2024-05-29 09:40:21 +00:00
92bc444ee3 [inductor][cpp] epilogue support for gemm template (#126019)
As part of #125683, this PR adds the epilogue support for c++ gemm template by reusing the c++ vector codegen on sub-slices of tensors. This is implemented by retracing the epilogue IR nodes with new ranges and offsets. The new `codegen_loop_bodies` and `codegen_functions` methods are added to c++ vector codegen for this purpose. This is leveraged by the `store_output` method of the template kernel for epilogue codegen and store to the final result.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126019
Approved by: https://github.com/jansel
ghstack dependencies: #124021
2024-05-29 09:12:03 +00:00
00999fd8b1 Prefer flip over index_select (#126783)
It's faster and has a lower memory footprint in eager.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126783
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #114471
2024-05-29 09:10:25 +00:00
8a21532e53 Fix constant propagation pass (#114471)
This pass was broken in a number of ways, as we were not generating
asserts whenever we took it, even though we need to. While doing so,
we found that the analysis we were using for choosing
whether to generate asserts or not for dynamic shapes was completely
broken.

Eliminating indirect indexing in this way allows for a number of optimisations.
In particular, we can now fuse against these kernels (indirect indexing disallows fusions).

The new strategy is as follows:

- We always propagate sympy expressions if we can.
- If an expression was an indirect_indexing, we call `check_bounds`
- We also call `check_bounds` within `CSEProxy.indirect_indexing`
- The checks are issued in the buffer where they would go if the were used in a load
   - This makes them always be codegen'd before the load and stores
   - In the case of stores, they will be generated potentially much earlier than the stores themselves, which is fine.

We add quite a few asserts to preexisting tests to strengthen them. In particular, we make sure
that issuing an assert plays well with all kinds of C++ vectorisation.

For now, we rely on the logic within `_maybe_evaluate_static` to prove
these bounds. This logic is rather limited though. In the future, we might want
to rely on Z3 here to be able to prove bounds in a more general way.

Supersedes https://github.com/pytorch/pytorch/pull/113068
Fixes https://github.com/pytorch/pytorch/issues/121251

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114471
Approved by: https://github.com/peterbell10
2024-05-29 09:10:25 +00:00
51b22d9cf2 [dynamo] Support enum construction (#127364)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127364
Approved by: https://github.com/yanboliang
ghstack dependencies: #127263
2024-05-29 08:09:21 +00:00
ad7700bfdb [inductor] Misc changes (#127307)
Pulling unrelated changes out of the larger halide PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127307
Approved by: https://github.com/yanboliang
2024-05-29 08:00:06 +00:00
cef776bcd1 [inductor][cpp] GEMM template (infra and fp32) (#124021)
This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC https://github.com/pytorch/pytorch/issues/125683 for more background info.
1. Cpp template infrastructure
Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates.
2. Initial FP32 gemm template
This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction.
3. Correctness and performance
The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details.

Static shapes
| Benchmark | torchbench | huggingface | timm_models |
|------------|-------------|--------------|--------------|
| Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x |
| Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x |
| Single-threaded (baseline) | 1.56x | 1.19x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x |

Key models being sped up:
drq: 1.14x
soft_act: 1.12
cait_m36_384: 1.18x

Dynamic shapes
| Benchmark | torchbench | huggingface | timm_models |
| --- | --- | --- | --- |
| Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x |
| Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x |
| Single-threaded (baseline) | 1.55x | 1.20x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x |

Key models being sped up:
BERT_pytorch: 1.22x
pyhpc_turbulent: 1.13x
soft_actor_critic: 1.77x
BlenderbotForCausalLM: 1.09x
cait_m36_384: 1.17x

Differential Revision: [D57585365](https://our.internmc.facebook.com/intern/diff/D57585365)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124021
Approved by: https://github.com/jansel
2024-05-29 07:37:41 +00:00
719589c9bf [dynamo] move bytecode tests from test_misc to new bytecode test file (#127329)
Also merge with bytecode hook test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127329
Approved by: https://github.com/yanboliang, https://github.com/jansel
2024-05-29 06:10:59 +00:00
a60b06bd2b [dtensor] update public API docs (#127340)
This PR updates the API documentations for the public facing APIs

needs more example for each API but plan to add them in a separate PR

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127340
Approved by: https://github.com/wz337
ghstack dependencies: #127338, #127339
2024-05-29 05:18:47 +00:00
2c9a420da3 [dtensor] move some modules to private namespace (#127339)
as titled, moving some modules that are mainly for DTensor private usage
to be a private module.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127339
Approved by: https://github.com/awgu
ghstack dependencies: #127338
2024-05-29 05:18:47 +00:00
72ef2555e3 [dtensor] make Partial placement public (#127338)
As titled, partial placement is standardized right now and I think we
would want to expose this as a public API to allow user to annotate the
the sharding layout easier. Given that we already have use cases
internal/externally that uses Partial

Keeping the old _Partial name for a while for BC reason

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127338
Approved by: https://github.com/awgu, https://github.com/wz337, https://github.com/kwen2501
2024-05-29 05:18:47 +00:00
5359af0c7e [dynamo] wrap GraphModule exceptions in dynamo-wrapped tests (#126341)
Better approach to https://github.com/pytorch/pytorch/pull/126197 to catch issues like https://github.com/pytorch/pytorch/issues/125568.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126341
Approved by: https://github.com/anijain2305, https://github.com/jansel
2024-05-29 05:18:04 +00:00
cdf2133186 Add compile time profiler for non fbcode targets (#126904)
This is a tool that allow profiling compile time using strobelight profiler, its a meta only tool.
but works on non-fbcode targets.

A follow up diff will unify this with caffe2/fb/strobelight/compile_time_profiler.py.
example test:

```
run  python tools/strobelight/examples/compile_time_profile_example.py
```

```
python torch/utils/_strobelight/examples/compile_time_profile_example.py
strobelight_compile_time_profiler, line 61, 2024-05-23 10:49:28,101, INFO: compile time strobelight profiling enabled
strobelight_compile_time_profiler, line 93, 2024-05-23 10:49:28,102, INFO: Unique sample tag for this run is: 2024-05-23-10:49:282334638devvm4561.ash0.facebook.com
strobelight_compile_time_profiler, line 94, 2024-05-23 10:49:28,102, INFO: You can use the following link to access the strobelight profile at the end of the run: https://www.internalfb.com/intern/scuba/query/?dataset=pyperf_experimental%2Fon_demand&drillstate=%7B%22purposes%22%3A[]%2C%22end%22%3A%22now%22%2C%22start%22%3A%22-30%20days%22%2C%22filterMode%22%3A%22DEFAULT%22%2C%22modifiers%22%3A[]%2C%22sampleCols%22%3A[]%2C%22cols%22%3A[%22namespace_id%22%2C%22namespace_process_id%22]%2C%22derivedCols%22%3A[]%2C%22mappedCols%22%3A[]%2C%22enumCols%22%3A[]%2C%22return_remainder%22%3Afalse%2C%22should_pivot%22%3Afalse%2C%22is_timeseries%22%3Afalse%2C%22hideEmptyColumns%22%3Afalse%2C%22timezone%22%3A%22America%2FLos_Angeles%22%2C%22compare%22%3A%22none%22%2C%22samplingRatio%22%3A%221%22%2C%22metric%22%3A%22count%22%2C%22aggregation_field%22%3A%22async_stack_complete%22%2C%22top%22%3A10000%2C%22aggregateList%22%3A[]%2C%22param_dimensions%22%3A[%7B%22dim%22%3A%22py_async_stack%22%2C%22op%22%3A%22edge%22%2C%22param%22%3A%220%22%2C%22anchor%22%3A%220%22%7D]%2C%22order%22%3A%22weight%22%2C%22order_desc%22%3Atrue%2C%22constraints%22%3A[[%7B%22column%22%3A%22sample_tags%22%2C%22op%22%3A%22all%22%2C%22value%22%3A[%22[%5C%222024-05-23-10:49:282334638devvm4561.ash0.facebook.com%5C%22]%22]%7D]]%2C%22c_constraints%22%3A[[]]%2C%22b_constraints%22%3A[[]]%2C%22ignoreGroupByInComparison%22%3Afalse%7D&view=GraphProfilerView&&normalized=1712358002&pool=uber
strobelight_function_profiler, line 241, 2024-05-23 10:49:34,943, INFO: strobelight run id is: 3507039740348330
strobelight_function_profiler, line 243, 2024-05-23 10:50:00,907, INFO: strobelight profiling running
strobelight_function_profiler, line 224, 2024-05-23 10:50:02,741, INFO: strobelight profiling stopped
strobelight_function_profiler, line 215, 2024-05-23 10:50:06,173, INFO: Total samples: 7
strobelight_function_profiler, line 215, 2024-05-23 10:50:06,173, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/75cxdro3
strobelight_function_profiler, line 215, 2024-05-23 10:50:06,173, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/qsgydsee
strobelight_compile_time_profiler, line 120, 2024-05-23 10:50:06,174, INFO: 1 strobelight success runs out of 1 non-recursive compilation events.
strobelight_function_profiler, line 241, 2024-05-23 10:50:08,137, INFO: strobelight run id is: 8721740011604497
strobelight_function_profiler, line 243, 2024-05-23 10:50:34,801, INFO: strobelight profiling running
strobelight_function_profiler, line 224, 2024-05-23 10:50:36,803, INFO: strobelight profiling stopped
strobelight_function_profiler, line 215, 2024-05-23 10:50:41,289, INFO: Total samples: 3
strobelight_function_profiler, line 215, 2024-05-23 10:50:41,289, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/qmi2ucwp
strobelight_function_profiler, line 215, 2024-05-23 10:50:41,289, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/7fjkhs9i
strobelight_compile_time_profiler, line 120, 2024-05-23 10:50:41,289, INFO: 2 strobelight success runs out of 2 non-recursive compilation events.
strobelight_function_profiler, line 241, 2024-05-23 10:50:43,597, INFO: strobelight run id is: 1932476082259558
strobelight_function_profiler, line 243, 2024-05-23 10:51:09,791, INFO: strobelight profiling running
strobelight_function_profiler, line 224, 2024-05-23 10:51:11,883, INFO: strobelight profiling stopped
strobelight_function_profiler, line 215, 2024-05-23 10:51:16,218, INFO: Total samples: 3
strobelight_function_profiler, line 215, 2024-05-23 10:51:16,218, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/vy1ujxec
strobelight_function_profiler, line 215, 2024-05-23 10:51:16,218, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/2xgadviv
strobelight_compile_time_profiler, line 120, 2024-05-23 10:51:16,219, INFO: 3 strobelight success runs out of 3 non-recursive compilation events.
```

or pass TORCH_COMPILE_STROBELIGHT=TRUE for any torch compile python program.
ex running on XLNetLMHeadModel.
```
 TORCH_COMPILE_STROBELIGHT=TRUE TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 time python benchmarks/dynamo/huggingface.py --ci --accuracy --timing --explain --inductor --device cuda --training --amp  --only XLNetLMHeadModel
 ```
 result:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126904
Approved by: https://github.com/aorenste
ghstack dependencies: #126444
2024-05-29 05:06:37 +00:00
2b72e2a596 [Cudagraph] better support for streams (#126809)
This PR fixes Issue #124391.

There are two root causes.

### Root Cause 1 [better support for stream during cudagraph capture]

When recording a new function, CUDA graph tree records memory block states (e.g., address, size, allocated, etc) via `getCheckpointState`. Let's say the record is called `block_state`.

Later, CUDA graph tree would like to recover exactly the same memory block states by `apply_checkpoint_execution_state_in_allocator`, which a) frees all memory blocks; b) allocate all recorded block states (regardless of `block_state->allocated`); c) free blocks with `block_state->allocated == False`; and d) check block_state matches remaining blocks (e.g., `block_state->ptr == block->ptr`).

An error may occur when multiple streams exists during recording. [Note](https://github.com/pytorch/pytorch/blob/main/c10/cuda/CUDACachingAllocator.cpp#L2149-L2152) that a block will not be merged with other blocks if it is used by some streams, even if `block->allocated==False`. This may lead to a mismatch between `block_state->ptr` and `block->ptr` in `apply_checkpoint_execution_state_in_allocator`.

This PR solves the issue by avoiding inserting events if this events coming from a stream used during cudagraph capture. The reason is that we know all events or streams used during cudagraph capture must have been completed before cudagraph capture finishes.

### Root Cause 2 [fix a bug in checkpoint state]
When we getCheckpointState, we create block state. At that time, we do not record block->device. So block_state->device == 0 no matter the real value of block->device. See [how](https://github.com/pytorch/pytorch/blob/main/c10/cuda/CUDACachingAllocator.cpp#L744-L750) BlockState is created from a block.

When use block state during setSegmentStateToCheckpoint, we use [block_state.device (=0)](https://github.com/pytorch/pytorch/blob/main/c10/cuda/CUDACachingAllocator.cpp#L1526). This leads to errors.

We fixed this issue by recording block->device into block_state in getCheckpointState.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126809
Approved by: https://github.com/eellison
2024-05-29 04:52:35 +00:00
a41f828da7 [c10d] fix group_name/group_desc set up in eager initialization (#127053)
Summary:
ProcessGroupNCCL set up group_name/desc in c10d log and NCCL when initializing nccl communicator. In eager initialization mode, pg_name and pg_desc is set after communicator initialization so the information won't be available in pytorch log or NCCL communicator.

This PR fix this by setting pg_name/desc earlier

Differential Revision: D57759816

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127053
Approved by: https://github.com/wconstab, https://github.com/kwen2501
2024-05-29 04:42:34 +00:00
932e04142d extract calculate_time_spent from print_time_report (#127362)
Fixes #ISSUE_NUMBER

wrap certain steps in a separate function for easier TTFB instrumentation (fb internal use case)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127362
Approved by: https://github.com/yanboliang, https://github.com/mengluy0125
2024-05-29 04:37:15 +00:00
a25b28a753 [Split Build] Add option to create libtorch wheel and use it to build pytorch as a separate wheel (#126328)
Creates an option to just build the libtorch portion of pytorch such that we have the necessary .so files.  Then it builds a torch package using the libtorch wheel. These options are enabled using ` BUILD_LIBTORCH_WHL` and `BUILD_PYTHON_ONLY`.

We run

```
 BUILD_LIBTORCH_WHL=1 python setup.py install
python setup.py clean
BUILD_PYTHON_ONLY=1 python setup.py install
```

to produce

```
sahanp@devgpu086 ~/pytorch (detached HEAD|REBASE-i 3/5)> ls /home/sahanp/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/torch/lib/                                                                                                                (pytorch-3.10)
libshm.so*  libtorch_global_deps.so*  libtorch_python.so*
sahanp@devgpu086 ~/pytorch (detached HEAD|REBASE-i 3/5)> ldd build/lib/libtorch_python.so                                                                                                                                                                (pytorch-3.10)
        linux-vdso.so.1 (0x00007ffdc2d37000)
        libtorch.so => /home/sahanp/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/libtorch/lib/libtorch.so (0x00007f539fe99000)
        libshm.so => /home/sahanp/pytorch/build/lib/libshm.so (0x00007f539fe90000)
        libcudnn.so.8 => /usr/local/cuda-12.1/targets/x86_64-linux/lib/libcudnn.so.8 (0x00007f539e800000)
        libnvToolsExt.so.1 => /usr/local/cuda/lib64/libnvToolsExt.so.1 (0x00007f539e400000)
        libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f539e000000)
        libm.so.6 => /lib64/libm.so.6 (0x00007f539fda5000)
        libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f539ebe5000)
        libc.so.6 => /lib64/libc.so.6 (0x00007f539dc00000)
        /lib64/ld-linux-x86-64.so.2 (0x00007f539fea0000)
        libtorch_cpu.so => /home/sahanp/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/libtorch/lib/libtorch_cpu.so (0x00007f5392400000)
        libtorch_cuda.so => /home/sahanp/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/libtorch/lib/libtorch_cuda.so (0x00007f5380000000)
        librt.so.1 => /lib64/librt.so.1 (0x00007f539fd9e000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f539fd99000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007f539fd94000)
        libc10.so => /home/sahanp/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/libtorch/lib/libc10.so (0x00007f539eb07000)
        libmkl_intel_lp64.so.2 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libmkl_intel_lp64.so.2 (0x00007f537ec00000)
        libmkl_gnu_thread.so.2 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libmkl_gnu_thread.so.2 (0x00007f537ce00000)
        libmkl_core.so.2 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libmkl_core.so.2 (0x00007f5378800000)
        libomp.so => /home/sahanp/.conda/envs/pytorch-3.10/lib/libomp.so (0x00007f539e707000)
        libcupti.so.12 => /usr/local/cuda/lib64/libcupti.so.12 (0x00007f5377e00000)
        libcudart.so.12 => /usr/local/cuda/lib64/libcudart.so.12 (0x00007f5377a00000)
        libc10_cuda.so => /home/sahanp/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/libtorch/lib/libc10_cuda.so (0x00007f539ea6a000)
        libcusparse.so.12 => /usr/local/cuda/lib64/libcusparse.so.12 (0x00007f5368400000)
        libcufft.so.11 => /usr/local/cuda/lib64/libcufft.so.11 (0x00007f535ee00000)
        libcusolver.so.11 => /usr/local/cuda/lib64/libcusolver.so.11 (0x00007f534c800000)
        libcurand.so.10 => /usr/local/cuda/lib64/libcurand.so.10 (0x00007f5346200000)
        libcublas.so.12 => /usr/local/cuda/lib64/libcublas.so.12 (0x00007f533f800000)
        libcublasLt.so.12 => /usr/local/cuda/lib64/libcublasLt.so.12 (0x00007f531e800000)
        libutil.so.1 => /lib64/libutil.so.1 (0x00007f539ea63000)
        libnvJitLink.so.12 => /usr/local/cuda/lib64/libnvJitLink.so.12 (0x00007f531b800000)
sahanp@devgpu086 ~/pytorch (detached HEAD|REBASE-i 3/5)> ldd build/lib/libtorch_global_deps.so                                                                                                                                                           (pytorch-3.10)
        linux-vdso.so.1 (0x00007ffc265df000)
        libmkl_intel_lp64.so.2 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libmkl_intel_lp64.so.2 (0x00007fa93fc00000)
        libmkl_gnu_thread.so.2 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libmkl_gnu_thread.so.2 (0x00007fa93de00000)
        libmkl_core.so.2 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libmkl_core.so.2 (0x00007fa939800000)
        libm.so.6 => /lib64/libm.so.6 (0x00007fa940f05000)
        libcudart.so.12 => /usr/local/cuda/lib64/libcudart.so.12 (0x00007fa939400000)
        libnvToolsExt.so.1 => /usr/local/cuda/lib64/libnvToolsExt.so.1 (0x00007fa939000000)
        libgomp.so.1 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libgomp.so.1 (0x00007fa93fb07000)
        libc.so.6 => /lib64/libc.so.6 (0x00007fa938c00000)
        libdl.so.2 => /lib64/libdl.so.2 (0x00007fa940efe000)
        libpthread.so.0 => /lib64/libpthread.so.0 (0x00007fa940ef9000)
        /lib64/ld-linux-x86-64.so.2 (0x00007fa940ff5000)
        librt.so.1 => /lib64/librt.so.1 (0x00007fa940ef2000)
        libstdc++.so.6 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libstdc++.so.6 (0x00007fa93921d000)
        libgcc_s.so.1 => /home/sahanp/.conda/envs/pytorch-3.10/lib/libgcc_s.so.1 (0x00007fa93faec000)
        ```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126328
Approved by: https://github.com/atalman
2024-05-29 04:33:56 +00:00
8090145936 [pipelining] add back support for multi-use parameters/buffers (#126653)
## Motivation
Resolves #126626 to support TorchTitan.

With this PR, we add back support for cases where a parameter or buffer is used in multiple stages. An example of such usage is in LLaMA (torchtitan), code snippet:
```
for layer in self.layers.values():
    h = layer(h, self.freqs_cis)
```

## Solution
Step 1:
Remove the previous guards of `if len(node.users) == 1`.
Step 2:
Call `move_param_to_callee` multiple times, one for each stage ("callee").
Step 3:
Delay deletion of the `get_attr` node (for getting the param) from root till this param has been sunk into each stage that uses it.

The PR also cleans up the old code around this (dropping the TRANSMIT mode and supporting REPLICATE mode only).

## Test
Changed the `ExampleCode` model to use `mm_param1` in multiple stages.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126653
Approved by: https://github.com/pianpwk
2024-05-29 03:36:47 +00:00
781f26240a Add script to copy distributed commits to stable branch (#126918)
This will be used as part of a prototype of a stable pytorch with a fast-moving distributed folder

Tasks: T189915739

Test plan:

I ran the script in a few configurations on my local machine. It worked as expected
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126918
Approved by: https://github.com/seemethere, https://github.com/malfet
2024-05-29 03:33:44 +00:00
10d2373abd Add a registry for GraphModuleSerializer (#126550)
This PR adds a registration function and a global registry for GraphModuleSerializer. After this PR, custom serialization methods can be done through registration instead of subclassing for ease of maintenance.

## Changes
- Add a test case where it injects custom op to test serialization.
- Add custom op handler
- Change allowed op for verifier
Co-authored-by: Zhengxu Chen <zhxchen17@outlook.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126550
Approved by: https://github.com/zhxchen17
2024-05-29 03:12:48 +00:00
cdbb2c9acc Revert "[Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051)"
This reverts commit 4fdbaa794f9d5af2f171f772a51cb710c51c925f.

Reverted https://github.com/pytorch/pytorch/pull/127051 on behalf of https://github.com/PaliC due to This PR needs to be synced using the import button as there is a bug in our diff train ([comment](https://github.com/pytorch/pytorch/pull/127051#issuecomment-2136428735))
2024-05-29 03:02:35 +00:00
7a506dd005 Revert "[Caffe2]Remove Caffe2 proto files (#126134)"
This reverts commit a40658481ada9ecfd5716513a8537818c79cb3ef.

Reverted https://github.com/pytorch/pytorch/pull/126134 on behalf of https://github.com/malfet due to Broke bazel builds, see https://github.com/pytorch/pytorch/actions/runs/9278148147/job/25528691981 ([comment](https://github.com/pytorch/pytorch/pull/126134#issuecomment-2136373096))
2024-05-29 01:53:45 +00:00
cyy
669560d51a Use hidden visibility in OBJECTCXX files (#127265)
Since it can eliminate some linker warnings on MacOS

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127265
Approved by: https://github.com/ezyang, https://github.com/malfet
2024-05-29 01:40:23 +00:00
52e448a7f9 Revert "Enable Wunused-variable on tests (#127161)"
This reverts commit 6436a6407d9d65c42efb8e55beeb8b391b67fd64.

Reverted https://github.com/pytorch/pytorch/pull/127161 on behalf of https://github.com/malfet due to Broke ReduceTests on Windows (by testing more), see https://github.com/pytorch/pytorch/actions/runs/9274944325/job/25519484937 ([comment](https://github.com/pytorch/pytorch/pull/127161#issuecomment-2136339435))
2024-05-29 01:09:45 +00:00
85172fbe84 Back out "Prevent partitioner from ever saving views (#126446)" (#127316)
Summary: Revert "Prevent partitioner from ever saving views (#126446)" due to a torchinductor failure on CU Training Framework tests.

Reviewed By: Chillee

Differential Revision: D57868343

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127316
Approved by: https://github.com/Chillee
2024-05-29 00:29:44 +00:00
cyy
a40658481a [Caffe2]Remove Caffe2 proto files (#126134)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126134
Approved by: https://github.com/r-barnes
2024-05-29 00:22:14 +00:00
f4cbcff8ef [TorchScript] Expand TorchScript __init__ annotation warning (#127045)
Summary:
Expand TorchScript `__init__` annotation warning to `list` and `dict` with reference to GSD task T187638414 and annotation warning reproduction D56834720.

Currently, the TorchScript compiler ignores and throws `UserWarning`s for the following annotation types for empty values within the `__init__` function: `List`, `Dict`, `Optional`. However, the compiler should additionally cover warnings for `list` and `dict`. This diff adds support for `list` and `dict`.

Test Plan:
Added 4 new unit tests:

`test_annotated_empty_list_lowercase` and `test_annotated_empty_dict_lowercase` verify that TorchScript throws UserWarnings for the list and dict type annotations on empty values.
```
(base) [jananisriram@devvm2248.cco0 /data/users/jananisriram/fbsource/fbcode (e4ce427eb)]$ buck2 test @mode/{opt,inplace} //caffe2/test:jit -- --regex test_annotated_empty_list_lowercase
...
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
(base) [jananisriram@devvm2248.cco0 /data/users/jananisriram/fbsource/fbcode (e4ce427eb)]$ buck2 test @mode/{opt,inplace} //caffe2/test:jit -- --regex test_annotated_empty_dict_lowercase
...
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0
```

`test_annotated_with_jit_empty_list_lowercase` and `test_annotated_with_jit_empty_dict_lowercase` verify that TorchScript throws UserWarnings for the list and dict type annotations on empty values with the jit annotation.
```
(base) [jananisriram@devvm2248.cco0 /data/users/jananisriram/fbsource/fbcode (e4ce427eb)]$ buck2 test @mode/{opt,inplace} //caffe2/test:jit -- --regex test_annotated_with_jit_empty_list_lowercase
...
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0
```
```
(base) [jananisriram@devvm2248.cco0 /data/users/jananisriram/fbsource/fbcode (e4ce427eb)]$ buck2 test @mode/{opt,inplace} //caffe2/test:jit -- --regex test_annotated_with_jit_empty_dict_lowercase
...
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Differential Revision: D57752002

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127045
Approved by: https://github.com/davidberard98
2024-05-28 23:49:10 +00:00
1be7e4086a Drop caffe2 nomnigraph (#127086)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127086
Approved by: https://github.com/Skylion007
2024-05-28 23:20:46 +00:00
f6ef832e87 [inductor] Use symbolic_hint when bounding fallback size hint (#127262)
The previous fallback ignores any known hint values in the expression and only
looks at the value ranges. By using the `symbolic_hint` we will use both hints
and value ranges.

Also removed the recursive use of `size_hint` on the bounds, since these should
always be constants.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127262
Approved by: https://github.com/lezcano
ghstack dependencies: #127251
2024-05-28 22:51:45 +00:00
26a8fa3a06 [inductor] Restore ExpandView sanity checks (#127251)
This restores the assertion removed in #124864

The handling of unbacked symints is incidental, the main purpose of this assert
was to catch bugs in lowerings.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127251
Approved by: https://github.com/lezcano
2024-05-28 22:51:45 +00:00
db0a0ecb60 [FSDP2] Added test for N-way TP and 1-way FSDP with CPU offloading (#127024)
This PR shows that we can use FSDP solely for CPU offloading when composing with N-way TP. Each FSDP mesh is just 1 rank.

This was motivated from an ask on Slack :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127024
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
2024-05-28 22:51:36 +00:00
6b24155827 [dtensor][debug] added c10d gather, reduce, scatter tracing to CommDebugMode (#127134)
**Summary**
Added c10d gather, reduce, and scatter tracing to CommDebugMode and edited test case in test_comm_mode to include added features.

**Test Plan**
pytest test/distributed/_tensor/debug/test_comm_mode.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127134
Approved by: https://github.com/XilunWu
ghstack dependencies: #127025, #127029, #127040
2024-05-28 22:48:07 +00:00
eqy
a76faff71c [NCCL][CUDA] Optionally avoid rethrowing CUDA Errors in NCCL Watchdog (#126587)
Doesn't affect current behavior by default, for #126544
I'm not sure what the exact mechanism is here but CUDA errors appear to already be thrown in the main process, meaning that the watchdog is separately throwing CUDA errors again. However this rethrown error causes the process to be terminated as it cannot be handled from user code (which doesn't have visibility of the watchdog thread).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126587
Approved by: https://github.com/kwen2501
2024-05-28 22:17:15 +00:00
93bfe57144 cudagraphs: fix backward hooks & fsdp interaction (#126914)
Fixes

> ERROR: expected to be in states [<TrainingState.FORWARD_BACKWARD: 2>] but current state is TrainingState.IDLE

Error that would occur when composing pt2 fsdp and cudagraphs. Cudagraphs caches output tensor impls in the fast path, so we were inadvertently accumulating multiple hooks on what should have been fresh allocations.

from code comment:
```
# this output represents a fresh allocated tensor.
# We return the same TensorImpl from run to run to avoid overhead.
# autograd.Function will reset the Autograd meta of output tensors
# as part of aot_autograd, but _backward_hooks are stored on tensors separately,
# so we need to manually reset hooks.
``

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126914
Approved by: https://github.com/awgu, https://github.com/xmfan
2024-05-28 22:07:41 +00:00
4154c8358a [BE] Wrap store check in a try/catch (#127030)
Summary:
Global store may already have been destroyed when we do the check.
This leads to a Null Pointer Exception. This caused a SEV in Production.
Stack trace from crash:
```
[trainer2]:# 5  c10d::TCPStore::check(std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const&)
[trainer2]:# 6  c10d::ProcessGroupNCCL::heartbeatMonitor()
```

Test Plan:
Will deploy in small training job and with `NCCL_DUMP_ON_TIMEOUT` set.
Job should complete with no exceptions.

Reviewers:

Subscribers:

Tasks: T190163458

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127030
Approved by: https://github.com/Skylion007, https://github.com/shuqiangzhang
2024-05-28 20:57:36 +00:00
f206c5c628 [export] handle new roots & root swapping in derived dims suggested fixes (#125543)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125543

This PR address 2 issues with derived dim suggested fixes, 1) newly introduced roots, and 2) root swapping.

1 | Newly introduced roots appear with modulo guards, e.g. Mod(dx, 2) = 0 suggests dx is a derived dim equal to 2 * _dx, introducing a new root _dx. Currently the final suggested fixes handle this correctly, but we can get intermediate results where related derived dims don't rely on a unified root, and are a mixture of min/max range and derived suggestions.

For example:
```
"dx": {"eq": 3*_dx-1, "max": 36}
"dy": {"eq": dx+1}
This should lead to suggested fixes
  _dx = Dim('_dx', max=12)
  dx = 3 * _dx - 1
  dy = 3 * _dx
```

This PR prettifies the suggested fixes routine by unifying to a single root, and making each intermediate suggestion either a derived dim or min/max range, not both.

2 | The current suggested fixes for derived dims can lead to root dims/derived dims being swapped, e.g. `dy - 1, dy` -> `dx, dx + 1`. This leads to problematic suggested fixes that look like `dy - 1 = Dim("dy - 1")` since we don't have access to the original variable name.

This PR only adds a suggested fix for the root dim, and removes all other derived suggestions.

For example, with the export test case test_derived_dim_out_of_order_simplified:
```
_dimz = torch.export.Dim("_dimz", min=6, max=8)
dimy = _dimz - 1
dimx = dimy - 1
dimz = torch.export.Dim("dimz", min=6, max=8)  # doesn't work, should be = _dimz

class Foo(torch.nn.Module):
    def forward(self, x, y, z):
        return x + y[1:] + z[2:]

foo = Foo()
u, v, w = torch.randn(5), torch.randn(6), torch.randn(7)
export(
    foo,
    (u, v, w),
    dynamic_shapes=({0: dimx}, {0: dimy}, {0: dimz}),
)
```

Before:
```
Suggested fixes:
  _dimz = Dim('_dimz', min=3, max=9223372036854775807)  # 2 <= _dimz - 1 <= 9223372036854775806
  _dimz - 2 = Dim('_dimz - 2', min=4, max=6)
  _dimz = Dim('_dimz', min=2, max=9223372036854775806)  # 2 <= _dimz <= 9223372036854775806
  _dimz - 1 = _dimz - 1
  dimz = _dimz
```

New suggested fixes:
```
Suggested fixes:
  dimz = _dimz
```

Note: This assumes the specified derived relations between dims are correct. This should be valid because: 1) if the relation is plain wrong (e.g. (dx, dx - 1) provided with inputs (6, 4)), this gets caught in beforehand in produce_guards. 2) if the relation is correct but does not match the emitted guard, for example:
```
def forward(self, x, y):
    return x.reshape([-1]) + y  # guard: s0 * 2 = s1
dx = Dim("dx")
export(
    model,
    (torch.randn(6, 2), torch.randn(12)),
    dynamic_shapes={"x": (dx, 2), "y": (dx + 6, )}
)
```
This produces two linear equations, leading to specialization since a) produce_guards is able to solve for a concrete value, and b) the export constraint solver will anyways force specializations due to range constraints.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125543
Approved by: https://github.com/avikchaudhuri
2024-05-28 20:41:43 +00:00
cyy
0a9d73a814 Remove c10::guts::bool_constant and c10::guts::negation (#127300)
They are not used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127300
Approved by: https://github.com/r-barnes
2024-05-28 19:55:20 +00:00
03005bb655 Improve the clarity of the torch.Tensor.backward doc (#127201)
Improve the clarity of the torch.Tensor.backward doc, particularly wrt the arg `gradient`.
Reference https://pytorch.org/tutorials/beginner/blitz/autograd_tutorial.html,
```
We need to explicitly pass a gradient argument in Q.backward() because it is a vector. gradient is a tensor of the same shape as Q, and it represents the gradient of Q w.r.t. itself
```

@janeyx99 feel free to assign to the corresponding reviewers, thanks
Co-authored-by: Jeffrey Wan <soulitzer@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127201
Approved by: https://github.com/soulitzer
2024-05-28 19:25:51 +00:00
f600faf248 [metal] Improve perf of int4pack_mm shader (#127135)
Using vectorized data types and using SIMD groups to optimize memory access pattern

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127135
Approved by: https://github.com/malfet
2024-05-28 18:22:58 +00:00
c9172d4471 print default value in FunctionSignature (#127059)
Fixes #[126758](https://github.com/pytorch/pytorch/issues/126758) and #[126759](https://github.com/pytorch/pytorch/issues/126759)

The output information in the issue is not accurate because `FunctionSignature::toString()` print the schema strings without default.
cb6ef68caa/torch/csrc/utils/python_arg_parser.cpp (L1282-L1283)
This pr, by adding a `default_value` to save the default str ,which shoule be priented. Of course, can also add an new api to reverse `default_bool/default_int` to string, which is slightly more complicated.
result:
![image](https://github.com/pytorch/pytorch/assets/37650440/f58a4cbf-b0f4-4c81-9106-59f0d35c54ea)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127059
Approved by: https://github.com/janeyx99
2024-05-28 18:04:31 +00:00
045309aa35 [MPS] Enable toch.mm and friends for complex dtypes (#127241)
- Add `supportedFloatingOrComplexType`
- Change dtype check to those
- Extend low-precision fp32 list to complex types
- Mark conv2d as supported now, as it was failing due to the tighter accuracy constrains than the same op for float32 dtype

Fixes https://github.com/pytorch/pytorch/issues/127178

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127241
Approved by: https://github.com/janeyx99
2024-05-28 17:56:13 +00:00
829f594d7d [small] guard_size_oblivious, skip check for meta (#127298)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127298
Approved by: https://github.com/ezyang
2024-05-28 17:53:08 +00:00
9521528f71 Log export result of torch.jit.trace to scuba (#126900)
Summary: We want to track how well torch.jit.trace can be converted to export in large scale. As a first step, we log all of torch.jit.trace unittests whether we can convert the traced module to export module OR we can export the model directly

Test Plan: CI

Differential Revision: D57629682

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126900
Approved by: https://github.com/SherlockNoMad
2024-05-28 17:49:34 +00:00
3f79e09515 Revert "Made some minor improvements to flexattention perf + added more autotune configs (#126811)"
This reverts commit 84e59f052d4342ac9453703be55758de102e20d3.

Reverted https://github.com/pytorch/pytorch/pull/126811 on behalf of https://github.com/PaliC due to breaking on V100s / internal tests ([comment](https://github.com/pytorch/pytorch/pull/126811#issuecomment-2135798983))
2024-05-28 17:48:26 +00:00
254783ce80 [Fix]: populate input parameter name when convert TorchScript to ExportedProgram (#126787)
## Goal
As title

## Design
Based on the fact that each TorchScript module has a `code` property which provides the original source code for the `forward` function, I implemented a function to extrapolate `forward` function signature by using the AST parser.

Some other tradeoff
* Directly parsing src code as string --> will be very buggy
* Directly using `compile` function in Python to get the function object --> raises a lot of exceptions because of missing packages or undefined variable names
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126787
Approved by: https://github.com/angelayi, https://github.com/tugsbayasgalan
2024-05-28 17:33:44 +00:00
122282111d [inductor][reland] Various improvements to error handling during autotuning (#126847)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126847
This is a reland of [D56764094](https://www.internalfb.com/diff/D56764094) / https://github.com/pytorch/pytorch/pull/125762. It was originally reverted due to rebase conflicts.
Original commit changeset: 45875a1e5de2
Original Phabricator Diff: [D56764094](https://www.internalfb.com/diff/D56764094)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126847
Approved by: https://github.com/chenyang78
2024-05-28 17:22:26 +00:00
df360e2add Update derivatives.yaml (#127193)
Fixed a typo in docs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127193
Approved by: https://github.com/soulitzer
2024-05-28 16:56:03 +00:00
cbb79a2baf [export] Disable backend decomps for capture_pre_autograd (#127120)
Differential Revision: D57785713

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127120
Approved by: https://github.com/ydwu4
2024-05-28 16:37:13 +00:00
cyy
c40408850a [1/N] Fix clang-tidy warnings in aten/src/ATen/cuda/ (#127183)
Fixes clang-tidy warnings

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127183
Approved by: https://github.com/soulitzer, https://github.com/Skylion007
2024-05-28 15:35:29 +00:00
cyy
3d88c618d5 Concat namespaces in torch/csrc/profiler and other fixes (#127266)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127266
Approved by: https://github.com/soulitzer
2024-05-28 15:21:32 +00:00
4d4d2a96f2 Add space in MetaFallbackKernel.cpp error message (#127291)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127291
Approved by: https://github.com/Skylion007
2024-05-28 13:54:38 +00:00
a6b994ed54 Fix lint after #126845 (#127286)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127286
Approved by: https://github.com/NicolasHug, https://github.com/DanilBaibak
2024-05-28 12:38:27 +00:00
ec8b254ef4 Refactored template codegen to explicitly set current body when generating code (#127144)
The main motivation for this refactor is that today, when generating templates, this is what happens.

```
def_kernel() # registers hook for fully generating function definition
store_output() # registers hook for generating the output store. *also* keeps a number of things generated on `self.body`.
```

Later on, when we codegen the template: f8c4c268da/torch/_inductor/codegen/simd.py (L1402)

```
epilogue_node.codegen() # Also writes to body!
template.finalize() # Calls the above two hooks for def_kernel and store_output, which then reads from the accumulated `self.body`
```

Today, this is fine, as long as `store_output` is the last function called in the template. However, there's a couple things we probably want to do with kernels that makes this annoying.

1. In FlexAttention backwards, we might want a `modification` to be positioned *after* the `store_output` (just logically from a code organization POV). This doesn't work today because `modification` also needs to codegen a subgraph, but writing to `body` here conflicts with `store_output`'s implicit saved state on `self.body`.
2. If we want to support prologue fusion, we need to go through a bunch of contortions today to call the template hook finalization a couple times (https://github.com/pytorch/pytorch/pull/121211/files#diff-73b89475038a5b4705da805f1217783883fb90398ee1164995db392fc4a342c1R322)
3. The current code also makes it quite difficult to support fusion into multiple output nodes.

To resolve this, I do two things:
1. I *remove* the default `self.body` on `TritonTemplateKernel`. Instead, I have a dict of `self.subgraph_bodies`, which can be enabled in a context with `TritonTemplateKernel.set_subgraph_body`. This allows multiple different template functions to write to their own isolated bodies.
2. I add functions that allow you to finalize specific hooks on `PartialRender`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127144
Approved by: https://github.com/jansel
2024-05-28 09:49:13 +00:00
457b9f7397 Optimize mask memory for flash attention (#126961)
The PR optimizes the mask memory for flash attention. Instead of directly converting the whole mask to fp32, we do the conversion block-wisely. This can decrease the peak memory usage (we test in https://huggingface.co/microsoft/Phi-3-mini-128k-instruct, peak memory usage reduces ~50%) and have some performance improvements as well.

### Performance result
single socket in Intel (R) Xeon (R) CPU Max 9480
batch_size = 12, q_seq_len = 1030, kv_seq_len = 1179, n_head = 3, head_dim = 33, mask_dim = 4, bool_mask = 0
  | Forward speedup | Backward speedup
-- | -- | --
float64 | 0.82% | 3.76%
float32 | 2.2% | 3.9%
bfloat16 | 16.15% | 7.56%

**segment-anything-fast**
Follow https://github.com/pytorch-labs/segment-anything-fast/tree/main/experiments
Single socket in Intel (R) Xeon (R) CPU Max 9480
Dtype: bfloat16, models: vit_b and vit_h, test in `SDPA` and `Triton` commit https://github.com/pytorch-labs/segment-anything-fast/blob/main/experiments/run_experiments.py#L199-L200, select the time of 20th iteration.
  | vit_b |   | vit_h |  
-- | -- | -- | -- | --
  | attn_mask w/o   block-wise | attn_mask w/   block-wise | attn_mask w/o   block-wise | attn_mask w/   block-wise
SDPA| 10.95s/it | 6.59s/it | 19.93s/it | 12.33s/it
Triton | 10.66s/it | 7.12s/it | 19.87s/it | 12.26s/it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126961
Approved by: https://github.com/Valentine233, https://github.com/jgong5
2024-05-28 09:12:18 +00:00
1507d5205a [dynamo][fsdp] Skip Dynamo tracing of __getattr__ if its top-level frame (#127263)
The generated bytecode for the first frame is below. Inlined comments about the LOAD_ATTR which causes Dynamo to trigger again on `__getattr__`.

~~~
[__bytecode] MODIFIED BYTECODE fn /data/users/anijain/pytorch2/test/dynamo/test_activation_checkpointing.py line 1129
[__bytecode] 1129           0 COPY_FREE_VARS           1
[__bytecode]                2 RESUME                   0
[__bytecode]                4 PUSH_NULL
[__bytecode]                6 LOAD_GLOBAL             10 (__compiled_fn_1)
[__bytecode]               18 LOAD_FAST                0 (x)
[__bytecode]               20 LOAD_DEREF               1 (mod)
[__bytecode]               22 LOAD_ATTR                6 (_checkpoint_wrapped_module)
[__bytecode]               32 LOAD_CONST               1 (0)
[__bytecode]               34 BINARY_SUBSCR
[__bytecode]               44 LOAD_ATTR                7 (weight)
[__bytecode]               54 LOAD_DEREF               1 (mod)
[__bytecode]               56 LOAD_ATTR                6 (_checkpoint_wrapped_module)
[__bytecode]               66 LOAD_CONST               1 (0)
[__bytecode]               68 BINARY_SUBSCR
[__bytecode]               78 LOAD_ATTR                8 (bias)

# When this optimized bytecode is executed, these two lines call the __getattr__ of ActivationWrapper module.
# Dynamo gets invoked on __getattr__.

# If we had inlined __getattr__ during the tracing, we would have seen the LOAD_ATTR
# on more low level data structures like _modules, obviating the need for CPython
# to call python overriden __getattr__. But today, UnspecializedNNModuleVariable
# calls python getattr at tracing time (instead of inlining it), resulting in LOAD_ATTR
# on the module itself.

# To prevent Dynamo to skip tracing of __Getattr__ on the optimized bytecode,
# we can check if its top level frame and just skip it.

[__bytecode]               88 LOAD_DEREF               1 (mod)
[__bytecode]               90 LOAD_ATTR                0 (a)

[__bytecode]              100 PRECALL                  4
[__bytecode]              104 CALL                     4
[__bytecode]              114 UNPACK_SEQUENCE          1
[__bytecode]              118 RETURN_VALUE
~~~~

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127263
Approved by: https://github.com/yf225
2024-05-28 08:16:53 +00:00
cyy
d6e3e89804 Remove c10::void_t (#127248)
OSS version doesn't use it anymore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127248
Approved by: https://github.com/ezyang
2024-05-28 06:59:20 +00:00
246311c944 Unconditionally add asserts after export (#127132)
Summary: Today AOTAutograd drops some of assert nodes so we reapply it after strict export.

Test Plan: CI

Reviewed By: angelayi

Differential Revision: D57786907

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127132
Approved by: https://github.com/zhxchen17
2024-05-28 06:31:39 +00:00
cyy
e4b245292f Remove caffe2::tensorrt target code from cuda.cmake (#127204)
Following #126542.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127204
Approved by: https://github.com/ezyang
2024-05-28 04:42:14 +00:00
cyy
c6b36ec2f9 Remove calls of deprecated _aminmax (#127182)
While  #125995 is pending, the calls should be removed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127182
Approved by: https://github.com/ezyang
2024-05-28 03:51:45 +00:00
d957c2d5de [Doc] update default magma cuda version in readme (#122125)
Since we use cuda 12.1 by default now, it would be better to update the doc.

Many people (including me), want to directly copy-paste commands in readme 😉  Let's make our life easier.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122125
Approved by: https://github.com/malfet
2024-05-28 03:37:23 +00:00
7c61e7be5c Address issue #125307 (#126351)
PyTorch overrides SymPy's Mod and does its own symbolic simplification. Inspired by issue #125307, this PR adds one more simplification tactic.

Fixes #125307

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126351
Approved by: https://github.com/ezyang
2024-05-28 02:03:24 +00:00
8979412442 Enable ufmt format on test files (#126845)
Fixes some files in  #123062

Run lintrunner on files:

test/test_nnapi.py,
test/test_numba_integration.py,
test/test_numpy_interop.py,
test/test_openmp.py,
test/test_optim.py

```bash
$ lintrunner -a --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126845
Approved by: https://github.com/ezyang
2024-05-28 01:42:07 +00:00
cyy
57000708fc Remove c10::invoke_result (#127160)
Following #124169 , it can be safely remove from OSS version.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127160
Approved by: https://github.com/ezyang
2024-05-28 01:39:28 +00:00
cyy
6436a6407d Enable Wunused-variable on tests (#127161)
This PR enables unused-variable warnings in tests and fixes some test code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127161
Approved by: https://github.com/ezyang
2024-05-28 01:37:46 +00:00
cyy
70d8bc2da1 Fix various errors in TCPStoreLibUvBackend.cpp (#127230)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127230
Approved by: https://github.com/Skylion007
2024-05-27 19:14:01 +00:00
0ff2f8b522 update kineto submodule hash (#126780)
Summary: update kineto submodule hash

Test Plan: CIs

Differential Revision: D57620964

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126780
Approved by: https://github.com/Skylion007
2024-05-27 18:11:48 +00:00
25a9262ba4 Add structured logging for fx graph cache hash (#127156)
Summary: Add structured logging for fx graph cache hash so that we can debug MAST jobs easily.

Test Plan: ad hoc testing

Differential Revision: D57791537

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127156
Approved by: https://github.com/jamesjwu
2024-05-27 17:18:41 +00:00
26f4f10ac8 [5/N][Easy] fix typo for usort config in pyproject.toml (kown -> known): sort torch (#127126)
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126
Approved by: https://github.com/kit1980
2024-05-27 14:49:57 +00:00
c7f6fbfa9d Revert "[FSDP2] Added test for N-way TP and 1-way FSDP with CPU offloading (#127024)"
This reverts commit 9117779b0a178ec5ca548585a97bcb44be631644.

Reverted https://github.com/pytorch/pytorch/pull/127024 on behalf of https://github.com/atalman due to failing in CI ([comment](https://github.com/pytorch/pytorch/pull/127024#issuecomment-2133566325))
2024-05-27 14:12:09 +00:00
7121ea6f70 Revert "Add compile time profiler for non fbcode targets (#126904)"
This reverts commit 575cb617db4043dd7a76aaf523dc3ab7ee07e7a5.

Reverted https://github.com/pytorch/pytorch/pull/126904 on behalf of https://github.com/atalman due to Broke nightly smoke test ([comment](https://github.com/pytorch/pytorch/pull/126904#issuecomment-2133418687))
2024-05-27 12:52:09 +00:00
00fe0a0d79 Revert "Remove more of caffe2 (#126705)"
This reverts commit f95dbc12761cb4466099b0e9a3667057ca39272b.

Reverted https://github.com/pytorch/pytorch/pull/126705 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126705#issuecomment-2133325449))
2024-05-27 11:59:14 +00:00
1110edb94b Fix stream type to generic in comms default hooks (#120069)
In comms default_hooks - decompress stream is hardcoded to cuda type. fix this to use generic type based on the grad tensor device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/120069
Approved by: https://github.com/jgong5, https://github.com/fegin
2024-05-27 10:27:30 +00:00
55c0ab2887 Revert "[5/N][Easy] fix typo for usort config in pyproject.toml (kown -> known): sort torch (#127126)"
This reverts commit 7763c83af67eebfdd5185dbe6ce15ece2b992a0f.

Reverted https://github.com/pytorch/pytorch/pull/127126 on behalf of https://github.com/XuehaiPan due to Broken CI ([comment](https://github.com/pytorch/pytorch/pull/127126#issuecomment-2133044286))
2024-05-27 09:22:08 +00:00
4608971f7a Revert "[inductor][cpp] GEMM template (infra and fp32) (#124021)"
This reverts commit 0d1e22855022a04a8601a2d94f3079950283ba5d.

Reverted https://github.com/pytorch/pytorch/pull/124021 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2133002071))
2024-05-27 09:01:45 +00:00
343a41fba8 Revert "[inductor][cpp] epilogue support for gemm template (#126019)"
This reverts commit 56c412d9063de3dc8163b8e1b0b9b5bf9581ad05.

Reverted https://github.com/pytorch/pytorch/pull/126019 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2133002071))
2024-05-27 09:01:45 +00:00
68fddebf84 Revert "[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068)"
This reverts commit 4aa43d11f332b2d7b8f19b4da5ceba612133889d.

Reverted https://github.com/pytorch/pytorch/pull/126068 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2133002071))
2024-05-27 09:01:45 +00:00
ed9951ace7 Revert "[inductor][cpp] support bf16/fp16 gemm template epilogue fusion (#126545)"
This reverts commit 43baabe9b94c86bd36ba4a00f501e52d833d7ec8.

Reverted https://github.com/pytorch/pytorch/pull/126545 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2133002071))
2024-05-27 09:01:45 +00:00
4c2e671a3b Revert "[Inductor][CPP] Add Min/Max with VecMask (#126841)"
This reverts commit 1ef4306ab11410a506e0868543a466e87ea879b5.

Reverted https://github.com/pytorch/pytorch/pull/126841 on behalf of https://github.com/DanilBaibak due to Blocks reverting of the broken PR ([comment](https://github.com/pytorch/pytorch/pull/126841#issuecomment-2132995404))
2024-05-27 08:58:01 +00:00
5247446396 Revert "[Inductor][CPP] Add ne with VecMask (#126940)"
This reverts commit f8c4c268da67e9684f3287b7468f36a5a27c6a0b.

Reverted https://github.com/pytorch/pytorch/pull/126940 on behalf of https://github.com/DanilBaibak due to Blocks reverting of the broken PR ([comment](https://github.com/pytorch/pytorch/pull/126841#issuecomment-2132995404))
2024-05-27 08:58:01 +00:00
60523fa674 Revert "Move MKLDNN Specific IR to Separate File (#126504)"
This reverts commit bf2909b871579a78e841b661b9b0c302f311d010.

Reverted https://github.com/pytorch/pytorch/pull/126504 on behalf of https://github.com/DanilBaibak due to Blocks reverting of the broken PR ([comment](https://github.com/pytorch/pytorch/pull/126841#issuecomment-2132995404))
2024-05-27 08:58:01 +00:00
ff63e8bac8 [CI] fix doctest case by adding requires (#126855)
With the triton update, the new dependency `llnl-hatchet` will be introduced. And `pydot` is a dependency of `llnl-hatchet`. So the doctest case `torch/fx/passes/graph_drawer.py::FxGraphDrawer.get_dot_graph:0` won't be skipped.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126855
Approved by: https://github.com/etaf, https://github.com/EikanWang, https://github.com/peterbell10
2024-05-27 07:40:27 +00:00
22712ba5c5 Radam support the flag for "maximize" (#126765)
Fixes #[126642](https://github.com/pytorch/pytorch/issues/126642)

I reference the maximize in `Adam` and add `Radam's` maximize flag. If this pr is OK, I will add another pr for `Nadam`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126765
Approved by: https://github.com/janeyx99
2024-05-27 06:34:50 +00:00
cyy
5cca904c51 [3/N] Enable clang-tidy in aten/src/ATen/detail/ (#127184)
Following #127168

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127184
Approved by: https://github.com/jansel
2024-05-27 06:28:07 +00:00
1c2e221e25 CUDA 12.4 ARM wheel integration to CD - nightly build (#126174)
rebasing https://github.com/pytorch/pytorch/pull/124112.
too many conflict files, so starting a new PR.

Test https://github.com/pytorch/builder/pull/1775 (merged) for ARM wheel addition
Test https://github.com/pytorch/builder/pull/1828 (merged) for setting MAX_JOBS

Current issue to follow up:
https://github.com/pytorch/pytorch/issues/126980

Co-authored-by: Aidyn-A <aidyn.b.aitzhan@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126174
Approved by: https://github.com/nWEIdia, https://github.com/atalman
2024-05-27 05:50:36 +00:00
7763c83af6 [5/N][Easy] fix typo for usort config in pyproject.toml (kown -> known): sort torch (#127126)
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127126
Approved by: https://github.com/kit1980
ghstack dependencies: #127122, #127123, #127124, #127125
2024-05-27 04:22:18 +00:00
cyy
4fdbaa794f [Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127051
Approved by: https://github.com/cpuhrsch, https://github.com/malfet
2024-05-27 03:54:03 +00:00
6aa5bb1a76 [inductor] Support persistent reductions for dynamic shapes (#126684)
Currently persistent reductions are only supported when the reduction dimension
is static, however we only really need to know that the rnumel is bounded.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126684
Approved by: https://github.com/lezcano
2024-05-27 02:30:20 +00:00
bf2909b871 Move MKLDNN Specific IR to Separate File (#126504)
**Summary**
Following the discussion in https://github.com/pytorch/pytorch/pull/122593#discussion_r1604144782, Move Inductor MKLDNN specific IRs to a separate file.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126504
Approved by: https://github.com/desertfire, https://github.com/jgong5
ghstack dependencies: #126841, #126940
2024-05-27 00:48:09 +00:00
39de62845a [decomp] Fix default values missing from inplace rrelu decomposition (#126978)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126978
Approved by: https://github.com/lezcano
2024-05-26 23:49:40 +00:00
06934518a2 [AMD] Fix deprecated amdsmi api (#126962)
Summary: https://github.com/pytorch/pytorch/pull/119182 uses an API that has already been deprecated by c551c3caed. So fixing this in a backward compatible way

Differential Revision: D57711088

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126962
Approved by: https://github.com/eqy, https://github.com/izaitsevfb
2024-05-26 20:11:23 +00:00
ee6cb6daa1 Turn the mutation dependency of MutationOutput to weak deps (#127151)
A writeup of how mutation works in Inductor: https://docs.google.com/document/d/1P0fSq4Nm-3CvdUe9v-mLdEWD3dgIHUf1czQXMmQsuxc/edit

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127151
Approved by: https://github.com/oulgen
ghstack dependencies: #127148, #127149
2024-05-26 01:21:03 +00:00
f8c4c268da [Inductor][CPP] Add ne with VecMask (#126940)
**Summary**
Fix https://github.com/pytorch/pytorch/issues/126824#issuecomment-2125039161 which is missing the support of `ne` with `VecMask`.

**Test Plan**
```
python test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_ne_cpu_bool
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126940
Approved by: https://github.com/isuruf, https://github.com/jgong5, https://github.com/peterbell10
ghstack dependencies: #126841
2024-05-25 23:54:48 +00:00
1ef4306ab1 [Inductor][CPP] Add Min/Max with VecMask (#126841)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/126824 which is missing the support of `min/max` with `VecMask`.

**TestPlan**
```
python test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_clamp_max_cpu_bool
python test/inductor/test_torchinductor_opinfo.py -k test_comprehensive_clamp_min_cpu_bool
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126841
Approved by: https://github.com/isuruf, https://github.com/jgong5, https://github.com/peterbell10
2024-05-25 23:52:21 +00:00
b8ee7d0cc1 Change direct uses of MutationOutput to mark_node_as_mutating (#127149)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127149
Approved by: https://github.com/oulgen
ghstack dependencies: #127148
2024-05-25 23:47:39 +00:00
3817c4f9fa Unify add_fake_dep and add_mutation_dep, as they're literally the same thing (#127148)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127148
Approved by: https://github.com/oulgen
2024-05-25 23:47:39 +00:00
cyy
9bead53519 [2/N] Fix clang-tidy warnings in aten/src/ATen/detail/ (#127168)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127168
Approved by: https://github.com/Skylion007
2024-05-25 22:50:02 +00:00
a28bfb5ed5 [4/N][Easy] fix typo for usort config in pyproject.toml (kown -> known): sort functorch (#127125)
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127125
Approved by: https://github.com/Skylion007
ghstack dependencies: #127122, #127123, #127124
2024-05-25 22:45:38 +00:00
35ea5c6b22 [3/N][Easy] fix typo for usort config in pyproject.toml (kown -> known): sort torchgen (#127124)
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127124
Approved by: https://github.com/Skylion007
ghstack dependencies: #127122, #127123
2024-05-25 19:20:03 +00:00
0dae2ba5bd [2/N][Easy] fix typo for usort config in pyproject.toml (kown -> known): sort caffe2 (#127123)
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127123
Approved by: https://github.com/Skylion007
ghstack dependencies: #127122
2024-05-25 18:26:34 +00:00
da141b096b Enable UFMT on test/test_hub.py (#127155)
Partially addresses #123062

Ran lintrunner on:
test/test_hub.py

Detail:
```
$ lintrunner -a --take UFMT test/test_hub.py
ok No lint issues.
Successfully applied all patches.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127155
Approved by: https://github.com/Skylion007
2024-05-25 18:23:24 +00:00
12d11fe4e5 Revert "reset dynamo cache before each test (#126586)"
This reverts commit bd24991f461476036d6ba20fed92651c7e46ef7c.

Reverted https://github.com/pytorch/pytorch/pull/126586 on behalf of https://github.com/malfet due to Broke tons of tests, see bd24991f46  ([comment](https://github.com/pytorch/pytorch/pull/126586#issuecomment-2131365576))
2024-05-25 17:17:19 +00:00
71eafe9e97 Refactor dispatch logic to clarify control flow (#126402)
As discussed, this cleans up the code so that create_aot_dispatcher literally chooses an aot_dispatch function and runs it. Moves wrapper logic to jit_compile_runtime_wrappers, and adds aot_dispatch_export to handle export cases in one place.

This also makes aot_dispatch_* return the same type always: a Callable and the forward metadata, instead of returning different number of arguments in export cases. Callers that don't care about fw_metadata can just ignore it. Added return type hints to enforce the same exact interface among all the aot_dispatch_* functions.

It'd be nice to move the checks from the synthetic base and dedup wrappers that have to do with export outside of those wrappers, but it's probably fine for now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126402
Approved by: https://github.com/oulgen, https://github.com/bdhirsh
ghstack dependencies: #126193
2024-05-25 16:06:34 +00:00
7642cdef25 Improve fusable_read_and_write() (#127061)
Related to https://github.com/pytorch/pytorch/issues/98467

The tacotron2 benchmark creates a lot of nodes which fusion then checks. This improves some of the perf of that checking.

`can_fuse_vertical` calls `fusable_read_and_write` on O(read deps * write deps) combinations - but only cares about write deps that are MemoryDeps - so do the isinstance check outside the inner loop to save O(read deps) when it won't matter anyway.

Also moves `fusable_read_and_write` to a instance method (instead of a closure) since it doesn't actually capture any variables.

I also tried pre-splitting the read deps into `StarDep` vs `MemoryDep` but that didn't actually make any perf difference.

Testing:
```
time python benchmarks/dynamo/torchbench.py --accuracy --inference --amp --backend inductor --disable-cudagraphs --device cuda --only tacotron2
```
Before this change: 10m15s
After this change: 9m31s

Related to #98467

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127061
Approved by: https://github.com/peterbell10, https://github.com/jansel
ghstack dependencies: #127060
2024-05-25 15:17:25 +00:00
6c79299a35 Improve score_fusion_memory() (#127060)
Related to #98467

The tacotron2 benchmark creates a lot of nodes which fusion then checks. This
improves some of the perf of that checking.

`score_fusion_memory` is called O(n^2) times - so by moving the set union, `has_unbacked_symbols` check, and `numbytes_hint` out of the loop we call them O(n) times and the O(n^2) call gets cheaper.

Testing:
```
time python benchmarks/dynamo/torchbench.py --accuracy --inference --amp --backend inductor --disable-cudagraphs --device cuda --only tacotron2
```

Before this change: 12m33s
After this change: 10m15s

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127060
Approved by: https://github.com/peterbell10, https://github.com/jansel
2024-05-25 15:17:25 +00:00
ba3b05fdf3 [1/N][Easy] fix typo for usort config in pyproject.toml (kown -> known): sort stdlib (#127122)
The `usort` config in `pyproject.toml` has no effect due to a typo. Fixing the typo make `usort` do more and generate the changes in the PR. Except `pyproject.toml`, all changes are generated by `lintrunner -a --take UFMT --all-files`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127122
Approved by: https://github.com/kit1980
2024-05-25 08:25:50 +00:00
4a997de8b9 [AOTI] support freezing for MKLDNN (#124350)
## Description
Fixes https://github.com/pytorch/pytorch/issues/114450. This PR builds upon the work from @imzhuhl done in https://github.com/pytorch/pytorch/pull/114451.

This PR requires https://github.com/pytorch/pytorch/pull/122472 to land firstly.

We leverage the serialization and deserialization API from oneDNN v3.4.1 to save the opaque MKLDNN tensor during the compilation and restore the opaque tensor when loading the compiled .so.
ideep version is updated so that we won't break any pipeline even if third_party/ideep is not updated at the same time.

### Test plan:
```sh
python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_freezing_non_abi_compatible_cpu
python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_conv_freezing_non_abi_compatible_cpu
python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_deconv_freezing_non_abi_compatible_cpu
python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_linear_freezing_non_abi_compatible_cpu
```

### TODOs in follow-up PRs
1. We found that using `AOTI_TORCH_CHECK` will cause performance drop on several models (`DistillGPT2`, `MBartForConditionalGeneration`, `T5ForConditionalGeneration`, `T5Small`) compared with JIT Inductor which uses `TORCH_CHECK`. This may need further discussion how to address (`AOTI_TORCH_CHECK` is introduced in
 https://github.com/pytorch/pytorch/pull/119220).
2. Freezing in non-ABI compatible mode will work with the support in this PR. While for ABI compatible mode, we need to firstly address this issue: `AssertionError: None, i.e. optional output is not supported`.
6c4f43f826/torch/_inductor/codegen/cpp_wrapper_cpu.py (L2023-L2024)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124350
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-05-25 07:15:36 +00:00
e7a42702f9 generalize custom_fwd&custom_bwd to be device-agnostic (#126531)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126531
Approved by: https://github.com/jgong5, https://github.com/gujinghui, https://github.com/albanD, https://github.com/EikanWang
ghstack dependencies: #126527
2024-05-25 06:48:16 +00:00
c09205a057 Deprecate device-specific GradScaler autocast API (#126527)
# Motivation

## for `torch.amp.GradScaler`,
- `torch.cpu.amp.GradScaler(args...)` is completely equivalent to `torch. amp.GradScaler("cpu", args...)`.
- `torch.cuda.amp.GradScaler(args...)` is completely equivalent to `torch.amp.GradScaler("cuda", args...)`.

So, we intend to depreate them and **strongly recommend** developer to use `torch.amp.GradScaler`.

## for `custom_fwd` and `custom_bwd`,
this is a good solution to make the custom function run with or without effect even in an autocast-enabled region and can be shared by other backends, like CPU and XPU.
So we generalize it to be device-agnostic and put them int `torch/amp/autocast_mode.py` and re-expose to `torch.amp.custom_fwd` and `torch.amp.custom_bwd`. Meanwhile, we deprecate `torch.cuda.amp.custom_fwd` and `torch.cuda.amp.custom_bwd`.

# Additional Context
Add UT to cover the deprecated warning.
No need for more UTs to cover the functionality of `torch.amp.custom_f/bwd`, the existing UTs that previously covered the functionality of `torch.cuda.amp.custom_f/bwd` can cover them.
To facilitate the review, we separate these code changes to two PRs. The first PR cover `torch.amp.GradScaler`. The follow-up covers `custom_fwd` and `custom_bwd`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126527
Approved by: https://github.com/jgong5, https://github.com/gujinghui, https://github.com/janeyx99, https://github.com/EikanWang
2024-05-25 06:41:34 +00:00
ef86a27dba Mark test_set_per_process_memory_fraction serial (#127087)
Occasionally OOMs

Also should probably give the entire GPU for this anyways
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127087
Approved by: https://github.com/huydhn
2024-05-25 06:26:47 +00:00
0f67d38f0f add TORCHDYNAMO_CAPTURE_DYNAMIC_OUTPUT_SHAPE_OPS (#127017)
tlparse prints failure description like this

> dynamic shape operator: aten._unique2.default; to enable, set torch._dynamo.config.capture_dynamic_output_shape_ops = True

adding os env var to set it easier for testing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127017
Approved by: https://github.com/jackiexu1992
2024-05-25 05:42:41 +00:00
84e59f052d Made some minor improvements to flexattention perf + added more autotune configs (#126811)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126811
Approved by: https://github.com/drisspg, https://github.com/yanboliang, https://github.com/Neilblaze
2024-05-25 05:03:31 +00:00
cyy
9f11fc666a [1/N] Fix clang-tidy warnings in aten/src/ATen/detail/ (#127057)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127057
Approved by: https://github.com/Skylion007
2024-05-25 04:55:52 +00:00
bd24991f46 reset dynamo cache before each test (#126586)
In https://github.com/pytorch/pytorch/issues/125967, we found test results depend on test order. The root cause is due to earlier tests populate dynamo cache and affect the later tests.

This PR clear dynamo cache before each unit test so we get more deterministic result for unit test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126586
Approved by: https://github.com/jansel
2024-05-25 04:48:09 +00:00
8bd26ecf0b [pipelining] test composability with DDP and FSDP (#127066)
Added to `multigpu` test config, which is run periodically.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127066
Approved by: https://github.com/H-Huang, https://github.com/wconstab
ghstack dependencies: #127136, #126931
2024-05-25 04:30:40 +00:00
c1d2564acf [pipelining] Add grad test for interleaved schedules (#126931)
Added `test_grad_with_manual_interleaved`:
- Model: `MultiMLP`
- Tested schedules: Interleaved1F1B, LoopedBFS
- Two stages per rank
```
Rank 0 stages: [0, 2]
Rank 1 stages: [1, 3]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126931
Approved by: https://github.com/wconstab
ghstack dependencies: #127136
2024-05-25 04:13:28 +00:00
eaace67444 [pipelining] do not check inputs for non-0 stages (#127136)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127136
Approved by: https://github.com/wconstab
2024-05-25 04:13:28 +00:00
cc9a3412d4 Implement a post_compile step for aot_dispatch_autograd (#126193)
This PR moves the post compile portion of aot_dispatch_autograd into runtime_wrappers.py. Completing this allows us to run the post compile section on its own when warm starting.

I considered leaving this thing in jit_compile_runtime_wrappers, but we're gonna run into circular dependency issues later if we don't move it over
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126193
Approved by: https://github.com/bdhirsh
ghstack dependencies: #126907
2024-05-25 03:24:20 +00:00
52bcf120e5 Make inductor config hashing more portable (#127022)
Summary: masnesral and I noticed that config contains non portable artifacts. Lets fix that.

Test Plan: adhoc testing

Differential Revision: D57748025

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127022
Approved by: https://github.com/masnesral
2024-05-25 03:01:33 +00:00
665637714f Remove SparseAdam weird allowance of raw Tensor input (#127081)
This continues the full deprecation after https://github.com/pytorch/pytorch/pull/114425. It's been 6 months! And I'm fairly certain no one is going to yell at me as this patch is not really used.

------

# BC Breaking note

As of this PR, SparseAdam will become consistent with the rest of our optimizers in that it will only accept containers of Tensors/Parameters/param groups and fully complete deprecation of this path. Hitherto, the SparseAdam constructor had allowed raw tensors as the params argument to the constructor. Now, if you write the following code, there will be an error similar to every other optim: "params argument given to the optimizer should be an iterable of Tensors or dicts"

```
import torch
param = torch.rand(16, 32)
optimizer = torch.optim.SparseAdam(param)
```

Instead you should replace the last line with
```
optimizer = torch.optim.SparseAdam([param])
```
to no longer error.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127081
Approved by: https://github.com/soulitzer
2024-05-25 02:58:24 +00:00
cyy
29a1f62f23 Replace c10::invoke_result with std::invoke_result (#124169)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124169
Approved by: https://github.com/swolchok
2024-05-25 02:42:13 +00:00
9ef6f8dfc1 Fix typo in inductor workflow for CUDA 12.4 jobs (#127121)
Discovered by @clee2000.  The change was introduced in https://github.com/pytorch/pytorch/pull/121956
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127121
Approved by: https://github.com/clee2000, https://github.com/Skylion007
2024-05-25 02:36:39 +00:00
ed838793df [pipelining] Remove qualname mapping (#127018)
`QualnameMapMixin` was intended to provide a mapping from new FQN of the piped model to the FQN of the original model. It was there because previous tracers and flattening during tracing would modify the FQNs.

Now that we use unflattener, the FQN of the stage modules are the same as the original FQNs. We don't need `QualnameMapMixin` any more.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127018
Approved by: https://github.com/H-Huang
2024-05-25 02:32:40 +00:00
5f15110499 Update dispatch stub to make SDPA routing cleaner (#126832)
# Summary

Adds a public method to dispatchstub to check if a fn has been registered for a device. We use this new function to clean up the dispatching logic for SDPA, as well as make the private use dispatching simpler:
#126392
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126832
Approved by: https://github.com/ezyang, https://github.com/albanD
2024-05-25 01:40:53 +00:00
db9c6aeec6 Revert "Skip test_memory_format_nn_BatchNorm2d in inductor (#125970)" (#126594)
This reverts commit 0a9c6e92f8d1a35f33042c8dab39f23b7f39d6e7.

enable the test since it's fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126594
Approved by: https://github.com/huydhn
ghstack dependencies: #126593
2024-05-25 01:27:02 +00:00
b03dc3d167 don't check memory format for empty tensors (#126593)
Fix https://github.com/pytorch/pytorch/issues/125967 . The test actually fail for empty 4D or 5D tensors when checking for memory format.

I'm not exactly sure what recent inductor change cause the failure, but it may be not that important to maintain strides for an empty tensor. (?)

I just skip the check for empty tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126593
Approved by: https://github.com/ezyang
2024-05-25 01:19:45 +00:00
84f8cd22ac [dynamo][TensorVariable] Support "if param.grad_fn" usecase (#126960)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126960
Approved by: https://github.com/jansel
ghstack dependencies: #126922
2024-05-25 01:09:26 +00:00
bbeb0906c4 Register creak_node_hook (#126671)
Differential Revision: D57469157

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126671
Approved by: https://github.com/angelayi
2024-05-24 23:32:15 +00:00
72f0bdcc22 Remove torch._constrain_as_value (#127103)
Summary: This API doesn't do anything useful and should be subsumed by torch._check.

Test Plan: CI

Differential Revision: D57786740

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127103
Approved by: https://github.com/angelayi
2024-05-24 22:49:46 +00:00
d5bf3a98db [inductor] Refactor indexing() into triton.py (#127047)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127047
Approved by: https://github.com/shunting314
ghstack dependencies: #126944, #126945
2024-05-24 22:46:20 +00:00
92433217cb [inductor] Misc refactors (#126945)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126945
Approved by: https://github.com/shunting314
ghstack dependencies: #126944
2024-05-24 22:46:20 +00:00
1b6e3e3bcb [inductor] Refactor part of IterationRangesEntry into triton.py (#126944)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126944
Approved by: https://github.com/shunting314
2024-05-24 22:46:20 +00:00
83617017e0 [dtensor][debug] add c10d allreduce_coalesced_ tracing to CommDebugMode (#127040)
**Summary**
Added c10d all_reduce_coalesced tracing to CommDebugMode and added test case to test_comm_mode.py.

**Test Plan**
pytest test/distributed/_tensor/debug/test_comm_mode.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127040
Approved by: https://github.com/XilunWu
ghstack dependencies: #127025, #127029
2024-05-24 22:25:44 +00:00
59052071b7 Disallow fusions of foreach and reductions (#127048)
Fixes https://github.com/pytorch/pytorch/issues/120857

This currently isn't supported until we enable foreach reduction kernels.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127048
Approved by: https://github.com/weifengpy
2024-05-24 21:35:06 +00:00
023c1baf82 Add global configurations to cache key (#126907)
This adds a bunch of global configurations to the cache key. There's definitely more I haven't added, but this is just an audit of all of the `torch.*` globals that are used in jit_compile_runtime_wrappers, runtime_wrappers, etc.

It also makes the hash details object subclass FXGraphHashDetails, which implements other hashed data like configs inductor depends on.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126907
Approved by: https://github.com/aorenste
2024-05-24 21:26:46 +00:00
c133665d4a [CUDA] Parallelize upsampling OPS across the batch/channel dimension. (#127082)
This can make this operation 200x+ faster on modern GPUs for small grid sizes, as otherwise this kernel is scheduled with a single block (!)

Tested on A100 with:
```
python test/test_nn.py TestNNDeviceTypeCUDA
```

**Benchmarks FW**
Ran on A100 / bf16
## Forward pass benchmarks

| batch size | input size | output size | before runtime (mem bandwidth) | after runtime (mem bandwidth) | speedup |
|------------|------------|-------------|------------------|-----------------|---------|
| 768 | 16x16 | 6x6 | 5855us (0.07 GB/s) | 38us (10 GB/s) | 154x |
| 768 | 16x16 | 7x7 | 5214us (0.08 GB/s) | 37us (11 GB/s) | 138x |
| 768 | 16x16 | 14x14 | 2314us (0.27 GB/s) | 36us (17 GB/s) | 63x |
| 768 | 16x16 | 16x16 | 1232us (0.59 GB/s) | 33us (21 GB/s) | 36x |
| 768 | 32x32 | 6x6 | 19442us (0.07 GB/s) | 98us (15 GB/s) | 197x |
| 768 | 32x32 | 7x7 | 16918us (0.09 GB/s) | 89us (17 GB/s) | 188x |
| 768 | 32x32 | 14x14 | 6023us (0.28 GB/s) | 69us (25 GB/s) | 86x |
| 768 | 32x32 | 16x16 | 3455us (0.52 GB/s) | 55us (32 GB/s) | 62x |
| 768 | 48x48 | 6x6 | 38597us (0.08 GB/s) | 179us (18 GB/s) | 214x |
| 768 | 48x48 | 7x7 | 34700us (0.09 GB/s) | 163us (20 GB/s) | 211x |
| 768 | 48x48 | 14x14 | 10647us (0.33 GB/s) | 112us (31 GB/s) | 94x |
| 768 | 48x48 | 16x16 | 7388us (0.49 GB/s) | 100us (36 GB/s) | 73x |
| 768 | 64x64 | 6x6 | 76288us (0.07 GB/s) | 310us (19 GB/s) | 246x |
| 768 | 64x64 | 7x7 | 54981us (0.1 GB/s) | 257us (23 GB/s) | 213x |
| 768 | 64x64 | 14x14 | 16565us (0.37 GB/s) | 169us (36 GB/s) | 97x |
| 768 | 64x64 | 16x16 | 12037us (0.51 GB/s) | 141us (43 GB/s) | 84x |
| 1024 | 16x16 | 6x6 | 8123us (0.06 GB/s) | 44us (12 GB/s) | 183x |
| 1024 | 16x16 | 7x7 | 7017us (0.08 GB/s) | 45us (12 GB/s) | 155x |
| 1024 | 16x16 | 14x14 | 3150us (0.27 GB/s) | 45us (18 GB/s) | 69x |
| 1024 | 16x16 | 16x16 | 1695us (0.57 GB/s) | 41us (23 GB/s) | 40x |
| 1024 | 32x32 | 6x6 | 25918us (0.07 GB/s) | 120us (16 GB/s) | 214x |
| 1024 | 32x32 | 7x7 | 22622us (0.09 GB/s) | 108us (18 GB/s) | 208x |
| 1024 | 32x32 | 14x14 | 8245us (0.28 GB/s) | 87us (26 GB/s) | 94x |
| 1024 | 32x32 | 16x16 | 4599us (0.53 GB/s) | 68us (35 GB/s) | 67x |
| 1024 | 48x48 | 6x6 | 51486us (0.08 GB/s) | 219us (20 GB/s) | 234x |
| 1024 | 48x48 | 7x7 | 46501us (0.09 GB/s) | 202us (22 GB/s) | 229x |
| 1024 | 48x48 | 14x14 | 14280us (0.33 GB/s) | 145us (32 GB/s) | 98x |
| 1024 | 48x48 | 16x16 | 9877us (0.49 GB/s) | 125us (39 GB/s) | 79x |
| 1024 | 64x64 | 6x6 | 101731us (0.07 GB/s) | 378us (20 GB/s) | 268x |
| 1024 | 64x64 | 7x7 | 73465us (0.1 GB/s) | 320us (24 GB/s) | 229x |
| 1024 | 64x64 | 14x14 | 22109us (0.37 GB/s) | 218us (37 GB/s) | 101x |
| 1024 | 64x64 | 16x16 | 16081us (0.51 GB/s) | 178us (46 GB/s) | 90x |
| 1536 | 16x16 | 6x6 | 12546us (0.06 GB/s) | 61us (13 GB/s) | 205x |
| 1536 | 16x16 | 7x7 | 11064us (0.07 GB/s) | 63us (13 GB/s) | 175x |
| 1536 | 16x16 | 14x14 | 4839us (0.26 GB/s) | 62us (20 GB/s) | 77x |
| 1536 | 16x16 | 16x16 | 2630us (0.55 GB/s) | 59us (24 GB/s) | 44x |
| 1536 | 32x32 | 6x6 | 38898us (0.07 GB/s) | 170us (17 GB/s) | 227x |
| 1536 | 32x32 | 7x7 | 34079us (0.09 GB/s) | 155us (19 GB/s) | 219x |
| 1536 | 32x32 | 14x14 | 12632us (0.27 GB/s) | 124us (28 GB/s) | 101x |
| 1536 | 32x32 | 16x16 | 6900us (0.53 GB/s) | 98us (37 GB/s) | 70x |
| 1536 | 48x48 | 6x6 | 77272us (0.08 GB/s) | 316us (21 GB/s) | 243x |
| 1536 | 48x48 | 7x7 | 70153us (0.09 GB/s) | 291us (23 GB/s) | 240x |
| 1536 | 48x48 | 14x14 | 21500us (0.33 GB/s) | 208us (34 GB/s) | 103x |
| 1536 | 48x48 | 16x16 | 14851us (0.49 GB/s) | 181us (40 GB/s) | 81x |
| 1536 | 64x64 | 6x6 | 152669us (0.07 GB/s) | 548us (21 GB/s) | 278x |
| 1536 | 64x64 | 7x7 | 110348us (0.1 GB/s) | 466us (25 GB/s) | 236x |
| 1536 | 64x64 | 14x14 | 33350us (0.36 GB/s) | 316us (38 GB/s) | 105x |
| 1536 | 64x64 | 16x16 | 24173us (0.51 GB/s) | 263us (47 GB/s) | 91x |
| 4096 | 16x16 | 6x6 | 34638us (0.06 GB/s) | 138us (16 GB/s) | 249x |
| 4096 | 16x16 | 7x7 | 31590us (0.07 GB/s) | 144us (16 GB/s) | 218x |
| 4096 | 16x16 | 14x14 | 13203us (0.26 GB/s) | 149us (23 GB/s) | 88x |
| 4096 | 16x16 | 16x16 | 7328us (0.53 GB/s) | 143us (27 GB/s) | 51x |
| 4096 | 32x32 | 6x6 | 103802us (0.07 GB/s) | 405us (19 GB/s) | 256x |
| 4096 | 32x32 | 7x7 | 91354us (0.08 GB/s) | 372us (22 GB/s) | 245x |
| 4096 | 32x32 | 14x14 | 34501us (0.26 GB/s) | 312us (29 GB/s) | 110x |
| 4096 | 32x32 | 16x16 | 18465us (0.52 GB/s) | 247us (39 GB/s) | 74x |
## Backward pass benchmarks

| batch size | input size | output size | before runtime (mem bandwidth) | after runtime (mem bandwidth) | speedup |
|------------|------------|-------------|------------------|-----------------|---------|
| 768 | 16x16 | 6x6 | 78656us (0.0 GB/s) | 323us (1 GB/s) | 243x |
| 768 | 16x16 | 7x7 | 67167us (0.0 GB/s) | 292us (1 GB/s) | 230x |
| 768 | 16x16 | 14x14 | 27478us (0.02 GB/s) | 229us (2 GB/s) | 119x |
| 768 | 16x16 | 16x16 | 131us (5.59 GB/s) | 56us (13 GB/s) | 2x |
| 768 | 32x32 | 6x6 | 271752us (0.0 GB/s) | 888us (1 GB/s) | 305x |
| 768 | 32x32 | 7x7 | 224110us (0.0 GB/s) | 813us (1 GB/s) | 275x |
| 768 | 32x32 | 14x14 | 85365us (0.02 GB/s) | 450us (3 GB/s) | 189x |
| 768 | 32x32 | 16x16 | 67700us (0.02 GB/s) | 360us (5 GB/s) | 187x |
| 768 | 48x48 | 6x6 | 593709us (0.0 GB/s) | 1988us (1 GB/s) | 298x |
| 768 | 48x48 | 7x7 | 485566us (0.0 GB/s) | 1694us (1 GB/s) | 286x |
| 768 | 48x48 | 14x14 | 164059us (0.02 GB/s) | 897us (3 GB/s) | 182x |
| 768 | 48x48 | 16x16 | 134317us (0.02 GB/s) | 674us (5 GB/s) | 199x |
| 768 | 64x64 | 6x6 | 1026651us (0.0 GB/s) | 3360us (1 GB/s) | 305x |
| 768 | 64x64 | 7x7 | 770901us (0.0 GB/s) | 2584us (2 GB/s) | 298x |
| 768 | 64x64 | 14x14 | 277850us (0.02 GB/s) | 1556us (3 GB/s) | 178x |
| 768 | 64x64 | 16x16 | 236245us (0.02 GB/s) | 1144us (5 GB/s) | 206x |
| 1024 | 16x16 | 6x6 | 106638us (0.0 GB/s) | 341us (1 GB/s) | 312x |
| 1024 | 16x16 | 7x7 | 90886us (0.0 GB/s) | 314us (1 GB/s) | 288x |
| 1024 | 16x16 | 14x14 | 36572us (0.02 GB/s) | 292us (2 GB/s) | 124x |
| 1024 | 16x16 | 16x16 | 171us (5.69 GB/s) | 56us (17 GB/s) | 3x |
| 1024 | 32x32 | 6x6 | 356900us (0.0 GB/s) | 936us (2 GB/s) | 380x |
| 1024 | 32x32 | 7x7 | 299139us (0.0 GB/s) | 870us (2 GB/s) | 343x |
| 1024 | 32x32 | 14x14 | 113205us (0.02 GB/s) | 576us (4 GB/s) | 196x |
| 1024 | 32x32 | 16x16 | 90886us (0.02 GB/s) | 458us (5 GB/s) | 198x |
| 1024 | 48x48 | 6x6 | 786896us (0.0 GB/s) | 2127us (2 GB/s) | 369x |
| 1024 | 48x48 | 7x7 | 640515us (0.0 GB/s) | 1837us (2 GB/s) | 348x |
| 1024 | 48x48 | 14x14 | 218720us (0.02 GB/s) | 1152us (4 GB/s) | 189x |
| 1024 | 48x48 | 16x16 | 178827us (0.02 GB/s) | 863us (5 GB/s) | 207x |
| 1024 | 64x64 | 6x6 | 1379991us (0.0 GB/s) | 3589us (2 GB/s) | 384x |
| 1024 | 64x64 | 7x7 | 1047466us (0.0 GB/s) | 2774us (2 GB/s) | 377x |
| 1024 | 64x64 | 14x14 | 370139us (0.02 GB/s) | 1999us (4 GB/s) | 185x |
| 1024 | 64x64 | 16x16 | 316501us (0.02 GB/s) | 1470us (5 GB/s) | 215x |
| 1536 | 16x16 | 6x6 | 159057us (0.0 GB/s) | 477us (1 GB/s) | 332x |
| 1536 | 16x16 | 7x7 | 135578us (0.0 GB/s) | 441us (1 GB/s) | 306x |
| 1536 | 16x16 | 14x14 | 53002us (0.02 GB/s) | 400us (3 GB/s) | 132x |
| 1536 | 16x16 | 16x16 | 252us (5.79 GB/s) | 55us (26 GB/s) | 4x |
| 1536 | 32x32 | 6x6 | 545653us (0.0 GB/s) | 1323us (2 GB/s) | 412x |
| 1536 | 32x32 | 7x7 | 447491us (0.0 GB/s) | 1248us (2 GB/s) | 358x |
| 1536 | 32x32 | 14x14 | 173491us (0.02 GB/s) | 787us (4 GB/s) | 220x |
| 1536 | 32x32 | 16x16 | 136395us (0.02 GB/s) | 633us (5 GB/s) | 215x |
| 1536 | 48x48 | 6x6 | 1198639us (0.0 GB/s) | 3057us (2 GB/s) | 392x |
| 1536 | 48x48 | 7x7 | 985549us (0.0 GB/s) | 2645us (2 GB/s) | 372x |
| 1536 | 48x48 | 14x14 | 331419us (0.02 GB/s) | 1581us (4 GB/s) | 209x |
| 1536 | 48x48 | 16x16 | 270972us (0.02 GB/s) | 1186us (6 GB/s) | 228x |
| 1536 | 64x64 | 6x6 | 2094282us (0.0 GB/s) | 5214us (2 GB/s) | 401x |
| 1536 | 64x64 | 7x7 | 1593449us (0.0 GB/s) | 4086us (2 GB/s) | 389x |
| 1536 | 64x64 | 14x14 | 559244us (0.02 GB/s) | 2828us (4 GB/s) | 197x |
| 1536 | 64x64 | 16x16 | 469471us (0.02 GB/s) | 2057us (6 GB/s) | 228x |
| 4096 | 16x16 | 6x6 | 430494us (0.0 GB/s) | 1008us (2 GB/s) | 427x |
| 4096 | 16x16 | 7x7 | 360346us (0.0 GB/s) | 1015us (2 GB/s) | 354x |
| 4096 | 16x16 | 14x14 | 142868us (0.02 GB/s) | 988us (3 GB/s) | 144x |
| 4096 | 16x16 | 16x16 | 658us (5.93 GB/s) | 56us (69 GB/s) | 11x |
| 4096 | 32x32 | 6x6 | 1425928us (0.0 GB/s) | 2796us (2 GB/s) | 509x |
| 4096 | 32x32 | 7x7 | 1188862us (0.0 GB/s) | 2906us (2 GB/s) | 409x |
| 4096 | 32x32 | 14x14 | 464286us (0.02 GB/s) | 1965us (4 GB/s) | 236x |
| 4096 | 32x32 | 16x16 | 363903us (0.02 GB/s) | 1588us (6 GB/s) | 229x |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127082
Approved by: https://github.com/fmassa
2024-05-24 21:17:12 +00:00
b0871f9b33 [DSD] Add a test to verify FSDP lazy initialization case (#127069)
Summary:
Distributed state_dict should not error out because the `model.state_dict()` will trigger FSDP to initialize.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127069
Approved by: https://github.com/wz337
2024-05-24 21:09:11 +00:00
7394ec7123 [AOTI][refactor] Update DTYPE_TO_CPP mapping (#126915)
Summary: Use more consistent cpp int types in DTYPE_TO_CPP

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126915
Approved by: https://github.com/chenyang78
2024-05-24 21:03:12 +00:00
800f461b2a [User-Written Triton] Handle the scf.for and scf.while case (#127065)
Summary:
This is the official fix of the issue, reported in https://fb.workplace.com/groups/1075192433118967/permalink/1427865377851669/

The root-cause is the MLIR mutation analyze doesn't find the mutated tensors, which made AOT autograd think there is no users of the Triton kernel and then removed it 😔

---

Triton IR: P1369315213
Wrong Analyze Graph: P1364305956
Right Analyze Graph: P1369324977

Test Plan:
buck2 run mode/opt scripts/liptds/domain_kernels:triton_dcpp_flash

unit tests

Differential Revision: D57606053

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127065
Approved by: https://github.com/oulgen, https://github.com/chenyang78
2024-05-24 21:01:13 +00:00
dce29a8a87 Replaced same with assertEqual in two files (#126994)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126994
Approved by: https://github.com/masnesral
2024-05-24 20:50:36 +00:00
c34f8c7f91 Revert "Lift jagged -> padded dense forward / backward kernels from fbgemm_gpu (#125946)"
This reverts commit 5e69e11d098a2cfccc8a59377c431e9c71cab9a8.

Reverted https://github.com/pytorch/pytorch/pull/125946 on behalf of https://github.com/clee2000 due to sorry the Dr CI fix hasn't been merged yet and its still failing 5e69e11d09 https://github.com/pytorch/pytorch/actions/runs/9228887299/job/25393895252 ([comment](https://github.com/pytorch/pytorch/pull/125946#issuecomment-2130305958))
2024-05-24 20:26:07 +00:00
fdda9a22c3 Performance parity for 32-bit-precision in FP16 ARM matrix-vector kernel using FMLAL instruction (#127033)
Summary: I discovered this instruction by checking all the intrinsics on https://arm-software.github.io/acle/neon_intrinsics/advsimd.html .

Test Plan: Existing test coverage
benchmarked custom sizes with https://github.com/malfet/llm_experiments benchmarks/benchmark/torch_mm.py:

```
m=1024, n=1024, k=1
====================
trans_b  torch.float16   43.93 usec

Using FP16 accumulation
trans_b  torch.float16   43.76 usec
m=4100, n=4100, k=1
====================
trans_b  torch.float16  719.35 usec

Using FP16 accumulation
trans_b  torch.float16  719.33 usec
m=4104, n=4104, k=1
====================
trans_b  torch.float16  727.79 usec

Using FP16 accumulation
trans_b  torch.float16  702.72 usec
m=16384, n=16384, k=1
====================
trans_b  torch.float16 18465.11 usec

Using FP16 accumulation
trans_b  torch.float16 11435.28 usec
```

also checked the default sizes. Relevant output before:
```
mv_nt    torch.float16   13.05 usec
trans_b  torch.float16   13.69 usec

Using FP16 accumulation
mv_nt    torch.float16    8.65 usec
trans_b  torch.float16    9.24 usec
```

after:
```
mv_nt    torch.float16    8.66 usec
trans_b  torch.float16    8.85 usec

Using FP16 accumulation
mv_nt    torch.float16    8.52 usec
trans_b  torch.float16    8.60 usec
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127033
Approved by: https://github.com/malfet, https://github.com/Skylion007
ghstack dependencies: #126745, #126746, #126793, #126794, #126877, #127016
2024-05-24 19:47:50 +00:00
1d3aa08327 Cleanup: use c10::ForceUnroll and constexpr variables in ARM FP16 matrix-vector fast path (#127016)
Summary: Just straightforward code cleanup in this path.

Test Plan: Existing CI, double-checked benchmark_torch_mm didn't regress as per previous diffs in stack.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127016
Approved by: https://github.com/peterbell10
ghstack dependencies: #126745, #126746, #126793, #126794, #126877
2024-05-24 19:47:50 +00:00
cyy
67d52d7fcb [caffe2] Remove import_legacy.cpp (#126149)
I think they are for Caffe2 and should be deleted.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126149
Approved by: https://github.com/r-barnes
2024-05-24 19:47:32 +00:00
5e69e11d09 Lift jagged -> padded dense forward / backward kernels from fbgemm_gpu (#125946)
PyTorch can't depend on `fbgemm_gpu` as a dependency because `fbgemm_gpu` already has a dependency on PyTorch. So this PR copy / pastes kernels from `fbgemm_gpu`:
* `dense_to_jagged_forward()` as CUDA registration for new ATen op `_padded_dense_to_jagged_forward()`
* `jagged_to_padded_dense_forward()` as CUDA registration for new ATen op `_jagged_to_padded_dense_forward()`

CPU impls for these new ATen ops will be added in a follow-up PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125946
Approved by: https://github.com/davidberard98
2024-05-24 19:16:29 +00:00
9d4731f952 [AOTI] Disable stack allocation for OSS (#125732)
Summary: Stack allocation is for certain small CPU models, but its coverage still needs improvement, so default to OFF for OSS.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125732
Approved by: https://github.com/chenyang78
ghstack dependencies: #126720, #126801
2024-05-24 19:10:33 +00:00
72d30aa026 [AOTI] Fix an int array codegen issue (#126801)
Summary: fixes https://github.com/pytorch/pytorch/issues/126779. When an int array contains symbol expression, we can't declare it with constexpr.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126801
Approved by: https://github.com/chenyang78
ghstack dependencies: #126720
2024-05-24 19:10:33 +00:00
71f1aebe1f [AOTI] Add more fallback ops (#126720)
Summary: These ops are either in either unit tests or TorchBench. Fixes https://github.com/pytorch/pytorch/issues/122050

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126720
Approved by: https://github.com/chenyang78
2024-05-24 19:10:33 +00:00
f508cd6e00 Update assigntome job (#127027)
Updating for the new docathon

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127027
Approved by: https://github.com/kit1980
2024-05-24 19:04:51 +00:00
3cb16ebf08 [BE]: Update ruff to 0.4.5 (#126979)
Update ruff to 0.4.5 and addresses some false negatives that have been found in the newer version.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126979
Approved by: https://github.com/ezyang
2024-05-24 18:38:35 +00:00
4a09117d16 Introduce ProcessGroupCudaP2P (#122163)
## Context
This stack prototypes automatic micro-pipelining of `all-gather -> matmul` and `matmul -> reduce-scatter` via Inductor. The idea originates from the paper [Overlap Communication with Dependent Computation via
Decomposition in Large Deep Learning Models](https://dl.acm.org/doi/pdf/10.1145/3567955.3567959). The implementation and some key optimizations are heavily influenced by @lw's implementation in xformers.

The stack contains several components:
- `ProcessGroupCudaP2P` - a thin wrapper around `ProcessGroupNCCL`. It in addition maintains a P2P workspace that enables SM-free, one-sided P2P communication which is needed for optimal micro-pipelining.
- `fused_all_gather_matmul` and `fused_matmul_reduce_scatter` dispatcher ops.
- Post-grad fx pass that detects `all-gather -> matmul` and `matmul -> reduce-scatter` and replaces them with the fused dispatcher ops.

To enable the prototype feature:
- Set the distributed backend to `cuda_p2p`.
- Set `torch._inductor.config._micro_pipeline_tp` to `True`.

*NOTE: the prototype sets nothing in stone w.r.t to each component's design. The purpose is to have a performant baseline with reasonable design on which each component can be further improved.*

## Benchmark
Setup:
- 8 x H100 (500W) + 3rd gen NVSwitch.
- Llama3 8B training w/ torchtitan.
- 8-way TP. Reduced the number of layers from 32 to 8 for benchmarking purpose.

Trace (baseline): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmpjaz8zgx0
<img width="832" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/4addba77-5abc-4d2e-93ea-f68078587fe1">

Trace (w/ micro pipelining): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmpn073b4wn
<img width="963" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/4f44e78d-8196-43ab-a1ea-27390f07e9d2">

## This PR
`ProcessGroupCudaP2P` is a thin wrapper around `ProcessGroupNCCL`. By default, it routes all collectives to the underlying `ProcessGroupNCCL`. In addition, `ProcessGroupCudaP2P` initializes a P2P workspace that allows direct GPU memory access among the members. The workspace can be used in Python to optimize intra-node communication patterns or to create custom intra-node collectives in CUDA.

`ProcessGroupCudaP2P` aims to bridge the gap where certain important patterns can be better optimized via fine-grained P2P memory access than with collectives in the latest version of NCCL. It is meant to complement NCCL rather than replacing it.
Usage:
```
    # Using ProcessGroupCudaP2P
    dist.init_process_group(backend="cuda_p2p", ...)

    # Using ProcessGroupCudaP2P while specifying ProcessGroupCudaP2P.Options
    pg_options = ProcessGroupCudaP2P.Options()
    dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...)

    # Using ProcessGroupCudaP2P while specifying ProcessGroupNCCL.Options
    pg_options = ProcessGroupNCCL.Options()
    dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...)

    # Using ProcessGroupCudaP2P while specifying both
    # ProcessGroupCudaP2P.Options and ProcessGroupNCCL.Options
    pg_options = ProcessGroupCudaP2P.Options()
    pg_options.nccl_options = ProcessGroupNCCL.Options()
    dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...)

    # Down-casting the backend to access p2p buffers for cuda_p2p specific
    # optimizations
    if is_cuda_p2p_group(group):
        backend = get_cuda_p2p_backend(group)
        if required_p2p_buffer_size > backend.get_buffer_size():
            # fallback
        p2p_buffer = backend.get_p2p_buffer(...)
    else:
        # fallback
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122163
Approved by: https://github.com/wanchaol
2024-05-24 18:33:18 +00:00
01f04230cf [cond] support torch built in function as subgraph (#126909)
Fixes https://github.com/pytorch/pytorch/issues/126818.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126909
Approved by: https://github.com/zou3519
ghstack dependencies: #127026
2024-05-24 18:31:43 +00:00
2d6d2dbc0b [dynamo] make callable(nn_module) return True (#127026)
Before the pr, we have a graph break for `callable(nn_module)`:
```python
class M(nn.Module):
    def forward(self, x):
        return x.sin()

def f(m):
    return callable(m)

res = torch.compile(f, fullgraph=True)(M())
```

```
Traceback (most recent call last):
  File "/data/users/yidi/pytorch/t.py", line 17, in <module>
    out = torch.compile(f, backend="eager", fullgraph=True)(M())
  File "/data/users/yidi/pytorch/torch/_dynamo/eval_frame.py", line 414, in _fn
    return fn(*args, **kwargs)
  File "/data/users/yidi/pytorch/torch/_dynamo/convert_frame.py", line 1077, in catch_errors
    return callback(frame, cache_entry, hooks, frame_state, skip=1)
  File "/data/users/yidi/pytorch/torch/_dynamo/convert_frame.py", line 456, in _convert_frame_assert
    return _compile(
  File "/data/users/yidi/pytorch/torch/_utils_internal.py", line 74, in wrapper_function
    return function(*args, **kwargs)
  File "/home/yidi/.conda/envs/pytorch/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/data/users/yidi/pytorch/torch/_dynamo/convert_frame.py", line 799, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/data/users/yidi/pytorch/torch/_dynamo/utils.py", line 210, in time_wrapper
    r = func(*args, **kwargs)
  File "/data/users/yidi/pytorch/torch/_dynamo/convert_frame.py", line 618, in compile_inner
    out_code = transform_code_object(code, transform)
  File "/data/users/yidi/pytorch/torch/_dynamo/bytecode_transformation.py", line 1167, in transform_code_object
    transformations(instructions, code_options)
  File "/data/users/yidi/pytorch/torch/_dynamo/convert_frame.py", line 177, in _fn
    return fn(*args, **kwargs)
  File "/data/users/yidi/pytorch/torch/_dynamo/convert_frame.py", line 564, in transform
    tracer.run()
  File "/data/users/yidi/pytorch/torch/_dynamo/symbolic_convert.py", line 2244, in run
    super().run()
  File "/data/users/yidi/pytorch/torch/_dynamo/symbolic_convert.py", line 886, in run
    while self.step():
  File "/data/users/yidi/pytorch/torch/_dynamo/symbolic_convert.py", line 801, in step
    self.dispatch_table[inst.opcode](self, inst)
  File "/data/users/yidi/pytorch/torch/_dynamo/symbolic_convert.py", line 496, in wrapper
    return inner_fn(self, inst)
  File "/data/users/yidi/pytorch/torch/_dynamo/symbolic_convert.py", line 1255, in CALL_FUNCTION
    self.call_function(fn, args, {})
  File "/data/users/yidi/pytorch/torch/_dynamo/symbolic_convert.py", line 739, in call_function
    self.push(fn.call_function(self, args, kwargs))
  File "/data/users/yidi/pytorch/torch/_dynamo/variables/builtin.py", line 948, in call_function
    return handler(tx, args, kwargs)
  File "/data/users/yidi/pytorch/torch/_dynamo/variables/builtin.py", line 711, in <lambda>
    return lambda tx, args, kwargs: obj.call_function(
  File "/data/users/yidi/pytorch/torch/_dynamo/variables/builtin.py", line 948, in call_function
    return handler(tx, args, kwargs)
  File "/data/users/yidi/pytorch/torch/_dynamo/variables/builtin.py", line 835, in builtin_dipatch
    unimplemented(error_msg)
  File "/data/users/yidi/pytorch/torch/_dynamo/exc.py", line 216, in unimplemented
    raise Unsupported(msg)
torch._dynamo.exc.Unsupported: builtin: callable [<class 'torch._dynamo.variables.nn_module.NNModuleVariable'>] False
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127026
Approved by: https://github.com/jansel
2024-05-24 18:31:43 +00:00
cyy
f2c6fddbe1 Remove unnecessary const_cast and other fixes (#127054)
Removes unnecessary const casts and copies.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127054
Approved by: https://github.com/Skylion007
2024-05-24 18:05:06 +00:00
9117779b0a [FSDP2] Added test for N-way TP and 1-way FSDP with CPU offloading (#127024)
This PR shows that we can use FSDP solely for CPU offloading when composing with N-way TP. Each FSDP mesh is just 1 rank.

This was motivated from an ask on Slack :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127024
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
ghstack dependencies: #127004
2024-05-24 17:09:12 +00:00
87f79af24d Fix map_location for wrapper subclass and device tensors that go through numpy (#126728)
Fixes https://github.com/pytorch/pytorch/issues/124418

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126728
Approved by: https://github.com/albanD
2024-05-24 16:39:30 +00:00
4ff9113e3d [MPS] Add _weight_int8pack_mm tests (#127041)
As well as extend the test to cover MV cases (where A matrix is 1xM) Limit int8 op testing to 32x32 matrix sizes for now

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127041
Approved by: https://github.com/larryliu0820, https://github.com/manuelcandales
2024-05-24 16:08:06 +00:00
194950c0ca Default TreadPool size to number of physical cores (#125963)
TODO: Some benchmarks

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125963
Approved by: https://github.com/janeyx99, https://github.com/Skylion007, https://github.com/gajjanag, https://github.com/jgong5
2024-05-24 16:06:48 +00:00
5ae9daa4a2 Revert "[AOTI] support freezing for MKLDNN (#124350)"
This reverts commit 654afb6f3ae3ddbd926a753f9af95a6f6e22131c.

Reverted https://github.com/pytorch/pytorch/pull/124350 on behalf of https://github.com/clee2000 due to Seems to have broken inductor/test_aot_inductor.py::AOTInductorTestNonABICompatibleCpu::test_freezing_non_abi_compatible_cpu 654afb6f3a https://github.com/pytorch/pytorch/actions/runs/9224838183/job/25382780192 ([comment](https://github.com/pytorch/pytorch/pull/124350#issuecomment-2129889809))
2024-05-24 16:03:07 +00:00
2ac739cc80 [DOCS] Fixed KLDiv example (#126857)
Small import fix to make the example run
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126857
Approved by: https://github.com/albanD
2024-05-24 15:39:50 +00:00
4105f91cfc [inductor] fix an assertion for node debug str (#127021)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127021
Approved by: https://github.com/aorenste
2024-05-24 13:37:05 +00:00
654afb6f3a [AOTI] support freezing for MKLDNN (#124350)
## Description
Fixes https://github.com/pytorch/pytorch/issues/114450. This PR builds upon the work from @imzhuhl done in https://github.com/pytorch/pytorch/pull/114451.

This PR requires https://github.com/pytorch/pytorch/pull/122472 to land firstly.

We leverage the serialization and deserialization API from oneDNN v3.4.1 to save the opaque MKLDNN tensor during the compilation and restore the opaque tensor when loading the compiled .so.
ideep version is updated so that we won't break any pipeline even if third_party/ideep is not updated at the same time.

### Test plan:
```sh
python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_freezing_non_abi_compatible_cpu
python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_conv_freezing_non_abi_compatible_cpu
python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_deconv_freezing_non_abi_compatible_cpu
python -u test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCpu.test_linear_freezing_non_abi_compatible_cpu
```

### TODOs in follow-up PRs
1. We found that using `AOTI_TORCH_CHECK` will cause performance drop on several models (`DistillGPT2`, `MBartForConditionalGeneration`, `T5ForConditionalGeneration`, `T5Small`) compared with JIT Inductor which uses `TORCH_CHECK`. This may need further discussion how to address (`AOTI_TORCH_CHECK` is introduced in
 https://github.com/pytorch/pytorch/pull/119220).
2. Freezing in non-ABI compatible mode will work with the support in this PR. While for ABI compatible mode, we need to firstly address this issue: `AssertionError: None, i.e. optional output is not supported`.
6c4f43f826/torch/_inductor/codegen/cpp_wrapper_cpu.py (L2023-L2024)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124350
Approved by: https://github.com/jgong5, https://github.com/desertfire
2024-05-24 13:34:04 +00:00
43baabe9b9 [inductor][cpp] support bf16/fp16 gemm template epilogue fusion (#126545)
As part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows:
1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling.
2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops.
3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen.
4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126545
Approved by: https://github.com/jansel
ghstack dependencies: #124021, #126019, #126068
2024-05-24 12:29:06 +00:00
4aa43d11f3 [inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068)
As part of #125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126068
Approved by: https://github.com/jansel
ghstack dependencies: #124021, #126019
2024-05-24 12:24:35 +00:00
56c412d906 [inductor][cpp] epilogue support for gemm template (#126019)
As part of #125683, this PR adds the epilogue support for c++ gemm template by reusing the c++ vector codegen on sub-slices of tensors. This is implemented by retracing the epilogue IR nodes with new ranges and offsets. The new `codegen_loop_bodies` and `codegen_functions` methods are added to c++ vector codegen for this purpose. This is leveraged by the `store_output` method of the template kernel for epilogue codegen and store to the final result.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126019
Approved by: https://github.com/jansel
ghstack dependencies: #124021
2024-05-24 12:14:12 +00:00
dd64ca2a02 Inductor respects strides for custom ops by default (#126986)
Previously, the default was that Inductor did not respect strides for
all (builtin and custom) ops unless the op has a
"needs_fixed_stride_order" tag on it. This PR changes it so that:

- inductor doesn't respect strides for builtin ops. To change the
  behavior, one can add the "needs_fixed_stride_order" tag
- inductor does respect strides for custom ops. To change the behavior,
  one can add the "does_not_need_fixed_stride_order" tag

Test Plan:
- new tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126986
Approved by: https://github.com/ezyang, https://github.com/albanD
2024-05-24 11:11:18 +00:00
f14cdc570d Fix to #126656 (#127050)
Fix failure from fbcode - in the case of a foreach node the fake `group` needs to be hashable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127050
Approved by: https://github.com/DanilBaibak
ghstack dependencies: #126656
2024-05-24 10:56:53 +00:00
47c976b904 Revert "[AOTI] Add more fallback ops (#126720)"
This reverts commit 19cd4484ec8449b8c5ebf46be1f8f2fcbace8c6c.

Reverted https://github.com/pytorch/pytorch/pull/126720 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126720#issuecomment-2129011751))
2024-05-24 09:07:07 +00:00
f749c5def8 Revert "[AOTI] Fix an int array codegen issue (#126801)"
This reverts commit ff617ab6c8f6f67ae912fbcd45a913a89e19effb.

Reverted https://github.com/pytorch/pytorch/pull/126801 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126720#issuecomment-2129011751))
2024-05-24 09:07:07 +00:00
fd9cdeed19 Revert "[AOTI] Disable stack allocation for OSS (#125732)"
This reverts commit 599e684ad6f34dd069eff8611f45e25b7695a339.

Reverted https://github.com/pytorch/pytorch/pull/125732 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126720#issuecomment-2129011751))
2024-05-24 09:07:07 +00:00
f95dbc1276 Remove more of caffe2 (#126705)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126705
Approved by: https://github.com/malfet
2024-05-24 06:53:08 +00:00
0d1e228550 [inductor][cpp] GEMM template (infra and fp32) (#124021)
This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC https://github.com/pytorch/pytorch/issues/125683 for more background info.
1. Cpp template infrastructure
Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates.
2. Initial FP32 gemm template
This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction.
3. Correctness and performance
The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details.

Static shapes
| Benchmark | torchbench | huggingface | timm_models |
|------------|-------------|--------------|--------------|
| Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x |
| Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x |
| Single-threaded (baseline) | 1.56x | 1.19x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x |

Key models being sped up:
drq: 1.14x
soft_act: 1.12
cait_m36_384: 1.18x

Dynamic shapes
| Benchmark | torchbench | huggingface | timm_models |
| --- | --- | --- | --- |
| Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x |
| Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x |
| Single-threaded (baseline) | 1.55x | 1.20x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x |

Key models being sped up:
BERT_pytorch: 1.22x
pyhpc_turbulent: 1.13x
soft_actor_critic: 1.77x
BlenderbotForCausalLM: 1.09x
cait_m36_384: 1.17x

Differential Revision: [D57585365](https://our.internmc.facebook.com/intern/diff/D57585365)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124021
Approved by: https://github.com/jansel
2024-05-24 06:26:33 +00:00
505b8ceaa2 Double registers per iteration in FP32-arithmetic FP16 ARM gemv kernel (#126877)
Summary: I found that doubling this significantly improved performance, but doubling again did not, so I stopped here.

Test Plan: CI
Benchmarked with llm_experiments repo as previously in stack; relevant data:

before:
trans_b torch.float16 1396.11 usec (4100)
trans_b torch.float16 1399.54 usec (4104)

after:
trans_b  torch.float16 1096.00 usec (4100)
trans_b  torch.float16 1093.47 usec (4104)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126877
Approved by: https://github.com/malfet
ghstack dependencies: #126745, #126746, #126793, #126794
2024-05-24 05:57:09 +00:00
e8fa0f10c5 Quadruple registers per iteration in ARM64 FP16 kernel (#126794)
The machine has plenty of registers we weren't using. This looks like it might improve performance a couple percent, though there is noise so I'm not certain.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126794
Approved by: https://github.com/malfet
ghstack dependencies: #126745, #126746, #126793
2024-05-24 05:57:09 +00:00
f6366454db Add privateuse1 in FSDP's sharded grad scaler (#126971)
1. add privateuse1 in FSDP's sharded grad scaler
2. support found_inf copy for more devices

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126971
Approved by: https://github.com/awgu, https://github.com/weifengpy
2024-05-24 05:54:25 +00:00
2f6954c7c3 Update the modification api (#127035)
# Summary
Updates the modification jinja template's api, so as to specify the output_name for the fixed buffer. As well updates flex-attention's usage to make the algorithm more clear/ closer align with the vmap impl
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127035
Approved by: https://github.com/Chillee
2024-05-24 04:45:34 +00:00
894efcd0e9 [DTensor] Supported simple replicate strategy for SVD (#127004)
This PR adds a simple strategy to always replicate for `torch.linalg.svd()`. This is to help unblock some GaLore exploration.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127004
Approved by: https://github.com/wanchaol
2024-05-24 04:34:43 +00:00
70dc59c55f Fix perf regression caused by #122074 (#126996)
The original change was about 9.5% slower than then before #122074 .
This improves it to be only about 1.4% slower.

Also touched up some unrelated nits that the linter complained about.

Fixes #126293

Ran torchbench 3 times on each change. Perf values before (stable), after (fix),
and with #122074 backed out (backout):
```
../inductor-tools/scripts/modelbench/inductor_single_run.sh single inference performance torchbench pyhpc_isoneutral_mixing amp first dynamic cpp
stable:
43.948x
45.754x
44.906x

fix:
47.505x
49.987x
47.493x

backout:
48.243x
48.199x
48.192x

../inductor-tools/scripts/modelbench/inductor_single_run.sh single inference performance torchbench pyhpc_equation_of_state amp first static default
stable:
15.224x
13.286x
15.354x

fix:
16.402x
16.370x
16.183x

backout:
16.554x
16.675x
16.787x

../inductor-tools/scripts/modelbench/inductor_single_run.sh single inference performance torchbench lennard_jones float32 first static default
stable:
1.712x
1.651x
1.640x

fix:
1.804x
1.798x
1.792x

backout:
1.864x
1.824x
1.836x
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126996
Approved by: https://github.com/jansel
2024-05-24 04:27:22 +00:00
cb6ef68caa Propagate tokens in aotautograd (#127028)
Test Plan: `buck run mode/dev-nosan //aimp/experimental/pt2:pt2_export -- --model-entity-id 938593492 --output /tmp/938593492.zip --use-torchrec-eager-mp --use-manifold`

Differential Revision: D57750072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127028
Approved by: https://github.com/tugsbayasgalan
2024-05-24 03:23:17 +00:00
99a11efc8a Revert "Lift jagged -> padded dense forward / backward kernels from fbgemm_gpu (#125946)"
This reverts commit e2f081837f4276c1a6a37739bd28157f62004a06.

Reverted https://github.com/pytorch/pytorch/pull/125946 on behalf of https://github.com/clee2000 due to I think dr ci is wrong and the windows build failure is real e2f081837f https://github.com/pytorch/pytorch/actions/runs/9216826622/job/25357819877 ([comment](https://github.com/pytorch/pytorch/pull/125946#issuecomment-2128388126))
2024-05-24 02:37:46 +00:00
cfb374dc73 [BE] Create grad check util (#126991)
# Summary
Add small utility func for deciding if we shoudl compute LSE and update to also check for gradMode
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126991
Approved by: https://github.com/cpuhrsch
2024-05-24 02:36:00 +00:00
27594be3ed [dtensor][be] remove repeated test in test_comm_mode.py (#127029)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127029
Approved by: https://github.com/XilunWu
ghstack dependencies: #127025
2024-05-24 01:42:13 +00:00
89c638f9a5 [dtensor][debug] add all_reduce_coalesced tracing to CommDebugMode (#127025)
**Summary**
Added all_reduce_coalesced tracing to CommDebugMode and added test case to test_comm_mode test suite.

**Test Plan**
pytest test/distributed/_tensor/debug/test_comm_mode.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127025
Approved by: https://github.com/XilunWu
2024-05-24 01:42:13 +00:00
575cb617db Add compile time profiler for non fbcode targets (#126904)
This is a tool that allow profiling compile time using strobelight profiler, its a meta only tool.
but works on non-fbcode targets.

A follow up diff will unify this with caffe2/fb/strobelight/compile_time_profiler.py.
example test:

```
run  python tools/strobelight/examples/compile_time_profile_example.py
```

```
python torch/utils/_strobelight/examples/compile_time_profile_example.py
strobelight_compile_time_profiler, line 61, 2024-05-23 10:49:28,101, INFO: compile time strobelight profiling enabled
strobelight_compile_time_profiler, line 93, 2024-05-23 10:49:28,102, INFO: Unique sample tag for this run is: 2024-05-23-10:49:282334638devvm4561.ash0.facebook.com
strobelight_compile_time_profiler, line 94, 2024-05-23 10:49:28,102, INFO: You can use the following link to access the strobelight profile at the end of the run: https://www.internalfb.com/intern/scuba/query/?dataset=pyperf_experimental%2Fon_demand&drillstate=%7B%22purposes%22%3A[]%2C%22end%22%3A%22now%22%2C%22start%22%3A%22-30%20days%22%2C%22filterMode%22%3A%22DEFAULT%22%2C%22modifiers%22%3A[]%2C%22sampleCols%22%3A[]%2C%22cols%22%3A[%22namespace_id%22%2C%22namespace_process_id%22]%2C%22derivedCols%22%3A[]%2C%22mappedCols%22%3A[]%2C%22enumCols%22%3A[]%2C%22return_remainder%22%3Afalse%2C%22should_pivot%22%3Afalse%2C%22is_timeseries%22%3Afalse%2C%22hideEmptyColumns%22%3Afalse%2C%22timezone%22%3A%22America%2FLos_Angeles%22%2C%22compare%22%3A%22none%22%2C%22samplingRatio%22%3A%221%22%2C%22metric%22%3A%22count%22%2C%22aggregation_field%22%3A%22async_stack_complete%22%2C%22top%22%3A10000%2C%22aggregateList%22%3A[]%2C%22param_dimensions%22%3A[%7B%22dim%22%3A%22py_async_stack%22%2C%22op%22%3A%22edge%22%2C%22param%22%3A%220%22%2C%22anchor%22%3A%220%22%7D]%2C%22order%22%3A%22weight%22%2C%22order_desc%22%3Atrue%2C%22constraints%22%3A[[%7B%22column%22%3A%22sample_tags%22%2C%22op%22%3A%22all%22%2C%22value%22%3A[%22[%5C%222024-05-23-10:49:282334638devvm4561.ash0.facebook.com%5C%22]%22]%7D]]%2C%22c_constraints%22%3A[[]]%2C%22b_constraints%22%3A[[]]%2C%22ignoreGroupByInComparison%22%3Afalse%7D&view=GraphProfilerView&&normalized=1712358002&pool=uber
strobelight_function_profiler, line 241, 2024-05-23 10:49:34,943, INFO: strobelight run id is: 3507039740348330
strobelight_function_profiler, line 243, 2024-05-23 10:50:00,907, INFO: strobelight profiling running
strobelight_function_profiler, line 224, 2024-05-23 10:50:02,741, INFO: strobelight profiling stopped
strobelight_function_profiler, line 215, 2024-05-23 10:50:06,173, INFO: Total samples: 7
strobelight_function_profiler, line 215, 2024-05-23 10:50:06,173, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/75cxdro3
strobelight_function_profiler, line 215, 2024-05-23 10:50:06,173, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/qsgydsee
strobelight_compile_time_profiler, line 120, 2024-05-23 10:50:06,174, INFO: 1 strobelight success runs out of 1 non-recursive compilation events.
strobelight_function_profiler, line 241, 2024-05-23 10:50:08,137, INFO: strobelight run id is: 8721740011604497
strobelight_function_profiler, line 243, 2024-05-23 10:50:34,801, INFO: strobelight profiling running
strobelight_function_profiler, line 224, 2024-05-23 10:50:36,803, INFO: strobelight profiling stopped
strobelight_function_profiler, line 215, 2024-05-23 10:50:41,289, INFO: Total samples: 3
strobelight_function_profiler, line 215, 2024-05-23 10:50:41,289, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/qmi2ucwp
strobelight_function_profiler, line 215, 2024-05-23 10:50:41,289, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/7fjkhs9i
strobelight_compile_time_profiler, line 120, 2024-05-23 10:50:41,289, INFO: 2 strobelight success runs out of 2 non-recursive compilation events.
strobelight_function_profiler, line 241, 2024-05-23 10:50:43,597, INFO: strobelight run id is: 1932476082259558
strobelight_function_profiler, line 243, 2024-05-23 10:51:09,791, INFO: strobelight profiling running
strobelight_function_profiler, line 224, 2024-05-23 10:51:11,883, INFO: strobelight profiling stopped
strobelight_function_profiler, line 215, 2024-05-23 10:51:16,218, INFO: Total samples: 3
strobelight_function_profiler, line 215, 2024-05-23 10:51:16,218, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/vy1ujxec
strobelight_function_profiler, line 215, 2024-05-23 10:51:16,218, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/2xgadviv
strobelight_compile_time_profiler, line 120, 2024-05-23 10:51:16,219, INFO: 3 strobelight success runs out of 3 non-recursive compilation events.
```

or pass TORCH_COMPILE_STROBELIGHT=TRUE for any torch compile python program.
ex running on XLNetLMHeadModel.
```
 TORCH_COMPILE_STROBELIGHT=TRUE TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 time python benchmarks/dynamo/huggingface.py --ci --accuracy --timing --explain --inductor --device cuda --training --amp  --only XLNetLMHeadModel
 ```
 result:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126904
Approved by: https://github.com/aorenste
ghstack dependencies: #126693
2024-05-24 01:39:40 +00:00
e2f081837f Lift jagged -> padded dense forward / backward kernels from fbgemm_gpu (#125946)
PyTorch can't depend on `fbgemm_gpu` as a dependency because `fbgemm_gpu` already has a dependency on PyTorch. So this PR copy / pastes kernels from `fbgemm_gpu`:
* `dense_to_jagged_forward()` as CUDA registration for new ATen op `_padded_dense_to_jagged_forward()`
* `jagged_to_padded_dense_forward()` as CUDA registration for new ATen op `_jagged_to_padded_dense_forward()`

CPU impls for these new ATen ops will be added in a follow-up PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125946
Approved by: https://github.com/davidberard98
2024-05-24 00:42:59 +00:00
3f5b59eef4 [codemod] c10::optional -> std::optional in caffe2/aten/src/ATen/DeviceGuard.h +117 (#126901)
Summary:
Generated with
```
fbgs -f '.*\.(cpp|cxx|cc|h|hpp|cu|cuh)$' c10::optional -l | perl -pe 's/^fbsource.fbcode.//' | grep -v executorch | xargs -n 50 perl -pi -e 's/c10::optional/std::optional/g'
```

 - If you approve of this diff, please use the "Accept & Ship" button :-)

(117 files modified.)

Test Plan: Sandcastle

Reviewed By: palmje

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126901
Approved by: https://github.com/Skylion007, https://github.com/eqy
2024-05-24 00:26:15 +00:00
cyy
95e5c994f9 [Submodule] Clear USE_QNNPACK build option (#126941)
Following the removal of QNNPACK third-party module #126657, we can clear more build system code. Also third_party/neon2sse was removed because it is not used.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126941
Approved by: https://github.com/ezyang
2024-05-24 00:12:56 +00:00
dfabae5b89 Revert "[pipelining] Add grad test for interleaved schedules (#126931)"
This reverts commit abf6d4e6bc1a9a0e08bfc2204560ca7858fa90cd.

Reverted https://github.com/pytorch/pytorch/pull/126931 on behalf of https://github.com/clee2000 due to newly added test fails distributed/pipelining/test_schedule.py::ScheduleTest::test_grad_with_manual_interleaved_ScheduleClass0 abf6d4e6bc https://github.com/pytorch/pytorch/actions/runs/9214413308/job/25352507591, pull workflow failed on startup on PR, so no distributed tests ran at all ([comment](https://github.com/pytorch/pytorch/pull/126931#issuecomment-2128228496))
2024-05-23 23:51:29 +00:00
2db13633e7 [export] disable forced specializations, even when solvable with single var (#126925)
Summary:
Previously https://github.com/pytorch/pytorch/pull/124949 added the ability to disable forced specializations on dynamic shapes for export, keeping dynamism for complex guards instead of specializing, allowing unsoundness by having the user fail at runtime.

It avoided disabling one case: single-variable equality guards, where a variable is specified as dynamic but can be solvable for a concrete value, suggesting the correct behavior is specialization. For example, guard : Eq(s0 // 4, 400) suggests s0 should specialize to 1600.

In debugging, some users (e.g. APS) would like to keep this dynamic, and defer to failing at runtime instead. This PR adds this, so now all forced specializations should be turned off. Mostly this should be used for debugging, since it produces unsoundness, and lets the user proceed with (probably) incorrect dynamism.

Test Plan: export tests

Differential Revision: D57698601

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126925
Approved by: https://github.com/angelayi
2024-05-23 23:43:30 +00:00
6eac3f45c7 Add basic sanity checks for graph ops to cache key (#124745)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124745
Approved by: https://github.com/bdhirsh
2024-05-23 23:37:43 +00:00
ff82e2e7cf [traced-graph][sparse] propagate sparsity metadata into traced graph (#117907)
Propagate sparsity metadata from sparse tensors of torch.sparse into the traced graph representation (with would be useful for a JIT backend that supports a "sparse compiler"). This is a first careful attempt, since the actual "meta" feature seem still incomplete for coo and completely lacking for csr/csc/bsr/bsc.

For background see forum postings (with examples):
  https://discuss.pytorch.org/t/connecting-pytorch-sparse-tensors-with-mlir/195145
  https://dev-discuss.pytorch.org/t/connecting-pytorch-sparse-tensors-with-mlir/1803

And feature request:
  https://github.com/pytorch/pytorch/issues/117188

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117907
Approved by: https://github.com/pearu, https://github.com/ezyang
2024-05-23 22:46:46 +00:00
93ba5e7291 Fix typo for input (#126981)
The variable name should be `cloned_inputs` rather than `clone_inputs`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126981
Approved by: https://github.com/xuzhao9
2024-05-23 22:08:14 +00:00
d11e44c0d0 Reset grad state across unittests (#126345)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126345
Approved by: https://github.com/ezyang
2024-05-23 21:16:39 +00:00
a31a60d85b Change run_test.py arg parsing to handle additional args better (#126709)
Do not inherit parser from common_utils
* I don't think we use any variables in run_test that depend on those, and I think all tests except doctests run in a subprocess so they will parse the args in common_utils and set the variables.  I don't think doctests wants any of those variables?

Parse known args, add the extra args as extra, pass the extra ones along to the subprocess
Removes the first instance of `--`

I think I will miss run_test telling me if an arg is valid or not

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126709
Approved by: https://github.com/ZainRizvi, https://github.com/huydhn, https://github.com/Flamefire
2024-05-23 21:08:12 +00:00
09a73da190 Downgrade requests to 2.31.0 for ios and android (#126989)
Ex https://github.com/pytorch/pytorch/actions/runs/9211850483/job/25342181353
https://github.com/pytorch/pytorch/actions/runs/9211850483/job/25342182105

2.32.0 isn't on the conda channels yet?

Is there a way to add them?

If not here's a PR to downgrad
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126989
Approved by: https://github.com/atalman, https://github.com/malfet
2024-05-23 21:02:50 +00:00
0d2ac9782b [FSDP1] Update docstring to include device_mesh arg (#126589)
Fixes #126548

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126589
Approved by: https://github.com/wanchaol
2024-05-23 20:40:48 +00:00
0902929d58 [CUDA] [CI]: Enable CUDA 12.4 CI (#121956)
Reference PR: https://github.com/pytorch/pytorch/pull/93406

Co-authored-by: Aidyn-A <31858918+Aidyn-A@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121956
Approved by: https://github.com/atalman
2024-05-23 20:37:47 +00:00
abf6d4e6bc [pipelining] Add grad test for interleaved schedules (#126931)
Added `test_grad_with_manual_interleaved`:
- Model: `MultiMLP`
- Tested schedules: Interleaved1F1B, LoopedBFS
- Two stages per rank
```
Rank 0 stages: [0, 2]
Rank 1 stages: [1, 3]
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126931
Approved by: https://github.com/wconstab
ghstack dependencies: #126812, #126721, #126735, #126927
2024-05-23 20:26:08 +00:00
c46b38bc75 [pipelining] Generalize definition of MultiMLP for testing interleaved schedules (#126927)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126927
Approved by: https://github.com/wconstab
ghstack dependencies: #126812, #126721, #126735
2024-05-23 20:26:08 +00:00
6b39146b3f [pipelining] Validate stage input/output shape/dtype (#126732)
Address the classes of user errors stemming from (possibly)
unintentional dynamic shapes usage or mismatch of configuration time and
run time data shapes/dtypes.

The goal is to ensure a clear error is raised rather than relying on some underlying
error to bubble up when a tensor shape is not compatible, or worse,
having a silent correctness issue.

**Classes of shape/dtype errors**
* (a) error is thrown within the stage-module forward code, but may be
hard to understand/trace back to an input issue
* (b) silent correctness issue happens inside the stage-module forward,
but the correct output shape is still produced
produces the expected output shape
* (c) the stage-module produces an output that is locally correct, but not
matching the expectation of the following stage, leading to a hang or
correctness issue down the line

**How validation helps**

Input shape validation
- improves debugability of case (a)
- guards against case (b)
- only needed on first stage, since subsequent stages use pre-allocated recv
  buffers that can't change shape/size even if they wanted to

Output shape validation
- guards against case (c)

Validation of first stage input and all stages' outputs inductively verifies all shapes

Shape/dtype are most critical as they literally affect the number of
bytes on the wire.  Strides and other tensor properties may also (?)
matter, and the validation function can be adjusted accordingly if needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126732
Approved by: https://github.com/kwen2501
2024-05-23 20:16:06 +00:00
9b91c91e64 Don't add to replacements when guard is suppressed (#126210)
Also improve logging when guards are suppressed

Partially addresses https://github.com/pytorch/pytorch/issues/125641

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126210
Approved by: https://github.com/jbschlosser
2024-05-23 20:10:29 +00:00
f8857cef45 [Reland] Verify types in custom op schemas (#126861)
Summary:
co-dev reland of https://github.com/pytorch/pytorch/pull/124520, which requires
the removal of some executorch tests.

Before this PR, we didn't check that types in a schema were valid. This
is because TorchScript treats unknown types as type variables.

This PR checks types in a schema for the TORCH_LIBRARY APIs. To do this,
we add an `allow_typevars` flag to parseSchema so that TorchScript can
use allow_typevars=True. We also add some error messages for common
mistakes (e.g. using int64_t or double in schema).

Test Plan: Wait for tests

Differential Revision: D57666659

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126861
Approved by: https://github.com/albanD
2024-05-23 19:53:52 +00:00
c921c5cc77 [c10d] Print certain logs only on head rank of each node (#125432)
Recently we added the following warning, which is printed on every rank and makes the log a bit verbose.

This PR dedups certain logs that are identical across ranks and prints them only on head rank of each node.

Resolves https://github.com/pytorch/pytorch/issues/126275

=========================================

[rank0]:[W502 14:06:55.821964708 ProcessGroupNCCL.cpp:1113] WARNING: process group has NOT been destroyed before it is being destructed. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL data transfers have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4
[rank1]:[W502 14:06:57.994276972 ProcessGroupNCCL.cpp:1113] WARNING: process group has NOT been destroyed before it is being destructed. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL data transfers have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4
[rank2]:[W502 14:07:00.353013116 ProcessGroupNCCL.cpp:1113] WARNING: process group has NOT been destroyed before it is being destructed. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL data transfers have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4
[rank3]:[W502 14:07:02.515511670 ProcessGroupNCCL.cpp:1113] WARNING: process group has NOT been destroyed before it is being destructed. On normal program exit, the application should call destroy_process_group to ensure that any pending NCCL data transfers have finished in this process. In rare cases this process can exit before this point and block the progress of another member of the process group. This constraint has always been present,  but this warning has only been added since PyTorch 2.4

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125432
Approved by: https://github.com/wconstab
2024-05-23 19:16:11 +00:00
0625f92993 [inductor] Run some tests on correct device (#126943)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126943
Approved by: https://github.com/yanboliang
2024-05-23 18:47:44 +00:00
abf40320dd remove ax/ay arrays in fp16 ARM matmul kernels (#126793)
These shouldn't do anything as only two elements are live at once, so we can simplify the code. (I checked assembly for the inner loops in instruments and it seems to be the same.)

Differential Revision: [D57732738](https://our.internmc.facebook.com/intern/diff/D57732738)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126793
Approved by: https://github.com/malfet
ghstack dependencies: #126745, #126746
2024-05-23 18:42:45 +00:00
5dcf3d0f9e use arith-by-dot-products approach for fp32 accumulation in fp16 matmul (#126746)
Summary: The faster fp16-native kernel is gated off by default. Let's give people better performance in the default case.

Test Plan: CI
benchmarked matmul of size 4100x4100x1 and 4104x4104x1 using https://github.com/malfet/llm_experiments/blob/main/benchmarks/benchmark_torch_mm.py (4100 % 32 = 4100 % 8 = 4). Relevant timing numbers without FP16 reduction (which then uses this kernel):

after:
trans_b  torch.float16 1396.11 usec (4100)
trans_b  torch.float16 1399.54 usec (4104)

before:
trans_b  torch.float16 1840.79 usec (4100)
trans_b  torch.float16 1786.67 usec (4104)

Differential Revision: [D57732736](https://our.internmc.facebook.com/intern/diff/D57732736)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126746
Approved by: https://github.com/malfet
ghstack dependencies: #126745
2024-05-23 18:42:45 +00:00
fd4fd24080 add tail fixup for fp16 gemv transposed fast path (#126745)
Summary: We previously had restrictive gating for the fp16 kernel; now it supports arbitrary m & n.

Test Plan: 1) ran test coverage added in  #126700, passes
2) benchmarked matmul of size 4100x4100x1 and 4104x4104x1 using https://github.com/malfet/llm_experiments/blob/main/benchmarks/benchmark_torch_mm.py (4100 % 32 = 44100 % 8 = 4). Relevant timing numbers with FP16 reduction enabled (which gates this kernel):

after:
trans_b  torch.float16  716.42 usec (4100)
trans_b  torch.float16  711.10 usec (4104)

Before:
trans_b  torch.float16 1808.66 usec (4100)
trans_b  torch.float16 1083.18 usec (4104)

Differential Revision: [D57732737](https://our.internmc.facebook.com/intern/diff/D57732737)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126745
Approved by: https://github.com/malfet
2024-05-23 18:42:35 +00:00
b36e390b6c Revert "Default XLA to use swap_tensors path in nn.Module._apply (#126814)"
This reverts commit eb41ed5d90e946e62dd664d7037ebbb021baf33e.

Reverted https://github.com/pytorch/pytorch/pull/126814 on behalf of https://github.com/mikaylagawarecki due to broke xla ci ([comment](https://github.com/pytorch/pytorch/pull/126814#issuecomment-2127719337))
2024-05-23 17:43:06 +00:00
6a06d36296 Revert "Default meta device to use swap_tensors in nn.Module._apply (.to_empty and .to('meta')) (#126819)"
This reverts commit ab61309ab8f6452975021994a6d4a102d55feba8.

Reverted https://github.com/pytorch/pytorch/pull/126819 on behalf of https://github.com/mikaylagawarecki due to broke xla ci ([comment](https://github.com/pytorch/pytorch/pull/126814#issuecomment-2127719337))
2024-05-23 17:43:06 +00:00
041e8d73fd Separate non/strict functions in _export (#126718)
Move non/strict _export to different functions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126718
Approved by: https://github.com/angelayi
2024-05-23 17:41:23 +00:00
cyy
e5db6758c8 [BE]: Use make_unique (#126966)
Adds make_unique in places

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126966
Approved by: https://github.com/Skylion007
2024-05-23 17:39:48 +00:00
264155a8d7 [DCP][AC] Add test for apply AC with FSDP1 (#126935)
Adding test for this cherry pick. https://github.com/pytorch/pytorch/pull/126559/

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126935
Approved by: https://github.com/fegin
2024-05-23 17:35:54 +00:00
bbe68a16b9 [codemod][lowrisk] Remove extra semi colon from caffe2/caffe2/core/observer.h (#126976)
Summary:
`-Wextra-semi` or `-Wextra-semi-stmt`

If the code compiles, this is safe to land.

Test Plan: Sandcastle

Reviewed By: palmje

Differential Revision: D57632765

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126976
Approved by: https://github.com/Skylion007
2024-05-23 17:31:19 +00:00
a63310eebc TorchScript 2 ExportedProgram Converter (#126920)
Summary:
Initial commit for TorchScript 2 ExportedProgram Converter.

TODO:
- Improve TorchScript IR coverage
- parameter and buffers should be owned by output ExportedProgram
- Experiment on conditional op conversion

Test Plan: buck2 run mode/dev-nosan fbcode//caffe2/test:test_export -- -r TestConverter

Differential Revision: D57694784

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126920
Approved by: https://github.com/angelayi, https://github.com/tugsbayasgalan
2024-05-23 17:00:18 +00:00
1b29c16e5e Revert "Introduce ProcessGroupCudaP2P (#122163)"
This reverts commit 2dd269986027ea25c092f769ef8e9524920aaef6.

Reverted https://github.com/pytorch/pytorch/pull/122163 on behalf of https://github.com/jithunnair-amd due to This is breaking ROCm distributed CI on trunk ([comment](https://github.com/pytorch/pytorch/pull/122163#issuecomment-2127518473))
2024-05-23 16:06:14 +00:00
ab61309ab8 Default meta device to use swap_tensors in nn.Module._apply (.to_empty and .to('meta')) (#126819)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126819
Approved by: https://github.com/albanD
ghstack dependencies: #126814
2024-05-23 15:43:32 +00:00
eb41ed5d90 Default XLA to use swap_tensors path in nn.Module._apply (#126814)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126814
Approved by: https://github.com/JackCaoG, https://github.com/albanD
2024-05-23 15:43:32 +00:00
f0366de414 [dynamo] Support __contains__ on obj.__dict__ (#126922)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126922
Approved by: https://github.com/jansel, https://github.com/yanboliang
2024-05-23 09:01:29 +00:00
25b8dbc3e4 Revert "[inductor][cpp] GEMM template (infra and fp32) (#124021)"
This reverts commit 9da7efa6774777890c8e4a713f6d23ea5cfcf6a4.

Reverted https://github.com/pytorch/pytorch/pull/124021 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2126568331))
2024-05-23 08:50:18 +00:00
45784cd229 Revert "[inductor][cpp] epilogue support for gemm template (#126019)"
This reverts commit 08f57b4bffe6edfdb016703219744482b4d03e23.

Reverted https://github.com/pytorch/pytorch/pull/126019 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2126568331))
2024-05-23 08:50:18 +00:00
926327e8fc Revert "[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068)"
This reverts commit 31412cb2f25bda0fe31dae7b2afc88278794cad6.

Reverted https://github.com/pytorch/pytorch/pull/126068 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2126568331))
2024-05-23 08:50:18 +00:00
30c9ca0899 Revert "[inductor][cpp] support bf16/fp16 gemm template epilogue fusion (#126545)"
This reverts commit 7b6d036c05bd782f5e59bdb353f9e47865e9db50.

Reverted https://github.com/pytorch/pytorch/pull/126545 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2126568331))
2024-05-23 08:50:18 +00:00
da7bf1d588 [export] Fix unflatten with empty nn_module_stack (#126785)
Fixes https://fb.workplace.com/groups/1075192433118967/permalink/1433418843962989/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126785
Approved by: https://github.com/tugsbayasgalan
2024-05-23 08:34:25 +00:00
a6155d23d1 [easy] Delete dead code global (#126903)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126903
Approved by: https://github.com/aorenste
ghstack dependencies: #126083
2024-05-23 08:29:29 +00:00
cc61d03ac9 Do not trace into triton/backends (#126083)
Fixes #125807

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126083
Approved by: https://github.com/yanboliang, https://github.com/jansel
2024-05-23 08:29:29 +00:00
558c4413ce add strobelight cli function profiler (#126693)
This is a meta only tool, this allow users to profile any python function by annotating it with **strobelight** using
the strobelight profiler.
ex
```
    def fn(x, y, z):
        return x * y + z

    # use decorator with default profiler.
    @strobelight()
    @torch.compile()
    def work():
        for i in range(100):
            for j in range(5):
                fn(torch.rand(j, j), torch.rand(j, j), torch.rand(j, j))

    work()
```

test
```
 python torch/utils/strobelight/examples/cli_function_profiler_example.py
strobelight_cli_function_profiler, line 274, 2024-05-20 11:05:41,513, INFO: strobelight run id is: -6222660165281106
strobelight_cli_function_profiler, line 276, 2024-05-20 11:06:08,318, INFO: strobelight profiling running
strobelight_cli_function_profiler, line 257, 2024-05-20 11:06:11,867, INFO: strobelight profiling stopped
strobelight_cli_function_profiler, line 237, 2024-05-20 11:06:16,164, INFO: Total samples: 2470
strobelight_cli_function_profiler, line 237, 2024-05-20 11:06:16,164, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/oiqmyltg
strobelight_cli_function_profiler, line 237, 2024-05-20 11:06:16,164, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/b10x92x0
strobelight_cli_function_profiler, line 274, 2024-05-20 11:06:18,476, INFO: strobelight run id is: -4112659701221677
strobelight_cli_function_profiler, line 276, 2024-05-20 11:06:45,096, INFO: strobelight profiling running
strobelight_cli_function_profiler, line 257, 2024-05-20 11:06:52,366, INFO: strobelight profiling stopped
strobelight_cli_function_profiler, line 237, 2024-05-20 11:06:56,222, INFO: Total samples: 1260
strobelight_cli_function_profiler, line 237, 2024-05-20 11:06:56,222, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/0yyx6el5
strobelight_cli_function_profiler, line 237, 2024-05-20 11:06:56,223, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/8m2by4ea
(base) [lsakka@devvm4561.ash0 /data/users/lsakka/pytorch/pytorch (strobelight2)]$ python torch/profiler/strobelight_cli_function_profiler_example.py
strobelight_cli_function_profiler, line 274, 2024-05-20 11:07:26,701, INFO: strobelight run id is: -2373009368202256
strobelight_cli_function_profiler, line 276, 2024-05-20 11:07:53,477, INFO: strobelight profiling running
strobelight_cli_function_profiler, line 257, 2024-05-20 11:07:56,827, INFO: strobelight profiling stopped
strobelight_cli_function_profiler, line 237, 2024-05-20 11:08:01,138, INFO: Total samples: 2372
strobelight_cli_function_profiler, line 237, 2024-05-20 11:08:01,138, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/dk797xg9
strobelight_cli_function_profiler, line 237, 2024-05-20 11:08:01,138, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/4w6c8vnm
strobelight_cli_function_profiler, line 274, 2024-05-20 11:08:03,235, INFO: strobelight run id is: -1919086123693716
strobelight_cli_function_profiler, line 276, 2024-05-20 11:08:29,848, INFO: strobelight profiling running
strobelight_cli_function_profiler, line 257, 2024-05-20 11:08:37,233, INFO: strobelight profiling stopped
strobelight_cli_function_profiler, line 237, 2024-05-20 11:08:41,138, INFO: Total samples: 1272
strobelight_cli_function_profiler, line 237, 2024-05-20 11:08:41,138, INFO: GraphProfiler (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/43r58aew
strobelight_cli_function_profiler, line 237, 2024-05-20 11:08:41,138, INFO: Icicle view (python stack): https://fburl.com/scuba/pyperf_experimental/on_demand/9g52onmw
(base) [lsakka@devvm4561.ash0 /data/users/lsakka/pytorch/pytorch (strobelight2)]$
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126693
Approved by: https://github.com/aorenste
2024-05-23 07:42:25 +00:00
7b6d036c05 [inductor][cpp] support bf16/fp16 gemm template epilogue fusion (#126545)
As part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows:
1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling.
2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops.
3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen.
4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126545
Approved by: https://github.com/jansel
ghstack dependencies: #124021, #126019, #126068
2024-05-23 07:39:29 +00:00
31412cb2f2 [inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068)
As part of #125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126068
Approved by: https://github.com/jansel
ghstack dependencies: #124021, #126019
2024-05-23 07:39:29 +00:00
08f57b4bff [inductor][cpp] epilogue support for gemm template (#126019)
As part of #125683, this PR adds the epilogue support for c++ gemm template by reusing the c++ vector codegen on sub-slices of tensors. This is implemented by retracing the epilogue IR nodes with new ranges and offsets. The new `codegen_loop_bodies` and `codegen_functions` methods are added to c++ vector codegen for this purpose. This is leveraged by the `store_output` method of the template kernel for epilogue codegen and store to the final result.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126019
Approved by: https://github.com/jansel
ghstack dependencies: #124021
2024-05-23 07:39:29 +00:00
9da7efa677 [inductor][cpp] GEMM template (infra and fp32) (#124021)
This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC https://github.com/pytorch/pytorch/issues/125683 for more background info.
1. Cpp template infrastructure
Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates.
2. Initial FP32 gemm template
This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction.
3. Correctness and performance
The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details.

Static shapes
| Benchmark | torchbench | huggingface | timm_models |
|------------|-------------|--------------|--------------|
| Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x |
| Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x |
| Single-threaded (baseline) | 1.56x | 1.19x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x |

Key models being sped up:
drq: 1.14x
soft_act: 1.12
cait_m36_384: 1.18x

Dynamic shapes
| Benchmark | torchbench | huggingface | timm_models |
| --- | --- | --- | --- |
| Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x |
| Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x |
| Single-threaded (baseline) | 1.55x | 1.20x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x |

Key models being sped up:
BERT_pytorch: 1.22x
pyhpc_turbulent: 1.13x
soft_actor_critic: 1.77x
BlenderbotForCausalLM: 1.09x
cait_m36_384: 1.17x

Differential Revision: [D57585365](https://our.internmc.facebook.com/intern/diff/D57585365)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124021
Approved by: https://github.com/jansel
2024-05-23 07:39:29 +00:00
aa6de76181 Fix silu test for flexattention (#126641)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126641
Approved by: https://github.com/ezyang, https://github.com/drisspg
ghstack dependencies: #126615, #126446
2024-05-23 05:45:07 +00:00
36e70572d0 [Dynamo] make bytecode of resume function resemble natural bytecode (#126630)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126630
Approved by: https://github.com/williamwen42
2024-05-23 05:06:33 +00:00
2c90b99267 Revert "reset dynamo cache before each test (#126586)"
This reverts commit 43f2f43eb3b6d8cbe8eb7f45acb50376092f1a16.

Reverted https://github.com/pytorch/pytorch/pull/126586 on behalf of https://github.com/clee2000 due to broke tests on inductor? test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_CTCLoss_cuda_float64 43f2f43eb3 https://github.com/pytorch/pytorch/actions/runs/9200644034/job/25308511495 ([comment](https://github.com/pytorch/pytorch/pull/126586#issuecomment-2126228689))
2024-05-23 04:54:28 +00:00
b1e214ceb1 Revert "don't check memory format for empty tensors (#126593)"
This reverts commit 12dee4f2046d07db97cddc7b3c5bdf06fc304ae3.

Reverted https://github.com/pytorch/pytorch/pull/126593 on behalf of https://github.com/clee2000 due to broke tests on inductor? test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_CTCLoss_cuda_float64 43f2f43eb3 https://github.com/pytorch/pytorch/actions/runs/9200644034/job/25308511495 ([comment](https://github.com/pytorch/pytorch/pull/126586#issuecomment-2126228689))
2024-05-23 04:54:28 +00:00
df4b7cb5f7 Reapply "Skip test_memory_format_nn_BatchNorm2d in inductor (#125970)" (#126594)
This reverts commit ce6e36bf8b524c3f4b07605c5b3af2b7d5ba8fd9.

Reverted https://github.com/pytorch/pytorch/pull/126594 on behalf of https://github.com/clee2000 due to broke tests on inductor? test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_CTCLoss_cuda_float64 43f2f43eb3 https://github.com/pytorch/pytorch/actions/runs/9200644034/job/25308511495 ([comment](https://github.com/pytorch/pytorch/pull/126586#issuecomment-2126228689))
2024-05-23 04:54:28 +00:00
4f14282e35 Revert "[inductor][cpp] GEMM template (infra and fp32) (#124021)"
This reverts commit 2ac33a9f663269e6060246337c776a20c3b7c858.

Reverted https://github.com/pytorch/pytorch/pull/124021 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it has a land race and failing in trunk 2ac33a9f66 ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2126016522))
2024-05-23 01:13:29 +00:00
657d39e44c Revert "[inductor][cpp] epilogue support for gemm template (#126019)"
This reverts commit 57108d9a4990f6b2ed3578cee58354ab01505dd3.

Reverted https://github.com/pytorch/pytorch/pull/126019 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it has a land race and failing in trunk 2ac33a9f66 ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2126016522))
2024-05-23 01:13:29 +00:00
205f08140e Revert "[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068)"
This reverts commit 57c185b4c765c522a7f2908a773d128c66def190.

Reverted https://github.com/pytorch/pytorch/pull/126068 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it has a land race and failing in trunk 2ac33a9f66 ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2126016522))
2024-05-23 01:13:29 +00:00
2b57652278 Update requests to 2.32.2 (#126805)
To address CVE-2024-35195 (though it does not really affect PyTorch, only CI)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126805
Approved by: https://github.com/atalman, https://github.com/kit1980, https://github.com/seemethere, https://github.com/Skylion007
2024-05-23 00:21:28 +00:00
eqy
ebbd431d9e [CPU] Bump test_complex_2d thresholds for LBFGS on complex64 (#126358)
Is this supposed to be bitwise identical? Wasn't sure how to interpret the comment but it seems to be giving mismatches like:
```
Mismatched elements: 1 / 2 (50.0%)
Greatest absolute difference: 4.6372413635253906e-05 at index (1,) (up to 1e-05 allowed)
Greatest relative difference: 3.4600801882334054e-05 at index (1,) (up to 1.3e-06 allowed)

To execute this test, run the following from the base repo dir:
     python test/test_optim.py -k test_complex_2d_LBFGS_cpu_complex64
```

on Neoverse-N2 SBSA ARM CPUs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126358
Approved by: https://github.com/lezcano, https://github.com/janeyx99
2024-05-23 00:16:45 +00:00
57c185b4c7 [inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068)
As part of #125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126068
Approved by: https://github.com/jansel
ghstack dependencies: #124021, #126019
2024-05-23 00:12:38 +00:00
57108d9a49 [inductor][cpp] epilogue support for gemm template (#126019)
As part of #125683, this PR adds the epilogue support for c++ gemm template by reusing the c++ vector codegen on sub-slices of tensors. This is implemented by retracing the epilogue IR nodes with new ranges and offsets. The new `codegen_loop_bodies` and `codegen_functions` methods are added to c++ vector codegen for this purpose. This is leveraged by the `store_output` method of the template kernel for epilogue codegen and store to the final result.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126019
Approved by: https://github.com/jansel
ghstack dependencies: #124021
2024-05-23 00:07:52 +00:00
2ac33a9f66 [inductor][cpp] GEMM template (infra and fp32) (#124021)
This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC https://github.com/pytorch/pytorch/issues/125683 for more background info.
1. Cpp template infrastructure
Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates.
2. Initial FP32 gemm template
This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction.
3. Correctness and performance
The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details.

Static shapes
| Benchmark | torchbench | huggingface | timm_models |
|------------|-------------|--------------|--------------|
| Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x |
| Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x |
| Single-threaded (baseline) | 1.56x | 1.19x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x |

Key models being sped up:
drq: 1.14x
soft_act: 1.12
cait_m36_384: 1.18x

Dynamic shapes
| Benchmark | torchbench | huggingface | timm_models |
| --- | --- | --- | --- |
| Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x |
| Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x |
| Single-threaded (baseline) | 1.55x | 1.20x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x |

Key models being sped up:
BERT_pytorch: 1.22x
pyhpc_turbulent: 1.13x
soft_actor_critic: 1.77x
BlenderbotForCausalLM: 1.09x
cait_m36_384: 1.17x

Differential Revision: [D57585365](https://our.internmc.facebook.com/intern/diff/D57585365)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124021
Approved by: https://github.com/jansel
2024-05-22 23:59:12 +00:00
e3db9ba37a [FSDP2] Added test for manual reshard with reshard_after_forward=False (#126892)
This test shows that we could always set `reshard_after_forward=False` but manually insert calls to `module.reshard()` to implement the resharding after forward. This is useful for advanced PP schedules.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126892
Approved by: https://github.com/wanchaol
ghstack dependencies: #126887
2024-05-22 23:35:06 +00:00
203f2641e9 [FSDP2] Used CommDebugMode for comm. count test (#126887)
simplify the test :)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126887
Approved by: https://github.com/wanchaol
2024-05-22 23:35:06 +00:00
69325e4de6 [FSDP] Warned on wrapping ModuleList/ModuleDict (#124764)
This partially addresses https://github.com/pytorch/pytorch/issues/113794.

To avoid being BC breaking, we just issue an warning when wrapping `ModuleList` or `ModuleDict`. We want to add this warning since this is a common pitfall.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124764
Approved by: https://github.com/wanchaol
2024-05-22 23:34:52 +00:00
b0e849870e Change error message when nn module inlining is enabled for MiscTests.test_map_side_effects (#126444)
#fix https://github.com/pytorch/pytorch/issues/126355

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126444
Approved by: https://github.com/anijain2305
2024-05-22 23:24:03 +00:00
17186bd5b6 [inductor] make conv lowering work with dynamic shapes (#126823)
Fix an issue reported by internal user that conv lowering does not work well with dynamic shapes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126823
Approved by: https://github.com/jansel
2024-05-22 23:15:29 +00:00
14c5c753de [inductor] use smaller RBLOCK for expensive reduction kernels (#126477)
Triton sometimes uses less registers for more expensive kernel which results in worse perf ( https://github.com/pytorch/pytorch/issues/126463 ). This may make inductor end up with a sub-optimal config. Use a smaller max RBLOCK if the reduction potentially need many registers.

Will run perf test..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126477
Approved by: https://github.com/jansel
2024-05-22 22:47:10 +00:00
ce6e36bf8b Revert "Skip test_memory_format_nn_BatchNorm2d in inductor (#125970)" (#126594)
This reverts commit 0a9c6e92f8d1a35f33042c8dab39f23b7f39d6e7.

enable the test since it's fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126594
Approved by: https://github.com/huydhn
ghstack dependencies: #126586, #126593
2024-05-22 22:43:09 +00:00
12dee4f204 don't check memory format for empty tensors (#126593)
Fix https://github.com/pytorch/pytorch/issues/125967 . The test actually fail for empty 4D or 5D tensors when checking for memory format.

I'm not exactly sure what recent inductor change cause the failure, but it may be not that important to maintain strides for an empty tensor. (?)

I just skip the check for empty tensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126593
Approved by: https://github.com/ezyang
ghstack dependencies: #126586
2024-05-22 22:43:09 +00:00
43f2f43eb3 reset dynamo cache before each test (#126586)
In https://github.com/pytorch/pytorch/issues/125967, we found test results depend on test order. The root cause is due to earlier tests populate dynamo cache and affect the later tests.

This PR clear dynamo cache before each unit test so we get more deterministic result for unit test

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126586
Approved by: https://github.com/jansel
2024-05-22 22:43:09 +00:00
08c260bc29 [pipelining] Test schedules against manual stage (#126735)
Added manual stage in test_schedule.py so that we can test various schedules against it.

In this file we now have:
- test_schedule_with_tracer
- test_schedule_with_manual
- test_grad_with_tracer
- test_grad_with_manual

Tested schedules are:
- ScheduleGPipe
- Schedule1F1B

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126735
Approved by: https://github.com/wconstab, https://github.com/H-Huang
ghstack dependencies: #126812, #126721
2024-05-22 21:54:27 +00:00
6a539e80dd Update descriptor fields to resolve fft precision issue (#125328)
Fixes #124096
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125328
Approved by: https://github.com/kulinseth, https://github.com/malfet
2024-05-22 21:48:49 +00:00
5ccc634603 [CI] Pin uv==0.1.45 for lintrunner (#126908)
e4623de4cf/1
```

2024-05-22T19:10:48.5974515Z + python3 -m pip install uv
2024-05-22T19:10:48.5975198Z Collecting uv
2024-05-22T19:10:48.5976496Z   Downloading uv-0.1.45-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (32 kB)
2024-05-22T19:10:48.5977828Z Downloading uv-0.1.45-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.8 MB)
2024-05-22T19:10:48.5986243Z [?25l   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/12.8 MB ? eta -:--:--
2024-05-22T19:10:48.5988326Z    ━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━ 6.8/12.8 MB 205.8 MB/s eta 0:00:01
2024-05-22T19:10:48.5990300Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 12.8/12.8 MB 215.1 MB/s eta 0:00:01
2024-05-22T19:10:48.5991645Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 12.8/12.8 MB 215.1 MB/s eta 0:00:01
2024-05-22T19:10:48.5992724Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.8/12.8 MB 97.8 MB/s eta 0:00:00
2024-05-22T19:10:48.5993443Z [?25hInstalling collected packages: uv
2024-05-22T19:10:48.5993950Z Successfully installed uv-0.1.45
2024-05-22T19:10:48.5994363Z + CACHE_DIRECTORY=/tmp/.lintbin
2024-05-22T19:10:48.5994772Z + [[ -d /tmp/.lintbin ]]
2024-05-22T19:10:48.5995157Z + cp -r /tmp/.lintbin .
2024-05-22T19:10:48.5995497Z + lintrunner init
2024-05-22T19:10:48.5995839Z + [[ 1 == \1 ]]
```
vs
```

2024-05-22T20:33:53.5563991Z + python3 -m pip install uv
2024-05-22T20:33:53.5564921Z Collecting uv
2024-05-22T20:33:53.5566259Z   Downloading uv-0.2.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (32 kB)
2024-05-22T20:33:53.5568142Z Downloading uv-0.2.1-py3-none-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.9 MB)
2024-05-22T20:33:53.5570253Z [?25l   ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 0.0/12.9 MB ? eta -:--:--
2024-05-22T20:33:53.5571889Z    ━━━━━━━━━━━━━━━━━━━━━╺━━━━━━━━━━━━━━━━━━ 7.0/12.9 MB 208.8 MB/s eta 0:00:01
2024-05-22T20:33:53.5573716Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 12.9/12.9 MB 206.7 MB/s eta 0:00:01
2024-05-22T20:33:53.5575478Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╸ 12.9/12.9 MB 206.7 MB/s eta 0:00:01
2024-05-22T20:33:53.5577240Z    ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 12.9/12.9 MB 101.6 MB/s eta 0:00:00
2024-05-22T20:33:53.5578531Z [?25hInstalling collected packages: uv
2024-05-22T20:33:53.5579316Z Successfully installed uv-0.2.1
2024-05-22T20:33:53.5580033Z + CACHE_DIRECTORY=/tmp/.lintbin
2024-05-22T20:33:53.5580640Z + [[ -d /tmp/.lintbin ]]
2024-05-22T20:33:53.5581229Z + cp -r /tmp/.lintbin .
2024-05-22T20:33:53.5581799Z + lintrunner init
2024-05-22T20:33:53.5603302Z Traceback (most recent call last):
2024-05-22T20:33:53.5604857Z   File "/home/ec2-user/actions-runner/_work/pytorch/pytorch/test-infra/.github/scripts/run_with_env_secrets.py", line 101, in <module>
2024-05-22T20:33:53.5605805Z     main()
2024-05-22T20:33:53.5606687Z   File "/home/ec2-user/actions-runner/_work/pytorch/pytorch/test-infra/.github/scripts/run_with_env_secrets.py", line 97, in main
2024-05-22T20:33:53.5607762Z     run_cmd_or_die(f"docker exec -t {container_name} /exec")
2024-05-22T20:33:53.5608949Z   File "/home/ec2-user/actions-runner/_work/pytorch/pytorch/test-infra/.github/scripts/run_with_env_secrets.py", line 38, in run_cmd_or_die
2024-05-22T20:33:53.5610107Z     raise RuntimeError(f"Command {cmd} failed with exit code {exit_code}")
2024-05-22T20:33:53.5611328Z RuntimeError: Command docker exec -t e551764bdba0c87c2fc392fba9ea265e8821a552915b36010f18299d8035b304 /exec failed with exit code 1
2024-05-22T20:33:53.5626540Z ##[error]Process completed with exit code 1.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126908
Approved by: https://github.com/huydhn
2024-05-22 21:41:21 +00:00
a30baec0c3 [Docs] Fix NumPy + backward example (#126872)
We were calling backward on a tensor not a scalar...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126872
Approved by: https://github.com/albanD
2024-05-22 21:29:31 +00:00
e4623de4cf typing scheduler.py [2/2]: Apply types (#126656)
Add `# mypy: disallow-untyped-defs` to scheduler.py and then fix the resulting fallout.

We probably should eventually add a new node between BaseSchedulerNode and all the non-FusedSchedulerNode types to indicate the split between nodes that have a valid `self.node` and ones that don't. That would cause a lot of the `assert self.node is not None` churn to go away - but was a bigger change because a lot of code makes assumptions about types that aren't reflected in the types themselves.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126656
Approved by: https://github.com/eellison
2024-05-22 20:33:31 +00:00
3591bce6c7 Add usage explanation in torch.dot ducment (#125908)
Fixes #125842

Add unsupported declaration on <code>torch.dot</code>, avoid misused like:

```python
>>> t1, t2 = torch.tensor([0,1]), torch.tensor([2,3])
>>> torch.dot(input=t1, other=t2)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: dot() missing 1 required positional arguments: "tensor"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125908
Approved by: https://github.com/albanD
2024-05-22 20:33:12 +00:00
0939b68980 Support dtype kwarg in _foreach_norm (#125665)
Fixes #125040

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125665
Approved by: https://github.com/janeyx99
2024-05-22 20:27:50 +00:00
d62b025efc [TorchElastic] Option for sharing TCPStore created by rdzv handlers (#125743)
Summary:

1. Define explicit `use_agent_store` on rdzv handlers. Handlers that set is true can share the store.
2. Instead of agent coordinating master_add/master_port values, the logic is now encapsulated by a *rdzv_handler* where `RendezvousInfo` will have `RendezvousStoreInfo` object that handlers must return.
    - Depending on the implementation they can either:
         - point to existing store (and expected to `use_agent_store` as true - point 1). Client code will rely on `TORCHELASTIC_USE_AGENT_STORE` env variable to know if the store is shared.
         - build args that `torch.distributed.init_process_group` can bootstrap by creating new store.

Additional points:

- When TCPStore is shared, it should be wrapped in PrefixStore to qualify/scope namespace for other usecases.
- `next_rendezvous` signature changed to return instance of `RendezvousInfo` instead of a (store, rank, world_size) tuple for extensibility purposes.

Why:
- Reduce moving parts
   - easier to swap implementation
   - improve tractability
   - addressing perf/debug-ability will benefit all usecases
   -
Test Plan: CI

Differential Revision: D57055235

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125743
Approved by: https://github.com/d4l3k
2024-05-22 18:24:11 +00:00
fde1e8af7a [dtensor] implement distributed topk operator (#126711)
as titled. Implemented the topk operator in DTensor.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126711
Approved by: https://github.com/wz337
ghstack dependencies: #126710
2024-05-22 18:11:56 +00:00
af633e4a7b [dtensor] remove unused failed_reason (#126710)
as titled, this field is not actively used, so removing it

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126710
Approved by: https://github.com/wz337
2024-05-22 18:11:56 +00:00
a8195f257e [custom_op] use new python custom ops API on prims ops (#124665)
Also ads a non-decorator version of `custom_op`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124665
Approved by: https://github.com/zou3519
2024-05-22 17:48:33 +00:00
db0b74bbc5 [CUDA Caching Allocator] Allow division of 0 (#126833)
Summary: Division of 0 means disabling roundup.

Test Plan: CI

Differential Revision: D57651410

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126833
Approved by: https://github.com/banitag1
2024-05-22 17:40:39 +00:00
d4ec18bdad Prevent partitioner from ever saving views (#126446)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126446
Approved by: https://github.com/anijain2305
ghstack dependencies: #126615
2024-05-22 17:28:46 +00:00
51e707650f Fix flexattention not realizing inputs before lowering (also refactored runtime estimation) (#126615)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126615
Approved by: https://github.com/yanboliang, https://github.com/drisspg, https://github.com/xmfan
2024-05-22 17:28:46 +00:00
3e826c477a [pipelining] Add pipeline stage test (#126721)
Test tracer's and manual's stage creation by using a basic schedule (GPipe).

(Migrated from https://github.com/pytorch/PiPPy/blob/main/test/test_pipeline_stage.py)

Test command:
```
$ python test_stage.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126721
Approved by: https://github.com/wconstab, https://github.com/H-Huang
ghstack dependencies: #126812
2024-05-22 16:24:51 +00:00
403012b50a [pipelining] expose APIs per pytorch rule (#126812)
Rule is enforced by #126103.

The rule:
- If `torch.a.b` defines a public class `C` (i.e. to be exposed in torch API namespace), then `torch.a.b` must be a public path, i.e. no `_`.
- `torch.a.b` should ideally have an `__all__` that defines what should be imported from this file when it is imported.
- All other definitions in `torch.a.b` that you don't want to expose should have a `_` prefix.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126812
Approved by: https://github.com/wconstab
2024-05-22 16:21:13 +00:00
599e684ad6 [AOTI] Disable stack allocation for OSS (#125732)
Summary: Stack allocation is for certain small CPU models, but its coverage still needs improvement, so default to OFF for OSS.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125732
Approved by: https://github.com/chenyang78
ghstack dependencies: #126720, #126801
2024-05-22 15:33:24 +00:00
ff617ab6c8 [AOTI] Fix an int array codegen issue (#126801)
Summary: fixes https://github.com/pytorch/pytorch/issues/126779. When an int array contains symbol expression, we can't declare it with constexpr.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126801
Approved by: https://github.com/chenyang78
ghstack dependencies: #126720
2024-05-22 15:33:24 +00:00
19cd4484ec [AOTI] Add more fallback ops (#126720)
Summary: These ops are either in either unit tests or TorchBench. Fixes https://github.com/pytorch/pytorch/issues/122050

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126720
Approved by: https://github.com/chenyang78
2024-05-22 15:33:24 +00:00
0d17aae242 Teach FakeTensor to fill in item_memo when converting scalar CPU tensor (#126245)
This PR requires a little justification, but let's start with what it does first:

1. When you have a 0d CPU scalar int64/float64 tensor input to a graph, we will preallocate a backed SymInt/SymFloat corresponding to what you would get if you call item() on this tensor. This means you can freely change your input to be a Python int/float or a Tensor with an item() call and end up with exactly the same level of expressivity (specifically, you can guard on the internal SymInt/SymFloat no matter what). By default, the source of the backed SymInt/SymFloat is `L['tensor'].item()`, but if you have promoted a float input into a Tensor, we will cancel out `torch.as_tensor(L['float']).item()` into just `L['float']`.
2. We switch wrap_symfloat to use this, instead of hand crafting the new SymNodeVariable. Everything works out, except that we carefully pass the item() result to tracked fakes (and not the fake Tensor argument)

OK, so why do this at all? There is some marginal benefit where now some item() calls on scalar inputs can be guarded on, but IMO this is a pretty marginal benefit, and if it was the only reason, I wouldn't do this. The real reason for this is that I need to be able to propagate fake tensors through the graphs that are produced by Dynamo, and if I am doing the old custom wrap_symfloat logic, there's no way I can do this, because ordinarily an item() call will cause an unbacked SymInt when I reallocate.

The other obvious way to solve the problem above is to make a HOP alternative that item() that "bakes in" the backed SymInt its supposed to return. But this strategy seems more parsimonious, and it does have the marginal benefit I mentioned above. The main downside is that what I have to do next, is make it so that when I run tensor computation, I also apply the equivalent operations to the SymInt/SymFloat as well. That's next PR.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126245
Approved by: https://github.com/eellison
ghstack dependencies: #126637
2024-05-22 15:25:38 +00:00
86ad101370 Enable pickling torch._C.Generator (#126271)
Fixes #71398

Add `__reduce__` and `__setstate__` methods for `torch._C.Generator`.

`__reduce__` returns a tuple of 3 values:

1. `torch.Generator` itself.
2. A one-element tuple containing the `torch.device` to create the `Generator` with, since this cannot be changed after the object is created.
3. The state, a three-element tuple: the initial seed, the offset (or `None` if a CPU `Generator`), and the RNG state tensor.

`__setstate__` calls `manual_seed`, `set_offset` (if not `None`), and `set_state` on each respective element of the state.

Added test demonstrating successful reserialization with cpu and cuda `Generator`s.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126271
Approved by: https://github.com/ezyang
2024-05-22 14:38:47 +00:00
ed734178ab Refresh OpOverloadPacket if a new OpOverload gets added (#126863)
If a user accesses an OpOverloadPacket, then creates a new OpOverload,
then uses the OpOverloadPacket, the new OpOverload never gets hit. This
is because OpOverloadPacket caches OpOverloads when it is constructed.

This PR fixes the problem by "refreshing" the OpOverloadPacket if a new
OpOverload gets constructed and the OpOverloadPacket exists.

Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126863
Approved by: https://github.com/albanD
2024-05-22 14:13:27 +00:00
082251e76b fix invalid call to aoti_torch_tensor_copy_ (#126668)
Fixes #123039

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126668
Approved by: https://github.com/desertfire
2024-05-22 13:02:02 +00:00
2dd2699860 Introduce ProcessGroupCudaP2P (#122163)
## Context
This stack prototypes automatic micro-pipelining of `all-gather -> matmul` and `matmul -> reduce-scatter` via Inductor. The idea originates from the paper [Overlap Communication with Dependent Computation via
Decomposition in Large Deep Learning Models](https://dl.acm.org/doi/pdf/10.1145/3567955.3567959). The implementation and some key optimizations are heavily influenced by @lw's implementation in xformers.

The stack contains several components:
- `ProcessGroupCudaP2P` - a thin wrapper around `ProcessGroupNCCL`. It in addition maintains a P2P workspace that enables SM-free, one-sided P2P communication which is needed for optimal micro-pipelining.
- `fused_all_gather_matmul` and `fused_matmul_reduce_scatter` dispatcher ops.
- Post-grad fx pass that detects `all-gather -> matmul` and `matmul -> reduce-scatter` and replaces them with the fused dispatcher ops.

To enable the prototype feature:
- Set the distributed backend to `cuda_p2p`.
- Set `torch._inductor.config._micro_pipeline_tp` to `True`.

*NOTE: the prototype sets nothing in stone w.r.t to each component's design. The purpose is to have a performant baseline with reasonable design on which each component can be further improved.*

## Benchmark
Setup:
- 8 x H100 (500W) + 3rd gen NVSwitch.
- Llama3 8B training w/ torchtitan.
- 8-way TP. Reduced the number of layers from 32 to 8 for benchmarking purpose.

Trace (baseline): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmpjaz8zgx0
<img width="832" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/4addba77-5abc-4d2e-93ea-f68078587fe1">

Trace (w/ micro pipelining): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmpn073b4wn
<img width="963" alt="image" src="https://github.com/pytorch/pytorch/assets/4156752/4f44e78d-8196-43ab-a1ea-27390f07e9d2">

## This PR
`ProcessGroupCudaP2P` is a thin wrapper around `ProcessGroupNCCL`. By default, it routes all collectives to the underlying `ProcessGroupNCCL`. In addition, `ProcessGroupCudaP2P` initializes a P2P workspace that allows direct GPU memory access among the members. The workspace can be used in Python to optimize intra-node communication patterns or to create custom intra-node collectives in CUDA.

`ProcessGroupCudaP2P` aims to bridge the gap where certain important patterns can be better optimized via fine-grained P2P memory access than with collectives in the latest version of NCCL. It is meant to complement NCCL rather than replacing it.
Usage:
```
    # Using ProcessGroupCudaP2P
    dist.init_process_group(backend="cuda_p2p", ...)

    # Using ProcessGroupCudaP2P while specifying ProcessGroupCudaP2P.Options
    pg_options = ProcessGroupCudaP2P.Options()
    dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...)

    # Using ProcessGroupCudaP2P while specifying ProcessGroupNCCL.Options
    pg_options = ProcessGroupNCCL.Options()
    dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...)

    # Using ProcessGroupCudaP2P while specifying both
    # ProcessGroupCudaP2P.Options and ProcessGroupNCCL.Options
    pg_options = ProcessGroupCudaP2P.Options()
    pg_options.nccl_options = ProcessGroupNCCL.Options()
    dist.init_process_group(backend="cuda_p2p", pg_options=pg_options, ...)

    # Down-casting the backend to access p2p buffers for cuda_p2p specific
    # optimizations
    if is_cuda_p2p_group(group):
        backend = get_cuda_p2p_backend(group)
        if required_p2p_buffer_size > backend.get_buffer_size():
            # fallback
        p2p_buffer = backend.get_p2p_buffer(...)
    else:
        # fallback
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122163
Approved by: https://github.com/wanchaol
2024-05-22 09:33:05 +00:00
8a4597980c Revert "Fix flexattention not realizing inputs before lowering (also refactored runtime estimation) (#126615)"
This reverts commit 831efeeadf5fa8d9e7f973057e634a57e3bcf04b.

Reverted https://github.com/pytorch/pytorch/pull/126615 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126615#issuecomment-2124169157))
2024-05-22 08:23:40 +00:00
0f37fd06d9 Revert "Prevent partitioner from ever saving views (#126446)"
This reverts commit da2292ce6b37028746bf5beeae04442eef1e803d.

Reverted https://github.com/pytorch/pytorch/pull/126446 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126615#issuecomment-2124169157))
2024-05-22 08:23:40 +00:00
d2cbbdee31 Revert "Fix silu test for flexattention (#126641)"
This reverts commit cd3a71f754a2248bcfe500de7c9860bd7d2002bf.

Reverted https://github.com/pytorch/pytorch/pull/126641 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126615#issuecomment-2124169157))
2024-05-22 08:23:40 +00:00
4575d3be83 [Quant][onednn] fix performance regression of depth-wise qconv (#126761)
Fixes #125663

It did not handle groups correctly in the original implementation.

Test plan:
Functionality is covered by UT.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126761
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
2024-05-22 07:53:11 +00:00
aede940975 [inductor] Fix cuda compilation under fbcode remote execution (#126408)
Differential Revision: D57390072

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126408
Approved by: https://github.com/desertfire
2024-05-22 07:51:35 +00:00
edea2b81b5 [ONNX] Adds Support for Some Bitwise Ops in Onnx Exporter (#126229)
Addresses #126194

Adds support for
- "aten::bitwise_right_shift"
- "aten::bitwise_left_shift"
- "aten::bitwise_and"

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126229
Approved by: https://github.com/justinchuby
2024-05-22 07:47:43 +00:00
b516de8cac [halide-backend] Add HalideCodeCache (#126416)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126416
Approved by: https://github.com/shunting314
ghstack dependencies: #126631, #126655
2024-05-22 06:52:50 +00:00
d937d0db0f [SAC] fix ignored ops in eager mode to recompute (#126751)
as titled. I found that there're some issues in the eager mode SAC where
sometimes we would have recompute pop from storage of ops that are
missing, these ops are detach ops. So this PR refactors the two modes,
so that they would always recompute ignored ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126751
Approved by: https://github.com/yf225
2024-05-22 06:47:22 +00:00
3b0f6cce5c [pytree] freeze attributes of TreeSpec (#124011)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124011
Approved by: https://github.com/zou3519
2024-05-22 05:57:00 +00:00
6edf989e2f [CUDA Caching Allocator] Round to nearest 512 bytes boundary if number of divisions=1 (#126830)
Summary: This diff fixes an issue when the number of divisions=1, resulting in unaligned memory accesses.

Reviewed By: 842974287

Differential Revision: D57648763

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126830
Approved by: https://github.com/842974287
2024-05-22 04:57:24 +00:00
ae66c94eaa Capture dtype in Flight Recorder (#126581)
Summary:
Capture dtype in flight recorder.
Mismatched dtypes can lead to hangs.

Newly added logs to job show mismatching DTYPE of op, which affects data
size.  Even though the sizes match and we don't see the dtype on the FR
log.

We end up capturing the type as follows:
```
{'entries': [{'record_id': 0, 'pg_id': 0, 'process_group': ('0', 'default_pg'), 'collective_seq_id': 1, 'p2p_seq_id': 0, 'op_id': 1, 'profiling_name': 'nccl:all_reduce', 'time_created_ns': 1715989097552775261, 'duration_ms': 6.697696208953857, 'input_sizes': [[3, 4]], 'input_dtypes': [6], 'output_sizes': [[3, 4]], 'output_dtypes': [6], 'state': 'completed', 'time_discovered_started_ns': 1715989097593778240, 'time_discovered_completed_ns': 1715989097593778461, 'retired': True,
```
Notice the new fields:
input_dtypes: [6]
output_dtypes: [6]

Test Plan:
unit tests

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/issues/126554

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126581
Approved by: https://github.com/wconstab
2024-05-22 03:38:09 +00:00
7530cfe7e4 [dynamo][flaky tests] test_conv_empty_input_* (#126790)
Run CI, maybe fixes https://github.com/pytorch/pytorch/issues/126178

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126790
Approved by: https://github.com/mikaylagawarecki
2024-05-22 03:14:21 +00:00
ac1f0befcf Remove redundant serialization code (#126803)
After https://github.com/pytorch/pytorch/pull/123308, we no longer need separate serialization path to handle different types that exist in the nn_module metadata. This PR cleans up the redundant code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126803
Approved by: https://github.com/angelayi
2024-05-22 03:14:17 +00:00
608a11c496 [pipelining] Retire PIPPY_VERBOSITY in favor of TORCH_LOGS=pp (#126828)
https://github.com/pytorch/pytorch/pull/126499/ established:

`TORCH_LOGS=pp` --> info
`TORCH_LOGS=-pp` --> warn
`TORCH_LOGS=+pp` --> debug

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126828
Approved by: https://github.com/wconstab
2024-05-22 02:52:58 +00:00
e3c96935c2 Support CUDA_INC_PATH env variable when compiling extensions (#126808)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126808
Approved by: https://github.com/amjames, https://github.com/ezyang
2024-05-22 02:44:32 +00:00
5fa7aefb49 [pipelining] Do not print loss (#126829)
`loss` is a tensor, printing it would induce a GPU-CPU sync, which would slow down the program more than regular debug overhead.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126829
Approved by: https://github.com/wconstab
2024-05-22 02:32:04 +00:00
e6f655697b [AOTI] Fix unsupported type of output=s1 (#126797)
Fixes #123036

In unit test `DynamicShapesCudaWrapperCudaTests.test_scaled_dot_product_attention_cuda_dynamic_shapes_cuda_wrapper`, computed buffer buf3 is compiled to a fallback kernel `aoti_torch_cuda__scaled_dot_product_flash_attention`. It has 9 outputs whose types are `[MultiOutput, MultiOutput, None, None, s1, s1, MultiOutput, MultiOutput,MultiOutput]`. The type `s1` here is passed from [generate_output](acfe237a71/torch/_inductor/ir.py (L5658)).

They type check for Symbol is missing for fallback kernel output generation. This PR fixes this issue by checking `output.is_Symbol`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126797
Approved by: https://github.com/desertfire
2024-05-22 02:15:43 +00:00
a379ed6e98 Fix SobolEngine default dtype handling (#126781)
- Change default dtype argument to `None` and fetch it value via `torch.get_default_dtype()` call if not defined
- Fix bug in first draw handling logic, that would ignore dtype in favor of default one due to type promotion
- Add regression tests

Fixes https://github.com/pytorch/pytorch/issues/126478
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126781
Approved by: https://github.com/Skylion007, https://github.com/albanD
2024-05-22 01:55:48 +00:00
28f29e074b Dont mutate tensor stride in place in cudnn conv (#126786)
Fix for https://github.com/pytorch/pytorch/issues/126241.

Within the cudnn convolution, we were in-place updating the strides of the tensor to disambiguate for size-1 dims and contiguous and channels last tensors. Instead of mutating the tensors stride, just use a temporary. Inside cudnn it is then copied: d7ccb5b3c4/include/cudnn_frontend_Tensor.h (L201-L203).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126786
Approved by: https://github.com/ezyang, https://github.com/shunting314, https://github.com/eqy
2024-05-22 01:53:44 +00:00
66c23cb021 Add micro-benchmark framework and multi_layer_norm as an example (#126754)
```micro_benchmark.py``` output csv example (all numbers are fake, just for demo)
```
name,metric,target,actual
multi_layer_norm,inference_time(s),20,19.87
multi_layer_norm,memory_bandwidth(GB/s),108,108.04
llama2-int8, token_per_sec,155,156
llama2-int8,memory_bandwidth(GB/s),92,92.7
```
Expected dashboard looks like:
```
| name             | metric                 | target | actual | change |
|------------------|------------------------|--------|--------|--------|
| multi_layer_norm | inference_time(s)      | 20     | 19.87  | 99%    |
|                  | memory_bandwidth(GB/s) | 108    | 108.04 | 101%   |
| llama2-int8      | token_per_sec          | 155    | 156    | 100%   |
|                  | memory_bandwidth(GB/s) | 92     | 92.7   | 101%   |

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126754
Approved by: https://github.com/Chillee
2024-05-22 01:27:37 +00:00
636e79991c [FSDP2] Fixed 2D clip grad norm test (#126497)
This fixes https://github.com/pytorch/pytorch/issues/126484.

We change from transformer to MLP stack since transformer seems to introduce slight numeric differences when using TP. We include a sequence parallel layer norm module in the MLP stack to exercise `(S(0), R)` placement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126497
Approved by: https://github.com/weifengpy, https://github.com/wz337
2024-05-22 00:29:13 +00:00
25ea32567e [caffe2][1/n] migrate global Static Initializer (#126688)
Summary:
Caffe2 lib has 200+ global static initializer usage, which are papar-cut reference to startup perf. Detail in this post https://fb.workplace.com/groups/arglassesperf/permalink/623909116287154.
Kick off a stack to migirate all usage of global static initializer in caffe2.

Test Plan: TODO: Please advise how can i test this change?

Differential Revision: D57531083

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126688
Approved by: https://github.com/ezyang
2024-05-22 00:16:06 +00:00
10a5c1b26c [Dynamo][TVM] Fix tvm backend interface (#126529)
Fixes #126528

The repro in the above issue works fine with this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126529
Approved by: https://github.com/xmfan
2024-05-21 23:31:15 +00:00
1e818db547 [torchbench] Fix torchao benchmarking script (#126736)
As the title says.

Test Plan:

```
python benchmarks/dynamo/torchbench.py --only BERT_pytorch --bfloat16 --quantization int8dynamic --performance --inference --print-memory

cuda eval  BERT_pytorch
[XZ Debug] Torch grad status: False
memory: eager: 0.82 GB, dynamo: 0.92 GB, ratio: 0.89
running benchmark: 100%
1.001x
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126736
Approved by: https://github.com/jerryzh168, https://github.com/huydhn
2024-05-21 23:15:12 +00:00
9dba1aca0e [inductor] Relax type annotations for statically_known_* (#126655)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126655
Approved by: https://github.com/Skylion007, https://github.com/shunting314
ghstack dependencies: #126631
2024-05-21 23:12:42 +00:00
c08afbb3da [inductor] Add kernel_code logging artifact (#126631)
This is useful for some compile errors where we don't finish outputting the full graph.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126631
Approved by: https://github.com/shunting314
2024-05-21 23:12:42 +00:00
4e921593a4 [c10d]skip nan tests for lower versions of CUDA (#126701)
Summary:
We found that the UNIT tests would hang only in one test,
linux-focal-cuda11.8-py3.9-gcc9 / test (multigpu, 1, 1,
linux.g5.12xlarge.nvidia.gpu),
in which DSA would still be raised, but somehow the process would cause
errors like:
P1369649418

Test Plan:
Run CI tests

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126701
Approved by: https://github.com/wconstab
ghstack dependencies: #126409
2024-05-21 22:25:29 +00:00
f6ffe32a9d [AOTInductor] Automatic detection for buffer mutation and binary linking (#126706)
Summary: Instead of a explicit config for users to determine buffer mutation, we automatically detect whether there's buffer mutation in the model and determine which section constants would be placed. If constants are too large and doesn't fit within section, we error out directly.

Test Plan: Existing tests for buffer mutation and large weight linking

Differential Revision: D57579800

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126706
Approved by: https://github.com/desertfire
2024-05-21 21:49:13 +00:00
fed536dbcf [DTensor][Optim] Add support for fused_adam and fused_adamw when lr is a tensor (#126750)
Fixes #126670

In this PR, we update the following:
1. lr is an kwarg. Add support to automatically turn on implict replication for kwarg. We only did this for arg previously.
2. add associated tensor_lr ops in pointwises.py
3. add associated unit test in test_optimizers.py

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126750
Approved by: https://github.com/wanchaol, https://github.com/msaroufim
2024-05-21 21:38:05 +00:00
7ee74d986a Enable UFMT format on test/typing files (#126038)
Fixes some files in #123062

Run lintrunner on files:
test/typing/**/*

```
$ lintrunner -a --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126038
Approved by: https://github.com/shink, https://github.com/ezyang
2024-05-21 21:37:07 +00:00
1cc9354cb0 Unify the dtype to VecMask<float, N> in ops.masked (#126662)
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/126449. For `ops.masked` in CPP backend, when input dtype is `bool`, we actually load it as `VecMask<float, N>`. So, we should unify the type of `other` and `mask` to the same as  `VecMask<float, N>` to invoke `blendv` method.

**Test Plan**
```
clear && python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_ops_masked_with_bool_input
clear && PYTORCH_ALL_SAMPLES=1 python -u -m pytest -s -v test/inductor/test_torchinductor_opinfo.py -k test_comprehensive__chunk_cat_cpu_bool
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126662
Approved by: https://github.com/isuruf, https://github.com/jgong5, https://github.com/peterbell10
2024-05-21 20:52:25 +00:00
fd7293db71 Bump rexml from 3.2.5 to 3.2.8 in /ios/TestApp (#126455)
Bumps [rexml](https://github.com/ruby/rexml) from 3.2.5 to 3.2.8.
- [Release notes](https://github.com/ruby/rexml/releases)
- [Changelog](https://github.com/ruby/rexml/blob/master/NEWS.md)
- [Commits](https://github.com/ruby/rexml/compare/v3.2.5...v3.2.8)

---
updated-dependencies:
- dependency-name: rexml
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-05-21 13:47:12 -07:00
fe0a36fd7c Fix a link in the compiler backend doc (#126079)
The core aten is the core subset of aten and seems the corrent link to replace the broken link.

Fixes #125961

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126079
Approved by: https://github.com/svekars
2024-05-21 20:16:04 +00:00
5325a6de64 [dtensor] remove output_ prefix from OpStrategy properties (#126359)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126359
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
2024-05-21 19:54:29 +00:00
c73c9457aa Add guard_size_oblivious to vector_norm (#126772)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126772
Approved by: https://github.com/lezcano, https://github.com/Skylion007
ghstack dependencies: #126771
2024-05-21 19:53:21 +00:00
97eef61474 Don't assume compare_arg is fx.Node (#126771)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126771
Approved by: https://github.com/Skylion007
2024-05-21 19:53:21 +00:00
fc594ed219 Remove lint from retryable_workflows (#126806)
Related to https://github.com/pytorch/test-infra/pull/4934

Lint workflow now uses Docker, so there should not be network-related errors for pip installing stuff.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126806
Approved by: https://github.com/seemethere, https://github.com/ZainRizvi, https://github.com/huydhn
2024-05-21 19:47:23 +00:00
4e6673e244 Remove MAX_STACK_ENTRY from _build_table (#126583)
Summary:
As reported by this issue: https://github.com/pytorch/pytorch/issues/83584

We already store the entries in evt.stack so there is no need to cap the limit when we output the table to 5

Test Plan: Regression testing should cover this. We have unit tests to check the stack already.

Differential Revision: D57513565

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126583
Approved by: https://github.com/nmacchioni
2024-05-21 18:52:04 +00:00
0c76018714 [inductor] Don't inherit __future__ flags from the calling scope when compile -ing generated modules (#126454)
This file includes `from __futures__ import annotations` which interacts with `compile` by causing type annotations to be populated as strings. Triton does not parse the string annotation correctly. Avoid this behavior by passing `dont_inherit=True` to `compile`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126454
Approved by: https://github.com/peterbell10
2024-05-21 18:51:13 +00:00
cyy
7428fd19fe Remove outdated options from setup.py (#125988)
Since the recent removal of Caffe2 files.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125988
Approved by: https://github.com/ezyang
2024-05-21 18:48:23 +00:00
b40fb2de59 [AOTI] Fix a codegen issue when .item() is used for kernel arg (#126575)
Summary: fixes https://github.com/pytorch/pytorch/issues/126574 . Pass kernel argument type information into generate_args_decl, so it can generate the argument declaration instead of relying on string matching.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126575
Approved by: https://github.com/chenyang78
ghstack dependencies: #126369
2024-05-21 18:20:20 +00:00
5e2de16a6f [AOTI] Codegen None as empty tensor (#126369)
Summary: When None denotes an optional tensor, we codegen NULL to represent it; but when None is for actual tensor type, we need to codegen an empty tensor for it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126369
Approved by: https://github.com/chenyang78
2024-05-21 18:20:20 +00:00
ac51920656 Reapply "c10d: add Collectives abstraction (#125978)" (#126695)
This reverts commit d9c3485146913324ab4b3e211d2a4517e138f4af.

Reapplies #125978.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126695
Approved by: https://github.com/c-p-i-o
2024-05-21 18:00:09 +00:00
d8f5627a88 prune back configs (#126570)
We had a previous PR that added configs for an internal model. Running the below script on output from autotuning, we can prune back the added configs with negligible perf loss: P1365917790.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126570
Approved by: https://github.com/nmacchioni
2024-05-21 17:44:32 +00:00
85fd76f76d Add test coverage for fp16 matrix-vector specialized kernel (#126700)
Summary: This kernel is special-cased on ARM because it's important for LLMs, so let's have test coverage.

Test Plan: Ran locally and it passes. Intentionally broke fp16_gemv_trans and saw it fail, confirming it provides coverage.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126700
Approved by: https://github.com/malfet
2024-05-21 17:23:16 +00:00
bae3b17fd9 Tweak a comment and fix spelling (#126681)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126681
Approved by: https://github.com/Skylion007
2024-05-21 17:19:06 +00:00
0756f9f5fd Remove debug breakpoint (#126756)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126756
Approved by: https://github.com/BowenBao, https://github.com/Skylion007
2024-05-21 17:04:50 +00:00
140ab89c02 typing scheduler.py [1/2]: Bug fix (#126610)
Found while getting scheduler.py to typecheck - split off to make reviewing easier.

1. is_template: I'm pretty sure this is a bug.  Based on the definition of `is_template` I'm pretty sure we want to return the node's `get_template_node()`, not the node itself.

2. can_free: It seems that this was intended to b a raise, not a return.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126610
Approved by: https://github.com/eellison
2024-05-21 16:59:37 +00:00
ac2c547838 [TD] Upload names of failures to s3 for pytest cache (#126315)
Some tests don't get run through pytest and pytest crashes when a test segfaults, so in both caess, the pytest cache won't have an entry (similar to https://github.com/pytorch/test-infra/pull/5205).

Instead, manually upload/download an extra file that lists the failing test files

Technically this would be more general than the pytest cache
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126315
Approved by: https://github.com/ZainRizvi
2024-05-21 16:29:31 +00:00
4a7b46be3d small changes to padding (#126716)
Add cost of writing padding 0s to benchmark, skip dimension that can be squeezed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126716
Approved by: https://github.com/shunting314
2024-05-21 16:09:32 +00:00
980f5ac049 Revert "[Quant][PT2E] enable qlinear post op fusion for dynamic quant & qat (#122667)"
This reverts commit 3642e51ea527e23ded10afc266f298b0cb5350c8.

Reverted https://github.com/pytorch/pytorch/pull/122667 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/122667#issuecomment-2122642317))
2024-05-21 13:45:07 +00:00
b36e01801b [3.12, inductor] re-enable AsyncCompile.warm_pool for 3.12 (#126724)
Somehow working now? Fixes https://github.com/pytorch/pytorch/issues/124192 and https://github.com/pytorch/pytorch/issues/125979.

Still getting the warning
```
/home/williamwen/local/installs/python3.12/debug/install/lib/python3.12/multiprocessing/popen_fork.py:66: DeprecationWarning: This process (pid=2360707) is multi-threaded, use of fork() may lead to deadlocks in the child.
  self.pid = os.fork()
```
though

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126724
Approved by: https://github.com/masnesral, https://github.com/jansel
2024-05-21 08:50:13 +00:00
cyy
faa72dca41 Remove QNNPACK submodule (#126657)
QNNPACK has integrated into ATEN for a long time and removing it from third party causing no build issue.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126657
Approved by: https://github.com/ezyang
2024-05-21 07:25:24 +00:00
7d34cfd28a Update torch-xpu-ops pin (ATen XPU implementation) (#126744)
Regular bi-weekly pin update. New 85 ATen operators are implemented in XPU backend.
https://github.com/intel/torch-xpu-ops/blob/release/2.4/yaml/xpu_functions.yaml

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126744
Approved by: https://github.com/EikanWang
2024-05-21 07:21:52 +00:00
4b23c4fc5d [Pipelining] Clean up function names in 1f1b schedule (#126582)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126582
Approved by: https://github.com/kwen2501
ghstack dependencies: #126539
2024-05-21 06:50:02 +00:00
8c9d332953 [c10d] fix excepthook crash on exc after destroy_process_group (#126739)
fixes #126379

This is the easy fix.  An additional fix that I did not do is to
deregister the excepthook (or rather, restore the orignal one) when
calling dist.destroy_process_group.  This might be a bit complicated in
practice, so landing as is for now.

Also, couldn't figure out a clean way to test this.  assertRaisesRegex
wasn't getting a string value, probably becuase of the stderr
redirection done via the excepthook in the first place.

Output from the original repro is cleaned up with the fix:

```
[rank0]: Traceback (most recent call last):
[rank0]:   File "/data/users/whc/pytorch/except.py", line 6, in <module>
[rank0]:     raise ZeroDivisionError
[rank0]: ZeroDivisionError
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126739
Approved by: https://github.com/yf225
2024-05-21 06:39:18 +00:00
e363a8a222 Revert "[pipelining] Add pipeline stage test (#126721)"
This reverts commit b948b1ad7a9cf61c9692506c60c295fd40e00f43.

Reverted https://github.com/pytorch/pytorch/pull/126721 on behalf of https://github.com/clee2000 due to The test_public_bindings failure is real, you just got unlucky since it was also broken on trunk for a different reason ([comment](https://github.com/pytorch/pytorch/pull/126721#issuecomment-2121725408))
2024-05-21 04:40:05 +00:00
dc2560f073 [Pipelining] Add debug logs for batch p2p ops (#126539)
logs from torchtitan:

<img width="2878" alt="image" src="https://github.com/pytorch/pytorch/assets/4984825/4039c85f-0bf1-4924-92fa-2c55e8e4da2a">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126539
Approved by: https://github.com/kwen2501, https://github.com/H-Huang
2024-05-21 03:54:46 +00:00
b96d9090d2 [C10D] make get_node_local_rank() accept fallback_rank (#126737)
Addresses follow up comments on #123992 and allows the use case of
writing code that checks `get_node_local_rank(fallback_rank=0)` and
runs correctly whether in the presence of a launcher (e.g. torchrun),
or run locally on a single device.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126737
Approved by: https://github.com/shuqiangzhang
2024-05-21 03:38:02 +00:00
c1b90a4e8a [Dynamo] Treat integers stored on nn.Modules as dynamic (#126466)
Fixes #115711

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126466
Approved by: https://github.com/jansel
2024-05-21 03:31:20 +00:00
a83e745356 [BE] split seq_id to collective_seq_id and p2p_seq_id (#125727)
Summary:
Split out `seq_id` into `collective_seq_id` and `p2p_seq_id`. The main idea here is that collectives that go to all machines should have identical `collective_seq_id` and therefore it makes it easier to spot if one of machines isn't handling a collective operation.
Next, we can attempt to match up p2p operations to ensure that the sender(s)/receivers(s) are in sync.

Resolves issue: https://github.com/pytorch/pytorch/issues/125173

Test Plan:
Unit tests.

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125727
Approved by: https://github.com/zdevito
2024-05-21 03:26:49 +00:00
eqy
5f64086d08 [NT][SDPA] Bump tolerances for test_sdpa_with_packed_in_proj_cuda_bfloat16 (#126356)
Current tolerances fail on RTX 6000 (Ada) with `Mismatched elements: 2 / 144 (1.4%)`

```
AssertionError: Tensor-likes are not close!

Mismatched elements: 2 / 144 (1.4%)
Greatest absolute difference: 0.002197265625 at index (5, 0, 0) (up to 0.001 allowed)
Greatest relative difference: 0.08203125 at index (3, 0, 0) (up to 0.016 allowed)

To execute this test, run the following from the base repo dir:
     python test/test_nestedtensor.py -k test_sdpa_with_packed_in_proj_cuda_bfloat16

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0

----------------------------------------------------------------------
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126356
Approved by: https://github.com/drisspg
2024-05-21 03:25:30 +00:00
40cc616909 Fix caching allocator of out-of-tree device is destructed before the … (#126677)
…destruction of tensors cached by autocast

## Root Cause
For out-of-tree device extension it is loaded after torch (different .so), so the global variable `cached_casts` may be constructed before caching allocator and then destructed in reversed order when exit.

## Fix
Lazily initialize `cached_casts` to correct the order.

## How to Reproduce && Test
Modify the testcase `TestAutocastGPU.test_cast_cache_is_global` in test/test_autocast.py  to run on your out-of-tree device. You will see following failure in the end of test.
```bash
----------------------------------------------------------------------
Ran 1 test in 4.812s

OK
free: 0x30080ff44000400
terminate called after throwing an instance of 'c10::Error'
  what():  invalid device pointer: 0x30080ff44000400
Exception raised from free at /projs/framework/betterman/code/pytorch_new/catch/torch_mlu/csrc/framework/core/caching_allocator.cpp:1609 (most recent call first):
frame #0: <unknown function> + 0x118fe1 (0x7ffaef4d3fe1 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #1: <unknown function> + 0x11b1c4 (0x7ffaef4d61c4 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #2: <unknown function> + 0x117677 (0x7ffaef4d2677 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #3: <unknown function> + 0x11a2bf (0x7ffaef4d52bf in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #4: <unknown function> + 0x11a186 (0x7ffaef4d5186 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #5: <unknown function> + 0x119fde (0x7ffaef4d4fde in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #6: <unknown function> + 0x119d2e (0x7ffaef4d4d2e in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #7: <unknown function> + 0x119be0 (0x7ffaef4d4be0 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #8: <unknown function> + 0x119977 (0x7ffaef4d4977 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #9: <unknown function> + 0x119313 (0x7ffaef4d4313 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #10: <unknown function> + 0x118b4c (0x7ffaef4d3b4c in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #11: c10::Error::Error(c10::SourceLocation, std::string) + 0x34 (0x7ffaef4d27c4 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #12: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x7f (0x7ffaef4d04ed in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #13: torch_mlu::MLUCachingAllocator::Native::NativeCachingAllocator::free(void*) + 0xe6 (0x7ff9a8eeb112 in /projs/framework/betterman/code/pytorch_new/catch/torch_mlu/csrc/lib/libtorch_mlu.so)
frame #14: torch_mlu::MLUCachingAllocator::Native::local_raw_delete(void*) + 0x3b (0x7ff9a8ed9480 in /projs/framework/betterman/code/pytorch_new/catch/torch_mlu/csrc/lib/libtorch_mlu.so)                                                                                                                         frame #15: std::unique_ptr<void, void (*)(void*)>::~unique_ptr() + 0x50 (0x7ffb0a5ea322 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_python.so)
frame #16: <unknown function> + 0x1269890 (0x7ffb0a5e4890 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_python.so)
frame #17: <unknown function> + 0x1269928 (0x7ffb0a5e4928 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_python.so)
frame #18: <unknown function> + 0x127572c (0x7ffb0a5f072c in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_python.so)
frame #19: <unknown function> + 0x1275758 (0x7ffb0a5f0758 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_python.so)
frame #20: <unknown function> + 0xb9bc7 (0x7ffaef474bc7 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #21: <unknown function> + 0xb97bc (0x7ffaef4747bc in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #22: <unknown function> + 0xdbc50 (0x7ffaef496c50 in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #23: c10::TensorImpl::~TensorImpl() + 0x82 (0x7ffaef49157e in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #24: c10::TensorImpl::~TensorImpl() + 0x1c (0x7ffaef4915aa in /projs/framework/betterman/code/pytorch_new/torch/lib/libc10.so)
frame #25: <unknown function> + 0x2f596d9 (0x7ffaf24fc6d9 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)
frame #26: <unknown function> + 0x2f589c2 (0x7ffaf24fb9c2 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)
frame #27: <unknown function> + 0x2f57b92 (0x7ffaf24fab92 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)
frame #28: <unknown function> + 0x2f5c228 (0x7ffaf24ff228 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)
frame #29: <unknown function> + 0x30f3f70 (0x7ffaf2696f70 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)
frame #30: <unknown function> + 0x30f3f90 (0x7ffaf2696f90 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)
frame #31: <unknown function> + 0x30f5004 (0x7ffaf2698004 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)                                                                                                                                                                                frame #32: <unknown function> + 0x30f5024 (0x7ffaf2698024 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)
frame #33: <unknown function> + 0x31207f0 (0x7ffaf26c37f0 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)
frame #34: <unknown function> + 0x3120814 (0x7ffaf26c3814 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)
frame #35: <unknown function> + 0x30f51e8 (0x7ffaf26981e8 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)
frame #36: <unknown function> + 0x30f5148 (0x7ffaf2698148 in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)
frame #37: <unknown function> + 0x316ecea (0x7ffaf2711cea in /projs/framework/betterman/code/pytorch_new/torch/lib/libtorch_cpu.so)
frame #38: <unknown function> + 0x468a7 (0x7ffb0c9ed8a7 in /lib/x86_64-linux-gnu/libc.so.6)
frame #39: on_exit + 0 (0x7ffb0c9eda60 in /lib/x86_64-linux-gnu/libc.so.6)
<omitting python frames>
frame #47: __libc_start_main + 0xf3 (0x7ffb0c9cb083 in /lib/x86_64-linux-gnu/libc.so.6)

Aborted (core dumped)

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126677
Approved by: https://github.com/ezyang
2024-05-21 03:20:17 +00:00
51c07f9f69 [dynamo] Allow asserts to fail (#126661)
Currently if an assertion is statically known to be false, dynamo converts it to
`_assert_async` which inductor currently ignores. Instead this graph breaks to
raise the original assertion.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126661
Approved by: https://github.com/ezyang
2024-05-21 02:42:13 +00:00
d777685ef9 Script for choosing template configurations (#126560)
This adds logging that will mark any invocation of a matmul for a particular input shapes, and record every template configs performance on it. Then, we can parse that into a script which will minimize the total mm execution time given N allowed templates. And in future, other experiments..

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126560
Approved by: https://github.com/nmacchioni, https://github.com/jansel
2024-05-21 02:28:39 +00:00
d30cdc4321 [ROCm] amdsmi library integration (#119182)
Adds monitoring support for ROCm using amdsmi in place of pynvml.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119182
Approved by: https://github.com/jeffdaily, https://github.com/malfet, https://github.com/xw285cornell
2024-05-21 01:59:26 +00:00
b948b1ad7a [pipelining] Add pipeline stage test (#126721)
Test tracer's and manual's stage creation by using a basic schedule (GPipe).

(Migrated from https://github.com/pytorch/PiPPy/blob/main/test/test_pipeline_stage.py)

Test command:
```
$ python test_stage.py
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126721
Approved by: https://github.com/wconstab, https://github.com/H-Huang
2024-05-21 01:22:10 +00:00
31ba6ee49b Traceable wrapper subclass support for deferred runtime asserts (#126198)
The padded dense -> jagged conversion op has the signature:
```
_fbgemm_dense_to_jagged_forward(Tensor dense, Tensor[] offsets, SymInt? total_L=None) -> Tensor
```

when `total_L` is not specified, the meta registration has a data-dependent output shape (based on `offsets[0][-1]`). Returning an unbacked SymInt here should work in theory, but traceable wrapper subclass support is missing in later code to handle deferred runtime asserts. This PR fixes this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126198
Approved by: https://github.com/ezyang
2024-05-21 01:21:46 +00:00
82b4528788 [cudagraph] fix verbose graph logging (#126694)
According to the [doc](https://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__TYPES.html#group__CUDART__TYPES_1g0907ca7a1e7d0211b71ee49c5403072b):

> enum cudaGraphDebugDotFlags
> CUDA Graph debug write options
>
> Values
> cudaGraphDebugDotFlagsVerbose = 1<<0
> Output all debug data as if every debug flag is enabled
> cudaGraphDebugDotFlagsKernelNodeParams = 1<<2
> Adds cudaKernelNodeParams to output
> cudaGraphDebugDotFlagsMemcpyNodeParams = 1<<3
> Adds cudaMemcpy3DParms to output
> cudaGraphDebugDotFlagsMemsetNodeParams = 1<<4
> Adds cudaMemsetParams to output
> cudaGraphDebugDotFlagsHostNodeParams = 1<<5
> Adds cudaHostNodeParams to output
> cudaGraphDebugDotFlagsEventNodeParams = 1<<6
> Adds cudaEvent_t handle from record and wait nodes to output
> cudaGraphDebugDotFlagsExtSemasSignalNodeParams = 1<<7
> Adds cudaExternalSemaphoreSignalNodeParams values to output
> cudaGraphDebugDotFlagsExtSemasWaitNodeParams = 1<<8
> Adds cudaExternalSemaphoreWaitNodeParams to output
> cudaGraphDebugDotFlagsKernelNodeAttributes = 1<<9
> Adds cudaKernelNodeAttrID values to output
> cudaGraphDebugDotFlagsHandles = 1<<10
> Adds node handles and every kernel function handle to output
> cudaGraphDebugDotFlagsConditionalNodeParams = 1<<15
> Adds cudaConditionalNodeParams to output
>

`1 << 10` is not the most verbose flag. it is just one flag to add node handles and every kernel function handle to output. `1 << 0` is the most verbose flag, under the name `cudaGraphDebugDotFlagsVerbose`.

Here is an example of graph, dumped with `1 << 10`:

```dot
digraph dot {
subgraph cluster_1 {
label="graph_1" graph[style="dashed"];
"graph_1_node_0"[style="solid" shape="rectangle" label="0
MEM_ALLOC
node handle: 0x000055D2889750F0
"];

"graph_1_node_1"[style="bold" shape="octagon" label="1
_Z3addPhS_S_m
node handle: 0x000055D288979A20
func handle: 0x000055D288978D40
"];

"graph_1_node_2"[style="solid" shape="trapezium"label="2
MEMCPY
node handle: 0x000055D28897A130
(DtoH,1024)
"];

"graph_1_node_3"[style="solid" shape="rectangle" label="3
MEM_FREE
node handle: 0x000055D2889890C0
"];

"graph_1_node_0" -> "graph_1_node_1";
"graph_1_node_1" -> "graph_1_node_2";
"graph_1_node_2" -> "graph_1_node_3";
}
}
```

The same graph dumped with `1 << 0`:

```dot
digraph dot {
subgraph cluster_1 {
label="graph_1" graph[style="dashed"];
"graph_1_node_0"[style="solid" shape="record" label="{
MEM_ALLOC
| {{ID | node handle} | {0 (topoId: 3) | 0x000055D2889750F0}}
| {{{poolProps | {allocType | handleTypes | {location | {type | id}}} | {PINNED | NONE | DEVICE | 0}}}}
| {{bytesize | dptr} | {1024 | 0x0000000A02000000}}
}"];

"graph_1_node_1"[style="bold" shape="record" label="{KERNEL
| {ID | 1 (topoId: 2) | _Z3addPhS_S_m\<\<\<4,256,0\>\>\>}
| {{node handle | func handle} | {0x000055D288979A20 | 0x000055D288978D40}}
| {accessPolicyWindow | {base_ptr | num_bytes | hitRatio | hitProp | missProp} | {0x0000000000000000 | 0 | 0.000000 | N | N}}
| {cooperative | 0}
| {priority | 0}
}"];

"graph_1_node_2"[style="solid" shape="record" label="{
MEMCPY
| {{ID | node handle} | {2 (topoId: 1) | 0x000055D28897A130}}
| {kind | DtoH (DEVICE to HOST PAGEABLE)}
| {{srcPtr | dstPtr} | {pitch | ptr | xsize | ysize | pitch | ptr | xsize | ysize} | {0 | 0x0000000A02000000 | 0 | 0 | 0 | 0x000055D287CA6DB0 | 0 | 0}}
| {{srcPos | {{x | 0} | {y | 0} | {z | 0}}} | {dstPos | {{x | 0} | {y | 0} | {z | 0}}} | {Extent | {{Width | 1024} | {Height | 1} | {Depth | 1}}}}
}"];

"graph_1_node_3"[style="solid" shape="record" label="{
MEM_FREE
| {{ID | node handle} | {3 (topoId: 0) | 0x000055D2889890C0}}
| {{dptr} | {0x0000000A02000000}}
}"];

"graph_1_node_0" -> "graph_1_node_1" [headlabel=0];
"graph_1_node_1" -> "graph_1_node_2" [headlabel=0];
"graph_1_node_2" -> "graph_1_node_3" [headlabel=0];
}
}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126694
Approved by: https://github.com/eqy, https://github.com/eellison
2024-05-21 00:55:15 +00:00
4644611b14 [cprofile] log manifold link instead of raw data to trace_structured (#126451)
Internal D57459752 returns manifold URL and this PR adds to tlparse payload

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126451
Approved by: https://github.com/jamesjwu
2024-05-21 00:44:55 +00:00
b85f9d7fa2 Add symbolic_shape_specialization structured trace (#126450)
This is typically the information you want when diagnosing why something
overspecialized in dynamic shapes.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126450
Approved by: https://github.com/albanD
2024-05-21 00:34:05 +00:00
cd3a71f754 Fix silu test for flexattention (#126641)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126641
Approved by: https://github.com/ezyang, https://github.com/drisspg
ghstack dependencies: #126615, #126446
2024-05-20 23:40:56 +00:00
da2292ce6b Prevent partitioner from ever saving views (#126446)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126446
Approved by: https://github.com/anijain2305
ghstack dependencies: #126615
2024-05-20 23:40:56 +00:00
831efeeadf Fix flexattention not realizing inputs before lowering (also refactored runtime estimation) (#126615)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126615
Approved by: https://github.com/yanboliang, https://github.com/drisspg, https://github.com/xmfan
2024-05-20 23:40:56 +00:00
14dc8d4f63 Protect codecache against cache failures (#126696)
When there's a manifold, memcache or filesystem related issues or network outages, we should not completely fail to compile but instead fallback to cold start.

Differential Revision: [D57573835](https://our.internmc.facebook.com/intern/diff/D57573835/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126696
Approved by: https://github.com/aorenste
2024-05-20 22:22:41 +00:00
6f1935b0b5 doc: torch.utils.data.Sampler: __len__ is optional (#125938)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125938
Approved by: https://github.com/andrewkho, https://github.com/xmfan
2024-05-20 22:20:36 +00:00
74b053d7c4 Pass model path to observer (#126503)
Summary: Passing model path to observer so that they can get additional info if needed.

Test Plan: contbuild & OSS CI

Differential Revision: D57475129

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126503
Approved by: https://github.com/kirklandsign
2024-05-20 22:17:56 +00:00
acfe237a71 Fix C++ compilation error for tensor array in abi_compatible mode (#126412)
Fixes #122048

There is a compilation error https://github.com/pytorch/pytorch/issues/122048  when  the element type in an array is tensor. It is because `val_to_arg_str does` not take arg type as input, and always generate an int array.

This PR change the underlying `codegen_int_array_var` to `codegen_var_array` by adding type checks and corresponding code generations.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126412
Approved by: https://github.com/desertfire
2024-05-20 20:57:50 +00:00
3d4f1c3083 [export] Make error name private (#126715)
Fixes CI

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126715
Approved by: https://github.com/clee2000
2024-05-20 20:50:11 +00:00
d28868c7e8 Change skipIfs to xfails in test_mps.py for test_isin (#125412)
Follow-up to #124896 to move the added test to use expectedFailure instead of skip.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125412
Approved by: https://github.com/kulinseth
2024-05-20 20:23:53 +00:00
8bca0847c2 Revert "[TD] Upload names of failures to s3 for pytest cache (#126315)"
This reverts commit 655038687afd19a4a4c9371b77ff046fd6c84be1.

Reverted https://github.com/pytorch/pytorch/pull/126315 on behalf of https://github.com/clee2000 due to broke inductor ([comment](https://github.com/pytorch/pytorch/pull/126315#issuecomment-2121133045))
2024-05-20 20:15:08 +00:00
2813f0672a fix huggingface models input issue in torchbench (#126579)
Fixes https://github.com/pytorch/benchmark/issues/2263.

According to https://github.com/pytorch/pytorch/blob/main/benchmarks/dynamo/common.py#L509, example_inputs are formatted as dictionaries for HuggingFace models. However, this forward_pass function passes all inputs to mod with *, which may only pass the input_ids key in HuggingFace model's example inputs.

To reproduce, run the following command.
```bash
python pytorch/benchmarks/dynamo/torchbench.py --performance --inference -dcuda --only=hf_Bert --output=torchbench_inference.csv
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126579
Approved by: https://github.com/xuzhao9
2024-05-20 19:10:46 +00:00
11c2d127ec [AOTInductor] Add config to allow buffer mutation (#126584)
Summary:
Add an additional config to allow buffer mutation.
For data that's greater than 2GB, we would need to set it as read-only, otherwise overflow would occur.
This is a temporary solution since it won't handle cases that requires mutable data greater than 2GB.

Test Plan: Included in commit.

Differential Revision: D57514729

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126584
Approved by: https://github.com/chenyang78
2024-05-20 18:16:00 +00:00
2068dadbe8 [torchbench] Add torchao to PT2 Benchmark Runner (#126469)
Summary:
X-link: https://github.com/pytorch/benchmark/pull/2268

Support torchao performance and accuracy tests in PT2 Benchmark Runner, using the inductor backend as the baseline.

Test Plan:
```
$ buck2 run mode/opt //caffe2/benchmarks/dynamo:torchbench -- --only BERT_pytorch --bfloat16 --quantization int8dynamic --performance --inference --print-memory

loading model: 0it [00:50, ?it/s]
cuda eval  BERT_pytorch
memory: eager: 0.75 GB, dynamo: 0.75 GB, ratio: 1.00
running benchmark: 100%
1.003x
```

Reviewed By: jerryzh168

Differential Revision: D57463273

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126469
Approved by: https://github.com/huydhn
2024-05-20 17:53:44 +00:00
022adf8c5e Fix bug for comptime.get_local for cells/closures (#126637)
I wasn't paying enough attention and didn't notice that LOAD_DEREF is
defined differently for InliningInstructionTranslator.  Match it up with
the code there.

This also fixes comptime.print(), which was broken, because closing over
an argument turned it into a cell rather than a regular local.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126637
Approved by: https://github.com/yanboliang
2024-05-20 17:51:28 +00:00
f9de510121 [dynamo] Graph break on set_num_threads (#126623)
Fixes #125364

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126623
Approved by: https://github.com/yanboliang
2024-05-20 17:44:32 +00:00
89c1cfe144 [export] Allow modules to be created in the forward (#125725)
Fixes the error in non-strict export when we're tracing a module that initializes another module in its forward function. This appears in [many huggingface models](https://github.com/search?q=repo%3Ahuggingface%2Ftransformers+CrossEntropyLoss%28%29&type=code&fbclid=IwAR285uKvSevJM6SDbXmb4-monj4iH7wf8opkvnec-li7sKpn4lUMjIvbGKc). It's probably not good practice to do this, but since it appears in so many places, and strict-export supports this, we will also support this.

The approach we'll take for these cases is that we will inline the call to the module. Parameters and buffers initialized as constants (with `torch.tensor`) will be represented as constant tensors, and those initialized with tensor factory functions (`torch.ones`) will show up as an operator in the graph. The module stack for the ops in the inlined module will reflect the toplevel's module stack.

One issue is that strict-export seems to segfault when there is an `nn.Parameter` call in the constructor (https://github.com/pytorch/pytorch/issues/126109). Non-strict export will succeed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125725
Approved by: https://github.com/ydwu4
2024-05-20 17:42:20 +00:00
655038687a [TD] Upload names of failures to s3 for pytest cache (#126315)
Some tests don't get run through pytest and pytest crashes when a test segfaults, so in both caess, the pytest cache won't have an entry (similar to https://github.com/pytorch/test-infra/pull/5205).

Instead, manually upload/download an extra file that lists the failing test files

Technically this would be more general than the pytest cache
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126315
Approved by: https://github.com/ZainRizvi
2024-05-20 17:36:30 +00:00
8c38d0cd64 [inductor] Fix edge case in JIT vs. AOT fusion after finalizing MultiTemplateBuffer (#126622)
# Context
Here's a peripheral scenario causing the JIT-pass and AOT-pass to pick different fusions.
```py
# JIT -- buf3 is a MultiTemplateBuffer
V.graph.buffers = [buf0, buf1, buf2, buf3, buf4]
                                ^          ^
# JIT pass calls finalize_multi_template_buffers()
V.graph.buffers = [buf0, buf1, buf2, buf4, *buf3*]

# AOT, note proximity_score(buf2, buf4) is "better" for fusion than JIT
V.graph.buffers = [buf0, buf1, buf2, buf4, *buf3*]
                                ^    ^
```

It happens like this:
* JIT starts with the original set nodes using V.graph.buffers
* In JIT, finalize_multi_template_buffers() is called which can change the order of the buffers.
* This makes the order of buffers/scheduler nodes different.
* Now, each node's min/max-order is different than before.
* As a result, the proximity between two nodes is different. ad67553c5c/torch/_inductor/scheduler.py (L2316-L2335)

# Error
```
$ TORCH_LOGS="+fusion" python test/inductor/test_max_autotune.py -k test_jit_fusion_matches_aot_fusion
======================================================================
FAIL: test_jit_fusion_matches_aot_fusion (__main__.TestMaxAutotune)
----------------------------------------------------------------------
Traceback (most recent call last):
  ...
  File "/data/users/colinpeppler/pytorch/torch/_inductor/graph.py", line 1718, in compile_to_fn
    code, linemap = self.codegen_with_cpp_wrapper()
  File "/data/users/colinpeppler/pytorch/torch/_inductor/graph.py", line 1618, in codegen_with_cpp_wrapper
    return self.codegen()
  File "/data/users/colinpeppler/pytorch/torch/_inductor/graph.py", line 1636, in codegen
    self.scheduler.codegen()
  File "/data/users/colinpeppler/pytorch/torch/_dynamo/utils.py", line 210, in time_wrapper
    r = func(*args, **kwargs)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/scheduler.py", line 2602, in codegen
    self.get_backend(device).codegen_node(node)  # type: ignore[possibly-undefined]
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/cuda_combined_scheduling.py", line 66, in codegen_node
    return self._triton_scheduling.codegen_node(node)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3377, in codegen_node
    return self.codegen_node_schedule(node_schedule, buf_accesses, numel, rnumel)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3602, in codegen_node_schedule
    final_kernel.call_kernel(final_kernel.kernel_name)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3055, in call_kernel
    grid = wrapper.generate_default_grid(name, grid)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/cpp_wrapper_cuda.py", line 174, in generate_default_grid
    params is not None
AssertionError: cuda kernel parameters for triton_poi_fused_add_0 should already exist at this moment, only found dict_keys(['Placeholder.DESCRIPTIVE_NAME', 'triton_poi_fused_add_mul_0', 'triton_poi_fused_pow_1'])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126622
Approved by: https://github.com/chenyang78
ghstack dependencies: #125982
2024-05-20 16:58:08 +00:00
7aa853a54e [CI] Install sccache on XLA build job (#126117)
XLA build job uses a docker image from XLA, which doesn't have sccache installed.  The XLA build job just builds pytorch, XLA gets built during the test job.  The pytorch build was taking 1+hrs, with a warm cache it takes <30min
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126117
Approved by: https://github.com/malfet
2024-05-20 16:39:14 +00:00
3642e51ea5 [Quant][PT2E] enable qlinear post op fusion for dynamic quant & qat (#122667)
**Description**
Add fusion path for dynamic quant and for QAT.
The following patterns can be matched for static quant with QAT cases:
`qx -> qlinear -> add -> optional relu -> optional type convert -> optional quant`

The following patterns can be matched for dynamic quant cases:
`qx -> qlinear -> add -> optional relu`

**Test plan**
python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear
python test/inductor/test_cpu_cpp_wrapper.py -k test_qlinear
python test/test_quantization.py -k test_linear_unary
python test/test_quantization.py -k test_linear_binary

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122667
Approved by: https://github.com/jgong5
2024-05-20 15:55:18 +00:00
2f53747ec6 Speedup bf16 gemm fallback on ARM (#126592)
By dispatching it to multiple threads and using vectorized dot operation (with fp16 to fp32 upcasts via left shift)

This bumps stories110M eval from 22 to 55 tokens/sec using bfloat16

TODO:
 - Refactor tinygemm template and use it here

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126592
Approved by: https://github.com/mikekgfb
2024-05-20 12:39:51 +00:00
cb69c51b6f Revert " Updated test_graph_optims and test_graph_scaling_fused_optimizers to use new OptimizerInfo infrastructure (#125127)"
This reverts commit cf35a591b95220aa1bfcc04ff8a943efd1d6d6eb.

Reverted https://github.com/pytorch/pytorch/pull/125127 on behalf of https://github.com/DanilBaibak due to Broken trunk ([comment](https://github.com/pytorch/pytorch/pull/125127#issuecomment-2120337584))
2024-05-20 12:14:22 +00:00
7100a72950 [inductor] Fix ops.scan for non-commutative operators (#126633)
`tl.associative_scan` supports non-commutative combine functions but `tl.reduce`
doesn't. This effects non-persistent scans, where we use the reduction from
the previous loop iterations as the base for future iterations.

Here I work around this by taking the last element of the scan output and using
that as the reduced value. This is done using a trick where we create a
mask that is 1 at the desired element and 0 elsewhere, then sum over it.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126633
Approved by: https://github.com/Chillee, https://github.com/lezcano
2024-05-20 10:27:17 +00:00
d9c3485146 Revert "c10d: add Collectives abstraction (#125978)"
This reverts commit 4b2ae2ac338f3a0de340c9711b03989b8ce66fc6.

Reverted https://github.com/pytorch/pytorch/pull/125978 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/125978#issuecomment-2119858015))
2024-05-20 07:40:41 +00:00
53f73cdeb6 Revert "Add symbolic_shape_specialization structured trace (#126450)"
This reverts commit da1fc85d60fcf0bd1e8638d643a7c0c6560c3a5f.

Reverted https://github.com/pytorch/pytorch/pull/126450 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126450#issuecomment-2119798075))
2024-05-20 06:59:58 +00:00
5ad2f10034 Revert "[inductor] Load python modules using importlib (#126454)"
This reverts commit faa26df72e2a3ff08f9dd564bb50756916826854.

Reverted https://github.com/pytorch/pytorch/pull/126454 on behalf of https://github.com/DanilBaibak due to Break internal build ([comment](https://github.com/pytorch/pytorch/pull/126454#issuecomment-2119771267))
2024-05-20 06:41:11 +00:00
cf35a591b9 Updated test_graph_optims and test_graph_scaling_fused_optimizers to use new OptimizerInfo infrastructure (#125127)
This PR is meant to address issue #123451, more specifically, the ```test_graph_optims``` and ```test_graph_scaling_fused_optimizers``` functions in ```test_cuda.py``` have been updated so that they now use the new OptimizerInfo infrastructure.

Lintrunner passed:
```
$ lintrunner test/test_cuda.py
ok No lint issues.
```
Tests passed:
```
>python test_cuda.py -k test_graph_optims
Ran 19 tests in 7.463s

OK (skipped=9)

>python test_cuda.py -k test_graph_scaling_fused_optimizers
Ran 6 tests in 2.800s

OK (skipped=3)
```
Both the functions have been moved to the newly created TestCase class ```TestCudaOptims```. The test is mostly the same except the ```@optims``` decorator is used at the top of the function to implicitly call the function using each of the optimizers mentioned in the decorator instead of explicitly using a for loop to iterate through each of the optimizers.

I was unable to use the ```_get_optim_inputs_including_global_cliquey_kwargs``` to get all kwargs for each of the optimizers since some of the kwargs that are used in the original ```test_graph_optims``` function are not being returned by the new OptimizerInfo infrastructure, more specifically, for the ```torch.optim.rmsprop.RMSprop``` optimizer, the following kwargs are not returned whenever ```_get_optim_inputs_including_global_cliquey_kwargs``` is called:
```
{'foreach': False, 'maximize': True, 'weight_decay': 0}
{ 'foreach': True, 'maximize': True, 'weight_decay': 0}
```
I ran into the same issue for ```test_graph_scaling_fused_optimizers```, for the ```torch.optim.adamw.AdamW``` optimizer, whenever ```optim_info.optim_inputs_func(device=device)``` was called, the following kwarg was not returned:
```
{'amsgrad': True}
```

Due to this issue, I resorted to using a dictionary to store the kwargs for each of the optimizers, I am aware that this is less than ideal. I was wondering whether I should use the OptimizerInfo infrastructure to get all the kwargs regardless of the fact that it lacks some kwargs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125127
Approved by: https://github.com/janeyx99
2024-05-20 06:20:45 +00:00
5fb11cda4f [compiled autograd] Better cache miss logging (#126602)
- log only first node key cache miss
- log existing node key sizes
- log which node's collected sizes became dynamic
e.g.
```
DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to new autograd node: torch::autograd::GraphRoot (NodeCall 0) with key size 39, previous key sizes=[]
...
DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to new autograd node: torch::autograd::AccumulateGrad (NodeCall 5) with key size 32, previous key sizes=[21]
...
DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 0 of torch::autograd::GraphRoot (NodeCall 0)
DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 2 of SumBackward0 (NodeCall 1)
DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 4 of SumBackward0 (NodeCall 1)
DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 2 of ReluBackward0 (NodeCall 2)
DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 9 of AddmmBackward0 (NodeCall 3)
DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 2 of torch::autograd::AccumulateGrad (NodeCall 5)
DEBUG:torch._dynamo.compiled_autograd.__compiled_autograd_verbose:Cache miss due to dynamic shapes: collected size idx 2 of ReluBackward0 (NodeCall 6)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126602
Approved by: https://github.com/jansel
ghstack dependencies: #126144, #126146, #126148, #126483
2024-05-19 23:49:52 +00:00
be67985bd7 [compiled autograd] log in cpp using python logger (#126483)
Internal infra may not preserve python and c++ log ordering e.g. MAST logs: https://fburl.com/mlhub/38576cxn, all the `[python_compiled_autograd.cpp] Creating cache entry [...]` logs of the entire run are at the beginning of the file

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126483
Approved by: https://github.com/jansel
ghstack dependencies: #126144, #126146, #126148
2024-05-19 23:49:52 +00:00
cyy
574ae9afb8 [Submodule] Remove third-party onnx-tensorrt (#126542)
It seems that tensorrt is not used by the C++ code, may be due to the removal of Caffe2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126542
Approved by: https://github.com/ezyang
2024-05-19 22:34:24 +00:00
cyy
853081a8e7 Replace torch.library.impl_abstract with torch.library.register_fake (#126606)
To remove the disrupting warning
```
      warnings.warn("torch.library.impl_abstract was renamed to "
                    "torch.library.register_fake. Please use that instead; "
                    "we will remove torch.library.impl_abstract in a future "
                    "version of PyTorch.",
                    DeprecationWarning, stacklevel=2)
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126606
Approved by: https://github.com/ezyang
2024-05-19 13:21:39 +00:00
5ea956a61f Update hf_BirdBird periodic-dynamo-benchmarks results (#126414)
can't repro this regression. also nothing in the faulty PR range would cause it only for 1 model. the job is still causing noise, so we should mute it. I think just updating the graph break count is better than skipping the model here since it's still passing

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126414
Approved by: https://github.com/ezyang
2024-05-19 10:58:07 +00:00
c4dfd783f4 UFMT torch.utils._sympy.functions (#126553)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126553
Approved by: https://github.com/lezcano, https://github.com/Skylion007
ghstack dependencies: #126511
2024-05-19 10:35:48 +00:00
7dae7d3ca5 Remove unnecessary implementations from MockHandler (#126511)
Dead implementations are confusing and can cause bugs when people
accidentally hit them.  Better for it to be missing.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126511
Approved by: https://github.com/peterbell10, https://github.com/lezcano
2024-05-19 04:43:54 +00:00
71b6459edc Revert "[Dynamo] Treat integers stored on nn.Modules as dynamic (#126466)"
This reverts commit 6bb9d6080d33c817fcbf9e5ae8a59b76812a53d2.

Reverted https://github.com/pytorch/pytorch/pull/126466 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the ONNX test failure looks legit, not flaky, as it starts failing in trunk 6bb9d6080d ([comment](https://github.com/pytorch/pytorch/pull/126466#issuecomment-2119078245))
2024-05-19 02:52:11 +00:00
e3230f87aa Cached required_fw_nodes creation (#126613)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126613
Approved by: https://github.com/anijain2305
2024-05-19 01:48:52 +00:00
abc4b66124 Forward fix the failed new test from D57474327 (#126596)
Summary: TSIA.  The two looks the same to me, but buck was failing with the following error when `with torch._inductor.utils.fresh_inductor_cache()` is used:

```
_________________________ ReproTests.test_issue126128 __________________________

self = <caffe2.test.dynamo.test_repros.ReproTests testMethod=test_issue126128>

    def test_issue126128(self):
        def fn():
            x = torch.randn(1, 10)
            y = torch.randn(10, 1)
            return torch.mm(x, y).sum()

        def fn2():
            x = torch.randn(10, 100)
            y = torch.randn(100, 10)
            return torch.mm(x, y).sum()

>       with torch._inductor.utils.fresh_inductor_cache():
E       AttributeError: module 'torch._inductor' has no attribute 'utils'
```

Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_repros.py::ReproTests::test_issue126128'`

Differential Revision: D57516676

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126596
Approved by: https://github.com/xmfan
2024-05-18 23:56:03 +00:00
ad67553c5c Updated test_torch.py to use new OptimizerInfo infrastructure (#125538)
Fixes #123451 (only addresses test_torch.py cases)

This PR solves the specific task to update `test_grad_scaling_autocast` and `test_params_invalidated_with_grads_invalidated_between_unscale_and_step` in `test/test_torch.py` to use the new OptimizerInfo infrastructure.

I have combined tests that call `_grad_scaling_autocast_test` into one called `test_grad_scaling_autocast` and used `_get_optim_inputs_including_global_cliquey_kwargs` to avoid hard-coded configurations.

```
$ lintrunner test/test_cuda.py
ok No lint issues.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125538
Approved by: https://github.com/janeyx99
2024-05-18 15:42:45 +00:00
99af1b3ab0 Refactor variables / function names related to non-strict export (#126458)
Improve variable and function naming for better clarity: `non strict` --> `aten`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126458
Approved by: https://github.com/angelayi
2024-05-18 06:05:14 +00:00
6bb9d6080d [Dynamo] Treat integers stored on nn.Modules as dynamic (#126466)
Fixes #115711

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126466
Approved by: https://github.com/jansel
2024-05-18 05:02:16 +00:00
a44d0cf227 [Traceable FSDP2] Change from register_multi_grad_hook to per-tensor backward hook (#126350)
As discussed with Andrew before, under compile we will register per-tensor backward hook instead of multi-grad hook, because it's difficult for Dynamo to support `register_multi_grad_hook` (or anything `.grad_fn` related). We expect both to have the same underlying behavior, ~~and we will add integration test (in subsequent PR) to show that compile and eager has same numerics.~~

As discussed below, we will change eager path to use per-tensor backward hook as well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126350
Approved by: https://github.com/awgu
2024-05-18 04:44:29 +00:00
d4704dcacc Map float8 types to uint8 for allgather (#126556)
# Summary
Different take on this one:
https://github.com/pytorch/pytorch/issues/126338

We should probably not allow this mapping for 'compute' ops e.g. reductions

### Corresponding fp8 PR
https://github.com/pytorch-labs/float8_experimental/pull/263

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126556
Approved by: https://github.com/wanchaol
2024-05-18 03:19:16 +00:00
bf099a08f0 [2/N] Non-Tensor: Scalar Support: Add scalar to the cache for eager-through-torch.compile (#124070)
Add scalar information to the kernel configuration.

#### Additional Context
Currently, the input parameters are orchestrated by input order in the kernel configuration and loaded/mapped to the kernel at runtime. For example, the cache order of the input parameters of `torch.add(a, b, alpha=2.0)` is `a' first, followed by `b` and then `alpha`. The same order is for cache loading.

However, the orchestration mechanism does not support kwargs because the order of kwargs is useless. For example, the `out` of `aten::gelu.out(Tensor self, *, str approximate='none', Tensor(a!) out) -> Tensor(a!)` may be before `approximate`. We will support it with subsequent PRs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124070
Approved by: https://github.com/jansel, https://github.com/jgong5
2024-05-18 03:08:37 +00:00
c1767d8626 Faster(?) FP16 gemv kernel (#126297)
Differential Revision: [D57369266](https://our.internmc.facebook.com/intern/diff/D57369266/)

**NOTE FOR REVIEWERS**: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D57369266/)!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126297
Approved by: https://github.com/malfet
2024-05-18 03:03:03 +00:00
b98decfc38 [halide-backend] Refactor codegen/triton.py into codegen/simd.py (#126415)
This PR is primarily just moving stuff around.  It creates a new
common baseclass for TritonCodegen and the (upcoming) HalideCodegen.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126415
Approved by: https://github.com/shunting314
2024-05-18 02:43:42 +00:00
cyy
74b99438f2 [Submodule] Remove third-party CUB (#126540)
Because it was updated 4 years ago, and now all supported CUDA versions provide CUB.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126540
Approved by: https://github.com/Skylion007
2024-05-18 02:28:17 +00:00
1191168c45 [pipelining] Follow improvements in export.unflatten (#126217)
Previously, we make a copy of `torch.export.unflatten` in pippy/_unflatten.py.

But it turns out to be too hard to track bug fixes and improvements in upstream version. For example, `torch.export.unflatten` recently added support for tied parameters, which is something pipelining needs.

Now that we moved into pytorch, we make a reference to `torch.export.unflatten` instead of maintaining a copy.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126217
Approved by: https://github.com/H-Huang
2024-05-18 02:24:01 +00:00
661ecedbd0 gitmodules: switch cpp-httplib to https (#126580)
Fixes issue introduced in https://github.com/pytorch/pytorch/pull/126470#issuecomment-2118374811

Test plan:

CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126580
Approved by: https://github.com/PaliC, https://github.com/jeffdaily
2024-05-18 01:31:28 +00:00
224f2bef9f [C10D] Add __repr__ to P2POp class (#126538)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126538
Approved by: https://github.com/Skylion007, https://github.com/kwen2501, https://github.com/c-p-i-o
ghstack dependencies: #126419
2024-05-18 00:58:57 +00:00
bcee6f708a [Pipelining] Fix 1f1b schedule (#126419)
This schedule was running fine locally but failing (hanging) on CI.

After analysis (https://fburl.com/gdoc/xt80h1gd), it seems like the
schedule was not correct previously but may still work depending on the
runtime.

The fix bundles together fwd-recv(s->s+1) and bwd-send(s+1->s) into one
coalesced group so they would not block each other.

Design drawing
<img width="803" alt="image" src="https://github.com/pytorch/pytorch/assets/4984825/906a9a66-39ae-4a6a-bc1a-18b77eaaa784">

Flight recorder traces show the same coalescing pattern as designed
<img width="1013" alt="image" src="https://github.com/pytorch/pytorch/assets/4984825/ab10646e-eaef-4191-83dd-73f448876c27">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126419
Approved by: https://github.com/c-p-i-o, https://github.com/kwen2501
2024-05-18 00:58:57 +00:00
41fb4bcc73 [AOTI] Flag to include aoti sources when building lite interpreter (#126572)
Summary:
Added USE_LITE_AOTI cmake flag, which is turned OFF by default.
When it is turned on, the AOTI sources  (inductor_core_resources) are included when building lite interpreter

Test Plan:
```
ANDROID_ABI=arm64-v8a ./scripts/build_android.sh -DUSE_LITE_AOTI=ON
```

Differential Revision: D57394078

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126572
Approved by: https://github.com/malfet
2024-05-18 00:39:42 +00:00
2863c76b1f [torch-distributed] Make log directory creation idempotent (#126496)
Summary:
https://docs.python.org/3/library/os.html#os.makedirs
> If exist_ok is False (the default), a FileExistsError is raised if the target directory already exists.

Test Plan: Existing tests

Differential Revision: D57471577

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126496
Approved by: https://github.com/d4l3k
2024-05-18 00:17:13 +00:00
0d5ba547ec Tool for scouting exportability in one shot (#126471)
Summary:
Tool for scouting exportability issues in one shot.

- Collect sample inputs for all submodules by running eager inference with forward_pre_hook.
- Start from root module, recursively try exporting child modules, if current module export fails.

Limitations:
- only works for nn.module that contains tree-like submodules structure. this doesn't work for flatten GraphModule.

TODO: support dynamic_dims

Sample output: https://docs.google.com/spreadsheets/d/1jnixrqBTYbWO_y6AaKA13XqOZmeB1MQAMuWL30dGoOg/edit?usp=sharing

```
exportability_report =
        {
            '': UnsupportedOperatorException(func=<OpOverload(op='testlib.op_missing_meta', overload='default')>),
            'submod_1': UnsupportedOperatorException(func=<OpOverload(op='testlib.op_missing_meta', overload='default')>),
            'submod_2': None
        }
```

Test Plan: buck2 run mode/dev-nosan fbcode//caffe2/test:test_export -- -r TestExportTools

Differential Revision: D57466486

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126471
Approved by: https://github.com/zhxchen17
2024-05-18 00:10:46 +00:00
54bc55c515 Remove dist_ prefix from TORCH_LOGS shortcuts (#126499)
e.g. dist_ddp -> ddp

'distributed' shortcut remains unchained

Feedback has been that it is not appealing to have the dist_ prefix,
and the main reason for it was to keep the distributed shortcuts grouped
together in the help menu.  It's nice to have shorter shortcuts.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126499
Approved by: https://github.com/XilunWu, https://github.com/kwen2501
ghstack dependencies: #126322
2024-05-18 00:07:30 +00:00
93844a31b3 Fix aarch64 debug build with GCC (#126290)
By working around GCCs quirks in instantiating templates that require immediate values.
Provide alternative implementation for scaling the output if compiled without any optimizations (both GCC and clang define `__OPTIMIZE__` if invoked with anything but `-O0`)

Test plan (after change was reverted): ssh into aarch64 runner and rebuild given file with `-O0`

Fixes https://github.com/pytorch/pytorch/issues/126283

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126290
Approved by: https://github.com/atalman, https://github.com/seemethere
2024-05-17 23:47:08 +00:00
d54c28e7fc Added error checks for invalid inputs on thnn_conv2d (#121906)
Fixes #121188
Prevent Segmentation Fault in 'torch._C._nn.thnn_conv2d'

Previously, calling 'torch._C._nn.thnn_conv2d' with invalid arguments for padding, stride, and kernel_size would result in a segmentation fault. This issue has been resolved by implementing argument validation (using Torch Check). Now, when invalid arguments are detected, a runtime error is raised with a debug message detailing the correct format.

Additionally, this commit includes tests to cover the three referenced cases.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/121906
Approved by: https://github.com/janeyx99
2024-05-17 23:41:48 +00:00
173b1d811d [dynamo] Sourceless builder - ordered dict and re.pattern (#126468)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126468
Approved by: https://github.com/Skylion007
2024-05-17 23:24:55 +00:00
faa26df72e [inductor] Load python modules using importlib (#126454)
The `compile` + `exec` workflow is susceptible to behavior drifting from
a "normal" import use importlib instead to avoid this.

In particular here annotations were being stored as strings due to
`from __futures__ import annotations` in the scope calling `compile`.
Triton cares about annotations on global variables and this makes it
much easier to reliably code-gen them.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126454
Approved by: https://github.com/peterbell10
2024-05-17 23:13:07 +00:00
d7de4c9d80 Fix issue of lowering nn.linear ops with kwargs (#126331)
Summary: Support kwarg bias for nn.linear quantization

Differential Revision: D57403190

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126331
Approved by: https://github.com/ZhengkaiZ, https://github.com/huydhn
2024-05-17 21:50:55 +00:00
c26f6548f9 [AOTI] config target platform (#126306)
Test Plan: AOTI compile stories15M for Android

Differential Revision: D57392830

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126306
Approved by: https://github.com/desertfire
2024-05-17 21:42:19 +00:00
09fd771485 Disable vulkan test batch_norm_invalid_inputs (#126571)
Fails flakily ex https://github.com/pytorch/pytorch/actions/runs/9130802617/job/25109131748
https://github.com/pytorch/pytorch/actions/runs/9125548571/job/25092535707

First bad I can find is 538877d204

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126571
Approved by: https://github.com/SS-JIA
2024-05-17 21:11:07 +00:00
bed1c600bb Experimental prototype for converting torch.jit.trace modules to export (#124449)
Differential Revision: [D56440613](https://our.internmc.facebook.com/intern/diff/D56440613)

We want to do this for following reasons:
1. There is current limitation in export tracing for torch.jit.trace d modules that cannot be easily upstreamed
2. We need to run internal CI regularly to understand feature gaps and continuously track them
3. Multiple people will be working on this prototype so it is better to have a checked in version so we don't always run into merge conflicts.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124449
Approved by: https://github.com/angelayi, https://github.com/avikchaudhuri
2024-05-17 20:42:42 +00:00
30b70b1a63 [ROCm] enable faster_load_save for Fused_SGD (#125456)
Reopen due to rebase error. Fixes https://github.com/pytorch/pytorch/issues/117599

The reported hang test : `test_cuda.py::TestCuda::test_grad_scaling_autocast_fused_optimizers` is passing with this PR

HSA Async copy / host wait on completion signal is resolved in MultiTensorApply.cuh

```
:4:command.cpp              :347 : 8881368803196 us: [pid:1268211 tid:0x7f5af80d7180] Command (InternalMarker) enqueued: 0xc4e2070
:4:rocvirtual.cpp           :556 : 8881368803201 us: [pid:1268211 tid:0x7f5af80d7180] Host wait on completion_signal=0x7f5967df3e00
:3:rocvirtual.hpp           :66  : 8881368803209 us: [pid:1268211 tid:0x7f5af80d7180] Host active wait for Signal = (0x7f5967df3e00) for -1 ns
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125456
Approved by: https://github.com/jeffdaily, https://github.com/eqy, https://github.com/janeyx99
2024-05-17 20:36:47 +00:00
d782e43464 Revert "[FSDP2] Fixed 2D clip grad norm test (#126497)"
This reverts commit 3f289063117673650db868c978bf3cb8125a22dc.

Reverted https://github.com/pytorch/pytorch/pull/126497 on behalf of https://github.com/jeanschmidt due to reverting to check if might have introduced inductor cuda 12 issues ([comment](https://github.com/pytorch/pytorch/pull/126497#issuecomment-2118338716))
2024-05-17 20:29:20 +00:00
95b2766864 [BE][Ez]: Use NotADirectoryError in tensorboard writer (#126534)
Slightly improve exception typing for tensorboard wrriter
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126534
Approved by: https://github.com/ezyang
2024-05-17 19:52:13 +00:00
90a5aeea79 [distributed] Add cpp-httplib to pytorch (#126470)
Adds https://github.com/yhirose/cpp-httplib such that we are able to use https for host to host communication in distributed (specifically torchrun)

Todo: We likely need to add cpp-httplib somewhere in the build (cmake/bazel) but first we should write the code for it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126470
Approved by: https://github.com/d4l3k, https://github.com/Skylion007
2024-05-17 19:45:08 +00:00
eb0b16db92 Initial implementation of AdaRound (#126153)
Summary:
This is an implementation of AdaRound from a paper https://arxiv.org/abs/2004.10568

This algorithm is going to be used by multiple people, hence we need make it official implementation.

Differential Revision: D57227565

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126153
Approved by: https://github.com/jerryzh168, https://github.com/huydhn
2024-05-17 19:44:50 +00:00
875221dedf Revert "Fix aarch64 debug build with GCC (#126290)"
This reverts commit 91bf952d10e9524a9b078900d9807efa5d252f5c.

Reverted https://github.com/pytorch/pytorch/pull/126290 on behalf of https://github.com/huydhn due to There seems to be a mis-match closing curly bracket here and it breaks some internal build in D57474505 ([comment](https://github.com/pytorch/pytorch/pull/126290#issuecomment-2118246756))
2024-05-17 19:30:02 +00:00
f89500030b Revert "Remove redundant serialization code (#126249)"
This reverts commit aab448e381366d4cf499145adffe9fcb1ac2b28d.

Reverted https://github.com/pytorch/pytorch/pull/126249 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing sigmoid/frontend:serialization_test internally ([comment](https://github.com/pytorch/pytorch/pull/126249#issuecomment-2118233656))
2024-05-17 19:19:02 +00:00
de42af4b00 Add coms metadata to execution trace (ET) (#126317)
Add Execution Trace communication collective meta data.
For specification see https://github.com/pytorch/pytorch/issues/124674

New fields look like
```
    {
      "id": 80, "name": "record_param_comms", "ctrl_deps": 79,
      "inputs": {"values": [[[78,74,0,100,4,"cuda:0"]],21,["0","default_pg"],0,"allreduce",[],[],0,1,2], "shapes": [[[100]],[],[[],[]],[],[],[],[],[],[],[]], "types": ["GenericList[Tensor(float)]","Int","Tuple[String,String]","Int","String","GenericList[]","GenericList[]","Int","Int","Int"]},                             "outputs": {"values": [[[78,74,0,100,4,"cuda:0"]]], "shapes": [[[100]]], "types": ["GenericList[Tensor(float)]"]},
      "attrs": [{"name": "rf_id", "type": "uint64", "value": 53},{"name": "fw_parent", "type": "uint64", "value": 0},{"name": "seq_id", "type": "int64", "value": -1},{"name": "scope", "type": "uint64", "value": 0},{"name": "tid", "type": "uint64", "value": 2},{"name": "fw_tid", "type": "uint64", "value": 0},{"name": "op_schema", "type": "string", "value": ""},{"name": "kernel_backend", "type": "string", "value": ""},{"name": "kernel_file", "type": "string", "value": ""},
  {"name": "collective_name", "type": "string", "value": "allreduce"},
  {"name": "dtype", "type": "string", "value": "Float"},
  {"name": "in_msg_nelems", "type": "uint64", "value": 100},
  {"name": "out_msg_nelems", "type": "uint64", "value": 100},
  {"name": "in_split_size", "type": "string", "value": "[]"},
  {"name": "out_split_size", "type": "string", "value": "[]"},
  {"name": "global_rank_start", "type": "uint64", "value": 0},
  {"name": "global_rank_stride", "type": "uint64", "value": 1},
  {"name": "pg_name", "type": "string", "value": "0"},
  {"name": "pg_desc", "type": "string", "value": "default_pg"},
  {"name": "pg_size", "type": "uint64", "value": 2}]
 }
```

## Unit Test
Added a new unit test to check the execution trace collected has right attributes

`touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_ddp_profiling_execution_trace`

```
STAGE:2024-05-08 17:39:10 62892:62892 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
STAGE:2024-05-08 17:39:10 62893:62893 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
STAGE:2024-05-08 17:39:11 62892:62892 ActivityProfilerController.cpp:322] Completed Stage: Collection
STAGE:2024-05-08 17:39:11 62893:62893 ActivityProfilerController.cpp:322] Completed Stage: Collection
STAGE:2024-05-08 17:39:11 62892:62892 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
STAGE:2024-05-08 17:39:11 62893:62893 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
[rank1]:[W508 17:39:12.329544411 reducer.cpp:1399] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model
indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[rank0]:[W508 17:39:12.329626774 reducer.cpp:1399] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model
indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[rank0]:[W508 17:39:12.339239982 execution_trace_observer.cpp:825] Enabling Execution Trace Observer
[rank1]:[W508 17:39:12.339364516 execution_trace_observer.cpp:825] Enabling Execution Trace Observer
STAGE:2024-05-08 17:39:12 62892:62892 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
STAGE:2024-05-08 17:39:12 62893:62893 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
[rank1]:[W508 17:39:12.352452400 execution_trace_observer.cpp:837] Disabling Execution Trace Observer
STAGE:2024-05-08 17:39:12 62893:62893 ActivityProfilerController.cpp:322] Completed Stage: Collection
[rank0]:[W508 17:39:12.354019014 execution_trace_observer.cpp:837] Disabling Execution Trace Observer
STAGE:2024-05-08 17:39:12 62893:62893 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
STAGE:2024-05-08 17:39:12 62892:62892 ActivityProfilerController.cpp:322] Completed Stage: Collection
STAGE:2024-05-08 17:39:12 62892:62892 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
Execution trace saved at /tmp/tmpy01ngc3w.et.json
Execution trace saved at /tmp/tmptf8543k4.et.json
ok

----------------------------------------------------------------------
```

Also run profilerunit test
`touch /tmp/barrier && TEMP_DIR="/tmp" BACKEND="nccl" WORLD_SIZE="2" python test/distributed/test_distributed_spawn.py -v TestDistBackendWithSpawn.test_ddp_profiling_torch_profiler`

```
STAGE:2024-05-08 18:24:22 1926775:1926775 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
STAGE:2024-05-08 18:24:22 1926774:1926774 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
STAGE:2024-05-08 18:24:24 1926774:1926774 ActivityProfilerController.cpp:322] Completed Stage: Collection
STAGE:2024-05-08 18:24:24 1926775:1926775 ActivityProfilerController.cpp:322] Completed Stage: Collection
STAGE:2024-05-08 18:24:24 1926774:1926774 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
STAGE:2024-05-08 18:24:24 1926775:1926775 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
[rank1]:[W508 18:24:24.508622236 reducer.cpp:1399] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
[rank0]:[W508 18:24:24.508622241 reducer.cpp:1399] Warning: find_unused_parameters=True was specified in DDP constructor, but did not find any unused parameters in the forward pass. This flag results in an extra traversal of the autograd graph every iteration,  which can adversely affect performance. If your model indeed never has any unused parameters in the forward pass, consider turning this flag off. Note that this warning may be a false positive if your model has flow control causing later iterations to have unused parameters. (function operator())
STAGE:2024-05-08 18:24:24 1926774:1926774 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
STAGE:2024-05-08 18:24:24 1926775:1926775 ActivityProfilerController.cpp:316] Completed Stage: Warm Up
STAGE:2024-05-08 18:24:24 1926774:1926774 ActivityProfilerController.cpp:322] Completed Stage: Collection
STAGE:2024-05-08 18:24:24 1926775:1926775 ActivityProfilerController.cpp:322] Completed Stage: Collection
STAGE:2024-05-08 18:24:24 1926774:1926774 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
STAGE:2024-05-08 18:24:24 1926775:1926775 ActivityProfilerController.cpp:326] Completed Stage: Post Processing
Trace saved to /tmp/tmpdrw_cmcu.json
Trace saved to /tmp/tmpnio7ec9j.json
ok

----------------------------------------------------------------------
Ran 1 test in 19.772s

OK
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126317
Approved by: https://github.com/yoyoyocmu, https://github.com/sanrise
2024-05-17 19:08:55 +00:00
6931f781c2 [quant][pt2e] Allow multi users without output observers (#126487)
Summary: The PT2E quantization flow does not support unquantized
outputs yet. To work around this, users may wish to remove the
output observer from their graphs. However, this fails currently
in some cases because the `PortNodeMetaForQDQ` pass is too
restrictive, for example:

```
conv -> obs -------> output0
         \\-> add -> output1
```

Previously we expected conv to always have exactly 1 user,
which is the observer. When the observer is removed, however,
conv now has 2 users, and this fails the check.

```
conv -------> output0
  \\-> add -> output1
```

This commit relaxes the error into a warning to enable
this workaround.

Test Plan:
python test/test_quantization.py TestQuantizePT2E.test_multi_users_without_output_observer

Reviewers: jerryzh168

Subscribers: jerryzh168, supriyar

Differential Revision: [D57472601](https://our.internmc.facebook.com/intern/diff/D57472601)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126487
Approved by: https://github.com/tarun292
2024-05-17 18:48:21 +00:00
ecd9a4e5c3 Enable FX graph cache for huggingface and timm benchmarks (#126205)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126205
Approved by: https://github.com/eellison
2024-05-17 18:36:05 +00:00
66dc8fb7ff Allow tensor subclasses and add torch.serialization.add_safe_globals that allows users to allowlist classes for weights_only load (#124331)
#### Conditions for allowlisting tensor subclasses
We allow tensor subclasses types that
(1) Do not override `__setstate__`, `__getattr__`, `__setattr__`, `__get__`, `__set__` or `__getattribute__` of `torch.Tensor` (`torch.Tensor` does not have a definition of `__getattr__`, `__get__` or `__set__` so we check that these are `None`)
(2) Use the generic `tp_alloc`
(3) Are in a module that *has been imported by the user*
to be pushed onto the stack as strings by `GLOBAL` instructions, while storing the type in a dict

The strings will be converted to the classes as appropriate when executing `REBUILD` with `_rebuild_from_type_v2`

*Note that we use `inspect.getattr_static(sys.modules[module], name)` to get the class/function as this method claims to have no code execution.

The rationale for the 3 conditions above is as follows:

The rebuild func provided by `Tensor.__reduce_ex__` is `torch._tensor._rebuild_from_type_v2`, which is defined as such (note the call to `getattr`, `Tensor.__setstate__` and the call to `as_subclass` as well as the call to `_set_obj_state` which calls `setattr`)

4e66aaa010/torch/_tensor.py (L57-L71)

`as_subclass` is implemented with a call to `THPVariable_NewWithVar`

that will eventually call `tp_alloc` here
4e66aaa010/torch/csrc/autograd/python_variable.cpp (L2053)

The `func` arg to `_rebuild_from_type_v2` for wrapper subclasses is `Tensor.rebuild_wrapper_subclass`, which will similarly call into `THPVariable_NewWithVar` and hit the above `tp_alloc`

**Note that we do not call `tp_init` or `tp_new` (i.e. `cls.__init__` or `cls.__new__`) when unpickling**

### How do we check something is a tensor subclass/constraints around imports

In order to check whether `bla` is a tensor subclass in the bytecode `GLOBAL module.name`, we need to do an `issubclass` check, which entails converting the global string to the appropriate type. We *do not* arbitrarily import modules but will perform this check as long as the given subclass (given by `module.name`) has already been imported by the user (i.e. `module in sys.modules` and `issubclass(getattr(sys[modules], name), torch.Tensor)`

This PR also allowlisted  `torch._utils._rebuild_wrapper_subclass` and `torch.device` (used by `_rebuild_wrapper_subclass`)

### API for allow listing
This PR also added `torch.serialization.{add/get/clear}_safe_globals` that enables user to allowlist globals they have deemed safe and manipulate this list (for example they could allowlist a tensor subclass with a custom `__setstate__` if they have checked that this is safe).

Next steps:
- Add testing and allowlist required classes for all in-core tensor subclasses (e.g. `DTensor`, `FakeTensor` etc.)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124331
Approved by: https://github.com/albanD
2024-05-17 17:56:57 +00:00
31ea8290e7 Workflow for uploading additional test stats on workflow dispatch (#126080)
This kind of an experiment for uploading test stats during the run, and also for test dashboard stuff so it can re calculate the info

Add workflow that is callable via workflow dispatch for uploading additional test stats
Adds script that only calculates the additional info

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126080
Approved by: https://github.com/ZainRizvi
2024-05-17 17:29:44 +00:00
6bcf15669e [inductor] fix unbacked case in pointwise + reduction vertical fusion (#125982)
```
$ INDUCTOR_TEST_DISABLE_FRESH_CACHE=1 python test/inductor/test_unbacked_symints.py -k test_vertical_pointwise_reduction_fusion

  File "/data/users/colinpeppler/pytorch/torch/_inductor/scheduler.py", line 1953, in fuse_nodes_once
    for node1, node2 in self.get_possible_fusions():
  File "/data/users/colinpeppler/pytorch/torch/_inductor/scheduler.py", line 2010, in get_possible_fusions
    check_all_pairs(node_grouping)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/scheduler.py", line 1997, in check_all_pairs
    if self.can_fuse(node1, node2):
  File "/data/users/colinpeppler/pytorch/torch/_inductor/scheduler.py", line 2252, in can_fuse
    return self.get_backend(device).can_fuse_vertical(node1, node2)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/cuda_combined_scheduling.py", line 39, in can_fuse_vertical
    return self._triton_scheduling.can_fuse_vertical(node1, node2)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3237, in can_fuse
    if not all(
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 3238, in <genexpr>
    TritonKernel.is_compatible((numel2, rnumel2), n.get_ranges())
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 1543, in is_compatible
    cls._split_iteration_ranges(groups, lengths)
  File "/data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py", line 1507, in _split_iteration_ranges
    while current_group < len(remaining) and sv.size_hint(remaining[current_group]) == 1:
  File "/data/users/colinpeppler/pytorch/torch/_inductor/sizevars.py", line 442, in size_hint
    return int(out)
  File "/home/colinpeppler/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/core/expr.py", line 320, in __int__
    raise TypeError("Cannot convert symbols to int")
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
TypeError: Cannot convert symbols to int
```

Where the unbacked symints show up at.
```
> /data/users/colinpeppler/pytorch/torch/_inductor/codegen/triton.py(1506)_split_iteration_ranges()
(Pdb) print(groups)
(1, 512*u0)
(Pdb) print(lengths)
([u0, 32, 16], [])
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125982
Approved by: https://github.com/jansel
2024-05-17 17:06:24 +00:00
7e9a037b47 [Perf] Vectorize more dtype for int4mm (#126512)
It used to be vectorized only for f16, but no reason not to do the same for bf16 or f32

Spiritual followup of https://github.com/pytorch/pytorch/pull/125290

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126512
Approved by: https://github.com/Skylion007
2024-05-17 16:34:19 +00:00
81277baa0c Remove removed ruff rule TRY200 (#126256)
My TOML linter is complaining that "TRY200" is not acceptable for the `tool.ruff.lint` schema.

From the ruff docs: https://docs.astral.sh/ruff/rules/reraise-no-cause/

> This rule has been removed and its documentation is only available for historical reasons.
>
> This rule is identical to [B904](https://docs.astral.sh/ruff/rules/raise-without-from-inside-except/) which should be used instead.

and we are currently explicitly ignoring B904.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126256
Approved by: https://github.com/Skylion007
2024-05-17 16:31:05 +00:00
402170b22f Early return in _recursive_build if obj is a Tensor (#125639)
Fix issue #125551

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125639
Approved by: https://github.com/ezyang
2024-05-17 15:53:37 +00:00
7e166e8057 [optim] Fix: wrong ASGD implementation (#126375)
This PR is based on #125440, additionally merging the latest main branch and fixing the lint failures from #126361.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126375
Approved by: https://github.com/janeyx99
2024-05-17 15:46:39 +00:00
078e530446 Delete refactored function, move changes over (#126407)
Oops, in https://github.com/pytorch/pytorch/pull/125610 I moved this function to runtime_wrappers.py, but forgot to delete the old one. https://github.com/pytorch/pytorch/pull/126234 then modified it which would do nothing, so I'm applying the change correctly now and deleting the function as I intended.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126407
Approved by: https://github.com/eellison
2024-05-17 15:28:18 +00:00
ab307a8992 Default to env variable instead of config value for precompile parallelism (#126333)
Previously, we would default to the config `compile_threads`. That controls the number of forks we use for async compile. It defaults to 1 in fbcode because fork() has known issues with safety. In precompilation, we are using threads, which have no safety issues and should strictly improve compile time. there isn't really any reason to reduce except for testing, and it doesn't make sense to share the same value as for determining forks.

This changes so we default it to use as many threads as needed unless the env variable is set.

Differential Revision: [D57473023](https://our.internmc.facebook.com/intern/diff/D57473023)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126333
Approved by: https://github.com/nmacchioni
2024-05-17 14:58:55 +00:00
3f28906311 [FSDP2] Fixed 2D clip grad norm test (#126497)
This fixes https://github.com/pytorch/pytorch/issues/126484.

We change from transformer to MLP stack since transformer seems to introduce slight numeric differences when using TP. We include a sequence parallel layer norm module in the MLP stack to exercise `(S(0), R)` placement.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126497
Approved by: https://github.com/weifengpy, https://github.com/wz337
2024-05-17 13:38:31 +00:00
55033ab43a Update ops handler documentation some more (#126480)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126480
Approved by: https://github.com/peterbell10
ghstack dependencies: #126292, #126299
2024-05-17 13:31:44 +00:00
cyy
4ed93d6e0c [Submodule] Remove zstd dependency (#126485)
After searching in the codebase, it seems that zstd is not in use now.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126485
Approved by: https://github.com/ezyang
2024-05-17 12:49:23 +00:00
6c503f1dbb save the reciprocal of weights for welford_reduce (#125148)
Save the reciprocal of weights for welford_reduce to avoid redundant divisions for improving performance, and `weight_recps` will be inserted into the generated vec kernel.

Generated code:

- Before:

```
for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L))
{
    auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x1 + (1024L*x0)), 16);
    tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0);
}
```

- After::

```
static WeightRecp<at::vec::Vectorized<float>> weight_recps(64);
for(long x1=static_cast<long>(0L); x1<static_cast<long>(1024L); x1+=static_cast<long>(16L))
{
    auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x1 + (1024L*x0)), 16);
    tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &weight_recps);
}
```

Performance:

- Single core:

Op | shape | eager/ms | inductor/ms | optimized inductor/ms
-- | -- | -- | -- | --
layernorm | (56, 384, 1024) | 16.825 | 22.338 | 15.208
var | (56, 384, 1024) | 21.752 | 13.258 | 13.102

- 4 cores:

Op | shape | eager/ms | inductor/ms | optimized inductor/ms
-- | -- | -- | -- | --
layernorm | (56, 384, 1024) | 4.249 | 5.899 | 4.223
var | (56, 384, 1024) | 5.3152 | 3.278 | 2.163

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125148
Approved by: https://github.com/jgong5, https://github.com/peterbell10
2024-05-17 08:20:12 +00:00
8619fe6214 variable search spaces for gemm autotuning (#126220)
add a switch to change the gemm autotuning search space between the default (the current set of hardcoded configs) and an exhaustive search space that enumerates all block sizes in [16, 32, 64, 128, 256], stages in [1, 2, 3, 4, 5], and warps in [2, 4, 6]

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126220
Approved by: https://github.com/eellison
2024-05-17 08:09:53 +00:00
45f2d09452 [Quant][Inductor] Enable lowering of qlinear-binary(-unary) fusion for X86Inductor (#122593)
**Description**
Lower the qlinear binary post op pattern to Inductor. Use post op sum (in-place) if the extra input has the same dtype as output. Otherwise, it uses binary add.

**Supported linear-binary(-unary) patterns**
```
    linear(X)   extra input
           \   /
            Add
             |
        Optional(relu)
             |
             Y

1. int8-mixed-fp32
+---+---------------+-----------+------------------------------+---------+
| # | Add type      | Quant out | Pattern                      | Post op |
+---+---------------+-----------+------------------------------+---------+
| 1 | In-/out-place | Yes       | linear + fp32 -> (relu) -> q | add     |
+---+---------------+-----------+------------------------------+---------+
| 2 | In-/out-place | No        | linear + fp32 -> (relu)      | sum     |
+---+---------------+-----------+------------------------------+---------+

2. int8-mixed-bf16
+---+----------+---------------+-----------+--------------------------------------------------+---------+
| # | X2 dtype | Add type      | Quant out | Pattern                                          | Post op |
+---+----------+---------------+-----------+--------------------------------------------------+---------+
| 1 | BF16     | In-/out-place | Yes       | linear + bf16 -> (relu) -> to_fp32 -> q          | add     |
+---+----------+---------------+-----------+--------------------------------------------------+---------+
| 2 | BF16     | In-/out-place | No        | linear + bf16 -> (relu)                          | sum     |
+---+----------+---------------+-----------+--------------------------------------------------+---------+
| 3 | FP32     | Out-place     | Yes       | linear + fp32 -> (relu) -> q                     | add     |
|   |          | In-place right|           |                                                  |         |
+---+----------+---------------+-----------+--------------------------------------------------+---------+
| 4 | FP32     | Out-place     | No        | linear + fp32 -> (relu)                          | sum     |
|   |          | In-place right|           |                                                  |         |
+---+----------+---------------+-----------+--------------------------------------------------+---------+
| 5 | FP32     | In-place left | Yes       | linear + fp32 -> to_bf16 -> relu -> to_fp32 -> q | add     |
+---+----------+---------------+-----------+--------------------------------------------------+---------+
| 6 | FP32     | In-place left | No        | linear + fp32 -> to_bf16 -> (relu)               | add     |
+---+----------+---------------+-----------+--------------------------------------------------+---------+
```
Note
(1) The positions of linear and the extra input can be swapped.
(2) we don't insert q-dq before the extra input of linear-add by recipe. But if q-dq is found at the
extra input, we don't match that pattern because we cannot match all these patterns in 3 passes.

**Test plan**
python test/inductor/test_mkldnn_pattern_matcher.py -k test_qlinear_add
python test/inductor/test_cpu_cpp_wrapper.py -k test_qlinear_add

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122593
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/eellison
2024-05-17 07:46:48 +00:00
2edaae436a Fix cummax and cummin lowering for empty case (#126461)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126461
Approved by: https://github.com/peterbell10
2024-05-17 07:08:32 +00:00
15ca562f86 [DTensor] Turn on foreach implementation for clip_grad_norm_ for DTensor by default (#126423)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126423
Approved by: https://github.com/awgu
2024-05-17 06:57:52 +00:00
f9a7033194 Refactor partitioner and clean it up (#126318)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126318
Approved by: https://github.com/anijain2305
2024-05-17 06:15:00 +00:00
5756b53dd8 [XPU] call empty_cache for dynamo tests (#126377)
When running a batch of models, lacking `empty_cache()` would result in OOM for subsequent models.

This PR unifies the `empty_cache` call for both CUDA and XPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126377
Approved by: https://github.com/EikanWang, https://github.com/guangyey, https://github.com/desertfire
2024-05-17 06:05:51 +00:00
9edf54df4d [dtensor] refactor view ops to use OpStrategy (#126011)
As titled. Some ops require adjustment of output shape argument. In rule-based sharding prop, global output shape was inferred in the rule (in `view_ops.py`). In strategy-based sharding prop, it is now obtained from propagated out_tensor_meta (in `sharding_prop.py`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126011
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
2024-05-17 05:39:21 +00:00
a0df40f195 Add dist_pp shortcut to TORCH_LOGS (#126322)
distributed log category already includes pipelining since its under the
torch.distributed umbrella.

So both TORCH_LOGS=distributed and TORCH_LOGS=dist_pp will enable PP
logs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126322
Approved by: https://github.com/kwen2501
2024-05-17 05:32:15 +00:00
4b2ae2ac33 c10d: add Collectives abstraction (#125978)
This adds a new `Collectives` API for doing distributed collectives operations. This is intended to replace the [current Elastic store abstraction](https://github.com/pytorch/pytorch/blob/main/torch/distributed/elastic/utils/store.py) with more performant and debugable primitives.

Design doc: https://docs.google.com/document/d/147KcKJXEHvk1Q6tISLbJVvLejHg_1kIhBQeu-8RQxhY/edit

The standard implementation is using `StoreCollectives` but other more performant backends will be added in a follow up PR.

Test plan:

```
python test/distributed/test_collectives.py -v
```

This tests both functionality using multiple threads as well as timeout behavior.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125978
Approved by: https://github.com/shuqiangzhang
2024-05-17 05:09:11 +00:00
a8c41e0678 dont pad 0 dim mm inputs (#126475)
Otherwise you get an error in constant_pad_nd.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126475
Approved by: https://github.com/huydhn
ghstack dependencies: #125772, #125773, #125780
2024-05-17 05:03:27 +00:00
88582195fd [FSDP2][Test] Fix _test_clip_grad_norm (#126457)
Fixes #ISSUE_NUMBER
We need to compare ref_total_norm to total_norm.full_tensor().
Example:
```
iter_idx:0, rank:0,\
ref_total_norm=tensor(1052.5934, device='cuda:0'),\
total_norm=DTensor(local_tensor=482.0861511230469, device_mesh=DeviceMesh([0, 1]), placements=(_NormPartial(reduce_op='sum', norm_type=2.0),)),\
total_norm.full_tensor()=tensor(1052.5934, device='cuda:0')
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126457
Approved by: https://github.com/awgu
2024-05-17 04:29:21 +00:00
1a27e24ff5 Make inductor scheduler graph extension configurable (#125578)
This patch makes the inductor scheduler graph extension configurable.
It enables ease of debugging by changing the graph format (dot, png, etc.).

Particularly, it's very convenient to work with the graph interactively using tools like https://github.com/tintinweb/vscode-interactive-graphviz

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125578
Approved by: https://github.com/Chillee
2024-05-17 04:19:23 +00:00
da1fc85d60 Add symbolic_shape_specialization structured trace (#126450)
This is typically the information you want when diagnosing why something
overspecialized in dynamic shapes.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126450
Approved by: https://github.com/albanD
2024-05-17 02:01:21 +00:00
d2f5a8ac99 [doc] expose torch.Tensor.xpu API to doc (#126383)
# Motivation
The doc string related `torch.Tensor.xpu` has been added [here](d61a81a9e7/torch/_tensor_docs.py (L1434)) but not expose it to public doc, like [torch.Tensor.cuda](https://pytorch.org/docs/stable/generated/torch.Tensor.cuda.html#torch.Tensor.cuda). This PR intends to expose the document of `torch.Tensor.xpu` to public doc.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126383
Approved by: https://github.com/albanD
2024-05-17 01:19:03 +00:00
776b878917 [easy] Fix typing for map_location docs in torch.load (#125473)
Currently it incorrectly has `Callable[[Tensor, str], Tensor]` as a possible type signature, this should be `Callable[[Storage, str], Storage]`

<img width="716" alt="Screenshot 2024-05-03 at 12 09 54 PM" src="https://github.com/pytorch/pytorch/assets/35276741/b8946f95-8297-445f-a9d9-570b8a3caab1">

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125473
Approved by: https://github.com/albanD
2024-05-17 01:15:25 +00:00
697ed6f5b3 [DeviceMesh] Supported N groups in from_group (#126258)
**Overview**
This PR supports constructing an ND mesh with `from_group()` by passing in `group: List[ProcessGroup]` and `mesh: Union[torch.Tensor, "ArrayLike"]` together. The `ndim` of the device mesh returned from `from_group()` is equal to the number of `ProcessGroup`s passed. If the `ndim` is greater than 1, then the `mesh` argument is required (since there is no simple way to recover the `mesh` tensor from the process groups otherwise).

This PR also adds `mesh_dim_names` as an argument to forward to the device mesh for convenience.

<details>
<summary> Old Approach </summary>

**Overview**
- This PR mainly adds `mesh_shape` to `from_group()` so that the user can construct an ND (N > 1) device mesh from a process group. This is to unblock HSDP, where we can pass the overall data parallel process group to `from_group()` with `mesh_shape = (replicate_dim_size, shard_dim_size)` and `from_group()` will construct subgroups for the user. (The user can then get the subgroups from the submeshes.)
    - Constructing the 2D `DeviceMesh` from an existing shard process group and replicate process group is hard because we cannot easily recover the array of ranks in their parent group on each rank in general.
- This PR also adds `mesh_dim_names` to `from_group()` so that the user can name the mesh dimensions of the constructed device mesh.

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126258
Approved by: https://github.com/wanchaol
2024-05-17 01:03:21 +00:00
1018a68e31 [export] Delete predispatch tests (#126459)
Deleting predispatch tests as we moved export to predispatch already
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126459
Approved by: https://github.com/tugsbayasgalan
2024-05-17 00:48:32 +00:00
8bb7a2f46d Fix documentation for register_fake_class (#126422)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126422
Approved by: https://github.com/angelayi
2024-05-17 00:45:21 +00:00
762ce6f062 Add Lowering for FlexAttention Backwards (#125515)
# Summary
#### What does this PR do?
It enables Inductor to actually generate the fused flex attention kernel for the backwards

I did some other things along the way:
- Abstract out the 'build_subgraph_buffer' subroutine and make it reusable between flex attention and flex_attention backwards. In total we need too build 3 subgraphs for fwd + bwd. 1 for the fwd graph and then 2 in the bwd. The FAv2 algorithm recomputes the parts of the forward (more efficiently since we already have the row_max via logsumexp), therefore we need to inline both the fwd graph and the joint graph in the bwds kernel.
- The version of the backwards kernel is from a somewhat older version of the triton tutorial implementation. I think that we should update in a follow up to a newer version. Notably the blocks need to be square for this to work as currently implemented. I am sure there are many opportunities for optimization.
- I didnt correctly register the decomp table + IndexMode when I landed: https://github.com/pytorch/pytorch/pull/123902, this remedies that.
- The rel_bias helper func was reversed in terms of causality. I updated and then add a test specific for "future causal" attention.
- This PRs but the main point that I think still needs to be worked out is the store_output call. I have it hacked up to be 'fake' but I dont think we want to land that and likely want to just have a mutated 'dq' and a stored_output 'dk'
- I also needed to update the `TritonTemplateKernel` to actually accept multiple subgraphs (modifications)
- I updated the benchmark to also profile bwds performance

### Benchmark Numbers:
_The current implementation is not parallelizing over ctx length in the bwd_
FWD Speedups

| Type    |   Speedup | shape              | score_mod   | dtype          |
|---------|-----------|--------------------|-------------|----------------|
| Average |     0.991 |                    |             |                |
| Max     |     1.182 | (16, 16, 4096, 64) | noop        | torch.bfloat16 |
| Min     |     0.796 | (2, 16, 512, 256)  | head_bias   | torch.bfloat16 |

BWD Speedups

| Type    |   Speedup | shape              | score_mod   | dtype          |
|---------|-----------|--------------------|-------------|----------------|
| Average |     0.291 |                    |             |                |
| Max     |     0.652 | (8, 16, 512, 64)   | head_bias   | torch.bfloat16 |
| Min     |     0.073 | (2, 16, 4096, 128) | head_bias   | torch.bfloat16 |

<details>

<summary>Full Data</summary>

| shape               | score_mod     | dtype          |   fwd_eager_time |   fwd_compiled_time |   bwd_eager_time |   bwd_compiled_time |   fwd_speedup |   bwd_speedup |
|---------------------|---------------|----------------|------------------|---------------------|------------------|---------------------|---------------|---------------|
| (2, 16, 512, 64)    | noop          | torch.bfloat16 |           19.936 |              19.092 |           57.851 |             193.564 |         1.044 |         0.299 |
| (2, 16, 512, 64)    | causal_mask   | torch.bfloat16 |           19.955 |              19.497 |           57.662 |             206.278 |         1.024 |         0.280 |
| (2, 16, 512, 64)    | relative_bias | torch.bfloat16 |           19.455 |              21.297 |           57.674 |             195.219 |         0.913 |         0.295 |
| (2, 16, 512, 64)    | head_bias     | torch.bfloat16 |           19.958 |              21.289 |           57.674 |             193.859 |         0.938 |         0.298 |
| (2, 16, 512, 128)   | noop          | torch.bfloat16 |           28.157 |              28.615 |           82.831 |             454.211 |         0.984 |         0.182 |
| (2, 16, 512, 128)   | causal_mask   | torch.bfloat16 |           28.154 |              28.444 |           83.091 |             432.083 |         0.990 |         0.192 |
| (2, 16, 512, 128)   | relative_bias | torch.bfloat16 |           28.722 |              27.897 |           83.175 |             446.789 |         1.030 |         0.186 |
| (2, 16, 512, 128)   | head_bias     | torch.bfloat16 |           28.299 |              27.673 |           83.052 |             459.179 |         1.023 |         0.181 |
| (2, 16, 512, 256)   | noop          | torch.bfloat16 |           41.167 |              50.504 |          175.019 |            1083.545 |         0.815 |         0.162 |
| (2, 16, 512, 256)   | causal_mask   | torch.bfloat16 |           41.656 |              51.933 |          175.078 |            1171.176 |         0.802 |         0.149 |
| (2, 16, 512, 256)   | relative_bias | torch.bfloat16 |           41.697 |              50.722 |          175.159 |            1097.312 |         0.822 |         0.160 |
| (2, 16, 512, 256)   | head_bias     | torch.bfloat16 |           41.690 |              52.387 |          175.184 |            1097.336 |         0.796 |         0.160 |
| (2, 16, 1024, 64)   | noop          | torch.bfloat16 |           39.232 |              37.454 |          127.847 |             612.430 |         1.047 |         0.209 |
| (2, 16, 1024, 64)   | causal_mask   | torch.bfloat16 |           39.930 |              39.599 |          127.755 |             665.359 |         1.008 |         0.192 |
| (2, 16, 1024, 64)   | relative_bias | torch.bfloat16 |           39.417 |              41.304 |          127.902 |             614.990 |         0.954 |         0.208 |
| (2, 16, 1024, 64)   | head_bias     | torch.bfloat16 |           39.965 |              42.034 |          127.953 |             613.273 |         0.951 |         0.209 |
| (2, 16, 1024, 128)  | noop          | torch.bfloat16 |           63.964 |              71.024 |          226.510 |            1637.669 |         0.901 |         0.138 |
| (2, 16, 1024, 128)  | causal_mask   | torch.bfloat16 |           63.843 |              72.451 |          226.750 |            1558.949 |         0.881 |         0.145 |
| (2, 16, 1024, 128)  | relative_bias | torch.bfloat16 |           64.301 |              70.487 |          226.651 |            1610.063 |         0.912 |         0.141 |
| (2, 16, 1024, 128)  | head_bias     | torch.bfloat16 |           64.033 |              71.394 |          226.676 |            1668.511 |         0.897 |         0.136 |
| (2, 16, 1024, 256)  | noop          | torch.bfloat16 |          129.348 |             141.390 |          507.337 |            4405.175 |         0.915 |         0.115 |
| (2, 16, 1024, 256)  | causal_mask   | torch.bfloat16 |          129.538 |             145.680 |          507.178 |            4768.874 |         0.889 |         0.106 |
| (2, 16, 1024, 256)  | relative_bias | torch.bfloat16 |          129.438 |             142.782 |          507.004 |            4401.002 |         0.907 |         0.115 |
| (2, 16, 1024, 256)  | head_bias     | torch.bfloat16 |          129.058 |             146.242 |          507.547 |            4434.251 |         0.883 |         0.114 |
| (2, 16, 4096, 64)   | noop          | torch.bfloat16 |          481.606 |             409.120 |         1440.890 |           14147.269 |         1.177 |         0.102 |
| (2, 16, 4096, 64)   | causal_mask   | torch.bfloat16 |          480.227 |             438.847 |         1434.419 |           14973.386 |         1.094 |         0.096 |
| (2, 16, 4096, 64)   | relative_bias | torch.bfloat16 |          480.831 |             458.104 |         1432.935 |           14193.253 |         1.050 |         0.101 |
| (2, 16, 4096, 64)   | head_bias     | torch.bfloat16 |          480.749 |             452.497 |         1437.040 |           14084.869 |         1.062 |         0.102 |
| (2, 16, 4096, 128)  | noop          | torch.bfloat16 |          872.534 |             848.275 |         2600.895 |           35156.849 |         1.029 |         0.074 |
| (2, 16, 4096, 128)  | causal_mask   | torch.bfloat16 |          872.647 |             868.279 |         2587.581 |           31919.531 |         1.005 |         0.081 |
| (2, 16, 4096, 128)  | relative_bias | torch.bfloat16 |          871.484 |             827.644 |         2593.989 |           34805.634 |         1.053 |         0.075 |
| (2, 16, 4096, 128)  | head_bias     | torch.bfloat16 |          871.422 |             856.437 |         2602.482 |           35708.591 |         1.017 |         0.073 |
| (2, 16, 4096, 256)  | noop          | torch.bfloat16 |         1904.497 |            1758.183 |         6122.416 |           66754.593 |         1.083 |         0.092 |
| (2, 16, 4096, 256)  | causal_mask   | torch.bfloat16 |         1911.174 |            1762.821 |         6113.207 |           72759.392 |         1.084 |         0.084 |
| (2, 16, 4096, 256)  | relative_bias | torch.bfloat16 |         1911.254 |            1727.108 |         6123.530 |           66577.988 |         1.107 |         0.092 |
| (2, 16, 4096, 256)  | head_bias     | torch.bfloat16 |         1916.977 |            1801.804 |         6118.158 |           67359.680 |         1.064 |         0.091 |
| (8, 16, 512, 64)    | noop          | torch.bfloat16 |           44.984 |              43.974 |          170.276 |             262.259 |         1.023 |         0.649 |
| (8, 16, 512, 64)    | causal_mask   | torch.bfloat16 |           45.001 |              46.265 |          170.509 |             274.893 |         0.973 |         0.620 |
| (8, 16, 512, 64)    | relative_bias | torch.bfloat16 |           45.466 |              48.211 |          170.606 |             262.759 |         0.943 |         0.649 |
| (8, 16, 512, 64)    | head_bias     | torch.bfloat16 |           45.481 |              48.435 |          170.267 |             261.265 |         0.939 |         0.652 |
| (8, 16, 512, 128)   | noop          | torch.bfloat16 |           72.565 |              74.736 |          313.220 |             773.126 |         0.971 |         0.405 |
| (8, 16, 512, 128)   | causal_mask   | torch.bfloat16 |           72.015 |              75.755 |          313.311 |             775.513 |         0.951 |         0.404 |
| (8, 16, 512, 128)   | relative_bias | torch.bfloat16 |           72.105 |              74.189 |          313.806 |             769.238 |         0.972 |         0.408 |
| (8, 16, 512, 128)   | head_bias     | torch.bfloat16 |           72.005 |              74.364 |          313.509 |             775.237 |         0.968 |         0.404 |
| (8, 16, 512, 256)   | noop          | torch.bfloat16 |          138.656 |             165.453 |          663.707 |            2672.067 |         0.838 |         0.248 |
| (8, 16, 512, 256)   | causal_mask   | torch.bfloat16 |          139.096 |             172.613 |          663.593 |            2926.538 |         0.806 |         0.227 |
| (8, 16, 512, 256)   | relative_bias | torch.bfloat16 |          139.500 |             168.417 |          663.938 |            2658.629 |         0.828 |         0.250 |
| (8, 16, 512, 256)   | head_bias     | torch.bfloat16 |          139.776 |             173.549 |          662.920 |            2667.266 |         0.805 |         0.249 |
| (8, 16, 1024, 64)   | noop          | torch.bfloat16 |          134.883 |             125.004 |          484.706 |            1195.254 |         1.079 |         0.406 |
| (8, 16, 1024, 64)   | causal_mask   | torch.bfloat16 |          134.297 |             132.875 |          485.420 |            1234.953 |         1.011 |         0.393 |
| (8, 16, 1024, 64)   | relative_bias | torch.bfloat16 |          134.839 |             139.231 |          485.470 |            1198.556 |         0.968 |         0.405 |
| (8, 16, 1024, 64)   | head_bias     | torch.bfloat16 |          133.822 |             136.449 |          485.608 |            1189.198 |         0.981 |         0.408 |
| (8, 16, 1024, 128)  | noop          | torch.bfloat16 |          235.470 |             234.765 |          886.094 |            2662.944 |         1.003 |         0.333 |
| (8, 16, 1024, 128)  | causal_mask   | torch.bfloat16 |          236.305 |             241.382 |          886.293 |            2646.984 |         0.979 |         0.335 |
| (8, 16, 1024, 128)  | relative_bias | torch.bfloat16 |          236.414 |             233.980 |          885.250 |            2642.178 |         1.010 |         0.335 |
| (8, 16, 1024, 128)  | head_bias     | torch.bfloat16 |          237.176 |             239.040 |          885.754 |            2665.242 |         0.992 |         0.332 |
| (8, 16, 1024, 256)  | noop          | torch.bfloat16 |          504.445 |             517.855 |         1978.956 |            9592.906 |         0.974 |         0.206 |
| (8, 16, 1024, 256)  | causal_mask   | torch.bfloat16 |          502.428 |             536.002 |         1978.611 |           10607.342 |         0.937 |         0.187 |
| (8, 16, 1024, 256)  | relative_bias | torch.bfloat16 |          503.396 |             523.960 |         1977.993 |            9539.284 |         0.961 |         0.207 |
| (8, 16, 1024, 256)  | head_bias     | torch.bfloat16 |          503.818 |             536.014 |         1980.131 |            9576.262 |         0.940 |         0.207 |
| (8, 16, 4096, 64)   | noop          | torch.bfloat16 |         1970.139 |            1674.930 |         5750.940 |           16724.134 |         1.176 |         0.344 |
| (8, 16, 4096, 64)   | causal_mask   | torch.bfloat16 |         1959.036 |            1775.056 |         5780.512 |           17390.350 |         1.104 |         0.332 |
| (8, 16, 4096, 64)   | relative_bias | torch.bfloat16 |         1947.198 |            1773.869 |         5780.643 |           16779.699 |         1.098 |         0.345 |
| (8, 16, 4096, 64)   | head_bias     | torch.bfloat16 |         1963.935 |            1829.502 |         5780.018 |           16703.259 |         1.073 |         0.346 |
| (8, 16, 4096, 128)  | noop          | torch.bfloat16 |         3582.711 |            3362.623 |        10436.069 |           36415.565 |         1.065 |         0.287 |
| (8, 16, 4096, 128)  | causal_mask   | torch.bfloat16 |         3581.504 |            3499.472 |        10346.869 |           36164.959 |         1.023 |         0.286 |
| (8, 16, 4096, 128)  | relative_bias | torch.bfloat16 |         3589.779 |            3337.849 |        10529.621 |           36261.696 |         1.075 |         0.290 |
| (8, 16, 4096, 128)  | head_bias     | torch.bfloat16 |         3602.265 |            3436.444 |        10458.660 |           36507.790 |         1.048 |         0.286 |
| (8, 16, 4096, 256)  | noop          | torch.bfloat16 |         7695.923 |            7126.275 |        24643.009 |          140949.081 |         1.080 |         0.175 |
| (8, 16, 4096, 256)  | causal_mask   | torch.bfloat16 |         7679.939 |            7186.252 |        24538.105 |          157156.067 |         1.069 |         0.156 |
| (8, 16, 4096, 256)  | relative_bias | torch.bfloat16 |         7681.374 |            6994.832 |        24549.713 |          140077.179 |         1.098 |         0.175 |
| (8, 16, 4096, 256)  | head_bias     | torch.bfloat16 |         7679.822 |            7212.278 |        24627.823 |          140675.003 |         1.065 |         0.175 |
| (16, 16, 512, 64)   | noop          | torch.bfloat16 |           80.126 |              78.291 |          333.719 |             541.165 |         1.023 |         0.617 |
| (16, 16, 512, 64)   | causal_mask   | torch.bfloat16 |           80.065 |              81.696 |          333.779 |             551.113 |         0.980 |         0.606 |
| (16, 16, 512, 64)   | relative_bias | torch.bfloat16 |           80.138 |              86.715 |          333.364 |             542.118 |         0.924 |         0.615 |
| (16, 16, 512, 64)   | head_bias     | torch.bfloat16 |           80.415 |              85.204 |          333.294 |             536.840 |         0.944 |         0.621 |
| (16, 16, 512, 128)  | noop          | torch.bfloat16 |          134.964 |             138.025 |          607.093 |            1333.102 |         0.978 |         0.455 |
| (16, 16, 512, 128)  | causal_mask   | torch.bfloat16 |          134.192 |             141.523 |          606.269 |            1424.318 |         0.948 |         0.426 |
| (16, 16, 512, 128)  | relative_bias | torch.bfloat16 |          135.711 |             138.639 |          606.283 |            1327.974 |         0.979 |         0.457 |
| (16, 16, 512, 128)  | head_bias     | torch.bfloat16 |          135.552 |             140.555 |          607.107 |            1347.370 |         0.964 |         0.451 |
| (16, 16, 512, 256)  | noop          | torch.bfloat16 |          275.113 |             315.144 |         1301.583 |            5268.153 |         0.873 |         0.247 |
| (16, 16, 512, 256)  | causal_mask   | torch.bfloat16 |          274.867 |             328.106 |         1302.513 |            5770.594 |         0.838 |         0.226 |
| (16, 16, 512, 256)  | relative_bias | torch.bfloat16 |          276.052 |             321.770 |         1302.904 |            5241.920 |         0.858 |         0.249 |
| (16, 16, 512, 256)  | head_bias     | torch.bfloat16 |          271.409 |             328.839 |         1302.142 |            5266.037 |         0.825 |         0.247 |
| (16, 16, 1024, 64)  | noop          | torch.bfloat16 |          260.489 |             237.463 |          955.884 |            1817.558 |         1.097 |         0.526 |
| (16, 16, 1024, 64)  | causal_mask   | torch.bfloat16 |          262.378 |             254.350 |          955.280 |            1843.807 |         1.032 |         0.518 |
| (16, 16, 1024, 64)  | relative_bias | torch.bfloat16 |          261.338 |             268.253 |          956.038 |            1820.036 |         0.974 |         0.525 |
| (16, 16, 1024, 64)  | head_bias     | torch.bfloat16 |          262.153 |             264.156 |          956.023 |            1810.076 |         0.992 |         0.528 |
| (16, 16, 1024, 128) | noop          | torch.bfloat16 |          476.475 |             461.413 |         1760.578 |            4306.521 |         1.033 |         0.409 |
| (16, 16, 1024, 128) | causal_mask   | torch.bfloat16 |          473.794 |             479.178 |         1761.277 |            4619.439 |         0.989 |         0.381 |
| (16, 16, 1024, 128) | relative_bias | torch.bfloat16 |          473.839 |             463.282 |         1758.692 |            4290.562 |         1.023 |         0.410 |
| (16, 16, 1024, 128) | head_bias     | torch.bfloat16 |          472.979 |             472.896 |         1763.086 |            4367.931 |         1.000 |         0.404 |
| (16, 16, 1024, 256) | noop          | torch.bfloat16 |         1014.184 |            1026.764 |         3922.997 |           19104.147 |         0.988 |         0.205 |
| (16, 16, 1024, 256) | causal_mask   | torch.bfloat16 |         1013.217 |            1039.046 |         3928.382 |           21086.281 |         0.975 |         0.186 |
| (16, 16, 1024, 256) | relative_bias | torch.bfloat16 |         1008.519 |            1015.278 |         3922.133 |           18980.652 |         0.993 |         0.207 |
| (16, 16, 1024, 256) | head_bias     | torch.bfloat16 |         1011.360 |            1047.542 |         3931.245 |           19069.172 |         0.965 |         0.206 |
| (16, 16, 4096, 64)  | noop          | torch.bfloat16 |         3929.850 |            3325.667 |        11411.704 |           23344.280 |         1.182 |         0.489 |
| (16, 16, 4096, 64)  | causal_mask   | torch.bfloat16 |         3885.262 |            3581.544 |        11390.515 |           23725.639 |         1.085 |         0.480 |
| (16, 16, 4096, 64)  | relative_bias | torch.bfloat16 |         3865.737 |            3537.308 |        11489.901 |           23406.330 |         1.093 |         0.491 |
| (16, 16, 4096, 64)  | head_bias     | torch.bfloat16 |         3880.530 |            3665.249 |        11484.411 |           23299.496 |         1.059 |         0.493 |
| (16, 16, 4096, 128) | noop          | torch.bfloat16 |         7030.306 |            6745.715 |        20621.264 |           57464.096 |         1.042 |         0.359 |
| (16, 16, 4096, 128) | causal_mask   | torch.bfloat16 |         7095.414 |            7034.385 |        20410.656 |           61660.511 |         1.009 |         0.331 |
| (16, 16, 4096, 128) | relative_bias | torch.bfloat16 |         7084.779 |            6686.497 |        20315.161 |           57243.969 |         1.060 |         0.355 |
| (16, 16, 4096, 128) | head_bias     | torch.bfloat16 |         7075.367 |            6863.305 |        20494.385 |           58481.953 |         1.031 |         0.350 |
| (16, 16, 4096, 256) | noop          | torch.bfloat16 |        15612.741 |           14297.482 |        55306.847 |          281161.865 |         1.092 |         0.197 |
| (16, 16, 4096, 256) | causal_mask   | torch.bfloat16 |        15326.592 |           14263.878 |        55227.806 |          313063.232 |         1.075 |         0.176 |
| (16, 16, 4096, 256) | relative_bias | torch.bfloat16 |        15297.963 |           14007.379 |        54558.029 |          279529.175 |         1.092 |         0.195 |
| (16, 16, 4096, 256) | head_bias     | torch.bfloat16 |        15216.160 |           14276.027 |        55081.581 |          280996.826 |         1.066 |         0.196 |

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125515
Approved by: https://github.com/Chillee
2024-05-17 00:41:55 +00:00
337830f657 Revert "[inductor][cpp] GEMM template (infra and fp32) (#124021)"
This reverts commit f060b0c6e608436997a1dc229c82ce26c1e6676f.

Reverted https://github.com/pytorch/pytorch/pull/124021 on behalf of https://github.com/huydhn due to Unfortunately, the new tests are still failing internally ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2116415398))
2024-05-17 00:22:40 +00:00
4a5ef0b793 Revert "[inductor][cpp] epilogue support for gemm template (#126019)"
This reverts commit 7844c202b2076ec3efa23264226f3eaef11a6fcb.

Reverted https://github.com/pytorch/pytorch/pull/126019 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the dependency PR https://github.com/pytorch/pytorch/pull/124021 is going to be revert ([comment](https://github.com/pytorch/pytorch/pull/126019#issuecomment-2116408137))
2024-05-17 00:15:00 +00:00
59ca0d8c14 Revert "[inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068)"
This reverts commit 927e631dc2356c0cb600dbdf9e8f84ce792a8ba1.

Reverted https://github.com/pytorch/pytorch/pull/126068 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but the dependency PR https://github.com/pytorch/pytorch/pull/124021 is going to be revert ([comment](https://github.com/pytorch/pytorch/pull/126019#issuecomment-2116408137))
2024-05-17 00:15:00 +00:00
cb3b8cd0d3 Use object identity for deepcopy memo (#126126)
Copy of #126089, with some additional fixes & tests

Partial fix for #125635: previously, the deepcopy implementation would group together any tensors with any aliasing relationship and assign them to the same tensor. This was sort of good if you have two tensors `b = a.detach()`, because then if you deepcopy `list = [a, b]` to `list2 = list.deepcopy()`, then writes to `list2[0]` will also modify `list2[1]`. But for the most part, it's bad; (1) if you have `b = a.as_strided((4, 4), (16, 1), 16)`, then it'll make `b == a` in the deepcopied implementation, which is completely wrong; and (2) even if you have `b = a.detach()`, these are still initially two different tensors which become the same tensor after the old deepcopy implementation.

The new implementation only groups together tensors that have the same identity. This is a partial fix, but it's more reasonable. What changes:
* (becomes more correct): different views of the same base tensor will no longer all become equal after deepcopying
* (still kind of wrong): views won't actually alias each other after deepcopying.
* (arguably a minor regression): equivalent views of the same tensor will no longer be copied to the same tensor - so they won't alias.

BC breaking: C++ deepcopy interface changes from accepting `IValue::HashAliasedIValueMap memo` to accepting `IValue::HashIdentityIValueMap memo`. If there are objections, we can keep the old API. However, it seems likely that users generally won't try to deepcopy from C++.

Differential Revision: [D57406306](https://our.internmc.facebook.com/intern/diff/D57406306)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126126
Approved by: https://github.com/ezyang
2024-05-17 00:06:26 +00:00
55628624b8 [c10d] add pg_name and pg_desc to logger (#126409)
Summary:
This should further improve our debuggability

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126409
Approved by: https://github.com/XilunWu
2024-05-16 23:56:19 +00:00
796dff7147 Import MKL via //third-party/mkl targets (#126371)
Summary:
This is a step towards upgrading the MKL library and using a buckified targets rather than importing from TP2.

- Add new `//third-party/mkl:mkl_xxx` targets that are currently aliases to `third-party//IntelComposerXE:mkl_xxx`.
- Switch usage of `external_deps = [("IntelComposerXE", None, "mkl_xxx")]` to `deps = ["fbsource//third-party/mkl:mkl_xxx"]`

Note that this only changes references to `mkl_xxx` references in `IntelComposerXE` but not references to "svml" or "ipp*".

Test Plan: sandcastle

Differential Revision: D57360438

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126371
Approved by: https://github.com/bertmaher
2024-05-16 22:51:26 +00:00
62403b57b9 Add prefix option to CapabilityBasedPartitioner (#126382)
Summary: Add prefix arg so that users can provide the submodule name to partitioner.

Test Plan: https://fburl.com/anp/2kue4qp9

Differential Revision: D57416926

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126382
Approved by: https://github.com/SherlockNoMad
2024-05-16 22:38:07 +00:00
c226839f5c Eliminate some C++11 checks (#126308)
Test Plan: Sandcastle

Reviewed By: palmje

Differential Revision: D57246912

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126308
Approved by: https://github.com/Skylion007
2024-05-16 22:37:45 +00:00
f17572fcf6 add 3.12 inductor CI tests (#126218)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126218
Approved by: https://github.com/huydhn, https://github.com/desertfire
2024-05-16 22:29:24 +00:00
93524cf5ff [compiled autograd] clear compiled_autograd_verbose once test is done (#126148)
verbose flag leaks into tests ran after

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126148
Approved by: https://github.com/jansel
ghstack dependencies: #126144, #126146
2024-05-16 22:23:02 +00:00
cef7756c9c [inductor] Clear cache on ctx manager exit (#126146)
FIXES https://github.com/pytorch/pytorch/issues/126128.

Right now, we only clear the cache on ctx manager enter. So state is bad unless we call fresh_inductor_cache again,  usually fine in tests.

Cue compiled autograd tests when going from TestCompiledAutograd -> TestAutogradWithCompiledAutograd.
TestCompiledAutograd uses the ctx manager, but TestAutogradWithCompiledAutograd don't

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126146
Approved by: https://github.com/jgong5, https://github.com/oulgen
ghstack dependencies: #126144
2024-05-16 22:23:02 +00:00
4cd4463c1c [compiled autograd] Fix LoggingTensor flaky test (#126144)
LoggingTensor fails consistently when root logger level is INFO or lower
By default, root logger should be WARNING
But, triton driver initialization will overwrite root logger to INFO, which causes flakiness: https://github.com/pytorch/pytorch/issues/126143

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126144
Approved by: https://github.com/jansel
2024-05-16 22:23:02 +00:00
4b7eee3450 Print export warning only once in capture_pre_autograd (#126403)
Summary: Missed this in D57163341

Test Plan: CI

Differential Revision: D57442088

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126403
Approved by: https://github.com/zhxchen17
2024-05-16 21:55:11 +00:00
e9719aec30 Fix strict default value in StateDictOptions (#125998)
Fixes #125992

The default value of the parameter `strict` should be `True`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125998
Approved by: https://github.com/fegin
2024-05-16 21:42:53 +00:00
f5abf28e41 [Traceable FSDP2] Use DTensor.from_local() in _from_local_no_grad when compile (#126346)
As discussed before, for now Dynamo is not able to support DTensor constructor, and instead we have to use `DTensor.from_local()`.

This won't affect eager and it's a compile-only change.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126346
Approved by: https://github.com/awgu
2024-05-16 21:37:00 +00:00
4f1a56cd42 Switched from parameter in can_cast to from_. (#126030)
Fixes #126012.

`from` is a reserved keyword in Python, thus we can't make the C++ impl available with `from` as function parameter. This PR changes the name to `from_` and also adjusts the docs.

If we want to preserve backwards compatibility, we can leave the C++ name as-is and only fix the docs. However, `torch.can_cast(from_=torch.int, to=torch.int)` won't work then.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126030
Approved by: https://github.com/albanD
2024-05-16 20:58:24 +00:00
82c66bc41a Make 'pytest test/inductor/test_memory_planning.py' work (#126397)
There's still another naughty direct test_* import, I'm out of patience
right now though.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126397
Approved by: https://github.com/peterbell10, https://github.com/int3
2024-05-16 20:28:20 +00:00
866ca4630c Don't install inplace_methods on MockHandler, not needed (#126398)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126398
Approved by: https://github.com/jansel, https://github.com/peterbell10
2024-05-16 20:28:05 +00:00
8f0c207e18 xpu: implement xpu serialization (#125530)
Fixes: #125529

BC-breaking note:
The deprecated "async" argument to the Storage.cuda and Storage.hpu has been removed. Use non_blocking instead.

CC: @jbschlosser, @frank-wei @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @albanD

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125530
Approved by: https://github.com/guangyey, https://github.com/albanD
2024-05-16 20:22:17 +00:00
da9bf77f0a [Dynamo] Support SET_UPDATE (#126243)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126243
Approved by: https://github.com/anijain2305, https://github.com/Skylion007, https://github.com/jansel
2024-05-16 20:05:34 +00:00
aab448e381 Remove redundant serialization code (#126249)
After https://github.com/pytorch/pytorch/pull/123308, we no longer need separate serialization path to handle different types that exist in the `nn_module` metadata. This PR cleans up the redundant code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126249
Approved by: https://github.com/angelayi
2024-05-16 19:22:20 +00:00
5862521ad1 [onnx.export] Cache SetGraphInputTypeReliable (#124912)
This PR is part of an effort to speed up torch.onnx.export (https://github.com/pytorch/pytorch/issues/121422).

- For each node that is processed in onnx.export, a check is run to see if all inputs are "reliable" (static shape, etc.). This value does not change, so it is much faster to cache it on the first computation. The caching is added to the ConstantMap state.
- Resolves (6) in #121422.
- Also see #123028 with a similar addition of a cache state.

(partial fix of #121545)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124912
Approved by: https://github.com/justinchuby
2024-05-16 18:48:56 +00:00
a0429c01ad [BE][FSDP] Remove unnecessary warnings (#126365)
As title

Differential Revision: [D57419704](https://our.internmc.facebook.com/intern/diff/D57419704/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126365
Approved by: https://github.com/awgu, https://github.com/Skylion007
ghstack dependencies: #126362
2024-05-16 17:34:01 +00:00
0dd53650dd [BE][FSDP] Change the logging level to info (#126362)
As title

Differential Revision: [D57419445](https://our.internmc.facebook.com/intern/diff/D57419445/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126362
Approved by: https://github.com/awgu, https://github.com/Skylion007
2024-05-16 17:31:06 +00:00
9fbf2696d7 [AOTI][refactor] Add aoti_torch_item as a util function (#126352)
Summary: The logic has been repeated several times in the code, so it's worth to write a common util function.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126352
Approved by: https://github.com/chenyang78
ghstack dependencies: #126181, #126182, #126183
2024-05-16 17:07:06 +00:00
0332b5812e [AOTI] Support InplaceBernoulliFallback in the ABI-compatible codegen (#126183)
Summary: Update the torchgen rule for inplace ops like bernoulli_, and update InplaceBernoulliFallback to codegen in the ABI-compatible mode. Fixes https://github.com/pytorch/pytorch/issues/121809

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126183
Approved by: https://github.com/angelayi
ghstack dependencies: #126181, #126182
2024-05-16 17:07:06 +00:00
5792bc3c3e [AOTI] Refactor some fallback op util functions (#126182)
Summary: Move some util functions for cpp kernel naming and missing arg filling from FallbackKernel to ExternKernel, since they are useful for ExternKernel in general.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126182
Approved by: https://github.com/chenyang78
ghstack dependencies: #126181
2024-05-16 17:07:00 +00:00
c5f926ab87 [AOTI][torchgen] Support at::Generator via C shim (#126181)
Summary: Support at::Generator which is used by many random number generator ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126181
Approved by: https://github.com/chenyang78
2024-05-16 17:06:53 +00:00
a55d63659a Add 2nd shard to ROCm trunk workflow for core distributed UTs (#121716)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121716
Approved by: https://github.com/ezyang, https://github.com/huydhn
2024-05-16 16:50:02 +00:00
f155ed6bf2 [ROCm] amax hipblaslt integration (#125921)
AMAX is coming as part of rocm6.2. This code adds that functionality

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125921
Approved by: https://github.com/eqy, https://github.com/lezcano
2024-05-16 16:40:31 +00:00
14d8e3aec0 Add distributed/_tensor/test_attention to ROCM_BLOCKLIST (#126336)
Fixes #125504
Fixes #126252
Fixes #126296
Fixes #126330

This PR doesn't really fix the RingAttentionTest tests for ROCm, but explicitly adds the whole test file to ROCM_BLOCKLIST to get a clean signal on ROCm distributed CI. We will enable these tests in a follow-up PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126336
Approved by: https://github.com/huydhn, https://github.com/pruthvistony
2024-05-16 16:38:09 +00:00
91bf952d10 Fix aarch64 debug build with GCC (#126290)
By working around GCCs quirks in instantiating templates that require immediate values.
Provide alternative implementation for scaling the output if compiled without any optimizations (both GCC and clang define `__OPTIMIZE__` if invoked with anything but `-O0`)

Fixes https://github.com/pytorch/pytorch/issues/126283

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126290
Approved by: https://github.com/atalman, https://github.com/seemethere
2024-05-16 13:41:45 +00:00
ab07867084 [FSDP2] Supported set_all_reduce_gradients=False for HSDP (#126166)
**Context**
For FSDP, gradient accumulation across microbatches has two flavors: (1) reduce-scatter or (2) no reduce-scatter. (1) incurs the collective per microbatch backward but saves gradient memory (storing the sharded gradients), while (2) avoids the communication but uses more gradient memory (storing the unsharded gradients).
- FSDP2 offers (1) without any intervention. The user should simply make sure to run the optimizer step after `K` microbatches for `K > 1`.
- FSDP2 offers (2) via `module.set_requires_gradient_sync()` (e.g. `module.set_requires_gradient_sync(is_last_microbatch)`.

For HSDP, since we reduce-scatter and then all-reduce, we have additional flexibility and get three flavors: (1) reduce-scatter and all-reduce, (2) reduce-scatter but no all-reduce, and (3) no reduce-scatter and no all-reduce. This PR adds support for (2).
- FSDP2 offers (1) without any intervention like mentioned above.
- FSDP2 offers (3) via `module.set_requires_gradient_sync()` like mentioned above.
- FSDP2 offers (2) via `module.set_requires_all_reduce()` similar to `set_requires_gradient_sync()`.

**Overview**
For HSDP, to reduce-scatter but not all-reduce during gradient accumulation, the user can do something like:
```
for microbatch_idx, microbatch in enumerate(microbatches):
    is_last_microbatch = microbatch_idx == len(microbatches) - 1
    model.set_requires_all_reduce(is_last_microbatch)
    # Run forward/backward
```

This PR also makes the minor change of making the `recurse: bool` argument in these setter methods to be kwarg only.

**Developer Notes**
We choose to implement this by saving the partial reduce output to the `FSDPParamGroup` for simplicity, where we assume that the set of parameters that receive gradients does not change across microbatches. An alternative would be to view into the partial reduce output per parameter and save the view to each parameter. We prefer to avoid this alternative for now because it introduces more complexity to do extra viewing when saving the partial reduce output to each parameter, accumulating into them, and accumulating back to the last microbatch's reduce output.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126166
Approved by: https://github.com/weifengpy, https://github.com/wanchaol
ghstack dependencies: #126067, #126070, #126161
2024-05-16 12:29:22 +00:00
c2f8c75129 [Reopen] Upgrade submodule oneDNN to v3.4.2 (#126137)
Reopen of https://github.com/pytorch/pytorch/pull/122472

## Improvements
This upgrade fixes the following issues:
- https://github.com/pytorch/pytorch/issues/120982

This upgrade brings the following new features:
- Introduced memory descriptor serialization API. This API is needed to support freezing on CPU in AOTInductor (https://github.com/pytorch/pytorch/issues/114450)

## Validation results on CPU
Original results with oneDNN v3.4.1 are here: https://github.com/pytorch/pytorch/pull/122472#issue-2201602846

Need to rerun validation and update results.

Co-authored-by: Sunita Nadampalli <nadampal@amazon.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126137
Approved by: https://github.com/jgong5, https://github.com/snadampal, https://github.com/atalman
2024-05-16 12:00:16 +00:00
691af57fbc Fix broken link of scikit-learn (#120972)
The link is broken in https://pytorch.org/docs/main/community/design.html
Pull Request resolved: https://github.com/pytorch/pytorch/pull/120972
Approved by: https://github.com/Skylion007
2024-05-16 11:46:34 +00:00
4333e122d4 [Traceable FSDP2] Add all_gather_into_tensor out variant (#126334)
This PR adds `torch.ops._c10d_functional.all_gather_into_tensor_out`.

It's important for tracing FSDP2, because FSDP2 pre-allocates the output buffer of AllGather, and makes input buffer an alias of the output buffer, and expects both of them to be used to achieve lower memory usage. If we don't preserve this behavior and instead functionalize the AllGather op, AllGather op will then create a brand-new output buffer (instead of reusing), thus significantly increasing the memory usage.

The expectation is that we will "re-inplace" the AllGather op by switching to the out variant in Inductor post-grad stage via an FX pass, so this API is not expected to be directly used by users.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126334
Approved by: https://github.com/yifuwang, https://github.com/wanchaol
2024-05-16 10:27:06 +00:00
d61a81a9e7 Fix lint failures coming from #126035 (#126378)
MYPY somehow shows lots of local failures for me.  The issue is tracked in https://github.com/pytorch/pytorch/issues/126361.  This is only to keep trunk sane.  These two line were added by #126035 as an attempt to fix lint there, but didn't seem to help.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126378
Approved by: https://github.com/kit1980
2024-05-16 06:05:47 +00:00
0716f75cfb Revert "Add Lowering for FlexAttention Backwards (#125515)"
This reverts commit 95b9e981c3ab68fc17f78b8a6bbfd9569745ae4c.

Reverted https://github.com/pytorch/pytorch/pull/125515 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the newly added test runs out of memory 95b9e981c3 ([comment](https://github.com/pytorch/pytorch/pull/125515#issuecomment-2114084869))
2024-05-16 05:52:13 +00:00
cdcba4dee5 Revert "Fix lint failures coming from #126035 (#126378)"
This reverts commit 5fa1f4c6e46d92482d99614c06b6e288cc8d6c8d.

Reverted https://github.com/pytorch/pytorch/pull/126378 on behalf of https://github.com/huydhn due to Trying to add yet another lint fix from https://hud.pytorch.org/pr/pytorch/pytorch/126357 and will reland this ([comment](https://github.com/pytorch/pytorch/pull/126378#issuecomment-2114060547))
2024-05-16 05:32:19 +00:00
58378f1224 [Doc] Add deprecated autocast comments for doc (#126062)
# Motivation
We generalize a device-agnostic API `torch.amp.autocast` in [#125103](https://github.com/pytorch/pytorch/pull/125103).  After that,
- `torch.cpu.amp.autocast(args...)` is completely equivalent to `torch.amp.autocast('cpu', args...)`, and
- `torch.cuda.amp.autocast(args...)` is completely equivalent to `torch.amp.autocast('cuda', args...)`

no matter in eager mode or JIT mode.
Base on this point, we would like to deprecate `torch.cpu.amp.autocast` and `torch.cuda.amp.autocast` to **strongly recommend** developer to use `torch.amp.autocast` that is a device-agnostic API.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126062
Approved by: https://github.com/eqy, https://github.com/albanD
2024-05-16 05:26:43 +00:00
08aa704d0c [1/N] Non-Tensor: Scalar Support: Enable aot compile to support aten operations with scalar input like alpha (#124177)
Some operations have a scalar input parameter, like `torch.add(a, b, alpha=2.0)`.  Currently, the aot compile does not support such a case because it requires the signature of the captured graph to align with the operation's signature. This means that some inputs in the captured graph may be scalar(float, int, bool, etc.). It breaks the assumption of `compile_fx_aot` as it assumes all the example inputs are tensor - 0f6ce45bcb/torch/_inductor/compile_fx.py (L1048)

This PR intends to support such cases by allowing not-aligned signature and filtering out the non-Tensor parameters.

Captured graph for `torch.add(a, b, alpha=2.0)`

```
opcode         name      target           args              kwargs
-------------  --------  ---------------  ----------------  --------------
placeholder    arg0_1    arg0_1           ()                {}
placeholder    arg1_1    arg1_1           ()                {}
call_function  add       aten.add.Tensor  (arg0_1, arg1_1)  {'alpha': 2.0}
output         output_1  output           ((add,),)         {}
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124177
Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/jgong5
2024-05-16 05:15:55 +00:00
5fa1f4c6e4 Fix lint failures coming from #126035 (#126378)
MYPY somehow shows lots of local failures for me.  The issue is tracked in https://github.com/pytorch/pytorch/issues/126361.  This is only to keep trunk sane.  These two line were added by #126035 as an attempt to fix lint there, but didn't seem to help.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126378
Approved by: https://github.com/kit1980
2024-05-16 05:12:27 +00:00
e661a42428 [Add sliding window attention bias] (#126061)
Summary:
This PR implements sliding window and updates "aten._flash_attention_forward/_flash_attention_backward" to expose the window_size_left and window_size_right arguments. With this kwarg added we can dispatch to the FAv2 impl if the necessary constraints are met.

These arguments will eventually be provided to "aten.sdpa_flash" but for now they are needed when called by xformers into their effort to directly use the Pytorch FAv2 impl instead of building their own.

Test Plan:
Use the default aten.sdpa_flash tests since we've added optional arguments set to the previous default value: -1, /*window_size_left*/

Using buck2 build --flagfile fbcode//mode/dev-nosan fbcode//caffe2/caffe2/fb/predictor/tests:inference_context_test

Differential Revision: D56938087

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126061
Approved by: https://github.com/drisspg, https://github.com/desertfire
2024-05-16 04:50:47 +00:00
8dc6f455bd [ez] fix exported diff mismatch (#126357)
Fixes the following issue:
D55803461 differs from the exported PR: #123658

⚠️ this PR needs to be skipped on diff train!

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126357
Approved by: https://github.com/huydhn, https://github.com/fegin
2024-05-16 04:49:48 +00:00
6e6e44bdcc Generate runtime asserts when propagate real tensor is used (#126287)
This means that propagate real tensor is no longer unsound: if the
route we took at compile time diverges with runtime, you will get a
runtime assert.

Also add structured trace logs for these.

Also fix bug where xreplace with int range is not guaranteed to return
a sympy expression.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126287
Approved by: https://github.com/Skylion007
2024-05-16 04:45:57 +00:00
c860df5a9d [c10d] Add an option for NAN check on every collective (#125726)
Summary:
The NAN CHECK is done through device side assert without copying needed
from GPU to CPU
Test Plan:
Unit test for collectives that should experience run time error

(sqzhang_1) [sqzhang@devgpu009.cln1 ~/pytorch (38f5143e)]$  python
test/distributed/test_c10d_nccl.py ProcessGroupNCCLTest.test_nan_assert
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [0,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [1,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [2,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [3,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [4,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [5,0,0] Assertion `!isnan(val)`
failed.
[rank0]:[E507 17:31:56.885473996 Utils.cu:30] CUDA error during
checkForNan: device-side assert triggered

/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [0,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [1,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [2,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [3,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [4,0,0] Assertion `!isnan(val)`
failed.
/home/sqzhang/pytorch/torch/csrc/distributed/c10d/Utils.cu:15:
checkForNaN: block: [0,0,0], thread: [5,0,0] Assertion `!isnan(val)`
failed.
[rank1]:[E507 17:31:56.128961534 Utils.cu:30] CUDA error during
checkForNan: device-side assert triggered

.
----------------------------------------------------------------------
Ran 1 test in 7.723s

OK

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125726
Approved by: https://github.com/kwen2501
2024-05-16 04:35:15 +00:00
0214711f05 Add mode to MemoryDep to track atomic accumulates (#123223)
And allow fusion of buffers where writes are only atomic accumulates.
This allows fusing of ops like

  _unsafe_index_put(_unsafe_index_put(a, ...), ...)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123223
Approved by: https://github.com/peterbell10
2024-05-16 04:34:09 +00:00
d0dfcd2c34 fix the device type for with_comms decorator (#125798)
found by @yifuwang, it looks like we are wrongly using
self.device_type="cuda" for gloo backend, which are triggering some
flakiness. i.e. https://github.com/pytorch/pytorch/issues/125366

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125798
Approved by: https://github.com/yifuwang
2024-05-16 03:40:19 +00:00
bcc8d25e47 [dynamo] Delete extra testing of cpp guard manager (#126343)
CPP guard manager has been on for a few weeks now. This separate testing was part of phasing when the cpp guard manager was not enabled. Now this is not needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126343
Approved by: https://github.com/williamwen42
ghstack dependencies: #126303, #126316, #126314, #126327
2024-05-16 03:30:38 +00:00
95b9e981c3 Add Lowering for FlexAttention Backwards (#125515)
# Summary
#### What does this PR do?
It enables Inductor to actually generate the fused flex attention kernel for the backwards

I did some other things along the way:
- Abstract out the 'build_subgraph_buffer' subroutine and make it reusable between flex attention and flex_attention backwards. In total we need too build 3 subgraphs for fwd + bwd. 1 for the fwd graph and then 2 in the bwd. The FAv2 algorithm recomputes the parts of the forward (more efficiently since we already have the row_max via logsumexp), therefore we need to inline both the fwd graph and the joint graph in the bwds kernel.
- The version of the backwards kernel is from a somewhat older version of the triton tutorial implementation. I think that we should update in a follow up to a newer version. Notably the blocks need to be square for this to work as currently implemented. I am sure there are many opportunities for optimization.
- I didnt correctly register the decomp table + IndexMode when I landed: https://github.com/pytorch/pytorch/pull/123902, this remedies that.
- The rel_bias helper func was reversed in terms of causality. I updated and then add a test specific for "future causal" attention.
- This PRs but the main point that I think still needs to be worked out is the store_output call. I have it hacked up to be 'fake' but I dont think we want to land that and likely want to just have a mutated 'dq' and a stored_output 'dk'
- I also needed to update the `TritonTemplateKernel` to actually accept multiple subgraphs (modifications)
- I updated the benchmark to also profile bwds performance

### Benchmark Numbers:
_The current implementation is not parallelizing over ctx length in the bwd_
FWD Speedups

| Type    |   Speedup | shape              | score_mod   | dtype          |
|---------|-----------|--------------------|-------------|----------------|
| Average |     0.991 |                    |             |                |
| Max     |     1.182 | (16, 16, 4096, 64) | noop        | torch.bfloat16 |
| Min     |     0.796 | (2, 16, 512, 256)  | head_bias   | torch.bfloat16 |

BWD Speedups

| Type    |   Speedup | shape              | score_mod   | dtype          |
|---------|-----------|--------------------|-------------|----------------|
| Average |     0.291 |                    |             |                |
| Max     |     0.652 | (8, 16, 512, 64)   | head_bias   | torch.bfloat16 |
| Min     |     0.073 | (2, 16, 4096, 128) | head_bias   | torch.bfloat16 |

<details>

<summary>Full Data</summary>

| shape               | score_mod     | dtype          |   fwd_eager_time |   fwd_compiled_time |   bwd_eager_time |   bwd_compiled_time |   fwd_speedup |   bwd_speedup |
|---------------------|---------------|----------------|------------------|---------------------|------------------|---------------------|---------------|---------------|
| (2, 16, 512, 64)    | noop          | torch.bfloat16 |           19.936 |              19.092 |           57.851 |             193.564 |         1.044 |         0.299 |
| (2, 16, 512, 64)    | causal_mask   | torch.bfloat16 |           19.955 |              19.497 |           57.662 |             206.278 |         1.024 |         0.280 |
| (2, 16, 512, 64)    | relative_bias | torch.bfloat16 |           19.455 |              21.297 |           57.674 |             195.219 |         0.913 |         0.295 |
| (2, 16, 512, 64)    | head_bias     | torch.bfloat16 |           19.958 |              21.289 |           57.674 |             193.859 |         0.938 |         0.298 |
| (2, 16, 512, 128)   | noop          | torch.bfloat16 |           28.157 |              28.615 |           82.831 |             454.211 |         0.984 |         0.182 |
| (2, 16, 512, 128)   | causal_mask   | torch.bfloat16 |           28.154 |              28.444 |           83.091 |             432.083 |         0.990 |         0.192 |
| (2, 16, 512, 128)   | relative_bias | torch.bfloat16 |           28.722 |              27.897 |           83.175 |             446.789 |         1.030 |         0.186 |
| (2, 16, 512, 128)   | head_bias     | torch.bfloat16 |           28.299 |              27.673 |           83.052 |             459.179 |         1.023 |         0.181 |
| (2, 16, 512, 256)   | noop          | torch.bfloat16 |           41.167 |              50.504 |          175.019 |            1083.545 |         0.815 |         0.162 |
| (2, 16, 512, 256)   | causal_mask   | torch.bfloat16 |           41.656 |              51.933 |          175.078 |            1171.176 |         0.802 |         0.149 |
| (2, 16, 512, 256)   | relative_bias | torch.bfloat16 |           41.697 |              50.722 |          175.159 |            1097.312 |         0.822 |         0.160 |
| (2, 16, 512, 256)   | head_bias     | torch.bfloat16 |           41.690 |              52.387 |          175.184 |            1097.336 |         0.796 |         0.160 |
| (2, 16, 1024, 64)   | noop          | torch.bfloat16 |           39.232 |              37.454 |          127.847 |             612.430 |         1.047 |         0.209 |
| (2, 16, 1024, 64)   | causal_mask   | torch.bfloat16 |           39.930 |              39.599 |          127.755 |             665.359 |         1.008 |         0.192 |
| (2, 16, 1024, 64)   | relative_bias | torch.bfloat16 |           39.417 |              41.304 |          127.902 |             614.990 |         0.954 |         0.208 |
| (2, 16, 1024, 64)   | head_bias     | torch.bfloat16 |           39.965 |              42.034 |          127.953 |             613.273 |         0.951 |         0.209 |
| (2, 16, 1024, 128)  | noop          | torch.bfloat16 |           63.964 |              71.024 |          226.510 |            1637.669 |         0.901 |         0.138 |
| (2, 16, 1024, 128)  | causal_mask   | torch.bfloat16 |           63.843 |              72.451 |          226.750 |            1558.949 |         0.881 |         0.145 |
| (2, 16, 1024, 128)  | relative_bias | torch.bfloat16 |           64.301 |              70.487 |          226.651 |            1610.063 |         0.912 |         0.141 |
| (2, 16, 1024, 128)  | head_bias     | torch.bfloat16 |           64.033 |              71.394 |          226.676 |            1668.511 |         0.897 |         0.136 |
| (2, 16, 1024, 256)  | noop          | torch.bfloat16 |          129.348 |             141.390 |          507.337 |            4405.175 |         0.915 |         0.115 |
| (2, 16, 1024, 256)  | causal_mask   | torch.bfloat16 |          129.538 |             145.680 |          507.178 |            4768.874 |         0.889 |         0.106 |
| (2, 16, 1024, 256)  | relative_bias | torch.bfloat16 |          129.438 |             142.782 |          507.004 |            4401.002 |         0.907 |         0.115 |
| (2, 16, 1024, 256)  | head_bias     | torch.bfloat16 |          129.058 |             146.242 |          507.547 |            4434.251 |         0.883 |         0.114 |
| (2, 16, 4096, 64)   | noop          | torch.bfloat16 |          481.606 |             409.120 |         1440.890 |           14147.269 |         1.177 |         0.102 |
| (2, 16, 4096, 64)   | causal_mask   | torch.bfloat16 |          480.227 |             438.847 |         1434.419 |           14973.386 |         1.094 |         0.096 |
| (2, 16, 4096, 64)   | relative_bias | torch.bfloat16 |          480.831 |             458.104 |         1432.935 |           14193.253 |         1.050 |         0.101 |
| (2, 16, 4096, 64)   | head_bias     | torch.bfloat16 |          480.749 |             452.497 |         1437.040 |           14084.869 |         1.062 |         0.102 |
| (2, 16, 4096, 128)  | noop          | torch.bfloat16 |          872.534 |             848.275 |         2600.895 |           35156.849 |         1.029 |         0.074 |
| (2, 16, 4096, 128)  | causal_mask   | torch.bfloat16 |          872.647 |             868.279 |         2587.581 |           31919.531 |         1.005 |         0.081 |
| (2, 16, 4096, 128)  | relative_bias | torch.bfloat16 |          871.484 |             827.644 |         2593.989 |           34805.634 |         1.053 |         0.075 |
| (2, 16, 4096, 128)  | head_bias     | torch.bfloat16 |          871.422 |             856.437 |         2602.482 |           35708.591 |         1.017 |         0.073 |
| (2, 16, 4096, 256)  | noop          | torch.bfloat16 |         1904.497 |            1758.183 |         6122.416 |           66754.593 |         1.083 |         0.092 |
| (2, 16, 4096, 256)  | causal_mask   | torch.bfloat16 |         1911.174 |            1762.821 |         6113.207 |           72759.392 |         1.084 |         0.084 |
| (2, 16, 4096, 256)  | relative_bias | torch.bfloat16 |         1911.254 |            1727.108 |         6123.530 |           66577.988 |         1.107 |         0.092 |
| (2, 16, 4096, 256)  | head_bias     | torch.bfloat16 |         1916.977 |            1801.804 |         6118.158 |           67359.680 |         1.064 |         0.091 |
| (8, 16, 512, 64)    | noop          | torch.bfloat16 |           44.984 |              43.974 |          170.276 |             262.259 |         1.023 |         0.649 |
| (8, 16, 512, 64)    | causal_mask   | torch.bfloat16 |           45.001 |              46.265 |          170.509 |             274.893 |         0.973 |         0.620 |
| (8, 16, 512, 64)    | relative_bias | torch.bfloat16 |           45.466 |              48.211 |          170.606 |             262.759 |         0.943 |         0.649 |
| (8, 16, 512, 64)    | head_bias     | torch.bfloat16 |           45.481 |              48.435 |          170.267 |             261.265 |         0.939 |         0.652 |
| (8, 16, 512, 128)   | noop          | torch.bfloat16 |           72.565 |              74.736 |          313.220 |             773.126 |         0.971 |         0.405 |
| (8, 16, 512, 128)   | causal_mask   | torch.bfloat16 |           72.015 |              75.755 |          313.311 |             775.513 |         0.951 |         0.404 |
| (8, 16, 512, 128)   | relative_bias | torch.bfloat16 |           72.105 |              74.189 |          313.806 |             769.238 |         0.972 |         0.408 |
| (8, 16, 512, 128)   | head_bias     | torch.bfloat16 |           72.005 |              74.364 |          313.509 |             775.237 |         0.968 |         0.404 |
| (8, 16, 512, 256)   | noop          | torch.bfloat16 |          138.656 |             165.453 |          663.707 |            2672.067 |         0.838 |         0.248 |
| (8, 16, 512, 256)   | causal_mask   | torch.bfloat16 |          139.096 |             172.613 |          663.593 |            2926.538 |         0.806 |         0.227 |
| (8, 16, 512, 256)   | relative_bias | torch.bfloat16 |          139.500 |             168.417 |          663.938 |            2658.629 |         0.828 |         0.250 |
| (8, 16, 512, 256)   | head_bias     | torch.bfloat16 |          139.776 |             173.549 |          662.920 |            2667.266 |         0.805 |         0.249 |
| (8, 16, 1024, 64)   | noop          | torch.bfloat16 |          134.883 |             125.004 |          484.706 |            1195.254 |         1.079 |         0.406 |
| (8, 16, 1024, 64)   | causal_mask   | torch.bfloat16 |          134.297 |             132.875 |          485.420 |            1234.953 |         1.011 |         0.393 |
| (8, 16, 1024, 64)   | relative_bias | torch.bfloat16 |          134.839 |             139.231 |          485.470 |            1198.556 |         0.968 |         0.405 |
| (8, 16, 1024, 64)   | head_bias     | torch.bfloat16 |          133.822 |             136.449 |          485.608 |            1189.198 |         0.981 |         0.408 |
| (8, 16, 1024, 128)  | noop          | torch.bfloat16 |          235.470 |             234.765 |          886.094 |            2662.944 |         1.003 |         0.333 |
| (8, 16, 1024, 128)  | causal_mask   | torch.bfloat16 |          236.305 |             241.382 |          886.293 |            2646.984 |         0.979 |         0.335 |
| (8, 16, 1024, 128)  | relative_bias | torch.bfloat16 |          236.414 |             233.980 |          885.250 |            2642.178 |         1.010 |         0.335 |
| (8, 16, 1024, 128)  | head_bias     | torch.bfloat16 |          237.176 |             239.040 |          885.754 |            2665.242 |         0.992 |         0.332 |
| (8, 16, 1024, 256)  | noop          | torch.bfloat16 |          504.445 |             517.855 |         1978.956 |            9592.906 |         0.974 |         0.206 |
| (8, 16, 1024, 256)  | causal_mask   | torch.bfloat16 |          502.428 |             536.002 |         1978.611 |           10607.342 |         0.937 |         0.187 |
| (8, 16, 1024, 256)  | relative_bias | torch.bfloat16 |          503.396 |             523.960 |         1977.993 |            9539.284 |         0.961 |         0.207 |
| (8, 16, 1024, 256)  | head_bias     | torch.bfloat16 |          503.818 |             536.014 |         1980.131 |            9576.262 |         0.940 |         0.207 |
| (8, 16, 4096, 64)   | noop          | torch.bfloat16 |         1970.139 |            1674.930 |         5750.940 |           16724.134 |         1.176 |         0.344 |
| (8, 16, 4096, 64)   | causal_mask   | torch.bfloat16 |         1959.036 |            1775.056 |         5780.512 |           17390.350 |         1.104 |         0.332 |
| (8, 16, 4096, 64)   | relative_bias | torch.bfloat16 |         1947.198 |            1773.869 |         5780.643 |           16779.699 |         1.098 |         0.345 |
| (8, 16, 4096, 64)   | head_bias     | torch.bfloat16 |         1963.935 |            1829.502 |         5780.018 |           16703.259 |         1.073 |         0.346 |
| (8, 16, 4096, 128)  | noop          | torch.bfloat16 |         3582.711 |            3362.623 |        10436.069 |           36415.565 |         1.065 |         0.287 |
| (8, 16, 4096, 128)  | causal_mask   | torch.bfloat16 |         3581.504 |            3499.472 |        10346.869 |           36164.959 |         1.023 |         0.286 |
| (8, 16, 4096, 128)  | relative_bias | torch.bfloat16 |         3589.779 |            3337.849 |        10529.621 |           36261.696 |         1.075 |         0.290 |
| (8, 16, 4096, 128)  | head_bias     | torch.bfloat16 |         3602.265 |            3436.444 |        10458.660 |           36507.790 |         1.048 |         0.286 |
| (8, 16, 4096, 256)  | noop          | torch.bfloat16 |         7695.923 |            7126.275 |        24643.009 |          140949.081 |         1.080 |         0.175 |
| (8, 16, 4096, 256)  | causal_mask   | torch.bfloat16 |         7679.939 |            7186.252 |        24538.105 |          157156.067 |         1.069 |         0.156 |
| (8, 16, 4096, 256)  | relative_bias | torch.bfloat16 |         7681.374 |            6994.832 |        24549.713 |          140077.179 |         1.098 |         0.175 |
| (8, 16, 4096, 256)  | head_bias     | torch.bfloat16 |         7679.822 |            7212.278 |        24627.823 |          140675.003 |         1.065 |         0.175 |
| (16, 16, 512, 64)   | noop          | torch.bfloat16 |           80.126 |              78.291 |          333.719 |             541.165 |         1.023 |         0.617 |
| (16, 16, 512, 64)   | causal_mask   | torch.bfloat16 |           80.065 |              81.696 |          333.779 |             551.113 |         0.980 |         0.606 |
| (16, 16, 512, 64)   | relative_bias | torch.bfloat16 |           80.138 |              86.715 |          333.364 |             542.118 |         0.924 |         0.615 |
| (16, 16, 512, 64)   | head_bias     | torch.bfloat16 |           80.415 |              85.204 |          333.294 |             536.840 |         0.944 |         0.621 |
| (16, 16, 512, 128)  | noop          | torch.bfloat16 |          134.964 |             138.025 |          607.093 |            1333.102 |         0.978 |         0.455 |
| (16, 16, 512, 128)  | causal_mask   | torch.bfloat16 |          134.192 |             141.523 |          606.269 |            1424.318 |         0.948 |         0.426 |
| (16, 16, 512, 128)  | relative_bias | torch.bfloat16 |          135.711 |             138.639 |          606.283 |            1327.974 |         0.979 |         0.457 |
| (16, 16, 512, 128)  | head_bias     | torch.bfloat16 |          135.552 |             140.555 |          607.107 |            1347.370 |         0.964 |         0.451 |
| (16, 16, 512, 256)  | noop          | torch.bfloat16 |          275.113 |             315.144 |         1301.583 |            5268.153 |         0.873 |         0.247 |
| (16, 16, 512, 256)  | causal_mask   | torch.bfloat16 |          274.867 |             328.106 |         1302.513 |            5770.594 |         0.838 |         0.226 |
| (16, 16, 512, 256)  | relative_bias | torch.bfloat16 |          276.052 |             321.770 |         1302.904 |            5241.920 |         0.858 |         0.249 |
| (16, 16, 512, 256)  | head_bias     | torch.bfloat16 |          271.409 |             328.839 |         1302.142 |            5266.037 |         0.825 |         0.247 |
| (16, 16, 1024, 64)  | noop          | torch.bfloat16 |          260.489 |             237.463 |          955.884 |            1817.558 |         1.097 |         0.526 |
| (16, 16, 1024, 64)  | causal_mask   | torch.bfloat16 |          262.378 |             254.350 |          955.280 |            1843.807 |         1.032 |         0.518 |
| (16, 16, 1024, 64)  | relative_bias | torch.bfloat16 |          261.338 |             268.253 |          956.038 |            1820.036 |         0.974 |         0.525 |
| (16, 16, 1024, 64)  | head_bias     | torch.bfloat16 |          262.153 |             264.156 |          956.023 |            1810.076 |         0.992 |         0.528 |
| (16, 16, 1024, 128) | noop          | torch.bfloat16 |          476.475 |             461.413 |         1760.578 |            4306.521 |         1.033 |         0.409 |
| (16, 16, 1024, 128) | causal_mask   | torch.bfloat16 |          473.794 |             479.178 |         1761.277 |            4619.439 |         0.989 |         0.381 |
| (16, 16, 1024, 128) | relative_bias | torch.bfloat16 |          473.839 |             463.282 |         1758.692 |            4290.562 |         1.023 |         0.410 |
| (16, 16, 1024, 128) | head_bias     | torch.bfloat16 |          472.979 |             472.896 |         1763.086 |            4367.931 |         1.000 |         0.404 |
| (16, 16, 1024, 256) | noop          | torch.bfloat16 |         1014.184 |            1026.764 |         3922.997 |           19104.147 |         0.988 |         0.205 |
| (16, 16, 1024, 256) | causal_mask   | torch.bfloat16 |         1013.217 |            1039.046 |         3928.382 |           21086.281 |         0.975 |         0.186 |
| (16, 16, 1024, 256) | relative_bias | torch.bfloat16 |         1008.519 |            1015.278 |         3922.133 |           18980.652 |         0.993 |         0.207 |
| (16, 16, 1024, 256) | head_bias     | torch.bfloat16 |         1011.360 |            1047.542 |         3931.245 |           19069.172 |         0.965 |         0.206 |
| (16, 16, 4096, 64)  | noop          | torch.bfloat16 |         3929.850 |            3325.667 |        11411.704 |           23344.280 |         1.182 |         0.489 |
| (16, 16, 4096, 64)  | causal_mask   | torch.bfloat16 |         3885.262 |            3581.544 |        11390.515 |           23725.639 |         1.085 |         0.480 |
| (16, 16, 4096, 64)  | relative_bias | torch.bfloat16 |         3865.737 |            3537.308 |        11489.901 |           23406.330 |         1.093 |         0.491 |
| (16, 16, 4096, 64)  | head_bias     | torch.bfloat16 |         3880.530 |            3665.249 |        11484.411 |           23299.496 |         1.059 |         0.493 |
| (16, 16, 4096, 128) | noop          | torch.bfloat16 |         7030.306 |            6745.715 |        20621.264 |           57464.096 |         1.042 |         0.359 |
| (16, 16, 4096, 128) | causal_mask   | torch.bfloat16 |         7095.414 |            7034.385 |        20410.656 |           61660.511 |         1.009 |         0.331 |
| (16, 16, 4096, 128) | relative_bias | torch.bfloat16 |         7084.779 |            6686.497 |        20315.161 |           57243.969 |         1.060 |         0.355 |
| (16, 16, 4096, 128) | head_bias     | torch.bfloat16 |         7075.367 |            6863.305 |        20494.385 |           58481.953 |         1.031 |         0.350 |
| (16, 16, 4096, 256) | noop          | torch.bfloat16 |        15612.741 |           14297.482 |        55306.847 |          281161.865 |         1.092 |         0.197 |
| (16, 16, 4096, 256) | causal_mask   | torch.bfloat16 |        15326.592 |           14263.878 |        55227.806 |          313063.232 |         1.075 |         0.176 |
| (16, 16, 4096, 256) | relative_bias | torch.bfloat16 |        15297.963 |           14007.379 |        54558.029 |          279529.175 |         1.092 |         0.195 |
| (16, 16, 4096, 256) | head_bias     | torch.bfloat16 |        15216.160 |           14276.027 |        55081.581 |          280996.826 |         1.066 |         0.196 |

</details>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125515
Approved by: https://github.com/Chillee
2024-05-16 03:14:27 +00:00
ae6fdfa539 Revert "Initial implementation of AdaRound (#126153)"
This reverts commit 175c18af818804ba8ef433c3eb8488d1a3d1dd9d.

Reverted https://github.com/pytorch/pytorch/pull/126153 on behalf of https://github.com/huydhn due to Sorry for reverting your change but the lint failure is legit because there are more than one lint issues, torch/optim/asgd.py is just the last one ([comment](https://github.com/pytorch/pytorch/pull/126153#issuecomment-2113902522))
2024-05-16 02:34:49 +00:00
e3c5d1b7d7 Revert "[optim] Fix: wrong ASGD implementation (#125440)"
This reverts commit 2c5ad9a3d7ea79ca897aec153a401f4b9175a717.

Reverted https://github.com/pytorch/pytorch/pull/125440 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it looks like there is a linter failure coming from this change ([comment](https://github.com/pytorch/pytorch/pull/125440#issuecomment-2113833108))
2024-05-16 02:12:29 +00:00
175c18af81 Initial implementation of AdaRound (#126153)
Summary:
This is an implementation of AdaRound from a paper https://arxiv.org/abs/2004.10568

This algorithm is going to be used by multiple people, hence we need make it official implementation.

Differential Revision: D57227565

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126153
Approved by: https://github.com/jerryzh168
2024-05-16 02:09:18 +00:00
927e631dc2 [inductor][cpp] bf16/fp16 gemm template computed with fp32 w/o epilogue fusion (#126068)
As part of #125683, this PR adds the initial bf16/fp16 gemm template support with micro-gemm implemented with fused type casting and fp32 computation. It doesn't provide epilogue fusion support yet which will be added in the next PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126068
Approved by: https://github.com/jansel
ghstack dependencies: #126019
2024-05-16 02:05:49 +00:00
059b68fbdf [DeviceMesh] Fix hash and eq not match (#123572)
Fixes #121799

We fix DeviceMesh hash such that two mesh are considered equal if they have the same mesh and same parent_mesh.
Examples can be found here: https://github.com/pytorch/pytorch/issues/121799

Also need this to unblock https://github.com/pytorch/pytorch/pull/123394

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123572
Approved by: https://github.com/xunnanxu, https://github.com/wanchaol, https://github.com/yoyoyocmu
2024-05-16 02:00:45 +00:00
1876f0fec1 [dynamo][nn module guards] Use TENSOR_MATCH, and not ID_MATCH, for numpy tensors (#126246)
Fixes speech_transformer regression here - https://hud.pytorch.org/benchmark/torchbench/inductor_no_cudagraphs?startTime=Tue%2C%2007%20May%202024%2019%3A22%3A54%20GMT&stopTime=Tue%2C%2014%20May%202024%2019%3A22%3A54%20GMT&granularity=hour&mode=training&dtype=amp&lBranch=main&lCommit=02093b6c6ae1046368e2500881d0bb5880873386&rBranch=main&rCommit=b24ad7eab55eaf660893dddae949fc714e434338

Thanks to @eellison  and @bdhirsh for isolating the regression to nn module guards.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126246
Approved by: https://github.com/jansel
ghstack dependencies: #126203
2024-05-16 01:57:59 +00:00
315389bfed Revert "Remove deprecated _aminmax operator (#125995)"
This reverts commit 0116ffae7f94f35a2c712e186a0b371959b68c64.

Reverted https://github.com/pytorch/pytorch/pull/125995 on behalf of https://github.com/huydhn due to Sorry for reverting your change but we need to reland this after I get rid of all usage of _aminmax internally in Meta ([comment](https://github.com/pytorch/pytorch/pull/125995#issuecomment-2113769497))
2024-05-16 01:45:37 +00:00
6dca1e639b [TEST][Dynamo] fix test_deviceguard.py (#126240)
The `test_device_guard.py` was improperly set up, so there were failures on multi-GPU machines. By design the `DeviceGuard` should keep `idx` the same even after it was applied.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126240
Approved by: https://github.com/jansel
2024-05-16 01:44:42 +00:00
7844c202b2 [inductor][cpp] epilogue support for gemm template (#126019)
As part of #125683, this PR adds the epilogue support for c++ gemm template by reusing the c++ vector codegen on sub-slices of tensors. This is implemented by retracing the epilogue IR nodes with new ranges and offsets. The new `codegen_loop_bodies` and `codegen_functions` methods are added to c++ vector codegen for this purpose. This is leveraged by the `store_output` method of the template kernel for epilogue codegen and store to the final result.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126019
Approved by: https://github.com/jansel
2024-05-16 01:42:29 +00:00
6065a4d46e Revert "Switched from parameter in can_cast to from_. (#126030)"
This reverts commit 06d6bb4ebabc64433224970024ada1781508197d.

Reverted https://github.com/pytorch/pytorch/pull/126030 on behalf of https://github.com/huydhn due to Sorry for reverting your change but i need to revert it to avoid a diff train conflict with https://github.com/pytorch/pytorch/pull/125995.  Please help rebase and I will reland the change ([comment](https://github.com/pytorch/pytorch/pull/126030#issuecomment-2113757469))
2024-05-16 01:42:23 +00:00
5efad4ebc1 [inductor] [FX graph cache] Ignore unbacked symints in guards expression (#126251)
Summary: Found a unit test that was causing an assertion failure during an attempt to use unbacked symints in the guards expression, but it turns out unbacked symints can't affect guards anyway, so we can just filter them out. Also in this diff: test_torchinductor_dynamic_shapes.py was not configured to exercise the codecache because the TestCase setUp method was indavertently skipping the setUp of the immediate parent class.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126251
Approved by: https://github.com/peterbell10
2024-05-16 01:35:41 +00:00
bd63300bae [dynamo][inline-inbuilt-nn-modules] Add and update test_modules.py for nlining work (#126327)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126327
Approved by: https://github.com/williamwen42
ghstack dependencies: #126303, #126316, #126314
2024-05-16 01:35:09 +00:00
7aa068f350 [dynamo][inline-inbuilt-nn-modules] Change test to not depend on id of mod instance (#126314)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126314
Approved by: https://github.com/williamwen42
ghstack dependencies: #126303, #126316
2024-05-16 01:35:09 +00:00
0f8380dd65 [Inductor][Flex-attention] Make num_head support dynamic (#126342)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126342
Approved by: https://github.com/drisspg
2024-05-16 01:33:53 +00:00
f9d107af66 [optim] add fused_adagrad support for CPU device (#124905)
Support fused_sgd_kernel support for CPU.

## Bench result:
32 core/sockets ICX
Test Scripts:
https://gist.github.com/zhuhaozhe/79e842e0a6e25d6d7fa1e4598807272c
https://gist.github.com/zhuhaozhe/b4c6998a509dcea1796dd05b3005c969
```
Tensor Size: 262144, Num Tensor 4, Num Threads: 1
_single_tensor_adagrad time: 0.2500 seconds
_fused_adagrad time: 0.0933 seconds
Tensor Size: 4194304, Num Tensor 32, Num Threads: 32
_single_tensor_adagrad time: 2.8819 seconds
_fused_adagrad time: 1.7591 seconds
```
## Test Plan:
```
python test_optim.py -k test_fused_matches_forloop
python test_optim.py -k test_fused_large_tensor
python test_optim.py -k test_can_load_older_state_dict
python test_optim.py -k test_grad_scaling_autocast_fused_optimizers
python test_torch.py -k test_grad_scaling_autocast_fused
python test_torch.py -k test_params_invalidated_with_grads_invalidated_between_unscale_and_step
```

Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124905
Approved by: https://github.com/jgong5, https://github.com/janeyx99
2024-05-16 01:11:51 +00:00
51e9bb8783 [Export] Allow ExportedProgram to take empty decomp table (#126142)
**As title.**
Still, `ep.run_decompositions()` will use `core_aten_decompositions()` by default. Cases like `ep.run_decompositions(get_decompositions([]))` will use empty table, and go with [`aot_autograd_decompositions`](04877dc430/torch/_functorch/aot_autograd.py (L456-459)) only.

**Motivation**
We didn't have a clean way to pass in an empty decomp table. Since we've made `pre_dispatch` export as default and `ep.run_decompositions` remains with `aot_export_module(..., pre_dispatch=False)`, allowing empty table would help make blank control easier.

**Testing**
CI
Also looked through all the references in fbcode. The only concern I have is whether we should update [this example](04877dc430/torch/onnx/_internal/exporter.py (L817)) or not.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126142
Approved by: https://github.com/angelayi
2024-05-16 00:31:23 +00:00
b3f1882d17 [easy][dynamo][inline-inbuilt-nn-modules] Change test to check for params (#126316)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126316
Approved by: https://github.com/williamwen42
ghstack dependencies: #126303
2024-05-16 00:20:58 +00:00
06d6bb4eba Switched from parameter in can_cast to from_. (#126030)
Fixes #126012.

`from` is a reserved keyword in Python, thus we can't make the C++ impl available with `from` as function parameter. This PR changes the name to `from_` and also adjusts the docs.

If we want to preserve backwards compatibility, we can leave the C++ name as-is and only fix the docs. However, `torch.can_cast(from_=torch.int, to=torch.int)` won't work then.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126030
Approved by: https://github.com/albanD
2024-05-16 00:09:54 +00:00
3ae118204e Make propagate_real_tensor more safe (#126281)
Internal xref: https://fb.workplace.com/groups/6829516587176185/posts/7228787720582401/

There a few improvements here, which luckily fix some xfails:

* In generally, it can be unsafe to call operations on Tensors under a `no_dispatch()` mode that is purely trying to disable ambient modes, because this ALSO disables tensor subclass handling. So we test to see if there is a tensor subclass and don't propagate real tensors if that's the case. Another acceptable outcome might be to try to only disable the ambient fake tensor mode, this would help us propagate real tensors through more exotic tensor types, but I'm not going to do it until someone asks for it.
* We're graph breaking for wrapped tensors too late. Pull it up earlier so we do it before we try to muck around with the real tensor.
* I noticed that occasionally when I do `storage.copy_(real_storage)`, the sizes mismatch. Careful code reading suggests that I should just copy in the real data when the tensor was initially allocated, so that's what I do now, eliminating the need for a storage copy.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126281
Approved by: https://github.com/Skylion007
2024-05-15 23:57:02 +00:00
b2d9b80fba Also remove compile_time_strobelight_meta frame when generating stack (#126289)
I think I also need to fix this in fbcode, leaving that for future work.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126289
Approved by: https://github.com/yanboliang
2024-05-15 23:55:37 +00:00
9c9d0c2fab Add VariableTracker.debug_repr (#126299)
Now you can print arbitrary values at compile time with
comptime.print()

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126299
Approved by: https://github.com/jansel
ghstack dependencies: #126292
2024-05-15 23:55:29 +00:00
a7af53cec1 [FSDP2] support fully_shard(model_on_meta, cpu_offload) (#126305)
support fully_shard(model_on_meta, cpu_offload) when fully_shard is placed outside of `torch.device("meta")`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126305
Approved by: https://github.com/awgu
ghstack dependencies: #126267
2024-05-15 23:29:23 +00:00
bcdd0b11ca [dynamo][inline-inbuilt-nn-modules] Bug fix - Only unspecialized nn modules (#126303)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126303
Approved by: https://github.com/mlazos, https://github.com/laithsakka
2024-05-15 23:23:12 +00:00
5cab7a7662 [dynamo] fix https://github.com/pytorch/pytorch/issues/93624 (#125945)
Fixes https://github.com/pytorch/pytorch/issues/93624 but also requires https://github.com/jcmgray/autoray/issues/20 to be fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125945
Approved by: https://github.com/jansel
ghstack dependencies: #125882, #125943
2024-05-15 23:22:06 +00:00
56a89fcc08 [dynamo] graph break on issubclass call with non-const args (#125943)
Fixes https://github.com/pytorch/pytorch/issues/125942

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125943
Approved by: https://github.com/jansel
ghstack dependencies: #125882
2024-05-15 23:22:06 +00:00
100e3c1205 [dynamo] graph break on const dict KeyError (#125882)
Fixes https://github.com/pytorch/pytorch/issues/125866

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125882
Approved by: https://github.com/jansel
2024-05-15 23:22:06 +00:00
b5432ad5ab Fix triton codegen main do_bench_gpu import error (#126213)
Summary:
Encountered module import error when running triton kernel file.

The cause seems to be D57215950 which changed "do_bench" to "do_bench_gpu" for torch._inductor.runtime.runtime_utils

However, in the codegen, instead we have "from triton.testing import do_bench", so the line below should be reverted back to "do_bench".

Test Plan:
LOGLEVEL=DEBUG TORCH_COMPILE_DEBUG=1 TORCHINDUCTOR_MAX_AUTOTUNE=0 CUDA_VISIBLE_DEVICES=5 TORCHINDUCTOR_PROFILE=1 TORCHINDUCTOR_PROFILE_OUTPUT='/home/adelesun/mts_profiling/outputs/profile_output.txt' TORCH_LOGS='+inductor,+schedule,output_code' TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 TORCHINDUCTOR_BENCHMARK_KERNEL=1 TORCHINDUCTOR_CACHE_DIR='/home/adelesun/mts_profiling/code' TORCHINDUCTOR_ENABLED_METRIC_TABLES=kernel_metadata buck2 run mode/opt                 -c=python.package_style=inplace                 -c fbcode.enable_gpu_sections=true                 -c fbcode.platform=platform010                 -c fbcode.nvcc_arch=v100,a100,h100                 -c fbcode.split-dwarf=true                 caffe2/torch/fb/model_transform/experimental/benchmark:mts_gpu_benchmark                 --  --local-model /home/adelesun/mts_profiling/inputs/offsite_cvr_model_526372970_793.input.predictor.disagg.gpu.merge --lower-backend AOT_INDUCTOR 2>&1 | tee /home/adelesun/mts_profiling/outputs/benchmark_output.txt

bento console --kernel=aetk --file=/home/adelesun/mts_profiling/code/op/copmbxfunzmywemwmg66lnlcx4apvn2f2vsi3glgisausgfvit4g.py

file ran successfully

Differential Revision: D57345619

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126213
Approved by: https://github.com/shunting314
2024-05-15 22:56:15 +00:00
2c5ad9a3d7 [optim] Fix: wrong ASGD implementation (#125440)
> previous: Originally, the variables `new_eta` and `new_mu` would be constructed `len(grouped_mus)` times, but each of their values is the same and won't be changed. Therefore, it can be simplified using Python list multiplication, which only constructs one tensor.

- [X] Ill assumption that every param will have the same step.
- [x] DIfferent implementation between `foreach=Ture` and `foreach=False`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125440
Approved by: https://github.com/janeyx99
2024-05-15 22:52:15 +00:00
eqy
5af4b49285 Remove expected failure in test_eager_transforms.py (#125883)
Seems to be supported now

CC @tinglvv @nWEIdia @Aidyn-A

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125883
Approved by: https://github.com/Chillee, https://github.com/Aidyn-A
2024-05-15 22:12:07 +00:00
0ca8bf4b41 Enable UFMT on test/test_datapipe.py (#124994)
Part of: #123062

Ran lintrunner on:

- `test/test_datapipe.py`

Detail:

```bash
$ lintrunner -a --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```

Co-authored-by: Edward Z. Yang <ezyang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124994
Approved by: https://github.com/mikaylagawarecki
2024-05-15 21:58:35 +00:00
cyy
18cbaf6dbf Remove Caffe2 python code (#126035)
Follows the recent changes of Caffe2.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126035
Approved by: https://github.com/r-barnes, https://github.com/Skylion007
2024-05-15 21:51:11 +00:00
ad7316b4c2 [CI] Add AMP models in inductor cpu smoketest for performance (#125830)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125830
Approved by: https://github.com/chuanqi129, https://github.com/jgong5, https://github.com/huydhn, https://github.com/desertfire, https://github.com/atalman
2024-05-15 21:46:58 +00:00
f0d34941dd Improve Storage copy_ size mismatch error message (#126280)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126280
Approved by: https://github.com/mikaylagawarecki
2024-05-15 21:14:59 +00:00
d15920a7d0 Warn SDPA users about dropout behavior (#126294)
Fixes #124464
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126294
Approved by: https://github.com/mikaylagawarecki, https://github.com/drisspg
2024-05-15 20:58:23 +00:00
31d22858e9 [onnx.export] Avoid unnecessary copy of debug_names (#123026)
This PR is part of an effort to speed up torch.onnx.export (#121422).

- The `auto debug_names = ` infers a copy, where as `const auto& debug_names` does not.
- However, this ones requires us to be careful, since calls to `setDebugName` changes `debug_names` and invalidates the `exist_name` iterator. So if we simply change `auto` to `const auto&`, then between that line and `find` we have corrupted the iterator by calling `output[i]->setDebugName`. This change aims to be functionally equivalent to the original, which is why we first get the Value pointer, then call `output[i]->setDebugName`, and finally call `setDebugName` on the found value. It is possible functionally it is OK to simply call `output[i]->setDebugName` first and then find and the second `setDebugName`, but this would not be identical to current behavior.
- Resolves (2) in #121422.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123026
Approved by: https://github.com/justinchuby
2024-05-15 20:58:18 +00:00
90461d4986 [dynamo] Detect monkeypatching on nn module forward method (#126203)
An alternative was https://github.com/pytorch/pytorch/pull/124975. Though it was safer because it was adding guards for every inlined function, it was causing guard overhead for a few models of > 20%.  The overhead of this PR is minimal for the common unpatched case.

Fixes an internal issue - [fb.workplace.com/groups/1075192433118967/permalink/1411067766198097](https://fb.workplace.com/groups/1075192433118967/permalink/1411067766198097/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126203
Approved by: https://github.com/ezyang
2024-05-15 20:41:13 +00:00
c8130dfe84 [FSDP2] allow meta tensors during loading state dict and cpu offloading (#126267)
unit test: ``pytest test/distributed/_composable/fsdp/test_fully_shard_state_dict.py``

with meta init and cpu offloading, we have meta tensors after`model.load_state_dict(assign=True, strict=False)`. This PR avoided calling `.cpu` on meta tensors otherwise it's a runtime error

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126267
Approved by: https://github.com/awgu
2024-05-15 20:35:36 +00:00
d74c89fb10 2 rocm shards on trunk.yml (#125933)
after test removal for windows cpu + avx related configs, it's going to be the long pole for trunk

Just checked: without rocm, avg tts for trunk is 2.5 hrs last week, with rocm its about 3

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125933
Approved by: https://github.com/ZainRizvi
2024-05-15 20:22:14 +00:00
d2b2727d66 Fix public api allowlist logical merge conflict (#126321)
Skip the newly added bad API from https://github.com/pytorch/pytorch/pull/126212 to keep CI green.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126321
Approved by: https://github.com/ezyang
2024-05-15 20:21:39 +00:00
e2d18228fe [DCP] overwrites existing checkpoint by default (#125877)
Checks for existing checkpoints and overwrites, based on an `overwrite` flag

Differential Revision: [D57186174](https://our.internmc.facebook.com/intern/diff/D57186174/)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125877
Approved by: https://github.com/fegin
2024-05-15 20:12:52 +00:00
b659506d82 Parametrize test_dim_reduction (#126292)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126292
Approved by: https://github.com/Skylion007
2024-05-15 19:55:37 +00:00
2086f91c4c Revert "Fix aarch64 debug build with GCC (#126290)"
This reverts commit a961e1ac83bf8831768c5a04eb7c4c18df8988d5.

Reverted https://github.com/pytorch/pytorch/pull/126290 on behalf of https://github.com/malfet due to Indeed lint is broken :/ ([comment](https://github.com/pytorch/pytorch/pull/126290#issuecomment-2113332757))
2024-05-15 19:45:57 +00:00
2978f07d0e [FSDP] Fixed docs for inter/intra node PG helpers (#126288)
1. This fixes an issue where we had 9 ranks in one node and 7 in the other.
2. This makes the notation more explicit that `[0, 7]` is `[0, 1, ..., 7]`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126288
Approved by: https://github.com/weifengpy
2024-05-15 19:45:10 +00:00
af9acc4168 Fix public binding to actually traverse modules (#126103)
The current call passes in `['/actual/path']` to os.walk which is a string pointing to no path and thus silently leads to and empty traversal.
There is an unused function just above that handles that, so I guess this is what was supposed to be called.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126103
Approved by: https://github.com/suo
2024-05-15 19:36:03 +00:00
a961e1ac83 Fix aarch64 debug build with GCC (#126290)
By working around GCCs quirks in instantiating templates that require immediate values.
Provide alternative implementation for scaling the output if compiled without any optimizations (both GCC and clang define __OPTIMIZE__ if invoked with anything but -O0)

Fixes https://github.com/pytorch/pytorch/issues/126283

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126290
Approved by: https://github.com/atalman, https://github.com/seemethere
2024-05-15 19:02:21 +00:00
196661255f Enable UFMT format on test/test_utils.py (#125996)
Fixes some files in #123062

Run lintrunner on files:
test/test_utils.py

```bash
$ lintrunner -a --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125996
Approved by: https://github.com/ezyang
2024-05-15 18:22:57 +00:00
44efeac24e Beef up error message for pending assert failure (#126212)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126212
Approved by: https://github.com/Skylion007
2024-05-15 18:22:53 +00:00
26f6f98364 Forward fix failures for torch.export switch to predispatch (#126081)
Summary:
Fixes:
- executorch test
- torchrec test

Test Plan: CI

Differential Revision: D57282304

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126081
Approved by: https://github.com/angelayi
2024-05-15 18:13:06 +00:00
0d49c5cb06 Skip padding cost of fusible/planable inputs (#125780)
For mm inputs which are not inputs of the graph, assume that we can memory plan them in the aten.cat and exclude the padding cost in the benchmarking comparison. Technically we also have to do a small amount of 0s writing, but that should be relatively small and encompassed in the weighting of the padding time by `1.1`

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125780
Approved by: https://github.com/shunting314
ghstack dependencies: #125772, #125773
2024-05-15 18:05:53 +00:00
4fb5d69b3b Reland '[Inductor] GEMM shape padding improvements (#118522)' (#125773)
Relanding just the pad in a single pass portion of [the pr](https://github.com/pytorch/pytorch/pull/118522). Not including
the transpose logic:

This was previously accepted and reviewed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125773
Approved by: https://github.com/shunting314
ghstack dependencies: #125772
2024-05-15 17:34:41 +00:00
a91311e7c2 [easy] Remove aot_config from pre_compile returns, rename fw_metadata in post_compile (#125854)
This field never changes so pre_compile doesn't need to return it again: remove it just for a cleaner refactor.

As @aorenste  points out, the fw_metadata passed to post_compile is actually the fw_metadata after all wrapper's pre_compile's have run. I want to make this clear in the code, so I renamed the arg in post_compile.

Wrappers that need the exact metadata that they were passed in pre_compile need to save that fw_metadata properly themselves.

Currently, wrappers come in two categories:

1. Wrappers that modify fw_metadata, but then never use fw_metadata in post compile
2. Wrappers that never modify fw_metadata, and only consume the "final" fw_metadata.

So none of the behaviors will change for the existing wrappers. That said, it might be useful to define a "SimpleCompilerWrapper" subclass which guarantees it does not modify fw_metadata. I'll do that in a separate PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125854
Approved by: https://github.com/aorenste, https://github.com/bdhirsh
2024-05-15 17:23:47 +00:00
44e47d5bd0 [onnx.export] Avoid linear loop over symbol_dim_map (#123029)
This PR is part of an effort to speed up torch.onnx.export (#121422).

- Doing a reverse look-up in `symbol_dim_map` incurs a linear cost in number of symbols. This happens for each node, so incurs a quadratic cost to the whole export.
- Add a reverse look-up `dim_symbol_map` that is kept in parallel of `symbol_dim_map`. This avoids a linear time look-up, which creates a quadratic export time complexity.
- This is a highly pragmatic solution. If someone more familiar with the code base has a better solution, I'm interested to hear about it.
- Resolves (9) in #121422.

(partial fix of #121422)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123029
Approved by: https://github.com/justinchuby
2024-05-15 17:22:30 +00:00
490d72e4e6 CMake: Improve check and report of Magma (#117858)
- Only search for magma if it is used (GPU builds)
- Don't report it was not found when it isn't searched for
- Don't report if magma is disabled (currently: "MAGMA not found. Compiling without MAGMA support" is reported)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117858
Approved by: https://github.com/malfet
2024-05-15 17:18:22 +00:00
f91cae461d [Dynamo] SizeVariable supports hasattr (#126222)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126222
Approved by: https://github.com/williamwen42, https://github.com/anijain2305
2024-05-15 17:16:36 +00:00
c1dc8bb858 [DTensor] Turn on foreach implementation of optimizer for DTensor by default (#123394)
Append DTensor to the optimizer `_foreach_supported_types` and turn on foreach implementation of optimizer for DTensor if not specified by the users.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123394
Approved by: https://github.com/wanchaol
2024-05-15 16:45:42 +00:00
4ab2c399be Faster int8 quantized (#125704)
Or my journey to learn how to write fast Metal kernels (more details would be posted [here](https://github.com/malfet/llm_experiments/tree/main/metal-perf) )

Using gpt-fast as a benchmark (by running `python generate.py --checkpoint_path checkpoints/stories110M/model_int8.pth --device mps`)

Before the change, on M2 Pro I get 50 tokens per sec
After adding a very naive
```metal
template<typename T>
kernel void int8pack_mm(
    constant T                 * A              [[buffer(0)]],
    constant char              * B              [[buffer(1)]],
    constant T                 * scales         [[buffer(2)]],
    device   T                 * outputData     [[buffer(3)]],
    constant uint3             & sizes          [[buffer(4)]],
    uint                         thread_index   [[thread_position_in_grid]]) {
    const uint lda = sizes.y;
    const uint ldc = sizes.z;
    const uint m = thread_index / sizes.z; // 0..sizes.x-1
    const uint n = thread_index % sizes.z; // 0..sizes.z-1
    constant T *A_ptr = A + m * lda;
    constant char *B_ptr = B + n * lda;

    float rc = 0.0;
    for(uint k = 0; k < sizes.y;  k++) {
      const auto a_val = float(A_ptr[k]);
      const auto b_val = float(B_ptr[k]);
      rc += a_val * b_val;
    }
    outputData[thread_index] = T(rc * float(scales[n]));
}
```
Perf dropped down to sad 15 tokens per seconds.
Replacing inner loop with vectorized operations
```metal
    float rc = 0.0;
    for(uint k = 0; k < sizes.y/4;  k++) {
      const auto a_val = float4(A_ptr[k]);
      const auto b_val = float4(B_ptr[k]);
      rc += dot(a_val, b_val);
    }
```
Perf jumps back up to 53 tokens per second, but it's a bit of a lie when it comes to llama2-7B perf.

Next step in unlocking the performance were to replace a 1D grid with a 2D one, but limit the thread group size to a single row, which results in a much better data locality which unfortunately is not observable with `stories110M` anymore as it small model size and Python runtime overhead hide the perf gain)

There were several unsuccessful attempts at caching inputs in thread local memory or using `float4x4` to speed up computation. But the key to unlocking the perf were a comment in 631dfbe673/mlx/backend/metal/kernels/gemv.metal (L184)
which hinted at exploiting both SIMD groups and thread local caches, which resulted in 5x jump in performance compared to initial vectorization approach and 3x perf jump in end-to-end llama7b test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125704
Approved by: https://github.com/mikekgfb
2024-05-15 16:39:24 +00:00
719a8f42bf Foward fix lint after #125747 (#126295)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126295
Approved by: https://github.com/atalman
2024-05-15 16:37:48 +00:00
9689532106 [CI] 3 procs non cuda (#125932)
Too lazy to figure out actual time reduction here, I'll figure it out later.  Also I'd rather get an average of a couple of runs on trunk rather than just this one PR
Things got faster. Source? Trust me bro

* rel to https://github.com/pytorch/pytorch/pull/125598

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125932
Approved by: https://github.com/ZainRizvi
2024-05-15 16:18:36 +00:00
718bb9016f Revert "[Memory Snapshot] Add recordAnnotations to capture record_function annotations (#124179)"
This reverts commit 187aeaeabf612824c2d0e9be72f80ce6612760d4.

Reverted https://github.com/pytorch/pytorch/pull/124179 on behalf of https://github.com/clee2000 due to test_tensorexpr.py::TestTensorExprFuser::test_simple_add is causing a segfault https://github.com/pytorch/pytorch/actions/runs/9097383783/job/25007155440 187aeaeabf, test was skipped due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/124179#issuecomment-2112948246))
2024-05-15 16:11:47 +00:00
f9dda37a74 [export] Cover more cases to copy tensor conversions. (#125628)
Summary:
Previously we tried to convert all .to() calls to to_copy in the graph, now some user reports that other methods like .float() is not covered: https://github.com/pytorch/PiPPy/issues/1104#issuecomment-2093352734

I think fundemantally .float() should look similar to .to() in export and this diff tries to expand the coverage of the tensor conversion methods here.

Test Plan: buck run mode/opt caffe2/test:test_export -- -r float_conversion

Differential Revision: D56951634

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125628
Approved by: https://github.com/tugsbayasgalan
2024-05-15 15:50:21 +00:00
c53e0ac7ba [Inductor] Generalize new introduced device-bias code. (#126261)
We find some Inductor test case failues when enabling Inductor UT for Intel GPU, the root cause is new introduced Inductor device-bias code from recent community PRs, which cause differnet beheaviors between Intel GPU and CUDA. This PR generalize these codes to align their beheaviors.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126261
Approved by: https://github.com/EikanWang, https://github.com/peterbell10
2024-05-15 15:05:07 +00:00
ba3cd6e463 Enable UFMT on test/test_fake_tensor.py, test/test_flop_counter.py and some files (#125747)
Part of: #123062

Ran lintrunner on:

- test/test_fake_tensor.py
- test/test_flop_counter.py
- test/test_function_schema.py
- test/test_functional_autograd_benchmark.py
- test/test_functional_optim.py
- test/test_functionalization_of_rng_ops.py

Detail:

```bash
$ lintrunner -a --take UFMT --all-files
ok No lint issues.
Successfully applied all patches.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125747
Approved by: https://github.com/malfet
2024-05-15 14:50:14 +00:00
187aeaeabf [Memory Snapshot] Add recordAnnotations to capture record_function annotations (#124179)
Summary: Add new traceEvents into Memory Snapshot for record_function annotations. These will capture both the profiler's step annotation as well as user annotations.

Test Plan:
CI

New Snapshot Generated:
devvm2184.cco0.facebook.com.Apr_19_13_27_14.3072800.snapshot.pickle

Snippet of Snapshot device_traces show `ProfilerStep#0`, and `## forward ##` annotations:
```
[[{'action': 'user_defined',
   'addr': 0,
   'size': 0,
   'stream': 0,
   'time_us': 1713558427168556,
   'frames': [{'name': 'START', 'filename': 'ProfilerStep#0', 'line': 0}]},
  {'action': 'user_defined',
   'addr': 0,
   'size': 0,
   'stream': 0,
   'time_us': 1713558427168738,
   'frames': [{'name': 'END', 'filename': 'ProfilerStep#0', 'line': 0}]},
  {'action': 'user_defined',
   'addr': 0,
   'size': 0,
   'stream': 0,
   'time_us': 1713558427168865,
   'frames': [{'name': 'START', 'filename': 'ProfilerStep#1', 'line': 0}]},
  {'action': 'user_defined',
   'addr': 0,
   'size': 0,
   'stream': 0,
   'time_us': 1713558427168920,
   'frames': [{'name': 'START', 'filename': '## forward ##', 'line': 0}]},
  {'action': 'alloc',
   'addr': 140166073581568,
   'size': 3211264,
   'stream': 0,
   'time_us': 1713558427172978,
   'frames': [{'name': '_conv_forward',
     'filename': '/mnt/xarfuse/uid-416185/235d4caf-seed-nspid4026531836_cgpid32884718-ns-4026531840/torch/nn/modules/conv
```

Differential Revision: D55941362

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124179
Approved by: https://github.com/zdevito
2024-05-15 14:19:40 +00:00
ee8c1550d6 [AOTI][torchgen] Add a few more fallback ops (#126013)
Summary: They appear in some unit tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126013
Approved by: https://github.com/chenyang78
ghstack dependencies: #125962
2024-05-15 12:56:07 +00:00
563aa3e035 [AOTI][torchgen] Update NativeFunctionsGroup mapping (#125962)
Summary: When looking up for what backend call to use for a fallback op (see get_backend_index_for_aoti), sometimes we need to search for a NativeFunction's structured delegate. Previous str:NativeFunctionsGroup dict missed some cases, such as aten.index.Tensor, and that's why aten.index.Tensor was specified in the fallback_ops list but no C shim entry was generated for it. This PR uses a more robust OperatorName:NativeFunctionsGroup mapping.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125962
Approved by: https://github.com/chenyang78
2024-05-15 12:56:07 +00:00
a0aaf56114 Don't assert about pending when we are peeking (#126239)
Internal xref https://fb.workplace.com/groups/6829516587176185/posts/7211398545654652/

In particular, when we're collecting forward metadata, we aren't going
to discharge any of the pending, so we'll be continuously collecting
more and more pending symbols that we may not be able to resolve.  This
is fine.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126239
Approved by: https://github.com/lezcano
2024-05-15 12:18:34 +00:00
8f30f367d0 [CUDA] [CI] Add cu124 docker images (#125944)
Fixes issues encountered in https://github.com/pytorch/pytorch/pull/121956

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125944
Approved by: https://github.com/atalman
2024-05-15 09:52:38 +00:00
f060b0c6e6 [inductor][cpp] GEMM template (infra and fp32) (#124021)
This PR adds the Cpp template infrastructure and the initial FP32 gemm template. See RFC https://github.com/pytorch/pytorch/issues/125683 for more background info.
1. Cpp template infrastructure
Similar template abstractions as the CUTLASS template, i.e., `CppTemplate`, `CppTemplateKernel`, `CppTemplateBuffer`. The MicroGemm micro-kernel abstraction that can be used by Cpp GEMM templates.
2. Initial FP32 gemm template
This involves a GEMM template implementation `CppPackedGemmTemplate` that supports GEMM with constant weight (`B`) requiring `N` to be a multiple of register blocking while allows the static or dynamic sizes for the `M` (batch dim) of `A`. The `B` matrix would be prepacked. This is a typical setting for inference workloads. The template handles the thread decomposition (via `thread_blocking`) and cache blocking (via `cache_blocking`). Then it invokes `CppMicroGemm` which handles register blocking, instruction selection, and other CPU architecture-specific optimizations. A `CppMicroGemmFP32Vec` micro-kernel implementation is provided for fp32 matmuls implemented with ATen vec abstraction.
3. Correctness and performance
The changes have been validated with fp32 inference on the three benchmark suites (torchbench, huggingface and timm_models) with both static shape and dynamic shapes. Since it is an initial implementation, we are still working on further performance improves with follow-up PRs including the optimizations in kernels as well as fusions. The perf gains are only observed from a selective number of models compared to the ATen kernels which are implemented with MKL. The perf gains are more obvious with dynamic shapes since MKL only supports packed gemm for static shapes. Below are details.

Static shapes
| Benchmark | torchbench | huggingface | timm_models |
|------------|-------------|--------------|--------------|
| Multi-threaded (baseline) | 1.47x | 1.36x | 1.91x |
| Multi-threaded (max-autotune) | 1.47x | 1.36x | 1.92x |
| Single-threaded (baseline) | 1.56x | 1.19x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.52x |

Key models being sped up:
drq: 1.14x
soft_act: 1.12
cait_m36_384: 1.18x

Dynamic shapes
| Benchmark | torchbench | huggingface | timm_models |
| --- | --- | --- | --- |
| Multi-threaded (baseline) | 1.43x | 1.28x | 1.85x |
| Multi-threaded (max-autotune) | 1.47x | 1.28x | 1.85x |
| Single-threaded (baseline) | 1.55x | 1.20x | 1.51x |
| Single-threaded (max-autotune) | 1.56x | 1.19x | 1.53x |

Key models being sped up:
BERT_pytorch: 1.22x
pyhpc_turbulent: 1.13x
soft_actor_critic: 1.77x
BlenderbotForCausalLM: 1.09x
cait_m36_384: 1.17x

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124021
Approved by: https://github.com/jansel
2024-05-15 08:14:51 +00:00
79655a1321 Add force_disable_caches to the docs (#126184)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126184
Approved by: https://github.com/msaroufim
2024-05-15 07:16:08 +00:00
2d35b4564a [audio hash update] update the pinned audio hash (#126248)
This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml).
Update the pinned audio hash.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126248
Approved by: https://github.com/pytorchbot
2024-05-15 05:45:16 +00:00
03467b3fed Add a few "warm start" smoketest runs to CI (#125955)
Summary:
Not sure which to choose, so my criteria was:
1) We care about huggingface as part of internal milestones
2) This handful of models seems to particularly benefite from caching
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125955
Approved by: https://github.com/desertfire
ghstack dependencies: #125917, #125953
2024-05-15 05:32:06 +00:00
c87c39d935 [benchmarking] Suppress csv creation on cold-start phase of --warm-start-latency (#125953)
Summary: It seems that most (all?) of our utilities for examining benchmark output expect single-line entries per benchmark. The way the --warm-start-latency flag is currently implemented, it means that we'll see two entries for every benchmark run (one for the warm-up run and one for the actual run). This PR adds a --disable-output flag that we can use for the first run to suppress populating the csv. This way, the existing utilities like `benchmarks/dynamo/check_accuracy.py` will function without any changes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125953
Approved by: https://github.com/desertfire
ghstack dependencies: #125917
2024-05-15 05:32:06 +00:00
9f0d3f71c9 Adjust number of repeats when using --warm-start-latency benchmark flag (#125917)
Summary: In --warm-start-latency mode, we can just perform the cache-warmup run once instead of whatever was provided with --repeat

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125917
Approved by: https://github.com/desertfire
2024-05-15 05:32:06 +00:00
0dedc1aff2 Update CUDA out of memory mesage with private pool info (#124673)
Fixes https://github.com/pytorch/pytorch/issues/121932

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124673
Approved by: https://github.com/eellison, https://github.com/eqy
2024-05-15 05:30:47 +00:00
5178baefa9 use statically known instead of suppress guard for ddp stride propagation (#126234)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126234
Approved by: https://github.com/ezyang
2024-05-15 05:21:55 +00:00
e74a6f487a [Inductor] Skip test_nll_loss_backward for intel GPU. (#126157)
Skip this test case due to unaligned behavior to CUDA for Triton `mask_load`. We submitted issue #126173 to elaborate on the root cause. We intend to skip this case for XPU first as we need to take some time to fix the issue and have full validation to update the Triton commit pin for Intel GPU.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126157
Approved by: https://github.com/EikanWang, https://github.com/peterbell10, https://github.com/desertfire
2024-05-15 05:16:07 +00:00
FEI
b950217f19 Support third-party devices emit a range for each autograd operator (#125822)
Fixes #125752

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125822
Approved by: https://github.com/aaronenyeshi
2024-05-15 05:06:24 +00:00
cyy
bdea4904c1 Add some type annotations to python stream and event classes (#126171)
For recent device agnostic code changes, we need type hinting on the parent classes for better tooling support.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126171
Approved by: https://github.com/ezyang
2024-05-15 04:58:07 +00:00
7dfd2949d7 Add missing type uint16, uint32, and uint64 to TensorHash in LTC. (#125972)
If I do:

```
xla_device = xm.xla_device()
xla_tensor_0 = torch.tensor(42, dtype=torch.uint32).to(xla_device)
```

I got the error:

```
RuntimeError: false INTERNAL ASSERT FAILED at "/ansible/pytorch/torch/csrc/lazy/core/hash.h":139, please report a bug to PyTorch. Unsupported scalar type:UInt16
```

This PR intends to fix this issue.
The data type can be found in pytorch/c10/core/ScalarType.h.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125972
Approved by: https://github.com/JackCaoG
2024-05-15 04:57:08 +00:00
dfab69fdf1 [Inductor] Flex attention supports dynamic shape (#125994)
## static shapes perf
```
| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod   | dtype          |
|---------|-----------|--------------|-------------|-------------|-------------|------------|-------------|----------------|
| Average |     0.692 |              |             |             |             |            |             |                |
| Max     |     0.855 |           16 |          16 |        4096 |        4096 |         64 | head_bias   | torch.bfloat16 |
| Min     |     0.419 |            8 |          16 |         512 |         512 |        256 | noop        | torch.bfloat16 |
```
## dynamic shapes perf
```
| Type    |   Speedup |   batch_size |   num_heads |   q_seq_len |   k_seq_len |   head_dim | score_mod     | dtype          |
|---------|-----------|--------------|-------------|-------------|-------------|------------|---------------|----------------|
| Average |     0.670 |              |             |             |             |            |               |                |
| Max     |     0.864 |           16 |          16 |        4096 |        4096 |         64 | relative_bias | torch.bfloat16 |
| Min     |     0.376 |            8 |          16 |         512 |         512 |        256 | relative_bias | torch.bfloat16 |
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125994
Approved by: https://github.com/Chillee
2024-05-15 04:43:24 +00:00
1485621ccb [BE] Abstract out strings to top of file (#125640)
Summary:
Move const strings to top of file. This is in preparation of tooling to
make use of shared constants (e.g. version string). A non-functional change.
Ideally we want these const strings to be available from both C++ and Python - but I haven't figured out how to correctly share things in PyTorch. I'll do this in a subsequent change.

Test Plan:
python test/distributed/test_c10d_nccl.py NCCLTraceTest

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125640
Approved by: https://github.com/wconstab
2024-05-15 03:38:30 +00:00
24c30096e3 Set dtype when copying empty tensor (#126124)
Summary: Forward fix D57251348

Test Plan: `buck2 test 'fbcode//mode/dev' fbcode//executorch/kernels/test:aten_op_copy_test`

Differential Revision: D57304360

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126124
Approved by: https://github.com/bdhirsh
2024-05-15 03:25:07 +00:00
51ed4c46cf [Dynamo] Supports torch._C._is_any_autocast_enabled (#126196)
Fixes #126026

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126196
Approved by: https://github.com/anijain2305
2024-05-15 03:16:13 +00:00
314ba13f01 Support trace_subgraph in _MakefxTracer (#125363)
Adds trace_subgraph to _MakefxTracer, the motivation is in https://github.com/pytorch/pytorch/pull/122972. Also migrate all existing usage of reenter_make_fx to the new sub-tracer. Previously, the torch function mode for creating torch_fn metadata won't be re-enetered when we're in ProxyTensorMode (since it's inside of __torch_function__). This PR reconstruct the torch function mode based on parent tracer's config and reentered the torch function mode so the metadata is shown in the graph.

**Test Plan:**
Existing tests. We have a bunch of make_fx tests for cond, map and while_loop. Also remove expected failure for torch_fn since reenter_make_fx is able to re-construct torch function modes.

Also fixes https://github.com/pytorch/pytorch/issues/124643

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125363
Approved by: https://github.com/Chillee
ghstack dependencies: #125267
2024-05-15 03:12:24 +00:00
73d8c10f13 Refactor make_fx to better support hop subgraph tracing (#125267)
Code movement + minor rewrites. We extract the states of make_fx out and encapsulate them into a _MakefxTracer class. This allows us to create a new make_fx_tracer when tracing subgraphs, the actual logic for tracing subgraph is in the next diff.

Test Plan:
Existing tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125267
Approved by: https://github.com/Chillee
2024-05-15 03:12:24 +00:00
470723faea [pipelining] Add manual pipeline stage (#126123)
Add `ManualPipelineStage` under `_PipelineStage.py`

Fix some type hints since `args_recv_info` can contain more than one RecvInfo. Previously the hint was `Tuple[InputInfo]` which meant it is a tuple of size 1. This is different from `List[InputInfo]` which can contain any number of items. I needed to update to `Tuple[InputInfo, ...]` to make the number of items flexible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126123
Approved by: https://github.com/kwen2501
2024-05-15 00:55:15 +00:00
dccb5cf7ca Allow for trailing 'a' in sm_arch (#126185)
# Summary
I was getting
``` Shell
File "/home/drisspg/meta/pytorch/torch/cuda/__init__.py", line 312, in _lazy_init
    raise DeferredCudaCallError(msg) from e
torch.cuda.DeferredCudaCallError: CUDA call failed lazily at initialization with error: invalid literal for int() with base 10: '90a'
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126185
Approved by: https://github.com/Skylion007
2024-05-15 00:16:42 +00:00
92eb1731d4 [torch/distributed] Bugfix: wait for all child procs to exit before c… (#125969)
Observed Problem
---------------------

When `torchrun` has finished running the main trainer function (aka entrypoint/user function) successfully, I noticed that sometimes it SIGTERMS the child processes. Then `torchrun` exits successfully.

This results in misleading warning log messages towards the end of the job like the one below:

```
W0510 14:52:48.185934  672413 api.py:513] Closing process 675171 via signal SIGTERM
W0510 14:52:48.185984  672413 api.py:513] Closing process 675172 via signal SIGTERM
W0510 14:52:48.186013  672413 api.py:513] Closing process 675174 via signal SIGTERM
# <---- ^^^ ??? everything runs successfully but child still SIGTERM'ed? ^^^ --->

I0510 14:52:48.229119  672413 api.py:877] [main] worker group successfully finished. Waiting 300 seconds for other agents to finish.
I0510 14:52:48.229161  672413 api.py:922] Local worker group finished (WorkerState.SUCCEEDED). Waiting 300 seconds for other agents to finish
I0510 14:52:48.229395  672413 api.py:936] Done waiting for other agents. Elapsed: 0.0001709461212158203 seconds
I0510 14:52:48.257544  672413 dynamic_rendezvous.py:1131] The node 'localhost_672413_0' has closed the rendezvous 'torchrun_qpfd'.
I0510 14:52:48.568198  672413 distributed.py:200] Deleting temp log directory: /tmp/torchrun_udgp8zoq
I0510 14:52:48.568989  672413 distributed.py:202] Finished running `main`
```

Root Cause
------------------

I noticed that this was due to the incorrect usage of `torch.multiprocessing.ProcessContext.join()` in `torch.distributed.elastic.multiprocessing.api.MultiprocessingContext`.

`torch.multiprocessing.ProcessContext.join()` does not actually wait for ALL child procs to exit, but rather waits for **at-least-one** child proc to exit. If only a subset of the child procs have exited, it returns `False` and if all child procs have exited it returns `True`.

`torch.distributed.elastic.multiprocessing.api.MultiprocessingContext` was assuming that `torch.multiprocessing.ProcessContext.join()` blocks indefinitely until all child procs have exited.

Fix
---------

The fix is simple, just loop, while continuing to call `pc.join()` until it returns `True`

> **NOTE**: that the indefinite blocking is NOT an issue since by the time `torch.distributed.elastic.multiprocessing.api.MultiprocessingContext` calls `pc.join()` it already did all the checking to validate that the entrypoint functions either return successfully or that one of them has failed. So we are really just waiting for the unix process to exit after running the entrypoint function.

> **NOTE**: since `pc.join()` already blocks until at-least-one child proc exits, there is no need to add a polling interval in the body of the loop and the debug logging will show at most `nproc_per_node` times so no log spamming is observed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125969
Approved by: https://github.com/d4l3k
2024-05-15 00:13:08 +00:00
e5cce35c21 Remove use of USE_C10D (#126120)
As per https://github.com/pytorch/pytorch/blob/main/torch/CMakeLists.txt#L271 the USE_DISTRIBUTED and USE_C10D are equivalent. In another PR I was cleaning this usage up so also cleaning it up here.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126120
Approved by: https://github.com/aaronenyeshi
2024-05-15 00:00:26 +00:00
fd48fb9930 Revert "[CUDA] [CI] Add cu124 docker images (#125944)"
This reverts commit 5fb4a766b88bcf633a23610bd66de0f3020f7c66.

Reverted https://github.com/pytorch/pytorch/pull/125944 on behalf of https://github.com/nWEIdia due to test failure seems related 5fb4a766b8 https://github.com/pytorch/pytorch/actions/runs/9085206167/job/24972040039 ([comment](https://github.com/pytorch/pytorch/pull/125944#issuecomment-2111321724))
2024-05-14 23:29:26 +00:00
b6d8b256e6 Revert "[inductor][cpp] GEMM template (infra and fp32) (#124021)"
This reverts commit 037615b989b37b1bf5eff0c031055fc8d1fbe5ae.

Reverted https://github.com/pytorch/pytorch/pull/124021 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing inductor.test_unbacked_symints.TestUnbackedSymintsCPU::test_autotuning_cpu ([comment](https://github.com/pytorch/pytorch/pull/124021#issuecomment-2111318883))
2024-05-14 23:26:15 +00:00
c1aa05f80c [easy][dynamo] Use disable_dynamo for torch.manual_seed (#126192)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126192
Approved by: https://github.com/yanboliang
ghstack dependencies: #126191
2024-05-14 23:20:32 +00:00
c6f3f1d239 [reland][dynamo][disable] Move disable impl to its own __call__ method (#126191)
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126191
Approved by: https://github.com/yoyoyocmu, https://github.com/yanboliang, https://github.com/fegin
2024-05-14 23:20:32 +00:00
41fabbd93f Fanatically correct real tensor cloning for propagate_real_tensors (#126175)
Internal xref:
https://fb.workplace.com/groups/6829516587176185/posts/7211398545654652/

Previously I did it in a crappy way using clone_input in the callback,
but this results in tensors that don't have quite the same
size/stride/storage offset and there was an internal test case where
not having completely accurate information was causing a downstream
problem in propagation.  So now I make real tensors as similar to their
fake equivalents as much as possible.  Though... I don't bother with
autograd lol.

Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126175
Approved by: https://github.com/albanD
2024-05-14 23:14:17 +00:00
328b75d1a0 Enable epilogue fusion benchmarking internally (#125455)
Differential Revision: [D56920738](https://our.internmc.facebook.com/intern/diff/D56920738)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125455
Approved by: https://github.com/Chillee
2024-05-14 23:06:29 +00:00
e046c59e5b [export] handle aliased/unused params for unflattening (#125758)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125758

Aliased and unused params are currently an issue for strict-mode export. For a model like this:
```
def __init__(self):
    # ...
    self.alpha = nn.Parameter(torch.randn(4))
    self.beta = self.alpha
    self.gamma = self.alpha
def forward(self, x):
    return x + self.beta
```
Dynamo will trace only 1 parameter (beta) and assign a dynamo name (e.g. `L__self___beta`) which can be difficult to match to the correct FQN in the original eager module. This leads to export graph signature potentially having the incorrect target FQN for the parameter, leading to downstream issues unflattening (the parameter may be assigned to the wrong target attribute, mismatching the relevant placeholder node in the unflattened module).

This handles aliasing issues by assigning all tensors present in the state dict as module attributes, even if they're unused. Still, only the used tensors will appear in the graph's forward pass.

Another issue that exists is weight-sharing is not maintained in unflattening (all params/buffers are re-cloned) - handle this by checking tensor ids too.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125758
Approved by: https://github.com/zhxchen17
2024-05-14 23:00:46 +00:00
4d063c8e8a Do not print escape characters in xdoctest logs (#126219)
By invoking make with `vt100` terminal settings
Test Plan:
[Before](https://github.com/pytorch/pytorch/actions/runs/9086391859/job/24972547633)
```
2024-05-14T21:50:09.0459741Z reading sources... [ 57%] generated/torch.func.stack_module_state .. generated/torch.gradient
2024-05-14T21:50:09.2204992Z reading sources... [ 59%] generated/torch.greater .. generated/torch.jit.ignore
2024-05-14T21:50:09.9598581Z reading sources... [ 61%] generated/torch.jit.interface .. generated/torch.linalg.multi_dot
2024-05-14T21:50:10.5383853Z reading sources... [ 64%] generated/torch.linalg.norm .. generated/torch.moveaxis
```
[After](https://github.com/pytorch/pytorch/actions/runs/9086780396/job/24973727737?pr=126219)
```
2024-05-14T22:27:22.9388802Z reading sources... [ 57%] generated/torch.func.stack_module_state .. generated/torch.gradient
2024-05-14T22:27:23.5874407Z reading sources... [ 59%] generated/torch.greater .. generated/torch.jit.ignore
2024-05-14T22:27:23.7649947Z reading sources... [ 61%] generated/torch.jit.interface .. generated/torch.linalg.multi_dot
2024-05-14T22:27:24.3492981Z reading sources... [ 64%] generated/torch.linalg.norm .. generated/torch.moveaxis
2024-05-14T22:27:24.9723946Z reading sources... [ 66%] generated/torch.movedim .. generated/torch.nn.AdaptiveLogSoftmaxWithLoss
```
Fixes https://github.com/pytorch/pytorch/issues/123166

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126219
Approved by: https://github.com/clee2000
2024-05-14 22:45:55 +00:00
b522e65056 Check pointer for null before deref in Aten/native/sparse (#126163)
Fixes #126162

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126163
Approved by: https://github.com/ezyang
2024-05-14 21:55:41 +00:00
bbdbfe3661 Reland add write_record_metadata to PyTorchFileWriter (#126087)
Reland of https://github.com/pytorch/pytorch/pull/125184 with compiler warning fixed by extending `m_pWrite` rather than adding `m_pSeek` to miniz API

Differential Revision: [](https://our.internmc.facebook.com/intern/diff/)

Differential Revision: [D57287327](https://our.internmc.facebook.com/intern/diff/D57287327)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126087
Approved by: https://github.com/albanD
2024-05-14 21:48:44 +00:00
1ba852c1dc Fix torch elastic test SimpleElasticAgentTest.test_restart_workers br… (#126002)
Failure Info:
```bash
(pt) betterman@bjys1009:/projs/framework/betterman/code/pytorch_new/test/distributed/elastic/agent/server/test$ pytest api_test.py -k test_restart_workers
=============================================================================================================================================== test session starts ================================================================================================================================================
platform linux -- Python 3.10.8, pytest-8.1.1, pluggy-1.4.0
rootdir: /projs/framework/betterman/code/pytorch_new
configfile: pytest.ini
plugins: hypothesis-6.15.0, rerunfailures-14.0, flakefinder-1.1.0, xdist-3.3.1
collecting 1 item                                                                                                                                                                                                                                                                                                  /
projs/framework/betterman/code/pytorch_new/test/distributed/elastic/agent/server/test/api_test.py:123: PytestCollectionWarning: cannot collect test class 'TestAgent' because it has a __init__ constructor (from: test/distributed/elastic/agent/server/test/api_test.py)
  class TestAgent(SimpleElasticAgent):
collected 29 items / 28 deselected / 1 selected
Running 1 items in this shard

api_test.py F                                                                                                                                                                                                                                                                                                [100%]

===================================================================================================================================================== FAILURES =====================================================================================================================================================
___________________________________________________________________________________________________________________________________ SimpleElasticAgentTest.test_restart_workers ____________________________________________________________________________________________________________________________________
Traceback (most recent call last):
  File "/usr/local/python3.10/lib/python3.10/unittest/case.py", line 59, in testPartExecutor
    yield
  File "/usr/local/python3.10/lib/python3.10/unittest/case.py", line 591, in run
    self._callTestMethod(testMethod)
  File "/usr/local/python3.10/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
    method()
  File "/projs/framework/betterman/code/pytorch_new/test/distributed/elastic/agent/server/test/api_test.py", line 368, in test_restart_workers
    agent._restart_workers(worker_group)
  File "/projs/framework/betterman/code/pytorch_new/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
    result = f(*args, **kwargs)
  File "/projs/framework/betterman/code/pytorch_new/torch/distributed/elastic/agent/server/api.py", line 728, in _restart_workers
    self._stop_workers(worker_group, is_restart=True)
TypeError: TestAgent._stop_workers() got an unexpected keyword argument 'is_restart'
============================================================================================================================================= short test summary info ==============================================================================================================================================
FAILED [0.0054s] api_test.py::SimpleElasticAgentTest::test_restart_workers - TypeError: TestAgent._stop_workers() got an unexpected keyword argument 'is_restart'
========================================================================================================================================= 1 failed, 28 deselected in 7.37s =========================================================================================================================================
```
Caused by #124819 .

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126002
Approved by: https://github.com/ezyang
2024-05-14 21:36:24 +00:00
3a58d40b93 [Profiler] Clean up deprecated use_cuda by default (#126180)
Summary: Should not be setting use_cuda by default anymore, since it is deprecated. Instead it will be set via use_device="cuda".

Test Plan:
CI and ran locally:

Before:
```
[INFO: pytorch_resnet_integration_test.py:  196]: step: 80, peak allocated GPU mem: 3.17GB, peak active GPU mem: 3.17GB, peak reserved GPU mem: 3.39GB.
/data/users/aaronshi/fbsource/buck-out/v2/gen/fbcode/277373c3e83d278c/kineto/libkineto/fb/integration_tests/__pytorch_resnet_integration_test__/pytorch_resnet_integration_test#link-tree/torch/autograd/profiler.py:215: UserWarning:

The attribute `use_cuda` will be deprecated soon, please use ``use_device = 'cuda'`` instead.

  Log file: /tmp/libkineto_activities_812639.json
  Trace start time: 2024-05-14 08:44:50  Trace duration: 500ms
  Warmup duration: 5s
  Max GPU buffer size: 128MB
  Enabled activities: cpu_op,user_annotation,gpu_user_annotation,gpu_memcpy,gpu_memset,kernel,external_correlation,cuda_runtime,cuda_driver,cpu_instant_event,python_function,xpu_runtime,privateuse1_runtime,privateuse1_driver
  Manifold bucket: gpu_traces
  Manifold object: tree/traces/clientAPI/0/1715701483/devvm2184.cco0/libkineto_activities_812639.json
  Trace compression enabled: 1
  TTL in seconds: 31536000 (365 days)
INFO:2024-05-14 08:44:43 812639:812639 CuptiActivityProfiler.cpp:971] Enabling GPU tracing
```

After:
```
[INFO: pytorch_resnet_integration_test.py:  196]: step: 80, peak allocated GPU mem: 3.17GB, peak active GPU mem: 3.17GB, peak reserved GPU mem: 3.39GB.
  Log file: /tmp/libkineto_activities_903554.json
  Trace start time: 2024-05-14 09:05:47  Trace duration: 500ms
  Warmup duration: 5s
  Max GPU buffer size: 128MB
  Enabled activities: cpu_op,user_annotation,gpu_user_annotation,gpu_memcpy,gpu_memset,kernel,external_correlation,cuda_runtime,cuda_driver,cpu_instant_event,python_function,xpu_runtime,privateuse1_runtime,privateuse1_driver
  Manifold bucket: gpu_traces
  Manifold object: tree/traces/clientAPI/0/1715702740/devvm2184.cco0/libkineto_activities_903554.json
  Trace compression enabled: 1
  TTL in seconds: 31536000 (365 days)
INFO:2024-05-14 09:05:40 903554:903554 CuptiActivityProfiler.cpp:971] Enabling GPU tracing
```

Differential Revision: D57337445

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126180
Approved by: https://github.com/davidberard98
2024-05-14 21:23:31 +00:00
534c34b320 Fix copy-pasted docs, reversing the load and save description (#125993)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125993
Approved by: https://github.com/kwen2501, https://github.com/fegin
2024-05-14 21:14:16 +00:00
2973c9bb88 [export] add SchemaCheckMode testing for pre-dispatch export, OpInfo (#125481)
This adds a new dispatch mode, PreDispatchSchemaCheckMode, built on top of SchemaCheckMode, used for verifying op schemas for functionalization for PreDispatch IR. More specifically, the mode runs in eager mode on concrete inputs, checking if op schemas incorrectly claim to be functional, but are aliasing or mutating. This mode is pushed to the pre-dispatch mode stack, and run before decompositions.

Current testing is hooked up to OpInfo, containing 1103 tests on 600 unique ops. Below is a list of ops that fail testing. One caveat is we only raise errors on ops that claim to be functional - if an op schema admits aliasing or mutating but fails testing for the other, it still may decompose further and become functional.

List of failed ops:
```
aten.atleast_1d.default
aten.atleast_2d.default
aten.atleast_3d.default
aten.cartesian_prod.default
aten.conj_physical.default
aten.alpha_dropout.default
aten.feature_dropout.default
aten.feature_alpha_dropout.default
aten.unsafe_chunk.default
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125481
Approved by: https://github.com/tugsbayasgalan
2024-05-14 21:07:21 +00:00
534ddfa619 Move compute unbacked bindings call to track_tensor_tree (#126168)
This ensures we hit it in all the HOP proxy tensor implementations

Fixes https://github.com/pytorch/pytorch/issues/125869

Signed-off-by: Edward Z. Yang <ezyang@meta.com>

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126168
Approved by: https://github.com/ydwu4
2024-05-14 21:05:05 +00:00
54131ecb25 Remove redundant spaces in CMakeLists.txt (#126042)
Fixes #126023

```diff
diff --git a/CMakeLists.txt b/CMakeLists.txt
index 79db67e735..924721d2e6 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -281,8 +281,8 @@ if(NOT DEFINED USE_VULKAN)
 endif()

 option(USE_SLEEF_FOR_ARM_VEC256 "Use sleef for arm" OFF)
-option(USE_SOURCE_DEBUG_ON_MOBILE "Enable " ON)
-option(USE_LITE_INTERPRETER_PROFILER "Enable " ON)
+option(USE_SOURCE_DEBUG_ON_MOBILE "Enable" ON)
+option(USE_LITE_INTERPRETER_PROFILER "Enable" ON)
 option(USE_VULKAN_FP16_INFERENCE "Vulkan - Use fp16 inference" OFF)
 option(USE_VULKAN_RELAXED_PRECISION "Vulkan - Use relaxed precision math in the kernels (mediump)" OFF)
 # option USE_XNNPACK: try to enable xnnpack by default.
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126042
Approved by: https://github.com/r-barnes
2024-05-14 21:04:49 +00:00
7ed67cdbcc Add compile time smoketest for foreach (#126136)
Fixes [T175425693](https://www.internalfb.com/intern/tasks/?t=175425693)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126136
Approved by: https://github.com/yanboliang
2024-05-14 21:00:55 +00:00
a8eac0efa8 fix: unknown CMake command "check_function_exists" (#126165)
When building pytorch with OpenBLAS on windows I ran into this CMake issue:

```
CMake Error at cmake/Modules/FindLAPACK.cmake:137 (check_function_exists):
  Unknown CMake command "check_function_exists".
Call Stack (most recent call first):
  cmake/Dependencies.cmake:1745 (find_package)
  CMakeLists.txt:708 (include)
```

Similarly described here: https://discuss.pytorch.org/t/cmake-with-error-by-compiling-on-windows-with-mingw32-make/159140

This PR fixes this issue by adding:

```
include(CheckFunctionExists)
```

To the offending CMake file.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126165
Approved by: https://github.com/ezyang
2024-05-14 20:54:06 +00:00
4a8db9d45b [dynamo] reset grad state in aotdispatch test, add failing trace functional tensor test to dynamo (#126113)
Workaround for https://github.com/pytorch/pytorch/issues/125568.

We could add additional global state to reset (e.g. autocast?) or move this setup/teardown to a more general place.

Also added a minimal repro for the linked issue - will investigate in a followup PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126113
Approved by: https://github.com/ezyang, https://github.com/bdhirsh
2024-05-14 20:42:49 +00:00
f6a00a8032 [inductor] Add abs to index_propagation (#124616)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124616
Approved by: https://github.com/lezcano
ghstack dependencies: #124119
2024-05-14 20:14:53 +00:00
c30ea3387b [inductor] Improve stability of scaled softmax (#124119)
This adds a pattern which replaces:
```python
   scale(x) - scale(x).amax(dim, keepdim=True)
```
with
```python
   scale(x - x.amax(dim, keepdim=True))
```
where `scale` can be either multiplication or division by a scalar,
or a tensor that is broadcast in the `dim` dimension.

We can find this pattern inside of the decomposed graph of:
```python
F.softmax(scale(x), dim=dim)
```

This has the effect of both reducing the chance of hitting the `fma`
issue and also means we avoid recomputing `scale(x)` inside and outside
the reduction which may be significant if we can remove an extra division.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124119
Approved by: https://github.com/lezcano
2024-05-14 20:14:53 +00:00
352a893b0c Fast standalone symbolize for unwinding (#123966)
We've had issues using addr2line. On certain versions of
CentOS it is on a version that has a performance regression making it very slow,
and even normallly it is not that fast, taking several seconds even when parallelized
for a typical memory trace dump.

Folly Symbolize or LLVMSymbolize are fast but it requires PyTorch take a dependency on those libraries to do this, and given the number of environments we run stuff in, we end up hitting cases where we fallback to slow addr2line behavior.

This adds a standalone symbolizer to PyTorch similar to the unwinder which has
no external dependencies and is ~20x faster than addr2line for unwinding PyTorch frames.

I've tested this on some memory profiling runs using all combinations of {gcc, clang} x {dwarf4, dwarf5} and it seems to do a good job at getting line numbers and function names right. It is also careful to route all reads of library data through the `CheckedLexer` object, which ensure it is not reading out of bounds of the section. Errors are routed through UnwindError so that those exceptions get caught and we produce a ?? frame rather than crash. I also added a fuzz test which gives all our symbolizer options random addresses in the process to make sure they do not crash.

Differential Revision: [D56828968](https://our.internmc.facebook.com/intern/diff/D56828968)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123966
Approved by: https://github.com/ezyang, https://github.com/aaronenyeshi
2024-05-14 19:39:17 +00:00
5fb4a766b8 [CUDA] [CI] Add cu124 docker images (#125944)
Fixes issues encountered in https://github.com/pytorch/pytorch/pull/121956

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125944
Approved by: https://github.com/atalman
2024-05-14 19:38:10 +00:00
ed327876f5 [codemod] c10:optional -> std::optional (#126135)
Generated by running the following from PyTorch root:
```
find . -regex ".*\.\(cpp\|h\|cu\|hpp\|cc\|cxx\)$" | grep -v "build/" | xargs -n 50 -P 4 perl -pi -e 's/c10::optional/std::optional/'
```

`c10::optional` is just an alias for `std::optional`. This removes usages of that alias in preparation for eliminating it entirely.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126135
Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/albanD, https://github.com/aaronenyeshi
2024-05-14 19:35:51 +00:00
b55f57b7af [codemod][lowrisk] Remove extra semi colon from caffe2/c10/core/SymNodeImpl.h (#123055)
Summary:
`-Wextra-semi` or `-Wextra-semi-stmt`

If the code compiles, this is safe to land.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/123055
Approved by: https://github.com/Skylion007
2024-05-14 19:35:29 +00:00
023f05cfe6 Allow symbols to reach conv_layout stride argument #125829 (#126116)
https://github.com/pytorch/pytorch/pull/125829 was reverted i rebased and the error could be merge error
because its not reproducible.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126116
Approved by: https://github.com/anijain2305
2024-05-14 19:22:16 +00:00
0e6462f69a [pipelining] Consolidate test models into a registry (#126114)
Resolves https://github.com/pytorch/PiPPy/issues/1062.

Also added a gradient equivalence test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126114
Approved by: https://github.com/H-Huang
ghstack dependencies: #125729, #125975
2024-05-14 19:11:54 +00:00
38b8b614a2 [ROCm] Implement forward AD for miopen_batch_norm (#125069)
Implements forward automatic differentiation support for miopen_batch_norm as well as unskips the associated unit tests. Also fixes a class of functorch related unit tests that fail due to failing a contiguous tensor assertion in BatchNorm_miopen.cpp. Solution was to just limit tensors to miopen_batch_norm that have at least 3 dimensions. The exact restriction already existed in the cudnn path and is why the tests in question only failed on ROCm.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125069
Approved by: https://github.com/jeffdaily, https://github.com/andrewor14
2024-05-14 19:09:50 +00:00
1a28f731dc [optim] Merge the pyi files into py files of optimizer (#125452)
Continue the work of pytorch/pytorch#125153
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125452
Approved by: https://github.com/janeyx99
2024-05-14 18:24:50 +00:00
a00a99e801 [profiler] Report strides in json trace (#125851)
We already collect strides, we just don't report them anywhere.

Note: this depends on concrete input collection being enabled, which I think is currently not the case internally.

Differential Revision: [D57165421](https://our.internmc.facebook.com/intern/diff/D57165421)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125851
Approved by: https://github.com/Chillee, https://github.com/aaronenyeshi
2024-05-14 18:24:24 +00:00
50c3d58734 [onnx.export] Cache AllGraphInputsStatic (#123028)
This PR is part of an effort to speed up torch.onnx.export (#121422).

- The inputs (dynamic inputs and constants) do not change as as nodes are added and it is expensive to re-compute for every node. So, we cache this value so we avoid computing it for every node. Open to entirely other solution as well.
- Resolves (5) in #121422.

(partial fix of #121545)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123028
Approved by: https://github.com/justinchuby
2024-05-14 18:19:04 +00:00
3cba50e478 [quant] Make per_group and per_token quant match torch.fake_quantize (#125781)
Summary: Follow-up to https://github.com/pytorch/ao/pull/229.
This resolves the difference between `input.div(scales)` and
`input.mul(1.0 / scales)`, which results in small numerical
discrepancies on some inputs.

Test Plan:
python test/test_quantization.py TestQuantizedTensor.test_decomposed_quantize_per_channel_group
python test/test_quantization.py TestQuantizedTensor.test_decomposed_quantize_per_token

Reviewers: jerryzh168

Subscribers: jerryzh168, supriyar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125781
Approved by: https://github.com/jerryzh168
2024-05-14 18:18:54 +00:00
7207 changed files with 346441 additions and 352023 deletions

View File

@ -1,4 +1,4 @@
# Docker images for GitHub CI
# Docker images for GitHub CI and CD
This directory contains everything needed to build the Docker images
that are used in our CI.
@ -12,7 +12,7 @@ each image as the `BUILD_ENVIRONMENT` environment variable.
See `build.sh` for valid build environments (it's the giant switch).
## Contents
## Docker CI builds
* `build.sh` -- dispatch script to launch all builds
* `common` -- scripts used to execute individual Docker build stages
@ -21,6 +21,12 @@ See `build.sh` for valid build environments (it's the giant switch).
* `ubuntu-rocm` -- Dockerfile for Ubuntu image with ROCm support
* `ubuntu-xpu` -- Dockerfile for Ubuntu image with XPU support
### Docker CD builds
* `conda` - Dockerfile and build.sh to build Docker images used in nightly conda builds
* `manywheel` - Dockerfile and build.sh to build Docker images used in nightly manywheel builds
* `libtorch` - Dockerfile and build.sh to build Docker images used in nightly libtorch builds
## Usage
```bash

View File

@ -0,0 +1,5 @@
0.7b
manylinux_2_17
rocm6.2
9be04068c3c0857a4cfd17d7e39e71d0423ebac2
3e9e1959d23b93d78a08fcc5f868125dc3854dece32fd9458be9ef4467982291

View File

@ -84,16 +84,16 @@ fi
# CMake 3.18 is needed to support CUDA17 language variant
CMAKE_VERSION=3.18.5
_UCX_COMMIT=00bcc6bb18fc282eb160623b4c0d300147f579af
_UCC_COMMIT=7cb07a76ccedad7e56ceb136b865eb9319c258ea
_UCX_COMMIT=7bb2722ff2187a0cad557ae4a6afa090569f83fb
_UCC_COMMIT=20eae37090a4ce1b32bcce6144ccad0b49943e0b
# It's annoying to rename jobs every time you want to rewrite a
# configuration, so we hardcode everything here rather than do it
# from scratch
case "$image" in
pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9)
CUDA_VERSION=12.1.1
CUDNN_VERSION=8
pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)
CUDA_VERSION=12.4.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
@ -105,9 +105,23 @@ case "$image" in
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9-inductor-benchmarks)
pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9)
CUDA_VERSION=12.1.1
CUDNN_VERSION=8
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9-inductor-benchmarks)
CUDA_VERSION=12.4.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
@ -120,9 +134,54 @@ case "$image" in
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda11.8-cudnn8-py3-gcc9)
pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9-inductor-benchmarks)
CUDA_VERSION=12.1.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.1-cudnn9-py3.12-gcc9-inductor-benchmarks)
CUDA_VERSION=12.1.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda12.4-cudnn9-py3.12-gcc9-inductor-benchmarks)
CUDA_VERSION=12.4.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-focal-cuda11.8-cudnn9-py3-gcc9)
CUDA_VERSION=11.8.0
CUDNN_VERSION=8
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
@ -134,9 +193,37 @@ case "$image" in
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-cuda12.1-cudnn8-py3-gcc9)
pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)
CUDA_VERSION=12.4.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-cuda12.1-cudnn9-py3-gcc9)
CUDA_VERSION=12.1.1
CUDNN_VERSION=8
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
DB=yes
VISION=yes
KATEX=yes
UCX_COMMIT=${_UCX_COMMIT}
UCC_COMMIT=${_UCC_COMMIT}
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)
CUDA_VERSION=12.4.1
CUDNN_VERSION=9
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=9
PROTOBUF=yes
@ -149,7 +236,7 @@ case "$image" in
TRITON=yes
;;
pytorch-linux-focal-py3-clang10-onnx)
ANACONDA_PYTHON_VERSION=3.8
ANACONDA_PYTHON_VERSION=3.9
CLANG_VERSION=10
PROTOBUF=yes
DB=yes
@ -158,7 +245,7 @@ case "$image" in
ONNX=yes
;;
pytorch-linux-focal-py3-clang9-android-ndk-r21e)
ANACONDA_PYTHON_VERSION=3.8
ANACONDA_PYTHON_VERSION=3.9
CLANG_VERSION=9
LLVMDEV=yes
PROTOBUF=yes
@ -167,8 +254,8 @@ case "$image" in
GRADLE_VERSION=6.8.3
NINJA_VERSION=1.9.0
;;
pytorch-linux-focal-py3.8-clang10)
ANACONDA_PYTHON_VERSION=3.8
pytorch-linux-focal-py3.9-clang10)
ANACONDA_PYTHON_VERSION=3.9
CLANG_VERSION=10
PROTOBUF=yes
DB=yes
@ -189,8 +276,8 @@ case "$image" in
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-focal-py3.8-gcc9)
ANACONDA_PYTHON_VERSION=3.8
pytorch-linux-focal-py3.9-gcc9)
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=9
PROTOBUF=yes
DB=yes
@ -221,7 +308,7 @@ case "$image" in
TRITON=yes
;;
pytorch-linux-jammy-xpu-2024.0-py3)
ANACONDA_PYTHON_VERSION=3.8
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=11
PROTOBUF=yes
DB=yes
@ -231,8 +318,8 @@ case "$image" in
CONDA_CMAKE=yes
TRITON=yes
;;
pytorch-linux-jammy-py3.8-gcc11-inductor-benchmarks)
ANACONDA_PYTHON_VERSION=3.8
pytorch-linux-jammy-py3.9-gcc11-inductor-benchmarks)
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=11
PROTOBUF=yes
DB=yes
@ -243,10 +330,10 @@ case "$image" in
DOCS=yes
INDUCTOR_BENCHMARKS=yes
;;
pytorch-linux-jammy-cuda11.8-cudnn8-py3.8-clang12)
ANACONDA_PYTHON_VERSION=3.8
pytorch-linux-jammy-cuda11.8-cudnn9-py3.9-clang12)
ANACONDA_PYTHON_VERSION=3.9
CUDA_VERSION=11.8
CUDNN_VERSION=8
CUDNN_VERSION=9
CLANG_VERSION=12
PROTOBUF=yes
DB=yes
@ -268,8 +355,8 @@ case "$image" in
CONDA_CMAKE=yes
VISION=yes
;;
pytorch-linux-jammy-py3.8-gcc11)
ANACONDA_PYTHON_VERSION=3.8
pytorch-linux-jammy-py3.9-gcc11)
ANACONDA_PYTHON_VERSION=3.9
GCC_VERSION=11
PROTOBUF=yes
DB=yes
@ -286,6 +373,13 @@ case "$image" in
CONDA_CMAKE=yes
EXECUTORCH=yes
;;
pytorch-linux-jammy-py3.12-halide)
CUDA_VERSION=12.4
ANACONDA_PYTHON_VERSION=3.12
GCC_VERSION=11
CONDA_CMAKE=yes
HALIDE=yes
;;
pytorch-linux-focal-linter)
# TODO: Use 3.9 here because of this issue https://github.com/python/mypy/issues/13627.
# We will need to update mypy version eventually, but that's for another day. The task
@ -293,7 +387,7 @@ case "$image" in
ANACONDA_PYTHON_VERSION=3.9
CONDA_CMAKE=yes
;;
pytorch-linux-jammy-cuda11.8-cudnn8-py3.9-linter)
pytorch-linux-jammy-cuda11.8-cudnn9-py3.9-linter)
ANACONDA_PYTHON_VERSION=3.9
CUDA_VERSION=11.8
CONDA_CMAKE=yes
@ -313,6 +407,22 @@ case "$image" in
# from pytorch/llvm:9.0.1 is x86 specific
SKIP_LLVM_SRC_BUILD_INSTALL=yes
;;
pytorch-linux-jammy-aarch64-py3.10-gcc11-inductor-benchmarks)
ANACONDA_PYTHON_VERSION=3.10
GCC_VERSION=11
ACL=yes
PROTOBUF=yes
DB=yes
VISION=yes
CONDA_CMAKE=yes
# snadampal: skipping sccache due to the following issue
# https://github.com/pytorch/pytorch/issues/121559
SKIP_SCCACHE_INSTALL=yes
# snadampal: skipping llvm src build install because the current version
# from pytorch/llvm:9.0.1 is x86 specific
SKIP_LLVM_SRC_BUILD_INSTALL=yes
INDUCTOR_BENCHMARKS=yes
;;
*)
# Catch-all for builds that are not hardcoded.
PROTOBUF=yes
@ -360,7 +470,7 @@ tmp_tag=$(basename "$(mktemp -u)" | tr '[:upper:]' '[:lower:]')
#when using cudnn version 8 install it separately from cuda
if [[ "$image" == *cuda* && ${OS} == "ubuntu" ]]; then
IMAGE_NAME="nvidia/cuda:${CUDA_VERSION}-cudnn${CUDNN_VERSION}-devel-ubuntu${UBUNTU_VERSION}"
if [[ ${CUDNN_VERSION} == 8 ]]; then
if [[ ${CUDNN_VERSION} == 9 ]]; then
IMAGE_NAME="nvidia/cuda:${CUDA_VERSION}-devel-ubuntu${UBUNTU_VERSION}"
fi
fi
@ -403,6 +513,7 @@ docker build \
--build-arg "DOCS=${DOCS}" \
--build-arg "INDUCTOR_BENCHMARKS=${INDUCTOR_BENCHMARKS}" \
--build-arg "EXECUTORCH=${EXECUTORCH}" \
--build-arg "HALIDE=${HALIDE}" \
--build-arg "XPU_VERSION=${XPU_VERSION}" \
--build-arg "ACL=${ACL:-}" \
--build-arg "SKIP_SCCACHE_INSTALL=${SKIP_SCCACHE_INSTALL:-}" \
@ -412,7 +523,7 @@ docker build \
"$@" \
.
# NVIDIA dockers for RC releases use tag names like `11.0-cudnn8-devel-ubuntu18.04-rc`,
# NVIDIA dockers for RC releases use tag names like `11.0-cudnn9-devel-ubuntu18.04-rc`,
# for this case we will set UBUNTU_VERSION to `18.04-rc` so that the Dockerfile could
# find the correct image. As a result, here we have to replace the
# "$UBUNTU_VERSION" == "18.04-rc"

View File

@ -77,6 +77,9 @@ RUN rm install_rocm.sh
COPY ./common/install_rocm_magma.sh install_rocm_magma.sh
RUN bash ./install_rocm_magma.sh
RUN rm install_rocm_magma.sh
COPY ./common/install_amdsmi.sh install_amdsmi.sh
RUN bash ./install_amdsmi.sh
RUN rm install_amdsmi.sh
ENV PATH /opt/rocm/bin:$PATH
ENV PATH /opt/rocm/hcc/bin:$PATH
ENV PATH /opt/rocm/hip/bin:$PATH
@ -105,10 +108,17 @@ ENV CMAKE_C_COMPILER cc
ENV CMAKE_CXX_COMPILER c++
COPY ./common/install_triton.sh install_triton.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/triton-rocm.txt triton-rocm.txt
COPY ci_commit_pins/triton.txt triton.txt
COPY triton_version.txt triton_version.txt
RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi
RUN rm install_triton.sh common_utils.sh triton-rocm.txt triton_version.txt
RUN rm install_triton.sh common_utils.sh triton.txt triton_version.txt
# Install AOTriton (Early fail)
COPY ./aotriton_version.txt aotriton_version.txt
COPY ./common/common_utils.sh common_utils.sh
COPY ./common/install_aotriton.sh install_aotriton.sh
RUN ["/bin/bash", "-c", "./install_aotriton.sh /opt/rocm && rm -rf install_aotriton.sh aotriton_version.txt common_utils.sh"]
ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton
# Install ccache/sccache (do this last, so we get priority in PATH)
COPY ./common/install_cache.sh install_cache.sh

View File

@ -1 +1 @@
d4b3e5cc607e97afdba79dc90f8ef968142f347c
cd1c833b079adb324871dcbbe75b43d42ffc0ade

View File

@ -0,0 +1 @@
461c12871f336fe6f57b55d6a297f13ef209161b

View File

@ -1 +1 @@
730b907b4d45a4713cbc425cbf224c46089fd514
ac3470188b914c5d7a5058a7e28b9eb685a62427

View File

@ -1 +0,0 @@
bbe6246e37d8aa791c67daaf9d9d61b26c9ccfdc

View File

@ -1 +1 @@
b8c64f64c18d8cac598b3adb355c21e7439c21de
91b14bf5593cf58a8541f3e6b9125600a867d4ef

View File

@ -1 +1 @@
45fff310c891f5a92d55445adf8cc9d29df5841e
5fe38ffd73c2ac6ed6323b554205186696631c6f

View File

@ -1,6 +1,6 @@
set -euo pipefail
readonly version=v23.08
readonly version=v24.04
readonly src_host=https://review.mlplatform.org/ml
readonly src_repo=ComputeLibrary

View File

@ -0,0 +1,5 @@
#!/bin/bash
set -ex
cd /opt/rocm/share/amd_smi && pip install .

View File

@ -0,0 +1,23 @@
#!/bin/bash
set -ex
source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"
TARBALL='aotriton.tar.gz'
# This read command alwasy returns with exit code 1
read -d "\n" VER MANYLINUX ROCMBASE PINNED_COMMIT SHA256 < aotriton_version.txt || true
ARCH=$(uname -m)
AOTRITON_INSTALL_PREFIX="$1"
AOTRITON_URL="https://github.com/ROCm/aotriton/releases/download/${VER}/aotriton-${VER}-${MANYLINUX}_${ARCH}-${ROCMBASE}-shared.tar.gz"
cd "${AOTRITON_INSTALL_PREFIX}"
# Must use -L to follow redirects
curl -L --retry 3 -o "${TARBALL}" "${AOTRITON_URL}"
ACTUAL_SHA256=$(sha256sum "${TARBALL}" | cut -d " " -f 1)
if [ "${SHA256}" != "${ACTUAL_SHA256}" ]; then
echo -n "Error: The SHA256 of downloaded tarball is ${ACTUAL_SHA256},"
echo " which does not match the expected value ${SHA256}."
exit
fi
tar xf "${TARBALL}" && rm -rf "${TARBALL}"

View File

@ -3,7 +3,7 @@
set -ex
install_ubuntu() {
# NVIDIA dockers for RC releases use tag names like `11.0-cudnn8-devel-ubuntu18.04-rc`,
# NVIDIA dockers for RC releases use tag names like `11.0-cudnn9-devel-ubuntu18.04-rc`,
# for this case we will set UBUNTU_VERSION to `18.04-rc` so that the Dockerfile could
# find the correct image. As a result, here we have to check for
# "$UBUNTU_VERSION" == "18.04"*

View File

@ -5,32 +5,22 @@ set -ex
# Optionally install conda
if [ -n "$ANACONDA_PYTHON_VERSION" ]; then
BASE_URL="https://repo.anaconda.com/miniconda"
CONDA_FILE="Miniconda3-latest-Linux-x86_64.sh"
if [[ $(uname -m) == "aarch64" ]] || [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then
BASE_URL="https://github.com/conda-forge/miniforge/releases/latest/download"
CONDA_FILE="Miniforge3-Linux-$(uname -m).sh"
fi
MAJOR_PYTHON_VERSION=$(echo "$ANACONDA_PYTHON_VERSION" | cut -d . -f 1)
MINOR_PYTHON_VERSION=$(echo "$ANACONDA_PYTHON_VERSION" | cut -d . -f 2)
if [[ $(uname -m) == "aarch64" ]]; then
BASE_URL="https://github.com/conda-forge/miniforge/releases/latest/download"
case "$MAJOR_PYTHON_VERSION" in
3)
CONDA_FILE="Miniforge3-Linux-aarch64.sh"
;;
3);;
*)
echo "Unsupported ANACONDA_PYTHON_VERSION: $ANACONDA_PYTHON_VERSION"
exit 1
;;
esac
else
case "$MAJOR_PYTHON_VERSION" in
3)
CONDA_FILE="Miniconda3-latest-Linux-x86_64.sh"
;;
*)
echo "Unsupported ANACONDA_PYTHON_VERSION: $ANACONDA_PYTHON_VERSION"
exit 1
;;
esac
fi
mkdir -p /opt/conda
chown jenkins:jenkins /opt/conda
@ -78,19 +68,20 @@ fi
CONDA_COMMON_DEPS="astunparse pyyaml setuptools openblas==0.3.25=*openmp* ninja==1.11.1 scons==4.5.2"
if [ "$ANACONDA_PYTHON_VERSION" = "3.8" ]; then
conda_install numpy=1.24.4 ${CONDA_COMMON_DEPS}
NUMPY_VERSION=1.24.4
else
conda_install numpy=1.26.2 ${CONDA_COMMON_DEPS}
NUMPY_VERSION=1.26.2
fi
else
CONDA_COMMON_DEPS="astunparse pyyaml mkl=2021.4.0 mkl-include=2021.4.0 setuptools"
if [ "$ANACONDA_PYTHON_VERSION" = "3.11" ] || [ "$ANACONDA_PYTHON_VERSION" = "3.12" ]; then
conda_install numpy=1.26.0 ${CONDA_COMMON_DEPS}
if [ "$ANACONDA_PYTHON_VERSION" = "3.11" ] || [ "$ANACONDA_PYTHON_VERSION" = "3.12" ] || [ "$ANACONDA_PYTHON_VERSION" = "3.13" ]; then
NUMPY_VERSION=1.26.0
else
conda_install numpy=1.21.2 ${CONDA_COMMON_DEPS}
NUMPY_VERSION=1.21.2
fi
fi
conda_install ${CONDA_COMMON_DEPS}
# Install llvm-8 as it is required to compile llvmlite-0.30.0 from source
# and libpython-static for torch deploy
@ -112,7 +103,7 @@ fi
# Install some other packages, including those needed for Python test reporting
pip_install -r /opt/conda/requirements-ci.txt
pip_install numpy=="$NUMPY_VERSION"
pip_install -U scikit-learn
if [ -n "$DOCS" ]; then

View File

@ -0,0 +1,20 @@
#!/bin/bash
# Script used only in CD pipeline
set -ex
# Anaconda
# Latest anaconda is using openssl-3 which is incompatible with all currently published versions of git
# Which are using openssl-1.1.1, see https://anaconda.org/anaconda/git/files?version=2.40.1 for example
MINICONDA_URL=https://repo.anaconda.com/miniconda/Miniconda3-py311_23.5.2-0-Linux-x86_64.sh
wget -q $MINICONDA_URL
# NB: Manually invoke bash per https://github.com/conda/conda/issues/10431
bash $(basename "$MINICONDA_URL") -b -p /opt/conda
rm $(basename "$MINICONDA_URL")
export PATH=/opt/conda/bin:$PATH
# See https://github.com/pytorch/builder/issues/1473
# Pin conda to 23.5.2 as it's the last one compatible with openssl-1.1.1
conda install -y conda=23.5.2 conda-build anaconda-client git ninja
# The cmake version here needs to match with the minimum version of cmake
# supported by PyTorch (3.18). There is only 3.18.2 on anaconda
/opt/conda/bin/pip3 install cmake==3.18.2
conda remove -y --force patchelf

View File

@ -0,0 +1,96 @@
#!/bin/bash
# Script used only in CD pipeline
set -uex -o pipefail
PYTHON_DOWNLOAD_URL=https://www.python.org/ftp/python
PYTHON_DOWNLOAD_GITHUB_BRANCH=https://github.com/python/cpython/archive/refs/heads
GET_PIP_URL=https://bootstrap.pypa.io/get-pip.py
# Python versions to be installed in /opt/$VERSION_NO
CPYTHON_VERSIONS=${CPYTHON_VERSIONS:-"3.8.1 3.9.0 3.10.1 3.11.0 3.12.0 3.13.0"}
function check_var {
if [ -z "$1" ]; then
echo "required variable not defined"
exit 1
fi
}
function do_cpython_build {
local py_ver=$1
local py_folder=$2
check_var $py_ver
check_var $py_folder
tar -xzf Python-$py_ver.tgz
pushd $py_folder
local prefix="/opt/_internal/cpython-${py_ver}"
mkdir -p ${prefix}/lib
if [[ -n $(which patchelf) ]]; then
local shared_flags="--enable-shared"
else
local shared_flags="--disable-shared"
fi
if [[ -z "${WITH_OPENSSL+x}" ]]; then
local openssl_flags=""
else
local openssl_flags="--with-openssl=${WITH_OPENSSL} --with-openssl-rpath=auto"
fi
# -Wformat added for https://bugs.python.org/issue17547 on Python 2.6
CFLAGS="-Wformat" ./configure --prefix=${prefix} ${openssl_flags} ${shared_flags} > /dev/null
make -j40 > /dev/null
make install > /dev/null
if [[ "${shared_flags}" == "--enable-shared" ]]; then
patchelf --set-rpath '$ORIGIN/../lib' ${prefix}/bin/python3
fi
popd
rm -rf $py_folder
# Some python's install as bin/python3. Make them available as
# bin/python.
if [ -e ${prefix}/bin/python3 ]; then
ln -s python3 ${prefix}/bin/python
fi
${prefix}/bin/python get-pip.py
if [ -e ${prefix}/bin/pip3 ] && [ ! -e ${prefix}/bin/pip ]; then
ln -s pip3 ${prefix}/bin/pip
fi
# install setuptools since python 3.12 is required to use distutils
${prefix}/bin/pip install wheel==0.34.2 setuptools==68.2.2
local abi_tag=$(${prefix}/bin/python -c "from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag; print('{0}{1}-{2}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag()))")
ln -s ${prefix} /opt/python/${abi_tag}
}
function build_cpython {
local py_ver=$1
check_var $py_ver
check_var $PYTHON_DOWNLOAD_URL
local py_ver_folder=$py_ver
if [ "$py_ver" = "3.13.0" ]; then
PY_VER_SHORT="3.13"
check_var $PYTHON_DOWNLOAD_GITHUB_BRANCH
wget $PYTHON_DOWNLOAD_GITHUB_BRANCH/$PY_VER_SHORT.tar.gz -O Python-$py_ver.tgz
do_cpython_build $py_ver cpython-$PY_VER_SHORT
else
wget -q $PYTHON_DOWNLOAD_URL/$py_ver_folder/Python-$py_ver.tgz
do_cpython_build $py_ver Python-$py_ver
fi
rm -f Python-$py_ver.tgz
}
function build_cpythons {
check_var $GET_PIP_URL
curl -sLO $GET_PIP_URL
for py_ver in $@; do
build_cpython $py_ver
done
rm -f get-pip.py
}
mkdir -p /opt/python
mkdir -p /opt/_internal
build_cpythons $CPYTHON_VERSIONS

View File

@ -0,0 +1,250 @@
#!/bin/bash
set -ex
NCCL_VERSION=v2.21.5-1
CUDNN_VERSION=9.1.0.70
function install_cusparselt_040 {
# cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && pushd tmp_cusparselt
wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/libcusparse_lt-linux-x86_64-0.4.0.7-archive.tar.xz
tar xf libcusparse_lt-linux-x86_64-0.4.0.7-archive.tar.xz
cp -a libcusparse_lt-linux-x86_64-0.4.0.7-archive/include/* /usr/local/cuda/include/
cp -a libcusparse_lt-linux-x86_64-0.4.0.7-archive/lib/* /usr/local/cuda/lib64/
popd
rm -rf tmp_cusparselt
}
function install_cusparselt_052 {
# cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && pushd tmp_cusparselt
wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/libcusparse_lt-linux-x86_64-0.5.2.1-archive.tar.xz
tar xf libcusparse_lt-linux-x86_64-0.5.2.1-archive.tar.xz
cp -a libcusparse_lt-linux-x86_64-0.5.2.1-archive/include/* /usr/local/cuda/include/
cp -a libcusparse_lt-linux-x86_64-0.5.2.1-archive/lib/* /usr/local/cuda/lib64/
popd
rm -rf tmp_cusparselt
}
function install_cusparselt_062 {
# cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && pushd tmp_cusparselt
wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/libcusparse_lt-linux-x86_64-0.6.2.3-archive.tar.xz
tar xf libcusparse_lt-linux-x86_64-0.6.2.3-archive.tar.xz
cp -a libcusparse_lt-linux-x86_64-0.6.2.3-archive/include/* /usr/local/cuda/include/
cp -a libcusparse_lt-linux-x86_64-0.6.2.3-archive/lib/* /usr/local/cuda/lib64/
popd
rm -rf tmp_cusparselt
}
function install_118 {
echo "Installing CUDA 11.8 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.4.0"
rm -rf /usr/local/cuda-11.8 /usr/local/cuda
# install CUDA 11.8.0 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/11.8.0/local_installers/cuda_11.8.0_520.61.05_linux.run
chmod +x cuda_11.8.0_520.61.05_linux.run
./cuda_11.8.0_520.61.05_linux.run --toolkit --silent
rm -f cuda_11.8.0_520.61.05_linux.run
rm -f /usr/local/cuda && ln -s /usr/local/cuda-11.8 /usr/local/cuda
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn && cd tmp_cudnn
wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda11-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda11-archive.tar.xz
tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda11-archive.tar.xz
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda11-archive/include/* /usr/local/cuda/include/
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda11-archive/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf tmp_cudnn
# NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses
# Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build
git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git
cd nccl && make -j src.build
cp -a build/include/* /usr/local/cuda/include/
cp -a build/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf nccl
install_cusparselt_040
ldconfig
}
function install_121 {
echo "Installing CUDA 12.1 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.5.2"
rm -rf /usr/local/cuda-12.1 /usr/local/cuda
# install CUDA 12.1.0 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/12.1.1/local_installers/cuda_12.1.1_530.30.02_linux.run
chmod +x cuda_12.1.1_530.30.02_linux.run
./cuda_12.1.1_530.30.02_linux.run --toolkit --silent
rm -f cuda_12.1.1_530.30.02_linux.run
rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.1 /usr/local/cuda
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn && cd tmp_cudnn
wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz
tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf tmp_cudnn
# NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses
# Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build
git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git
cd nccl && make -j src.build
cp -a build/include/* /usr/local/cuda/include/
cp -a build/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf nccl
install_cusparselt_052
ldconfig
}
function install_124 {
echo "Installing CUDA 12.4.1 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.5.2"
rm -rf /usr/local/cuda-12.4 /usr/local/cuda
# install CUDA 12.4.1 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux.run
chmod +x cuda_12.4.1_550.54.15_linux.run
./cuda_12.4.1_550.54.15_linux.run --toolkit --silent
rm -f cuda_12.4.1_550.54.15_linux.run
rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.4 /usr/local/cuda
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn && cd tmp_cudnn
wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz -O cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz
tar xf cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive.tar.xz
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/include/* /usr/local/cuda/include/
cp -a cudnn-linux-x86_64-${CUDNN_VERSION}_cuda12-archive/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf tmp_cudnn
# NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses
# Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build
git clone -b $NCCL_VERSION --depth 1 https://github.com/NVIDIA/nccl.git
cd nccl && make -j src.build
cp -a build/include/* /usr/local/cuda/include/
cp -a build/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf nccl
install_cusparselt_062
ldconfig
}
function prune_118 {
echo "Pruning CUDA 11.8 and cuDNN"
#####################################################################################
# CUDA 11.8 prune static libs
#####################################################################################
export NVPRUNE="/usr/local/cuda-11.8/bin/nvprune"
export CUDA_LIB_DIR="/usr/local/cuda-11.8/lib64"
export GENCODE="-gencode arch=compute_35,code=sm_35 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
export GENCODE_CUDNN="-gencode arch=compute_35,code=sm_35 -gencode arch=compute_37,code=sm_37 -gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
if [[ -n "$OVERRIDE_GENCODE" ]]; then
export GENCODE=$OVERRIDE_GENCODE
fi
# all CUDA libs except CuDNN and CuBLAS (cudnn and cublas need arch 3.7 included)
ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis" \
| xargs -I {} bash -c \
"echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"
# prune CuDNN and CuBLAS
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a
#####################################################################################
# CUDA 11.8 prune visual tools
#####################################################################################
export CUDA_BASE="/usr/local/cuda-11.8/"
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2022.3.0 $CUDA_BASE/nsight-systems-2022.4.2/
}
function prune_121 {
echo "Pruning CUDA 12.1"
#####################################################################################
# CUDA 12.1 prune static libs
#####################################################################################
export NVPRUNE="/usr/local/cuda-12.1/bin/nvprune"
export CUDA_LIB_DIR="/usr/local/cuda-12.1/lib64"
export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
if [[ -n "$OVERRIDE_GENCODE" ]]; then
export GENCODE=$OVERRIDE_GENCODE
fi
# all CUDA libs except CuDNN and CuBLAS
ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis" \
| xargs -I {} bash -c \
"echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"
# prune CuDNN and CuBLAS
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a
#####################################################################################
# CUDA 12.1 prune visual tools
#####################################################################################
export CUDA_BASE="/usr/local/cuda-12.1/"
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2023.1.0 $CUDA_BASE/nsight-systems-2023.1.2/
}
function prune_124 {
echo "Pruning CUDA 12.4"
#####################################################################################
# CUDA 12.4 prune static libs
#####################################################################################
export NVPRUNE="/usr/local/cuda-12.4/bin/nvprune"
export CUDA_LIB_DIR="/usr/local/cuda-12.4/lib64"
export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
if [[ -n "$OVERRIDE_GENCODE" ]]; then
export GENCODE=$OVERRIDE_GENCODE
fi
if [[ -n "$OVERRIDE_GENCODE_CUDNN" ]]; then
export GENCODE_CUDNN=$OVERRIDE_GENCODE_CUDNN
fi
# all CUDA libs except CuDNN and CuBLAS
ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis" \
| xargs -I {} bash -c \
"echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"
# prune CuDNN and CuBLAS
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a
#####################################################################################
# CUDA 12.1 prune visual tools
#####################################################################################
export CUDA_BASE="/usr/local/cuda-12.4/"
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.1.0 $CUDA_BASE/nsight-systems-2023.4.4/
}
# idiomatic parameter and option handling in sh
while test $# -gt 0
do
case "$1" in
11.8) install_118; prune_118
;;
12.1) install_121; prune_121
;;
12.4) install_124; prune_124
;;
*) echo "bad argument $1"; exit 1
;;
esac
shift
done

View File

@ -0,0 +1,93 @@
#!/bin/bash
# Script used only in CD pipeline
set -ex
NCCL_VERSION=v2.21.5-1
function install_cusparselt_052 {
# cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && pushd tmp_cusparselt
wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-sbsa/libcusparse_lt-linux-sbsa-0.5.2.1-archive.tar.xz
tar xf libcusparse_lt-linux-sbsa-0.5.2.1-archive.tar.xz
cp -a libcusparse_lt-linux-sbsa-0.5.2.1-archive/include/* /usr/local/cuda/include/
cp -a libcusparse_lt-linux-sbsa-0.5.2.1-archive/lib/* /usr/local/cuda/lib64/
popd
rm -rf tmp_cusparselt
}
function install_124 {
echo "Installing CUDA 12.4.1 and cuDNN 9.1 and NCCL ${NCCL_VERSION} and cuSparseLt-0.5.2"
rm -rf /usr/local/cuda-12.4 /usr/local/cuda
# install CUDA 12.4.1 in the same container
wget -q https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux_sbsa.run
chmod +x cuda_12.4.1_550.54.15_linux_sbsa.run
./cuda_12.4.1_550.54.15_linux_sbsa.run --toolkit --silent
rm -f cuda_12.4.1_550.54.15_linux_sbsa.run
rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.4 /usr/local/cuda
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn && cd tmp_cudnn
wget -q https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-sbsa/cudnn-linux-sbsa-9.1.0.70_cuda12-archive.tar.xz -O cudnn-linux-sbsa-9.1.0.70_cuda12-archive.tar.xz
tar xf cudnn-linux-sbsa-9.1.0.70_cuda12-archive.tar.xz
cp -a cudnn-linux-sbsa-9.1.0.70_cuda12-archive/include/* /usr/local/cuda/include/
cp -a cudnn-linux-sbsa-9.1.0.70_cuda12-archive/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf tmp_cudnn
# NCCL license: https://docs.nvidia.com/deeplearning/nccl/#licenses
# Follow build: https://github.com/NVIDIA/nccl/tree/master?tab=readme-ov-file#build
git clone -b ${NCCL_VERSION} --depth 1 https://github.com/NVIDIA/nccl.git
cd nccl && make -j src.build
cp -a build/include/* /usr/local/cuda/include/
cp -a build/lib/* /usr/local/cuda/lib64/
cd ..
rm -rf nccl
install_cusparselt_052
ldconfig
}
function prune_124 {
echo "Pruning CUDA 12.4"
#####################################################################################
# CUDA 12.4 prune static libs
#####################################################################################
export NVPRUNE="/usr/local/cuda-12.4/bin/nvprune"
export CUDA_LIB_DIR="/usr/local/cuda-12.4/lib64"
export GENCODE="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
export GENCODE_CUDNN="-gencode arch=compute_50,code=sm_50 -gencode arch=compute_60,code=sm_60 -gencode arch=compute_61,code=sm_61 -gencode arch=compute_70,code=sm_70 -gencode arch=compute_75,code=sm_75 -gencode arch=compute_80,code=sm_80 -gencode arch=compute_86,code=sm_86 -gencode arch=compute_90,code=sm_90"
if [[ -n "$OVERRIDE_GENCODE" ]]; then
export GENCODE=$OVERRIDE_GENCODE
fi
# all CUDA libs except CuDNN and CuBLAS
ls $CUDA_LIB_DIR/ | grep "\.a" | grep -v "culibos" | grep -v "cudart" | grep -v "cudnn" | grep -v "cublas" | grep -v "metis" \
| xargs -I {} bash -c \
"echo {} && $NVPRUNE $GENCODE $CUDA_LIB_DIR/{} -o $CUDA_LIB_DIR/{}"
# prune CuDNN and CuBLAS
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublas_static.a -o $CUDA_LIB_DIR/libcublas_static.a
$NVPRUNE $GENCODE_CUDNN $CUDA_LIB_DIR/libcublasLt_static.a -o $CUDA_LIB_DIR/libcublasLt_static.a
#####################################################################################
# CUDA 12.1 prune visual tools
#####################################################################################
export CUDA_BASE="/usr/local/cuda-12.4/"
rm -rf $CUDA_BASE/libnvvp $CUDA_BASE/nsightee_plugins $CUDA_BASE/nsight-compute-2024.1.0 $CUDA_BASE/nsight-systems-2023.4.4/
}
# idiomatic parameter and option handling in sh
while test $# -gt 0
do
case "$1" in
12.4) install_124; prune_124
;;
*) echo "bad argument $1"; exit 1
;;
esac
shift
done

View File

@ -1,20 +1,18 @@
#!/bin/bash
if [[ ${CUDNN_VERSION} == 8 ]]; then
if [[ -n "${CUDNN_VERSION}" ]]; then
# cuDNN license: https://developer.nvidia.com/cudnn/license_agreement
mkdir tmp_cudnn
pushd tmp_cudnn
if [[ ${CUDA_VERSION:0:4} == "12.1" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-8.9.2.26_cuda12-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/${CUDNN_NAME}.tar.xz
elif [[ ${CUDA_VERSION:0:4} == "11.8" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-8.7.0.84_cuda11-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/redist/cudnn/v8.7.0/local_installers/11.8/${CUDNN_NAME}.tar.xz
if [[ ${CUDA_VERSION:0:2} == "12" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-9.1.0.70_cuda12-archive"
elif [[ ${CUDA_VERSION:0:2} == "11" ]]; then
CUDNN_NAME="cudnn-linux-x86_64-9.1.0.70_cuda11-archive"
else
print "Unsupported CUDA version ${CUDA_VERSION}"
exit 1
fi
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cudnn/redist/cudnn/linux-x86_64/${CUDNN_NAME}.tar.xz
tar xf ${CUDNN_NAME}.tar.xz
cp -a ${CUDNN_NAME}/include/* /usr/local/cuda/include/
cp -a ${CUDNN_NAME}/lib/* /usr/local/cuda/lib64/

View File

@ -0,0 +1,25 @@
#!/bin/bash
set -ex
# cudss license: https://docs.nvidia.com/cuda/cudss/license.html
mkdir tmp_cudss && cd tmp_cudss
if [[ ${CUDA_VERSION:0:4} =~ ^12\.[1-4]$ ]]; then
arch_path='sbsa'
export TARGETARCH=${TARGETARCH:-$(uname -m)}
if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then
arch_path='x86_64'
fi
CUDSS_NAME="libcudss-linux-${arch_path}-0.3.0.9_cuda12-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cudss/redist/libcudss/linux-${arch_path}/${CUDSS_NAME}.tar.xz
# only for cuda 12
tar xf ${CUDSS_NAME}.tar.xz
cp -a ${CUDSS_NAME}/include/* /usr/local/cuda/include/
cp -a ${CUDSS_NAME}/lib/* /usr/local/cuda/lib64/
fi
cd ..
rm -rf tmp_cudss
ldconfig

View File

@ -5,9 +5,22 @@ set -ex
# cuSPARSELt license: https://docs.nvidia.com/cuda/cusparselt/license.html
mkdir tmp_cusparselt && cd tmp_cusparselt
if [[ ${CUDA_VERSION:0:4} == "12.1" ]]; then
CUSPARSELT_NAME="libcusparse_lt-linux-x86_64-0.5.2.1-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/${CUSPARSELT_NAME}.tar.xz
if [[ ${CUDA_VERSION:0:4} =~ ^12\.[2-4]$ ]]; then
arch_path='sbsa'
export TARGETARCH=${TARGETARCH:-$(uname -m)}
if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then
arch_path='x86_64'
fi
CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.6.2.3-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-${arch_path}/${CUSPARSELT_NAME}.tar.xz
elif [[ ${CUDA_VERSION:0:4} == "12.1" ]]; then
arch_path='sbsa'
export TARGETARCH=${TARGETARCH:-$(uname -m)}
if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then
arch_path='x86_64'
fi
CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.5.2.1-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-${arch_path}/${CUSPARSELT_NAME}.tar.xz
elif [[ ${CUDA_VERSION:0:4} == "11.8" ]]; then
CUSPARSELT_NAME="libcusparse_lt-linux-x86_64-0.4.0.7-archive"
curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/${CUSPARSELT_NAME}.tar.xz

View File

@ -37,6 +37,9 @@ install_conda_dependencies() {
install_pip_dependencies() {
pushd executorch/.ci/docker
# Install PyTorch CPU build beforehand to avoid installing the much bigger CUDA
# binaries later, ExecuTorch only needs CPU
pip_install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cpu
# Install all Python dependencies
pip_install -r requirements-ci.txt
popd
@ -44,13 +47,14 @@ install_pip_dependencies() {
setup_executorch() {
pushd executorch
source .ci/scripts/utils.sh
# Setup swiftshader and Vulkan SDK which are required to build the Vulkan delegate
as_jenkins bash .ci/scripts/setup-vulkan-linux-deps.sh
install_flatc_from_source
pip_install .
export PYTHON_EXECUTABLE=python
export EXECUTORCH_BUILD_PYBIND=ON
export CMAKE_ARGS="-DEXECUTORCH_BUILD_XNNPACK=ON -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON"
# Make sure that all the newly generate files are owned by Jenkins
chown -R jenkins .
as_jenkins .ci/scripts/setup-linux.sh cmake
popd
}

View File

@ -0,0 +1,46 @@
#!/bin/bash
set -ex
source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"
COMMIT=$(get_pinned_commit halide)
test -n "$COMMIT"
# activate conda to populate CONDA_PREFIX
test -n "$ANACONDA_PYTHON_VERSION"
eval "$(conda shell.bash hook)"
conda activate py_$ANACONDA_PYTHON_VERSION
if [ -n "${UBUNTU_VERSION}" ];then
apt update
apt-get install -y lld liblld-15-dev libpng-dev libjpeg-dev libgl-dev \
libopenblas-dev libeigen3-dev libatlas-base-dev libzstd-dev
fi
conda_install numpy scipy imageio cmake ninja
git clone --depth 1 --branch release/16.x --recursive https://github.com/llvm/llvm-project.git
cmake -DCMAKE_BUILD_TYPE=Release \
-DLLVM_ENABLE_PROJECTS="clang" \
-DLLVM_TARGETS_TO_BUILD="X86;NVPTX" \
-DLLVM_ENABLE_TERMINFO=OFF -DLLVM_ENABLE_ASSERTIONS=ON \
-DLLVM_ENABLE_EH=ON -DLLVM_ENABLE_RTTI=ON -DLLVM_BUILD_32_BITS=OFF \
-S llvm-project/llvm -B llvm-build -G Ninja
cmake --build llvm-build
cmake --install llvm-build --prefix llvm-install
export LLVM_ROOT=`pwd`/llvm-install
export LLVM_CONFIG=$LLVM_ROOT/bin/llvm-config
git clone https://github.com/halide/Halide.git
pushd Halide
git checkout ${COMMIT} && git submodule update --init --recursive
pip_install -r requirements.txt
cmake -G Ninja -DCMAKE_BUILD_TYPE=Release -S . -B build
cmake --build build
test -e ${CONDA_PREFIX}/lib/python3 || ln -s python${ANACONDA_PYTHON_VERSION} ${CONDA_PREFIX}/lib/python3
cmake --install build --prefix ${CONDA_PREFIX}
chown -R jenkins ${CONDA_PREFIX}
popd
rm -rf Halide llvm-build llvm-project llvm-install
python -c "import halide" # check for errors

View File

@ -0,0 +1,23 @@
#!/bin/bash
# Script used only in CD pipeline
set -ex
LIBPNG_VERSION=1.6.37
mkdir -p libpng
pushd libpng
wget http://download.sourceforge.net/libpng/libpng-$LIBPNG_VERSION.tar.gz
tar -xvzf libpng-$LIBPNG_VERSION.tar.gz
pushd libpng-$LIBPNG_VERSION
./configure
make
make install
popd
popd
rm -rf libpng

View File

@ -0,0 +1,29 @@
#!/usr/bin/env bash
# Script used only in CD pipeline
set -eou pipefail
MAGMA_VERSION="2.5.2"
function do_install() {
cuda_version=$1
cuda_version_nodot=${1/./}
MAGMA_VERSION="2.6.1"
magma_archive="magma-cuda${cuda_version_nodot}-${MAGMA_VERSION}-1.tar.bz2"
cuda_dir="/usr/local/cuda-${cuda_version}"
(
set -x
tmp_dir=$(mktemp -d)
pushd ${tmp_dir}
curl -OLs https://anaconda.org/pytorch/magma-cuda${cuda_version_nodot}/${MAGMA_VERSION}/download/linux-64/${magma_archive}
tar -xvf "${magma_archive}"
mkdir -p "${cuda_dir}/magma"
mv include "${cuda_dir}/magma/include"
mv lib "${cuda_dir}/magma/lib"
popd
)
}
do_install $1

View File

@ -0,0 +1,137 @@
#!/bin/bash
# Script used only in CD pipeline
set -ex
ROCM_VERSION=$1
if [[ -z $ROCM_VERSION ]]; then
echo "missing ROCM_VERSION"
exit 1;
fi
# To make version comparison easier, create an integer representation.
save_IFS="$IFS"
IFS=. ROCM_VERSION_ARRAY=(${ROCM_VERSION})
IFS="$save_IFS"
if [[ ${#ROCM_VERSION_ARRAY[@]} == 2 ]]; then
ROCM_VERSION_MAJOR=${ROCM_VERSION_ARRAY[0]}
ROCM_VERSION_MINOR=${ROCM_VERSION_ARRAY[1]}
ROCM_VERSION_PATCH=0
elif [[ ${#ROCM_VERSION_ARRAY[@]} == 3 ]]; then
ROCM_VERSION_MAJOR=${ROCM_VERSION_ARRAY[0]}
ROCM_VERSION_MINOR=${ROCM_VERSION_ARRAY[1]}
ROCM_VERSION_PATCH=${ROCM_VERSION_ARRAY[2]}
else
echo "Unhandled ROCM_VERSION ${ROCM_VERSION}"
exit 1
fi
ROCM_INT=$(($ROCM_VERSION_MAJOR * 10000 + $ROCM_VERSION_MINOR * 100 + $ROCM_VERSION_PATCH))
# Install custom MIOpen + COMgr for ROCm >= 4.0.1
if [[ $ROCM_INT -lt 40001 ]]; then
echo "ROCm version < 4.0.1; will not install custom MIOpen"
exit 0
fi
# Function to retry functions that sometimes timeout or have flaky failures
retry () {
$* || (sleep 1 && $*) || (sleep 2 && $*) || (sleep 4 && $*) || (sleep 8 && $*)
}
# Build custom MIOpen to use comgr for offline compilation.
## Need a sanitized ROCM_VERSION without patchlevel; patchlevel version 0 must be added to paths.
ROCM_DOTS=$(echo ${ROCM_VERSION} | tr -d -c '.' | wc -c)
if [[ ${ROCM_DOTS} == 1 ]]; then
ROCM_VERSION_NOPATCH="${ROCM_VERSION}"
ROCM_INSTALL_PATH="/opt/rocm-${ROCM_VERSION}.0"
else
ROCM_VERSION_NOPATCH="${ROCM_VERSION%.*}"
ROCM_INSTALL_PATH="/opt/rocm-${ROCM_VERSION}"
fi
# MIOPEN_USE_HIP_KERNELS is a Workaround for COMgr issues
MIOPEN_CMAKE_COMMON_FLAGS="
-DMIOPEN_USE_COMGR=ON
-DMIOPEN_BUILD_DRIVER=OFF
"
# Pull MIOpen repo and set DMIOPEN_EMBED_DB based on ROCm version
if [[ $ROCM_INT -ge 60200 ]] && [[ $ROCM_INT -lt 60300 ]]; then
echo "ROCm 6.2 MIOpen does not need any patches, do not build from source"
exit 0
elif [[ $ROCM_INT -ge 60100 ]] && [[ $ROCM_INT -lt 60200 ]]; then
echo "ROCm 6.1 MIOpen does not need any patches, do not build from source"
exit 0
elif [[ $ROCM_INT -ge 60000 ]] && [[ $ROCM_INT -lt 60100 ]]; then
echo "ROCm 6.0 MIOpen does not need any patches, do not build from source"
exit 0
elif [[ $ROCM_INT -ge 50700 ]] && [[ $ROCM_INT -lt 60000 ]]; then
echo "ROCm 5.7 MIOpen does not need any patches, do not build from source"
exit 0
elif [[ $ROCM_INT -ge 50600 ]] && [[ $ROCM_INT -lt 50700 ]]; then
MIOPEN_BRANCH="release/rocm-rel-5.6-staging"
elif [[ $ROCM_INT -ge 50500 ]] && [[ $ROCM_INT -lt 50600 ]]; then
MIOPEN_BRANCH="release/rocm-rel-5.5-gfx11"
elif [[ $ROCM_INT -ge 50400 ]] && [[ $ROCM_INT -lt 50500 ]]; then
MIOPEN_CMAKE_DB_FLAGS="-DMIOPEN_EMBED_DB=gfx900_56;gfx906_60;gfx90878;gfx90a6e;gfx1030_36 -DMIOPEN_USE_MLIR=Off"
MIOPEN_BRANCH="release/rocm-rel-5.4-staging"
elif [[ $ROCM_INT -ge 50300 ]] && [[ $ROCM_INT -lt 50400 ]]; then
MIOPEN_CMAKE_DB_FLAGS="-DMIOPEN_EMBED_DB=gfx900_56;gfx906_60;gfx90878;gfx90a6e;gfx1030_36 -DMIOPEN_USE_MLIR=Off"
MIOPEN_BRANCH="release/rocm-rel-5.3-staging"
elif [[ $ROCM_INT -ge 50200 ]] && [[ $ROCM_INT -lt 50300 ]]; then
MIOPEN_CMAKE_DB_FLAGS="-DMIOPEN_EMBED_DB=gfx900_56;gfx906_60;gfx90878;gfx90a6e;gfx1030_36 -DMIOPEN_USE_MLIR=Off"
MIOPEN_BRANCH="release/rocm-rel-5.2-staging"
elif [[ $ROCM_INT -ge 50100 ]] && [[ $ROCM_INT -lt 50200 ]]; then
MIOPEN_CMAKE_DB_FLAGS="-DMIOPEN_EMBED_DB=gfx900_56;gfx906_60;gfx90878;gfx90a6e;gfx1030_36"
MIOPEN_BRANCH="release/rocm-rel-5.1-staging"
elif [[ $ROCM_INT -ge 50000 ]] && [[ $ROCM_INT -lt 50100 ]]; then
MIOPEN_CMAKE_DB_FLAGS="-DMIOPEN_EMBED_DB=gfx900_56;gfx906_60;gfx90878;gfx90a6e;gfx1030_36"
MIOPEN_BRANCH="release/rocm-rel-5.0-staging"
else
echo "Unhandled ROCM_VERSION ${ROCM_VERSION}"
exit 1
fi
yum remove -y miopen-hip
git clone https://github.com/ROCm/MIOpen -b ${MIOPEN_BRANCH}
pushd MIOpen
# remove .git to save disk space since CI runner was running out
rm -rf .git
# Don't build MLIR to save docker build time
# since we are disabling MLIR backend for MIOpen anyway
if [[ $ROCM_INT -ge 50400 ]] && [[ $ROCM_INT -lt 50500 ]]; then
sed -i '/rocMLIR/d' requirements.txt
elif [[ $ROCM_INT -ge 50200 ]] && [[ $ROCM_INT -lt 50400 ]]; then
sed -i '/llvm-project-mlir/d' requirements.txt
fi
## MIOpen minimum requirements
cmake -P install_deps.cmake --minimum
# clean up since CI runner was running out of disk space
rm -rf /tmp/*
yum clean all
rm -rf /var/cache/yum
rm -rf /var/lib/yum/yumdb
rm -rf /var/lib/yum/history
## Build MIOpen
mkdir -p build
cd build
PKG_CONFIG_PATH=/usr/local/lib/pkgconfig CXX=${ROCM_INSTALL_PATH}/llvm/bin/clang++ cmake .. \
${MIOPEN_CMAKE_COMMON_FLAGS} \
${MIOPEN_CMAKE_DB_FLAGS} \
-DCMAKE_PREFIX_PATH="${ROCM_INSTALL_PATH}/hip;${ROCM_INSTALL_PATH}"
make MIOpen -j $(nproc)
# Build MIOpen package
make -j $(nproc) package
# clean up since CI runner was running out of disk space
rm -rf /usr/local/cget
yum install -y miopen-*.rpm
popd
rm -rf MIOpen

View File

@ -0,0 +1,16 @@
#!/bin/bash
set -ex
# MKL
MKL_VERSION=2024.2.0
MKLROOT=/opt/intel
mkdir -p ${MKLROOT}
pushd /tmp
python3 -mpip install wheel
python3 -mpip download -d . mkl-static==${MKL_VERSION}
python3 -m wheel unpack mkl_static-${MKL_VERSION}-py2.py3-none-manylinux1_x86_64.whl
python3 -m wheel unpack mkl_include-${MKL_VERSION}-py2.py3-none-manylinux1_x86_64.whl
mv mkl_static-${MKL_VERSION}/mkl_static-${MKL_VERSION}.data/data/lib ${MKLROOT}
mv mkl_include-${MKL_VERSION}/mkl_include-${MKL_VERSION}.data/data/include ${MKLROOT}

View File

@ -0,0 +1,13 @@
#!/bin/bash
# Script used only in CD pipeline
set -ex
mkdir -p /usr/local/mnist/
cd /usr/local/mnist
for img in train-images-idx3-ubyte.gz train-labels-idx1-ubyte.gz t10k-images-idx3-ubyte.gz t10k-labels-idx1-ubyte.gz; do
wget -q https://ossci-datasets.s3.amazonaws.com/mnist/$img
gzip -d $img
done

View File

@ -0,0 +1,20 @@
#!/bin/bash
set -ex
function install_nvpl {
mkdir -p /opt/nvpl/lib /opt/nvpl/include
wget https://developer.download.nvidia.com/compute/nvpl/redist/nvpl_blas/linux-sbsa/nvpl_blas-linux-sbsa-0.3.0-archive.tar.xz
tar xf nvpl_blas-linux-sbsa-0.3.0-archive.tar.xz
cp -r nvpl_blas-linux-sbsa-0.3.0-archive/lib/* /opt/nvpl/lib/
cp -r nvpl_blas-linux-sbsa-0.3.0-archive/include/* /opt/nvpl/include/
wget https://developer.download.nvidia.com/compute/nvpl/redist/nvpl_lapack/linux-sbsa/nvpl_lapack-linux-sbsa-0.2.3.1-archive.tar.xz
tar xf nvpl_lapack-linux-sbsa-0.2.3.1-archive.tar.xz
cp -r nvpl_lapack-linux-sbsa-0.2.3.1-archive/lib/* /opt/nvpl/lib/
cp -r nvpl_lapack-linux-sbsa-0.2.3.1-archive/include/* /opt/nvpl/include/
}
install_nvpl

View File

@ -15,7 +15,7 @@ pip_install \
flatbuffers==2.0 \
mock==5.0.1 \
ninja==1.10.2 \
networkx==2.0 \
networkx==2.5 \
numpy==1.24.2
# ONNXRuntime should be installed before installing
@ -30,10 +30,11 @@ pip_install \
pip_install coloredlogs packaging
pip_install onnxruntime==1.17.0
pip_install onnx==1.15.0
# pip_install "onnxscript@git+https://github.com/microsoft/onnxscript@3e869ef8ccf19b5ebd21c10d3e9c267c9a9fa729" --no-deps
pip_install onnxscript==0.1.0.dev20240315 --no-deps
pip_install onnxruntime==1.18.1
pip_install onnx==1.16.2
pip_install onnxscript==0.1.0.dev20240831 --no-deps
# required by onnxscript
pip_install ml_dtypes
# Cache the transformers model to be used later by ONNX tests. We need to run the transformers
# package to download the model. By default, the model is cached at ~/.cache/huggingface/hub/

View File

@ -0,0 +1,22 @@
#!/bin/bash
# Script used only in CD pipeline
set -ex
cd /
git clone https://github.com/OpenMathLib/OpenBLAS.git -b v0.3.25 --depth 1 --shallow-submodules
OPENBLAS_BUILD_FLAGS="
NUM_THREADS=128
USE_OPENMP=1
NO_SHARED=0
DYNAMIC_ARCH=1
TARGET=ARMV8
CFLAGS=-O3
"
OPENBLAS_CHECKOUT_DIR="OpenBLAS"
make -j8 ${OPENBLAS_BUILD_FLAGS} -C ${OPENBLAS_CHECKOUT_DIR}
make -j8 ${OPENBLAS_BUILD_FLAGS} install -C ${OPENBLAS_CHECKOUT_DIR}

View File

@ -0,0 +1,16 @@
#!/bin/bash
# Script used only in CD pipeline
set -ex
# Pin the version to latest release 0.17.2, building newer commit starts
# to fail on the current image
git clone -b 0.17.2 --single-branch https://github.com/NixOS/patchelf
cd patchelf
sed -i 's/serial/parallel/g' configure.ac
./bootstrap.sh
./configure
make
make install
cd ..
rm -rf patchelf

View File

@ -39,7 +39,8 @@ install_ubuntu() {
rocm-libs \
rccl \
rocprofiler-dev \
roctracer-dev
roctracer-dev \
amd-smi-lib
if [[ $(ver $ROCM_VERSION) -ge $(ver 6.1) ]]; then
DEBIAN_FRONTEND=noninteractive apt-get install -y --allow-unauthenticated rocm-llvm-dev
@ -106,7 +107,8 @@ install_centos() {
rocm-libs \
rccl \
rocprofiler-dev \
roctracer-dev
roctracer-dev \
amd-smi-lib
# precompiled miopen kernels; search for all unversioned packages
# if search fails it will abort this script; use true to avoid case where search fails

View File

@ -0,0 +1,150 @@
#!/bin/bash
# Script used only in CD pipeline
###########################
### prereqs
###########################
# Install Python packages depending on the base OS
ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')
case "$ID" in
ubuntu)
apt-get update -y
apt-get install -y libpciaccess-dev pkg-config
apt-get clean
;;
centos)
yum install -y libpciaccess-devel pkgconfig
;;
*)
echo "Unable to determine OS..."
exit 1
;;
esac
python3 -m pip install meson ninja
###########################
### clone repo
###########################
GIT_SSL_NO_VERIFY=true git clone https://gitlab.freedesktop.org/mesa/drm.git
pushd drm
###########################
### patch
###########################
patch -p1 <<'EOF'
diff --git a/amdgpu/amdgpu_asic_id.c b/amdgpu/amdgpu_asic_id.c
index a5007ffc..13fa07fc 100644
--- a/amdgpu/amdgpu_asic_id.c
+++ b/amdgpu/amdgpu_asic_id.c
@@ -22,6 +22,13 @@
*
*/
+#define _XOPEN_SOURCE 700
+#define _LARGEFILE64_SOURCE
+#define _FILE_OFFSET_BITS 64
+#include <ftw.h>
+#include <link.h>
+#include <limits.h>
+
#include <ctype.h>
#include <stdio.h>
#include <stdlib.h>
@@ -34,6 +41,19 @@
#include "amdgpu_drm.h"
#include "amdgpu_internal.h"
+static char *amdgpuids_path = NULL;
+static const char* amdgpuids_path_msg = NULL;
+
+static int check_for_location_of_amdgpuids(const char *filepath, const struct stat *info, const int typeflag, struct FTW *pathinfo)
+{
+ if (typeflag == FTW_F && strstr(filepath, "amdgpu.ids")) {
+ amdgpuids_path = strdup(filepath);
+ return 1;
+ }
+
+ return 0;
+}
+
static int parse_one_line(struct amdgpu_device *dev, const char *line)
{
char *buf, *saveptr;
@@ -113,10 +133,46 @@ void amdgpu_parse_asic_ids(struct amdgpu_device *dev)
int line_num = 1;
int r = 0;
+ // attempt to find typical location for amdgpu.ids file
fp = fopen(AMDGPU_ASIC_ID_TABLE, "r");
+
+ // if it doesn't exist, search
+ if (!fp) {
+
+ char self_path[ PATH_MAX ];
+ ssize_t count;
+ ssize_t i;
+
+ count = readlink( "/proc/self/exe", self_path, PATH_MAX );
+ if (count > 0) {
+ self_path[count] = '\0';
+
+ // remove '/bin/python' from self_path
+ for (i=count; i>0; --i) {
+ if (self_path[i] == '/') break;
+ self_path[i] = '\0';
+ }
+ self_path[i] = '\0';
+ for (; i>0; --i) {
+ if (self_path[i] == '/') break;
+ self_path[i] = '\0';
+ }
+ self_path[i] = '\0';
+
+ if (1 == nftw(self_path, check_for_location_of_amdgpuids, 5, FTW_PHYS)) {
+ fp = fopen(amdgpuids_path, "r");
+ amdgpuids_path_msg = amdgpuids_path;
+ }
+ }
+
+ }
+ else {
+ amdgpuids_path_msg = AMDGPU_ASIC_ID_TABLE;
+ }
+
+ // both hard-coded location and search have failed
if (!fp) {
- fprintf(stderr, "%s: %s\n", AMDGPU_ASIC_ID_TABLE,
- strerror(errno));
+ fprintf(stderr, "amdgpu.ids: No such file or directory\n");
return;
}
@@ -132,7 +188,7 @@ void amdgpu_parse_asic_ids(struct amdgpu_device *dev)
continue;
}
- drmMsg("%s version: %s\n", AMDGPU_ASIC_ID_TABLE, line);
+ drmMsg("%s version: %s\n", amdgpuids_path_msg, line);
break;
}
@@ -150,7 +206,7 @@ void amdgpu_parse_asic_ids(struct amdgpu_device *dev)
if (r == -EINVAL) {
fprintf(stderr, "Invalid format: %s: line %d: %s\n",
- AMDGPU_ASIC_ID_TABLE, line_num, line);
+ amdgpuids_path_msg, line_num, line);
} else if (r && r != -EAGAIN) {
fprintf(stderr, "%s: Cannot parse ASIC IDs: %s\n",
__func__, strerror(-r));
EOF
###########################
### build
###########################
meson builddir --prefix=/opt/amdgpu
pushd builddir
ninja install
popd
popd

View File

@ -1,7 +1,11 @@
#!/bin/bash
# Script used in CI and CD pipeline
set -ex
MKLROOT=${MKLROOT:-/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION}
# "install" hipMAGMA into /opt/rocm/magma by copying after build
git clone https://bitbucket.org/icl/magma.git
pushd magma
@ -11,7 +15,10 @@ git checkout a1625ff4d9bc362906bd01f805dbbe12612953f6
cp make.inc-examples/make.inc.hip-gcc-mkl make.inc
echo 'LIBDIR += -L$(MKLROOT)/lib' >> make.inc
echo 'LIB += -Wl,--enable-new-dtags -Wl,--rpath,/opt/rocm/lib -Wl,--rpath,$(MKLROOT)/lib -Wl,--rpath,/opt/rocm/magma/lib' >> make.inc
if [[ -f "${MKLROOT}/lib/libmkl_core.a" ]]; then
echo 'LIB = -Wl,--start-group -lmkl_gf_lp64 -lmkl_gnu_thread -lmkl_core -Wl,--end-group -lpthread -lstdc++ -lm -lgomp -lhipblas -lhipsparse' >> make.inc
fi
echo 'LIB += -Wl,--enable-new-dtags -Wl,--rpath,/opt/rocm/lib -Wl,--rpath,$(MKLROOT)/lib -Wl,--rpath,/opt/rocm/magma/lib -ldl' >> make.inc
echo 'DEVCCFLAGS += --gpu-max-threads-per-block=256' >> make.inc
export PATH="${PATH}:/opt/rocm/bin"
if [[ -n "$PYTORCH_ROCM_ARCH" ]]; then
@ -25,7 +32,7 @@ done
# hipcc with openmp flag may cause isnan() on __device__ not to be found; depending on context, compiler may attempt to match with host definition
sed -i 's/^FOPENMP/#FOPENMP/g' make.inc
make -f make.gen.hipMAGMA -j $(nproc)
LANG=C.UTF-8 make lib/libmagma.so -j $(nproc) MKLROOT=/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION
make testing/testing_dgemm -j $(nproc) MKLROOT=/opt/conda/envs/py_$ANACONDA_PYTHON_VERSION
LANG=C.UTF-8 make lib/libmagma.so -j $(nproc) MKLROOT="${MKLROOT}"
make testing/testing_dgemm -j $(nproc) MKLROOT="${MKLROOT}"
popd
mv magma /opt/rocm

View File

@ -12,10 +12,7 @@ conda_reinstall() {
as_jenkins conda install -q -n py_$ANACONDA_PYTHON_VERSION -y --force-reinstall $*
}
if [ -n "${ROCM_VERSION}" ]; then
TRITON_REPO="https://github.com/openai/triton"
TRITON_TEXT_FILE="triton-rocm"
elif [ -n "${XPU_VERSION}" ]; then
if [ -n "${XPU_VERSION}" ]; then
TRITON_REPO="https://github.com/intel/intel-xpu-backend-for-triton"
TRITON_TEXT_FILE="triton-xpu"
else
@ -41,19 +38,33 @@ if [ -z "${MAX_JOBS}" ]; then
export MAX_JOBS=$(nproc)
fi
# Git checkout triton
mkdir /var/lib/jenkins/triton
chown -R jenkins /var/lib/jenkins/triton
chgrp -R jenkins /var/lib/jenkins/triton
pushd /var/lib/jenkins/
as_jenkins git clone ${TRITON_REPO} triton
cd triton
as_jenkins git checkout ${TRITON_PINNED_COMMIT}
cd python
# TODO: remove patch setup.py once we have a proper fix for https://github.com/triton-lang/triton/issues/4527
as_jenkins sed -i -e 's/https:\/\/tritonlang.blob.core.windows.net\/llvm-builds/https:\/\/oaitriton.blob.core.windows.net\/public\/llvm-builds/g' setup.py
if [ -n "${UBUNTU_VERSION}" ] && [ -n "${GCC_VERSION}" ] && [[ "${GCC_VERSION}" == "7" ]]; then
# Triton needs at least gcc-9 to build
apt-get install -y g++-9
CXX=g++-9 pip_install "git+${TRITON_REPO}@${TRITON_PINNED_COMMIT}#subdirectory=python"
CXX=g++-9 pip_install -e .
elif [ -n "${UBUNTU_VERSION}" ] && [ -n "${CLANG_VERSION}" ]; then
# Triton needs <filesystem> which surprisingly is not available with clang-9 toolchain
add-apt-repository -y ppa:ubuntu-toolchain-r/test
apt-get install -y g++-9
CXX=g++-9 pip_install "git+${TRITON_REPO}@${TRITON_PINNED_COMMIT}#subdirectory=python"
CXX=g++-9 pip_install -e .
else
pip_install "git+${TRITON_REPO}@${TRITON_PINNED_COMMIT}#subdirectory=python"
pip_install -e .
fi
if [ -n "${CONDA_CMAKE}" ]; then

View File

@ -1,6 +1,6 @@
#!/bin/bash
set -xe
# Script used in CI and CD pipeline
# Intel® software for general purpose GPU capabilities.
# Refer to https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpus.html
@ -8,19 +8,23 @@ set -xe
# Users should update to the latest version as it becomes available
function install_ubuntu() {
. /etc/os-release
if [[ ! " jammy " =~ " ${VERSION_CODENAME} " ]]; then
echo "Ubuntu version ${VERSION_CODENAME} not supported"
exit
fi
apt-get update -y
apt-get install -y gpg-agent wget
# Set up the repository. To do this, download the key to the system keyring
# To add the online network package repository for the GPU Driver
wget -qO - https://repositories.intel.com/gpu/intel-graphics.key \
| gpg --dearmor --output /usr/share/keyrings/intel-graphics.gpg
wget -qO - https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \
| gpg --dearmor --output /usr/share/keyrings/intel-for-pytorch-gpu-dev-keyring.gpg
# Add the signed entry to APT sources and configure the APT client to use the Intel repository
| gpg --yes --dearmor --output /usr/share/keyrings/intel-graphics.gpg
echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] \
https://repositories.intel.com/gpu/ubuntu jammy/lts/2350 unified" \
| tee /etc/apt/sources.list.d/intel-gpu-jammy.list
https://repositories.intel.com/gpu/ubuntu ${VERSION_CODENAME}${XPU_DRIVER_VERSION} unified" \
| tee /etc/apt/sources.list.d/intel-gpu-${VERSION_CODENAME}.list
# To add the online network network package repository for the Intel Support Packages
wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \
| gpg --dearmor > /usr/share/keyrings/intel-for-pytorch-gpu-dev-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/intel-for-pytorch-gpu-dev-keyring.gpg] \
https://apt.repos.intel.com/intel-for-pytorch-gpu-dev all main" \
| tee /etc/apt/sources.list.d/intel-for-pytorch-gpu-dev.list
@ -41,9 +45,9 @@ function install_ubuntu() {
apt-get install -y libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev level-zero-dev
# Install Intel Support Packages
if [ -n "$XPU_VERSION" ]; then
apt-get install -y intel-for-pytorch-gpu-dev-${XPU_VERSION}
apt-get install -y intel-for-pytorch-gpu-dev-${XPU_VERSION} intel-pti-dev
else
apt-get install -y intel-for-pytorch-gpu-dev
apt-get install -y intel-for-pytorch-gpu-dev intel-pti-dev
fi
# Cleanup
@ -51,44 +55,49 @@ function install_ubuntu() {
rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*
}
function install_centos() {
dnf install -y 'dnf-command(config-manager)'
dnf config-manager --add-repo \
https://repositories.intel.com/gpu/rhel/8.6/production/2328/unified/intel-gpu-8.6.repo
# To add the EPEL repository needed for DKMS
dnf -y install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
# https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm
function install_rhel() {
. /etc/os-release
if [[ "${ID}" == "rhel" ]]; then
if [[ ! " 8.6 8.8 8.9 9.0 9.2 9.3 " =~ " ${VERSION_ID} " ]]; then
echo "RHEL version ${VERSION_ID} not supported"
exit
fi
elif [[ "${ID}" == "almalinux" ]]; then
# Workaround for almalinux8 which used by quay.io/pypa/manylinux_2_28_x86_64
VERSION_ID="8.6"
fi
# Create the YUM repository file in the /temp directory as a normal user
tee > /tmp/oneAPI.repo << EOF
[oneAPI]
name=Intel® oneAPI repository
baseurl=https://yum.repos.intel.com/oneapi
dnf install -y 'dnf-command(config-manager)'
# To add the online network package repository for the GPU Driver
dnf config-manager --add-repo \
https://repositories.intel.com/gpu/rhel/${VERSION_ID}${XPU_DRIVER_VERSION}/unified/intel-gpu-${VERSION_ID}.repo
# To add the online network network package repository for the Intel Support Packages
tee > /etc/yum.repos.d/intel-for-pytorch-gpu-dev.repo << EOF
[intel-for-pytorch-gpu-dev]
name=Intel for Pytorch GPU dev repository
baseurl=https://yum.repos.intel.com/intel-for-pytorch-gpu-dev
enabled=1
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://yum.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
EOF
# Move the newly created oneAPI.repo file to the YUM configuration directory /etc/yum.repos.d
mv /tmp/oneAPI.repo /etc/yum.repos.d
# The xpu-smi packages
dnf install -y flex bison xpu-smi
dnf install -y xpu-smi
# Compute and Media Runtimes
dnf install -y \
dnf install --skip-broken -y \
intel-opencl intel-media intel-mediasdk libmfxgen1 libvpl2\
level-zero intel-level-zero-gpu mesa-dri-drivers mesa-vulkan-drivers \
mesa-vdpau-drivers libdrm mesa-libEGL mesa-libgbm mesa-libGL \
mesa-libxatracker libvpl-tools intel-metrics-discovery \
intel-metrics-library intel-igc-core intel-igc-cm \
libva libva-utils intel-gmmlib libmetee intel-gsc intel-ocloc hwinfo clinfo
libva libva-utils intel-gmmlib libmetee intel-gsc intel-ocloc
# Development packages
dnf install -y --refresh \
intel-igc-opencl-devel level-zero-devel intel-gsc-devel libmetee-devel \
level-zero-devel
# Install Intel® oneAPI Base Toolkit
dnf install intel-basekit -y
# Install Intel Support Packages
yum install -y intel-for-pytorch-gpu-dev intel-pti-dev
# Cleanup
dnf clean all
@ -97,6 +106,41 @@ EOF
rm -rf /var/lib/yum/history
}
function install_sles() {
. /etc/os-release
VERSION_SP=${VERSION_ID//./sp}
if [[ ! " 15sp4 15sp5 " =~ " ${VERSION_SP} " ]]; then
echo "SLES version ${VERSION_ID} not supported"
exit
fi
# To add the online network package repository for the GPU Driver
zypper addrepo -f -r \
https://repositories.intel.com/gpu/sles/${VERSION_SP}${XPU_DRIVER_VERSION}/unified/intel-gpu-${VERSION_SP}.repo
rpm --import https://repositories.intel.com/gpu/intel-graphics.key
# To add the online network network package repository for the Intel Support Packages
zypper addrepo https://yum.repos.intel.com/intel-for-pytorch-gpu-dev intel-for-pytorch-gpu-dev
rpm --import https://yum.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB
# The xpu-smi packages
zypper install -y lsb-release flex bison xpu-smi
# Compute and Media Runtimes
zypper install -y intel-level-zero-gpu level-zero intel-gsc intel-opencl intel-ocloc \
intel-media-driver libigfxcmrt7 libvpl2 libvpl-tools libmfxgen1 libmfx1
# Development packages
zypper install -y libigdfcl-devel intel-igc-cm libigfxcmrt-devel level-zero-devel
# Install Intel Support Packages
zypper install -y intel-for-pytorch-gpu-dev intel-pti-dev
}
# Default use GPU driver LTS releases
XPU_DRIVER_VERSION="/lts/2350"
if [[ "${XPU_DRIVER_TYPE,,}" == "rolling" ]]; then
# Use GPU driver rolling releases
XPU_DRIVER_VERSION=""
fi
# The installation depends on the base OS
ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')
@ -104,8 +148,11 @@ case "$ID" in
ubuntu)
install_ubuntu
;;
centos)
install_centos
rhel|almalinux)
install_rhel
;;
sles)
install_sles
;;
*)
echo "Unable to determine OS..."

100
.ci/docker/conda/Dockerfile Normal file
View File

@ -0,0 +1,100 @@
ARG CUDA_VERSION=10.2
ARG BASE_TARGET=cuda${CUDA_VERSION}
FROM centos:7 as base
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
ARG DEVTOOLSET_VERSION=9
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
RUN yum update -y
RUN yum install -y wget curl perl util-linux xz bzip2 git patch which unzip
# Just add everything as a safe.directory for git since these will be used in multiple places with git
RUN git config --global --add safe.directory '*'
RUN yum install -y yum-utils centos-release-scl
RUN yum-config-manager --enable rhel-server-rhscl-7-rpms
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
RUN yum install -y devtoolset-${DEVTOOLSET_VERSION}-gcc devtoolset-${DEVTOOLSET_VERSION}-gcc-c++ devtoolset-${DEVTOOLSET_VERSION}-gcc-gfortran devtoolset-${DEVTOOLSET_VERSION}-binutils
# EPEL for cmake
RUN yum --enablerepo=extras install -y epel-release
# cmake
RUN yum install -y cmake3 && \
ln -s /usr/bin/cmake3 /usr/bin/cmake
ENV PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
RUN yum install -y autoconf aclocal automake make sudo
RUN rm -rf /usr/local/cuda-*
FROM base as patchelf
# Install patchelf
ADD ./common/install_patchelf.sh install_patchelf.sh
RUN bash ./install_patchelf.sh && rm install_patchelf.sh && cp $(which patchelf) /patchelf
FROM base as openssl
# Install openssl
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
FROM base as conda
# Install Anaconda
ADD ./common/install_conda_docker.sh install_conda.sh
RUN bash ./install_conda.sh && rm install_conda.sh
# Install CUDA
FROM base as cuda
ARG CUDA_VERSION=10.2
RUN rm -rf /usr/local/cuda-*
ADD ./common/install_cuda.sh install_cuda.sh
ENV CUDA_HOME=/usr/local/cuda-${CUDA_VERSION}
# Preserve CUDA_VERSION for the builds
ENV CUDA_VERSION=${CUDA_VERSION}
# Make things in our path by default
ENV PATH=/usr/local/cuda-${CUDA_VERSION}/bin:$PATH
FROM cuda as cuda11.8
RUN bash ./install_cuda.sh 11.8
ENV DESIRED_CUDA=11.8
FROM cuda as cuda12.1
RUN bash ./install_cuda.sh 12.1
ENV DESIRED_CUDA=12.1
FROM cuda as cuda12.4
RUN bash ./install_cuda.sh 12.4
ENV DESIRED_CUDA=12.4
# Install MNIST test data
FROM base as mnist
ADD ./common/install_mnist.sh install_mnist.sh
RUN bash ./install_mnist.sh
FROM base as all_cuda
COPY --from=cuda11.8 /usr/local/cuda-11.8 /usr/local/cuda-11.8
COPY --from=cuda12.1 /usr/local/cuda-12.1 /usr/local/cuda-12.1
COPY --from=cuda12.4 /usr/local/cuda-12.4 /usr/local/cuda-12.4
# Final step
FROM ${BASE_TARGET} as final
COPY --from=openssl /opt/openssl /opt/openssl
COPY --from=patchelf /patchelf /usr/local/bin/patchelf
COPY --from=conda /opt/conda /opt/conda
# Add jni.h for java host build.
COPY ./common/install_jni.sh install_jni.sh
COPY ./java/jni.h jni.h
RUN bash ./install_jni.sh && rm install_jni.sh
ENV PATH /opt/conda/bin:$PATH
COPY --from=mnist /usr/local/mnist /usr/local/mnist
RUN rm -rf /usr/local/cuda
RUN chmod o+rw /usr/local
RUN touch /.condarc && \
chmod o+rw /.condarc && \
chmod -R o+rw /opt/conda

76
.ci/docker/conda/build.sh Executable file
View File

@ -0,0 +1,76 @@
#!/usr/bin/env bash
# Script used only in CD pipeline
set -eou pipefail
image="$1"
shift
if [ -z "${image}" ]; then
echo "Usage: $0 IMAGE"
exit 1
fi
DOCKER_IMAGE_NAME="pytorch/${image}"
export DOCKER_BUILDKIT=1
TOPDIR=$(git rev-parse --show-toplevel)
CUDA_VERSION=${CUDA_VERSION:-12.1}
case ${CUDA_VERSION} in
cpu)
BASE_TARGET=base
DOCKER_TAG=cpu
;;
all)
BASE_TARGET=all_cuda
DOCKER_TAG=latest
;;
*)
BASE_TARGET=cuda${CUDA_VERSION}
DOCKER_TAG=cuda${CUDA_VERSION}
;;
esac
(
set -x
docker build \
--target final \
--progress plain \
--build-arg "BASE_TARGET=${BASE_TARGET}" \
--build-arg "CUDA_VERSION=${CUDA_VERSION}" \
--build-arg "DEVTOOLSET_VERSION=9" \
-t ${DOCKER_IMAGE_NAME} \
$@ \
-f "${TOPDIR}/.ci/docker/conda/Dockerfile" \
${TOPDIR}/.ci/docker/
)
if [[ "${DOCKER_TAG}" =~ ^cuda* ]]; then
# Test that we're using the right CUDA compiler
(
set -x
docker run --rm "${DOCKER_IMAGE_NAME}" nvcc --version | grep "cuda_${CUDA_VERSION}"
)
fi
GITHUB_REF=${GITHUB_REF:-$(git symbolic-ref -q HEAD || git describe --tags --exact-match)}
GIT_BRANCH_NAME=${GITHUB_REF##*/}
GIT_COMMIT_SHA=${GITHUB_SHA:-$(git rev-parse HEAD)}
DOCKER_IMAGE_BRANCH_TAG=${DOCKER_IMAGE_NAME}-${GIT_BRANCH_NAME}
DOCKER_IMAGE_SHA_TAG=${DOCKER_IMAGE_NAME}-${GIT_COMMIT_SHA}
if [[ "${WITH_PUSH:-}" == true ]]; then
(
set -x
docker push "${DOCKER_IMAGE_NAME}"
if [[ -n ${GITHUB_REF} ]]; then
docker tag ${DOCKER_IMAGE_NAME} ${DOCKER_IMAGE_BRANCH_TAG}
docker tag ${DOCKER_IMAGE_NAME} ${DOCKER_IMAGE_SHA_TAG}
docker push "${DOCKER_IMAGE_BRANCH_TAG}"
docker push "${DOCKER_IMAGE_SHA_TAG}"
fi
)
fi

View File

@ -0,0 +1,107 @@
ARG BASE_TARGET=base
ARG GPU_IMAGE=ubuntu:20.04
FROM ${GPU_IMAGE} as base
ENV DEBIAN_FRONTEND=noninteractive
RUN apt-get clean && apt-get update
RUN apt-get install -y curl locales g++ git-all autoconf automake make cmake wget unzip sudo
# Just add everything as a safe.directory for git since these will be used in multiple places with git
RUN git config --global --add safe.directory '*'
RUN locale-gen en_US.UTF-8
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
# Install openssl
FROM base as openssl
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
# Install python
FROM base as python
ADD common/install_cpython.sh install_cpython.sh
RUN apt-get update -y && \
apt-get install build-essential gdb lcov libbz2-dev libffi-dev \
libgdbm-dev liblzma-dev libncurses5-dev libreadline6-dev \
libsqlite3-dev libssl-dev lzma lzma-dev tk-dev uuid-dev zlib1g-dev -y && \
bash ./install_cpython.sh && \
rm install_cpython.sh && \
apt-get clean
FROM base as conda
ADD ./common/install_conda_docker.sh install_conda.sh
RUN bash ./install_conda.sh && rm install_conda.sh
FROM base as cpu
# Install Anaconda
COPY --from=conda /opt/conda /opt/conda
# Install python
COPY --from=python /opt/python /opt/python
COPY --from=python /opt/_internal /opt/_internal
ENV PATH=/opt/conda/bin:/usr/local/cuda/bin:$PATH
# Install MKL
ADD ./common/install_mkl.sh install_mkl.sh
RUN bash ./install_mkl.sh && rm install_mkl.sh
FROM cpu as cuda
ADD ./common/install_cuda.sh install_cuda.sh
ADD ./common/install_magma.sh install_magma.sh
ENV CUDA_HOME /usr/local/cuda
FROM cuda as cuda11.8
RUN bash ./install_cuda.sh 11.8
RUN bash ./install_magma.sh 11.8
RUN ln -sf /usr/local/cuda-11.8 /usr/local/cuda
FROM cuda as cuda12.1
RUN bash ./install_cuda.sh 12.1
RUN bash ./install_magma.sh 12.1
RUN ln -sf /usr/local/cuda-12.1 /usr/local/cuda
FROM cuda as cuda12.4
RUN bash ./install_cuda.sh 12.4
RUN bash ./install_magma.sh 12.4
RUN ln -sf /usr/local/cuda-12.4 /usr/local/cuda
FROM cpu as rocm
ARG PYTORCH_ROCM_ARCH
ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH}
ENV MKLROOT /opt/intel
# Adding ROCM_PATH env var so that LoadHip.cmake (even with logic updated for ROCm6.0)
# find HIP works for ROCm5.7. Not needed for ROCm6.0 and above.
# Remove below when ROCm5.7 is not in support matrix anymore.
ENV ROCM_PATH /opt/rocm
# No need to install ROCm as base docker image should have full ROCm install
#ADD ./common/install_rocm.sh install_rocm.sh
ADD ./common/install_rocm_drm.sh install_rocm_drm.sh
ADD ./common/install_rocm_magma.sh install_rocm_magma.sh
# gfortran and python needed for building magma from source for ROCm
RUN apt-get update -y && \
apt-get install gfortran -y && \
apt-get install python -y && \
apt-get clean
RUN bash ./install_rocm_drm.sh && rm install_rocm_drm.sh
RUN bash ./install_rocm_magma.sh && rm install_rocm_magma.sh
# Install AOTriton
COPY ./common/common_utils.sh common_utils.sh
COPY ./aotriton_version.txt aotriton_version.txt
COPY ./common/install_aotriton.sh install_aotriton.sh
RUN bash ./install_aotriton.sh /opt/rocm && rm install_aotriton.sh aotriton_version.txt
ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton
FROM ${BASE_TARGET} as final
COPY --from=openssl /opt/openssl /opt/openssl
# Install patchelf
ADD ./common/install_patchelf.sh install_patchelf.sh
RUN bash ./install_patchelf.sh && rm install_patchelf.sh
# Install Anaconda
COPY --from=conda /opt/conda /opt/conda
# Install python
COPY --from=python /opt/python /opt/python
COPY --from=python /opt/_internal /opt/_internal
ENV PATH=/opt/conda/bin:/usr/local/cuda/bin:$PATH

93
.ci/docker/libtorch/build.sh Executable file
View File

@ -0,0 +1,93 @@
#!/usr/bin/env bash
# Script used only in CD pipeline
set -eou pipefail
image="$1"
shift
if [ -z "${image}" ]; then
echo "Usage: $0 IMAGE"
exit 1
fi
DOCKER_IMAGE="pytorch/${image}"
TOPDIR=$(git rev-parse --show-toplevel)
GPU_ARCH_TYPE=${GPU_ARCH_TYPE:-cpu}
GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}
WITH_PUSH=${WITH_PUSH:-}
DOCKER=${DOCKER:-docker}
case ${GPU_ARCH_TYPE} in
cpu)
BASE_TARGET=cpu
DOCKER_TAG=cpu
GPU_IMAGE=ubuntu:20.04
DOCKER_GPU_BUILD_ARG=""
;;
cuda)
BASE_TARGET=cuda${GPU_ARCH_VERSION}
DOCKER_TAG=cuda${GPU_ARCH_VERSION}
GPU_IMAGE=ubuntu:20.04
DOCKER_GPU_BUILD_ARG=""
;;
rocm)
BASE_TARGET=rocm
DOCKER_TAG=rocm${GPU_ARCH_VERSION}
GPU_IMAGE=rocm/dev-ubuntu-20.04:${GPU_ARCH_VERSION}-complete
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx1030;gfx1100"
ROCM_REGEX="([0-9]+)\.([0-9]+)[\.]?([0-9]*)"
if [[ $GPU_ARCH_VERSION =~ $ROCM_REGEX ]]; then
ROCM_VERSION_INT=$((${BASH_REMATCH[1]}*10000 + ${BASH_REMATCH[2]}*100 + ${BASH_REMATCH[3]:-0}))
else
echo "ERROR: rocm regex failed"
exit 1
fi
if [[ $ROCM_VERSION_INT -ge 60000 ]]; then
PYTORCH_ROCM_ARCH+=";gfx942"
fi
DOCKER_GPU_BUILD_ARG="--build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH}"
;;
*)
echo "ERROR: Unrecognized GPU_ARCH_TYPE: ${GPU_ARCH_TYPE}"
exit 1
;;
esac
(
set -x
DOCKER_BUILDKIT=1 ${DOCKER} build \
--target final \
${DOCKER_GPU_BUILD_ARG} \
--build-arg "GPU_IMAGE=${GPU_IMAGE}" \
--build-arg "BASE_TARGET=${BASE_TARGET}" \
-t "${DOCKER_IMAGE}" \
$@ \
-f "${TOPDIR}/.ci/docker/libtorch/Dockerfile" \
"${TOPDIR}/.ci/docker/"
)
GITHUB_REF=${GITHUB_REF:-$(git symbolic-ref -q HEAD || git describe --tags --exact-match)}
GIT_BRANCH_NAME=${GITHUB_REF##*/}
GIT_COMMIT_SHA=${GITHUB_SHA:-$(git rev-parse HEAD)}
DOCKER_IMAGE_BRANCH_TAG=${DOCKER_IMAGE}-${GIT_BRANCH_NAME}
DOCKER_IMAGE_SHA_TAG=${DOCKER_IMAGE}-${GIT_COMMIT_SHA}
if [[ "${WITH_PUSH}" == true ]]; then
(
set -x
${DOCKER} push "${DOCKER_IMAGE}"
if [[ -n ${GITHUB_REF} ]]; then
${DOCKER} tag ${DOCKER_IMAGE} ${DOCKER_IMAGE_BRANCH_TAG}
${DOCKER} tag ${DOCKER_IMAGE} ${DOCKER_IMAGE_SHA_TAG}
${DOCKER} push "${DOCKER_IMAGE_BRANCH_TAG}"
${DOCKER} push "${DOCKER_IMAGE_SHA_TAG}"
fi
)
fi

View File

@ -29,7 +29,7 @@ RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/re
# Install cuda and cudnn
ARG CUDA_VERSION
RUN wget -q https://raw.githubusercontent.com/pytorch/builder/main/common/install_cuda.sh -O install_cuda.sh
COPY ./common/install_cuda.sh install_cuda.sh
RUN bash ./install_cuda.sh ${CUDA_VERSION} && rm install_cuda.sh
ENV DESIRED_CUDA ${CUDA_VERSION}
ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:$PATH

View File

@ -0,0 +1,202 @@
# syntax = docker/dockerfile:experimental
ARG ROCM_VERSION=3.7
ARG BASE_CUDA_VERSION=11.8
ARG GPU_IMAGE=centos:7
FROM centos:7 as base
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
ARG DEVTOOLSET_VERSION=9
# Note: This is required patch since CentOS have reached EOL
# otherwise any yum install setp will fail
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
RUN yum install -y wget curl perl util-linux xz bzip2 git patch which perl zlib-devel
# Just add everything as a safe.directory for git since these will be used in multiple places with git
RUN git config --global --add safe.directory '*'
RUN yum install -y yum-utils centos-release-scl
RUN yum-config-manager --enable rhel-server-rhscl-7-rpms
# Note: After running yum-config-manager --enable rhel-server-rhscl-7-rpms
# patch is required once again. Somehow this steps adds mirror.centos.org
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
RUN yum install -y devtoolset-${DEVTOOLSET_VERSION}-gcc devtoolset-${DEVTOOLSET_VERSION}-gcc-c++ devtoolset-${DEVTOOLSET_VERSION}-gcc-gfortran devtoolset-${DEVTOOLSET_VERSION}-binutils
ENV PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
RUN yum --enablerepo=extras install -y epel-release
# cmake-3.18.4 from pip
RUN yum install -y python3-pip && \
python3 -mpip install cmake==3.18.4 && \
ln -s /usr/local/bin/cmake /usr/bin/cmake
RUN yum install -y autoconf aclocal automake make sudo
FROM base as openssl
# Install openssl (this must precede `build python` step)
# (In order to have a proper SSL module, Python is compiled
# against a recent openssl [see env vars above], which is linked
# statically. We delete openssl afterwards.)
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
# EPEL for cmake
FROM base as patchelf
# Install patchelf
ADD ./common/install_patchelf.sh install_patchelf.sh
RUN bash ./install_patchelf.sh && rm install_patchelf.sh
RUN cp $(which patchelf) /patchelf
FROM patchelf as python
# build python
COPY manywheel/build_scripts /build_scripts
ADD ./common/install_cpython.sh /build_scripts/install_cpython.sh
RUN bash build_scripts/build.sh && rm -r build_scripts
FROM base as cuda
ARG BASE_CUDA_VERSION=10.2
# Install CUDA
ADD ./common/install_cuda.sh install_cuda.sh
RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh
FROM base as intel
# MKL
ADD ./common/install_mkl.sh install_mkl.sh
RUN bash ./install_mkl.sh && rm install_mkl.sh
FROM base as magma
ARG BASE_CUDA_VERSION=10.2
# Install magma
ADD ./common/install_magma.sh install_magma.sh
RUN bash ./install_magma.sh ${BASE_CUDA_VERSION} && rm install_magma.sh
FROM base as jni
# Install java jni header
ADD ./common/install_jni.sh install_jni.sh
ADD ./java/jni.h jni.h
RUN bash ./install_jni.sh && rm install_jni.sh
FROM base as libpng
# Install libpng
ADD ./common/install_libpng.sh install_libpng.sh
RUN bash ./install_libpng.sh && rm install_libpng.sh
FROM ${GPU_IMAGE} as common
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
RUN yum install -y \
aclocal \
autoconf \
automake \
bison \
bzip2 \
curl \
diffutils \
file \
git \
make \
patch \
perl \
unzip \
util-linux \
wget \
which \
xz \
yasm
RUN yum install -y \
https://repo.ius.io/ius-release-el7.rpm \
https://ossci-linux.s3.amazonaws.com/epel-release-7-14.noarch.rpm
RUN yum swap -y git git236-core
# git236+ would refuse to run git commands in repos owned by other users
# Which causes version check to fail, as pytorch repo is bind-mounted into the image
# Override this behaviour by treating every folder as safe
# For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
RUN git config --global --add safe.directory "*"
ENV SSL_CERT_FILE=/opt/_internal/certs.pem
# Install LLVM version
COPY --from=openssl /opt/openssl /opt/openssl
COPY --from=python /opt/python /opt/python
COPY --from=python /opt/_internal /opt/_internal
COPY --from=python /opt/python/cp39-cp39/bin/auditwheel /usr/local/bin/auditwheel
COPY --from=intel /opt/intel /opt/intel
COPY --from=patchelf /usr/local/bin/patchelf /usr/local/bin/patchelf
COPY --from=jni /usr/local/include/jni.h /usr/local/include/jni.h
COPY --from=libpng /usr/local/bin/png* /usr/local/bin/
COPY --from=libpng /usr/local/bin/libpng* /usr/local/bin/
COPY --from=libpng /usr/local/include/png* /usr/local/include/
COPY --from=libpng /usr/local/include/libpng* /usr/local/include/
COPY --from=libpng /usr/local/lib/libpng* /usr/local/lib/
COPY --from=libpng /usr/local/lib/pkgconfig /usr/local/lib/pkgconfig
FROM common as cpu_final
ARG BASE_CUDA_VERSION=10.1
ARG DEVTOOLSET_VERSION=9
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
RUN yum install -y yum-utils centos-release-scl
RUN yum-config-manager --enable rhel-server-rhscl-7-rpms
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
RUN yum install -y devtoolset-${DEVTOOLSET_VERSION}-gcc devtoolset-${DEVTOOLSET_VERSION}-gcc-c++ devtoolset-${DEVTOOLSET_VERSION}-gcc-gfortran devtoolset-${DEVTOOLSET_VERSION}-binutils
ENV PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
# cmake is already installed inside the rocm base image, so remove if present
RUN rpm -e cmake || true
# cmake-3.18.4 from pip
RUN yum install -y python3-pip && \
python3 -mpip install cmake==3.18.4 && \
ln -s /usr/local/bin/cmake /usr/bin/cmake
# ninja
RUN yum install -y ninja-build
FROM cpu_final as cuda_final
RUN rm -rf /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=cuda /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=magma /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
RUN ln -sf /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda
ENV PATH=/usr/local/cuda/bin:$PATH
FROM cpu_final as rocm_final
ARG ROCM_VERSION=3.7
ARG PYTORCH_ROCM_ARCH
ENV PYTORCH_ROCM_ARCH ${PYTORCH_ROCM_ARCH}
# Adding ROCM_PATH env var so that LoadHip.cmake (even with logic updated for ROCm6.0)
# find HIP works for ROCm5.7. Not needed for ROCm6.0 and above.
# Remove below when ROCm5.7 is not in support matrix anymore.
ENV ROCM_PATH /opt/rocm
ENV MKLROOT /opt/intel
# No need to install ROCm as base docker image should have full ROCm install
#ADD ./common/install_rocm.sh install_rocm.sh
#RUN ROCM_VERSION=${ROCM_VERSION} bash ./install_rocm.sh && rm install_rocm.sh
ADD ./common/install_rocm_drm.sh install_rocm_drm.sh
RUN bash ./install_rocm_drm.sh && rm install_rocm_drm.sh
# cmake3 is needed for the MIOpen build
RUN ln -sf /usr/local/bin/cmake /usr/bin/cmake3
ADD ./common/install_rocm_magma.sh install_rocm_magma.sh
RUN bash ./install_rocm_magma.sh && rm install_rocm_magma.sh
ADD ./common/install_miopen.sh install_miopen.sh
RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh
# Install AOTriton
COPY ./common/common_utils.sh common_utils.sh
COPY ./aotriton_version.txt aotriton_version.txt
COPY ./common/install_aotriton.sh install_aotriton.sh
RUN bash ./install_aotriton.sh /opt/rocm && rm install_aotriton.sh aotriton_version.txt
ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton

View File

@ -0,0 +1,153 @@
# syntax = docker/dockerfile:experimental
ARG ROCM_VERSION=3.7
ARG BASE_CUDA_VERSION=10.2
ARG GPU_IMAGE=nvidia/cuda:${BASE_CUDA_VERSION}-devel-centos7
FROM quay.io/pypa/manylinux2014_x86_64 as base
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
RUN yum install -y wget curl perl util-linux xz bzip2 git patch which perl zlib-devel
RUN yum install -y yum-utils centos-release-scl sudo
RUN yum-config-manager --enable rhel-server-rhscl-7-rpms
RUN yum install -y devtoolset-7-gcc devtoolset-7-gcc-c++ devtoolset-7-gcc-gfortran devtoolset-7-binutils
ENV PATH=/opt/rh/devtoolset-7/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-7/root/usr/lib64:/opt/rh/devtoolset-7/root/usr/lib:$LD_LIBRARY_PATH
# cmake
RUN yum install -y cmake3 && \
ln -s /usr/bin/cmake3 /usr/bin/cmake
FROM base as openssl
# Install openssl (this must precede `build python` step)
# (In order to have a proper SSL module, Python is compiled
# against a recent openssl [see env vars above], which is linked
# statically. We delete openssl afterwards.)
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
# remove unncessary python versions
RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
FROM base as cuda
ARG BASE_CUDA_VERSION=10.2
# Install CUDA
ADD ./common/install_cuda.sh install_cuda.sh
RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh
FROM base as intel
# MKL
ADD ./common/install_mkl.sh install_mkl.sh
RUN bash ./install_mkl.sh && rm install_mkl.sh
FROM base as magma
ARG BASE_CUDA_VERSION=10.2
# Install magma
ADD ./common/install_magma.sh install_magma.sh
RUN bash ./install_magma.sh ${BASE_CUDA_VERSION} && rm install_magma.sh
FROM base as jni
# Install java jni header
ADD ./common/install_jni.sh install_jni.sh
ADD ./java/jni.h jni.h
RUN bash ./install_jni.sh && rm install_jni.sh
FROM base as libpng
# Install libpng
ADD ./common/install_libpng.sh install_libpng.sh
RUN bash ./install_libpng.sh && rm install_libpng.sh
FROM ${GPU_IMAGE} as common
RUN sed -i s/mirror.centos.org/vault.centos.org/g /etc/yum.repos.d/*.repo
RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo
RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
RUN yum install -y \
aclocal \
autoconf \
automake \
bison \
bzip2 \
curl \
diffutils \
file \
git \
make \
patch \
perl \
unzip \
util-linux \
wget \
which \
xz \
yasm
RUN yum install -y \
https://repo.ius.io/ius-release-el7.rpm \
https://ossci-linux.s3.amazonaws.com/epel-release-7-14.noarch.rpm
RUN yum swap -y git git236-core
# git236+ would refuse to run git commands in repos owned by other users
# Which causes version check to fail, as pytorch repo is bind-mounted into the image
# Override this behaviour by treating every folder as safe
# For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
RUN git config --global --add safe.directory "*"
ENV SSL_CERT_FILE=/opt/_internal/certs.pem
# Install LLVM version
COPY --from=openssl /opt/openssl /opt/openssl
COPY --from=base /opt/python /opt/python
COPY --from=base /opt/_internal /opt/_internal
COPY --from=base /usr/local/bin/auditwheel /usr/local/bin/auditwheel
COPY --from=intel /opt/intel /opt/intel
COPY --from=base /usr/local/bin/patchelf /usr/local/bin/patchelf
COPY --from=libpng /usr/local/bin/png* /usr/local/bin/
COPY --from=libpng /usr/local/bin/libpng* /usr/local/bin/
COPY --from=libpng /usr/local/include/png* /usr/local/include/
COPY --from=libpng /usr/local/include/libpng* /usr/local/include/
COPY --from=libpng /usr/local/lib/libpng* /usr/local/lib/
COPY --from=libpng /usr/local/lib/pkgconfig /usr/local/lib/pkgconfig
COPY --from=jni /usr/local/include/jni.h /usr/local/include/jni.h
FROM common as cpu_final
ARG BASE_CUDA_VERSION=10.2
RUN yum install -y yum-utils centos-release-scl
RUN yum-config-manager --enable rhel-server-rhscl-7-rpms
RUN yum install -y devtoolset-7-gcc devtoolset-7-gcc-c++ devtoolset-7-gcc-gfortran devtoolset-7-binutils
ENV PATH=/opt/rh/devtoolset-7/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-7/root/usr/lib64:/opt/rh/devtoolset-7/root/usr/lib:$LD_LIBRARY_PATH
# cmake
RUN yum install -y cmake3 && \
ln -s /usr/bin/cmake3 /usr/bin/cmake
# ninja
RUN yum install -y http://repo.okay.com.mx/centos/7/x86_64/release/okay-release-1-1.noarch.rpm
RUN yum install -y ninja-build
FROM cpu_final as cuda_final
RUN rm -rf /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=cuda /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=magma /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
FROM common as rocm_final
ARG ROCM_VERSION=3.7
# Install ROCm
ADD ./common/install_rocm.sh install_rocm.sh
RUN bash ./install_rocm.sh ${ROCM_VERSION} && rm install_rocm.sh
# cmake is already installed inside the rocm base image, but both 2 and 3 exist
# cmake3 is needed for the later MIOpen custom build, so that step is last.
RUN yum install -y cmake3 && \
rm -f /usr/bin/cmake && \
ln -s /usr/bin/cmake3 /usr/bin/cmake
ADD ./common/install_miopen.sh install_miopen.sh
RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh

View File

@ -0,0 +1,157 @@
# syntax = docker/dockerfile:experimental
ARG ROCM_VERSION=3.7
ARG BASE_CUDA_VERSION=11.8
ARG GPU_IMAGE=amd64/almalinux:8
FROM quay.io/pypa/manylinux_2_28_x86_64 as base
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
ARG DEVTOOLSET_VERSION=11
RUN yum install -y sudo wget curl perl util-linux xz bzip2 git patch which perl zlib-devel yum-utils gcc-toolset-${DEVTOOLSET_VERSION}-toolchain
ENV PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
# cmake-3.18.4 from pip
RUN yum install -y python3-pip && \
python3 -mpip install cmake==3.18.4 && \
ln -s /usr/local/bin/cmake /usr/bin/cmake3
FROM base as openssl
# Install openssl (this must precede `build python` step)
# (In order to have a proper SSL module, Python is compiled
# against a recent openssl [see env vars above], which is linked
# statically. We delete openssl afterwards.)
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
# remove unncessary python versions
RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
FROM base as cuda
ARG BASE_CUDA_VERSION=11.8
# Install CUDA
ADD ./common/install_cuda.sh install_cuda.sh
RUN bash ./install_cuda.sh ${BASE_CUDA_VERSION} && rm install_cuda.sh
FROM base as intel
# MKL
ADD ./common/install_mkl.sh install_mkl.sh
RUN bash ./install_mkl.sh && rm install_mkl.sh
FROM base as magma
ARG BASE_CUDA_VERSION=10.2
# Install magma
ADD ./common/install_magma.sh install_magma.sh
RUN bash ./install_magma.sh ${BASE_CUDA_VERSION} && rm install_magma.sh
FROM base as jni
# Install java jni header
ADD ./common/install_jni.sh install_jni.sh
ADD ./java/jni.h jni.h
RUN bash ./install_jni.sh && rm install_jni.sh
FROM base as libpng
# Install libpng
ADD ./common/install_libpng.sh install_libpng.sh
RUN bash ./install_libpng.sh && rm install_libpng.sh
FROM ${GPU_IMAGE} as common
ARG DEVTOOLSET_VERSION=11
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
RUN yum -y install epel-release
RUN yum -y update
RUN yum install -y \
autoconf \
automake \
bison \
bzip2 \
curl \
diffutils \
file \
git \
make \
patch \
perl \
unzip \
util-linux \
wget \
which \
xz \
gcc-toolset-${DEVTOOLSET_VERSION}-toolchain \
glibc-langpack-en
RUN yum install -y \
https://repo.ius.io/ius-release-el7.rpm \
https://ossci-linux.s3.amazonaws.com/epel-release-7-14.noarch.rpm
RUN yum swap -y git git236-core
# git236+ would refuse to run git commands in repos owned by other users
# Which causes version check to fail, as pytorch repo is bind-mounted into the image
# Override this behaviour by treating every folder as safe
# For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
RUN git config --global --add safe.directory "*"
ENV SSL_CERT_FILE=/opt/_internal/certs.pem
# Install LLVM version
COPY --from=openssl /opt/openssl /opt/openssl
COPY --from=base /opt/python /opt/python
COPY --from=base /opt/_internal /opt/_internal
COPY --from=base /usr/local/bin/auditwheel /usr/local/bin/auditwheel
COPY --from=intel /opt/intel /opt/intel
COPY --from=base /usr/local/bin/patchelf /usr/local/bin/patchelf
COPY --from=libpng /usr/local/bin/png* /usr/local/bin/
COPY --from=libpng /usr/local/bin/libpng* /usr/local/bin/
COPY --from=libpng /usr/local/include/png* /usr/local/include/
COPY --from=libpng /usr/local/include/libpng* /usr/local/include/
COPY --from=libpng /usr/local/lib/libpng* /usr/local/lib/
COPY --from=libpng /usr/local/lib/pkgconfig /usr/local/lib/pkgconfig
COPY --from=jni /usr/local/include/jni.h /usr/local/include/jni.h
FROM common as cpu_final
ARG BASE_CUDA_VERSION=11.8
ARG DEVTOOLSET_VERSION=11
# Ensure the expected devtoolset is used
ENV PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
# cmake-3.18.4 from pip
RUN yum install -y python3-pip && \
python3 -mpip install cmake==3.18.4 && \
ln -s /usr/local/bin/cmake /usr/bin/cmake3
FROM cpu_final as cuda_final
RUN rm -rf /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=cuda /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=magma /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
FROM common as rocm_final
ARG ROCM_VERSION=3.7
# Install ROCm
ADD ./common/install_rocm.sh install_rocm.sh
RUN bash ./install_rocm.sh ${ROCM_VERSION} && rm install_rocm.sh
# cmake is already installed inside the rocm base image, but both 2 and 3 exist
# cmake3 is needed for the later MIOpen custom build, so that step is last.
RUN yum install -y cmake3 && \
rm -f /usr/bin/cmake && \
ln -s /usr/bin/cmake3 /usr/bin/cmake
ADD ./common/install_miopen.sh install_miopen.sh
RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh
FROM cpu_final as xpu_final
# XPU CD use rolling driver
ENV XPU_DRIVER_TYPE ROLLING
# cmake-3.28.4 from pip
RUN python3 -m pip install --upgrade pip && \
python3 -mpip install cmake==3.28.4
# Install setuptools and wheel for python 3.13
RUN /opt/python/cp313-cp313/bin/python -m pip install setuptools wheel
ADD ./common/install_xpu.sh install_xpu.sh
RUN bash ./install_xpu.sh && rm install_xpu.sh
RUN pushd /opt/_internal && tar -xJf static-libs-for-embedding-only.tar.xz && popd

View File

@ -0,0 +1,57 @@
FROM quay.io/pypa/manylinux_2_28_aarch64 as base
# Graviton needs GCC 10 or above for the build. GCC12 is the default version in almalinux-8.
ARG GCCTOOLSET_VERSION=11
# Language variabes
ENV LC_ALL=en_US.UTF-8
ENV LANG=en_US.UTF-8
ENV LANGUAGE=en_US.UTF-8
# Installed needed OS packages. This is to support all
# the binary builds (torch, vision, audio, text, data)
RUN yum -y install epel-release
RUN yum -y update
RUN yum install -y \
autoconf \
automake \
bison \
bzip2 \
curl \
diffutils \
file \
git \
less \
libffi-devel \
libgomp \
make \
openssl-devel \
patch \
perl \
unzip \
util-linux \
wget \
which \
xz \
yasm \
zstd \
sudo \
gcc-toolset-${GCCTOOLSET_VERSION}-toolchain
# Ensure the expected devtoolset is used
ENV PATH=/opt/rh/gcc-toolset-${GCCTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${GCCTOOLSET_VERSION}/root/usr/lib64:/opt/rh/gcc-toolset-${GCCTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
# git236+ would refuse to run git commands in repos owned by other users
# Which causes version check to fail, as pytorch repo is bind-mounted into the image
# Override this behaviour by treating every folder as safe
# For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
RUN git config --global --add safe.directory "*"
FROM base as final
# remove unncessary python versions
RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6

View File

@ -0,0 +1,94 @@
FROM quay.io/pypa/manylinux2014_aarch64 as base
# Graviton needs GCC 10 for the build
ARG DEVTOOLSET_VERSION=10
# Language variabes
ENV LC_ALL=en_US.UTF-8
ENV LANG=en_US.UTF-8
ENV LANGUAGE=en_US.UTF-8
# Installed needed OS packages. This is to support all
# the binary builds (torch, vision, audio, text, data)
RUN yum -y install epel-release
RUN yum -y update
RUN yum install -y \
autoconf \
automake \
bison \
bzip2 \
curl \
diffutils \
file \
git \
make \
patch \
perl \
unzip \
util-linux \
wget \
which \
xz \
yasm \
less \
zstd \
libgomp \
sudo \
devtoolset-${DEVTOOLSET_VERSION}-gcc \
devtoolset-${DEVTOOLSET_VERSION}-gcc-c++ \
devtoolset-${DEVTOOLSET_VERSION}-gcc-gfortran \
devtoolset-${DEVTOOLSET_VERSION}-binutils
# Ensure the expected devtoolset is used
ENV PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
# git236+ would refuse to run git commands in repos owned by other users
# Which causes version check to fail, as pytorch repo is bind-mounted into the image
# Override this behaviour by treating every folder as safe
# For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
RUN git config --global --add safe.directory "*"
###############################################################################
# libglfortran.a hack
#
# libgfortran.a from quay.io/pypa/manylinux2014_aarch64 is not compiled with -fPIC.
# This causes __stack_chk_guard@@GLIBC_2.17 on pytorch build. To solve, get
# ubuntu's libgfortran.a which is compiled with -fPIC
# NOTE: Need a better way to get this library as Ubuntu's package can be removed by the vender, or changed
###############################################################################
RUN cd ~/ \
&& curl -L -o ~/libgfortran-10-dev.deb http://ports.ubuntu.com/ubuntu-ports/pool/universe/g/gcc-10/libgfortran-10-dev_10.5.0-1ubuntu1_arm64.deb \
&& ar x ~/libgfortran-10-dev.deb \
&& tar --use-compress-program=unzstd -xvf data.tar.zst -C ~/ \
&& cp -f ~/usr/lib/gcc/aarch64-linux-gnu/10/libgfortran.a /opt/rh/devtoolset-10/root/usr/lib/gcc/aarch64-redhat-linux/10/
# install cmake
RUN yum install -y cmake3 && \
ln -s /usr/bin/cmake3 /usr/bin/cmake
FROM base as openssl
# Install openssl (this must precede `build python` step)
# (In order to have a proper SSL module, Python is compiled
# against a recent openssl [see env vars above], which is linked
# statically. We delete openssl afterwards.)
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
ENV SSL_CERT_FILE=/opt/_internal/certs.pem
FROM base as openblas
# Install openblas
ADD ./common/install_openblas.sh install_openblas.sh
RUN bash ./install_openblas.sh && rm install_openblas.sh
FROM openssl as final
# remove unncessary python versions
RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
COPY --from=openblas /opt/OpenBLAS/ /opt/OpenBLAS/
ENV LD_LIBRARY_PATH=/opt/OpenBLAS/lib:$LD_LIBRARY_PATH

View File

@ -0,0 +1,91 @@
FROM quay.io/pypa/manylinux_2_28_aarch64 as base
# Cuda ARM build needs gcc 11
ARG DEVTOOLSET_VERSION=11
# Language variables
ENV LC_ALL=en_US.UTF-8
ENV LANG=en_US.UTF-8
ENV LANGUAGE=en_US.UTF-8
# Installed needed OS packages. This is to support all
# the binary builds (torch, vision, audio, text, data)
RUN yum -y install epel-release
RUN yum -y update
RUN yum install -y \
autoconf \
automake \
bison \
bzip2 \
curl \
diffutils \
file \
git \
make \
patch \
perl \
unzip \
util-linux \
wget \
which \
xz \
yasm \
less \
zstd \
libgomp \
sudo \
gcc-toolset-${DEVTOOLSET_VERSION}-toolchain
# Ensure the expected devtoolset is used
ENV PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH
ENV LD_LIBRARY_PATH=/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/gcc-toolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH
# git236+ would refuse to run git commands in repos owned by other users
# Which causes version check to fail, as pytorch repo is bind-mounted into the image
# Override this behaviour by treating every folder as safe
# For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
RUN git config --global --add safe.directory "*"
FROM base as openssl
# Install openssl (this must precede `build python` step)
# (In order to have a proper SSL module, Python is compiled
# against a recent openssl [see env vars above], which is linked
# statically. We delete openssl afterwards.)
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
ENV SSL_CERT_FILE=/opt/_internal/certs.pem
FROM openssl as final
# remove unncessary python versions
RUN rm -rf /opt/python/cp26-cp26m /opt/_internal/cpython-2.6.9-ucs2
RUN rm -rf /opt/python/cp26-cp26mu /opt/_internal/cpython-2.6.9-ucs4
RUN rm -rf /opt/python/cp33-cp33m /opt/_internal/cpython-3.3.6
RUN rm -rf /opt/python/cp34-cp34m /opt/_internal/cpython-3.4.6
FROM base as cuda
ARG BASE_CUDA_VERSION
# Install CUDA
ADD ./common/install_cuda_aarch64.sh install_cuda_aarch64.sh
RUN bash ./install_cuda_aarch64.sh ${BASE_CUDA_VERSION} && rm install_cuda_aarch64.sh
FROM base as magma
ARG BASE_CUDA_VERSION
# Install magma
ADD ./common/install_magma.sh install_magma.sh
RUN bash ./install_magma.sh ${BASE_CUDA_VERSION} && rm install_magma.sh
FROM base as nvpl
# Install nvpl
ADD ./common/install_nvpl.sh install_nvpl.sh
RUN bash ./install_nvpl.sh && rm install_nvpl.sh
FROM final as cuda_final
ARG BASE_CUDA_VERSION
RUN rm -rf /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=cuda /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=magma /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda-${BASE_CUDA_VERSION}
COPY --from=nvpl /opt/nvpl/lib/ /usr/local/lib/
COPY --from=nvpl /opt/nvpl/include/ /usr/local/include/
RUN ln -sf /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda
ENV PATH=/usr/local/cuda/bin:$PATH

View File

@ -0,0 +1,71 @@
FROM centos:8 as base
ENV LC_ALL en_US.UTF-8
ENV LANG en_US.UTF-8
ENV LANGUAGE en_US.UTF-8
ENV PATH /opt/rh/gcc-toolset-11/root/bin/:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
# change to a valid repo
RUN sed -i 's|#baseurl=http://mirror.centos.org|baseurl=http://vault.centos.org|g' /etc/yum.repos.d/CentOS-Linux-*.repo
# enable to install ninja-build
RUN sed -i 's|enabled=0|enabled=1|g' /etc/yum.repos.d/CentOS-Linux-PowerTools.repo
RUN yum -y update
RUN yum install -y wget curl perl util-linux xz bzip2 git patch which zlib-devel sudo
RUN yum install -y autoconf automake make cmake gdb gcc-toolset-11-gcc-c++
FROM base as openssl
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
# Install python
FROM base as python
RUN yum install -y openssl-devel zlib-devel bzip2-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel libpcap-devel xz-devel libffi-devel
ADD common/install_cpython.sh install_cpython.sh
RUN bash ./install_cpython.sh && rm install_cpython.sh
FROM base as conda
ADD ./common/install_conda_docker.sh install_conda.sh
RUN bash ./install_conda.sh && rm install_conda.sh
RUN /opt/conda/bin/conda install -y cmake
FROM base as intel
# Install MKL
COPY --from=python /opt/python /opt/python
COPY --from=python /opt/_internal /opt/_internal
COPY --from=conda /opt/conda /opt/conda
ENV PATH=/opt/conda/bin:$PATH
ADD ./common/install_mkl.sh install_mkl.sh
RUN bash ./install_mkl.sh && rm install_mkl.sh
FROM base as patchelf
ADD ./common/install_patchelf.sh install_patchelf.sh
RUN bash ./install_patchelf.sh && rm install_patchelf.sh
RUN cp $(which patchelf) /patchelf
FROM base as jni
ADD ./common/install_jni.sh install_jni.sh
ADD ./java/jni.h jni.h
RUN bash ./install_jni.sh && rm install_jni.sh
FROM base as libpng
ADD ./common/install_libpng.sh install_libpng.sh
RUN bash ./install_libpng.sh && rm install_libpng.sh
FROM base as final
COPY --from=openssl /opt/openssl /opt/openssl
COPY --from=python /opt/python /opt/python
COPY --from=python /opt/_internal /opt/_internal
COPY --from=intel /opt/intel /opt/intel
COPY --from=conda /opt/conda /opt/conda
COPY --from=patchelf /usr/local/bin/patchelf /usr/local/bin/patchelf
COPY --from=jni /usr/local/include/jni.h /usr/local/include/jni.h
COPY --from=libpng /usr/local/bin/png* /usr/local/bin/
COPY --from=libpng /usr/local/bin/libpng* /usr/local/bin/
COPY --from=libpng /usr/local/include/png* /usr/local/include/
COPY --from=libpng /usr/local/include/libpng* /usr/local/include/
COPY --from=libpng /usr/local/lib/libpng* /usr/local/lib/
COPY --from=libpng /usr/local/lib/pkgconfig /usr/local/lib/pkgconfig
RUN yum install -y ninja-build

View File

@ -0,0 +1,73 @@
FROM --platform=linux/s390x docker.io/ubuntu:24.04 as base
# Language variables
ENV LC_ALL=C.UTF-8
ENV LANG=C.UTF-8
ENV LANGUAGE=C.UTF-8
# Installed needed OS packages. This is to support all
# the binary builds (torch, vision, audio, text, data)
RUN apt update ; apt upgrade -y
RUN apt install -y \
build-essential \
autoconf \
automake \
bzip2 \
curl \
diffutils \
file \
git \
make \
patch \
perl \
unzip \
util-linux \
wget \
which \
xz-utils \
less \
zstd \
cmake \
python3 \
python3-dev \
python3-setuptools \
python3-yaml \
python3-typing-extensions \
libblas-dev \
libopenblas-dev \
liblapack-dev \
libatlas-base-dev
# git236+ would refuse to run git commands in repos owned by other users
# Which causes version check to fail, as pytorch repo is bind-mounted into the image
# Override this behaviour by treating every folder as safe
# For more details see https://github.com/pytorch/pytorch/issues/78659#issuecomment-1144107327
RUN git config --global --add safe.directory "*"
FROM base as openssl
# Install openssl (this must precede `build python` step)
# (In order to have a proper SSL module, Python is compiled
# against a recent openssl [see env vars above], which is linked
# statically. We delete openssl afterwards.)
ADD ./common/install_openssl.sh install_openssl.sh
RUN bash ./install_openssl.sh && rm install_openssl.sh
ENV SSL_CERT_FILE=/opt/_internal/certs.pem
# EPEL for cmake
FROM base as patchelf
# Install patchelf
ADD ./common/install_patchelf.sh install_patchelf.sh
RUN bash ./install_patchelf.sh && rm install_patchelf.sh
RUN cp $(which patchelf) /patchelf
FROM patchelf as python
# build python
COPY manywheel/build_scripts /build_scripts
ADD ./common/install_cpython.sh /build_scripts/install_cpython.sh
RUN bash build_scripts/build.sh && rm -r build_scripts
FROM openssl as final
COPY --from=python /opt/python /opt/python
COPY --from=python /opt/_internal /opt/_internal
COPY --from=python /opt/python/cp39-cp39/bin/auditwheel /usr/local/bin/auditwheel
COPY --from=patchelf /usr/local/bin/patchelf /usr/local/bin/patchelf

154
.ci/docker/manywheel/build.sh Executable file
View File

@ -0,0 +1,154 @@
#!/usr/bin/env bash
# Script used only in CD pipeline
set -eou pipefail
TOPDIR=$(git rev-parse --show-toplevel)
image="$1"
shift
if [ -z "${image}" ]; then
echo "Usage: $0 IMAGE"
exit 1
fi
DOCKER_IMAGE="pytorch/${image}"
DOCKER_REGISTRY="${DOCKER_REGISTRY:-docker.io}"
GPU_ARCH_TYPE=${GPU_ARCH_TYPE:-cpu}
GPU_ARCH_VERSION=${GPU_ARCH_VERSION:-}
MANY_LINUX_VERSION=${MANY_LINUX_VERSION:-}
DOCKERFILE_SUFFIX=${DOCKERFILE_SUFFIX:-}
WITH_PUSH=${WITH_PUSH:-}
case ${GPU_ARCH_TYPE} in
cpu)
TARGET=cpu_final
DOCKER_TAG=cpu
GPU_IMAGE=centos:7
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=9"
;;
cpu-manylinux_2_28)
TARGET=cpu_final
DOCKER_TAG=cpu
GPU_IMAGE=amd64/almalinux:8
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=11"
MANY_LINUX_VERSION="2_28"
;;
cpu-aarch64)
TARGET=final
DOCKER_TAG=cpu-aarch64
GPU_IMAGE=arm64v8/centos:7
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=10"
MANY_LINUX_VERSION="aarch64"
;;
cpu-aarch64-2_28)
TARGET=final
DOCKER_TAG=cpu-aarch64
GPU_IMAGE=arm64v8/almalinux:8
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=11"
MANY_LINUX_VERSION="2_28_aarch64"
;;
cpu-cxx11-abi)
TARGET=final
DOCKER_TAG=cpu-cxx11-abi
GPU_IMAGE=""
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=9"
MANY_LINUX_VERSION="cxx11-abi"
;;
cpu-s390x)
TARGET=final
DOCKER_TAG=cpu-s390x
GPU_IMAGE=redhat/ubi9
DOCKER_GPU_BUILD_ARG=""
MANY_LINUX_VERSION="s390x"
;;
cuda)
TARGET=cuda_final
DOCKER_TAG=cuda${GPU_ARCH_VERSION}
# Keep this up to date with the minimum version of CUDA we currently support
GPU_IMAGE=centos:7
DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=9"
;;
cuda-manylinux_2_28)
TARGET=cuda_final
DOCKER_TAG=cuda${GPU_ARCH_VERSION}
GPU_IMAGE=amd64/almalinux:8
DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=11"
MANY_LINUX_VERSION="2_28"
;;
cuda-aarch64)
TARGET=cuda_final
DOCKER_TAG=cuda${GPU_ARCH_VERSION}
GPU_IMAGE=arm64v8/centos:7
DOCKER_GPU_BUILD_ARG="--build-arg BASE_CUDA_VERSION=${GPU_ARCH_VERSION} --build-arg DEVTOOLSET_VERSION=11"
MANY_LINUX_VERSION="aarch64"
DOCKERFILE_SUFFIX="_cuda_aarch64"
;;
rocm)
TARGET=rocm_final
DOCKER_TAG=rocm${GPU_ARCH_VERSION}
GPU_IMAGE=rocm/dev-centos-7:${GPU_ARCH_VERSION}-complete
PYTORCH_ROCM_ARCH="gfx900;gfx906;gfx908;gfx90a;gfx1030;gfx1100"
ROCM_REGEX="([0-9]+)\.([0-9]+)[\.]?([0-9]*)"
if [[ $GPU_ARCH_VERSION =~ $ROCM_REGEX ]]; then
ROCM_VERSION_INT=$((${BASH_REMATCH[1]}*10000 + ${BASH_REMATCH[2]}*100 + ${BASH_REMATCH[3]:-0}))
else
echo "ERROR: rocm regex failed"
exit 1
fi
if [[ $ROCM_VERSION_INT -ge 60000 ]]; then
PYTORCH_ROCM_ARCH+=";gfx942"
fi
DOCKER_GPU_BUILD_ARG="--build-arg ROCM_VERSION=${GPU_ARCH_VERSION} --build-arg PYTORCH_ROCM_ARCH=${PYTORCH_ROCM_ARCH} --build-arg DEVTOOLSET_VERSION=9"
;;
xpu)
TARGET=xpu_final
DOCKER_TAG=xpu
GPU_IMAGE=amd64/almalinux:8
DOCKER_GPU_BUILD_ARG=" --build-arg DEVTOOLSET_VERSION=11"
MANY_LINUX_VERSION="2_28"
;;
*)
echo "ERROR: Unrecognized GPU_ARCH_TYPE: ${GPU_ARCH_TYPE}"
exit 1
;;
esac
IMAGES=''
if [[ -n ${MANY_LINUX_VERSION} && -z ${DOCKERFILE_SUFFIX} ]]; then
DOCKERFILE_SUFFIX=_${MANY_LINUX_VERSION}
fi
(
set -x
DOCKER_BUILDKIT=1 docker build \
${DOCKER_GPU_BUILD_ARG} \
--build-arg "GPU_IMAGE=${GPU_IMAGE}" \
--target "${TARGET}" \
-t "${DOCKER_IMAGE}" \
$@ \
-f "${TOPDIR}/.ci/docker/manywheel/Dockerfile${DOCKERFILE_SUFFIX}" \
"${TOPDIR}/.ci/docker/"
)
GITHUB_REF=${GITHUB_REF:-$(git symbolic-ref -q HEAD || git describe --tags --exact-match)}
GIT_BRANCH_NAME=${GITHUB_REF##*/}
GIT_COMMIT_SHA=${GITHUB_SHA:-$(git rev-parse HEAD)}
DOCKER_IMAGE_BRANCH_TAG=${DOCKER_IMAGE}-${GIT_BRANCH_NAME}
DOCKER_IMAGE_SHA_TAG=${DOCKER_IMAGE}-${GIT_COMMIT_SHA}
if [[ "${WITH_PUSH}" == true ]]; then
(
set -x
docker push "${DOCKER_IMAGE}"
if [[ -n ${GITHUB_REF} ]]; then
docker tag ${DOCKER_IMAGE} ${DOCKER_IMAGE_BRANCH_TAG}
docker tag ${DOCKER_IMAGE} ${DOCKER_IMAGE_SHA_TAG}
docker push "${DOCKER_IMAGE_BRANCH_TAG}"
docker push "${DOCKER_IMAGE_SHA_TAG}"
fi
)
fi

View File

@ -0,0 +1,131 @@
#!/bin/bash
# Top-level build script called from Dockerfile
# Script used only in CD pipeline
# Stop at any error, show all commands
set -ex
# openssl version to build, with expected sha256 hash of .tar.gz
# archive
OPENSSL_ROOT=openssl-1.1.1l
OPENSSL_HASH=0b7a3e5e59c34827fe0c3a74b7ec8baef302b98fa80088d7f9153aa16fa76bd1
DEVTOOLS_HASH=a8ebeb4bed624700f727179e6ef771dafe47651131a00a78b342251415646acc
PATCHELF_HASH=d9afdff4baeacfbc64861454f368b7f2c15c44d245293f7587bbf726bfe722fb
CURL_ROOT=curl-7.73.0
CURL_HASH=cf34fe0b07b800f1c01a499a6e8b2af548f6d0e044dca4a29d88a4bee146d131
AUTOCONF_ROOT=autoconf-2.69
AUTOCONF_HASH=954bd69b391edc12d6a4a51a2dd1476543da5c6bbf05a95b59dc0dd6fd4c2969
# Get build utilities
MY_DIR=$(dirname "${BASH_SOURCE[0]}")
source $MY_DIR/build_utils.sh
if [ "$(uname -m)" != "s390x" ] ; then
# Dependencies for compiling Python that we want to remove from
# the final image after compiling Python
PYTHON_COMPILE_DEPS="zlib-devel bzip2-devel ncurses-devel sqlite-devel readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-devel libffi-devel"
# Libraries that are allowed as part of the manylinux1 profile
MANYLINUX1_DEPS="glibc-devel libstdc++-devel glib2-devel libX11-devel libXext-devel libXrender-devel mesa-libGL-devel libICE-devel libSM-devel ncurses-devel"
# Development tools and libraries
yum -y install bzip2 make git patch unzip bison yasm diffutils \
automake which file cmake28 \
kernel-devel-`uname -r` \
${PYTHON_COMPILE_DEPS}
else
# Dependencies for compiling Python that we want to remove from
# the final image after compiling Python
PYTHON_COMPILE_DEPS="zlib1g-dev libbz2-dev libncurses-dev libsqlite3-dev libdb-dev libpcap-dev liblzma-dev libffi-dev"
# Libraries that are allowed as part of the manylinux1 profile
MANYLINUX1_DEPS="libglib2.0-dev libX11-dev libncurses-dev"
# Development tools and libraries
apt install -y bzip2 make git patch unzip diffutils \
automake which file cmake \
linux-headers-virtual \
${PYTHON_COMPILE_DEPS}
fi
# Install newest autoconf
build_autoconf $AUTOCONF_ROOT $AUTOCONF_HASH
autoconf --version
# Compile the latest Python releases.
# (In order to have a proper SSL module, Python is compiled
# against a recent openssl [see env vars above], which is linked
# statically. We delete openssl afterwards.)
build_openssl $OPENSSL_ROOT $OPENSSL_HASH
/build_scripts/install_cpython.sh
PY39_BIN=/opt/python/cp39-cp39/bin
# Our openssl doesn't know how to find the system CA trust store
# (https://github.com/pypa/manylinux/issues/53)
# And it's not clear how up-to-date that is anyway
# So let's just use the same one pip and everyone uses
$PY39_BIN/pip install certifi
ln -s $($PY39_BIN/python -c 'import certifi; print(certifi.where())') \
/opt/_internal/certs.pem
# If you modify this line you also have to modify the versions in the
# Dockerfiles:
export SSL_CERT_FILE=/opt/_internal/certs.pem
# Install newest curl
build_curl $CURL_ROOT $CURL_HASH
rm -rf /usr/local/include/curl /usr/local/lib/libcurl* /usr/local/lib/pkgconfig/libcurl.pc
hash -r
curl --version
curl-config --features
# Install patchelf (latest with unreleased bug fixes)
curl -sLOk https://nixos.org/releases/patchelf/patchelf-0.10/patchelf-0.10.tar.gz
# check_sha256sum patchelf-0.9njs2.tar.gz $PATCHELF_HASH
tar -xzf patchelf-0.10.tar.gz
(cd patchelf-0.10 && ./configure && make && make install)
rm -rf patchelf-0.10.tar.gz patchelf-0.10
# Install latest pypi release of auditwheel
$PY39_BIN/pip install auditwheel
ln -s $PY39_BIN/auditwheel /usr/local/bin/auditwheel
# Clean up development headers and other unnecessary stuff for
# final image
if [ "$(uname -m)" != "s390x" ] ; then
yum -y erase wireless-tools gtk2 libX11 hicolor-icon-theme \
avahi freetype bitstream-vera-fonts \
${PYTHON_COMPILE_DEPS} || true > /dev/null 2>&1
yum -y install ${MANYLINUX1_DEPS}
yum -y clean all > /dev/null 2>&1
yum list installed
else
apt purge -y ${PYTHON_COMPILE_DEPS} || true > /dev/null 2>&1
fi
# we don't need libpython*.a, and they're many megabytes
find /opt/_internal -name '*.a' -print0 | xargs -0 rm -f
# Strip what we can -- and ignore errors, because this just attempts to strip
# *everything*, including non-ELF files:
find /opt/_internal -type f -print0 \
| xargs -0 -n1 strip --strip-unneeded 2>/dev/null || true
# We do not need the Python test suites, or indeed the precompiled .pyc and
# .pyo files. Partially cribbed from:
# https://github.com/docker-library/python/blob/master/3.4/slim/Dockerfile
find /opt/_internal \
\( -type d -a -name test -o -name tests \) \
-o \( -type f -a -name '*.pyc' -o -name '*.pyo' \) \
-print0 | xargs -0 rm -f
for PYTHON in /opt/python/*/bin/python; do
# Smoke test to make sure that our Pythons work, and do indeed detect as
# being manylinux compatible:
$PYTHON $MY_DIR/manylinux1-check.py
# Make sure that SSL cert checking works
$PYTHON $MY_DIR/ssl-check.py
done
# Fix libc headers to remain compatible with C99 compilers.
find /usr/include/ -type f -exec sed -i 's/\bextern _*inline_*\b/extern __inline __attribute__ ((__gnu_inline__))/g' {} +
# Now we can delete our built SSL
rm -rf /usr/local/ssl

View File

@ -0,0 +1,91 @@
#!/bin/bash
# Helper utilities for build
# Script used only in CD pipeline
OPENSSL_DOWNLOAD_URL=https://www.openssl.org/source/old/1.1.1/
CURL_DOWNLOAD_URL=https://curl.askapache.com/download
AUTOCONF_DOWNLOAD_URL=https://ftp.gnu.org/gnu/autoconf
function check_var {
if [ -z "$1" ]; then
echo "required variable not defined"
exit 1
fi
}
function do_openssl_build {
./config no-ssl2 no-shared -fPIC --prefix=/usr/local/ssl > /dev/null
make > /dev/null
make install > /dev/null
}
function check_sha256sum {
local fname=$1
check_var ${fname}
local sha256=$2
check_var ${sha256}
echo "${sha256} ${fname}" > ${fname}.sha256
sha256sum -c ${fname}.sha256
rm -f ${fname}.sha256
}
function build_openssl {
local openssl_fname=$1
check_var ${openssl_fname}
local openssl_sha256=$2
check_var ${openssl_sha256}
check_var ${OPENSSL_DOWNLOAD_URL}
curl -sLO ${OPENSSL_DOWNLOAD_URL}/${openssl_fname}.tar.gz
check_sha256sum ${openssl_fname}.tar.gz ${openssl_sha256}
tar -xzf ${openssl_fname}.tar.gz
(cd ${openssl_fname} && do_openssl_build)
rm -rf ${openssl_fname} ${openssl_fname}.tar.gz
}
function do_curl_build {
LIBS=-ldl ./configure --with-ssl --disable-shared > /dev/null
make > /dev/null
make install > /dev/null
}
function build_curl {
local curl_fname=$1
check_var ${curl_fname}
local curl_sha256=$2
check_var ${curl_sha256}
check_var ${CURL_DOWNLOAD_URL}
curl -sLO ${CURL_DOWNLOAD_URL}/${curl_fname}.tar.bz2
check_sha256sum ${curl_fname}.tar.bz2 ${curl_sha256}
tar -jxf ${curl_fname}.tar.bz2
(cd ${curl_fname} && do_curl_build)
rm -rf ${curl_fname} ${curl_fname}.tar.bz2
}
function do_standard_install {
./configure > /dev/null
make > /dev/null
make install > /dev/null
}
function build_autoconf {
local autoconf_fname=$1
check_var ${autoconf_fname}
local autoconf_sha256=$2
check_var ${autoconf_sha256}
check_var ${AUTOCONF_DOWNLOAD_URL}
curl -sLO ${AUTOCONF_DOWNLOAD_URL}/${autoconf_fname}.tar.gz
check_sha256sum ${autoconf_fname}.tar.gz ${autoconf_sha256}
tar -zxf ${autoconf_fname}.tar.gz
(cd ${autoconf_fname} && do_standard_install)
rm -rf ${autoconf_fname} ${autoconf_fname}.tar.gz
}

View File

@ -0,0 +1,60 @@
# Logic copied from PEP 513
def is_manylinux1_compatible():
# Only Linux, and only x86-64 / i686
from distutils.util import get_platform
if get_platform() not in ["linux-x86_64", "linux-i686", "linux-s390x"]:
return False
# Check for presence of _manylinux module
try:
import _manylinux
return bool(_manylinux.manylinux1_compatible)
except (ImportError, AttributeError):
# Fall through to heuristic check below
pass
# Check glibc version. CentOS 5 uses glibc 2.5.
return have_compatible_glibc(2, 5)
def have_compatible_glibc(major, minimum_minor):
import ctypes
process_namespace = ctypes.CDLL(None)
try:
gnu_get_libc_version = process_namespace.gnu_get_libc_version
except AttributeError:
# Symbol doesn't exist -> therefore, we are not linked to
# glibc.
return False
# Call gnu_get_libc_version, which returns a string like "2.5".
gnu_get_libc_version.restype = ctypes.c_char_p
version_str = gnu_get_libc_version()
# py2 / py3 compatibility:
if not isinstance(version_str, str):
version_str = version_str.decode("ascii")
# Parse string and check against requested version.
version = [int(piece) for piece in version_str.split(".")]
assert len(version) == 2
if major != version[0]:
return False
if minimum_minor > version[1]:
return False
return True
import sys
if is_manylinux1_compatible():
print(f"{sys.executable} is manylinux1 compatible")
sys.exit(0)
else:
print(f"{sys.executable} is NOT manylinux1 compatible")
sys.exit(1)

View File

@ -0,0 +1,35 @@
# cf. https://github.com/pypa/manylinux/issues/53
GOOD_SSL = "https://google.com"
BAD_SSL = "https://self-signed.badssl.com"
import sys
print("Testing SSL certificate checking for Python:", sys.version)
if sys.version_info[:2] < (2, 7) or sys.version_info[:2] < (3, 4):
print("This version never checks SSL certs; skipping tests")
sys.exit(0)
if sys.version_info[0] >= 3:
from urllib.request import urlopen
EXC = OSError
else:
from urllib import urlopen
EXC = IOError
print(f"Connecting to {GOOD_SSL} should work")
urlopen(GOOD_SSL)
print("...it did, yay.")
print(f"Connecting to {BAD_SSL} should fail")
try:
urlopen(BAD_SSL)
# If we get here then we failed:
print("...it DIDN'T!!!!!11!!1one!")
sys.exit(1)
except EXC:
print("...it did, yay.")

View File

@ -30,9 +30,14 @@ dill==0.3.7
#Pinned versions: 0.3.7
#test that import: dynamo/test_replay_record.py test_dataloader.py test_datapipe.py test_serialization.py
expecttest==0.1.6
expecttest==0.2.1
#Description: method for writing tests where test framework auto populates
# the expected output based on previous runs
#Pinned versions: 0.2.1
#test that import:
fbscribelogger==0.1.6
#Description: write to scribe from authenticated jobs on CI
#Pinned versions: 0.1.6
#test that import:
@ -85,10 +90,10 @@ librosa>=0.6.2 ; python_version < "3.11"
#Pinned versions:
#test that import:
mypy==1.9.0
mypy==1.10.0
# Pin MyPy version because new errors are likely to appear with each release
#Description: linter
#Pinned versions: 1.9.0
#Pinned versions: 1.10.0
#test that import: test_typing.py, test_type_hints.py
networkx==2.8.8
@ -104,7 +109,7 @@ networkx==2.8.8
#test that import: run_test.py, test_cpp_extensions_aot.py,test_determination.py
numba==0.49.0 ; python_version < "3.9"
numba==0.54.1 ; python_version == "3.9"
numba==0.55.2 ; python_version == "3.9"
numba==0.55.2 ; python_version == "3.10"
#Description: Just-In-Time Compiler for Numerical Functions
#Pinned versions: 0.54.1, 0.49.0, <=0.49.1
@ -134,9 +139,9 @@ opt-einsum==3.3
#Pinned versions: 3.3
#test that import: test_linalg.py
optree==0.11.0
optree==0.12.1
#Description: A library for tree manipulation
#Pinned versions: 0.11.0
#Pinned versions: 0.12.1
#test that import: test_vmap.py, test_aotdispatch.py, test_dynamic_shapes.py,
#test_pytree.py, test_ops.py, test_control_flow.py, test_modules.py,
#common_utils.py, test_eager_transforms.py, test_python_dispatch.py,
@ -218,7 +223,7 @@ pygments==2.15.0
#test that import:
scikit-image==0.19.3 ; python_version < "3.10"
scikit-image==0.20.0 ; python_version >= "3.10"
scikit-image==0.22.0 ; python_version >= "3.10"
#Description: image processing routines
#Pinned versions:
#test that import: test_nn.py
@ -269,6 +274,10 @@ lintrunner==0.12.5
#Pinned versions: 0.12.5
#test that import:
redis>=4.0.0
#Description: redis database
#test that import: anything that tests OSS caching/mocking (inductor/test_codecache.py, inductor/test_max_autotune.py)
rockset==1.0.3
#Description: queries Rockset
#Pinned versions: 1.0.3
@ -306,9 +315,30 @@ pywavelets==1.5.0 ; python_version >= "3.12"
#Pinned versions: 1.4.1
#test that import:
lxml==5.0.0.
lxml==5.0.0
#Description: This is a requirement of unittest-xml-reporting
# Python-3.9 binaries
PyGithub==2.3.0
sympy==1.12.1 ; python_version == "3.8"
sympy==1.13.1 ; python_version >= "3.9"
#Description: Required by coremltools, also pinned in .github/requirements/pip-requirements-macOS.txt
#Pinned versions:
#test that import:
onnx==1.16.1
#Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
#Pinned versions:
#test that import:
onnxscript==0.1.0.dev20240817
#Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
#Pinned versions:
#test that import:
parameterized==0.8.1
#Description: Parameterizes unittests, both the tests themselves and the entire testing class
#Pinned versions:
#test that import:

View File

@ -1 +1 @@
3.0.0
3.1.0

View File

@ -103,6 +103,14 @@ COPY triton_version.txt triton_version.txt
RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi
RUN rm install_triton.sh common_utils.sh triton.txt triton_version.txt
ARG HALIDE
# Build and install halide
COPY ./common/install_halide.sh install_halide.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/halide.txt halide.txt
RUN if [ -n "${HALIDE}" ]; then bash ./install_halide.sh; fi
RUN rm install_halide.sh common_utils.sh halide.txt
# Install ccache/sccache (do this last, so we get priority in PATH)
COPY ./common/install_cache.sh install_cache.sh
ENV PATH /opt/cache/bin:$PATH
@ -139,7 +147,7 @@ COPY --from=pytorch/llvm:9.0.1 /opt/llvm /opt/llvm
ARG CUDNN_VERSION
ARG CUDA_VERSION
COPY ./common/install_cudnn.sh install_cudnn.sh
RUN if [ "${CUDNN_VERSION}" -eq 8 ]; then bash install_cudnn.sh; fi
RUN if [ -n "${CUDNN_VERSION}" ]; then bash install_cudnn.sh; fi
RUN rm install_cudnn.sh
# Install CUSPARSELT
@ -148,10 +156,17 @@ COPY ./common/install_cusparselt.sh install_cusparselt.sh
RUN bash install_cusparselt.sh
RUN rm install_cusparselt.sh
# Install CUDSS
ARG CUDA_VERSION
COPY ./common/install_cudss.sh install_cudss.sh
RUN bash install_cudss.sh
RUN rm install_cudss.sh
# Delete /usr/local/cuda-11.X/cuda-11.X symlinks
RUN if [ -h /usr/local/cuda-11.6/cuda-11.6 ]; then rm /usr/local/cuda-11.6/cuda-11.6; fi
RUN if [ -h /usr/local/cuda-11.7/cuda-11.7 ]; then rm /usr/local/cuda-11.7/cuda-11.7; fi
RUN if [ -h /usr/local/cuda-12.1/cuda-12.1 ]; then rm /usr/local/cuda-12.1/cuda-12.1; fi
RUN if [ -h /usr/local/cuda-12.4/cuda-12.4 ]; then rm /usr/local/cuda-12.4/cuda-12.4; fi
USER jenkins
CMD ["bash"]

View File

@ -78,6 +78,11 @@ ENV MAGMA_HOME /opt/rocm/magma
ENV LANG C.UTF-8
ENV LC_ALL C.UTF-8
# Install amdsmi
COPY ./common/install_amdsmi.sh install_amdsmi.sh
RUN bash ./install_amdsmi.sh
RUN rm install_amdsmi.sh
# (optional) Install non-default CMake version
ARG CMAKE_VERSION
COPY ./common/install_cmake.sh install_cmake.sh
@ -95,10 +100,17 @@ ARG TRITON
# try to reach out to S3, which docker build runners don't have access
COPY ./common/install_triton.sh install_triton.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/triton-rocm.txt triton-rocm.txt
COPY ci_commit_pins/triton.txt triton.txt
COPY triton_version.txt triton_version.txt
RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi
RUN rm install_triton.sh common_utils.sh triton-rocm.txt triton_version.txt
RUN rm install_triton.sh common_utils.sh triton.txt triton_version.txt
# Install AOTriton
COPY ./aotriton_version.txt aotriton_version.txt
COPY ./common/common_utils.sh common_utils.sh
COPY ./common/install_aotriton.sh install_aotriton.sh
RUN ["/bin/bash", "-c", "./install_aotriton.sh /opt/rocm && rm -rf install_aotriton.sh aotriton_version.txt common_utils.sh"]
ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton
# Install ccache/sccache (do this last, so we get priority in PATH)
COPY ./common/install_cache.sh install_cache.sh

View File

@ -30,6 +30,7 @@ RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh
ARG ANACONDA_PYTHON_VERSION
ARG CONDA_CMAKE
ARG DOCS
ARG BUILD_ENVIRONMENT
ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION
ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH
ENV DOCS=$DOCS

View File

@ -50,7 +50,7 @@ RUN bash ./install_lcov.sh && rm install_lcov.sh
# Install cuda and cudnn
ARG CUDA_VERSION
RUN wget -q https://raw.githubusercontent.com/pytorch/builder/main/common/install_cuda.sh -O install_cuda.sh
COPY ./common/install_cuda.sh install_cuda.sh
RUN bash ./install_cuda.sh ${CUDA_VERSION} && rm install_cuda.sh
ENV DESIRED_CUDA ${CUDA_VERSION}
ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:$PATH
@ -155,6 +155,14 @@ COPY ci_commit_pins/executorch.txt executorch.txt
RUN if [ -n "${EXECUTORCH}" ]; then bash ./install_executorch.sh; fi
RUN rm install_executorch.sh common_utils.sh executorch.txt
ARG HALIDE
# Build and install halide
COPY ./common/install_halide.sh install_halide.sh
COPY ./common/common_utils.sh common_utils.sh
COPY ci_commit_pins/halide.txt halide.txt
RUN if [ -n "${HALIDE}" ]; then bash ./install_halide.sh; fi
RUN rm install_halide.sh common_utils.sh halide.txt
ARG ONNX
# Install ONNX dependencies
COPY ./common/install_onnx.sh ./common/common_utils.sh ./

View File

@ -1,42 +1 @@
This directory contains scripts for our continuous integration.
One important thing to keep in mind when reading the scripts here is
that they are all based off of Docker images, which we build for each of
the various system configurations we want to run on Jenkins. This means
it is very easy to run these tests yourself:
1. Figure out what Docker image you want. The general template for our
images look like:
``registry.pytorch.org/pytorch/pytorch-$BUILD_ENVIRONMENT:$DOCKER_VERSION``,
where ``$BUILD_ENVIRONMENT`` is one of the build environments
enumerated in
[pytorch-dockerfiles](https://github.com/pytorch/pytorch/blob/master/.ci/docker/build.sh). The dockerfile used by jenkins can be found under the `.ci` [directory](https://github.com/pytorch/pytorch/blob/master/.ci/docker)
2. Run ``docker run -it -u jenkins $DOCKER_IMAGE``, clone PyTorch and
run one of the scripts in this directory.
The Docker images are designed so that any "reasonable" build commands
will work; if you look in [build.sh](build.sh) you will see that it is a
very simple script. This is intentional. Idiomatic build instructions
should work inside all of our Docker images. You can tweak the commands
however you need (e.g., in case you want to rebuild with DEBUG, or rerun
the build with higher verbosity, etc.).
We have to do some work to make this so. Here is a summary of the
mechanisms we use:
- We install binaries to directories like `/usr/local/bin` which
are automatically part of your PATH.
- We add entries to the PATH using Docker ENV variables (so
they apply when you enter Docker) and `/etc/environment` (so they
continue to apply even if you sudo), instead of modifying
`PATH` in our build scripts.
- We use `/etc/ld.so.conf.d` to register directories containing
shared libraries, instead of modifying `LD_LIBRARY_PATH` in our
build scripts.
- We reroute well known paths like `/usr/bin/gcc` to alternate
implementations with `update-alternatives`, instead of setting
`CC` and `CXX` in our implementations.

View File

@ -44,10 +44,7 @@ if [[ "$BUILD_ENVIRONMENT" == *cuda11* ]]; then
fi
fi
if [[ ${BUILD_ENVIRONMENT} == *"paralleltbb"* ]]; then
export ATEN_THREADING=TBB
export USE_TBB=1
elif [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then
if [[ ${BUILD_ENVIRONMENT} == *"parallelnative"* ]]; then
export ATEN_THREADING=NATIVE
fi
@ -179,7 +176,8 @@ fi
if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then
# shellcheck disable=SC1091
source /opt/intel/oneapi/compiler/latest/env/vars.sh
export USE_XPU=1
# XPU kineto feature dependencies are not fully ready, disable kineto build as temp WA
export USE_KINETO=0
fi
# sccache will fail for CUDA builds if all cores are used for compiling
@ -233,6 +231,10 @@ if [[ "${BUILD_ENVIRONMENT}" != *android* && "${BUILD_ENVIRONMENT}" != *cuda* ]]
export BUILD_STATIC_RUNTIME_BENCHMARK=ON
fi
if [[ "$BUILD_ENVIRONMENT" == *-debug* ]]; then
export CMAKE_BUILD_TYPE=RelWithAssert
fi
# Do not change workspace permissions for ROCm CI jobs
# as it can leave workspace with bad permissions for cancelled jobs
if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then
@ -283,13 +285,29 @@ else
if [[ "$BUILD_ENVIRONMENT" != *rocm* &&
"$BUILD_ENVIRONMENT" != *xla* ]]; then
if [[ "$BUILD_ENVIRONMENT" != *py3.8* ]]; then
# Install numpy-2.0 release candidate for builds
# Which should be backward compatible with Numpy-1.X
python -mpip install --pre numpy==2.0.0rc1
# Install numpy-2.0.2 for builds which are backward compatible with 1.X
python -mpip install --pre numpy==2.0.2
fi
WERROR=1 python setup.py clean
if [[ "$USE_SPLIT_BUILD" == "true" ]]; then
BUILD_LIBTORCH_WHL=1 BUILD_PYTHON_ONLY=0 python setup.py bdist_wheel
BUILD_LIBTORCH_WHL=0 BUILD_PYTHON_ONLY=1 python setup.py bdist_wheel --cmake
else
WERROR=1 python setup.py bdist_wheel
fi
WERROR=1 python setup.py bdist_wheel
else
python setup.py bdist_wheel
python setup.py clean
if [[ "$BUILD_ENVIRONMENT" == *xla* ]]; then
source .ci/pytorch/install_cache_xla.sh
fi
if [[ "$USE_SPLIT_BUILD" == "true" ]]; then
echo "USE_SPLIT_BUILD cannot be used with xla or rocm"
exit 1
else
python setup.py bdist_wheel
fi
fi
pip_install_whl "$(echo dist/*.whl)"
@ -328,9 +346,10 @@ else
CUSTOM_OP_TEST="$PWD/test/custom_operator"
python --version
SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"
mkdir -p "$CUSTOM_OP_BUILD"
pushd "$CUSTOM_OP_BUILD"
cmake "$CUSTOM_OP_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" -DPYTHON_EXECUTABLE="$(which python)" \
cmake "$CUSTOM_OP_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch;$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \
-DCMAKE_MODULE_PATH="$CUSTOM_TEST_MODULE_PATH" -DUSE_ROCM="$CUSTOM_TEST_USE_ROCM"
make VERBOSE=1
popd
@ -343,7 +362,7 @@ else
SITE_PACKAGES="$(python -c 'from distutils.sysconfig import get_python_lib; print(get_python_lib())')"
mkdir -p "$JIT_HOOK_BUILD"
pushd "$JIT_HOOK_BUILD"
cmake "$JIT_HOOK_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" -DPYTHON_EXECUTABLE="$(which python)" \
cmake "$JIT_HOOK_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch;$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \
-DCMAKE_MODULE_PATH="$CUSTOM_TEST_MODULE_PATH" -DUSE_ROCM="$CUSTOM_TEST_USE_ROCM"
make VERBOSE=1
popd
@ -355,7 +374,7 @@ else
python --version
mkdir -p "$CUSTOM_BACKEND_BUILD"
pushd "$CUSTOM_BACKEND_BUILD"
cmake "$CUSTOM_BACKEND_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch" -DPYTHON_EXECUTABLE="$(which python)" \
cmake "$CUSTOM_BACKEND_TEST" -DCMAKE_PREFIX_PATH="$SITE_PACKAGES/torch;$SITE_PACKAGES" -DPython_EXECUTABLE="$(which python)" \
-DCMAKE_MODULE_PATH="$CUSTOM_TEST_MODULE_PATH" -DUSE_ROCM="$CUSTOM_TEST_USE_ROCM"
make VERBOSE=1
popd

View File

@ -56,9 +56,29 @@ function assert_git_not_dirty() {
function pip_install_whl() {
# This is used to install PyTorch and other build artifacts wheel locally
# without using any network connection
python3 -mpip install --no-index --no-deps "$@"
# Convert the input arguments into an array
local args=("$@")
# Check if the first argument contains multiple paths separated by spaces
if [[ "${args[0]}" == *" "* ]]; then
# Split the string by spaces into an array
IFS=' ' read -r -a paths <<< "${args[0]}"
# Loop through each path and install individually
for path in "${paths[@]}"; do
echo "Installing $path"
python3 -mpip install --no-index --no-deps "$path"
done
else
# Loop through each argument and install individually
for path in "${args[@]}"; do
echo "Installing $path"
python3 -mpip install --no-index --no-deps "$path"
done
fi
}
function pip_install() {
# retry 3 times
# old versions of pip don't have the "--progress-bar" flag
@ -159,7 +179,7 @@ function install_torchvision() {
}
function install_tlparse() {
pip_install --user "tlparse==0.3.7"
pip_install --user "tlparse==0.3.25"
PATH="$(python -m site --user-base)/bin:$PATH"
}
@ -188,28 +208,6 @@ function clone_pytorch_xla() {
fi
}
function checkout_install_torchdeploy() {
local commit
commit=$(get_pinned_commit multipy)
pushd ..
git clone --recurse-submodules https://github.com/pytorch/multipy.git
pushd multipy
git checkout "${commit}"
python multipy/runtime/example/generate_examples.py
BUILD_CUDA_TESTS=1 pip install -e .
popd
popd
}
function test_torch_deploy(){
pushd ..
pushd multipy
./multipy/runtime/build/test_deploy
./multipy/runtime/build/test_deploy_gpu
popd
popd
}
function checkout_install_torchbench() {
local commit
commit=$(get_pinned_commit torchbench)
@ -224,6 +222,8 @@ function checkout_install_torchbench() {
# to install and test other models
python install.py --continue_on_fail
fi
echo "Print all dependencies after TorchBench is installed"
python -mpip freeze
popd
}

View File

@ -6,6 +6,7 @@ from cryptography.hazmat.primitives import hashes, serialization
from cryptography.hazmat.primitives.asymmetric import rsa
from cryptography.x509.oid import NameOID
temp_dir = mkdtemp()
print(temp_dir)

View File

@ -6,4 +6,4 @@ source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
echo "Testing pytorch docs"
cd docs
make doctest
TERM=vt100 make doctest

View File

@ -0,0 +1,37 @@
#!/bin/bash
# Script for installing sccache on the xla build job, which uses xla's docker
# image and doesn't have sccache installed on it. This is mostly copied from
# .ci/docker/install_cache.sh. Changes are: removing checks that will always
# return the same thing, ex checks for for rocm, CUDA, and changing the path
# where sccache is installed, and not changing /etc/environment.
set -ex
install_binary() {
echo "Downloading sccache binary from S3 repo"
curl --retry 3 https://s3.amazonaws.com/ossci-linux/sccache -o /tmp/cache/bin/sccache
}
mkdir -p /tmp/cache/bin
mkdir -p /tmp/cache/lib
export PATH="/tmp/cache/bin:$PATH"
install_binary
chmod a+x /tmp/cache/bin/sccache
function write_sccache_stub() {
# Unset LD_PRELOAD for ps because of asan + ps issues
# https://gcc.gnu.org/bugzilla/show_bug.cgi?id=90589
# shellcheck disable=SC2086
# shellcheck disable=SC2059
printf "#!/bin/sh\nif [ \$(env -u LD_PRELOAD ps -p \$PPID -o comm=) != sccache ]; then\n exec sccache $(which $1) \"\$@\"\nelse\n exec $(which $1) \"\$@\"\nfi" > "/tmp/cache/bin/$1"
chmod a+x "/tmp/cache/bin/$1"
}
write_sccache_stub cc
write_sccache_stub c++
write_sccache_stub gcc
write_sccache_stub g++
write_sccache_stub clang
write_sccache_stub clang++

View File

@ -18,7 +18,9 @@ time python test/run_test.py --verbose -i distributed/test_c10d_gloo
time python test/run_test.py --verbose -i distributed/test_c10d_nccl
time python test/run_test.py --verbose -i distributed/test_c10d_spawn_gloo
time python test/run_test.py --verbose -i distributed/test_c10d_spawn_nccl
time python test/run_test.py --verbose -i distributed/test_compute_comm_reordering
time python test/run_test.py --verbose -i distributed/test_store
time python test/run_test.py --verbose -i distributed/test_symmetric_memory
time python test/run_test.py --verbose -i distributed/test_pg_wrapper
time python test/run_test.py --verbose -i distributed/rpc/cuda/test_tensorpipe_agent
# FSDP tests
@ -42,14 +44,16 @@ time python test/run_test.py --verbose -i distributed/_tensor/test_dtensor_compi
time python test/run_test.py --verbose -i distributed/test_device_mesh
# DTensor/TP tests
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_ddp_2d_parallel
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_fsdp_2d_parallel
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_examples
time python test/run_test.py --verbose -i distributed/tensor/parallel/test_tp_random_state
# FSDP2 tests
time python test/run_test.py --verbose -i distributed/_composable/fsdp/test_fully_shard_training -- -k test_2d_mlp_with_nd_mesh
# ND composability tests
time python test/run_test.py --verbose -i distributed/_composable/test_composability/test_2d_composability
time python test/run_test.py --verbose -i distributed/_composable/test_composability/test_pp_composability
# Other tests
time python test/run_test.py --verbose -i test_cuda_primary_ctx
time python test/run_test.py --verbose -i test_optim -- -k test_forloop_goes_right_direction_multigpu

View File

@ -3,6 +3,7 @@ import json
import math
import sys
parser = argparse.ArgumentParser()
parser.add_argument(
"--test-name", dest="test_name", action="store", required=True, help="test name"

View File

@ -3,6 +3,7 @@ import sys
import numpy
sample_data_list = sys.argv[1:]
sample_data_list = [float(v.strip()) for v in sample_data_list]

View File

@ -1,6 +1,7 @@
import json
import sys
data_file_path = sys.argv[1]
commit_hash = sys.argv[2]

View File

@ -1,5 +1,6 @@
import sys
log_file_path = sys.argv[1]
with open(log_file_path) as f:

View File

@ -6,6 +6,9 @@
set -ex
# Suppress ANSI color escape sequences
export TERM=vt100
# shellcheck source=./common.sh
source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
@ -166,7 +169,7 @@ fi
if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then
# Source Intel oneAPI envrioment script to enable xpu runtime related libraries
# refer to https://www.intel.com/content/www/us/en/docs/oneapi/programming-guide/2024-0/use-the-setvars-and-oneapi-vars-scripts-with-linux.html
# refer to https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpu/2-5.html
# shellcheck disable=SC1091
source /opt/intel/oneapi/compiler/latest/env/vars.sh
# Check XPU status before testing
@ -249,9 +252,7 @@ fi
# This tests that the debug asserts are working correctly.
if [[ "$BUILD_ENVIRONMENT" == *-debug* ]]; then
echo "We are in debug mode: $BUILD_ENVIRONMENT. Expect the python assertion to fail"
# TODO: Enable the check after we setup the build to run debug asserts without having
# to do a full (and slow) debug build
# (cd test && ! get_exit_code python -c "import torch; torch._C._crash_if_debug_asserts_fail(424242)")
(cd test && ! get_exit_code python -c "import torch; torch._C._crash_if_debug_asserts_fail(424242)")
elif [[ "$BUILD_ENVIRONMENT" != *-bazel-* ]]; then
# Noop when debug is disabled. Skip bazel jobs because torch isn't available there yet.
echo "We are not in debug mode: $BUILD_ENVIRONMENT. Expect the assertion to pass"
@ -277,6 +278,9 @@ test_python_shard() {
# Bare --include flag is not supported and quoting for lint ends up with flag not being interpreted correctly
# shellcheck disable=SC2086
# modify LD_LIBRARY_PATH to ensure it has the conda env.
# This set of tests has been shown to be buggy without it for the split-build
time python test/run_test.py --exclude-jit-executor --exclude-distributed-tests $INCLUDE_CLAUSE --shard "$1" "$NUM_TEST_SHARDS" --verbose $PYTHON_TEST_EXTRA_OPTION
assert_git_not_dirty
@ -315,17 +319,18 @@ test_inductor_distributed() {
python test/run_test.py -i inductor/test_aot_inductor.py -k test_replicate_on_devices --verbose
python test/run_test.py -i distributed/test_c10d_functional_native.py --verbose
python test/run_test.py -i distributed/_tensor/test_dtensor_compile.py --verbose
python test/run_test.py -i distributed/tensor/parallel/test_fsdp_2d_parallel.py --verbose
python test/run_test.py -i distributed/tensor/parallel/test_micro_pipeline_tp.py --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_comm.py --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_multi_group --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_with_activation_checkpointing --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_2d_mlp --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_hsdp --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_2d_transformer_checkpoint_resume --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_gradient_accumulation --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_dp_state_dict_save_load --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_frozen.py --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_compute_dtype --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_mixed_precision.py -k test_reduce_dtype --verbose
python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_clip_grad_norm_.py -k test_clip_grad_norm_2d --verbose
python test/run_test.py -i distributed/fsdp/test_fsdp_tp_integration.py -k test_fsdp_tp_integration --verbose
# this runs on both single-gpu and multi-gpu instance. It should be smart about skipping tests that aren't supported
@ -334,26 +339,52 @@ test_inductor_distributed() {
assert_git_not_dirty
}
test_inductor() {
python tools/dynamo/verify_dynamo.py
python test/run_test.py --inductor --include test_modules test_ops test_ops_gradients test_torch --verbose
# Do not add --inductor for the following inductor unit tests, otherwise we will fail because of nested dynamo state
python test/run_test.py --include inductor/test_torchinductor inductor/test_torchinductor_opinfo inductor/test_aot_inductor --verbose
test_inductor_shard() {
if [[ -z "$NUM_TEST_SHARDS" ]]; then
echo "NUM_TEST_SHARDS must be defined to run a Python test shard"
exit 1
fi
python tools/dynamo/verify_dynamo.py
python test/run_test.py --inductor \
--include test_modules test_ops test_ops_gradients test_torch \
--shard "$1" "$NUM_TEST_SHARDS" \
--verbose
# Do not add --inductor for the following inductor unit tests, otherwise we will fail because of nested dynamo state
python test/run_test.py \
--include inductor/test_torchinductor inductor/test_torchinductor_opinfo inductor/test_aot_inductor \
--shard "$1" "$NUM_TEST_SHARDS" \
--verbose
}
test_inductor_aoti() {
# docker build uses bdist_wheel which does not work with test_aot_inductor
# TODO: need a faster way to build
if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then
BUILD_AOT_INDUCTOR_TEST=1 python setup.py develop
CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="${TORCH_LIB_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aoti_abi_check cpp/test_aoti_inference
if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then
# We need to hipify before building again
python3 tools/amd_build/build_amd.py
fi
BUILD_AOT_INDUCTOR_TEST=1 python setup.py develop
CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="${TORCH_LIB_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aoti_abi_check cpp/test_aoti_inference
}
test_inductor_cpp_wrapper_abi_compatible() {
export TORCHINDUCTOR_ABI_COMPATIBLE=1
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
echo "Testing Inductor cpp wrapper mode with TORCHINDUCTOR_ABI_COMPATIBLE=1"
# cpu stack allocation causes segfault and needs more investigation
TORCHINDUCTOR_STACK_ALLOCATION=0 python test/run_test.py --include inductor/test_cpu_cpp_wrapper
PYTORCH_TESTING_DEVICE_ONLY_FOR="" python test/run_test.py --include inductor/test_cpu_cpp_wrapper
python test/run_test.py --include inductor/test_cuda_cpp_wrapper
TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/timm_models.py --device cuda --accuracy --amp \
--training --inductor --disable-cudagraphs --only vit_base_patch16_224 \
--output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_training.csv"
python benchmarks/dynamo/check_accuracy.py \
--actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_training.csv" \
--expected "benchmarks/dynamo/ci_expected_accuracy/inductor_timm_training.csv"
}
# "Global" flags for inductor benchmarking controlled by TEST_CONFIG
@ -364,7 +395,22 @@ test_inductor_cpp_wrapper_abi_compatible() {
# .github/workflows/inductor-perf-test-nightly.yml
DYNAMO_BENCHMARK_FLAGS=()
if [[ "${TEST_CONFIG}" == *dynamo_eager* ]]; then
pr_time_benchmarks() {
pip_install --user "fbscribelogger"
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
PYTHONPATH=$(pwd)/benchmarks/dynamo/pr_time_benchmarks source benchmarks/dynamo/pr_time_benchmarks/benchmark_runner.sh "$TEST_REPORTS_DIR/pr_time_benchmarks_after.txt" "benchmarks/dynamo/pr_time_benchmarks/benchmarks"
echo "benchmark results on current PR: "
cat "$TEST_REPORTS_DIR/pr_time_benchmarks_after.txt"
}
if [[ "${TEST_CONFIG}" == *pr_time_benchmarks* ]]; then
pr_time_benchmarks
exit 0
elif [[ "${TEST_CONFIG}" == *dynamo_eager* ]]; then
DYNAMO_BENCHMARK_FLAGS+=(--backend eager)
elif [[ "${TEST_CONFIG}" == *aot_eager* ]]; then
DYNAMO_BENCHMARK_FLAGS+=(--backend aot_eager)
@ -378,7 +424,7 @@ if [[ "${TEST_CONFIG}" == *dynamic* ]]; then
DYNAMO_BENCHMARK_FLAGS+=(--dynamic-shapes --dynamic-batch-only)
fi
if [[ "${TEST_CONFIG}" == *cpu_inductor* ]]; then
if [[ "${TEST_CONFIG}" == *cpu* ]]; then
DYNAMO_BENCHMARK_FLAGS+=(--device cpu)
else
DYNAMO_BENCHMARK_FLAGS+=(--device cuda)
@ -402,6 +448,18 @@ test_perf_for_dashboard() {
# TODO: All the accuracy tests can be skipped once the CI accuracy checking is stable enough
local targets=(accuracy performance)
local device=cuda
if [[ "${TEST_CONFIG}" == *cpu* ]]; then
if [[ "${TEST_CONFIG}" == *cpu_x86* ]]; then
device=cpu_x86
elif [[ "${TEST_CONFIG}" == *cpu_aarch64* ]]; then
device=cpu_aarch64
fi
test_inductor_set_cpu_affinity
elif [[ "${TEST_CONFIG}" == *cuda_a10g* ]]; then
device=cuda_a10g
fi
for mode in "${modes[@]}"; do
if [[ "$mode" == "inference" ]]; then
dtype=bfloat16
@ -417,56 +475,62 @@ test_perf_for_dashboard() {
fi
if [[ "$DASHBOARD_TAG" == *default-true* ]]; then
python "benchmarks/dynamo/$suite.py" \
$TASKSET python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" --disable-cudagraphs "$@" \
--output "$TEST_REPORTS_DIR/${backend}_no_cudagraphs_${suite}_${dtype}_${mode}_cuda_${target}.csv"
--output "$TEST_REPORTS_DIR/${backend}_no_cudagraphs_${suite}_${dtype}_${mode}_${device}_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *cudagraphs-true* ]]; then
python "benchmarks/dynamo/$suite.py" \
$TASKSET python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" "$@" \
--output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_${suite}_${dtype}_${mode}_cuda_${target}.csv"
--output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_${suite}_${dtype}_${mode}_${device}_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *dynamic-true* ]]; then
python "benchmarks/dynamo/$suite.py" \
$TASKSET python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" --dynamic-shapes \
--dynamic-batch-only "$@" \
--output "$TEST_REPORTS_DIR/${backend}_dynamic_${suite}_${dtype}_${mode}_cuda_${target}.csv"
--output "$TEST_REPORTS_DIR/${backend}_dynamic_${suite}_${dtype}_${mode}_${device}_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *cppwrapper-true* ]] && [[ "$mode" == "inference" ]]; then
TORCHINDUCTOR_CPP_WRAPPER=1 python "benchmarks/dynamo/$suite.py" \
TORCHINDUCTOR_CPP_WRAPPER=1 $TASKSET python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" --disable-cudagraphs "$@" \
--output "$TEST_REPORTS_DIR/${backend}_cpp_wrapper_${suite}_${dtype}_${mode}_cuda_${target}.csv"
--output "$TEST_REPORTS_DIR/${backend}_cpp_wrapper_${suite}_${dtype}_${mode}_${device}_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *freezing_cudagraphs-true* ]] && [[ "$mode" == "inference" ]]; then
python "benchmarks/dynamo/$suite.py" \
$TASKSET python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" "$@" --freezing \
--output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_freezing_${suite}_${dtype}_${mode}_cuda_${target}.csv"
--output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_freezing_${suite}_${dtype}_${mode}_${device}_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *freeze_autotune_cudagraphs-true* ]] && [[ "$mode" == "inference" ]]; then
TORCHINDUCTOR_MAX_AUTOTUNE=1 python "benchmarks/dynamo/$suite.py" \
TORCHINDUCTOR_MAX_AUTOTUNE=1 $TASKSET python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" "$@" --freezing \
--output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_freezing_autotune_${suite}_${dtype}_${mode}_cuda_${target}.csv"
--output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_freezing_autotune_${suite}_${dtype}_${mode}_${device}_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *aotinductor-true* ]] && [[ "$mode" == "inference" ]]; then
TORCHINDUCTOR_ABI_COMPATIBLE=1 python "benchmarks/dynamo/$suite.py" \
if [[ "$target" == "accuracy" ]]; then
# Also collect Export pass rate and display as a separate row
$TASKSET python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --export --disable-cudagraphs "$@" \
--output "$TEST_REPORTS_DIR/${backend}_export_${suite}_${dtype}_${mode}_${device}_${target}.csv"
fi
TORCHINDUCTOR_ABI_COMPATIBLE=1 $TASKSET python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --export-aot-inductor --disable-cudagraphs "$@" \
--output "$TEST_REPORTS_DIR/${backend}_aot_inductor_${suite}_${dtype}_${mode}_cuda_${target}.csv"
--output "$TEST_REPORTS_DIR/${backend}_aot_inductor_${suite}_${dtype}_${mode}_${device}_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *maxautotune-true* ]]; then
TORCHINDUCTOR_MAX_AUTOTUNE=1 python "benchmarks/dynamo/$suite.py" \
TORCHINDUCTOR_MAX_AUTOTUNE=1 $TASKSET python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" "$@" \
--output "$TEST_REPORTS_DIR/${backend}_max_autotune_${suite}_${dtype}_${mode}_cuda_${target}.csv"
--output "$TEST_REPORTS_DIR/${backend}_max_autotune_${suite}_${dtype}_${mode}_${device}_${target}.csv"
fi
if [[ "$DASHBOARD_TAG" == *cudagraphs_low_precision-true* ]] && [[ "$mode" == "inference" ]]; then
# TODO: This has a new dtype called quant and the benchmarks script needs to be updated to support this.
# The tentative command is as follows. It doesn't work now, but it's ok because we only need mock data
# to fill the dashboard.
python "benchmarks/dynamo/$suite.py" \
$TASKSET python "benchmarks/dynamo/$suite.py" \
"${target_flag[@]}" --"$mode" --quant --backend "$backend" "$@" \
--output "$TEST_REPORTS_DIR/${backend}_cudagraphs_low_precision_${suite}_quant_${mode}_cuda_${target}.csv" || true
--output "$TEST_REPORTS_DIR/${backend}_cudagraphs_low_precision_${suite}_quant_${mode}_${device}_${target}.csv" || true
# Copy cudagraph results as mock data, easiest choice?
cp "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_${suite}_${dtype}_${mode}_cuda_${target}.csv" \
"$TEST_REPORTS_DIR/${backend}_cudagraphs_low_precision_${suite}_quant_${mode}_cuda_${target}.csv"
cp "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_${suite}_${dtype}_${mode}_${device}_${target}.csv" \
"$TEST_REPORTS_DIR/${backend}_cudagraphs_low_precision_${suite}_quant_${mode}_${device}_${target}.csv"
fi
done
done
@ -503,11 +567,19 @@ test_single_dynamo_benchmark() {
test_perf_for_dashboard "$suite" \
"${DYNAMO_BENCHMARK_FLAGS[@]}" "$@" "${partition_flags[@]}"
else
if [[ "${TEST_CONFIG}" == *aot_inductor* ]]; then
if [[ "${TEST_CONFIG}" == *aot_inductor* && "${TEST_CONFIG}" != *cpu_aot_inductor* ]]; then
# Test AOTInductor with the ABI-compatible mode on CI
# This can be removed once the ABI-compatible mode becomes default.
# For CPU device, we perfer non ABI-compatible mode on CI when testing AOTInductor.
export TORCHINDUCTOR_ABI_COMPATIBLE=1
fi
if [[ "${TEST_CONFIG}" == *_avx2* ]]; then
TEST_CONFIG=${TEST_CONFIG//_avx2/}
fi
if [[ "${TEST_CONFIG}" == *_avx512* ]]; then
TEST_CONFIG=${TEST_CONFIG//_avx512/}
fi
python "benchmarks/dynamo/$suite.py" \
--ci --accuracy --timing --explain \
"${DYNAMO_BENCHMARK_FLAGS[@]}" \
@ -524,9 +596,17 @@ test_single_dynamo_benchmark() {
test_inductor_micro_benchmark() {
TEST_REPORTS_DIR=$(pwd)/test/test-reports
if [[ "${TEST_CONFIG}" == *cpu* ]]; then
test_inductor_set_cpu_affinity
fi
python benchmarks/gpt_fast/benchmark.py --output "${TEST_REPORTS_DIR}/gpt_fast_benchmark.csv"
}
test_inductor_halide() {
python test/run_test.py --include inductor/test_halide.py --verbose
assert_git_not_dirty
}
test_dynamo_benchmark() {
# Usage: test_dynamo_benchmark huggingface 0
TEST_REPORTS_DIR=$(pwd)/test/test-reports
@ -541,8 +621,16 @@ test_dynamo_benchmark() {
elif [[ "${TEST_CONFIG}" == *perf* ]]; then
test_single_dynamo_benchmark "dashboard" "$suite" "$shard_id" "$@"
else
if [[ "${TEST_CONFIG}" == *cpu_inductor* ]]; then
test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --float32 "$@"
if [[ "${TEST_CONFIG}" == *cpu* ]]; then
local dt="float32"
if [[ "${TEST_CONFIG}" == *amp* ]]; then
dt="amp"
fi
if [[ "${TEST_CONFIG}" == *freezing* ]]; then
test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --"$dt" --freezing "$@"
else
test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --"$dt" "$@"
fi
elif [[ "${TEST_CONFIG}" == *aot_inductor* ]]; then
test_single_dynamo_benchmark "inference" "$suite" "$shard_id" --inference --bfloat16 "$@"
else
@ -556,12 +644,16 @@ test_inductor_torchbench_smoketest_perf() {
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
# smoke test the cpp_wrapper mode
TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/torchbench.py --device cuda --accuracy --bfloat16 \
--inference --inductor --only hf_T5 --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_smoketest.csv"
# Test some models in the cpp wrapper mode
TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/torchbench.py --device cuda --accuracy \
--bfloat16 --inference --inductor --only hf_T5 --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"
TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/torchbench.py --device cuda --accuracy \
--bfloat16 --inference --inductor --only llama --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"
TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python benchmarks/dynamo/torchbench.py --device cuda --accuracy \
--bfloat16 --inference --inductor --only moco --output "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv"
python benchmarks/dynamo/check_accuracy.py \
--actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_smoketest.csv" \
--expected "benchmarks/dynamo/ci_expected_accuracy/inductor_torchbench_inference.csv"
--actual "$TEST_REPORTS_DIR/inductor_cpp_wrapper_inference.csv" \
--expected "benchmarks/dynamo/ci_expected_accuracy/inductor_torchbench_inference.csv"
python benchmarks/dynamo/torchbench.py --device cuda --performance --backend inductor --float16 --training \
--batch-size-file "$(realpath benchmarks/dynamo/torchbench_models_list.txt)" --only hf_Bert \
@ -588,50 +680,88 @@ test_inductor_torchbench_smoketest_perf() {
"$TEST_REPORTS_DIR/inductor_training_smoketest_$test.csv" \
--expected benchmarks/dynamo/expected_ci_perf_inductor_torchbench.csv
done
# Perform some "warm-start" runs for a few huggingface models.
for test in AlbertForQuestionAnswering AllenaiLongformerBase DistilBertForMaskedLM DistillGPT2 GoogleFnet YituTechConvBert; do
python benchmarks/dynamo/huggingface.py --accuracy --training --amp --inductor --device cuda --warm-start-latency \
--only $test --output "$TEST_REPORTS_DIR/inductor_warm_start_smoketest_$test.csv"
python benchmarks/dynamo/check_accuracy.py \
--actual "$TEST_REPORTS_DIR/inductor_warm_start_smoketest_$test.csv" \
--expected "benchmarks/dynamo/ci_expected_accuracy/inductor_huggingface_training.csv"
done
}
test_inductor_get_core_number() {
if [[ "${TEST_CONFIG}" == *aarch64* ]]; then
echo "$(($(lscpu | grep 'Cluster(s):' | awk '{print $2}') * $(lscpu | grep 'Core(s) per cluster:' | awk '{print $4}')))"
else
echo "$(($(lscpu | grep 'Socket(s):' | awk '{print $2}') * $(lscpu | grep 'Core(s) per socket:' | awk '{print $4}')))"
fi
}
test_inductor_set_cpu_affinity(){
#set jemalloc
JEMALLOC_LIB="$(find /usr/lib -name libjemalloc.so.2)"
export LD_PRELOAD="$JEMALLOC_LIB":"$LD_PRELOAD"
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
if [[ "${TEST_CONFIG}" != *aarch64* ]]; then
# Use Intel OpenMP for x86
IOMP_LIB="$(dirname "$(which python)")/../lib/libiomp5.so"
export LD_PRELOAD="$IOMP_LIB":"$LD_PRELOAD"
export KMP_AFFINITY=granularity=fine,compact,1,0
export KMP_BLOCKTIME=1
fi
cores=$(test_inductor_get_core_number)
export OMP_NUM_THREADS=$cores
end_core=$((cores-1))
export TASKSET="taskset -c 0-$end_core"
}
test_inductor_torchbench_cpu_smoketest_perf(){
TEST_REPORTS_DIR=$(pwd)/test/test-reports
mkdir -p "$TEST_REPORTS_DIR"
#set jemalloc
JEMALLOC_LIB="/usr/lib/x86_64-linux-gnu/libjemalloc.so.2"
IOMP_LIB="$(dirname "$(which python)")/../lib/libiomp5.so"
export LD_PRELOAD="$JEMALLOC_LIB":"$IOMP_LIB":"$LD_PRELOAD"
export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"
export KMP_AFFINITY=granularity=fine,compact,1,0
export KMP_BLOCKTIME=1
CORES=$(lscpu | grep Core | awk '{print $4}')
export OMP_NUM_THREADS=$CORES
end_core=$(( CORES-1 ))
test_inductor_set_cpu_affinity
MODELS_SPEEDUP_TARGET=benchmarks/dynamo/expected_ci_speedup_inductor_torchbench_cpu.csv
grep -v '^ *#' < "$MODELS_SPEEDUP_TARGET" | while IFS=',' read -r -a model_cfg
do
local model_name=${model_cfg[0]}
local data_type=${model_cfg[1]}
local speedup_target=${model_cfg[4]}
if [[ ${model_cfg[3]} == "cpp" ]]; then
local data_type=${model_cfg[2]}
local speedup_target=${model_cfg[5]}
local backend=${model_cfg[1]}
if [[ ${model_cfg[4]} == "cpp" ]]; then
export TORCHINDUCTOR_CPP_WRAPPER=1
else
unset TORCHINDUCTOR_CPP_WRAPPER
fi
local output_name="$TEST_REPORTS_DIR/inductor_inference_${model_cfg[0]}_${model_cfg[1]}_${model_cfg[2]}_${model_cfg[3]}_cpu_smoketest.csv"
if [[ ${model_cfg[2]} == "dynamic" ]]; then
taskset -c 0-"$end_core" python benchmarks/dynamo/torchbench.py \
if [[ ${model_cfg[3]} == "dynamic" ]]; then
$TASKSET python benchmarks/dynamo/torchbench.py \
--inference --performance --"$data_type" -dcpu -n50 --only "$model_name" --dynamic-shapes \
--dynamic-batch-only --freezing --timeout 9000 --backend=inductor --output "$output_name"
--dynamic-batch-only --freezing --timeout 9000 --"$backend" --output "$output_name"
else
taskset -c 0-"$end_core" python benchmarks/dynamo/torchbench.py \
$TASKSET python benchmarks/dynamo/torchbench.py \
--inference --performance --"$data_type" -dcpu -n50 --only "$model_name" \
--freezing --timeout 9000 --backend=inductor --output "$output_name"
--freezing --timeout 9000 --"$backend" --output "$output_name"
fi
cat "$output_name"
# The threshold value needs to be actively maintained to make this check useful.
python benchmarks/dynamo/check_perf_csv.py -f "$output_name" -t "$speedup_target"
done
# Add a few ABI-compatible accuracy tests for CPU. These can be removed once we turn on ABI-compatible as default.
TORCHINDUCTOR_ABI_COMPATIBLE=1 python benchmarks/dynamo/timm_models.py --device cpu --accuracy \
--bfloat16 --inference --export-aot-inductor --disable-cudagraphs --only adv_inception_v3 \
--output "$TEST_REPORTS_DIR/aot_inductor_smoke_test.csv"
TORCHINDUCTOR_ABI_COMPATIBLE=1 python benchmarks/dynamo/timm_models.py --device cpu --accuracy \
--bfloat16 --inference --export-aot-inductor --disable-cudagraphs --only beit_base_patch16_224 \
--output "$TEST_REPORTS_DIR/aot_inductor_smoke_test.csv"
python benchmarks/dynamo/check_accuracy.py \
--actual "$TEST_REPORTS_DIR/aot_inductor_smoke_test.csv" \
--expected "benchmarks/dynamo/ci_expected_accuracy/aot_inductor_timm_inference.csv"
}
test_torchbench_gcp_smoketest(){
@ -671,7 +801,6 @@ test_aten() {
${SUDO} ln -sf "$TORCH_LIB_DIR"/libmkldnn* "$TEST_BASE_DIR"
${SUDO} ln -sf "$TORCH_LIB_DIR"/libnccl* "$TEST_BASE_DIR"
${SUDO} ln -sf "$TORCH_LIB_DIR"/libtorch* "$TEST_BASE_DIR"
${SUDO} ln -sf "$TORCH_LIB_DIR"/libtbb* "$TEST_BASE_DIR"
ls "$TEST_BASE_DIR"
aten/tools/run_tests.sh "$TEST_BASE_DIR"
@ -696,21 +825,6 @@ test_without_numpy() {
popd
}
# pytorch extensions require including torch/extension.h which includes all.h
# which includes utils.h which includes Parallel.h.
# So you can call for instance parallel_for() from your extension,
# but the compilation will fail because of Parallel.h has only declarations
# and definitions are conditionally included Parallel.h(see last lines of Parallel.h).
# I tried to solve it #39612 and #39881 by including Config.h into Parallel.h
# But if Pytorch is built with TBB it provides Config.h
# that has AT_PARALLEL_NATIVE_TBB=1(see #3961 or #39881) and it means that if you include
# torch/extension.h which transitively includes Parallel.h
# which transitively includes tbb.h which is not available!
if [[ "${BUILD_ENVIRONMENT}" == *tbb* ]]; then
sudo mkdir -p /usr/include/tbb
sudo cp -r "$PWD"/third_party/tbb/include/tbb/* /usr/include/tbb
fi
test_libtorch() {
local SHARD="$1"
@ -724,7 +838,6 @@ test_libtorch() {
ln -sf "$TORCH_LIB_DIR"/libc10* "$TORCH_BIN_DIR"
ln -sf "$TORCH_LIB_DIR"/libshm* "$TORCH_BIN_DIR"
ln -sf "$TORCH_LIB_DIR"/libtorch* "$TORCH_BIN_DIR"
ln -sf "$TORCH_LIB_DIR"/libtbb* "$TORCH_BIN_DIR"
ln -sf "$TORCH_LIB_DIR"/libnvfuser* "$TORCH_BIN_DIR"
export CPP_TESTS_DIR="${TORCH_BIN_DIR}"
@ -861,7 +974,6 @@ test_rpc() {
# test reporting process to function as expected.
ln -sf "$TORCH_LIB_DIR"/libtorch* "$TORCH_BIN_DIR"
ln -sf "$TORCH_LIB_DIR"/libc10* "$TORCH_BIN_DIR"
ln -sf "$TORCH_LIB_DIR"/libtbb* "$TORCH_BIN_DIR"
CPP_TESTS_DIR="${TORCH_BIN_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_cpp_rpc
}
@ -963,11 +1075,113 @@ test_xla() {
assert_git_not_dirty
}
function check_public_api_test_fails {
test_name=$1
invalid_item_name=$2
invalid_item_desc=$3
echo "Running public API test '${test_name}'..."
test_output=$(python test/test_public_bindings.py -k "${test_name}" 2>&1) && ret=$? || ret=$?
# Ensure test fails correctly.
if [ "$ret" -eq 0 ]; then
cat << EOF
Expected the public API test '${test_name}' to fail after introducing
${invalid_item_desc}, but it succeeded! Check test/test_public_bindings.py
for any changes that may have broken the test.
EOF
return 1
fi
# Ensure invalid item is in the test output.
echo "${test_output}" | grep -q "${invalid_item_name}" && ret=$? || ret=$?
if [ $ret -ne 0 ]; then
cat << EOF
Expected the public API test '${test_name}' to identify ${invalid_item_desc}, but
it didn't! It's possible the test may not have run. Check test/test_public_bindings.py
for any changes that may have broken the test.
EOF
return 1
fi
echo "Success! '${test_name}' identified ${invalid_item_desc} ${invalid_item_name}."
return 0
}
# Do NOT run this test before any other tests, like test_python_shard, etc.
# Because this function uninstalls the torch built from branch and installs
# the torch built on its base commit.
test_forward_backward_compatibility() {
set -x
# First, validate public API tests in the torch built from branch.
# Step 1. Make sure the public API test "test_correct_module_names" fails when a new file
# introduces an invalid public API function.
new_filename=$(mktemp XXXXXXXX.py -p "${TORCH_INSTALL_DIR}")
BAD_PUBLIC_FUNC=$(
cat << 'EOF'
def new_public_func():
pass
# valid public API functions have __module__ set correctly
new_public_func.__module__ = None
EOF
)
echo "${BAD_PUBLIC_FUNC}" >> "${new_filename}"
invalid_api="torch.$(basename -s '.py' "${new_filename}").new_public_func"
echo "Created an invalid public API function ${invalid_api}..."
check_public_api_test_fails \
"test_correct_module_names" \
"${invalid_api}" \
"an invalid public API function" && ret=$? || ret=$?
rm -v "${new_filename}"
if [ "$ret" -ne 0 ]; then
exit 1
fi
# Step 2. Make sure that the public API test "test_correct_module_names" fails when an existing
# file is modified to introduce an invalid public API function.
EXISTING_FILEPATH="${TORCH_INSTALL_DIR}/nn/parameter.py"
cp -v "${EXISTING_FILEPATH}" "${EXISTING_FILEPATH}.orig"
echo "${BAD_PUBLIC_FUNC}" >> "${EXISTING_FILEPATH}"
invalid_api="torch.nn.parameter.new_public_func"
echo "Appended an invalid public API function to existing file ${EXISTING_FILEPATH}..."
check_public_api_test_fails \
"test_correct_module_names" \
"${invalid_api}" \
"an invalid public API function" && ret=$? || ret=$?
mv -v "${EXISTING_FILEPATH}.orig" "${EXISTING_FILEPATH}"
if [ "$ret" -ne 0 ]; then
exit 1
fi
# Step 3. Make sure that the public API test "test_modules_can_be_imported" fails when a module
# cannot be imported.
new_module_dir=$(mktemp XXXXXXXX -d -p "${TORCH_INSTALL_DIR}")
echo "invalid syntax garbage" > "${new_module_dir}/__init__.py"
invalid_module_name="torch.$(basename "${new_module_dir}")"
check_public_api_test_fails \
"test_modules_can_be_imported" \
"${invalid_module_name}" \
"a non-importable module" && ret=$? || ret=$?
rm -rv "${new_module_dir}"
if [ "$ret" -ne 0 ]; then
exit 1
fi
# Next, build torch from the merge base.
REPO_DIR=$(pwd)
if [[ "${BASE_SHA}" == "${SHA1}" ]]; then
echo "On trunk, we should compare schemas with torch built from the parent commit"
@ -1141,15 +1355,21 @@ test_executorch() {
pushd /executorch
# NB: We need to build ExecuTorch runner here and not inside the Docker image
# because it depends on PyTorch
export PYTHON_EXECUTABLE=python
export EXECUTORCH_BUILD_PYBIND=ON
export CMAKE_ARGS="-DEXECUTORCH_BUILD_XNNPACK=ON -DEXECUTORCH_BUILD_KERNELS_QUANTIZED=ON"
# NB: We need to rebuild ExecuTorch runner here because it depends on PyTorch
# from the PR
# shellcheck disable=SC1091
source .ci/scripts/utils.sh
build_executorch_runner "cmake"
source .ci/scripts/setup-linux.sh cmake
echo "Run ExecuTorch unit tests"
pytest -v -n auto
# shellcheck disable=SC1091
LLVM_PROFDATA=llvm-profdata-12 LLVM_COV=llvm-cov-12 bash test/run_oss_cpp_tests.sh
echo "Run ExecuTorch regression tests for some models"
# NB: This is a sample model, more can be added here
export PYTHON_EXECUTABLE=python
# TODO(huydhn): Add more coverage here using ExecuTorch's gather models script
# shellcheck disable=SC1091
source .ci/scripts/test.sh mv3 cmake xnnpack-quantization-delegation ''
@ -1187,7 +1407,7 @@ if ! [[ "${BUILD_ENVIRONMENT}" == *libtorch* || "${BUILD_ENVIRONMENT}" == *-baze
(cd test && python -c "import torch; print(torch.__config__.show())")
(cd test && python -c "import torch; print(torch.__config__.parallel_info())")
fi
if [[ "$BUILD_ENVIRONMENT" == *aarch64* ]]; then
if [[ "${BUILD_ENVIRONMENT}" == *aarch64* && "${TEST_CONFIG}" != *perf_cpu_aarch64* ]]; then
test_linux_aarch64
elif [[ "${TEST_CONFIG}" == *backward* ]]; then
test_forward_backward_compatibility
@ -1209,11 +1429,10 @@ elif [[ "$TEST_CONFIG" == distributed ]]; then
if [[ "${SHARD_NUMBER}" == 1 ]]; then
test_rpc
fi
elif [[ "$TEST_CONFIG" == deploy ]]; then
checkout_install_torchdeploy
test_torch_deploy
elif [[ "${TEST_CONFIG}" == *inductor_distributed* ]]; then
test_inductor_distributed
elif [[ "${TEST_CONFIG}" == *inductor-halide* ]]; then
test_inductor_halide
elif [[ "${TEST_CONFIG}" == *inductor-micro-benchmark* ]]; then
test_inductor_micro_benchmark
elif [[ "${TEST_CONFIG}" == *huggingface* ]]; then
@ -1225,13 +1444,14 @@ elif [[ "${TEST_CONFIG}" == *timm* ]]; then
id=$((SHARD_NUMBER-1))
test_dynamo_benchmark timm_models "$id"
elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then
if [[ "${TEST_CONFIG}" == *cpu_inductor* ]]; then
if [[ "${TEST_CONFIG}" == *cpu* ]]; then
install_torchaudio cpu
else
install_torchaudio cuda
fi
install_torchtext
install_torchvision
TORCH_CUDA_ARCH_LIST="8.0;8.6" pip_install git+https://github.com/pytorch/ao.git
id=$((SHARD_NUMBER-1))
# https://github.com/opencv/opencv-python/issues/885
pip_install opencv-python==4.8.0.74
@ -1239,9 +1459,9 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then
checkout_install_torchbench hf_Bert hf_Albert nanogpt timm_vision_transformer
PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_smoketest_perf
elif [[ "${TEST_CONFIG}" == *inductor_torchbench_cpu_smoketest_perf* ]]; then
checkout_install_torchbench timm_vision_transformer phlippe_densenet basic_gnn_gcn \
checkout_install_torchbench timm_vision_transformer phlippe_densenet basic_gnn_edgecnn \
llama_v2_7b_16h resnet50 timm_efficientnet mobilenet_v3_large timm_resnest \
shufflenet_v2_x1_0 hf_GPT2
functorch_maml_omniglot yolov3 mobilenet_v2 resnext50_32x4d densenet121 mnasnet1_0
PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_cpu_smoketest_perf
elif [[ "${TEST_CONFIG}" == *torchbench_gcp_smoketest* ]]; then
checkout_install_torchbench
@ -1250,7 +1470,7 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then
checkout_install_torchbench
# Do this after checkout_install_torchbench to ensure we clobber any
# nightlies that torchbench may pull in
if [[ "${TEST_CONFIG}" != *cpu_inductor* ]]; then
if [[ "${TEST_CONFIG}" != *cpu* ]]; then
install_torchrec_and_fbgemm
fi
PYTHONPATH=$(pwd)/torchbench test_dynamo_benchmark torchbench "$id"
@ -1258,17 +1478,24 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then
elif [[ "${TEST_CONFIG}" == *inductor_cpp_wrapper_abi_compatible* ]]; then
install_torchvision
test_inductor_cpp_wrapper_abi_compatible
elif [[ "${TEST_CONFIG}" == *inductor* && "${SHARD_NUMBER}" == 1 ]]; then
elif [[ "${TEST_CONFIG}" == *inductor* ]]; then
install_torchvision
test_inductor
test_inductor_distributed
elif [[ "${TEST_CONFIG}" == *dynamo* && "${SHARD_NUMBER}" == 1 && $NUM_TEST_SHARDS -gt 1 ]]; then
install_torchvision
test_dynamo_shard 1
test_aten
elif [[ "${TEST_CONFIG}" == *dynamo* && $SHARD_NUMBER -gt 1 && $NUM_TEST_SHARDS -gt 1 ]]; then
test_inductor_shard "${SHARD_NUMBER}"
if [[ "${SHARD_NUMBER}" == 1 ]]; then
if [[ "${BUILD_ENVIRONMENT}" != linux-jammy-py3.9-gcc11-build ]]; then
test_inductor_distributed
fi
fi
elif [[ "${TEST_CONFIG}" == *dynamo* ]]; then
install_torchvision
test_dynamo_shard "${SHARD_NUMBER}"
if [[ "${SHARD_NUMBER}" == 1 ]]; then
test_aten
fi
elif [[ "${BUILD_ENVIRONMENT}" == *rocm* && -n "$TESTS_TO_INCLUDE" ]]; then
install_torchvision
test_python_shard "$SHARD_NUMBER"
test_aten
elif [[ "${SHARD_NUMBER}" == 1 && $NUM_TEST_SHARDS -gt 1 ]]; then
test_without_numpy
install_torchvision
@ -1298,10 +1525,6 @@ elif [[ "${BUILD_ENVIRONMENT}" == *-mobile-lightweight-dispatch* ]]; then
test_libtorch
elif [[ "${TEST_CONFIG}" = docs_test ]]; then
test_docs_test
elif [[ "${BUILD_ENVIRONMENT}" == *rocm* && -n "$TESTS_TO_INCLUDE" ]]; then
install_torchvision
test_python
test_aten
elif [[ "${BUILD_ENVIRONMENT}" == *xpu* ]]; then
install_torchvision
test_python

View File

@ -24,6 +24,12 @@ call %INSTALLER_DIR%\install_sccache.bat
if errorlevel 1 goto fail
if not errorlevel 0 goto fail
if "%USE_XPU%"=="1" (
:: Install xpu support packages
call %INSTALLER_DIR%\install_xpu.bat
if errorlevel 1 exit /b 1
)
:: Miniconda has been installed as part of the Windows AMI with all the dependencies.
:: We just need to activate it here
call %INSTALLER_DIR%\activate_miniconda3.bat
@ -43,6 +49,16 @@ if "%VC_VERSION%" == "" (
)
if errorlevel 1 goto fail
if not errorlevel 0 goto fail
if "%USE_XPU%"=="1" (
:: Activate xpu environment - VS env is required for xpu
call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"
if errorlevel 1 exit /b 1
:: Reduce build time. Only have MTL self-hosted runner now
SET TORCH_XPU_ARCH_LIST=xe-lpg
SET USE_KINETO=0
)
@echo on
popd
@ -65,13 +81,6 @@ set CUDA_PATH_V%VERSION_SUFFIX%=%CUDA_PATH%
set CUDNN_LIB_DIR=%CUDA_PATH%\lib\x64
set CUDA_TOOLKIT_ROOT_DIR=%CUDA_PATH%
set CUDNN_ROOT_DIR=%CUDA_PATH%
set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt
set PATH=%CUDA_PATH%\bin;%CUDA_PATH%\libnvvp;%PATH%
set CUDNN_LIB_DIR=%CUDA_PATH%\lib\x64
set CUDA_TOOLKIT_ROOT_DIR=%CUDA_PATH%
set CUDNN_ROOT_DIR=%CUDA_PATH%
set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt
set PATH=%CUDA_PATH%\bin;%CUDA_PATH%\libnvvp;%PATH%
:cuda_build_end

View File

@ -0,0 +1,91 @@
@echo on
REM Description: Install Intel Support Packages on Windows
REM BKM reference: https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpu/2-5.html
set XPU_INSTALL_MODE=%~1
if "%XPU_INSTALL_MODE%"=="" goto xpu_bundle_install_start
if "%XPU_INSTALL_MODE%"=="bundle" goto xpu_bundle_install_start
if "%XPU_INSTALL_MODE%"=="driver" goto xpu_driver_install_start
if "%XPU_INSTALL_MODE%"=="all" goto xpu_driver_install_start
:arg_error
echo Illegal XPU installation mode. The value can be "bundle"/"driver"/"all"
echo If keep the value as space, will use default "bundle" mode
exit /b 1
:xpu_driver_install_start
:: TODO Need more testing for driver installation
set XPU_DRIVER_LINK=https://downloadmirror.intel.com/830975/gfx_win_101.5972.exe
curl -o xpu_driver.exe --retry 3 --retry-all-errors -k %XPU_DRIVER_LINK%
echo "XPU Driver installing..."
start /wait "Intel XPU Driver Installer" "xpu_driver.exe"
if errorlevel 1 exit /b 1
del xpu_driver.exe
if "%XPU_INSTALL_MODE%"=="driver" goto xpu_install_end
:xpu_bundle_install_start
set XPU_BUNDLE_PARENT_DIR=C:\Program Files (x86)\Intel\oneAPI
set XPU_BUNDLE_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/9d1a91e2-e8b8-40a5-8c7f-5db768a6a60c/w_intel-for-pytorch-gpu-dev_p_0.5.3.37_offline.exe
set XPU_PTI_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/9d1a91e2-e8b8-40a5-8c7f-5db768a6a60c/w_intel-pti-dev_p_0.9.0.37_offline.exe
set XPU_BUNDLE_VERSION=0.5.3+31
set XPU_PTI_VERSION=0.9.0+36
set XPU_BUNDLE_PRODUCT_NAME=intel.oneapi.win.intel-for-pytorch-gpu-dev.product
set XPU_PTI_PRODUCT_NAME=intel.oneapi.win.intel-pti-dev.product
set XPU_BUNDLE_INSTALLED=0
set XPU_PTI_INSTALLED=0
set XPU_BUNDLE_UNINSTALL=0
set XPU_PTI_UNINSTALL=0
:: Check if XPU bundle is target version or already installed
if exist "%XPU_BUNDLE_PARENT_DIR%\Installer\installer.exe" goto xpu_bundle_ver_check
goto xpu_bundle_install
:xpu_bundle_ver_check
"%XPU_BUNDLE_PARENT_DIR%\Installer\installer.exe" --list-products > xpu_bundle_installed_ver.log
for /f "tokens=1,2" %%a in (xpu_bundle_installed_ver.log) do (
if "%%a"=="%XPU_BUNDLE_PRODUCT_NAME%" (
echo %%a Installed Version: %%b
set XPU_BUNDLE_INSTALLED=1
if not "%XPU_BUNDLE_VERSION%"=="%%b" (
start /wait "Installer Title" "%XPU_BUNDLE_PARENT_DIR%\Installer\installer.exe" --action=remove --eula=accept --silent --product-id %XPU_BUNDLE_PRODUCT_NAME% --product-ver %%b --log-dir uninstall_bundle
set XPU_BUNDLE_UNINSTALL=1
)
)
if "%%a"=="%XPU_PTI_PRODUCT_NAME%" (
echo %%a Installed Version: %%b
set XPU_PTI_INSTALLED=1
if not "%XPU_PTI_VERSION%"=="%%b" (
start /wait "Installer Title" "%XPU_BUNDLE_PARENT_DIR%\Installer\installer.exe" --action=remove --eula=accept --silent --product-id %XPU_PTI_PRODUCT_NAME% --product-ver %%b --log-dir uninstall_bundle
set XPU_PTI_UNINSTALL=1
)
)
)
if errorlevel 1 exit /b 1
if exist xpu_bundle_installed_ver.log del xpu_bundle_installed_ver.log
if "%XPU_BUNDLE_INSTALLED%"=="0" goto xpu_bundle_install
if "%XPU_BUNDLE_UNINSTALL%"=="1" goto xpu_bundle_install
if "%XPU_PTI_INSTALLED%"=="0" goto xpu_pti_install
if "%XPU_PTI_UNINSTALL%"=="1" goto xpu_pti_install
goto xpu_install_end
:xpu_bundle_install
curl -o xpu_bundle.exe --retry 3 --retry-all-errors -k %XPU_BUNDLE_URL%
echo "XPU Bundle installing..."
start /wait "Intel Pytorch Bundle Installer" "xpu_bundle.exe" --action=install --eula=accept --silent --log-dir install_bundle
if errorlevel 1 exit /b 1
del xpu_bundle.exe
:xpu_pti_install
curl -o xpu_pti.exe --retry 3 --retry-all-errors -k %XPU_PTI_URL%
echo "XPU PTI installing..."
start /wait "Intel PTI Installer" "xpu_pti.exe" --action=install --eula=accept --silent --log-dir install_bundle
if errorlevel 1 exit /b 1
del xpu_pti.exe
:xpu_install_end

View File

@ -4,6 +4,7 @@ import os
import subprocess
import sys
COMMON_TESTS = [
(
"Checking that torch is available",

View File

@ -40,7 +40,6 @@ set CUDA_PATH_V%VERSION_SUFFIX%=%CUDA_PATH%
set CUDNN_LIB_DIR=%CUDA_PATH%\lib\x64
set CUDA_TOOLKIT_ROOT_DIR=%CUDA_PATH%
set CUDNN_ROOT_DIR=%CUDA_PATH%
set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt
set PATH=%CUDA_PATH%\bin;%CUDA_PATH%\libnvvp;%PATH%
set NUMBAPRO_CUDALIB=%CUDA_PATH%\bin
set NUMBAPRO_LIBDEVICE=%CUDA_PATH%\nvvm\libdevice

View File

@ -31,6 +31,6 @@ if ERRORLEVEL 1 exit /b 1
:: Run tests C++-side and load the exported script module.
cd build
set PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt\bin\x64;%TMP_DIR_WIN%\build\torch\lib;%PATH%
set PATH=%TMP_DIR_WIN%\build\torch\lib;%PATH%
test_custom_backend.exe model.pt
if ERRORLEVEL 1 exit /b 1

View File

@ -31,6 +31,6 @@ if ERRORLEVEL 1 exit /b 1
:: Run tests C++-side and load the exported script module.
cd build
set PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt\bin\x64;%TMP_DIR_WIN%\build\torch\lib;%PATH%
set PATH=%TMP_DIR_WIN%\build\torch\lib;%PATH%
test_custom_ops.exe model.pt
if ERRORLEVEL 1 exit /b 1

View File

@ -5,7 +5,7 @@ if errorlevel 1 exit /b 1
set CWD=%cd%
set CPP_TESTS_DIR=%TMP_DIR_WIN%\build\torch\bin
set PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt\bin\x64;%TMP_DIR_WIN%\build\torch\lib;%PATH%
set PATH=%TMP_DIR_WIN%\build\torch\lib;%PATH%
set TORCH_CPP_TEST_MNIST_PATH=%CWD%\test\cpp\api\mnist
python tools\download_mnist.py --quiet -d %TORCH_CPP_TEST_MNIST_PATH%

View File

@ -40,6 +40,12 @@ python -m pip install pytest-rerunfailures==10.3 pytest-cpp==2.3.0 tensorboard==
# Install Z3 optional dependency for Windows builds.
python -m pip install z3-solver==4.12.2.0
# Install tlparse for test\dynamo\test_structured_trace.py UTs.
python -m pip install tlparse==0.3.25
# Install parameterized
python -m pip install parameterized==0.8.1
run_tests() {
# Run nvidia-smi if available
for path in '/c/Program Files/NVIDIA Corporation/NVSMI/nvidia-smi.exe' /c/Windows/System32/nvidia-smi.exe; do

View File

@ -5,6 +5,7 @@ import sys
import yaml
# Need to import modules that lie on an upward-relative path
sys.path.append(os.path.join(sys.path[0], ".."))

View File

@ -46,14 +46,12 @@ if [[ "\$python_nodot" = *310* ]]; then
PROTOBUF_PACKAGE="protobuf>=3.19.0"
fi
if [[ "\$python_nodot" = *39* ]]; then
if [[ "\$python_nodot" = *39* ]]; then
# There's an issue with conda channel priority where it'll randomly pick 1.19 over 1.20
# we set a lower boundary here just to be safe
NUMPY_PIN=">=1.20"
fi
# Move debug wheels out of the package dir so they don't get installed
mkdir -p /tmp/debug_final_pkgs
mv /final_pkgs/debug-*.zip /tmp/debug_final_pkgs || echo "no debug packages to move"
@ -83,7 +81,7 @@ if [[ "$PACKAGE_TYPE" == conda ]]; then
"numpy\${NUMPY_PIN}" \
mkl>=2018 \
ninja \
sympy \
sympy>=1.12 \
typing-extensions \
${PROTOBUF_PACKAGE}
if [[ "$DESIRED_CUDA" == 'cpu' ]]; then
@ -97,8 +95,16 @@ if [[ "$PACKAGE_TYPE" == conda ]]; then
)
elif [[ "$PACKAGE_TYPE" != libtorch ]]; then
if [[ "\$BUILD_ENVIRONMENT" != *s390x* ]]; then
pip install "\$pkg" --index-url "https://download.pytorch.org/whl/\${CHANNEL}/${DESIRED_CUDA}"
retry pip install -q numpy protobuf typing-extensions
if [[ "$USE_SPLIT_BUILD" == "true" ]]; then
pkg_no_python="$(ls -1 /final_pkgs/torch_no_python* | sort |tail -1)"
pkg_torch="$(ls -1 /final_pkgs/torch-* | sort |tail -1)"
# todo: after folder is populated use the pypi_pkg channel instead
pip install "\$pkg_no_python" "\$pkg_torch" --index-url "https://download.pytorch.org/whl/\${CHANNEL}/${DESIRED_CUDA}_pypi_pkg"
retry pip install -q numpy protobuf typing-extensions
else
pip install "\$pkg" --index-url "https://download.pytorch.org/whl/\${CHANNEL}/${DESIRED_CUDA}"
retry pip install -q numpy protobuf typing-extensions
fi
else
pip install "\$pkg"
retry pip install -q numpy protobuf typing-extensions
@ -113,6 +119,14 @@ fi
# Test the package
/builder/check_binary.sh
if [[ "\$GPU_ARCH_TYPE" != *s390x* && "\$GPU_ARCH_TYPE" != *xpu* && "\$GPU_ARCH_TYPE" != *rocm* && "$PACKAGE_TYPE" != libtorch ]]; then
# Exclude s390, xpu, rocm and libtorch builds from smoke testing
python /builder/test/smoke_test/smoke_test.py --package=torchonly --torch-compile-check disabled
fi
# Clean temp files
cd /builder && git clean -ffdx
# =================== The above code will be executed inside Docker container ===================
EOL
echo

View File

@ -33,9 +33,9 @@ if [[ -z "$DOCKER_IMAGE" ]]; then
if [[ "$PACKAGE_TYPE" == conda ]]; then
export DOCKER_IMAGE="pytorch/conda-cuda"
elif [[ "$DESIRED_CUDA" == cpu ]]; then
export DOCKER_IMAGE="pytorch/manylinux-cpu"
export DOCKER_IMAGE="pytorch/manylinux:cpu"
else
export DOCKER_IMAGE="pytorch/manylinux-cuda${DESIRED_CUDA:2}"
export DOCKER_IMAGE="pytorch/manylinux-builder:${DESIRED_CUDA:2}"
fi
fi
@ -75,9 +75,9 @@ export PYTORCH_BUILD_NUMBER=1
TRITON_VERSION=$(cat $PYTORCH_ROOT/.ci/docker/triton_version.txt)
# Here PYTORCH_EXTRA_INSTALL_REQUIREMENTS is already set for the all the wheel builds hence append TRITON_CONSTRAINT
TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64' and python_version < '3.13'"
if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then
# Only linux Python < 3.12 are supported wheels for triton
TRITON_CONSTRAINT="platform_system == 'Linux' and platform_machine == 'x86_64' and python_version < '3.12'"
# Only linux Python < 3.13 are supported wheels for triton
TRITON_REQUIREMENT="triton==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"
if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then
TRITON_SHORTHASH=$(cut -c1-10 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton.txt)
@ -87,11 +87,11 @@ if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:
fi
# Set triton via PYTORCH_EXTRA_INSTALL_REQUIREMENTS for triton rocm package
if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*rocm.* && $(uname) == "Linux" && "$DESIRED_PYTHON" != "3.12" ]]; then
TRITON_REQUIREMENT="pytorch-triton-rocm==${TRITON_VERSION}"
if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*rocm.* && $(uname) == "Linux" ]]; then
TRITON_REQUIREMENT="pytorch-triton-rocm==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"
if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then
TRITON_SHORTHASH=$(cut -c1-10 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton-rocm.txt)
TRITON_REQUIREMENT="pytorch-triton-rocm==${TRITON_VERSION}+${TRITON_SHORTHASH}"
TRITON_SHORTHASH=$(cut -c1-10 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton.txt)
TRITON_REQUIREMENT="pytorch-triton-rocm==${TRITON_VERSION}+${TRITON_SHORTHASH}; ${TRITON_CONSTRAINT}"
fi
if [[ -z "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then
export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${TRITON_REQUIREMENT}"
@ -100,30 +100,18 @@ if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_B
fi
fi
JAVA_HOME=
BUILD_JNI=OFF
if [[ "$PACKAGE_TYPE" == libtorch ]]; then
POSSIBLE_JAVA_HOMES=()
POSSIBLE_JAVA_HOMES+=(/usr/local)
POSSIBLE_JAVA_HOMES+=(/usr/lib/jvm/java-8-openjdk-amd64)
POSSIBLE_JAVA_HOMES+=(/Library/Java/JavaVirtualMachines/*.jdk/Contents/Home)
# Add the Windows-specific JNI path
POSSIBLE_JAVA_HOMES+=("$PWD/pytorch/.circleci/windows-jni/")
for JH in "${POSSIBLE_JAVA_HOMES[@]}" ; do
if [[ -e "$JH/include/jni.h" ]] ; then
# Skip if we're not on Windows but haven't found a JAVA_HOME
if [[ "$JH" == "$PWD/pytorch/.circleci/windows-jni/" && "$OSTYPE" != "msys" ]] ; then
break
fi
echo "Found jni.h under $JH"
JAVA_HOME="$JH"
BUILD_JNI=ON
break
# Set triton via PYTORCH_EXTRA_INSTALL_REQUIREMENTS for triton xpu package
if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*xpu.* && $(uname) == "Linux" ]]; then
TRITON_REQUIREMENT="pytorch-triton-xpu==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"
if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then
TRITON_SHORTHASH=$(cut -c1-10 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton-xpu.txt)
TRITON_REQUIREMENT="pytorch-triton-xpu==${TRITON_VERSION}+${TRITON_SHORTHASH}; ${TRITON_CONSTRAINT}"
fi
if [[ -z "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then
export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${TRITON_REQUIREMENT}"
else
export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${PYTORCH_EXTRA_INSTALL_REQUIREMENTS} | ${TRITON_REQUIREMENT}"
fi
done
if [ -z "$JAVA_HOME" ]; then
echo "Did not find jni.h"
fi
fi
cat >"$envfile" <<EOL
@ -136,6 +124,7 @@ export DESIRED_PYTHON="${DESIRED_PYTHON:-}"
export DESIRED_CUDA="$DESIRED_CUDA"
export LIBTORCH_VARIANT="${LIBTORCH_VARIANT:-}"
export BUILD_PYTHONLESS="${BUILD_PYTHONLESS:-}"
export USE_SPLIT_BUILD="${USE_SPLIT_BUILD:-}"
if [[ "${OSTYPE}" == "msys" ]]; then
export LIBTORCH_CONFIG="${LIBTORCH_CONFIG:-}"
if [[ "${LIBTORCH_CONFIG:-}" == 'debug' ]]; then
@ -159,8 +148,6 @@ export TORCH_CONDA_BUILD_FOLDER='pytorch-nightly'
export ANACONDA_USER='pytorch'
export USE_FBGEMM=1
export JAVA_HOME=$JAVA_HOME
export BUILD_JNI=$BUILD_JNI
export PIP_UPLOAD_FOLDER="$PIP_UPLOAD_FOLDER"
export DOCKER_IMAGE="$DOCKER_IMAGE"

View File

@ -25,6 +25,15 @@ if [[ "${DRY_RUN}" = "disabled" ]]; then
AWS_S3_CP="aws s3 cp"
fi
if [[ "${USE_SPLIT_BUILD:-false}" == "true" ]]; then
UPLOAD_SUBFOLDER="${UPLOAD_SUBFOLDER}_pypi_pkg"
fi
# this is special build with all dependencies packaged
if [[ ${BUILD_NAME} == *-full* ]]; then
UPLOAD_SUBFOLDER="${UPLOAD_SUBFOLDER}_full"
fi
# Sleep 2 minutes between retries for conda upload
retry () {
"$@" || (sleep 5m && "$@") || (sleep 5m && "$@") || (sleep 5m && "$@") || (sleep 5m && "$@")

View File

@ -10,6 +10,11 @@ export SCCACHE_BUCKET=ossci-compiler-cache
export SCCACHE_IGNORE_SERVER_IO_ERROR=1
export VC_YEAR=2019
if [[ "$DESIRED_CUDA" == 'xpu' ]]; then
export VC_YEAR=2022
export USE_SCCACHE=0
fi
echo "Free space on filesystem before build:"
df -h

View File

@ -6,6 +6,10 @@ source "${BINARY_ENV_FILE:-/c/w/env}"
export CUDA_VERSION="${DESIRED_CUDA/cu/}"
export VC_YEAR=2019
if [[ "$DESIRED_CUDA" == 'xpu' ]]; then
export VC_YEAR=2022
fi
pushd "$BUILDER_ROOT"
./windows/internal/smoke_test.bat

View File

@ -8,6 +8,7 @@ import time
import requests
AZURE_PIPELINE_BASE_URL = "https://aiinfra.visualstudio.com/PyTorch/"
AZURE_DEVOPS_PAT_BASE64 = os.environ.get("AZURE_DEVOPS_PAT_BASE64_SECRET", "")
PIPELINE_ID = "911"

View File

@ -62,4 +62,6 @@ readability-string-compare,
'
HeaderFilterRegex: '^(aten/|c10/|torch/).*$'
WarningsAsErrors: '*'
CheckOptions:
misc-header-include-cycle.IgnoredFilesList: 'format.h;ivalue.h;custom_class.h;Dict.h;List.h'
...

View File

@ -5,7 +5,7 @@ git submodule sync
git submodule update --init --recursive
# This takes some time
make setup_lint
make setup-lint
# Add CMAKE_PREFIX_PATH to bashrc
echo 'export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}' >> ~/.bashrc

View File

@ -2,12 +2,12 @@
# NOTE: **Mirror any changes** to this file the [tool.ruff] config in pyproject.toml
# before we can fully move to use ruff
enable-extensions = G
select = B,C,E,F,G,P,SIM1,T4,W,B9,TOR0,TOR1,TOR2,TOR9
select = B,C,E,F,G,P,SIM1,SIM911,T4,W,B9,TOR0,TOR1,TOR2,TOR9
max-line-length = 120
# C408 ignored because we like the dict keyword argument syntax
# E501 is not flexible enough, we're using B950 instead
ignore =
E203,E305,E402,E501,E721,E741,F405,F841,F999,W503,W504,C408,E302,W291,E303,
E203,E305,E402,E501,E704,E721,E741,F405,F841,F999,W503,W504,C408,E302,W291,E303,
# shebang has extra meaning in fbcode lints, so I think it's not worth trying
# to line this up with executable bit
EXE001,
@ -55,6 +55,9 @@ per-file-ignores =
torch/distributed/_functional_collectives.py: TOR901
torch/distributed/_spmd/data_parallel.py: TOR901
torch/distributed/_tensor/_collective_utils.py: TOR901
# This is a full package that happen to live within the test
# folder, so ok to skip
test/cpp_extensions/open_registration_extension/pytorch_openreg/_aten_impl.py: TOR901
optional-ascii-coding = True
exclude =
./.git,

View File

@ -40,3 +40,7 @@ e6ec0efaf87703c5f889cfc20b29be455885d58d
a53cda1ddc15336dc1ff0ce1eff2a49cdc5f882e
# 2024-01-02 clangformat: fused adam #116583
9dc68d1aa9e554d09344a10fff69f7b50b2d23a0
# 2024-06-28 enable UFMT in `torch/storage.py`
d80939e5e9337e8078f11489afefec59fd42f93b
# 2024-06-28 enable UFMT in `torch.utils.data`
7cf0b90e49689d45be91aa539fdf54cf2ea8a9a3

View File

@ -1,30 +1,83 @@
self-hosted-runner:
labels:
# GitHub hosted x86 Linux runners
- linux.20_04.4x
- linux.20_04.16x
# Organization-wide AWS Linux Runners
- linux.large
- linux.large.arc
- linux.2xlarge
- linux.4xlarge
- linux.9xlarge.ephemeral
- am2.linux.9xlarge.ephemeral
- linux.12xlarge
- linux.12xlarge.ephemeral
- linux.24xlarge
- linux.24xlarge.ephemeral
- linux.arm64.2xlarge
- linux.arm64.2xlarge.ephemeral
- linux.arm64.m7g.4xlarge
- linux.arm64.m7g.4xlarge.ephemeral
- linux.4xlarge.nvidia.gpu
- linux.8xlarge.nvidia.gpu
- linux.16xlarge.nvidia.gpu
- linux.g5.4xlarge.nvidia.gpu
# Pytorch/pytorch AWS Linux Runners on Linux Foundation account
- lf.linux.large
- lf.linux.2xlarge
- lf.linux.4xlarge
- lf.linux.12xlarge
- lf.linux.24xlarge
- lf.linux.arm64.2xlarge
- lf.linux.4xlarge.nvidia.gpu
- lf.linux.8xlarge.nvidia.gpu
- lf.linux.16xlarge.nvidia.gpu
- lf.linux.g5.4xlarge.nvidia.gpu
# Organization-wide AWS Linux Runners with new Amazon 2023 AMI
- amz2023.linux.large
- amz2023.linux.2xlarge
- amz2023.linux.4xlarge
- amz2023.linux.12xlarge
- amz2023.linux.24xlarge
- amz2023.linux.arm64.2xlarge
- amz2023.linux.arm64.m7g.4xlarge
- amz2023.linux.arm64.m7g.4xlarge.ephemeral
- amz2023.linux.4xlarge.nvidia.gpu
- amz2023.linux.8xlarge.nvidia.gpu
- amz2023.linux.16xlarge.nvidia.gpu
- amz2023.linux.g5.4xlarge.nvidia.gpu
# Pytorch/pytorch AWS Linux Runners with the new Amazon 2023 AMI on Linux Foundation account
- amz2023.lf.linux.large
- amz2023.lf.linux.2xlarge
- amz2023.lf.linux.4xlarge
- amz2023.lf.linux.12xlarge
- amz2023.lf.linux.24xlarge
- amz2023.lf.linux.arm64.2xlarge
- amz2023.lf.linux.4xlarge.nvidia.gpu
- amz2023.lf.linux.8xlarge.nvidia.gpu
- amz2023.lf.linux.16xlarge.nvidia.gpu
- amz2023.lf.linux.g5.4xlarge.nvidia.gpu
# Repo-specific IBM hosted S390x runner
- linux.s390x
# Organization wide AWS Windows runners
- windows.g4dn.xlarge
- windows.g4dn.xlarge.nonephemeral
- windows.4xlarge
- windows.4xlarge.nonephemeral
- windows.8xlarge.nvidia.gpu
- windows.8xlarge.nvidia.gpu.nonephemeral
- windows.g5.4xlarge.nvidia.gpu
- bm-runner
# Organization-wide AMD hosted MI300 runners
- linux.rocm.gpu
# Repo-specific Apple hosted runners
- macos-m1-ultra
- macos-m2-14
# Org wise AWS `mac2.metal` runners (2020 Mac mini hardware powered by Apple silicon M1 processors)
- macos-m1-stable
- macos-m1-13
- macos-m1-14
- macos-12-xl
- macos-12
- macos12.3-m1
# GitHub-hosted MacOS runners
- macos-latest-xlarge
- macos-13-xlarge
- macos-14-xlarge
# Organization-wide Intel hosted XPU runners
- linux.idc.xpu

View File

@ -14,12 +14,14 @@ runs:
- name: Cleans up diskspace
shell: bash
run: |
set -ex
diskspace_cutoff=${{ inputs.diskspace-cutoff }}
diskspace=$(df -H / --output=pcent | sed -n 2p | sed 's/%//' | sed 's/ //')
docker_root_dir=$(docker info -f '{{.DockerRootDir}}')
diskspace=$(df -H --output=pcent ${docker_root_dir} | sed -n 2p | sed 's/%//' | sed 's/ //')
msg="Please file an issue on pytorch/pytorch reporting the faulty runner. Include a link to the runner logs so the runner can be identified"
if [[ "$diskspace" -ge "$diskspace_cutoff" ]] ; then
docker system prune -af
diskspace_new=$(df -H / --output=pcent | sed -n 2p | sed 's/%//' | sed 's/ //')
diskspace_new=$(df -H --output=pcent ${docker_root_dir} | sed -n 2p | sed 's/%//' | sed 's/ //')
if [[ "$diskspace_new" -gt "$diskspace_cutoff" ]] ; then
echo "Error: Available diskspace is less than $diskspace_cutoff percent. Not enough diskspace."
echo "$msg"

View File

@ -41,6 +41,9 @@ outputs:
ci-verbose-test-logs:
description: True if ci-verbose-test-logs label was on PR or [ci-verbose-test-logs] in PR body.
value: ${{ steps.filter.outputs.ci-verbose-test-logs }}
ci-test-showlocals:
description: True if ci-test-showlocals label was on PR or [ci-test-showlocals] in PR body.
value: ${{ steps.filter.outputs.ci-test-showlocals }}
ci-no-test-timeout:
description: True if ci-no-test-timeout label was on PR or [ci-no-test-timeout] in PR body.
value: ${{ steps.filter.outputs.ci-no-test-timeout }}
@ -54,7 +57,7 @@ outputs:
runs:
using: composite
steps:
- uses: nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482
- uses: nick-fields/retry@v3.0.0
name: Setup dependencies
env:
GITHUB_TOKEN: ${{ inputs.github-token }}
@ -66,7 +69,8 @@ runs:
command: |
set -eux
# PyYAML 6.0 doesn't work with MacOS x86 anymore
python3 -m pip install requests==2.26.0 pyyaml==6.0.1
# This must run on Python-3.7 (AmazonLinux2) so can't use request=3.32.2
python3 -m pip install requests==2.27.1 pyyaml==6.0.1
- name: Parse ref
id: parse-ref

View File

@ -1,207 +0,0 @@
name: linux-build
inputs:
build-environment:
required: true
description: Top-level label for what's being built/tested.
docker-image-name:
required: true
description: Name of the base docker image to build with.
build-generates-artifacts:
required: false
default: "true"
description: If set, upload generated build artifacts.
build-with-debug:
required: false
default: "false"
description: If set, build in debug mode.
sync-tag:
required: false
default: ""
description: |
If this is set, our linter will use this to make sure that every other
job with the same `sync-tag` is identical.
cuda-arch-list:
required: false
default: "5.2"
description: Runner label to select worker type
runner:
required: false
default: "linux.2xlarge"
description: |
List of CUDA architectures CI build should target.
test-matrix:
required: false
type: string
description: |
An option JSON description of what test configs to run later on. This
is moved here from the Linux test workflow so that we can apply filter
logic using test-config labels earlier and skip unnecessary builds
s3-bucket:
description: S3 bucket to download artifact
required: false
default: "gha-artifacts"
aws-role-to-assume:
description: role to assume for downloading artifacts
required: false
default: ""
GITHUB_TOKEN:
description: GitHub token
required: true
HUGGING_FACE_HUB_TOKEN:
description: Hugging Face Hub token
required: false
default: ""
outputs:
docker-image:
value: ${{ steps.calculate-docker-image.outputs.docker-image }}
description: The docker image containing the built PyTorch.
test-matrix:
value: ${{ steps.filter.outputs.test-matrix }}
description: An optional JSON description of what test configs to run later on.
runs:
using: composite
steps:
- name: Setup Linux
uses: ./.github/actions/setup-linux
- name: configure aws credentials
uses: aws-actions/configure-aws-credentials@v3
if: ${{ inputs.aws-role-to-assume != '' }}
with:
role-to-assume: ${{ inputs.aws-role-to-assume }}
role-session-name: gha-linux-build
role-duration-seconds: 10800
aws-region: us-east-1
- name: Calculate docker image
id: calculate-docker-image
uses: pytorch/test-infra/.github/actions/calculate-docker-image@main
with:
docker-image-name: ${{ inputs.docker-image-name }}
- name: Use following to pull public copy of the image
id: print-ghcr-mirror
env:
ECR_DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
shell: bash
run: |
tag=${ECR_DOCKER_IMAGE##*/}
echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}"
- name: Pull docker image
uses: pytorch/test-infra/.github/actions/pull-docker-image@main
with:
docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}
- name: Parse ref
id: parse-ref
shell: bash
run: .github/scripts/parse_ref.py
- name: Get workflow job id
id: get-job-id
uses: ./.github/actions/get-workflow-job-id
if: always()
with:
github-token: ${{ inputs.GITHUB_TOKEN }}
# Apply the filter logic to the build step too if the test-config label is already there
- name: Select all requested test configurations (if the test matrix is available)
id: filter
uses: ./.github/actions/filter-test-configs
with:
github-token: ${{ inputs.GITHUB_TOKEN }}
test-matrix: ${{ inputs.test-matrix }}
job-name: ${{ steps.get-job-id.outputs.job-name }}
- name: Download pytest cache
uses: ./.github/actions/pytest-cache-download
continue-on-error: true
with:
cache_dir: .pytest_cache
job_identifier: ${{ github.workflow }}_${{ inputs.build-environment }}
s3_bucket: ${{ inputs.s3-bucket }}
- name: Build
if: steps.filter.outputs.is-test-matrix-empty == 'False' || inputs.test-matrix == ''
id: build
env:
BUILD_ENVIRONMENT: ${{ inputs.build-environment }}
BRANCH: ${{ steps.parse-ref.outputs.branch }}
# TODO duplicated
AWS_DEFAULT_REGION: us-east-1
PR_NUMBER: ${{ github.event.pull_request.number }}
SHA1: ${{ github.event.pull_request.head.sha || github.sha }}
SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2
SCCACHE_S3_KEY_PREFIX: ${{ github.workflow }}
XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla
PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}
TORCH_CUDA_ARCH_LIST: ${{ inputs.cuda-arch-list }}
DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}
XLA_CUDA: ${{ contains(inputs.build-environment, 'xla') && '0' || '' }}
DEBUG: ${{ inputs.build-with-debug == 'true' && '1' || '0' }}
OUR_GITHUB_JOB_ID: ${{ steps.get-job-id.outputs.job-id }}
HUGGING_FACE_HUB_TOKEN: ${{ inputs.HUGGING_FACE_HUB_TOKEN }}
shell: bash
run: |
# detached container should get cleaned up by teardown_ec2_linux
container_name=$(docker run \
-e BUILD_ENVIRONMENT \
-e MAX_JOBS="$(nproc --ignore=2)" \
-e AWS_DEFAULT_REGION \
-e PR_NUMBER \
-e SHA1 \
-e BRANCH \
-e SCCACHE_BUCKET \
-e SCCACHE_S3_KEY_PREFIX \
-e XLA_CUDA \
-e XLA_CLANG_CACHE_S3_BUCKET_NAME \
-e SKIP_SCCACHE_INITIALIZATION=1 \
-e TORCH_CUDA_ARCH_LIST \
-e PR_LABELS \
-e OUR_GITHUB_JOB_ID \
-e HUGGING_FACE_HUB_TOKEN \
--env-file="/tmp/github_env_${GITHUB_RUN_ID}" \
--security-opt seccomp=unconfined \
--cap-add=SYS_PTRACE \
--tty \
--detach \
--user jenkins \
-v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \
-w /var/lib/jenkins/workspace \
"${DOCKER_IMAGE}"
)
docker exec -t "${container_name}" sh -c '.ci/pytorch/build.sh'
- name: Archive artifacts into zip
if: inputs.build-generates-artifacts == 'true' && steps.build.outcome != 'skipped'
shell: bash
run: |
zip -1 -r artifacts.zip dist/ build/custom_test_artifacts build/lib build/bin .additional_ci_files
- name: Store PyTorch Build Artifacts on S3
uses: seemethere/upload-artifact-s3@v5
if: inputs.build-generates-artifacts == 'true' && steps.build.outcome != 'skipped'
with:
name: ${{ inputs.build-environment }}
retention-days: 14
if-no-files-found: error
path: artifacts.zip
s3-bucket: ${{ inputs.s3-bucket }}
- name: Upload sccache stats
if: steps.build.outcome != 'skipped'
uses: seemethere/upload-artifact-s3@v5
with:
s3-prefix: |
${{ github.repository }}/${{ github.run_id }}/${{ github.run_attempt }}/artifact
retention-days: 365
if-no-files-found: warn
path: sccache-stats-*.json
s3-bucket: ${{ inputs.s3-bucket }}
- name: Teardown Linux
uses: pytorch/test-infra/.github/actions/teardown-linux@main
if: always()

View File

@ -167,6 +167,7 @@ runs:
REENABLED_ISSUES: ${{ steps.keep-going.outputs.reenabled-issues }}
CONTINUE_THROUGH_ERROR: ${{ steps.keep-going.outputs.keep-going }}
VERBOSE_TEST_LOGS: ${{ steps.keep-going.outputs.ci-verbose-test-logs }}
TEST_SHOWLOCALS: ${{ steps.keep-going.outputs.ci-test-showlocals }}
NO_TEST_TIMEOUT: ${{ steps.keep-going.outputs.ci-no-test-timeout }}
NO_TD: ${{ steps.keep-going.outputs.ci-no-td }}
TD_DISTRIBUTED: ${{ steps.keep-going.outputs.ci-td-distributed }}

View File

@ -17,7 +17,7 @@ inputs:
runs:
using: composite
steps:
- uses: nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482
- uses: nick-fields/retry@v3.0.0
name: Setup dependencies
with:
shell: bash

View File

@ -24,7 +24,7 @@ inputs:
runs:
using: composite
steps:
- uses: nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482
- uses: nick-fields/retry@v3.0.0
name: Setup dependencies
with:
shell: bash

Some files were not shown because too many files have changed in this diff Show More