Remove enable_fake_mode and exporter_legacy entirely. Even though this is bc breaking, `enable_fake_mode` is no longer compatible with the latest version of transformers, and so it is no longer useful.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161222
Approved by: https://github.com/titaiwangms
Summary:
Adds Python Garbage Collection to Kineto Traces and Profiler FunctionEvents. Create custom cpp callback in profiler_python.cpp. Then define a python function with cpp and register that callback for all python garbage collection. We don't worry about thread safety in this case because we are only doing init/teardown for main thread while holding GIL.
Currently we are hiding this behind experimental config because python tracing tends to be unstable especially when adding any new feature. If this is found to not add too much overhead we can set this to on by default. NOTE: To enable this you need both with_stack=True and the experimental config on!
Test Plan:
Ran trace with GC induced and saw it on trace
Also added a test
Rollback Plan:
Differential Revision: D80491146
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161209
Approved by: https://github.com/ngimel
For comparing NativeRT and TorchScript. We add `torchscript-jit-trace` as an option in the benchmark. With this option, we can run trace a model and run inference with the traced module using TorchScript interpreter
```
python ./benchmarks/dynamo/huggingface.py --performance --inference --torchscript-jit-trace
python ./benchmarks/dynamo/timm_models.py --performance --inference --torchscript-jit-trace
python ./benchmarks/dynamo/torchbench.py --performance --inference --torchscript-jit-trace
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161223
Approved by: https://github.com/huydhn
HIPAllocatorMasqueradingAsCUDA and HIPCachingAllocatorMasqueradingAsCUDA are now proper complete wrappers of HIPAllocator and HIPCachingAllocator, respectively. HIPAllocatorMasqueradingAsCUDA now subclasses HIPAllocator instead of Allocator. This fixes usability of hipify replacing c10::cuda::CUDACachingAllocator::get() where callers expect a CUDAAllocator to be returned but instead were getting a very thin Allocator shim instead.
This also fixes using cudagraph trees with torch compile. The hip:0 device was not being replaced by the cuda:0 device in all methods.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161221
Approved by: https://github.com/jeffdaily
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
## Test Result
```bash
lintrunner --take MYPY test/test_numpy_interop.py
Warning: Could not find a lintrunner config at: '.lintrunner.private.toml'. Continuing without using configuration file.
ok No lint issues.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158556
Approved by: https://github.com/soulitzer
Summary:
We use tempfile.NamedTemporaryFile to create a temporary pt2 file in `test_nativert.py`
However, it is not recognized as an allowed file format and a warning will be thrown.
Test Plan:
CI
Rollback Plan:
Differential Revision: D80740916
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161203
Approved by: https://github.com/angelayi
Fixes#158076
Basically, the gemm template generates code like
```
cpp_CppMicroGemmRef_micro_gemm<static_cast<bool>(false), static_cast<bool>(false)>(
&(X[static_cast<int64_t>(k_start + 196LL*m_start + 38416LL*ks_b_index)]),
&(W[static_cast<int64_t>(200704000LL + n_start + 80LL*k_start + 15680LL*ks_b_index)]),
&(local_acc_buf[static_cast<int64_t>(Nr*nci + ((-1LL)*Nr*nc))]),
static_cast<int64_t>(m_end + ((-1LL)*m_start)),
static_cast<int64_t>(Nr),
static_cast<int64_t>(k_end + ((-1LL)*k_start)),
static_cast<int64_t>(196LL),
static_cast<int64_t>(80LL),
static_cast<int64_t>(Nc_blocks*Nr)
);
```
However, when the input tensor W has a storage offset, this results in a double offset issue. That is, the resulting pointer is `2 * 200704000LL` away from `W.storage().data_ptr()`, which causes an out-of-bounds access.
The storage offset of `W` is introduced by [this patch](https://github.com/pytorch/pytorch/pull/136421/files), but I think it's a reasonable fix. So `cpp_gemm_template.py` should handle input matrices with storage offsets properly.
I think a good way to fix this issue is to create a new matrix that has no storage offset.
When `should_block_weights` is true, `block_weight()` creates a clean new matrix, so that branch is not affected by this issue.
BTW I've also examined the FX IRs generated by `torch.compile()`, as well as the generated python module, and they are correct.
The newly-added test in `test_cpu_select_algorithm.py` can reproduce the issue. With this patch, the crash is fixed. It also resolves the crash reported in #158076.
I ran CPU tests in `test_cpu_select_algorithm.py`, but many of them are skipped due to MKL and AMX. I'd be appreciated if someone can help verify the test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159233
Approved by: https://github.com/leslie-fang-intel, https://github.com/swolchok
Note: Adding unit test for this is tricky as having errors in the specific unit test would cause test_utils.py to crash all together.
Tested as follows:
1. Added x = 1/0 after guarded_code = compile_inner(code, one_graph, hooks, transform) in convert_frame.py
2. Printed exception_stack_trace and got: ['Traceback (most recent call last):\n File "/data/users/jovian/pytorch/torch/_dynamo/convert_frame.py", line 1207, in _compile\n x = 1/0\n ~^~\nZeroDivisionError: division by zero\n']
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161096
Approved by: https://github.com/c00w
Following up on https://github.com/pytorch/pytorch/pull/152951#discussion_r2267714825, this removes a few lines added in that pull request, fixing link errors like
```
[7019/7028] Linking CXX shared library bin\torch_hip.dll
FAILED: [code=4294967295] bin/torch_hip.dll lib/torch_hip.lib
C:\Windows\system32\cmd.exe /C "cd . && D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\cmake\data\bin\cmake.exe -E vs_link_dll --msvc-ver=1942 --intdir=caffe2\CMakeFiles\torch_hip.dir --rc=C:\PROGRA~2\WI3CF2~1\10\bin\100261~1.0\x64\rc.exe --mt=C:\PROGRA~2\MICROS~2\2022\BUILDT~1\VC\Tools\Llvm\x64\bin\llvm-mt.exe --manifests -- D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\_rocm_sdk_devel\lib\llvm\bin\lld-link.exe /nologo @CMakeFiles\torch_hip.rsp /out:bin\torch_hip.dll /implib:lib\torch_hip.lib /pdb:bin\torch_hip.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO && cd ."
LINK: command "D:\projects\TheRock\external-builds\pytorch\3.12.venv\Lib\site-packages\_rocm_sdk_devel\lib\llvm\bin\lld-link.exe /nologo @CMakeFiles\torch_hip.rsp /out:bin\torch_hip.dll /implib:lib\torch_hip.lib /pdb:bin\torch_hip.pdb /dll /version:0.0 /machine:x64 /ignore:4049 /ignore:4217 /ignore:4099 /INCREMENTAL:NO /MANIFEST:EMBED,ID=2" failed (exit code 1) with the following output:
lld-link: error: undefined symbol: __declspec(dllimport) class std::tuple<class at::Tensor, class at::Tensor, class at::Tensor> __cdecl at::native::transform_bias_rescale_qkv_cuda(class at::Tensor const &, class at::Tensor const &, __int64)
>>> referenced by caffe2\CMakeFiles\torch_hip.dir\__\aten\src\ATen\RegisterCUDA_0.cpp.obj:(class std::tuple<class at::Tensor, class at::Tensor, class at::Tensor> __cdecl at::`anonymous namespace'::`anonymous namespace'::wrapper_CUDA___transform_bias_rescale_qkv(class 0xE9BF7323::Tensor const &, class 0xE9BF7323::Tensor const &, __int64))
>>> referenced by caffe2\CMakeFiles\torch_hip.dir\__\aten\src\ATen\RegisterNestedTensorCUDA_0.cpp.obj:(class std::tuple<class at::Tensor, class at::Tensor, class at::Tensor> __cdecl at::`anonymous namespace'::`anonymous namespace'::wrapper_NestedTensorCUDA___transform_bias_rescale_qkv(class 0xEFEB5304::Tensor const &, class 0xEFEB5304::Tensor const &, __int64))
```
The `native_transformers_hip_hip` and `native_transformers_hip_cpp` sources are okay to define (and are required) even if accelerated versions of these operations are not available.
I've tested downstream builds of torch with ROCm on native Windows via https://github.com/ROCm/TheRock both with and without aotriton and these changes were needed for the build to succeed in both cases. I have _not_ tested Linux, WSL, or with the HIP SDK.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160373
Approved by: https://github.com/alugorey, https://github.com/jeffdaily
Summary:
Removed `Model`, it's not being used anywhere so it's safe.
Removed `tensor_paths` and `constant_paths` fields in `ExportedProgram`
- BC: when the current deserializer load a previously serialized EP (that comes with empty `tensor_paths` and `constant_paths`), it will just ignore those two fields
- FC: when the old deserializer load a newly serialized EP (that doesn't come with `tensor_paths` and `constant_paths`, it will also ignore those two fields in `_dict_to_dataclass()`
Differential Revision: D80725094
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161185
Approved by: https://github.com/SherlockNoMad