[MPS] Fix sliced cast (#138314)
This fixes internal crash due to the invalid bufer size computation if sliced API is used
Not sure what was the purpose of
```c++
IntArrayRef baseShape;
if (src.is_view()) {
baseShape = src._base().sizes();
} else {
baseShape = getIMPSAllocator()->getBufferShape(src.storage().data());
}
int flattenedShaped = 1;
for (const auto i : c10::irange(baseShape.size())) {
flattenedShaped *= baseShape[i];
}
```
As flattenShaped could be much easier computed as `[srcBuf
lengh]/src.element_size()`, and even if `srcBuf` is padded it's a safe thing to do.
When someone allocated buffer to hold say uint8 and that view-casted it
to float16, attempt to compute `baseShape` returned sizes of original
tensor in its data type, rather than size in new dtypes
Fixes https://github.com/pytorch/pytorch/issues/137800
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138314
Approved by: https://github.com/albanD, https://github.com/DenisVieriu97
(cherry picked from commit de16159e565e7a08294347e31e97ca08a3468227)
Co-authored-by: Nikita Shulga <nikita.shulga@gmail.com>
[SDPA-CUDNN] Make CuDNN Attention Opt in (#138522)
# Summary
Currently we have a `cudnn_order` that says on H100 w/ new enough CuDNN backend (we ship a 9.1 version in OSS) try to run CuDNN attention first. We have already encountered a few bugs with the release of 2.5:
1. https://github.com/pytorch/pytorch/issues/138529
2. https://github.com/huggingface/diffusers/issues/9704
3. https://github.com/pytorch/pytorch/pull/138354
In light of the above we are going to make the CuDNN backend Opt-in by default.
This can be done easily with the context manager for choosing backends I.e.:
``` Python
from torch.nn.attention import sdpa_kernel, SDPBackend
with sdpa_kernel(SDPBackend.CUDNN_ATTENTION):
out = F.scaled_dot_product_attention(q, k, v)
```
This PR puts the CuDNN backend as the lowest precedence in the backend list, meaning that the Math backend will always be chosen unless disabled (which is done via the context manager).
Cc @atalman
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138522
Approved by: https://github.com/ngimel, https://github.com/eqy, https://github.com/malfet
(cherry picked from commit 9a9a0abc2818d40d06eda6c0b6fdbc949474f12e)
Co-authored-by: drisspg <drisspguessous@gmail.com>
* [RELEASE-ONLY Change] Push ROCm images on RC
We usually don't need to push docker images on RC, but for 2.5.0 release a cherry pick for ROCm actually modified docker images.
Do this for ROCm only to be safe.
After the release, think about what's the desired behavior and implement this in a more generic way.
* Hardcode 2.5 in the tag
[dynamo] Do not treat user defined nn module attributes static for dynamic shape infra (#136516)
Fixes https://github.com/pytorch/pytorch/issues/136254
Th regression was introduced in https://github.com/pytorch/pytorch/pull/132736 where originally we were trying to fix another regression. This PR and the offending PR together say - "treat user defined nn module attributes as automatic dynamic, but for cudagraphs they will be considered static". This avoid recompilations. This can lead to a cudagraph recording, which is ok. This also maintains the state before inline_inbuilt_nn_modules flag was introduced.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136516
Approved by: https://github.com/williamwen42
We previously only supported the same v_head dim and + qk_head dim. When allowed for different head-dims I accidently kept the same query strides for the output. This PR fixes this bug as well it ensures that we always produce output in the same stride order as the input query.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135882
Approved by: https://github.com/yanboliang, https://github.com/Chillee
(cherry picked from commit ae02d663cdf493362699d2672ed7dc9019a7033b)
* [DSD] Fix distributed state dict full_state_dict option hang during set_state_dict (#135725)
Fix https://github.com/pytorch/pytorch/issues/134095
This fix distributed state dict full_state_dict option hang during set_state_dict. We switch `_distribute_tensors` in _state_dict_utils.py to use `DTensor.from_local` instead of `distribute_tensor` to support FSDP2+TP 2D strided sharding use case, as `distribute_tensor` cannot handle strided sharding yet. `distribute_tensor` incurs a scatter behind the scenes, while `DTensor.from_local` takes the local slice from the full tensor on each rank to create the DTensor (no collective). This means it's the user's responsibility to make sure the full_tensor from the full_state_dict is the same across all ranks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135725
Approved by: https://github.com/fegin
(cherry picked from commit 0cdc6a8dcd7e294b01d8914385bbe45e79c1770d)
* [DSD] Fix loading uneven full tensor into sharded state dict (#136365)
Fix#136228.
This is a follow up on https://github.com/pytorch/pytorch/pull/135725. We need to pass shape and stride from the original dtensor, since for uneven case, `from_local` would calculate shape and stride assuming the tensor is evenly-sharded based on the local tensor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136365
Approved by: https://github.com/fegin
(cherry picked from commit 637d5c4b7eb0fb82e019eed29efc0aa6ba92dc24)
[SymmetricMemory] improve multicast initialization/fallback logic (#136577)
Fixes https://github.com/pytorch/pytorch/issues/136494
Currently, CUDASymmetricMemory::rendezvous() initializes a multicast address if multicast support is present. However, if we believe multicast support is present but cuMulticastCreate still fails for some reason, we do not fallback gracefully.
- In addition to CUDART and driver version check, query CU_DEVICE_ATTRIBUTE_MULTICAST_SUPPORTED to determine multicast support for a rank/device.
- Before initializing multicast for a block, ensure all ranks/devices have multicast support.
- This is unlikely, but if cuMulticastCreate still fails on rank 0, print the corresponding driver error message as a warning, and gracefully skip multicast initialization for the block.
- Introduced an environment variable (TORCH_SYMM_MEM_DISABLE_MULTICAST) to allow users to explicitly disable multicast support as a workaround.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136577
Approved by: https://github.com/Chillee, https://github.com/eqy
(cherry picked from commit d55eef5c596b3955dd8ee43c721b1c311dbab5e0)
* Always create seed and offset tensors on GPU memory.
* Adjust fudge_factors for test_flash_attention_vs_math_ref_grads
* Skip enable_gqa=True tests
* Fix cudagraph support for FA backend
* Update the AOTriton FA API to meet hipGraph demands.
* Enable test_fused_attention_vs_math_ref_grads_cudagraph and skip seq_len_q != seq_len_k when is_causal=True
* The main FA and ME tests passed after heavily hacking the fudge factors...
* [SDPA] Add experimental support to Navi31
* Changes aotriton_version.txt to 0.7b release
* Make the fudge factors more explicit.
* Code clean up.
* Claim GQA is not supported on ROCM in can_use_flash_attention
* Switch to .gz package
* Skip failures on test/test_native_mha.py
* Skip more GQA tests
* Skip nn_functional_scaled_dot_product_attention related tests
* Disable Efficient attention on fp32 + is_casual=True
* Revert "Disable Efficient attention on fp32 + is_casual=True"
This reverts commit 36324a49d2c322146adbd678902fa32d008b8b8b.
It's not very effective and forcing MATH backend does not help. Need
further investigations.
* Add missing imports
* Disable test_transformerencoderlayer and test_transformerdecoder
* Fix two more problems
* Fix lint
* Re-enable test_transformers
* Skip some tests in test_multiheadattention_fastpath_attn_mask on ROCM
* fix lint
* skip test_pointwise_associative_scan_tuple_reverse_True_combine_mode_pointwise_cuda on ROCm
* skip more test_pointwise_associative_scan
* Fix per suggestions from Nikita
* Update skip reason of test_transformerencoderlayer
* Add missing using
skip_if_rocm is used only in multiprocess case (when UT test class is a child of MultiProcessTestCase). Each individual process can exit with a skip code. If used for single process UT, it will cause the UT to fail as the process returns a non-zero exit code. Use skipIfRocm in single process UTs.
To avoid the above confusion, this PR renamed skip_if_rocm to skip_if_rocm_multiprocess.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136161
Approved by: https://github.com/jithunnair-amd, https://github.com/kwen2501, https://github.com/fegin
Co-authored-by: Prachi Gupta <prachi.gupta@amd.com>
Update current maintainers (#136672)
This file didn't had an overall in a few years so long overdue. Most of the credit goes to @orionr for gathering all of this info.
The main rules we followed:
- No code contributor is removed, they're all placed as emeritus
- Breakdown too big categories to make this document useful to know who to ping
- No category where the code is still in the codebase is removed
- We did not rework the categories (for example to be closer to module: labels) and leave that for later
- All non-emeritus names are ordered by their number of comments on issues related to their topic
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136672
Approved by: https://github.com/eqy, https://github.com/ezyang, https://github.com/seemethere, https://github.com/malfet
(cherry picked from commit 2421344d8f582084b69b7b00fe0304b1c9732f65)
Co-authored-by: albanD <desmaison.alban@gmail.com>
Don't do push to https://ghcr.io/ on release branch: we don't need it and it fails with "unauthorized: unauthenticated: User cannot be authenticated with the token provided".
Fix hardcoded ROCm paths in `Caffe2Targets.cmake` (#136283)
Fixes#131701
Use CMake imported targets more consistently to eliminate hardcode paths.
Here is the new relevant sections of Caffe2Targets.cmake:
```
set_target_properties(c10_hip PROPERTIES
INTERFACE_INCLUDE_DIRECTORIES "${_IMPORT_PREFIX}/include"
INTERFACE_LINK_LIBRARIES "c10;hip::amdhip64"
)
```
```
set_target_properties(torch_hip PROPERTIES
INTERFACE_COMPILE_DEFINITIONS "USE_C10D_NCCL"
INTERFACE_COMPILE_OPTIONS "-fPIC;-D__HIP_PLATFORM_AMD__=1;-DCUDA_HAS_FP16=1;-DUSE_ROCM;-D__HIP_NO_HALF_OPERATORS__=1;-D__HIP_NO_HALF_CONVERSIONS__=1;-DTORCH_HIP_VERSION=602;-Wno-shift-count-negative;-Wno-shift-count-overflow;-Wno-duplicate-decl-specifier;-DCAFFE2_USE_MIOPEN;-DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP;-std=c++17;-DHIPBLAS_V2;-DHIP_NEW_TYPE_ENUMS"
INTERFACE_INCLUDE_DIRECTORIES "${_IMPORT_PREFIX}/include"
INTERFACE_LINK_LIBRARIES "c10_hip;torch_cpu_library;hip::amdhip64;MIOpen;hiprtc::hiprtc;roc::hipblaslt;roc::hipblas;hip::hipfft;hip::hiprand;roc::hipsparse;roc::hipsolver"
)
```
HIPCUB dependency was not actually used; which is why it is removed here as the imported target had undesirable side effects.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136283
Approved by: https://github.com/jeffdaily, https://github.com/Skylion007, https://github.com/jithunnair-amd, https://github.com/atalman
(cherry picked from commit e8f1dd6ba0675d3e11808e5198b0d927278a6f91)
Co-authored-by: Nichols A. Romero <nick.romero@amd.com>
# Motivation
fix https://github.com/pytorch/pytorch/issues/135726
After merging two free blocks, I made a stupid mistake of ignoring the correct size to decrease the active memory size, which should be the original block size instead of the merged block size.
# Additional Context
Add a UT to guard this scenario.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135818
Approved by: https://github.com/EikanWang
(cherry picked from commit e6b68359d7c86aff25eefe77e0774c02b38f44b4)
[ONNX] Fix numpy method to return the correct type (#136162)
Previous implementation of the `numpy()` method returns `fp64` when the tensor is `fp32`. This is unexpected but seems to be caused by calling `__array__(dtype=None)` on the numpy array. I updated the implementation to implement the `numpy()` method explicitly and added tests to guard the behavior.
This needs to be cherry-picked into torch 2.5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136162
Approved by: https://github.com/gramalingam, https://github.com/xadupre
(cherry picked from commit 67b14ce8bd9d4d0ad1920e57bc148644775646ac)
[inductor] [cpp] fix the input contiguous check in max-autotune (#134982)
## Description
Fixes the FP32 accuracy failure of `resmlp_12_224` and BF16 accuracy failure of `volo_d1_224` in timm.
In this PR, we check whether input is contiguous using the following way:
If it has `FixedLayout`, we know the accurate strides. For `FlexibleLayout`, if its data is a `ComputedBuffer`, we could get the fill order of the buffer to decide whether it's contiguous. For the other cases, we won't use GEMM template as we can't infer whether it's contiguous.
## Additional context
The current GEMM template only supports this case: `input.get_stride()[-1] == 1`. In `resmlp_12_224`, when we run into this check, the layout of `input` is a `FlexibleLayout`. The reason is that when realizing the input which is a `View` IR, the `convert_to_reinterpret_view` call fails:
d14fe3ffed/torch/_inductor/ir.py (L4712-L4715)
And it finally runs into this `copy_input` and returns a `FlexibleLayout`.
d14fe3ffed/torch/_inductor/ir.py (L4722)
When checking its stride, this `FlexibleLayout` indeed satisfies `input.get_stride()[-1] == 1` but it is later decided as a `FixedLayout` with `size = (3072, 196), stride = (1, 3072)`, which is not supported by the GEMM template, thus causing accuracy issue in this model.
The `FlexibleLayout` is converted to `FixedLayout` during [CppPackedGemmTemplate.add_choices](d14fe3ffed/torch/_inductor/mkldnn_lowerings.py (L1051)) which calls [slice_nd](d14fe3ffed/torch/_inductor/codegen/cpp_template_kernel.py (L150)) when rendering the kernel (`slice_nd(X)`). When creating the `SliceView` IR, [as_storage_and_layout](d14fe3ffed/torch/_inductor/ir.py (L2288)) invokes
[decide_layout](d14fe3ffed/torch/_inductor/ir.py (L2135)) and converts it to a `FixedLayout` with `size = (3072, 196), stride = (1, 3072)`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134982
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel
reland of https://github.com/pytorch/pytorch/pull/133113
I have to create a new PR because the previous reverted PR could not either be rebased, or imported successfully :(
----
Moving DTensor to be in the public namespace, to formally add the documentation page that includes all the public APIs. This includes:
* many path renames and path import fixes
* a dedicated doc page without too much content yet (adding in the next PRs)
* To preserve the BC for users still using the torch.distributed._tensor, I added a shim script to redirect old path calls to the new module
The BC preserving is evidented by the fact that all DTensor tests are still working without changing the public imports. So it's safe to land the changes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134203
Approved by: https://github.com/tianyu-l
Use oneDNN BRGEMM on packed data to get better performance on the 5th generation of Xeon where Intel® Advanced Matrix Extensions (AMX) will have fp16 support, e.g. amx-fp16.
Multiple models have achieved acceleration, for instance, FP16 stable diffusion v2.1 has achieved over 50% improvement.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131879
Approved by: https://github.com/jgong5, https://github.com/peterbell10
ghstack dependencies: #131878
Summary: BrokenProcessPool means a parallel-compile subprocess exited, which we never expect. It's likely due to a crash, so print a more meaningful error message and instructions that it's probably easier to debug by turning off parallel compile. Output looks like:
```
...
File "/data/users/slarsen/pytorch/torch/_inductor/runtime/compile_tasks.py", line 45, in _reload_python_module
exec(code, mod.__dict__, mod.__dict__)
File "/tmp/torchinductor_slarsen/4q/c4qw7xk5lbb7whg5txnk4hwbc7z6kepak3o666tr3d64gcad5r5b.py", line 815, in <module>
async_compile.wait(globals())
File "/data/users/slarsen/pytorch/torch/_inductor/async_compile.py", line 265, in wait
raise RuntimeError(
RuntimeError: A compilation subprocess exited unexpectedly. This is likely due to a crash. To facilitate debugging, you can re-run with TORCHINDUCTOR_COMPILE_THREADS=1 to cause compilation to occur in the main process.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135120
Approved by: https://github.com/Chillee
Sync with https://github.com/justinchuby/torch-onnx/compare/v0.1.20...v0.1.21 to support FakeTensors in ONNXProgram. Specifically, this PR implements the `apply_weights` method to allow users to supply a dictionary of concrete tensors to replace FakeTensors in the exported model weights.
An error is raised when users try to serialize a FakeTensor to avoid segfaults.
Also fixed a bug in `.save()` when `keep_initializers_as_inputs` is True and `include_initializers` is False.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135399
Approved by: https://github.com/titaiwangms
Previously, when an input contains a mixture of `Value` and python constants like `[SymbolicTensor('sym_size_int_3', type=Tensor(INT64), shape=[], producer=node_Shape_0, index=0), 512]`, we get errors like
```pytb
Traceback (most recent call last):
File "/Users/justinc/Documents/GitHub/torch-onnx/src/torch_onnx/_building.py", line 367, in _call_op
converted_named_inputs = _process_python_constants_and_sequences(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/justinc/Documents/GitHub/torch-onnx/src/torch_onnx/_building.py", line 275, in _process_python_constants_and_sequences
raise TypeError(
TypeError: Constant input '[SymbolicTensor('sym_size_int_3', type=Tensor(INT64), shape=[], producer=node_Shape_0, index=0), 512]' of type '<class 'list'>' is not supported
```
This PR updates Sequence handling to support this case, as well as variadic inputs and ONNX Sequence inputs.
Synced from https://github.com/justinchuby/torch-onnx/pull/187
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135378
Approved by: https://github.com/titaiwangms
This is the OSS component of a larger MTIA diff.
Currently, Inductor disables padding for non-GPU devices. We need to change this behavior to enable padding on MTIA.
This PR adds a config option to enable padding on the CPU, or any other non-GPU device. In the future, we might want to enable padding on all devices by default. However, that might require supporting device-dependent padding defaults, since CPUs will likely use different settings than H100 GPUs.
Differential Revision: D61038114
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135280
Approved by: https://github.com/jfix71, https://github.com/shunting314
Refactor exporter errors to combine old errors and new errors for API consistency.
This PR also
1. Removes the `_C._check_onnx_proto(proto)` call in the old exporter. We don't need the ONNX checker because it is limited.
2. Removes the `OnnxExporterError` defined in the dynamo module. This class unnecessarily stores the onnx program object, making it very bulky. Instead, we revert to use the plain OnnxExporterError defined in the `errors` module and use it as the base class for all errors.
3. Continues to expose `OnnxExporterError` in `torch.onnx` and the rest of the errors in `torch.onnx.errors`.
4. Removes the `CheckerError` and `InvalidExportOptionsError` from `torch.onnx`. This is BC breaking but should have low impact.
5. I did not rename existing errors out of compatibility considerations, even though `ExporterError` would have been more succinct.
Fixes https://github.com/pytorch/pytorch/issues/135125
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135180
Approved by: https://github.com/titaiwangms
Before the fix, the unit test will fail at forward Dynamo tracing:
```
File "/data/users/willfeng/pytorch/test/distributed/_composable/test_replicate_with_compiler.py", line 415, in test_ddp_tp
loss = compiled_replicate_model(data).sum()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
...
torch._dynamo.exc.InternalTorchDynamoError: SymNodeVariable() is not a constant
from user code:
File "/data/users/willfeng/pytorch/torch/distributed/tensor/parallel/_data_parallel_utils.py", line 34, in _unflatten_tensor
result = DTensor.from_local(
```
After the fix, the compilation fails at a later step (Compiled Autograd tracing), due to needing "pre-dispatch tracing of backward graph" feature (see details at https://github.com/pytorch/pytorch/issues/127797#issuecomment-2291695474).
I believe this PR is a net improvement, because it should also fix the 1D Traceable FSDP2 failure case on internal models (https://github.com/pytorch/pytorch/issues/130978#issuecomment-2319476690), which is much harder to build a minimal unit test for.
Fixes https://github.com/pytorch/pytorch/issues/130978.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135315
Approved by: https://github.com/bdhirsh
Summary: In new export_for_training, "stack_trace" does not exist in node meta anymore.
Test Plan:
```
buck run fbcode//mode/dev-nosan fbcode//caffe2/test:quantization_pt2e -- -r test_constant_prop_preserve_metadata
```
Reviewed By: angelayi
Differential Revision: D62219974
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135268
Approved by: https://github.com/angelayi
The current test is failing because of the current unstable state of map. torch.compile and non-strict export are taking two seperate routes unlike cond and while_loop. This pr fix the test it self. We'll fix map in follow up PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135366
Approved by: https://github.com/angelayi
This replaces the existing TCPStore counters with the new shared wait counters. There's no users of the tcpstore counters so should be completely safe to remove.
Test plan:
Existing tests + build
There's no OSS backend for wait counters so can't write any tests with them currently.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135283
Approved by: https://github.com/c-p-i-o
Refactor exporter errors to combine old errors and new errors for API consistency.
This PR also
1. Removes the `_C._check_onnx_proto(proto)` call in the old exporter. We don't need the ONNX checker because it is limited.
2. Removes the `OnnxExporterError` defined in the dynamo module. This class unnecessarily stores the onnx program object, making it very bulky. Instead, we revert to use the plain OnnxExporterError defined in the `errors` module and use it as the base class for all errors.
3. Continues to expose `OnnxExporterError` in `torch.onnx` and the rest of the errors in `torch.onnx.errors`.
4. Removes the `CheckerError` and `InvalidExportOptionsError` from `torch.onnx`. This is BC breaking but should have low impact.
5. I did not rename existing errors out of compatibility considerations, even though `ExporterError` would have been more succinct.
Fixes https://github.com/pytorch/pytorch/issues/135125
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135180
Approved by: https://github.com/titaiwangms
Hi,
I noticed the `unfold` operator was missing on MaskedTensor.
I tested that my change works when calling unfold and backward on a `MaskedTensor` but I didn't find the tests for the dispatch of such operation. Where is it?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125262
Approved by: https://github.com/cpuhrsch
Summary:
There was a regression introduced in https://github.com/pytorch/pytorch/pull/125743 that made `local_addr` no longer used. This fixes that by passing `local_addr` to `RendezvousStoreInfo.build` everywhere it's used.
This also fixes a number of tests allowing them to be run in parallel which hugely sped up the testing cycle as this change touches many different rendezvous implementations. This required a few fixes in unrelated tests.
Test Plan:
Added tests for the common rendezvous implementations that `local_addr` to prevent future regressions.
```
buck2 test @//mode/dev-nosan fbcode//caffe2/test/distributed/elastic/... fbcode//caffe2/torch/distributed/elastic/... -- --stress-runs 3
```
To vet the parallelism changes I also ran with 3 stress runs each to identify flakiness caused by parallelism.
Differential Revision: D62256407
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135262
Approved by: https://github.com/fduwjj, https://github.com/wz337
Summary: In general I think it will be useful to also record the global torch version in the EP, so that we can track them in the logging in addition to the schema version.
Test Plan: CI
Reviewed By: henryoier
Differential Revision: D62252626
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135243
Approved by: https://github.com/yushangdi
Summary:
resnet152 spent about 15 minutes writing warning messages in _unlift
during `to_executorch` because they're all written to unbuffered stderr
by the `warnings` module.
These warnings are almost always about get_attr nodes referencing a
non-existent name:
```lang=py
warnings.warn(f'Node {node} target {node.target} {atom} of {seen_qualname} does '
'not reference an nn.Module, nn.Parameter, or buffer, which is '
'what \'get_attr\' Nodes typically target'
)
```
I'm not aware of a way to configure the warnings module to write this out
at most once, so I'm just going to disable the lint for now.
Test Plan:
Re-ran resnet152 with Executorch and the XNNPackBackend, it is much faster now
Differential Revision: D62156090
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135069
Approved by: https://github.com/yushangdi
By default, Inductor is allowed to manipulate the layout
(strides+storage offset) of input tensors to custom operators.
We want to change it so that the default is that Inductor should respect
the stride order of input tensors to custom operators.
This PR adds a config to toggle the behavior, in the next PR up we'll
change the default. We also make the following changes:
- We add a new operator Tag (flexible_layout), which means that
inductor is allowed to manipulate the layout. When we flip the default,
users can specify they want the old behavior by using this tag.
This is a reland of https://github.com/pytorch/pytorch/pull/126986,
which was previously reverted due to silent incorrectness. We've since
fixed the silent incorrectness
(https://github.com/pytorch/pytorch/pull/133639)
Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135238
Approved by: https://github.com/albanD
Summary:
In multiprocessing, signal handling is not possible if the thread is not the main thread. This resulted in the following error:
> "ValueError('signal only works in main thread of the main interpreter')"
To address this issue, the diff checks whether the thread is the main thread and, if not, skips signal handling.
Test Plan:
Before this change, MAST job failed:
https://fburl.com/mlhub/iq2m10v8
With this change, MAST job succeeded:
https://fburl.com/mlhub/q6kb8343
Differential Revision: D62166943
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135088
Approved by: https://github.com/d4l3k
SplitScan makes use of a workspace arg that needs to be zeroed before it is used - then, it is used to communicate between thread blocks during the triton kernel implementation. It is mutated during during the execution of the kernel, so it should be marked as such.
Before this PR, it is not marked as mutated; AFAIK this is fine during normal execution, but during autotuning it causes problems. The workspace starts off zeroed (as expected), but during autotuning the kernel will be executed multiple times and the workspace does not get re-set between executions, resulting in incorrect data. If the data is used for indexing, then you can fail device-side asserts (and the results after the initial run (with autotuning) could be wrong). The test added in this PR repros the issue when the fix is removed.
When we mark the arg as mutated, then the arg gets cloned before autotuning, so that the arg passed to the kernel during autotuning will always be zeroed as expected.
804852c1f9/torch/_inductor/runtime/triton_heuristics.py (L685-L689)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134648
Approved by: https://github.com/peterbell10, https://github.com/jansel
This PR is slightly a revival / update to the discussion from https://github.com/pytorch/pytorch/pull/98960:
Part of FSDP2's tracing strategy right now is that:
(1) it is painful/difficult to handle the case where we have multiple graph input tensors that are aliased to each other and at least one of them is duplicated
(2) we already have longstanding in logic to remove duplicate input tensors from the graph in dynamo. Morally, FSDP2 gives us duplicate input tensors in the backward graph for every `unsharded_param`, because we have (a) the `unsharded_param` being closed over by the backward hook to resize/allgather, and (b) the same `unsharded_param` being saved for backward by autograd (we now guarantee in the partitioner that we will always save the base tensor for backward and recompute views)
(3) However, we were still seeing cases where the `unsharded_param` showed up twice in the backward graph inputs, as distinct tensor objects (with different python ids) instead of being true duplicates that dynamo can de-dup.
It turns on that this was because we were `.detach()`ing the `unsharded_param` in AOTDispatcher before plumbing it through the compiled forward (and so autograd would save a detach'd version of the `unsharded_param`). This is precisely because of the logic from https://github.com/pytorch/pytorch/pull/98960.
However, re-reading the detailed comments, it seems unnecessary to do a detach() on a graph input that is a (leaf) `nn.Parameter`, even if it happens to get no gradients in the backward. Since it is a leaf, we don't have to worry about the autograd engine "continuing to backprop through the graph beyond the current tensor" (the leaf has no other grad_fn for autograd to backprop through).
So this PR makes us a bit less aggressive about calling detach() on inputs: we only do it when:
(1) our graph input statically will get a `None` gradient (and also has no metadata mutations, the existing state)
(2) **and** our graph input is a non-leaf tensor (so detach()ing is actually required to prevent autograd from incorrectly backpropping past the non-leaf.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134193
Approved by: https://github.com/yf225
Co-authored-by: Will Feng <yf225@cornell.edu>
Internal xref: https://fb.workplace.com/groups/6829516587176185/posts/7705964779531357/
This now also incorporates a test from https://github.com/pytorch/pytorch/pull/133585 (which it fixes) and the prep PR https://github.com/pytorch/pytorch/pull/134407 Including the PR desc from that:
I am trying to fix a problem reported by user in [fb.workplace.com/groups/6829516587176185/permalink/7705964779531357](https://fb.workplace.com/groups/6829516587176185/permalink/7705964779531357/) The summary of this problem is that when we do collect metadata analysis in AOTAutograd, we accumulate pending unbacked symbols which are going to be discarded at the end of the trace. However, if we do a recursive make_fx inside tracing, as occurs with torch.cond, we end up seeing that there are pending unbacked symbols that aren't associated with a binding, even though it's spurious (they've leaked into the inner make_fx call from the outer AOTAutograd analysis).
In https://github.com/pytorch/pytorch/pull/133588 I tried to just prevent adding the symbols to the pending list at all in the first place. But this itself caused some problems which were fixed in https://github.com/pytorch/pytorch/pull/124785 . The problem fixed in that PR is that when we allocate tangents that have unbacked size, something prevented them from having correct unbacked SymInts when ignore fresh unbacked SymInts was enabled. So I had patched it at the time by just not suppressing pending symbols and clearing them out some other way.
I think... I was wrong in that PR? That is to say, it was OK to avoid putting the fresh unbacked symbols in the pending list; the real problem was suppressing unbacked renamings. But there doesn't seem to be a good reason to suppress these; this PR shows that it doesn't actually fail any tests if you do these anyway. Intuitively, this makes sense, because you can't trigger renamings unless you're actually adding unbacked symbols to the pending set.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135053
Approved by: https://github.com/ydwu4
Fixes https://github.com/pytorch/pytorch/issues/114389
Previously, dynamo would attempt to trace through the `__init__` of traceable tensor subclasses, since their constructors are AOT dispatcher traceable by definition, dynamo should automatically put these in the graph like we do for any other tensors. Not doing this is difficult because dynamo would need to apply mutations post tensor subclass creation in the graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135151
Approved by: https://github.com/bdhirsh
Summary:
In graph of TestXNNPACKQuantizer.test_dynamic_linear_with_con test, some quantized_decomposed.quantize_per_tensor.default ops are becoming quantized_decomposed.dequantize_per_tensor.tensor ops when using the new training ir.
This is because we lift params/buffers before calling make_fx. So previously, for the graph that’s passed to make_fx,`graph.L__self___linear1.weight` is a tensor
now in training ir, graph.L__self___linear1.weight is a FakeTensor. This caused the node overload to be different.
Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_dynamic_linear_with_conv
```
Differential Revision: D61364547
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134525
Approved by: https://github.com/tugsbayasgalan, https://github.com/jerryzh168
Summary:
D62215095 Introduced an import error to arvr pipelines as the is_fbcode() function does not work as intended.
This changes is_fbcode() to be a much stricter check.
Test Plan:
```
buck2 run arvr/mode/platform010/opt-stripped //arvr/libraries/depthlink/clients/mr_replay:pipeline_runner -c bolt.use_eva3_sim=True -- --config_file arvr/libraries/depthlink/clients/mr_replay/configs/runner_config.yaml --features DEPTH
```
Differential Revision: D62237502
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135244
Approved by: https://github.com/aorenste
Migrate function call in test to eliminate warning message in below and reduce the chance of test fail when methods removed
- from deprecated `save_state_dict` change to `save`
- from deprecated `load_state_dict` change to `load`
Warning message:
```bash
pytorch/test/distributed/checkpoint/test_fsdp_model_state.py:37: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134938
Approved by: https://github.com/wz337, https://github.com/fegin
**Summary**
Fix issue: https://github.com/pytorch/pytorch/issues/134998: Previously, we only checked if the `get_attr` FX node for the weight had a single user node. However, two `get_attr` nodes may share the same tensor and should not be deleted in such cases. In this PR, we add the count of users for tensor along with the num of users for nodes to decide whether this tensor can be deleted or not.
**TestPlan**
```
python test/inductor/test_cpu_select_algorithm.py -k test_linear_wgt_multi_users
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135100
Approved by: https://github.com/jgong5
Error was hard to understand without message. Render it now. See https://github.com/pytorch/pytorch/pull/135259 for it in action.
Example failure:
```
2024-09-05T20:04:45.3022000Z FAILED [5.9524s] test_public_bindings.py::TestPublicBindings::test_modules_can_be_imported - AssertionError: String comparison failed: '' != "torch._logging.scribe failed to import w[112 chars].py)"
2024-09-05T20:04:45.3025413Z + torch._logging.scribe failed to import with error ImportError: cannot import name 'TypeAlias' from 'typing' (/opt/conda/envs/py_3.9/lib/python3.9/typing.py)
2024-09-05T20:04:45.3026990Z
```
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135258
Approved by: https://github.com/albanD
for aarch64 neoverse platforms there are two gemm backends available
for matmul operator on PyTorch: (1) Arm Compute Library and (2) OpenBLAS.
While Arm Compute Library provides better performance over OpenBLAS,
it has overhead for the kernel launch time, and hence we use OpenBLAS
for smaller tensor compute. The heuristic was originally implemented for
neoverse_v1. This commit extends the heuristic to other neoverse platforms
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134548
Approved by: https://github.com/malfet
Summary: Users have recently asked that the profiler contains self/total CPU and device percentages to FunctionEvents so that teams can process the data procedurely. Some of it could be done mathematically via subroutines but since we already have the information in the _build_table, lets build it there.
Test Plan: Check that we have the same table as before but also check that the parameters we check also have the expected values
Differential Revision: D62210351
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135155
Approved by: https://github.com/shanw-meta, https://github.com/kit1980
The idea behind the tracking is the following, whenever we see a tensor if the tensors is a root tensors (does not have any view metas ) when we consider is as the base of the all the tensors that shares its storage.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135141
Approved by: https://github.com/zou3519
We found a corner case that when a tensor dimension is 1, calling `view(1)` would result in an unexpected replication (see case 1 below). When the tensor dimension to shard is not 1, no matter whether the tensor dimension is evenly-shardable across the mesh dimension, it won't cause an implicit replication behind the scenes if view doesn't change the size of the given tensor dimension (see case 2 and 3).
When the tensor dimension to shard is of size 1, it is not being added to shardable_dims here:
https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/ops/_view_ops.py#L518
```
# uneven case where the size of the tensor dimension to shard is 1
p = torch.randn(1,2)
mesh = init_device_mesh(“cuda”, (2,))
dtensor = distribute_tensor(p, mesh, [Shard(0)])
t = dtensor.view(1, 2)
# this would result in replication, meaning t is now replicated across all ranks.
# uneven case where the size of the tensor dimension to shard is not 1
p = torch.randn(3, 2)
mesh = init_device_mesh(“cuda”, (2,))
dtensor = distribute_tensor(p, mesh, [Shard(0)])
t = dtensor.view(3, 2) # this would not result in replication.
# this would not result in replication, meaning t stays as sharded.
# even case
p = torch.randn(2,2)
dtensor = distribute_tensor(p, mesh, [Shard(0)])
t = dtensor.view(2, 2)
# this would not result in replication, meaning t stays as sharded.
```
Differential Revision: [D62155606](https://our.internmc.facebook.com/intern/diff/D62155606)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135054
Approved by: https://github.com/tianyu-l, https://github.com/wanchaol
Summary:
When exporting for training with `tolist`, we do not hit `FunctionalTensor.tolist` since we do not functionalize. Unfortunately, this means we hit `FakeTensor.tolist`, which creates unbacked symints that are not backed by proxies.
Rather than trying to patch up this low-level implementation, we replace it with essentially what `FunctionalTensor.tolist` does, which is higher-level: we essentially desugar to `item()` calls and let it take care of unbacked symints.
Test Plan:
Some expected failures are gone now.
Also found a test for `tolist` that was written when `FunctionalTensor.tolist` was implemented but not really doing much; repurposed it now to exercise more modes.
Differential Revision: D62197742
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135131
Approved by: https://github.com/ezyang
See https://github.com/pytorch/pytorch/pull/135138 for a usage example. Meta only, see https://docs.google.com/document/d/1JpbAQvRhTmuxjnKKjT7qq57dsnV84nxSLpWJo1abJuE/edit#heading=h.9wi46k7np6xw for context
fbscribelogger is a library that allows us to write to scribe, which is Meta's logging infrastructure, when you have appropriate access token (this token is available for jobs running on main, as well as authorized jobs with the ci-scribe label). The resulting data is accessible via Scuba (a real time in-memory database) and Hive (a more traditional SQL persisted database).
Here's the motivating use case. Suppose there is somewhere in PyTorch's codebase where you'd like to log an event, and then you'd like to find all the situations where this log is called. If PyTorch is rolled out to our internal users, we have some FB-oriented APIs (like torch._utils_internal.signpost_event) with which you can do this. But you have to actually land your PR to main, wait for it to be ingested to fbcode, and then wait for us to actually roll out this version, before you get any data. But what if you want the results within the next few hours? Instead, you can use torch._logging.scribe to directly write to our logging infrastructure *from inside CI jobs.* The most convenient approach is to log unstructured JSON blobs to `open_source_signpost` (added in this PR; you can also add your own dedicated table as described in the GDoc above). After adding logging code to your code, you can push your PR to CI, add 'ci-scribe' label, and in a few hours view the results in Scuba, e.g., (Meta-only) https://fburl.com/scuba/torch_open_source_signpost/z2mq8o4l If you want continuous logging on all commits on master, you can land your PR and it will be continuously get logging for all CI runs that happen on main.
Eventually, if your dataset is important enough, you can consider collaborating with PyTorch Dev Infra to get the data collected in our public AWS cloud so that OSS users can view it without access to Meta's internal users. But this facility is really good for prototyping / one-off experiments. It's entirely self serve: just add your logging, run your PR CI with ci-scribe, get results, do analysis in Scuba.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135224
Approved by: https://github.com/Skylion007
This enables inductor micro benchmark on CPU (x86):
* Running on AWS metal runner for more accurate benchmark
* I add a new `arch` column, which will be either x86_64 or arm64 for CPU or GPU name for GPU. We can use this later to differentiate between different setup, i.e. cuda (a100) vs cuda (a10g) or cpu (x86_64) vs cpu (arm64)
The next step would be to run this one cpu arm64, and cuda (a10g).
### Testing
Here is the CSV results from my test run https://github.com/pytorch/pytorch/actions/runs/10709344180
```
name,metric,target,actual,dtype,device,arch,is_model
mlp_layer_norm_gelu,flops_utilization,0.8,17.36,bfloat16,cpu,x86_64,False
gather_gemv,memory_bandwidth(GB/s),990,170.80,int8,cpu,x86_64,False
gather_gemv,memory_bandwidth(GB/s),1060,204.78,bfloat16,cpu,x86_64,False
Mixtral-8x7B-v0.1,token_per_sec,175,26.68,int8,cpu,x86_64,True
Mixtral-8x7B-v0.1,memory_bandwidth(GB/s),1130,171.91,int8,cpu,x86_64,True
Mixtral-8x7B-v0.1,compilation_time(s),162,47.36,int8,cpu,x86_64,True
gemv,memory_bandwidth(GB/s),870,236.36,int8,cpu,x86_64,False
gemv,memory_bandwidth(GB/s),990,305.71,bfloat16,cpu,x86_64,False
Llama-2-7b-chat-hf,token_per_sec,94,14.01,bfloat16,cpu,x86_64,True
Llama-2-7b-chat-hf,memory_bandwidth(GB/s),1253,185.18,bfloat16,cpu,x86_64,True
Llama-2-7b-chat-hf,compilation_time(s),162,74.99,bfloat16,cpu,x86_64,True
Llama-2-7b-chat-hf,token_per_sec,144,25.09,int8,cpu,x86_64,True
Llama-2-7b-chat-hf,memory_bandwidth(GB/s),957,165.83,int8,cpu,x86_64,True
Llama-2-7b-chat-hf,compilation_time(s),172,70.69,int8,cpu,x86_64,True
layer_norm,memory_bandwidth(GB/s),950,172.03,bfloat16,cpu,x86_64,False
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135042
Approved by: https://github.com/yanboliang
If an `auto_functionalized` HOP is included in backward graph due to activation checkpointing, we will run into a scenario where Compiled Autograd Dynamo tracing will need to trace through the `auto_functionalized` HOP. This PR adds support for it.
Test commands:
- `pytest -rA test/inductor/test_compiled_autograd.py::TestCompiledAutograd::test_trace_auto_functionalized`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135169
Approved by: https://github.com/zou3519
Summary:
Fixed some quantization tests for new training ir:
Fix batch norm node pattern matcher. In training ir, we have `aten.batch_norm` node instead of `aten._native_batch_norm_legit` and `aten._native_batch_norm_legit_no_training`.
Test Plan:
```
buck run fbcode//mode/dev-nosan fbcode//caffe2/test:quantization_pt2e
```
Reviewed By: tugsbayasgalan
Differential Revision: D62209819
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135184
Approved by: https://github.com/tugsbayasgalan
Summary:
Added the contextmanager, `_disable_interpreter`, which is meant to put around a call to `unflatten`. This will generate an UnflattendModule and sub-InterpreterModules which will not use torch.fx.Interpreter to run eagerly. We want to have this as a state of the module instead of a contextmanager around running the module because it's not clear where we are calling the unflattened module.
This seems to improve the performance: https://fb.workplace.com/groups/1075192433118967/posts/1473590629945810/?comment_id=1473621763276030
Test Plan: CI
Differential Revision: D60939034
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133996
Approved by: https://github.com/pianpwk
We should not try to do ConstProp on the unrecognized types (e.g. Subclasses).
In case of those types throwing NotImplemented will jump to the next torch_dispatch.
Test:
```
python test/functorch/test_aotdispatch.py -k test_aot_test_subclasses_with_tensor_factories
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135033
Approved by: https://github.com/zou3519, https://github.com/bdhirsh
## Semantic
The semantic is
(1) By default `torch.serialization.skip_data(materialize_fake_tensors=False)` will make `torch.save` skip writing storages (but reserve space for them in the checkpoint).
```python
import torch
import torch.nn as nn
sd = nn.Linear(3, 5).state_dict()
with torch.serialization.skip_data():
torch.save(sd, 'foo.pt')
print(torch.load('foo.pt', weights_only=True))
```
(2) With `torch.serialization.skip_data(materialize_fake_tensors=True)`If FakeTensor is passed to `torch.save` the pickler will treat these FakeTensors as being "materialized" space will be reserved in the checkpoint for the associated storage bytes, and when loading the type will be Tensor instead of FakeTensor)
```python
import torch
import torch.nn as nn
from torch._subclasses.fake_tensor import FakeTensorMode
with FakeTensorMode():
m = nn.Linear(3, 5, dtype=torch.float16, device='cuda')
sd = m.state_dict()
with torch.serialization.skip_data(materialize_fake_tensors=True):
torch.save(sd, 'bla.pt')
print(torch.load('bla.pt', weights_only=True))
# OrderedDict([('weight', tensor([[0., 0., 0.],
# [0., 0., 0.],
# [0., 0., 0.],
# [0., 0., 0.],
# [0., 0., 0.]], device='cuda:0', dtype=torch.float16)), ('bias', tensor([0., 0., 0., 0., 0.], device='cuda:0', dtype=torch.float16))])
```
## Follow Ups
- [ ] `torch.load` semantic for skip_data context manager
- [ ] Mechanism for getting offsets of storages saved via this method (for writing in a separate pass)
Differential Revision: [D62238610](https://our.internmc.facebook.com/intern/diff/D62238610)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134504
Approved by: https://github.com/albanD
For example, if I do TORCH_LOGS=fbscribelogger I'll get:
```
I0904 17:59:07.567000 3672513 fbscribelogger/__init__.py:161] stop
```
instead of
```
I0904 12:46:15.332000 2930287 ../../../../../home/ezyang/local/a/pytorch-env/lib/python3.10/site-packages/fbscribelogger/__init__.py:161] stop
```
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135165
Approved by: https://github.com/Skylion007
Solve the request [here](https://github.com/pytorch/pytorch/issues/120003#issuecomment-2248805798).
Enable DTensor input in gradient scaler's APIs, especially on `.unscale_()`
Related dispatch strategy is added to accept DTensor input.
To enable found_inf to conduct reduce action across devices, we add allreduce at dispatch with args after dispatch strategy and kernel.
Since `aten._amp_foreach_non_finite_check_and_unscale_.default` is an inplace_op, grad_scale as the arg[0] with be inplaced, so that redesign a strategy or refactoring the kernel would not help
Test files are testing 2 parts under 1-d(dp) and 2-d(dp,tp) cases:
1. whether the non-inf values unscaled
2. whether all DTensors at each device could found inf even not at their device.
3. If inf not found, will new parameters generates
4. if inf found, will scale be updated
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132816
Approved by: https://github.com/XilunWu, https://github.com/weifengpy, https://github.com/wanchaol
Differential Revision: D61506212
Use `skipCUDAIf` from `torch.testing._internal.common_device_type` if we create the test class with `instantiate_device_type_tests`.
`instantiate_device_type_tests` would make sure the class has attr device_type, which works with`skipCUDAIf` from `torch.testing._internal.common_device_type`.
Also skipping test_vertical_pointwise_reduction_fusion for cpu test class, since the test expects cuda.
FAILED [0.0026s] test/inductor/test_unbacked_symints.py::TestUnbackedSymintsCPU::test_vertical_pointwise_reduction_fusion_cpu - AttributeError: 'TestUnbackedSymintsCPU' object has no attribute 'device'
repro:
```
CUDA_VISIBLE_DEVICES="" pytest test/inductor/test_unbacked_symints.py -k cpu -v
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133936
Approved by: https://github.com/ColinPeppler, https://github.com/desertfire
## Summary
At the moment, the fake impl for `masked_select` simply sets the upper range while updating its size-like SymInt to `sys.maxsize`(9223372036854775807, max value for an unsigned int64) if the there are any SymInts in the original input tensor shape. This PR constrains the range more intelligently by using the upper ranges of each SymInt in the input tensor shape.
This solves an issue where an model being lowered to Executorch errors during memory planning because the memory allocated for `masked_select` ended up exceeded the 64-bit address space (`INT_MAX * size(dtype)`).
## Test plan
- Passes existing unit tests (tests case where upper bound is inf)
- Added unit test to verify upper bound reduction calculation
- Tested end-to-end by exporting with TORCH_LOGS="export" and ensuring that the range for `masked_select`'s SymInt size has the correct upper bound
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134899
Approved by: https://github.com/ezyang
In C++, when a floating-point literal (e.g., 3.14) is compared with a variable of type float, the literal is by default interpreted as a double.
```c++
float f = 3.14f;
if (f == 3.14) {
// Do something
}
```
If a device does not support double, an error will occur.
This PR addresses the issue of complex64 errors on machines that do not support double operations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134647
Approved by: https://github.com/EikanWang, https://github.com/albanD
We found that currently, we only pass one input and output tensor to the function `collective`, and this causes NaNCheck, work numel stats and FR input/output sizes not accurate for all-to-all, scatter and reduce. So we want to let the collective take in a list of tensors to ensure it works for all collectives inside PGNCCL.
This partially revert what we did in https://github.com/pytorch/pytorch/pull/119421, and down the road we will have another round of cleanup on the collective to make it cleaner. For now, at least for the sake of correctness, we changed it back.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135049
Approved by: https://github.com/kwen2501
Adds val, and optionally stack_trace & nn_module_stack metadata back to SymInt compute nodes that we CSE, with a hook on `graph.create_node()`. Not sure if there's other metadata we want to populate here?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134516
Approved by: https://github.com/ezyang
Fixes the FP32 accuracy failure of `levit_128` in timm.
Previously, we used `Y` which is the output of the final epilogue node to calculate the reindexer. We actually need to use each epilogue node to calculate the reindexer from the GEMM output to the epilogue node.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134984
Approved by: https://github.com/jgong5
Adds utility functions `_dump_dynamic_shapes` and `_load_dynamic_shapes`.
- `_dump_dynamic_shapes`: dynamic shapes spec -> serialized format:
- takes in the `dynamic_shapes` pytree object you'd feed into `export()`, and dumps into serialized format
- `_load_dynamic_shapes`: serialized format -> dynamic shapes spec
- takes the serialized format, and produces a `dynamic_shapes` object you feed into `export()`
For example with dumping:
```
dx = Dim("dx", min=4, max=16)
dy = dx + 1
inputs = (
[
torch.randn(4, 4),
torch.randn(5, 4),
],
torch.randn(4),
torch.randn(4, 4),
"hello",
)
dynamic_shapes = {
"a": [
(dx, 4),
(dy, 4),
],
"b": (Dim.AUTO,),
"c": None,
"d": None,
}
out = _dump_dynamic_shapes(dynamic_shapes, inputs)
```
would generate the following output:
```
DynamicShapesSpec(
dynamic_shapes=(
[
['dx', 4],
['dx + 1', 4],
],
['_DimHint.STATIC'],
['_DimHint.STATIC', '_DimHint.STATIC'],
None,
),
dims={
'dx': RootDim(
min=4,
max=16,
derived=['dx + 1'],
),
},
)
```
The serialized format contains 2 keys, `dynamic_shapes` and `dims.`
- `dynamic_shapes` is the pytree structure matching the input to `export()`, with strings in place of Dim names and enums, and ints/Nones otherwise. Each tensor is represented with a list of shapes, non-tensors with Nones.
- `dims` contain min/max range and derived dims info for each root dim.
The test cases show some roundtrippability guarantees for these functions. Definitely taking naming suggestions for them :)
Follow up: utility function to extract serializable format from ExportedProgram.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134718
Approved by: https://github.com/avikchaudhuri
Before this PR, when traceable FSDP2 + AC is run, an error would be thrown:
```
File "/data/users/willfeng/pytorch/torch/_dynamo/variables/builtin.py", line 1449, in call_getitem
return args[0].call_method(tx, "__getitem__", args[1:], kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/users/willfeng/pytorch/torch/_dynamo/variables/lists.py", line 435, in call_method
return super().call_method(tx, name, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/users/willfeng/pytorch/torch/_dynamo/variables/lists.py", line 392, in call_method
return super().call_method(tx, name, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/users/willfeng/pytorch/torch/_dynamo/variables/lists.py", line 131, in call_method
return self.getitem_const(tx, value)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/users/willfeng/pytorch/torch/_dynamo/variables/lists.py", line 106, in getitem_const
return self.items[index]
Error: Index out of bound
from user code:
File "<eval_with_key>.5", line 105, in forward
aot0_trace_wrapped = torch__dynamo__trace_wrapped_higher_order_op_self_invoke(aot0_tangents_1, bw_state = aot0_primals_34); aot0_tangents_1 = None
File "/data/users/willfeng/pytorch/torch/_dynamo/_trace_wrapped_higher_order_op.py", line 74, in self_invoke
return _trace_wrapped_op(*args, **dyn_kwargs, **kwargs)
File "/data/users/willfeng/pytorch/torch/_dynamo/external_utils.py", line 132, in call_hook_from_backward_state
return getattr(bw_state, hook_name)(*args, **kwargs)
File "/data/users/willfeng/pytorch/torch/distributed/_composable/fsdp/_fsdp_state.py", line 271, in _pre_backward
self._fsdp_param_group.pre_backward(default_prefetch)
File "/data/users/willfeng/pytorch/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 332, in pre_backward
self._backward_prefetch()
File "/data/users/willfeng/pytorch/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 417, in _backward_prefetch
target_fsdp_param_group = self.comm_ctx.post_forward_order[target_index]
```
Since it's okay to rely on the compiler to recover the "prefetching" pattern, we will skip this `_backward_prefetch()` code path during tracing to avoid the error, and have a compiler pass (in future PR) to achieve the equivalent prefetching overlap.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135163
Approved by: https://github.com/awgu
This PR is a supplement to #130082. The previous PR #130082 fulfill the basic functionality of codegen, while we found it fails to handle the device sameness check in lots of uts. Current PR is aimed to facilitate the XPU device guard code generation.
With current PR, the code snippet in `RegisterXPU.cpp` is as follows, where we can see the device guard is successfully generated.
```c++
namespace {
at::Tensor & wrapper_XPU_Tensor_float_out_normal_out(const at::Tensor & mean, double std, ::std::optional<at::Generator> generator, at::Tensor & out) {
std::optional<Device> common_device = std::nullopt;
(void)common_device; // Suppress unused variable warning
c10::impl::check_and_update_common_device(common_device, out, "wrapper_XPU_Tensor_float_out_normal_out", "out");
c10::impl::check_and_update_common_device(common_device, mean, "wrapper_XPU_Tensor_float_out_normal_out", "mean");
const OptionalDeviceGuard device_guard(device_of(out));
return at::native::normal_out(mean, std, generator, out);
}
} // anonymous namespace
```
Nevertheless, without current change, the generated code is
```c++
namespace {
at::Tensor & wrapper_XPU_Tensor_float_out_normal_out(const at::Tensor & mean, double std, ::std::optional<at::Generator> generator, at::Tensor & out) {
// No device check
// DeviceGuard omitted
return at::native::normal_out(mean, std, generator, out);
}
} // anonymous namespace
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133980
Approved by: https://github.com/EikanWang, https://github.com/malfet
resolve: https://github.com/pytorch/pytorch/pull/135029
when enabling mixed precision, FSDP cast input args to desired dtype by calling `_apply_to_tensors`. When input args has `dataclass(frozen=True)`, we hit following runtime error, because of using `setattr` in `_apply_to_tensors`
`dataclasses.FrozenInstanceError: cannot assign to field 'some_key'`. The fix is to use dataclasses api `dataclasses.replace`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135067
Approved by: https://github.com/awgu
Fixes#132715
The failure in #132715 is due to `autocast_dtype` being a thread-local variable. It causes inconsistencies between `get_autocast_dtype()` among different threads.
To be exact, what is happening in the following: The amp dtype is set to `bfloat16` on main thread. The `backward` call runs on a side thread, so `at::autocast::prioritize` fails because `lower_precision_fp` defaults to `float16`:
6f738d6434/aten/src/ATen/autocast_mode.h (L221-L225)
This PR makes `autocast_dtype` thread-global so it consistent among all threads of forward and backward passes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133938
Approved by: https://github.com/soulitzer
# Description
This pipeline enables the CI build on Windows with PR labeled with ciflow/xpu. This will build torch binary with Torch XPU Operators on Windows using Vision Studio BuildTools 2022.
# Changes
1. Install xpu batch file (install_xpu.bat) - Check if build machine has oneAPI in environment, and if the version of it is latest. If not, install the latest public released oneAPI in the machine.
2. GHA callable pipeline (_win-build.yml) - Set vc_year and use_xpu as parameter to set build wheel environment.
3. GHA workflow (xpu.yml) - Add a new windows build job and pass parameters to it.
4. Build wheels script (.ci/pytorch/win-test-helpers/build_pytorch.bat) - Prepare environment for building, e.g. install oneAPI bundle.
# Note
1. For building wheels on Intel GPU, you need Vision Studio BuildTools version >= 2022
2. This pipeline requires to use Vision Studio BuildTools 2022 to build wheels. For now, we specify "windows.4xlarge.nonephemeral" as build machine label in the yaml file. We will request to add self-hosted runners with Intel GPU and Vision Studio BuildTools 2022 installed soon.
Work for #114850
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133151
Approved by: https://github.com/chuanqi129, https://github.com/atalman
Co-authored-by: chuanqiw <chuanqi.wang@intel.com>
Moved all the backward functions (`stage_backward_input`, `stage_backward_weight`, `stage_backward`) under the same `backward_maybe_with_nosync` function which controls the logic of the data parallel wrappers.
FSDP was not working with zero bubble PP because there will be twice as many "backward" calls and we update the weight gradients after `autograd.grad` is called. As a result, we need to manually call the FSDP `post_backward_hook()` after the weights have the correct gradients.
Fixes the tests:
`python test/distributed/_composable/test_composability/test_pp_composability.py ComposabilityTest.test_manual_with_data_parallel_dp_type_FSDP_ScheduleClass0_use_new_runtime_False`
`python test/distributed/_composable/test_composability/test_pp_composability.py ComposabilityTest.test_manual_with_data_parallel_dp_type_DDP_ScheduleClass0_use_new_runtime_False`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134052
Approved by: https://github.com/kwen2501
Fixes https://github.com/pytorch/pytorch/issues/133858
Details: Previously Dynamo would treat dataclasses as UserDefinedVariables. This was non-desirable if we would like to proxy the value into the graph, which is needed for TensorSubclassMetadata. To rectify this, frozen dataclasses are now able to be proxied similarly to NamedTuples. We require the object to be frozen, because if arbitrary mutation were allowed, we would need to replay those mutations in the graph after construction of the object.
For tracing construction of the variable, the generated `__init__` for the dataclass uses `object.__setattr__` because frozen dataclasses throw errors on the usual `__setattr__` invocation. With this treatment, no special handling is needed in dynamo for frozen dataclass construction.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134846
Approved by: https://github.com/bdhirsh, https://github.com/anijain2305
Summary:
D62008788 added an extra parameter to the RawTensorMetadata struct. For some reason this causes some corrupted accesses in other tests as described in T200685032.
Once this is removed the tests pass. Going forward we need to document how to add parameters to this portion of the code as the AppendOnlyLists seem to be very rigid.
Test Plan: Ran all the tests locally and they all passed.
Differential Revision: D62171089
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135096
Approved by: https://github.com/aaronenyeshi
It's a bit surprised that the code added in Scheduler.fusable_read_and_write would increase compilation time.
Here are some number I get from a H100 on BertForMaskedLM:
- without the fix, cold start compilation time is around 82s
- with the fix, cold start compilation time is around 76s.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135071
Approved by: https://github.com/jansel
Fixes https://github.com/pytorch/pytorch/issues/134798
In the regular Tensor case, when you call Tensor.data, there's a check
for if inference mode is active. If it is active, then we don't set the
version counter. We replicate this check for Tensor Subclasses (the bug
was we were trying to set the version counter on a FakeTensor in
inference_mode).
Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134878
Approved by: https://github.com/bdhirsh
- The new implementation (auto_functionalized_v2) is enabled by default but can be disable
using an inductor flag.
- In export mode the old implementation is used.
**Motiviation**
Previous functionalization fails to re-inplace arguments when they are view over other tensors.
see issue https://github.com/pytorch/pytorch/issues/131192
The new functionalization is easier to re-inplace for views.
**A) Functionalizations pass**
consider a program:
```
func(t)
x = t[0]
y = t[1]
foo(x, y) # custom operator with x, y mutable
return (x, y, t)
```
- To functionalize `foo` we generate a function that operates on the base tensors of the inputs; (x.base() and y.base())
and record how to regenerates the views out of the base for argument x by recording ```ViewInfo=(x.base(), x.size(), x.stride, x,storage_offset())```
- Due to some limitations on the torch.export arguments format, we have to generate alot of arguments, but this is something we can simplify in the future, for the example above we get the following function.
```
auto_functionalized = torch.ops.higher_order.auto_functionalized(torch.ops.mylib.foo.default,
_x_base_index = 0, _x_size = (), _x_stride = (), _x_storage_offset = 0 ,
_y_base_index = 0,_y_size = (), _y_stride = (), _y_storage_offset = 1 ,
_all_bases = [arg0_1])
```
- In the code above:
- _all_bases[t]: refers to a unique set of bases for all foo arguments.
- for each argument x we have _x_base_index, _x_size, _x_stride, _x_storage_offset that can be used to (1) regenerate x from _all_bases[_x_base_index] or a copy of a the base.
- the output of auto_functionalized is foo output , followed by x tensors one for each base in _all_bases, that is a copy of the base tensor after observing the mutations of the all the arguments that are views of that base.
- for each use of a base in _all_bases or a view of it , that are after the call to foo, replace it with a view of the new output
for the function above after functionalization we get :
```
def forward(self, arg0_1: "f32[2][1]cpu"):
auto_functionalized = torch.ops.higher_order.auto_functionalized(torch.ops.mylib.foo.default, _x_base_index = 0, _x_size = (), _x_stride = (), _x_storage_offset = 0, _y_base_index = 0, _y_size = (), _y_stride = (), _y_storage_offset = 1, _all_bases = [arg0_1])
getitem_1: "f32[2][1]cpu" = auto_functionalized[1]; auto_functionalized = None
copy_: "f32[2][1]cpu" = torch.ops.aten.copy_.default(arg0_1, getitem_1); arg0_1 = copy_ = None
# No stacktrace found for following nodes
select_2: "f32[][]cpu" = torch.ops.aten.select.int(getitem_1, 0, 0)
select_3: "f32[][]cpu" = torch.ops.aten.select.int(getitem_1, 0, 1); getitem_1 = None
return (select_2, select_3)
```
**B) Semantics of auto_functionalize**
The new semantics of auto_functionalize is as the following:
1. For each base in all_bases, copy the base and create all_bases copies. (if a base is inplaced we do not need to copy it)
2. For each arg, regenerate the arg from the copy of its base using the view information above.
3. return the original foo output followed by the new bases.
**C) Re-inplace pass**
since auto_functionalize not copy the bases, what we actually inplace is the bases.
(run just like before but on the beses instead of args).
1. For each base b in _all_bases check if there is any use of base (or its aliases/views) after auto_functionalize (before its overwritten with a copy) if there is not any, then inplace it (avoid copying it in step 1 above).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134409
Approved by: https://github.com/zou3519
Summary: When we process keyword arguments in profiler today we assume that all values will be strings. This breaks HTA because it assumes that "stream" and other values similar to it will be ints. To fix this we will only put quotes around strings for ivalues.
Test Plan: Add chrome trace export in unit tests and check that stream does not have quotes around it
Differential Revision: D62056059
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134893
Approved by: https://github.com/sanrise, https://github.com/izaitsevfb
This is a bit twisty and I don't entirely understand the situation, but here's my best explanation.
In https://github.com/pytorch/pytorch/pull/133588 I am trying to fix a problem reported by user in https://fb.workplace.com/groups/6829516587176185/permalink/7705964779531357/ The summary of this problem is that when we do collect metadata analysis in AOTAutograd, we accumulate pending unbacked symbols which are going to be discarded at the end of the trace. However, if we do a recursive make_fx inside tracing, as occurs with torch.cond, we end up seeing that there are pending unbacked symbols that aren't associated with a binding, even though it's spurious (they've leaked into the inner make_fx call from the outer AOTAutograd analysis).
In #133588 I tried to just prevent adding the symbols to the pending list at all in the first place. But this itself caused some problems which were fixed in https://github.com/pytorch/pytorch/pull/124785 . The problem fixed in that PR is that when we allocate tangents that have unbacked size, something prevented them from having correct unbacked SymInts when ignore fresh unbacked SymInts was enabled. So I had patched it at the time by just not suppressing pending symbols and clearing them out some other way.
I think... I was wrong in that PR? That is to say, it was OK to avoid putting the fresh unbacked symbols in the pending list; the real problem was suppressing unbacked renamings. But there doesn't seem to be a good reason to suppress these; this PR shows that it doesn't actually fail any tests if you do these anyway. Intuitively, this makes sense, because you can't trigger renamings unless you're actually adding unbacked symbols to the pending set.
But I don't entirely understand all the interactions. I just know that this seems to not cause tests to fail, and it should fix the internal issue (which I need to add a UT for.)
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134407
Approved by: https://github.com/ydwu4
I propose a new heuristic function to select tile tile size, cluster size, and transposition given M, N and K. It improves the performance across the board (on average) while remaining simple and relying only on a handful of kernels (to limit build time and binary size).
Across the shapes I benchmarked, the new heuristic gives a (geometric) mean speedup of +16.5%. Some shapes worsen, but 98.6% of the shapes retain their old performance (up to 5% to allow for noise) or improve it.

I benchmarked on over 5.4k different shapes:
- For M and N I swept across all values which are the sums of two powers of 2 (limited to multiples of 64, capped at 16,384)
- For K I only used powers of 2 between 1,024 and 8,192 (based on the intuition that the optimal config doesn't depend on K, which turned out to be the case)
Here's the detailed speedup for each shape

<details>
<summary>
This is the code I used to benchmark
</summary>
```
import torch
import torch.utils.benchmark
s = set()
for i in range(6, 15):
s.add(2**i)
for j in range(6, i):
s.add(2**i + 2**j)
ms = [i for i in sorted(s) if i <= 2**14]
ns = [i for i in sorted(s) if i <= 2**14]
ks = [2**i for i in range(10, 14)]
def make_graph(n_iters, f):
g = torch.cuda.CUDAGraph()
with torch.cuda.graph(g):
for _ in range(n_iters):
f()
return g
def rowwise_scale(t, dtype_t):
min_v, max_v = torch.finfo(dtype_t).min, torch.finfo(dtype_t).max
scale_t = torch.clamp(t.abs().amax(dim=-1, keepdim=True).float(), min=1e-12) / max_v
t_fp8 = (t / scale_t).clamp(min=min_v, max=max_v).to(dtype_t)
return t_fp8, scale_t
for m in ms:
for n in ns:
for k in ks:
a = torch.randn((m, k), device="cuda", dtype=torch.float)
b_t = torch.randn((n, k), device="cuda", dtype=torch.float)
a_fp8, scale_a = rowwise_scale(a, torch.float8_e4m3fn)
b_t_fp8, scale_b_t = rowwise_scale(b_t, torch.float8_e4m3fn)
func = lambda: torch._scaled_mm(
a_fp8,
b_t_fp8.t(),
scale_a=scale_a,
scale_b=scale_b_t.t(),
bias=None,
use_fast_accum=True,
out_dtype=torch.bfloat16
)
print(f"{m=},{n=},{k=}")
print(torch.utils.benchmark.Timer("g.replay()", globals={"g": make_graph(1000, func)}).blocked_autorange(min_run_time=1).mean / 1000)
```
</details>
<details>
<summary>
This is the code I used for the plots
</summary>
```
from itertools import islice
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib.cm import ScalarMappable
from matplotlib.colors import FuncNorm
from mpl_toolkits.axes_grid1 import ImageGrid
def batched(iterable, n):
iterator = iter(iterable)
while batch := tuple(islice(iterator, n)):
yield batch
def try_to_convert(v):
if v == "False":
return False
if v == "True":
return True
return int(v)
def get_from_paste(filename):
text = open(filename, "rt").read()
headers = []
data = []
for config, value in batched(text.splitlines(), 2):
config_elems = config.split(",")
if not headers:
headers = [e.partition("=")[0] for e in config_elems]
data.append((*(try_to_convert(e.partition("=")[-1]) for e in config_elems), float(value)))
return pd.DataFrame(data, columns=headers + ["latency"])
old_latencies = get_from_paste(...)
new_latencies = get_from_paste(...)
ratios = pd.merge(new_latencies, old_latencies, how="left", left_on=["m", "n", "k"], right_on=["m", "n", "k"], suffixes=("_new", "_old"))
ratios = ratios.assign(ratio=ratios.latency_old / ratios.latency_new)
fig = plt.figure(figsize=(40.0, 10.0))
grid = ImageGrid(
fig,
111,
nrows_ncols=(1, 4),
axes_pad=0.5,
share_all=True,
cbar_location="right",
cbar_mode="single",
cbar_size="7%",
cbar_pad=0.15,
)
log_amax = np.max(np.abs(np.log(ratios.ratio.to_numpy())))
for K, ax in zip([1024, 2048, 4096, 8192], grid):
pivoted = ratios[(ratios.k == K)].pivot_table(index="m", columns="n", values="ratio")
im = ax.imshow(np.log(pivoted.to_numpy()), origin="lower", vmin=-log_amax, vmax=log_amax, cmap="PiYG")
m_vals, n_vals = pivoted.axes
ax.set_xticks(np.arange(len(n_vals)), labels=[f"N={i}" for i in n_vals.values], fontsize=12)
ax.set_yticks(np.arange(len(m_vals)), labels=[f"M={i}" for i in m_vals.values], fontsize=12)
plt.setp(ax.get_xticklabels(), rotation=90, ha="right", rotation_mode="anchor")
ax.grid(False)
ax.set_title(f"K={K}", fontsize=20)
norm = FuncNorm((lambda x: np.log(x), lambda x: np.exp(x)), np.exp(-log_amax), np.exp(log_amax))
ax.cax.colorbar(ScalarMappable(norm=norm, cmap="PiYG"))
plt.show()
counts, bins = np.histogram(np.log(ratios.ratio.to_numpy()), bins=500)
plt.stairs(counts, np.exp(bins), fill=True)
plt.xscale("function", functions=(lambda x: np.log(x), lambda x: np.exp(x)))
```
</details>
I only benchmarked fast_accum=True and out_dtype=torch.bfloat16 supposing that these are the most commonly-used flags (e.g., with fast_accum=False row-wise scaling is much slower than tensor-wise scaling hence unpractical).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134781
Approved by: https://github.com/drisspg, https://github.com/eqy
ghstack dependencies: #134773
On some occasion, a column-major output layout is more efficient (it's unclear if it's because of better store coalescing for some tile shapes, or whether it's just that it's CUTLASS's default and thus it's better optimized).
At this stage I only add a flag that allows to transpose, but the hardest will be deciding on a new heuristic to turn it on selectively. This will be in a follow-up PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134773
Approved by: https://github.com/drisspg
Fixes an issue after updating XNNPACK where parsing the XNNPACK CMakeLists breaks. I'm just ignored the generated build identifier for now, since it's not used and we would need to update the buck build to generate it at build time.
Remove unused ukernels_xop XNNPACK target as it has no sources (after the recent update) and causes buck1 to complain.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134724
Approved by: https://github.com/mcr229
Adds val, and optionally stack_trace & nn_module_stack metadata back to SymInt compute nodes that we CSE, with a hook on `graph.create_node()`. Not sure if there's other metadata we want to populate here?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134516
Approved by: https://github.com/ezyang
Summary:
A bit of refactoring to prepare to remove `None` as a way to specify static dimensions in dynamic shapes, given we already have `Dim.STATIC` for the same purpose. We will now warn whenever this happens. However no tests were modified because problematic uses of `None` still need to behave as they do today, until we are ready to remove support. It should be easy to port tests by replacing the warning function to raise instead.
Note that other uses of `None`, such as for entire values (tensor or non-tensor) remain as is. Moving forward this should be the only purpose of `None` (at least externally).
Finally, there's a bit of confusion in our representation now because `AUTO` also internally transforms to `None`. Renamed dynamic_shapes to transformed_dynamic_shapes where this happens. Overall the two forms (pre and post transformation) have different properties so should probably not be represented in the same format in the future.
Test Plan: existing
Differential Revision: D62040729
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134877
Approved by: https://github.com/pianpwk
Summary:
Original commit changeset: 96513cbc425f
Original Phabricator Diff: D61291210
There is some evidence that FB-FM-v4 has better NE with Set ctx.set_materialize_grads(False), especially when pairing up with prefetching.
See https://www.internalfb.com/intern/anp/view/?id=5732259
Test Plan:
export NUM_WORKERS=128
export BATCH_SIZE=1024
export CONFIG_FILE="mast_joint_arch_exploration_cmf_updated_fbfm_v3_fsdp2.yaml"
export ENTITLEMENT=ads_global_tc_2k_training_large_short
buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -c fbcode.platform010_cuda_version=12 -c hpc_comms.use_nccl=2.17.1 -- mode=${CONFIG_FILE} launcher.tags='[ads_ranking_taxonomy_monetization_genai]' launcher.data_project=pytorch_at_scale launcher.max_retries=10 launcher.fbl_entitl
ement=${ENTITLEMENT} launcher.oncall=pytorch_training_enablement launcher.hardware=GRANDTETON launcher.num_workers=${NUM_WORKERS} data_loader.dataset.batch_size=${BATCH_SIZE} training.planner.proposer=dynamic_col_dim training.planner.proposer.optim_target=h
bm 2>&1| tee ~/tmp/log.mast
Differential Revision: D62009163
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135059
Approved by: https://github.com/awgu
Summary: Torch-compiling a quick script can be a bit slower than it needs to be: even though we initialize the subprocess pool early, it still might not be ready by the time we try to compile the first Triton kernel. Instead, let's use the single-threaded path until the pool has successfully completed a no-op job.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133508
Approved by: https://github.com/Chillee
Summary:
1. Move the debug printer call a level lower -> at here
:https://www.internalfb.com/code/fbsource/[931d7bbb9e7cf2dcb926f42718f56fc940903eec]/fbcode/caffe2/torch/_inductor/codegen/cpp_wrapper_cuda.py?lines=335
2. Add UT for validating debug printer for user defined triton kernel codegen
The benefit of having the debug printer call happens at a more centralized place is 1) reduce the duplicate debug printer related logic code scattered everywhere in the codebase 2) it can handle more triton kernel codegen path as long as it invokes this `generate_kernel_call()` for example, it can automatically handle/support user_defined_kernel 's debug printing which is a pretty common use case we encounter in debugging
Test Plan:
```AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_aoti_debug_printer_user_defined_triton_kernel_abi_compatible_cuda```
Also verified that templateKernel codegen path still works
Differential Revision: D61949020
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134789
Approved by: https://github.com/ColinPeppler
Summary: We noticed that there will be runtime error to do the dim broadcast when the meta example value has symbolic shape, thus we skip it.
Test Plan:
```
buck2 run mode/opt //caffe2/benchmarks/dynamo/fb:torchbench_run_ads_dhen_5x_training -- -m ads_dhen_5x -t training
```
P1559019921
Differential Revision: D62115015
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134975
Approved by: https://github.com/xuzhao9
Summary:
Currently some jobs are encountering the following trace, P1539415198. This suggests that when we are parsing through tensors the path is prone to encountering an invalid address. This is is possibly occurring because for some reason the sizes() and strides() of a Tensor seem to not be of the same dimensions. We assume such when iterating through the shapes to get the Ivalue generator. When browsing some of the tensor implementations, I found that some of the size and stride paths are different which could be the cause of this issue. Regardless, the profiler should be flexible enough to handle such issues without bringing down the whole main thread.
If the crashes still persist, it will still give us a data point as to where they are occurring and we can rule out the strides/sizes as the culprit
Test Plan: This change doesn't break anything in the happy path, just makes sure the bad path is not exited abruptly. We should use this in order to debug what the events are having mismatching dimensions between sizes and strides.
Differential Revision: D62008788
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134862
Approved by: https://github.com/aaronenyeshi
We keep two copies of the runner-determinator script:
1. In runner_determinator.py, for ease of testing. This however is not actually executed during CI
2. Embedded in _runner-determinator.yml. This is what CI uses.
Why the duplication? Short version: Because of how github CI works, during a given CI run the workflow yml files could actually come from the main branch, while the remaining files get read from the local commit.
This can lead to a newer version of _runner-determinator.yml trying to invoke an older version of runner_determintor.py than it was actually designed for. Chaos ensues.
We mitigate this by embedding the script into the yml file. But we still keep the script around because it's much easier to run tests against.
This workflow's job is to ensure that if one edits the script in one of those two locations then they remember to update it in the other location as well
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134800
Approved by: https://github.com/zxiiro, https://github.com/PaliC
ghstack dependencies: #134796
D53335860 and D56435815 added an option to torch elastic allowing users to choose a TCPStore backend type to use via
1) explicit argument passing in user code when instantiating `MastRendezvousHandler`
2) pass `--use_libuv` command line argument to `torchrun`.
The motivation was to offer a quick way to roll back to non-libuv TCPStore backend since we were making libuv the default in `c10d` code. Now we think that it's better to have torch elastic to not realize the TCPStore backend type but rely on `c10d`'s mechanism to decide which backend to use for torch elastic as well. In this sense, the TCPStore backend type used by torch elastic will be identical to that in pytorch.
PyTorch TCPStore uses the environment variable `USE_LIBUV` to determine the backend type:
when `USE_LIBUV="0"`, the non-libuv backend will be used.
when `USE_LIBUV="1"`, the libuv backend will be used. And this is the default option.
Differential Revision: [D58259590](https://our.internmc.facebook.com/intern/diff/D58259590/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134882
Approved by: https://github.com/shuqiangzhang
Summary:
The current use case is to continuously measure the total allocated and reserved CUDA memory size from CUDACachingAllocator, and export their distribution (min, max, p90 etc) over time as timeseries.
The current callback-based API does not work because the backend decides when the measurement is taken, so data points between two measurements may not be recorded. The distribution (e.g. max) as such will not be accurate.
This new API closely follow the design of the existing WaitCounter API otherwise.
This is not quite a synchronous version of DynamicCounter, as summing multiple data points does not make sense to my use case
Test Plan: CI
Differential Revision: D61837528
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134883
Approved by: https://github.com/c-p-i-o
The issue:
Const propagation checks only if arguments do not have FakeTensor. If argument is Subclass, it will pass this condition.
As a result Const Propogation execution happens without FakeTensorMode and having tensor factories inside Subclass.__torch_dispatch__ results that this Tensor is not Fakified.
Solution:
If we have subclasses arguments, do not count that const propagation is doable
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134855
Approved by: https://github.com/zou3519
op_level_debug helped to identify missing operators, and wrongly implemented operators at the time that dynamo exporter relied on nearest matching and torchlib was just created. However, right now, with dispatcher logic improved and torchlib becomes mature, we no longer need it.
PS: op-level-debug diagnostics rule is not deleted in this PR, as it auto generates lint error code, and need more time to fix. We can delete it when we retire sarif.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134961
Approved by: https://github.com/justinchuby
Based on https://github.com/pytorch/pytorch/pull/130956.
Inductor already supports padding through the `config.comprehensive_padding` option, but the padding format involves a few heuristics that are specific to Nvidia GPUs:
- When we pad, it is always aligned to the next multiple of 128 bytes.
- Strides smaller than 1024 are not padded.
- Only intermediate values are padded, not outputs.
The last of these is not really GPU-specific, but there are certain cases where we may want to override it. For example, padding outputs is useful on hardware accelerators with specific memory alignment requirements, or for applications where performance is more important than conformity with eager mode.
This PR surfaces padding parameters up to Inductor's config module, so the user can control them.
- `config.pad_outputs`: choose whether to pad outputs (default: `False`)
- `config.padding_alignment_bytes`: choose the alignment size for padding (default: `128`)
- `config.padding_stride_threshold`: choose the smallest stride that we will pad. For example, setting this to 0 will pad all unaligned strides. (default: `1024`)
**Test plan**
Added a new test in `test_padding.py` which tries various combinations of these options, checking that the output strides match our expectations.
These changes should not affect perf, because the defaults are identical to Inductor's current behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133939
Approved by: https://github.com/shunting314
Co-authored-by: Yueming Hao <yhao@meta.com>
Summary:
Pull the big nested function out of the middle of cached_autotune() into its own class.
Also refactor creating the autotune cache itself out - which gets shared in the next diff.
Test Plan: unit tests
Differential Revision: D60677501
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134911
Approved by: https://github.com/oulgen
The reraise is not supported and so this just gunks up our actual exception handling. You can trigger this by hitting an exception inside of an NN module that has hooks on it. You end up graph breaking on the reraise here, and losing the inner stack trace from the actual exception that was raised.
This might be kind of controversial. An alternate strategy is to support reraises in Dynamo or something but IDK this doesn't feel like the right place to apply force.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133239
Approved by: https://github.com/anijain2305
Summary:
The existing RemoteCacheBackend classes were a bit haphazard - some of them accepted bytes only, some accepted objects, some returned different types of objects than were passed in.
Update them to be more consistent:
1. RemoteCacheBackend is an implementation of a backend: Redis, Memcache, Manifold, LocalFile
2. RemoteCacheSerde is an implementation of a serde protocol - to turn structured objects (dict, list, etc) into bytes: RemoteCacheJsonSerde (json encoding), RemoteCachePassthroughSerde (strictly bytes only)
3. RemoteCache is the cache implementation itself, mixing a RemoteCacheBackend along with an RemoteCacheSerde to provide structured caching.
Other than simply reorganizing the existing cache code this also fixes the Redis autotune caching for OSS.
Test Plan: unit tests
Reviewed By: oulgen
Differential Revision: D61178859
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134032
Approved by: https://github.com/oulgen, https://github.com/bhack
Context: Adding support for the beta parameters to be tensors
Details: Similarly to the previous two PRs addcmul_ is used with the tensor betas as the value argument. When this occurs, an item() call is invoked in the aten op. To avoid this graph break, addcmul_ is decomposed into its constrituent ops to avoid this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134168
Approved by: https://github.com/anijain2305
ghstack dependencies: #134166, #134167
Context: Adding support for the beta parameters to be tensors
Details:
In this PR similarly to the previous, foreach_pow calls item() on the first argument when it is a scalar tensor. In this case, we broadcast that scalar tensor into a list of aliases of that tensor to avoid the item() call, and this results in a device copy of the scalar tensor. Once again, I dont think we can change the foreach_pow API due to BC concerns, so this op rewrite allows us to avoid a graph break, generate semantically the same code, and not affect eager.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134167
Approved by: https://github.com/anijain2305
ghstack dependencies: #134166
Context: Adding support for the beta parameters to be tensors
Details:
In order to add support for the beta params to be tensors without graph breaks in the Adam family of optimizers it is necessary to support foreach_lerp(x, y, s) where s is a scalar tensor. Today, this isn't possible because when `s` is a scalar, internally the aten op calls item() on it to extract the value and distribute it to each of the ops on the individual list indices. To support this in dynamo without graph breaks, I decompose the lerp into its constituent ops which support a scalar tensor in the list argument positions which do not result in an item() call. To be clear the item() call is more performant for eager I think and for BC I don't think we can modify that API, so this allows us to have performance in eager and no graph breaks in compile.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134166
Approved by: https://github.com/anijain2305
This essentially undoes large skips on everything but MacOS Sequoia to nn.modules made by https://github.com/pytorch/pytorch/pull/128393
Instead it uses existing `xfail`, but guards it on `_macos15_or_newer` boolean
Before the change if run on MacOS 14:
```
% python3 ../test/test_modules.py -v -k Hardswish 2>&1|tail -n3
Ran 57 tests in 0.053s
OK (skipped=32)
```
After
```
% python3 ../test/test_modules.py -v -k Hardswish 2>&1|tail -n3
Ran 57 tests in 0.229s
OK (skipped=10, expected failures=2)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134858
Approved by: https://github.com/janeyx99
Add to relative path search in benchmark. This enables user to run `torchbench.py` inside the `pytorch/benchmark/dynamo` folder when `torchbench` repo is cloned in the same level as `pytorch`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134871
Approved by: https://github.com/FindHao
Currently, if installed, `onnxruntime` will be imported when importing `torch._inductor` (which will be imported by some other library, e.g. transformer-engine):
```
/mnt/c.py(53)<module>()
-> from torch._inductor.utils import maybe_profile
/usr/local/lib/python3.10/site-packages/torch/_inductor/utils.py(49)<module>()
-> import torch._export
/usr/local/lib/python3.10/site-packages/torch/_export/__init__.py(25)<module>()
-> import torch._dynamo
/usr/local/lib/python3.10/site-packages/torch/_dynamo/__init__.py(2)<module>()
-> from . import convert_frame, eval_frame, resume_execution
/usr/local/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py(48)<module>()
-> from . import config, exc, trace_rules
/usr/local/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py(52)<module>()
-> from .variables import (
/usr/local/lib/python3.10/site-packages/torch/_dynamo/variables/__init__.py(38)<module>()
-> from .higher_order_ops import (
/usr/local/lib/python3.10/site-packages/torch/_dynamo/variables/higher_order_ops.py(14)<module>()
-> import torch.onnx.operators
/usr/local/lib/python3.10/site-packages/torch/onnx/__init__.py(62)<module>()
-> from ._internal.onnxruntime import (
/usr/local/lib/python3.10/site-packages/torch/onnx/_internal/onnxruntime.py(37)<module>()
-> import onnxruntime # type: ignore[import]
```
This issue breaks generated triton kernel because it imported torch, and unexpected runtime libraries as well.
I've also added a test for this specific case under `test/onnx`, perhaps we should add more somewhere else?
Related issue: https://github.com/huggingface/accelerate/pull/3056
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134662
Approved by: https://github.com/justinchuby
Based on https://github.com/pytorch/pytorch/pull/130956.
Inductor already supports padding through the `config.comprehensive_padding` option, but the padding format involves a few heuristics that are specific to Nvidia GPUs:
- When we pad, it is always aligned to the next multiple of 128 bytes.
- Strides smaller than 1024 are not padded.
- Only intermediate values are padded, not outputs.
The last of these is not really GPU-specific, but there are certain cases where we may want to override it. For example, padding outputs is useful on hardware accelerators with specific memory alignment requirements, or for applications where performance is more important than conformity with eager mode.
This PR surfaces padding parameters up to Inductor's config module, so the user can control them.
- `config.pad_outputs`: choose whether to pad outputs (default: `False`)
- `config.padding_alignment_bytes`: choose the alignment size for padding (default: `128`)
- `config.padding_stride_threshold`: choose the smallest stride that we will pad. For example, setting this to 0 will pad all unaligned strides. (default: `1024`)
**Test plan**
Added a new test in `test_padding.py` which tries various combinations of these options, checking that the output strides match our expectations.
These changes should not affect perf, because the defaults are identical to Inductor's current behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133939
Approved by: https://github.com/shunting314
Co-authored-by: Yueming Hao <yhao@meta.com>
This PR add dynamic shapes support to foreach and combo kernels for horizontal fusion.
A flag `combo_kernel_foreach_dynamic_shapes` (default False to avoid disturb production workflows) is added to _inductor/config.py. Setting it to True enables automatic dynamic shapes for foreach kernels. It is always enabled for combo kernels cases. Added unit cases.
This PR also fixes a flaky test case for [T198833257](https://www.internalfb.com/intern/tasks/?t=198833257)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134477
Approved by: https://github.com/mlazos
The caching autotuner caches triton configs, and it doesn't try to hash or save the pre_hook from the config if it exists. If we had a config that had a pre_hook, then we might autotune -> save the config (without the pre_config) -> later load the saved config and try to run it, but this time without the pre_hook.
So this PR adds an assert and deletes the pre_hook handling. We can be confident that we didn't have functional pre_hooks, because the pre_hook handling tries to use `self.arg_name`, which doesn't exist.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134633
Approved by: https://github.com/shunting314, https://github.com/jansel
Summary: When we are placing nodes in the graph, we should also replace the references in module_call_graph.
Test Plan:
buck2 run 'fbcode//mode/opt' torchrec/fb/ir/tests:test_serializer -- --filter-regex test_serialize_deserialize_vlea
buck2 test 'fbcode//mode/opt' fbcode//torchrec/fb/ir/tests:test_serializer -- --exact 'torchrec/fb/ir/tests:test_serializer - torchrec.fb.ir.tests.test_serializer.TestSerializer: test_serialize_empty_value_vlea' --run-disabled
buck2 test 'fbcode//mode/opt' fbcode//torchrec/fb/ir/tests:test_serializer -- --exact 'torchrec/fb/ir/tests:test_serializer - torchrec.fb.ir.tests.test_serializer.TestSerializer: test_deserialized_device_vle' --run-disabled
Differential Revision: D62014035
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134830
Approved by: https://github.com/angelayi
This is part of a series of PRs to improve the functionality of the `associatve_scan` functionality. This specific PR introduces a `combine_mode`, which can be either `pointwise` (default) or `generic`. In case of `generic`, the `associative_scan` is more flexible and allows also to perform non-pointwise functions. This PR has been derived from https://github.com/pytorch/pytorch/pull/129307.
@ydwu4 @Chillee @zou3519
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133012
Approved by: https://github.com/ydwu4
TLDR; this PR supports exporting cond x inine_inbuilt nn modules flag by inling into tracing code in proxy_tensor.py _symbolic_trace.py (internally, the pattern is make_fx(record_module_stack)(torch.compile(f))).
We have two special treatments for following cases:
1. _ModuleStackTracer will wrap all the nn modules into _AttrProxy. This _AttrProxy has several subtiles which make it hard to inline in dynamo like overriding _modules with a property method and overrides the `__getattr__`, which mutates captured states when calling `__getattr__`.
Solution to this is that we unwrap the _AttrProxy and get its corresponding nn_module (a 1-1 correspondence). So that dynamo symbolically traces the original nn module instead of tracing _AttrProxy.
2. The tracer applies a bunch of patches the `__getattr__` and `__call__` of nn.Module for tracking reasons. This doesn't work well with dynamo. The immediate error we see is `torch._dynamo.exc.Unsupported: 'inline in skipfiles: WeakKeyDictionary.__contains__ | __contains__ /home/yidi/.conda/envs/pytorch/lib/python3.10/weakref.py` caused by a weakdict in PythonKeyTracer.
Solution to this is that we remove the patches during dynamo symbolic convert temporally. So that dynamo has a clean environment. make_fx will be trace the transformed bytecode of dynamo and patches nn modules there instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133731
Approved by: https://github.com/anijain2305
ghstack dependencies: #134775
Fixes#131865. Addresses the issue seen when running llama v3.1 8B parameter model on MPS backend where the batch matmul output size can go over the 32-bit indexing limit of MPS tensors, causing an assert.
Test case to reproduce the issue with the dimensions encountered in llama v3.1 and verify this fix works around it:
```
import torch
device='mps'
a = torch.randn([32, 20064, 128], dtype=torch.float32,device=device)
b = torch.randn([32, 128, 20064], dtype=torch.float32, device=device)
res = torch.bmm(a, b)
```
Notably the current change only works as long as the individual output matrix in the bmm does not exceed the number of elements 2**32. This lets us split up the computation along the batch axis to avoid going over the limit.
Added a TORCH_CHECK to raise an error if the individual matrix dimensions are too large to handle for this op until a more general workaround tiling the matmuls is available.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133430
Approved by: https://github.com/malfet
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Fixes issue seen in https://github.com/pytorch/pytorch/issues/132872#issuecomment-2314574656
With this API, we can mark the offending module as static in detectron2.
Today's world - Consider user defined nn module int attributes automatic dynamic. Use the API in this PR to make them static if you want.
Alternative work - Consider all int attributes of any user defined nn module class static. And then introduce an API - `torch._dynamo.mark_nn_module_attribute_dynamic`. The default being static is worrying if users have `counter` in their model which is updated in each forward invocation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134713
Approved by: https://github.com/jansel
ghstack dependencies: #134653
## Motivation
This is follow up to PR:https://github.com/pytorch/pytorch/pull/126970, adding facility to run content for Intel Gaudi devices.
We intend to extend similar generalization for the rest of the content in test/dynamo which is currently being written to work specifically for cuda devices. Other devices can add onto it if support is available.
## Changes
carve out bert related content to another class
use instantiate_device_type utility to instantiate this class for devices which support the functionality
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130714
Approved by: https://github.com/anijain2305
benchmarks several shapes of basic nn modules. in both eager and inductor
```
collecting compile time instruction count for basic_modules_ListOfLinears_inductor
compile time instruction count for iteration 0 is 48602516013
compile time instruction count for iteration 1 is 20424350269
compile time instruction count for iteration 2 is 20440350455
compile time instruction count for iteration 3 is 20419269999
compile time instruction count for iteration 4 is 20430782200
compile time instruction count for iteration 5 is 20455049622
compile time instruction count for iteration 6 is 20157290712
compile time instruction count for iteration 7 is 20455324001
compile time instruction count for iteration 8 is 20450158317
compile time instruction count for iteration 9 is 20492987748
collecting compile time instruction count for basic_modules_ListOfLinears_eager
compile time instruction count for iteration 0 is 961328334
compile time instruction count for iteration 1 is 958887896
compile time instruction count for iteration 2 is 958792214
compile time instruction count for iteration 3 is 958375977
compile time instruction count for iteration 4 is 958568525
compile time instruction count for iteration 5 is 958152305
compile time instruction count for iteration 6 is 959322800
compile time instruction count for iteration 7 is 958332703
compile time instruction count for iteration 8 is 958092100
compile time instruction count for iteration 9 is 958095277
collecting compile time instruction count for basic_modules_ModuleForwardHasGraphBreak_inductor
compile time instruction count for iteration 0 is 3572145793
compile time instruction count for iteration 1 is 3503323973
compile time instruction count for iteration 2 is 3501962432
compile time instruction count for iteration 3 is 3501746084
compile time instruction count for iteration 4 is 3500687361
compile time instruction count for iteration 5 is 3822254676
compile time instruction count for iteration 6 is 3498356846
compile time instruction count for iteration 7 is 3499019157
compile time instruction count for iteration 8 is 3500780314
compile time instruction count for iteration 9 is 3500257458
collecting compile time instruction count for basic_modules_ModuleForwardHasGraphBreak_eager
compile time instruction count for iteration 0 is 1844838754
compile time instruction count for iteration 1 is 1843476862
compile time instruction count for iteration 2 is 1844761450
compile time instruction count for iteration 3 is 1845371742
compile time instruction count for iteration 4 is 1845159665
compile time instruction count for iteration 5 is 1845035802
compile time instruction count for iteration 6 is 1844895007
compile time instruction count for iteration 7 is 1844697922
compile time instruction count for iteration 8 is 1844780885
compile time instruction count for iteration 9 is 1844493990
collecting compile time instruction count for basic_modules_SequentialWithDuplicatedModule_inductor
compile time instruction count for iteration 0 is 1597839479
compile time instruction count for iteration 1 is 1348225351
compile time instruction count for iteration 2 is 1347340818
compile time instruction count for iteration 3 is 1348170800
compile time instruction count for iteration 4 is 1348637747
compile time instruction count for iteration 5 is 1678366444
compile time instruction count for iteration 6 is 1348412420
compile time instruction count for iteration 7 is 1348461578
compile time instruction count for iteration 8 is 1347420149
compile time instruction count for iteration 9 is 1349748195
collecting compile time instruction count for basic_modules_SequentialWithDuplicatedModule_eager
compile time instruction count for iteration 0 is 137721777
compile time instruction count for iteration 1 is 139065517
compile time instruction count for iteration 2 is 137130552
compile time instruction count for iteration 3 is 137506030
compile time instruction count for iteration 4 is 137089838
compile time instruction count for iteration 5 is 137477395
compile time instruction count for iteration 6 is 138550452
compile time instruction count for iteration 7 is 137568409
compile time instruction count for iteration 8 is 136968468
compile time instruction count for iteration 9 is 137481664
collecting compile time instruction count for basic_modules_ModuleComparison_inductor
compile time instruction count for iteration 0 is 917209684
compile time instruction count for iteration 1 is 899154426
compile time instruction count for iteration 2 is 898145079
compile time instruction count for iteration 3 is 899817018
compile time instruction count for iteration 4 is 899184687
compile time instruction count for iteration 5 is 898172885
compile time instruction count for iteration 6 is 899958951
compile time instruction count for iteration 7 is 899348186
compile time instruction count for iteration 8 is 897745404
compile time instruction count for iteration 9 is 899581123
collecting compile time instruction count for basic_modules_ModuleComparison_eager
compile time instruction count for iteration 0 is 113165302
compile time instruction count for iteration 1 is 112724376
compile time instruction count for iteration 2 is 112774611
compile time instruction count for iteration 3 is 114465211
compile time instruction count for iteration 4 is 112689572
compile time instruction count for iteration 5 is 112726465
compile time instruction count for iteration 6 is 112853691
compile time instruction count for iteration 7 is 112295238
compile time instruction count for iteration 8 is 114022136
compile time instruction count for iteration 9 is 112664932
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134658
Approved by: https://github.com/anijain2305
ghstack dependencies: #133834, #134635, #134649, #134652
**Summary**
reland of https://github.com/pytorch/pytorch/pull/134294Fixes#131446Fixes#126852Fixes#126868Fixes#126493
The PR was reverted due to CI red signal in https://github.com/pytorch/pytorch/actions/runs/10537099590/job/29201744658. It seems that the `gaussian_nll_loss` test had been flaky before my original PR #134294 . Therefore this PR also removes the `xfail` mark on this specific test to make CI signal green.
See the error message below:
```
2024-08-24T13:42:01.3228990Z ==================================== RERUNS ====================================
2024-08-24T13:42:01.3229530Z [31m[1m_ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _[0m
2024-08-24T13:42:01.3229710Z Unexpected success[90m[39;49;00m
2024-08-24T13:42:01.3230235Z [31m[1m_ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _[0m
2024-08-24T13:42:01.3230407Z Unexpected success[90m[39;49;00m
2024-08-24T13:42:01.3230594Z =================================== FAILURES ===================================
2024-08-24T13:42:01.3231128Z [31m[1m_ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _[0m
2024-08-24T13:42:01.3231296Z Unexpected success[90m[39;49;00m
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134509
Approved by: https://github.com/tianyu-l, https://github.com/wz337
# Motivation
If build XPU via oneAPI 2024.2, it will fail because `sycl-preview.lib` exists in windows. And linking the unexpected lib results in `error LNK2019: unresolved external symbol`.
# Solution
Use explicitly `sycl-preview` in linux build only.
# Additional Context
For `find_library`, please note that the variable will not be updated if it has been stored.
```
If the library is found the result is stored in the variable and the search will not be repeated unless the variable is cleared.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133845
Approved by: https://github.com/min-jean-cho, https://github.com/EikanWang, https://github.com/atalman, https://github.com/malfet
**Summary**
Fix the comment: https://github.com/pytorch/pytorch/pull/122961#issuecomment-2313930242. For all of the cases we see in the 3 test suits (TorchBench, Timms, Huggingface) we expect:
* `_node` is a FX Node with target in ["index_expr", "load", "store"]
* `_node.args[1 if _node.target == "index_expr" else 2]` is another FX node with target `get_index`
* `_node.args[1 if _node.target == "index_expr" else 2].args[0]` is a str for the name of this index expression
It seems not true in some FB internal testcase from the failure log posted in above link. So, add the condition check to work around it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134645
Approved by: https://github.com/jgong5, https://github.com/masnesral
Summary:
We found that if we init the pG in a background thread, it would block
the main thread till init is complete. This is because in the pybinding
we never release the GIL lock
Test Plan:
existing CI on eager init
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134779
Approved by: https://github.com/c-p-i-o
This benchmark measure the cost of compiling the following function in eager and inductor
its basically two benchmarks.
```
@torch.compile(backend=self.backend, fullgraph=True)
def f(a, b):
result = a.clone()
for i in range(1000):
if i % 3 == 0:
result = result + b
elif i % 3 == 1:
result = result + 8 * b
else:
result = result.sin()
return result
```
PYTHONPATH=$(pwd) python benchmarks/add_loop.py out
```
collecting compile time instruction count for add_loop_eager
compile time instruction count for iteration 0 is 8286649663
compile time instruction count for iteration 1 is 2838971338
compile time instruction count for iteration 2 is 2834263023
compile time instruction count for iteration 3 is 2829447493
compile time instruction count for iteration 4 is 2830904231
compile time instruction count for iteration 5 is 2830281077
compile time instruction count for iteration 6 is 2831466595
compile time instruction count for iteration 7 is 2830732164
compile time instruction count for iteration 8 is 2831088056
compile time instruction count for iteration 9 is 2831204407
collecting compile time instruction count for add_loop_inductor
compile time instruction count for iteration 0 is 32585687849
compile time instruction count for iteration 1 is 11747553436
compile time instruction count for iteration 2 is 11746959875
compile time instruction count for iteration 3 is 11749479461
compile time instruction count for iteration 4 is 11750053711
compile time instruction count for iteration 5 is 11750793958
compile time instruction count for iteration 6 is 11751673576
compile time instruction count for iteration 7 is 11754552912
compile time instruction count for iteration 8 is 11753723127
compile time instruction count for iteration 9 is 11759059942
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134652
Approved by: https://github.com/anijain2305
ghstack dependencies: #133834, #134635, #134649
We introduced the dispatchable backend for a ProcessGroup and collective in https://github.com/pytorch/pytorch/issues/86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG.
Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options"
We need to make changes to the test to make it aligned with the change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132931
Approved by: https://github.com/H-Huang
Restart the work from PR https://github.com/pytorch/pytorch/pull/100331 in this new PR since it's hard to rebase. It would be expected that some code is copy/pasted from the previous PR and main idea is the same.
Previously we see relatively large compilation time increase due to too many loop orders being considered. This PR tries to continue the work by doing pruning and only considering loop orders that we know for sure are relevant (i.e. do it on demand).
Some manually created cases that loop ordering matters are added as unit tests. The PR can make sure inductor does not miss fusion opportunities for them.
This PR should solve the not-able to fusion problem in https://github.com/pytorch/pytorch/issues/130015
Right now there is still significant increase of compilation time. I'll disable the feature by default. Later on after the compilation time issue is resolved, I'll enable it by default.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126254
Approved by: https://github.com/jansel
Previously setting garbage_collection_threshold or max_split_size_mb along with expandable_segments:True could cause the allocator to hit assert failures when running nearly out of memory. This PR ensures garbage_collection and max_split freeing do not accidentally try to release expandable segments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134338
Approved by: https://github.com/ezyang
Fixes#133252
In strict mode, we have this routine for mapping traced parameters to their FQNs using tensor ids. Currently we assume there's at least 1 unique FQN for each traced parameter, but this seems to break with parameter reuse when call_module nodes are present. Adding a test case where this breaks.
Fixes this by assigning the same FQN to all traced parameters with the same tensor id. This is fine because we return the original state_dict for the EP, and the unflattener has its own routine of handling aliasing: https://github.com/pytorch/pytorch/pull/125758
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134500
Approved by: https://github.com/angelayi
In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue https://github.com/pytorch/pytorch/issues/134062.
With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA.
Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134357
Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang
ghstack dependencies: #134345
Fix `test_logs_out` UT on Windows. make `test/dynamo/test_logging.py` all UTs pass on Windows.
Changes:
1. Close `NamedTemporaryFile` to release file handle to avoid PermissionError issue.
2. `PermissionError` setup as `delete=False`, let file not be auto deleted.
3. Open log file as "utf-8" to align with Linux.
4. Process wrap difference for Windows.
5. Delete tmp file manually.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134586
Approved by: https://github.com/jansel
Summary:
With training IR, we cannot rely on trapping `to()` in `FunctionalTensor` because the regular decomposition kicks it first, and that can cause it to be optimized away.
So instead we preserve it until we functionalize, and then replace it explicitly with `_to_copy()`.
Test Plan: expected test failures go away
Differential Revision: D61883878
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134622
Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan
aten.empty is almost always fusible into its consumer, so we never CSE
it. This fixes a bug that looks like the following:
```py
@torch.library.custom_op("_reinplacing::sin_cos", mutates_args={"out_sin", "out_cos"})
def sin_cos(x: torch.Tensor, out_sin: torch.Tensor, out_cos: torch.Tensor) -> None:
out_sin.copy_(x.sin())
out_cos.copy_(x.cos())
@torch.compile
def f(x):
out0 = torch.empty_like(x)
out1 = torch.empty_like(x)
sin_cos(x, out0, out1)
return x.clone(), out0, out1
x = torch.randn(3, requires_grad=True)
f(x)
```
- cse would de-duplicate the empty nodes
- reinplacing would add an additional clone (because it can't write to
both tensors at the same time)
- the clone lowers into a new buffer + a copy_ kernel
- the copy_ kernel is unnecessary because "empty" is special - all reinplacing needed was an additional
buffer, it doesn't matter what the values are.
We could attempt to fix this on the reinplacing side but this seemed
better as a partitioner heuristic and the reinplacing fix is a bit more
tricky (we'd need to identify that the op never reads from the empty
node).
Test Plan:
- new test (the old number was 27, the new number is 21, so this PR
helped).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134703
Approved by: https://github.com/yf225
ghstack dependencies: #134466, #134490, #134491
Fixes [134212](https://github.com/pytorch/pytorch/issues/134212)
Currently, when we use 2D FSDP with TP, `optimizer.step()` would fail if the model were not fully tensor parallelized. If we don't have the entire model tensor parallelized when doing 2D, we would have both 1D and 2D DTensor parameters. As foreach is turned on by default, `optimizer.step()` would fail as cross mesh op is not allowed. Error as follows:
```
NotImplementedError: aten._foreach_mul_.Scalar: DTensor does not support cross-mesh operation yet!Got meshes: DeviceMesh('cuda', [[0, 1], [2, 3]], mesh_dim_names=('dp', 'tp')) DeviceMesh('cuda', [1, 3], mesh_dim_names=('dp',))
```
In this PR, we extend implicit_replication to replicate DTensor in missing dimensions for foreach ops. If users don't want to fully tensor parallelize the model when using 2D, they have the option of using the `implicit_replication()` context manager for `optimizer.step()`. In this case, we would swap out the 1D DTensorSpec and replace it with 2D DTensorSpec. However, we don't want to turn this on by default yet, as we want the users to be aware that the tp dimension is replicated if a layer is not tp-ed.
With implicit implication turning on, try replicate dtensor spec in missing dimension would work for most cases for foreach case except when the first DTensor in the list is one that also need to be replicated. This is currently a limitation, which I don't have a good solution yet. Currently, with this change, we can handle most of the cases except the case that the first DTensor's ndim is not the largest.
```
[2D_DTensor, 1D_DTensor...] ---> Implicit_replication() can handle this.
[1D_DTensor, 2D_DTensor...] ---> Implicit_replication() can't handle this.
```
This change doesn't affect the existing default behavior, as `implicit_replication()` is not turned on by default.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134551
Approved by: https://github.com/tianyu-l
So that the tensor's lifetime management is the same as the management built for the NCCL, pre and post kernels.
Also so that on visualizers, they show up in the NCCL stream line. Otherwise if they show up in the compute line, user may get confused (my code does not have these kernels).
The check is thus moved after the point where we depend NCCL stream from the last compute kernel.
Also moved declaration of `checkForNan` from Utils.hpp to NCCLUtils.hpp, and renamed Utils.cu to NCCLUtils.cu.
Differential Revision: [D61957573](https://our.internmc.facebook.com/intern/diff/D61957573)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134300
Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
The previous PR https://github.com/pytorch/pytorch/pull/133532 caused stuck compilation issue on internal models. In this 2nd attempt PR, we gate the trace_rules.py changes with `if not torch._dynamo.config.skip_fsdp_hooks:`, so that they don't take effect for current graph-break FSDP2 (which relies on the default config value `skip_fsdp_hooks=True`), and will only take effect when we are using Traceable FSDP2 (in which case the user needs to proactively set `skip_fsdp_hooks=False`).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134539
Approved by: https://github.com/ckluk2, https://github.com/yanboliang
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):
There are two function variants to get accumulated dtype for a given dtype:
- Func1: `c10::ScalarType toAccumulateType(c10::ScalarType type, c10::DeviceType device)`
- Func2: `c10::ScalarType toAccumulateType(c10::ScalarType type, bool is_cuda)`
The Func1 is general enough to support different devices, while the Func2 only supports CUDA and CPU. This PR intends to add the Intel GPU path in the Func1. And we expect users to invoke the Func1 to ensure compatibility for different devices.
* __->__ #134465
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134465
Approved by: https://github.com/Skylion007, https://github.com/atalman
## Semantic
The semantic is
(1) By default `torch.serialization.skip_data(materialize_fake_tensors=False)` will make `torch.save` skip writing storages (but reserve space for them in the checkpoint).
```python
import torch
import torch.nn as nn
sd = nn.Linear(3, 5).state_dict()
with torch.serialization.skip_data():
torch.save(sd, 'foo.pt')
print(torch.load('foo.pt', weights_only=True))
```
(2) With `torch.serialization.skip_data(materialize_fake_tensors=True)`If FakeTensor is passed to `torch.save` the pickler will treat these FakeTensors as being "materialized" space will be reserved in the checkpoint for the associated storage bytes, and when loading the type will be Tensor instead of FakeTensor)
```python
import torch
import torch.nn as nn
from torch._subclasses.fake_tensor import FakeTensorMode
with FakeTensorMode():
m = nn.Linear(3, 5, dtype=torch.float16, device='cuda')
sd = m.state_dict()
with torch.serialization.skip_data(materialize_fake_tensors=True):
torch.save(sd, 'bla.pt')
print(torch.load('bla.pt', weights_only=True))
# OrderedDict([('weight', tensor([[0., 0., 0.],
# [0., 0., 0.],
# [0., 0., 0.],
# [0., 0., 0.],
# [0., 0., 0.]], device='cuda:0', dtype=torch.float16)), ('bias', tensor([0., 0., 0., 0., 0.], device='cuda:0', dtype=torch.float16))])
```
## Follow Ups
- [ ] `torch.load` semantic for skip_data context manager
- [ ] Mechanism for getting offsets of storages saved via this method (for writing in a separate pass)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134504
Approved by: https://github.com/albanD
Summary:
Encountered issues related to AMD build when working on https://www.internalfb.com/diff/D60739324?dst_version_fbid=2203158110057105 (see stack trace P1545717562)
Looking at the file history, seems that the flag is no longer used so I propose to remove it. Alternatively, I could change the `#ifdef` to check both `USE_C10D_NCCL` and `USE_ROCM` and include the corresponding AMD header files.
Let me know what is more preferred way.
Test Plan: Sandcastle
Differential Revision: D61762129
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134404
Approved by: https://github.com/malfet
A user wants to use the flop counter with meta devices. This previously caused problems for SDPA+NJT:
1. autocast check: `torch.is_autocast_enabled("meta")` fails because `meta` is not valid for autocasting. If we skip this, we run into the next error
2. math backend: conversion to NST requires getting concrete offsets in a list of python integers, which doesn't work on a meta tensor b2eb0e8c6a/torch/nested/_internal/sdpa.py (L809-L815)
3. (fixed in the previous PR, #134288) - if we force using flash attention backend for flop counting, `_flash_attention_forward` previously didn't support meta tensors.
In this PR, we check specifically for FlopCounterMode, and, if it's enabled and combined with meta tensors, (a) skip autocasting and (b) force it down the flash attention path. This isn't generally safe for tracing (e.g. if you actually care which kernels you are running), but in the absence of actual device information, we have to make some assumptions. By specifically checking for FlopCounterMode, this should reduce the chance of unintended side effects for other meta tensor users.
Note: fake tensor would solve a bunch of these issues, but it's not a viable solution right now for the user.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134289
Approved by: https://github.com/soulitzer
ghstack dependencies: #134288
Fixes#130394
TorchInductor doesn't respect original strides of outputs. It opens up optimization opportunities like changing up memory layout. But for some cases, such as the case in https://github.com/pytorch/pytorch/issues/130394, we do need the output match the exact stride as required. The correctness is the first priority goal. So, this PR adds a new API `ir.ExternKernel.require_exact_strides(x, exact_strides, allow_padding=False)` to fix the issue. This PR enables dense and non-dense outputs' strides follow the strides required by semantics.
The comparison between the original and after this fix for the test is the below.
```python
@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
xnumel = 128
xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:]
xmask = xindex < xnumel
x0 = xindex % 8
x1 = (xindex // 8)
- x2 = xindex
tmp0 = tl.load(in_ptr0 + (x0 + (16*x1)), xmask)
tmp1 = tmp0 + tmp0
- tl.store(out_ptr0 + (x2), tmp1, xmask)
+ tl.store(out_ptr0 + (x0 + (16*x1)), tmp1, xmask)
def call(args):
arg0_1, = args
args.clear()
assert_size_stride(arg0_1, (16, 8), (16, 1))
with torch.cuda._DeviceGuard(0):
torch.cuda.set_device(0)
- buf1 = empty_strided_cuda((16, 8), (8, 1), torch.float32)
+ buf1 = empty_strided_cuda((16, 8), (16, 1), torch.float32)
stream0 = get_raw_stream(0)
triton_poi_fused_add_copy_0.run(arg0_1, buf1, 128, grid=grid(128), stream=stream0)
del arg0_1
return (buf1, )
```
The buf1 is created with exact stride required by users, and its values are written in same stride with the input.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130956
Approved by: https://github.com/eellison, https://github.com/blaine-rister, https://github.com/desertfire
```
compile time instruction count for iteration 1 is 10732129038
compile time instruction count for iteration 2 is 10719776783
compile time instruction count for iteration 3 is 10729546868
compile time instruction count for iteration 4 is 10737655132
compile time instruction count for iteration 5 is 10732564252
compile time instruction count for iteration 6 is 10728721234
compile time instruction count for iteration 7 is 10733354271
compile time instruction count for iteration 8 is 10719588972
compile time instruction count for iteration 9 is 10706311856
```
1. add torch.manual_seed(0), inputs was not the same across iterations
2. disable gc.
3. remove loop (not needed since compilation happen once only)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134649
Approved by: https://github.com/aorenste
ghstack dependencies: #133834, #134635
## Semantic
The semantic is
(1) By default `torch.serialization.skip_data(materialize_fake_tensors=False)` will make `torch.save` skip writing storages (but reserve space for them in the checkpoint).
```python
import torch
import torch.nn as nn
sd = nn.Linear(3, 5).state_dict()
with torch.serialization.skip_data():
torch.save(sd, 'foo.pt')
print(torch.load('foo.pt', weights_only=True))
```
(2) With `torch.serialization.skip_data(materialize_fake_tensors=True)`If FakeTensor is passed to `torch.save` the pickler will treat these FakeTensors as being "materialized" space will be reserved in the checkpoint for the associated storage bytes, and when loading the type will be Tensor instead of FakeTensor)
```python
import torch
import torch.nn as nn
from torch._subclasses.fake_tensor import FakeTensorMode
with FakeTensorMode():
m = nn.Linear(3, 5, dtype=torch.float16, device='cuda')
sd = m.state_dict()
with torch.serialization.skip_data(materialize_fake_tensors=True):
torch.save(sd, 'bla.pt')
print(torch.load('bla.pt', weights_only=True))
# OrderedDict([('weight', tensor([[0., 0., 0.],
# [0., 0., 0.],
# [0., 0., 0.],
# [0., 0., 0.],
# [0., 0., 0.]], device='cuda:0', dtype=torch.float16)), ('bias', tensor([0., 0., 0., 0., 0.], device='cuda:0', dtype=torch.float16))])
```
## Follow Ups
- [ ] `torch.load` semantic for skip_data context manager
- [ ] Mechanism for getting offsets of storages saved via this method (for writing in a separate pass)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134504
Approved by: https://github.com/albanD
Summary: The default c_shim version was switched to 2 for HIP in D60674018. This results in some linking errors where shim function symbols are missing from the compiled .so file (eg. P1551186492) when building lowering benchmark scripts since the required files aren't included. Hipify the shim v2 generated header files as well since they're needed during codegen when the buck binaries are executed.
Reviewed By: frank-wei, zoranzhao, henryoier
Differential Revision: D61865202
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134689
Approved by: https://github.com/zoranzhao
Summary:
This is to fix the pytorch issue filed https://github.com/pytorch/pytorch/issues/133010
one way to fix this problem is to enable parallel start processes in mp.start_processes.
What else in the diff:
refactored a test case api_test which was repeating a lot of tests due to the inheritance.
added unit test for forkserver when parallel start is on.
Test Plan: Added unit tests
Differential Revision: D61878552
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134629
Approved by: https://github.com/d4l3k
Hi,
I noticed the `unfold` operator was missing on MaskedTensor.
I tested that my change works when calling unfold and backward on a `MaskedTensor` but I didn't find the tests for the dispatch of such operation. Where is it?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125262
Approved by: https://github.com/cpuhrsch
- This PR generates a more useful output log for users: P1552399180.
- It also fixes the logic when we check the all-gather size mismatch.
- Add dtype check for collective input/output
- We store more context information for error match_state so that we can report them in the file.
- Disable the size match for alltoall because we don't log the size for all inputs/outputs.
- Correct some types for func args specification.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134528
Approved by: https://github.com/c-p-i-o
This PR adds a basic Runtime Estimator for single-device models.
It estimates the GPU runtime in milliseconds using various estimation methods under the ``FakeTensorMode``.
It provides a ``TorchDispatchMode`` based context manager that can estimate the eager runtime of PyTorch functions. It supports two estimation modes, benchmarking (`operator-level-benchmark`) and roofline cost modeling (`operator-level-cost-model`).
For modules executed under this context manager, it agggregates the forward and backward operation runtimes and records their execution orders.
```
import torch
from torch import nn, optim
from torch._subclasses.fake_tensor import FakeTensorMode
from torch.distributed._tools.runtime_estimator import RuntimeEstimator
from torch.testing._internal.distributed._tensor.common_dtensor import (
ModelArgs,
Transformer,
)
if __name__ == "__main__":
def _train_step(
model: nn.Module,
optimizer: optim.Optimizer,
inp: torch.Tensor,
):
out = model(inp)
loss = out.sum()
loss.backward()
optimizer.step()
optimizer.zero_grad()
dev = torch.cuda.current_device()
vocab_size = 8192
bsz, seq_len = 32, 1024
model_args = ModelArgs(
n_layers=4,
n_heads=12,
vocab_size=vocab_size,
max_seq_len=seq_len,
dim=768,
dropout_p=0.1,
)
runtime_estimator = RuntimeEstimator()
with FakeTensorMode():
with torch.device(dev):
model = Transformer(model_args)
optimizer = optim.Adam(model.parameters(), lr=1e-2, foreach=True)
inp = torch.randint(0, model_args.vocab_size, (bsz, model_args.max_seq_len), device=dev)
with runtime_estimator("operator-level-benchmark"):
_train_step(model, optimizer, inp)
with runtime_estimator("operator-level-cost-model"):
_train_step(model, optimizer, inp)
# Actual model runtime
with torch.device(dev):
model = Transformer(model_args)
optimizer = optim.Adam(model.parameters(), lr=1e-2, foreach=True)
inp = torch.randint(0, model_args.vocab_size, (bsz, model_args.max_seq_len), device=dev)
warmup_iters, actual_iters = 2, 5
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
for _ in range(warmup_iters):
_train_step(model, optimizer, inp)
start_event.record()
for _ in range(actual_iters):
_train_step(model, optimizer, inp)
end_event.record()
torch.cuda.synchronize()
measured_time = start_event.elapsed_time(end_event) / actual_iters
print(f"Actual total_time: {measured_time:.3f} ms")
```
<img width="506" alt="Screenshot 2024-08-26 at 11 27 15 PM" src="https://github.com/user-attachments/assets/04d243c9-21a6-4389-8c20-80958980788c">
@weifengpy @xuanzhang816 @gnadathur
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134243
Approved by: https://github.com/weifengpy
**Summary**
reland of https://github.com/pytorch/pytorch/pull/134294Fixes#131446Fixes#126852Fixes#126868Fixes#126493
The PR was reverted due to CI red signal in https://github.com/pytorch/pytorch/actions/runs/10537099590/job/29201744658. It seems that the `gaussian_nll_loss` test had been flaky before my original PR #134294 . Therefore this PR also removes the `xfail` mark on this specific test to make CI signal green.
See the error message below:
```
2024-08-24T13:42:01.3228990Z ==================================== RERUNS ====================================
2024-08-24T13:42:01.3229530Z [31m[1m_ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _[0m
2024-08-24T13:42:01.3229710Z Unexpected success[90m[39;49;00m
2024-08-24T13:42:01.3230235Z [31m[1m_ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _[0m
2024-08-24T13:42:01.3230407Z Unexpected success[90m[39;49;00m
2024-08-24T13:42:01.3230594Z =================================== FAILURES ===================================
2024-08-24T13:42:01.3231128Z [31m[1m_ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _[0m
2024-08-24T13:42:01.3231296Z Unexpected success[90m[39;49;00m
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134509
Approved by: https://github.com/tianyu-l, https://github.com/wz337
The original DCP doesn't flattening all the containers, which can cause issues, https://github.com/pytorch/pytorch/pull/125335 intends to solve the issue by flattening all the dictionaries.
Unfortunately, it breaks the checkpoints that are saved before 2.4. This
also shows some issues of the DCP:
1. DCP should record version in the metadata.
2. DCP should have a nice way to load old state_dict.
3. DCP should unflatten all containers (map, list) not just map.
This PR only addresses issue 2 to unblock users. Issue 1 and issue 3 need to be addressed in the future.
@pradeepfn Please let me know if this summary matches our discussion.
Fixes https://github.com/pytorch/pytorch/issues/133923
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134158
Approved by: https://github.com/wz337, https://github.com/pradeepfn
Summary: benchmarks/dynamo/ci_expected_accuracy/update_expected.py expects a benchmark run config is named as {config}_{benchmark}, and CPU tests should follow the same naming convention.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134639
Approved by: https://github.com/huydhn
Summary: Recently https://github.com/pytorch/pytorch/pull/133620 added support for automatic dynamic shapes, where a new enum, `DIM`, was introduced to provide hints like `AUTO` and `STATIC`. This PR is a nominal change where we expose the hints via the existing public `Dim` API, and remove `DIM` from the public API. The main motivation is to avoid having users need to import too many things.
Test Plan: existing
Differential Revision: D61807361
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134484
Approved by: https://github.com/angelayi
The previous PR https://github.com/pytorch/pytorch/pull/133532 caused stuck compilation issue on internal models. In this 2nd attempt PR, we gate the trace_rules.py changes with `if not torch._dynamo.config.skip_fsdp_hooks:`, so that they don't take effect for current graph-break FSDP2 (which relies on the default config value `skip_fsdp_hooks=True`), and will only take effect when we are using Traceable FSDP2 (in which case the user needs to proactively set `skip_fsdp_hooks=False`).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134539
Approved by: https://github.com/ckluk2, https://github.com/yanboliang
Fixes#134391, #124714
The above issues reported that `dist.barrier()` could hang in some cases.
The culprit is that ProcessGroupNCCL inferred a wrong device to perform the dummy all-reduce.
After the PR, the following will be the order of device selection:
- 1st choice: `opts.device_ids`, if provided by user via `barrier(opts)`.
- 2nd choice: bound device id, if provided to `init_process_group` via `device_id` arg.
- 3rd choice: `usedDeviceIdxs_` recorded in current PG. Will have a value from previous collectives.
- 4th choice: `globalRank() % localDeviceCount_`. This can only happen when `dist.barrier()` is the first call of the PG.
What's new:
- Added the 2nd choice.
- In the 4th choice, we use `globalRank()` instead of group-local rank, because the group-local rank can be offset wrt the device id if intra-node GPUs are sharded into multiple dimensions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134617
Approved by: https://github.com/yifuwang, https://github.com/shuqiangzhang
This adds logs if we can't acquire locks in NCCLUtils and ProcessGroupNCCL for 30s.
This is motivated by some deadlocks were seeing and it's unclear if it's in NCCL or on the PyTorch side of things.
This required replacing most `std::mutex` with `std::timed_mutex` and `std::condition_variable_any` as appropriate.
Test plan:
existing CI for regressions
will add unit tests on `C10D_LOCK_GUARD`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134131
Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj
Summary:
There's 2 concepts of unsupported sympy.Functions in symbolic_shapes:
1) unsupported by the export solver, meaning the solver doesn't know how to provide useful fixes for those functions
2) unsupported by the sympy interpreter - meaning we can't reify them into FX nodes because the functions aren't present in PythonReferenceAnalysis
This splits the current call into a call for each version, with the Export solver the only user of 1). For 1), we enumerate the functions in _sympy/functions.py, and subtract the functions we know we can support. For 2) there's only 3 functions we've seen pop up in test cases.
cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10
Differential Revision: D61863394
Pulled By: pianpwk
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134598
Approved by: https://github.com/angelayi
PYTHONPATH=$(pwd) python benchmarks/update_hint_benchmark.py out
as of this diff, compile_time_instruction_count counts the number of instruction from within
convert_frame.compile_inner
```
update_hint_regression,compile_time_instruction_count,10522459165
```
will add result from CI once populated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133834
Approved by: https://github.com/aorenste
Summary: apparently DIM.AUTO leads to duck sizing, I didn't catch this. Doing the least intrusive fix possible by using `torch._dynamo.maybe_mark_dynamic()` under the hood.
Test Plan: added test
Differential Revision: D61809344
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134486
Approved by: https://github.com/avikchaudhuri
See #121528 for additional context.
In #120682, we moved the attention kernels from meta_registrations to fake_impls with the intent of fixing the device handling for seed/offset: these are typically on CPU. We needed to put the registrations in fake_impls to do this because meta_registrations doesn't have a way to specify device, whereas fake_impls does. But when we tried to actually fix the device types (#120839), we had to revert the PR because it broke cudagraph handling (during which seed/offset _are_ on CUDA).
Now, we want to put the registrations back in meta_registrations so that we can call these kernels with meta tensors. The use case is later in this stack - we want to be able to use the flop counter with these kernels.
Also - I specifically skip the `compare_tensor_meta()` check in test_fake / test_fake_autocast tests for the `_efficient_attention_forward` and `_flash_attention_forward` kernels, which fails because of the device mismatch from the seed/offset tensors. Then we can un-skip these opinfos. I verified that the efficient_attention_forward bug (#120842) is now caught by these opinfos if I revert the fix from this PR.
Differential Revision: [D61687369](https://our.internmc.facebook.com/intern/diff/D61687369)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134288
Approved by: https://github.com/drisspg
Maintainers have the links to their GitHub profiles, but the major contributors do not have them.
I added the links to the contributors' GitHub accounts in case anyone wants to follow them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133787
Approved by: https://github.com/albanD
Seeing failures like this:
```
#49 844.6 //build_scripts/manylinux1-check.py:6: DeprecationWarning: The distutils package is deprecated and slated for removal in Python 3.12. Use setuptools or check PEP 632 for potential alternatives
.....
[python 3/3] RUN bash build_scripts/build.sh && rm -r build_scripts:
846.9 ...it did, yay.
846.9 + for PYTHON in '/opt/python/*/bin/python'
846.9 + /opt/python/cpython-3.12.0/bin/python build_scripts/manylinux1-check.py
847.0 Traceback (most recent call last):
847.0 File "//build_scripts/manylinux1-check.py", line 55, in <module>
847.0 if is_manylinux1_compatible():
847.0 ^^^^^^^^^^^^^^^^^^^^^^^^^^
847.0 File "//build_scripts/manylinux1-check.py", line 6, in is_manylinux1_compatible
847.0 from distutils.util import get_platform
847.0 ModuleNotFoundError: No module named 'distutils'
------
```
PR: https://github.com/pytorch/pytorch/pull/134455
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134595
Approved by: https://github.com/kit1980, https://github.com/seemethere, https://github.com/malfet
**Summary**
This PR is a follow-up of #126924 to address reviewer's comments:
1) add a test case to show the use of `local_map` as a function decorator.
2) simplify the logic of handling different data types of `out_placements`.
3) correct variable naming in test cases to match math formulas.
**Test**
see #126924
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127752
Approved by: https://github.com/wanchaol
Which fixes BatchNorm behavior for if called with empty tensors on MPS backed. Removed `expectedFailureMPS` in test_nn.py, deleted expected failure in `test_mps.py` and adjusted `skipIfMPS` to `expectedFailureMPS` in BatchNorm2d OpInfo decorator, but restrict it only to the memory format tests
Test Plan: CI + `python3 -c "import torch; print(torch.nn.BatchNorm2d(3, device='mps')(torch.rand(0, 3, 2, 2, device='mps')))"`
Fixes https://github.com/pytorch/pytorch/issues/134423
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134540
Approved by: https://github.com/Skylion007, https://github.com/albanD
## Context
In some user Triton kernels, we have this set-up for whatever reason.
```
@triton.jit
def mykernel(
param0,
param1,
param2,
param3: tl.constexpr, # autotuned
param4, # non-constexpr
):
...
```
This is an edge case because it's a general practice to declare all constexprs params at the end.
And this will be an issue for AOTI because it fails to codegen all 4 params. That will surface as a device-side error: CUDA IMA, invalid argument...
```
> void* kernel_args_var_0[] = {&var_0, &var_1, &var_2};
---
< CUdeviceptr var_3;
< AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_get_data_ptr(buf0, reinterpret_cast<void**>(&var_3)));
< void* kernel_args_var_0[] = {&var_0, &var_1, &var_2, &var_3};
```
## Root-cause
* `kernel.constexpr` from the Kernel side-table contains the indices for all `constexpr` params that includes autotuned params.
* `raw_args`, that gets passed to wrapper codegen, excludes autotuned args.
* In the wrapper codegen, we try to find non-constexpr args using `kernel.constexpr` & `raw_args`. This is okay unless there's a `raw_arg` after an autotuned param in the function signature.
79b7fff188/torch/_inductor/codegen/cpp_wrapper_cuda.py (L118-L126)
## Fix
We try to fix this, by calculating the right constexprs wrt `raw_args`.
An illustration
```
raw_args: [arg0, arg1, arg2, arg4]
kernel.arg_names: [param0, param1, param2, param3, param4]
kernel.constexprs: [3] # param3 is autotuned; this is correct wrt kernel.arg_names
constexpr_indices: [] # this is correct wrt raw_args
```
Differential Revision: [D61831625](https://our.internmc.facebook.com/intern/diff/D61831625)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134520
Approved by: https://github.com/oulgen
This is designed to be a more ergonomic interface on top of justknob_feature (see https://github.com/pytorch/pytorch/pull/134151 for just the PR with the base commits).
The idea is that people stop having to think about this as much, and can just do JustkobsConfig("//the:thing", "FORCE_THING") and it'll do the right thing.
Primarily sending this to see how people feel about the API, and using it for new config changes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134161
Approved by: https://github.com/ezyang
So that the tensor's lifetime management is the same as the management built for the NCCL, pre and post kernels.
Also so that on visualizers, they show up in the NCCL stream line. Otherwise if they show up in the compute line, user may get confused (my code does not have these kernels).
The check is thus moved after the point where we depend NCCL stream from the last compute kernel.
Also moved declaration of `checkForNan` from Utils.hpp to NCCLUtils.hpp, and renamed Utils.cu to NCCLUtils.cu.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134300
Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab
Clarify that `add_safe_globals` will allow types for these instructions
Some types do not appear as `GLOBAL` and are only caught in `BUILD`, example from hf slack is `numpy.dtypes.UInt32DType`
```python
import torch
import numpy as np
from tempfile import TemporaryDirectory
from pathlib import Path
from codecs import encode
torch.serialization.add_safe_globals([encode, np.dtype, np.core.multiarray._reconstruct, np.ndarray])
with TemporaryDirectory() as tempdir:
p = Path(tempdir)
r2 = np.random.get_state()
torch.save(r2, p / "r2.pkl")
torch.load(p / "r2.pkl", weights_only=True)
```
Yields (error comes from BUILD)
```
UnpicklingError: Weights only load failed. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
Please file an issue with the following so that we can make `weights_only=True` compatible with your use case: WeightsUnpickler error: Can only build Tensor, parameter or OrderedDict objects, but got <class 'numpy.dtypes.UInt32DType'>
```
The reasoning is that `numpy.dtypes.UInt32DType` is constructed via `REDUCE` with `func =<class 'numpy.dtype'>` and `args= ('u4', False, True)`, clarify the error message that doing `add_safe_globals` on these will also allow them
After this PR error message becomes
```
_pickle.UnpicklingError: Weights only load failed. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
Please file an issue with the following so that we can make `weights_only=True` compatible with your use case: WeightsUnpickler error: Can only build Tensor, Parameter, OrderedDict or types allowlisted via `add_safe_globals`, but got <class 'numpy.dtypes.UInt32DType'>
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134346
Approved by: https://github.com/albanD
Changes jobs to go back to using the default AMI.
Note: This is only a cleanup PR. It does NOT introduce any behavior changes in CI
Now that the default variant uses the Amazon 2023 AMI and has been shown to be stable for a week, it's time to remove the explicit amz2023 references and go back to using the default variant.
After a week or two, when this is rolled out to most people, we can remove the variants from scale config as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134355
Approved by: https://github.com/jeanschmidt
Summary:
Currently the warning is printed when the cat inputs have same qparam, leading to a flood of warnings.
This diff emits the warning only when cat inputs don't have the same qparam.
Test Plan: CI
Reviewed By: aprotopopov
Differential Revision: D60638609
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133999
Approved by: https://github.com/tarun292
Fixes#127519
Currently in torchrun rendezvous, there are only two rendezvous backends supported out of the box: `C10d` and `Etcd`. The changes in this PR enables the distributed elastic users to bring their out-of-tree rendezvous backend implementations as Python packages.
#### AUTHORING NEW PLUGIN
Any new plugin will be a python package exposing entry-points. For example, the structure of redis plugin is as follows:
```
plugin_root
|_ pyproject.toml
|_ src
|_ redis
|_ __init__.py
|_ redis_store.py
|_ redis_backend.py
```
The contents of the `pyproject.toml` should indicate that this is exposes a torchrun entry-point by mentioning the group name `torchrun.plugins`. The `pyproject.toml` for redis plugin would be as follows:
```
[project]
name = "redis"
version = "0.0.1"
[project.entry-points.'torchrun.plugins']
redis = 'redis'
```
The `src/redis/__init__.py` file would contain functions that return the plugin name and plugin handler. The contents of `__init__.py` for redis would be as follows:
```
def getPluginHandler():
def _create_redis_handler(params: RendezvousParameters):
from redis_rendezvous_backend import create_backend
backend, store = create_backend(params)
return create_handler(store, backend, params)
return _create_redis_handler
```
The files `redis_store` and `redis_backend` contain the implementation of [Store](41189b0da4/torch/_C/_distributed_c10d.pyi (L171)) and [RendezvousBackend](e782918b8e/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py (L61)) respectively.
#### USER EXPERIENCE
Before using the plugin for the first time, the user has to install the plugin packages. For example, the published packages can be installed using `pip3 install <plugin-name>` and the plugin is in local file systemcan be installed using `pip3 install -e <plugin-location>`.
Once installed, the new backend can be used in torchrun as follows:
```
torchrun --rdzv-backend=redis --rdzv-endpoint=redis-container:6379 --nnodes=3 --nproc-per-node=1 --max-restarts=3 --rdzv-id=1 test.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132633
Approved by: https://github.com/wconstab
enable Windows inductor UTs for `test/inductor/test_torchinductor_codegen_dynamic_shapes.py`
Failed by depends on https://github.com/pytorch/pytorch/pull/134429, need to rebase after https://github.com/pytorch/pytorch/pull/134429 merged.
```cmd
2024-08-25T23:57:23.2747794Z Windows CI does not have necessary dependencies for test_torchinductor_dynamic_shapes yet
2024-08-25T23:57:23.2748541Z Traceback (most recent call last):
2024-08-25T23:57:23.2749593Z File "C:\actions-runner\_work\pytorch\pytorch\test\inductor\test_torchinductor_codegen_dynamic_shapes.py", line 30, in <module>
2024-08-25T23:57:23.2750688Z from inductor.test_torchinductor_dynamic_shapes import (
2024-08-25T23:57:23.2751877Z File "C:\actions-runner\_work\pytorch\pytorch\test\inductor\test_torchinductor_dynamic_shapes.py", line 46, in <module>
2024-08-25T23:57:23.2752876Z raise unittest.SkipTest("requires sympy/functorch/filelock")
2024-08-25T23:57:23.2753545Z unittest.case.SkipTest: requires sympy/functorch/filelock
2024-08-25T23:57:23.2754077Z Got exit code 1
2024-08-25T23:57:23.2754874Z No stepcurrent file found. Either pytest didn't get to run (e.g. import error) or file got deleted (contact dev infra)
```
Local test pass:
<img width="1892" alt="image" src="https://github.com/user-attachments/assets/241ab082-6026-4f33-b3ac-7e9ef7da744d">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134428
Approved by: https://github.com/jansel
Summary:
We want to add compile IDs and frames to each Torch-Compiled Region in order to help users cross reference the section they are checking alongside data obtained from tools, such as tlparse.
This diff operates on the assumption that each graph section will enter and exit a CompileContext before it is ran to either compile the graph or look it up in the cache. Based on this assuption, we can save the value of the graph section from the exited CompileContext in eval_frame.c using a Python C API. After this, we can create a new interface in cpp shim to wrap around the record_function in order to pass in the new keyword argument for "context".
Test Plan:
Enhance test_profiler_dynamo_compiled_region to look for kwinputs as well as a name to see that the context is now labeled. Also changed test to run graph with more contexts so that we test a wider range of profiling.
Differential Revision: D60803317
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132765
Approved by: https://github.com/anijain2305
This PR increases test coverage by including the tests in `test/test_nn.py` in the test suite of MPS.
Some of the tests are decorated with `@expectedFailureMPS` for various reasons. Either that the op is not implemented, or that the outputs do not align. Those tests that contain differing results should be investigated further to rule out any live bugs.
```bash
$ python test/run_test.py --mps --verbose -k TestNN
Running test batch 'tests to run' cost 84.76 seconds
```
Ref #133520
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134184
Approved by: https://github.com/albanD, https://github.com/malfet
There's 2 concepts of unsupported sympy.Functions in symbolic_shapes:
1) unsupported by the export solver, meaning the solver doesn't know how to provide useful fixes for those functions
2) unsupported by the sympy interpreter - meaning we can't reify them into FX nodes because the functions aren't present in PythonReferenceAnalysis
This splits the current call into a call for each version, with the Export solver the only user of 1). For 1), we enumerate the functions in _sympy/functions.py, and subtract the functions we know we can support. For 2) there's only 3 functions we've seen pop up in test cases.
Differential Revision: D61677956
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134271
Approved by: https://github.com/avikchaudhuri
enable Windows inductor UTs of `test/inductor/test_binary_folding.py`
Failed UT depends on https://github.com/pytorch/pytorch/pull/134427
Need to rebase after https://github.com/pytorch/pytorch/pull/134427 merged.
```cmd
2024-08-25T23:32:23.0905727Z Traceback (most recent call last):
2024-08-25T23:32:23.0906516Z File "C:\actions-runner\_work\pytorch\pytorch\test\inductor\test_binary_folding.py", line 18, in <module>
2024-08-25T23:32:23.0908200Z from inductor.test_inductor_freezing import TestCase
2024-08-25T23:32:23.0909883Z File "C:\actions-runner\_work\pytorch\pytorch\test\inductor\test_inductor_freezing.py", line 39, in <module>
2024-08-25T23:32:23.0911128Z raise unittest.SkipTest("requires sympy/functorch/filelock")
2024-08-25T23:32:23.0911801Z unittest.case.SkipTest: requires sympy/functorch/filelock
2024-08-25T23:32:23.0912370Z Got exit code 1
2024-08-25T23:32:23.0913155Z No stepcurrent file found. Either pytest didn't get to run (e.g. import error) or file got deleted (contact dev infra)
```
Local test pass:
<img width="1898" alt="image" src="https://github.com/user-attachments/assets/4a6e3f66-4bbc-4aab-8f0d-2e2318046e53">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134425
Approved by: https://github.com/ezyang, https://github.com/jansel
Windows file path use `\` as delimiter, it is also a escape character. We need translate all path `\` to `/`. which like Linux.
Reproduce UTs:
```cmd
pytest test\dynamo\test_minifier.py -v -k test_after_dynamo_cpu_accuracy_error
```
Error message:
```cmd
____________________________________________________________________________________________________________ MinifierTests.test_after_dynamo_cpu_accuracy_error _____________________________________________________________________________________________________________
Traceback (most recent call last):
File "D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_minifier.py", line 40, in test_after_dynamo_cpu_accuracy_error
self._test_after_dynamo(
File "D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_minifier.py", line 27, in _test_after_dynamo
self._run_full_test(run_code, "dynamo", expected_error, isolate=False)
File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\_dynamo\test_minifier_common.py", line 235, in _run_full_test
self.assertIn(expected_error, test_proc.stderr.decode("utf-8"))
File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 1112, in assertIn
self.fail(self._formatMessage(msg, standardMsg))
File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 675, in fail
raise self.failureException(msg)
AssertionError: 'AccuracyError' not found in 'Traceback (most recent call last):\n File "C:\\Users\\Xuhan\\.conda\\envs\\win_mkl_static\\lib\\site-packages\\torch\\_dynamo\\test_minifier_common.py", line 114, in _maybe_subprocess_run\n exec(code, {"__name__": "__main__", "__compile_source__": code})\n File "<string>", line 9\n torch._dynamo.config.debug_dir_root = "C:\\Users\\Xuhan\\AppData\\Local\\Temp\\tmpufu9t3pc"\n ^\nSyntaxError: (unicode error) \'unicodeescape\' codec can\'t decode bytes in position 2-3: truncated \\UXXXXXXXX escape\n'
To execute this test, run the following from the base repo dir:
python test\dynamo\test_minifier.py MinifierTests.test_after_dynamo_cpu_accuracy_error
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
--------------------------------------------------------------------------------------------------------------------------- Captured stdout call ----------------------------------------------------------------------------------------------------------------------------
test stdout:
test stderr: Traceback (most recent call last):
File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\_dynamo\test_minifier_common.py", line 114, in _maybe_subprocess_run
exec(code, {"__name__": "__main__", "__compile_source__": code})
File "<string>", line 9
torch._dynamo.config.debug_dir_root = "C:\Users\Xuhan\AppData\Local\Temp\tmpufu9t3pc"
^
SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape
--------------------------------------------------------------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------------------------------------------------------------
running test
```
Local test passed:
<img width="849" alt="image" src="https://github.com/user-attachments/assets/4a4eecc2-7c08-4de6-9395-546b69803b16">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134365
Approved by: https://github.com/jansel, https://github.com/jgong5
Optimize memory cost at [PR#129635](https://github.com/pytorch/pytorch/pull/129635)
There are 2 main part of the optimization here:
1. optimize the tensor distributing part, postpone the full_tensor generation, which avoids the memory overlap, saves around 50% peak memory at 2 param test case.
2. apply `assign=True` for the `load_state_dict`, saves memory cost at the state dict loading by assigning the input param, around 50% peak memory at loading part.
Future work:
Memory optimization to the opt will be conducted in the next PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134025
Approved by: https://github.com/fegin
Co-authored-by: Rachel Guo <guorachel@meta.com>
Summary: Fixes https://github.com/pytorch/pytorch/issues/134133
Test Plan:
Tested on the small repro in the linked issue with different lengths N (replacing 100), recording N vs. time taken in nanoseconds:
10 127268319
20 220839662
30 325463125
40 429259441
50 553136055
60 670799769
70 999170514
80 899014103
90 997168902
100 1168202035
110 1388556619
120 1457488235
130 1609816470
140 2177889877
150 1917560313
160 2121096113
170 2428502334
180 4117450755
190 4003068224
So N ~ 200 takes ~5s. Previously even smaller N would go for >1 min.
Didn't add a perf test because ezyang is planning to build a benchmark.
Also tested on https://www.internalfb.com/diff/D61560171, which now gets past the stuck point.
Differential Revision: D61619660
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134150
Approved by: https://github.com/ezyang
Because aten.poisson doesn't have meta function registered, there is one additional eager execution of this op during compilation phase of torch.compile.
There are more ops without meta registration. Is there any reason for it?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134103
Approved by: https://github.com/ezyang
I had a night mare rewriting tests in test_misc.py specifically :
1. graphs can have comments that refers to my files "/lsakka/.." we really dont care about comments add option to ignore comments.
2. empty lines added when EXPECTTEST_ACCEPT=1 are changed with linter causing tests to fail or linter fail!
add flag to ignore empty lines.
3. EXPECTTEST_ACCEPT fails when the text have some not readable characters. those should not effect comparing strings, also those causes weird diffs comments when tests fails. I removed ansi_escape chars https://github.com/pytorch/pytorch/pull/133045
this is used in
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134248
Approved by: https://github.com/aorenste
ghstack dependencies: #133639, #134364
This UT actual code only one empty line wrap difference(`linear` and `add`) between Windows/Linux, and the context is right.
Reproduce UTs:
```cmd
pytest test\dynamo\test_higher_order_ops.py -v -k test_functional_call_sequential_params_and_buffers
```
We can add `empty_line_normalizer` to fix it.
```cmd
______________________________________________________________________________________________ FuncTorchHigherOrderOpTests.test_functional_call_sequential_params_and_buffers _______________________________________________________________________________________________
Traceback (most recent call last):
File "D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_higher_order_ops.py", line 3676, in test_functional_call_sequential_params_and_buffers
self.assertExpectedInline(
File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_utils.py", line 2871, in assertExpectedInline
return super().assertExpectedInline(actual if isinstance(actual, str) else str(actual), expect, skip + 1)
File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\expecttest\__init__.py", line 271, in assertExpectedInline
self.assertMultiLineEqualMaybeCppStack(expect, actual, msg=help_text)
File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\expecttest\__init__.py", line 292, in assertMultiLineEqualMaybeCppStack
self.assertMultiLineEqual(expect, actual, *args, **kwargs)
File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 1226, in assertMultiLineEqual
self.fail(self._formatMessage(msg, standardMsg))
File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 675, in fail
raise self.failureException(msg)
AssertionError: 'clas[509 chars]one\n add: "f32[1, 1]" = linear + l_buf[69 chars],)\n' != 'clas[509 chars]one\n\n add: "f32[1, 1]" = linear + l_b[71 chars],)\n'
class GraphModule(torch.nn.Module):
def forward(self, L_params_l1_weight_: "f32[1, 1]", L_params_l1_bias_: "f32[1]", L_buffers_buffer_: "f32[1]", L_inputs_: "f32[1, 1]"):
l_params_l1_weight_ = L_params_l1_weight_
l_params_l1_bias_ = L_params_l1_bias_
l_buffers_buffer_ = L_buffers_buffer_
l_inputs_ = L_inputs_
linear: "f32[1, 1]" = torch._C._nn.linear(l_inputs_, l_params_l1_weight_, l_params_l1_bias_); l_inputs_ = l_params_l1_weight_ = l_params_l1_bias_ = None
+ <<<< (difference is here )
add: "f32[1, 1]" = linear + l_buffers_buffer_; linear = l_buffers_buffer_ = None
return (add,)
: To accept the new output, re-run test with envvar EXPECTTEST_ACCEPT=1 (we recommend staging/committing your changes before doing this)
To execute this test, run the following from the base repo dir:
python test\dynamo\test_higher_order_ops.py FuncTorchHigherOrderOpTests.test_functional_call_sequential_params_and_buffers
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
========================================================================================================================== short test summary info ==========================================================================================================================
FAILED [0.4275s] test/dynamo/test_higher_order_ops.py::FuncTorchHigherOrderOpTests::test_functional_call_sequential_params_and_buffers - AssertionError: 'clas[509 chars]one\n add: "f32[1, 1]" = linear + l_buf[69 chars],)\n' != 'clas[509 chars]one\n\n add: "f32[1, 1]" = linear + l_b[71 chars],)\n'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134394
Approved by: https://github.com/jansel
Co-authored-by: Jason Ansel <jansel@jansel.net>
After this I think all `using namespace` will have been eliminated from PyTorch header files. Internally, `-Wheader-hygiene` will prevent more from being added.
Test Plan: Sandcastle
Differential Revision: D61679037
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134336
Approved by: https://github.com/Skylion007
Summary:
This enables patching extern modules to provide compatibility with serialized code depending on different versions of those extern modules.
The main motivation is to enable Numpy upgrade. In the recent release many alias to builtin types were deprecated and removed [1]. This breaks loading pickled modules that reference the removed aliases. While the proper solution is to re-generate pickled modules, it's not always feasible.
This proposes a way to define mapping with a new type, for a module member. It is only set if it's not present in the loaded module, thus removes the need to check for exact versions.
https://numpy.org/doc/stable/release/1.20.0-notes.html#using-the-aliases-of-builtin-types-like-np-int-is-deprecated
Differential Revision: D61556888
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134376
Approved by: https://github.com/SherlockNoMad
If a mesh_dim_name is given, we will use the given mesh_dim_name to name the new flattened dim.
Otherwise, the default is a string concatentaing the mesh_dim_names of the given submesh with each mesh_dim_name separated by "_".
For example, if we have a 3D mesh DeviceMesh([[[0, 1], [2, 3]], [[4, 5], [6, 7]]], mesh_dim_names=("dp", "cp", "tp")), calling mesh_3d["dp", "cp"]._flatten() will create a 1D submesh DeviceMesh([0, 1, 2, 3], mesh_dim_names=("dp_cp",)) on rank 0, 1, 2, 3 and a 1D submesh DeviceMesh([4, 5, 6, 7], mesh_dim_names=("dp_cp",)) on rank 4, 5, 6, 7.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134048
Approved by: https://github.com/fegin
ghstack dependencies: #133838, #133839
Sympy's implementation of Min/Max displays asymptotically bad behavior on `TORCH_COMPILE_CPROFILE=1 python torchrec/distributed/tests/test_pt2_multiprocess.py TestPt2Train.test_compile_multiprocess`. Evidence profile:

On this test case, we spend 42% of all time compiling the network on ShapeEnv.replace, which in turn spends all of its time in xreplace.
The problem appears to be find_localzeros call. By vendoring the implementations of Min/Max, we can potentially reduce the cost of this operation.
The implementation is copy-pasted sympy/functions/elementary/miscellaneous.py but with some adjustments:
* I deleted logic related to differentatiation, evalf and heaviside, as it's not relevant to PyTorch reasoning
* There's some massaging to appease PyTorch's linters, including a lot of noqa and type: ignore (which I could potentially refactor away with substantive changes, but that's better as its own change)
* I deleted the second loop iteration for is_connected, as an attempt at initial optimization (this also simplifies the port, since I can omit some code). I'll comment at that point what the exact difference is.
Before this change, the test in question takes 100s with 40 features; post this change, afterwards, it takes only 69s.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133319
Approved by: https://github.com/Skylion007
Summary:
Today there is no good mechanism to detect progress of non-strict export line-by-line in user code. This caused some pain recently in trying to find the exact line of user code that was triggering a bug where the process appeared stuck because deep down something was calling some symbolic shapes code that was suffering some exponential blowup.
This PR adds a environment variable for extended debugging that will log the line of user code corresponding to every torch function call. It only works in non-strict export for now. Prefix setting this environment variable with `TORCH_LOGS` enabled for `export` logs at `DEBUG` level (i.e., with a `+` prefix), i.e.,.:
```
TORCHEXPORT_EXTENDED_DEBUG_CURRENT_LOC=1 TORCH_LOGS="+export" ...
```
This will show logs with something like:
```
...
prim::device called at .../example.py:4284 in foo
TensorBase.item called at .../example.py:4277 in bar
...
```
We already have an existing place to intercept torch functions where we process data-dependent errors in non-strict, so parking the logging there. An alternative place we could be doing this is where we add `stack_trace` metadata when generating code, but unfortunately at least the example that motivated this gets stuck before generating code, so that would be too late.
Test Plan: ran it on some sample commands
Differential Revision: D61692156
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134298
Approved by: https://github.com/angelayi
Summary: Create simple test that checks that FunctionEvent build tree happens lazily by checking that the metrics for it changes before and after call.
Test Plan: Make sure test passes in CI
Reviewed By: briancoutinho
Differential Revision: D61685429
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134359
Approved by: https://github.com/briancoutinho
Fixes#133338
Test Plan:
```
TORCH_LOGS=dynamic python
import torch
torch._dynamo.config.capture_scalar_outputs = True
@torch.compile()
def f(x):
y = x.item()
torch._check_is_size(y)
r = torch.arange(y, dtype=torch.float32)
torch._check(r.size(0) == y)
return r
f(torch.tensor([300]))
```
Before and after diff. Verify the following line
```
I0813 11:05:44.890000 652898 torch/fx/experimental/symbolic_shapes.py:5198] [0/0] runtime_assert Eq(CeilToInt(IntTrueDiv(u0, 1)), u0) [guard added] at aa.py:10 in f (_dynamo/utils.py:2092 in run_node), for more info run with TORCHDYNAMO_EXTENDED_DEBUG_GUARD_ADDED="Eq(CeilToInt(IntTrueDiv(u0, 1)), u0)"
```
no longer shows in the logs. Also verify CI passes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134296
Approved by: https://github.com/aorenste
Current temporary directory path is hard code. Fixed by get temporary directory path by API.
Reproduce UTs:
```cmd
python test/dynamo/test_dynamic_shapes.py -v -k test_torch_package_working_with_trace_dynamic_shapes
```
Error message:
```cmd
________________________________________________________________________________________________ DynamicShapesMiscTests.test_torch_package_working_with_trace_dynamic_shapes ________________________________________________________________________________________________
Traceback (most recent call last):
File "D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_misc.py", line 7199, in test_torch_package_working_with_trace
with package.PackageExporter(path) as exp:
File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\package\package_exporter.py", line 237, in __init__
self.zip_file = torch._C.PyTorchFileWriter(f)
RuntimeError: Parent directory /tmp does not exist.
To execute this test, run the following from the base repo dir:
python test\dynamo\test_dynamic_shapes.py DynamicShapesMiscTests.test_torch_package_working_with_trace_dynamic_shapes
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
========================================================================================================================== short test summary info ==========================================================================================================================
FAILED [0.0080s] test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_torch_package_working_with_trace_dynamic_shapes - RuntimeError: Parent directory /tmp does not exist.
==================================================================================================================== 1 failed, 1665 deselected in 4.00s =====================================================================================================================
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134397
Approved by: https://github.com/ezyang
Fixes#130394
TorchInductor doesn't respect original strides of outputs. It opens up optimization opportunities like changing up memory layout. But for some cases, such as the case in https://github.com/pytorch/pytorch/issues/130394, we do need the output match the exact stride as required. The correctness is the first priority goal. So, this PR adds a new API `ir.ExternKernel.require_exact_strides(x, exact_strides, allow_padding=False)` to fix the issue. This PR enables non-dense outputs' strides follow the strides required by semantics.
The comparison between the original and after this fix for the test is the below.
```python
@triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
xnumel = 128
xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:]
xmask = xindex < xnumel
x0 = xindex % 8
x1 = (xindex // 8)
- x2 = xindex
tmp0 = tl.load(in_ptr0 + (x0 + (16*x1)), xmask)
tmp1 = tmp0 + tmp0
- tl.store(out_ptr0 + (x2), tmp1, xmask)
+ tl.store(out_ptr0 + (x0 + (16*x1)), tmp1, xmask)
def call(args):
arg0_1, = args
args.clear()
assert_size_stride(arg0_1, (16, 8), (16, 1))
with torch.cuda._DeviceGuard(0):
torch.cuda.set_device(0)
- buf1 = empty_strided_cuda((16, 8), (8, 1), torch.float32)
+ buf1 = empty_strided_cuda((16, 8), (16, 1), torch.float32)
stream0 = get_raw_stream(0)
triton_poi_fused_add_copy_0.run(arg0_1, buf1, 128, grid=grid(128), stream=stream0)
del arg0_1
return (buf1, )
```
The buf1 is created with exact stride required by users, and its values are written in same stride with the input.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130956
Approved by: https://github.com/eellison, https://github.com/blaine-rister
torch.cuda.amp.autocast / torch.cpu.amp.autocast are deprecated and spew a ton of warnings when these tests run. This PR: Update to just use torch.amp.autocast(device).
Note: this uncovers a bug in the test: when `device` is CUDA, it actually shows up as "cuda:0" - so previously, this test was _always_ using `torch.cpu.amp.autocast` even for `cuda` device. This PR fixes this, and uncovers additional bugs in `pinverse` and `linalg.pinv`; `linalg.pinv` was already failing before on CPU, but now the test also catches failures on CUDA, (and this PR adds to the skipped-test list).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134291
Approved by: https://github.com/YuqingJ
Summary:
# context
* when fixing the graph break in _maybe_compute_kjt_to_jt_dict, we encountered this issue P1539489731:
```
[rank0]: ATTENTION: guard_size_oblivious would fix the error, evaluating expression to False.
[rank0]: Maybe you need to add guard_size_oblivious to framework code, see doc below for more guidance.
[rank0]:
[rank0]: Potential framework code culprit (scroll up for full backtrace):
[rank0]: File "/data/users/hhy/fbsource/buck-out/v2/gen/fbcode/61f992c26f3f2773/aps_models/ads/icvr/__icvr_launcher_live__/icvr_launcher_live#link-tree/torch/_inductor/fx_passes/post_grad.py", line 671, in slice_noop
[rank0]: if start == 0 and end >= 2**63 - 1 and step == 1:
```
* change the condition logic to be compatible with SymInt
Test Plan:
# commands
* run test
```
TORCH_SHOW_CPP_STACKTRACES=1 TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 TORCH_LOGS="+graph_code,output_code,dynamic,aot,guards,verbose_guards,recompiles,graph_breaks" TORCH_TRACE=/var/tmp/tt buck2 run fbcode//mode/opt fbcode//aps_models/ads/icvr:icvr_launcher_live -- mode=fmc/local_ig_fm_v4_mini training.pipeline_type=pt2 2>&1 | tee -a `date +"%Y.%m.%d.%H.%M"`.`sl whereami`.log
```
* tlparse
```
ls -thl /var/tmp/tt | head -9 && tlparse `ls -t /var/tmp/tt/* | head -1`
```
Differential Revision: D61677207
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134270
Approved by: https://github.com/ezyang
Summary:
This diff will decompose torch.ops._quantized.wrapped_quantized_linear into torch.ops._quantized.wrapped_linear_prepack and torch.ops._quantized.wrapped_quantized_linear_prepacked for AOTI, and added the corresponding impl into shim
The way it works will be similar to what we did previously for fbgemm fp16 dynamic qlinear. We will do constant folding for packed weight during runtime (warm up) to achieve the speed up
Reviewed By: desertfire
Differential Revision: D61396144
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134368
Approved by: https://github.com/houseroad
Windows file path use `\` as delimiter, it is also a escape character. We need translate all path `\` to `/`. which like Linux.
Reproduce UT:
```cmd
pytest test\dynamo\test_higher_order_ops.py -v -k test_vmap_grad_vmap_guard_fail
```
Error msg:
```cmd
________________________________________________________________________________________________________ HigherOrderOpVmapGuardTests.test_vmap_grad_vmap_guard_fail _________________________________________________________________________________________________________
Traceback (most recent call last):
File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\logging_utils.py", line 89, in test_fn
fn(self, records)
File "D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_higher_order_ops.py", line 2714, in test_vmap_grad_vmap_guard_fail
munge_exc(record.getMessage()),
File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_utils.py", line 5252, in munge_exc
s = re.sub(file, os.path.basename(file), s)
File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\re.py", line 209, in sub
return _compile(pattern, flags).sub(repl, string, count)
File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\re.py", line 303, in _compile
p = sre_compile.compile(pattern, flags)
File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_compile.py", line 788, in compile
p = sre_parse.parse(p, flags)
File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_parse.py", line 955, in parse
p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0)
File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_parse.py", line 444, in _parse_sub
itemsappend(_parse(source, state, verbose, nested + 1,
File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_parse.py", line 526, in _parse
code = _escape(source, this, state)
File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_parse.py", line 370, in _escape
raise source.error("incomplete escape %s" % escape, len(escape))
re.error: incomplete escape \x at position 2
To execute this test, run the following from the base repo dir:
python test\dynamo\test_higher_order_ops.py HigherOrderOpVmapGuardTests.test_vmap_grad_vmap_guard_fail
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
--------------------------------------------------------------------------------------------------------------------------- Captured stdout call ----------------------------------------------------------------------------------------------------------------------------
frames [('total', 2), ('ok', 2)]
inductor []
inline_call []
stats [('calls_captured', 38), ('unique_graphs', 2)]
--------------------------------------------------------------------------------------------------------------------------- Captured stderr call ----------------------------------------------------------------------------------------------------------------------------
V0824 01:29:00.148000 27840 torch\_dynamo\guards.py:2787] [0/1] [__recompiles] Recompiling function fn in D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_higher_order_ops.py:2699
V0824 01:29:00.148000 27840 torch\_dynamo\guards.py:2787] [0/1] [__recompiles] triggered by the following guard failure(s):
V0824 01:29:00.148000 27840 torch\_dynamo\guards.py:2787] [0/1] [__recompiles] - 0/0: torch._functorch.pyfunctorch.compare_functorch_state([('Vmap', 1, 'error')]) # _dynamo\output_graph.py:479 in init_ambient_guards
========================================================================================================================== short test summary info ==========================================================================================================================
FAILED [0.7452s] test/dynamo/test_higher_order_ops.py::HigherOrderOpVmapGuardTests::test_vmap_grad_vmap_guard_fail - re.error: incomplete escape \x at position 2
```
Local test passed:
<img width="860" alt="image" src="https://github.com/user-attachments/assets/90f0d780-0639-4c03-8d7c-6f227c93a3fc">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134348
Approved by: https://github.com/jansel
Fixes#133499
### The issue
Testing a variety of TP `requires_grad` patterns (validating maximally flexible finetuning) revealed `DTensor` sharding propagation of `aten.native_layer_norm_backward` (default) fails with an `IndexError` for certain `requires_grad` patterns (pattern 1) (e.g. `output_mask` `[True, False, False]`) and an `AssertionError` for others (pattern 2) (e.g. output mask `[False, True, *]`). Please see issue #133499 for a full description of the observed failure patterns along with reproduction.
### Use Cases and Remediation
Failure pattern 1 is potentially problematic for a variety of finetuning scenarios. Though failure pattern 2 is really an xfail right now since it's not fully supported, IMHO there are use cases (e.g. especially wrt to mechanistic interpretability research, but certain finetuning scenarios too potentially) that justify supporting this output mask (especially since supporting it is fairly straightforward I think).
In this PR I propose some modest changes that:
* Address the aforementioned failure modes.
* Add a couple tests that I'm hopeful will help ensure `DTenso`r op dispatch (which is so well implemented and such a pleasure working with btw! 🚀🎉) accommodates a wide variety of (potentially unanticipated) `requires_grad` patterns as it evolves.
To address both failure modes, I'm proposing the following changes:
1. To [`torch.distributed._tensor.ops._math_ops.layer_norm_bwd_strategy`](7b269cc484/torch/distributed/_tensor/ops/_math_ops.py (L873)):
- Refactor conditional `output_mask` handling such that the input and output specs in the`PlacementStrategy`s of the returned `output_strategy.strategies` list remain aligned with the `op_schema.args_spec` (whose definition does not change at runtime based upon unused optional args).
2. To [`torch.distributed._tensor._sharding_prop.propagate_op_sharding_non_cached`](7b269cc484/torch/distributed/_tensor/_sharding_prop.py (L256-L262)):
- When iterating through the active `op_schema.args_spec` to build the relevant `expected_input_specs` list, filter any `None` `desired_specs`.
3. To [`torch/distributed/_tensor/_op_schema.OpSchema._inplace_rewrap_schema_suggestion`](7b269cc484/torch/distributed/_tensor/_op_schema.py (L418))
- When inputs need a redistribute, for runtime-unrequired (`None` arguments in the aligned `suggestion_args_schema`), ignore the associated `suggestion_args_spec`
### Implementation considerations:
- Regarding `1`, to avoid changing the op strategy return args ([`op_strategy`](cf81180007/torch/distributed/_tensor/_sharding_prop.py (L234))), the change in `1` allows `None` elements to exist temporarily in `PlacementStrategy.input_specs` (treating it as `Sequence[DTensorSpec | None] | None` when it's `Sequence[DTensorSpec] | None`. This could be addressed in any number of ways but I thought it best to leave that for a subsequent PR since it could have broader ramifications (e.g. allowing op_strategies to return an output_strategy.input_specs` mask explicitly, explicitly allowing `None`s in `PlacementStrategy.input_specs`, creating a `Null` DTensorSpec etc.). That's why I'm using an ignore arg-type directive there for now.
- Regarding `2` and `3` above, I don't introspect `op_schema.op._schema.arguments` to verify any `None` arguments are `torch.OptionalType`, leaving adherence to the schema contract the responsibility of the given op. Regarding `2`, I assume any `desired_spec` will be either a `DTensorSpec` or `None`, so only `None` can be Falsy in this context.
- I considered altering the active `args_schema`, which could be inspected and aligned with the active `output_strategy.input_specs` in some cases and avoid the changes in `3`, but I think that would rely on one of (among other possibilities):
- all supported op signatures having optional Tensors (`DTensorSpec`) args after required tensors (which isn't a planned required as far as I know),
- (somewhat brittle) heuristic-driven arg alignment
- only supporting kwargs etc.
### Added Tests
To facilitate detection of future `requires_grad` pattern op failure modes as `DTensor` evolves, I added the following two tests:
1. `test/distributed/_tensor/test_math_ops.py DistMathOpsTest.test_layer_norm_bwd_req_grad`
- Tests `native_layer_norm_backward` specifically with 20 subtests that sweep valid `output_mask` patterns along in different LayerNorm dimensionality and `elementwise_affine` configurations.
2. `test/distributed/tensor/parallel/test_tp_examples.py DistTensorParallelExampleTest.test_transformer_req_grad`
- Samples a subset of `requires_grad` patterns in a more realistic (relative to the `LayerNorm`-specific test) Transformer usage context with different `dtype` and `is_seq_parallel` configurations. Note since there was substantial overlap with the existing `test_transformer_training` test, I took the opportunity to refactor that test to allow relevant code-sharing. I also added an `ExpCommCounts` `NamedTuple` to facilitate the addition of additional `requires_grad` patterns that we may want to test in the future which may result in different comm counts. I created the separate `requires_grad` test to allow decoupling the multi-iteration `test_transformer_training` test and allow addition of new `requires_grad` scenarios as desired while being mindful of resources.
Thanks again to the PyTorch distributed team for your immensely valuable contributions to the open-source ML community!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133502
Approved by: https://github.com/XilunWu
For `aten.any`, we can use `reduce_op="sum"` as the linear reduction op.
When we do `all_reduce` with `reduce_op="sum"` on bool tensor, if one rank returns `torch.Tensor([True]) `, then the reduction result is `torch.Tensor([True]) `. Only when all ranks return `torch.Tensor([False]) ` would the reduction result be `torch.Tensor([False]) `. This matches with `any`'s behavior.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134206
Approved by: https://github.com/tianyu-l, https://github.com/chuanhaozhuge
Add DeviceMesh slicing support such that we could do the following:
```
mesh_3d = init_device_mesh(
self.device_type, (2, 2, 2), mesh_dim_names=("replicate", "shard", "cp")
)
shard_cp_mesh = mesh_3d["shard", "cp"]._flatten()
hsdp_mesh = mesh_3d["replicate", "shard_cp"]
# we can get the corresponding group of the flatten mesh through
group = shard_cp_mesh.get_group()
# or
group = mesh_3d["shard_cp"].get_group()
# or
mesh_3d.get_group(mesh_dim="shard_cp")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133839
Approved by: https://github.com/fegin
ghstack dependencies: #133838
### Description
This PR extends the `VecISA` class to include support for VSX on the `ppc64le` architecture within the Inductor backend. This enhancement enables vectorization support, resulting in performance improvements when using `torch.compile()` on `ppc64le`.
### Fixes
- Resolved the `test_acosh_with_negative_large_input` test case in `test_cpu_repro.py` by implementing `acosh` for VSX.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132746
Approved by: https://github.com/jansel
Summary: Pass process group info into NcclWork
Test Plan: buck2 run mode/dev-nosan kineto/libkineto/fb/integration_tests:pytorch_execution_trace_integration_test
Differential Revision: D61677160
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134269
Approved by: https://github.com/wconstab
The pattern matcher runs DCE and remove_noop_ops on the replacement
graph by default. Previously we had a switch for the DCE. This PR
changes that switch to also control if we run remove_noop_ops.
The context was that there is silent incorrectness with
auto_functionalized. We use the Pattern matcher to decompose
auto_functionalized into a mutable op + clones; remove_noop_ops were
deleting the clones.
Future: can try #134363
Test Plan:
- new test. I wasn't able to produce a silently incorrect example so I
settled for asserting that clones still exist in the post-grad graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134364
Approved by: https://github.com/eellison
ghstack dependencies: #133639
This adds logs if we can't acquire locks in NCCLUtils and ProcessGroupNCCL for 30s.
This is motivated by some deadlocks were seeing and it's unclear if it's in NCCL or on the PyTorch side of things.
This required replacing most `std::mutex` with `std::timed_mutex` and `std::condition_variable_any` as appropriate.
Test plan:
existing CI for regressions
will add unit tests on `C10D_LOCK_GUARD`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134131
Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj
Refactors construction of ExportGraphSignature object for export & training IR, explicitly creating AOTAutograd signature for training IR. This will be helpful for upcoming refactors for placeholder naming & runtime asserts prettifying.
Changes:
- dedups `make_argument_spec` call, moved to export/graph_signature.py
- `_sig_to_specs` wrapped into new function `_convert_to_export_graph_signature`, directly converts GraphSignature -> ExportGraphSignature
- `_make_fx_helper` explicitly creates AOTAutograd GraphSignature object
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134059
Approved by: https://github.com/angelayi, https://github.com/ydwu4
**Summary**
When check the vectorization status among 3 test suit, we found some operators disabled vectorization with message `Disabled vectorization: op: remainder`. In this PR, we add vectorization support of this op.
**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_remainder
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_int_div_vec
```
Differential Revision: [D61147014](https://our.internmc.facebook.com/intern/diff/D61147014)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129849
Approved by: https://github.com/jgong5, https://github.com/lezcano
Starter version of automatic dynamic shapes for export.
Creates enums `DIM.AUTO`, `DIM.STATIC`, allowing user to specify `AUTO` for dims in dynamic_shapes specs, meaning that corresponding dims are treated as dynamic, and relevant guards will do what's necessary (e.g. refine ValueRanges, set replacements based on equality, or even set static) without raising ConstraintViolationErrors. Basically allows the user to say, "a bunch of these dims can be dynamic, let export do model analysis and return the program with maximum possible dynamism, without complaining".
The usage for specifying `dynamic_shapes` is now:
```
AUTO -> dynamic by default, return whatever produce_guards() says, even if it's static
None/int/STATIC -> static
Dim/DerivedDim -> same as before - will complain if the min/max range is invalid, or if dims related to this are unspecified.
```
Caveat 1: specifying `AUTO` for a dim won't guarantee it'll be dynamic:
- specifying `AUTO` for a dim will return the maximum possible dynamism given your program and other specified constraints, but this can still mean you'll get a static program. For example, with the program below, x is specified dynamic, but it's equal to y, which is specified static, and with how we currently do things we won't promote y to dynamic, but will demote(?) x to static. So this can be surprising if you don't fully know your model, and/or missed one of your other inputs when specifying auto-dynamic shapes.
```
class Foo(torch.nn.Module):
def forward(self, x, y):
return x + y
inputs = (torch.randn(6), torch.randn(6))
export(Foo(), inputs, dynamic_shapes={"x": (DIM.AUTO,), "y": None})
```
Caveat 2: specifying `AUTO` and Dims in the same spec is still problematic:
- The way Dims/DerivedDims are currently handled is very strict. A Dim represents a symbol, and we require a user to specify the symbol for all dims governed by the symbol - that's why we've seen errors in the past like `The values of x must always be related to y by ...`, asking the user to specify the exact relation as in the program. We also require the specified min/max range to be a subset of the valid range from model analysis. All this doesn't compose well with specifying `AUTO` just yet - for example in the program below, ideal behavior could be to return a dynamic program, where `dx = x.size(0) = y.size(0)` has range (3,6). Unfortunately this crashes, and correct behavior is to specify `dx` for both inputs. So currently we raise a UserError and crash if both Dims + `AUTO` are present in the spec.
```
class Foo(torch.nn.Module):
def forward(self, x, y):
return x + y
inputs = (torch.randn(6), torch.randn(6))
export(Foo(), inputs, dynamic_shapes={"x": (DIM.AUTO,), "y": {0: Dim("dx", min=3, max=6)}}) # this doesn't work, because x & y and related
```
Implementation details:
This is done by setting `assume_static_by_default=False`, and doing a transform on the `dynamic_shapes` spec to preserve semantics. `assume_static_by_default=False` will treat unspecified dims or Nones as dynamic. This is the opposite of what `export.export()` currently does - unspecified Dims/Nones are treated as static. Historically this static-by-default behavior, where the user deals with fewer guards, has been desirable, and we would like to respect that in this implementation. So this internal spec transformation is added, `_transform_shapes_for_default_dynamic()`, does the spec conversion necessary to be compatbile with dynamic by default. Specifically, AUTOs are converted into Nones, and Nones/unspecified dims are filled in with explicitly static constraints.
For example, this would look like, for a 3-d tensor: `{0: DIM.AUTO, 1: None, 2: Dim("dx")} -> {0: None, 1: 32, 2: Dim("dx")}`
This does seem overly complicated, but it's done to preserve dynamic shapes semantics for `torch._dynamo.export()`, which already uses `assume_static_by_default=False`, and follows the same process for generating shape constraints , via `_process_dynamic_shapes`. There the semantics are:
```
None/unspecified: dynamic by default
Dim/DerivedDim: also a strict assertion
```
If we don't care about BC for `_dynamo.export(dynamic_shapes)`, then we can just modify semantics for `_process_dynamic_shapes()` and change all the relevant tests in `test/dynamo/test_export.py`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133620
Approved by: https://github.com/avikchaudhuri
The function expects a Tensor of type LongTensor. It currently throws the following error: "one_hot is only applicable to index tensor." which, imo, does not provide the user with enough information on what the problem is.
PR simply adds extra information to the error message on this specific scenario.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134209
Approved by: https://github.com/mikaylagawarecki
`nn_module_stack` was previously serialized to string by adding commas between the module_path and module_type. This error prone when the `nn_module_stack` itself contains commas.
This PR fixes this by creating a dictionary to store the `nn_module_stack` and serialize it to string via `json.dumps()`
Fixes#131941
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134049
Approved by: https://github.com/angelayi
Summary: Currently, for sequential mode, minimizer search terminates after a node is excluded via the user defined exclusion_fn. However, on some occasions we would like the search to continue past that for the remaining nodes. In this diff I am changing the termination criteria to respect the find_all setting, where we continue sequential search if it is set.
Test Plan: CI
Differential Revision: D61720262
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134339
Approved by: https://github.com/jfix71
Fixes#134050
### The issue
The current `DTensor` sharding propagation caching policy for `aten.scaled_dot_product_efficient_attention` (default) can result in silently incorrect gradients or trigger an IMA after cuda kernel launch in mixed `require_grad` configurations. Please see issue #134050 for a full description of the observed failure patterns along with reproduction. Note `aten.scaled_dot_product_flash_attention` presents a similar concern so this PR addresses both [as discussed here.](https://github.com/pytorch/pytorch/issues/134050#issuecomment-2299887602)
### Remediation
While there are a number of ways this could be addressed, the most straightforward remediation is to modify the sharding propagation caching policy of [`aten._scaled_dot_product_efficient_attention.default`](b03381cac2/torch/distributed/_tensor/ops/_matrix_ops.py (L337-L340)), registering it with `schema_info=RuntimeSchemaInfo(4)` to prevent cache sharing between differing `compute_log_sumexp` values i.e.
```python
@register_op_strategy(aten._scaled_dot_product_efficient_attention.default, schema_info=RuntimeSchemaInfo(4))
def scaled_dot_product_efficient_attention_strategy(
...
```
[As discussed here](https://github.com/pytorch/pytorch/issues/134050#issuecomment-2299887602), since `aten::_scaled_dot_product_flash_attention` could be affected by a similar issue wrt `return_debug_mask`, this PR adjusts the sharding propagation caching policy for that op as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134146
Approved by: https://github.com/tianyu-l
Summary:
This PR updated cuSPARSELt to v0.6.2. I think we should land
https://github.com/pytorch/pytorch/pull/128534 first though.
Most of this PR is just enabling tests to run when cuSPARSELt v0.6.2 is
available.
Unfortunately was running into a bug with fp32 support on Hopper, so I
removed fp32 support from the cuSPARSELt backend. I think this should be
fine since almost everybody uses the bfloat/float16/int8 kernels.
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134022
Approved by: https://github.com/jerryzh168, https://github.com/malfet
ghstack dependencies: #128534
Summary:
Added support for more custom op input types, now only missing dtype,
layout, memory format as input type, since we need to add some more testing for
mapping the types to their integer values
([previous
comment](https://github.com/pytorch/pytorch/pull/126215#discussion_r1617428066)).
This PR also replaces the `DynamicArg` struct's `serialized_arg_val` with
`list_item_types`, which stores an optional list of strings, where each string
represents the type of the value within this list. This is only used for
parsing lists of optional tensors, where we need to know if a specific value in
the list should be a tensor, or a None. Replacing with a list of strings is
also better than storing the actual json format because then we don't need to
parse the json string during the runtime, and can just loop over a preprocessed
list of strings.
Test Plan: `buck2 run @//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r "test_custom_"`
Reviewed By: desertfire
Differential Revision: D60295995
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132454
Approved by: https://github.com/desertfire
Summary:
We should always emit an end event in a finally block so that if a unit test or job fails, the stack is still correct.
Also, we use thread local storage for the stack, so that in multithreaded scenarios the stack will still be correctly added.
Test Plan:
Run benchmark and see that everything still works
Run
```
TORCH_LOGS=dynamo buck run test/functorch:test_aotdispatch -- -r test_backward_mutation_on_grad_out
```
With some extra logging to see that start events with the correct stack are emitted, and the end events are also emitted even though the test fails at runtime.
Differential Revision: D61682556
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134279
Approved by: https://github.com/aorenste
Fixes#128084
The approach is option 2 of what Elias suggested in the comment
thread:
- We require tensors to have the correct stride at usage. This may
involve a clone; if there was a clone and then a mutation into it
then we copy_ back the result of the mutation.
The reason why I went this approach was because it was the easiest and
Inductor already works really hard to remove additional clones/copy_.
There are some cases that this doesn't generate efficient code for; for
example, if the tensor is a view, we don't change the base of the view
to have the right stride order, instead we do a clone.
The view case isn't very common so I'm ignoring it for now but we could
improve this in the future.
Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133639
Approved by: https://github.com/eellison
Support of effectful operations in backward:
1/ AOTD collects metadata from forward fn only, so we can have usage of effectful ops in backward, that were not used in forward => Allowing tokens discovery during joint function .
FunctionalTensorMode holds _tokens, in Joint function after tracing forward we memoize _tokens as `_tokens_forward_output`.
2/ Tokens are added as primals inputs (forward) in EffectTokensWrapper.
Tokens that will be used in backward are in partitioner saved values. We do not have control on which positions they are saved in forward outputs.
2/ If new tokens discovered in backward after tracing joint_fn, the result graph will be manually added in the end of primals.
_aot_autograd/utils.py
3/ All effectful ops during backward are marked with 'must_be_in_backward' partitioner_tag, to prevent partiitoner to place them in forward.
For that functional_tensor_mode got new optional state `self._effects_partitioner_tag` for effectful ops, to set after tracing forward.
There are additional changes in partitioner to improve functionality of 'must_be_in_backward'
4/ Unlift tokens now should run for both forward and backward.
- As saved for backward tokens are placed on non static places - we identify input and output tokens to erase, by input and output of `with_effects` operation
- In forward we can have input tokens, discovered in backward, that are not used in with_effects ops in forward, but saved for backward. We identify them by position in forward inputs.
5/ Adding aot debug logging for graphs before unlifting and before adding additional primal for backward tokens.
Tests:
```
python test/higher_order_ops/test_with_effects.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132638
Approved by: https://github.com/bdhirsh
MSVC don't support dynamic array.
Ref: https://stackoverflow.com/questions/56555406/creating-dynamic-sized-array-using-msvc-c-compiler
We tried to solutions:
1. use std::vector to instead of it in previous PR: https://github.com/pytorch/pytorch/pull/134140, but it changed variable's type and failed at UTs.
2. Use `std::unique_ptr` to instead of it in PR: https://github.com/pytorch/pytorch/pull/134156, @jansel reviewed and give comments: https://github.com/pytorch/pytorch/pull/134156#pullrequestreview-2253091693. It is make sense, allocation memory maybe make code run slower.
3. Use fixed size array to instead of it in PR: https://github.com/pytorch/pytorch/pull/134210, fixed size is hard to process the situlation, reserved size if small than CPU number.
> a. Use min() function limited is local test failed: https://github.com/pytorch/pytorch/pull/134210#issuecomment-2304447729
> b. Dynamic select fixed size or dynamic array: https://github.com/pytorch/pytorch/pull/134210#issuecomment-2304128666 . It makes code too complex to maintains.
Discussed with origin PR(https://github.com/pytorch/pytorch/pull/115620) author @zhuhaozhe, we think:
1. MSVC it the only one compiler, which not support VLA.
2. MSVC it worse performance than other compilers, use `std::unique_ptr` for MSVC and make it works.
3. For other compilers, keep using current `VLA` code.
4. For Windows users, they can use `clang-cl` or `icx` to get better performance than MSVC.
5. Discussed with @jansel , we need to move compiler check to python side, and make output code cleaner.
Reproduce UT:
```cmd
pytest test/inductor/test_cpu_repro.py -v -k test_reduction_with_dynamic_threads
```
Error msg:
```cmd
C:/Users/Xuhan/AppData/Local/Temp/tmpncykej5v/a4/ca4534cazplidnf7vopaaxaifqkjiyhxm3h2gsylgztputbaeybx.cpp(13): error C2131: expression did not evaluate to a constant
C:/Users/Xuhan/AppData/Local/Temp/tmpncykej5v/a4/ca4534cazplidnf7vopaaxaifqkjiyhxm3h2gsylgztputbaeybx.cpp(13): note: failure was caused by a read of a variable outside its lifetime
C:/Users/Xuhan/AppData/Local/Temp/tmpncykej5v/a4/ca4534cazplidnf7vopaaxaifqkjiyhxm3h2gsylgztputbaeybx.cpp(13): note: see usage of 'max_threads'
C:/Users/Xuhan/AppData/Local/Temp/tmpncykej5v/a4/ca4534cazplidnf7vopaaxaifqkjiyhxm3h2gsylgztputbaeybx.cpp(16): error C3863: array type 'float [max_threads]' is not assignable
```
Genarated code:
```c++
#include "C:/Users/Xuhan/AppData/Local/Temp/tmpt6mxcjzi/j2/cj22tgrdgh42wbunl7gdptg2lintcziox2kmr7rdbcc6n2njrhgx.h"
extern "C" __declspec(dllexport) void kernel(const float* in_ptr0,
const float* in_ptr1,
float* out_ptr0,
float* out_ptr1)
{
{
{
float tmp_acc0 = 0;
at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(0);
int max_threads = omp_get_max_threads();
float tmp_acc0_arr[max_threads];
for (int tid = 0; tid < max_threads; tid++)
{
tmp_acc0_arr[tid] = 0;
}
at::vec::Vectorized<float> tmp_acc0_vec_arr[max_threads];
for (int tid = 0; tid < max_threads; tid++)
{
tmp_acc0_vec_arr[tid] = at::vec::Vectorized<float>(0);
}
#pragma omp parallel
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134221
Approved by: https://github.com/zhuhaozhe, https://github.com/jansel
Summary:
This diff adds two new operators torch.ops._quantized.wrapped_linear_prepack and torch.ops._quantized.wrapped_quantized_linear_prepacked. It is a decomposition of the op torch.ops._quantized.wrapped_quantized_linear added in the previous diff.
We decomposed in this way as packed weight could be computed early so we don;t need to do it in every forward in AOTI
Reviewed By: jerryzh168
Differential Revision: D61395887
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134232
Approved by: https://github.com/houseroad
Summary:
As title.
Add a test case in test_aot_inductor to check for codegen (i.e. `aoti_torch_print_tensor_handle` is inserted as expected for debugging printer) for both cpu and cuda based on a simple `addmm` test model.
Test Plan:
```
AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_aoti_debug_printer_codegen_abi_compatible_{cuda/cpu}
```
Differential Revision: D61169068
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133326
Approved by: https://github.com/ColinPeppler
Summary: Add tests that check function events for dynamic activity toggling for both GPU and CPU events. Also added comments from previous GH comments
Test Plan: Make sure all tests pass
Differential Revision: D61617514
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134149
Approved by: https://github.com/aaronenyeshi
Summary: Reduce the aarch64 dashboard run to only test the default config, until we solve the timeout issue. Also increase the frequency from nightly to 6 times a day, to see if we can reproduce the perf instability Nikita has observed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134265
Approved by: https://github.com/malfet
Switch installation of the pytorch package to be installed from our download.pytorch.org sources which are better maintained.
As well, switching over the miniconda installation to a miniforge installation in order to ensure backwards compat for users expecting to have the conda package manager installed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134274
Approved by: https://github.com/malfet, https://github.com/atalman
Co-authored-by: atalman <atalman@fb.com>
Summary:
Make quantization tests compatible with the new training IR.
With the new batch norm node `torch.ops.aten.batch_norm.default`, we don't need an additional getitem node after the bn node, so tests need to be fixed to not check for the getitem node.
We added a capture_pre_autograd_graph_using_training_ir() function, which returns True when we are using the training ir, and False otherwise. This way, the code supports both training ir and the old ir.
For now, we are just rolling out the training ir for fbcode internal tests.
Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_preserve_source_fn_stack
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_update_shared_qspec
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_conv2d
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_conv_bn_relu_fusion
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_conv_bn_fusion
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_conv_bn_fusion_literal_args
```
Reviewed By: andrewor14, tugsbayasgalan
Differential Revision: D61292102
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134259
Approved by: https://github.com/tugsbayasgalan
This patch makes two changes:
1. Whenever ncclCommSplit accepts groupRanks in its config, we should
populate it. This is independent of using PMI or not. For example,
non-PMI NCCL can also use this information, if it chooses to.
2. Provide a user flag to decide when to do a uniqueId broadcast and
when to skip it. This is a performance optimization, and not a
correctness requirement. If the user forgets to set this, we will
do the uniqueId broadcast, which is wasteful (because it will be
ignored by NCCL), but not incorrect.
@exported-using-ghexport
Differential Revision: [D60966774](https://our.internmc.facebook.com/intern/diff/D60966774/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133960
Approved by: https://github.com/shuqiangzhang
Reland of #128143 but added `alpha` and `bias` initialization to `launchTunableGemmAndBias`
Thus far TunableOp was implemented for gemm, bgemm, and scaled_mm. gemm_and_bias was notably missing. This PR closes that gap.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128919
Approved by: https://github.com/malfet
Summary:
In the new training ir, we produce `torch.ops.aten.batch_norm.default` instead of `torch.ops.aten._native_batch_norm_legit.default` or `torch.ops.aten._native_batch_norm_legit_no_training.default`.
So we need to change the pattern match to accomodate the new op.
- Add `torch.ops.aten.batch_norm.default` to pattern matcher list so it's identified as a batch norm node
- `torch.ops.aten.batch_norm.default` doesn't have a getitem user anymore, so when removing the bn norm, we need to do `bn_node.replace_all_uses_with(conv_node)` instead of `getitem_node.replace_all_uses_with(conv_node)`
The behavior of capture_pre_autograd_graph is consistent for each run.
If the run is a fbcode test, then capture_pre_autograd_graph uses training IR. This means both _get_aten_graph_module_for_pattern and replace_pattern_with_filters see the same training IR.
If the run is not a fbcode test, then both would see the old IR.
Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_conv2d_binary2
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_conv2d_unary
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_linear_unary
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_dynamic_quant_linear
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_dynamic_quant_linear
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_flatten_recipe
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_linear_unary
```
Reviewed By: andrewor14, tugsbayasgalan
Differential Revision: D61291077
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134157
Approved by: https://github.com/tugsbayasgalan
Part of #134054.
This corresponds to the pytorch mypy changes from D61493706. Updating takes so
long and touches so many files that it's impossible to land as a whole without conflicting with some other intermediate change.
So landing these 'type: ignore' for pytorch in advance of them actually being needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134202
Approved by: https://github.com/Skylion007
Changes:
1. Move `polyfill.py` -> `polyfills/__init__.py`. It can be used as `polyfill.xxx` -> `polyfills.xxx`.
2. Move submodule loading from `polyfills/__init__.py` to `polyfills/loader.py`.
Merge `polyfill.py` and `polyfills/` packages. Each polyfill module have its own namespace for better code organization.
The ultimate goal is make `polyfills/__init__.py` empty and all polyfill functions move to its own namespace.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133977
Approved by: https://github.com/jansel
Summary: When deepcopy a proxy, we first try the default deepcopy behavior.
Test Plan: buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:fx -- -r proxy_deepcopy
Differential Revision: D61398418
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133706
Approved by: https://github.com/angelayi
Summary:
This diff implements a bunch of views for internal scuba viewing.
TODOS that I might punt to another diff:
- Saving cache stats via counter is definitely sus here, but there's not really a good way to track "fx graph cache hit for this compile phase" right now. Will think about this more.
- We should definitely log frame id, compile id, etc
- We should definitely be logging configs. That way, we can A/B test based on whether a config is turned on.
- idk what I'm doing with compile_uuid yet, but it's useful when you want to look at samples for a single run. I think if we had mast job info this field is not needed, but it's nice to be able to drill down to a single run and get its chrome trace view or icicle view, so idk
Test Plan:
All of the above views are run with nanogpt benchmark:
```
buck run mode/opt caffe2/benchmarks/dynamo:torchbench -- --training --backend=inductor --only nanogpt --performance
```
Differential Revision: D61603243
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134118
Approved by: https://github.com/oulgen
As per title, this PR adds proper casting to fuse_linear_bn_weights in the same style as the conv case above. This previously caused numerical issues on my end, so that is why I am fixing it.
Also cleans up the docstring.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134105
Approved by: https://github.com/mikaylagawarecki
Update cudnn_frontend submodule to 1.6.1 to patch some minor bugfixes and compiler fixes.
# Bug fix
* Fixed an issue where custom dropout mask was not correctly applied.
* Added -fvisibility=hidden for the pip wheels generated to avoid symbol conflicts with other modules that use cudnn frontend.
* Fixed an issue in sdpa operation which when deserialized will lead to numerical mismatches.
* Fixed an issue in sdpa fp8 fprop operation (in inference mode).
# Samples
* Added a new sample to showcase how a custom dropout mask can be applied to a sdpa operation.
* Added a sample to showcase convolutions on large (c * d * h * w > 2 **31) tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134007
Approved by: https://github.com/eqy
Zero bubble can be expressed through `ScheduleFlexibleInterleaved1F1B` by setting `enable_zero_bubble=True`. But instead of having to include this flag in schedule initialization we should create a separate ZeroBubbleSchedule and also transition `Interleaved1F1B` to derive from `ScheduleFlexibleInterleaved1F1B`. Then we dont need to expose `ScheduleFlexibleInterleaved1F1B` since the naming is not obvious
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133467
Approved by: https://github.com/wconstab
ghstack dependencies: #132691
Just something I noticed while implementing a new DeviceInterface
I had to add `# type: ignore[assignment]` because mypy thinks
DeviceInterface.get_raw_stream is a `Callable` and therefore
incompatible with a `staticmethod`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134187
Approved by: https://github.com/jansel
CUTLASS automatically skips a stage in the epilogue if we provide a nullptr. Thus, instead of building a special kernel for bias=None, we can reuse one of the other ones.
This also considerably simplifies the code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134113
Approved by: https://github.com/drisspg
ghstack dependencies: #134110, #134111, #134112
The compute dtype for the bias addition was set to ElementBias. Thus, for a bf16 bias, we would cast the fp32 accum to bf16 and _then_ add the bias. It is however (slightly?) more accurate to first add the bias in fp32 and only cast at the end.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134112
Approved by: https://github.com/drisspg
ghstack dependencies: #134110, #134111
Bugfixings for PyTorch 2.5,
1. Using SYCL group algorithm API instead of old style for sub group shift utilities.
2. Add preprocess in reduction kernel for cases requiring data type cast.
3. Make group norm memory format compatible.
4. ZeroTensor: a. Remove unnecessary aten operators registration, or ZeroTensor process is bypassed. b. Align preprocess with intree implementation in aten::copy_.
5. Rebase checkIndexTensorTypes usage.
6. Align latest semantics of PyTorch foreach operators. Return multiple tensors with offset=0.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133850
Approved by: https://github.com/EikanWang
As you can see, 'privateuse1' appears many times in out-of-tree extension codebase. I think that everything about the device type should be as same as other in-tree backends after registering the privateuse1 backend.
For example, after registering a privateuse1 backend named "foo", you should allow "foo" to be passed in as a valid device type.
```diff
- instantiate_device_type_tests(TestIndexing, globals(), only_for='privateuse1')
- instantiate_device_type_tests(NumpyTests, globals(), only_for='privateuse1')
+ instantiate_device_type_tests(TestIndexing, globals(), only_for='foo')
+ instantiate_device_type_tests(NumpyTests, globals(), only_for='foo')
```
> https://github.com/Ascend/pytorch/blob/master/test/test_indexing.py#L1654-L1655
The change is to map privateuse1 backend name to 'privateuse1' when calling `filter_desired_device_types()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133082
Approved by: https://github.com/albanD
Summary:
Previously, reuse of the same `Dim` was encoded by "sharing" internal constraints among constraint targets. This kind of sharing, implemented using `shared` fields between `_Constraint`s, was originally motivated by `dynamic_dim`, specifically to support `==` between `dynamic_dim`s, but we no longer need to maintain this overcomplicated structure: we can simply use names of `Dims` to directly encode sharing information.
Thus this PR vastly simplifies the structure of `_Constraint` by removing `shared` fields. As a result, both `_Constraint` and its moral subclass, `_DerivedConstraint`, are 1-1 with `Dim` and its moral subclass, `DerivedDim`.
Note that this will break `==` over `dynamic_dim`, so an immediate follow-up will be to remove `dynamic_dim` entirely from our public API. (It's been more than 6 months since the deprecation warning anyway.) I just didn't want to deal with that process in the same PR.
Test Plan: existing
Differential Revision: D61559413
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134045
Approved by: https://github.com/pianpwk
Currently, `fully_shard` will create a new `FSDPMyModuleClass` class for each `MyModuleClass` module **object**, which causes Dynamo to guard-fail on every module object's type checking. This PR fixes the issue by caching and reusing previously created FSDP wrapper class.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134135
Approved by: https://github.com/awgu
Fixes#128084
The approach is option 2 of what Elias suggested in the comment
thread:
- We require tensors to have the correct stride at usage. This may
involve a clone; if there was a clone and then a mutation into it
then we copy_ back the result of the mutation.
The reason why I went this approach was because it was the easiest and
Inductor already works really hard to remove additional clones/copy_.
There are some cases that this doesn't generate efficient code for; for
example, if the tensor is a view, we don't change the base of the view
to have the right stride order, instead we do a clone.
The view case isn't very common so I'm ignoring it for now but we could
improve this in the future.
Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133639
Approved by: https://github.com/eellison
Summary:
This PR adds in cuSPARSELt as a backend to PyTorch.
It is now possible to see if cuSPARSELt is available and the version if
it is with
```
torch.backends.cusparselt.is_available()
torch.backends.cusparselt.version()
```
Test Plan:
```
python test/test_sparse_semi_structured.py -k test_cusparselt_backend
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128534
Approved by: https://github.com/cpuhrsch, https://github.com/eqy, https://github.com/syed-ahmed
As in the title. In addition, the PR introduces `_int_bsr_dense_addmm` that is equivalent to `bsr_dense_addmm` except for int8 inputs the operation result is int32 tensor (similar to existing `_int_mm`).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133855
Approved by: https://github.com/cpuhrsch
Fixes#133690
The naming was added in #121170 to allow performance debugging of latency critical threads. However the `pt_main_thread` name gets inherited every time a new process or thread is created from the parent one, which defeats the purpose. We need a better way to name the thread that launches kernels on accelerators but for the time being we can let users name the threads in the application code, using: `torch.multiprocessing._set_thread_name("insert_name")`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134066
Approved by: https://github.com/soulitzer, https://github.com/d4l3k
The functorch partitioners use network flow to split the joint graph into a forward and backward graph. Internally, we've found that upgrading to networkx 2.8.8 (from 2.5) results in some hard-to-debug failures (internal reference: https://fburl.com/workplace/jrqwagdm). And I'm told that there's interest to remove the python dependency.
So this PR introduces a C++ implementation that mirrors the API provided by networkx. We'll need to add python bindings and do some additional testing to verify correctness.
Differential Revision: [D61550977](https://our.internmc.facebook.com/intern/diff/D61550977)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132188
Approved by: https://github.com/Chillee
Add a way of generating a FunctionSchema from example values because hop's schema varies even for the same hop.
We didn't use torch._C.FunctionSchema because we cannot construct the classes directly (e.g. "__init__" cannot be used for torch._C.FunctionSchema). Also extending the Basic types in c++ seems not that easy.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133521
Approved by: https://github.com/zou3519
Summary:
In export, we will generate many redundant getitem nodes branching from the same source, inserted by runtime assertions or any passes. This is causing issues with any downstream system relying on any value being uniquely defined by a single node.
I don't think it hurt to remove a bunch of getitem nodes only, so I just added to the ctor.
Test Plan:
rebase on D61256937
```
buck2 run scripts/bearzx:pt2_export_playground
```
Differential Revision: D61351578
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133618
Approved by: https://github.com/tugsbayasgalan
Add `stage_backward_input` and `stage_backward_weight` functions to perform the weight updates for inputs and weights independently.
We still support `self.dw_builder` argument for a custom backward, but it has become optional. It takes a separate code path and cannot be used in conjuction with the native zero backward.
Added tests:
`python test/distributed/pipelining/test_schedule_multiproc.py -k test_schedule_with_native_zero_bubble`
`python test/distributed/pipelining/test_backward.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132691
Approved by: https://github.com/wconstab
**Summary**
Implement the complete vectorization of `index_expr` functionally. We also add heuristic from performance perspective to resolve the regressions posted below: https://github.com/pytorch/pytorch/pull/122961#issuecomment-2041336265 by disabling vectorization of specific (Fused) scheduler Node:
- Heuristic 1: when the num of non-contiguous `index_expr/load/store` exceeds the threshold, we disable the vectorization.
- Heuristic 2: when the total number of elements along the vec dim is less than `tiling_factor/2`, we disable the vectorization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122961
Approved by: https://github.com/jansel
Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>
Summary:
This diff adds a new operator wrapped_quantized_linear (torch.ops._quantized.wrapped_quantized_linear) and takes the following input argument: input (in fp32) , input_scale, input_zero_point, weight (in fp32), weight_scale, weight_zero_point, bias (in fp32), output_scale, output_zero_point, and out_channel. It does the following
1. Use quantize_per_tensor(input, input_scale, input_zero_point) to quantize the input tensor to int8
2. Use quantized::linear_prepack(weight, weight_scale, weight_zero_point, bias) to pack the weight and bias
3. Use quantized::linear to perform int8 quantized linear
4. dequantize
This new op is essentially a wrapper of mutiple ops. We do this as torch.export cannot handle models where it has old quantize apis.
Reviewed By: jerryzh168
Differential Revision: D61377266
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134024
Approved by: https://github.com/houseroad
Add decorator `torch.compiler.substitute_in_graph` to register polyfill for unsupported C++ function to avoid graph break. This API provides an official way to add support for dynamo for third-party C extensions. Also, it can be used to simplify our implementation for `torch._dynamo.polyfill`.
5ee070266f/torch/_dynamo/variables/builtin.py (L97-L107)
Example:
```python
>>> import operator
>>> operator.indexOf([1, 2, 3, 4, 5], 3)
2
>>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3)
Unsupported: ...
>>> @torch.compiler.substitute_in_graph(operator.indexOf)
... def indexOf(sequence, x):
... for i, item in enumerate(sequence):
... if item is x or item == x:
... return i
... raise ValueError("sequence.index(x): x not in sequence")
>>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3)
2
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133712
Approved by: https://github.com/jansel
Summary:
* TLDR:
`getenv` is not thread safe w.r.t `setenv`. Environment variables are kept as a per-process "dictionary" by libc. `setenv` can essentially realloc the whole thing move this list to a completely different location. If there is a concurrent `getenv` happening as the same time, it is possible that it might end up reading stale memory and segfault.
`getenv` is thread safe w.r.t other `getenv`.
* Details:
Inside PTD init:
```
ProcessGroupNCCL ctor
...
ncclCommWatchdogThread_ =
std::thread(&ProcessGroupNCCL::ncclCommWatchdog, this); (https://fburl.com/code/terf9ai7)
```
Inside ncclCommWatchdog thread:
```
...
ncclHeartbeatMonitorThread_ =
std::thread(&ProcessGroupNCCL::heartbeatMonitor, this); (https://fburl.com/code/fv9camg2)
...
```
Inside heartbeatMonitor thread:
```
...
std::optional<DumpPipe> dumpPipe = std::nullopt; (https://fburl.com/code/qdvahzbu)
dumpPipe.emplace(rank_);
...
```
Inside DumpPipe ctor (https://fburl.com/code/wvixlqcz)
```
getCvarString
getenv <=== SIGSEGV
```
On the main thread:
We go on to initialize NCCL:
Inside getNCCLComm, we call: `getNcclVersion` -> `initEnv` (https://fburl.com/code/j312pccu)
`initEnv` inside NCCL does this: `initEnv` -> `setEnvFile`
This guy, reads the /etc/nccl.conf file, and sets values of env variables with "setenv" (https://fburl.com/code/cq4r0y0h)
This "setenv" can race with "getenv" in heartbeatMonitor thread.
Ideally, all `setenv` should be done by a single thread before launching other threads. This diff moves getNCCLVersion before launching watchdog thread to make sure all setenvs are done beforehand.
I think we are just getting lucky that we are not hitting it in production. IIRC in fact we saw getenv segfault once in one of the large scale runs, but now I dont remember the details.
Test Plan: A lot of testing done as part of D61411062 & CI
Differential Revision: D61421292
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133744
Approved by: https://github.com/wconstab, https://github.com/fduwjj
Summary:
Change ReorderConvertTest to work with the new `capture_pre_autograd_graph` implementation using D61175223.
Note that now `ReorderConvertTest` doesn't work with the old `capture_pre_autograd_graph` anymore.
Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/passes/tests:optimize_test -- -r ReorderConvertTest
```
Differential Revision: D61507772
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134010
Approved by: https://github.com/tugsbayasgalan
Link various classes and functions of the `optim.swa.util` to make doc content accessible from the `torch.optim` doc.
Currently, if you click the link,
https://pytorch.org/docs/stable/optim.html#module-torch.optim.swa_utils it goes to a blank, bottom of the page section of `torch.optim`.
Also,
`torch.optim.swa_utils.AveragedModel` and `torch.optim.swa_utils.SWALR` classes as well as `torch.optim.swa_utils.update_bn()` and `optim.swa_utils.get_ema_multi_avg_fn` are not linked to doc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133393
Approved by: https://github.com/janeyx99
https://github.com/pytorch/pytorch/pull/132990 introduced dependency on `torch.version`, which might not be imported yet, and can result in `AttributeError: partially initialized module 'torch' has no attribute 'version' (most likely due to a circular import)` if user starts its code with `import torch.cuda`
Fix it by importing `torch.version` explicitly
Test Plan: CI
Differential Revision: D61549284
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134019
Approved by: https://github.com/seemethere
Summary:
Skip re-exporting modules with the duplicated types to speed up the exportability tests.
In real models, there are many duplicated modules, and mostly have the same export issues.
Test Plan: Existing CI
Differential Revision: D61504630
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133930
Approved by: https://github.com/angelayi
Add decorator `torch.compiler.substitute_in_graph` to register polyfill for unsupported C++ function to avoid graph break. This API provides an official way to add support for dynamo for third-party C extensions. Also, it can be used to simplify our implementation for `torch._dynamo.polyfill`.
5ee070266f/torch/_dynamo/variables/builtin.py (L97-L107)
Example:
```python
>>> import operator
>>> operator.indexOf([1, 2, 3, 4, 5], 3)
2
>>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3)
Unsupported: ...
>>> @torch.compiler.substitute_in_graph(operator.indexOf)
... def indexOf(sequence, x):
... for i, item in enumerate(sequence):
... if item is x or item == x:
... return i
... raise ValueError("sequence.index(x): x not in sequence")
>>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3)
2
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133712
Approved by: https://github.com/jansel
```
# supposed we have a 3d mesh
mesh_3d = init_device_mesh("cuda", (2,2,2), mesh_dim_names=("dp", "cp", "tp")
dp_cp_mesh = mesh_3d["dp", "cp"]._flatten()
"""
then we would have
flatten_name_to_root_dims[mesh_3d]: {
"dp_cp": (0, 1)
}
"""
```
We need this information to validate the order mesh slice including flatten mesh dim.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133838
Approved by: https://github.com/fegin
Summary:
Skip re-exporting modules with the duplicated types to speed up the exportability tests.
In real models, there are many duplicated modules, and mostly have the same export issues.
Test Plan: Existing CI
Differential Revision: D61504630
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133930
Approved by: https://github.com/angelayi
Co-authored-by: bearzx <bearzx@fb.com>
It is parallel PR to https://github.com/pytorch/pytorch/pull/133819 , and it is append change for @jansel 's comments.
1. For `torch/_inductor/codegen/cpp_wrapper_cpu.py`, revert to origin code to append LL on MacOS and Windows: bdc14ad89a
2. For `torch/_inductor/codegen/cpp_utils.py`, append LL on MacOS and Windows forlarge constants. And fix its UTs: 3a56b76ce0
------------------------------
Another solution for https://github.com/pytorch/pytorch/pull/133615, use `int64_t` as index type for all plartform.
### Development notes:
The metioned PR( https://github.com/pytorch/pytorch/pull/133615) is fix the index type not match to parse_arg args types. As reviewed with @jansel , Jason think we need to unificate `INDEX_TYPE` for all platforms.
Current code is make code cumbersome:
```python
INDEX_TYPE = "int64_t" if _IS_WINDOWS else "long"
```
So, I have some attempts to unificate `INDEX_TYPE` as `long` or `int64_t`.
For use `long` as index type: https://github.com/pytorch/pytorch/pull/133768
For use `int64_t` as index type: https://github.com/pytorch/pytorch/pull/133782
Since that, we still discussed which type we will select as final solution.

`long` type is different define and size in different OSs and different compilers. So, @jansel make decision that, we need to select `int64_t` for all platforms. So, I would comtine my work based on https://github.com/pytorch/pytorch/pull/133782.
As https://github.com/pytorch/pytorch/pull/133782 still has two issues:
1. std::min/std::max could not match function instances by arg types. It as fixed and validated in PR: https://github.com/pytorch/pytorch/pull/133812
4. Cuda TestMemoryPlanning::test_cpp_wrapper issue by wrong index type. It is fixing in this PR.
So, we made final solution in this PR.
### Changes:
**1. Use `int64_t` type as index type for all OSs: `Windows`, `Linux` and `MacOS`.**
**2. Use static_cast<int64_t>(`constant`) to convert constant to `div_floor_integer` with args type(`int64_t`).**
**3. Update `parse_arg` function signature to `int64_t`, which follow the index type.**
**4. Append double L(`LL`) to constant on Windows and MacOS, because of their int64_t are are long long.**
**5. Fix `std::min/std::max` type miss match by static_cast to `INDEX_TYPE`.**
**6. Fix UTs, containts: cuda `TestMemoryPlanning::test_cpp_wrapper`, and `test_indexing.py`.**
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133892
Approved by: https://github.com/jansel
Another attempt to update NVTX to NVTX3. We now avoid changing NVTX header inclusion of existing code. The advantage of NVTX3 over NVTX is that it is a header-only library so that linking with NVTX3 can greatly simplify our CMake and other building scripts for finding libraries in user environments. In addition, NVTX are indeed still present in the latest CUDA versions, but they're no longer a compiled library: It's now a header-only library. That's why there isn't a .lib file anymore.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109843
Approved by: https://github.com/peterbell10, https://github.com/eqy
Co-authored-by: Ivan Zaitsev <108101595+izaitsevfb@users.noreply.github.com>
Summary:
- exir.capture + to_edge is deprecated. We need to use the export + to_edge.
- Fix quantization pass to be compatible with the new export IR. In the quantization pass, some nodes might have side-effects, so they don't have users, but still are not removed by the DCE pass. We need to consider it.
- now export_rle_model works with the default `capture_pre_autograd_graph`, it should also work with the new training it
Test Plan:
```
buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/export:export_rle_model -- -r export_rle_model
```
Differential Revision: D61485834
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133937
Approved by: https://github.com/tugsbayasgalan
Summary:
The existing tests didn't cover a case where we had multiple autotunes in a single graph. Add a test to demonstrate that case.
Also added a test dependency on redis and removed the "fake redis" from the previous PR (#133579)
Test Plan: unit tests
Reviewed By: oulgen
Differential Revision: D61178861
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133868
Approved by: https://github.com/oulgen
Adds guards checking whether torch function mode is in the all disabled state.
There are three torch function enablement states:
* All torch function disabled (modes + subclasses)
* Torch function subclass disabled
* All enabled
We now have guards checking if the state is All enabled and if state is All disabled.
All of the above ternary states are assigned to a unique pair of these two flags.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133135
Approved by: https://github.com/anijain2305
ghstack dependencies: #133130, #133729, #133131, #133132, #133133, #133134, #133136
This PR adds a C function to check if all torch function is disabled.
Recall that there are three torch function enablement states:
* All disabled
* Torch Function Subclass disabled
* All enabled
The API before this change provides two functions:
* `_is_torch_function_enabled` - returns True iff the current TF state is All enabled
* `_is_torch_function_mode_enabled` - returns True iff the state is not All disabled and the torch function mode stack is non-empty.
The crux of why a new API is needed is the following: If dynamo enters a frame with the torch function mode stack empty, `_is_torch_function_enabled` == False, it is impossible to determine if after a new mode is pushed whether we should enter the mode or not. This is because we don't know if the enablement state is All disabled or only subclass disabled. Adding this API to check if All disabled is True allows us to disambiguate this case.
In the next PR, Dynamo InstructionTranslator will have clearer flags than the underlying C API:
* A flag to indicate if subclasses are disabled (ie All disabled or Subclass Disabled is the current state)
* A flag to indicate if modes are disabled (ie if All disabled is the current state)
* A symbolic stack which can be checked if any modes are present
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133136
Approved by: https://github.com/bdhirsh
ghstack dependencies: #133130, #133729, #133131, #133132, #133133, #133134
This PR adds support `torch._C._push_on_torch_function_stack()` by updating `torch.py` to push onto the symbolic torch function mode stack when a push is encountered. The same side effects infra used in the previous PR is used to track the mutation of the torch function mode stack and add bytecode to update it if it is mutated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133132
Approved by: https://github.com/williamwen42
ghstack dependencies: #133130, #133729, #133131
This PR adds support for tracing `torch._C._pop_torch_function_stack()` without graph breaking and in order to verify the state change also adds replay of mutations to the torch function mode stack via side_effects appending supplemental bytecode as we do for other python mutable objects.
Details:
To represent the torch function mode stack symbolically a deque field is added to the instruction translator. When the InstructionTranslator is initialized, all modes are read from the current torch function mode stack, and stashed in a global weak ref for later access (using existing sources) without needing to push/pop the python/cpp torch function mode stack.
During tracing, when `_pop_torch_function_stack` is encountered a value is popped from this deque and the variable tracker representing the mode is returned. To ensure the true torch function mode stack matches this state, `TorchFunctionModeStackVariable`, a singleton, is marked as mutated, this adds it to side effects, where during final codegen, side effects will codegen a call to a python helper which will update the python torch function mode stack.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133131
Approved by: https://github.com/jansel
ghstack dependencies: #133130, #133729
This PR adds a guard on the torch function mode stack state at the beginning of tracing. The way this is implemented is via a new leaf guard which is passed the initial stack state at construction and compares it to the stack state at the time the guard is run.
Details:
The stack state is extracted via popping all modes, appending them to a list, and pushing all modes back. This list is stored on the output graph and read during guard construction to pass to the stack mode guard. There the length and types of the modes are recorded. Next time the guard is run it compares this recorded state to the current mode stack state.
To implement this in python a helper function was added to utils.py and this is used if cpp guards are not enabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133130
Approved by: https://github.com/anijain2305
Summary: Defaulting TORCH_NCCL_DUMP_ON_TIMEOUT to "true" and adding a kilswitch in case we need to kill this feature in production.
Test Plan: Tests pass manually but need futher testing before this is rolled out fully everywhere.
Differential Revision: D61136320
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133237
Approved by: https://github.com/c00w
Summary: This diff fixed many lint issues in qlinear_prepack.cpp. I'am fixing them as I want to add more ops/funcs into this file later.
Test Plan: Sandcastle
Differential Revision: D61425436
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133797
Approved by: https://github.com/Skylion007
Summary: `_ConstraintTarget` is an internal data structure that has some redundancy: tensors are identified by their id but also carry a weak reference. The weak reference was probably useful a year back but everything is done with ids right now, and the lifetime of these tensors ensures that using their ids is OK.
Test Plan: existing tests
Differential Revision: D61488816
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133890
Approved by: https://github.com/tugsbayasgalan
Summary: When generating CUDA kernel load and launch, certain Triton kernel meta data are needed, but those meta data only exist after kernel auto-tune is done. DeferredCudaKernelLine is a deferred line which can backfill a string template after kernel auto-tune. This is to prepare for one-pass AOTI codegen implementation.
Differential Revision: [D61018114](https://our.internmc.facebook.com/intern/diff/D61018114)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129135
Approved by: https://github.com/angelayi
Summary:
Remove the early exit for padding when padding = [0, 0, 0, 0].
This prevents export from specializing when all padding=0, allowing export when all padding >= 0. Specialization will still happen for negative padding.
This change will be used to export image preprocess for multimodal models, where images of dynamic shape are padded. As images are of dynamic shape, we can't be sure if padding will be required or not. Padding is guaranteed to be non-negative.
Preprocess code: https://github.com/pytorch/torchtune/pull/1242
Note: the alternative is to wrap padding in a custom op, which isn't ideal given the custom op will contain the same impl as constant_pad_nd.
Test Plan: ci
Differential Revision: D60687727
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132679
Approved by: https://github.com/ezyang
The regex in the script is too restrictive, as it excludes examples with parentheses in args, like the following:
```
triton_poi_fused_add_0.run(arg0_1.item(), arg1_1.item(), buf0, 1, grid=grid(1), stream=streamNone)
^
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130837
Approved by: https://github.com/Chillee
Fixes the observed graph breaks in https://github.com/pytorch/pytorch/issues/121349 and https://github.com/pytorch/pytorch/issues/121350.
But there are still graph breaks since a random output is being used as a seed, e.g.
```python
import random
import torch
def fn(x):
seed = random.randint(0, 100)
rand = random.Random(seed)
return x + rand.randrange(10)
opt_fn = torch.compile(fn, backend="eager", fullgraph=True)
opt_fn(torch.ones(1))
```
fails with
```
torch._dynamo.exc.InternalTorchDynamoError: UnspecializedPythonVariable() is not a constant
```
when tracing the line
```
rand = random.Random(seed)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133725
Approved by: https://github.com/jansel
Add decorator `torch.compiler.substitute_in_graph` to register polyfill for unsupported C++ function to avoid graph break. This API provides an official way to add support for dynamo for third-party C extensions. Also, it can be used to simplify our implementation for `torch._dynamo.polyfill`.
5ee070266f/torch/_dynamo/variables/builtin.py (L97-L107)
Example:
```python
>>> import operator
>>> operator.indexOf([1, 2, 3, 4, 5], 3)
2
>>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3)
Unsupported: ...
>>> @torch.compiler.substitute_in_graph(operator.indexOf)
... def indexOf(sequence, x):
... for i, item in enumerate(sequence):
... if item is x or item == x:
... return i
... raise ValueError("sequence.index(x): x not in sequence")
>>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3)
2
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133712
Approved by: https://github.com/jansel
# UPDATE:
This is take 3 of https://github.com/pytorch/pytorch/pull/131863 which was landed via co dev but not applying correclty
# Summary
Changes the stance of SDPA on what to do for fully masked out rows
## Current Behavior
Several PyTorch users have expressed frustration over this issue:
- https://github.com/pytorch/pytorch/issues/41508
- https://github.com/pytorch/pytorch/issues/103749
- https://github.com/pytorch/pytorch/issues/103963
These are significant issues with extensive discussion but no satisfactory resolution. The PyTorch team's consensus, as stated here:
https://github.com/pytorch/pytorch/issues/24816#issuecomment-524415617
Can be paraphrased as follows:
When passing in fully masked out rows, attention becomes ambiguous. We have two main options:
1. Uniformly attend to all values:
```python
scores[masked_out_rows] = 1 / len(row)
out[masked_out_rows] = 1 / len(row) * value
```
2. Decide that attention between no queries (masked) and no keys (masked) is meaningless:
```python
output[fully_masked_rows] = NaN
```
We went with option 2. Partially because it was easier to implement, but also people argued that users can slice the output to remove the NaNs:
``` Python
>fill_value = -float("inf")
>row0 = torch.randn(4)
>row1 = torch.tensor([(fill_value for _ in range(4)])
>matrix = torch.stack([row0, row1]).requires_grad_(True)
>out = torch.softmax(matrix, 1)
>out = out[0]
>print(out)
tensor([0.5377, 0.2729, 0.0692, 0.1201])
```
Cool, problem solved. But what happends when you call backwards..
```Python
>out.backward(torch.ones_like(out))
>print(matrix.grad)
tensor([[3.0957e-08, 1.4157e-08, 7.7802e-10, 1.3713e-08],
[ nan, nan, nan, nan]])
```
Those pesky NaNs are back!
## Why do we see NaNs today?
The core of the problem revolves around using softmax function in sdpa:
```python
> row = torch.tensor([(-float("inf")) for _ in range(4)])
> torch.softmax(row, 0)
tensor([nan, nan, nan, nan])
```
## Quick Aside: Masking in Attention
Attention itself doesn't have a concept of masking. The `sdpa` function has an argument called `attn_mask`, which would be more accurately named `attn_bias`. This is because we don't actually "mask" entries when computing attention. Instead, due to implementation details([performance](https://github.com/pytorch/pytorch/issues/25110#issuecomment-524519087)), we add a value to the masked-out query/key pairs.
We use a large negative number (typically -inf) to decrease the attention weight, as softmax assigns more weight to larger values.
## Alternative Approaches
If we use a very large negative number instead of -inf:
```python
> row = torch.tensor([(-1e6) for _ in range(4)])
> torch.softmax(row, 0)
tensor([0.2500, 0.2500, 0.2500, 0.2500])
```
However if users always remembered to "slice" out their outputs i.e.:
```Python
>fill_value = -1e6
>...
>out.backward(torch.ones_like(out))
>print(matrix.grad)
tensor([[-0.0563, -0.0564, 0.1613, -0.0486],
[ 0.0000, 0.0000, 0.0000, 0.0000]])
```
This would bring us back into a better state.
## A Third Option
We don't necessarily need to alter the behavior of softmax for -inf or very large negative numbers. The fundamental goal is to exclude certain query/key pairs from attention, regardless of the underlying implementation.
This PR implements the new semantic for masking w/ attention in fully masked-out rows:
```python
out[masked_out_rows] = 0
```
**Important Note**: This idea isn't entirely new. The [MaskedTensor](https://pytorch.org/tutorials/prototype/maskedtensor_overview#safe-softmax) prototype, a tensor subclass, was designed to handle such cases. However, it remains a prototype feature and hasn't gained widespread adoption.
## Details
This PR stack does 3 things:
1. Adds a PRIVATE _safe_softmax op
2. Updates semantic for flash_cpu fused kernel
3. Updates semantic for efficient_cuda fused kernel
_safe_softmax is not supposed to be used generically and is only meant to be used within the context of SDPA. Due to this fact instead of decomposing softmax and checking for -inf rows we instead "cheat" and use nan_to_num.
Why I think this is okay? (please find a counter point if avail)
There are multiple ways NaNs can emerge. For the fully masked out rows case nan_to_num works. But what if there were other NaNs, wouldn't this silently remove them?
The only case that this can happen is if the input itself had a NaN or an Inf
For example:
```Python
a = torch.ones([4], requires_grad=False, dtype=torch.float16)
a[1] = torch.finfo(torch.float16).max
print(a.softmax(-1))
```
Will return
`tensor([0., 1., 0., 0.], dtype=torch.float16)`
Where
```Python
a = torch.ones([4], requires_grad=False, dtype=torch.float16)
a[1] = float("inf")
a.softmax(-1)
```
returns:
`tensor([nan, nan, nan, nan], dtype=torch.float16)`
If we dont want to even allow for the possibility of "inf" or "NaN" attention scores to be converted to 0 then we can implemented it something like this
```Python
max = torch.max(a, dim=-1, keepdim=True)
exp = torch.exp(a - max.values)
denom = torch.sum(exp, dim=-1, keepdim=True)
softmax = exp / denom
softmax = torch.where(max.values == float('-inf'), 0.0, softmax)
```
however we would be paying for this in math performance.
## Why Now
I think one point that has substantially changed where PyTorch should lie on this argument is the fact that we have fused implementations for SDPA now. And these fused implementations allow us to easily and performantly support this new semantic.
Differential Revision: [D61418679](https://our.internmc.facebook.com/intern/diff/D61418679)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133882
Approved by: https://github.com/soulitzer
When exporting a training model for Executorch (which requires all ops to be core aten) with cross entropy loss (`torch.nn.CrossEntropyLoss`), we ran into the following error from the fx verifier in `to_edge`:
```
torch._export.verifier.SpecViolationError: Operator torch._ops.aten.nll_loss2d_forward.default is not Aten Canonical.
```
The aten [implementation](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/LossNLL.cpp#L624) of `torch.nn.CrossEntropyLoss` uses `nll_loss2d_forward` for inference and `nll_loss2d_backward` for training, so we need to add the decompositions for both (which already exist) to the list of core aten decompositions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133534
Approved by: https://github.com/JacobSzwejbka
## Description
Create decomposition of _unsafe_index_put (non-core aten) that turns it into index_put (core aten)
## Testing
Phi3 mini + LoRA model successfully passed `to_edge` after failing due to a non-core aten `unsafe_index_put` getting introduced in a decomposition during joint graph calculations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133365
Approved by: https://github.com/pianpwk
Summary: migrate to aten IR, `reshape` -> `view.default`, not covering `flatten` as there are already optimazation done in PT2, see the example here P1506057533
Differential Revision: D60476525
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132183
Approved by: https://github.com/frank-wei
Creates a new runtime that shifts complexity from runtime to
ahead-of-time.
The existing runtime (PipelineScheduleMulti) accepts a
compute-only schedule (forward, backward, weight) actions only are
specified, and it infers the communication operations at runtime.
Compared to that runtime, PipelineScheduleRuntime has less logic that
happens at runtime and relies on lowering passes to transform the
compute-only schedule to add communications.
Advantages include
- easier to verify the correctness by dumping a compute+comm schedule
- posible to manually edit the compute+comm schedule if the lowering
heuristics are insufficient
Functionality included inside the PipelineScheduleRuntime is limited to
- accepting a compute-only schedule and lowering it to add comms
- executing the compute or comm operations specified by the given
schedule
- handling work.wait() automatically by calling it just before the
matching compute operation (for RECV ops) or at the end of step (for
SEND ops)
Follow ups for later PRs
- Some refactoring should be done to replace PipelineScheduleMulti with
this runtime
- Optimizer execution is not considered (e.g. for zero-bubble cases)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130488
Approved by: https://github.com/H-Huang
Summary: Previously we were mocking out FbRemoteFxGraphCacheBackend which meant that we were missing testing a whole bunch of the cache code. Cache at a lower level (CacheClient, LocalAutotuneCacheBackend, ManifoldClient, Redis) so we cover a larger amount of the caching code.
Test Plan: unit tests
Reviewed By: oulgen
Differential Revision: D60937966
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133579
Approved by: https://github.com/oulgen
This is the first step to make sure we have a basic function of analyzer for FR in production.
- We want to use this script to find out abnormalities in collectives and report it to users.
- We also fixed some type errors.
- [Ongoing] Also we will add more unit tests to this script and make it modularized so that we can better maintain it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133412
Approved by: https://github.com/c-p-i-o, https://github.com/atalman
This is a bugfix that was recently encountered in ROCm/Deepspeed. Currently if a library installs pynvml and runs on ROCm pytorch will break as _HAS_PYNVML is set to true and it will attempt to use amdsmi library for the device_count call which will not be installed.
This fix will set _HAS_PYNVML to false on ROCm if amdsmi is not installed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132990
Approved by: https://github.com/pruthvistony, https://github.com/eqy, https://github.com/malfet
This fixes an issue on AArch64 cpus supporting BF16, caused when torch.set_float32_matmul_precision("highest") does not disable the bf16 downconversion in mkldnn_matmul.
This was discovered from a unit test failure where the decorator `torch.testing._internal.common_mkldnn.bf32_on_and_off`, which internally switches the float32_matmul_precision between "medium" and "highest" was not having the desired effect.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130919
Approved by: https://github.com/jgong5
Upgrades the LF scale configs to change the default AMI in accordance with the Amazon 2023 rollout plan.
This PR will be merged on Monday Aug 19 in the morning, and over the next 2-3 days as new linux runners are spun up (and old ones spun down) they'll start using this new AMI
This PR will be paired with https://github.com/pytorch/test-infra/pull/5558, which will be merged after this one
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133641
Approved by: https://github.com/jeanschmidt
FIXES https://github.com/pytorch/pytorch/issues/123949https://github.com/pytorch/pytorch/issues/124376
torch.cuda.memory_allocated returns the amount of memory allocated in the current process, so if it isn't 0 it means another test didn't properly clean up after itself. I'm keeping the memory check and isolating these tests in subprocess as we don't have a good way to test for activation refcount
e.g. https://github.com/pytorch/pytorch/runs/28838386083
```
_______________ TestCompiledAutograd.test_free_activation_memory _______________
Traceback (most recent call last):
File "/var/lib/jenkins/workspace/test/inductor/test_compiled_autograd.py", line 1892, in test_free_activation_memory
self.assertTrue(torch.cuda.memory_allocated() == 0)
File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue
raise self.failureException(msg)
AssertionError: False is not true
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133733
Approved by: https://github.com/jansel
This threads through all of the necessary parts into aot autograd from the FXGraphCache changes so that we can run cudagraphs properly on a AOTAutograd cache hit.
Specifics:
- AOTAutograd needs access to the `cudagraphs` boxedbool in order to properly set the backward to not use cudagraphs on a cache hit from the forward.
- We have lots of tests that test this already from the previous PR, so I just added an extra test and made the previous test work with both AOTAutogradCache and FXGraphCache at the same time.
```
TORCH_LOGS=torch._functorch._aot_autograd.autograd_cache,cudagraphs ENABLE_AOT_AUTOGRAD_CACHE=1 TORCHINDUCTOR_FX_GRAPH_CACHE=1 tlp python benchmarks/gpt_fast/benchmark.py --output ~/gpt_fast_benchmark.csv
```
Twice, once on cache miss and once and cache hit.
Here is the perfetto trace for each(FB only link):
**Cache Miss:**
Logs:
```
Loading model Llama-2-7b-chat-hf
Time to load model: 0.66 seconds
I0813 10:53:34.416000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:479] [0/0] AOTAutograd cache miss for key alqchc7zw6ynsxj2bzktcsngu4cajwcb3tmhvwlyqkuinx3zhmey
I0813 10:53:51.395000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:558] [0/0] Writing AOTAutograd cache entry to /tmp/torchinductor_jjwu/aotautograd/alqchc7zw6ynsxj2bzktcsngu4cajwcb3tmhvwlyqkuinx3zhmey/entry
I0813 10:54:17.579000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:479] [1/0] AOTAutograd cache miss for key a3nq2ywjxku342c6ag7rsqkalnxfshlcgve3tb2bigg7a45uz6pt
I0813 10:54:38.636000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:558] [1/0] Writing AOTAutograd cache entry to /tmp/torchinductor_jjwu/aotautograd/a3nq2ywjxku342c6ag7rsqkalnxfshlcgve3tb2bigg7a45uz6pt/entry
I0813 10:54:39.228000 911030 torch/_inductor/cudagraph_trees.py:385] [__cudagraphs] recording cudagraph tree for graph without symints
V0813 10:54:39.939000 911030 torch/_inductor/cudagraph_trees.py:2160] [__cudagraphs] Running warmup of function 0
V0813 10:55:10.615000 911030 torch/_inductor/cudagraph_trees.py:2119] [__cudagraphs] Recording function 0 of graph recording id 0
Compilation time: 101.24 seconds
Average tokens/sec: 147.96 tokens/sec
Average bandwidth achieved: 1955.22 GB/s
Memory used: 14.51 GB
```
Chromium Event(fb only):
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json&local_cache_key

**Cache Hit:**
Logs:
```
Loading model Llama-2-7b-chat-hf
Time to load model: 0.67 seconds
I0813 10:55:51.821000 944420 torch/_functorch/_aot_autograd/autograd_cache.py:474] [0/0] AOTAutograd cache hit for key alqchc7zw6ynsxj2bzktcsngu4cajwcb3tmhvwlyqkuinx3zhmey
I0813 10:55:55.465000 944420 torch/_functorch/_aot_autograd/autograd_cache.py:474] [1/0] AOTAutograd cache hit for key a3nq2ywjxku342c6ag7rsqkalnxfshlcgve3tb2bigg7a45uz6pt
I0813 10:55:56.030000 944420 torch/_inductor/cudagraph_trees.py:385] [__cudagraphs] recording cudagraph tree for graph without symints
V0813 10:55:56.192000 944420 torch/_inductor/cudagraph_trees.py:2160] [__cudagraphs] Running warmup of function 0
V0813 10:55:56.426000 944420 torch/_inductor/cudagraph_trees.py:2119] [__cudagraphs] Recording function 0 of graph recording id 0
Compilation time: 9.40 seconds
Average tokens/sec: 147.94 tokens/sec
Average bandwidth achieved: 1954.98 GB/s
Memory used: 14.51 GB
```
Chromium Event(fb only):
https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom2%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom2%2Fchromium_events.json&local_cache_key

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132294
Approved by: https://github.com/eellison
**Summary**
Implement the complete vectorization of `index_expr` functionally. We also add heuristic from performance perspective to resolve the regressions posted below: https://github.com/pytorch/pytorch/pull/122961#issuecomment-2041336265 by disabling vectorization of specific (Fused) scheduler Node:
- Heuristic 1: when the num of non-contiguous `index_expr/load/store` exceeds the threshold, we disable the vectorization.
- Heuristic 2: when the total number of elements along the vec dim is less than `tiling_factor/2`, we disable the vectorization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122961
Approved by: https://github.com/jansel
Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>
Summary:
# context
* when running an IG FM training with PT2 we found there are a few graph break due to torch.diff call in [jagged_tensor.py](https://fburl.com/code/cwssxabc)
```
_length: List[int] = (
_length_per_key_from_stride_per_key(torch.diff(offsets), stride_per_key)
if variable_stride_per_key
else torch.sum(torch.diff(offsets).view(-1, stride), dim=1).tolist()
)
```
* look into the failure, we found the TORCH_CHECK in diff should be TORCH_SYM_CHECK
* slice_forward error: df3d7729e, [tlparse](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpxXZ2em/index.html)
```
RestartAnalysis
Tried to use data-dependent value in the subsequent computation. This can happen when we encounter unbounded dynamic value that is unknown during tracing time. You will need to explicitly give hint to the compiler. Please take a look at torch._check OR torch._check_is_size APIs. Could not guard on data-dependent expression ((5*u37 + u38)//(u37 + u38)) < 0 (unhinted: ((5*u37 + u38)//(u37 + u38)) < 0). (Size-like symbols: u38, u37)
ATTENTION: guard_size_oblivious would fix the error, evaluating expression to False.
Maybe you need to add guard_size_oblivious to framework code, see doc below for more guidance.
Potential framework code culprit (scroll up for full backtrace):
File "/data/users/hhy/fbsource/buck-out/v2/gen/fbcode/e99934938a0abe90/aps_models/ads/icvr/__icvr_launcher_live__/icvr_launcher_live#link-tree/torch/_decomp/decompositions.py", line 771, in slice_forward
if end_val < 0:
```
* after this diff: [tlparse](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpAhv2Sh/failures_and_restarts.html)
Test Plan:
# command
* run model
```
TORCH_SHOW_CPP_STACKTRACES=1 TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 TORCH_LOGS="+graph_code,output_code,dynamic,aot,guards,verbose_guards,recompiles,graph_breaks" TORCH_TRACE=/var/tmp/tt buck2 run fbcode//mode/opt fbcode//aps_models/ads/icvr:icvr_launcher_live -- mode=fmc/local_ig_fm_v4_mini training.pipeline_type=pt2
```
* generate tlparse
```
tlparse `ls -t /var/tmp/tt/* | head -1`
```
Reviewed By: ezyang
Differential Revision: D56339251
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133740
Approved by: https://github.com/ezyang
Moving DTensor to be in the public namespace, to formally add the
documentation page that includes all the public APIs. This includes:
* many path renames and path import fixes
* a dedicated doc page without too much content yet (adding in the next
PRs)
* To preserve the BC for users still using the `torch.distributed._tensor`,
I added a shim script to redirect old path calls to the new module
The BC preserving is evidented by the fact that all DTensor tests are still
working without changing the public imports. So it's safe to land the
changes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133113
Approved by: https://github.com/XilunWu
ghstack dependencies: #133305, #133306
Summary:
These tests aren't running internally because the outer test harness is crashing without listing the tests. To fix we need:
* Add a target for the tools/stats/ folder since this test imports it
* Add a dependence to that target so it's included in the par
* Fix up the relative import syntax, which is somehow different internally vs. fbcode (not sure why this works, but many other tests are doing it)
Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:cudagraph_trees_expandable_segments -- --run-disabled`
Differential Revision: D61396711
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133698
Approved by: https://github.com/xuzhao9
**Summary**
After enabling more vectorization, we found that vectorization does not always bring performance benefits. For example, a kernel with several non-contiguous index computations or non-contiguous buffer load/store operations can experience performance regression. A typical case is what we observed in the next PR: after fully enabling vectorization of `index_expr`, we saw a performance regression of `hf_BigBird`.
In this PR, we refactor the tiling select into a standalone module to enhance its extensibility for further advanced tiling select heuristic. A standalone class `TilingSelect` with its method `select_tiling` has been added. `select_tiling` accepts the inputs of `fn_list`, `var_sizes_list` and return `tiling_factors`, `tiling_indices`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130892
Approved by: https://github.com/jgong5
Summary:
This diff aims to fix the GPU Test skips in the quantization tests under the `caffe2/test/quantization` directory. The changes made in the `TARGETS` files include adding the `should_use_remote_gpu` flag to enable remote GPU testing. This should help to resolve the skipped tests and improve the overall test coverage.
[This diff] Fixed skip count: 4
[Running total] Fixed skip count: 4
Note: Creating separate diffs for each test-group.
Test Plan:
**281475054644766**: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_compare_per_channel_device_numerics (caffe2.test.quantization.core.test_quantized_tensor.TestQuantizedTensor)'
https://www.internalfb.com/intern/testinfra/testrun/5629499773981783
**281475054644780**: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_compare_per_tensor_device_numerics (caffe2.test.quantization.core.test_quantized_tensor.TestQuantizedTensor)'
https://www.internalfb.com/intern/testinfra/testrun/11540474087422107
**281475054644853**: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_quant_pin_memory (caffe2.test.quantization.core.test_quantized_tensor.TestQuantizedTensor)'
https://www.internalfb.com/intern/testinfra/testrun/11540474087422477
**844425008078016**: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_cuda_quantization_does_not_pin_memory (caffe2.test.quantization.core.test_quantized_tensor.TestQuantizedTensor)'
https://www.internalfb.com/intern/testinfra/testrun/1407375259845199
Differential Revision: D60055277
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133158
Approved by: https://github.com/jovianjaison
Summary: Recently we observed in AI CMF, enabling decompose_mm pass will lead to mixed dtype for aten.mm and aten.addmm errors. By investigation, we figure out that the error comes from torch.sum, which has an implicit type casting to avoid the possible overflow (a similar discussion in github: https://github.com/pytorch/pytorch/issues/115832). Thus we do the output cast to avoid the error.
Test Plan:
# unit test
```
buck2 test mode/dev-nosan //caffe2/test/inductor:decompose_mem_bound_mm -- test_decompose_mm_mixed_precision
```
Buck UI: https://www.internalfb.com/buck2/00dc168e-4d65-40f8-b169-f4a58206f641
Test UI: https://www.internalfb.com/intern/testinfra/testrun/17169973624867151
Network: Up: 25KiB Down: 44KiB (reSessionID-b7e2ecc7-16ca-476d-95b2-09ea74645eb0)
Jobs completed: 19. Time elapsed: 1:07.6s.
Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2)
Tests finished: Pass 6. Fail 0. Fatal 0. Skip 0. Build failure 0
# e2e
ads_dper3:68464f2dc5e849ba2670482079cecaaa
training_platform:2c41d916ad5dd82f196372a8c7bd37a0
### build training_platform
```
buck2 run fbcode//fblearner/flow/projects/training_platform:training_platform
```
### register training_platform
```
buck2 run mode/opt fblearner/flow/projects/training_platform:workflow -- register-workflows --project-name training_platform --flow_version training_platform:2c41d916ad5dd82f196372a8c7bd37a0
```
### build ads_dper 3
```
fbpkg build -E ads_dper3 --yes --expire 14d
```
### register ads_dper 3
```
buck2 run //pyper/core/eval_app_utils:flow_utils_script -- register --pkg-version ads_dper3:68464f2dc5e849ba2670482079cecaaa
```
### extend package (optional)
```
fbpkg expire --extend-only training_platform:2c41d916ad5dd82f196372a8c7bd37a0 30d
```
### before fix
f591360990
### after fix
baseline
f591395056
proposal
Differential Revision: D61351815
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133626
Approved by: https://github.com/jackiexu1992
If the scalar tensor is an output tensor, it shouldn't be unwrapped (i.e. `.item()` called) since `tl.store` requires a pointer type for outputs. This issue only occurs for mutated buffers: the input tensor is also used as an output tensor.
Fixes #ISSUE_NUMBER
@yanboliang @jansel @ngimel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132859
Approved by: https://github.com/jansel
`torch.cuda.Event` objects are different from `torch.cuda.Stream` in that events are not pooled, meaning we can't look up a previously created CUDA event object by ID. This prevents CUDA event object created outside of the Dynamo graph from being used within the graph (since Dynamo needs a way to emit a `call_function` line in the graph that does the retrieval of the event object for downstream op use). This PR adds a simple object pool within Dynamo utility, to support looking up CUDA event object by ID from within the Dynamo graph.
After this PR, if a user creates a CUDA event object outside of the graph and use that event within the graph, the behavior will exactly match eager.
Test commands:
- `pytest -rA test/dynamo/test_ctx_manager.py::CtxManagerTests::test_cuda_event_created_outside_of_graph`
- `pytest -rA test/dynamo/test_ctx_manager.py::CtxManagerTests::test_cuda_event_across_graph_break`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133635
Approved by: https://github.com/yifuwang
ghstack dependencies: #133532, #133531, #133636
During Inductor lowering, layout constraints for an op is applied before the op's lowering is called. Currently `add_layout_constraint(aten._scaled_mm.default, constrain_to_fx_strides)` is called inside `aten._scaled_mm.default`'s lowering. This means that if the first `_scaled_mm` to be lowered relies on the layout constraint, it won't be applied and the generated code would fail. The issue won't manifest if the first `_scaled_mm` doesn't rely on the layout constraint.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133669
Approved by: https://github.com/drisspg, https://github.com/yangsiyu007
Updates CUDNN_frontend header only library to make the most of the newest CUDNN features and decrease the overhead of the library.
Copied from commit:
New API
- Graph Slice Operation: Introduced the graph.slice operation for slicing input tensors. Refer to docs/operations/Slice.md for detailed documentation and samples/cpp/misc/slice.cpp for a C++ sample. Pybinds for this operation have also been added.
- SM Carveout Feature: Added the set_sm_count(int32_t type) graph property to support the SM Carveout feature introduced in Ampere and Hopper GPUs. Engines that do not support SM_COUNT will return NOT_SUPPORTED.
Bug Fixes
- Convolution Mode Attribute: Added the missing set_convolution_mode attribute to convolution attributes in forward propagation (fprop), data gradient (dgrad), and weight gradient (wgrad). Previously, this was hardcoded to CUDNN_CROSS_CORRELATION in the 1.x API.
- SDPA FP8 Backward Node: Fixed an issue with the deserialization of the sdpa_fp8_backward node.
Enhancements
- Graph Execution Overhead: Reduced the overhead of graph.execute() by optimizing sub-node tree traversal, collected UIDs, workspace modifications, and workspace size.
- Graph Validation Performance: Significantly improved (~10x) the performance of graph.validate() by deferring graph expansion to a later stage (build_operation_graph).
- Optional Running Stats for BatchNorm: Made the running statistics for the batch normalization operation optional, supported by cuDNN backend version 9.3.0 and later.
- Shape and Stride Inferencing: Enhanced shape and stride inferencing to preserve the stride order of the input.
- Diagnostic Error Message: Added a diagnostic error message to create_execution_plans if called without the preceding build_operation_graph.
- JSON Schema and Deserialization: Improved the JSON schema and deserialization logic with additional checks.
- Logging Overhead: Reduced logging overhead, resulting in faster graph.build() calls.
- CMake Integration: Replaced CMAKE_SOURCE_DIR with PROJECT_SOURCE_DIR in CMake files for better integration. See the relevant pull request for more details.
Samples
- Jupyter Notebooks: Added Jupyter notebooks for RMSNorm, InstanceNorm, and LayerNorm. Refer to the samples/python folder for more information.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133687
Approved by: https://github.com/eqy, https://github.com/malfet
During distributed training if all ranks except one hit the cache, the rank that did not hit the cache will cause a NCCL timeout since rest of the ranks will enter the collective and start the timer. This PR uses the new PTD API to increase timeout for the ranks that hit the cache by the amount of time the cache would save.
Differential Revision: [D61363722](https://our.internmc.facebook.com/intern/diff/D61363722)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133374
Approved by: https://github.com/ezyang
This is a low-risk short-term fix for
https://github.com/pytorch/pytorch/issues/128084, for the purposes of
2.4.1. The actual fix for that issue is more risky and we'll target 2.5.
needs_fixed_stride_order is silently incorrect with args that are
mutable because it creates clones of those args, writes into them, and
doesn't update the original args.
This PR makes it so that needs_fixed_stride_order doesn't apply to
inputs that are being mutated.
This PR doesn't completely fix the problem, but it makes it less
incorrect: most of the time the input already has the correct strides
but inductor fails to recognize it, and in those cases writing directly
to the input is fine.
Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133452
Approved by: https://github.com/eellison
Fix https://github.com/pytorch/pytorch/issues/132716
The triton template for convolution does not work when the stride or padding contains dynamic shape. Use the hint and add guards to handle that. An alternative is to fallback to eager, but since I've seen the lowering rule for convolution use the hint in other cases, I'll just follow the convention.
I don't really know how to add a unit test here since I need create symbolic strides (not strides of a tensor but the stride parameter for convolution) and paddings. I can try harder if reviewer swants me to add unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132938
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: #132952
Setting `torch._dynamo.config.skip_fsdp_hooks = True` is required for graph-break compiled FSDP2, thus setting it to default will make this adoption easier. If users want to use Traceable FSDP2, they can set this to False manually (which will allow FSDP2 hooks to be traced through).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133531
Approved by: https://github.com/awgu
ghstack dependencies: #133532
Fixes#128059
I'm not sure if this is the right way, since Inductor doesn't always respect the device id set by users, so probably we should just wrap it as null context manager and print a warning. cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames @jansel @anijain2305 @mlazos @williamwen42
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133385
Approved by: https://github.com/jansel
Summary:
We saw ncclCommAbort was called and hang during the NCCLComm:create.
If NCCL comm is not properly initialized, ncclCommAbort behavior is
'undefined', avoid calling it would allow the process to properly throw
exception
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133630
Approved by: https://github.com/wconstab
This PR fixes the accuracy issues when template_buffer has users other than the epilogue nodes. This will fix the accuracy failure of the below models using max-autotune:
- MobileBertForMaskedLM
- MobileBertForQuestionAnswering
- convnext_base
- swin_base_patch4_window7_224
## Issue 1:
Previously we always add `template_buffer` as an alias of `Y`. In case the `template_buffer` has users other than the epilogue nodes, we shouldn't set it as an alias of `Y`. This PR adds the check in such case.
Wrong code before the fix where `tmp4` and `tmp9` are both stored to `Y` while we need 2 different buffers for them since `tmp4` will be used by nodes other than the epilogue node:
```cpp
Y[static_cast<long>(n_start + x1 + (32L*m_start) + (32L*x0))] = tmp4; // tmp4 is the output of the template
Y[static_cast<long>(n_start + x1 + (32L*m_start) + (32L*x0))] = tmp9; // tmp9 is the output of the epilogue node
```
Correct code after the fix:
```cpp
out_ptr2[static_cast<long>(n_start + x1 + (32L*m_start) + (32L*x0))] = tmp4;
Y[static_cast<long>(n_start + x1 + (32L*m_start) + (32L*x0))] = tmp9;
```
## Issue 2:
When fixing the above issue, we found that there's correctness issue when `bias` is `False`. The root cause is that in the case where `bias` is `False`, the `template_buffer` has users other than the epilogue nodes and the GEMM output buffer is localized, we need to add an extra copy epilogue to ensure that the GEMM output (a local buffer) is stored to the `template_buffer` that will be used later by other nodes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133073
Approved by: https://github.com/jgong5
ghstack dependencies: #133070
Summary: Some symbols (unbacked symints?) can have upper bound that is `sys.maxsize - 1` but our code for runtime assertions assumes that such upper bounds would come in as `sympy.oo` (like backed symints?) in order to drop them. So we weren't dropping them, which this PR fixes.
Test Plan: added test
Differential Revision: D61352056
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133627
Approved by: https://github.com/SherlockNoMad
Updating the source matcher to also accept pattern matching on the torch_fn metadata, which exists in both strict and non-strict export. We want to replace the use of source_fn_stack with torch_fn, as it's not possible for us to get source_fn_stack in non-strict export.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133642
Approved by: https://github.com/ydwu4
This PR enables dynamic shapes for the CK backend for gemm max autotune (see #125453).
This is achieved via unhardcoding the problem sizes from the template body and passing them as parameters instead.
We handle passing the problem sizes for the kernel call as well as for the benchmark call.
# Testing
`pytest test/inductor/test_ck_backend.py [-k dynamic]`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133285
Approved by: https://github.com/ColinPeppler
Summary: Recently we observed more missing example values in nodes introduced in Optimus, which causes problem to have further optimization when this node info needs to be used. Thus we add the meta for these nodes in the diff.
Test Plan:
# unit test
```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test //caffe2/test/inductor:split_cat_fx_passes
```
Buck UI: https://www.internalfb.com/buck2/c0ad506f-ce9d-4b80-947a-cb79074b72f0
Test UI: https://www.internalfb.com/intern/testinfra/testrun/2251800058834808
Network: Up: 1.4GiB Down: 2.0GiB (reSessionID-fb781425-f29b-44b5-8a5b-daffe7274f86)
Jobs completed: 300289. Time elapsed: 13:19.5s.
Cache hits: 99%. Commands: 119360 (cached: 118494, remote: 824, local: 42)
Tests finished: Pass 9. Fail 0. Fatal 0. Skip 1. Build failure 0
# benchmark
```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "cmf_shrink" --flow_id 587303213
```
P1520691492
Differential Revision: D61039772
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133414
Approved by: https://github.com/jackiexu1992
This is the first step to make sure we have a basic function of analyzer for FR in production.
- We want to use this script to find out abnormalities in collectives and report it to users.
- We also fixed some type errors.
- [Ongoing] Also we will add more unit tests to this script and make it modularized so that we can better maintain it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133412
Approved by: https://github.com/c-p-i-o
Summary: Switch to set_proxy_slot instead of set the proxy directly on the Tensor. We do not want to add Proxy to tensor objects, because Proxy cannot be deepcopied or pickeled and can cause problems when users want to deepcopy or pickle models.
Test Plan: CI
Differential Revision: D61277650
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133470
Approved by: https://github.com/zou3519
If you are adding a new function or defaulted argument to native_functions.yaml, you cannot use it from pre-existing Python frontend code until our FC window passes (two weeks). Split your PR into two PRs, one which adds the new C++ functionality, and one that makes use of it from Python, and land them two weeks apart. See https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#forwards-compatibility-fc for more info.
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.