Related issue: #125077
### Feature
Inductor tries to remove dimensions with stride 0 from block pointers. Rather than loading with stride 0, it's more efficient to load a smaller block pointer, then use `tl.broadcast_to` to broadcast it up to the desired size. This already worked for simpler block pointers, but it was disabled for more complex block pointers which used `tl.reshape` to change the dimensionality after loading.
This PR generalizes the approach to work for all block pointers. The idea is to first reshape, adding singleton dimensions, then broadcast those singletons up to something larger, then reshape again to the final output shape. For readability, we emit this code only if it actually does something. Simpler loads will just have `tl.load`.
Here's an example of a complicated kernel that uses `reshape` -> `load` -> `reshape`. (The first reshape is actually the slice `[None,None,:]`).
```
@triton.jit
def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr):
xnumel = 64
xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:]
xmask = xindex < xnumel
x2 = xindex
x1 = (xindex // 8)
tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0])
tmp1 = tl.reshape(tl.broadcast_to(tl.load(tl.make_block_ptr(in_ptr1, shape=[8], strides=[8], block_shape=[((7 + XBLOCK) // 8)], order=[0], offsets=[(xoffset // 8)]), boundary_check=[0], eviction_policy='evict_last')[:, None, None], [((7 + XBLOCK) // 8), ((1) * ((1) <= (((7 + XBLOCK) // 8))) + (((7 + XBLOCK) // 8)) * ((((7 + XBLOCK) // 8)) < (1))), ((8) * ((8) <= (XBLOCK)) + (XBLOCK) * ((XBLOCK) < (8)))]), [XBLOCK])
tmp2 = tmp0 + tmp1
tl.store(tl.make_block_ptr(out_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tmp2.to(tl.float32), boundary_check=[0])
''', device_str='cuda')
```
Before this PR, we would have stride-0 dimensions:
```
@triton.jit
def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr):
xnumel = 64
xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:]
xmask = xindex < xnumel
x2 = xindex
x1 = (xindex // 8)
tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), boundary_check=[0])
tmp1 = tl.reshape(tl.load(tl.make_block_ptr(in_ptr1, shape=[8, 1, 8], strides=[8, 0, 0], block_shape=[((7 + XBLOCK) // 8), ((1) * ((1) <= (((7 + XBLOCK) // 8))) + (((7 + XBLOCK) // 8)) * ((((7 + XBLOCK) // 8)) < (1))), ((8) * ((8) <= (XBLOCK)) + (XBLOCK) * ((XBLOCK) < (8)))], order=[2, 1, 0], offsets=[(xoffset // 8), 0, xoffset % 8]), boundary_check=[0], eviction_policy='evict_last'), [XBLOCK])
tmp2 = tmp0 + tmp1
tl.store(tl.make_block_ptr(out_ptr0, shape=[64], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp2, [XBLOCK]).to(tl.float32), boundary_check=[0])
''', device_str='cuda')
```
Here's a simpler example where we use 2D tiling. In this case we don't actually need the broadcast. The broadcast is implied via a slice adding a new singleton dimension. This code is not changed by this PR, but it's important to know that we don't accidentally insert unnecessary broadcasts.
```
@triton.jit
def triton_(in_ptr0, in_ptr1, out_ptr0, ynumel, xnumel, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr):
ynumel = 8
xnumel = 8
yoffset = tl.program_id(1) * YBLOCK
yindex = yoffset + tl.arange(0, YBLOCK)[None, :]
ymask = yindex < ynumel
xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
xmask = xindex < xnumel
x1 = xindex
y0 = yindex
tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[8, 8], strides=[1, 8], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), boundary_check=[0, 1])
tmp1 = tl.load(tl.make_block_ptr(in_ptr1, shape=[8], strides=[8], block_shape=[YBLOCK], order=[0], offsets=[yoffset]), boundary_check=[0], eviction_policy='evict_last')[None, :]
tmp2 = tmp0 + tmp1
tl.store(tl.make_block_ptr(out_ptr0, shape=[8, 8], strides=[1, 8], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), tmp2.to(tl.float32), boundary_check=[0, 1])
''', device_str='cuda')
```
### Test Plan
Added a new expecttest to check the emitted code for broadcast addition. Looking at the test, we can see that stride 0 dimensions are removed. (This test generated the example kernels in the previous section.)
This change also removed a stride-0 dimension in an existing block pointer test. I updated the expected code accordingly.
Bonus: I noticed that the test parametrization for `config.prefer_nd_tiling` wasn't working as intended. It ended up always setting this option to `True`. Fixed it so we get the intended test coverage.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135557
Approved by: https://github.com/shunting314, https://github.com/jansel
Co-authored-by: Yueming Hao <yhao@meta.com>
Summary: We have an internal report of a Triton compiler error `ValueError: Cannot broadcast, rank mismatch: [1], [1, 2048]` coming from a line like this:
`tmp25 = tl.broadcast_to(((tl.full([1], 1.00000000000000, tl.float64)) + ((ks0 // 3278).to(tl.float64))) / (((tl.full([1], 0.500000000000000, tl.float64))*(libdevice.sqrt((1 + ((ks0 // 3278)*(ks0 // 3278)) + ((-2)*(ks0 // 3278))).to(tl.float64).to(tl.float32)))) + ((tl.full([1], 0.500000000000000, tl.float64))*((1 + (ks0 // 3278)).to(tl.float64)))), [XBLOCK, RBLOCK])
`
https://github.com/pytorch/pytorch/pull/135260 is the cause, presumably because we turn a constant into a 1-element tensor with: `(tl.full([1], const, tl.float64))`. It looks like changing the syntax to `(tl.full([], const, tl.float64))` gives us what we want?
Differential Revision: [D63465169](https://our.internmc.facebook.com/intern/diff/D63465169)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136594
Approved by: https://github.com/mengluy0125, https://github.com/jansel
Summary:
We skip the save_gpu_kernel if kernel is being saved already.
This would give us a more accurate Triton profiling result. The following trace shows before/after the change for a benchmarking of a trivial addmm:
Before:
<img width="1255" alt="Screenshot 2024-09-23 at 10 26 53 AM" src="https://github.com/user-attachments/assets/5aea05ef-6ef0-464c-8da9-17b31c97b43a">
After:
<img width="910" alt="Screenshot 2024-09-23 at 10 27 03 AM" src="https://github.com/user-attachments/assets/488b7d4f-268f-41cf-8553-cb16ceeae118">
We can see that before the change, the benchmarking includes two parts,
(1) The overhead of our triton_heuristic call, which includes the save/get, and the (expensive) hash computation.
(2) The exact computation of Triton kernel.
We see that (1) accounts >50% of time, which makes kernel selection for profiling often choose aten kernels over Triton kernels.
Test Plan:
Existing OSS CI
[Redacted, Some internal model results in D63441430]
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136389
Approved by: https://github.com/desertfire
Summary:
Reenable the `test_triton_wrapper.py` test again
# Why
We want this to run internally
# What
- fix python path issue on the test
- reenable the test
# Background
It appears that the parent process does not pass the entire path down to the child process. Namely, if there is some setup that makes the sys.path effectively look different than, say, PYTHONPATH or something like this, the child will not inherit this setup. To avoid needing to keep track of specific setups, we pass the effective `sys.path` from the parent to the child through the PYTHONPATH env variable
Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:triton_wrapper
Differential Revision: D63438186
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136721
Approved by: https://github.com/henrylhtsang
## Motivation
The FSDP common code for FSDP UT execution is mostly written with cuda device in mind. However other devices such the intel Gaudi supports most of the functionality. We are generalizing the base content so that the UT content can be used for non-cuda device execution.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133209
Approved by: https://github.com/kwen2501
Move `get-job-id` steps before running the tests and copy-n-paste environment variables from `_mac-test.yml` added in https://github.com/pytorch/pytorch/pull/113099
Should fix the following warning during MPS test run:
```
/Users/ec2-user/runner/_work/pytorch/pytorch/tools/stats/upload_metrics.py:147: UserWarning: Not emitting metrics for td_test_failure_stats_v2. Missing job_id. Please set the JOB_ID environment variable to pass in this value.
warn(f"Not emitting metrics for {metric_name}. {e}")
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136791
Approved by: https://github.com/albanD, https://github.com/izaitsevfb
PyTorch community members have reported issues with building PyTorch from source for ROCm in an environment that doesn't have aotriton pre-installed, because aotriton is only installed in the [CI](a8ed873ba2/.ci/docker/manywheel/Dockerfile (L197)) docker images. Building aotriton from source can take ~45 minutes.
This PR fixes the issue by downloading the aotriton tarball in such scenarios, *unless the user explicitly wants to build aotriton from source using the AOTRITON_INSTALL_FROM_SOURCE=1 env var*
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136603
Approved by: https://github.com/atalman
Co-authored-by: Xinya Zhang <Xinya.Zhang@amd.com>
Summary:
With empty graphs, the `graph.inserting_before(first_user_input = None)` call turns into a `graph.inserting_after(root)` call, inverting the order of constant input nodes being inserted.
This fixes the issue by initializing to the first node in the graph (still valid if not a user input - only used for insertion).
Test Plan: test_export
Differential Revision: D63403514
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136658
Approved by: https://github.com/avikchaudhuri
This file didn't had an overall in a few years so long overdue. Most of the credit goes to @orionr for gathering all of this info.
The main rules we followed:
- No code contributor is removed, they're all placed as emeritus
- Breakdown too big categories to make this document useful to know who to ping
- No category where the code is still in the codebase is removed
- We did not rework the categories (for example to be closer to module: labels) and leave that for later
- All non-emeritus names are ordered by their number of comments on issues related to their topic
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136672
Approved by: https://github.com/eqy, https://github.com/ezyang, https://github.com/seemethere, https://github.com/malfet
Not sure, why `isinf` is a composite op, but those needs to be implemented by hand.
Implementation is a trivial call to
```objc
[mpsGraph equalWithPrimaryTensor:input
secondaryTensor:[mpsGraph constantWithScalar:std::numeric_limits<T>::infinity()
dataType:input.dataType]]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136689
Approved by: https://github.com/Skylion007
Prior to this PR, calling `reshape()` under `inference_mode()` would throw a `NotImplementedError`. This is because `inference_mode()` disables autograd key dispatch, incidentally preventing the decomposition of reshape for NJT.
This PR fixes this by redispatching on the `CompositeImplicitAutogradNestedTensor` key whenever a composite implicit op is encountered in `NJT.__torch_dispatch__()`. This fixes reshape and any other composite implicit ops underneath `inference_mode()`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134683
Approved by: https://github.com/soulitzer, https://github.com/albanD
ghstack dependencies: #136566
Fixes#136565
This PR makes the python fallback robust to the case where there are no active modes & no tensors with the Python key. In this case, simply redispatch with the Python key disabled.
This was found when trying to use reentrant dispatch for NJT to get decompositions under `inference_mode()` when the autograd key is disabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136566
Approved by: https://github.com/bdhirsh
**Summary**
Optimize the WOQ int8 AMX performance by changing the int8 -> bf16 conversion.
Earlier, 16 int8 elements were being loaded at a time & converted to 16 BF16 elements.
With this change, 32 int8 elements will be loaded at a time, and converted to a cache-line of 32 BF16 elements more efficiently.
Performance before
```
AUTOTUNE _weight_int8pack_mm(4096x4096, 4096x4096, 4096)
cpp_packed_gemm_0 38.0439 ms 100.0%
_weight_int8pack_mm 50.2524 ms 75.7%
SingleProcess AUTOTUNE benchmarking takes 1.1087 seconds and 1.9791 seconds precompiling
AUTOTUNE _weight_int8pack_mm(4096x4096, 11008x4096, 11008)
cpp_packed_gemm_4 78.2038 ms 100.0%
_weight_int8pack_mm 119.1962 ms 65.6%
SingleProcess AUTOTUNE benchmarking takes 1.9274 seconds and 1.9949 seconds precompiling
AUTOTUNE _weight_int8pack_mm(4096x11008, 4096x11008, 4096)
cpp_packed_gemm_6 79.2368 ms 100.0%
_weight_int8pack_mm 118.3212 ms 67.0%
SingleProcess AUTOTUNE benchmarking takes 1.9200 seconds and 2.0015 seconds precompiling
AUTOTUNE _weight_int8pack_mm(4096x4096, 32000x4096, 32000)
cpp_packed_gemm_224 225.7201 ms 100.0%
_weight_int8pack_mm 388.5588 ms 58.1%
```
Performance after this PR
```
AUTOTUNE _weight_int8pack_mm(4096x4096, 4096x4096, 4096)
cpp_packed_gemm_0 11.0086 ms 100.0%
_weight_int8pack_mm 50.2918 ms 21.9%
SingleProcess AUTOTUNE benchmarking takes 1.0837 seconds and 2.0301 seconds precompiling
AUTOTUNE _weight_int8pack_mm(4096x4096, 11008x4096, 11008)
cpp_packed_gemm_4 24.3528 ms 100.0%
_weight_int8pack_mm 119.8492 ms 20.3%
SingleProcess AUTOTUNE benchmarking takes 1.8303 seconds and 1.8195 seconds precompiling
AUTOTUNE _weight_int8pack_mm(4096x11008, 4096x11008, 4096)
cpp_packed_gemm_6 24.6148 ms 100.0%
_weight_int8pack_mm 119.1908 ms 20.7%
SingleProcess AUTOTUNE benchmarking takes 1.8315 seconds and 1.8352 seconds precompiling
AUTOTUNE _weight_int8pack_mm(4096x4096, 32000x4096, 32000)
cpp_packed_gemm_224 78.1369 ms 100.0%
_weight_int8pack_mm 387.6289 ms 20.2%
SingleProcess AUTOTUNE benchmarking takes 4.5059 seconds and 1.8010 seconds precompiling
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136630
Approved by: https://github.com/jgong5
ghstack dependencies: #136353
Summary:
We have a user report on BA model that it raised "AttributeError: 'SymFloat' object has no attribute 'shape'", thus we add type check for the meta node.
See more context in the post
https://fb.workplace.com/groups/1075192433118967/permalink/1510477489590457/
Test Plan:
# local reproduce
```
CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode split-batch-decompose --flow_id 646303196
```
P1609807876
# E2E
before fix
f646303196
after fix
Differential Revision: D63399959
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136650
Approved by: https://github.com/ezyang
Fixes#133683Fixes#133684Fixes#133688
This PR introduces a new base class `_ArglessActivation` and refactors five existing activation functions to inherit from it. This change aims to improve documentation consistency and also API consistency with other activation functions that do have parameters and explicitly call `super().__init__()`
Key changes and considerations:
1. Added new class `_ArglessActivation`:
2. Refactored the following classes to inherit from `_ArglessActivation`:
- Sigmoid
- Tanh
- Softsign
- Tanhshrink
- Softmax2d
3. Performance consideration:
- This change introduces a slight overhead for creating a new stack frame and handling an additional function call on every instance creation
- The impact is expected to be minimal in most use cases
Docs view before:
<img width="425" alt="Screen Shot 2024-09-18 at 3 00 22 PM" src="https://github.com/user-attachments/assets/ca0d1000-44c5-4c52-b344-68f7e170bafe">
Docs view after:
<img width="431" alt="Screen Shot 2024-09-18 at 3 00 52 PM" src="https://github.com/user-attachments/assets/f7ceb8f3-a2a2-4fd6-a2b8-39105a02bcbd">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136296
Approved by: https://github.com/mikaylagawarecki
Fixes https://github.com/pytorch/pytorch/issues/136177
The motivation is that torch::deploy doesn't handle this well. The
workaround for users is to use C++ custom ops.
All torch.library APIs ultimately go through the torch.library.Library
object, so we add checks to noop for torch::deploy there.
Test Plan:
- new test
- going to test this internally and hope nothing breaks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136645
Approved by: https://github.com/ezyang
Fix two more leaks of the same variety as #136507 (see that PR desc and attached gdoc for debug details).
This time, also add a test-time check that helped to discover new leaks and ensure we won't accidently regress.
Adds `check_tensor_leak` util which internally asserts no tensors are being kept alive by other objects involved in py ref cycles.
Uses objgraph for a nice debug utility when a leak is found.
Credit to @H-Huang for pointing out objdump and helping debug the 'param_group["intermediates"]` leak.
I manually confirmed that all 3 of the leaks identified/fixed so far are caught by the unit test and checker.
Sample output, if I re-introduce a leak by commenting out `del param_group["intermediates"]` in _backward.py,
and run `python test/distributed/pipelining/test_schedule_multiproc.py -k test_schedule_with_native_zero_bubble`:
```
warnings.warn(
/data/users/whc/pytorch/torch/testing/_internal/common_utils.py:5341: UserWarning: 34 tensors were found in the garbage. Did you introduce a reference cycle?
warnings.warn(
/data/users/whc/pytorch/torch/testing/_internal/common_utils.py:5347: UserWarning: Dumping first 1 objgraphs of leaked tensors rendered to png
Graph written to /tmp/objgraph-ztz642h3.dot (19 nodes)
Graph viewer (xdot) not found, generating a png instead
Image generated as /tmp/objgraph-ztz642h3.png
```
rendering of ` /tmp/objgraph-ztz642h3.png`:
<img width="1671" alt="image" src="https://github.com/user-attachments/assets/9098ff29-224c-4533-935b-83c210ac2e22">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136584
Approved by: https://github.com/kwen2501, https://github.com/H-Huang
ghstack dependencies: #136507
Co-authored-by: Howard Huang <howardhuang@fb.com>
Fixes#131701
Use CMake imported targets more consistently to eliminate hardcode paths.
Here is the new relevant sections of Caffe2Targets.cmake:
```
set_target_properties(c10_hip PROPERTIES
INTERFACE_INCLUDE_DIRECTORIES "${_IMPORT_PREFIX}/include"
INTERFACE_LINK_LIBRARIES "c10;hip::amdhip64"
)
```
```
set_target_properties(torch_hip PROPERTIES
INTERFACE_COMPILE_DEFINITIONS "USE_C10D_NCCL"
INTERFACE_COMPILE_OPTIONS "-fPIC;-D__HIP_PLATFORM_AMD__=1;-DCUDA_HAS_FP16=1;-DUSE_ROCM;-D__HIP_NO_HALF_OPERATORS__=1;-D__HIP_NO_HALF_CONVERSIONS__=1;-DTORCH_HIP_VERSION=602;-Wno-shift-count-negative;-Wno-shift-count-overflow;-Wno-duplicate-decl-specifier;-DCAFFE2_USE_MIOPEN;-DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP;-std=c++17;-DHIPBLAS_V2;-DHIP_NEW_TYPE_ENUMS"
INTERFACE_INCLUDE_DIRECTORIES "${_IMPORT_PREFIX}/include"
INTERFACE_LINK_LIBRARIES "c10_hip;torch_cpu_library;hip::amdhip64;MIOpen;hiprtc::hiprtc;roc::hipblaslt;roc::hipblas;hip::hipfft;hip::hiprand;roc::hipsparse;roc::hipsolver"
)
```
HIPCUB dependency was not actually used; which is why it is removed here as the imported target had undesirable side effects.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136283
Approved by: https://github.com/jeffdaily, https://github.com/Skylion007, https://github.com/jithunnair-amd, https://github.com/atalman
This reverts commit 7743149b2be4a9eba7e0997ccdc6abe552bec266.
Reverts
* https://github.com/pytorch/pytorch/pull/135503
* https://github.com/pytorch/pytorch/pull/135502
* https://github.com/pytorch/pytorch/pull/135422
This passes this test. Earlier, the getitem would stay like a getitem in the Fx graph. But now the fake tensor propagations fails saying that .item is called. It seems that torch function is not getting triggered while fake tensor propagation.
```
import torch
from torch.nn.attention.flex_attention import BlockMask, _mask_mod_signature, _score_mod_signature, flex_attention
from torch._inductor.lowering import make_pointwise, register_lowering
from torch._inductor.virtualized import ops
from torch.nn.attention.flex_attention import create_block_mask
torch.set_default_device('cuda')
flex_attention = torch.compile(flex_attention, dynamic=False)
prefix_lengths = torch.arange(8)
def prefix_lm(b, h, q, kv):
return prefix_lengths[b] >= kv
mask = create_block_mask(prefix_lm, 8, None, 512, 512, _compile=True)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136590
Approved by: https://github.com/Chillee
Summary: If you actually import the module, you might end up with some import cycle situation where a module is imported too early and accesses things that are not initialized yet.
Test Plan:
sandcastle and ossci
```
TORCH_LOGS=+torch._inductor.codecache buck run mode/opt caffe2/benchmarks/dynamo:torchbench
```
Differential Revision: D63330224
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136548
Approved by: https://github.com/Skylion007
Summary: Previously `_inline_module ` helper function only works with submodules that have args specified. This diff updates the util function to look for input arguments from submodule kwargs first using placeholder node names, then fallback to list of args if node name not found.
Test Plan:
```
buck2 run @//mode/{opt,mtia,inplace} //glow/fb/fx/fba/tests:test_fba_inductor -- -r test_connected_fusions
```
Differential Revision: D63347675
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136631
Approved by: https://github.com/jfix71
AMD devices have 64 elements per thread; this PR makes the handling of the "ELEMENTS_PER_WARP_32" generic and uses DeviceProperties.warp_size to determine the warp size instead of hard-coding the warp size as 32. It also renames the enum value. Added a unit test for this.
Note: I left the old enum option (ELEMENTS_PER_WARP_32) as is instead of renaming it. I'm not sure whether we expect should caches to get invalidated here; if this concern is valid, then there's a risk that this would get updated, but some model could use the cached inductor code, which would reference "ELEMENTS_PER_WARP_32", which would no longer exist.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136472
Approved by: https://github.com/jansel
Summary: Title
Test Plan: CI
This fixes some breaking tests in executorch. I think the root cause is when we have aten::matmul which we are not preserving, we register meta implementation from C++ side. It seems like the C++ kernel doesn't work well with mix of FakeTensor and real tensor. This PR sidesteps this problem by always preferring python CIA decomp over C++ Cia decomp
Differential Revision: D63297050
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136492
Approved by: https://github.com/bdhirsh
Summary: Previously we had a very bad bug where we don't allow any decomp on CIA. This never mattered before because we never had to actually push CIA decomp to Python key level in export.
Test Plan: CI
Differential Revision: D63363749
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136600
Approved by: https://github.com/bdhirsh
Fixes#136504
If you have a tl.constexpr parameter to a triton kernel, and you pass in a SymNode, then, right now, you run into failures (see under 'constants'):
```
File "/tmp/torchinductor_dberard/na/cnax67r5zmslz7bvdfizteaepj7fajpjallb3bu2gyetjcdqtbzj.py", line 14, in <module>
triton_meta={'signature': {0: '*fp32', 1: '*fp32'}, 'device': DeviceProperties(type='cuda', index=0, cc=90, major=9, regs_per_multiprocessor=65536, max_threads_per_multi_processor=2048, multi_processor_count=132, warp_size=32), 'constants': {2: s0, 3: 256}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1), equal_to_1=())]},
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
NameError: name 's0' is not defined
```
To fix this, we specialize on the value during dynamo tracing, so that we have a real integer when we do codegen.
Alternatives: specialize somewhere else (e.g. inductor); or figure out how to actually pass the value dynamically into the user-written kernel. However, if we try to pass a dynamic value, then we wouldn't be able to precompile the triton kernels in inductor or use AOTI.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136512
Approved by: https://github.com/oulgen, https://github.com/jansel, https://github.com/eellison
The test is failing in trunk atm with the following error:
```
test_serialization.py::TestSerialization::test_skip_data_serialization_materialize_fake_False - AssertionError: "Can't pickle local object 'WeakValueDictionary.__init__.<locals>.remove'" does not match "Can't get local object 'WeakValueDictionary.__init__.<locals>.remove'"
```
for example, 36f0e61166
This comes from this cpython commit a3076c734d, and manifests in python 3.12.5 currently used in CI. The failure doesn't happen when I try it out with 3.12.3 and 3.12.4. Looking at the commit logs of https://github.com/python/cpython/commits/main/Lib/pickle.py, it looks like the exception message is changing back and forth, so I guess a regex match would capture both.
Fixes the compilation error of max-autotune for `maml_omniglot` (AMP and FP32) and `soft_actor_critic` (AMP) in Torchbench for single-thread dynamic shapes case:
```
/tmp/torchinductor_user/uv/cuvq6wenwp7us423onuvntkfx4cspmagha5beiknob7tiebzhupa.cpp: In function ‘void kernel(const bfloat16*, const bfloat16*, const bfloat16*, bfloat16*, int64_t)’:
/tmp/torchinductor_user/uv/cuvq6wenwp7us423onuvntkfx4cspmagha5beiknob7tiebzhupa.cpp:279:41: error: the value of ‘Mr_blocks’ is not usable in a constant expression
279 | constexpr int64_t m_block_end = Mr_blocks;
| ^~~~~~~~~
/tmp/torchinductor_user/uv/cuvq6wenwp7us423onuvntkfx4cspmagha5beiknob7tiebzhupa.cpp:237:19: note: ‘Mr_blocks’ was not initialized with a constant expression
237 | const int64_t Mr_blocks = (M + Mr - 1) / Mr;
| ^~~~~~~~~
```
The PR also updates the UT to add a test for `BS`=512 in single thread.
The previous case has `BS`=1024 equal to the `K` and `N` value. The generated code does not have symbolic shapes thus fails to capture the above issue.
By adding a case of `BS`=512, the generated code will have symbolic shape for the M dim and is able to reproduce the issue that this PR is addressing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136418
Approved by: https://github.com/jgong5
https://github.com/pytorch/pytorch/pull/136087 update pybind11 to 2.13.6 and that new release has the feature which is expressed by [a new function](https://pybind11.readthedocs.io/en/latest/changelog.html#version-2-13-6-september-13-2024) `_pybind11_conduit_v1_`. The presence of this function breaks the serialization mechanisms used by Titon and in PyTorch itself.
Possible errors that have been noticed due to this change:
<details>
<summary> the first error </summary>
```bash
_________ KernelTests.test_layout_constraint_needs_fixed_stride_order __________
Traceback (most recent call last):
File "/runner/_work/intel-xpu-backend-for-triton/intel-xpu-backend-for-triton/pytorch/test/inductor/test_triton_kernels.py", line 1072, in test_layout_constraint_needs_fixed_stride_order
eager_out = f(x)
File "/runner/_work/intel-xpu-backend-for-triton/intel-xpu-backend-for-triton/pytorch/test/inductor/test_triton_kernels.py", line 1068, in f
arange_out(x, y)
File "/runner/_work/intel-xpu-backend-for-triton/intel-xpu-backend-for-triton/pytorch/test/inductor/test_triton_kernels.py", line 1059, in arange_out
kernel[grid](x, out, n_elements, BLOCK_SIZE=4)
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/triton/runtime/jit.py", line 330, in <lambda>
return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/triton/runtime/jit.py", line 657, in run
kernel = self.compile(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/triton/compiler/compiler.py", line 315, in compile
metadata_group[metadata_filename] = fn_cache_manager.put(json.dumps(metadata, default=vars), metadata_filename,
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/json/__init__.py", line 234, in dumps
return cls(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/json/encoder.py", line 199, in encode
chunks = self.iterencode(o, _one_shot=True)
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/json/encoder.py", line 257, in iterencode
return _iterencode(o, 0)
TypeError: vars() argument must have __dict__ attribute
```
</details>
<details>
<summary> the second error </summary>
```bash
________________ TestTritonWrapper.test_wrapper_using_gpu_seed _________________
Traceback (most recent call last):
File "/cache/pytorch-c5e9d03a2da4b93481737594cbe2f5931fa569aa833f206a638189cad2c36d3c-11/test/inductor/test_triton_wrapper.py", line 40, in test_wrapper_using_gpu_seed
out = f(x, y)
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py", line 465, in _fn
return fn(*args, **kwargs)
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 1292, in __call__
return self._torchdynamo_orig_callable(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 1087, in __call__
result = self._inner_convert(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 530, in __call__
return _compile(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 933, in _compile
guarded_code = compile_inner(code, one_graph, hooks, transform)
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 675, in compile_inner
return _compile_inner(code, one_graph, hooks, transform)
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_utils_internal.py", line 87, in wrapper_function
return function(*args, **kwargs)
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 708, in _compile_inner
out_code = transform_code_object(code, transform)
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/bytecode_transformation.py", line 1322, in transform_code_object
transformations(instructions, code_options)
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 220, in _fn
return fn(*args, **kwargs)
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 643, in transform
tracer.run()
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2776, in run
super().run()
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 979, in run
while self.step():
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 891, in step
self.dispatch_table[inst.opcode](self, inst)
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2967, in RETURN_VALUE
self._return(inst)
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2952, in _return
self.output.compile_subgraph(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1117, in compile_subgraph
self.compile_and_call_fx_graph(tx, list(reversed(stack_values)), root)
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1369, in compile_and_call_fx_graph
compiled_fn = self.call_user_compiler(gm)
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1416, in call_user_compiler
return self._call_user_compiler(gm)
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1465, in _call_user_compiler
raise BackendCompilerFailed(self.compiler_fn, e).with_traceback(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/output_graph.py", line 1446, in _call_user_compiler
compiled_fn = compiler_fn(gm, self.example_inputs())
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/repro/after_dynamo.py", line 130, in __call__
compiled_gm = compiler_fn(gm, example_inputs)
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/__init__.py", line 2235, in __call__
return compile_fx(model_, inputs_, config_patches=self.config)
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 1528, in compile_fx
return aot_autograd(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/backends/common.py", line 72, in __call__
cg = aot_module_simplified(gm, example_inputs, **self.kwargs)
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 1071, in aot_module_simplified
compiled_fn = dispatch_and_compile()
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 1056, in dispatch_and_compile
compiled_fn, _ = create_aot_dispatcher_function(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 522, in create_aot_dispatcher_function
return _create_aot_dispatcher_function(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_functorch/aot_autograd.py", line 759, in _create_aot_dispatcher_function
compiled_fn, fw_metadata = compiler_fn(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 179, in aot_dispatch_base
compiled_fw = compiler(fw_module, updated_flat_args)
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 1357, in fw_compiler_base
return _fw_compiler_base(model, example_inputs, is_inference)
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 1428, in _fw_compiler_base
return inner_compile(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 479, in compile_fx_inner
return wrap_compiler_debug(_compile_fx_inner, compiler_name="inductor")(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_dynamo/repro/after_aot.py", line 85, in debug_wrapper
inner_compiled_fn = compiler_fn(gm, example_inputs)
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 665, in _compile_fx_inner
compiled_graph = FxGraphCache.load(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 1341, in load
compiled_graph = compile_fx_fn(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 574, in codegen_and_compile
compiled_graph = fx_codegen_and_compile(gm, example_inputs, **fx_kwargs)
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/compile_fx.py", line 882, in fx_codegen_and_compile
compiled_fn = graph.compile_to_fn()
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/graph.py", line 1952, in compile_to_fn
return self.compile_to_module().call
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/graph.py", line 1878, in compile_to_module
return self._compile_to_module()
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/graph.py", line 1906, in _compile_to_module
mod = PyCodeCache.load_by_key_path(
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 2866, in load_by_key_path
mod = _reload_python_module(key, path)
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/runtime/compile_tasks.py", line 45, in _reload_python_module
exec(code, mod.__dict__, mod.__dict__)
File "/tmp/tmps59zkbew/kg/ckgkb4gt5fs5pll4o7fqawppsmdezu5h52cq6nmrvi3yy6j7ddq4.py", line 45, in <module>
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/async_compile.py", line 198, in triton
kernel = TritonCodeCache.load(kernel_name, source_code)
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 2916, in load
return _module_to_triton_kernel(PyCodeCache.load(source_code), kernel_name)
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 2853, in load
return cls.load_by_key_path(key, path, linemap, attrs)
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/codecache.py", line 2866, in load_by_key_path
mod = _reload_python_module(key, path)
File "/opt/hostedtoolcache/Python/3.9.20/x64/lib/python3.9/site-packages/torch/_inductor/runtime/compile_tasks.py", line 39, in _reload_python_module
raise RuntimeError(
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
RuntimeError: Failed to import /tmp/tmps59zkbew/g3/cg3zgxsidsjhdlz2lzvajvubdq6kg2x2hzd2kznfj43qwvlv33du.py
SyntaxError: invalid syntax (cg3zgxsidsjhdlz2lzvajvubdq6kg2x2hzd2kznfj43qwvlv33du.py, line 14)
```
</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136280
Approved by: https://github.com/etaf, https://github.com/jansel, https://github.com/EikanWang
Co-authored-by: Henry Schreiner <HenrySchreinerIII@gmail.com>
Fix the correctness issue of https://github.com/pytorch/ao/pull/884/. The current implementation for converting between `Half/BFloat16` and `int8/uint8` incorrectly assumes that 1/4 of the int8/uint8 vector lane maps to 1/2 of the Half/BFloat16 vector lane. This assumption leads to accuracy issues after the full bit-width vectorization of the Half data type was introduced. When converting between int8 weights and the half data type, the generated code is as the following:
```
#include "/tmp/torchinductor_leslie/xw/cxww3s7wxrujoyxna7mlcjktid2uu6nntixqwm542xfkd756gl3x.h"
extern "C" void kernel(const int8_t* in_ptr0,
half* out_ptr0)
{
{
for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(2048L); x0+=static_cast<int64_t>(32L))
{
auto tmp0 = at::vec::Vectorized<int8_t>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(32));
auto tmp1 = at::vec::convert<half>(tmp0);
tmp1.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(32));
}
}
}
```
In this PR, we address the issue by changing the implementation to convert 1/2 of the int8/uint8 vector lane into a full vector lane of Half/BFloat16.
**TestPlan**
* AO: `python test/integration/test_integration.py -k test_int8_weight_only_quant_subclass_api`
* `python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_convert_int8_to_half_vec`
* Due to the CPP backend legalization pass, we are unable to create a unit test to simulate the conversion from `Half` to `int8`. Instead, we rely on a C++ test case.
* `./build/bin/vec_test_all_types_AVX512 --gtest_filter="VecConvertTestsReducedFloat/*.ConvertReduced"`
* `./build/bin/vec_test_all_types_AVX2 --gtest_filter="VecConvertTestsReducedFloat/*.ConvertReduced"`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136353
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
Seems like some other tests are holding onto memory that is not gc'able (e.g., cuBLAS workspaces), so these tests while working in isolation fail when run as e.g., `python test/test_cuda.py -k able`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136496
Approved by: https://github.com/ezyang
TLDR; found forward activation tensors were being kept alive "forever"
(or until GC ran), and tracked it down to a cycle involving
`stage_backward.<locals>.extract_tensors_with_grads`.
The reference cycle in question is below. (constructed using gc.get_referrers after doing a gc.collect in gc debug mode)
tensor is kept alive by
`[(<class 'cell'>, '0x7f7360234400')]`
tuple of cell objects
`(<cell at 0x7f73602343d0: function object at 0x7f734fff0ee0>, <cell at 0x7f7360234400: list object at 0x7f734e4d9a80>, <cell at 0x7f73602a4190: list object at 0x7f734eff8b00>)`
is kept alive by
`[(<class 'function'>, '0x7f734fff0ee0')]`
`<function stage_backward.<locals>.extract_tensors_with_grads at 0x7f734fff0ee0>`
is kept alive by
`[(<class 'cell'>, '0x7f73602343d0')]`
Put into more plain terms,
```
def stage_backward(...):
...
stage_output_tensors = []
# a cell object will exist that contains the variables defined in stage_backward and used by
# both stage_backward and nested functions
# in this case, the cell object contains 'stage_output_tensors' but
# this function object will hold a reference to a 'cell' that contains any vars from
# the parent scope not explicitly passed into the function as args.
def extract_tensors_with_grads(...):
...
# extract_tensors_with_grads refers to stage_output_tensors, so stage_output_tensors
# is in the cell
stage_output_tensors.append(output_val)
...
# but extract_tensors_with_grads ALSO refers to itself (extract_tensors_with_grads),
# so `extract_tensors_with_grads` will be in the cell
extract_tensors_with_grads(...)
```
More debug details:
https://docs.google.com/document/d/1QPH1Lz0tnieIFPM2tyHrjVB-bjlnHuDgjx1p2am3cmE/edit?usp=sharing
In pdb:
```
gc.collect()
g = gc.garbage
g[-1]
[rank0]:(Pdb) [rank0]:<function
stage_backward.<locals>.extract_tensors_with_grads at 0x7fee5c3392d0>
g[-2]
[rank0]:(Pdb) [rank0]:(<cell at 0x7fee7abbcf40: function object at
0x7fee5c3392d0>, <cell at 0x7fee7abbcf70: list object at
0x7fee7ab68940>, <cell at 0x7fee5c3210c0: list object at 0x7fee5e1
d6340>)
g[-3]
[rank0]:(Pdb) [rank0]:[tensor([[[-4.1127e-06, -3.3826e-06, 2.6226e-06,
..., 6.4969e-06,
[rank0]: -4.4405e-06, -4.7684e-06],
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136507
Approved by: https://github.com/awgu, https://github.com/kwen2501
Related: #132695
This PR uses padded dense <-> jagged conversions to handle binary pointwise broadcasting of (NT, T) and (T, NT). This includes:
* `(B, j0, D) + (1, 1, 1)`
* `(B, j0, D) + (B, 1, 1)`
* `(B, j0, D) + (B, 1, D)`
* etc.
This PR also adds (hacky) support for bool inputs to the jagged <-> padded dense conversions. The underlying CUDA kernels do not support integer / bool inputs; so the following workaround is employed: `convert input -> half, run conversion kernel, convert output -> bool`. Note that this bool support is needed specifically for the backward formula of `fmax`, and likely others.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133021
Approved by: https://github.com/cpuhrsch
More or less literal copy-n-paste of c33b0580e6/aten/src/ATen/native/cuda/UpSampleBicubic2d.cu (L24)
and
c33b0580e6/aten/src/ATen/native/cuda/UpSampleBicubic2d.cu (L99)
Missing `uint8` implementation mimics CUDA behavior
Initial version coded live in https://www.youtube.com/watch?v=shi6Kb5xxvk
Later refinements:
- Switch from 2D dispatch to 1D one (to match CUDA behavior)
- Added batch + channel loops
- Fixed scale computation to match align corners behavior
- Added backward implementation
Backward implementation again, mimics CUDA, so it has issues precision issue for `torch.half` as well as a somewhat slow simulation of atomic adds using atomic compare and exchange of the pair of adjacent values, i.e.
```metal
emplate <typename T>
static inline void atomic_add_helper(
device atomic<int>* data,
long offset,
float value) {
auto ptr = data + (offset >> 1);
auto old = atomic_load_explicit(ptr, memory_order_relaxed);
union {
int i;
T t[2];
} val;
do {
val.i = old;
val.t[offset & 1] += static_cast<T>(value);
} while (!atomic_compare_exchange_weak_explicit(
ptr, &old, val.i, memory_order_relaxed, memory_order_relaxed));
}
```
Bump basic Metal language version to 3.0, as it's supported on MacOS13 and that's the first version that has `atomic_float`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136123
Approved by: https://github.com/albanD
Summary: Now that we have subprocess parallel compile on by default, we can change the internal compile_threads default to > 1 with a killswitch. Some jankiness so we can avoid evaluating the justknob at import.
Test Plan: Ran codecache tests with JK on, then canaried locally with JK off
Differential Revision: D62913998
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136246
Approved by: https://github.com/eellison
- Set the new tolerances ~= N * eps(bfloat16) which should be a comfortable upper bound for tolerances. Where N is the inner dimension of the matmal.
Logic behind choice of tolerance:
The maximum error of the summation of a series of N numbers in bfloat16 should be `N * epsilon(bfloat16)` , I confirmed by sampling different random seeds that the maximum observed error doesn't exceed this value and is usually much less.
Fixes test failures on Arm® Neoverse™ V1 ( not raised as an issue as this hardware type is not currently covered by linux-aarch64 workflow )
```
Traceback (most recent call last):
File "/var/lib/jenkins/workspace/test/test_torch.py", line 2478, in test_cdist_large
self.assertEqual(expected, actual)
File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3885, in assertEqual
raise error_metas.pop()[0].to_error(
AssertionError: Tensor-likes are not close!
Mismatched elements: 134118 / 1000000 (13.4%)
Greatest absolute difference: 0.03829193115234375 at index (291, 726) (up to 0.005 allowed)
Greatest relative difference: 0.03519868478178978 at index (291, 726) (up to 1.3e-06 allowed)
```
@malfet @jondea
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136315
Approved by: https://github.com/albanD
Summary:
- Added TORCH_LOGS=cache to dump cache stats on exit - supported by RemoteCache.
- Split REMOTE_CACHE_VERSION - it was used for both JKs fx_graph_memcache_version and autotune_memcache_version but they really should be separate (just in case we need to change one but not the other)
- Prepare `_ManifoldCache` for use with other subpath keys
- Move create_cache to be more public and use it in codecache
- Add _InductorMetaTy alias (still just a dict)
- Cleaned up some common cached_autotune calls in triton_heuristics
Test Plan: unit tests
Reviewed By: oulgen
Differential Revision: D62648249
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136456
Approved by: https://github.com/oulgen
Original issue:
https://github.com/pytorch/ao/issues/890
The problem:
TracingContext.flat_params contain original params, with not desugared Subclasses.
While inductor.freezing API works on aot graphs, which already desugared Subclasses.
flat_params are used only for this logic and storing in them desguared subclasses fixes the issue.
Testing:
```
python test/functorch/test_aotdispatch.py -k test_inductor_freezing_with_subclasses
```
Torch AO original failure:
```
python test/integration/test_integration.py -k test_int8_weight_only_quant_with_freeze
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136265
Approved by: https://github.com/bdhirsh
## Description
Fixes the accuracy failure of FP32 `jx_nest_base` of max-autotune.
The current epilogue fusion implementation in GEMM template assumes that the read of template buffer and the write of epilogue output in the epilogue node have the same index (the layout could be different but the index should be the same).
If the condition is not satisfied, the computation is wrong, leading to correctness issue for FP32 `jx_nest_base`.
This PR disabled the epilogue fusion with GEMM template when the above condition is not satisfied.
### Unsupported epilogue:
`buf1` is the template buffer and `buf2` is the epilogue output buffer.
The store of `buf2`:
401408 * d0 + 100352 * d1 + **7168 * d2** + **1792 * d3** + 128 * d4 + d5
The load of `buf1` in the epilogue node:
401408 * d0 + 100352 * d1 + **1792 * d2** + **25088 * d3** + 128 * d4 + d5
The above two indexes are different.
```
CppTemplateBuffer(name='buf1', layout=FixedLayout('cpu', torch.float32, size=[25088, 128], stride=[128, 1]))
ComputedBuffer(name='buf2', layout=FixedLayout('cpu', torch.float32, size=[8, 4, 14, 4, 14, 128], stride=[401408, 100352, 7168, 1792, 128, 1]), data=Pointwise(
'cpu',
torch.float32,
def inner_fn(index):
i0, i1, i2, i3, i4, i5 = index
tmp0 = ops.load(arg5_1, i5 + 128 * i4 + 1792 * i2 + 25088 * i3 + 100352 * i1 + 401408 * i0)
tmp1 = ops.load(buf0, i5 + 128 * i4 + 1792 * i2 + 25088 * i3 + 100352 * i1 + 401408 * i0)
tmp2 = tmp0 + tmp1
tmp3 = ops.load(buf1, i5 + 128 * i4 + 1792 * i2 + 25088 * i3 + 100352 * i1 + 401408 * i0)
tmp4 = tmp2 + tmp3
return tmp4
,
ranges=[8, 4, 14, 4, 14, 128],
origin_node=clone,
origins=OrderedSet([clone])
))
```
### Supported epilogue:
`buf1` is the template buffer and `buf2` is the epilogue output buffer.
The store of `buf2`:
d0 + 576 * d1 + 32 * d2
The load of `buf1` in the epilogue node:
d0 + 576 * d1 + 32 * d2
The above two indexes are the same.
The layout of `buf2` and `buf1` are different though which is handled by the reindexer:
`buf1`: `size=[324, 32], stride=[32, 1]`
`buf2`: `size=[1, 32, 18, 18], stride=[10368, 1, 576, 32]`
```
CppTemplateBuffer(name='buf1', layout=FixedLayout('cpu', torch.bfloat16, size=[324, 32], stride=[32, 1]))
ComputedBuffer(name='buf2', layout=FixedLayout('cpu', torch.bfloat16, size=[1, 32, 18, 18], stride=[10368, 1, 576, 32]), data=Pointwise(
'cpu',
torch.bfloat16,
def inner_fn(index):
_, i1, i2, i3 = index
tmp0 = ops.load(buf1, i1 + 32 * i3 + 576 * i2)
tmp1 = ops.to_dtype(tmp0, torch.float32, src_dtype=torch.bfloat16)
tmp2 = ops.load(_frozen_param4, i1)
tmp3 = tmp1 * tmp2
tmp4 = ops.load(arg7_1, i1 + 32 * i3 + 576 * i2)
tmp5 = tmp3 + tmp4
tmp6 = ops.to_dtype(tmp5, torch.bfloat16, src_dtype=torch.float32)
return tmp6
,
ranges=[1, 32, 18, 18],
origin_node=convert_element_type_4,
origins=OrderedSet([add, mul, convert_element_type_4])
))
```
## TODO
Add the support for fusions when the indexes are different in a follow-up PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135661
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5
At the moment, lowering torch._scaled_mm with tensorwise scaling and rowwise scaling for both A and B
We probably also want to support either combination of tensorwise and rowwise for A and B, as well as bias support
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136337
Approved by: https://github.com/chenyang78
Summary:
Add a customizable loss function callback to NodeAccuracySummary to
allow users to pass in their own loss function.
Also, fix some type errors and propagate better exception messages when
unexpected tensor comparisons occur. Finally, enhance the robustness of
`generate_numeric_debug_handle` in the case where it is called multiple
times on the same model, by avoiding reuse of the same IDs.
Test Plan: Added a test for this case in `test_numeric_debugger`.
Reviewed By: jerryzh168
Differential Revision: D62898297
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136282
Approved by: https://github.com/jerryzh168
PR changes how `reconstruct` is done for a ConstDict. As of today, it works as follow:
(1) codegen(...) each pair of key/value
(2) create a new dictionary to hold the new items
(3) clear the original dictionary
(4) update the original dict with the one created in (2)
We do a micro optimization in the generated bytecode to:
- Only codegen the items that changed.
- Only clear the original dictionary if a key was removed.
Fixes: #133487
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134876
Approved by: https://github.com/zou3519
1. We want to take option 3 as discussed in https://github.com/pytorch/pytorch/issues/135712, so every time when we retry, we create a new TCPStore server first so that we don't need to append attempt count as prefix and avoid eventually TCPStore sync failure. (This is only for the TCPStore sharing enabled case)
2. We start a new server bound to an ephemeral port (i.e. 0) so it gets assigned to a free port. We then pass that downstream (trainer or c10d). By doing so, TCPStore is managed by the elastic agent rather than having a race condition on binding to a specific port in the trainer.
3. Then the port be broadcasted for dynamic_rendezvous.
Only one more question, what do we do about the store created from (_create_tcp_store) torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py, are we ok with creating a duplicate TCPStore server?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135957
Approved by: https://github.com/d4l3k, https://github.com/c-p-i-o
Fixes#93843
`EmbeddingBag()` / `embedding_bag()` support 1D inputs with offsets to handle raggedness. NJT is a natural fit here as it already maintains offsets of the same form. This PR updates the python-side to support NJT and adds corresponding OpInfo-based NJT tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135888
Approved by: https://github.com/cpuhrsch
Summary:
After the previous refactor, we can now call load_with_key directly from AOTAutogradCache to use the remote FXGraphCache.
This does *not* implement a remote AOTAutogradCache. It just allows AOTAutogradCache to work with remote FXGraphCache.
Test Plan: (Meta only tests)
Reviewed By: aorenste
Differential Revision: D62384944
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136173
Approved by: https://github.com/oulgen
- Sometimes having access to the `MixedPrecisionPolicy` in the `fsdp_pre_all_gather` is useful. See [here](https://github.com/pytorch/ao/pull/748/files#r1760375325) in the torchao INT8 mixed precision training PR.
- Sometimes having access to the owning `nn.Module` allows for using it for saving state. See [here](https://github.com/pytorch/pytorch/issues/114299#issuecomment-2298692762) for an example.
The major paint point here is how to deal with backward compatibility. For now, we use `signature.inspect` to check if the user subclass follows the old vs. new signature. However, for the new signature, the `param_dtype` in the post-all-gather is redundant, as if the user needed it, the user could save it from the `mp_policy` passed in the pre-all-gather now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136129
Approved by: https://github.com/weifengpy
Summary: Create another wrapper codegen class to handle ArrayRef for CPU. The goal is to simplify the regular cpp wrapper codegen logic and the generated cpp code.
Test Plan: CI
Differential Revision: D62961885
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136318
Approved by: https://github.com/frank-wei
Improves and enables a commented out test originally introduced in #131912
In `test_custom_tag_metadata_re_export()`, we check the added "custom" metadata to given nodes is preserved and not copied to other nodes after re-exporting
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136048
Approved by: https://github.com/zhxchen17
**Summary**
Fix circular import in `torch/distributed/utils.py` found when running internal test, see D62901023. Curious why this wasn't causing any issue. Is this relevant code deprecated and no longer used?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136286
Approved by: https://github.com/Skylion007
Fixes#131337
- add `arg_type` for workspace_arg, the type is consistent with the type in `generate_workspace_allocation()`.
- do not generate example tensors for `workspace`, and use `generate_workspace_allocation()` instead.
- add workspace allocation generation code to `kernel_autotune_calls`. e.g.
```python
workspace = empty_strided_cuda((1280, ), (1, ), torch.uint8)
workspace.zero_()
.....
triton_spl_fused_add_cumprod_0.run(buf2, arg0_1, arg1_1, workspace, 1, 10000, grid=split_scan_grid(1, 10000), stream=stream0)
del buf2, arg0_1, arg1_1, workspace
```
- add `empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda` to the header of triton autotune code.
The generated cpp has lines like below, so we also implement a `zero_()` for ` AtenTensorHandle `.
```cpp
static constexpr int64_t int_array_0[] = {1280L, };
static constexpr int64_t int_array_1[] = {1L, };
AtenTensorHandle workspace_handle;
AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_empty_strided(1, int_array_0, int_array_1, cached_torch_dtype_uint8, cached_torch_device_type_cuda, 0, &workspace_handle));
RAIIAtenTensorHandle workspace(workspace_handle);
workspace.zero_();
```
- Fix handle grid_fn for grid computation. Pass in "RBLOCK" to `split_scan_grid`
- Fix dynamic shapes:
Without the fix we generate code that looks like this `workspace = empty_strided_cuda((32*((255 + s0) // 256), ), (1, ), torch.uint8)` when doing triton autotune and `s0` is not defined.
The solution approach is to use `V.graph.sizevars.size_hint(nbytes)` to realize the workspace size for triton autotune. Note that we only realize it for triton autotune code, but not for the cpp cuda code.
- We also generate slightly different cpp code depending on if `abi_compatible` is turned on.
```cpp
RAIIAtenTensorHandle workspace(workspace_handle);
AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_zero_(workspace.get()));
```
vs
```cpp
at::Tensor workspace = at::detail::empty_strided_cuda({8L*(c10::div_floor_integer(static_cast<int64_t>((255L + s0)), static_cast<int64_t>(256L))), }, {1L, }, at::kByte, c10::DeviceType::CUDA);
workspace.zero_();
```
Test Plan:
```
TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda
python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper
python test/inductor/test_cuda_cpp_wrapper.py DynamicShapesCudaWrapperCudaTests.test_consecutive_split_cumprod_cuda_dynamic_shapes_cuda_wrapper
TORCHINDUCTOR_ABI_COMPATIBLE=1 python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper
TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135552
Approved by: https://github.com/desertfire
Summary:
- Clean up cache test code a bit.
- Removed patch_fbcode() - it turned out to cause flaky issues (image if it set fbcode=False and then loaded a module for the first time which had a top-level fbcode check).
Test Plan: unit tests
Reviewed By: oulgen
Differential Revision: D62648248
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136215
Approved by: https://github.com/bobrenjc93
**Motivations**:
A topological order of the scheduler nodes that optimize the liveness of buffers can reduce the peak memory utilization. This has been observed and studied e.g., [here](https://arxiv.org/pdf/1910.02653) and [here](https://proceedings.mlr.press/v202/steiner23a/steiner23a.pdf).
**Solutions**:
1. implement a peak memory estimator via liveness analysis
2. implement a few memory aware topological sorting algorithms and pick the one with the lowest peak memory
**Results**:
On some models we can reduce the peak memory significantly:
| model | batch size | peak_memory baseline | peak_memory new | ratio |
|:-----------------------------:|:----------:|:--------------------:|:---------------:|:-----:|
| alexnet | 128 | 1.17 | 0.99 | 1.19 |
| vgg16 | 64 | 4.10 | 3.57 | 1.15 |
| DebertaV2ForQuestionAnswering | 1 | 11.60 | 10.56 | 1.10 |
In the presence of compiler based AC, peak memory can be further reduced:
| model | batch size | peak_memory baseline | peak_memory new | ratio |
|:------------------------------:|:----------:|:--------------------:|:---------------:|:-----:|
| AlbertForMaskedLM | 4 | 6.87 | 6.43 | 1.07 |
| AlbertForQuestionAnswering | 4 | 8.69 | 7.76 | 1.12 |
| MobileBertForQuestionAnswering | 128 | 4.67 | 3.90 | 1.20 |
[Here](https://fb.workplace.com/groups/1075192433118967/posts/1499920537312819/?comment_id=1499938843977655&reply_comment_id=1499951630643043) is an internal use case.
**Other infos:**
* neutral model runtime, because the the reordering happens after fusion. So memory saving is _for free_.
* minimal compile time overhead as the algorithm is linear in the number of edges of the inductor graph. For all hugglingface benchmark models, the additional compile time is less than 1 second.
* no peak memory regression since we only adopt a new order if the peak memory is reduced based on the estimator. However, the model is unaware of operators' working memories, but for large models, the working memory should be negligible. We haven't observed any significant regressions on all of our tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134874
Approved by: https://github.com/yf225
Fixes#134848
For BF16/FP16, when a tensor is specified in `out` parameter of mean, the mean kernel should use its storage for output, but that doesn't happen, since an `at::to` in the current code causes storage to be allocated again, but the `out` parameter tensor's storage doesn't get updated, resulting in it not holding the mean output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135174
Approved by: https://github.com/soulitzer
Avoid allocating memory or dry-running the submodule during stage init.
Save user-provided input/output metadata during stage init, to allow
lazily initializing the buffers before the first step call.
Later, we plan to build on top of this to add lazy shape inference
(#130856) so that no input/output shapes are required at stage init.
For now, we require input/output tensors for stage init, but these
should be on meta device and stage should not allocate any real memory.
Note: this needs more thorough testing and review, but it worked on the
torchtitan 3d test.
TODO:
- delete 'device' arg from PipelineStage ctor? (move it to inferred from
args tensors passed to first step call? separate PR.
- delete 'output_args' from PipelineStage ctor? we don't actually need
it, but we use it to do shape validation, which is why I didn't remove
it in this PR. Proposal: leave it until we add lazy shape inference?
Fixes#136225, #136226
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136243
Approved by: https://github.com/H-Huang, https://github.com/kwen2501
Summary: Internal profiler behaves differently after turning on triton.autotune_at_compile_time. Needs more investigation but turning it off for this test for now.
Reviewed By: henrylhtsang
Differential Revision: D63035855
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136356
Approved by: https://github.com/henrylhtsang
Summary:
Return from functions instead of using `skipTest`.
This is mostly to make our test report happier.
Skipped tests still show up in our Broken test report.
```
OK (skipped=1)
I0917 16:14:24.749060 1018907 StorageDemandControl.cpp:572] Flushing Demand Control ODS counters
Skipped: Store doesn't support extended APIs
```
Test Plan:
Tested locally.
Test shows up as passed instead of skipped.
```
Cache hits: 99%. Commands: 125048 (cached: 124961, remote: 10, local: 77)
Tests finished: Pass 1. Fail 0. Fatal 0. Skip 0. Build failure 0
```
Differential Revision: D62912065
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136244
Approved by: https://github.com/XilunWu
Original issue:
https://github.com/pytorch/ao/issues/890
The problem:
TracingContext.flat_params contain original params, with not desugared Subclasses.
While inductor.freezing API works on aot graphs, which already desugared Subclasses.
flat_params are used only for this logic and storing in them desguared subclasses fixes the issue.
Testing:
```
python test/functorch/test_aotdispatch.py -k test_inductor_freezing_with_subclasses
```
Torch AO original failure:
```
python test/integration/test_integration.py -k test_int8_weight_only_quant_with_freeze
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136265
Approved by: https://github.com/bdhirsh
Summary:
Add a customizable loss function callback to NodeAccuracySummary to
allow users to pass in their own loss function.
Also, fix some type errors and propagate better exception messages when
unexpected tensor comparisons occur. Finally, enhance the robustness of
`generate_numeric_debug_handle` in the case where it is called multiple
times on the same model, by avoiding reuse of the same IDs.
Test Plan: Added a test for this case in `test_numeric_debugger`.
Reviewed By: jerryzh168
Differential Revision: D62898297
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136282
Approved by: https://github.com/jerryzh168
Fixes https://github.com/pytorch/pytorch/issues/132331
We need another barrier here to ensure that the main thread doesn't stop the profiler while other threads are still using it (and crash). I can reliably reproduce the issue with `pytest -v test/profiler/test_cpp_thread.py -k test_profile_memory --flake-finder`.
### Testing
`pytest -v test/profiler/test_cpp_thread.py --flake-finder` all passes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136304
Approved by: https://github.com/briancoutinho
Summary: The change involves passing the expired timers to the log_debug_info_for_expired_timers function after to_json() has been applied . This change is made to provide a better debugging experience for the user.
Test Plan: unit tests
Reviewed By: gag1jain
Differential Revision: D62408767
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135913
Approved by: https://github.com/gag1jain
Summary:
This logs all operations when tracing log level is enabled for the `TCPStoreLibUvBackend`. This is very useful for debugging collective operations when issues occur as it logs all hosts and the keys that they're modifying. To minimize total data we only log the keys and not the values
This changes the C10D_* macros to be much more efficient -- previously we would always format the log string even if they would never be printed which is very wasteful for detailed tracing. This now gates them with an if statement to achieve the same behavior with no overhead
Test Plan:
```
TORCH_DISTRIBUTED_DEBUG=DETAIL torchrun --nnodes 1 --nproc_per_node 1 --no-python /bin/bash -c "echo foo"
```
```
I0919 09:26:52.352013 34271 TCPStore.cpp:285] [c10d - debug] The server has started on port = 29500.
I0919 09:26:52.352246 34271 socket.cpp:783] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (127.0.0.1, 29500).
I0919 09:26:52.352241 36903 TCPStoreLibUvBackend.cpp:1173] [c10d - debug] Uv main loop running
I0919 09:26:52.352308 34271 socket.cpp:854] [c10d - trace] The client socket is attempting to connect to [localhost]:29500.
I0919 09:26:52.353633 34271 socket.cpp:945] [c10d] The client socket has connected to [localhost]:29500 on SocketImpl(fd=41, addr=[localhost]:45646, remote=[localhost]:29500).
I0919 09:26:52.354422 34271 TCPStore.cpp:321] [c10d - debug] TCP client connected to host 127.0.0.1:29500
I0919 09:26:52.354558 36903 TCPStoreLibUvBackend.cpp:774] [c10d - trace] validate magic:1015412686 address:[localhost]:45646
I0919 09:26:52.354638 36903 TCPStoreLibUvBackend.cpp:789] [c10d - trace] ping nonce:34271 address:[localhost]:45646
I0919 09:26:52.356122 36903 TCPStoreLibUvBackend.cpp:866] [c10d - trace] add key:init/ val:1 address:[localhost]:45646
I0919 09:26:52.356308 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646
I0919 09:26:52.356410 36903 TCPStoreLibUvBackend.cpp:846] [c10d - trace] get key:init/ address:[localhost]:45646
I0919 09:26:52.358688 36903 TCPStoreLibUvBackend.cpp:808] [c10d - trace] set key:/none/torchelastic/role_info/0 address:[localhost]:45646
I0919 09:26:52.360177 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646
I0919 09:26:52.360296 36903 TCPStoreLibUvBackend.cpp:1004] [c10d - trace] multi_get key_count:1 address:[localhost]:45646
I0919 09:26:52.362076 36903 TCPStoreLibUvBackend.cpp:1036] [c10d - trace] multi_set key_count:1 address:[localhost]:45646
I0919 09:26:52.364001 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646
I0919 09:26:52.364091 36903 TCPStoreLibUvBackend.cpp:846] [c10d - trace] get key:/none/torchelastic/assigned_ranks/0 address:[localhost]:45646
```
Differential Revision: D62924454
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136320
Approved by: https://github.com/c-p-i-o, https://github.com/XilunWu
Summary:
Add a third mode where we only print kernel names without dumping any intermediate actual tensor value info.
It can be helpful in quickly identifying the troublesome kernels in CUDA IMA issues.
thanks ColinPeppler and henrylhtsang for this "feature request".
Test Plan:
The output can look like this if set the `AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3`:
{F1871629091}
Differential Revision: D62791371
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136182
Approved by: https://github.com/henrylhtsang
Summary:
X-link: https://github.com/pytorch/benchmark/pull/2454
This adds structured logging overhead at a per compile basis to compilation metrics.
To do so, we track the frame_id_frame_compile_id that trace_structured uses to categorize compiles, and use that as the key in our timing table.
Implementation notes:
- If there's times we call trace_structured without a compile id, the time won't be measured. Not really a good way around that today given the compile id framework of compilation metrics. Strobelight is still the best way to measure on a per job basis.
- We don't actually measure the time it takes to log the compilation metrics itself. Fundamentally, it's not possible to log this properly if we're storing the logging number *in* compilation metrics, since there's no way to measure it before we do it(unless we want discrepancies between dynamo_compile and tlparse, which seems suboptimal). Hopefully for a large job, the cost of structured_logging compilation metrics itself is small.
- I wanted to use frame_phase_timing here, but there's a bunch of ids to iron out, and I don't really want to deal with that headache. compilation_time_metrics is sort of what I want, but that isn't by frame/compile id, so it's also a bit off. Putting it into torch.logging as a separate thing so logging tracks its own overhead seems fine, though.
Test Plan:
Run benchmarks/nanogpt and staging logger. See that the new compilation metric is logged to the staged dynamo_compile table:
https://fburl.com/scuba/logger_staging_jjwu_30582a48f1ff9cf5f4ac50a4c40af/xazjg5xq
Note that the sum(structured_logging_overhead_s) / sum(entire_frame_compile_time) = 8.387 / 124.278 = 6%, which seems reasonable as the overhead for a small compilation like this.
You can also look at samples for a more detailed log of this.
Reviewed By: oulgen
Differential Revision: D62643611
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136142
Approved by: https://github.com/bobrenjc93
Summary:
To facilitate PSS-2 upgrade, this uses `ndt.NDArray` instead of `nd.ndarray` in type annotations. In Numpy-1.19 (PSS-1) it's an alias to `nd.ndarray` -- a noop.
In Numpy-1.24, `ndt.NDArray` a proper generic type, and without this change uses of `nd.ndarray` generate this Pyre type error:
```counterexample
Invalid type parameters [24]: Generic type `np.ndarray` expects 2 type parameters.
```
Test Plan: Sandcastle plus visual inspection
Differential Revision: D62977370
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136288
Approved by: https://github.com/kit1980
When tensor folding occurs during matmul operation returned tensor is a view. This can cause issues when matmul is used inside a custom function and such view is then returned as output. Then it cannot be modified inplace and causes errors.
It can be especially problematic when after such function inplace allreduce is performed.
Issue is resolved when unsafe_view is returned from matmul instead. This solution aligns matmul decomposition with eager implementation in such a way that a non view tensor is returned.
Test included in this PR reproduces the issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134568
Approved by: https://github.com/zou3519
Fixes #127049
There's already a meta func in `meta_registrations.py` for `add_` and `sub_` methods. I added a second meta function for error checking, i.e `int.add/sub_(float)` and `bool.add/sub_(other types)` .
Also the corresponding test with Dynamo passes, removed `@xfailIfTorchDynamo`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135864
Approved by: https://github.com/williamwen42
Changes in this PR:
- Monkey-patching `F.scaled_dot_product_attention` with a lambda seems to not work in some cases. This PR avoids using a lambda.
- Running `fullgraph=True` and `fullgraph=False` in the same unit test seems to cause the two cases to interfere with each other and causes error. This PR splits them into two separate unit tests.
- The checks in the unit tests might not work with compile cache. This PR turns off the cache in order to have a more predictable compile behavior to do unit test on.
Test commands:
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor_fullgraph_True`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor_fullgraph_False`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor_fullgraph_True`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor_fullgraph_False`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136219
Approved by: https://github.com/yifuwang
Summary:
Quite a few times, we see the NCCL PG abort taking too long. There's no easy way to measure this, so let's add a counter to measure this across the stack.
This will help us measure how much time we take the NCCL abort.
Test Plan:
Unit tests
Reviewed By: c-p-i-o
Differential Revision: D62675010
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136067
Approved by: https://github.com/fduwjj
skip_if_rocm is used only in multiprocess case (when UT test class is a child of MultiProcessTestCase). Each individual process can exit with a skip code. If used for single process UT, it will cause the UT to fail as the process returns a non-zero exit code. Use skipIfRocm in single process UTs.
To avoid the above confusion, this PR renamed skip_if_rocm to skip_if_rocm_multiprocess.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136161
Approved by: https://github.com/jithunnair-amd, https://github.com/kwen2501, https://github.com/fegin
Continuation of https://github.com/pytorch/pytorch/pull/131909. This PR makes numpy tests compatible with numpy>=2.0.0. Specifically it deals with APIs that have been removed from numpy-2.0.
Changes in this PR:
1. Use `numpy.exceptions.ComplexWarning` if `numpy.exceptions` namespace is present. In numpy-2.0 `numpy.ComplexWarning` has been removed in favor of using `numpy.exceptions.ComplexWarning` (see [numpy-2.0 migration guide](https://numpy.org/devdocs/numpy_2_0_migration_guide.html#changes-to-namespaces)). Note that `numpy.exceptions` was introduced in numpy-1.25.0 hence does not exist in numpy<=1.24.x.
2. Do the same for `numpy.exceptions.VisibleDeprecationWarning`
3. Use `np.sort(...,axis=0)` over `np.msort()`(`np.msort()` removed in numpy-2.0)
4. Use `np.pad()` over `np.lib.pad()` (`np.lib` removed in numpy-2.0)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136152
Approved by: https://github.com/atalman
Summary:
Remove sleep from the `watchdogHandler` function. This sleep unnecessary slows things down during a NCCL timeout.
Flight recorder is configured to take a minute, at most, to dump out it's buffer.
This sleep ends up waiting for `8` minutes before destroy is called.
Test Plan: Unit tests.
Differential Revision: D62529875
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135760
Approved by: https://github.com/fduwjj, https://github.com/shuqiangzhang
Summary: Currently we process events in the regular allocation path and we call cudaEventQuery to check on the events and this path can take some locks in libcuda driver. Its not entirely needed to do process events in the allocation path, we could move this to a background thread and keep processing events regularly and put the freed block to the free list.
Differential Revision: D62396585
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135524
Approved by: https://github.com/zyan0
Summary:
This diff adds an option to round the non-split blocks in caching allocator so that they can be reused without causing lots of fragmentation for large memory segments.
For example, if we specify max_split memory size as 400MB, then all allocations more than 400MB will not be split. Lets say, we allocated some 1024MB blocks and these are cached in the allocator blocks. If we request a new 500MB block, we round it to nearest power-2-division, thats 512MB, we add default kLargeBuffer of 20MB, that will be 532MB and since 532MB is less than existing 1024MB block, the 1024MB will not be used for this allocation, instead a new 512MB block will be created. In this diff, we provide an option to cofigure the kLargeBuffer for rounding and expose as a configurable option, so 512MB + max_non_split_rounding_size and if thats greater than 1024MB, we will use te 1024MB and we wont create a new 512MB block using cudaMalloc. This option is added so that we can pre-allocate some large blocks so that we can reuse them as much as possible and we dont stall on calling cudaMalloc.
Differential Revision: D62758758
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136174
Approved by: https://github.com/zyan0
Summary:
# context
* for the root cause and background please refer to this [post](https://fb.workplace.com/groups/1028545332188949/permalink/1042204770823005/)
* basica idea of this diff is to **short circuit the pytree flatten-unflatten function pairs** between two preserved modules, i.e., EBC/fpEBC and KTRegroupAsDict.
NOTE: There could be multiple EBCs and one single KTRegroupAsDict as shown in the [pic](https://fburl.com/gslide/lcyt8eh3) {F1864810545}
* short-circuiting the EBC-KTRegroupAsDict pairs are very special and a must in most of the cases due to the EBC key-order issue with distributed table lookup.
* hide all the operations behind a control flag `short_circuit_pytree_ebc_regroup` to the torchrec main api call `decapsulate_ir_modules`, which should only be visible to the infra layer, not to the users.
# details
* The `_short_circuit_pytree_ebc_regroup` function finds all the EBCs/fpEBC and KTRegroupAsDict modules in an unflattened module. Retrieve their fqns and sort to in_fqns (regroup_fqns) and out_fqns (ebc_fqns). Because currently the fpEBC is swapped as a whole, so we do some extra fqn logic to filter out the EBC that belongs to an up-level fpEBC.
* a util function `prune_pytree_flatten_unflatten` removes the in-coming and out-going pytree flatten/unflatten function calls in the graph module, based on the given fqns.
WARNING: The flag `short_circuit_pytree_ebc_regroup` should be turned on if EBCs are used and EBC sharding is needed. Assertions are also added if can't find a `KTRegroupAsDict` module, or `finalize_interpreter_modules` is not `True`.
# additional changes
* absorb the `finalize_interpreter_modules` process inside the torchrec main api `decapsulate_ir_modules`.
* set `graph.owning_module` in export.unflatten as required by the graph modification
* add one more layer of `sparse_module` for closely mimicing the APF model structure.
Test Plan:
# run test
* serializer
```
buck2 run fbcode//mode/opt fbcode//torchrec/ir/tests:test_serializer
```
* apf
```
buck2 run fbcode//mode/opt fbcode//aps_models/ads/gmp/tests/ne/e2e_deterministic_tests:gmp_e2e_ne_tests -- --filter-text 'test_mtml_instagram_model_562438350_single_gpu_with_ir'
```
* local mp run
```
==== Finished E2E deterministic test for mtml_instagram_model_gmp_474023725_non_kjt_unary ====
finished
test_mtml_instagram_model_562438350_single_gpu_with_ir
Imports took: 6.0s! Profile with --import-profiler. --_ |""---__
Executed 1 example in 203.1s: |'.| || . """|
Successful: 1 | || || /|\""-. |
Failed: 0 | || || | | |
Skipped: 0 | || || | \|/ |
Not executed: 8 |."| || --"" '__|
https://testslide.readthedocs.io/ --" |__---"""
```
Differential Revision: D62606738
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136045
Approved by: https://github.com/angelayi
Currently when we deserialize inputs to nodes, we deserialize arguments with default values as kwargs. So deserializing `aten.uniform`, which has the signature `uniform(Tensor(a!) self, float from=0, float to=1, *, Generator? generator=None) -> Tensor(a!)`, will get become `uniform(x, from=0, to=1)`. However, this fails when running in python because `from` is a python keyword. So the solution here is to not deserialize it as a kwarg.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136036
Approved by: https://github.com/zhxchen17
`rms_norm()` is a nice-to-have for ViT :)
This PR:
* SymInt-ifies `rms_norm()`, allowing NJT to use the same decomp.
* Adds torch_function-based input validation logic for nested-specific stuff (no normalization supported over the ragged dim for now) on the python NJT side.
* Adds multi-dim support (on non-ragged, non-batch dims) to `mean()` for NJT.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135872
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #125947
Previous implementation of the `numpy()` method returns `fp64` when the tensor is `fp32`. This is unexpected but seems to be caused by calling `__array__(dtype=None)` on the numpy array. I updated the implementation to implement the `numpy()` method explicitly and added tests to guard the behavior.
This needs to be cherry-picked into torch 2.5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136162
Approved by: https://github.com/gramalingam, https://github.com/xadupre
When stub files (`*.pyi`) were removed from `optim` (#125556, #125452), some types that existed are no longer available. This pull request adds them back.
Just for reference, these types are used in `pytorch-lightning`'s `LightningCLI`. Command line interfaces are created automatically, and having type hints make them nicer.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136185
Approved by: https://github.com/janeyx99
## Motivation
The default device for tensor.device both for sharded as well as non sharded is set to cuda by default. Hence while checking the FSDP UTs we see the following errors. This change updates the actual device type based on the created tensor.
```
[rank3] File "/root/repos/pytorch-training-tests/tests/pytorch/v2.4.0/distributed_hpu/fsdp/test_fsdp_dtensor_state_dict.py", line 143, in test_dtensor_sharded_tensor_state_dict_identical
[rank3] sharded_tensor_sd = ref_model.state_dict()
[rank3] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1944, in state_dict
[rank3] hook_result = hook(self, destination, prefix, local_metadata)
[rank3] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank3] return func(*args, **kwargs)
[rank3] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/_state_dict_utils.py", line 752, in _post_state_dict_hook
[rank3] tensor.device,
[rank3] File "/usr/local/lib/python3.10/dist-packages/typing_extensions.py", line 2853, in wrapper
[rank3] return arg(*args, **kwargs)
[rank3] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/sharded_tensor/api.py", line 1152, in __torch_function__
[rank3] return dispatch(st_instance, func)
[rank3] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/sharded_tensor/api.py", line 1134, in dispatch
[rank3] return _SHARDED_OPS[func](types, args, kwargs, st._process_group)
[rank3] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/op_registry_utils.py", line 33, in wrapper
[rank3] return wrapped_func(types, args, kwargs, process_group)
[rank3] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/sharded_tensor/_ops/tensor_ops.py", line 52, in tensor_device
[rank3] dev = torch.device(torch.cuda.current_device())
[rank3] File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 878, in current_device
[rank3] _lazy_init()
[rank3] File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 305, in _lazy_init
[rank3] raise AssertionError("Torch not compiled with CUDA enabled")
[rank3] AssertionError: Torch not compiled with CUDA enabled
````
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134994
Approved by: https://github.com/fegin
Fixes https://github.com/pytorch/pytorch/issues/136064
In the linked repro, this issue was that there was some code like this:
```
# x has dtype torch.float32
def f(x):
y = x.view(torch.float32)
y.copy_(...)
```
Where because `view.dtype` is implemented today to potentially directly return its input, we would end up directly clobbering the proxy for our graph input (replacing its FX proxy value from `arg0_1` to `view_1`). This is not desirable, because we have careful assertions in AOTDispatcher that mutations only ever happen on graph inputs - but this clobbering caused the mutation to appear, from the perspective of the FX graph, like it was happening on a view of the input.
Why is this normally not a problem? Ordinarily, the `ADInplaceOrView` kernel for `view.dtype` will take the output of the view kernel, [and detach() it](https://github.com/pytorch/pytorch/blob/main/tools/autograd/gen_inplace_or_view_type.py#L466) (properly creating a fresh `TensorImpl`).
This does **not** happen, though, if you are executing the kernel from with a `__torch_dispatch__` region: the `ADInplaceOrView` logic has already run above you, so that key will be in the TLS exclude set.
This PR changes eager behavior - at first I considered trying to only change behavior under compile. But this problem isn't technically specific to PT2: if you ever rely on tensor identity from inside of a __torch_dispatch__ call, then we need to make sure the raw `view.dtype` kernel doesn't directly return the input.
I am also making the assumption that "`view.dtype` no-op'ing when the dtype is the same" is not a case worth optimizing in eager mode, and that the overhead of the `TensorImpl` creation is relatively negligible.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136074
Approved by: https://github.com/Skylion007, https://github.com/ezyang, https://github.com/albanD
ghstack dependencies: #136041
As in the title.
Tackles https://github.com/pytorch/ao/pull/821/files#r1759821413
The PR assumes that the existing tuning parameters are good also when using scaling arguments. This needs to be verified as a follow-up task.
Also, this PR redefines triton-contiguous tensors: the tensor must have strides not larger than 1. This will now allow zero strides that previously triggered `contiguous` call although the underlying memory buffer was contiguous.
Re: "a considerable slow-down occurs because tensor data is copied element-wise rather than chunk-wise" - this note should refer to a code (torch or triton?) that implements the element/chunk-wise copy so that we could verify that allowing zero strides indeed would not trigger element-wise copies. Atm, the performance increase in ViT-H benchmarks (that involve using 0 strides) is an evidence that allowing zero strides does not lead to slow-downs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136104
Approved by: https://github.com/cpuhrsch
We introduced the dispatchable backend for a ProcessGroup and collective in https://github.com/pytorch/pytorch/issues/86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG.
Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options"
We need to make changes to the test to make it aligned with the change.
This is try to reland D62008954 by fixing internal errors.
Differential Revision: [D62483294](https://our.internmc.facebook.com/intern/diff/D62483294/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135653
Approved by: https://github.com/wz337, https://github.com/H-Huang
Summary:
We refactor FxGraphCache.load into three phases:
- prepare_key, which checks that an inductor input is cacheable and bypasses otherwise
- load_with_key, which tries to lookup the key in the cache
- post compile, where we do some logging and run post compile steps
Splitting it along these lines will allow AOTAutogradCache to use load_with_key and still get access to all of the observability + remote cache logic when accessing FxGraphCache, without needing to pass key components, etc.
Differential Revision: D62314862
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135491
Approved by: https://github.com/oulgen
Fixes#136090
* Add support for isin to tensor half dtypes for CPU (just add a few extra dispatches).
* Seems like the CUDA implementation for bfloat16 was mostly compiled and available all along (it just calls sort internally AND unique). To enable it, we just need to remove an assert to access it (since sort's functionality was updated since the assert was added) and add missing dtype support to unique.
* This unlocks more GPU functionality with minimal code bloat. I also added CPU kernels for the dtypes for parity.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136114
Approved by: https://github.com/malfet
By default inductor promotes arguments to the common highest dtype.
Having empty token with dtype=torch.float32 results in dtype promotion for effectful ops during lowering of with_effects.
Disabling dtype promotion for this lowering.
Removing previous workaround making token dtype torch.bool.
Testing:
```
python test/distributed/test_c10d_functional_native.py -k test_inductor_dtypeview_memory_lea
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136039
Approved by: https://github.com/bdhirsh, https://github.com/eellison, https://github.com/zou3519
Summary: This implements a default backend proxy that tries to look up a backend via dlsym. What this enables is dynamically loading a module with a backend implementation without having it statically linked with the application.
Differential Revision: D62549295
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135967
Approved by: https://github.com/c-p-i-o
In this PR, we deprecate _preserve_ops feature in run_decomposition API. We can't kill this API completely because Executorch team depends on it. As the syncing between two repos is non-trivial, I just leave this argument as deprecated for now. In the next PR, i will immediately remove it.
After this PR, run_decompositions will only decompose what's inside the decomp table and preserve the rest by default. Note that this feature is only rolled out to OSS for now. Old code path is protected under IS_FBCODE flag.
Differential Revision: [D62163161](https://our.internmc.facebook.com/intern/diff/D62163161/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135080
Approved by: https://github.com/justinchuby, https://github.com/avikchaudhuri, https://github.com/bdhirsh
> Ignore FSDP2 forward hook side-effects in AC
Under AC, FSDP2 does not rely on forward hook to all-gather weights to do recomputation, instead it relies on pre-backward hook to do this job:
451eaf0ff2/torch/distributed/_composable/fsdp/_fsdp_state.py (L219-L220)
So when we use `speculate_subgraph` to trace the utils.checkpoint AC region, we don't actually need to worry about FSDP2 forward hook's side effects and can safely ignore it, because we are not and we don't expect to re-run the FSDP2 forward hook during backward recomputation.
----
Test commands:
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor`
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134997
Approved by: https://github.com/zou3519
ghstack dependencies: #135727
Running Torchbench llama with dynamic size failed with
```
File "/localdisk/leslie/torch_inductor_community/pytorch/torch/fx/experimental/symbolic_shapes.py", line 4182, in produce_guards
raise ConstraintViolationError(
torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (L['inputs'][0].size()[0])! For more information, run with TORCH_LOGS="+dynamic".
- Not all values of RelaxedUnspecConstraint(L['inputs'][0].size()[0]) are valid because L['inputs'][0].size()[0] was inferred to be a constant (32).
```
Skip this model for marking dynamic dim.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135960
Approved by: https://github.com/ezyang
PR changes how `reconstruct` is done for a ConstDict. As of today, it works as follow:
(1) codegen(...) each pair of key/value
(2) create a new dictionary to hold the new items
(3) clear the original dictionary
(4) update the original dict with the one created in (2)
We do a micro optimization in the generated bytecode to:
- Only codegen the items that changed.
- Only clear the original dictionary if a key was removed.
Fixes: #133487
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134876
Approved by: https://github.com/zou3519
Split out and modified from https://github.com/pytorch/pytorch/pull/130228. There were a bunch of subtle bugs eg. sometimes we need to use torch.ops.aten.{operator}.Tensor vs other times using torch.ops.aten.{operator}.default. Or in the case of pow we need to use Tensor_Tensor. I figured it'd be easier to split out adding TensorReferenceAnalysis and add some tests and do the actual integration in a separate diff.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135886
Approved by: https://github.com/ezyang
Updates pybind11 submodule. The major patchnote is an experimental new function that is added to all pybind11 objects that will make them more compatible across pybind11 version, settings, and frameworks (such as nanobind) called cpp_conduit. No code changes needed on our end except to update
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136087
Approved by: https://github.com/malfet
This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode)
Typically the bytecode for a context manager looks like this during a graph break:
1. graph call
2. enter context
3. unsupported code
4. exit context
5. resume call
resume fn structure:
1. enter context
2. jump
...
3. exit context
The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack).
So for torch function modes the structure of our output code is this:
1. graph call
2. mutate tf mode stack to replay mutations
4. unsupported code
5. on exception restore stack
6. resume function
Then our resume fn looks like this:
1. no-op enter torch function mode
2. jump
3. exit tf mode
To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context).
Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135422
Approved by: https://github.com/williamwen42
ghstack dependencies: #134732, #133137, #135443, #135444
In preparation for tracing through DeviceContext (defb515306/torch/utils/_device.py (L66))
This PR adds support for calling the setattr of thread local objects. These objects have a slots impl, and since this doesn't appear to have any side effects, we call this setattr impl when replaying mutations, since calling `object.__setattr__` on these objects results in a type error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135443
Approved by: https://github.com/anijain2305
ghstack dependencies: #134732, #133137
For tracing cond/while in eager, we trace the HOP with the eager backend with metadata torchfunction mode enabled. HOPs disallow the mutation that occurs in this torch function mode, so it is not able to be traced. As a result, we use a custom backend which enters this mode for tracing these HOPs. Thanks to @ydwu4 for the help with implementing this
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134732
Approved by: https://github.com/ydwu4
All of the previous benchmarks are similar, ListOfLinears should be representative enough.
I copied the previous benchmarks from unit tests without an intention, was just trying to create a large
number of benchmarks to better observe noise.
This PR keeps only one, we can add more as we see value and regressions in the future.
Also this diff adds a GPU version.
```
collecting compile time instruction count for basic_modules_ListOfLinears_eager
compile time instruction count for iteration 0 is 6479525851
compile time instruction count for iteration 1 is 1024432680
compile time instruction count for iteration 2 is 1019417317
compile time instruction count for iteration 3 is 1013603566
compile time instruction count for iteration 4 is 1008853980
compile time instruction count for iteration 5 is 1009541481
compile time instruction count for iteration 6 is 1005025533
compile time instruction count for iteration 7 is 1004116323
compile time instruction count for iteration 8 is 1000828633
compile time instruction count for iteration 9 is 999788323
collecting compile time instruction count for basic_modules_ListOfLinears_inductor
compile time instruction count for iteration 0 is 40837529730
compile time instruction count for iteration 1 is 18411921909
compile time instruction count for iteration 2 is 18383665161
compile time instruction count for iteration 3 is 18348983522
compile time instruction count for iteration 4 is 18349276590
compile time instruction count for iteration 5 is 18353046274
compile time instruction count for iteration 6 is 18346818581
compile time instruction count for iteration 7 is 18340057998
compile time instruction count for iteration 8 is 18331267320
compile time instruction count for iteration 9 is 18328381338
collecting compile time instruction count for basic_modules_ListOfLinears_inductor_gpu
compile time instruction count for iteration 0 is 15408870979
compile time instruction count for iteration 1 is 10949520859
compile time instruction count for iteration 2 is 11058786167
compile time instruction count for iteration 3 is 11003606719
compile time instruction count for iteration 4 is 10896406770
compile time instruction count for iteration 5 is 10982875189
compile time instruction count for iteration 6 is 10931848275
compile time instruction count for iteration 7 is 10956345008
compile time instruction count for iteration 8 is 11045384499
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135730
Approved by: https://github.com/ezyang, https://github.com/anijain2305
Summary:
Move towards consolidating strobelight profiler implementations between OSS and fbcode. This change is a first step towards that.
- Created a new function to abstract out compile time profiling enablement. This function allows profiler to switch between different function profilers (e.g. Thrift based or CLI based)
- Both OSS and Fbcode now use one compile time profiler in torch/_strobelight
Test Plan:
Tested OSS with following commands:
```
python torch/_strobelight/examples/compile_time_profile_example.py
python torch/_strobelight/examples/cli_function_profiler_example.py
TORCH_COMPILE_STROBELIGHT=TRUE TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 python benchmarks/dynamo/huggingface.py --ci --accuracy --timing --explain --inductor --device cuda --training --amp --only XLNetLMHeadModel
```
See test commands for fbcode in comments.
Differential Revision: D62444551
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135953
Approved by: https://github.com/laithsakka
if ! lintrunner --force-color --all-files --tee-json=lint.json ${ADDITIONAL_LINTRUNNER_ARGS} 2> /dev/null;then
if ! lintrunner --force-color --tee-json=lint.json ${ADDITIONAL_LINTRUNNER_ARGS} 2> /dev/null;then
echo""
echo -e "\e[1m\e[36mYou can reproduce these results locally by using \`lintrunner -m origin/main\`. (If you don't get the same results, run \'lintrunner init\' to update your local linter)\e[0m"
echo -e "\e[1m\e[36mSee https://github.com/pytorch/pytorch/wiki/lintrunner for setup instructions.\e[0m"
- [Building PyTorch with ASAN](#building-pytorch-with-asan)
- [Getting `ccache` to work](#getting-ccache-to-work)
@ -1132,38 +1131,6 @@ CUDA, MSVC, and PyTorch versions are interdependent; please install matching ver
Note: There's a [compilation issue](https://github.com/oneapi-src/oneDNN/issues/812) in several Visual Studio 2019 versions since 16.7.1, so please make sure your Visual Studio 2019 version is not in 16.7.1 ~ 16.7.5
## Running clang-tidy
[Clang-Tidy](https://clang.llvm.org/extra/clang-tidy/index.html) is a C++
linter and static analysis tool based on the clang compiler. We run clang-tidy
in our CI to make sure that new C++ code is safe, sane and efficient. See the
@ -380,7 +382,7 @@ Patch release process takes around 4-5 weeks to complete.
### Issue Tracker for Patch releases
For patch releases issue tracker needs to be created. For patch release, we require all cherry-pick changes to have links to either a high-priority GitHub issue or a CI failure from previous RC. An example of this would look like:
static_assert(std::is_same<typenamefunction_traits<func_t>::templatearg<0>::type,curandStatePhilox4_32_10_t&>::value,"the first argument of functor must be curandStatePhilox4_32_10_t");
static_assert(std::is_same_v<typenamefunction_traits<func_t>::templatearg<0>::type,curandStatePhilox4_32_10_t&>,"the first argument of functor must be curandStatePhilox4_32_10_t");
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.