Summary:
The following type of objects don't need to be serialized for precompile:
1. PyCapsule because we don't guard on C binding objects in meaningful ways.
2. Code object because we only id matching on these but id matches will always be dropped for precompile.
3. Nested function objects since we also ban CLOSURE_MATCH.
Test Plan:
buck run mode/opt test/dynamo:test_dynamo -- -k test_skipped_objects
Rollback Plan:
Differential Revision: D78816888
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158926
Approved by: https://github.com/jamesjwu
When one builds CD docker, all CUDA dependencies must be installed into `/usr/local/cuda/` folder
Test plan: Looks at the binary build logs, for example [here](https://github.com/pytorch/pytorch/actions/runs/16768141521/job/47477380147?pr=159907):
```
2025-08-06T05:58:00.7347471Z -- NVSHMEM_HOME set to: ''
2025-08-06T05:58:00.7348378Z -- NVSHMEM wheel installed at: ''
2025-08-06T05:58:00.7392528Z -- NVSHMEM_HOST_LIB: '/usr/local/cuda/lib64/libnvshmem_host.so'
2025-08-06T05:58:00.7393251Z -- NVSHMEM_DEVICE_LIB: '/usr/local/cuda/lib64/libnvshmem_device.a'
2025-08-06T05:58:00.7393792Z -- NVSHMEM_INCLUDE_DIR: '/usr/local/cuda/include'
2025-08-06T05:58:00.7394252Z -- NVSHMEM found, building with NVSHMEM support
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159907
Approved by: https://github.com/Skylion007, https://github.com/ngimel
# Background
After I built torch_openreg, I noticed that the wheel package contained the stub.c file under the csrc directory, which was not used in the runtime.
# Motivation
This PR aims to remove the stub.c file and any unused file when running torch_openreg.
**Changes:**
- Setting **include_package_data** keyword to false in the setup function
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159845
Approved by: https://github.com/albanD
**Summary**
This issue proposes implementing a CUDA kernel for aten._weight_int8pack_mm, a weight-only quantized (WOQ) linear operation that is currently only supported on CPU. On CUDA, the fallback path uses an unfused .mul().sum() pattern in quantization.py, which is less efficient for inference. https://github.com/pytorch/pytorch/issues/158849
**Motivation**
A fused GPU kernel for aten._weight_int8pack_mm would:
- Eliminate reliance on the .mul().sum() fallback in quantization.py
- Improve performance for quantized inference on CUDA
- Extend Inductor’s GPU quantization support across more workloads
**Implementation**
- Implement a Triton kernel for:
```
out[b, n] = sum_k(x[b, k] * w[n, k]) * scale[n]
where:
x: [B, K] float32
w: [N, K] int8
scale: [N] float32
out: [B, N] float32
```
- Integrate the kernel with register_woq_mm_ops() in torch/_inductor/quantized_lowerings.py
- Route it conditionally in quantization.py where GPU currently falls back to .mul().sum()
- Add unit tests comparing results to the reference fallback path
Test Plan:
```
buck2 run 'fbcode//mode/opt' :linalg test_linalg.TestLinalgCUDA.test__int8_mm_m_64_k_64_n_64_compile_True_slice_True_cuda
```
Log: P1882799769
```
buck2 test 'fbcode//mode/opt' caffe2/test:linalg
```
https://www.internalfb.com/intern/testinfra/testconsole/testrun/6755399722424741/
Benchmark Results:
```
**[Shape B=256, K=1024, N=512]**
CPU and CUDA outputs match
Max abs diff: 2.59e-04, max rel diff: 0.75
CPU: 144.14 ms, CUDA: 303.67 µs
Speedup: ×474.6
**[Shape B=512, K=2048, N=1024]**
CPU and CUDA outputs match
Max abs diff: 5.49e-04, max rel diff: 0.15
CPU: 1173.27 ms, CUDA: 2.40 ms
Speedup: ×488.5
```
Rollback Plan:
Differential Revision: D79042656
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159325
Approved by: https://github.com/danielvegamyhre, https://github.com/jerryzh168
In some cases we have mps kernels which are reused across higher-order-op subgraphs and the toplevel code. However, currently we initialize the variable for the mps kernel the first time we use it, which runs into an issue if we run into the mps kernel within a subgraph since the kernel will only be initialized within the subgraph scope. For instance:
```
if ...
auto mps_lib_0_func = ...
mps_lib_0_func->run()
// since we already used mps_lib_0 once, we don't re-initialize it
mps_lib_0_func->run() // error, mps_lib_0_func not initialized
```
So the solution we took here is to initialize all the kernels at the beginning:
```
const std::shared_ptr<at::native::mps::MetalKernelFunction> get_mps_lib_0() {
static const auto func = mps_lib_0.getKernelFunction("generated_kernel");
return func;
}
AOTIMetalKernelFunctionHandle get_mps_lib_0_handle() {
static const auto handle = AOTIMetalKernelFunctionHandle(get_mps_lib_0().get());
return handle;
}
...
if ...
get_mps_lib_0()->run()
get_mps_lib_0()->run() // success
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159753
Approved by: https://github.com/malfet
ghstack dependencies: #159456, #159695
Also migrate `test_common_rules.py` since it was a short file
`python test/distributed/tensor/test_common_rules.py`
Before:
Ran 10 tests in 91.516s
After:
Ran 10 tests in 5.604s
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159885
Approved by: https://github.com/ezyang
Summary:
This allows us to start seeing the failure rate on these models (and
potentially alert on it).
Test Plan:
```
FORCE_LOG_TRITON_BUILDS_TO_PROD=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 buck2 run @//mode/opt :compile 2>&1 | tee out
```
P1889607054
Waiting for scuba table to generate, but manual logging show it should show up at https://fburl.com/scuba/pt2_triton_builds_inc_archive/7852kt8h soon.
Rollback Plan:
Reviewed By: masnesral
Differential Revision: D79308333
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159897
Approved by: https://github.com/masnesral
Summary:
This feature is Meta internal only
Add a util function to put dynamic shape-related suggestion to MLHubDebugInsightService, which will then be surfaced to users in the MLHub .
The rollout will be controlled by JK.
Test Plan:
MAST job aps-omnifmv3_dev_baseline_test-a34fdccf21
{F1980593060}
* If you're not able to see the insight, please add yourself to this gk 'mlhub_debugging_insights_dev_visibility'
* The URL link should route to a new Job Inspector page that will provide details and straight forward instructions of how to config the ds. The page is currently still in development so here we use the general PT2 compile JI page.
* Test fails because of the export checks. I'll export after addressing all the comments from reviewers.
Rollback Plan:
Reviewed By: pianpwk
Differential Revision: D78526522
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159391
Approved by: https://github.com/jingsh
Summary: For training jobs particularly from GenAI, NCCL trace dumps are generated in the format of `<hostname>.pci3_rank_<rank>`. For multi-node training jobs, the hostname varies across traces. The current prefix matching logic can't handle this case.
Test Plan:
Create a local folder `dumps` and several empty files: `host0.pci3_rank_0`, `host0.pci3_rank_1`, `host1.pci3_rank_0`, `host1.pci3_rank_1` inside it. Then run
```
buck2 run fbcode//caffe2/fb/flight_recorder:fr_trace -- trace_dir dumps
```
Before this diff, fr_trace cannot locate any trace files, giving the following assertion error:
```
AssertionError: no files loaded from /home/tianhaoh/dumps with prefix pci3_rank_
```
After this diff, fr_trace is able to locate the trace files, resulting in the exceptions like
```
dump = pickle.load(infile)
^^^^^^^^^^^^^^^^^^^
EOFError: Ran out of input
```
(since the trace files are fake and empty).
Rollback Plan:
Differential Revision: D79224727
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159490
Approved by: https://github.com/fduwjj
Summary:
OrderedImporters is supposed to be an importer which tries out every single importer in self._importers. However the get_name API does not follow this behavior and only uses the get_name from the basic Importer class.
This change is to update the OrderedImporters get_name API so that it tries the get_name API of every single importers.
Differential Revision: D76463252
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155743
Approved by: https://github.com/jcwchen, https://github.com/jingsh
When we setup logging config as guide: https://docs.pytorch.org/docs/stable/logging.html
Such as:
TORCH_LOGS="+schedule,+inductor,+output_code"
On Linux, it shows as:
```cmd
declare -x SSH_TTY="/dev/pts/0"
declare -x TERM="xterm"
declare -x TORCH_LOGS="+schedule,+inductor,+output_code"
declare -x USER="xu"
```
On Windows, it shows as:
```cmd
TORCHINDUCTOR_WINDOWS_TESTS=1
TORCH_LOGS="+schedule,+inductor,+output_code"
UCRTVersion=10.0.22000.0
```
For Linux, it shows quotes by default, And Windows is not shows quotes.
Besides that, Windows would auto assemble quotes when env var processing.
On Linux, we will get variable: "+schedule,+inductor,+output_code"
On Windows, we will get variable: '"+schedule,+inductor,+output_code"'
So, we need remove the outer quotes for Windows.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159887
Approved by: https://github.com/angelayi
Summary:
This is needed for subprocesses that are trying to call back into torch
functionality, i.e. anything that's also setting `PYTHONPATH`. There are more
`sys.executable` subprocesses in torch/ but it seems like they're fine.
Test Plan: Local inference runs.
Reviewed By: aorenste
Differential Revision: D79124705
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159382
Approved by: https://github.com/aorenste
Originally, if the PT2 errored when loading, we would try to load using the old loader to fit BC issues. However this hides the error messages for if an up-to-date PT2 is erroring when loading due to some other reason.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159881
Approved by: https://github.com/yushangdi
Summary:
- debug.py: Added log_runtime_estimates() function to dump runtime estimation data as structured tlparse artifacts in JSON format
- test_structured_trace.py: Added comprehensive test coverage with testing compute and collective ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159730
Approved by: https://github.com/yushangdi
ghstack dependencies: #159190
Removes unused docker images from the docker build workflow
Then removes unused definitions in build.sh
The only one I left is the vllm one because I'm pretty sure it's going to be used in the future
I assume everything not mentioned is old and we forgot to remove them
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159171
Approved by: https://github.com/yangw-dev
No functional changes, just:
- Update C++ standard to C++17
- Update `cmake` min version to 3.18
- Update `libuv` dependency to 1.51 (to move its cmake min version to 3.10)
- Replace boost optional implementation with `std::optional` wrapper
- Make it compilable with gcc-14.x plus by including `cstddef` in few headers
- Avoid using deprecated enums for MacOS builds
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159834
Approved by: https://github.com/Skylion007
With fsdp, we sometimes have multiple, non-overlapping views of a single buffer which are all mutated. Previously we considered the original buffer as an allocation, and make the mutated buffer the deallocation. With multiple mutations of the same buffer, we need to consider the original buffer as deallocated only when all of its aliases die (and avoid double counting the input buffer size). See comment inline:
```
When an operation mutates a buffer in-place, the scheduler creates a new buffer name
to track the "before" and "after" states, even though they share the same memory.
The mutated buffer represents a rename with zero allocation and deallocation cost.
During dependency tracking, we transfer dependencies from the mutated name back to
the original buffer, ensuring the original memory is only freed when all aliases
are done.
This handles cases where a buffer has multiple non-overlapping aliases - rather than
trying to assign free costs to individual aliases, we forward all alias dependencies
to the original buffer.
Consider:
buf0 = op0()
buf1 = mutation_op_(buf0)
del buf0
...
op(buf1)
del buf1
The only memory events are the creation prior to op0, and the deletion following buf1.
```
As @IvanKobzarev 's logs in https://github.com/pytorch/pytorch/pull/158361/files#diff-e173a1d52aff49959c9f6d17ecc09946d8a616fc5909df884e62a15e1ebd1d41R1776-R1807 show, it can a bit of a pain to pinpoint which part of our memory calculation is incorrect.
This pr also adds a runtime verifier `config.test_configs.track_memory_lifecycle` which tracks buffer allocation and deallocation, and errors if their lifetime does not match our expectations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159569
Approved by: https://github.com/IvanKobzarev
This would allow me to enable more opinfo tests against MPS device eventually and supposed to be a very simple test, but actually required minor adjustments to lots of test files, namely:
- Introduce `all_mps_types_and` that is very similar to `all_types_and`, but skips `float64`
- Decorate lots of tests with `@dtypesIfMPS(*all_mps_types())`
- Skip `test_from_dlpack_noncontinguous` as it currently crashes (need to be fixed)
- Add lots of `expectedFailureIfMPS`
- Delete all `@onlyNativeDeviceTypesAnd("mps")`
<sarcasm> I love how well documented this variable are </sarcasm>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153835
Approved by: https://github.com/Skylion007
\# Why
- Make loop iteration simpler
- Have a common spot where to make modifications that affect
all the GEMM Triton templates, avoiding missed spots
\# What
- pull out commong logic of taking the BaseConfig objects
and turning them into kwargs to feed into maybe_append_choice
for Triton GEMM templates
Differential Revision: [D79186962](https://our.internmc.facebook.com/intern/diff/D79186962)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159383
Approved by: https://github.com/jansel
# Motivation
This PR moves the implementation of `torch.cuda.memory._set_allocator_settings` to `torch._C._accelerator_setAllocatorSettings`.
Since the original API was intended as a temporary/internal utility, I am not exposing the new function as a public API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156175
Approved by: https://github.com/albanD
ghstack dependencies: #159629, #150312, #156165
# Motivation
Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150312
Approved by: https://github.com/albanD
ghstack dependencies: #159629
Summary: Inductor's 3.4 Triton release is the most common used variant of Triton, but if someone is working with an alternative version of Triton this may not match. This moves the version check from 3.4 Triton to any variant that has support for the TMA APIs.
Test Plan:
Relying on CI. Should be a NFC.
Rollback Plan:
Reviewed By: davidberard98
Differential Revision: D79378792
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159777
Approved by: https://github.com/davidberard98
It can be be very slow to repeatedly hit DNS resolution failure, but
its very helpful to have DNS names in logs by default. So we try to use DNS
but if we hit a transient failure we just disable it for the remainder of the
job, logging IP addresses instead.
Fixes#159007
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159596
Approved by: https://github.com/d4l3k
This PR fixes `cmake/Dependencies.cmake` to work when compiling with `USE_SYSTEM_XNNPACK=ON` by changing a lowercase `or` to an uppercase `OR`.
---
For a personal project, I was building pytorch with a customized build of XNNPACK. When trying to do so I encountered the following error:
```
CMake Error at cmake/Dependencies.cmake:566 (if):
if given arguments:
"NOT" "XNNPACK_LIBRARY" "or" "NOT" "microkernels-prod_LIBRARY"
Unknown arguments specified
Call Stack (most recent call first):
CMakeLists.txt:868 (include)
```
Upon making the change in this PR (changing `or` to `OR`), the process continued as expected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159527
Approved by: https://github.com/janeyx99
ROCm inductor benchmark builds failing fbgemm build stage https://ossci-raw-job-status.s3.amazonaws.com/log/46800456622
```
2025-07-27T08:00:32.3443858Z /var/lib/jenkins/pytorch/fbgemm/src/RowWiseSparseAdagradFused.cc:389:18: error: no matching function for call to ‘asmjit::v1_17::x86::Vec::Vec(uint32_t)’
2025-07-27T08:00:32.3444080Z 389 | x86::Xmm partial_sum_xmm(partial_sum_vreg.id());
```
It looks like asmjit fails to build, this seems to be due to submodules of fbgemm not being updated after checking out to new commit.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159477
Approved by: https://github.com/pruthvistony, https://github.com/eqy
Update the torch-xpu-ops commit to [intel/torch-xpu-ops@1f7a57](1f7a57f507) includes:
- Add Template Parameter to the function `gpu_kernel` for Controlling Broadcasting Vectorization
- Add optional NaN checks to XCCL
- Fix NllLossForwardReduce2DKernelFunctor accuracy
- Extend the existing communication logging to include the reduction operation for collective calls
- [Reland] Install xpu codegen header to torch/include
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159621
Approved by: https://github.com/EikanWang
**Motivation / Context**: (what I _think_ is happening here)
In "eager"/just-in-time PT2 usage, dynamo/inductor will guard on whether indices fit in int32 or not. So it's generally safe in Inductor code to rely on the example values for symbolic ints in order to determine whether indices fit in int32, because the indices will be guarded on anyway; and if the inputs ever increase to `>int32_max`, dynamo will cause a recompilation.
But with AOTI, those int32 guards aren't respected; so if the example input is `< int32_max` but can be `> int32_max` during future execution, then the future execution might fail / IMA.
**Solution space**
Export allows users to specify which dimension are dynamic, and to provide **ranges of valid sizes**.
One solution idea is to always respect the upper bound of the dynamic shape range when doing AOTI; if the index's range includes values `>int32_max`, then don't use the hint and assume that this index doesn't fit in int32.
However, the problem with this is that many users may specify dynamism without specifying a range of values - the upper bound of the range will be set to the default of `inf`. Such use cases could potentially experience a perf regression if we implemented the idea above.
To prevent any such regressions, this implementation will rely solely on the specified range only if the upper bound of the range isn't inf. In other words, we'll ignore the hints/example values for AOTI (and rely only on the specified range) only if the upper bound of the range isn't inf - if users explicitly specify a range that extends past int32, we can be fairly sure that they actually do need values `>int32_max`.
If we continue to see correctness issues even with this implementation, we could consider more aggressively relying on the ranges.
Differential Revision: [D79220301](https://our.internmc.facebook.com/intern/diff/D79220301)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159433
Approved by: https://github.com/jingsh, https://github.com/ColinPeppler
Summary: Fix https://github.com/pytorch/pytorch/issues/159612
- Fix the meta implementation of `nan_to_num`, it should preserve the stride of the input
- The DeviceCopy IR node should always preserve the input's layout, so we don't end up with a contiguous call during device copy
Test Plan:
```
buck2 run @mode/dev-nosan fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_d2h_copy
```
Rollback Plan:
Differential Revision: D79411407
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159615
Approved by: https://github.com/eellison
Fixes#159601
Unfortunately #156868 introduced a couple regressions (see #159590 and #159601). This reverts the commit while I am working on a permanent fix. This means the `in_compiled_autograd_initial_trace` global flag will be removed and the `_are_we_tracing()` will instead be replaced with the symint preprocessing step during sharding prop post init.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159671
Approved by: https://github.com/xmfan
Summary:
Currently this function rely on the logic that we load `libnvshmem_device.a` statically and load `libnvshmem_host.so` at runtime. For loading `libnvshmem.a` (the combine 2 thing together) statically this will fail. Add a section to check if the symbol from host API exist at runtime to check if nvshmem is loaded statically
Test Plan:
CI + sample run
Rollback Plan:
Differential Revision: D79177525
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159561
Approved by: https://github.com/kwen2501
Due to different byteorder,
when copying data, it has to be put into last bytes to ensure that int32_t converted to int64_t keeps same value. Same has to be done when it's converted back.
This change fixes test
TestLibtorchAgnosticCPU::test_my_ones_like_cpu
from
cpp_extensions/libtorch_agnostic_extension/test/test_libtorch_agnostic.py on s390x.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155115
Approved by: https://github.com/huydhn
This change fixes multiple tests in
test/inductor/test_aot_inductor_arrayref.py
such as
test_cond_with_parameters_cpu_with_stack_allocation,
test_issue_140766_cpu_with_stack_allocation,
test_model_modified_weights_cpu_with_stack_allocation,
test_nested_tensor_from_jagged_cpu_with_stack_allocation.
Enable tests in test/inductor/test_aot_inductor_arrayref.py
This change is split off from https://github.com/pytorch/pytorch/pull/150116
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157784
Approved by: https://github.com/huydhn
The previous implementation was creating `n_gpu * n_tensors` intermediate tensors, which was adding a lot of CPU overhead, specially given that inductor was generating a number of individual tensor copy kernels for `torch.cat` .
This PR changes the implementation so that only `n_tensors` are created, making the CPU overhead proportional to the number of tensors being bucketed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159723
Approved by: https://github.com/IvanKobzarev
The output of a reduce_scatter is n_gpu times smaller than its input, while the output of an all_gather is n_gpu times larger than its input. This means that in the current heuristic for bucketing reduce_scatter, we would need to use a bucket size which is n_gpu times larger than the bucket for all_gather, making it gpu-dependent and less intuitive. This PRs propose to use instead the max between the input and output sizes, so that one can use the same bucket_size value for both passes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159717
Approved by: https://github.com/wconstab
#158649 turned off automatic GCs during cudagraph recording. This is causing a small uptick in some internal benchmark numbers because of memory the benchmark is leaving around before the benchmark starts - so GC before warming up the model.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159670
Approved by: https://github.com/oulgen
RuntimeError message updated in is_nonzero(input) method from bool to Boolean.
**Case 1:**
t = torch.tensor([])
torch.is_nonzero(t)
**Case 2:**
t = torch.tensor([1,2])
torch.is_nonzero(t)
**Existing Error message in documentation:**
for case 1: RuntimeError: bool value of Tensor with no values is ambiguous
for case 2: RuntimeError: bool value of Tensor with more than one value is ambiguous
**Proposed Error message in documentation:**
for case 1: RuntimeError: Boolean value of Tensor with no values is ambiguous
for case 2: RuntimeError: Boolean value of Tensor with more than one value is ambiguous
Fixes#159710
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159712
Approved by: https://github.com/malfet
# Moativation
This PR is used to enable _int_mm on Intel GPU. And _int_mm is used by int8 quantization on torchao.
# Model Test Result:
We run meta-llama/Llama-3.1-8B-Instruct on Intel GPU and A100 using torchao int8-dynamic-quantization. The model configs as below:
Precision : torch.bfloat16
quantization configuration : Int8DynamicActivationInt8WeightConfig
dataset : wikitext
Result:
The perplexity values for Intel GPU and A100 are 9.582953453063965 and 9.57755184173584, respectively.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157769
Approved by: https://github.com/EikanWang, https://github.com/desertfire
Summary: turns out i added this to reduce the frequency we'd call try_update_max_size_at_index when a new maximum is found before the replan is called. oops.
Test Plan:
backout
Rollback Plan:
Differential Revision: D79474114
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159677
Approved by: https://github.com/georgiaphillips
Summary:
The launch grid calculation code is using a python trick to achieve CeilDiv() through negative integer division with FloorDiv(). This is language dependent behaviour that doesn't apply to all languages.
In the FXIR backend we negate this behaviour and replace the experssion with CeilDiv() operation so the computation is correct regardless of language used. Not directly directly changing the orginal computation as it leads to a performance degredation.
Test Plan:
CI
Rollback Plan:
Differential Revision: D79275534
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159497
Approved by: https://github.com/blaine-rister
This change introduces structured logging of the collective communication schedule, enabling downstream tools (e.g. TLParse) to ingest and analyze per‑rank collective‐order information for multi‑rank jobs.
- Iterates over scheduler.nodes, filters for _CollectiveKernel nodes
- Extracts each op’s python_kernel_name
- Emits a structured JSON payload under the inductor_collective_schedule artifact name
- Dumps the full schedule list to collective_schedule.json via the PyTorch trace‑structured artifact
- Added comprehensive unit tests for collective schedule tracing: Created test_collective_schedule_empty() and test_collective_schedule_real() tests to verify structured trace logging works correctly for both empty collective schedules and real collective operations (like all_reduce and wait_tensor from _c10d_functional ops).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159190
Approved by: https://github.com/yushangdi, https://github.com/xmfan
# Motivation
While refactoring the caching allocator, I noticed that the `ExpandableSegment` constructor on CUDA had an unused parameter. This change removes that unused argument to avoid potential confusion.
# Additional Context
I noticed that `ExpandableSegment` is defined in cpp file, so it should be safe to make this change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159356
Approved by: https://github.com/ngimel, https://github.com/albanD
ghstack dependencies: #159159
PyTorch with ROCm on Windows is built with clang-cl and not MSVC. This code path is specific to the MSVC compiler so it should be checking for MSC_VER, not just WIN32. The change here is similar to https://github.com/pytorch/pytorch/pull/146606.
This fixes downstream build errors using clang-cl like https://github.com/ROCm/TheRock/actions/runs/16569646709/job/46858176812 (patched and tested downstream at https://github.com/ROCm/TheRock/pull/1140):
```
[7099/7147] Building CXX object functorch\CMakeFiles\functorch.dir\csrc\dim\dim.cpp.obj
FAILED: functorch/CMakeFiles/functorch.dir/csrc/dim/dim.cpp.obj
C:\home\runner\_work\_tool\Python\3.11.9\x64\Lib\site-packages\_rocm_sdk_devel\lib\llvm\bin\clang-cl.exe /nologo -TP -DEXPORT_AOTI_FUNCTIONS -DFUNCTORCH_BUILD_MAIN_LIB -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DNOMINMAX -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DROCM_ON_WINDOWS -DROCM_USE_FLOAT16 -DROCM_VERSION=70000 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_C -DTORCH_HIP_VERSION=700 -DUSE_EXTERNAL_MZCRC -DUSE_MIMALLOC -DUSE_PROF_API=1 -DWIN32_LEAN_AND_MEAN -D_CRT_SECURE_NO_DEPRECATE=1 -D_UCRT_LEGACY_INFINITY -D__HIP_PLATFORM_AMD__ -D__HIP_PLATFORM_AMD__=1 -Dfunctorch_EXPORTS -IB:\src\torch\build\aten\src -IB:\src\torch\aten\src -IB:\src\torch\build -IB:\src\torch -IB:\src\torch\nlohmann -IB:\src\torch\moodycamel -IB:\src\torch\third_party\mimalloc\include -IB:\src\torch\functorch -IB:\src\torch\torch\csrc\api -IB:\src\torch\torch\csrc\api\include -IB:\src\torch\c10\.. -IB:\src\torch\c10\hip\..\.. -IB:\src\torch\torch\.. -IB:\src\torch\torch\..\aten\src -IB:\src\torch\torch\..\aten\src\TH -IB:\src\torch\build\caffe2\aten\src -IB:\src\torch\build\third_party -IB:\src\torch\build\third_party\onnx -IB:\src\torch\torch\..\third_party\valgrind-headers -IB:\src\torch\torch\..\third_party\gloo -IB:\src\torch\torch\..\third_party\onnx -IB:\src\torch\torch\..\third_party\flatbuffers\include -IB:\src\torch\torch\..\third_party\kineto\libkineto\include -IB:\src\torch\torch\..\third_party\cpp-httplib -IB:\src\torch\torch\..\third_party\nlohmann\include -IB:\src\torch\torch\csrc -IB:\src\torch\torch\lib -IB:\src\torch\torch\standalone -IB:\src\torch\torch\lib\libshm_windows -imsvcC:\home\runner\_work\_tool\Python\3.11.9\x64\Lib\site-packages\_rocm_sdk_devel\include -imsvcB:\src\torch\third_party\protobuf\src -imsvcB:\src\torch\third_party\XNNPACK\include -imsvcB:\src\torch\third_party\ittapi\include -imsvcB:\src\torch\cmake\..\third_party\eigen -imsvcB:\src\torch\third_party\ideep\mkl-dnn\include\oneapi\dnnl -imsvcB:\src\torch\third_party\ideep\include -imsvcB:\src\torch\INTERFACE -imsvcB:\src\torch\third_party\nlohmann\include -imsvcB:\src\torch\third_party\concurrentqueue -imsvcC:\home\runner\_work\_tool\Python\3.11.9\x64\Lib\site-packages\_rocm_sdk_devel\include\hiprand -imsvcC:\home\runner\_work\_tool\Python\3.11.9\x64\Lib\site-packages\_rocm_sdk_devel\include\rocrand -imsvcB:\src\torch\cmake\..\third_party\pybind11\include -imsvcC:\home\runner\_work\_tool\Python\3.11.9\x64\include /DWIN32 /D_WINDOWS /EHsc /Zc:__cplusplus /bigobj /FS /utf-8 -DUSE_PTHREADPOOL -DNDEBUG -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE /wd4624 /wd4068 /wd4067 /wd4267 /wd4661 /wd4717 /wd4244 /wd4804 /wd4273 /O2 /Ob2 /DNDEBUG /bigobj -DNDEBUG -std:c++17 -MD -Z7 -Wmissing-prototypes -Werror=missing-prototypes /permissive- /d2implyavx512upperregs- /EHsc /bigobj -fms-runtime-lib=dll -D__HIP_PLATFORM_AMD__=1 -DCUDA_HAS_FP16=1 -DUSE_ROCM -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DTORCH_HIP_VERSION=700 -Wno-shift-count-negative -Wno-shift-count-overflow -Wno-duplicate-decl-specifier -DCAFFE2_USE_MIOPEN -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP -std=c++17 -DHIPBLAS_V2 -DHIP_ENABLE_WARP_SYNC_BUILTINS -fms-extensions -Wno-ignored-attributes /showIncludes /Fofunctorch\CMakeFiles\functorch.dir\csrc\dim\dim.cpp.obj /Fdfunctorch\CMakeFiles\functorch.dir\ -c -- B:\src\torch\functorch\csrc\dim\dim.cpp
clang-cl: warning: unknown argument ignored in clang-cl: '-std=c++17' [-Wunknown-argument]
clang-cl: warning: argument unused during compilation: '/d2implyavx512upperregs-' [-Wunused-command-line-argument]
In file included from B:\src\torch\functorch\csrc\dim\dim.cpp:36:
B:\src\torch\functorch\csrc\dim\arena.h(14,21): error: functions that differ only in their return type cannot be overloaded
14 | inline unsigned int __builtin_clz(unsigned int x) {
| ~~~~~~~~~~~~ ^
C:\home\runner\_work\_tool\Python\3.11.9\x64\Lib\site-packages\_rocm_sdk_devel\lib\llvm\lib\clang\20\include\ia32intrin.h(60,15): note: '__builtin_clz' is a builtin with type 'int (unsigned int) noexcept'
60 | return 31 - __builtin_clz((unsigned int)__A);
| ^
1 error generated.
[7100/7147] Building CXX object caffe2\torch\CMakeFiles\torch_python.dir\csrc\utils\tensor_list.cpp.obj
```
> [!NOTE]
> I haven't been able to reproduce those errors locally, but we have CI jobs that consistently fail when building for Python 3.11 but not 3.12 or 3.13. I'm not sure what is different between those builds, but the code fix seems correct.
There are a few other variations on fixes to this floating around, such as:
* a97a957af0/lz4.c (L34-L43) (checking with `__has_builtin`)
* c98c55ec7e/lj92.c (L31-L46) (the same code as here, but with `_MSC_VER`)
* 2760e5a2bb/def.h (L23-L25) (using `__lzcnt` instead of a custom implementation)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159273
Approved by: https://github.com/Skylion007, https://github.com/m-gallus
Previous version of `torch._tensor_str._Formatter` was not using `PRINT_OPTS.sci_mode` for the `max_width` computation but was using it for the formatting of values leading to a weird discrepancy.
Now, the code first checks if it should be in sci_mode, then compute `max_width`
Here is an example to test the behavior:
```python
A = torch.tensor([10, 1e-1, 1e-2])
B = torch.tensor([10, 1e-1, 1e-1])
print("================= Default =================")
print(A, f"Formatter max_width: {torch._tensor_str._Formatter(A).max_width}")
print(B, f"Formatter max_width: {torch._tensor_str._Formatter(B).max_width}")
print("================= sci_mode=False =================")
with torch._tensor_str.printoptions(sci_mode=False):
print(A, f"Formatter max_width: {torch._tensor_str._Formatter(A).max_width}")
print(B, f"Formatter max_width: {torch._tensor_str._Formatter(B).max_width}")
print("================= sci_mode=True =================")
with torch._tensor_str.printoptions(sci_mode=True):
print(A, f"Formatter max_width: {torch._tensor_str._Formatter(A).max_width}")
print(B, f"Formatter max_width: {torch._tensor_str._Formatter(B).max_width}")
```
In the current version this prints:
```
================= Default =================
tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10
tensor([10.0000, 0.1000, 0.1000]) Formatter max_width: 7
================= sci_mode=False =================
tensor([ 10.0000, 0.1000, 0.0100]) Formatter max_width: 10
tensor([10.0000, 0.1000, 0.1000]) Formatter max_width: 7
================= sci_mode=True =================
tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10
tensor([1.0000e+01, 1.0000e-01, 1.0000e-01]) Formatter max_width: 7
```
On can see that in `sci_mode=False`, the values of A are prefixed with unneeded 0 and does not have the same `max_width` as B (It keeps the `max_width` from `sci_mode = None`)
Also in `sci_mode = True`, for B, the `max_width` is 7 but each value takes 10 chars... (But it is fine as the code that uses `max_width` do not rely much on it, but still, this is missleading)
After this commit, this will print
```
================= Default =================
tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10
tensor([10.0000, 0.1000, 0.1000]) Formatter max_width: 7
================= sci_mode=False =================
tensor([10.0000, 0.1000, 0.0100]) Formatter max_width: 7
tensor([10.0000, 0.1000, 0.1000]) Formatter max_width: 7
================= sci_mode=True =================
tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10
tensor([1.0000e+01, 1.0000e-01, 1.0000e-01]) Formatter max_width: 10
```
This also allows to align A with B for `sci_mode=False`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126859
Approved by: https://github.com/malfet
This refactors the pipelining schedule tests since a lot of them have the same repeated code of:
1. Create pipelined model and reference model
2. Run reference model and pipelined model
3. compare gradients
So this refactors those parts above into helper methods and reduces ~300 LOC. Also adds a better gradient check to resolve flakiness (fixes https://github.com/pytorch/pytorch/issues/154408).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158780
Approved by: https://github.com/wconstab
**Summary**
This PR adds an all-gather based FlexAttention and uses TorchFunctionMode to dispatch
`FlexAttentionHOP.__call__` to it.
This PR makes the following changes:
- add a user-facing API `create_cp_block_mask` for creating CP-specific `BlockMask`
which masks over the attention result of Q shard and KV global.
- add `_ContextParallelGlobalVars` to store all necessary global vars that CP FlexAttention
requires. `torch_function_mode` is critical to maintain singleton mode to avoid dynamo
recompilations.
- add a dispatch path for `FlexAttentionForwardHOP.__call__` (TorchFunctionMode dispatch
won't work correctly without this line)
What's not in this PR:
- QKV load balancing
- Test on other masking besides `causal_mask`.
- Support on small attention (i.e. qkv size is smaller than 128) because the block mask
rewrite function requires `Q_BLOCK_SIZE == KV_BLOCK_SIZE == 128`.
**Test**
`pytest test/distributed/tensor/test_attention.py -s -k test_ring_flex_attention`
**Followup**
1. create an issue to reproduce the error in `create_fw_bw_graph()` when trying to call `create_block_mask`
to re-write `block_mask` in `FlexAttentionHOP` dispatch in `TorchFunctionMode`.
2. Merge `_ContextParallelGlobalVars` and `_cp_options`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158692
Approved by: https://github.com/drisspg
The current executorch pin needs to be updated
The next time the docker image gets rebuilt, the executorch docker build is going to fail like https://github.com/pytorch/pytorch/actions/runs/16626853655/job/47137807966
The failure is that the pin uses a version of the nightly that has been removed from the nightly index
```
#62 72.30 ERROR: Could not find a version that satisfies the requirement torch==2.8.0.dev20250601 (from versions: 1.11.0, 1.12.0, 1.12.1, 1.13.0, 1.13.1, 2.0.0, 2.0.1, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.4.0, 2.4.1, 2.5.0, 2.5.1, 2.6.0, 2.7.0, 2.7.1, 2.8.0.dev20250602+cpu, 2.8.0.dev20250603+cpu, 2.8.0.dev20250604+cpu, 2.8.0.dev20250605+cpu, 2.8.0.dev20250606+cpu, 2.8.0.dev20250607+cpu, 2.8.0.dev20250608+cpu, 2.8.0.dev20250609+cpu, 2.8.0.dev20250610+cpu, 2.8.0.dev20250611+cpu, 2.8.0.dev20250612+cpu, 2.8.0.dev20250613+cpu, 2.8.0.dev20250614+cpu, 2.8.0.dev20250615+cpu, 2.8.0.dev20250616+cpu, 2.8.0.dev20250617+cpu, 2.8.0.dev20250618+cpu, 2.8.0.dev20250619+cpu, 2.8.0.dev20250620+cpu, 2.8.0.dev20250621+cpu, 2.8.0.dev20250622+cpu, 2.8.0.dev20250623+cpu, 2.8.0.dev20250624+cpu, 2.8.0.dev20250625+cpu, 2.8.0.dev20250626+cpu, 2.8.0.dev20250627+cpu, 2.9.0.dev20250628+cpu, 2.9.0.dev20250629+cpu, 2.9.0.dev20250630+cpu, 2.9.0.dev20250701+cpu, 2.9.0.dev20250702+cpu, 2.9.0.dev20250703+cpu, 2.9.0.dev20250704+cpu, 2.9.0.dev20250705+cpu, 2.9.0.dev20250706+cpu, 2.9.0.dev20250707+cpu, 2.9.0.dev20250708+cpu, 2.9.0.dev20250709+cpu, 2.9.0.dev20250710+cpu, 2.9.0.dev20250711+cpu, 2.9.0.dev20250712+cpu, 2.9.0.dev20250713+cpu, 2.9.0.dev20250714+cpu, 2.9.0.dev20250715+cpu, 2.9.0.dev20250716+cpu, 2.9.0.dev20250717+cpu, 2.9.0.dev20250718+cpu, 2.9.0.dev20250719+cpu, 2.9.0.dev20250720+cpu, 2.9.0.dev20250722+cpu, 2.9.0.dev20250723+cpu, 2.9.0.dev20250724+cpu, 2.9.0.dev20250725+cpu, 2.9.0.dev20250726+cpu, 2.9.0.dev20250727+cpu, 2.9.0.dev20250728+cpu, 2.9.0.dev20250729+cpu, 2.9.0.dev20250730+cpu, 2.9.0.dev20250731+cpu)
#62 72.30 ERROR: No matching distribution found for torch==2.8.0.dev20250601
```
The executorch hash update currently fails due to https://github.com/pytorch/pytorch/actions/runs/16636773244/job/47079169392
```
2025-07-31T01:56:57.0249165Z + echo 'expecting triton to not be installed, but it is'
2025-07-31T01:56:57.0249614Z expecting triton to not be installed, but it is
2025-07-31T01:56:57.0249969Z + exit 1
2025-07-31T01:58:27.6764352Z ##[error]Final attempt failed. Child_process exited with error code 1
```
I believe the cause is https://github.com/pytorch/executorch/pull/11653 where the nightly pytorch is installed from our index, but then requirements-examples installs timm from pypi, which reinstalls pytorch, except its the release build for cuda from pypi? Which then causes triton to be installed.
I don't know what the intended behavior is so I'm disabling the executorch docker build, executorch build, and the nightly hash update, and apparently the test was already disabled because it was failing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159595
Approved by: https://github.com/malfet
**Summary**
`_weight_int8pack_mm` on CPU may cause segmentation fault if output shape is large (i.e., M * N is large). It's because the kernel compute output buffer address by
```c++
auto* C_ptr = C_data + mb_start * N + nb_start;
```
where both `mb_start` and `N` are `int` and when they are large their product may overflow.
The solution is simple: declare these variables as `int64_t` so that the product won't overflow.
**Test plan**
```
pytest -sv test/test_linalg.py -k test__int8_mm_large_shape
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158341
Approved by: https://github.com/mingfeima, https://github.com/drisspg
fixes typo in word `enought` to correct `enough` at 3 places in these files
```
aten/src/ATen/native/cuda/AdaptiveAveragePooling.cu
aten/src/ATen/native/cuda/CuFFTPlanCache.h
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159587
Approved by: https://github.com/ezyang
This PR is a big copy pasta from `c10/util/Float8*` -> `torch/headeronly/util/` which is why we are breaking PR sanity :C (sorry @albanD!).
Why is it not a clean copy paste?
- For BC reasons, we have to keep the old c10 file around so that OSS devs relying on those files can still get the same APIs
- Because we reexpose APIs that are headeronly through torch::headeronly, so there is an extra chunk of code in the new torch::headeronly files to do that.
Outside of the copy paste, I:
- changed the tests to call torch::headeronly instead of c10
- updated header_only_apis.txt
- added `// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-narrowing-conversions)` to pass lint (which was previously skipped for -inl.h files)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159415
Approved by: https://github.com/albanD
- Sort strategy now supports sharding on non sorted dim.
~~- Fix histc xfail.~~
- ~~Previously `python test/distributed/tensor/test_dtensor_ops.py TestDTensorOpsCPU.test_dtensor_op_db_histc_cpu_float32` will fail with `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=18`. However, if we run `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=18 python test/distributed/tensor/test_dtensor_ops.py TestDTensorOpsCPU.test_dtensor_op_db_histc_cpu_float32`, the test will pass. This kind of error is due to DTensor reuses the strategy schema hashing. It turns out that not only the strategy, the result correctness also depends on `static_argnum` or the op will reuse the previous args from hashed schema and output wrong results. I updated the document also.~~ (fixed in https://github.com/pytorch/pytorch/pull/159289)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159189
Approved by: https://github.com/XilunWu
scaled_grouped_mm's kernel only supports column-major on the second operand. I -think- this is just for efficiency reasons. But inductor treats that buffer as flexible and may tweak the strides to be row-major instead, as seen in the issue.
~Tagging the op as "needs_fixed_stride_order"/"needs_exact_strides" does not work. Inductor only considers those tags for ops that don't have registered lowering (not sure if this is intended). scaled_grouped_mm does have a lowering, so we never check its tags.~ From discussion below, the op tags are expected to work.
FIXES https://github.com/pytorch/pytorch/issues/159097
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159134
Approved by: https://github.com/eellison
This change introduces structured logging of the collective communication schedule, enabling downstream tools (e.g. TLParse) to ingest and analyze per‑rank collective‐order information for multi‑rank jobs.
- Iterates over scheduler.nodes, filters for _CollectiveKernel nodes
- Extracts each op’s python_kernel_name
- Emits a structured JSON payload under the inductor_collective_schedule artifact name
- Dumps the full schedule list to collective_schedule.json via the PyTorch trace‑structured artifact
- Added comprehensive unit tests for collective schedule tracing: Created test_collective_schedule_empty() and test_collective_schedule_real() tests to verify structured trace logging works correctly for both empty collective schedules and real collective operations (like all_reduce and wait_tensor from _c10d_functional ops).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159190
Approved by: https://github.com/yushangdi, https://github.com/xmfan
Summary:
VariadicOpConverter and FuseListUnpackConverter would introduce ops that only have CPU kernels.
Currently, the graph passes are ran if static_dispatch is enabled.
As we plan to enable static_dispatch by default, this diff add the additional check for the graph pass to only work on the node that has all the inputs/outputs on CPU.
Test Plan:
CI
Rollback Plan:
Differential Revision: D79295640
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159519
Approved by: https://github.com/dolpm, https://github.com/henryoier
Summary: test_c10d_functional_native.py uses hard-coded buf names to check the generated code string. This is fragile given that Inductor can update its buffer naming implementation freely. Thus this PR uses name regex matching to find buffer names at the run time. This will solve issues like https://github.com/pytorch/pytorch/issues/147754. Currently we do name matching based on empty_strided_ calls. We can expand it later if needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159487
Approved by: https://github.com/yushangdi
ghstack dependencies: #159476
Summary: test_c10d_functional_native.py tests torch._inductor.config.cpp_wrapper as True and False. Currently torch._inductor.config.cpp_wrapper is set globally which can cause a problem when running the whole test file. This PR changes it to use patch context.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159476
Approved by: https://github.com/yushangdi
\# Why
- Make loop iteration simpler
- Have a common spot where to make modifications that affect
all the GEMM Triton templates, avoiding missed spots
\# What
- pull out commong logic of taking the BaseConfig objects
and turning them into kwargs to feed into maybe_append_choice
for Triton GEMM templates
Differential Revision: [D79186962](https://our.internmc.facebook.com/intern/diff/D79186962)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159383
Approved by: https://github.com/jansel
Hi @desertfire, according to the latest test [results](https://github.com/pytorch/pytorch/actions/runs/15385952839) from the inductor nightly for max_autotune tests, we plan to update the baseline data:
In the latest nightly test, two models require baseline updates:
- vision_maskrcnn: This model shows improved graph breaks, so I’ve updated the baseline accordingly.
- detectron2_fcos_r_50_fpn: This model has a different number of graph breaks. However, since its accuracy result still shows fail_accuracy, so I skipped the graph break check for this model.
```
vision_maskrcnn IMPROVED: graph_breaks=29, expected=30
Improvement: 1 models have fixed dynamo graph breaks:
vision_maskrcnn
```
```
detectron2_fcos_r_50_fpn XFAIL
detectron2_fcos_r_50_fpn FAIL: graph_breaks=24, expected=22
Error: 1 models have new dynamo graph breaks:
detectron2_fcos_r_50_fpn
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154973
Approved by: https://github.com/desertfire
Previously, we log `skipping cudagraphs due to [xxx reasons]` when there are cudagraph-unsafe ops. With graph partition, we will split off these ops and cudagraph remaining parts. But the log message is also skipped.
In this PR, we add logs for graph partition reasons and the number of partitions to better understand the workload.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159425
Approved by: https://github.com/eellison
Fixes
`RuntimeError: self and mat2 must have the same dtype, but got BFloat16 and Float`
With bf16 autocast, bias converted into BFloat16, but fp8_qlinear_onednn_ref not support bf16 bias.
In this pr, convert bias into bf16 on fp8_qlinear_onednn_ref.
Add this case into ut and reproduce:
`python test/test_quantization.py -k test_qlinear_fp8`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159125
Approved by: https://github.com/Xia-Weiwen, https://github.com/cyyever, https://github.com/CaoE
Summary: Fixes a clear template typo where `a_desc_ptr` was passed instead of `b_desc_ptr` to define `b_desc`.
Test Plan:
Found by inspection.
Rollback Plan:
Reviewed By: NoamPaz
Differential Revision: D79178538
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159407
Approved by: https://github.com/NikhilAPatel
Summary:
The `replace_hook` is called once for each user of the replaced node. This fix avoids adding duplicated node sources.
This also means that if there are two nested pass like:
```
with GraphTransformObserver(gm, "outer"):
with GraphTransformObserver(gm, "inner"):
.....
```
We'll only see the outer pass's pass name recorded for the replaced node in the "from_node" node meta. I think this is fine. In practice, the outer pass usually contains a more meaningful name, e.g. `decompose_auto_functionalized`, and the inner pass name is just a default pass name like `pattern_matcher`.
Test Plan:
```
buck2 run @mode/dev-nosan fbcode//caffe2/test:fx -- -r test_graph_transform_observer_replace
```
Rollback Plan:
Differential Revision: D79203058
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159484
Approved by: https://github.com/angelayi
**Summary**
This PR adds an all-gather based FlexAttention and uses TorchFunctionMode to dispatch
`FlexAttentionHOP.__call__` to it.
This PR makes the following changes:
- add a user-facing API `create_cp_block_mask` for creating CP-specific `BlockMask`
which masks over the attention result of Q shard and KV global.
- add `_ContextParallelGlobalVars` to store all necessary global vars that CP FlexAttention
requires. `torch_function_mode` is critical to maintain singleton mode to avoid dynamo
recompilations.
- add a dispatch path for `FlexAttentionForwardHOP.__call__` (TorchFunctionMode dispatch
won't work correctly without this line)
What's not in this PR:
- QKV load balancing
- Test on other masking besides `causal_mask`.
- Support on small attention (i.e. qkv size is smaller than 128) because the block mask
rewrite function requires `Q_BLOCK_SIZE == KV_BLOCK_SIZE == 128`.
**Test**
`pytest test/distributed/tensor/test_attention.py -s -k test_ring_flex_attention`
**Followup**
1. create an issue to reproduce the error in `create_fw_bw_graph()` when trying to call `create_block_mask`
to re-write `block_mask` in `FlexAttentionHOP` dispatch in `TorchFunctionMode`.
2. Merge `_ContextParallelGlobalVars` and `_cp_options`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158692
Approved by: https://github.com/drisspg
Summary:
We found that we don't really set group_name inside group_split correctly, because we are setting group_name to `deviceTypeToBackend_` which is set after `setBackend`. Same thing as group_desc. I added more unit tests for it.
We need to setGroupName correctly, otherwise, this will break DeviceMesh use case when split_group is used in DeviceMesh
Also ncclx needs to be aware of that its Option is a subclass of BackendOption
Test Plan:
CI
Rollback Plan:
Differential Revision: D79201132
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159429
Approved by: https://github.com/xunnanxu
This PR is a bit more involved but effectively works to drastically simplify PyObjectSlot and PyInterpreter.
1) For PyObjectSlot we now use a global pyinterpreter since there only is one. From here we change all of the call sites to rely on this assumption.
2) We also remove the "tags" of the PyInterpreter by deprecating `PyInterpreterStatus`.
For the reviewer, sadly it seems like `functorch/csrc/dim/dim.cpp` needed to get linted, so there is an unreadable amount of changes there. Fortunately, the only actual change in the file is as follows which just removes `getPyInterpreter()` from the `check_pyobj` call.
```
mpy::handle handle_from_tensor(Arena& A, TensorRef t) {
- // fast case: tensor is live in python
- std::optional<PyObject*> mb_obj =
- t->unsafeGetTensorImpl()->pyobj_slot()->check_pyobj(getPyInterpreter(), /*ignore_hermetic_tls=*/false);
- if (mb_obj.has_value() && !t->unsafeGetTensorImpl()->pyobj_slot()->owns_pyobj()) {
- return *mb_obj;
- }
- return A.autorelease(mpy::object::checked_steal(THPVariable_Wrap(*t)));
-}
-}
+ // fast case: tensor is live in python
+ std::optional<PyObject*> mb_obj =
+ t->unsafeGetTensorImpl()->pyobj_slot()->check_pyobj(
+ /*ignore_hermetic_tls=*/false);
+ if (mb_obj.has_value() &&
+ !t->unsafeGetTensorImpl()->pyobj_slot()->owns_pyobj()) {
+ return *mb_obj;
+ }
+ return A.autorelease(mpy::object::checked_steal(THPVariable_Wrap(*t)));
+}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158427
Approved by: https://github.com/albanD
Hi team,
Please help review this patch.
This PR https://github.com/pytorch/pytorch/pull/150370 tried to fix the "Empty C Call Queue" problem on Python 3.12. It added C calls for each starting Python event with a callable.
I found the root cause is not that we cannot get C function frames by `PyFrame_GetBack` when PythonTracer is filling start frames, but the c call event loss problem bug on Python 3.12.0-3.12.4. And that problem was fixed by 257c413cd1 on 3.12.5.
So I think the https://github.com/pytorch/pytorch/pull/150370 cannot fix the problem, this patch reverts the change of it.
There are solutions to fix the problem correctly, such as we can add a new monitoring callback to compensate call events of methods with C function or we can override the callback registered by `PyEval_SetProfile`. These solutions may make the code hard to maintain.
~~Since upgrading the micro version of Python is not difficult for users, we can just ignore C functions and suggest user upgrade.~~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155446
Approved by: https://github.com/sraikund16
Fixes#154111
Resolves an issue during compilation with dynamic shapes where `torch._inductor.decomposition.mm` evaluates the SymInt expression for the input tensor due to a for loop, and thus the output tensor is not dynamically shaped. This issue is limited to (Mx1)x(1xN) small matrix multiplications, and creates an explicit error with tensor subclasses such as DTensor.
The proposed fix replaces the loop with a simple product instead. Benchmark currently running https://hud.pytorch.org/benchmark/compilers
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158998
Approved by: https://github.com/jansel, https://github.com/BoyuanFeng
Summary:
### PR Context
- Kill background process only when PG init fails or there is an explicit `TERMINATE` signal from main process.
- When a checkpoint fails to save, log and return the error but continue the serving loop.
Test Plan:
CI
Rollback Plan:
Differential Revision: D79177410
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159374
Approved by: https://github.com/sibuachu
Essence of this copypasta:
- combine Half-inl.h and Half.h in c10/util -> torch/headeronly/util/Half.h
- Add NOLINTNEXTLINE's to the portions of Half-inl.h that were previously in the ignore list of clangtidy
- Re-expose all APIs in namespaces and through includes of the original files. Ideally, we would have the APIs in torch::headeronly and reexpose them in c10, but that runs into BC issues (see D78997465) so for now we are keeping the APIs in c10 but reexposing them in torch::headeronly.
- Change test cases in test_aoti_abi_check to test torch::headeronly::Half vs c10::Half (they're the same thing but we eventually want all the tests for headeronly APIs to only import from headeronly).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159172
Approved by: https://github.com/albanD, https://github.com/desertfire
This is a follow up on the PR #154382, as the issue still persists:
```
File "/opt/pytorch/pytorch/torch/distributed/rpc/__init__.py", line 81, in <module>
from . import api, backend_registry, functions
File "/opt/pytorch/pytorch/torch/distributed/rpc/api.py", line 35, in <module>
from .constants import DEFAULT_SHUTDOWN_TIMEOUT, UNSET_RPC_TIMEOUT
File "/opt/pytorch/pytorch/torch/distributed/rpc/constants.py", line 3, in <module>
from torch._C._distributed_rpc import (
ImportError: cannot import name '_DEFAULT_NUM_WORKER_THREADS' from 'torch._C._distributed_rpc' (unknown location)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159461
Approved by: https://github.com/lw
Summary: Sometimes the call history recorded in a `nn_module_stack` does not have the stack property, where each FQN is a prefix of the next FQN. This can cause errors during `unflatten`. Instead of erroring we now drop entries from such a `nn_module_stack` to restore the stack property. This effectively leads to less unflattening: the last FQN in the call history before the stack property was broken keeps the entire flat subgraph of its call.
Test Plan:
added test, updated another
Rollback Plan:
Differential Revision: D79204669
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159418
Approved by: https://github.com/angelayi
Previous version of `torch._tensor_str._Formatter` was not using `PRINT_OPTS.sci_mode` for the `max_width` computation but was using it for the formatting of values leading to a weird discrepancy.
Now, the code first checks if it should be in sci_mode, then compute `max_width`
Here is an example to test the behavior:
```python
A = torch.tensor([10, 1e-1, 1e-2])
B = torch.tensor([10, 1e-1, 1e-1])
print("================= Default =================")
print(A, f"Formatter max_width: {torch._tensor_str._Formatter(A).max_width}")
print(B, f"Formatter max_width: {torch._tensor_str._Formatter(B).max_width}")
print("================= sci_mode=False =================")
with torch._tensor_str.printoptions(sci_mode=False):
print(A, f"Formatter max_width: {torch._tensor_str._Formatter(A).max_width}")
print(B, f"Formatter max_width: {torch._tensor_str._Formatter(B).max_width}")
print("================= sci_mode=True =================")
with torch._tensor_str.printoptions(sci_mode=True):
print(A, f"Formatter max_width: {torch._tensor_str._Formatter(A).max_width}")
print(B, f"Formatter max_width: {torch._tensor_str._Formatter(B).max_width}")
```
In the current version this prints:
```
================= Default =================
tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10
tensor([10.0000, 0.1000, 0.1000]) Formatter max_width: 7
================= sci_mode=False =================
tensor([ 10.0000, 0.1000, 0.0100]) Formatter max_width: 10
tensor([10.0000, 0.1000, 0.1000]) Formatter max_width: 7
================= sci_mode=True =================
tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10
tensor([1.0000e+01, 1.0000e-01, 1.0000e-01]) Formatter max_width: 7
```
On can see that in `sci_mode=False`, the values of A are prefixed with unneeded 0 and does not have the same `max_width` as B (It keeps the `max_width` from `sci_mode = None`)
Also in `sci_mode = True`, for B, the `max_width` is 7 but each value takes 10 chars... (But it is fine as the code that uses `max_width` do not rely much on it, but still, this is missleading)
After this commit, this will print
```
================= Default =================
tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10
tensor([10.0000, 0.1000, 0.1000]) Formatter max_width: 7
================= sci_mode=False =================
tensor([10.0000, 0.1000, 0.0100]) Formatter max_width: 7
tensor([10.0000, 0.1000, 0.1000]) Formatter max_width: 7
================= sci_mode=True =================
tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10
tensor([1.0000e+01, 1.0000e-01, 1.0000e-01]) Formatter max_width: 10
```
This also allows to align A with B for `sci_mode=False`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126859
Approved by: https://github.com/malfet
torch.compile of `all_to_all_vdev_2d` hits the following error:
```
torch._dynamo.exc.BackendCompilerFailed: backend='aot_eager' raised:
RuntimeError: Found a custom (non-ATen) operator whose output has alias annotations: symm_mem::all_to_all_vdev_2d(Tensor input, Tensor(a!) out, Tensor in_splits, Tensor(a!) out_splits_offsets, str group_name, int? major_align=None) -> Tensor(a!). We only support functionalizing operators whose outputs do not have alias annotations (e.g. 'Tensor(a)' is a Tensor with an alias annotation whereas 'Tensor' is a Tensor without. The '(a)' is the alias annotation). The alias annotation specifies that the output Tensor shares storage with an input that has the same annotation. Please check if (1) the output needs to be an output (if not, don't return it), (2) if the output doesn't share storage with any inputs, then delete the alias annotation. (3) if the output indeed shares storage with an input, then add a .clone() before returning it to prevent storage sharing and then delete the alias annotation. Otherwise, please file an issue on GitHub.
```
This PR selects option (1).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159435
Approved by: https://github.com/ngimel, https://github.com/xmfan
# Motivation
This PR moves the implementation of `torch.cuda.memory._set_allocator_settings` to `torch._C._accelerator_setAllocatorSettings`.
Since the original API was intended as a temporary/internal utility, I am not exposing the new function as a public API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156175
Approved by: https://github.com/albanD
ghstack dependencies: #149601, #157908, #150312, #156165
# Motivation
Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150312
Approved by: https://github.com/albanD
ghstack dependencies: #149601, #157908
# Motivation
This PR aims to generalize `AllocatorConfig` to be device-agnostic. Introduce the class `AcceleratorAllocatorConfig` to clarify its scope as a configuration manager for accelerator backends (e.g., CUDA, XPU). The another name `AllocatorConfig` is now reserved for a potential future base class that can unify configuration handling for both CPU and accelerator allocators, should similar requirements arise for the CPU path.
# Design Rule
## Overall
This class configures memory allocation for both device and host memory. A single `AcceleratorAllocatorConfig` instance is shared across all accelerator backends, such as CUDA and XPU, under the assumption that relevant environment variables apply uniformly to all accelerators. Device-specific configuration extensions are supported via hooks (see `registerDeviceConfigParserHook`).
Introduce a new class `ConfigTokenizer` to help process the env variable config key-value pair
## Naming Convention:
- Public API names in `AcceleratorAllocatorConfig` should be device-generic.
- Members prefixed with `pinned_` are specific to the host/pinned allocator.
- Environment variable names should be generic across backends.
- Comma-separated key-value pairs in the format: `key:value`. Use square brackets `[]` for list values Example: `key1:123, key2:[val1,val2]`
## Environment Variables:
- The default environment variable for configuration is `PYTORCH_ALLOC_CONF`.
- For backward compatibility, `PYTORCH_CUDA_ALLOC_CONF` and `PYTORCH_HIP_ALLOC_CONF` are also supported with lower priority.
Differential Revision: [D79011786](https://our.internmc.facebook.com/intern/diff/D79011786)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149601
Approved by: https://github.com/albanD
Summary: AI system co-design team requested to add user annotation for FX graph cache key in PyTorch Kineto trace and Execution trace. With this annotation, they can know the FX graph to which the kernels belong.
Test Plan:
buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA
Rollback Plan:
Differential Revision: D79019069
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159318
Approved by: https://github.com/sraikund16, https://github.com/jansel
Essence of this copypasta:
- combine Half-inl.h and Half.h in c10/util -> torch/headeronly/util/Half.h
- Add NOLINTNEXTLINE's to the portions of Half-inl.h that were previously in the ignore list of clangtidy
- Re-expose all APIs in namespaces and through includes of the original files. Ideally, we would have the APIs in torch::headeronly and reexpose them in c10, but that runs into BC issues (see D78997465) so for now we are keeping the APIs in c10 but reexposing them in torch::headeronly.
- Change test cases in test_aoti_abi_check to test torch::headeronly::Half vs c10::Half (they're the same thing but we eventually want all the tests for headeronly APIs to only import from headeronly).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159172
Approved by: https://github.com/albanD, https://github.com/desertfire
Sphinx likes titles and complains about them when they are not there. So adding a title to address this Wartning in the build:
```
WARNING: toctree contains reference to document 'distributed._dist2' that doesn't have a title: no link will be generated
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159385
Approved by: https://github.com/d4l3k
Straightup copy pasta. Keeps APIs in c10 and reexposes them to torch::headeronly.
It is arguable that we should just get rid of some of these unused dtypes but that is outside the scope of this PR, which is meant to build up to ScalarType moving to headeronly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159302
Approved by: https://github.com/malfet, https://github.com/albanD
Switch from guard_size_oblivious to guard_or_false if you encounter a DDE, this would then avoid folding this 3d bmm into a mm.
806d9e3fe7/torch/_decomp/decompositions.py (L4506-L4512)
## DDE
```
File "/data/users/colinpeppler/pytorch/torch/_decomp/decompositions.py", line 4506, in matmul
elif should_fold(tensor1, tensor2, is_out):
File "/data/users/colinpeppler/pytorch/torch/_decomp/decompositions.py", line 4472, in should_fold
if guard_size_oblivious(t1.numel() == 0):
torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(12*((u0//2)), 0) (unhinted: Eq(12*((u0//2)), 0)). (Size-like symbols: none)
Caused by: (_decomp/decompositions.py:4472 in should_fold)
```
```
File "/data/users/colinpeppler/pytorch/torch/_decomp/decompositions.py", line 4506, in matmul
elif should_fold(tensor1, tensor2, is_out):
File "/data/users/colinpeppler/pytorch/torch/_decomp/decompositions.py", line 4483, in should_fold
return all(
torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(3*((u0//2)), 3) (unhinted: Eq(3*((u0//2)), 3)). (Size-like symbols: none)
Caused by: (_decomp/decompositions.py:4483 in should_fold)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159184
Approved by: https://github.com/ezyang
ghstack dependencies: #158894
This only handles AttributeError, but in general, any exception coming from
here is a user exception. let me know if we prefer to catch all exceptions, and then reraise them as observed exceptions.
```
File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/_dynamo/symbolic_convert.py", line 2200, in CALL_FUNCTION
self.call_function(fn, args, {})
File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/_dynamo/symbolic_convert.py", line 1210, in call_function
self.push(fn.call_function(self, args, kwargs)) # type: ignore[arg-type]
File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/_dynamo/variables/lazy.py", line 201, in realize_and_forward
return getattr(self.realize(), name)(*args, **kwargs)
File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/_dynamo/variables/nn_module.py", line 472, in call_function
initialize_lazy_module(tx, mod, args, kwargs)
File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/_dynamo/variables/nn_module.py", line 104, in initialize_lazy_module
mod._infer_parameters(mod, fake_args, fake_kwargs)
File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/nn/modules/lazy.py", line 261, in _infer_parameters
module.initialize_parameters(*args, **kwargs)
...,
File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/nn/modules/module.py", line 1962, in __getattr__
raise AttributeError(
torch._dynamo.exc.InternalTorchDynamoError: AttributeError: '...' object has no attribute '...'
```
Note that we crash with a sligthly different exception trace in the other test I added. Let me know if we want this to not throw directly to the end user.
```
======================================================================
ERROR: test_lazy_module_bad_params (__main__.NNModuleTests.test_lazy_module_bad_params)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/data/users/clr/pytorch/torch/testing/_internal/common_utils.py", line 3223, in wrapper
method(*args, **kwargs)
~~~~~~^^^^^^^^^^^^^^^^^
File "/data/users/clr/pytorch/test/dynamo/test_modules.py", line 1683, in test_lazy_module_bad_params
exp_res = opt_m(x, y)
File "/data/users/clr/pytorch/torch/_dynamo/eval_frame.py", line 411, in __call__
return super().__call__(*args, **kwargs)
~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/data/users/clr/pytorch/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/data/users/clr/pytorch/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
File "/data/users/clr/pytorch/torch/_dynamo/eval_frame.py", line 473, in _call_lazy_check
self._orig_mod._infer_parameters(self._orig_mod, args, kwargs)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/users/clr/pytorch/torch/nn/modules/lazy.py", line 261, in _infer_parameters
module.initialize_parameters(*args, **kwargs)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/data/users/clr/pytorch/test/dynamo/test_modules.py", line 711, in initialize_parameters
self.foo += 1
^^^^^^^^
File "/data/users/clr/pytorch/torch/nn/modules/module.py", line 1962, in __getattr__
raise AttributeError(
f"'{type(self).__name__}' object has no attribute '{name}'"
)
AttributeError: 'LazyModuleBadInferParams' object has no attribute 'foo'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158501
Approved by: https://github.com/williamwen42, https://github.com/jansel
# Motivation
While refactoring the caching allocator, I noticed that the `AllocParams` constructor on CUDA had an unused parameter. This change removes that unused argument to avoid potential confusion.
# Additional Context
I noticed that `AllocParams` is defined in cpp file, so it should be safe to make this change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159159
Approved by: https://github.com/cyyever, https://github.com/albanD
This PR makes some less risky changes to PyObjectSlot as there is a lot of stuff we do not need since there is only one interpreter. Specifically `check_interpreter` and `has_pyobj_nonhermetic` are removed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158407
Approved by: https://github.com/albanD
ghstack dependencies: #158290, #158291
This PR removes the integration point torch.fx had with torch::deploy (and another minor change).
Note: This PR has some broken mypy errors, but I believe those should have been in the code base beforehand, and should be fixed in a separate PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158291
Approved by: https://github.com/albanD
ghstack dependencies: #158290
The mi355 ci regression and hiprtc kernel compilation is failing due to duplicate definitions of traits leading to errors like `error: redefinition of 'integral_constant'`. This seems to be the culprit: https://github.com/pytorch/pytorch/pull/158868. Checking if using hip version instead of rocm version for the check would help with resolution here as rocm version and hip version aren't synced. ROCm 7.0 Alpha build used in CI is still on HIP 6.5.
Confirmed that this patch works here: https://github.com/pytorch/pytorch/actions/runs/16579227179?pr=159292
Also, this PR increases the frequency of this MI355 CI to twice a day so we can catch and identify regressions easier if they happen for now.
Jeff is on vacation, so Jithun asked me to reach out to y'all. Please help stamp and approve, so we can resolve the recent MI355 CI regression/timeout (https://github.com/pytorch/pytorch/actions/workflows/rocm-mi355.yml) :) @huydhn @malfet @atalman @seemethere
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159292
Approved by: https://github.com/malfet
Summary: We are trying to deprecate torch deploy externally. However a bunch of legacy stuff still uses it. This PR allows the legacy tests to still run if neccessary
Test Plan:
It's a targets change so CI should suffice
Rollback Plan:
Differential Revision: D78910653
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159307
Approved by: https://github.com/albanD
# Note - On Lambda guarding of object aliasing
# We previously installed object‑aliasing guards as relational guards,
# but that undermined the recursive‑dict guard optimization: placing the
# aliasing guard at a leaf prevented the parent dict node from
# qualifying as a recursive‑dict guard root. Because aliasing guards are
# rare, we now emit them as epilogue guards via a small Python lambda.
# This repeats the access in Python—adding a bit of work—but the
# overhead is outweighed by the gains from enabling recursive‑dict guard
# optimization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159288
Approved by: https://github.com/StrongerXi
Summary:
A fallback kernel's output may be a non-list/tuple but a `MultiOutput` with empty indices. Allow the `FXConverter` to handle such case.
Test Plan:
Modified the fxir test for fallbacks, then ran `buck2 test mode/dev-nosan caffe2/test/inductor:fxir_backend -- test_fallback`.
Before this diff the modified test would fail with
```
File "/re_cwd/buck-out/v2/gen/fbcode/e2105f7329ead90a/caffe2/test/inductor/__fxir_backend__/fxir_backend#link-tree/torch/_inductor/codegen/wrapper_fxir.py", line 341, in generate
line.codegen_fx(self)(line)
File "/re_cwd/buck-out/v2/gen/fbcode/e2105f7329ead90a/caffe2/test/inductor/__fxir_backend__/fxir_backend#link-tree/torch/_inductor/codegen/wrapper_fxir.py", line 489, in _generate_multi_output
inds = line.indices[0][1:]
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
IndexError: list index out of range
```
(Full error paste in P1878839403)
With this diff the error is no longer present.
Rollback Plan:
Differential Revision: [D79126619](https://our.internmc.facebook.com/intern/diff/D79126619)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159297
Approved by: https://github.com/blaine-rister
This PR is part of the work to deprecate torch::deploy in OSS. Effectively it does 3 things to get started.
1. Remove test_deploy_interaction as we no longer need to worry about this
2. Remove all torch._running_with_deploy checks and use the False path always (surfaced 1)
3. Remove `USE_DEPLOY` and switch to the default path always
Note: MyPy does fail on a bunch of things here as a bunch of older files are touched. It may be better to fix these things on a separate PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158288
Approved by: https://github.com/albanD
Fixes#157452
Test with
```
python test/dynamo/test_repros.py ReproTests.test_nn_parameter_ctor_graph_breaks
```
### Release Notes
Change to nn.Parameter Constructor Behavior in Dynamo
Semantic change introduced in the nn.Parameter constructor; previously, if the constructor lacked a clean source, the system would attempt to infer arguments to construct a clone and lift this synthetic proxy in the computation graph. This approach had many potential edge cases and was difficult to reason about. The new behavior defaults to graph breaking when the nn.Parameter constructor does not have a clean source. Users are now suggested to manually move the constructor out of the graph in such cases. This change improves clarity and reduces complexity in graph construction and debugging. Users can escape hatch to old semantics with `torch.dynamo.config.graph_break_on_nn_param_ctor=False` if this cannot be done.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158800
Approved by: https://github.com/anijain2305
Fixes#154111
Resolves an issue during compilation with dynamic shapes where `torch._inductor.decomposition.mm` evaluates the SymInt expression for the input tensor due to a for loop, and thus the output tensor is not dynamically shaped. This issue is limited to (Mx1)x(1xN) small matrix multiplications, and creates an explicit error with tensor subclasses such as DTensor.
The proposed fix replaces the loop with a simple product instead. Benchmark currently running https://hud.pytorch.org/benchmark/compilers
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158998
Approved by: https://github.com/jansel, https://github.com/BoyuanFeng
Summary:
Remove use of targetDevice in KernelFactory.
AOTI would infer device when creating AOTIDelegateExecutor.
Test Plan:
CI
Rollback Plan:
Reviewed By: dolpm
Differential Revision: D79007317
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159298
Approved by: https://github.com/dolpm
This adds an option for backend precompile artifacts to be *editable*, i.e. to not serialize them right away, but instead be able to apply a Callable edit_fn to them.
This allows us to support editing the precompile artifact with more updated autotune results at a later time in the next PR. The goal flow here is:
- User runs AOTAutograd -> Inductor -> Triton
- User saves to AOTAutogradCache the normal results
- User runs autotuning
- User calls serialize(), it takes the new autotuning results at runtime and saves only the necessary triton kernels.
This PR just implements the API for editing the cache artifacts. The next PR actually adds the autotuning saving support.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158586
Approved by: https://github.com/zhxchen17
Summary: This test was using do_bench, so it was flaky performance is non-deterministic.
Test Plan:
buck test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:compile_subprocess -- --exact 'caffe2/test/inductor:compile_subprocess - test_inductor_multiple_specializations_cuda (caffe2.test.inductor.test_compile_subprocess.GPUTests)' --run-disabled
Rollback Plan:
Differential Revision: D79098692
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159264
Approved by: https://github.com/jingsh
Summary:
Strength matcher for StaticDispatch kernels: all input, output tensor must be on CPU, all Device-typed attribute must be CPU.
Previously, we only check output tensor on CPU. This will miss catching the case where we do DeviceToHost aten._to_copy.
Prepare for turning on static dispatch kernel by default.
Test Plan:
I should add some test before land.
Rollback Plan:
Differential Revision: D78747600
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159187
Approved by: https://github.com/dolpm
**Summary**
`_weight_int8pack_mm` on CPU may cause segmentation fault if output shape is large (i.e., M * N is large). It's because the kernel compute output buffer address by
```c++
auto* C_ptr = C_data + mb_start * N + nb_start;
```
where both `mb_start` and `N` are `int` and when they are large their product may overflow.
The solution is simple: declare these variables as `int64_t` so that the product won't overflow.
**Test plan**
```
pytest -sv test/test_linalg.py -k test__int8_mm_large_shape
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158341
Approved by: https://github.com/mingfeima, https://github.com/drisspg
Fixes#158892
All custom operators should go through the graph.call_function path. The
other fallback path is for aten/prim operations that don't have support
for things (like torch.float8_e8m0fn).
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159174
Approved by: https://github.com/eellison
## Summary
This PR changes the default value of `combo_kernel_foreach_dynamic_shapes` from `False` to `True` in `torch/_inductor/config.py`.
## Context
The `combo_kernel_foreach_dynamic_shapes` configuration was introduced in PR #134477 (August 2024) to support dynamic shapes for foreach and combo kernels. It was initially disabled by default as a conservative approach to avoid disrupting production workflows.
## Why This Change?
After several months of the feature being available and stable, it's time to enable it by default. This improves the user experience for developers using `torch.compile(dynamic=True)` with foreach operations.
### Current behavior:
- Users must manually discover and enable `combo_kernel_foreach_dynamic_shapes`
- Without this flag, foreach operations may fail with dynamic shapes
- This creates friction and confusion
### With this change:
- Foreach operations work seamlessly with dynamic compilation
- No manual configuration needed
- Better "it just works" experience
## Testing
Extensive testing was performed with PyTorch 2.5.0+ and 2.7.1:
- ✅ Various tensor sizes (8, 16, 32, 64, 128)
- ✅ Multiple tensors in operations (tested up to 20)
- ✅ Nested foreach operations
- ✅ Mixed operations (foreach + standard operations)
- ✅ Both CPU and CUDA devices
- ✅ Symbolic shapes with dynamic compilation
## Impact Assessment
- **Performance**: No impact - this only affects compilation behavior
- **Backward Compatibility**: Fully maintained - users can still set to `False`
- **Risk**: Minimal - feature has been stable since August 2024
## References
- Original implementation: PR #134477 by @qchip
- This completes the feature rollout by making it available by default
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158985
Approved by: https://github.com/jansel, https://github.com/mlazos
Fixes a ZB regression (https://github.com/pytorch/torchtitan/actions/runs/16478292562/job/46585646792)
Previously we only allowed an intermediate node to have 1 gradient. Recently a torchtitan ZB test started failing and I tracked to back to FusedRMSNorm grad_fn having two values `(grad, None)` (see https://github.com/pytorch/pytorch/pull/153666) and it started breaking our ZB tests.
This PR allows `stage_backward_weight` intermediate nodes to have multiple grads (it sums them together or if the grad value is None, then ignores it). Here is an example where the backward would have two grad values (gI1, gI2):
```python
class Func(torch.autograd.Function):
@staticmethod
def forward(ctx, x):
return x, 2
@staticmethod
def backward(ctx, gI1, gI2):
assert gI2 is None
return gI1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159084
Approved by: https://github.com/tianyu-l
If `return_debug_mask` is False (which is the default value for SDPA), the attention tensor returned is an empty tensor (which has 0 dimensions). This means that the shardings for the batch and CP case are that are passed can yield invalid dimensions.
This PR fixes it for `scaled_dot_product_flash_attention_strategy`. Note that `scaled_dot_product_cudnn_attention_strategy` doen't have this issue
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159205
Approved by: https://github.com/wconstab
Switch from `guard_size_oblivious` to `guard_or_false` if you encounter a DDE, this would then fallback to computing elementwise strides.
2dccff7dcf/torch/_prims/__init__.py (L1919-L1923)
We think it's safe because Laith tested whether this fallback would fail any tests. It did not.
https://github.com/pytorch/pytorch/pull/158157
## Data-dependent exceptions (DDE)
```
File "/data/users/colinpeppler/pytorch/torch/_decomp/decompositions.py", line 2139, in _to_copy
x_tensor = torch._prims.convert_element_type(x_tensor, dtype)
...
File "/data/users/colinpeppler/pytorch/torch/_prims/__init__.py", line 1920, in _convert_element_type_meta
if torch._prims_common.is_non_overlapping_and_dense(a):
File "/data/users/colinpeppler/pytorch/torch/_prims_common/__init__.py", line 494, in is_non_overlapping_and_dense
if guard_size_oblivious(length == 1):
GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(u0 - 4, 1) (unhinted: Eq(u0 - 4, 1)). (Size-like symbols: u0)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158894
Approved by: https://github.com/pianpwk, https://github.com/laithsakka
Fixes mm on B200:
Before:
```Shell
def _addmm_nvfp4_dispatch(
a: NVFP4Tensor, b: NVFP4Tensor, aten_op, bias: Optional[torch.Tensor] = None
) -> torch.Tensor:
"""
Core implementation shared between nvfp4_mm, nvfp4_addmm, and nvfp4_linear.
The only difference is whether bias is None or not.
"""
assert a._data.is_contiguous()
assert b._data.t().is_contiguous()
assert a._block_size == 16, f"NVFP4 requires block_size=16, got {a._block_size}"
assert b._block_size == 16, f"NVFP4 requires block_size=16, got {b._block_size}"
M, K = a.shape[0], a.shape[1]
N = b.shape[1]
# Swizzle Dizzle
if a._is_swizzled_scales:
a_scale_blocked = a._scale_e4m3 # Already swizzled
else:
a_scale = a._scale_e4m3.view(M, K // a._block_size)
a_scale_blocked = to_blocked(a_scale)
if b._is_swizzled_scales:
b_scale_blocked = b._scale_e4m3 # Already swizzled
else:
b_scale = b._scale_e4m3.view(N, K // b._block_size)
b_scale_blocked = to_blocked(b_scale)
# Merge double quant scales into 1 scale for Scale_In^D
if a._per_tensor_scale is not None:
assert b._per_tensor_scale is not None
scale_result = a._per_tensor_scale * b._per_tensor_scale
else:
assert b._per_tensor_scale is None and a._per_tensor_scale is None
scale_result = None
# THIS IS A WORKAROUND:
# RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling
# When we have per-tensor scaling, we need to apply it before bias
# since bias is not quantized
should_add_bias_separately = (scale_result is not None) and (bias is not None)
# should_add_bias_separately = bias is not None
> result = torch._scaled_mm(
a._data.view(torch.float4_e2m1fn_x2),
b._data.view(torch.float4_e2m1fn_x2),
a_scale_blocked.view(torch.float8_e4m3fn),
b_scale_blocked.view(torch.float8_e4m3fn),
bias=None if should_add_bias_separately else bias,
out_dtype=a._orig_dtype,
# scale_result=scale_result, # Not supported yet
)
E RuntimeError: Invalid scaling configuration.
E - For TensorWise scaling, a and b should be float8, scales should be float and singletons.
E - For RowWise scaling, a and b should be float8, scales should be float, scale_a should be (200, 1) and scale_b should be (1, 256), and both should be contiguous.
E - For BlockWise 1x128 scaling, a and b should be float8, scales should be float, scale_a should be (200, 1) and scale_b should be (1, 256), and both should be outer-dim-major.
E - For BlockWise 128x128 scaling, a and b should be float8, scales should be float, scale_a should be (2, 1) and scale_b should be (1, 2), and both should be near-inner-dim-major (with 16-byte aligned strides).
E - For Blockwise 1x32 scaling, a and b should be float8, scales should be float8_e8m0fnu, scale_a should have 1024 elements and scale_b should have 1024 elements, and both should be contiguous.
E - For Blockwise 1x16 scaling, a and b should be float4 (packed 2x), scales should be float8_e4m3fn, scale_a should have 3072 elements and scale_b should have 3072 elements, and both should be contiguous.
E Got a.dtype()=Float4_e2m1fn_x2, scale_a.dtype()=Float8_e4m3fn, scale_a.size()=[256, 12], scale_a.stride()=[12, 1], b.dtype()=Float4_e2m1fn_x2, scale_b.dtype()=Float8_e4m3fn, scale_b.size()=[256, 12] and scale_b.stride()=[12, 1]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159170
Approved by: https://github.com/ngimel
See docblock for details. The API here has been validated by use
in autoparallel but I'm always open to suggestions for tweaks. One
particular choice I made is to make most of the functions return dicts
by default; this isn't strictly necessary for inputs but it is very
convenient for outputs as the output desc lives on the output node,
not the argument that feeds into the node.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159005
Approved by: https://github.com/wconstab
Rewriting bucketing of all_gather and reduce_scatter with defining of "merge graph" via torch function.
`all_gather_merge_fn_to_trace`
`reduce_scatter_merge_fn_to_trace`
(Instead of creating nodes and doing FakeTensor prop manually)
This allows to experiment with merge function.
Used foreach_copy_ in merging function for all_gather - added lowering for inductor for `foreach_copy_`
Adding topological sort after bucketing passes (comment in post_grad.py):
```
# Fx collectives bucketing passes require topological sort for the cases:
# when bucketed collectives have users before the last collective in the bucket
# AND when inputs of bucketed collective have ancestors after the first collective in the bucket.
#
# In this case we can not manually pick the place for bucketed collective insertion.
# But we are guaranteed by the bucketing (independent collectives in the bucket),
# that it is possible to reorder nodes to satisfy all ordering requirements.
#
# --- before bucketing ---
# in0 = ...
# wait_ag0 = ag(in0)
# user0(wait_ag0)
# ...
# pre_in1 = ...
# in1 = transform(pre_in1)
# wait_ag1 = ag(in1)
# user1(wait_ag1)
#
# --- after bucketing ---
#
# in0 = ...
# user(wait_ag0) <--- wait_ag0 is defined only after bucketed collective.
#
# pre_in1 = ...
# in1 = transform(pre_in1)
# ag_bucket(in0+in1)
# wait_bucket
# wait_ag0 = wait_bucket[0]
# wait_ag1 = wait_bucket[1]
# user1(wait_ag1)
````
Correctness of the passes verified by loss curve for llama3 8b for simple_fsdp and for autoparallel:
<img width="1364" height="495" alt="Screenshot 2025-07-22 at 14 27 28" src="https://github.com/user-attachments/assets/67b2cabb-3206-450b-b529-e23c24292fc6" />
<img width="1355" height="509" alt="Screenshot 2025-07-22 at 14 27 56" src="https://github.com/user-attachments/assets/4d0e6b25-2eb1-47b2-8d68-dcec185239c4" />
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158663
Approved by: https://github.com/wconstab
Options to address the "undocumented python objects":
1. Reference the functions in the .rst via the torch.nn.modules namespace. Note that this changes the generated doc filenames / locations for most of these functions!
2. [Not an option] Monkeypatch `__module__` for these objects (broke several tests in CI due to `inspect.findsource` failing after this change)
3. Update the .rst files to also document the torch.nn.modules forms of these functions, duplicating docs.
#### [this is the docs page added](https://docs-preview.pytorch.org/pytorch/pytorch/158491/nn.aliases.html)
This PR takes option 3 by adding an rst page nn.aliases that documents the aliases in nested namespaces, removing all the torch.nn.modules.* entries from the coverage skiplist except
- NLLLoss2d (deprecated)
- Container (deprecated)
- CrossMapLRN2d (what is this?)
- NonDynamicallyQuantizableLinear
This mostly required adding docstrings to `forward`, `extra_repr` and `reset_parameters`. Since forward arguments are already part of the module docstrings I just added a very basic docstring.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158491
Approved by: https://github.com/janeyx99
Adds `c_shim_aten.{h/cpp}` and use this for `fill_`
This is the generated `c_shim_aten.cpp` for reference
```cpp
// WARNING: THIS FILE IS AUTOGENERATED BY torchgen. DO NOT MODIFY BY HAND.
// See 7e86a7c015/torchgen/gen.py (L2424-L2436) for details
// This file corresponds to the aten_shimified_ops list in torchgen/aoti/fallback_ops.py
#include <torch/csrc/inductor/aoti_torch/generated/c_shim_aten.h>
#include <torch/csrc/inductor/aoti_torch/utils.h>
#ifndef AT_PER_OPERATOR_HEADERS
#include <ATen/Functions.h>
#include <ATen/CompositeExplicitAutogradFunctions.h>
#include <ATen/CompositeExplicitAutogradNonFunctionalFunctions.h>
#include <ATen/CompositeImplicitAutogradFunctions.h>
#else
#include <ATen/ops/fill.h>
#endif // AT_PER_OPERATOR_HEADERS
using namespace torch::aot_inductor;
AOTITorchError aoti_torch_aten_fill__Scalar(AtenTensorHandle self, double value) {
AOTI_TORCH_CONVERT_EXCEPTION_TO_ERROR_CODE({
at::fill_(
*tensor_handle_to_tensor_pointer(self), value
);
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158974
Approved by: https://github.com/albanD, https://github.com/janeyx99
Hi team,
Please help review this patch.
This PR https://github.com/pytorch/pytorch/pull/150370 tried to fix the "Empty C Call Queue" problem on Python 3.12. It added C calls for each starting Python event with a callable.
I found the root cause is not that we cannot get C function frames by `PyFrame_GetBack` when PythonTracer is filling start frames, but the c call event loss problem bug on Python 3.12.0-3.12.4. And that problem was fixed by 257c413cd1 on 3.12.5.
So I think the https://github.com/pytorch/pytorch/pull/150370 cannot fix the problem, this patch reverts the change of it.
There are solutions to fix the problem correctly, such as we can add a new monitoring callback to compensate call events of methods with C function or we can override the callback registered by `PyEval_SetProfile`. These solutions may make the code hard to maintain.
~~Since upgrading the micro version of Python is not difficult for users, we can just ignore C functions and suggest user upgrade.~~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155446
Approved by: https://github.com/sraikund16
Summary:
Placement is leaked to too many classes!
In this diff, we consolidate all placement lookup into one place: Graph::ApplyDevicePlacement.
After applying placement, the in-memory graph, tensorMeta, weightMeta would already have the re-mapped device.
The subsequence weight loading, sample input loading, target device inference would look up the re-mapped device from graph's tensorMeta.
graph's tensorMeta becomes the only ground truth!
Test Plan:
Need to add some tests before landing.
This is a big change.
Rollback Plan:
Differential Revision: D78841818
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158996
Approved by: https://github.com/henryoier
Fix docstring for clip_grads_with_norm_ to reflect clamping behavior
This PR updates the docstring for torch.nn.utils.clip_grads_with_norm_ to accurately reflect the implementation behavior. The current documentation suggests that gradients are always scaled by:
grad = grad * (max_norm / (total_norm + eps))
However, the actual implementation clamps the scale coefficient to a maximum of 1.0, ensuring gradients are only scaled down, not up. This PR corrects the formula and adds a clarifying note to avoid confusion for users.
Updated the formula in the docstring to:
grad = grad * min(max_norm / (total_norm + eps), 1.0)
Added a note explaining the rationale for clamping (to prevent gradient amplification).
Ensured consistency with the behavior of clip_grad_norm_.
Fixes#151554
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158200
Approved by: https://github.com/mikaylagawarecki
Options to address the "undocumented python objects":
1. Reference the functions in the .rst via the `torch.functional` namespace. Note that this changes the generated doc filenames / locations for most of these functions!
2. Document these functions by referencing them from the `torch.` namespace instead, in line with common usage. This would also require setting the `__module__` for these functions and moving entries from `torch.functional`'s `__all__` -> `torch`'s `__all__`, which is BC-breaking.
3. Update the .rst files to also document the `torch.functional` forms of these functions, duplicating docs.
This PR takes option (3) above and:
* Removes all 20 `torch.functional` entries from the doc ignore list
* Removes `torch.functional.align_tensors()` entirely, since we don't want to document it.
* This is technically BC-breaking, although the previous impl simply errored out. This change could be moved to a separate isolated PR for safety.
* Introduces `torch.aliases.md` as a hidden page for the `torch.functional` aliases to the `torch` analogue functions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158581
Approved by: https://github.com/janeyx99
Wrapping is load bearing for things that introspect argument signatures,
but use of functools.wraps to do this is undesirable as this overrides
the name/module of the wrapping function, which is bad for tracking down
exactly what code is actually being run at runtime. simple_wraps is
like wraps but it doesn't override the name information, so you still
get an appropriate printout. To see the stack of all functions wrapping
each other, there is now a helper fn_stack.
I also make some assertions tighter in the descriptor PR. These didn't
catch any bugs but I figure might as well.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158734
Approved by: https://github.com/wconstab
ghstack dependencies: #158624, #158708
3 procs were used for sm86, but we switched to sm89 and the check failed so it switched back to 2
sm90 is H100, but idk what unittests we have running there, but I assume they also have a lot of memory
They use larger runners, which have more GPU memory, so its usually ok. I think it's ~22GB -> 10GB per proc if 2, 6GB per proc if 3 (cuda context maybe 1GB)
I've applied skips to the ones that OOMed
Time decreases from ~2.7hr per test job -> ~2hr
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158691
Approved by: https://github.com/huydhn
----
- First, we add a new expanded_def to FX, which will expand the
definitions of variables into multiple lines, one per variable
definition. This makes extremely long args/return lists much
more readable.
- Next, we extend this mechanism to also print out descriptors on
placeholders and return values, as comments, if available. This
is how we will test descriptors.
- We update tlparse for AOTAutograd to use this format.
- We update expect tests to use this format and update their formats,
so you can inspect what it can look at. There may be other tests
I should update, open to suggestions.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158708
Approved by: https://github.com/wconstab
ghstack dependencies: #158624
One of the recurring challenges of working with FX graphs produced by
AOTAutograd is that there is a very intricate input/output calling
convention that is essentially impossible to understand without actually
reverse engineering the AOTAutograd code. It is so bad that there
is a bit of logic for stashing indices of relevant arguments/outputs
in TracingContext so Inductor can figure out what the correct arguments
are.
This PR introduces the necessary scaffolding to keep track of
"descriptors" of every input/output to a (joint) FX graph produced
by AOTAutograd. First read through descriptors.py to get a sense for
what is available: for inputs, you can figure out if you have
a plain input, tangent, parameter, or something more exotic like
one of the fields of a subclass or view base. For outputs, you can
determine if you have a plain output or grad, or something more exotic
like the contents of a mutated input or an intermediate base of several
views that were returned.
There are two distinct parts of this patch: AOTInput tracking, and
AOTOutput tracking.
**AOTInput tracking.** The way this works is that AOTAutograd starts of
with some Tensor `flat_args` that are the inputs to the graph being
traced, and then updates these arguments as it modifies the input
calling convention. Anywhere these `args` are passed around, we now add a
news argument `args_descs` which is updated in synchrony with args. Add
a new arg? Add a new AOTInput to `args_descs`.
**AOTOutput tracking.** Originally, I wanted to also add an `outs_descs`
analogous to `args_descs` tracking output metadata. However, it is
often difficult to compute what the output will be until you're actually
tracing the function for real (and are able to peek at the real
outputs). So we only compute `outs_desc` when we actually trace. To do
this, we change the calling convention of the function we trace to
return not just outputs, but a tuple of `outs` and `outs_descs`. Before
we bottom out at the `make_fx` invocation, we save `outs_descs` to a
nonlocal and bottom out.
To actually make use of this information in a useful way, see the next PR. Potentially the two PRs could be combined together but I think it's actually clearer for them to be separate.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158624
Approved by: https://github.com/xmfan
## Fixes https://github.com/pytorch/pytorch/issues/157959
## mini repro from issue
```c++
import torch
from torch import nn
class Foo(nn.Module):
def __init__(
self,
use_parameter: bool
) -> None:
super().__init__()
self.b = 101
if use_parameter:
self.b = nn.Parameter(torch.Tensor([self.b]), requires_grad=False)
def forward(self, x: torch.Tensor) -> torch.Tensor:
# return x + self.b
# return x - self.b
return x / self.b
# return x * self.b
torch.manual_seed(42)
x = torch.rand((5, 5))
expected = Foo(False)(x)
models = [
Foo(False),
Foo(True),
torch.compile(Foo(False), fullgraph=True),
torch.compile(Foo(True), fullgraph=True),
]
for m in models:
print((m(x) - expected).sum())
```
all outputs equal zero except the result of torch.compile(Foo(False), fullgraph=True)
## summary:
when divisor is a scalar, inductor will lower div to mul the scalar's reciprocal.
this could lead precision lost in c++ kernel. but not in triton kernel
## why:
Generated C++ kernel; thanks to @xmfan for supplying the code.
```c++
#include <torch/csrc/inductor/cpp_prefix.h>
extern "C" void kernel(const float* in_ptr0,
float* out_ptr0)
{
{
for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(25L); x0+=static_cast<int64_t>(16L))
{
{
if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(16L)))
{
auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
auto tmp1 = static_cast<float>(0.009900990099009901);
auto tmp2 = at::vec::Vectorized<float>(tmp1);
auto tmp3 = tmp0 * tmp2;
tmp3.store(out_ptr0 + static_cast<int64_t>(x0));
}
if(C10_UNLIKELY(x0 >= static_cast<int64_t>(16L) && x0 < static_cast<int64_t>(25L)))
{
auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(9L));
auto tmp1 = static_cast<float>(0.009900990099009901);
auto tmp2 = at::vec::Vectorized<float>(tmp1);
auto tmp3 = tmp0 * tmp2;
tmp3.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(9L));
}
}
}
}
}
```
The float type in C typically has 6 to 7 significant digits, while the double type has 15 to 16 significant digits.
```c++
#include <iostream>
#include <iomanip>
int main() {
auto tmp1 = static_cast<float>(0.009900990099009901);
auto tmp2 = static_cast<double>(0.009900990099009901);
std::cout << std::setprecision(20) << "tmp1 = " << tmp1 << std::endl;
std::cout << std::setprecision(20) << "tmp2 = " << tmp2 << std::endl;
return 0;
}
```
the ouput is
```bash
tmp1 = 0.0099009899422526359558
tmp2 = 0.0099009900990099011103
```
`auto tmp1 = static_cast<float>(0.009900990099009901);` This will cause tmp1 to become 0.0099009, resulting in a loss of precision, so the final result will not match the expected value.
I also found that the bug occurred at that position
86d8af6a6c/torch/_inductor/lowering.py (L6238)
The commit states that the precision lost is expected in cuda implementation.
original commit
03439d4c1c
cuda implementation
0636c11811/aten/src/ATen/native/cuda/BinaryDivTrueKernel.cu (L36-L38)
What is interesting is that the Triton kernel works correctly due to the precision of float type in python.
```python
def triton_poi_fused_div_0(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
xnumel = 25
xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:]
xmask = xindex < xnumel
x0 = xindex
tmp0 = tl.load(in_ptr0 + (x0), xmask)
tmp1 = 0.009900990099009901
tmp2 = tmp0 * tmp1
tl.store(out_ptr0 + (x0), tmp2, xmask)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158231
Approved by: https://github.com/eellison
Python dispatcher is not always enabled in fake tensors and have to be called explicitly.
While it should be, it requires some work to get all tests working.
I have been running in several issues where I add to add enable_python_dispatcher ex
XLA, Helom ..etc to avoid issues related to that for the view specifically i moved it to fake tensor impl.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158406
Approved by: https://github.com/bobrenjc93
The origin code comemnts:
```python
# Let's not fail if we can't clean up the temp dir. Also note that for
# Windows, we can't delete the loaded modules because the module binaries
# are open.
```
But we are missing the `ignore_errors` parameter for Windows. I help to add it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159025
Approved by: https://github.com/jansel
The MIOpen integration has changed over the years. In the past, the MIOpen default for benchmark was True and if it were set to False it would use MIOpen Immediate Mode. But with #145294 the MIOpen benchmark default changed to False and to activate immediate mode you would set the deterministic flag to True. This has proved too restrictive because benchmark and deterministic flags are independent from immediate mode. Thus, immediate mode needs its own flag. Though MIOpen still masquerades behind torch.backends.cudnn and its flags, it seemed inappropriate to add an miopen-exclusive flag to the set of cudnn flags. This PR adds the first miopen-only flag to control its immediate mode.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158951
Approved by: https://github.com/jeffdaily
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Summary: The subclass can override the filtering logic to customize which frames to keep or drop.
Test Plan:
```
buck run caffe2/test:test_export -- -r test_stack_trace
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:others -- -r test_constant_random
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_custom_obj_list_out
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:fx -- -r class_member_back_compat
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158266
Approved by: https://github.com/ezyang, https://github.com/yushangdi
Fixes#158120
The issue was caused by populating a builtin tensor fn map at import time; if torch.export.export was called before any dynamo imports with the `meta` device, this map would not be populated, and so would populate on import time which would try to call `torch.disable`, which would not yet be initialized
Fix is to populate this map lazily
```
python test/dynamo/imports_non_circular_repro.py TestImports.test_circular_import_with_export_meta
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158931
Approved by: https://github.com/StrongerXi, https://github.com/mlazos, https://github.com/anijain2305
TL;DR: Cuts vLLM cudagraph collection from 80s -> 24s
Stop garbage collecting by default on every cudagraph recording. The old behavior can be re-enabled by setting `TORCH_CUDAGRAPH_GC=1` or the config `force_cudagraph_gc`.
We were previously garbage collecting at the beginning of each cudagraph
capture. vLLM collects 5427 graphs and most of those garbage collections weren't
actually collecting any memory (CPU or GPU). This changes it to not collect more
than every 10s so if we're capturing in a loop we don't burn all our cycles
looking for garbage.
(These number have a lot of variance from run to run but give the correct
general scale)
```
| calls | total | synchronize | gcs | collect | empty cache | sys freed | cuda freed |
-------+-------+-------+-------------+------+---------+-------------+-----------+------------+
before | 5427 | 78s | 1.48s | 5427 | 53.22s | 1.21s | 145855 | 1539309568 |
-------+-------+-------+-------------+------+---------+-------------+-----------+------------+
after | 5427 | 24s | 0s | 3 | 1.53s | 0.84s | 592 | 1539309568 |
-------+-------+-------+-------------+------+---------+-------------+-----------+------------+
```
total - this is the total time reported by vLLM's "Graph capturing finished" log.
The rest of these are measured in torch.cuda.graphs.graph.__enter__():
calls - number of times torch.cuda.graphs.graph.__enter__ was called
synchronize - this is the duration taken by the cuda.synchronize call
gcs - number of times gc.collect was called
collect - this is the duration taken by the gc.collect call
empty cache - this is the duration taken by the torch.cuda.empty_cache call
sys freed - the number of bytes reported freed by gc.collect
cuda freed - the number of bytes reported freed by torch.cuda.memory_reserved
So it seems like the heavy lifting is done by torch.cuda.empty_cache() which is
fairly quick.
Cudagraph results from the TorchInductor Performance DashBoard (this is from the original version using the GC clock so the real results will be slightly better than this):
<img width="1494" height="382" alt="image" src="https://github.com/user-attachments/assets/69b705ef-47ce-4b6e-9733-1ec941cad93d" />
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158193
Approved by: https://github.com/ngimel
Fixes#157452
Test with
```
python test/dynamo/test_repros.py ReproTests.test_nn_parameter_ctor_graph_breaks
```
### Release Notes
Change to nn.Parameter Constructor Behavior in Dynamo
Semantic change introduced in the nn.Parameter constructor; previously, if the constructor lacked a clean source, the system would attempt to infer arguments to construct a clone and lift this synthetic proxy in the computation graph. This approach had many potential edge cases and was difficult to reason about. The new behavior defaults to graph breaking when the nn.Parameter constructor does not have a clean source. Users are now suggested to manually move the constructor out of the graph in such cases. This change improves clarity and reduces complexity in graph construction and debugging. Users can escape hatch to old semantics with `torch.dynamo.config.graph_break_on_nn_param_ctor=False` if this cannot be done.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158800
Approved by: https://github.com/anijain2305
Before:
```
.Observed exception
Explanation: Dynamo found no exception handler at the top-level compiled function when encountering an exception. Exception will propagate outside the compiled region.
Hint: Dynamo has detected that tracing the code will result in an error when running in eager. Please double check that your code doesn't contain a similar error when actually running eager/uncompiled.
Hint: It may be possible to write Dynamo tracing rules for this code. Please report an issue to PyTorch if you encounter this graph break often and it is causing performance issues.
Developer debug context:
```
After:
```
Observed exception
Explanation: Dynamo found no exception handler at the top-level compiled function when encountering an exception. Exception will propagate outside the compiled region.
Hint: Dynamo has detected that tracing the code will result in an error when running in eager. Please double check that your code doesn't contain a similar error when actually running eager/uncompiled.
Hint: It may be possible to write Dynamo tracing rules for this code. Please report an issue to PyTorch if you encounter this graph break often and it is causing performance issues.
Developer debug context: raised exception TypeError([ConstantVariable(str: "unhashable type: <class 'torch._dynamo.variables.dicts.SetVariable'>")])
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158924
Approved by: https://github.com/williamwen42, https://github.com/zou3519
Previously precompile was implemented under the assumption that dynamo always inlines the user code and generate resume functions when a graph break is hit. In cases like nanogpt training, there exists nontrivial amount of code causing dynamo to fail the speculation and stop inlining certain type of user function. This results in more code objects to be tracked by CompilePackage.
Since these new code objects are user defined, we need to also serialize the location of these code so that we can load the precompile entries to the these code objects in another process.
With this fix, we are able to run nanogpt inference+training with precompile under torchbench.
Differential Revision: [D78691422](https://our.internmc.facebook.com/intern/diff/D78691422/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158947
Approved by: https://github.com/jamesjwu
cuBLAS added support for them in CUDA 12.9. It's rather easy to call into them, the hardest thing is allowing the lhs and rhs operands to have different scaling types, as that changes the whole callstack.
The scaling format is still detected from the sizes of the scale tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158037
Approved by: https://github.com/eqy, https://github.com/drisspg
When select has data dependent input, we cant tell if the actual index shall be index+size or index.
to avoid throwing dde, we allocate a new unbacked symbol to represent the storage offset of the
output view and we compute its value dynamically at runtime when inductor is lowered.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157605
Approved by: https://github.com/ColinPeppler
Add `sort`, `scatter_add` strategy. I am reusing the strategy for `scatter` related ops for a quick support. The strategy can be potential improved after we fix index related strategies.
Minor fix: fix `replicate_op_strategy` to support output multiple tensors, which is required by aten.sort.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159022
Approved by: https://github.com/XilunWu, https://github.com/wconstab
Summary: __assert_fail is declared slightly differently in the Emscripten stdlib. This may cause errors when compiling with Emscripten.
Test Plan:
N/A
Rollback Plan:
Differential Revision: D78500790
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158580
Approved by: https://github.com/JacobSzwejbka
As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo
This PR adds strict typing support to a critical set of files for dynamo, `source.py` and the base `_guards.py`
Running
```
mypy torch/_dynamo/source.py torch/_guards.py --linecount-report /tmp/coverage_log
```
| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main | 1227 | 2208 | 55.57% | 207 | 362 | 57.18% |
| This PR | 2217 | 2217 | 100.00% | 362 | 362 | 100.00% |
| Delta | +990 | +9 | +44.43% | +155 | 0 | +42.82% |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158397
Approved by: https://github.com/anijain2305
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):
Putting both the dispatch API and combine API in battlefield, one following the other, i.e.
```
all_to_all_vdev_2d(inp, out, inp_splits, out_splits_offsets, ...)
all_to_all_vdev_2d_offset(
input=out,
out=combine_out,
in_splits_offsets=out_splits_offsets,
out_splits_offsets=combine_out_splits_offsets
)
```
Here the `out_splits_offsets` from dispatch perfectly serves as the `in_splits_offsets` argument for combine.
Then we assert that the output of combine is exactly the same as the original input to shuffle, and combine's output splits are exactly the same as the original input splits.
It works!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157026
Approved by: https://github.com/Skylion007, https://github.com/ngimel
ghstack dependencies: #158234, #158235, #156743, #156881
Added `all_to_all_vdev_2d_offset`, which:
Perform a 2D AllToAllv operation, with input split and offset
information provided on device. The input offsets need not to be
exact prefix sum of the input splits, i.e. paddings are allowed between the
splitted chunks. The paddings, however, will not be transferred to peer
ranks.
In Mixure of Experts models, this operation can be used to combine tokens
processed by experts on remote ranks. This operation can be viewed as an
"reverse" operation to the `all_to_all_vdev_2d` operation (which shuffles
tokens to experts).
The change may seem a bit dense, sorry. But it is mainly two changes:
1. templating existing device functions (to use provided input offset or calculate it)
2. generalizing variable names, e.g. npes, ne --> minor_size, major_size,
so that I can use the same alltoall function for matrix of (nranks, ne) as well as matrix of (ne, nranks).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156881
Approved by: https://github.com/ngimel
ghstack dependencies: #158234, #158235, #156743
This PR addresses a few small bugfixes needed to make NanoGPT inference work, and also adds a new `--caching-precompile` argument to torchbench. With `--caching-precompile`, after every benchmark we save precompile artifacts to DynamoCache, allowing us to test caching precompile on all existing benchmarks.
The following bugfixes are in this PR to make all of this work:
- Fix global variables being pruned with DUPLICATE_INPUT guards. DUPLICATE_INPUT guards have additional vars from the second input, which we track with additional_local_vars, but we never tracked additional global variables. This fixes the issue. (See torch/_dynamo/guards.py changes)
- Return None from PRecompileContext.serialize() if no new dynamo compiles occurred. There's no reason to save artifacts (i.e. autotuning artifacts, etc) if no dynamo_compile occurred, so we return None early. We may later want to support editing existing dynamo artifacts as a TODO, but that's upcoming.
- log `dynamo_start` on CompilePackage.load: This is only needed so that tlparse doesn't ignore TORCH_TRACE logs generated when caching precompile hits. If there are no actual compiles, we never log a "dynamo_start" entry, which makes internal tlparse ignore the TORCH_TRACE file.
## Test Plan
After this PR, the following now works:
```
TORCH_LOGS=dynamo tlp python benchmarks/dynamo/torchbench.py --only nanogpt --performance --inference --backend inductor --caching-precompile --warm-start-latency
```
tlparse result (internal):
Cold Start (6 seconds):
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpAWe0zD/dedicated_log_torch_trace_vk9nkp4m.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000
Warm Start (~1 s):
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpAWe0zD/dedicated_log_torch_trace_5l4iwrpm.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000
The 1 second of warm start here can be improved: the costs here are mostly in starting up workers and triton and initializing CUDA, a lot of which should not be included in the compile time cost in real world scenarios where these are already loaded before training begins.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158847
Approved by: https://github.com/zhxchen17
- Prevent the inductor test for argsort/sort from wrongly failing when the argsort/sort output with stable=False differs from pytorch but is still a valid argsort output.
- Add functionality to allow alternative assert_equal functions in inductor tests for future cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146622
Approved by: https://github.com/eellison
Co-authored-by: George Wigley <georgewi@graphcore.ai>
Summary:
In general, device_ is not very useful in OpKernel. Remove it to avoid misuse.
Also, the meaning of `device_` is also ambiguous in the OpKernel.
For StaticDispatch kernels, we always call cpu kernel.
For C10Kernel, we rely on input tensor's device and dispatcher to determine which device to run on.
For ops involves multiple device, e.g. aten._to_copy(device), the meaning of device is ill-defined.
Test Plan:
CI
Rollback Plan:
Reviewed By: henryoier, dolpm, kqfu, zhxchen17
Differential Revision: D78704840
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158944
Approved by: https://github.com/dolpm
3 procs were used for sm86, but we switched to sm89 and the check failed so it switched back to 2
sm90 is H100, but idk what unittests we have running there, but I assume they also have a lot of memory
They use larger runners, which have more GPU memory, so its usually ok. I think it's ~22GB -> 10GB per proc if 2, 6GB per proc if 3 (cuda context maybe 1GB)
I've applied skips to the ones that OOMed
Time decreases from ~2.7hr per test job -> ~2hr
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158691
Approved by: https://github.com/huydhn
This PR suggests adding some models to `cpu_skip_list` which are currently being run in TIMM and Torchbench.
The suggested models takes a long time which leads to the benchmark runs being `timeout`. [benchmark runs for aarch64](https://github.com/pytorch/pytorch/actions/workflows/inductor-perf-test-nightly-aarch64.yml)
• The issue stems from unoptimized groupwise convolution (BF16 /F16 dtype) kernels for aarch64 platforms , which significantly slow down execution leading to the timeout.
**Action:**
• An optimized BF16 groupwise convolution kernel is currently being developed in oneDNN, targeted for release in Q4 2025.
To maintain dashboard consistency and signal clarity, I’ve skipped the affected tests in:
* timm benchmarks
* torchbench benchmarks
As suggested, skip is applied at the CPU - arch level, explicitly branching for aarch64 and adding models which needs to be skipped. This keeps the logic clean, but:
• An alternative considered was increasing shard counts for aarch64 runners, but given the known performance bottleneck, skipping avoids wasted compute cycles. Suggestions around this will be appreciated.
Benchmark does not timeout after the suggested change: https://github.com/pytorch/pytorch/actions/runs/16447200138
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158842
Approved by: https://github.com/malfet
Added `torch.hash_tensor` reduction function with a `mode` argument that defaults to reduction with xor.
- The hash is always uint64.
- Integers will be casted to uint64 before performing the xor_sum reduction
- Floats will be upcasted to double and then bitcasted to uint64 before performing the xor_sum reduction
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154149
Approved by: https://github.com/albanD
This PR consists of all the changes required to enable PyTorch ROCm CI on MI355X nodes.
- Rework aotriton cmake configuration to rely on `HIP_VERSION` instead of `ROCM_VERSION` as aotriton depnds on hip. Hip loosely track the rocm major version, but the two are not actually synchronized as observed in the ROCm 7 alpha build.
- Bump composable-kernel submodule to [df6023e305f389bbf7249b0c4414e649f3ad6598](df6023e305) for mi350 compatibility.
- Extend the change docker permissions step to the MI355x runners as well. This step is included to apply the required permission change to the test folder for a successful upload of artifacts in k8s docker.
- Create new rocm-mi355 workflow to trigger core PyTorch tests on a nightly basis at 2:30 am PST.
- Successfully tested running the test suites listed in rocm-mi355.yml on MI355 runners by temporarily hacking rocm-mi300.yml: ca7d5fae11 (rocm-mi300)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158889
Approved by: https://github.com/jeffdaily
#### 1. Provide a default fallback strategy that can apply to arbitrary operator with output in type of single tensor.
We can call register_op_strategy to register using the `fallback_op_strategy`:
- For op without List[Tensor] as input, call:
```
register_op_strategy(op_overload)(replicate_op_strategy)
```
- For op contains List[Tensor] as input, call:
```
register_op_strategy(op_overload, schema_info=RuntimeSchemaInfo(needs_pytree=True))(replicate_op_strategy)
```
The strategy will force all input and output to be replicated with the corresponding redistribute_cost.
#### 2. Add a test function as a necessary condition for strategy function.
```
detect_exists_identical_opspec(*args, op, mesh, strategy_function)
```
This function detects if identical strategies will be produced given the sample `args`. It will iterate all combinations of placements for each arg and produce the output strategy from the registered `strategy_function`.
#### 3. Provide a context manger `op_strategy_context` to easily register/unregister strategies for testing.
E.g.,
```
with op_strategy_context(test_op.default, replicate_op_strategy):
...
```
#### 4. Fix a bug that TupleStrategy never get flatten as expected:
9df0176408/torch/distributed/tensor/_op_schema.py (L286)
Basically we need to 1) register_pytree_node for TupleStrategy, 2) propagate the schema_info to `strategy_schema` after `strategy_schema = _wrap_with_op_strategy(op_schema)`.
This is the first implementation. Plan to add support to enable sharding on the batch dim as the output strategy next.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158046
Approved by: https://github.com/wanchaol, https://github.com/wconstab
We don't create new PGs when doing slicing in DeviceMesh so it is relatively safe to relax the requirement of one can only do slicing from root mesh. But this does come with caveat when it is asymmetric, for example, only some have the sliced out submesh, for example. So aside from removing the requirement we also add a warning here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158899
Approved by: https://github.com/wz337
This PR removes the integration point torch.fx had with torch::deploy (and another minor change).
Note: This PR has some broken mypy errors, but I believe those should have been in the code base beforehand, and should be fixed in a separate PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158291
Approved by: https://github.com/albanD
ghstack dependencies: #158288, #158290
This PR is part of the work to deprecate torch::deploy in OSS. Effectively it does 3 things to get started.
1. Remove test_deploy_interaction as we no longer need to worry about this
2. Remove all torch._running_with_deploy checks and use the False path always (surfaced 1)
3. Remove `USE_DEPLOY` and switch to the default path always
Note: MyPy does fail on a bunch of things here as a bunch of older files are touched. It may be better to fix these things on a separate PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158288
Approved by: https://github.com/albanD
Hi team,
Please help review this patch.
This PR https://github.com/pytorch/pytorch/pull/150370 tried to fix the "Empty C Call Queue" problem on Python 3.12. It added C calls for each starting Python event with a callable.
I found the root cause is not that we cannot get C function frames by `PyFrame_GetBack` when PythonTracer is filling start frames, but the c call event loss problem bug on Python 3.12.0-3.12.4. And that problem was fixed by 257c413cd1 on 3.12.5.
So I think the https://github.com/pytorch/pytorch/pull/150370 cannot fix the problem, this patch reverts the change of it.
There are solutions to fix the problem correctly, such as we can add a new monitoring callback to compensate call events of methods with C function or we can override the callback registered by `PyEval_SetProfile`. These solutions may make the code hard to maintain.
~~Since upgrading the micro version of Python is not difficult for users, we can just ignore C functions and suggest user upgrade.~~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155446
Approved by: https://github.com/sraikund16, https://github.com/cyyever
Thanks to @davidberard98 for much of the analysis here. For GEMMs of K=1, the hints, `tl.multiple_of` and `tl.max_contiguous` apply completely, as the indices to the loads are only dependent on `offs_m` and `offs_n`. For shapes like `(97x1), (1x97)`, this results in misaligned address errors, due to the fact that for all BLOCK_M and BLOCK_N sizes, the last tile is not a contiguous load. With K > 1 case, the hint is not as strict given the dependency on the k indices for the load as well. In the K=1 case, only `offs_m` and `offs_n` are used and broadcasted to the index shape.
One can say these hints are "wrong", but in various cases in the hints being wrong, such as with the shape `9999x4, 4x9999`, there is a substantial performance improvement with the hint.
For nice shapes with K=1, where M, N are a multiple 8 to where these hints are fine and there is no misaligned address, there is no performance regression observed on H100:
<img width="547" height="402" alt="Screenshot 2025-07-18 at 5 05 47 PM" src="https://github.com/user-attachments/assets/fee2bbaa-784c-422e-bb8c-43c6c2607ad2" />
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158650
Approved by: https://github.com/davidberard98
Add test that require weights to be packaged for torch native
For now, we need `package_weights_in_so=True` for compile standalone. The constants are in a `.o` file and will be added as a source to the CMakeLists.txt of the model.
After we added weight deduping, we should be able to let this config be False.
```
python test/inductor/test_aot_inductor_package.py -k test_compile_with_exporter_weights
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158750
Approved by: https://github.com/desertfire
@ -53,7 +53,9 @@ class Vectorized<int64_t> : public Vectorizedi {
return8;
}
usingVectorizedi::Vectorizedi;
Vectorized(){}
Vectorized(){
values=_mm512_setzero_si512();
}
Vectorized(int64_tv){
values=_mm512_set1_epi64(v);
}
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.