Summary:
The following type of objects don't need to be serialized for precompile:
1. PyCapsule because we don't guard on C binding objects in meaningful ways.
2. Code object because we only id matching on these but id matches will always be dropped for precompile.
3. Nested function objects since we also ban CLOSURE_MATCH.
Test Plan:
buck run mode/opt test/dynamo:test_dynamo -- -k test_skipped_objects
Rollback Plan:
Differential Revision: D78816888
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158926
Approved by: https://github.com/jamesjwu
When one builds CD docker, all CUDA dependencies must be installed into `/usr/local/cuda/` folder
Test plan: Looks at the binary build logs, for example [here](https://github.com/pytorch/pytorch/actions/runs/16768141521/job/47477380147?pr=159907):
```
2025-08-06T05:58:00.7347471Z -- NVSHMEM_HOME set to: ''
2025-08-06T05:58:00.7348378Z -- NVSHMEM wheel installed at: ''
2025-08-06T05:58:00.7392528Z -- NVSHMEM_HOST_LIB: '/usr/local/cuda/lib64/libnvshmem_host.so'
2025-08-06T05:58:00.7393251Z -- NVSHMEM_DEVICE_LIB: '/usr/local/cuda/lib64/libnvshmem_device.a'
2025-08-06T05:58:00.7393792Z -- NVSHMEM_INCLUDE_DIR: '/usr/local/cuda/include'
2025-08-06T05:58:00.7394252Z -- NVSHMEM found, building with NVSHMEM support
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159907
Approved by: https://github.com/Skylion007, https://github.com/ngimel
# Background
After I built torch_openreg, I noticed that the wheel package contained the stub.c file under the csrc directory, which was not used in the runtime.
# Motivation
This PR aims to remove the stub.c file and any unused file when running torch_openreg.
**Changes:**
- Setting **include_package_data** keyword to false in the setup function
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159845
Approved by: https://github.com/albanD
**Summary**
This issue proposes implementing a CUDA kernel for aten._weight_int8pack_mm, a weight-only quantized (WOQ) linear operation that is currently only supported on CPU. On CUDA, the fallback path uses an unfused .mul().sum() pattern in quantization.py, which is less efficient for inference. https://github.com/pytorch/pytorch/issues/158849
**Motivation**
A fused GPU kernel for aten._weight_int8pack_mm would:
- Eliminate reliance on the .mul().sum() fallback in quantization.py
- Improve performance for quantized inference on CUDA
- Extend Inductor’s GPU quantization support across more workloads
**Implementation**
- Implement a Triton kernel for:
```
out[b, n] = sum_k(x[b, k] * w[n, k]) * scale[n]
where:
x: [B, K] float32
w: [N, K] int8
scale: [N] float32
out: [B, N] float32
```
- Integrate the kernel with register_woq_mm_ops() in torch/_inductor/quantized_lowerings.py
- Route it conditionally in quantization.py where GPU currently falls back to .mul().sum()
- Add unit tests comparing results to the reference fallback path
Test Plan:
```
buck2 run 'fbcode//mode/opt' :linalg test_linalg.TestLinalgCUDA.test__int8_mm_m_64_k_64_n_64_compile_True_slice_True_cuda
```
Log: P1882799769
```
buck2 test 'fbcode//mode/opt' caffe2/test:linalg
```
https://www.internalfb.com/intern/testinfra/testconsole/testrun/6755399722424741/
Benchmark Results:
```
**[Shape B=256, K=1024, N=512]**
CPU and CUDA outputs match
Max abs diff: 2.59e-04, max rel diff: 0.75
CPU: 144.14 ms, CUDA: 303.67 µs
Speedup: ×474.6
**[Shape B=512, K=2048, N=1024]**
CPU and CUDA outputs match
Max abs diff: 5.49e-04, max rel diff: 0.15
CPU: 1173.27 ms, CUDA: 2.40 ms
Speedup: ×488.5
```
Rollback Plan:
Differential Revision: D79042656
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159325
Approved by: https://github.com/danielvegamyhre, https://github.com/jerryzh168
In some cases we have mps kernels which are reused across higher-order-op subgraphs and the toplevel code. However, currently we initialize the variable for the mps kernel the first time we use it, which runs into an issue if we run into the mps kernel within a subgraph since the kernel will only be initialized within the subgraph scope. For instance:
```
if ...
auto mps_lib_0_func = ...
mps_lib_0_func->run()
// since we already used mps_lib_0 once, we don't re-initialize it
mps_lib_0_func->run() // error, mps_lib_0_func not initialized
```
So the solution we took here is to initialize all the kernels at the beginning:
```
const std::shared_ptr<at::native::mps::MetalKernelFunction> get_mps_lib_0() {
static const auto func = mps_lib_0.getKernelFunction("generated_kernel");
return func;
}
AOTIMetalKernelFunctionHandle get_mps_lib_0_handle() {
static const auto handle = AOTIMetalKernelFunctionHandle(get_mps_lib_0().get());
return handle;
}
...
if ...
get_mps_lib_0()->run()
get_mps_lib_0()->run() // success
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159753
Approved by: https://github.com/malfet
ghstack dependencies: #159456, #159695
Also migrate `test_common_rules.py` since it was a short file
`python test/distributed/tensor/test_common_rules.py`
Before:
Ran 10 tests in 91.516s
After:
Ran 10 tests in 5.604s
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159885
Approved by: https://github.com/ezyang
Summary:
This allows us to start seeing the failure rate on these models (and
potentially alert on it).
Test Plan:
```
FORCE_LOG_TRITON_BUILDS_TO_PROD=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 buck2 run @//mode/opt :compile 2>&1 | tee out
```
P1889607054
Waiting for scuba table to generate, but manual logging show it should show up at https://fburl.com/scuba/pt2_triton_builds_inc_archive/7852kt8h soon.
Rollback Plan:
Reviewed By: masnesral
Differential Revision: D79308333
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159897
Approved by: https://github.com/masnesral
Summary:
This feature is Meta internal only
Add a util function to put dynamic shape-related suggestion to MLHubDebugInsightService, which will then be surfaced to users in the MLHub .
The rollout will be controlled by JK.
Test Plan:
MAST job aps-omnifmv3_dev_baseline_test-a34fdccf21
{F1980593060}
* If you're not able to see the insight, please add yourself to this gk 'mlhub_debugging_insights_dev_visibility'
* The URL link should route to a new Job Inspector page that will provide details and straight forward instructions of how to config the ds. The page is currently still in development so here we use the general PT2 compile JI page.
* Test fails because of the export checks. I'll export after addressing all the comments from reviewers.
Rollback Plan:
Reviewed By: pianpwk
Differential Revision: D78526522
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159391
Approved by: https://github.com/jingsh
Summary: For training jobs particularly from GenAI, NCCL trace dumps are generated in the format of `<hostname>.pci3_rank_<rank>`. For multi-node training jobs, the hostname varies across traces. The current prefix matching logic can't handle this case.
Test Plan:
Create a local folder `dumps` and several empty files: `host0.pci3_rank_0`, `host0.pci3_rank_1`, `host1.pci3_rank_0`, `host1.pci3_rank_1` inside it. Then run
```
buck2 run fbcode//caffe2/fb/flight_recorder:fr_trace -- trace_dir dumps
```
Before this diff, fr_trace cannot locate any trace files, giving the following assertion error:
```
AssertionError: no files loaded from /home/tianhaoh/dumps with prefix pci3_rank_
```
After this diff, fr_trace is able to locate the trace files, resulting in the exceptions like
```
dump = pickle.load(infile)
^^^^^^^^^^^^^^^^^^^
EOFError: Ran out of input
```
(since the trace files are fake and empty).
Rollback Plan:
Differential Revision: D79224727
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159490
Approved by: https://github.com/fduwjj
Summary:
OrderedImporters is supposed to be an importer which tries out every single importer in self._importers. However the get_name API does not follow this behavior and only uses the get_name from the basic Importer class.
This change is to update the OrderedImporters get_name API so that it tries the get_name API of every single importers.
Differential Revision: D76463252
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155743
Approved by: https://github.com/jcwchen, https://github.com/jingsh
When we setup logging config as guide: https://docs.pytorch.org/docs/stable/logging.html
Such as:
TORCH_LOGS="+schedule,+inductor,+output_code"
On Linux, it shows as:
```cmd
declare -x SSH_TTY="/dev/pts/0"
declare -x TERM="xterm"
declare -x TORCH_LOGS="+schedule,+inductor,+output_code"
declare -x USER="xu"
```
On Windows, it shows as:
```cmd
TORCHINDUCTOR_WINDOWS_TESTS=1
TORCH_LOGS="+schedule,+inductor,+output_code"
UCRTVersion=10.0.22000.0
```
For Linux, it shows quotes by default, And Windows is not shows quotes.
Besides that, Windows would auto assemble quotes when env var processing.
On Linux, we will get variable: "+schedule,+inductor,+output_code"
On Windows, we will get variable: '"+schedule,+inductor,+output_code"'
So, we need remove the outer quotes for Windows.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159887
Approved by: https://github.com/angelayi
Summary:
This is needed for subprocesses that are trying to call back into torch
functionality, i.e. anything that's also setting `PYTHONPATH`. There are more
`sys.executable` subprocesses in torch/ but it seems like they're fine.
Test Plan: Local inference runs.
Reviewed By: aorenste
Differential Revision: D79124705
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159382
Approved by: https://github.com/aorenste
Originally, if the PT2 errored when loading, we would try to load using the old loader to fit BC issues. However this hides the error messages for if an up-to-date PT2 is erroring when loading due to some other reason.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159881
Approved by: https://github.com/yushangdi
Summary:
- debug.py: Added log_runtime_estimates() function to dump runtime estimation data as structured tlparse artifacts in JSON format
- test_structured_trace.py: Added comprehensive test coverage with testing compute and collective ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159730
Approved by: https://github.com/yushangdi
ghstack dependencies: #159190
Removes unused docker images from the docker build workflow
Then removes unused definitions in build.sh
The only one I left is the vllm one because I'm pretty sure it's going to be used in the future
I assume everything not mentioned is old and we forgot to remove them
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159171
Approved by: https://github.com/yangw-dev
No functional changes, just:
- Update C++ standard to C++17
- Update `cmake` min version to 3.18
- Update `libuv` dependency to 1.51 (to move its cmake min version to 3.10)
- Replace boost optional implementation with `std::optional` wrapper
- Make it compilable with gcc-14.x plus by including `cstddef` in few headers
- Avoid using deprecated enums for MacOS builds
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159834
Approved by: https://github.com/Skylion007
With fsdp, we sometimes have multiple, non-overlapping views of a single buffer which are all mutated. Previously we considered the original buffer as an allocation, and make the mutated buffer the deallocation. With multiple mutations of the same buffer, we need to consider the original buffer as deallocated only when all of its aliases die (and avoid double counting the input buffer size). See comment inline:
```
When an operation mutates a buffer in-place, the scheduler creates a new buffer name
to track the "before" and "after" states, even though they share the same memory.
The mutated buffer represents a rename with zero allocation and deallocation cost.
During dependency tracking, we transfer dependencies from the mutated name back to
the original buffer, ensuring the original memory is only freed when all aliases
are done.
This handles cases where a buffer has multiple non-overlapping aliases - rather than
trying to assign free costs to individual aliases, we forward all alias dependencies
to the original buffer.
Consider:
buf0 = op0()
buf1 = mutation_op_(buf0)
del buf0
...
op(buf1)
del buf1
The only memory events are the creation prior to op0, and the deletion following buf1.
```
As @IvanKobzarev 's logs in https://github.com/pytorch/pytorch/pull/158361/files#diff-e173a1d52aff49959c9f6d17ecc09946d8a616fc5909df884e62a15e1ebd1d41R1776-R1807 show, it can a bit of a pain to pinpoint which part of our memory calculation is incorrect.
This pr also adds a runtime verifier `config.test_configs.track_memory_lifecycle` which tracks buffer allocation and deallocation, and errors if their lifetime does not match our expectations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159569
Approved by: https://github.com/IvanKobzarev
This would allow me to enable more opinfo tests against MPS device eventually and supposed to be a very simple test, but actually required minor adjustments to lots of test files, namely:
- Introduce `all_mps_types_and` that is very similar to `all_types_and`, but skips `float64`
- Decorate lots of tests with `@dtypesIfMPS(*all_mps_types())`
- Skip `test_from_dlpack_noncontinguous` as it currently crashes (need to be fixed)
- Add lots of `expectedFailureIfMPS`
- Delete all `@onlyNativeDeviceTypesAnd("mps")`
<sarcasm> I love how well documented this variable are </sarcasm>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153835
Approved by: https://github.com/Skylion007
\# Why
- Make loop iteration simpler
- Have a common spot where to make modifications that affect
all the GEMM Triton templates, avoiding missed spots
\# What
- pull out commong logic of taking the BaseConfig objects
and turning them into kwargs to feed into maybe_append_choice
for Triton GEMM templates
Differential Revision: [D79186962](https://our.internmc.facebook.com/intern/diff/D79186962)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159383
Approved by: https://github.com/jansel
# Motivation
This PR moves the implementation of `torch.cuda.memory._set_allocator_settings` to `torch._C._accelerator_setAllocatorSettings`.
Since the original API was intended as a temporary/internal utility, I am not exposing the new function as a public API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156175
Approved by: https://github.com/albanD
ghstack dependencies: #159629, #150312, #156165
# Motivation
Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150312
Approved by: https://github.com/albanD
ghstack dependencies: #159629
Summary: Inductor's 3.4 Triton release is the most common used variant of Triton, but if someone is working with an alternative version of Triton this may not match. This moves the version check from 3.4 Triton to any variant that has support for the TMA APIs.
Test Plan:
Relying on CI. Should be a NFC.
Rollback Plan:
Reviewed By: davidberard98
Differential Revision: D79378792
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159777
Approved by: https://github.com/davidberard98
It can be be very slow to repeatedly hit DNS resolution failure, but
its very helpful to have DNS names in logs by default. So we try to use DNS
but if we hit a transient failure we just disable it for the remainder of the
job, logging IP addresses instead.
Fixes#159007
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159596
Approved by: https://github.com/d4l3k
This PR fixes `cmake/Dependencies.cmake` to work when compiling with `USE_SYSTEM_XNNPACK=ON` by changing a lowercase `or` to an uppercase `OR`.
---
For a personal project, I was building pytorch with a customized build of XNNPACK. When trying to do so I encountered the following error:
```
CMake Error at cmake/Dependencies.cmake:566 (if):
if given arguments:
"NOT" "XNNPACK_LIBRARY" "or" "NOT" "microkernels-prod_LIBRARY"
Unknown arguments specified
Call Stack (most recent call first):
CMakeLists.txt:868 (include)
```
Upon making the change in this PR (changing `or` to `OR`), the process continued as expected.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159527
Approved by: https://github.com/janeyx99
ROCm inductor benchmark builds failing fbgemm build stage https://ossci-raw-job-status.s3.amazonaws.com/log/46800456622
```
2025-07-27T08:00:32.3443858Z /var/lib/jenkins/pytorch/fbgemm/src/RowWiseSparseAdagradFused.cc:389:18: error: no matching function for call to ‘asmjit::v1_17::x86::Vec::Vec(uint32_t)’
2025-07-27T08:00:32.3444080Z 389 | x86::Xmm partial_sum_xmm(partial_sum_vreg.id());
```
It looks like asmjit fails to build, this seems to be due to submodules of fbgemm not being updated after checking out to new commit.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159477
Approved by: https://github.com/pruthvistony, https://github.com/eqy
Update the torch-xpu-ops commit to [intel/torch-xpu-ops@1f7a57](1f7a57f507) includes:
- Add Template Parameter to the function `gpu_kernel` for Controlling Broadcasting Vectorization
- Add optional NaN checks to XCCL
- Fix NllLossForwardReduce2DKernelFunctor accuracy
- Extend the existing communication logging to include the reduction operation for collective calls
- [Reland] Install xpu codegen header to torch/include
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159621
Approved by: https://github.com/EikanWang
**Motivation / Context**: (what I _think_ is happening here)
In "eager"/just-in-time PT2 usage, dynamo/inductor will guard on whether indices fit in int32 or not. So it's generally safe in Inductor code to rely on the example values for symbolic ints in order to determine whether indices fit in int32, because the indices will be guarded on anyway; and if the inputs ever increase to `>int32_max`, dynamo will cause a recompilation.
But with AOTI, those int32 guards aren't respected; so if the example input is `< int32_max` but can be `> int32_max` during future execution, then the future execution might fail / IMA.
**Solution space**
Export allows users to specify which dimension are dynamic, and to provide **ranges of valid sizes**.
One solution idea is to always respect the upper bound of the dynamic shape range when doing AOTI; if the index's range includes values `>int32_max`, then don't use the hint and assume that this index doesn't fit in int32.
However, the problem with this is that many users may specify dynamism without specifying a range of values - the upper bound of the range will be set to the default of `inf`. Such use cases could potentially experience a perf regression if we implemented the idea above.
To prevent any such regressions, this implementation will rely solely on the specified range only if the upper bound of the range isn't inf. In other words, we'll ignore the hints/example values for AOTI (and rely only on the specified range) only if the upper bound of the range isn't inf - if users explicitly specify a range that extends past int32, we can be fairly sure that they actually do need values `>int32_max`.
If we continue to see correctness issues even with this implementation, we could consider more aggressively relying on the ranges.
Differential Revision: [D79220301](https://our.internmc.facebook.com/intern/diff/D79220301)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159433
Approved by: https://github.com/jingsh, https://github.com/ColinPeppler
Summary: Fix https://github.com/pytorch/pytorch/issues/159612
- Fix the meta implementation of `nan_to_num`, it should preserve the stride of the input
- The DeviceCopy IR node should always preserve the input's layout, so we don't end up with a contiguous call during device copy
Test Plan:
```
buck2 run @mode/dev-nosan fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_d2h_copy
```
Rollback Plan:
Differential Revision: D79411407
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159615
Approved by: https://github.com/eellison
Fixes#159601
Unfortunately #156868 introduced a couple regressions (see #159590 and #159601). This reverts the commit while I am working on a permanent fix. This means the `in_compiled_autograd_initial_trace` global flag will be removed and the `_are_we_tracing()` will instead be replaced with the symint preprocessing step during sharding prop post init.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159671
Approved by: https://github.com/xmfan
Summary:
Currently this function rely on the logic that we load `libnvshmem_device.a` statically and load `libnvshmem_host.so` at runtime. For loading `libnvshmem.a` (the combine 2 thing together) statically this will fail. Add a section to check if the symbol from host API exist at runtime to check if nvshmem is loaded statically
Test Plan:
CI + sample run
Rollback Plan:
Differential Revision: D79177525
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159561
Approved by: https://github.com/kwen2501
Due to different byteorder,
when copying data, it has to be put into last bytes to ensure that int32_t converted to int64_t keeps same value. Same has to be done when it's converted back.
This change fixes test
TestLibtorchAgnosticCPU::test_my_ones_like_cpu
from
cpp_extensions/libtorch_agnostic_extension/test/test_libtorch_agnostic.py on s390x.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155115
Approved by: https://github.com/huydhn
This change fixes multiple tests in
test/inductor/test_aot_inductor_arrayref.py
such as
test_cond_with_parameters_cpu_with_stack_allocation,
test_issue_140766_cpu_with_stack_allocation,
test_model_modified_weights_cpu_with_stack_allocation,
test_nested_tensor_from_jagged_cpu_with_stack_allocation.
Enable tests in test/inductor/test_aot_inductor_arrayref.py
This change is split off from https://github.com/pytorch/pytorch/pull/150116
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157784
Approved by: https://github.com/huydhn
The previous implementation was creating `n_gpu * n_tensors` intermediate tensors, which was adding a lot of CPU overhead, specially given that inductor was generating a number of individual tensor copy kernels for `torch.cat` .
This PR changes the implementation so that only `n_tensors` are created, making the CPU overhead proportional to the number of tensors being bucketed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159723
Approved by: https://github.com/IvanKobzarev
The output of a reduce_scatter is n_gpu times smaller than its input, while the output of an all_gather is n_gpu times larger than its input. This means that in the current heuristic for bucketing reduce_scatter, we would need to use a bucket size which is n_gpu times larger than the bucket for all_gather, making it gpu-dependent and less intuitive. This PRs propose to use instead the max between the input and output sizes, so that one can use the same bucket_size value for both passes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159717
Approved by: https://github.com/wconstab
#158649 turned off automatic GCs during cudagraph recording. This is causing a small uptick in some internal benchmark numbers because of memory the benchmark is leaving around before the benchmark starts - so GC before warming up the model.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159670
Approved by: https://github.com/oulgen
RuntimeError message updated in is_nonzero(input) method from bool to Boolean.
**Case 1:**
t = torch.tensor([])
torch.is_nonzero(t)
**Case 2:**
t = torch.tensor([1,2])
torch.is_nonzero(t)
**Existing Error message in documentation:**
for case 1: RuntimeError: bool value of Tensor with no values is ambiguous
for case 2: RuntimeError: bool value of Tensor with more than one value is ambiguous
**Proposed Error message in documentation:**
for case 1: RuntimeError: Boolean value of Tensor with no values is ambiguous
for case 2: RuntimeError: Boolean value of Tensor with more than one value is ambiguous
Fixes#159710
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159712
Approved by: https://github.com/malfet
# Moativation
This PR is used to enable _int_mm on Intel GPU. And _int_mm is used by int8 quantization on torchao.
# Model Test Result:
We run meta-llama/Llama-3.1-8B-Instruct on Intel GPU and A100 using torchao int8-dynamic-quantization. The model configs as below:
Precision : torch.bfloat16
quantization configuration : Int8DynamicActivationInt8WeightConfig
dataset : wikitext
Result:
The perplexity values for Intel GPU and A100 are 9.582953453063965 and 9.57755184173584, respectively.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157769
Approved by: https://github.com/EikanWang, https://github.com/desertfire
Summary: turns out i added this to reduce the frequency we'd call try_update_max_size_at_index when a new maximum is found before the replan is called. oops.
Test Plan:
backout
Rollback Plan:
Differential Revision: D79474114
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159677
Approved by: https://github.com/georgiaphillips
Summary:
The launch grid calculation code is using a python trick to achieve CeilDiv() through negative integer division with FloorDiv(). This is language dependent behaviour that doesn't apply to all languages.
In the FXIR backend we negate this behaviour and replace the experssion with CeilDiv() operation so the computation is correct regardless of language used. Not directly directly changing the orginal computation as it leads to a performance degredation.
Test Plan:
CI
Rollback Plan:
Differential Revision: D79275534
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159497
Approved by: https://github.com/blaine-rister
This change introduces structured logging of the collective communication schedule, enabling downstream tools (e.g. TLParse) to ingest and analyze per‑rank collective‐order information for multi‑rank jobs.
- Iterates over scheduler.nodes, filters for _CollectiveKernel nodes
- Extracts each op’s python_kernel_name
- Emits a structured JSON payload under the inductor_collective_schedule artifact name
- Dumps the full schedule list to collective_schedule.json via the PyTorch trace‑structured artifact
- Added comprehensive unit tests for collective schedule tracing: Created test_collective_schedule_empty() and test_collective_schedule_real() tests to verify structured trace logging works correctly for both empty collective schedules and real collective operations (like all_reduce and wait_tensor from _c10d_functional ops).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159190
Approved by: https://github.com/yushangdi, https://github.com/xmfan
# Motivation
While refactoring the caching allocator, I noticed that the `ExpandableSegment` constructor on CUDA had an unused parameter. This change removes that unused argument to avoid potential confusion.
# Additional Context
I noticed that `ExpandableSegment` is defined in cpp file, so it should be safe to make this change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159356
Approved by: https://github.com/ngimel, https://github.com/albanD
ghstack dependencies: #159159
PyTorch with ROCm on Windows is built with clang-cl and not MSVC. This code path is specific to the MSVC compiler so it should be checking for MSC_VER, not just WIN32. The change here is similar to https://github.com/pytorch/pytorch/pull/146606.
This fixes downstream build errors using clang-cl like https://github.com/ROCm/TheRock/actions/runs/16569646709/job/46858176812 (patched and tested downstream at https://github.com/ROCm/TheRock/pull/1140):
```
[7099/7147] Building CXX object functorch\CMakeFiles\functorch.dir\csrc\dim\dim.cpp.obj
FAILED: functorch/CMakeFiles/functorch.dir/csrc/dim/dim.cpp.obj
C:\home\runner\_work\_tool\Python\3.11.9\x64\Lib\site-packages\_rocm_sdk_devel\lib\llvm\bin\clang-cl.exe /nologo -TP -DEXPORT_AOTI_FUNCTIONS -DFUNCTORCH_BUILD_MAIN_LIB -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DNOMINMAX -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DROCM_ON_WINDOWS -DROCM_USE_FLOAT16 -DROCM_VERSION=70000 -DTORCH_API_INCLUDE_EXTENSION_H -DTORCH_EXTENSION_NAME=_C -DTORCH_HIP_VERSION=700 -DUSE_EXTERNAL_MZCRC -DUSE_MIMALLOC -DUSE_PROF_API=1 -DWIN32_LEAN_AND_MEAN -D_CRT_SECURE_NO_DEPRECATE=1 -D_UCRT_LEGACY_INFINITY -D__HIP_PLATFORM_AMD__ -D__HIP_PLATFORM_AMD__=1 -Dfunctorch_EXPORTS -IB:\src\torch\build\aten\src -IB:\src\torch\aten\src -IB:\src\torch\build -IB:\src\torch -IB:\src\torch\nlohmann -IB:\src\torch\moodycamel -IB:\src\torch\third_party\mimalloc\include -IB:\src\torch\functorch -IB:\src\torch\torch\csrc\api -IB:\src\torch\torch\csrc\api\include -IB:\src\torch\c10\.. -IB:\src\torch\c10\hip\..\.. -IB:\src\torch\torch\.. -IB:\src\torch\torch\..\aten\src -IB:\src\torch\torch\..\aten\src\TH -IB:\src\torch\build\caffe2\aten\src -IB:\src\torch\build\third_party -IB:\src\torch\build\third_party\onnx -IB:\src\torch\torch\..\third_party\valgrind-headers -IB:\src\torch\torch\..\third_party\gloo -IB:\src\torch\torch\..\third_party\onnx -IB:\src\torch\torch\..\third_party\flatbuffers\include -IB:\src\torch\torch\..\third_party\kineto\libkineto\include -IB:\src\torch\torch\..\third_party\cpp-httplib -IB:\src\torch\torch\..\third_party\nlohmann\include -IB:\src\torch\torch\csrc -IB:\src\torch\torch\lib -IB:\src\torch\torch\standalone -IB:\src\torch\torch\lib\libshm_windows -imsvcC:\home\runner\_work\_tool\Python\3.11.9\x64\Lib\site-packages\_rocm_sdk_devel\include -imsvcB:\src\torch\third_party\protobuf\src -imsvcB:\src\torch\third_party\XNNPACK\include -imsvcB:\src\torch\third_party\ittapi\include -imsvcB:\src\torch\cmake\..\third_party\eigen -imsvcB:\src\torch\third_party\ideep\mkl-dnn\include\oneapi\dnnl -imsvcB:\src\torch\third_party\ideep\include -imsvcB:\src\torch\INTERFACE -imsvcB:\src\torch\third_party\nlohmann\include -imsvcB:\src\torch\third_party\concurrentqueue -imsvcC:\home\runner\_work\_tool\Python\3.11.9\x64\Lib\site-packages\_rocm_sdk_devel\include\hiprand -imsvcC:\home\runner\_work\_tool\Python\3.11.9\x64\Lib\site-packages\_rocm_sdk_devel\include\rocrand -imsvcB:\src\torch\cmake\..\third_party\pybind11\include -imsvcC:\home\runner\_work\_tool\Python\3.11.9\x64\include /DWIN32 /D_WINDOWS /EHsc /Zc:__cplusplus /bigobj /FS /utf-8 -DUSE_PTHREADPOOL -DNDEBUG -DUSE_FBGEMM -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE /wd4624 /wd4068 /wd4067 /wd4267 /wd4661 /wd4717 /wd4244 /wd4804 /wd4273 /O2 /Ob2 /DNDEBUG /bigobj -DNDEBUG -std:c++17 -MD -Z7 -Wmissing-prototypes -Werror=missing-prototypes /permissive- /d2implyavx512upperregs- /EHsc /bigobj -fms-runtime-lib=dll -D__HIP_PLATFORM_AMD__=1 -DCUDA_HAS_FP16=1 -DUSE_ROCM -D__HIP_NO_HALF_OPERATORS__=1 -D__HIP_NO_HALF_CONVERSIONS__=1 -DTORCH_HIP_VERSION=700 -Wno-shift-count-negative -Wno-shift-count-overflow -Wno-duplicate-decl-specifier -DCAFFE2_USE_MIOPEN -DTHRUST_DEVICE_SYSTEM=THRUST_DEVICE_SYSTEM_HIP -std=c++17 -DHIPBLAS_V2 -DHIP_ENABLE_WARP_SYNC_BUILTINS -fms-extensions -Wno-ignored-attributes /showIncludes /Fofunctorch\CMakeFiles\functorch.dir\csrc\dim\dim.cpp.obj /Fdfunctorch\CMakeFiles\functorch.dir\ -c -- B:\src\torch\functorch\csrc\dim\dim.cpp
clang-cl: warning: unknown argument ignored in clang-cl: '-std=c++17' [-Wunknown-argument]
clang-cl: warning: argument unused during compilation: '/d2implyavx512upperregs-' [-Wunused-command-line-argument]
In file included from B:\src\torch\functorch\csrc\dim\dim.cpp:36:
B:\src\torch\functorch\csrc\dim\arena.h(14,21): error: functions that differ only in their return type cannot be overloaded
14 | inline unsigned int __builtin_clz(unsigned int x) {
| ~~~~~~~~~~~~ ^
C:\home\runner\_work\_tool\Python\3.11.9\x64\Lib\site-packages\_rocm_sdk_devel\lib\llvm\lib\clang\20\include\ia32intrin.h(60,15): note: '__builtin_clz' is a builtin with type 'int (unsigned int) noexcept'
60 | return 31 - __builtin_clz((unsigned int)__A);
| ^
1 error generated.
[7100/7147] Building CXX object caffe2\torch\CMakeFiles\torch_python.dir\csrc\utils\tensor_list.cpp.obj
```
> [!NOTE]
> I haven't been able to reproduce those errors locally, but we have CI jobs that consistently fail when building for Python 3.11 but not 3.12 or 3.13. I'm not sure what is different between those builds, but the code fix seems correct.
There are a few other variations on fixes to this floating around, such as:
* a97a957af0/lz4.c (L34-L43) (checking with `__has_builtin`)
* c98c55ec7e/lj92.c (L31-L46) (the same code as here, but with `_MSC_VER`)
* 2760e5a2bb/def.h (L23-L25) (using `__lzcnt` instead of a custom implementation)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159273
Approved by: https://github.com/Skylion007, https://github.com/m-gallus
Previous version of `torch._tensor_str._Formatter` was not using `PRINT_OPTS.sci_mode` for the `max_width` computation but was using it for the formatting of values leading to a weird discrepancy.
Now, the code first checks if it should be in sci_mode, then compute `max_width`
Here is an example to test the behavior:
```python
A = torch.tensor([10, 1e-1, 1e-2])
B = torch.tensor([10, 1e-1, 1e-1])
print("================= Default =================")
print(A, f"Formatter max_width: {torch._tensor_str._Formatter(A).max_width}")
print(B, f"Formatter max_width: {torch._tensor_str._Formatter(B).max_width}")
print("================= sci_mode=False =================")
with torch._tensor_str.printoptions(sci_mode=False):
print(A, f"Formatter max_width: {torch._tensor_str._Formatter(A).max_width}")
print(B, f"Formatter max_width: {torch._tensor_str._Formatter(B).max_width}")
print("================= sci_mode=True =================")
with torch._tensor_str.printoptions(sci_mode=True):
print(A, f"Formatter max_width: {torch._tensor_str._Formatter(A).max_width}")
print(B, f"Formatter max_width: {torch._tensor_str._Formatter(B).max_width}")
```
In the current version this prints:
```
================= Default =================
tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10
tensor([10.0000, 0.1000, 0.1000]) Formatter max_width: 7
================= sci_mode=False =================
tensor([ 10.0000, 0.1000, 0.0100]) Formatter max_width: 10
tensor([10.0000, 0.1000, 0.1000]) Formatter max_width: 7
================= sci_mode=True =================
tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10
tensor([1.0000e+01, 1.0000e-01, 1.0000e-01]) Formatter max_width: 7
```
On can see that in `sci_mode=False`, the values of A are prefixed with unneeded 0 and does not have the same `max_width` as B (It keeps the `max_width` from `sci_mode = None`)
Also in `sci_mode = True`, for B, the `max_width` is 7 but each value takes 10 chars... (But it is fine as the code that uses `max_width` do not rely much on it, but still, this is missleading)
After this commit, this will print
```
================= Default =================
tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10
tensor([10.0000, 0.1000, 0.1000]) Formatter max_width: 7
================= sci_mode=False =================
tensor([10.0000, 0.1000, 0.0100]) Formatter max_width: 7
tensor([10.0000, 0.1000, 0.1000]) Formatter max_width: 7
================= sci_mode=True =================
tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10
tensor([1.0000e+01, 1.0000e-01, 1.0000e-01]) Formatter max_width: 10
```
This also allows to align A with B for `sci_mode=False`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126859
Approved by: https://github.com/malfet
This refactors the pipelining schedule tests since a lot of them have the same repeated code of:
1. Create pipelined model and reference model
2. Run reference model and pipelined model
3. compare gradients
So this refactors those parts above into helper methods and reduces ~300 LOC. Also adds a better gradient check to resolve flakiness (fixes https://github.com/pytorch/pytorch/issues/154408).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158780
Approved by: https://github.com/wconstab
**Summary**
This PR adds an all-gather based FlexAttention and uses TorchFunctionMode to dispatch
`FlexAttentionHOP.__call__` to it.
This PR makes the following changes:
- add a user-facing API `create_cp_block_mask` for creating CP-specific `BlockMask`
which masks over the attention result of Q shard and KV global.
- add `_ContextParallelGlobalVars` to store all necessary global vars that CP FlexAttention
requires. `torch_function_mode` is critical to maintain singleton mode to avoid dynamo
recompilations.
- add a dispatch path for `FlexAttentionForwardHOP.__call__` (TorchFunctionMode dispatch
won't work correctly without this line)
What's not in this PR:
- QKV load balancing
- Test on other masking besides `causal_mask`.
- Support on small attention (i.e. qkv size is smaller than 128) because the block mask
rewrite function requires `Q_BLOCK_SIZE == KV_BLOCK_SIZE == 128`.
**Test**
`pytest test/distributed/tensor/test_attention.py -s -k test_ring_flex_attention`
**Followup**
1. create an issue to reproduce the error in `create_fw_bw_graph()` when trying to call `create_block_mask`
to re-write `block_mask` in `FlexAttentionHOP` dispatch in `TorchFunctionMode`.
2. Merge `_ContextParallelGlobalVars` and `_cp_options`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158692
Approved by: https://github.com/drisspg
The current executorch pin needs to be updated
The next time the docker image gets rebuilt, the executorch docker build is going to fail like https://github.com/pytorch/pytorch/actions/runs/16626853655/job/47137807966
The failure is that the pin uses a version of the nightly that has been removed from the nightly index
```
#62 72.30 ERROR: Could not find a version that satisfies the requirement torch==2.8.0.dev20250601 (from versions: 1.11.0, 1.12.0, 1.12.1, 1.13.0, 1.13.1, 2.0.0, 2.0.1, 2.1.0, 2.1.1, 2.1.2, 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.4.0, 2.4.1, 2.5.0, 2.5.1, 2.6.0, 2.7.0, 2.7.1, 2.8.0.dev20250602+cpu, 2.8.0.dev20250603+cpu, 2.8.0.dev20250604+cpu, 2.8.0.dev20250605+cpu, 2.8.0.dev20250606+cpu, 2.8.0.dev20250607+cpu, 2.8.0.dev20250608+cpu, 2.8.0.dev20250609+cpu, 2.8.0.dev20250610+cpu, 2.8.0.dev20250611+cpu, 2.8.0.dev20250612+cpu, 2.8.0.dev20250613+cpu, 2.8.0.dev20250614+cpu, 2.8.0.dev20250615+cpu, 2.8.0.dev20250616+cpu, 2.8.0.dev20250617+cpu, 2.8.0.dev20250618+cpu, 2.8.0.dev20250619+cpu, 2.8.0.dev20250620+cpu, 2.8.0.dev20250621+cpu, 2.8.0.dev20250622+cpu, 2.8.0.dev20250623+cpu, 2.8.0.dev20250624+cpu, 2.8.0.dev20250625+cpu, 2.8.0.dev20250626+cpu, 2.8.0.dev20250627+cpu, 2.9.0.dev20250628+cpu, 2.9.0.dev20250629+cpu, 2.9.0.dev20250630+cpu, 2.9.0.dev20250701+cpu, 2.9.0.dev20250702+cpu, 2.9.0.dev20250703+cpu, 2.9.0.dev20250704+cpu, 2.9.0.dev20250705+cpu, 2.9.0.dev20250706+cpu, 2.9.0.dev20250707+cpu, 2.9.0.dev20250708+cpu, 2.9.0.dev20250709+cpu, 2.9.0.dev20250710+cpu, 2.9.0.dev20250711+cpu, 2.9.0.dev20250712+cpu, 2.9.0.dev20250713+cpu, 2.9.0.dev20250714+cpu, 2.9.0.dev20250715+cpu, 2.9.0.dev20250716+cpu, 2.9.0.dev20250717+cpu, 2.9.0.dev20250718+cpu, 2.9.0.dev20250719+cpu, 2.9.0.dev20250720+cpu, 2.9.0.dev20250722+cpu, 2.9.0.dev20250723+cpu, 2.9.0.dev20250724+cpu, 2.9.0.dev20250725+cpu, 2.9.0.dev20250726+cpu, 2.9.0.dev20250727+cpu, 2.9.0.dev20250728+cpu, 2.9.0.dev20250729+cpu, 2.9.0.dev20250730+cpu, 2.9.0.dev20250731+cpu)
#62 72.30 ERROR: No matching distribution found for torch==2.8.0.dev20250601
```
The executorch hash update currently fails due to https://github.com/pytorch/pytorch/actions/runs/16636773244/job/47079169392
```
2025-07-31T01:56:57.0249165Z + echo 'expecting triton to not be installed, but it is'
2025-07-31T01:56:57.0249614Z expecting triton to not be installed, but it is
2025-07-31T01:56:57.0249969Z + exit 1
2025-07-31T01:58:27.6764352Z ##[error]Final attempt failed. Child_process exited with error code 1
```
I believe the cause is https://github.com/pytorch/executorch/pull/11653 where the nightly pytorch is installed from our index, but then requirements-examples installs timm from pypi, which reinstalls pytorch, except its the release build for cuda from pypi? Which then causes triton to be installed.
I don't know what the intended behavior is so I'm disabling the executorch docker build, executorch build, and the nightly hash update, and apparently the test was already disabled because it was failing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159595
Approved by: https://github.com/malfet
**Summary**
`_weight_int8pack_mm` on CPU may cause segmentation fault if output shape is large (i.e., M * N is large). It's because the kernel compute output buffer address by
```c++
auto* C_ptr = C_data + mb_start * N + nb_start;
```
where both `mb_start` and `N` are `int` and when they are large their product may overflow.
The solution is simple: declare these variables as `int64_t` so that the product won't overflow.
**Test plan**
```
pytest -sv test/test_linalg.py -k test__int8_mm_large_shape
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158341
Approved by: https://github.com/mingfeima, https://github.com/drisspg
fixes typo in word `enought` to correct `enough` at 3 places in these files
```
aten/src/ATen/native/cuda/AdaptiveAveragePooling.cu
aten/src/ATen/native/cuda/CuFFTPlanCache.h
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159587
Approved by: https://github.com/ezyang
This PR is a big copy pasta from `c10/util/Float8*` -> `torch/headeronly/util/` which is why we are breaking PR sanity :C (sorry @albanD!).
Why is it not a clean copy paste?
- For BC reasons, we have to keep the old c10 file around so that OSS devs relying on those files can still get the same APIs
- Because we reexpose APIs that are headeronly through torch::headeronly, so there is an extra chunk of code in the new torch::headeronly files to do that.
Outside of the copy paste, I:
- changed the tests to call torch::headeronly instead of c10
- updated header_only_apis.txt
- added `// NOLINTNEXTLINE(bugprone-narrowing-conversions,cppcoreguidelines-narrowing-conversions)` to pass lint (which was previously skipped for -inl.h files)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159415
Approved by: https://github.com/albanD
- Sort strategy now supports sharding on non sorted dim.
~~- Fix histc xfail.~~
- ~~Previously `python test/distributed/tensor/test_dtensor_ops.py TestDTensorOpsCPU.test_dtensor_op_db_histc_cpu_float32` will fail with `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=18`. However, if we run `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=18 python test/distributed/tensor/test_dtensor_ops.py TestDTensorOpsCPU.test_dtensor_op_db_histc_cpu_float32`, the test will pass. This kind of error is due to DTensor reuses the strategy schema hashing. It turns out that not only the strategy, the result correctness also depends on `static_argnum` or the op will reuse the previous args from hashed schema and output wrong results. I updated the document also.~~ (fixed in https://github.com/pytorch/pytorch/pull/159289)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159189
Approved by: https://github.com/XilunWu
scaled_grouped_mm's kernel only supports column-major on the second operand. I -think- this is just for efficiency reasons. But inductor treats that buffer as flexible and may tweak the strides to be row-major instead, as seen in the issue.
~Tagging the op as "needs_fixed_stride_order"/"needs_exact_strides" does not work. Inductor only considers those tags for ops that don't have registered lowering (not sure if this is intended). scaled_grouped_mm does have a lowering, so we never check its tags.~ From discussion below, the op tags are expected to work.
FIXES https://github.com/pytorch/pytorch/issues/159097
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159134
Approved by: https://github.com/eellison
This change introduces structured logging of the collective communication schedule, enabling downstream tools (e.g. TLParse) to ingest and analyze per‑rank collective‐order information for multi‑rank jobs.
- Iterates over scheduler.nodes, filters for _CollectiveKernel nodes
- Extracts each op’s python_kernel_name
- Emits a structured JSON payload under the inductor_collective_schedule artifact name
- Dumps the full schedule list to collective_schedule.json via the PyTorch trace‑structured artifact
- Added comprehensive unit tests for collective schedule tracing: Created test_collective_schedule_empty() and test_collective_schedule_real() tests to verify structured trace logging works correctly for both empty collective schedules and real collective operations (like all_reduce and wait_tensor from _c10d_functional ops).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159190
Approved by: https://github.com/yushangdi, https://github.com/xmfan
Summary:
VariadicOpConverter and FuseListUnpackConverter would introduce ops that only have CPU kernels.
Currently, the graph passes are ran if static_dispatch is enabled.
As we plan to enable static_dispatch by default, this diff add the additional check for the graph pass to only work on the node that has all the inputs/outputs on CPU.
Test Plan:
CI
Rollback Plan:
Differential Revision: D79295640
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159519
Approved by: https://github.com/dolpm, https://github.com/henryoier
Summary: test_c10d_functional_native.py uses hard-coded buf names to check the generated code string. This is fragile given that Inductor can update its buffer naming implementation freely. Thus this PR uses name regex matching to find buffer names at the run time. This will solve issues like https://github.com/pytorch/pytorch/issues/147754. Currently we do name matching based on empty_strided_ calls. We can expand it later if needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159487
Approved by: https://github.com/yushangdi
ghstack dependencies: #159476
Summary: test_c10d_functional_native.py tests torch._inductor.config.cpp_wrapper as True and False. Currently torch._inductor.config.cpp_wrapper is set globally which can cause a problem when running the whole test file. This PR changes it to use patch context.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159476
Approved by: https://github.com/yushangdi
\# Why
- Make loop iteration simpler
- Have a common spot where to make modifications that affect
all the GEMM Triton templates, avoiding missed spots
\# What
- pull out commong logic of taking the BaseConfig objects
and turning them into kwargs to feed into maybe_append_choice
for Triton GEMM templates
Differential Revision: [D79186962](https://our.internmc.facebook.com/intern/diff/D79186962)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159383
Approved by: https://github.com/jansel
Hi @desertfire, according to the latest test [results](https://github.com/pytorch/pytorch/actions/runs/15385952839) from the inductor nightly for max_autotune tests, we plan to update the baseline data:
In the latest nightly test, two models require baseline updates:
- vision_maskrcnn: This model shows improved graph breaks, so I’ve updated the baseline accordingly.
- detectron2_fcos_r_50_fpn: This model has a different number of graph breaks. However, since its accuracy result still shows fail_accuracy, so I skipped the graph break check for this model.
```
vision_maskrcnn IMPROVED: graph_breaks=29, expected=30
Improvement: 1 models have fixed dynamo graph breaks:
vision_maskrcnn
```
```
detectron2_fcos_r_50_fpn XFAIL
detectron2_fcos_r_50_fpn FAIL: graph_breaks=24, expected=22
Error: 1 models have new dynamo graph breaks:
detectron2_fcos_r_50_fpn
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154973
Approved by: https://github.com/desertfire
Previously, we log `skipping cudagraphs due to [xxx reasons]` when there are cudagraph-unsafe ops. With graph partition, we will split off these ops and cudagraph remaining parts. But the log message is also skipped.
In this PR, we add logs for graph partition reasons and the number of partitions to better understand the workload.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159425
Approved by: https://github.com/eellison
Fixes
`RuntimeError: self and mat2 must have the same dtype, but got BFloat16 and Float`
With bf16 autocast, bias converted into BFloat16, but fp8_qlinear_onednn_ref not support bf16 bias.
In this pr, convert bias into bf16 on fp8_qlinear_onednn_ref.
Add this case into ut and reproduce:
`python test/test_quantization.py -k test_qlinear_fp8`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159125
Approved by: https://github.com/Xia-Weiwen, https://github.com/cyyever, https://github.com/CaoE
Summary: Fixes a clear template typo where `a_desc_ptr` was passed instead of `b_desc_ptr` to define `b_desc`.
Test Plan:
Found by inspection.
Rollback Plan:
Reviewed By: NoamPaz
Differential Revision: D79178538
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159407
Approved by: https://github.com/NikhilAPatel
Summary:
The `replace_hook` is called once for each user of the replaced node. This fix avoids adding duplicated node sources.
This also means that if there are two nested pass like:
```
with GraphTransformObserver(gm, "outer"):
with GraphTransformObserver(gm, "inner"):
.....
```
We'll only see the outer pass's pass name recorded for the replaced node in the "from_node" node meta. I think this is fine. In practice, the outer pass usually contains a more meaningful name, e.g. `decompose_auto_functionalized`, and the inner pass name is just a default pass name like `pattern_matcher`.
Test Plan:
```
buck2 run @mode/dev-nosan fbcode//caffe2/test:fx -- -r test_graph_transform_observer_replace
```
Rollback Plan:
Differential Revision: D79203058
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159484
Approved by: https://github.com/angelayi
**Summary**
This PR adds an all-gather based FlexAttention and uses TorchFunctionMode to dispatch
`FlexAttentionHOP.__call__` to it.
This PR makes the following changes:
- add a user-facing API `create_cp_block_mask` for creating CP-specific `BlockMask`
which masks over the attention result of Q shard and KV global.
- add `_ContextParallelGlobalVars` to store all necessary global vars that CP FlexAttention
requires. `torch_function_mode` is critical to maintain singleton mode to avoid dynamo
recompilations.
- add a dispatch path for `FlexAttentionForwardHOP.__call__` (TorchFunctionMode dispatch
won't work correctly without this line)
What's not in this PR:
- QKV load balancing
- Test on other masking besides `causal_mask`.
- Support on small attention (i.e. qkv size is smaller than 128) because the block mask
rewrite function requires `Q_BLOCK_SIZE == KV_BLOCK_SIZE == 128`.
**Test**
`pytest test/distributed/tensor/test_attention.py -s -k test_ring_flex_attention`
**Followup**
1. create an issue to reproduce the error in `create_fw_bw_graph()` when trying to call `create_block_mask`
to re-write `block_mask` in `FlexAttentionHOP` dispatch in `TorchFunctionMode`.
2. Merge `_ContextParallelGlobalVars` and `_cp_options`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158692
Approved by: https://github.com/drisspg
Summary:
We found that we don't really set group_name inside group_split correctly, because we are setting group_name to `deviceTypeToBackend_` which is set after `setBackend`. Same thing as group_desc. I added more unit tests for it.
We need to setGroupName correctly, otherwise, this will break DeviceMesh use case when split_group is used in DeviceMesh
Also ncclx needs to be aware of that its Option is a subclass of BackendOption
Test Plan:
CI
Rollback Plan:
Differential Revision: D79201132
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159429
Approved by: https://github.com/xunnanxu
This PR is a bit more involved but effectively works to drastically simplify PyObjectSlot and PyInterpreter.
1) For PyObjectSlot we now use a global pyinterpreter since there only is one. From here we change all of the call sites to rely on this assumption.
2) We also remove the "tags" of the PyInterpreter by deprecating `PyInterpreterStatus`.
For the reviewer, sadly it seems like `functorch/csrc/dim/dim.cpp` needed to get linted, so there is an unreadable amount of changes there. Fortunately, the only actual change in the file is as follows which just removes `getPyInterpreter()` from the `check_pyobj` call.
```
mpy::handle handle_from_tensor(Arena& A, TensorRef t) {
- // fast case: tensor is live in python
- std::optional<PyObject*> mb_obj =
- t->unsafeGetTensorImpl()->pyobj_slot()->check_pyobj(getPyInterpreter(), /*ignore_hermetic_tls=*/false);
- if (mb_obj.has_value() && !t->unsafeGetTensorImpl()->pyobj_slot()->owns_pyobj()) {
- return *mb_obj;
- }
- return A.autorelease(mpy::object::checked_steal(THPVariable_Wrap(*t)));
-}
-}
+ // fast case: tensor is live in python
+ std::optional<PyObject*> mb_obj =
+ t->unsafeGetTensorImpl()->pyobj_slot()->check_pyobj(
+ /*ignore_hermetic_tls=*/false);
+ if (mb_obj.has_value() &&
+ !t->unsafeGetTensorImpl()->pyobj_slot()->owns_pyobj()) {
+ return *mb_obj;
+ }
+ return A.autorelease(mpy::object::checked_steal(THPVariable_Wrap(*t)));
+}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158427
Approved by: https://github.com/albanD
Hi team,
Please help review this patch.
This PR https://github.com/pytorch/pytorch/pull/150370 tried to fix the "Empty C Call Queue" problem on Python 3.12. It added C calls for each starting Python event with a callable.
I found the root cause is not that we cannot get C function frames by `PyFrame_GetBack` when PythonTracer is filling start frames, but the c call event loss problem bug on Python 3.12.0-3.12.4. And that problem was fixed by 257c413cd1 on 3.12.5.
So I think the https://github.com/pytorch/pytorch/pull/150370 cannot fix the problem, this patch reverts the change of it.
There are solutions to fix the problem correctly, such as we can add a new monitoring callback to compensate call events of methods with C function or we can override the callback registered by `PyEval_SetProfile`. These solutions may make the code hard to maintain.
~~Since upgrading the micro version of Python is not difficult for users, we can just ignore C functions and suggest user upgrade.~~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155446
Approved by: https://github.com/sraikund16
Fixes#154111
Resolves an issue during compilation with dynamic shapes where `torch._inductor.decomposition.mm` evaluates the SymInt expression for the input tensor due to a for loop, and thus the output tensor is not dynamically shaped. This issue is limited to (Mx1)x(1xN) small matrix multiplications, and creates an explicit error with tensor subclasses such as DTensor.
The proposed fix replaces the loop with a simple product instead. Benchmark currently running https://hud.pytorch.org/benchmark/compilers
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158998
Approved by: https://github.com/jansel, https://github.com/BoyuanFeng
Summary:
### PR Context
- Kill background process only when PG init fails or there is an explicit `TERMINATE` signal from main process.
- When a checkpoint fails to save, log and return the error but continue the serving loop.
Test Plan:
CI
Rollback Plan:
Differential Revision: D79177410
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159374
Approved by: https://github.com/sibuachu
Essence of this copypasta:
- combine Half-inl.h and Half.h in c10/util -> torch/headeronly/util/Half.h
- Add NOLINTNEXTLINE's to the portions of Half-inl.h that were previously in the ignore list of clangtidy
- Re-expose all APIs in namespaces and through includes of the original files. Ideally, we would have the APIs in torch::headeronly and reexpose them in c10, but that runs into BC issues (see D78997465) so for now we are keeping the APIs in c10 but reexposing them in torch::headeronly.
- Change test cases in test_aoti_abi_check to test torch::headeronly::Half vs c10::Half (they're the same thing but we eventually want all the tests for headeronly APIs to only import from headeronly).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159172
Approved by: https://github.com/albanD, https://github.com/desertfire
This is a follow up on the PR #154382, as the issue still persists:
```
File "/opt/pytorch/pytorch/torch/distributed/rpc/__init__.py", line 81, in <module>
from . import api, backend_registry, functions
File "/opt/pytorch/pytorch/torch/distributed/rpc/api.py", line 35, in <module>
from .constants import DEFAULT_SHUTDOWN_TIMEOUT, UNSET_RPC_TIMEOUT
File "/opt/pytorch/pytorch/torch/distributed/rpc/constants.py", line 3, in <module>
from torch._C._distributed_rpc import (
ImportError: cannot import name '_DEFAULT_NUM_WORKER_THREADS' from 'torch._C._distributed_rpc' (unknown location)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159461
Approved by: https://github.com/lw
Summary: Sometimes the call history recorded in a `nn_module_stack` does not have the stack property, where each FQN is a prefix of the next FQN. This can cause errors during `unflatten`. Instead of erroring we now drop entries from such a `nn_module_stack` to restore the stack property. This effectively leads to less unflattening: the last FQN in the call history before the stack property was broken keeps the entire flat subgraph of its call.
Test Plan:
added test, updated another
Rollback Plan:
Differential Revision: D79204669
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159418
Approved by: https://github.com/angelayi
Previous version of `torch._tensor_str._Formatter` was not using `PRINT_OPTS.sci_mode` for the `max_width` computation but was using it for the formatting of values leading to a weird discrepancy.
Now, the code first checks if it should be in sci_mode, then compute `max_width`
Here is an example to test the behavior:
```python
A = torch.tensor([10, 1e-1, 1e-2])
B = torch.tensor([10, 1e-1, 1e-1])
print("================= Default =================")
print(A, f"Formatter max_width: {torch._tensor_str._Formatter(A).max_width}")
print(B, f"Formatter max_width: {torch._tensor_str._Formatter(B).max_width}")
print("================= sci_mode=False =================")
with torch._tensor_str.printoptions(sci_mode=False):
print(A, f"Formatter max_width: {torch._tensor_str._Formatter(A).max_width}")
print(B, f"Formatter max_width: {torch._tensor_str._Formatter(B).max_width}")
print("================= sci_mode=True =================")
with torch._tensor_str.printoptions(sci_mode=True):
print(A, f"Formatter max_width: {torch._tensor_str._Formatter(A).max_width}")
print(B, f"Formatter max_width: {torch._tensor_str._Formatter(B).max_width}")
```
In the current version this prints:
```
================= Default =================
tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10
tensor([10.0000, 0.1000, 0.1000]) Formatter max_width: 7
================= sci_mode=False =================
tensor([ 10.0000, 0.1000, 0.0100]) Formatter max_width: 10
tensor([10.0000, 0.1000, 0.1000]) Formatter max_width: 7
================= sci_mode=True =================
tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10
tensor([1.0000e+01, 1.0000e-01, 1.0000e-01]) Formatter max_width: 7
```
On can see that in `sci_mode=False`, the values of A are prefixed with unneeded 0 and does not have the same `max_width` as B (It keeps the `max_width` from `sci_mode = None`)
Also in `sci_mode = True`, for B, the `max_width` is 7 but each value takes 10 chars... (But it is fine as the code that uses `max_width` do not rely much on it, but still, this is missleading)
After this commit, this will print
```
================= Default =================
tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10
tensor([10.0000, 0.1000, 0.1000]) Formatter max_width: 7
================= sci_mode=False =================
tensor([10.0000, 0.1000, 0.0100]) Formatter max_width: 7
tensor([10.0000, 0.1000, 0.1000]) Formatter max_width: 7
================= sci_mode=True =================
tensor([1.0000e+01, 1.0000e-01, 1.0000e-02]) Formatter max_width: 10
tensor([1.0000e+01, 1.0000e-01, 1.0000e-01]) Formatter max_width: 10
```
This also allows to align A with B for `sci_mode=False`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126859
Approved by: https://github.com/malfet
torch.compile of `all_to_all_vdev_2d` hits the following error:
```
torch._dynamo.exc.BackendCompilerFailed: backend='aot_eager' raised:
RuntimeError: Found a custom (non-ATen) operator whose output has alias annotations: symm_mem::all_to_all_vdev_2d(Tensor input, Tensor(a!) out, Tensor in_splits, Tensor(a!) out_splits_offsets, str group_name, int? major_align=None) -> Tensor(a!). We only support functionalizing operators whose outputs do not have alias annotations (e.g. 'Tensor(a)' is a Tensor with an alias annotation whereas 'Tensor' is a Tensor without. The '(a)' is the alias annotation). The alias annotation specifies that the output Tensor shares storage with an input that has the same annotation. Please check if (1) the output needs to be an output (if not, don't return it), (2) if the output doesn't share storage with any inputs, then delete the alias annotation. (3) if the output indeed shares storage with an input, then add a .clone() before returning it to prevent storage sharing and then delete the alias annotation. Otherwise, please file an issue on GitHub.
```
This PR selects option (1).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159435
Approved by: https://github.com/ngimel, https://github.com/xmfan
# Motivation
This PR moves the implementation of `torch.cuda.memory._set_allocator_settings` to `torch._C._accelerator_setAllocatorSettings`.
Since the original API was intended as a temporary/internal utility, I am not exposing the new function as a public API.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156175
Approved by: https://github.com/albanD
ghstack dependencies: #149601, #157908, #150312, #156165
# Motivation
Refactor `CUDAAllocatorConfig` to reuse `AcceleratorAllocatorConfig` and `ConfigTokenizer`. We would deprecate those option that overleap with `AcceleratorAllocatorConfig` in the following PR and keep them only for BC.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150312
Approved by: https://github.com/albanD
ghstack dependencies: #149601, #157908
# Motivation
This PR aims to generalize `AllocatorConfig` to be device-agnostic. Introduce the class `AcceleratorAllocatorConfig` to clarify its scope as a configuration manager for accelerator backends (e.g., CUDA, XPU). The another name `AllocatorConfig` is now reserved for a potential future base class that can unify configuration handling for both CPU and accelerator allocators, should similar requirements arise for the CPU path.
# Design Rule
## Overall
This class configures memory allocation for both device and host memory. A single `AcceleratorAllocatorConfig` instance is shared across all accelerator backends, such as CUDA and XPU, under the assumption that relevant environment variables apply uniformly to all accelerators. Device-specific configuration extensions are supported via hooks (see `registerDeviceConfigParserHook`).
Introduce a new class `ConfigTokenizer` to help process the env variable config key-value pair
## Naming Convention:
- Public API names in `AcceleratorAllocatorConfig` should be device-generic.
- Members prefixed with `pinned_` are specific to the host/pinned allocator.
- Environment variable names should be generic across backends.
- Comma-separated key-value pairs in the format: `key:value`. Use square brackets `[]` for list values Example: `key1:123, key2:[val1,val2]`
## Environment Variables:
- The default environment variable for configuration is `PYTORCH_ALLOC_CONF`.
- For backward compatibility, `PYTORCH_CUDA_ALLOC_CONF` and `PYTORCH_HIP_ALLOC_CONF` are also supported with lower priority.
Differential Revision: [D79011786](https://our.internmc.facebook.com/intern/diff/D79011786)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149601
Approved by: https://github.com/albanD
Summary: AI system co-design team requested to add user annotation for FX graph cache key in PyTorch Kineto trace and Execution trace. With this annotation, they can know the FX graph to which the kernels belong.
Test Plan:
buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA
Rollback Plan:
Differential Revision: D79019069
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159318
Approved by: https://github.com/sraikund16, https://github.com/jansel
Essence of this copypasta:
- combine Half-inl.h and Half.h in c10/util -> torch/headeronly/util/Half.h
- Add NOLINTNEXTLINE's to the portions of Half-inl.h that were previously in the ignore list of clangtidy
- Re-expose all APIs in namespaces and through includes of the original files. Ideally, we would have the APIs in torch::headeronly and reexpose them in c10, but that runs into BC issues (see D78997465) so for now we are keeping the APIs in c10 but reexposing them in torch::headeronly.
- Change test cases in test_aoti_abi_check to test torch::headeronly::Half vs c10::Half (they're the same thing but we eventually want all the tests for headeronly APIs to only import from headeronly).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159172
Approved by: https://github.com/albanD, https://github.com/desertfire
Sphinx likes titles and complains about them when they are not there. So adding a title to address this Wartning in the build:
```
WARNING: toctree contains reference to document 'distributed._dist2' that doesn't have a title: no link will be generated
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159385
Approved by: https://github.com/d4l3k
Straightup copy pasta. Keeps APIs in c10 and reexposes them to torch::headeronly.
It is arguable that we should just get rid of some of these unused dtypes but that is outside the scope of this PR, which is meant to build up to ScalarType moving to headeronly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159302
Approved by: https://github.com/malfet, https://github.com/albanD
Switch from guard_size_oblivious to guard_or_false if you encounter a DDE, this would then avoid folding this 3d bmm into a mm.
806d9e3fe7/torch/_decomp/decompositions.py (L4506-L4512)
## DDE
```
File "/data/users/colinpeppler/pytorch/torch/_decomp/decompositions.py", line 4506, in matmul
elif should_fold(tensor1, tensor2, is_out):
File "/data/users/colinpeppler/pytorch/torch/_decomp/decompositions.py", line 4472, in should_fold
if guard_size_oblivious(t1.numel() == 0):
torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(12*((u0//2)), 0) (unhinted: Eq(12*((u0//2)), 0)). (Size-like symbols: none)
Caused by: (_decomp/decompositions.py:4472 in should_fold)
```
```
File "/data/users/colinpeppler/pytorch/torch/_decomp/decompositions.py", line 4506, in matmul
elif should_fold(tensor1, tensor2, is_out):
File "/data/users/colinpeppler/pytorch/torch/_decomp/decompositions.py", line 4483, in should_fold
return all(
torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(3*((u0//2)), 3) (unhinted: Eq(3*((u0//2)), 3)). (Size-like symbols: none)
Caused by: (_decomp/decompositions.py:4483 in should_fold)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159184
Approved by: https://github.com/ezyang
ghstack dependencies: #158894
This only handles AttributeError, but in general, any exception coming from
here is a user exception. let me know if we prefer to catch all exceptions, and then reraise them as observed exceptions.
```
File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/_dynamo/symbolic_convert.py", line 2200, in CALL_FUNCTION
self.call_function(fn, args, {})
File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/_dynamo/symbolic_convert.py", line 1210, in call_function
self.push(fn.call_function(self, args, kwargs)) # type: ignore[arg-type]
File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/_dynamo/variables/lazy.py", line 201, in realize_and_forward
return getattr(self.realize(), name)(*args, **kwargs)
File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/_dynamo/variables/nn_module.py", line 472, in call_function
initialize_lazy_module(tx, mod, args, kwargs)
File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/_dynamo/variables/nn_module.py", line 104, in initialize_lazy_module
mod._infer_parameters(mod, fake_args, fake_kwargs)
File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/nn/modules/lazy.py", line 261, in _infer_parameters
module.initialize_parameters(*args, **kwargs)
...,
File "/packages/aps.ads.gmp/launcher_with_publish#link-tree/torch/nn/modules/module.py", line 1962, in __getattr__
raise AttributeError(
torch._dynamo.exc.InternalTorchDynamoError: AttributeError: '...' object has no attribute '...'
```
Note that we crash with a sligthly different exception trace in the other test I added. Let me know if we want this to not throw directly to the end user.
```
======================================================================
ERROR: test_lazy_module_bad_params (__main__.NNModuleTests.test_lazy_module_bad_params)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/data/users/clr/pytorch/torch/testing/_internal/common_utils.py", line 3223, in wrapper
method(*args, **kwargs)
~~~~~~^^^^^^^^^^^^^^^^^
File "/data/users/clr/pytorch/test/dynamo/test_modules.py", line 1683, in test_lazy_module_bad_params
exp_res = opt_m(x, y)
File "/data/users/clr/pytorch/torch/_dynamo/eval_frame.py", line 411, in __call__
return super().__call__(*args, **kwargs)
~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/data/users/clr/pytorch/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/data/users/clr/pytorch/torch/nn/modules/module.py", line 1784, in _call_impl
return forward_call(*args, **kwargs)
File "/data/users/clr/pytorch/torch/_dynamo/eval_frame.py", line 473, in _call_lazy_check
self._orig_mod._infer_parameters(self._orig_mod, args, kwargs)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/data/users/clr/pytorch/torch/nn/modules/lazy.py", line 261, in _infer_parameters
module.initialize_parameters(*args, **kwargs)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
File "/data/users/clr/pytorch/test/dynamo/test_modules.py", line 711, in initialize_parameters
self.foo += 1
^^^^^^^^
File "/data/users/clr/pytorch/torch/nn/modules/module.py", line 1962, in __getattr__
raise AttributeError(
f"'{type(self).__name__}' object has no attribute '{name}'"
)
AttributeError: 'LazyModuleBadInferParams' object has no attribute 'foo'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158501
Approved by: https://github.com/williamwen42, https://github.com/jansel
# Motivation
While refactoring the caching allocator, I noticed that the `AllocParams` constructor on CUDA had an unused parameter. This change removes that unused argument to avoid potential confusion.
# Additional Context
I noticed that `AllocParams` is defined in cpp file, so it should be safe to make this change.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159159
Approved by: https://github.com/cyyever, https://github.com/albanD
This PR makes some less risky changes to PyObjectSlot as there is a lot of stuff we do not need since there is only one interpreter. Specifically `check_interpreter` and `has_pyobj_nonhermetic` are removed
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158407
Approved by: https://github.com/albanD
ghstack dependencies: #158290, #158291
This PR removes the integration point torch.fx had with torch::deploy (and another minor change).
Note: This PR has some broken mypy errors, but I believe those should have been in the code base beforehand, and should be fixed in a separate PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158291
Approved by: https://github.com/albanD
ghstack dependencies: #158290
The mi355 ci regression and hiprtc kernel compilation is failing due to duplicate definitions of traits leading to errors like `error: redefinition of 'integral_constant'`. This seems to be the culprit: https://github.com/pytorch/pytorch/pull/158868. Checking if using hip version instead of rocm version for the check would help with resolution here as rocm version and hip version aren't synced. ROCm 7.0 Alpha build used in CI is still on HIP 6.5.
Confirmed that this patch works here: https://github.com/pytorch/pytorch/actions/runs/16579227179?pr=159292
Also, this PR increases the frequency of this MI355 CI to twice a day so we can catch and identify regressions easier if they happen for now.
Jeff is on vacation, so Jithun asked me to reach out to y'all. Please help stamp and approve, so we can resolve the recent MI355 CI regression/timeout (https://github.com/pytorch/pytorch/actions/workflows/rocm-mi355.yml) :) @huydhn @malfet @atalman @seemethere
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159292
Approved by: https://github.com/malfet
Summary: We are trying to deprecate torch deploy externally. However a bunch of legacy stuff still uses it. This PR allows the legacy tests to still run if neccessary
Test Plan:
It's a targets change so CI should suffice
Rollback Plan:
Differential Revision: D78910653
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159307
Approved by: https://github.com/albanD
# Note - On Lambda guarding of object aliasing
# We previously installed object‑aliasing guards as relational guards,
# but that undermined the recursive‑dict guard optimization: placing the
# aliasing guard at a leaf prevented the parent dict node from
# qualifying as a recursive‑dict guard root. Because aliasing guards are
# rare, we now emit them as epilogue guards via a small Python lambda.
# This repeats the access in Python—adding a bit of work—but the
# overhead is outweighed by the gains from enabling recursive‑dict guard
# optimization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159288
Approved by: https://github.com/StrongerXi
Summary:
A fallback kernel's output may be a non-list/tuple but a `MultiOutput` with empty indices. Allow the `FXConverter` to handle such case.
Test Plan:
Modified the fxir test for fallbacks, then ran `buck2 test mode/dev-nosan caffe2/test/inductor:fxir_backend -- test_fallback`.
Before this diff the modified test would fail with
```
File "/re_cwd/buck-out/v2/gen/fbcode/e2105f7329ead90a/caffe2/test/inductor/__fxir_backend__/fxir_backend#link-tree/torch/_inductor/codegen/wrapper_fxir.py", line 341, in generate
line.codegen_fx(self)(line)
File "/re_cwd/buck-out/v2/gen/fbcode/e2105f7329ead90a/caffe2/test/inductor/__fxir_backend__/fxir_backend#link-tree/torch/_inductor/codegen/wrapper_fxir.py", line 489, in _generate_multi_output
inds = line.indices[0][1:]
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
IndexError: list index out of range
```
(Full error paste in P1878839403)
With this diff the error is no longer present.
Rollback Plan:
Differential Revision: [D79126619](https://our.internmc.facebook.com/intern/diff/D79126619)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159297
Approved by: https://github.com/blaine-rister
This PR is part of the work to deprecate torch::deploy in OSS. Effectively it does 3 things to get started.
1. Remove test_deploy_interaction as we no longer need to worry about this
2. Remove all torch._running_with_deploy checks and use the False path always (surfaced 1)
3. Remove `USE_DEPLOY` and switch to the default path always
Note: MyPy does fail on a bunch of things here as a bunch of older files are touched. It may be better to fix these things on a separate PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158288
Approved by: https://github.com/albanD
Fixes#157452
Test with
```
python test/dynamo/test_repros.py ReproTests.test_nn_parameter_ctor_graph_breaks
```
### Release Notes
Change to nn.Parameter Constructor Behavior in Dynamo
Semantic change introduced in the nn.Parameter constructor; previously, if the constructor lacked a clean source, the system would attempt to infer arguments to construct a clone and lift this synthetic proxy in the computation graph. This approach had many potential edge cases and was difficult to reason about. The new behavior defaults to graph breaking when the nn.Parameter constructor does not have a clean source. Users are now suggested to manually move the constructor out of the graph in such cases. This change improves clarity and reduces complexity in graph construction and debugging. Users can escape hatch to old semantics with `torch.dynamo.config.graph_break_on_nn_param_ctor=False` if this cannot be done.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158800
Approved by: https://github.com/anijain2305
Fixes#154111
Resolves an issue during compilation with dynamic shapes where `torch._inductor.decomposition.mm` evaluates the SymInt expression for the input tensor due to a for loop, and thus the output tensor is not dynamically shaped. This issue is limited to (Mx1)x(1xN) small matrix multiplications, and creates an explicit error with tensor subclasses such as DTensor.
The proposed fix replaces the loop with a simple product instead. Benchmark currently running https://hud.pytorch.org/benchmark/compilers
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158998
Approved by: https://github.com/jansel, https://github.com/BoyuanFeng
Summary:
Remove use of targetDevice in KernelFactory.
AOTI would infer device when creating AOTIDelegateExecutor.
Test Plan:
CI
Rollback Plan:
Reviewed By: dolpm
Differential Revision: D79007317
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159298
Approved by: https://github.com/dolpm
This adds an option for backend precompile artifacts to be *editable*, i.e. to not serialize them right away, but instead be able to apply a Callable edit_fn to them.
This allows us to support editing the precompile artifact with more updated autotune results at a later time in the next PR. The goal flow here is:
- User runs AOTAutograd -> Inductor -> Triton
- User saves to AOTAutogradCache the normal results
- User runs autotuning
- User calls serialize(), it takes the new autotuning results at runtime and saves only the necessary triton kernels.
This PR just implements the API for editing the cache artifacts. The next PR actually adds the autotuning saving support.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158586
Approved by: https://github.com/zhxchen17
Summary: This test was using do_bench, so it was flaky performance is non-deterministic.
Test Plan:
buck test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:compile_subprocess -- --exact 'caffe2/test/inductor:compile_subprocess - test_inductor_multiple_specializations_cuda (caffe2.test.inductor.test_compile_subprocess.GPUTests)' --run-disabled
Rollback Plan:
Differential Revision: D79098692
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159264
Approved by: https://github.com/jingsh
Summary:
Strength matcher for StaticDispatch kernels: all input, output tensor must be on CPU, all Device-typed attribute must be CPU.
Previously, we only check output tensor on CPU. This will miss catching the case where we do DeviceToHost aten._to_copy.
Prepare for turning on static dispatch kernel by default.
Test Plan:
I should add some test before land.
Rollback Plan:
Differential Revision: D78747600
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159187
Approved by: https://github.com/dolpm
**Summary**
`_weight_int8pack_mm` on CPU may cause segmentation fault if output shape is large (i.e., M * N is large). It's because the kernel compute output buffer address by
```c++
auto* C_ptr = C_data + mb_start * N + nb_start;
```
where both `mb_start` and `N` are `int` and when they are large their product may overflow.
The solution is simple: declare these variables as `int64_t` so that the product won't overflow.
**Test plan**
```
pytest -sv test/test_linalg.py -k test__int8_mm_large_shape
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158341
Approved by: https://github.com/mingfeima, https://github.com/drisspg
Fixes#158892
All custom operators should go through the graph.call_function path. The
other fallback path is for aten/prim operations that don't have support
for things (like torch.float8_e8m0fn).
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159174
Approved by: https://github.com/eellison
## Summary
This PR changes the default value of `combo_kernel_foreach_dynamic_shapes` from `False` to `True` in `torch/_inductor/config.py`.
## Context
The `combo_kernel_foreach_dynamic_shapes` configuration was introduced in PR #134477 (August 2024) to support dynamic shapes for foreach and combo kernels. It was initially disabled by default as a conservative approach to avoid disrupting production workflows.
## Why This Change?
After several months of the feature being available and stable, it's time to enable it by default. This improves the user experience for developers using `torch.compile(dynamic=True)` with foreach operations.
### Current behavior:
- Users must manually discover and enable `combo_kernel_foreach_dynamic_shapes`
- Without this flag, foreach operations may fail with dynamic shapes
- This creates friction and confusion
### With this change:
- Foreach operations work seamlessly with dynamic compilation
- No manual configuration needed
- Better "it just works" experience
## Testing
Extensive testing was performed with PyTorch 2.5.0+ and 2.7.1:
- ✅ Various tensor sizes (8, 16, 32, 64, 128)
- ✅ Multiple tensors in operations (tested up to 20)
- ✅ Nested foreach operations
- ✅ Mixed operations (foreach + standard operations)
- ✅ Both CPU and CUDA devices
- ✅ Symbolic shapes with dynamic compilation
## Impact Assessment
- **Performance**: No impact - this only affects compilation behavior
- **Backward Compatibility**: Fully maintained - users can still set to `False`
- **Risk**: Minimal - feature has been stable since August 2024
## References
- Original implementation: PR #134477 by @qchip
- This completes the feature rollout by making it available by default
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158985
Approved by: https://github.com/jansel, https://github.com/mlazos
Fixes a ZB regression (https://github.com/pytorch/torchtitan/actions/runs/16478292562/job/46585646792)
Previously we only allowed an intermediate node to have 1 gradient. Recently a torchtitan ZB test started failing and I tracked to back to FusedRMSNorm grad_fn having two values `(grad, None)` (see https://github.com/pytorch/pytorch/pull/153666) and it started breaking our ZB tests.
This PR allows `stage_backward_weight` intermediate nodes to have multiple grads (it sums them together or if the grad value is None, then ignores it). Here is an example where the backward would have two grad values (gI1, gI2):
```python
class Func(torch.autograd.Function):
@staticmethod
def forward(ctx, x):
return x, 2
@staticmethod
def backward(ctx, gI1, gI2):
assert gI2 is None
return gI1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159084
Approved by: https://github.com/tianyu-l
If `return_debug_mask` is False (which is the default value for SDPA), the attention tensor returned is an empty tensor (which has 0 dimensions). This means that the shardings for the batch and CP case are that are passed can yield invalid dimensions.
This PR fixes it for `scaled_dot_product_flash_attention_strategy`. Note that `scaled_dot_product_cudnn_attention_strategy` doen't have this issue
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159205
Approved by: https://github.com/wconstab
Switch from `guard_size_oblivious` to `guard_or_false` if you encounter a DDE, this would then fallback to computing elementwise strides.
2dccff7dcf/torch/_prims/__init__.py (L1919-L1923)
We think it's safe because Laith tested whether this fallback would fail any tests. It did not.
https://github.com/pytorch/pytorch/pull/158157
## Data-dependent exceptions (DDE)
```
File "/data/users/colinpeppler/pytorch/torch/_decomp/decompositions.py", line 2139, in _to_copy
x_tensor = torch._prims.convert_element_type(x_tensor, dtype)
...
File "/data/users/colinpeppler/pytorch/torch/_prims/__init__.py", line 1920, in _convert_element_type_meta
if torch._prims_common.is_non_overlapping_and_dense(a):
File "/data/users/colinpeppler/pytorch/torch/_prims_common/__init__.py", line 494, in is_non_overlapping_and_dense
if guard_size_oblivious(length == 1):
GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(u0 - 4, 1) (unhinted: Eq(u0 - 4, 1)). (Size-like symbols: u0)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158894
Approved by: https://github.com/pianpwk, https://github.com/laithsakka
Fixes mm on B200:
Before:
```Shell
def _addmm_nvfp4_dispatch(
a: NVFP4Tensor, b: NVFP4Tensor, aten_op, bias: Optional[torch.Tensor] = None
) -> torch.Tensor:
"""
Core implementation shared between nvfp4_mm, nvfp4_addmm, and nvfp4_linear.
The only difference is whether bias is None or not.
"""
assert a._data.is_contiguous()
assert b._data.t().is_contiguous()
assert a._block_size == 16, f"NVFP4 requires block_size=16, got {a._block_size}"
assert b._block_size == 16, f"NVFP4 requires block_size=16, got {b._block_size}"
M, K = a.shape[0], a.shape[1]
N = b.shape[1]
# Swizzle Dizzle
if a._is_swizzled_scales:
a_scale_blocked = a._scale_e4m3 # Already swizzled
else:
a_scale = a._scale_e4m3.view(M, K // a._block_size)
a_scale_blocked = to_blocked(a_scale)
if b._is_swizzled_scales:
b_scale_blocked = b._scale_e4m3 # Already swizzled
else:
b_scale = b._scale_e4m3.view(N, K // b._block_size)
b_scale_blocked = to_blocked(b_scale)
# Merge double quant scales into 1 scale for Scale_In^D
if a._per_tensor_scale is not None:
assert b._per_tensor_scale is not None
scale_result = a._per_tensor_scale * b._per_tensor_scale
else:
assert b._per_tensor_scale is None and a._per_tensor_scale is None
scale_result = None
# THIS IS A WORKAROUND:
# RuntimeError: CUDA error: CUBLAS_STATUS_INVALID_VALUE when calling
# When we have per-tensor scaling, we need to apply it before bias
# since bias is not quantized
should_add_bias_separately = (scale_result is not None) and (bias is not None)
# should_add_bias_separately = bias is not None
> result = torch._scaled_mm(
a._data.view(torch.float4_e2m1fn_x2),
b._data.view(torch.float4_e2m1fn_x2),
a_scale_blocked.view(torch.float8_e4m3fn),
b_scale_blocked.view(torch.float8_e4m3fn),
bias=None if should_add_bias_separately else bias,
out_dtype=a._orig_dtype,
# scale_result=scale_result, # Not supported yet
)
E RuntimeError: Invalid scaling configuration.
E - For TensorWise scaling, a and b should be float8, scales should be float and singletons.
E - For RowWise scaling, a and b should be float8, scales should be float, scale_a should be (200, 1) and scale_b should be (1, 256), and both should be contiguous.
E - For BlockWise 1x128 scaling, a and b should be float8, scales should be float, scale_a should be (200, 1) and scale_b should be (1, 256), and both should be outer-dim-major.
E - For BlockWise 128x128 scaling, a and b should be float8, scales should be float, scale_a should be (2, 1) and scale_b should be (1, 2), and both should be near-inner-dim-major (with 16-byte aligned strides).
E - For Blockwise 1x32 scaling, a and b should be float8, scales should be float8_e8m0fnu, scale_a should have 1024 elements and scale_b should have 1024 elements, and both should be contiguous.
E - For Blockwise 1x16 scaling, a and b should be float4 (packed 2x), scales should be float8_e4m3fn, scale_a should have 3072 elements and scale_b should have 3072 elements, and both should be contiguous.
E Got a.dtype()=Float4_e2m1fn_x2, scale_a.dtype()=Float8_e4m3fn, scale_a.size()=[256, 12], scale_a.stride()=[12, 1], b.dtype()=Float4_e2m1fn_x2, scale_b.dtype()=Float8_e4m3fn, scale_b.size()=[256, 12] and scale_b.stride()=[12, 1]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159170
Approved by: https://github.com/ngimel
See docblock for details. The API here has been validated by use
in autoparallel but I'm always open to suggestions for tweaks. One
particular choice I made is to make most of the functions return dicts
by default; this isn't strictly necessary for inputs but it is very
convenient for outputs as the output desc lives on the output node,
not the argument that feeds into the node.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159005
Approved by: https://github.com/wconstab
Rewriting bucketing of all_gather and reduce_scatter with defining of "merge graph" via torch function.
`all_gather_merge_fn_to_trace`
`reduce_scatter_merge_fn_to_trace`
(Instead of creating nodes and doing FakeTensor prop manually)
This allows to experiment with merge function.
Used foreach_copy_ in merging function for all_gather - added lowering for inductor for `foreach_copy_`
Adding topological sort after bucketing passes (comment in post_grad.py):
```
# Fx collectives bucketing passes require topological sort for the cases:
# when bucketed collectives have users before the last collective in the bucket
# AND when inputs of bucketed collective have ancestors after the first collective in the bucket.
#
# In this case we can not manually pick the place for bucketed collective insertion.
# But we are guaranteed by the bucketing (independent collectives in the bucket),
# that it is possible to reorder nodes to satisfy all ordering requirements.
#
# --- before bucketing ---
# in0 = ...
# wait_ag0 = ag(in0)
# user0(wait_ag0)
# ...
# pre_in1 = ...
# in1 = transform(pre_in1)
# wait_ag1 = ag(in1)
# user1(wait_ag1)
#
# --- after bucketing ---
#
# in0 = ...
# user(wait_ag0) <--- wait_ag0 is defined only after bucketed collective.
#
# pre_in1 = ...
# in1 = transform(pre_in1)
# ag_bucket(in0+in1)
# wait_bucket
# wait_ag0 = wait_bucket[0]
# wait_ag1 = wait_bucket[1]
# user1(wait_ag1)
````
Correctness of the passes verified by loss curve for llama3 8b for simple_fsdp and for autoparallel:
<img width="1364" height="495" alt="Screenshot 2025-07-22 at 14 27 28" src="https://github.com/user-attachments/assets/67b2cabb-3206-450b-b529-e23c24292fc6" />
<img width="1355" height="509" alt="Screenshot 2025-07-22 at 14 27 56" src="https://github.com/user-attachments/assets/4d0e6b25-2eb1-47b2-8d68-dcec185239c4" />
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158663
Approved by: https://github.com/wconstab
Options to address the "undocumented python objects":
1. Reference the functions in the .rst via the torch.nn.modules namespace. Note that this changes the generated doc filenames / locations for most of these functions!
2. [Not an option] Monkeypatch `__module__` for these objects (broke several tests in CI due to `inspect.findsource` failing after this change)
3. Update the .rst files to also document the torch.nn.modules forms of these functions, duplicating docs.
#### [this is the docs page added](https://docs-preview.pytorch.org/pytorch/pytorch/158491/nn.aliases.html)
This PR takes option 3 by adding an rst page nn.aliases that documents the aliases in nested namespaces, removing all the torch.nn.modules.* entries from the coverage skiplist except
- NLLLoss2d (deprecated)
- Container (deprecated)
- CrossMapLRN2d (what is this?)
- NonDynamicallyQuantizableLinear
This mostly required adding docstrings to `forward`, `extra_repr` and `reset_parameters`. Since forward arguments are already part of the module docstrings I just added a very basic docstring.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158491
Approved by: https://github.com/janeyx99
Adds `c_shim_aten.{h/cpp}` and use this for `fill_`
This is the generated `c_shim_aten.cpp` for reference
```cpp
// WARNING: THIS FILE IS AUTOGENERATED BY torchgen. DO NOT MODIFY BY HAND.
// See 7e86a7c015/torchgen/gen.py (L2424-L2436) for details
// This file corresponds to the aten_shimified_ops list in torchgen/aoti/fallback_ops.py
#include <torch/csrc/inductor/aoti_torch/generated/c_shim_aten.h>
#include <torch/csrc/inductor/aoti_torch/utils.h>
#ifndef AT_PER_OPERATOR_HEADERS
#include <ATen/Functions.h>
#include <ATen/CompositeExplicitAutogradFunctions.h>
#include <ATen/CompositeExplicitAutogradNonFunctionalFunctions.h>
#include <ATen/CompositeImplicitAutogradFunctions.h>
#else
#include <ATen/ops/fill.h>
#endif // AT_PER_OPERATOR_HEADERS
using namespace torch::aot_inductor;
AOTITorchError aoti_torch_aten_fill__Scalar(AtenTensorHandle self, double value) {
AOTI_TORCH_CONVERT_EXCEPTION_TO_ERROR_CODE({
at::fill_(
*tensor_handle_to_tensor_pointer(self), value
);
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158974
Approved by: https://github.com/albanD, https://github.com/janeyx99
Hi team,
Please help review this patch.
This PR https://github.com/pytorch/pytorch/pull/150370 tried to fix the "Empty C Call Queue" problem on Python 3.12. It added C calls for each starting Python event with a callable.
I found the root cause is not that we cannot get C function frames by `PyFrame_GetBack` when PythonTracer is filling start frames, but the c call event loss problem bug on Python 3.12.0-3.12.4. And that problem was fixed by 257c413cd1 on 3.12.5.
So I think the https://github.com/pytorch/pytorch/pull/150370 cannot fix the problem, this patch reverts the change of it.
There are solutions to fix the problem correctly, such as we can add a new monitoring callback to compensate call events of methods with C function or we can override the callback registered by `PyEval_SetProfile`. These solutions may make the code hard to maintain.
~~Since upgrading the micro version of Python is not difficult for users, we can just ignore C functions and suggest user upgrade.~~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155446
Approved by: https://github.com/sraikund16
Summary:
Placement is leaked to too many classes!
In this diff, we consolidate all placement lookup into one place: Graph::ApplyDevicePlacement.
After applying placement, the in-memory graph, tensorMeta, weightMeta would already have the re-mapped device.
The subsequence weight loading, sample input loading, target device inference would look up the re-mapped device from graph's tensorMeta.
graph's tensorMeta becomes the only ground truth!
Test Plan:
Need to add some tests before landing.
This is a big change.
Rollback Plan:
Differential Revision: D78841818
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158996
Approved by: https://github.com/henryoier
Fix docstring for clip_grads_with_norm_ to reflect clamping behavior
This PR updates the docstring for torch.nn.utils.clip_grads_with_norm_ to accurately reflect the implementation behavior. The current documentation suggests that gradients are always scaled by:
grad = grad * (max_norm / (total_norm + eps))
However, the actual implementation clamps the scale coefficient to a maximum of 1.0, ensuring gradients are only scaled down, not up. This PR corrects the formula and adds a clarifying note to avoid confusion for users.
Updated the formula in the docstring to:
grad = grad * min(max_norm / (total_norm + eps), 1.0)
Added a note explaining the rationale for clamping (to prevent gradient amplification).
Ensured consistency with the behavior of clip_grad_norm_.
Fixes#151554
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158200
Approved by: https://github.com/mikaylagawarecki
Options to address the "undocumented python objects":
1. Reference the functions in the .rst via the `torch.functional` namespace. Note that this changes the generated doc filenames / locations for most of these functions!
2. Document these functions by referencing them from the `torch.` namespace instead, in line with common usage. This would also require setting the `__module__` for these functions and moving entries from `torch.functional`'s `__all__` -> `torch`'s `__all__`, which is BC-breaking.
3. Update the .rst files to also document the `torch.functional` forms of these functions, duplicating docs.
This PR takes option (3) above and:
* Removes all 20 `torch.functional` entries from the doc ignore list
* Removes `torch.functional.align_tensors()` entirely, since we don't want to document it.
* This is technically BC-breaking, although the previous impl simply errored out. This change could be moved to a separate isolated PR for safety.
* Introduces `torch.aliases.md` as a hidden page for the `torch.functional` aliases to the `torch` analogue functions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158581
Approved by: https://github.com/janeyx99
Wrapping is load bearing for things that introspect argument signatures,
but use of functools.wraps to do this is undesirable as this overrides
the name/module of the wrapping function, which is bad for tracking down
exactly what code is actually being run at runtime. simple_wraps is
like wraps but it doesn't override the name information, so you still
get an appropriate printout. To see the stack of all functions wrapping
each other, there is now a helper fn_stack.
I also make some assertions tighter in the descriptor PR. These didn't
catch any bugs but I figure might as well.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158734
Approved by: https://github.com/wconstab
ghstack dependencies: #158624, #158708
3 procs were used for sm86, but we switched to sm89 and the check failed so it switched back to 2
sm90 is H100, but idk what unittests we have running there, but I assume they also have a lot of memory
They use larger runners, which have more GPU memory, so its usually ok. I think it's ~22GB -> 10GB per proc if 2, 6GB per proc if 3 (cuda context maybe 1GB)
I've applied skips to the ones that OOMed
Time decreases from ~2.7hr per test job -> ~2hr
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158691
Approved by: https://github.com/huydhn
----
- First, we add a new expanded_def to FX, which will expand the
definitions of variables into multiple lines, one per variable
definition. This makes extremely long args/return lists much
more readable.
- Next, we extend this mechanism to also print out descriptors on
placeholders and return values, as comments, if available. This
is how we will test descriptors.
- We update tlparse for AOTAutograd to use this format.
- We update expect tests to use this format and update their formats,
so you can inspect what it can look at. There may be other tests
I should update, open to suggestions.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158708
Approved by: https://github.com/wconstab
ghstack dependencies: #158624
One of the recurring challenges of working with FX graphs produced by
AOTAutograd is that there is a very intricate input/output calling
convention that is essentially impossible to understand without actually
reverse engineering the AOTAutograd code. It is so bad that there
is a bit of logic for stashing indices of relevant arguments/outputs
in TracingContext so Inductor can figure out what the correct arguments
are.
This PR introduces the necessary scaffolding to keep track of
"descriptors" of every input/output to a (joint) FX graph produced
by AOTAutograd. First read through descriptors.py to get a sense for
what is available: for inputs, you can figure out if you have
a plain input, tangent, parameter, or something more exotic like
one of the fields of a subclass or view base. For outputs, you can
determine if you have a plain output or grad, or something more exotic
like the contents of a mutated input or an intermediate base of several
views that were returned.
There are two distinct parts of this patch: AOTInput tracking, and
AOTOutput tracking.
**AOTInput tracking.** The way this works is that AOTAutograd starts of
with some Tensor `flat_args` that are the inputs to the graph being
traced, and then updates these arguments as it modifies the input
calling convention. Anywhere these `args` are passed around, we now add a
news argument `args_descs` which is updated in synchrony with args. Add
a new arg? Add a new AOTInput to `args_descs`.
**AOTOutput tracking.** Originally, I wanted to also add an `outs_descs`
analogous to `args_descs` tracking output metadata. However, it is
often difficult to compute what the output will be until you're actually
tracing the function for real (and are able to peek at the real
outputs). So we only compute `outs_desc` when we actually trace. To do
this, we change the calling convention of the function we trace to
return not just outputs, but a tuple of `outs` and `outs_descs`. Before
we bottom out at the `make_fx` invocation, we save `outs_descs` to a
nonlocal and bottom out.
To actually make use of this information in a useful way, see the next PR. Potentially the two PRs could be combined together but I think it's actually clearer for them to be separate.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158624
Approved by: https://github.com/xmfan
## Fixes https://github.com/pytorch/pytorch/issues/157959
## mini repro from issue
```c++
import torch
from torch import nn
class Foo(nn.Module):
def __init__(
self,
use_parameter: bool
) -> None:
super().__init__()
self.b = 101
if use_parameter:
self.b = nn.Parameter(torch.Tensor([self.b]), requires_grad=False)
def forward(self, x: torch.Tensor) -> torch.Tensor:
# return x + self.b
# return x - self.b
return x / self.b
# return x * self.b
torch.manual_seed(42)
x = torch.rand((5, 5))
expected = Foo(False)(x)
models = [
Foo(False),
Foo(True),
torch.compile(Foo(False), fullgraph=True),
torch.compile(Foo(True), fullgraph=True),
]
for m in models:
print((m(x) - expected).sum())
```
all outputs equal zero except the result of torch.compile(Foo(False), fullgraph=True)
## summary:
when divisor is a scalar, inductor will lower div to mul the scalar's reciprocal.
this could lead precision lost in c++ kernel. but not in triton kernel
## why:
Generated C++ kernel; thanks to @xmfan for supplying the code.
```c++
#include <torch/csrc/inductor/cpp_prefix.h>
extern "C" void kernel(const float* in_ptr0,
float* out_ptr0)
{
{
for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(25L); x0+=static_cast<int64_t>(16L))
{
{
if(C10_LIKELY(x0 >= static_cast<int64_t>(0) && x0 < static_cast<int64_t>(16L)))
{
auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(16));
auto tmp1 = static_cast<float>(0.009900990099009901);
auto tmp2 = at::vec::Vectorized<float>(tmp1);
auto tmp3 = tmp0 * tmp2;
tmp3.store(out_ptr0 + static_cast<int64_t>(x0));
}
if(C10_UNLIKELY(x0 >= static_cast<int64_t>(16L) && x0 < static_cast<int64_t>(25L)))
{
auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(9L));
auto tmp1 = static_cast<float>(0.009900990099009901);
auto tmp2 = at::vec::Vectorized<float>(tmp1);
auto tmp3 = tmp0 * tmp2;
tmp3.store(out_ptr0 + static_cast<int64_t>(x0), static_cast<int64_t>(9L));
}
}
}
}
}
```
The float type in C typically has 6 to 7 significant digits, while the double type has 15 to 16 significant digits.
```c++
#include <iostream>
#include <iomanip>
int main() {
auto tmp1 = static_cast<float>(0.009900990099009901);
auto tmp2 = static_cast<double>(0.009900990099009901);
std::cout << std::setprecision(20) << "tmp1 = " << tmp1 << std::endl;
std::cout << std::setprecision(20) << "tmp2 = " << tmp2 << std::endl;
return 0;
}
```
the ouput is
```bash
tmp1 = 0.0099009899422526359558
tmp2 = 0.0099009900990099011103
```
`auto tmp1 = static_cast<float>(0.009900990099009901);` This will cause tmp1 to become 0.0099009, resulting in a loss of precision, so the final result will not match the expected value.
I also found that the bug occurred at that position
86d8af6a6c/torch/_inductor/lowering.py (L6238)
The commit states that the precision lost is expected in cuda implementation.
original commit
03439d4c1c
cuda implementation
0636c11811/aten/src/ATen/native/cuda/BinaryDivTrueKernel.cu (L36-L38)
What is interesting is that the Triton kernel works correctly due to the precision of float type in python.
```python
def triton_poi_fused_div_0(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
xnumel = 25
xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:]
xmask = xindex < xnumel
x0 = xindex
tmp0 = tl.load(in_ptr0 + (x0), xmask)
tmp1 = 0.009900990099009901
tmp2 = tmp0 * tmp1
tl.store(out_ptr0 + (x0), tmp2, xmask)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158231
Approved by: https://github.com/eellison
Python dispatcher is not always enabled in fake tensors and have to be called explicitly.
While it should be, it requires some work to get all tests working.
I have been running in several issues where I add to add enable_python_dispatcher ex
XLA, Helom ..etc to avoid issues related to that for the view specifically i moved it to fake tensor impl.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158406
Approved by: https://github.com/bobrenjc93
The origin code comemnts:
```python
# Let's not fail if we can't clean up the temp dir. Also note that for
# Windows, we can't delete the loaded modules because the module binaries
# are open.
```
But we are missing the `ignore_errors` parameter for Windows. I help to add it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159025
Approved by: https://github.com/jansel
The MIOpen integration has changed over the years. In the past, the MIOpen default for benchmark was True and if it were set to False it would use MIOpen Immediate Mode. But with #145294 the MIOpen benchmark default changed to False and to activate immediate mode you would set the deterministic flag to True. This has proved too restrictive because benchmark and deterministic flags are independent from immediate mode. Thus, immediate mode needs its own flag. Though MIOpen still masquerades behind torch.backends.cudnn and its flags, it seemed inappropriate to add an miopen-exclusive flag to the set of cudnn flags. This PR adds the first miopen-only flag to control its immediate mode.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158951
Approved by: https://github.com/jeffdaily
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Summary: The subclass can override the filtering logic to customize which frames to keep or drop.
Test Plan:
```
buck run caffe2/test:test_export -- -r test_stack_trace
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:others -- -r test_constant_random
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_custom_obj_list_out
buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:fx -- -r class_member_back_compat
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158266
Approved by: https://github.com/ezyang, https://github.com/yushangdi
Fixes#158120
The issue was caused by populating a builtin tensor fn map at import time; if torch.export.export was called before any dynamo imports with the `meta` device, this map would not be populated, and so would populate on import time which would try to call `torch.disable`, which would not yet be initialized
Fix is to populate this map lazily
```
python test/dynamo/imports_non_circular_repro.py TestImports.test_circular_import_with_export_meta
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158931
Approved by: https://github.com/StrongerXi, https://github.com/mlazos, https://github.com/anijain2305
TL;DR: Cuts vLLM cudagraph collection from 80s -> 24s
Stop garbage collecting by default on every cudagraph recording. The old behavior can be re-enabled by setting `TORCH_CUDAGRAPH_GC=1` or the config `force_cudagraph_gc`.
We were previously garbage collecting at the beginning of each cudagraph
capture. vLLM collects 5427 graphs and most of those garbage collections weren't
actually collecting any memory (CPU or GPU). This changes it to not collect more
than every 10s so if we're capturing in a loop we don't burn all our cycles
looking for garbage.
(These number have a lot of variance from run to run but give the correct
general scale)
```
| calls | total | synchronize | gcs | collect | empty cache | sys freed | cuda freed |
-------+-------+-------+-------------+------+---------+-------------+-----------+------------+
before | 5427 | 78s | 1.48s | 5427 | 53.22s | 1.21s | 145855 | 1539309568 |
-------+-------+-------+-------------+------+---------+-------------+-----------+------------+
after | 5427 | 24s | 0s | 3 | 1.53s | 0.84s | 592 | 1539309568 |
-------+-------+-------+-------------+------+---------+-------------+-----------+------------+
```
total - this is the total time reported by vLLM's "Graph capturing finished" log.
The rest of these are measured in torch.cuda.graphs.graph.__enter__():
calls - number of times torch.cuda.graphs.graph.__enter__ was called
synchronize - this is the duration taken by the cuda.synchronize call
gcs - number of times gc.collect was called
collect - this is the duration taken by the gc.collect call
empty cache - this is the duration taken by the torch.cuda.empty_cache call
sys freed - the number of bytes reported freed by gc.collect
cuda freed - the number of bytes reported freed by torch.cuda.memory_reserved
So it seems like the heavy lifting is done by torch.cuda.empty_cache() which is
fairly quick.
Cudagraph results from the TorchInductor Performance DashBoard (this is from the original version using the GC clock so the real results will be slightly better than this):
<img width="1494" height="382" alt="image" src="https://github.com/user-attachments/assets/69b705ef-47ce-4b6e-9733-1ec941cad93d" />
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158193
Approved by: https://github.com/ngimel
Fixes#157452
Test with
```
python test/dynamo/test_repros.py ReproTests.test_nn_parameter_ctor_graph_breaks
```
### Release Notes
Change to nn.Parameter Constructor Behavior in Dynamo
Semantic change introduced in the nn.Parameter constructor; previously, if the constructor lacked a clean source, the system would attempt to infer arguments to construct a clone and lift this synthetic proxy in the computation graph. This approach had many potential edge cases and was difficult to reason about. The new behavior defaults to graph breaking when the nn.Parameter constructor does not have a clean source. Users are now suggested to manually move the constructor out of the graph in such cases. This change improves clarity and reduces complexity in graph construction and debugging. Users can escape hatch to old semantics with `torch.dynamo.config.graph_break_on_nn_param_ctor=False` if this cannot be done.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158800
Approved by: https://github.com/anijain2305
Before:
```
.Observed exception
Explanation: Dynamo found no exception handler at the top-level compiled function when encountering an exception. Exception will propagate outside the compiled region.
Hint: Dynamo has detected that tracing the code will result in an error when running in eager. Please double check that your code doesn't contain a similar error when actually running eager/uncompiled.
Hint: It may be possible to write Dynamo tracing rules for this code. Please report an issue to PyTorch if you encounter this graph break often and it is causing performance issues.
Developer debug context:
```
After:
```
Observed exception
Explanation: Dynamo found no exception handler at the top-level compiled function when encountering an exception. Exception will propagate outside the compiled region.
Hint: Dynamo has detected that tracing the code will result in an error when running in eager. Please double check that your code doesn't contain a similar error when actually running eager/uncompiled.
Hint: It may be possible to write Dynamo tracing rules for this code. Please report an issue to PyTorch if you encounter this graph break often and it is causing performance issues.
Developer debug context: raised exception TypeError([ConstantVariable(str: "unhashable type: <class 'torch._dynamo.variables.dicts.SetVariable'>")])
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158924
Approved by: https://github.com/williamwen42, https://github.com/zou3519
Previously precompile was implemented under the assumption that dynamo always inlines the user code and generate resume functions when a graph break is hit. In cases like nanogpt training, there exists nontrivial amount of code causing dynamo to fail the speculation and stop inlining certain type of user function. This results in more code objects to be tracked by CompilePackage.
Since these new code objects are user defined, we need to also serialize the location of these code so that we can load the precompile entries to the these code objects in another process.
With this fix, we are able to run nanogpt inference+training with precompile under torchbench.
Differential Revision: [D78691422](https://our.internmc.facebook.com/intern/diff/D78691422/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158947
Approved by: https://github.com/jamesjwu
cuBLAS added support for them in CUDA 12.9. It's rather easy to call into them, the hardest thing is allowing the lhs and rhs operands to have different scaling types, as that changes the whole callstack.
The scaling format is still detected from the sizes of the scale tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158037
Approved by: https://github.com/eqy, https://github.com/drisspg
When select has data dependent input, we cant tell if the actual index shall be index+size or index.
to avoid throwing dde, we allocate a new unbacked symbol to represent the storage offset of the
output view and we compute its value dynamically at runtime when inductor is lowered.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157605
Approved by: https://github.com/ColinPeppler
Add `sort`, `scatter_add` strategy. I am reusing the strategy for `scatter` related ops for a quick support. The strategy can be potential improved after we fix index related strategies.
Minor fix: fix `replicate_op_strategy` to support output multiple tensors, which is required by aten.sort.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159022
Approved by: https://github.com/XilunWu, https://github.com/wconstab
Summary: __assert_fail is declared slightly differently in the Emscripten stdlib. This may cause errors when compiling with Emscripten.
Test Plan:
N/A
Rollback Plan:
Differential Revision: D78500790
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158580
Approved by: https://github.com/JacobSzwejbka
As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo
This PR adds strict typing support to a critical set of files for dynamo, `source.py` and the base `_guards.py`
Running
```
mypy torch/_dynamo/source.py torch/_guards.py --linecount-report /tmp/coverage_log
```
| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main | 1227 | 2208 | 55.57% | 207 | 362 | 57.18% |
| This PR | 2217 | 2217 | 100.00% | 362 | 362 | 100.00% |
| Delta | +990 | +9 | +44.43% | +155 | 0 | +42.82% |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158397
Approved by: https://github.com/anijain2305
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):
Putting both the dispatch API and combine API in battlefield, one following the other, i.e.
```
all_to_all_vdev_2d(inp, out, inp_splits, out_splits_offsets, ...)
all_to_all_vdev_2d_offset(
input=out,
out=combine_out,
in_splits_offsets=out_splits_offsets,
out_splits_offsets=combine_out_splits_offsets
)
```
Here the `out_splits_offsets` from dispatch perfectly serves as the `in_splits_offsets` argument for combine.
Then we assert that the output of combine is exactly the same as the original input to shuffle, and combine's output splits are exactly the same as the original input splits.
It works!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157026
Approved by: https://github.com/Skylion007, https://github.com/ngimel
ghstack dependencies: #158234, #158235, #156743, #156881
Added `all_to_all_vdev_2d_offset`, which:
Perform a 2D AllToAllv operation, with input split and offset
information provided on device. The input offsets need not to be
exact prefix sum of the input splits, i.e. paddings are allowed between the
splitted chunks. The paddings, however, will not be transferred to peer
ranks.
In Mixure of Experts models, this operation can be used to combine tokens
processed by experts on remote ranks. This operation can be viewed as an
"reverse" operation to the `all_to_all_vdev_2d` operation (which shuffles
tokens to experts).
The change may seem a bit dense, sorry. But it is mainly two changes:
1. templating existing device functions (to use provided input offset or calculate it)
2. generalizing variable names, e.g. npes, ne --> minor_size, major_size,
so that I can use the same alltoall function for matrix of (nranks, ne) as well as matrix of (ne, nranks).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156881
Approved by: https://github.com/ngimel
ghstack dependencies: #158234, #158235, #156743
This PR addresses a few small bugfixes needed to make NanoGPT inference work, and also adds a new `--caching-precompile` argument to torchbench. With `--caching-precompile`, after every benchmark we save precompile artifacts to DynamoCache, allowing us to test caching precompile on all existing benchmarks.
The following bugfixes are in this PR to make all of this work:
- Fix global variables being pruned with DUPLICATE_INPUT guards. DUPLICATE_INPUT guards have additional vars from the second input, which we track with additional_local_vars, but we never tracked additional global variables. This fixes the issue. (See torch/_dynamo/guards.py changes)
- Return None from PRecompileContext.serialize() if no new dynamo compiles occurred. There's no reason to save artifacts (i.e. autotuning artifacts, etc) if no dynamo_compile occurred, so we return None early. We may later want to support editing existing dynamo artifacts as a TODO, but that's upcoming.
- log `dynamo_start` on CompilePackage.load: This is only needed so that tlparse doesn't ignore TORCH_TRACE logs generated when caching precompile hits. If there are no actual compiles, we never log a "dynamo_start" entry, which makes internal tlparse ignore the TORCH_TRACE file.
## Test Plan
After this PR, the following now works:
```
TORCH_LOGS=dynamo tlp python benchmarks/dynamo/torchbench.py --only nanogpt --performance --inference --backend inductor --caching-precompile --warm-start-latency
```
tlparse result (internal):
Cold Start (6 seconds):
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpAWe0zD/dedicated_log_torch_trace_vk9nkp4m.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000
Warm Start (~1 s):
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpAWe0zD/dedicated_log_torch_trace_5l4iwrpm.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000
The 1 second of warm start here can be improved: the costs here are mostly in starting up workers and triton and initializing CUDA, a lot of which should not be included in the compile time cost in real world scenarios where these are already loaded before training begins.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158847
Approved by: https://github.com/zhxchen17
- Prevent the inductor test for argsort/sort from wrongly failing when the argsort/sort output with stable=False differs from pytorch but is still a valid argsort output.
- Add functionality to allow alternative assert_equal functions in inductor tests for future cases.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146622
Approved by: https://github.com/eellison
Co-authored-by: George Wigley <georgewi@graphcore.ai>
Summary:
In general, device_ is not very useful in OpKernel. Remove it to avoid misuse.
Also, the meaning of `device_` is also ambiguous in the OpKernel.
For StaticDispatch kernels, we always call cpu kernel.
For C10Kernel, we rely on input tensor's device and dispatcher to determine which device to run on.
For ops involves multiple device, e.g. aten._to_copy(device), the meaning of device is ill-defined.
Test Plan:
CI
Rollback Plan:
Reviewed By: henryoier, dolpm, kqfu, zhxchen17
Differential Revision: D78704840
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158944
Approved by: https://github.com/dolpm
3 procs were used for sm86, but we switched to sm89 and the check failed so it switched back to 2
sm90 is H100, but idk what unittests we have running there, but I assume they also have a lot of memory
They use larger runners, which have more GPU memory, so its usually ok. I think it's ~22GB -> 10GB per proc if 2, 6GB per proc if 3 (cuda context maybe 1GB)
I've applied skips to the ones that OOMed
Time decreases from ~2.7hr per test job -> ~2hr
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158691
Approved by: https://github.com/huydhn
This PR suggests adding some models to `cpu_skip_list` which are currently being run in TIMM and Torchbench.
The suggested models takes a long time which leads to the benchmark runs being `timeout`. [benchmark runs for aarch64](https://github.com/pytorch/pytorch/actions/workflows/inductor-perf-test-nightly-aarch64.yml)
• The issue stems from unoptimized groupwise convolution (BF16 /F16 dtype) kernels for aarch64 platforms , which significantly slow down execution leading to the timeout.
**Action:**
• An optimized BF16 groupwise convolution kernel is currently being developed in oneDNN, targeted for release in Q4 2025.
To maintain dashboard consistency and signal clarity, I’ve skipped the affected tests in:
* timm benchmarks
* torchbench benchmarks
As suggested, skip is applied at the CPU - arch level, explicitly branching for aarch64 and adding models which needs to be skipped. This keeps the logic clean, but:
• An alternative considered was increasing shard counts for aarch64 runners, but given the known performance bottleneck, skipping avoids wasted compute cycles. Suggestions around this will be appreciated.
Benchmark does not timeout after the suggested change: https://github.com/pytorch/pytorch/actions/runs/16447200138
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158842
Approved by: https://github.com/malfet
Added `torch.hash_tensor` reduction function with a `mode` argument that defaults to reduction with xor.
- The hash is always uint64.
- Integers will be casted to uint64 before performing the xor_sum reduction
- Floats will be upcasted to double and then bitcasted to uint64 before performing the xor_sum reduction
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154149
Approved by: https://github.com/albanD
This PR consists of all the changes required to enable PyTorch ROCm CI on MI355X nodes.
- Rework aotriton cmake configuration to rely on `HIP_VERSION` instead of `ROCM_VERSION` as aotriton depnds on hip. Hip loosely track the rocm major version, but the two are not actually synchronized as observed in the ROCm 7 alpha build.
- Bump composable-kernel submodule to [df6023e305f389bbf7249b0c4414e649f3ad6598](df6023e305) for mi350 compatibility.
- Extend the change docker permissions step to the MI355x runners as well. This step is included to apply the required permission change to the test folder for a successful upload of artifacts in k8s docker.
- Create new rocm-mi355 workflow to trigger core PyTorch tests on a nightly basis at 2:30 am PST.
- Successfully tested running the test suites listed in rocm-mi355.yml on MI355 runners by temporarily hacking rocm-mi300.yml: ca7d5fae11 (rocm-mi300)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158889
Approved by: https://github.com/jeffdaily
#### 1. Provide a default fallback strategy that can apply to arbitrary operator with output in type of single tensor.
We can call register_op_strategy to register using the `fallback_op_strategy`:
- For op without List[Tensor] as input, call:
```
register_op_strategy(op_overload)(replicate_op_strategy)
```
- For op contains List[Tensor] as input, call:
```
register_op_strategy(op_overload, schema_info=RuntimeSchemaInfo(needs_pytree=True))(replicate_op_strategy)
```
The strategy will force all input and output to be replicated with the corresponding redistribute_cost.
#### 2. Add a test function as a necessary condition for strategy function.
```
detect_exists_identical_opspec(*args, op, mesh, strategy_function)
```
This function detects if identical strategies will be produced given the sample `args`. It will iterate all combinations of placements for each arg and produce the output strategy from the registered `strategy_function`.
#### 3. Provide a context manger `op_strategy_context` to easily register/unregister strategies for testing.
E.g.,
```
with op_strategy_context(test_op.default, replicate_op_strategy):
...
```
#### 4. Fix a bug that TupleStrategy never get flatten as expected:
9df0176408/torch/distributed/tensor/_op_schema.py (L286)
Basically we need to 1) register_pytree_node for TupleStrategy, 2) propagate the schema_info to `strategy_schema` after `strategy_schema = _wrap_with_op_strategy(op_schema)`.
This is the first implementation. Plan to add support to enable sharding on the batch dim as the output strategy next.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158046
Approved by: https://github.com/wanchaol, https://github.com/wconstab
We don't create new PGs when doing slicing in DeviceMesh so it is relatively safe to relax the requirement of one can only do slicing from root mesh. But this does come with caveat when it is asymmetric, for example, only some have the sliced out submesh, for example. So aside from removing the requirement we also add a warning here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158899
Approved by: https://github.com/wz337
This PR removes the integration point torch.fx had with torch::deploy (and another minor change).
Note: This PR has some broken mypy errors, but I believe those should have been in the code base beforehand, and should be fixed in a separate PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158291
Approved by: https://github.com/albanD
ghstack dependencies: #158288, #158290
This PR is part of the work to deprecate torch::deploy in OSS. Effectively it does 3 things to get started.
1. Remove test_deploy_interaction as we no longer need to worry about this
2. Remove all torch._running_with_deploy checks and use the False path always (surfaced 1)
3. Remove `USE_DEPLOY` and switch to the default path always
Note: MyPy does fail on a bunch of things here as a bunch of older files are touched. It may be better to fix these things on a separate PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158288
Approved by: https://github.com/albanD
Hi team,
Please help review this patch.
This PR https://github.com/pytorch/pytorch/pull/150370 tried to fix the "Empty C Call Queue" problem on Python 3.12. It added C calls for each starting Python event with a callable.
I found the root cause is not that we cannot get C function frames by `PyFrame_GetBack` when PythonTracer is filling start frames, but the c call event loss problem bug on Python 3.12.0-3.12.4. And that problem was fixed by 257c413cd1 on 3.12.5.
So I think the https://github.com/pytorch/pytorch/pull/150370 cannot fix the problem, this patch reverts the change of it.
There are solutions to fix the problem correctly, such as we can add a new monitoring callback to compensate call events of methods with C function or we can override the callback registered by `PyEval_SetProfile`. These solutions may make the code hard to maintain.
~~Since upgrading the micro version of Python is not difficult for users, we can just ignore C functions and suggest user upgrade.~~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155446
Approved by: https://github.com/sraikund16, https://github.com/cyyever
Thanks to @davidberard98 for much of the analysis here. For GEMMs of K=1, the hints, `tl.multiple_of` and `tl.max_contiguous` apply completely, as the indices to the loads are only dependent on `offs_m` and `offs_n`. For shapes like `(97x1), (1x97)`, this results in misaligned address errors, due to the fact that for all BLOCK_M and BLOCK_N sizes, the last tile is not a contiguous load. With K > 1 case, the hint is not as strict given the dependency on the k indices for the load as well. In the K=1 case, only `offs_m` and `offs_n` are used and broadcasted to the index shape.
One can say these hints are "wrong", but in various cases in the hints being wrong, such as with the shape `9999x4, 4x9999`, there is a substantial performance improvement with the hint.
For nice shapes with K=1, where M, N are a multiple 8 to where these hints are fine and there is no misaligned address, there is no performance regression observed on H100:
<img width="547" height="402" alt="Screenshot 2025-07-18 at 5 05 47 PM" src="https://github.com/user-attachments/assets/fee2bbaa-784c-422e-bb8c-43c6c2607ad2" />
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158650
Approved by: https://github.com/davidberard98
Add test that require weights to be packaged for torch native
For now, we need `package_weights_in_so=True` for compile standalone. The constants are in a `.o` file and will be added as a source to the CMakeLists.txt of the model.
After we added weight deduping, we should be able to let this config be False.
```
python test/inductor/test_aot_inductor_package.py -k test_compile_with_exporter_weights
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158750
Approved by: https://github.com/desertfire
Fix up decomposeK autotuning, by removing condition to return more than `k_splits_limit` and setting default to 10 instead of 5. Allow `k_splits_limit` to be configurable to the user via `TORCHINDUCTOR_NUM_DECOMPOSE_K_SPLITS` and also allow user to configure threshold in which to use decompose_k via `TORCHINDUCTOR_DECOMPOSE_K_THRESHOLD`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158745
Approved by: https://github.com/eellison
An out of tree backend can have its own configuration options that the user can enable to control inductor compilation. These config options need to be taken into account when calculating the key that is used to determine cache miss / hits. This PR allows out of tree backends to specify a custom config module that has the same type as `torch._inductor.config` that can be used to control codegen (in addition to the default config), and will be used when creating the cache key.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158254
Approved by: https://github.com/eellison
This PR addresses a few small bugfixes needed to make NanoGPT inference work, and also adds a new `--caching-precompile` argument to torchbench. With `--caching-precompile`, after every benchmark we save precompile artifacts to DynamoCache, allowing us to test caching precompile on all existing benchmarks.
The following bugfixes are in this PR to make all of this work:
- Fix global variables being pruned with DUPLICATE_INPUT guards. DUPLICATE_INPUT guards have additional vars from the second input, which we track with additional_local_vars, but we never tracked additional global variables. This fixes the issue. (See torch/_dynamo/guards.py changes)
- Return None from PRecompileContext.serialize() if no new dynamo compiles occurred. There's no reason to save artifacts (i.e. autotuning artifacts, etc) if no dynamo_compile occurred, so we return None early. We may later want to support editing existing dynamo artifacts as a TODO, but that's upcoming.
- log `dynamo_start` on CompilePackage.load: This is only needed so that tlparse doesn't ignore TORCH_TRACE logs generated when caching precompile hits. If there are no actual compiles, we never log a "dynamo_start" entry, which makes internal tlparse ignore the TORCH_TRACE file.
## Test Plan
After this PR, the following now works:
```
TORCH_LOGS=dynamo tlp python benchmarks/dynamo/torchbench.py --only nanogpt --performance --inference --backend inductor --caching-precompile --warm-start-latency
```
tlparse result (internal):
Cold Start (6 seconds):
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpAWe0zD/dedicated_log_torch_trace_vk9nkp4m.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000
Warm Start (~1 s):
https://manifold.edge.x2p.facebook.net/v0/read/tree/logs/.tmpAWe0zD/dedicated_log_torch_trace_5l4iwrpm.log/index.html?bucketName=tlparse_reports&apiKey=tlparse_reports-key&withPayload=1&timeoutMsec=10000
The 1 second of warm start here can be improved: the costs here are mostly in starting up workers and triton and initializing CUDA, a lot of which should not be included in the compile time cost in real world scenarios where these are already loaded before training begins.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158847
Approved by: https://github.com/zhxchen17
Avoid failures caused by tests exiting via sys.exit instead of `unittest.skip`
In particular it will not try to start the test (causing forks into subprocess) just to stop them (killing the subprocess) which is done in the test setup
Using `unittest.skip` decorators avoids the starting of the test in the first place.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158846
Approved by: https://github.com/Skylion007
Summary:
# Why
capture relevant data for offline lookup table generation
# What
report the hinted sizes not just the symbolic sizes
Test Plan:
```
buck2 run mode/opt scripts/coconutruben/torchmm:experiment 2>&1 | tee /tmp/epx040
```
This only validates that this change does not break anything, as the schema is not on scuba yet (not actualized)
Rollback Plan:
Reviewed By: stashuk-olek
Differential Revision: D77837548
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158852
Approved by: https://github.com/jingsh
# description
Add base docker image for vllm.
It seems like we use the base docker image for both pytorch build, and tests. Configure a base image for vllm against pytorch CI.
# Others
Added readme regarding how the base docker images are used, and how to add one, this also explain what is the right file to modify
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158755
Approved by: https://github.com/seemethere, https://github.com/huydhn
### What
- Use `statically_known_true` over `guard_size_oblivious` in cases where we're checking an optimization path. Otherwise, it will DDE and we can't take the safe/slower path.
- For broadcast checks, use `fallback=False` if we encounter a DDE. Typically, unbackeds would be ≥2 and that falls inline with size-oblivious reasoning (i.e. when `size_oblivious=True`).
### Example DDE
```
torch._inductor.exc.InductorError: LoweringException: GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq((u0//387), 1) (unhinted: Eq((u0//387), 1)). (Size-like symbols: u0)
Caused by: (_inductor/lowering.py:488 in broadcast_symbolic_shapes)
```
```
torch._inductor.exc.InductorError: LoweringException: GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq((u0//387), 1) (unhinted: Eq((u0//387), 1)). (Size-like symbols: u0)
Caused by: (_inductor/ir.py:2797 in create)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155267
Approved by: https://github.com/eellison
Thanks to @davidberard98 for much of the analysis here. For GEMMs of K=1, the hints, `tl.multiple_of` and `tl.max_contiguous` apply completely, as the indices to the loads are only dependent on `offs_m` and `offs_n`. For shapes like `(97x1), (1x97)`, this results in misaligned address errors, due to the fact that for all BLOCK_M and BLOCK_N sizes, the last tile is not a contiguous load. With K > 1 case, the hint is not as strict given the dependency on the k indices for the load as well. In the K=1 case, only `offs_m` and `offs_n` are used and broadcasted to the index shape.
One can say these hints are "wrong", but in various cases in the hints being wrong, such as with the shape `9999x4, 4x9999`, there is a substantial performance improvement with the hint.
For nice shapes with K=1, where M, N are a multiple 8 to where these hints are fine and there is no misaligned address, there is no performance regression observed on H100:
<img width="547" height="402" alt="Screenshot 2025-07-18 at 5 05 47 PM" src="https://github.com/user-attachments/assets/fee2bbaa-784c-422e-bb8c-43c6c2607ad2" />
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158650
Approved by: https://github.com/davidberard98
Adds guilhermeleobas to merge_rules for Dynamo and functorch.
Guilherme has done good work on both of these subsystems and I am tired
of him approving my PRs and me not being able to merge them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158620
Approved by: https://github.com/anijain2305
Differential Revision: D78431075
For #158366
- Calls runtime asserts pass for HOP subgraphs (in reenter_make_fx)
- For while_loop only (can be expanded), clones input tensors for subgraph tracing, so unbacked memos (item, nonzero, etc.) aren't reused
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158467
Approved by: https://github.com/ydwu4
# Background
The `C10_WARP_SIZE`, although always be `32` on CUDA platform, varies across different AMD GPUs.
Therefore, to correctly refer this value, the host code must be a variable instead of a literal defined by macro, or a `constexpr int`.
This PR may cause more compiler errors for third party code on AMD GPU, which is intentional. Having a fixed `C10_WARP_SIZE` value on host code for AMD GPU only defers compile time error to runtime.
This PR is recommended to be included as part of Release Notes to describe an API change for whoever uses this macro.
Users are recommended to use `C10_WARP_SIZE` directly, which adapts for various scenarios, or define a macro to use `C10_WARP_SIZE`. Assignment of this macro to symbols shared by host/device code causes problems on ROCM platform. (See the fix at `aten/src/ATen/native/cuda/layer_norm_kernel.cu` for a concrete example)
# Behaviors
* If compiling with HIPCC (i.e `defined(__HIPCC__)`):
+ Define `C10_WARP_SIZE` to be non-`constexpr` `at::cuda::warp_size()` for host-compilation pass (as compared to `static constexpr int C10_WARP_SIZE = 1;` set in 04bd7e6850e8efec77994963ffee87549555b9c3)
+ Define `C10_WARP_SIZE` to be a function returning `constexpr int` `64` for `__GFX9__`, and `32` otherwise, for device-compilation pass
- `__GFX8__` is also 64 but we do not support any GFX8 GPU.
* If not compiling with HIPCC:
+ Define `C10_WARP_SIZE` to be non-constexpr `at::cuda::warp_size()`
# `constexpr` variant for host code
For host-compilation cases where a `constexpr` value is needed for warp size (eg. launch bounds), use `C10_WARP_SIZE_STATIC`, which is defined as `64`. This macro follows the pre 04bd7e6850e8efec77994963ffee87549555b9c3 behavior of `C10_WARP_SIZE`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158271
Approved by: https://github.com/jeffdaily
Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>
Fixes#156012
This is a temporary solution that makes context parallelism working before logsumexp behavior changes landed in AOTriton.
After discussion we are not going to release AOTriton 0.10.1 to fix this due to
* Even if the interface is not changed, changing the behavior of returned logsumexp tensor should still be considered as an ABI break. Such changes do not fall into the "ABI compatible" category and should be postponed to next release.
* AOTriton 0.11 is scheduled to be released before end of July, which is less than five weeks
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156903
Approved by: https://github.com/jeffdaily
Co-authored-by: Jeff Daily <jeff.daily@amd.com>
Fixes https://github.com/pytorch/pytorch/issues/158164
This was fixed by applying `skip_code_recursive` to any function registered to `sys.monitoring` (via `PyThreadState_GET()->interp->monitoring_callables`). This check is done whenever we attempt to set the eval frame callback from Python.
Microbenchmark: `benchmarks/dynamo/microbenchmarks/overheads.py`:
BEFORE:
```
requires_grad=False
eager 7.1us (warmup=0.0s)
compiled 24.6us (warmup=10.0s)
requires_grad=True
eager 8.9us (warmup=0.0s)
compiled 57.8us (warmup=0.1s)
inference_mode()
eager 6.5us (warmup=0.0s)
compiled 23.4us (warmup=0.1s)
```
AFTER:
```
requires_grad=False
eager 7.0us (warmup=0.0s)
compiled 23.2us (warmup=15.2s)
requires_grad=True
eager 9.0us (warmup=0.0s)
compiled 55.1us (warmup=0.1s)
inference_mode()
eager 6.4us (warmup=0.0s)
compiled 22.2us (warmup=0.1s)
```
Followup thought: how do we let users know that a frame is skipped because the code object is a callable registered to sys.monitoring? (or any other reason?)
Differential Revision: [D78530528](https://our.internmc.facebook.com/intern/diff/D78530528)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158171
Approved by: https://github.com/jansel
https://github.com/pytorch/pytorch/pull/154193 gets reverted due to a test failure. The root cause being that: an executorch pass turns int inputs into a scalar tensor in cond's subgraph. The pass have been around on the critical path of executorch since two years ago. Changing it would be difficult. So we just allow non-fake inputs for check input mutation and aliasing, which shoudn't affect the correctness of the analysis.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158798
Approved by: https://github.com/pianpwk
Fix PyTorch tensor copying warning in ONNX export
## Problem
PyTorch ONNX exporter was generating a warning about incorrect tensor copying method:
```
UserWarning: To copy construct from a tensor, it is recommended to use sourceTensor.clone().detach() or sourceTensor.clone().detach().requires_grad_(True), rather than torch.tensor(sourceTensor).
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158835
Approved by: https://github.com/justinchuby
Looks like all MPS operations will crash if one of tensor dimentions are
greater than `2**31-1`
Change it into a structured exception, by checking tensor size before
attempting to create MPS Tensor
Add regression test for it. Before this change running following will abort with exception
```
% python3 -c "import torch; torch.randint(0, 10, (2**31,), dtype=torch.uint8, device='mps')"
/AppleInternal/Library/BuildRoots/1c8f7852-1ca9-11f0-b28b-226177e5bb69/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:829: failed assertion `[MPSNDArray initWithDevice:descriptor:isTextureBacked:] Error: NDArray dimension length > INT_MAX'
zsh: abort python3 -c·
```
Skip the test on MacOS-13, as it crashes somewhere deep in MPSGraph framework with
```
/AppleInternal/Library/BuildRoots/c651a45f-806e-11ed-a221-7ef33c48bc85/Library/Caches/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:724: failed assertion `[MPSTemporaryNDArray initWithDevice:descriptor:] Error: total bytes of NDArray > 2**32'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158824
Approved by: https://github.com/dcci
ghstack dependencies: #158690, #158823
Main changes:
- bucketing collectives only from the same process_group by group_name
- Support of groups like [0,2,4,6], [0,1,3,5] using `rank_idx_dict` for in pass operations for slice idxs etc.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158632
Approved by: https://github.com/wconstab
When running BundledAOTAutogradCache with precompile, we still need to run triton bundling so that the precompiled CompiledFxGraph has triton cuda kernels. We also pre save the autotune results in the precompile artifact.
It would be even better to pre trim the cuda kernels on save and apply them, which we can work on later.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158048
Approved by: https://github.com/zhxchen17
Summary:
for this particular instance, we're doing
from torch._inductor.config import trace
...trace.provenance_tracking...
but for all other call sites, we're doing
from torch._inductor import config
... config.trace.provenance_tracking....
Test Plan:
CI
Rollback Plan:
Differential Revision: D78699876
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158796
Approved by: https://github.com/c00w
Summary: We added group split in D78300794 and remote_group_merge in D78450094. We first want to upstream this change to PGNCCLx as well so that NCCLx can use this new API and we can continue our c10d clean up in https://github.com/pytorch/pytorch/pull/158488.
Test Plan:
CI
```
buck test -c hpc_comms.use_ncclx=stable comms/ncclx/pg/tests:test_c10d_ncclx -- test_group_split_and_merge
```
Rollback Plan:
Differential Revision: D78521060
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158790
Approved by: https://github.com/d4l3k
Summary:
We transfer stack trace in post_grad passes.
We shouldn't add "stack_trace" to _COPY_META_FIELDS because _COPY_META_FIELDS is used in proxy.py where stack_trace is explicitly set.
Since the stack_trace is being used by more and more debugging tools, we should also start testing it more rigorously. This PR start by adding a first test for testing that stack trace is preserved through post_grad_passes.
Test Plan:
```
buck run mode/dev-nosan fbcode//caffe2/test/inductor:provenance_tracing -- -r test_pattern_matcher_transfer_meta
buck run mode/dev-nosan
fbcode//caffe2/test/inductor:auto_functionalize -- --rcaffe2/test/inductor:auto_functionalize_old
```
Rollback Plan:
Differential Revision: D78669729
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158752
Approved by: https://github.com/jingsh
Fix up decomposeK autotuning, by removing condition to return more than `k_splits_limit` and setting default to 10 instead of 5. Allow `k_splits_limit` to be configurable to the user via `TORCHINDUCTOR_NUM_DECOMPOSE_K_SPLITS` and also allow user to configure threshold in which to use decompose_k via `TORCHINDUCTOR_DECOMPOSE_K_THRESHOLD`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158745
Approved by: https://github.com/eellison
And prevent new ones from appearing by removing `-Wno-error=extra-semi` (not sure what was thereason behind adding the warning but not erroring on on it when building with -Werror introduced by https://github.com/pytorch/pytorch/pull/140236 )
300+ violations of that rule were fixed by running `sed -i -e "s/});/})/" /` against `torch/nativert`
Other 3p deps that needs updates:
- TensorPipe
- LLVM
- FBGEMM
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158730
Approved by: https://github.com/Skylion007
Summary:
**Changes**
* Deleted function import from build definition utilities
* Removed `load("//tools/build_defs:fbsource_utils.bzl", "is_arvr_mode")`
* Replaced is_arvr_mode() function calls with direct references to configuration flags
* Changed from `is_arvr_mode()` to `"ovr_config//build_mode:arvr_mode"`
* Changed conditional expressions to Buck `select()` statements
Test Plan:
Check if CI passes
Rollback Plan:
Differential Revision: D78520947
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158682
Approved by: https://github.com/malfet
Fixes#147140
## Changes
- Add `to` implementation in `MaskedTensor` to support move `mask` to target device
## Test Result
```python
In [1]: import torch
...: from torch.masked import as_masked_tensor
...: data = torch.tensor([1,2,3])
...: mask = torch.tensor([True,False,True])
...: mt = as_masked_tensor(data, mask).to('cuda')
...: mt.get_data().device, mt.get_mask().device
/home/zong/code/pytorch/torch/masked/maskedtensor/core.py:247: UserWarning: The PyTorch API of MaskedTensors is in prototype stage and will change in the near future. Please open a Github issue for features requests and see our documentation on the torch.masked module for further information about the project.
return MaskedTensor(data, mask)
/home/zong/code/pytorch/torch/masked/maskedtensor/_ops_refs.py:354: UserWarning: The PyTorch API of MaskedTensors is in prototype stage and will change in the near future. Please open a Github issue for features requests and see our documentation on the torch.masked module for further information about the project.
return MaskedTensor(new_data, _maybe_get_mask(args[0]))
Out[1]: (device(type='cuda', index=0), device(type='cuda', index=0))
In [2]: mt.sum(dim=0)
/home/zong/code/pytorch/torch/masked/maskedtensor/core.py:247: UserWarning: The PyTorch API of MaskedTensors is in prototype stage and will change in the near future. Please open a Github issue for features requests and see our documentation on the torch.masked module for further information about the project.
return MaskedTensor(data, mask)
Out[2]: MaskedTensor(4, True)
```
```bash
pytest test/test_maskedtensor.py -vv
```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151205
Approved by: https://github.com/ezyang
Bumps [requests](https://github.com/psf/requests) from 2.32.2 to 2.32.4.
<details>
<summary>Release notes</summary>
<p><em>Sourced from <a href="https://github.com/psf/requests/releases">requests's releases</a>.</em></p>
<blockquote>
<h2>v2.32.4</h2>
<h2>2.32.4 (2025-06-10)</h2>
<p><strong>Security</strong></p>
<ul>
<li>CVE-2024-47081 Fixed an issue where a maliciously crafted URL and trusted
environment will retrieve credentials for the wrong hostname/machine from a
netrc file. (<a href="https://redirect.github.com/psf/requests/issues/6965">#6965</a>)</li>
</ul>
<p><strong>Improvements</strong></p>
<ul>
<li>Numerous documentation improvements</li>
</ul>
<p><strong>Deprecations</strong></p>
<ul>
<li>Added support for pypy 3.11 for Linux and macOS. (<a href="https://redirect.github.com/psf/requests/issues/6926">#6926</a>)</li>
<li>Dropped support for pypy 3.9 following its end of support. (<a href="https://redirect.github.com/psf/requests/issues/6926">#6926</a>)</li>
</ul>
<h2>v2.32.3</h2>
<h2>2.32.3 (2024-05-29)</h2>
<p><strong>Bugfixes</strong></p>
<ul>
<li>Fixed bug breaking the ability to specify custom SSLContexts in sub-classes of
HTTPAdapter. (<a href="https://redirect.github.com/psf/requests/issues/6716">#6716</a>)</li>
<li>Fixed issue where Requests started failing to run on Python versions compiled
without the <code>ssl</code> module. (<a href="https://redirect.github.com/psf/requests/issues/6724">#6724</a>)</li>
</ul>
</blockquote>
</details>
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/psf/requests/blob/main/HISTORY.md">requests's changelog</a>.</em></p>
<blockquote>
<h2>2.32.4 (2025-06-10)</h2>
<p><strong>Security</strong></p>
<ul>
<li>CVE-2024-47081 Fixed an issue where a maliciously crafted URL and trusted
environment will retrieve credentials for the wrong hostname/machine from a
netrc file.</li>
</ul>
<p><strong>Improvements</strong></p>
<ul>
<li>Numerous documentation improvements</li>
</ul>
<p><strong>Deprecations</strong></p>
<ul>
<li>Added support for pypy 3.11 for Linux and macOS.</li>
<li>Dropped support for pypy 3.9 following its end of support.</li>
</ul>
<h2>2.32.3 (2024-05-29)</h2>
<p><strong>Bugfixes</strong></p>
<ul>
<li>Fixed bug breaking the ability to specify custom SSLContexts in sub-classes of
HTTPAdapter. (<a href="https://redirect.github.com/psf/requests/issues/6716">#6716</a>)</li>
<li>Fixed issue where Requests started failing to run on Python versions compiled
without the <code>ssl</code> module. (<a href="https://redirect.github.com/psf/requests/issues/6724">#6724</a>)</li>
</ul>
</blockquote>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="021dc729f0"><code>021dc72</code></a> Polish up release tooling for last manual release</li>
<li><a href="821770e822"><code>821770e</code></a> Bump version and add release notes for v2.32.4</li>
<li><a href="59f8aa2adf"><code>59f8aa2</code></a> Add netrc file search information to authentication documentation (<a href="https://redirect.github.com/psf/requests/issues/6876">#6876</a>)</li>
<li><a href="5b4b64c346"><code>5b4b64c</code></a> Add more tests to prevent regression of CVE 2024 47081</li>
<li><a href="7bc45877a8"><code>7bc4587</code></a> Add new test to check netrc auth leak (<a href="https://redirect.github.com/psf/requests/issues/6962">#6962</a>)</li>
<li><a href="96ba401c12"><code>96ba401</code></a> Only use hostname to do netrc lookup instead of netloc</li>
<li><a href="7341690e84"><code>7341690</code></a> Merge pull request <a href="https://redirect.github.com/psf/requests/issues/6951">#6951</a> from tswast/patch-1</li>
<li><a href="6716d7c9f2"><code>6716d7c</code></a> remove links</li>
<li><a href="a7e1c745dc"><code>a7e1c74</code></a> Update docs/conf.py</li>
<li><a href="c799b8167a"><code>c799b81</code></a> docs: fix dead links to kenreitz.org</li>
<li>Additional commits viewable in <a href="https://github.com/psf/requests/compare/v2.32.2...v2.32.4">compare view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts).
</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158006
Approved by: https://github.com/Skylion007
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Summary: When compiling for standalone, make embed_kernel_binary and emit_multi_arch_kernel default to True, and add a default name for model_name_for_generated_files to make the generated cpp project easier to understand. Also improved the weights object file naming to be more readable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158560
Approved by: https://github.com/yushangdi
Collects some scattershot improvements made while attempting to enable training for AOTInductor. Non-typing changes are:
1. Swapping a few custom searches for the output node in an FX graph for calling `graph.output_node()`.
2. Removing two unused parameters from `torch.export._unlift._unlift`.
3. Switching handles to constants in `cpp_wrapper_cpu` to use C++ references for memory efficiency.
4. Cleaning out unused, unexported imports from `torch/export/__init__.py`, and adding one missing export to `__all__`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158075
Approved by: https://github.com/Skylion007
Implements collective alltoall operation for NVSHMEM Triton kernels. Enables data exchange where each PE sends unique data to every other PE in the team.
Tests: `python test/distributed/test_nvshmem_triton.py -k test_triton_alltoall`
<details>
<summary>Quick debug print for sanity check</summary>
```markdown
============================================================
[Rank 0] Starting alltoall test with world_size=2
============================================================
[Rank 0] Configuration:
- nelems_per_pe: 2
- dtype: torch.int64, element_size: 8 bytes
- nelems_bytes: 16
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.8/main_nvshmem/src/modules/transport/ibrc/ibrc.cpp:1653: NULL value get_device_list failed
/dvs/p4/build/sw/rel/gpgpu/toolkit/r12.8/main_nvshmem/src/modules/transport/ibrc/ibrc.cpp:1653: NULL value get_device_list failed
[Rank 0] Preparing source data:
[Rank 1] Preparing source data:
- Data for PE 0: [0, 0] (indices 0-1)
- Data for PE 1: [1, 1] (indices 2-3)
[Rank 0] Complete source buffer: [0, 0, 1, 1]
- Data for PE 0: [100, 100] (indices 0-1)
- Data for PE 1: [101, 101] (indices 2-3)
[Rank 1] Complete source buffer: [100, 100, 101, 101]
[Rank 1] Initial destination buffer: [-1, -1, -1, -1]
[Rank 0] Initial destination buffer: [-1, -1, -1, -1]
/data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
warnings.warn( # warn only once
/data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
warnings.warn( # warn only once
[rank0]:[W716 15:30:06.215666766 ProcessGroupNCCL.cpp:5064] [PG ID 0 PG GUID 0 Rank 0] using GPU 0 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can specify device_id in init_process_group() to force use of a particular device.
[rank1]:[W716 15:30:06.215752786 ProcessGroupNCCL.cpp:5064] [PG ID 0 PG GUID 0 Rank 1] using GPU 1 as device used by this process is currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect. You can specify device_id in init_process_group() to force use of a particular device.
NCCL version 2.27.5+cuda12.4
[Rank 1] Executing alltoall operation...
[Rank 0] Executing alltoall operation...
[Rank 1] alltoall operation completed
/data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
warnings.warn( # warn only once
[Rank 0] alltoall operation completed
/data/users/suryasub/pytorch/torch/distributed/distributed_c10d.py:4809: UserWarning: No device id is provided via `init_process_group` or `barrier `. Using the current device set by the user.
warnings.warn( # warn only once
[Rank 0] Results after alltoall:
[Rank 1] Results after alltoall:[Rank 0] Destination buffer: [0, 0, 100, 100]
[Rank 0] Verifying results:
- From PE 0 (indices 0-1):
Expected: [0, 0]
Actual: [0, 0]
[Rank 1] Destination buffer: [1, 1, 101, 101]
[Rank 1] Verifying results:
- From PE 0 (indices 0-1):
Expected: [1, 1]
Actual: [1, 1]
Match: ✓
Match: ✓
- From PE 1 (indices 2-3):
Expected: [100, 100]
- From PE 1 (indices 2-3):
Expected: [101, 101]
Actual: [100, 100]
Actual: [101, 101]
Match: ✓
Match: ✓
[Rank 0] ============================================================
[Rank 0] Summary: ALL TESTS PASSED ✓
[Rank 0] Data flow explanation:
- Each rank sends 2 elements to every other rank
[Rank 1] ============================================================
[Rank 1] Summary: ALL TESTS PASSED ✓
- Rank 0 sent: [0, 0, 1, 1]
[Rank 1] Data flow explanation:
- Each rank sends 2 elements to every other rank
- Rank 0 received: [0, 0, 100, 100]
- My data for PE 0 (0) went to PE 0's buffer
- I received PE 0's data for me (0)
- My data for PE 1 (1) went to PE 1's buffer
- Rank 1 sent: [100, 100, 101, 101]
- I received PE 1's data for me (100)
[Rank 0] ============================================================
- Rank 1 received: [1, 1, 101, 101]
- My data for PE 0 (100) went to PE 0's buffer
- I received PE 0's data for me (1)
- My data for PE 1 (101) went to PE 1's buffer
- I received PE 1's data for me (101)
[Rank 1] ============================================================
```
</details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158513
Approved by: https://github.com/fduwjj, https://github.com/mandroid6
ghstack dependencies: #158511, #158512
This PR attempts to cache:
* codegen for cutlass backend for the same kernel. Even if runtime params are different.
From some profiling, most of the time spent is on render. So we only target to cache that part for now.
The output of render is `code`, and we are able to cache that easily. Also, I have to cache size_args, since it depends on `kernel.get_dynamic_shape_args()`, which depends on the state of self when we call render.
make_key is doing most of the work here: We are hashing on input node layouts, output node layout and op.configuration_name() (this is what hash(op) would do anyway).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156781
Approved by: https://github.com/ColinPeppler
Summary: This test often fails internally -- looks like it's because autotuning sometimes chooses not to do the epilog tuning. Turning off `benchmark_epilogue_fusion` seems to fix.
Test Plan:
`buck test '@fbcode//mode/opt' fbcode//caffe2/test/inductor:max_autotune -- --exact 'caffe2/test/inductor:max_autotune - test_cat_max_autotune_triton (caffe2.test.inductor.test_max_autotune.TestMaxAutotune)' --run-disabled`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158589
Approved by: https://github.com/eellison
**Background**:
```Shell
torch 2.5.1+cpu
torchvision 0.20.1
```
```Python
import torch
import torchvision
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torchvision/__init__.py", line 10, in <module>
from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils # usort:skip
File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torchvision/_meta_registrations.py", line 164, in <module>
def meta_nms(dets, scores, iou_threshold):
File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/library.py", line 795, in register
use_lib._register_fake(op_name, func, _stacklevel=stacklevel + 1)
File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/library.py", line 184, in _register_fake
handle = entry.fake_impl.register(func_to_register, source)
File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/_library/fake_impl.py", line 31, in register
if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):
RuntimeError: operator torchvision::nms does not exist
```
**Cause**:
```
torchvision's .so file lacks some symbol definitions, because these symbols come from CUDA, but the current environment does not have CUDA and GPU. The above error message is very confusing.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157524
Approved by: https://github.com/ezyang
When running BundledAOTAutogradCache with precompile, we still need to run triton bundling so that the precompiled CompiledFxGraph has triton cuda kernels. We also pre save the autotune results in the precompile artifact.
It would be even better to pre trim the cuda kernels on save and apply them, which we can work on later.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158048
Approved by: https://github.com/zhxchen17
Fixes https://github.com/pytorch/pytorch/issues/158382
```
renamed: torch/_functorch/_aot_autograd/dispatch_and_compile_graph.py -> torch/_functorch/_aot_autograd/graph_capture.py
renamed: torch/_functorch/_aot_autograd/traced_function_transforms.py -> torch/_functorch/_aot_autograd/graph_capture_wrappers.py
renamed: torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py -> torch/_functorch/_aot_autograd/graph_compile.py
```
Everything else is ONLY import changes. I did not rename any functions
even if we probably should have.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158449
Approved by: https://github.com/jamesjwu
Change the default value of min_chunk_size from 4096 to 512 to allow more for loops to be parallelized.
I tested the Inductor benchmark with this PR on CPU, and saw ~10% improvement in torchbench geomean speedup, and no change in huggingface/timm_models. There are about 15 torchbench models with different degrees of performance improvement, among which functorch_dp_cifar10, opacus_cifar10, hf_Reformer, and pyhpc_turbulent_kinetic_energy have more than 50% performance improvement.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150762
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
Adds `sync_all()` function for local store visibility synchronization in NVSHMEM Triton kernels. Provides memory ordering for local operations without remote completion guarantees.
Tests: `python test/distributed/test_nvshmem_triton.py -k test_triton_sync`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158512
Approved by: https://github.com/fduwjj
ghstack dependencies: #158511
This one
```
Compiling /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/Pooling.metal to Pooling_30.air
/Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/kernels/Pooling.metal:172:1: warning: non-void function does not return a value in all control paths [-Wreturn-type]
}
^
1 warning generated.
```
Although functionally one is not supposed to hit this codepath ever, it's not not to throw warning
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158729
Approved by: https://github.com/Skylion007
Adds device-side barrier synchronization and PE identification functions for NVSHMEM Triton integration. Includes `barrier_all()` for collective synchronization and `my_pe()`/`n_pes()` for PE identification within kernels.
We are launching with cooperative grid launch (for all the PRs in this stack) because the `nvshmemx_collective_launch` function must be used to launch kernels on the GPU when the kernels use NVSHMEM synchronization or collective APIs, and `nvshmemx_collective_launch` essentially boils down to a CUDA cooperative group launch.
Tests: `python test/distributed/test_nvshmem_triton.py -k test_triton_barrier`
Also tested that if you remove the barrier, you get an assertion error/race conditions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158511
Approved by: https://github.com/fduwjj
This refactor ensures all registered template hooks have been finalised before accessing the code object of the template. In `simd.SimdScheduling.codegen_template` the template hooks are finalised manually with `template.finalize_hook(hook_name)` calls, so it is the responsibility of the caller to finalise all the template hooks. This PR adds:
- `RenderPartial.finalize_remaining` a function that can be called at the end to finalise the remaining active hooks after a selection of hooks have been finalised manually.
- A test with a custom template implementation that registers custom hooks that the scheduler needs to finalise. This test should fail if the scheduler does not finalise the registered custom hook.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157270
Approved by: https://github.com/eellison
Summary: export_for_training exist because we couldn't migrate internal usages of export to the final IR. Now that we have completed the migration, we should deprecate and delete this API.
Test Plan:
CI
Rollback Plan:
Differential Revision: D78240836
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158203
Approved by: https://github.com/JacobSzwejbka
Related to https://github.com/pytorch/pytorch/issues/157517
Detect when users are executing torch build with cuda 12.8/12.9 and running on Maxwell or Pascal architectures.
We would like to include reference to the issue: https://github.com/pytorch/pytorch/issues/157517 as well as ask people to install CUDA 12.6 builds if they are running on sm50 or sm60 architectures.
Test:
```
>>> torch.cuda.get_arch_list()
['sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90', 'sm_100', 'sm_120', 'compute_120']
>>> torch.cuda.init()
/home/atalman/.conda/envs/py312/lib/python3.12/site-packages/torch/cuda/__init__.py:263: UserWarning:
Found <GPU Name> which is of cuda capability 5.0.
PyTorch no longer supports this GPU because it is too old.
The minimum cuda capability supported by this library is 7.0.
warnings.warn(
/home/atalman/.conda/envs/py312/lib/python3.12/site-packages/torch/cuda/__init__.py:268: UserWarning:
Support for Maxwell and Pascal architectures is removed for CUDA 12.8+ builds.
Please see https://github.com/pytorch/pytorch/issues/157517
Please install CUDA 12.6 builds if you require Maxwell or Pascal support.
```
Please note I reverted original PR https://github.com/pytorch/pytorch/pull/158301 because it broke internal users. This is a reland, added added check for non empty torch.cuda.get_arch_list()
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158700
Approved by: https://github.com/huydhn, https://github.com/Skylion007, https://github.com/eqy
This PR introduces the rest of the keyword-arguments added in DLPack
version 2023.12: `dl_device` and `copy`.
In summary, we handle these arguments in the C++ implementation of
`to_dlpack(...)` at _torch/csrc/Module.cpp_, by calling the
`maybeCopyTensor` function at _aten/src/ATen/DLConvertor.cpp_. It also
introduces the following changes:
- Add a new Python API `torchDeviceToDLDevice()`, which is simply a
refactoring of the `getDLDevice()` function at
_aten/src/ATen/DLConvertor.cpp_.
- Add both keyword-arguments to the `from_dlpack()` function at
_torch/utils/dlpack.py_ and to the `Tensor.__dlpack__()` dunder
method.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150218
Approved by: https://github.com/albanD
ghstack dependencies: #150216, #150217
https://github.com/pytorch/pytorch/pull/149946 modified some checks that verify whether async-TP is "applicable" to a given collective operation in a graph. Before, the pattern-mathcing+replacement would just be skipped, but now these are asserts that fail and raise.
This is causing concrete issues in some graphs where 2-dimensional device meshes are being used (e.g., TP + CP) but only one dimension has symm-mem enabled. See #158569.
This PR is turning these asserts back into harmless early-exits. Note that this only needed to be done for reduce-scatters, as it was already the case for all-gathers.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158572
Approved by: https://github.com/danielvegamyhre, https://github.com/atalman
By setting the name of the domain libraries to build via `BUILD_ADDITIONAL_PACKAGES` environment variable, the build job will build them and make them available as artifacts in the same way as the PyTorch CI wheel. To ensure that this doesn't break CI, the test job will still build them as usual if the wheels are not there. Building dependencies like FBGEMM on the test job is bad, especially for GPU jobs, because it leave the GPU resource idle
Fixes https://github.com/pytorch/pytorch/issues/152024
Signed-off-by: Huy Do <huydhn@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158600
Approved by: https://github.com/yangw-dev
ghstack dependencies: #158598, #158599
The previous strategy directly used 'self' input strategy for 'src'
input. The fixed strategy correctly maps the self dim to src dim
so that it works even if the src input is broadcast.
E.g. for this program, broadcasting will occur on dims 0,1,3 of self.
```
self = torch.ones((2,3,4,5))
src = torch.ones((4,1))
self.copy_(src)
```
These are the correct sharding combinations:
| self | src |
|-------|------|
| Shard(0) | Replicate() |
| Shard(1) | Replicate() |
| Shard(2) | Shard(0) |
| Shard(3) | Shard(1) |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158538
Approved by: https://github.com/zpcore, https://github.com/XilunWu, https://github.com/wanchaol
ghstack dependencies: #158490
Fixes several bugs in the original.
- foremost, fixes a serious bug where we returned incorrect strategies
by mixing input_specs that were frozen from
select_strategy.strategies[0] with output_specs that varied across
select_strategy.strategies[0..N] (e.g. we could create a nonsense
strategy like input:Shard(0) output(Replicate) for an op like clone
- fixes the redistribute costs: they should not actually be 0, they
should be the cost of redistributing our single input from another
strategy to the current strategy, in our list of output strategies
- adds a note, wondering if we should have just literally returned the
input strategy instead of creating this new object
- Currently, using default_strategy is incorrect becuase it maps 'self'
tensor's strategies directly onto 'src' tensor without accounting for
the fact that copy_ supports broadcasting a smaller rank tensor into a
larger one.
Separates out copy_ op from default strategy, adds missing test case,
but does not fix the underlying issue with copy_, leaves that for future
PR
Renames to `propagate_single_input_strategy` since that's more
descriptive
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158490
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
Summary:
Fix the error that occurs in the devarm environment when compiling with Clang:
```
caffe2/torch/csrc/distributed/c10d/symm_mem/NCCLSymmetricMemory.cu:97:20: error: 'get_buffer' overrides a member function but is not marked 'override' [-Werror,-Winconsistent-missing-override]
97 | virtual at::Tensor get_buffer(int
| ^
caffe2/torch/csrc/distributed/c10d/symm_mem/SymmetricMemory.hpp:56:20: note: overridden virtual function is here
56 | virtual at::Tensor get_buffer(int rank, c10::IntArrayRef sizes, c10::ScalarType dtype, int64_t storage_offset) = 0;
| ^
1 error generated.
```
Test Plan:
See D78520305
Rollback Plan:
Differential Revision: D78517953
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158597
Approved by: https://github.com/janeyx99
I feel uneasy about touching `__warningregistry__` since it is undocumented and private surface. The only public API hook that doesn't increment warnings version seems to be https://docs.python.org/3/library/warnings.html#warnings.showwarning.
So we could wack a mole all the warnings muters in compile to just not display warnings, and we wouldn't invalidate warnings cache. This PR adds it for torch/_dynamo, and I didn't find any warnings versioning mutation from torch/_inductor.
There is a behavior change if someone calls a compiled graph with simplefilter("error"):
```python
# e.g. test/dynamo_expected_failures/TestAutogradFallback.test_no_autograd_kernel_inplace_mode_nothing
with warnings.catch_warnings():
warnings.simplefilter("error") # turns all warnings into errors
compiled_fn() # will throw if any of the muted warnings fire
```
FIXES https://github.com/pytorch/pytorch/issues/128427
A note for the future: The warnings module doesn't offer a thread safe way of using it. Even regular filters have this problem, directly editing `__warningregistry__` would be very bad, and this PR would mute all threads. Someone will need to build a thread safe warnings interface.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158520
Approved by: https://github.com/anijain2305, https://github.com/zou3519
Summary: This allows us to start alerting on cache failures, based on scuba data
Test Plan:
Added new tests explicitly for the Remote Cache API.
Note that we have existing tests for memcache, but not for manifold AFAICT.
There are two potential wrinkles. One we're adding a new field (and everything uses ScubaData AFAICT, so this should just work).
The other one is the implicit api contract that if the sample is None, then it will be ignored (and not crash). I believe the second one is implemented correctly (and tested). The first one is a little more nebulous, but I think won't cause any breakages.
Also manually ran a compile and made sure it didn't break - P1851504490 as well as forcing it to break and checking we didn't screw up the exception handling - P1851504243
Rollback Plan:
Differential Revision: D77054339
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156874
Approved by: https://github.com/oulgen, https://github.com/masnesral
Summary: Small change to make the `times` and `repeat` variables controllable as command line args.
Test Plan:
Execute:
```
buck2 run <run params> <path>:inductor_benchmark -- --times=1 --repeat=1
```
Only runs once, and without passing the args it runs with default values of 10.
Rollback Plan:
Reviewed By: malfet
Differential Revision: D78458680
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158590
Approved by: https://github.com/FindHao, https://github.com/malfet
Adds a pre-commit hook (technically a pre-push hook) to the PyTorch repo.
**This is currently an opt-in feature**, which one can opt into by running `python scripts/setup_hooks.py` locally.
### Features
- **Run Lintrunner Before Push**: Before every `git push`, automatically runs lintrunner on your changes.
- Really need to skip the checks? Run `git push --no-verify`
- **Consistent, Isolated, Lintrunner Environment**: During pre-push, Lintrunner runs in it's own virtual en environment that contain all lintrunner dependencies in a consistent, isolated environment. No more lintrunner failures because you created a new .venv. (Did you know you needed to run `lintrunner init` every time you make a new .venv?)
- **Dependencies Automatically Updated**: If .lintrunner.toml is updated, this will automatically re-run `lintrunner init` to ensure you install the latest dependencies specified
### Installation
- Run `python scripts/setup_hooks.py`. Now every `git push` will first run lintrunner.
### Additional details
- The lintrunner used by the pre-push hook runs in a special per-repo virtual environment managed by the commit-hook tool located under `$USER/.cache/pre-commit`
- Does not affect your regularly used lintrunner
- Manual invocations of lintrunner will continue to depend on your local environment instead of the special pre-push one. If there's enough interest, we could explore consolidating them.
- Does not run `lintrunner -a` for you.
- You still need to manually run that (can be changed later though!)
- Have staged/unstaged changes? No worries
- This runs `git stash` before running the pre-commit hooks and pops back your changes afterwards, so only the changes actaully being pushed will be tested
### Downsides
- No streaming UI updates
- While you still get the same output from lintrunner that you're used to, the commit-hook framework doesn't show any output while lintrunner is actually running. Instead, it shows the entire output after linter has completed execution, which could be a few minutes (especially if it has to run `lintrunner init` first)
- `uv` installation is required to run the setup script. The setup script will ask users to install uv if it's not available.
- This is required to be able to install the pre-commit package in a safe way that's available no matter what .venv you are running in.
### Opting out
- Disable hook for a single push: Run `git push --no-verify`
- Disable hook permanently: If something goes wrong and you need to wipe your setup:
- Delete the `$USER/.cache/pre-commit` folder and the `.git/hooks/pre-push` file in your local repo.
- You can now rerun `python scripts/setup_hooks.py` to setup your git push hook again if you want.
### Potential Future Changes
Things that could be done to make this even better if folks like these ideas:
- Automatic setup
- Our `CONTRIBUTING.md` file tells devs to run `make setup-env`. That could be a good entry point to hook the installation into
- Fix the console output streaming
- Make every lintrunner invocation (including manual ones) use the same repo-specific venv that the commit-hook uses.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158389
Approved by: https://github.com/seemethere
Fixes#156587
This sets lower bounds for fsspec and networkx in both setup.py and requirements,txt.
- fsspec>= 0.8.5 (released December 15, 2020)
- netowrkx>= 2.5.1 (released April 3, 2021)
These are the first stable versions released after Python 3.9 came out on October 5, 2020. Since Python 3.8 is no longer maintained, setting these minimums helps ensure PyTorch won't be installed alongside unexpectedly old versions of these packages.
Tested with these versions locally to make sure they don't break anything. Adding CI for lower-bound testing could be a follow up later if need.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158565
Approved by: https://github.com/janeyx99
Summary: These tests failing internally because the number of underlying calls to the rng differ by virtue of various library initializations that get sucked in with an internal build.
Test Plan:
```
buck test '@fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_unspec.py::UnspecTests::test_random_object' --run-disabled
buck test '@fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_unspec.py::UnspecTests::test_random_values_with_graph_break' --run-disabled
buck test '@fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_unspec.py::UnspecTests::test_feed_random_values_into_graph_only' --run-disabled
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158485
Approved by: https://github.com/williamwen42
As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo
This PR adds strict typing support to an important set of utilities in dynamo, `repro/` and the base `debug_utils.py`
Running
```
mypy torch/_dynamo/repro/ torch/_dynamo/debug_utils.py --linecount-report /tmp/coverage_log
```
| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main | 905 | 3268 | 27.69% | 22 | 81 | 27.16% |
| This PR | 3368 | 3368 | 100.00% | 81 | 81 | 100.00% |
| Delta | +2463 | +100 | +72.31% | +59 | 0 | +72.84% |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158504
Approved by: https://github.com/mlazos
Summary: For case like caching_precompile, we almost always want to drop ID_MATCH-type guards since they will block serialization. This diff add this behavior when this global flag is toggled on so that ID_MATCH guards are excluded from compilation and serialization.
Test Plan:
test_dynamo -- -k test_id_match_with_config
Rollback Plan:
Differential Revision: D78363609
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158368
Approved by: https://github.com/jamesjwu
The architecture version checks are unnecessary fine-grained in PyTorch. Considering the fact that PyTorch's Flash Attention works on all `sm_80+` machines, it makes more sense to just check for lower bound.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158558
Approved by: https://github.com/eqy
The previous strategy directly used 'self' input strategy for 'src'
input. The fixed strategy correctly maps the self dim to src dim
so that it works even if the src input is broadcast.
E.g. for this program, broadcasting will occur on dims 0,1,3 of self.
```
self = torch.ones((2,3,4,5))
src = torch.ones((4,1))
self.copy_(src)
```
These are the correct sharding combinations:
| self | src |
|-------|------|
| Shard(0) | Replicate() |
| Shard(1) | Replicate() |
| Shard(2) | Shard(0) |
| Shard(3) | Shard(1) |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158538
Approved by: https://github.com/zpcore, https://github.com/XilunWu, https://github.com/wanchaol
ghstack dependencies: #158495, #158490
We want to do it for two reasons:
1. It's tedious for users to manually turn on capture_scalar_outputs=True when compiling map and scan with inductor, where we decomposing them into while_loop and use the idx tensor.item() to select a slice of output buffer and write into it. This pr turns on the flag by default.
2. a graph break caused by capture_scalar_outputs=False would cause the hop to fail, and we should turn it on by default so that the error message is more meaningful.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158480
Approved by: https://github.com/zou3519
Address second part of #158366, where torch.tensor(0), is treated as a constant tensor and its .item() gets specailized to 0 which causes a silent specialization. The fix is to unspecialize the constant carries and make them non-constant.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158381
Approved by: https://github.com/zou3519
This PR is a bit more involved but effectively works to drastically simplify PyObjectSlot and PyInterpreter.
1) For PyObjectSlot we now use a global pyinterpreter since there only is one. From here we change all of the call sites to rely on this assumption.
2) We also remove the "tags" of the PyInterpreter by deprecating `PyInterpreterStatus`.
For the reviewer, sadly it seems like `functorch/csrc/dim/dim.cpp` needed to get linted, so there is an unreadable amount of changes there. Fortunately, the only actual change in the file is as follows which just removes `getPyInterpreter()` from the `check_pyobj` call.
```
mpy::handle handle_from_tensor(Arena& A, TensorRef t) {
- // fast case: tensor is live in python
- std::optional<PyObject*> mb_obj =
- t->unsafeGetTensorImpl()->pyobj_slot()->check_pyobj(getPyInterpreter(), /*ignore_hermetic_tls=*/false);
- if (mb_obj.has_value() && !t->unsafeGetTensorImpl()->pyobj_slot()->owns_pyobj()) {
- return *mb_obj;
- }
- return A.autorelease(mpy::object::checked_steal(THPVariable_Wrap(*t)));
-}
-}
+ // fast case: tensor is live in python
+ std::optional<PyObject*> mb_obj =
+ t->unsafeGetTensorImpl()->pyobj_slot()->check_pyobj(
+ /*ignore_hermetic_tls=*/false);
+ if (mb_obj.has_value() &&
+ !t->unsafeGetTensorImpl()->pyobj_slot()->owns_pyobj()) {
+ return *mb_obj;
+ }
+ return A.autorelease(mpy::object::checked_steal(THPVariable_Wrap(*t)));
+}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158427
Approved by: https://github.com/albanD
Fixes several bugs in the original.
- foremost, fixes a serious bug where we returned incorrect strategies
by mixing input_specs that were frozen from
select_strategy.strategies[0] with output_specs that varied across
select_strategy.strategies[0..N] (e.g. we could create a nonsense
strategy like input:Shard(0) output(Replicate) for an op like clone
- fixes the redistribute costs: they should not actually be 0, they
should be the cost of redistributing our single input from another
strategy to the current strategy, in our list of output strategies
- adds a note, wondering if we should have just literally returned the
input strategy instead of creating this new object
- Currently, using default_strategy is incorrect becuase it maps 'self'
tensor's strategies directly onto 'src' tensor without accounting for
the fact that copy_ supports broadcasting a smaller rank tensor into a
larger one.
Separates out copy_ op from default strategy, adds missing test case,
but does not fix the underlying issue with copy_, leaves that for future
PR
Renames to `propagate_single_input_strategy` since that's more
descriptive
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158490
Approved by: https://github.com/wanchaol, https://github.com/XilunWu
ghstack dependencies: #158495
Summary:
- Split `create_mapping` to `create_mapping_pre_post_grad_nodes` and ` create_node_mapping_kernel_to_post_grad`
- Store a mapping from pre_grad graph node names to stack traces in `_inductor_pre_grad_node_stack_trace`
- Add `stack_traces` member to ir.Node and add it to the string representation of ir.Node
- When we create an IR node, if `inductor.config.trace.provenance_tracing=True`, we populate `stack_traces` from `origins`. The nodes in `origins` are post_grad graph nodes. If a node has `node.stack_trace`, we store the stack_trace directly. This is particularly important for backward graph nodes because they don't have a mapping to pre-grad graph nodes. If a node doesn't have `.stack_trace ` (such as `linear`-> `addmm` nodes), we use the stack trace of the pre_grad graph nodes that it maps to.
- A post grad graph node might not have stack trace if it correspond to multiple pre grad graph nodes, e.g. [GroupLinearFusion](a00442421a/torch/_inductor/fx_passes/group_batch_fusion.py (L299))
Example:
```
scheduling ExternKernelOut(
python_kernel_name='extern_kernels.mm',
name=buf0,
layout=FixedLayout('cuda:0', torch.float32, size=[8, 16], stride=[16, 1]),
inputs=[InputBuffer(name='arg2_1', layout=FixedLayout('cuda:0', torch.float32, size=[8, 10], stride=[10, 1])), ReinterpretView(
StorageBox(
ConstantBuffer(name='fc1_weight', layout=FixedLayout('cuda:0', torch.float32, size=[16, 10], stride=[10, 1]))
),
FixedLayout('cuda:0', torch.float32, size=[10, 16], stride=[1, 10]),
origins=OrderedSet([mm_default_1]),
stack_traces = {,
File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/7b4b7a52e15abb17/scripts/shangdiy/__aot__/aot#link-tree/scripts/shangdiy/aot.py", line 29, in forward,
x = self.fc1(x),
File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/7b4b7a52e15abb17/scripts/shangdiy/__aot__/aot#link-tree/torch/nn/modules/linear.py", line 125, in forward,
return F.linear(input, self.weight, self.bias),
}
)],
constant_args=(),
kwargs={},
output_view=None,
python_kernel_name=extern_kernels.mm,
cpp_kernel_name=at::mm_out,
ordered_kwargs_for_cpp_kernel=(),
op_overload=None,
arg_properties=[{}, {}],
allarg_properties={},
kwarg_properties=None,
unbacked_bindings={},
mutation_outputs=[],
origin_node=mm_default_1,
origins=OrderedSet([mm_default_1]),
stack_traces = {,
File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/7b4b7a52e15abb17/scripts/shangdiy/__aot__/aot#link-tree/scripts/shangdiy/aot.py", line 29, in forward,
x = self.fc1(x),
File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/7b4b7a52e15abb17/scripts/shangdiy/__aot__/aot#link-tree/torch/nn/modules/linear.py", line 125, in forward,
return F.linear(input, self.weight, self.bias),
}
)
```
Test Plan:
```
buck2 run mode/dev-nosan fbcode//caffe2/test/inductor:provenance_tracing
```
Rollback Plan:
Differential Revision: D78365534
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158576
Approved by: https://github.com/angelayi
Include both the error stacktrace and the graphmodule in a new
structured trace artifact. Log the shortened version to the console,
and also log a hint to look at the tlparse for more.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158469
Approved by: https://github.com/ezyang
If you reinstall numpy after having installed pandas, it will error out sometimes if the versions are different enough (see below snippet). This change forces pandas to be reinstalled when installing numpy. It doesn't work in a separate pip call, because then pip takes the version of numpy requested by pandas as the one to install, undoing the command in the first place.
```
(numpy_pandas) [gabeferns@devvm2497.eag0 ~/pt-envs/at (exclamaforte/just-gemm-model)]$ pip list
Package Version
------------------ -----------
attrs 25.3.0
build 1.2.2.post1
certifi 2025.7.14
charset-normalizer 3.4.2
cmake 4.0.3
exceptiongroup 1.3.0
expecttest 0.3.0
filelock 3.18.0
fsspec 2025.5.1
hypothesis 6.135.32
idna 3.10
importlib_metadata 8.7.0
Jinja2 3.1.6
lintrunner 0.12.7
MarkupSafe 2.1.5
mpmath 1.3.0
networkx 3.2.1
ninja [1.11.1.4](https://www.internalfb.com/phabricator/paste/view/1.11.1.4)
opt-einsum 3.3.0
optree 0.16.0
packaging 25.0
pip 25.1
psutil 7.0.0
pyproject_hooks 1.2.0
python-dateutil 2.9.0.post0
pytz 2025.2
PyYAML 6.0.2
requests 2.32.4
setuptools 78.1.1
six 1.17.0
sortedcontainers 2.4.0
sympy 1.14.0
tomli 2.2.1
typing_extensions 4.14.0
tzdata 2025.2
urllib3 2.5.0
uv 0.7.21
wheel 0.45.1
zipp 3.23.0
(numpy_pandas) [gabeferns@devvm2497.eag0 ~/pt-envs/at (exclamaforte/just-gemm-model)]$ pip install numpy==1.22.4
Collecting numpy==1.22.4
Using cached numpy-1.22.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (2.0 kB)
Using cached numpy-1.22.4-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.8 MB)
Installing collected packages: numpy
Successfully installed numpy-1.22.4
(numpy_pandas) [gabeferns@devvm2497.eag0 ~/pt-envs/at (exclamaforte/just-gemm-model)]$ pip install pandas==2.0.3
Collecting pandas==2.0.3
Using cached pandas-2.0.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (18 kB)
Requirement already satisfied: python-dateutil>=2.8.2 in /home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages (from pandas==2.0.3) (2.9.0.post0)
Requirement already satisfied: pytz>=2020.1 in /home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages (from pandas==2.0.3) (2025.2)
Requirement already satisfied: tzdata>=2022.1 in /home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages (from pandas==2.0.3) (2025.2)
Requirement already satisfied: numpy>=1.20.3 in /home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages (from pandas==2.0.3) (1.22.4)
Requirement already satisfied: six>=1.5 in /home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages (from python-dateutil>=2.8.2->pandas==2.0.3) (1.17.0)
Using cached pandas-2.0.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.4 MB)
Installing collected packages: pandas
Successfully installed pandas-2.0.3
(numpy_pandas) [gabeferns@devvm2497.eag0 ~/pt-envs/at (exclamaforte/just-gemm-model)]$ pip install --pre numpy==2.0.2
Collecting numpy==2.0.2
Using cached numpy-2.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (60 kB)
Using cached numpy-2.0.2-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (19.5 MB)
Installing collected packages: numpy
Attempting uninstall: numpy
Found existing installation: numpy 1.22.4
Uninstalling numpy-1.22.4:
Successfully uninstalled numpy-1.22.4
Successfully installed numpy-2.0.2
(numpy_pandas) [gabeferns@devvm2497.eag0 ~/pt-envs/at (exclamaforte/just-gemm-model)]$ python
Python 3.9.23 (main, Jun 5 2025, 13:40:20)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pandas
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/__init__.py", line 22, in <module>
from pandas.compat import is_numpy_dev as _is_numpy_dev # pyright: ignore # noqa:F401
File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/compat/__init__.py", line 25, in <module>
from pandas.compat.numpy import (
File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/compat/numpy/__init__.py", line 4, in <module>
from pandas.util.version import Version
File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/util/__init__.py", line 2, in <module>
from pandas.util._decorators import ( # noqa:F401
File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/util/_decorators.py", line 14, in <module>
from pandas._libs.properties import cache_readonly
File "/home/gabeferns/.conda/envs/numpy_pandas/lib/python3.9/site-packages/pandas/_libs/__init__.py", line 13, in <module>
from pandas._libs.interval import Interval
File "pandas/_libs/interval.pyx", line 1, in init pandas._libs.interval
ValueError: numpy.dtype size changed, may indicate binary incompatibility. Expected 96 from C header, got 88 from PyObject
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158584
Approved by: https://github.com/huydhn
- Skip `test_index_put_accumulate_large_tensor_mps` as it crashes with
```
/com.apple.xbs/Sources/MetalPerformanceShaders/MPSCore/Types/MPSNDArray.mm:829: failed assertion `[MPSNDArray initWithDevice:descriptor:isTextureBacked:] Error: NDArray dimension length > INT_MAX'
```
while running `torch.ones([2**31+5], dtype=torch.int8, device='mps')`
- Adjust types for `test_index_put_src_datatype` as index_put on MPS is not implemented for complex (yet)
- Adjust `test_index` to avoid using DoubleTensors for MPS
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158582
Approved by: https://github.com/dcci, https://github.com/Skylion007, https://github.com/manuelcandales
As part of better engineering week, we would like to improve out type support to improve dev experience in dynamo
This PR adds strict typing support to an important file in dynamo, `decorators.py`
NOTE: Untyped fns are because there is a conflict with `__init__.py` in compiler so we can't type these at this time
Running
```
mypy torch/_dynamo/decorators.py --linecount-report /tmp/coverage_log
```
| -------- | Lines Unannotated | Lines Total | % lines covered | Funcs Unannotated | Funcs Total | % funcs covered |
| -------- | ------- | -------- | ------- | ------- | ------- | ------- |
| Main | 209 | 908 | 23.02% | 9 | 39 | 23.08% |
| This PR | 870 | 943 | 100.00% | 36 | 39 | 100.00% |
| Delta | +661 | +35 | +76.98% | +27 | 0 | +76.92% |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158509
Approved by: https://github.com/williamwen42
Summary:
The shims for aten ops are now generated by torchgen. But there are some still old APIs in `aoti_torch/c/shim.h`
This diff moves the old to-be-deprecated APIs for aten ops to a separate header file `shim_deprecated.h`
The to-be-deprecated APIs are determined by comparing APIs in `shim.h` and ops in `fallback_ops.py`
Test Plan:
CI
Rollback Plan:
Differential Revision: D78378373
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158400
Approved by: https://github.com/jingsh, https://github.com/desertfire
Test modules that depend on the original definition of `wrapper_set_seed` will inadvertently be affected if they import from test_torchinductor_opinfo.py. Additionally, using pytest `test_torchinductor_opinfo.py test_other_module.py` when run in the same process may affect the test behaviour of `test_other_module.py` if the tests depend on `wrapper_set_seed`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158548
Approved by: https://github.com/janeyx99
This modifies the lint workflow to use the new get-changed-files
workflow to optimize lint execution by only running on files
that have actually changed in pull requests.
This more closely mirrors the type of behavior that users
expect when running lint locally on their PRs.
This also leaves the default behavior as a fallback for when
you're not running on a pull request.
Since lint runs on the pull_request event I'm not really worried about
any type of ciflow shenanigans in this.
This also splits mypy into its own job since mypy needs to run on all-files all the time.
Signed-off-by: Eli Uriegas <eliuriegas@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158518
Approved by: https://github.com/huydhn
ghstack dependencies: #158517
Summary: This test is failing internally because the number of underlying calls to the rng differ by virtue of various library initializations that get sucked in with an internal build.
Test Plan: `buck test '@fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_repros.py::ReproTests::test_longtensor_list' --run-disabled`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158458
Approved by: https://github.com/jansel
Summary: I tried to add some logic that could optimize for the non-row wise sharded case and do it more efficiently, but this has some bugs, so removing it for now and will find a better algorithm for the non-row wise sharded case to find the maximum number of bytes that we can write at a time.
Test Plan:
ensure tests pass
Rollback Plan:
Differential Revision: D78366701
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158380
Approved by: https://github.com/Saiteja64
Update triton commit hash to `11ec6354315768a85da41032535e3b7b99c5f706`, which is the new release/3.4.x branch in triton-lang/triton.
Also, update HAS_WARP_SPEC handling: In triton 3.4, warp spec will have a different interface: num_consumer_groups will be determined automatically by the compiler. This breaks the current Inductor integration, so for now, update HAS_WARP_SPEC to check whether triton.Config takes num_consumer_groups and num_buffers_warp_spec as parameters.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158459
Approved by: https://github.com/atalman
### Description
This PR is to enable TF32 as fp32 internal precision for matmul/linear/conv in `mkldnn backend`. Since we have refined fp32 precision API in https://github.com/pytorch/pytorch/pull/125888, we can easily extend the API to support TF32 for `mkldnn backend`.
```
torch.backends.mkldnn.matmul.fp32_precision = 'tf32'
torch.backends.mkldnn.conv.fp32_precision = "tf32"
```
Related kernel update and UTs update are done. And the wrapper `bf32_on_and _off` is updated to `reduced_f32_on_and_off`, and it can run tests 3 times, one is reduced_f32 OFF, the other two are reduced_f32 ON (including `bf32 ON` and `tf32 ON`).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157520
Approved by: https://github.com/mingfeima, https://github.com/jansel
cuBLAS added support for them in CUDA 12.9. It's rather easy to call into them, the hardest thing is allowing the lhs and rhs operands to have different scaling types, as that changes the whole callstack.
The scaling format is still detected from the sizes of the scale tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158037
Approved by: https://github.com/eqy, https://github.com/drisspg
Summary:
In pytorch, tensor.to("cuda") behaves differently from tensor.to("cuda:0).
tensor.to("cuda") will read from thread local DeviceGuard, aka cuda::current_device(), to infer the device index.
TBEPermute is relying on this behavior to route output tensor to a device specified by current thread.
For this reason, we remove the normalizeDevice(), and disallow index-less cuda device in Placement.
Device-to-device mapping must be done between concrete device!
Test Plan:
CI
Rollback Plan:
Differential Revision: D78443109
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158489
Approved by: https://github.com/henryoier
Summary: AOTI already has weights embedded in .so file. So for the initial load, no need to load the weights again. This allows lowered modules can have different set of weights on different hardwares.
Test Plan:
```
MODEL_TYPE=ads_mtml_offsite_cvr_oba_optout_dedicated_model
MODEL_ENTITY_ID=895279202
SNAPSHOT_ID=0
MODULE=merge
buck2 run mode/dev-nosan -c fbcode.nvcc_arch=a100,h100 -c fbcode.enable_gpu_sections=true fbcode//caffe2/torch/fb/model_transform/fx2trt/packaging:load_net_predictor -- --loadMode=Benchmark --inputNetFile=/data/users/$USER/models/${MODEL_ENTITY_ID}/${SNAPSHOT_ID}/${MODEL_ENTITY_ID}_${SNAPSHOT_ID}.predictor.disagg.gpu.${MODULE} --moduleName ${MODULE} --predictor-hardware-type 1 --submodToDevice "" --benchmarkDontRebatchSamples=true --benchmarkNumIterations 1000
```
Rollback Plan:
Differential Revision: D78383881
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158416
Approved by: https://github.com/henryoier, https://github.com/SherlockNoMad
This is related to: https://www.anaconda.com/legal/terms/terms-of-service
Trying to fix outage with docker builds.
https://github.com/pytorch/pytorch/actions/runs/16298993712/job/46033590799
Rocm and XPU builds since they use Miniforge are not affected
```
#22 ERROR: process "/bin/sh -c bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt" did not complete successfully: exit code: 1
------
> [base 14/42] RUN bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt:
11.93 CondaToSNonInteractiveError: Terms of Service have not been accepted for the following channels. Please accept or remove them before proceeding:
11.93 • https://repo.anaconda.com/pkgs/main
11.93 • https://repo.anaconda.com/pkgs/r
11.93
11.93 To accept a channel's Terms of Service, run the following and replace `CHANNEL` with the channel name/URL:
11.93 ‣ conda tos accept --override-channels --channel CHANNEL
```
Hence solution is:
1. using `` conda tos accept --override-channels --channel defaults``
2. use Miniforge instead of Miniconda.
Using solution 2.
Solution Tried that don't work:
1. Using ``CONDA_ALWAYS_YES = true ``
4. Using older version of miniconda
```
[Miniconda3-py310_25.5.1-0-Linux-x86_64.sh](https://repo.anaconda.com/miniconda/Miniconda3-py310_25.5.1-0-Linux-x86_64.sh)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158370
Approved by: https://github.com/seemethere
Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
This PR removes the integration point torch.fx had with torch::deploy (and another minor change).
Note: This PR has some broken mypy errors, but I believe those should have been in the code base beforehand, and should be fixed in a separate PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158291
Approved by: https://github.com/albanD
ghstack dependencies: #158288, #158290
This PR is part of the work to deprecate torch::deploy in OSS. Effectively it does 3 things to get started.
1. Remove test_deploy_interaction as we no longer need to worry about this
2. Remove all torch._running_with_deploy checks and use the False path always (surfaced 1)
3. Remove `USE_DEPLOY` and switch to the default path always
Note: MyPy does fail on a bunch of things here as a bunch of older files are touched. It may be better to fix these things on a separate PR
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158288
Approved by: https://github.com/albanD
When select has data dependent input, we cant tell if the actual index shall be index+size or index.
to avoid throwing dde, we allocate a new unbacked symbol to represent the storage offset of the
output view and we compute its value dynamically at runtime when inductor is lowered.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157605
Approved by: https://github.com/ColinPeppler
When I modify the code located in test/cpp_extensions/open_registration_extension/torch_openreg/torch_openreg,
some unrelated format error occurred.
```Python
Lint for torch/_inductor/fx_passes/fuse_attention.py:
Error (CODESPELL) spelling error
Failed due to ValueError:
/pytorch/pytorch/torch/_inductor/fx_passes/fuse_attention.py:587: differnt
==> different
Please either fix the error or add the word(s) to the dictionary file.
HINT: all-lowercase words in the dictionary can cover all case variations.
Lint for torch/fx/traceback.py:
Error (MYPY) [assignment]
Incompatible types in assignment (expression has type "str", variable has
type "None")
101 |
102 | def _get_action_string(self):
103 | if self._action_string is None:
104 | self._action_string = "+".join([a.name.lower() for a in self.action])
105 | return self._action_string
106 |
107 | def print_readable(self, indent=0):
Error (MYPY) [assignment]
Incompatible types in assignment (expression has type "dict[str, Any]",
variable has type "None")
121 | if self._dict is None:
122 | # Convert the object to a dictionary
123 | action_string = self._get_action_string()
124 | self._dict = {
125 | "name": self.name,
126 | "target": self.target,
127 | "graph_id": self.graph_id,
Error (MYPY) [return-value]
Incompatible return value type (got "None", expected "dict[Any, Any]")
130 | "from_node": [node.to_dict() for node in self.from_node],
131 | }
132 |
133 | return self._dict
134 |
135 | def __eq__(self, other: object):
136 | if not isinstance(other, NodeSource):
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158450
Approved by: https://github.com/Skylion007
This pull request refactors the `parse_type` function in `c10/core/Device.cpp` to improve the handling of the `PrivateUse1` device type. The main change involves reordering the logic to check for the `PrivateUse1` device type earlier in the function for better clarity and efficiency.
This help to migrate existed backend to PrivateUse1 smoothly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157609
Approved by: https://github.com/jgong5, https://github.com/albanD
If there is only one node passed to aten::cat, the argument is a single node,
rather than a list of nodes with a valid length.
Example stack
```
File "/dev/shm/uid-99/be3468a8-seed-nspid4026546656_cgpid14993614-ns-4026546628/torch/_inductor/pattern_matcher.py", line 1115, in apply
self.handler(match, *match.args, **match.kwargs)
File "/dev/shm/uid-99/be3468a8-seed-nspid4026546656_cgpid14993614-ns-4026546628/torch/_inductor/fx_passes/split_cat.py", line 1786, in merge_split_cat_aten
if len(cat_inputs) < threshold_to_cat:
torch._inductor.exc.InductorError: TypeError: object of type 'Node' has no len()
```
This has failed about 7 internal jobs in the last week, running pytorch trunk code from 06/15
I've attached a test which reproduces this issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157155
Approved by: https://github.com/jansel
Summary:
As inductor provenance tracking is getting more use cases, we want to separate the inductor provenance tracking guarding flag from the general `trace.enabled`, so we can enable provenance tracking without all the overhead of `trace.enabled`
- change the guard flag from `trace.enabled` to `trace.provenance_tracking`. It is turned on by either `TORCH_COMPILE_DEBUG=1` or `INDUCTOR_PROVENANCE=1`.
- Move the provenance tracking logic and variables out of DebugContext, because DebugContext is only enabled with `trace.enabled`. Since the variables are now global variables, added `reset_provenance_globals()` context manager to reset them for each `compile_fx()` call.
- Move `set_kernel_post_grad_provenance_tracing` from `util.py` to `debug.py` so now all provenance related logic is in `debug.py`.
In the future, if we want to enable it further, we can change the provenance tracking flag to be enabled when `TORCH_TRACE` is set. I think we should do that in a separate PR, so it's easier to revert if this flag change creates any problem.
See more motivation in internal Diff
Test Plan:
```
buck2 run mode/dev-nosan fbcode//caffe2/test:fx -- -r test_graph_transform_observer
buck run mode/dev-nosan fbcode//caffe2/test:fx -- -r graph_provenance
buck2 run mode/dev-nosan fbcode//caffe2/test/inductor:provenance_tracing
```
Differential Revision: D78287976
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158399
Approved by: https://github.com/angelayi
These change add an `eval()` API to PP schedules
## Context
Currently, you can run "Forward only" for a schedule in two ways:
1. Use a custom schedule `_ScheduleForwardOnly`
2. Do not pass in `loss_fn` in schedule constructor, and no backward computations will be executed.
However, this is still limiting because we may want to run forward through the pipeline / calculate the loss, but without backward, e.g. during validation. These changes allow for this.
```python
if self.rank == 0:
schedule.eval(x)
elif self.rank == self.world_size - 1:
losses = []
schedule.eval(target=target, losses=losses)
else:
schedule.eval()
```
TODO:
- in later PRs, we will deprecate the `_ScheduleForwardOnly`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157795
Approved by: https://github.com/wconstab
Related to https://github.com/pytorch/pytorch/issues/157517
Detect when users are executing torch build with cuda 12.8/12.9 and running on Maxwell or Pascal architectures.
We would like to include reference to the issue: https://github.com/pytorch/pytorch/issues/157517 as well as ask people to install CUDA 12.6 builds if they are running on sm50 or sm60 architectures.
Test:
```
>>> torch.cuda.get_arch_list()
['sm_70', 'sm_75', 'sm_80', 'sm_86', 'sm_90', 'sm_100', 'sm_120', 'compute_120']
>>> torch.cuda.init()
/home/atalman/.conda/envs/py312/lib/python3.12/site-packages/torch/cuda/__init__.py:263: UserWarning:
Found <GPU Name> which is of cuda capability 5.0.
PyTorch no longer supports this GPU because it is too old.
The minimum cuda capability supported by this library is 7.0.
warnings.warn(
/home/atalman/.conda/envs/py312/lib/python3.12/site-packages/torch/cuda/__init__.py:268: UserWarning:
Support for Maxwell and Pascal architectures is removed for CUDA 12.8+ builds.
Please see https://github.com/pytorch/pytorch/issues/157517
Please install CUDA 12.6 builds if you require Maxwell or Pascal support.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158301
Approved by: https://github.com/nWEIdia, https://github.com/albanD
Preview: https://docs-preview.pytorch.org/pytorch/pytorch/157750/export.html
Changes:
* Rename draft_export.md -> export.draft_export.md for consistency.
* Removed non-strict section in export, instead pointed to programming model doc.
* Extended "Expressing Dynamism" section to include Dim hints, ShapeCollection, and AdditionalInputs.
* Removed Specialization section in favor of programming model doc
* Added pt2 archive doc
* Cleaned up sidebar
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157750
Approved by: https://github.com/pianpwk
It causes errors under C++20
```
/Users/runner/work/pytorch/pytorch/pytorch/aten/src/ATen/native/mps/OperationUtils.mm:330:40:
error: call to consteval function 'fmt::fstring<>::fstring<std::string, 0>' is not a constant expression
```
Indeed the printed value is treated as format string and it may contain special chars in some cases. While this is not true in our case, it can't be determined in compile time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158436
Approved by: https://github.com/Skylion007
Fixes#71673
This fixes a bug in PyTorch indexing, that shows up when mixing multi-dimensional boolean masks with other forms of indexing. Examples:
```python
>>> import torch
>>> x = torch.ones([2, 2, 3])
>>> m = torch.tensor(((True, False), (False, False))) # (2x2 boolean mask)
>>> x[m].shape # this works fine (the boolean mask acts on the 2x2 subspace selecting one row)
torch.Size([1, 3])
>>> x[m, 0] # this should produce a tensor of shape (1,)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: The shape of the mask [2, 2] at index 1 does not match the shape of the indexed tensor [2, 3] at index 1
>>> x[m, ::2] # this should produce a tensor of shape (1, 2)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: The shape of the mask [2, 2] at index 1 does not match the shape of the indexed tensor [2, 1, 3] at index 1
>>> x[m, None] # this should produce a tensor of shape (1, 1, 3)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: The shape of the mask [2, 2] at index 1 does not match the shape of the indexed tensor [2, 1, 2, 3] at index 1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158369
Approved by: https://github.com/ngimel
Summary: Test is failing internally because of the import from functorch.einops. _Maybe_ there's a way to get this dependence in the TARGETS file, but the obvious things didn't work. I'm wondering if this test is that important to have running in OSS and internally anyway?
Test Plan:
`buck test '@fbcode//mode/opt' fbcode//caffe2/test/inductor:cuda_repro -- --exact 'caffe2/test/inductor:cuda_repro - test_repeated_masked_load (caffe2.test.inductor.test_cuda_repro.CudaReproTests)' --run-disabled`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158355
Approved by: https://github.com/eellison
Fixes following issue of building PyTorch with ROCm7.0:
```
-- verifying file...
file='/var/lib/jenkins/pytorch/build/aotriton_external-prefix/src/aotriton-0.10b-manylinux_2_28_x86_64-rocm7.0-shared.tar.gz'
-- SHA256 hash of
/var/lib/jenkins/pytorch/build/aotriton_external-prefix/src/aotriton-0.10b-manylinux_2_28_x86_64-rocm7.0-shared.tar.gz
does not match expected value
expected: '7e29c325d5bd33ba896ddb106f5d4fc7d715274dca7fe937f724fffa82017838'
actual: '1e9b3dddf0c7fc07131c6f0f5266129e83ce2331f459fa2be8c63f4ae91b0f5b'
-- Hash mismatch, removing...
CMake Error at aotriton_external-prefix/src/aotriton_external-stamp/download-aotriton_external.cmake:163 (message):
Each download failed!
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158420
Approved by: https://github.com/jeffdaily
`libnvshmem_extension.so` creates an illusion that it is a shared library from NVSHMEM. But indeed it is built from torch source code, for symmetric tensor infrastructure and operations, though leveraging NVSHMEM APIs. Thus this PR renames `libnvshmem_extension.so` to `libtorch_nvshmem.so`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158234
Approved by: https://github.com/albanD
cuBLAS added support for them in CUDA 12.9. It's rather easy to call into them, the hardest thing is allowing the lhs and rhs operands to have different scaling types, as that changes the whole callstack.
The scaling format is still detected from the sizes of the scale tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158037
Approved by: https://github.com/eqy, https://github.com/drisspg
As popularly requested in user groups.
Test plan:
```
import torch
a = torch.randn(10000)
device = torch.device('cuda:1')
a = a.to(device)
```
Before:
```
Traceback (most recent call last):
File "/data/users/raymo/pytorch/test/cuda.py", line 6, in <module>
a = a.to(device)
^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: invalid device ordinal
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```
After:
```
Traceback (most recent call last):
File "/data/users/raymo/pytorch/test/cuda.py", line 6, in <module>
a = a.to(device)
^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: invalid device ordinal
GPU device may be out of range, do you have enough GPUs?
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158395
Approved by: https://github.com/aorenste
Co-authored-by: Aaron Orenstein <aorenste@fb.com>
This is related to: https://www.anaconda.com/legal/terms/terms-of-service
Trying to fix outage with docker builds.
https://github.com/pytorch/pytorch/actions/runs/16298993712/job/46033590799
Rocm and XPU builds since they use Miniforge are not affected
```
#22 ERROR: process "/bin/sh -c bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt" did not complete successfully: exit code: 1
------
> [base 14/42] RUN bash ./install_conda.sh && rm install_conda.sh install_magma_conda.sh common_utils.sh /opt/conda/requirements-ci.txt /opt/conda/requirements-docs.txt:
11.93 CondaToSNonInteractiveError: Terms of Service have not been accepted for the following channels. Please accept or remove them before proceeding:
11.93 • https://repo.anaconda.com/pkgs/main
11.93 • https://repo.anaconda.com/pkgs/r
11.93
11.93 To accept a channel's Terms of Service, run the following and replace `CHANNEL` with the channel name/URL:
11.93 ‣ conda tos accept --override-channels --channel CHANNEL
```
Hence solution is:
1. using `` conda tos accept --override-channels --channel defaults``
2. use Miniforge instead of Miniconda.
Using solution 2.
Solution Tried that don't work:
1. Using ``CONDA_ALWAYS_YES = true ``
4. Using older version of miniconda
```
[Miniconda3-py310_25.5.1-0-Linux-x86_64.sh](https://repo.anaconda.com/miniconda/Miniconda3-py310_25.5.1-0-Linux-x86_64.sh)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158370
Approved by: https://github.com/seemethere
Co-authored-by: Eli Uriegas <1700823+seemethere@users.noreply.github.com>
Fixes#141563
In NumPy, an ellipsis always acts as a separator between advanced indices, even when the ellipsis doesn't actually match any dimensions. In PyTorch an empty ellipsis doesn't cause a separation. This leads to differing behavior between Numpy and PyTorch in this edge case.
This difference in behavior leads to a bug when using torch.compile:
```python
>>> import numpy as np
>>> f = lambda x: x[:,(0,1),...,(0,1)].shape
>>> a = np.ones((3, 4, 5))
>>> f(a)
(2, 3)
>>> torch.compile(f)(a)
(3, 2)
```
Similarly to #157676, this PR doesn't change PyTorch's behavior, but it fixes the translation layer, ensuring torch._numpy compatibility with NumPy. I am marking this PR as fixing #141563, even though PyTorch behavior isn't modified.
Notice that there are still some other bugs in PyTorch's advanced indexing, that need to be fixed (mainly regarding proper accounting of dimensions when multidimensional boolean masks are present). But those need to be fixed at the ATen operator level. Examples:
- #71673
- #107699
- #158125
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158297
Approved by: https://github.com/soumith
**Background**:
```Shell
torch 2.5.1+cpu
torchvision 0.20.1
```
```Python
import torch
import torchvision
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torchvision/__init__.py", line 10, in <module>
from torchvision import _meta_registrations, datasets, io, models, ops, transforms, utils # usort:skip
File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torchvision/_meta_registrations.py", line 164, in <module>
def meta_nms(dets, scores, iou_threshold):
File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/library.py", line 795, in register
use_lib._register_fake(op_name, func, _stacklevel=stacklevel + 1)
File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/library.py", line 184, in _register_fake
handle = entry.fake_impl.register(func_to_register, source)
File "/usr/local/anaconda3/envs/test/lib/python3.10/site-packages/torch/_library/fake_impl.py", line 31, in register
if torch._C._dispatch_has_kernel_for_dispatch_key(self.qualname, "Meta"):
RuntimeError: operator torchvision::nms does not exist
```
**Cause**:
```
torchvision's .so file lacks some symbol definitions, because these symbols come from CUDA, but the current environment does not have CUDA and GPU. The above error message is very confusing.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157524
Approved by: https://github.com/ezyang
This is an improvement over https://github.com/pytorch/pytorch/pull/132595 . That PR improves the case where `device` is not given. This PR tries to improve the case where `device` is given but the first step of auto-infer device from `cudaPointerGetAttributes` can be wrong (undesired). See https://github.com/pytorch/pytorch/issues/158316 for more details on when this can happen.
I think this is a reasonable improvement, as people expect `torch.as_tensor` + cupy should be zero-copy as much as possible. However, it does change some behaviors, because previously it might incur a device-to-device copy.
I will leave it to pytorch developers to see if the improvement is worthwhile.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158320
Approved by: https://github.com/ezyang
This PR adds ROCm 7.0 alpha docker builds to start testing latest ROCm in PyTorch CI and enable new MI350x hardware.
Highlights:
* Stop building `pytorch-linux-jammy-rocm-n-1-py3` docker images, as they're not currently used in any CI workflows
* Add `pytorch-linux-noble-rocm-alpha-py3` docker images that will use ROCm alpha (newer than latest official release) builds
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158390
Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily
Hi team,
Please help review this trivial fix.
Without this change:
``` python
>>> import torch
>>> print(torch._C._profiler._ExperimentalConfig.__init__.__doc__)
__init__(self: torch._C._profiler._ExperimentalConfig, profiler_metrics: list[str] = [], profiler_measure_per_kernel: bool = False, verbose: bool = False, performance_events: list[str] = [], enable_cuda_sync_events: bool = False, adjust_profiler_step: bool = False, disable_external_correlation: bool = False, profile_all_threads: bool = False, capture_overload_names: bool = False) -> None
capture_overload_names (bool) : whether to include ATen overload names in the profile
```
With this change:
```python
>>> import torch
>>> print(torch._C._profiler._ExperimentalConfig.__init__.__doc__)
__init__(self: torch._C._profiler._ExperimentalConfig, profiler_metrics: list[str] = [], profiler_measure_per_kernel: bool = False, verbose: bool = False, performance_events: list[str] = [], enable_cuda_sync_events: bool = False, adjust_profiler_step: bool = False, disable_external_correlation: bool = False, profile_all_threads: bool = False, capture_overload_names: bool = False) -> None
An experimental config for Kineto features. Please note thatbackward compatibility is not guaranteed.
profiler_metrics : a list of CUPTI profiler metrics used
to measure GPU performance events.
If this list contains values Kineto runs in CUPTI profiler mode
profiler_measure_per_kernel (bool) : whether to profile metrics per kernel
or for the entire measurement duration.
verbose (bool) : whether the trace file has `Call stack` field or not.
performance_events : a list of profiler events to be used for measurement.
enable_cuda_sync_events : for CUDA profiling mode, enable adding CUDA synchronization events
that expose CUDA device, stream and event synchronization activities. This feature is new
and currently disabled by default.
adjust_profiler_step (bool) : whether to adjust the profiler step to
match the parent python event duration. This feature is new and currently disabled by default.
disable_external_correlation (bool) : whether to disable external correlation
profile_all_threads (bool) : whether to profile all threads
capture_overload_names (bool) : whether to include ATen overload names in the profile
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156586
Approved by: https://github.com/sraikund16, https://github.com/cyyever
The starting point for this refactor is that I need access to the fully
general joint graph representation in an export-like interface, but I
then subsequently need a way to feed this joint graph into the rest of
the compilation pipeline so I can get an actual callable that I can run
once I've finished modifying it. Previously, people had added export
capabilities to AOTAutograd by having an export flag that toggled what
exactly the functions return and triggering aot_dispatch to go to a
different "export" implementation, but I've found this difficult to
understand and has lead to a bit of duplicate code for the export path.
So the idea here is to reorganize the structure of the function calls in AOTAutograd. Here, it is helpful to first describe how things used to work:
* Start with aot_autograd.py top level functions like aot_function, _aot_export_function and aot_module_simplified. These call:
* create_aot_dispatcher_function. This does a bunch of stuff (forward metadata collection) and adds many context managers. This calls:
* One of aot_dispatch_base, aot_dispatch_export or aot_dispatch_autograd, which:
* Call aot_dispatch_autograd_graph or aot_dispatch_base_graph to actually do the graph capture
* Do some base/export/autograd specific post-processing on the graph
Notice the pattern of nested function invocations means that there is no way to easily get the graph capture result from the autograd case; furthermore, the export path is "bolted" on to force the entire chain of functions to have a different return result than normal, and no way to *resume* the rest of the post-processing to actually get a callable.
Here is the new structure:
* Start with aot_autograd.py top level functions like aot_function, _aot_export_function and aot_module_simplified. These now orchestrate this top level flow:
* Start a context manager (stack); this stateful context block takes care of all of the nested context managers which originally necessitated the nested call structure
* Call create_aot_state to do initial setup and setup all the context managers on stack. These context managers do NOT exit upon return of this.
* Call aot_stage1_graph_capture to do the graph capture
* Call aot_stage2_compile or aot_stage2_export depending on what postprocessing you want
With this new structure, it's now possible (although not done in this PR) to return the graph after aot_stage1_graph_capture and do something with it, before running aot_stage2_compile to finish the job.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158213
Approved by: https://github.com/jamesjwu
ghstack dependencies: #158149, #158150, #158173, #158176
Two main things of note:
- Review this diff without whitespace changes
- To ensure that context managers correctly propagate to later pipeline
stages, I am using the ExitStack trick: there is an ExitStack which is
in scope for the entire pipeline, and inside of the individual
pipeline stages we push context managers onto this stack when we want
them to survive into the next pipeline stage. This is not obviously
what the best final form of the code is, but
create_aot_dispatcher_function is called from multiple locations so I
can't just inline the context managers into the call site.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158173
Approved by: https://github.com/jamesjwu, https://github.com/wconstab
ghstack dependencies: #158149, #158150
Summary: NodeSouce should not be updated after created, so that it would be better if we cache its dict and string representation for better perf.
Test Plan:
ci
Rollback Plan:
Reviewed By: yushangdi
Differential Revision: D78298501
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158372
Approved by: https://github.com/yushangdi
The base Docker images in directory `.ci/docker/` are built by the `docker-builds.yml` workflow. Those images are used throughout the PyTorch CI/CD pipeline. You should only create or modify a base Docker image if you need specific environment changes or dependencies before building PyTorch on CI.
1.**Automatic Rebuilding**:
- The Docker image building process is triggered automatically when changes are made to files in the `.ci/docker/*` directory
- This ensures all images stay up-to-date with the latest dependencies and configurations
2.**Image Reuse in PyTorch Build Workflows** (example: linux-build):
- The images generated by `docker-builds.yml` are reused in `_linux-build.yml` through the `calculate-docker-image` step
- The `_linux-build.yml` workflow:
- Pulls the Docker image determined by the `calculate-docker-image` step
- Runs a Docker container with that image
- Executes `.ci/pytorch/build.sh` inside the container to build PyTorch
3.**Usage in Test Workflows** (example: linux-test):
- The same Docker images are also used in `_linux-test.yml` for running tests
- The `_linux-test.yml` workflow follows a similar pattern:
- It uses the `calculate-docker-image` step to determine which Docker image to use
- It pulls the Docker image and runs a container with that image
- It installs the wheels from the artifacts generated by PyTorch build jobs
- It executes test scripts (like `.ci/pytorch/test.sh` or `.ci/pytorch/multigpu-test.sh`) inside the container
### Understanding File Purposes
#### `.ci/docker/build.sh` vs `.ci/pytorch/build.sh`
- **`.ci/docker/build.sh`**:
- Used for building base Docker images
- Executed by the `docker-builds.yml` workflow to pre-build Docker images for CI
- Contains configurations for different Docker build environments
- **`.ci/pytorch/build.sh`**:
- Used for building PyTorch inside a Docker container
- Called by workflows like `_linux-build.yml` after the Docker container is started
- Builds PyTorch wheels and other artifacts
#### `.ci/docker/ci_commit_pins/` vs `.github/ci_commit_pins`
- **`.ci/docker/ci_commit_pins/`**:
- Used for pinning dependency versions during base Docker image building
- Ensures consistent environments for building PyTorch
- Changes here trigger base Docker image rebuilds
- **`.github/ci_commit_pins`**:
- Used for pinning dependency versions during PyTorch building and tests
- Ensures consistent dependencies for PyTorch across different builds
- Used by build scripts running inside Docker containers
### Step-by-Step Guide for Adding a New Base Docker Image
#### 1. Add Pinned Commits (If Applicable)
We use pinned commits for build stability. The `nightly.yml` workflow checks and updates pinned commits for certain repository dependencies daily.
If your new Docker image needs a library installed from a specific pinned commit or built from source:
1. Add the repository you want to track in `nightly.yml` and `merge-rules.yml`
2. Add the initial pinned commit in `.ci/docker/ci_commit_pins/`. The text filename should match the one defined in step 1
#### 2. Configure the Base Docker Image
1.**Add new Base Docker image configuration** (if applicable):
Add the configuration in `.ci/docker/build.sh`. For example:
#Description: required for testing CUDAGraph::raw_cuda_graph(). See https://nvidia.github.io/cuda-python/cuda-bindings/latest/support.html for how this version was chosen. Note "Any fix in the latest bindings would be backported to the prior major version" means that only the newest version of cuda-bindings will get fixes. Depending on the latest version of 12.x is okay because all 12.y versions will be supported via "CUDA minor version compatibility". Pytorch builds against 13.z versions of cuda toolkit work with 12.x versions of cuda-bindings as well because newer drivers work with old toolkits.
#test that import: test_cuda.py
setuptools-git-versioning==2.1.0
scikit-build==0.18.1
pyre-extensions==0.0.32
tabulate==0.9.0
#Description: These package are needed to build FBGEMM and torchrec on PyTorch CI
@ -520,7 +518,7 @@ on [our website](https://pytorch.org/get-started/previous-versions).
## Getting Started
Three-pointers to get you started:
Threepointers to get you started:
- [Tutorials: get you started with understanding and using PyTorch](https://pytorch.org/tutorials/)
- [Examples: easy to understand PyTorch code across all domains](https://github.com/pytorch/examples)
- [The API Reference](https://pytorch.org/docs/)
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.