Summary: fix some type errors + instead of manually creating a filelock when dumping dcache's imc to file we simply use an odc (since this is the intended behavior of odc, anyways)
Test Plan:
```
buck test fbcode//mode/opt caffe2/test/inductor:caching
```
Differential Revision: D86345594
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167136
Approved by: https://github.com/aorenste
Prior to this PR, the IValue <-> StableIValue conversion for `DeviceObjType` (aka c10::Device) was to pack it into the leading bits of the StableIValue (which is a uint64_t)
After this PR, the IValue <-> StableIValue conversion for `DeviceObjType` expects DeviceType to be packed into the upper 32 bits of StableIValue and DeviceIndex to be packed into the lower 32 bits
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166579
Approved by: https://github.com/janeyx99
Make the PyObject preservation scheme thread-safe with free threaded (nogil) Python. The general idea is:
* Python Tensor and Storage objects always hold a strong reference to their underlying c10 object
* c10 objects hold a strong reference to their Python objects if there's at least one other reference to the c10 object
This is implemented in `intrusive_ptr`:
* The top most bit (`kHasPyObject`) from the weakref count is now used to indicate if the `intrusive_ptr_target` has an associated PyObject. So `kHasPyObject` is one bit, the weakref count is now 31 bits and the strong refcount remains 32 bits.
* When the reference count increases from one to two and `kHasPyObject` is set, we incref the associated Python object to ensure that it's kept alive.
* When the reference count decreases from two to one (i.e., there are no C++ reference to the `intrusive_ptr_target` other than from the Python object), we decre the associated Python object to break the cycle.
Other benefits:
* We can delete a lot of the copypasta from Python internal `subtype_dealloc`
* This fixes the weakref and GC bugs we had in the previous scheme. Python weakrefs on Tensors and Storages should just work as expected now.
Risks:
* Extra branch for reference count operations on `intrusive_ptr<TensorImpl>`, `intrusive_ptr<StorageImpl>`, and the generic `intrusive_ptr<intrusive_ptr_target>` even when we're not using Python.
* It's a big change
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166342
Approved by: https://github.com/albanD
Preferred solution to #166380
Changes:
- Moved summary print to bottom of CMakeLists.txt
- Fix the problem 'add_compile_options' should be called before targets defined, so opted for `append_cxx_flag_if_supported` and `append_c_flag_if_supported` ( new ).
- Added extra verbosity so it can be seen when linker script added.
( unfortunately linker script has to be added per-target rather than globally due to ninja/cmake depdendency tracking ).
Also move summary print to bottom of CMakeLists.txt and improve logging
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166407
Approved by: https://github.com/Aidyn-A, https://github.com/atalman
Summary:
https://github.com/pytorch/pytorch/pull/166609 updates `node.is_impure` to consider a submodule as impure if submodule contains impure node. This in turn changes `graph.eliminate_dead_code()` function behavior, which does not eliminate nodes with side effects, see [pytorch documentation](https://docs.pytorch.org/docs/stable/fx.html#torch.fx.Graph.eliminate_dead_code)
> Remove all dead code from the graph, based on each node’s number of users, and whether the nodes have any side effects.
While this is correct that a submodule containing side-effectful ops is side-effectful and should not be dead code eliminated, some customers rely on the dead code elimination to eliminate submodules that contain impure ops which is the behavior before #166609 fix.
Due to production environment constraints, we have to revert https://github.com/pytorch/pytorch/pull/166609 and move the side-effectful submodule check logic to `const_fold.py`, which will correctly **not** const-fold a submodule that contains impure ops.
NOTE other call sites that use `node.is_impure()` to make decisions are still incorrectly eliminating side-effectful submodules, but we can't safely change that today.
## This pr
- move `_subgraph_has_impure_op` into `fx/experimental/const_fold.py`, check and prevent const-folding an impure submodule
- added a note in `node.is_impure` to highlight the incorrect behavior and context in case people go looking in the future.
Test Plan: run test_fx_const_fold and all tests pass
Differential Revision: D86641994
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167443
Approved by: https://github.com/jfix71
We plan to use `StridedShard` to express `shard_order`. This PR adds the function to support the conversion between `StridedShard` and `shard_order`.
I moved some test related function into torch/testing/_internal/common_utils.py. We may only care about **_dtensor_spec.py** and **test_utils.py** in this PR for the review.
### How to convert shard order to StridedShard:
Considering the example:
- placements = $[x_0, x_1, x_2, x_3, x_4]$, all $x_?$ are shard on the same tensor dim.
Let's see how the shard order will impact the split_factor (sf). We loop from right to left in the placements to construct the split_factor by assuming different shard order. Starting from $x_4$, this should be a normal shard.
Then $x_3$. There are two possibilities, $x_3$'s order can be before $x_4$. If so, $x_3$'s sf=1, because $x_3$ is before $x_4$ in the placements. Else $x_3$'s order is after $x_4$, then the $x_3$'s sf should be the mesh dim size of $x_4$, which is $T(x_4)$:
<img width="820" height="431" alt="image" src="https://github.com/user-attachments/assets/f53b4b24-2523-42cc-ad6f-41f3c280db70" />
We can use this method to decide on the split factor for $x_2$, $x_1$ and so on.
### How to convert StridedShard to shard order:
This follows the same method above. We check all possible paths and use the real split_factor to see which path matchs the split_factor. If no such matches, the StridedShard is unable to be converted to shard order.
---
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166740
Approved by: https://github.com/ezyang
Pass `dim_map` to `_requires_data_exchange` and return False if both spatial and channels dimensions are replicated
Modify `test_conv1d` and `test_conv3d` to check values rather than just shape, and replicate `conv3d` across batch dimension
In general, feels like current Convolution implementation was written to work only if tensor is sharded across last dimention
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167402
Approved by: https://github.com/ezyang
- The logger name in test_fully_shard_logging.py was wrong so the logs didn't happen.
- The `device` variable in test_fully_shard_logging is expected to be a string, so quote it
- `unittest.skipIf` is used so importing `unittest` instead of `unittest.mock` is required
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167312
Approved by: https://github.com/Skylion007, https://github.com/cyyever
## Summary
- Fixes#163548 by replacing the quadratic chunk-overlap scan in `_validate_global_plan` with a sweep-line pass that sorts chunk intervals and keeps an active set via `bisect_right`, giving O(n log n) behavior for metadata validation.
- Add focused tests in `TestValidateGlobalPlan` covering overlapping and non-overlapping shard layouts to lock in the faster path.
## Testing
- python test/distributed/checkpoint/test_planner.py -k ValidateGlobalPlan
## Benchmarks
| chunks | old runtime | new runtime |
|--------|-------------|-------------|
| 1 024 | 0.121 s | 0.0014 s |
| 2 048 | 0.486 s | 0.0027 s |
| 4 096 | 2.474 s | 0.0058 s |
| 8 192 | 8.014 s | 0.0126 s |
| 16 384 | 32.740 s | 0.026 s |
@ezyang
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166820
Approved by: https://github.com/LucasLLC, https://github.com/Skylion007
Summary:
It seems
D80948073
has caused some issue on a lowering pkg built on trunk: https://fburl.com/mlhub/o6p60pno
error log: P2001933683
which we were able to lower successfully in older ien pkg: https://fburl.com/mlhub/1ro094zo
D85172267 fixed this issue for the if conditional, but issue still exists for the else conditional. Logic is moved right before if-else to cover both cases
Test Plan:
checkout D85605372
buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.nvcc_arch=a100,h100 -c fbcode.split-dwarf=true -c fbcode.dwp=true -c fbcode.enable_distributed_thinlto=true -c fbcode.use_link_groups=true fbcode//inference_enablement/model_processing/infra/components/lowering/re:re_cinder -- -r "$(cat ./fbcode/minimal_viable_ai/umia_v1/ig/ss_omni_exp/re_lower_aoti.json)"
with the diff, no issue was encountered.
Reviewed By: tissue3
Differential Revision: D86474796
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167279
Approved by: https://github.com/pianpwk
This adds zero-bubble / DualPipeV support for (S)AC
Before:
- AC will always retrigger recompute upon every distinct backward.
After:
- Any checkpointed regions encountered by backward under the same instance of this context manager will only trigger recompute at most once, even if there are multiple calls to backward.
- Backward calls under the same instance of this context manager must execute over non-overlapping regions of the backward graph even if retain_graph=True.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166536
Approved by: https://github.com/albanD
**Summary:** While the implementation is correct, these checks are just a subset of the Partial placement checks that are done in https://github.com/pytorch/pytorch/pull/165962. This means for ops aten.linalg_vector_norm.default and aten._foreach_norm.Scalar, we're unnecessarily checking for Partial placements twice.
**Test Cases**
1. pytest test/distributed/tensor/test_math_ops.py -k test_vector_norm_partial
2. pytest test/distributed/tensor/test_math_ops.py -k test_foreach_norm_partial
3. pytest test/distributed/tensor/test_math_ops.py -k test_partial_reduction_ops
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167247
Approved by: https://github.com/XilunWu
# Problem
The FX backend currently allocates tensors on an exact device index, such as `"cuda:0"`. In contrast, the Python backend allocates on a device type, such as `"cuda"`. This avoids edge cases where fake tensor propagation can fail due to mismatched devices.
# Fix
Allocate tensors on `device.type` instead of the device.
# Test plan
Added a CI test passing in sample inputs on an indexed device, and checking that the output device in the generated FX graph is not indexed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167358
Approved by: https://github.com/mlazos, https://github.com/nandesuka, https://github.com/eellison
Not sure why, but running some tests (for example `test_weights_only_safe_globals_build`) with `pytest` in 3.14 makes global name `test_serialization.ClassThatUsesBuildInstruction` instead of expected `__main__.ClassThatUsesBuildInstruction`
Also, change expected exception type from `AttributeError` to `PicklingError`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167333
Approved by: https://github.com/atalman
Some important notes:
a) Just like IValues steal the ownership of ArrayRefs and any std::vectors in order to convert the inner elements into IValues, we do the same thing with StableIValue. This O(N) traverse is ineluctable.
b) As a result, since StableIValues are owning and our contract is that to<T>(StableIValue) transfers ownership, you cannot ever convert from StableIValue to a nonowning HeaderOnlyArrayRef<V>.
We handle memory similar to AtenTensorHandle, but we have a StableListHandle!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165953
Approved by: https://github.com/malfet
ghstack dependencies: #164991, #165152, #165153
Summary:
If quiesce ends up called twice (which is likely not possible with the timer based implementation, but possible with either manual calls, or with the context manager implementation), this assertion fires.
Instead make this assertion tolerant to rentrant calling of quiesce
Test Plan: Added a explicit test which calls quiesce twice.
Differential Revision: D86251534
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167025
Approved by: https://github.com/masnesral
Provides type coverage to torch/_dynamo/variables/dicts.py
Coverage report:
`mypy torch/_dynamo/variables/functions.py --linecount-report /tmp/coverage_log`
Compare before to after - we go from 0 lines and 0 funcs covered to 2698 lines and 166 funcs covered
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167103
Approved by: https://github.com/mlazos, https://github.com/fxdawnn
This fixes 2 issues with the DTensor data-dependent test case:
1) ShapeEnv not found when doing shard prop on data-dependent ops - fix was to detect the outer tracing fake mode. Maybe ShardingPropagator should just own a FakeMode & ShapeEnv for these purposes? The previous behavior was to initialize a new fake mode on every call.
2) Pending unbacked symbols not found. This happens because DTensor dispatch runs fake prop twice, once while figuring out the output sharding: 2bba37309b/torch/distributed/tensor/_sharding_prop.py (L175) and again to actually get the resulting local tensor: 2bba37309b/torch/distributed/tensor/_dispatch.py (L254-L255) With data-dependent ops, both calls will produce an unbacked symbol, but symbols in the first invocation are never surfaced, producing this error, so we ignore pending symbols from this site.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166989
Approved by: https://github.com/ezyang
TORCH_CHECK is non-fatal by design, but TORCH_CHECK_{COND} macros are fatal. this is confusing, and we should limit fatality to the set of debug macros.
Differential Revision: D86168955
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167004
Approved by: https://github.com/malfet
Fixes#163971
Problem:
PyTorch's inductor compiler crashed with IndexError: list index out of range when compiling code that uses 0-dimensional tensors with operations like torch.softmax(scalar_tensor, dim=0).
A 0-dim tensor has shape = torch.Size([]) (empty shape)
```
ndim = 0 (zero dimensions)
len(shape) = 0 (no indices to access)
# Line 972: Pad other_shape to match inp dimensions
other_shape = [1] * (inp_ndim - len(other_shape)) + list(other_shape)
# For scalar tensors:
# inp_ndim = 0 # as input is scalar
# other_shape = []
# Result: [1] * (0 - 0) + [] = [] (still empty!)
dim = match.kwargs["dim"] # dim = 0
if isinstance(dim, int):
dim = (dim,)
# crash is happening here!
return all(statically_known_true(other_shape[d] == 1) for d in dim)
# ^^^^^^^^^^^^^^^^
# Tries other_shape[0] but other_shape = [] (empty!)
# → IndexError: list index out of range
```
The function _other_is_broadcasted_in_dim() is an optimization check for a softmax fusion pattern. It verifies whether it's safe to rewrite:
```
# From
scaled = inp * other
result = scaled - scaled.amax(dim, keepdim=True)
# To this more stable form:
result = (inp - inp.amax(dim, keepdim=True)) * other
```
The optimization is only valid if other is constant across the reduction dimension (i.e., broadcasted to size 1 in that dimension). Otherwise, scaling changes which element is the maximum.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166547
Approved by: https://github.com/jansel
- Since MultiProcContinuous class spawns one process per GPU and runs UT in each of the processes, we need to ensure we are propagating the exit code associated with skip all the way to the main worker thread that spawned all the child processes.
- This commit also updates several UTs that are meant for 4 GPUs but incorrectly calls skip_if_lt_x_gpu with 2 as an input. Examples:
- test_replicate_with_fsdp.py
- test_dtensor_resharding.py
- test_state_dict.py
- test_functional_api.py: Fix typo. multi-accelerator doesn't exit, replaced with multi-gpu
- test_op_strategy.py: world_size was hardcoded
- test_math_ops.py: UT written for 4 GPU, so skipping for anything less
- test_schedule_multiproc.py: All UTs in this suite are required to run on 2+ GPUs, therefore, adding skips if less than 4 GPUs are supplied
Fixes https://github.com/pytorch/pytorch/issues/166875
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167281
Approved by: https://github.com/jeffdaily
Recursive function call creates a reference cycle: closure <- function <- cell inside closure
Capturing self (PyCodegen instance) in same closure prolongs it's life until next gc.collect() which might result in worse resource management
After the introduction of e9209e0 OOM issues has been observed. Looking for reference cycles one has been uncovered that would result in the prolonging lifetime of tensors. As the result of that OOM issues might occur. Such a dependency chain has been uncovered:
<img width="1059" height="540" alt="image" src="https://github.com/user-attachments/assets/359a8534-e7cd-491f-be40-547c2af5cbbc" />
At the end of it a reference cycle can be found that consists of a closure for function collect_temp_source, the function itself, and a cell object inside closure that would point to the function due to the recursive call.
This issue can either be resolved by removing recurrency or removing PyCodegen instance from the closure.
Another precaution that can be made is to explicitly empty f_locals dict. This way we cut the tensor from the chain leading to reference cycle.
Fixes#166721
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166714
Approved by: https://github.com/Lucaskabela, https://github.com/Skylion007, https://github.com/jeromean, https://github.com/williamwen42, https://github.com/mlazos
Fixes T243967987
Move `enrich_profiler_metadata` from `torch._dynamo.config` to `torch.fx.experimental._config`.
We cannot import anything inside recompile(), it made some perf regress internally. We move the config so we can import it at the top of `graph_module.py` without causing any circular import.
We also cannot delete the old config right now because some internal tests rely on copies of the old `graph_module.py` cpp file in unit tests. But I think we should be able to delete the old config soon after this PR lands.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167114
Approved by: https://github.com/angelayi
Summary: Before calling the tritonparse hook with `config_args`, ensure that we set `config_args` within `inductor_meta`. This way, even if it is not set, the hook still gets run and we can at least get the launch arguments.
Test Plan: Tritonparse tests
Differential Revision: D86463732
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167261
Approved by: https://github.com/FindHao
This PR is meant to swap the PR-based ciflow tags from the mi200 nodes (less stable) to the mi300 nodes (more stable). This will ensure that developers see consistent testing on their PRs as well as on main. This PR does all of the following:
- Rename rocm.yml to rocm-mi200.yml : for clarity
- Add ciflow/rocm-mi200 trigger to rocm-mi200.yml : for devs who want to opt-in to single-GPU unit tests on MI200
- Move ciflow/rocm trigger from rocm-mi200.yml to rocm-mi300.yml : so PRs target MI300 runners by default
- Rename inductor-rocm.yml to inductor-rocm-mi200.yml : for clarity
- Remove ciflow/inductor-rocm trigger from inductor-rocm-mi200.yml : prevent MI200 inductor config unit tests being triggered by default
- Add ciflow/inductor-rocm-mi200 trigger to inductor-rocm-mi200.yml : for devs who want to opt-in to inductor config unit tests on MI200
- Move ciflow/periodic trigger from periodic-rocm-mi200.yml to periodic-rocm-mi300.yml : so PRs target MI300 runners by default
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167225
Approved by: https://github.com/jeffdaily, https://github.com/huydhn
Co-authored-by: Jithun Nair <jithun.nair@amd.com>
A few improvements for autotuning
- while testing mix order reduction for internal workloads, Paul found that tuning num-stages could be very helpful for triton kernel. The idea is illustrated on his diff: https://www.internalfb.com/diff/D86341591
- when rnumel is small, larger XBLOCK could be helpful for perf
This PR adds the ability to autotune num-stages and XBLOCK. This brings further 19% speedup for RMSNorm BWD on B200.
Testing result:
eager 11 data points
compiled 11 data points, 17.07x speedup (was 14.39x before the PR. The PR brings further 19% speedup)
quack 11 data points, 12.72x speedup
liger 11 data points, 11.75x speedup
compiled-no-fusion 11 data points, 9.93x speedup
<img width="3564" height="2368" alt="RMSNormBackward_bench" src="https://github.com/user-attachments/assets/3e415242-a988-42bf-8a47-4ed5f11148a3" />
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167161
Approved by: https://github.com/jansel
ghstack dependencies: #166669, #166938
The PR includes a misc list of fixes for the regressions I see from the dashboard:
1. the dashboard may use very small shape for rmsnorm backward. The data set can be fully cached in L2 thus mix order reduction does not show much benefit and may even has worse perf. Disable mix order reduction for small workload
2. disable the autotuning of split size by default to avoid the compilation time hit
3. avoid mix order reduction if there is non-contiguous memory access. Previously the check is only done for shared buffers accessed by both reductions. It turns out to be necessary to expand the check for buffers only accessed by one reduction. Check test test_avoid_non_coalesced_access which is simplified from a TIMM model. Note that larger XBLOCK could fix the perf problem and make mix order reduction still applicable. But I don't think that's high priority. With larger XBLOCK, the kernel would consume much more shared memory/registers. That could also cause perf issue.
Dashboard result [here](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Wed%2C%2029%20Oct%202025%2003%3A40%3A22%20GMT&stopTime=Wed%2C%2005%20Nov%202025%2004%3A40%3A22%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(h100)&lBranch=gh/shunting314/257/head&lCommit=b6f4a24ea5f7574d6b1d3b854022aa09d70593db&rBranch=main&rCommit=22a745737a09b0600bb0b85b4c0bbb9fb627f137).
<img width="1484" height="531" alt="Screenshot 2025-11-04 at 10 58 48 PM" src="https://github.com/user-attachments/assets/60cda211-3cc4-4fe1-9eaf-d2fb2c7d15a1" />
- the perf drop for TIMM (default) is not real, it's due to one more model passed the accuracy test
- the perf drop for HF (cudagraphs) is not real. I checked each individual models that showed regressed on the dashboard. And they fall into the following categories
- showed regressed, but absolute execution get reduced. e.g. OPTForCausalLM
- showed regressed, but has slight speedup on h100 dev server: MobileBertForMaskedLM . speedup from 57.847709ms to 56.711640 ms
- showed regressed, but the PR does not change the kernels generated (skip mix order reduction due to small workload or other reasons). e.g. XGLMForCausalLM, AlbertForMaskedLM .
Note that the neutral result on the dashboard is expected due to small workload size. For large workload, we see about 1.5x geomean for rmsnorm/layernorm backward on average and 2.2x for some shapes used by internal model. For 8GPU torchtitan training on llama3, we see 4% TPS (tokens per second) improvement.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166938
Approved by: https://github.com/jansel
ghstack dependencies: #166669
Fixes#165873
# Title
Fix load_state_dict: raise error for 1D (size > 1) -> 0D parameter loads
## Summary
This PR fixes a bug where loading a 1D tensor (size > 1) into a scalar (0D) parameter would silently take the first element instead of raising an error. The fix preserves backward compatibility for 1D tensors of size 1 while catching genuine shape mismatches.
## Motivation
Previously, loading a 1D tensor like torch.randn(32000) into a 0D scalar parameter would silently slice the first element, leading to silent data loss and potential bugs. This change ensures users get a clear error when there's a genuine shape mismatch.
## Behavior change
Before:
1D tensor (any length) -> 0D scalar -> silently coerced using input_param[0]
After:
- 1D tensor (size == 1) -> 0D scalar -> allowed (backward compatibility)
- 1D tensor (size > 1) -> 0D scalar -> raises RuntimeError with size mismatch message
In torch/nn/modules/module.py, _load_from_state_dict, added input_param.shape[0] == 1 check to the backward compatibility condition to only allow single-element 1D tensors.
## Tests
Added test_scalar_param_1d_tensor_raises to verify that loading 1D tensors of size > 1 raises an error, while size 1 loads successfully.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166335
Approved by: https://github.com/mikaylagawarecki
Update the torch-xpu-ops commit to [intel/torch-xpu-ops@9aac5a](9aac5a1ddf), includes:
- Enable FP8 concat/where/flip/index_put/index.Tensor on XPU backend
- Remove BUILD_SPLIT_KERNEL_LIB flag
- Fix the initialization order of ProcessGroupXCCL
- Separates communication initialization logic from getXCCLComm
- Fix segmentation fault in NLLLoss kernel
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166945
Approved by: https://github.com/EikanWang
* `c10::fetch_and_cast` and `c10::cast_and_store` produce branchy code since it supports all datatypes
* So, we do special handling for binary elementwise broadcast with mixed dtypes of float/bfloat16/half
* This improves performance
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167233
Approved by: https://github.com/jeffdaily
# Motivation
This PR aims to introduce two variables (`TEST_ACCELERATOR` and `TEST_MULTIACCELERATOR`) to simplify UT generalization. Since out-of-tree backends may be imported later, these variables are defined as lazy values.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167196
Approved by: https://github.com/albanD
Problem: the migration of `AT_DISPATCH_V2` macros to headeronly cannot be a simple copy-paste of macro definitions from one header file to another because the macros `AT_DISPATCH_SWITCH` and `AT_DISPATCH_CASE` may use functions that cannot be migrated to headeronly, e.g. when a selective build feature is enabled, there will be functions that are generated. On the other hand, when not using selective build, the dtype-dispatch macros are perfectly suitable for migrating to headeronly.
In this PR, the migration problem above is tackled by refactoring `AT_DISPATCH` related macros into headeronly macros and non-headeronly macros while preserving the current API and semantics. For instance, consider the current V2 macro definitions:
```c++
#define AT_DISPATCH_V2(TYPE, NAME, BODY, ...) \
AT_DISPATCH_SWITCH(TYPE, NAME, AT_AP_VAR(AT_WRAP(BODY), TYPE, __VA_ARGS__))
#define AT_AP_VAR(N, T, ...) \
AT_EXPAND(AT_CONCAT(AT_AP, AT_NUM_ARGS(__VA_ARGS__))(AT_WRAP(N), __VA_ARGS__))
#define AT_AP1(N, _1) AT_DISPATCH_CASE(_1, N)
...
```
where the headeronly-migration-problematic parts are using AT_DISPATCH_SWITCH and AT_DISPATCH_CASE macros (defined in ATen/Dispatch.h). In this PR, we introduce parametric versions of `AT_DISPATCH_V2` and `AT_AP1` macros that have `_TMPL` suffices, have DISPATCH_SWITCH and DISPATCH_CASE arguments, and are define in `torch/headeronly/core/Dispatch_v2.h`:
```c++
#define THO_DISPATCH_V2_TMPL( \
DISPATCH_SWITCH, DISPATCH_CASE, TYPE, NAME, BODY, ...) \
DISPATCH_SWITCH( \
TYPE, \
NAME, \
THO_AP_VAR_TMPL(DISPATCH_CASE, AT_WRAP(BODY), TYPE, __VA_ARGS__))
#define THO_AP_VAR_TMPL(C, N, T, ...) \
AT_EXPAND( \
AT_CONCAT(THO_AP, AT_NUM_ARGS(__VA_ARGS__))(C, AT_WRAP(N), __VA_ARGS__))
#define THO_AP1(C, N, _1) C(_1, N)
...
```
so that original V2 macro definition, defined in ATen/Dispatch_v2.h, becomes:
```c++
#define AT_DISPATCH_V2(TYPE, NAME, BODY, ...) \
THO_DISPATCH_V2_TMPL( \
AT_DISPATCH_SWITCH, \
AT_DISPATCH_CASE, \
TYPE, \
NAME, \
AT_WRAP(BODY), \
__VA_ARGS__)
```
that has exactly the same API and semantics as the original definition.
Note 1: ~we have changed the definition of `AT_AP1(N, _1) ...` to `AT_AP1(C, N, _1) ...` without renaming `AT_AP1` because `AT_AP1` is a helper macro that is not a part of public API (for instance, nothing in pytorch explicitly uses `AT_AP1`).~ UPDATE: restored the original `AT_AP` macros and introduced new `THO_AP` macros.
Note 2: this PR introduces a new API macro THO_DISPATCH_V2_TMPL that will be available for stable ABI users who can use it by providing custom versions of `AT_DISPATCH_SWITCH` and `AT_DISPATCH_CASE macros, say, with selective build features removed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165856
Approved by: https://github.com/janeyx99
Fixes#166405.
I've fixed this by adding epsilon validation in ```torch.nn.functional.batch_norm``` to reject non-positive values before they cause undefined behavior. Also added a test case ```test_batchnorm_invalid_eps``` to verify the fix works correctly.
While working on this, I noticed that ```layer_norm```, ```group_norm```, and ```instance_norm``` also don't validate epsilon and could have the same issue. Should I add validation for those in this PR as well?
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166756
Approved by: https://github.com/mikaylagawarecki
Provides type coverage to torch/_dynamo/variables/dicts.py
Coverage report:
`mypy torch/_dynamo/variables/functions.py --linecount-report /tmp/coverage_log`
Compare before to after - we go from 0 lines and 0 funcs covered to 2698 lines and 166 funcs covered
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167103
Approved by: https://github.com/mlazos, https://github.com/fxdawnn
Summary:
The SIMD path is using SLEEF version of pow which is slightly different from std::pow. The fix is to use the same vectorized code (with partial load and store) for the trailing data as well to ensure consistency between results.
Deploy:
Need to make a hotfix in waas to monitor release signals, since this diff can cause testing failures in veloski and waas release correctness tests.
Test Plan: Sandcastle.
Differential Revision: D86218207
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166986
Approved by: https://github.com/swolchok
Fixes T243967987
Move `enrich_profiler_metadata` from `torch._dynamo.config` to `torch.fx.experimental._config`.
We cannot import anything inside recompile(), it made some perf regress internally. We move the config so we can import it at the top of `graph_module.py` without causing any circular import.
We also cannot delete the old config right now because some internal tests rely on copies of the old `graph_module.py` cpp file in unit tests. But I think we should be able to delete the old config soon after this PR lands.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167114
Approved by: https://github.com/angelayi
This is the initial prototype of distributed autotuning. It's intended to be a basis for iteration rather than the final end product.
Currently when we run a SPMD program we compile the ranks independently. As a result the autotuning is repeated on every rank. So for a 8-GPU program with 8 matmul operators we'll autotune 64 (8*8) times.
Distributed autotuning uses collectives to distribute the autotuning across the ranks so each rank autotunes 1/worldsize the total operators. So in our 8-GPU example we would only perform 8 autotunes total (one on each rank) rather than 64.
There are several advantages:
1. Faster autotuning times - each CPU/GPU does less work total
2. Better determinism - currently it's possible for two ranks to choose different algorithms for the same operator. With distributed autotuning we choose the algorithm once for the entire program.
Results:
In testing using llama3 8B on torchtitan max-autotune time was reduced from 52s -> 26s and exhaustive-autotuning was reduced from 2009s -> 613s.
Usage:
The feature is controlled by the environment variable TORCHINDUCTOR_DISTRIBUTED_AUTOTUNE.
Co-authored-by: @PaulZhang12
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163369
Approved by: https://github.com/PaulZhang12
Handle `torch._check` in `TorchInGraphFunctionVariable.call_function`. Basically, it has two arguments - a predicate (bool) and a message (callable). If predicate is a constant, evaluate `torch._check`. If predicate is true, it just will compile and nothing happens. If predicate is false, `torch._check` will raise an exception.
If predicate is not constant, we manually emit a proxy. I tried to build as_proxy() inside NestedUserFunctionVariable, but failed to, that's why I create it here. I try to extract message. If it's a function, I retrieve it. If not, set it to None. Maybe we could extract it if message is a closure, but not sure how
Fixes#163668
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164676
Approved by: https://github.com/williamwen42, https://github.com/mlazos
Co-authored-by: William Wen <william.wen42@gmail.com>
Summary: The current delta update has a strong assumption that the non-lowered weights share the same tensor dtype from the lowered version. This is not true by design. When dtype mismatches the data loading will load the data into unexpected dtype which introduces undefined behavior. This diff aims to close the gap by always load tensor by its original dtype first then cast to desired dtype.
Test Plan:
No more NaN values!
{P2022339213}
Reviewed By: kqfu
Differential Revision: D86181685
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167039
Approved by: https://github.com/henryoier
This PR enables ROCm/HIP support for PyTorch's StaticCudaLauncher, which provides static compilation and launching of Triton kernels. The implementation has been tested on AMD MI300 and MI200 hardware.
**Changes**
**Python (torch/_inductor/runtime/)**
- static_cuda_launcher.py: Added ROCm detection, .hsaco binary support, and ROCm-specific scratch parameter handling
- triton_heuristics.py: Updated device type checks to support both cuda and hip
**C++ (torch/csrc/)**
- Module.cpp: Enabled StaticCudaLauncher for ROCm builds
- inductor/static_cuda_launcher.cpp: Added HIP API equivalents for all CUDA driver calls
- inductor/static_cuda_launcher.h: Updated header guard
**Tests (test/inductor/)**
- test_static_cuda_launcher.py: Removed @skipIfRocm decorators and updated binary file handling
**Enabled Unit Tests**
All tests in test/inductor/test_static_cuda_launcher.py now pass on ROCm:
1. test_basic
2. test_unsigned_integers
3. test_signed_integers
4. test_basic_1arg
5. test_constexpr
6. test_implied_constant
7. test_kernel_no_args
8. test_high_shared_mem
9. test_too_high_shared_mem
10. test_kernel_empty_tensor
11. test_kernel_many_args
12. test_basic_compile
13. test_incompatible_code
14. test_static_launch_user_defined_triton_kernels
15. test_empty_tensor
16. test_any
17. test_disable_static_cuda_launcher
In addition to this, the following tests from test/inductor/test_codecache.py also pass:
1. test_remote_cache_load_function_device_cuda_float32_dynamic_False_bundle_triton_False_use_static_cuda_launcher_False
2. test_remote_cache_load_function_device_cuda_float32_dynamic_False_bundle_triton_True_use_static_cuda_launcher_False
3. test_remote_cache_load_function_device_cuda_float32_dynamic_False_bundle_triton_True_use_static_cuda_launcher_True
4. test_remote_cache_load_function_device_cuda_bfloat16_dynamic_False_bundle_triton_False_use_static_cuda_launcher_False
5. test_remote_cache_load_function_device_cuda_bfloat16_dynamic_False_bundle_triton_True_use_static_cuda_launcher_False
6. test_remote_cache_load_function_device_cuda_bfloat16_dynamic_False_bundle_triton_True_use_static_cuda_launcher_True
The following tests are skipped since triton bundling is necessary for StaticCudaLauncher:
1. test_remote_cache_load_function_device_cuda_float32_dynamic_False_bundle_triton_False_use_static_cuda_launcher_True
2. test_remote_cache_load_function_device_cuda_bfloat16_dynamic_False_bundle_triton_False_use_static_cuda_launcher_True
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166492
Approved by: https://github.com/jeffdaily
## Summary
This PR adds multi-architecture kernel compilation support for ROCm in PyTorch's AOT Inductor module, enabling a single compiled model to run across multiple AMD GPU architectures (MI200, MI300, MI350, etc.) without recompilation.
## Implementation
- **Multi-arch compilation pipeline**: Compiles LLVM IR to multiple GPU architectures and bundles them using `clang-offload-bundler`
- **Architecture detection**: Automatically detects target architectures from `torch.cuda.get_arch_list()`, with overrides via `PYTORCH_ROCM_ARCH` environment variable
- **ROCm-specific utilities**: New `rocm_multiarch_utils.py` module handles ROCm toolchain integration
- **Test infrastructure**: Adapted AOT Inductor tests to support both CUDA and ROCm compilation paths
## Testing
Successfully tested on:
- MI200
- MI300
**Enabled tests:**
- `test_simple_multi_arch`
- `test_compile_after_package_multi_arch`
- `test_compile_with_exporter`
- `test_compile_with_exporter_weights`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166357
Approved by: https://github.com/jeffdaily
Fixes#166906
```
python test/export/test_export.py -k test_annotate_on_assert
```
The assertions are not marked with annotation because these nodes are created in `apply_runtime_assertion_pass`. Currently the annotation will only be added if the nodes are created during tracing. So we need to manually add the annotation.
Nodes added in `apply_runtime_assertion_pass` will have the same annotation as the input node to the assertion.
Output graph:
Note that `_assert_scalar_default_1` is not annotated becayse it's an assertion on the size of `x` which is not annotated.
```
ExportedProgram:
class GraphModule(torch.nn.Module):
def forward(self, x: "f32[s77]", y: "i64[]"):
# No stacktrace found for following nodes
sym_size_int_1: "Sym(s77)" = torch.ops.aten.sym_size.int(x, 0)
# Annotation: {'moo': 0} File: /data/users/shangdiy/pytorch/test/export/test_export.py:729 in forward, code: x = torch.cat([x, x])
cat: "f32[2*s77]" = torch.ops.aten.cat.default([x, x]); x = None
# Annotation: {'moo': 0} File: /data/users/shangdiy/pytorch/test/export/test_export.py:730 in forward, code: b = y.item()
item: "Sym(u0)" = torch.ops.aten.item.default(y); y = None
ge_1: "Sym(u0 >= 4)" = item >= 4
_assert_scalar_default = torch.ops.aten._assert_scalar.default(ge_1, "Runtime assertion failed for expression u0 >= 4 on node 'ge_1'"); ge_1 = _assert_scalar_default = None
# No stacktrace found for following nodes
mul_1: "Sym(2*s77)" = 2 * sym_size_int_1; sym_size_int_1 = None
le: "Sym(2*s77 <= u0)" = mul_1 <= item; mul_1 = None
_assert_scalar_default_1 = torch.ops.aten._assert_scalar.default(le, "Runtime assertion failed for expression 2*s77 <= u0 on node 'le'"); le = _assert_scalar_default_1 = None
# Annotation: {'moo': 0} File: /data/users/shangdiy/pytorch/test/export/test_export.py:732 in forward, code: return x * b
mul: "f32[2*s77]" = torch.ops.aten.mul.Tensor(cat, item); cat = item = None
return (mul,)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167171
Approved by: https://github.com/angelayi
## Summary
Fix outdated unbiased parameter references in normalization module documentation. Replace deprecated torch.var(input, unbiased=False/True) with modern torch.var(input, correction=0/1) API throughout BatchNorm, InstanceNorm, LayerNorm, and GroupNorm docstrings.
## Changes
- torch/nn/modules/batchnorm.py: Updated 4 instances across BatchNorm1d, BatchNorm2d, BatchNorm3d, and SyncBatchNorm
- torch/nn/modules/instancenorm.py: Updated 3 instances across InstanceNorm1d, InstanceNorm2d, and InstanceNorm3d
- torch/nn/modules/normalization.py: Updated 2 instances in LayerNorm and GroupNorm
## Test plan
Mathematical behavior remains identical: unbiased=False ≡ correction=0 (biased estimator), unbiased=True ≡ correction=1 (unbiased estimator). Documentation now uses consistent modern API terminology with no functional changes to code behavior.
Fixes#166804
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167209
Approved by: https://github.com/albanD
In a trunk failure today, we saw the same test running on both trunk and slow shards. The reason is that this test didn't invoke `super().setUp()`, so all the test features like slow and disabled test didn't apply to them.
I use Claude to find all test classes with a `setUp()` method that didn't called `super().setUp()` and patch all of them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167163
Approved by: https://github.com/malfet
We have some logic figure out "given which inputs have static indices in the pre-subclass-desugaring graph, figure out the static indices in the post-subclass-desugaring graph", and it was busted for training.
Separately, we should probably not have to do this logic at all - as @eellison mentioned, inputs/outputs in the graph are less likely to be tweaked through graph passes, so it would be more convenient and less hassle if we just stashed if a given input was static directly on the Descriptor for it. I did not end up doing that in this PR though.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167127
Approved by: https://github.com/ezyang
[Flight Recorder] Reverted to include stack traces for dump pipe triggered FR dump (#167023)
Summary:
We should also retry if include stacktraces failed. Changed was introduced in https://github.com/pytorch/pytorch/pull/164591
Test Plan: eyes
Reviewed By: fduwjj
Differential Revision: D86248484
This pull request updates the PyTorch documentation build system to support newer versions of Sphinx and its related dependencies, improves coverage checking for undocumented objects, and adds configuration enhancements to the docs build. The most important changes are grouped below.
**Dependency Upgrades and Compatibility:**
* Upgraded `sphinx` to version 7.2.6 and updated related documentation dependencies (`breathe`, `exhale`, `docutils`, `myst-nb`, `sphinx-design`, `myst-parser`, and others) in `.ci/docker/requirements-docs.txt` to ensure compatibility with Python 3.13 and improve documentation generation. [[1]](diffhunk://#diff-b5577a8e38a2e4c5d91865096b259738cc1dbcb97921abb73045dae0255b1479L1-L12) [[2]](diffhunk://#diff-b5577a8e38a2e4c5d91865096b259738cc1dbcb97921abb73045dae0255b1479L39-R45) [[3]](diffhunk://#diff-b5577a8e38a2e4c5d91865096b259738cc1dbcb97921abb73045dae0255b1479L59-R64)
* Replaced the editable install of `pytorch_sphinx_theme2` with a pinned version for stability in documentation builds.
**Documentation Coverage and Build Improvements:**
* Updated the coverage check logic in `.ci/pytorch/python_doc_push_script.sh` to parse the new Sphinx 7.2.6+ coverage report format, extracting the undocumented count from the statistics table for more reliable coverage validation.
**Configuration and Formatting Enhancements:**
* Introduced `autosummary_filename_map` in `docs/source/conf.py` to resolve duplicated autosummary output filenames for functions and classes with the same name, improving documentation clarity.
**Minor Documentation Formatting:**
* Removed an unused `:template:` directive from `docs/source/quantization-support.md` for cleaner autosummary output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164901
Approved by: https://github.com/albanD
As per title.
It seems safe to be able to generalize to arbitrary contiguous inputs since `at::matmul` is likely to do the flattening to avoid `baddmm`.
Additionally, we guard for bias to be 1D and contiguous which is guaranteed to be fused with no copies.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166071
Approved by: https://github.com/ngimel
I.e. remove distinction between two cases, and always preload full set of libraries
For some reason, when one uses `virtualenv` instead of `venv`,
preloading `cudart` works, but it fails to find cudnn or cublasLT later on
Fix it, by getting read of partial preload logic for one of the cases and always preload full set of libraries
Test plan on stock Ubuntu:
```
pip install virtualenv
virtualenv --symlinks -p python3.11 --prompt virtv venv-virt
source venv-virt/bin/activate
pip install torch
python -c 'import torch'
```
Fixes https://github.com/pytorch/pytorch/issues/165812
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167046
Approved by: https://github.com/atalman
torch.cuda.memory.set_per_process_memory_fraction allows setting
an upper bound on how much device memory is allocated. This PR
exposes this setting to an environment variable.
For example, PYTORCH_CUDA_ALLOC_CONF="per_process_memory_fraction:0.5"
will limit the device memory to half of the available memory.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161035
Approved by: https://github.com/ngimel, https://github.com/eqy
Add Inline Fusion Support for Custom Op Autotuning
--------------------------------------------------
This PR extends PyTorch Inductor's custom op autotuning with inline fusion capabilities, enabling the winning decomposition to be inlined directly into the computation graph for fusion with surrounding operations.
### Usage
```python
def decompose_k_implementation(
a: torch.Tensor, b: torch.Tensor, k_splits: int = 4
) -> torch.Tensor:
"""Matrix multiply with k-way decomposition."""
...
@torch.library.custom_op("my_lib::matmul_relu", mutates_args={})
def custom_matmul_relu_dk(
a: torch.Tensor, b: torch.Tensor, k_splits: int
) -> torch.Tensor:
return torch.relu(decompose_k_implementation(a, b, k_splits))
register_custom_op_autotuning(
custom_op=custom_matmul_relu_dk,
configs=[
CustomOpConfig(k_splits=2),
CustomOpConfig(k_splits=4),
CustomOpConfig(k_splits=8),
CustomOpConfig(k_splits=32),
CustomOpConfig(k_splits=64),
],
name="decompose_k_autotuned",
input_gen_fns={
"a": lambda fake: torch.randn_like(fake, device='cuda'),
"b": lambda fake: torch.randn_like(fake, device='cuda'),
}
)
```
### How It Works
Enable optimizations from Inductor by inlining the best decomposition, allowing fusion with surrounding elementwise operations and other graph-level optimizations. This provide potentially better performance and memory efficiency.
During customop autotuning phase, we still benchmarks all CustomOpConfigs to find the fastest implementation. Then during inline fusion, inductor inline the decompositions into the main graph, converting the winning choice to individual ComputedBuffer IR nodes (fusable). At the end, Inductor automatically fuses inlined operations with surrounding elementwise ops (e.g., bias add, ReLU, scaling). Note that the winning choice must be a SubgraphChoiceCaller (decomposition-based) rather than an ExternKernelChoice for inlining to work. If the ExternKernelChoice is returned, no inline happens.
Performance Results
Benchmarked on matmul+relu workload with decompose-k fusion (H100 GPU, 15 test shapes):
<img width="782" height="377" alt="Screenshot 2025-11-04 at 12 43 11 AM" src="https://github.com/user-attachments/assets/22131d4c-a8ce-4f55-bdcd-ac758ddad8cd" />
Metric | Result
-- | --
Average Speedup vs ATen | 1.28x
Max Speedup vs ATen | 1.41x
<br class="Apple-interchange-newline">
The performance comparison are detailed in the below plots. We spot that on most use cases, the inline fusion gains better performance compared to aten baseline and the current torch.compile.
<img width="4874" height="3545" alt="image" src="https://github.com/user-attachments/assets/190a1233-412f-4f34-84cd-9b7cb582f504" />
**Test**: `test_decompose_k_with_fusion` demonstrates decompose-k with inline fusion enabled.
--------------
### Integration to mm.py decomposeK with a flag enable_inline_subgraph_fusion=True in config (deprecated to avoid breaking async compilation. removed from the PR already)
FP32:
<img width="738" height="357" alt="Screenshot 2025-11-04 at 12 05 08 AM" src="https://github.com/user-attachments/assets/ee421d22-c426-42f2-8dcd-4dcc547d6219" />
FP16:
<img width="769" height="403" alt="Screenshot 2025-11-04 at 12 13 49 AM" src="https://github.com/user-attachments/assets/346d1ffc-15af-40b0-9378-cf9b297711c2" />
The TCF column represents torch compile fusion, which is close to custom_op decomposek. The difference might due to different candidate k values.
#### Usage:
Note: this only happens when we don't benchmark_epilogue_fusion, i.e., not using multi_template_buffer.
```python
# Define the matmul+relu function
def matmul_relu(x, y):
return torch.nn.functional.relu(torch.matmul(x, y))
# Compile with inline subgraph fusion enabled
@torch.compile
def compiled_matmul_relu(x, y):
return matmul_relu(x, y)
# Reset dynamo to ensure clean compilation
torch._dynamo.reset()
with config.patch(
{
"max_autotune": True,
# CRITICAL: These two flags enable inline subgraph fusion
"benchmark_epilogue_fusion": False, # Must be False for inline fusion!
"enable_inline_subgraph_fusion": True, # Enable inline fusion
}
):
# Compile and run
result = compiled_matmul_relu(a, b)
torch.cuda.synchronize()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165952
Approved by: https://github.com/PaulZhang12, https://github.com/eellison
### Summary
Adds a debug-level logging statement to torch.fx.Interpreter.run_node, as proposed in [#117351](https://github.com/pytorch/pytorch/issues/117351), to make FX graph execution traceable when debugging or instrumenting model transformations.
When debug logging is enabled, each executed node emits a single structured log line formatted via `LazyString(lambda: n.format_node())`, deferring string construction unless logging is active.
### Example Output
With `logging.DEBUG` enabled:
```
run_node x = x()
run_node add = _operator.add(x, 1)
run_node clamp = torch.clamp(add, min=0.0, max=5.0)
run_node output = output(clamp)
```
With `logging.DEBUG` disabled no additional output is produced (unchanged default behavior).
### Test Plan
Verified locally with Python 3.11 on macOS using a PyTorch build from source.
- With `logging.DEBUG` enabled: each node emits a debug log via LazyString.
- With `logging.DEBUG` disabled: no additional output.
- Confirmed all `Interpreter` tests pass locally:
`pytest test/test_fx.py -k "Interpreter"`
Updated the example output to reflect the new `_format_fx_node` helper and inclusion of `kwargs`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166622
Approved by: https://github.com/aorenste
Intended to make it easier to reuse this logic for processing operator arguments as IValues in following PR(s).
Testing: python test/test_python_dispatch.py (broke during development, seems to work now)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166368
Approved by: https://github.com/albanD
Summary: Inside group_split api, we share the reference of PG option with parent PG if a PG option is not explicitly specified. This is bad because if we split parent pg multiple times, we will run into errors.
Test Plan: UT + internal test.
Differential Revision: D86225394
Pull Request resolved: https://github.com/pytorch/pytorch/pull/167125
Approved by: https://github.com/Skylion007
Fixes#160513
## The Problem Summary
The issue boiled down to data type promotion logic. The code base has two different functions that deal with dtype promotion logic. If it is purely multi-dimensional tensor operations, the cpp code gets triggered and that follows the numpy dtype promotion logic. That is why in #160513 NDim tensors are fine as NDim dtypes gets precedence. The issue came with python scalars and 0Dim tensors. When it detects "scalars", a python implementation of dtype promotion logic gets triggered (torch/_prims_common/__init__.py:1544). Since this is in python, the implementation can't distinguish what is from a wrapped tensor and a 0Dim tensor and thus will just take the highest dtype which is the python double wrapped number.
## The Fix
The python implementation for dtype promotion had to know where the scalar came from. Once the scalar can be distinguished then the appropriate dtype can be set. The first approach was to try and expose the `is_wrapped_number` method but this came with a big issue. During the `forward_ad` the derivative of those scalars turned out to be `ZeroTensor`s. The `ZeroTensor` internally uses a hack to initialize a meta dtype tensor which skips expensive dispatch operations. But the copy would not grab everything especially the `is_number_wrapped_` property. I thought about modifying the copy but that seemed to go away from the spirit of what the copy was intended for and plus the tests for `is_wrapped_number_` requires `dim > 0` and a scalar `ZeroTensor` is a meta dtype tensor which complicates things.
So I chose the route of creating a new property called `was_wrapped_number` and exposed this property to the python tensor API. I had to modify the autograd code generation to set `was_wrapped_number` in the mul, add, and div operations in `VariableType.cpp`. Once this property was set, the dtype promotion logic could be updated to consider wrapped numbers and 0Dim numbers. Once that hierarchy was taken care of, the buggy behavior was fixed.
I wrote a new ops testing module `TestForwardADWithScalars`. I saw that this bug was unique and required new testing paradigm. This only tests the multiply, add, and divide and I chose this because all operations boil down to these three operations.
[edit]: Just used `efficientzerotensor` meta and converted that to a python number. Since wrapped number is converted back to a python number, dtype promotion is preserved. The constraint to achieve this happened by setting the forward grad zero tensor of a wrapped number with a wrapped number flag since the tangent of the wrapped number should still be a wrapped number. After that this specific zerotensor was then sent through as a meta type in the `BinaryOps.cpp` to get appropriate dtype for resulting arithmetic.
@ezyang @OihanJoyot
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165784
Approved by: https://github.com/ezyang
Fixes#166746 by removing squeezes that caused shape mismatches when calling backwards through `BCELoss(reduction='none')`.
Based on running these tests, it seems MPSGraph can handle inputs without squeezing.
```
python test/test_mps.py TestMPS -k test_bce
python test/test_mps.py TestConsistency -k binary_cross
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166786
Approved by: https://github.com/malfet
Slice knows how to handle unbacked start, we do not need to offset start before calling slice, we can leave it for slice.
The only edge case is when start<0 and start+length ==0 in that case slice and narrow would deviate,
for that case we shall pass dim_size instead of start+length
Pull Request resolved: https://github.com/pytorch/pytorch/pull/166361
Approved by: https://github.com/aorenste
// Symbolic case: use sym_ite to convert bool to int (0 or 1)
autonode=toSymNodeImpl();
autoone_node=node->wrap_int(1);
autozero_node=node->wrap_int(0);
returnSymInt(node->sym_ite(one_node,zero_node));
}
}// namespace c10
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.