The normal pytest/unittest failure patterns also match flaky tests (specifically I think tests that fail -> succeed on rerun in a new subprocess)
So print something specifically for log classifier that it can match against
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165163
Approved by: https://github.com/izaitsevfb
Summary: While saving state_dict tensors, deduping is done to reduce number of tensor data. For this storage point is used. But when the tensor is empty, storage pointer is 0. But dtype of the tensors could be different. Existing logic will consider all such tensor as same. This will fail the model later when different dtype is expected. This change will include dtype also while deduping. For non empty tensor, this should not affect as the storage point will be unique.
Test Plan: TBD
Differential Revision: D84243094
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165090
Approved by: https://github.com/yiming0416
https://github.com/pytorch/pytorch/issues/162858 The issue described the feature implemented.
This adds to the existing graph break log with the latest 20 (or viable user frame) bytecode instructions. The scenario is when the graph_break happens without errors. It happens during the case when user calling torch._dynamo.graph_break().
Meanwhile, in the testing, one can find that the generated frame based on step() is not deterministic as sometimes it reached the maximum amount, sometimes it generated the less than that. The bytecode generation is python version dependent. Thus, the testing plan excludes the bytecode output but generated the total bytecode line count.
This is a helpful process to understand bytecode transformation, symbolic convert, and convert frame. It is a helpful task to provide hands-on experience with dynamo workflow.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164422
Approved by: https://github.com/williamwen42, https://github.com/mlazos
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Splits the training and inference paths for aot stage2 compile.
1. Split `aot_stage2_autograd` into `_aot_stage2a_partition`, `_aot_stage2b_fw_compile` and `_aot_stage2b_bw_compile`, and rest.
2. Split `aot_stage2_inference` into `_aot_stage2b_inference_compile` and rest.
I'm leaving these as functions with underscore names since the I/O interfaces and the exact boundaries of these splits are somewhat in the air.
Differential Revision: D84028203
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164808
Approved by: https://github.com/SherlockNoMad
Some more context at https://github.com/pytorch/pytorch/pull/164939
The basic point here is that Python decomps are guaranteed to be functional, whereas C++ ones are not. If we have a Python decomp, we should prefer it over the C++ one. This currently doesn't matter too much as CIA decomps will get functionalized, but it matters after the quoted PR because we now run these decompositions very late (to make it easy for things like aot_eager to get the fused versions of operators in proxy tensor).
Signed-off-by: Edward Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164970
Approved by: https://github.com/bdhirsh
Adding bf16 support for `torch._fake_quantize_learnable_per_channel_affine()` op by relaxing the type check on scale
TODO: need to add bf16 support to `per_tensor_affine_` as `torch._fake_quantize_learnable_per_tensor_affine_backward` gets called in the backward pass
**Test**
Modified unit test in `test_workflow_ops.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165098
Approved by: https://github.com/jerryzh168, https://github.com/andrewor14
This is follow-up of #165037. It generally recommended to use `is/is not` to compare types. Therefore this series of changes apply this suggestion in the code base, and it aims to finally enabling related linter checks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165142
Approved by: https://github.com/albanD
This PR enables a number of distributed unit tests and applies necessary fixes to ensure they pass on ROCm platforms. The changes have been successfully tested on both MI200 and MI300 hardware.
This work addresses the following issues:
**https://github.com/ROCm/frameworks-internal/issues/13586https://github.com/ROCm/frameworks-internal/issues/13578**
**Enabled Tests**
The following tests have been enabled and are now passing:
1. test_compiled_autograd_ctx
2. test_simple_mlp_fullgraph_backend_aot_eager
3. test_simple_mlp_fullgraph_backend_aot_eager_decomp_partition
4. test_simple_mlp_fullgraph_backend_inductor
5. test_nested_fully_shard_backend_aot_eager
6. test_nested_fully_shard_backend_aot_eager_decomp_partition
7. test_nested_fully_shard_backend_inductor_fullgraph_True
8. test_nested_fully_shard_backend_inductor_fullgraph_True_graph_partition
9. test_transformer_backend_aot_eager
10. test_transformer_backend_aot_eager_decomp_partition
11. test_storage_resize_zero_gpu
12. test_storage_resize_nonzero_gpu
13. test_fake_distributed_inductor
**Tests skipped due to upstream issues:**
1. test_nested_fully_shard_backend_inductor_fullgraph_False
2. test_transformer_backend_inductor_fullgraph_True
3. test_transformer_backend_inductor_fullgraph_True_graph_partition
4. test_transformer_backend_inductor_fullgraph_False
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165011
Approved by: https://github.com/jeffdaily
It generally recommended to use `is/is not` to compare types. Therefore this series of changes apply this suggestion in the code base, and it aims to finally enabling related linter checks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165037
Approved by: https://github.com/mlazos
## Description:
This PR refactors the autocast context manager in `autocast_mode.py` to simplify and centralize the logic for checking supported dtypes for each device. The previous implementation repeated similar checks for multiple device types. Now, a single mapping `device_supported_dtypes` is used to associate device types with their supported dtypes, and the validation logic is unified.
In my view, this makes the code easier to maintain and extend for new devices.
Please share any suggestions and comments with me.
BTW, in the original `xla` branch, the `supported_dtype` are `[torch.float16, torch.bfloat16]`, 5d8a226e23/torch/amp/autocast_mode.py (L358-L363) but the warning message has only `torch.bfloat16`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163446
Approved by: https://github.com/FFFrog, https://github.com/albanD
As title
In windows, we cannot modify the .dll to append weights at the end, the windows .dll loader will complain it's not a valid .dll file. So we store the weight blob as a separete file.
1. We add the following API which allows passing in a pointer to the weight blob and get the size of the weight blob.
```cpp
AOTI_API AOTIRuntimeError AOTInductorModelContainerGetConstantsBlobSize(
AOTInductorModelContainerHandle container_handle,
uint64_t* ret_size);
// Load weights from a single blob in weight_blob_ptr
AOTI_API AOTIRuntimeError AOTInductorModelUpdateConstantsFromBlob(
AOTInductorModelContainerHandle container_handle,
const uint8_t* weight_blob_ptr);
```
2. We also add a method in ModelContainerRunner to load the weight:
If the runner see that there is a `.blob` file in the package, if will mmap the .blob file and use the content to load the constants.
3. We also add the `USE_MMAP_EXTERNAL` macro. When this macro is defined, the model expects to load the weights from external mmap'd weights.
Test Plan:
```
buck run @mode/dev-nosan caffe2/test/inductor:test_aot_inductor -- -r test_large_mmaped_weights_on_disk
```
Also tested for windows-cross compilation with 6542566585/demo/main_voxtral.cpp
```
Loaded model.dll
audio_encoder loaded
C:\Users\shangdiy\source\repos\torchnative\demo\token_embedding\data\aotinductor\model\model.wrapper.so
Loaded model.dll
token_embedding loaded
C:\Users\shangdiy\source\repos\torchnative\demo\text_decoder\data\aotinductor\model\model.wrapper.so
Loaded model.dll
Loading weights from C:\Users\shangdiy\source\repos\torchnative\demo\text_decoder\data\aotinductor\model\model.wrapper_weights.blob
text_decoder loaded
Load latency (ms):
audio_encoder: 1011.234
archive extraction: 0.000
.so loading: 1011.197
token_embedding: 525.773
archive extraction: 0.000
.so loading: 525.704
text_decoder: 3324.130
archive extraction: 0.000
.so loading: 3323.979
Run latency (ms):
audio_encoder: 285.958
audio_encoder output: dtype=bfloat16, shape=[1, 1125, 3072], numel=3456000
token_embedding: 6.676
token_embedding output: dtype=bfloat16, shape=[1, 1138, 3072], numel=3495936
text_decoder: 576.519
text_decoder output: dtype=bfloat16, shape=[1, 1138, 131072], numel=149159936
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164526
Approved by: https://github.com/desertfire
Instead of collecting local results using all_gather_object followed by local reduction, with this change we switch to using a single all_reduce with MIN reduction operation to compute the final equals result.
This change is needed to enable LocalTensor work (all_gather_object introduces challenges in for DTensor and LocalTensor integration).
topic: not user facing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164999
Approved by: https://github.com/ezyang
Context is in https://www.internalfb.com/excalidraw/EX519691 and https://docs.google.com/document/d/1qnuXLZk_GYt_PksHTwkn7L2ELRDnYlIRPkHAlXTyuhw/edit?tab=t.0.
So for Autoparallel initial trace, we want to trace the graph with global shapes initially. But, for the local_map region, we are forced to trace with the expected local tensors. To the tracers, this looks weird, because it's a plain tensor input (representing DTensor's full tensor .to_local()) that we need to "redistribute".
After hacking a miserable version that had cross-key dependencies, @ydwu4 proposed this simpler approach to override the fake key. This means the shape conversion will be invisible to all dispatch keys above fake, this covers all current tracing mechanisms. This manifests as the joint graph for the HOP body being traced with local shapes:
```python
# HOP forward, note local shapes (10, 80)
class GraphModule(torch.nn.Module):
def forward(self, primals_0: "f32[10, 80]"):
# No stacktrace found for following nodes
view: "f32[800]" = torch.ops.aten.view.default(primals_0, [-1]); primals_0 = None
add: "f32[800]" = torch.ops.aten.add.Tensor(view, 10); view = None
view_1: "f32[10, 80]" = torch.ops.aten.view.default(add, [10, 80]); add = None
return (view_1,)
# HOP backward, note local shapes (10, 80)
class GraphModule(torch.nn.Module):
def forward(self, tangents_0: "f32[10, 80]"):
# No stacktrace found for following nodes
clone: "f32[10, 80]" = torch.ops.aten.clone.default(tangents_0); tangents_0 = None
return (clone,)
```
while the rest of the graph is still traced with global shapes:
```python
# Parent graph joint, note global shapes (80, 80)
class inner_f(torch.nn.Module):
def forward(self, primals, tangents):
primals_1: "f32[80, 80]"; tangents_1: "f32[80, 80]";
primals_1, tangents_1, = fx_pytree.tree_flatten_spec([primals, tangents], self._in_spec)
# File: /home/xmfan/core/a/pytorch/test/higher_order_ops/test_local_map.py:597 in forward, code: return fn(x)
call_local_map = torch._higher_order_ops.local_map.call_local_map(primals_1); primals_1 = None
getitem: "f32[80, 80]" = call_local_map[0]; call_local_map = None
call_local_map_1 = torch._higher_order_ops.local_map.call_local_map(tangents_1); tangents_1 = None
getitem_1: "f32[80, 80]" = call_local_map_1[0]; call_local_map_1 = None
return pytree.tree_unflatten([getitem, getitem_1], self._out_spec)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164340
Approved by: https://github.com/ydwu4
ghstack dependencies: #164296, #164321, #164419, #164420
Reviewed GPT-5 Summary:
**Summary / Goal**
Add validation that partitioned forward/backward graphs respect placements.
**Details**
- Validates placement alignment in local_map.
- The HOP's autograd key gets called when we are tracing the joint, we need to validate:
- the inputs to the HOP's fwd gm (typically this is the dynamo rewritten inputs)
- the inputs to the HOP partitioned fwd/bwd gm
- the outputs of the HOP partitioned fwd/bwd gm
**Motivation**
Catch mismatch errors earlier, improve debugging.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164420
Approved by: https://github.com/ezyang
ghstack dependencies: #164296, #164321, #164419
Reviewed GPT5 summary:
**Summary / Goal**
Fix inconsistent variable naming for forward/backward graphs.
**Details**
- Those methods are actually for both fw and bw graphs now that we reuse the same op for fw/bw
**Motivation**
Improves clarity, avoids confusion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164419
Approved by: https://github.com/bdhirsh
ghstack dependencies: #164296, #164321
Reviewed GPT5 summary:
**Summary / Goal**
Improve error reporting when local_map subgraph input/output counts mismatch placement info.
**Details**
- Adds descriptive runtime error messages.
**Motivation**
Helps debug local_map misalignments.
```python
AssertionError: Expecting 2 inputs to local_map function based on placements, but found 1. If the count matches for eager, Dynamo may have flattened inputs to the function or found additional tensors used via closures. Please adjust the input placements to match what the traced graph sees:
class GraphModule(torch.nn.Module):
def forward(self, l_args_0_: "f32[8, 8, 16]"):
# File: /home/xmfan/core/a/pytorch/test/higher_order_ops/test_local_map.py:523 in mismatch_input, code: return x + scalar, scalar
child: "f32[8, 8, 16]" = l_args_0_ + 10; l_args_0_ = None
return (child,)
.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164321
Approved by: https://github.com/ezyang, https://github.com/mlazos
ghstack dependencies: #164296
Reviewed GPT5 summary:
**Summary / Goal**
Add a utility to compute expected local tensor sizes and strides under *even sharding* in dtensor.
**Details**
- New function in `torch/distributed/tensor/_utils.py`.
- Computes local sizes/strides given global shape, mesh, and placements.
- Enforces divisibility of global dimension by mesh size (strict even sharding).
- Complements `compute_global_tensor_info`.
**Motivation**
Ensures correctness for stride/layout computations in distributed tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164296
Approved by: https://github.com/ezyang
This PR fixes the condition
```
if arg_signatures is None and self.kernel_type == "cpp" or "extern"
```
which is interpreted as
```
if (arg_signatures is None and self.kernel_type == "cpp") or ("extern"):
```
and it is always evaluated to `True`. According to the context the intention was
```
if arg_signatures is None and (self.kernel_type == "cpp" or self.kernel_type == "extern"):
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165033
Approved by: https://github.com/Skylion007
## TODO
Check on multi indices
```Python
@cute.jit
def score_mod(tSrS_ssa, b_idx, h_idx, q_idx, kv_idx, buffers):
in_ptr4 = buffers[0]
tmp0 = tSrS_ssa
tmp1 = b_idx
tmp2 = h_idx
tmp3 = cute.make_fragment(1, cutlass.Int32)
tmp4 = tmp3.store(32*tmp1 + tmp2)
tmp5 = cute.make_fragment(1, cutlass.BFloat16)
tmp6 = tmp3[0]
tmp7 = tmp5[0] = (in_ptr4[tmp6])
tmp8 = (tmp5.load()).to(cutlass.Float32)
tmp9 = (tmp0 + tmp8)
tSrS_ssa = tmp9
return tSrS_ssa
```
I dont think that
```
tmp4 = tmp3.store(32*tmp1 + tmp2)
tmp5 = cute.make_fragment(1, cutlass.BFloat16)
tmp6 = tmp3[0]
tmp7 = tmp5[0] = (in_ptr4[tmp6]
```
is right since this tmp6 value will be larger than the actual index dim int his case its B -> see if its possible to 1d index
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162031
Approved by: https://github.com/v0i0
ghstack dependencies: #161118
This fixes AOTAutograd rms_norm not being bitwise equivalent to
eager, because it avoids a decomposition. You can force the
decomposition by having the decomposition in the dispatch table,
but if eager mode wouldn't have decomposed (because it went to the fused
one), we now default to preserving the fused call by default.
This largely reverts https://github.com/pytorch/pytorch/pull/103275/ for view ops. This means that in inference mode we could hit the wrong C++ kernel; if this occurs we should just SymInt'ify the C++ kernel.
Another neat side effect of this change is that Inductor's generated kernels for rms_norm now have rms_norm in their name.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164939
Approved by: https://github.com/bdhirsh
Previously when torch.are_deterministic_algorithms_enabled() is True Inductor will
- skip autotuning pointwise kernels
- pick a fixed (and quite arbitrary) config for reduction
This PR change the behavior to
- for pointwise kernels, we still do autotuning
- for reduction kernels, we use the recent added heuristic to pick a config
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164905
Approved by: https://github.com/jansel, https://github.com/v0i0
ghstack dependencies: #164801, #164532, #164904
Verify the deterministic mode with torch.compile benchmark scripts.
Here is what my testing script does (pasted in the end):
- run a model in default mode, save it's result
- run the model again in default mode, but distort the benchmarking results. Compare it with the saved result.
- Do the above again in deterministic mode.
I tried to test a few modes
- BertForMaskedLM and GoogleFnet: I can repro the numeric change by distorting the benchnmark result in the default mode. The non-determinism is gone in the deterministic mode
- DistillGPT2: I can not repro the numeric change by distorting the benchmarking result in the default mode. It does not surprise me much. Reduction order change does not always cause numeric change.
```
model=GoogleFnet
export TORCHINDUCTOR_WRITE_ARE_DETERMINISTIC_ALGORITHMS_ENABLED=0
export TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 # disable autotune cache
export TORCHINDUCTOR_FX_GRAPH_REMOTE_CACHE=0
export TORCHINDUCTOR_FX_GRAPH_CACHE=0
export TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor_shunting/
export TORCHINDUCTOR_BENCHMARK_KERNEL=1
export TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1
export INDUCTOR_TEST_DISABLE_FRESH_CACHE=1
# Non deterministic mode
# --float32 rather than --amp to make it easier to repro non-deterministic
echo "Save results for non-deterministic mode"
python benchmarks/dynamo/huggingface.py --backend inductor --float32 --accuracy --only $model --training --disable-cudagraphs --save-model-outputs-to=/tmp/saved-non-deterministic.pkl
echo "Compare results with distorted benchmarking in non-deterministic mode"
TORCHINDUCTOR_DISTORT_BENCHMARKING_RESULT=inverse python benchmarks/dynamo/huggingface.py --backend inductor --float32 --accuracy --only $model --training --disable-cudagraphs --compare-model-outputs-with=/tmp/saved-non-deterministic.pkl
echo "Save results for deterministic mode"
TORCHINDUCTOR_DETERMINISTIC=1 python benchmarks/dynamo/huggingface.py --backend inductor --float32 --accuracy --only $model --training --disable-cudagraphs --save-model-outputs-to=/tmp/saved-deterministic.pkl
echo "Compare results with distorted benchmarking in deterministic mode"
TORCHINDUCTOR_DETERMINISTIC=1 TORCHINDUCTOR_DISTORT_BENCHMARKING_RESULT=inverse python benchmarks/dynamo/huggingface.py --backend inductor --float32 --accuracy --only $model --training --disable-cudagraphs --compare-model-outputs-with=/tmp/saved-deterministic.pkl
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164904
Approved by: https://github.com/jansel, https://github.com/v0i0
ghstack dependencies: #164801, #164532
After lean export, we might want to be able to restore the original fqn. This PR refactors one util function in export that sort of does this. Note that strict_export has some complicated logic of updating the graph signature as well which we don't want. I think we can gradually make this util more refined by handling constants, non persistent buffers etc and change how strict_export does it today.
Differential Revision: [D83687844](https://www.internalfb.com/diff/D83687844)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164401
Approved by: https://github.com/avikchaudhuri
Previously we hardcoded the assumption in cuDNN that the inputs would be dense which breaks when e.g., the user is chunking tensors yielding noncontig inputs
New test added to check this when `TORCH_CUDNN_SDPA_NESTED_TENSOR_ENABLED=1` is set in `test/test_transformers.py`
One issue I noticed was that the old gating of nested tensor in `sdp_utils.cpp` seems to be a no-op? All of the inputs are reported as "dense" by the time that function is called in the nested tensor tests in `test/test_nestedtensor.py -k sdpa`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164958
Approved by: https://github.com/Skylion007, https://github.com/drisspg
Summary: We have an internal user where caching broke because the paths that are unzipped are probably different per host. We can't think of a use case where a path change matters when the file content has not changed, so removing this part
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165020
Approved by: https://github.com/oulgen
In aot_stage2_autograd:
Before calling fw_compiler, we run pre_compile for the following wrappers:
* FakifiedOutWrapper
* FunctionalizedRngRuntimeWrapper
After, we run post_compile for the following wrappers:
* EffectTokensWrapper
* AOTDispatchSubclassWrapper
* FunctionalizedRngRuntimeWrapper
* FakifiedOutWrapper
In aot_stage2_inference:
Before calling inference compiler, we run pre_compile for the following wrappers (same as above):
* FakifiedOutWrapper
* FunctionalizedRngRuntimeWrapper
After, we run post_compile for the following wrappers (different than above):
* FunctionalizedRngRuntimeWrapper
* FakifiedOutWrapper
* EffectTokensWrapper
* AOTDispatchSubclassWrapper
This PR makes both do the post_compiles in the same order.
Differential Revision: D84213657
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165016
Approved by: https://github.com/zhxchen17, https://github.com/bdhirsh
This commit makes several cleanup changes to MHA.cpp, the main
one of which is removal of shared_ptr from MHAGraphCache as the
cache does not actually intend to share ownership. The changes are:
1. Remove shared_ptr from MHAGraphCache
2. Remove template arguments from MHAGraphCache
3. Remove unnecessary optional<shared_ptr<...>> vars
4. Change some functions with auto return type to the actual type
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164895
Approved by: https://github.com/eqy
The windows cpp tests take ~1 hour according to logs. Each has run_test called on them individually, so I tried batching them together so it's just one run_test call for all of them. I believe it now takes 30min. I turned off TD since I don't think cpp tests are included in TD stuff.
As always with batch, I'm not sure if the errorlevel/error surfacing stuff is correct
This code is written with a lot of help from chatgpu and copilot
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164861
Approved by: https://github.com/huydhn
This PR updates build jobs that currently use linux.12xlarge to the
c7i varient which should increase build times by 15% - 20% depending
on the job and reduce costs of these jobs by 10% - 15%.
Signed-off-by: Thanh Ha <thanh.ha@linuxfoundation.org>
Summary:
* Add `torch.nn.functional.scaled_mm` as an abstraction around the C++
methods
* Wraps `torch._scaled_mm_v2` API by default, but user can force use of
the older `torch._scaled_mm` interface.
* Scaled MM tests now run on the new API
Test Plan:
`pytest test/test_scaled_matmul_cuda.py`
Reviewers:
Subscribers:
Tasks:
Tags:
Signed-off-by: Simon Layton <simonlaytonmeta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164142
Approved by: https://github.com/drisspg
ghstack dependencies: #164141
Summary:
* Add new scaled-MM API to future-proof / clean-up existing code.
* Scaling is explicitly described rather than infer
* Swizzling of scaled must now be defined (vs. inferred)
* Adds API support for multi-level scaling
* Refactor dispatch logic to make it easier to add new implementations
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
Signed-off-by: Simon Layton <simonlaytonmeta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164141
Approved by: https://github.com/drisspg
This PR includes a couple of changes to extend FlightRecorder dump by PyTorch watchdog
- New knobs to control FR dump as suggested in the public documentation even for watchdog
(TORCH_INCLUDE_STACK_TRACE, TORCH_INCLUDE_ONLY_ACTIVE)
- Trigger the flight recorder dump on exceptions which could be triggered by any CUDA / host side error
(TORCH_NCCL_EXTRA_DUMP_ON_EXEC)
-> Can be used as a snapshot of the workload progress for post-mortem analysis
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164591
Approved by: https://github.com/fduwjj
This fixes AOTAutograd rms_norm not being bitwise equivalent to
eager, because it avoids a decomposition. You can force the
decomposition by having the decomposition in the dispatch table,
but if eager mode wouldn't have decomposed (because it went to the fused
one), we now default to preserving the fused call by default.
This largely reverts https://github.com/pytorch/pytorch/pull/103275/ for view ops. This means that in inference mode we could hit the wrong C++ kernel; if this occurs we should just SymInt'ify the C++ kernel.
Another neat side effect of this change is that Inductor's generated kernels for rms_norm now have rms_norm in their name.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164939
Approved by: https://github.com/bdhirsh
ghstack dependencies: #164573
In https://github.com/pytorch/pytorch/pull/106824, export decided to slow-path for MultiHeadAttention module (look into the PR description as to why). But that PR eventually caused a divergence between Dynamo and export.
Today, strict-export does not inline into builtin modules (like MultiHeadAttention), and therefore make_fx sees the original nn.Module and takes the slow path. But compile inlines into the nn module, and at this time the condition `_is_make_fx_tracing` is False. As a result, Dynamo takes a fast path, resulting in a different op being called.
This divergence is undesirable. There are 2 ways to fix it
1) Make export take the fast path - As explained in the https://github.com/pytorch/pytorch/pull/106824 , this might be difficult. So, we go to (2)
2) Make compile as well take the slow path - This is easy to implement. The con here is that Pytorch eager and compile will use different operators, which can cause numerics issues etc.
Since (2) is easy to do, we will follow this path. We are tracking the issue in https://github.com/pytorch/pytorch/issues/164062
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164721
Approved by: https://github.com/avikchaudhuri, https://github.com/tugsbayasgalan
Summary: Noticed sometimes the combo kernel partition will contain empty group. Skip kernel generation in this case to unblock head model launching. The change in this diff is safe, but it's better to root cause why empty group is being created.
Test Plan:
Lowering passed after applying the diff
Differential Revision: D84134471
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164918
Approved by: https://github.com/mlazos
We also want to have a python side API for users to reset FR recording for FR entries. We don't need to reset the PGNCCL's member counter since we are creating new PGNCCL anyway. FR is a global ring buffer, so we need to reset it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164988
Approved by: https://github.com/tushar00jain
ghstack dependencies: #164752
Builds on top of https://github.com/pytorch/pytorch/pull/163673 and https://github.com/pytorch/pytorch/pull/164174. This will be used in the followup PRs to apply regional inductor compilation.
The existing implementation let Dynamo trace into the `torch.fx.traceback.annotate`, but thats not what we want. We want Dynamo to essentially run the torch.fx.traceback.annotate function in eager, so that every Fx node created in Dynamo Fx graph has the custom meta node.
What does not work?
* We still have to set the context manager `torch.fx.traceback.preserve_node_meta()` in the user code because CI was unhappy. This can be fixed but with some perseverance.
* This does not work with graph breaks yet. But we can solve that problem, if needed, in a separate PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164678
Approved by: https://github.com/SherlockNoMad, https://github.com/jansel, https://github.com/xmfan
- Update the Memory Estimator to use node storages for analysis, which simplifies book keeping, as opposed to manually looking at operator schema. This will also allow me to reuse this component elsewhere.
- Factor out into separate class, so that this same logic can be used in scheduling (node allocations / aliasing / uses)
- Adds Tests for correctness - right now only on fwd/bwd by itself, not with both.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164783
Approved by: https://github.com/ruisizhang123
ghstack dependencies: #164738
Turns out codegen'ing a nested step graph break is significantly more complicated than first thought. The optimized function should actually do:
- call graph/load values/do side effects etc.
- call into the leaf's resume function, but skipped (this essentially step graph break function for just the leaf function)
- call into all the other resume functions, traced.
This PR also adds `torch._dynamo.step_unsupported()`, which can be used for internal testing purposes to better test step graph break handling.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162737
Approved by: https://github.com/Lucaskabela
ghstack dependencies: #160601
This is needed because if we codegen cells for nested frames AFTER side effects, then reconstruction could get messed up. From below:
>The added test case demonstrates the reconstruction failure if we kept cell codegen at the original place (only happens with nested graph breaks since we reconstruct nested frame cells from VariableTracker rather than directly using LOAD_CLOSURE).
>At a high level, what happened before this change was that side_effects was pruning the cells (I don't recall exactly why this happens), and because cells were codegen'd after the side effects were applied, we were unable to properly reconstruct the cell. The error I was seeing was a list/tuple IndexError.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160601
Approved by: https://github.com/mlazos
## MOTIVATION
To generalize Distributed checkpoint test cases for non-CUDA devices
## CHANGES
18 test files with minimal device abstraction changes updated in
test/distributed/checkpoint/
- Use device_type from DTensorTestBase wherever appropriate
- Replaced hard coded device names with torch.accelerator.current_accelerator()
- extend multi gpu decrator for other devices
test/distributed/checkpoint/test_state_dict_stager.py has large diff, that's because i changed the name cuda_obj to gpu_obj. Functional change is minimum.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159242
Approved by: https://github.com/guangyey, https://github.com/d4l3k
**Summary:** Created a test so that we can verify that a model that has been pipelined + replicated has the same gradients as a reference model. To do this, I mapped the layers and their parameters in each partial model to the original full model and then compared the gradients.
**Test Case**
1. pytest test/distributed/_composable/test_composability/test_pp_composability.py -k test_replicate_pp_grads
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164890
Approved by: https://github.com/H-Huang
This is mostly mechanical change which make device mesh members all private and use a public property API instead. This is not a BC breaking change since the new API still guarantee BC.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164954
Approved by: https://github.com/fegin
ghstack dependencies: #164750
This is going to be used in https://github.com/pytorch/torchtitan/issues/1682
Add a `register_custom_function` to the `_PipelineScheduleRuntime` which allows users to implement any custom function to replace the runtime operation dynamically.
The signature of the callback should look like:
```python
class _CustomFunctionProtocol(Protocol):
def __call__(self, action: _Action, ctx: _PipelineContext) -> None: ...
```
`_PipelineContext` contains a reference to the schedule which is executing the operations.
### Testing
Added a test which adds custom methods for `FORWARD` and `OVERLAP_F_B` which are just the same implementations as those used in the default schedule runtime. Check that the schedule can still run, numerics are correct, and the callbacks are executed the correct number of times.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162016
Approved by: https://github.com/fegin
We allow passing in PG option via https://github.com/pytorch/pytorch/pull/159371 and we did a clean up of Meta internal usage of `_set_mesh_dim_group_options`, since this a private API, we don't have any bc guarantee, we want to directly remove so that people use the new behavior from now on.
Also since we now allow passing pg in both DeviceMesh constructor and flatten API, so that we also want to get rid of the global pg option override variable.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164750
Approved by: https://github.com/lw, https://github.com/fegin
## Summary
- add a CuBLASReductionOption enum so the CUDA context can track reduced-precision and split-K options
- extend the Python bindings, backend helpers, and docs to accept an optional allow_splitk argument for fp16/bf16 matmul controls
- update cuBLAS/cuBLASLt call sites plus dynamo guards and tests to respect the new combinations
## Testing
- python test/test_cuda.py TestCuda.test_cublas_allow_fp16_reduced_precision_reduction_get_set -v *(fails: ModuleNotFoundError: No module named 'psutil')*
------
https://chatgpt.com/codex/tasks/task_e_68e404623178832f8a3e1d34e1e175da
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164766
Approved by: https://github.com/malfet, https://github.com/albanD
throwstd::runtime_error("vmap: We do not support batching operators that can support dynamic shape. Attempting to batch over indexing with a boolean mask.");
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.