We don't care about the Dynamo x TorchScript composition, so I'm
disabling these tests (so they don't get reported as flaky). Not
disabling all of the TorchScript tests yet because they have been useful
to catch random bugs.
Test Plan:
- CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128731
Approved by: https://github.com/williamwen42
FIXES#113263. Same idea as in https://github.com/pytorch/pytorch/pull/113417, but we need a more intrusive C API to silently nop default saved tensor hooks, in order to support user-code that use torch.autograd.disable_saved_tensors_hooks (see test_unpack_hooks_can_be_disabled). We mock the output of get_hooks while leaving push/pop untouched.
For compiled autograd, we're firing pack hooks once and unpack hooks twice right now, I'll look into this separately from this issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123196
Approved by: https://github.com/soulitzer
These models are really flaky. I went into the CI machine and ran the model many times, sometime it fails, sometimes it passes. Even Pytorch-eager results change from run to run, so the accuracy comparison is fundamentally broken/non-deterministic. I am hitting these issues more frequently in inlining work. There is nothing wrong with inlining, I think these models are on the edge of already-broken accuracy measurement, and inlining is just pushing it in more broken direction.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128715
Approved by: https://github.com/eellison
This is an extension of [ModuleTracker](https://github.com/pytorch/pytorch/blob/main/torch/utils/module_tracker.py) with added features and bug fixes.
1. Allows installing user-defined hooks to be called in pre-fw, post-fw, pre-bw and post-bw hooks of the ``ModTracker``.
2. Adds a function ``get_known_fqn`` that retrieves the fqn of the module as tracked by the ``ModTracker``.
3. Only registers the multi-grad hooks if we are in the forward pass. This is important because, a module's pre-fw and post-fw hooks get called in the backward during AC and we do not want to register multi-grad hooks in this case.
4. Sets the kwarg ``always_call=True`` for post-fw hooks, so that they are called post AC.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128508
Approved by: https://github.com/wanchaol
Adds `C10_UBSAN_ENABLED` macro and use it to disable `SymIntTest::Overflows` (fails under `signed-integer-overflow` UBSAN check).
Also cleans up UBSAN guard in `jit/test_misc.cpp` to use `C10_UBSAN_ENABLED` and the existing `C10_ASAN_ENABLED` instead of locally defining `HAS_ASANUBSAN`.
> NOTE: This should fix `SymIntTest::Overflows` failing under ubsan in fbcode too...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127967
Approved by: https://github.com/atalman, https://github.com/d4l3k, https://github.com/malfet
This PR renames the implementation details of register_fake to align
more with the new name. It is in its own PR because this is risky
(torch.package sometimes depends on private library functions and
implementation details).
Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123938
Approved by: https://github.com/williamwen42
This is an extension of [ModuleTracker](https://github.com/pytorch/pytorch/blob/main/torch/utils/module_tracker.py) with added features and bug fixes.
1. Allows installing user-defined hooks to be called in pre-fw, post-fw, pre-bw and post-bw hooks of the ``ModTracker``.
2. Adds a function ``get_known_fqn`` that retrieves the fqn of the module as tracked by the ``ModTracker``.
3. Only registers the multi-grad hooks if we are in the forward pass. This is important because, a module's pre-fw and post-fw hooks get called in the backward during AC and we do not want to register multi-grad hooks in this case.
4. Sets the kwarg ``always_call=True`` for post-fw hooks, so that they are called post AC.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128508
Approved by: https://github.com/wanchaol
Tries to fix#127677.
# Context
Just as @peterbell10 pointed out, we have the following scenario:
```
a = ops.indirect_indexing(...)
b = ops.index_expr(a, ...)
c = ops.indirect_indexing(b, ...)
```
We can repro this as:
```
def forward(self, arg0_1, arg1_1, arg2_1):
iota = torch.ops.prims.iota.default(arg0_1, start = 0, step = 1, index=0),
repeat_interleave = torch.ops.aten.repeat_interleave.Tensor(arg1_1);
index = torch.ops.aten.index.Tensor(iota, [repeat_interleave]);
index_1 = torch.ops.aten.index.Tensor(arg2_1, [index]);
return (index_1,)
```
which should generate a JIT py file like this:
```
def triton_poi_fused_index_select_0(in_ptr0, in_ptr1, out_ptr0, ks0, xnumel, XBLOCK : tl.constexpr):
...
tmp0 = tl.load(in_ptr0 + (x1), xmask, eviction_policy='evict_last')
tmp1 = ks0
tmp2 = tmp0 + tmp1
tmp3 = tmp0 < 0
tmp4 = tl.where(tmp3, tmp2, tmp0)
# check_bounds()
tl.device_assert(((0 <= tmp4) & (tmp4 < ks0)) | ~(xmask), "index out of bounds: 0 <= tmp4 < ks0")
def call():
arg0_1, arg1_1, arg2_1 = args
buf1 = aten.repeat_interleave.Tensor(arg1_1)
buf4 = empty_strided_cuda((u0, 64), (64, 1))
triton_poi_fused_index_select_0.run(
buf1, arg2_1, buf4, s0,
triton_poi_fused_index_select_0_xnumel,
grid=grid(triton_poi_fused_index_select_0_xnumel),
stream=stream0)
```
# Issue
In our `IndexPropagation.indirect_indexing()` call we have `expr=indirect0` which is spawned in `LoopBodyBlock.indirect_indexing()`.
3b555ba477/torch/_inductor/ir.py (L8154-L8160)
When we try to see if we can prove its bounds, we fail because `indirect0` isn't in `var_ranges`.
# Approach
When creating `indirect` symbols from fallback, specify its range to be `[-size, size -1]` to avoid a lookup error with `indirectX`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128378
Approved by: https://github.com/lezcano, https://github.com/peterbell10
**Summary**
Currently, the comm_mode_feature_examples does not have an example for printing sharding information for a model with nested module. While adding the new example to the suite, I recognized a way to refactor existing examples in order to make them more readable for users. The expected output can be found below:
<img width="354" alt="Screenshot 2024-06-11 at 5 41 14 PM" src="https://github.com/pytorch/pytorch/assets/50644008/68cef7c7-cb1b-4e51-8b60-85123d96ca92">
**Test Plan**
torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128461
Approved by: https://github.com/XilunWu
ghstack dependencies: #128369, #128451
**Summary**
I have added comments to address previous readability concerns in comm_mode.py and comm_mode_features_example.py. I also renamed files and test cases in order to better reflect what they are about. Removed non-distributed test case and other lines of code that do not contribute to the example of how comm_mode can be used. Finally, I've added the expected output for each example function so users are not forced to run code.
**Test Plan**
torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128451
Approved by: https://github.com/XilunWu
ghstack dependencies: #128369
**Summary**
Currently, CommDebugMode only allows displaying collective tracing at a model level whereas a user may require a more detailed breakdown. In order to make this possible, I have changed the ModuleParamaterShardingTracker by adding a string variable to track the current sub-module as well as a dictionary keeping track of the depths of the submodules in the model tree. CommModeDebug class was changed by adding a new dictionary keeping track of the module collective counts as well as a function that displays the counts in a way that is easy for the user to read. Two examples using MLPModule and Transformer have been added to showcase the new changes. The expected output of the simpler MLPModule example is:
<img width="255" alt="Screenshot 2024-06-10 at 4 58 50 PM" src="https://github.com/pytorch/pytorch/assets/50644008/cf2161ef-2663-49c1-a8d5-9f97e96a1791">
**Test Plan**
torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/display_sharding_example.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128369
Approved by: https://github.com/XilunWu
Summary:
Added `set_module_name_qconfig` support to allow users to set configurations based on module name in `X86InductorQuantizer`.
For example, only quantize the `sub`:
```python
class M(torch.nn.Module):
def __init__(self):
super().__init__()
self.linear = torch.nn.Linear(5, 5)
self.sub = Sub()
def forward(self, x):
x = self.linear(x)
x = self.sub(x)
return x
m = M().eval()
example_inputs = (torch.randn(3, 5),)
# Set config for a specific submodule.
quantizer = X86InductorQuantizer()
quantizer.set_module_name_qconfig("sub", xiq.get_default_x86_inductor_quantization_config())
```
- Added `set_module_name_qconfig` to allow user set the configuration at the `module_name` level.
- Unified the annotation process to follow this order: `module_name_qconfig`, `operator_type_qconfig`, and `global_config`.
- Added `config_checker` to validate all user configurations and prevent mixing of static/dynamic or QAT/non-QAT configs.
- Moved `_get_module_name_filter` from `xnnpack_quantizer.py` into `utils.py` as it common for all quantizer.
Test Plan
```bash
python -m pytest quantization/pt2e/test_x86inductor_quantizer.py -k test_set_module_name
```
@Xia-Weiwen @leslie-fang-intel @jgong5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126044
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jerryzh168
When we don't dynamo.reset(), we don't recompile on different dynamic shapes.
Also, some of the returned views were tuples - so when we `* 2`, we actually just copy all the inputs twice in the tuple. I changed it so that it would just return one of the values from the return tuple.
Additionally, this exposes a bug that fails with the slice operation, so I skipped it when we're testing with dynamic shapes:
```
File "/home/dberard/local/pytorch/torch/fx/experimental/symbolic_shapes.py", line 3996, in produce_guards
sexpr = ShapeGuardPrinter(symbol_to_source, source_ref, self.var_to_sources).doprint(expr)
File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/printer.py", line 292, in doprint
return self._str(self._print(expr))
File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/printer.py", line 331, in _print
return printmethod(expr, **kwargs)
File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/str.py", line 56, in _print_Add
t = self._print(term)
File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/printer.py", line 331, in _print
return printmethod(expr, **kwargs)
File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/str.py", line 366, in _print_Mul
a_str = [self.parenthesize(x, prec, strict=False) for x in a]
File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/str.py", line 366, in <listcomp>
a_str = [self.parenthesize(x, prec, strict=False) for x in a]
File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/str.py", line 37, in parenthesize
return self._print(item)
File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/printer.py", line 331, in _print
return printmethod(expr, **kwargs)
File "/home/dberard/local/pytorch/torch/fx/experimental/symbolic_shapes.py", line 1494, in _print_Symbol
assert self.symbol_to_source.get(expr), (
AssertionError: s3 (could be from ['<ephemeral: symint_visitor_fn>', '<ephemeral: symint_visitor_fn>']) not in {s0: ["L['x'].a.size()[1]", "L['x'].b.size()[1]", "L['x'].size()[1]", "L['x'].a.size()[1]", "L['x'].b.size()[1]", "L['x'].a.size()[1]", "L['x'].b.size()[1]"], s1: ["L['x'].a.stride()[0]", "L['x'].b.stride()[0]", "L['x'].stride()[0]", "L['x'].a.stride()[0]", "L['x'].b.stride()[0]", "L['x'].a.stride()[0]", "L['x'].b.stride()[0]"], s2: ["L['x'].a.storage_offset()", "L['x'].b.storage_offset()", "L['x'].a.storage_offset()", "L['x'].b.storage_offset()"]}. If this assert is failing, it could be due to the issue described in https://github.com/pytorch/pytorch/pull/90665
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128659
Approved by: https://github.com/YuqingJ
Summary:
GraphTransformObserver saves the SVG file of the input/output graph in each inductor pass. In my test with CMF model, if the graph is large, GraphViz took forever to convert DOT to SVG. That is NOT acceptable.
This DIFF is to save DOT file instead of SVG file to speed it up. Also DOT file size is order of mangitude smaller than SVG.
To view these graphs, user can run dot -Txxx inpout.dot to convert DOT to any other format you want. User can control how many iterations to layout the graph properly. Refer to https://web.archive.org/web/20170507095019/http://graphviz.org/content/attrs#dnslimit for details.
Test Plan: buck2 test mode/dev-sand caffe2/test:fx -- fx.test_fx_xform_observer.TestGraphTransformObserver
Differential Revision: D58539182
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128634
Approved by: https://github.com/mengluy0125
It has come to my attention that some of our licenses are incorrect, so I attempted to rectify a few of them based on given recommendations for:
clog - BSD-3
eigen - MPL-2.0
ffnvcodec - LGPL-2.1
-> **hungarian - Permissive (free to use)**
irrlicht - The Irrlicht Engine License (zlib/libpng)
-> **pdcurses - Public Domain for core**
-> **sigslot - Public Domain**
test - BSD-3
Vulkan - Apache-2.0 or MIT
fb-only: more context is here https://fb.workplace.com/groups/osssupport/posts/26333256012962998/?comment_id=26333622989592967
This PR addressed the manual mismatches of licensing mentioned above (the two bolded, one is getting addressed in #128085, but as everything else is generated by pulling through other files, I did not address those. It is unclear what needs to be updated for the remaining to be accurate/if they're inaccurate today.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128630
Approved by: https://github.com/malfet
This matches our autograd logic for pytorch native operators. There's no
need to invoke an autograd.Function if we're under a torch.no_grad() or
if none of the inputs have requires_grad=True (invoking an
autograd.Function results in (noticeable) overhead).
Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127976
Approved by: https://github.com/williamwen42
Fixes#127896
### Description
Add docstring to `torch/jit/frontend.py:get_default_args` function
### Checklist
- [x] The issue that is being fixed is referred in the description
- [x] Only one issue is addressed in this pull request
- [x] Labels from the issue that this PR is fixing are added to this pull request
- [x] No unnecessary issues are included into this pull request
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128408
Approved by: https://github.com/malfet
In fused_all_gather_matmul, each rank copies their shard into their
local p2p buffer, performs a barrier, then performs (copy -> matmul) for
each remote shard. The (copy -> matmul)s for remote shards run on two
streams without synchronization. This not only allows for
computation/communication overlapping, but also computation/computation
overlapping which alleviates the wave quantization effect caused by
computation decomposition.
However, the synchronization-free approach doesn't work well with
fused_matmul_reduce_scatter, in which there's a barrier in every step.
Without synchronization between the two streams, a matmul in one stream
can delay a barrier in the other stream, further delaying the copy
waiting for the barrier.
This PR addresss the issue by adding synchronization between the two
streams such that the matmul of step i can only start after the barrier
of step i-1 completes. With this approach, we lose the
computation/computation overlapping, but avoid slowdown due to delayed
barrier.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127455
Approved by: https://github.com/Chillee
ghstack dependencies: #127454
This PR changes the traced_tangents field of ViewAndMutationMeta to be cache safe. Specifically, at runtime, the only time we need the fw_metadata's traced_tangent's field is for Tensor subclass metadata from __tensor_flatten__. So instead of storing an entire FakeTensor, which has many fields that can be unserializable, only store the result of __tensor_flatten__() on any FakeTensors representing subclasses.
That said, there's no guarantee that `__tensor_flatten__` is actually serializable: if we fail to pickle the result of __tensor_flatten__ we won't save to the cache.
To do this, we also make a small change to `__coerce_same_metadata_as_tangent__`, so that it takes in the return value of tensor_flatten() instead of an entire FakeTensor. Let me know if we should change the name of the function.
By doing this, we can now run the dynamic shapes cache test with autograd turned on.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127618
Approved by: https://github.com/bdhirsh
Summary: I've seen this issue once in the wild and oulgen was able to repro in a unit test. The problem is this:
- We're using pickle to turn everything related to the FX graph cache key into a byte stream, then hashing the bytes to compute the cache key.
- Pickle is optimized to avoid serializing the same ID more than once; it instead drops a reference to a previously-pickled object if it encounters the same ID.
- That pickle behavior means that we can see different cache keys if an object id appears more than once in the hashed objects vs. being functionally equivalent but distinct objects.
The cases I've investigated only involve the torch.device objects in the tensor graph args. That is, we may compile a graph with two tensor args, each referencing `torch.device('cpu')`. In one run, those devices may reference the same object; in another, they may reference distinct (but equivalent) objects. In practice, my observation is that the compiler is largely deterministic and this situation is rare. I've seen cache misses on a real benchmark only when enabling/disabling FakeTensor caching in order to introduce different code paths that otherwise produce the same fx graph. But the failing unit test seems to be enough motivation for a remediation?
I don't really love this solution, but I've failed to find another way to make the pickling phase robust to these kinds of changes, e.g., by changing the protocol version or by overriding internal methods (which would also be gross). But I'm definitely open to other creative ideas.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128366
Approved by: https://github.com/oulgen, https://github.com/eellison
Summary: The feature was previously disabled in fbcode due to breaking the deterministic NE unit tests. Now it has been on in OSS for quite a while and we verified that it has no NE impact on CMF, we want to update the unit test and enable the feature.
Test Plan:
```
time buck2 test 'fbcode//mode/opt' fbcode//aps_models/ads/icvr/tests/ne/e2e_deterministic_tests:fm_tests -- --exact 'aps_models/ads/icvr/tests/ne/e2e_deterministic_tests:fm_tests - aps_models.ads.icvr.tests.ne.e2e_deterministic_tests.icvr_fm_test.ICVR_FM_DeterministicTest: test_icvr_fm_pt2_fsdp_multi_gpus'
```
Differential Revision: D58425432
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128555
Approved by: https://github.com/eellison
Summary: When calling a fallback op in the minimal_arrayref_interface mode with an optional tensor, a temporary RAIIAtenTensorHandle needes to be explicitly created in order to pass a pointer of tensor as the optional tensor parameter.
Test Plan: CI
Differential Revision: D58528575
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128613
Approved by: https://github.com/hl475
When performing fused_all_gather_matmul/fused_matmul_reduce_scatter and gather_dim/scatter_dim != 0, a copy of the lhs operand (A_shard/A) is needed for layout transformation.
This copy can be avoided if the lhs operand already has the following stride order:
lhs.movedim(gather_dim, 0).contiguous().movedim(0, gather_dim).stride()
In `micro_pipeline_tp` passes, we enforce the lhs operand to have such stride order via `inductor_prims.force_stride_order`. This way if the lhs operand has a flexible layout, the copy is avoided.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127454
Approved by: https://github.com/Chillee
This PR introduces naive CPU impls for:
* `_jagged_to_padded_dense_forward()`
* `_padded_dense_to_jagged_forward()`
On the CUDA side, these are backed by lifted FBGEMM kernels. We may want to revisit the CPU versions with higher-performance implementations at a later time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127007
Approved by: https://github.com/davidberard98
We should be able to remove this as, with the new canonicalisation, we
have that `a < b` and `-a > -b` should be canonicalised to the same
expression (if SymPy does not interfere too much).
nb. I thought this would cut further the compilation time, but I was running
the benchmarks wrong (not removing triton's cache oops). It turns out that
after the first PR in this stack, https://github.com/pytorch/pytorch/issues/128398 is fully fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128500
Approved by: https://github.com/ezyang
ghstack dependencies: #128410, #128411
https://github.com/pytorch/pytorch/pull/126717 will skip the tests in both ABI compatible and non-ABI compatible mode.
It's not expected to skip them in non-ABI compatible mode since they can actually run successfully in such mode but only have issues in ABI compatible mode.
We leverage the existing `xfail_list` for those that will only fail in ABI compatible mode.
- `test_qlinear_add` is already in the `xfail_list`.
- `test_linear_packed` doesn't fail either in my local run (running with `TORCHINDUCTOR_ABI_COMPATIBLE=1`) or in the CI of this PR so I didn't add it into `xfail_list`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128506
Approved by: https://github.com/jgong5, https://github.com/desertfire
Fixes https://github.com/pytorch/pytorch/issues/128544
Fixes https://github.com/pytorch/pytorch/issues/128535
We had a problem with multithreading where the nonlocals were being
clobbered. In the first place, we stored these nonlocals because we
wanted to ferry information from an autograd.Function.apply to
autograd.Function.forward.
Our new approach is:
- pass the information directly as an input to the
autograd.Function.apply. This means that the autograd.Function.forward
will receive the information too.
- this messes up ctx.needs_input_grad, which has an element per input to
forward. The user should not see the additional information we passed.
We fix this by temporarily overriding ctx.needs_input_grad to the
right thing.
- this exposed a bug in that ctx.needs_input_grad wasn't correct for
TensorList inputs. This PR fixes that too.
Test Plan:
- existing and new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128547
Approved by: https://github.com/williamwen42, https://github.com/soulitzer
https://github.com/pytorch/pytorch/issues/127572
Allow mutations in backward on forward inputs, if
1/ not mutationg metadata
Enforced at compilation time.
2/ if create_graph=True: mutated input does not require_grad
Enforced in runtime, when create_graph mode can be detected by checking torch.is_grad_enabled()
Adding input_joint_info to track mutations of inputs during joint.
Created a separate field in ViewAndMutationMeta as it is filled only after joint fn tracing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128409
Approved by: https://github.com/bdhirsh
As part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows:
1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling.
2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops.
3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen.
4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126545
Approved by: https://github.com/jansel
Minor tweak of comparison as using `assert` on `torch.allclose` prevents the mismatches from being logged. Also bump a few tolerances that seem to be causing failures on sm86/sm90
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128553
Approved by: https://github.com/jcaip
## Context
This PR ported GGML int8 per channel matrix multiplication and matrix vector multiplication metal shaders into ATen library.
llama.cpp LICENSE: https://github.com/ggerganov/llama.cpp/blob/master/LICENSE
## Key Changes
Made the following changes to the original code:
* Memory layout of weight and scales is different than llama.cpp.
* Weight dequantization (scales multiplication) is done after MM is finished.
* Following PyTorch naming convention (M, K, N and assuming row major).
## Benchmark
When M = 1, mv shader improves existing ATen int8mm by 40%.
When M > 4, mm shader outperforms existing ATen int8mm up to 10x for a large M, as show blow.

Hence the kernel chooses different shaders based on M.
## Test Plan
Tests are passing:
```
❯ python test/test_mps.py -v -k _int8_
/Users/larryliu/CLionProjects/pytorch/venv/lib/python3.8/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'dlopen(/Users/larryliu/CLionProjects/pytorch/venv/lib/python3.8/site-packages/torchvision/image.so, 0x0006): Symbol not found: __ZN3c1017RegisterOperatorsD1Ev
Referenced from: <A770339A-37C9-36B2-84FE-4125FBE26FD6> /Users/larryliu/CLionProjects/pytorch/venv/lib/python3.8/site-packages/torchvision/image.so
Expected in: <5749F98A-0A0C-3F89-9CBF-277B3C8EA00A> /Users/larryliu/CLionProjects/pytorch/torch/lib/libtorch_cpu.dylib'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
test__int8_mm_m_1_k_32_n_32_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_1_k_32_n_64_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_1_k_64_n_32_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_1_k_64_n_64_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_32_k_32_n_32_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_32_k_32_n_64_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_32_k_64_n_32_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_32_k_64_n_64_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_64_k_32_n_32_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_64_k_32_n_64_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_64_k_64_n_32_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_64_k_64_n_64_mps (__main__.TestLinalgMPSMPS) ... ok
----------------------------------------------------------------------
Ran 12 tests in 1.180s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127646
Approved by: https://github.com/malfet
In a previous life, we used sympy.oo to represent the lower/upper bounds of integer ranges. Later, we changed this to be sys.maxsize - 1 for a few reasons: (1) sometimes we do tests on a value being exactly sys.maxsize, and we wanted to avoid a data dependent guard in this case, (2) sympy.oo corresponds to floating point infinity, so you get incorrect types for value ranges with oo, and (3) you can do slightly better reasoning if you assume that input sizes fall within representable 64-bit integer range.
After working in the sys.maxsize regime for a bit, I've concluded that this was actually a bad idea. Specifically, the problem is that you end up with sys.maxsize in your upper bound, and then whenever you do any sort of size-increasing computation like size * 2, you end up with 2 * sys.maxsize, and you end up doing a ton of arbitrary precision int computation that is totally unnecessary. A symbolic bound is better.
But especially after #126905, we can't go back to using sympy.oo, because that advertises that it's not an integer, and now your ValueRanges is typed incorrectly. So what do we do? We define a new numeric constant `int_oo`, which is like `sympy.oo` but it advertises `is_integer`. **test/test_sympy_utils.py** describes some basic properties of the number, and **torch/utils/_sympy/numbers.py** has the actual implementation.
The rest of the changes of the PR are working out the implications of this change. I'll give more commentary as inline comments.
Fixes https://github.com/pytorch/pytorch/issues/127396
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127693
Approved by: https://github.com/lezcano
ghstack dependencies: #126905
Summary: There are clang errors in profiler_kineto. It would probably be a good idea to fix them as the file is already quite dense.
Test Plan: Make sure all on Phabricator all tests under static_tests/lint_root pass
Differential Revision: D58431005
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128464
Approved by: https://github.com/aaronenyeshi
Related doc: https://docs.google.com/document/d/1BKyizkZPdri9mHqdDOLAUpkI7SbbKfLHRFVVpK9ZWqo/edit
Memory considerations:
- As with the existing SAC, cached values are cleared upon first use.
- We error if the user wishes to backward a second time on a region forwarded with SAC enabled.
In-place:
- We use version counting to enforce that if any cached tensor has been mutated. In-place operations not mutating cached tensors are allowed.
- `allow_cache_entry_mutation=True` can be passed to disable this check (useful in the case of auto AC where the user is cleverly also saves the output of the in-place)
Randomness, views
- Currently in this PR, we don't do anything special for randomness or views, the author of the policy function is expected to handle them properly. (Would it would be beneficial to error? - we either want to save all or recompute all random tensors)
Tensor object preservation
- We guarantee that if a tensor does not requires grad, and it is saved, then what you get out is the same tensor object. If the tensor does require grad, we must detach to avoid creating a reference cycle. This is a nice guarantee for nested tensors which care about the object identity of of the offsets tensor.
Policy function
- Enum values are `{MUST,PREFER}_{SAVE,RECOMPUTE}` (bikeshed welcome). Alternatively there was `{SAVE,RECOMPUTE}_{NON_,}OVERRIDABLE`. The former was preferred bc it seemed clearer that two `MUST` clashing should error, versus it is ambiguous whether two `NON_OVERRIDABLE` being stacked should silently ignore or error.
- The usage of Enum today. There actually is NO API to stack SAC policies today. The only thing the Enum should matter for in the near term is the compiler. The stacking SAC policy would be useful if someone wants to implement something like simple FSDP, but it is not perfect because with a policy of `PREFER_SAVE` you are actually saving more than autograd would save normally (would be fixed with AC v3).
- The number of times we call the policy_fn is something documented part of public API. We call the policy function for all ops except detach because detach is itself called a different number of times by AC between forward and recompute.
- The policy function can be a stateful object (we do NOT make separate copies of this object for forward/recompute, the user is expected to handle that via is_recompute see below).
Tensors guaranteed to be the same tensor as-is
- Policy function signature takes ctx object as its first argument. The ctx function is an object encapsulating info that may be useful to the user, it currently only holds "is_recompute". Adding this indirection gives us flexibility to add more attrs later if necessary.
"bc-breaking" for existing users of the private API:
- Existing policy functions must now change their return value to use the Enum.
- Existing calls to `_pt2_selective_checkpoint_context_fn_gen` must be renamed to `gen_selective_checkpoint_context_fn`. The way you use the API remains the same. It would've been nice to do something different (not make the user have to use functools.partial?), but this was the easiest to compile (idk if this should actually be a constraint).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125795
Approved by: https://github.com/Chillee, https://github.com/fmassa
Summary: prim::dtype has the signature `(Tensor a) -> int`, where it gets the dtype of the tensor and returns the integer corresponding to this dtype based on the enum in ScalarType.h. Previously we were converting prim::dtype by returning the actual dtype of the tensor (ex. torch.float32). This causes some incorrect control flow to behavior, specifically where it checks if `prim::dtype(tensor) in [3, 5, 7]`, where [3, 5, 7] correspond to torch.int32, torch.float16, torch.float64. This control flow would always returns False because we would be comparing torch.float32 against the integers [3, 5, 7], which is a type mismatch.
Test Plan: 7/22 internal models now are convertable and runnable in eager and sigmoid! P1410243909
Reviewed By: jiashenC
Differential Revision: D58469232
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128517
Approved by: https://github.com/jiashenC
```at::detail::computeStorageNbytesContiguous``` does fewer data-dependent tests compared to ```at::detail::computeStorageNbytes```. Therefore, use of former is more likely to succeed with dynamic shapes. This PR detects is_contiguous and dispatches to the appropriate function. This should be helpful in unblocking aot_eager for torchrec. As an aside, this is an alternative solution to the unsound solution I had first proposed in another [PR](#128141).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128515
Approved by: https://github.com/ezyang
This PR lifts internal lowerings written for FBGEMM kernels that do jagged <-> padded dense conversions. In particular, this PR provides lowerings and meta registrations for the following ATen ops:
* `_jagged_to_padded_dense_forward()`
* `_padded_dense_to_jagged_forward()`
* NB: if `total_L` is not provided, the output shape is data-dependent. An unbacked SymInt is used for this case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125968
Approved by: https://github.com/davidberard98
This PR intends to support the aten operations with the `out` tensor.
Currently, the AOT compile always does **NOT** keep input tensor mutations. According to the comments, this is because it has not encountered such a use case.
> For now there's no use case involving keeping input mutations in the graph (which we can only do in the inference case anyway). We can add this later if we need to.
However, for aten operations, it is popular that the `out` tensor is an input parameter and needs to be mutated. This PR intends to support it by adding a `keep_inference_input_mutations` flag to `aot_inductor.keep_inference_input_mutations`. This flag can provide flexibility to the callee in deciding whether the AOT compile needs to keep input tensor mutations in the graph.
Take `clamp` as an example as follows.
```python
out_tensor = torch.randn(128, dtype=torch.float, device=device).fill_(-2.0)
inp_tensor = torch.randn(128, dtype=torch.float, device=device).fill_(1.0)
min_tensor = inp_tensor - 0.05
max_tensor = inp_tensor + 0.05
torch.clamp(input=inp_tensor, min=min_tensor, max=max_tensor, out=out_tensor)
```
W/O this PR
```python
def forward(self):
arg0_1: "f32[128]"; arg1_1: "f32[128]"; arg2_1: "f32[128]"; arg3_1: "f32[128]";
arg0_1, arg1_1, arg2_1, arg3_1, = fx_pytree.tree_flatten_spec([], self._in_spec)
clamp_min: "f32[128]" = torch.ops.aten.clamp_min.Tensor(arg0_1, arg1_1); arg0_1 = arg1_1 = None
clamp_max: "f32[128]" = torch.ops.aten.clamp_max.Tensor(clamp_min, arg2_1); clamp_min = arg2_1 = None
return (clamp_max, clamp_max)
```
W/ this PR
```python
def forward(self):
arg0_1: "f32[128]"; arg1_1: "f32[128]"; arg2_1: "f32[128]"; arg3_1: "f32[128]";
arg0_1, arg1_1, arg2_1, arg3_1, = fx_pytree.tree_flatten_spec([], self._in_spec)
clamp_min: "f32[128]" = torch.ops.aten.clamp_min.Tensor(arg0_1, arg1_1); arg0_1 = arg1_1 = None
clamp_max: "f32[128]" = torch.ops.aten.clamp_max.Tensor(clamp_min, arg2_1); clamp_min = arg2_1 = None
copy_: "f32[128]" = torch.ops.aten.copy_.default(arg3_1, clamp_max); arg3_1 = clamp_max = None
return (copy_,)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124926
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/angelayi
Summary:
D56907877 modified OSS commSplit. However, commSplit requires every rank being called even though it is no-color. ncclCommSplit will not create a communicator for nocolor ranks hence this line of code will potentially throw error like `NCCL WARN CommUserRank : comm argument is NULL`
Revert this change from D56907877
Test Plan: CI
Differential Revision: D58436088
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128459
Approved by: https://github.com/shuqiangzhang
Summary:
The previous side effect pruning algorithm would keep many dead cell
variables alive. For example, in
https://github.com/pytorch/pytorch/issues/125078, the compiled function
has one return but there were three in the Dynamo graph due to two
dead cell variables not being pruned away.
This PR adds a corrected algorithm. "new cell variables" are alive if
they can be reached from one of the following:
1. any of the tx.symbolic_locals or tx.stack (that is, if they are
involved in a return from the function or intermediate variable
during a graph break). Example: an alive NestedUserFunctionVariable
2. "mutations to pre-existing objects". Example: appending a
NestedUserFunctionVariable to a global list
The new algorithm reflects this, but please let me know if there are
more cases to handle.
Test Plan:
- existing tests (afaict, test/dynamo/test_python_autograd is the best
SideEffects test case we have)
- see in test/dynamo/test_higher_order_ops that the expecttests changed
-- the functorch dynamo graphs no longer return dead cellvars.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128028
Approved by: https://github.com/jansel
This PR makes it so we lazily save to the cache on backward call instead of saving ahead of time always. We have to pass a closure to post_compile to prevent cyclic dependencies.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126999
Approved by: https://github.com/bdhirsh
ghstack dependencies: #126791
This is a short-term fix (for 2.4). In the longer term we should
fix https://github.com/pytorch/pytorch/issues/128430
The problem is that warnings.warn that are inside Dynamo print
all the time. Python warnings are supposed to print once, unless their
cache is reset: Dynamo ends up resetting that cache everytime it runs.
As a workaround we provide our own warn_once cache that is keyed on the
warning msg. I am not worried about this increasing memory usage because
that's effectively what python's warnings.warn cache does.
Test Plan:
- fix tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128456
Approved by: https://github.com/anijain2305
Fix https://github.com/pytorch/pytorch/issues/128287.
Previous the assertion in `linear_add_bias` are pretty bad
```
assert packed_weight_node.name == "_reorder_linear_weight"
assert transpose_weight_node.name == "permute_default"
```
because the `name` can be changed to `_reorder_linear_weight_id, permute_default_id` if we have more than 1 reorder/permute.
Check `target` instead `name` can solve this issue.
UT is also updated to have match more than 1 `linear_add_bias` pattern to cover this case.
Co-authored-by: Jiong Gong <jiong.gong@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128473
Approved by: https://github.com/jgong5
Summary:
Number of features rely on TCP store as a control plane. By default TCPStore server is started on rank0 trainer and this can create a a race condition when rank0 may exit (error and graceful exit) and any other ranks reading/writing will fail.
Solution: TCPStore server should outlive all the trainer processes. By moving the ownership TCPStore to torchelastic agent it naturally fixes the lifecycle of the server.
Static rendezvous in torchelastic does already support sharing of the TCPStore server. We are extending this to more commonly used c10d rendezvous handler.
Any handler would like to manage tcp store has to:
- Return true on `use_agent_store` property
- `RendezvousInfo`.`RendezvousStoreInfo`#[`master_addr/master_port`] values refer to managed TCPStore (those are returned on `next_rendezvous` call)
Note: in some instances users may want to use non-TCPStore based stores for the torchelastic rendezvous process, so the handler will need to create and hold a reference to TCPStore (as done in this change)
Test Plan:
`cat ~/workspace/dist-demo/stores.py`
~~~
import torch
import logging
import sys
import torch.distributed as dist
import torch
import os
import time
logger = logging.getLogger(__name__)
logger.addHandler(logging.StreamHandler(sys.stderr))
logger.setLevel(logging.INFO)
def _run_test(store):
if dist.get_rank() == 1:
logger.info("Rank %s is sleeping", dist.get_rank())
time.sleep(5)
key = "lookup_key"
logger.info("Checking key %s in store on rank %s", key, dist.get_rank())
store.check([key])
else:
logger.info("rank %s done", dist.get_rank())
def main() -> None:
use_gpu = torch.cuda.is_available()
dist.init_process_group(backend="nccl" if use_gpu else "gloo")
dist.barrier()
logger.info(f"Hello World from rank {dist.get_rank()}")
host = os.environ['MASTER_ADDR']
port = os.environ['MASTER_PORT']
world_size = os.environ['WORLD_SIZE']
logger.info("testing TCPStore")
store = dist.TCPStore(
host_name=host, port=int(port), world_size=int(world_size),
)
_run_test(store)
if __name__ == "__main__":
main()
~~~
With the fix (TORCH_DISABLE_SHARE_RDZV_TCP_STORE=0 or just drop the option)
~~~
(pytorch_38) [kurman@devgpu011.cln5 ~/local/pytorch (main)]$ TORCH_DISABLE_SHARE_RDZV_TCP_STORE=0 python -m torch.distributed.run --rdzv-backend c10d --nproc-per-node 3 ~/workspace/dist-demo/stores.py
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Hello World from rank 1
Hello World from rank 2
Hello World from rank 0
testing TCPStore
testing TCPStore
testing TCPStore
rank 2 done
Rank 1 is sleeping
rank 0 done
Checking key lookup_key in store on rank 1
~~~
TORCH_DISABLE_SHARE_RDZV_TCP_STORE=1
~~~
(pytorch_38) [kurman@devgpu011.cln5 ~/local/pytorch (main)]$ TORCH_DISABLE_SHARE_RDZV_TCP_STORE=1 python -m torch.distributed.run --rdzv-backend c10d --npro
c-per-node 3 ~/workspace/dist-demo/stores.py
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Hello World from rank 0
Hello World from rank 2
Hello World from rank 1
testing TCPStore
testing TCPStore
testing TCPStore
rank 0 done
rank 2 done
Rank 1 is sleeping
Checking key lookup_key in store on rank 1
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/kurman/workspace/dist-demo/stores.py", line 46, in <module>
[rank1]: main()
[rank1]: File "/home/kurman/workspace/dist-demo/stores.py", line 42, in main
[rank1]: _run_test(store)
[rank1]: File "/home/kurman/workspace/dist-demo/stores.py", line 22, in _run_test
[rank1]: store.check([key])
[rank1]: torch.distributed.DistNetworkError: Connection reset by peer
E0605 17:40:22.853277 140249136719680 torch/distributed/elastic/multiprocessing/api.py:832] failed (exitcode: 1) local_rank: 1 (pid: 2279237) of binary: /home/kurman/.conda/envs/pytorch_38/bin/python
Traceback (most recent call last):
File "/home/kurman/.conda/envs/pytorch_38/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/kurman/.conda/envs/pytorch_38/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/data/users/kurman/pytorch/torch/distributed/run.py", line 904, in <module>
main()
File "/data/users/kurman/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/data/users/kurman/pytorch/torch/distributed/run.py", line 900, in main
run(args)
File "/data/users/kurman/pytorch/torch/distributed/run.py", line 891, in run
elastic_launch(
File "/data/users/kurman/pytorch/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/data/users/kurman/pytorch/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/kurman/workspace/dist-demo/stores.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-06-05_17:40:22
host : devgpu011.cln5.facebook.com
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2279237)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
~~~
Differential Revision: D58180193
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128096
Approved by: https://github.com/shuqiangzhang
With inilining NN modules these tests no longer raise runtime errors because changing static ptrs induces a rerecording instead of a runtime error. The solution is to run the test with inlining disabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128529
Approved by: https://github.com/anijain2305
ghstack dependencies: #128528
With inilining NN modules these tests no longer raise runtime errors because changing static ptrs induces a rerecording instead of a runtime error. The solution is to run the test with inlining disabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128528
Approved by: https://github.com/anijain2305
This PR implements "V0" of AOTAutogradCache. Given an input to AOTAutograd, we calculate a cache key, then save an AOTAutogradCacheEntry.
Each AOTAutogradCacheEntry has:
- A CompiledForward and optionally a CompiledBackward
- A bunch of metadata.
CompiledForward and CompiledBackward each save the *key* to the FXGraphCache associated with the compiled object. FXGraphCache populates this key field as long as it's able to return a compiled graph given a set of inputs. We then load the same object from the FXGraphCache on an AOTAutogradCache hit.
On cache miss:
- Run AOTAutograd, up to AOTAutogradDispatch.post_compile.
- Save an AOTAutogradCacheEntry to the cache after compiling the necessary portions and receiving a cache key from FXGraphCache. In this we *always* compile the backwards ahead of time. The PR above this one implements backward lazy caching, so that we only save to the cache after compiling the backward in a lazy backward scenario.
- Return the resulting object
On cache hit:
- Run AOTAutogradCacheEntry.post_compile() on the cache key.
- This attempts to load the forward and backward graphs from FXGraphCache
- As long as we successfully load from FXGraphCache, it's a hit. We then rewrap the callable with post compile wrappers using our saved metadata.
For now, we ignore the fakified out and debug wrappers. We only save to the cache if Fakified out is turned off.
V0 Guards behavior:
FXGraphCache serializes guards that are needed in the shape_env based on the symint inputs to the graph. The invariant that AOTAutograd uses here is that the sources for symints given to it by dynamo are exactly the same as the ones it passes to inductor, for both the forward and backward passes. (This does *not* mean that the tensor values passed in are the same: only that their symints are). That is, AOTAutograd and Inductor never create new guards based on symints with *different sources* than those passed to it by inductor.
We don't currently store any AOTAutograd specific guards: my hypothesis is that FXGraphCache already stores these, as any guards generated by AOTAutograd should already be in the shape_env before calling into inductor, and we don't generate new guards post inductor. If this is needed, I'll add it in another diff.
Testing:
We'll start with some basic unit tests, but I'll be adding more and more complicated testing as the next step.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126791
Approved by: https://github.com/bdhirsh
Before: `softmax` definition uses `jagged_unary_pointwise()` (wrong)
After: `softmax` impl adjusts the `dim` arg to account for the difference in dimensionality between the outer NT and the NT's `_values`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119459
Approved by: https://github.com/soulitzer
Fixes#126950
`ptd_state_dict` with `broadcast_from_rank0=False` might miss 2 condition checks in the `set_optimizer_state_dict`
Here we add another condition `full_state_dict=True` with corresponding tensor distribution without broadcasting if broadcast_from_rank0=False
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128004
Approved by: https://github.com/fegin
Fixes#120570
## Description
Update torch.nanmean() docstring to mention input dtype requirement as either floating point type or complex.
Previously, the torch.mean() docstring had been updated in #120208 in a similar manner, but the torch.nanmean() docstring was not updated.
## Checklist
- [X] The issue that is being fixed is referred in the description.
- [X] Only one issue is addressed in this pull request.
- [x] Labels from the issue that this PR is fixing are added to this pull request.
- [X] No unnecessary issues are included into this pull request.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128155
Approved by: https://github.com/malfet
Thus far TunableOp was implemented for gemm, bgemm, and scaled_mm. gemm_and_bias was notably missing. This PR closes that gap.
This PR also fixes a regression after #124362 disabled the numerical check by default. The env var to enable it no longer worked.
CC @xw285cornell
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128143
Approved by: https://github.com/Skylion007
Not requiring all functions to have types allows a lot of 'Any' types to slip in - which poison types and make mypy unable to properly typecheck the code. I want to flip the default so that new files are required to have fully typed defs and we can have a burndown list of files that fail to require full types.
The preceding stack of PRs (cut up simply to limit the number of file changes per PR "reasonable") adds `# mypy: allow-untyped-defs` to any file which didn't immediately pass mypy with the flag flipped. Due to changing files and merge conflicts it will probably be necessary to have several passes through before landing this final PR which turns the option on.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127836
Approved by: https://github.com/oulgen, https://github.com/Skylion007
This PR implements "V0" of AOTAutogradCache. Given an input to AOTAutograd, we calculate a cache key, then save an AOTAutogradCacheEntry.
Each AOTAutogradCacheEntry has:
- A CompiledForward and optionally a CompiledBackward
- A bunch of metadata.
CompiledForward and CompiledBackward each save the *key* to the FXGraphCache associated with the compiled object. FXGraphCache populates this key field as long as it's able to return a compiled graph given a set of inputs. We then load the same object from the FXGraphCache on an AOTAutogradCache hit.
On cache miss:
- Run AOTAutograd, up to AOTAutogradDispatch.post_compile.
- Save an AOTAutogradCacheEntry to the cache after compiling the necessary portions and receiving a cache key from FXGraphCache. In this we *always* compile the backwards ahead of time. The PR above this one implements backward lazy caching, so that we only save to the cache after compiling the backward in a lazy backward scenario.
- Return the resulting object
On cache hit:
- Run AOTAutogradCacheEntry.post_compile() on the cache key.
- This attempts to load the forward and backward graphs from FXGraphCache
- As long as we successfully load from FXGraphCache, it's a hit. We then rewrap the callable with post compile wrappers using our saved metadata.
For now, we ignore the fakified out and debug wrappers. We only save to the cache if Fakified out is turned off.
V0 Guards behavior:
FXGraphCache serializes guards that are needed in the shape_env based on the symint inputs to the graph. The invariant that AOTAutograd uses here is that the sources for symints given to it by dynamo are exactly the same as the ones it passes to inductor, for both the forward and backward passes. (This does *not* mean that the tensor values passed in are the same: only that their symints are). That is, AOTAutograd and Inductor never create new guards based on symints with *different sources* than those passed to it by inductor.
We don't currently store any AOTAutograd specific guards: my hypothesis is that FXGraphCache already stores these, as any guards generated by AOTAutograd should already be in the shape_env before calling into inductor, and we don't generate new guards post inductor. If this is needed, I'll add it in another diff.
Testing:
We'll start with some basic unit tests, but I'll be adding more and more complicated testing as the next step.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126791
Approved by: https://github.com/bdhirsh
Summary: I admit I'm not 100% sure what I'm doing here. I'm hitting a bug in the FX graph cache when we try to evaluate a guards expression. We're creating guards that look like this:
```
Ne(CeilToInt(FloatTrueDiv(ToFloat(8*L['t0']) - 4.0, 8.0))*CeilToInt(FloatTrueDiv(ToFloat(8*L['t1']) - 4.0, 8.0)), CeilToInt(FloatTrueDiv(ToFloat(8*L['t1']) - 4.0, 8.0))) and ...
```
It looks like we have a facility to define these operators in the SYMPY_INTERP map and we're just missing FloatTrueDiv and ToFloat. What's surprsing to me is that we're only hitting this problem with the FX graph enabled. We can create such guards, but we've never actually evaluated any?
Test Plan:
`TORCHINDUCTOR_FX_GRAPH_CACHE=1 python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --inference --bfloat16 --only detectron2_fcos_r_50_fpn`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128418
Approved by: https://github.com/ezyang
Summary: Improve the convert fp32 to fp16 fx pass to use to_dtype node and const folding instead of inplace conversion.
Test Plan:
```
buck2 test @//mode/{opt,inplace} //glow/fb/fx/fba/tests:test_fba_pass_manager_builder
```
Differential Revision: D57803843
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127829
Approved by: https://github.com/Skylion007
Fixes#127905
### Description
Add docstring to torch/onnx/symbolic_opset9.py:sigmoid function
### Checklist
- [x] The issue that is being fixed is referred in the description
- [x] Only one issue is addressed in this pull request
- [x] Labels from the issue that this PR is fixing are added to this pull request
- [x] No unnecessary issues are included into this pull request
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128082
Approved by: https://github.com/titaiwangms
Summary:
Pass parameters from request to dump_nccl_trace_pickle handler.
The supported parameters + value are all lowercase.
includecollectives={true, false}
includestacktraces={true, false}
onlyactive={true, false}
Example post is:
/handler/dump_nccl_trace_pickle?includecollectives=true&includestacktraces=false&onlyactive=true
Test Plan:
unit tests
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128307
Approved by: https://github.com/d4l3k
ghstack dependencies: #128191
Summary:
Add a unit test for the only_active flag to _dump_nccl_trace API call.
With this flag, we only expect active records to be returned.
Test Plan:
Unit test.
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128191
Approved by: https://github.com/d4l3k
Fixes#127897
### Description
Add docstring to torch/onnx/symbolic_opset9.py:sigmoid function
### Checklist
- [x] The issue that is being fixed is referred in the description
- [x] Only one issue is addressed in this pull request
- [x] Labels from the issue that this PR is fixing are added to this pull request
- [x] No unnecessary issues are included into this pull request
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128171
Approved by: https://github.com/titaiwangms
Summary:
The previous side effect pruning algorithm would keep many dead cell
variables alive. For example, in
https://github.com/pytorch/pytorch/issues/125078, the compiled function
has one return but there were three in the Dynamo graph due to two
dead cell variables not being pruned away.
This PR adds a corrected algorithm. "new cell variables" are alive if
they can be reached from one of the following:
1. any of the tx.symbolic_locals or tx.stack (that is, if they are
involved in a return from the function or intermediate variable
during a graph break). Example: an alive NestedUserFunctionVariable
2. "mutations to pre-existing objects". Example: appending a
NestedUserFunctionVariable to a global list
The new algorithm reflects this, but please let me know if there are
more cases to handle.
Test Plan:
- existing tests (afaict, test/dynamo/test_python_autograd is the best
SideEffects test case we have)
- see in test/dynamo/test_higher_order_ops that the expecttests changed
-- the functorch dynamo graphs no longer return dead cellvars.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128028
Approved by: https://github.com/jansel
We observed signficant compile time regression in torchtitan when turning
on 2D parallel + torch.compile recently. So I decided to get a deeper
understanding why.
It turns out this is affecting **all the trainings** that have functional collectives
captured in the graph, not only 2D parallel (2D parallel was just the
job that happen to have collectives captured in the TP region).
The root cause is because when doing inductor lowering, we are calling
the comm analysis pass to get a estimated collective time for each
collective node in the graph, for each call to check the collective
node, we are calling `get_gpu_type()`, which under the hood calls a
`torch.utils.collect_env.run` to get the GPU info. However, this call is
super expensive! The reason is that this call effectively spawns a new
process and call `nvidia-smi` to get the GPU info, so the cost is **linear**
to the number of collective nodes in the graph.
see https://github.com/pytorch/pytorch/blob/main/torch/utils/collect_env.py#L75
The fix is to add a lru cache to the function, so that we only call this
once and reuse the cached results afterwards
torchtitan benchmark shows:
* before this fix: 2D parallel + fp8 compile time: 6min +
* after this fix: 2D parallel + fp8 compile time: 2min 48s (more than 100% improvement)
There're more room to improve the compile time, but this PR is trying to fix the biggest regression I found so far.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128363
Approved by: https://github.com/yf225
### Motivation
Intel Gaudi accelerator (device name hpu) is seen to have good pass rate with the pytorch framework UTs , however being an out-of-tree device, we face challenges in adapting the device to natively run the existing pytorch UTs under pytorch/test. The UTs however is a good indicator of the device stack health and as such we run them regularly with adaptations.
Although we can add Gaudi/HPU device to generate the device specific tests using the TORCH_TEST_DEVICES environment variable, we miss out on lot of features such as executing for specific dtypes, skipping and overriding opInfo. With significant changes introduced every Pytorch release maintaining these adaptations become difficult and time consuming.
Hence with this PR we introduce Gaudi device in common_device_type framework, so that the tests are instantiated for Gaudi when the library is loaded.
The eventual goal is to introduce Gaudi out-of-tree support as equivalent to in-tree devices
### Changes
Add HPUTestBase of type DeviceTypeTestBase specifying appropriate attributes for Gaudi/HPU.
Include code to check if intel Gaudi Software library is loaded and if so, add the device to the list of devices considered for instantiation of device type tests
### Additional Context
please refer the following RFC : https://github.com/pytorch/rfcs/pull/63/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126970
Approved by: https://github.com/albanD
gen_static_runtime_ops hasn't been updated in a while. In preparation for https://github.com/pytorch/pytorch/pull/127675 in which I need to re-run the codegen step for cumprod, I want to land these changes beforehand in case there are any other issues that arise.
I added a number of ops to the blocklist:
```
+ "_nested_tensor_storage_offsets",
+ "_nested_get_values", # no CPU backend
+ "_nested_get_values_copy", # no CPU backend
+ "_nested_view_from_jagged", # testing needs to be patched
+ "_nested_view_from_jagged_copy", # testing needs to be patched
+ "_nested_view_from_buffer", # testing needs to be patched
+ "_nested_view_from_buffer_copy", # testing needs to be patched
+ "_int_mm", # testing needs to be patched
+ "_to_sparse_csc", # testing needs to be patched
+ "_to_sparse_csr", # testing needs to be patched
+ "segment_reduce", # testing needs to be patched
```
Most of these are added just because testing doesn't work right now.
Additionally, a few `fft` ops seem to have been removed from native_functions.yaml; I'm guessing it's unlikely FFT would have been used in many real models though.
Differential Revision: [D58329403](https://our.internmc.facebook.com/intern/diff/D58329403/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128299
Approved by: https://github.com/YuqingJ
Some features of third-party devices depend on TraceUtils.h, so some of the CUDA code was removed and split into NCCLUtils files.
In addition, some common functions still remain in TraceUtils.h since I'm not sure if other devices will use them later.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126969
Approved by: https://github.com/c-p-i-o
This PR properly registers the tensor used in the module compute as a parameter. This bug was hidden previously because all tensors on the nn modules would be considered constant by dynamo, with inlining NN modules, this is no longer the case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128356
Approved by: https://github.com/anijain2305
ghstack dependencies: #128355
Today inlining builtin nn modules is not compatible with parameter freezing. Freezing parameters and then constant folding them through the graph relies on the assumption that they will not be inputs and will be static across calls to the same graph. When inlining builtin nn modules this assumption is broken and we reuse the same graph for different instances of the same nn module. There are three options 1) abandon constant folding, 2) create a dispatcher layer (like cudagraphs) which will dispatch to the correct constant-folded graph for each distinct set of parameters or 3) recompile
This PR implements 3 by introducing guards on the parameter pointers. This was due to freezing being relatively rare and performance sensistive. 2 Had many more unknowns and 1 is not a viable option due to the drop in performance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128355
Approved by: https://github.com/anijain2305
We implemented a lowering for the avg_pool3d_backward operation and created tests for it.
We ran some benchmarks and achieved the following results:
```
[-------------- avgpool_3d_backwards --------------]
| Decomposed | Eager
16 threads: ----------------------------------------
(3, 5, 400, 200, 200) | 6061 | 11160
(3, 5, 300, 200, 200) | 4547 | 8372
(3, 5, 200, 200, 200) | 3032 | 5585
(3, 5, 300, 300, 300) | 10100 | 18840
(3, 5, 100, 100, 100) | 381 | 703
(3, 5, 100, 300, 200) | 2270 | 4190
(8, 8, 128, 128, 128) | 3397 | 6253
(2, 3, 150, 150, 150) | 520 | 947
(1, 3, 128, 128, 128) | 161 | 299
(8, 16, 64, 64, 64) | 851 | 1569
(1, 1, 50, 50, 50) | 17 | 11
(3, 5, 20, 40, 40) | 17 | 30
(3, 5, 10, 20, 20) | 17 | 11
(1, 1, 10, 10, 10) | 16 | 11
(3, 5, 5, 10, 10) | 17 | 11
(3, 5, 2, 5, 5) | 17 | 11
```
These were run on an RTX 3050, so we were not able to allocate larger tensors due to memory limitations.
We believe it would be beneficial to benchmark this on more recent hardware, just to check if the performance holds up with larger sizes.
Furthermore, we also refactored code from adaptive_avg_pool2d and adaptive_max_pool2d, to reduce code duplication.
We diffed the kernels and they are identical.
Fixes#127101
Co-authored-by: Martim Mendes <martimccmendes@tecnico.ulisboa.pt>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127722
Approved by: https://github.com/jansel
This PR intends to support the aten operations with the `out` tensor.
Currently, the AOT compile always does **NOT** keep input tensor mutations. According to the comments, this is because it has not encountered such a use case.
> For now there's no use case involving keeping input mutations in the graph (which we can only do in the inference case anyway). We can add this later if we need to.
However, for aten operations, it is popular that the `out` tensor is an input parameter and needs to be mutated. This PR intends to support it by adding a `keep_inference_input_mutations` flag to `aot_inductor.keep_inference_input_mutations`. This flag can provide flexibility to the callee in deciding whether the AOT compile needs to keep input tensor mutations in the graph.
Take `clamp` as an example as follows.
```python
out_tensor = torch.randn(128, dtype=torch.float, device=device).fill_(-2.0)
inp_tensor = torch.randn(128, dtype=torch.float, device=device).fill_(1.0)
min_tensor = inp_tensor - 0.05
max_tensor = inp_tensor + 0.05
torch.clamp(input=inp_tensor, min=min_tensor, max=max_tensor, out=out_tensor)
```
W/O this PR
```python
def forward(self):
arg0_1: "f32[128]"; arg1_1: "f32[128]"; arg2_1: "f32[128]"; arg3_1: "f32[128]";
arg0_1, arg1_1, arg2_1, arg3_1, = fx_pytree.tree_flatten_spec([], self._in_spec)
clamp_min: "f32[128]" = torch.ops.aten.clamp_min.Tensor(arg0_1, arg1_1); arg0_1 = arg1_1 = None
clamp_max: "f32[128]" = torch.ops.aten.clamp_max.Tensor(clamp_min, arg2_1); clamp_min = arg2_1 = None
return (clamp_max, clamp_max)
```
W/ this PR
```python
def forward(self):
arg0_1: "f32[128]"; arg1_1: "f32[128]"; arg2_1: "f32[128]"; arg3_1: "f32[128]";
arg0_1, arg1_1, arg2_1, arg3_1, = fx_pytree.tree_flatten_spec([], self._in_spec)
clamp_min: "f32[128]" = torch.ops.aten.clamp_min.Tensor(arg0_1, arg1_1); arg0_1 = arg1_1 = None
clamp_max: "f32[128]" = torch.ops.aten.clamp_max.Tensor(clamp_min, arg2_1); clamp_min = arg2_1 = None
copy_: "f32[128]" = torch.ops.aten.copy_.default(arg3_1, clamp_max); arg3_1 = clamp_max = None
return (copy_,)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124926
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/angelayi
Beware that none of the topics under [Using Pytorch Securely](#using-pytorch-securely) are considered vulnerabilities of Pytorch.
@ -61,3 +61,27 @@ If applicable, prepare your model against bad inputs and prompt injections. Some
PyTorch can be used for distributed computing, and as such there is a `torch.distributed` package. PyTorch Distributed features are intended for internal communication only. They are not built for use in untrusted environments or networks.
For performance reasons, none of the PyTorch Distributed primitives (including c10d, RPC, and TCPStore) include any authorization protocol and will send messages unencrypted. They accept connections from anywhere, and execute the workload sent without performing any checks. Therefore, if you run a PyTorch Distributed program on your network, anybody with access to the network can execute arbitrary code with the privileges of the user running PyTorch.
## CI/CD security principles
_Audience_: Contributors and reviewers, especially if modifying the workflow files/build system.
PyTorch CI/CD security philosophy is based on finding a balance between open and transparent CI pipelines while keeping the environment efficient and safe.
PyTorch testing requirements are complex, and a large part of the code base can only be tested on specialized powerful hardware, such as GPU, making it a lucrative target for resource misuse. To prevent this, we require workflow run approval for PRs from non-member contributors. To keep the volume of those approvals relatively low, we easily extend write permissions to the repository to regular contributors.
More widespread write access to the repo presents challenges when it comes to reviewing changes, merging code into trunk, and creating releases. [Protected branches](https://docs.github.com/en/repositories/configuring-branches-and-merges-in-your-repository/managing-protected-branches/about-protected-branches) are used to restrict the ability to merge to the trunk/release branches only to the repository administrators and merge bot. The merge bot is responsible for mechanistically merging the change and validating reviews against the path-based rules defined in [merge_rules.yml](https://github.com/pytorch/pytorch/blob/main/.github/merge_rules.yaml). Once a PR has been reviewed by person(s) mentioned in these rules, leaving a `@pytorchbot merge` comment on the PR will initiate the merge process. To protect merge bot credentials from leaking, merge actions must be executed only on ephemeral runners (see definition below) using a specialized deployment environment.
To speed up the CI system, build steps of the workflow rely on the distributed caching mechanism backed by [sccache](https://github.com/mozilla/sccache), making them susceptible to cache corruption compromises. For that reason binary artifacts generated during CI should not be executed in an environment that contains an access to any sensitive/non-public information and should not be published for use by general audience. One should not have any expectation about the lifetime of those artifacts, although in practice they likely remain accessible for about two weeks after the PR has been closed.
To speed up CI system setup, PyTorch relies heavily on Docker to pre-build and pre-install the dependencies. To prevent a potentially malicious PR from altering ones that were published in the past, ECR has been configured to use immutable tags.
To improve runner availability and more efficient resource utilization, some of the CI runners are non-ephemeral, i.e., workflow steps from completely unrelated PRs could be scheduled sequentially on the same runner, making them susceptible to reverse shell attacks. For that reason, PyTorch does not rely on the repository secrets mechanism, as these can easily be compromised in such attacks.
### Release pipelines security
To ensure safe binary releases, PyTorch release pipelines are built on the following principles:
- All binary builds/upload jobs must be run on ephemeral runners, i.e., on a machine that is allocated from the cloud to do the build and released back to the cloud after the build is finished. This protects those builds from interference from external actors, who potentially can get reverse shell access to a non-ephemeral runner and wait there for a binary build.
- All binary builds are cold-start builds, i.e., distributed caching/incremental builds are not permitted. This renders builds much slower than incremental CI builds but isolates them from potential compromises of the intermediate artifacts caching systems.
- All upload jobs are executed in a [deployment environments](https://docs.github.com/en/actions/deployment/targeting-different-environments/using-environments-for-deployment) that are restricted to protected branches
- Security credentials needed to upload binaries to PyPI/conda or stable indexes `download.pytorch.org/whl` are never uploaded to repo secrets storage/environment. This requires an extra manual step to publish the release but ensures that access to those would not be compromised by deliberate/accidental leaks of secrets stored in the cloud.
- No binary artifacts should be published to GitHub releases pages, as these are overwritable by anyone with write permission to the repo.
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.