Changes:
1. Switch `AotCodeCompiler` to new cpp_builder.
2. Only use `deprecated_cpp_compile_command` for `fb_code`, due to I can't debug anymore on no Meta internal environment access.
3. Add `TODO` comments for further some Meta employee help on contine to do this work.
4. Due to item 3, we only remaining `deprecated_cpp_compile_command` for `fb_code` to be fix, let's remove `validate_new_cpp_commands`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130127
Approved by: https://github.com/jgong5, https://github.com/jansel
This PR refactors placeholders in cudagraphs to be serializable. We define a new PlaceholderInfo object which only has the necessary parts of placeholders for logging/debugging, and use that instead of `torch.fx.Node` directly. This allows us to then save PlaceholderInfo into the FXGraphCache/AOTAutogradCache later.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130252
Approved by: https://github.com/eellison, https://github.com/masnesral
ghstack dependencies: #129384
Resubmit of #128979
`WeakDep`s force readers to have completed before a mutation overwrites the
buffer, but we want to allow fusions to occur for inplace mutations where the
same index is read and written.
Currently this is achieved by:
1. Identifying the buffers used by the mutating op in its `dep_closure`
2. Not creating `WeakDep`s for buffers in the `dep_closure`
3. Fixing up any bad fusions that might occur by an extra check in `can_fuse_vertical`
So we are first over-agressive in removing `WeakDep`, then add an ad-hoc fixup.
This PR instead emits all `WeakDep`s and adds a `fusable_weak_dep` check to
`can_fuse_vertical` which selectively allows inplace operation to fuse.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130835
Approved by: https://github.com/lezcano
Moves cudagraphs stuff into a post_compile function that I can later call when loading from AOTAutogradCache. On a cache hit, we only need to save any reasons for disabling cudagraphs along with some metadata needed to run cudagraphify. The arguments to cudagraphs_post_compile should be the set of parameters I'll need to reconstruct on a warm start.
No actual behavioral change should result from this: I'm moving the behavior into separate functions, but every operation should be the same pre and post PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129384
Approved by: https://github.com/eellison
https://github.com/pytorch/pytorch/issues/127561
Mutations of inputs in backward are emitted manually, after joint_fn tracing.
With default partitioner logic they will be moved to "forward" graph, as this is operation on forward inputs.
To keep those mutations in backward:
- Introduce "subgraph" node key, that can be specified with contextmanager. When we do manual `copy_` in backward on forward input - we know that his is for backward - set subgraph="backward"
In partitioner:
Introducing optional argument subgraph, to filter out nodes with specified subgraph (node_subgraph) and not to add them to subgraph if node_subgraph is different.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129130
Approved by: https://github.com/Chillee
Summary:
This pr fixes all the places in strict export stack where the output node's meta is not preserved correctly. However, we're getting a new error for the test we intend to fix: `buck2 run caffe2/test/quantization:test_quantization -- -r "test_re_export_preserve_handle"`:
The `get_attr` nodes has wrong metadata. I guess there are more things need to be fixed to get it working but it's beyond the scope of this PR.
Test Plan: buck2 run caffe2/test/quantization:test_quantization -- -r "test_re_export_preserve_handle"
Differential Revision: D60198221
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131706
Approved by: https://github.com/yushangdi
Changes:
1. Switch `AotCodeCompiler` to new cpp_builder.
2. Only use `deprecated_cpp_compile_command` for `fb_code`, due to I can't debug anymore on no Meta internal environment access.
3. Add `TODO` comments for further some Meta employee help on contine to do this work.
4. Due to item 3, we only remaining `deprecated_cpp_compile_command` for `fb_code` to be fix, let's remove `validate_new_cpp_commands`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130127
Approved by: https://github.com/jgong5, https://github.com/jansel
In cases where the program takes in a constant, export will specialize on the constant and embed the constant into the graph, with the graph containing a placeholder node with no users. However, inductor errors further down as typically in torch.compile, these constants don't show up as inputs. Since these constants are already embedded in the graph, we will just ignore these inputs while compiling with AOTI, and filter out the non-tensor inputs during the runtime.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131594
Approved by: https://github.com/desertfire
Fixes https://github.com/pytorch/pytorch/issues/103602.
This PR implements the idea of "if someone creates a string and then ends up not using it, we would prefer to NOT have specialized." mentioned in above issue. Specifically, we create a lazy variable tracker instead of ConstantVariable when we're in FORMAT_VALUE, and when the lazy variable tracker is realized (i.e. it's going to be used), we create a ConstantVariable and the specialization/guarding happens at the time of realization.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131529
Approved by: https://github.com/ezyang
Summary: Instead of embedding the user_defined TraceEntry inside of device_traces, which causes issues when some threads may not have the proper device id set, save them into an external_annotations field by using a RingBuffer<AnnotationEntry> called annotation_buffer owned by the NativeCachingAllocator.
Test Plan: CI, resnet run, and FBR model.
Differential Revision: D59703213
Pulled By: aaronenyeshi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130964
Approved by: https://github.com/zdevito
This PR provides the initial support for k-slicing (i.e. parallel reduction along k-dim) of CPP GEMM template. Only static shapes are supported now. When k-slicing is enabled, there would be extra temporary buffers allocated to hold the intermediate results and an extra barrier after initial GEMM compute by each thread, i.e. each thread first stores the GEMM result to temporary accumulation buffers (pointed by `local_buf_ptrs` which is an array of pointers pointing to accumulation buffers), followed by a reduction along k-slices, epilogue computes and store to the final output `Y`. In each k-slicing thread group, the reduction along k-slices and epilogue computes are conducted in parallel along M-dim. The algorithm is designed to reduce the synchronization overhead as much as possible.
The k-slicing is enabled when blocking on M and N is unable to occupy all threads. Since k-slicing doesn't always bring benefit, an extra configuration is added to enable it (disable by default). We need to identify a good heuristics in the future to enable k-slicing by default.
Performance numbers with 64x4096x64, 64x10000x64, 64x20000x64 as examples on 60-core SPR as examples. As you can see, the perf of k-slicing is only better than non-k-slicing when K is large enough.
Without k-slicing
AUTOTUNE linear_unary(64x4096, 64x4096, 64)
cpp_packed_gemm_0 0.0108 ms 100.0%
_linear_pointwise 0.0431 ms 25.1%
AUTOTUNE linear_unary(64x10000, 64x10000, 64)
cpp_packed_gemm_0 0.0272 ms 100.0%
_linear_pointwise 0.0892 ms 30.5%
AUTOTUNE linear_unary(64x20000, 64x20000, 64)
cpp_packed_gemm_0 0.0781 ms 100.0%
_linear_pointwise 0.1693 ms 46.1%
With k-slicing:
AUTOTUNE linear_unary(64x4096, 64x4096, 64)
cpp_packed_gemm_0 0.0260 ms 100.0%
_linear_pointwise 0.0444 ms 58.5%
AUTOTUNE linear_unary(64x10000, 64x10000, 64)
cpp_packed_gemm_0 0.0275 ms 100.0%
_linear_pointwise 0.0893 ms 30.8%
AUTOTUNE linear_unary(64x20000, 64x20000, 64)
cpp_packed_gemm_0 0.0284 ms 100.0%
_linear_pointwise 0.1686 ms 16.8%
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130821
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
ghstack dependencies: #131024
#109581
At this point, the vanilla implementation (the default) is good.
Docs: https://docs-preview.pytorch.org/pytorch/pytorch/129905/generated/torch.optim.Adafactor.html#torch.optim.Adafactor
Specifically, the impl in this PR, which attempts to replicate the paper,
```
optim = torch.optim.Adafactor([weight])
```
is close enough to https://pytorch-optimizers.readthedocs.io/en/latest/optimizer/#pytorch_optimizer.AdaFactor
```
optim_c = AdaFactor([weight], betas=(0, 0.999), scale_parameter=False)
```
is close enough to https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adafactor
```
optim = keras.optimizers.Adafactor(learning_rate=0.01)
```
The three results respectively for the same randomly generated weights:
```
# ours
tensor([[ 0.3807594, -0.3912092],
[ 0.0762539, 0.5377805],
[ 0.2459473, 0.4662207]])
# pytorch-optimizer
tensor([[ 0.3807592, -0.3912172],
[ 0.0762507, 0.5377818],
[ 0.2459457, 0.4662213]])
# keras
array([[ 0.38076326, -0.39121315],
[ 0.0762547 , 0.5377859 ],
[ 0.24594972, 0.46622536]], dtype=float32)
```
This gives me confidence to move forward in speeding up the implementation now that a baseline has been established. If you're curious about differences:
* keras assigns step_size (rho_t in their code) to `min(lr, 1 / sqrt(step)` whereas the OG impl uses a hardcoded 0.01 instead of lr. We do the same thing as keras, but our lr default is 0.01.
* We differ from the pytorch-optimizers default in that our default will not track momentum (thus `beta1=0`) and we do not apply parameter scaling.
<details>
Keras collab: https://colab.research.google.com/drive/1i3xF8ChL7TWKJGV_5v_5nMhXKnYmQQ06?usp=sharing
My script repro:
```
import torch
from pytorch_optimizer import AdaFactor
torch.set_printoptions(precision=7)
weight = torch.tensor([[ 0.37697506, -0.39500135],
[ 0.07246649, 0.53399765],
[ 0.24216151, 0.46243715]], dtype=torch.float32)
# bias = torch.tensor([0, 0], dtype=torch.float32)
weight.grad = torch.tensor([[-0.5940447, -0.7743838],
[-0.5940447, -0.7743838],
[-0.5940447, -0.7743838]], dtype=torch.float32)
# bias.grad = torch.tensor([-2.5027974, 1.5422692], dtype=torch.float32)
weight_c = weight.clone()
weight_c.grad = weight.grad.clone()
optim = torch.optim.Adafactor([weight])
optim.step()
print(weight)
optim_c = AdaFactor([weight_c], betas=(0, 0.999), scale_parameter=False)
optim_c.step()
print(weight_c)
```
<details>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129905
Approved by: https://github.com/albanD
Extension of the condition allowing the cpu scalar to be moved to specific devices.
This fixes an HPU specific error:
`torch._dynamo.exc.BackendCompilerFailed: backend='aot_hpu_training_backend' raised:
RuntimeError: Expected `value` to be on same device as `a`While executing %masked_fill : [num_users=1] = call_method[target=masked_fill](args = (%matmul, %expand_as, %tensor), kwargs = {})`
On the HPU in eager mode the problem doesn't occur because the pytorch's implementation is not used then.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127871
Approved by: https://github.com/jgong5, https://github.com/ezyang
This fixes a couple errors that come up when multi-kernel is used with
split-scan.
1. The split-scan was being marked as a persistent kernel, which allowed
a multi-kernel to be created but this isn't supported. Fix is to
never mark split-scan as persistent.
2. Benchmark codegen was not handling WorkspaceArg, and would raise a
KeyError during codegen.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131044
Approved by: https://github.com/shunting314
Changes:
- Add `-C REPO` in `git` commands to allow the tool can be run everywhere not only the repo dir
- Use `pathlib.Path` as many as possible
- Replace `subprocess.run(..., check=True)` with `subprocess.check_{call,output}(...)`
- Add `encoding='utf-8'` for files
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131134
Approved by: https://github.com/ezyang
------
As per the title, add argument `--locals` for `unittest` and `--showlocals --tb=long` for `pytest` in CI.
Some failures cannot be reproduced on the local machine but exist on cloud CI. This change allows us to investigate the test failure more easily.
Example output: https://github.com/pytorch/pytorch/actions/runs/9961546996/job/27523888353?pr=130710#step:20:3361
```text
/opt/conda/envs/py_3.8/lib/python3.8/site-packages/sympy/core/function.py:307:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
cls = FloorDiv, base = -1.00000000000000, divisor = -1.00000000000000
@classmethod
def eval(cls, base, divisor):
# python test/test_dynamic_shapes.py -k TestDimConstraints.test_dim_constraints_solve_full
# Assert triggered by inequality solver
# assert base.is_integer, base
# assert divisor.is_integer, divisor
# We don't provide the same error message as in Python because SymPy
# makes it difficult to check the types.
if divisor.is_zero:
raise ZeroDivisionError("division by zero")
if base in (int_oo, -int_oo, sympy.oo, -sympy.oo) and divisor in (
int_oo,
-int_oo,
sympy.oo,
-sympy.oo,
):
return sympy.nan
if base is sympy.nan or divisor is sympy.nan:
return sympy.nan
if base.is_zero:
return sympy.S.Zero
if base.is_integer and divisor == 1:
return base
if base.is_integer and divisor == -1:
return sympy.Mul(base, -1)
if (
isinstance(base, sympy.Number)
and isinstance(divisor, sympy.Number)
and (
base in (int_oo, -int_oo, sympy.oo, -sympy.oo)
or divisor in (int_oo, -int_oo, sympy.oo, -sympy.oo)
)
):
r = float(base) / float(divisor)
if r == math.inf:
return int_oo
elif r == -math.inf:
return -int_oo
elif math.isnan(r):
return sympy.nan
else:
return sympy.Integer(math.floor(r))
if isinstance(base, sympy.Integer) and isinstance(divisor, sympy.Integer):
return sympy.Integer(int(base) // int(divisor))
if isinstance(base, FloorDiv):
return FloorDiv(base.args[0], base.args[1] * divisor)
# Expands (x + y) // b into x // b + y // b.
# This only works if floor is an identity, i.e. x / b is an integer.
for term in sympy.Add.make_args(base):
quotient = term / divisor
if quotient.is_integer and isinstance(divisor, sympy.Integer):
# NB: this is correct even if the divisor is not an integer, but it
# creates rational expressions that cause problems with dynamic
# shapes.
return FloorDiv(base - term, divisor) + quotient
try:
gcd = sympy.gcd(base, divisor)
if gcd != 1:
> return FloorDiv(
sympy.simplify(base / gcd), sympy.simplify(divisor / gcd)
)
base = -1.00000000000000
cls = FloorDiv
divisor = -1.00000000000000
gcd = 1.00000000000000
quotient = 1.00000000000000
term = -1.00000000000000
/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/_sympy/functions.py:159:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
args = (FloorDiv, -1.00000000000000, -1.00000000000000), kwargs = {}
@wraps(func)
def wrapper(*args, **kwargs):
try:
> retval = cfunc(*args, **kwargs)
E RecursionError: maximum recursion depth exceeded in comparison
E
E To execute this test, run the following from the base repo dir:
E python test/test_sympy_utils.py -k TestValueRanges.test_binary_ref_fn_floordiv_dtype_float
E
E This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
args = (FloorDiv, -1.00000000000000, -1.00000000000000)
cfunc = <functools._lru_cache_wrapper object at 0x7fc5303173a0>
func = <function Function.__new__ at 0x7fc530317280>
kwargs = {}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131151
Approved by: https://github.com/ezyang
**Summary**
I have added a new noise level between the existing levels of 1 and 2, such that the noise level controls are now:
0. prints module-level collective counts
1. prints dTensor operations not included in trivial operations (new noise level)
2. prints operations not included in trivial operations
3. prints all operations
This gives the user more flexibility in controlling what information they want to use. The noise levels are used both for creating the console/file log and the json dump. In the example file, I have changed the module_tracing examples to noise level 0 and have changed my transformer examples to show off the new noise level.
**Test Plan**
1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_json_dump
2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing
3. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing
4. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131592
Approved by: https://github.com/XilunWu
ghstack dependencies: #131419, #130996
Summary:
When run internally in multiple parallel processes, the `test_debug_trace` hits the cache and skips writing all the expected outputs. Here we force-disable inductor cache to circumvent the problem. Ideally, we should switch to using a cleaner `fresh_inductor_cache` decorator approach, but it doesn't work at the moment.
Additionally, the debug trace dir is now generated by `tempfile.mkdtemp` to avoid a (rather unlikely) race condition.
Test Plan: Tested internally.
Differential Revision: D60207586
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131722
Approved by: https://github.com/eellison
Fix static `py::object`s with `py::gil_safe_call_once_and_store`.
The following code will leak a `py::object` which will call its destructor when shutdown the program. The destructor will call `Py_DECREF(obj.m_ptr)` which may raise a segmentation fault.
```c++
void func() {
static py::object obj = py::module_::import("foo").attr("bar");
...
}
```
The correct code is to use raw pointers rather than the instance.
```c++
void func() {
static py::object* obj_ptr = new py::object{py::module_::import("foo").attr("bar")};
py::object obj = *obj_ptr;
...
}
```
This PR uses the `py::gil_safe_call_once_and_store` function from `pybind11`, which can run arbitrary initialization code only once under the Python GIL thread safely.
```c++
void func() {
PYBIND11_CONSTINIT static py::gil_safe_call_once_and_store<py::object> storage;
py::object obj = storage
.call_once_and_store_result(
[]() -> py::object {
return py::module_::import("foo").attr("bar");
}
)
.get_stored();
...
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130341
Approved by: https://github.com/ezyang, https://github.com/malfet
Add a new option `--cuda` to `tools/nightly.py` to pull the nightly packages with CUDA support.
```bash
# installs pytorch-nightly with cpuonly
tools/nightly.py pull
# The following only available on Linux and Windows
# installs pytorch-nightly with latest CUDA we support
tools/nightly.py pull --cuda
# installs pytorch-nightly with CUDA 12.1
tools/nightly.py pull --cuda 12.1
```
Also add targets in `Makefile` and instructions in constribution guidelines.
```bash
# setup conda environment with pytorch-nightly
make setup-env
# setup conda environment with pytorch-nightly with CUDA support
make setup-env-cuda
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131133
Approved by: https://github.com/ezyang
Summary: Internally, the ABI-compatible mode is [enabled by default](eb54ca7abe/torch/_inductor/config.py (L53)). As a result, when the `abi_compatible: False` flag is not specified explitictly in the tests assuming non-ABI-compatible C++ codegen, those are failing internally. Here we fix one such test in `test_memory_planning.py`.
Test Plan: Tested internally.
Differential Revision: D60197327
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131703
Approved by: https://github.com/eellison
Bumps [setuptools](https://github.com/pypa/setuptools) from 69.5.1 to 70.0.0.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a href="https://github.com/pypa/setuptools/blob/main/NEWS.rst">setuptools's changelog</a>.</em></p>
<blockquote>
<h1>v70.0.0</h1>
<h2>Features</h2>
<ul>
<li>Emit a warning when <code>[tools.setuptools]</code> is present in <code>pyproject.toml</code> and will be ignored. -- by :user:<code>SnoopJ</code> (<a href="https://redirect.github.com/pypa/setuptools/issues/4150">#4150</a>)</li>
<li>Improved <code>AttributeError</code> error message if <code>pkg_resources.EntryPoint.require</code> is called without extras or distribution
Gracefully "do nothing" when trying to activate a <code>pkg_resources.Distribution</code> with a <code>None</code> location, rather than raising a <code>TypeError</code>
-- by :user:<code>Avasam</code> (<a href="https://redirect.github.com/pypa/setuptools/issues/4262">#4262</a>)</li>
<li>Typed the dynamically defined variables from <code>pkg_resources</code> -- by :user:<code>Avasam</code> (<a href="https://redirect.github.com/pypa/setuptools/issues/4267">#4267</a>)</li>
<li>Modernized and refactored VCS handling in package_index. (<a href="https://redirect.github.com/pypa/setuptools/issues/4332">#4332</a>)</li>
</ul>
<h2>Bugfixes</h2>
<ul>
<li>In install command, use super to call the superclass methods. Avoids race conditions when monkeypatching from _distutils_system_mod occurs late. (<a href="https://redirect.github.com/pypa/setuptools/issues/4136">#4136</a>)</li>
<li>Fix finder template for lenient editable installs of implicit nested namespaces
constructed by using <code>package_dir</code> to reorganise directory structure. (<a href="https://redirect.github.com/pypa/setuptools/issues/4278">#4278</a>)</li>
<li>Fix an error with <code>UnicodeDecodeError</code> handling in <code>pkg_resources</code> when trying to read files in UTF-8 with a fallback -- by :user:<code>Avasam</code> (<a href="https://redirect.github.com/pypa/setuptools/issues/4348">#4348</a>)</li>
</ul>
<h2>Improved Documentation</h2>
<ul>
<li>Uses RST substitution to put badges in 1 line. (<a href="https://redirect.github.com/pypa/setuptools/issues/4312">#4312</a>)</li>
</ul>
<h2>Deprecations and Removals</h2>
<ul>
<li>
<p>Further adoption of UTF-8 in <code>setuptools</code>.
This change regards mostly files produced and consumed during the build process
(e.g. metadata files, script wrappers, automatically updated config files, etc..)
Although precautions were taken to minimize disruptions, some edge cases might
be subject to backwards incompatibility.</p>
<p>Support for <code>"locale"</code> encoding is now <strong>deprecated</strong>. (<a href="https://redirect.github.com/pypa/setuptools/issues/4309">#4309</a>)</p>
</li>
<li>
<p>Remove <code>setuptools.convert_path</code> after long deprecation period.
This function was never defined by <code>setuptools</code> itself, but rather a
side-effect of an import for internal usage. (<a href="https://redirect.github.com/pypa/setuptools/issues/4322">#4322</a>)</p>
</li>
<li>
<p>Remove fallback for customisations of <code>distutils</code>' <code>build.sub_command</code> after long
deprecated period.
Users are advised to import <code>build</code> directly from <code>setuptools.command.build</code>. (<a href="https://redirect.github.com/pypa/setuptools/issues/4322">#4322</a>)</p>
</li>
<li>
<p>Removed <code>typing_extensions</code> from vendored dependencies -- by :user:<code>Avasam</code> (<a href="https://redirect.github.com/pypa/setuptools/issues/4324">#4324</a>)</p>
</li>
<li>
<p>Remove deprecated <code>setuptools.dep_util</code>.
The provided alternative is <code>setuptools.modified</code>. (<a href="https://redirect.github.com/pypa/setuptools/issues/4360">#4360</a>)</p>
</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a href="5cbf12a9b6"><code>5cbf12a</code></a> Workaround for release error in v70</li>
<li><a href="9c1bcc3417"><code>9c1bcc3</code></a> Bump version: 69.5.1 → 70.0.0</li>
<li><a href="4dc0c31644"><code>4dc0c31</code></a> Remove deprecated <code>setuptools.dep_util</code> (<a href="https://redirect.github.com/pypa/setuptools/issues/4360">#4360</a>)</li>
<li><a href="6c1ef5748d"><code>6c1ef57</code></a> Remove xfail now that test passes. Ref <a href="https://redirect.github.com/pypa/setuptools/issues/4371">#4371</a>.</li>
<li><a href="d14fa0162c"><code>d14fa01</code></a> Add all site-packages dirs when creating simulated environment for test_edita...</li>
<li><a href="6b7f7a18af"><code>6b7f7a1</code></a> Prevent <code>bin</code> folders to be taken as extern packages when vendoring (<a href="https://redirect.github.com/pypa/setuptools/issues/4370">#4370</a>)</li>
<li><a href="69141f69f8"><code>69141f6</code></a> Add doctest for vendorised bin folder</li>
<li><a href="2a53cc1200"><code>2a53cc1</code></a> Prevent 'bin' folders to be taken as extern packages</li>
<li><a href="720862807d"><code>7208628</code></a> Replace call to deprecated <code>validate_pyproject</code> command (<a href="https://redirect.github.com/pypa/setuptools/issues/4363">#4363</a>)</li>
<li><a href="96d681aa40"><code>96d681a</code></a> Remove call to deprecated validate_pyproject command</li>
<li>Additional commits viewable in <a href="https://github.com/pypa/setuptools/compare/v69.5.1...v70.0.0">compare view</a></li>
</ul>
</details>
<br />
[](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores)
Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`.
[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)
---
<details>
<summary>Dependabot commands and options</summary>
<br />
You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
- `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency
- `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)
You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts).
</details>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130893
Approved by: https://github.com/kit1980
We are considering consolidating data source for logging and flight recorder so that we don't build multiple paths for debugging information. Before we do any merging, we want to first ensure that the PG status is also included in flight recorder. Also, we can leverage this information to validate our FR dump as well. Because the dump is not synced so we might potentially see some variants in the dump.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131268
Approved by: https://github.com/shuqiangzhang
Summary:
When tunable ops load selected kernels from csv file, it will validate hipblaslt version defined in hipblaslt-version.h
This PR changes the validator to fetch hipblaslt version and revision from hipblaslt runtime instead of the header file, as in our environment we might rollout a new version of the run time prior to updating the header file fleet wide.
Test Plan:
Verified generated tunableops kernel selection has the correct hipblaslt version from runtime:
```
Validator,PT_VERSION,2.5.0
Validator,ROCBLAS_VERSION,4.0.0-72e57364-dirty
Validator,HIPBLASLT_VERSION,800-bf2c3184
Validator,ROCM_VERSION,6.0.0.0-12969-1544e39
Validator,GCN_ARCH_NAME,gfx942:sramecc+:xnack-
GemmTunableOp_BFloat16_TN,tn_8192_2_3584,Gemm_Hipblaslt_TN_572,0.0240676
GemmTunableOp_BFloat16_TN,tn_7168_2_8192,Gemm_Hipblaslt_TN_482,0.0359019
GemmTunableOp_BFloat16_TN,tn_8192_2_1024,Default,0.0173723
GemmTunableOp_BFloat16_TN,tn_1280_2_8192,Gemm_Hipblaslt_TN_491,0.0191047
```
Differential Revision: D59889043
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131078
Approved by: https://github.com/jeffdaily, https://github.com/xw285cornell
Summary:
When importing `_trace.py`, put `torch._dynamo.exc.Unsupported` in the global variable ``_ALLOW_LIST`` can cause import to ``export/_trace.py`` to fail with error:
ValueError: Artifact name: 'graph_breaks' not registered, please call register_artifact('graph_breaks') in torch._logging.registrations.
The error is directly raise on line `graph_breaks_log = torch._logging.getArtifactLogger(__name__, "graph_breaks")` in `_dynamo/exc.py`. I've checked that ``register_artifact('graph_breaks')`` does already exist in torch._logging.registrations.
Explicitly call `import torch._logging` doesn't fix the issue.
(see T196719676)
We move ``_ALLOW_LIST`` to be a local variable.
Test Plan:
buck2 test 'fbcode//mode/opt' fbcode//aiplatform/modelstore/publish/utils/tests:fc_transform_utils_test -- --exact 'aiplatform/modelstore/publish/utils/tests:fc_transform_utils_test - test_serialized_model_for_disagg_acc (aiplatform.modelstore.publish.utils.tests.fc_transform_utils_test.PrepareSerializedModelTest)'
buck2 test 'fbcode//mode/opt' fbcode//aiplatform/modelstore/publish/utils/tests:fc_transform_utils_test -- --exact 'aiplatform/modelstore/publish/utils/tests:fc_transform_utils_test - test_serialized_test_dsnn_module (aiplatform.modelstore.publish.utils.tests.fc_transform_utils_test.PrepareSerializedModelTest)'
Differential Revision: D60136706
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131523
Approved by: https://github.com/zhxchen17
Regression introduced by https://github.com/pytorch/pytorch/pull/126376
Before this change, compiling torch_cpu on my MacBook prints tons of warnings every time HooksInterface is included
```
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/src/optim/adamw.cpp:1:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/optim/adamw.h:3:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/nn/module.h:3:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/nn/modules/container/any_module_holder.h:3:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/nn/modules/container/any_value.h:3:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/detail/static.h:4:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/types.h:3:
In file included from /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/ATen.h:7:
In file included from /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/Context.h:13:
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/detail/HIPHooksInterface.h:27:11: warning: '~HIPHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override]
virtual ~HIPHooksInterface() = default;
^
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/detail/AcceleratorHooksInterface.h:16:11: note: overridden virtual function is here
virtual ~AcceleratorHooksInterface() = default;
^
1 warning generated.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131204
Approved by: https://github.com/albanD, https://github.com/seemethere
Summary:
We removed references to _export/exported_program.py in executorch
in D60052318. Now we can remove this file.
Update the pin to executorch.
Test Plan: contbuild & OSS CI:
Differential Revision: D60072980
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131597
Approved by: https://github.com/avikchaudhuri
- Add a `kwargs` option; add the `dynamic_shapes` option so users can supply it directly to `torch.export`.
- Make the options keyword-only arguments (bc-breaking)
- Deprecate the `training` and `operator_export_type` options and include a warning message. The exact time for removal is TBD but the message should discourage users from using the options.
- Deprecate two functions `exportable_ops` and pretty print export
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131501
Approved by: https://github.com/titaiwangms
`inductor` and `rocm` workflows are the major contributors to the CI load on ROCm CI at the moment, resulting in huge backlogs: https://github.com/pytorch/pytorch/pull/131489#issue-2425804464
* Move rocm.yml to cron frequency
* Move ROCm CI jobs from inductor.yml to inductor-rocm.yml
* Introduce `ciflow/inductor-rocm` as PR label to manually invoke inductor jobs for ROCm (no automatic invoking to limit CI load)
* After this PR, only `trunk` workflow jobs for ROCm will run on every commit and PR merge, but since they take 45min*3 time on average, I decided to leave them as-is since it will provide us some basic insulation against ROCm breakage.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131637
Approved by: https://github.com/clee2000, https://github.com/atalman, https://github.com/huydhn
Summary: Since WaitCounter frontend itself has minimal depdendencies it's fine to be moved into c10. Specific backends can be registered/linked separately.
Test Plan: unit test
Reviewed By: jamesperng, asiab4, c-p-i-o
Differential Revision: D59842868
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131021
Approved by: https://github.com/asiab4
The problem was we were shoving SymInts into the constant_args side
table. The root problem is that torch.fx.node.base_types, which we use
to determine what can be put in the graph, doesn't actually have SymInt
in it. This PR fixes base_types to include SymInt.
Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131363
Approved by: https://github.com/oulgen, https://github.com/justinchuby
This PR adds an API `FSDPModule.set_reduce_scatter_divide_factor` to allow setting a custom gradient divide factor for reduce-scatter. This can be useful when using parallelisms in combination with FSDP (e.g. expert parallelism), where gradients need to be divided by a custom factor (e.g. an extra `EP` factor).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129286
Approved by: https://github.com/weifengpy
This PR implements an opt-in configuration option for synchronizing compilation across all ranks at the end of Dynamo tracing (and potentially, other places in the future). There are two pieces to this PR:
1. Implementing infrastructure for compiler collectives (DistributedState/LocalState, the actual collective)
2. Using this infrastructure to synchronize automatic dynamic choices across all ranks
The infrastructure in part one can be used for other purposes, just add more (serializable) fields to LocalState.
Here is how automatic dynamic synchronization works:
1. Preflight in "torch/_dynamo/variables/builder.py": On the first Dynamo trace run, we trace without automatic dynamic at all; we assume all Tensor inputs that are not otherwise marked are static. This run is purely to collect all Tensor input sizes in the program.
2. torch/_dynamo/output_graph.py: At the end of the first Dynamo trace run, we perform a compiler collective to distribute all Tensor input sizes to all ranks. Then, we restart Dynamo
3. Apply the updates in "torch/_dynamo/variables/builder.py": Now that we have all sizes for every rank, we now update frame state with the observed sizes for all ranks, in rank order. Under the assumption that frame state is consistent on all ranks, this series of updates will preserve consistency.
For future work, it would be safer if we force a consistent hint on all ranks; this is more involved as we have to interpose in fakification.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130935
Approved by: https://github.com/jansel
High level goals:
- Cover the all-gather and reduce-scatter pattern matchers with unit tests
- Make it easier to exclude certain collectives as async-tp candidates
- Make it easier to match other all-gather and reduce-scatter variants (e.g. fp8 collectives)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131409
Approved by: https://github.com/weifengpy
Resubmit of #129325
Previously each mutation was represented by a `MutationOutput` operation which
was a new scheduler node that must be scheduled immediately afterwards.
Now we have a single scheduler node, which produces mutiple `MutationOutput`
buffers as its output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130832
Approved by: https://github.com/lezcano
This should prevent regressions like the ones fixed by https://github.com/pytorch/pytorch/pull/131204
- Remove global `-Wno-error=inconsistent-missing-override`
- Wrap offending includes (protobuf and asmjit) with `C10_DIAGNOSTIC_PUSH_AND_IGNORE` and `C10_DIAGNOSTIC_POP_AND_IGNORED`
- Add `override` keyword to `at::namespace::tunable::StreamTimer` and `LLVMCodeGenImpl`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131524
Approved by: https://github.com/atalman
Add example `NestedTensor`s with inner dimension of size `1` to `_get_example_tensor_lists` with `include_inner_dim_size_1=True`. This diff creates `NestedTensor`s of sizes `(B, *, 1)` and `(B, *, 5, 1)`, ensuring that the current implementations of jagged reductions for `sum` and `mean` hold for tensors of effective shape `(B, *)` and `(B, *, 5)`.
Differential Revision: [D59846023](https://our.internmc.facebook.com/intern/diff/D59846023/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131516
Approved by: https://github.com/davidberard98
Summary: We currently don't support some of the `@triton.autotune` arguments when compiling user-written Triton kernels with PT2. In this PR, we're adding a flag to circumvent it. This is to unblock internal compilation in some cases. The flag is supplied with the docs mentioning why it is not a good idea to set it.
Test Plan:
```
python test/inductor/test_triton_kernels.py -k test_triton_kernel_
autotune_with_unsupported_args
...
----------------------------------------------------------------------
Ran 3 tests in 3.636s
OK
```
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131431
Approved by: https://github.com/oulgen, https://github.com/zou3519
The issue addressed is that compiled autograd changes the calling convention of the FX graph to only have a single placeholder which contains a list of inputs. In this case, the meta of the tensor input nodes don't contain the `tensor_dict` meta. This adds them.
The context is that `tensor_dict` is used to convey if a tensor is an input with a static address.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131556
Approved by: https://github.com/anijain2305
In gen.py, the code for generating CompositeViewCopyKernels.cpp includes *_native.h headers for "view_groups" but not "structured_native_functions". However, this results in the TORCH_API in the headers being ineffective and presents such functions being used outside libtorch_cpu.so
This patch ensures that gen.py includes the native headers for "structured_native_functions" in the same way as for "view_groups".
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131208
Approved by: https://github.com/bdhirsh
Summary: In the script of testing different families of models, when the conversion failed, we switch to use output from the explain function to provide more meaningful information.
Test Plan:
Manual testing with attatched log information.
```
buck2 run mode/dev-nosan sigmoid/inference/ts_migration:main -- --mode test_all --test_suites ads_merge --model_id 440779101
```
```
Processing 440779101_5455.predictor.disagg.gpu.merge
model_name: 440779101_5455.predictor.disagg.gpu.merge
has_ts_model: True
has_sample_inputs: True
ops_maybe_missing_meta: set()
ts_can_run: True
ts_run_exception: None
can_convert: False
convert_exception: Unsupported nodes are found in the following list:
0. prim::Loop [%14259 : int = prim::Loop(%14258, %1129, %1126), scope: torch.fx.graph_module.GraphModule:: # <torch_package_1>.caffe2/torch/fb/predictor/modules/tensors_to_device_module.py💯19]
1. prim::Loop [%14326 : int = prim::Loop(%1115, %1129, %14259), scope: torch.fx.graph_module.GraphModule:: # <torch_package_1>.caffe2/torch/fb/predictor/modules/tensors_to_device_module.py💯19]
ep_result_correct: None
ep_run_exception: None
can_package: None
package_exception: None
sigmoid_can_run: None
sigmoid_run_exception: None
sigmoid_result_correct: None
```
Reviewed By: SherlockNoMad
Differential Revision: D59971446
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131214
Approved by: https://github.com/angelayi
This PR improves the thread blocking heuristics to favor full occupancy as much as possible. Also, the "m x n" block size is made as squared as possible for better data reuse.
Take the shape M=20000, N=64, K=128 as an example, the original heuristics couldn't use up all the threads when the number of threads is large, say 60:
AUTOTUNE linear_unary(200000x128, 64x128, 64)
_linear_pointwise 0.1010 ms 100.0%
cpp_packed_gemm_0 0.8303 ms 12.2%
0722 02:26:39.220660 302553 torch/_inductor/codegen/cpp_gemm_template.py:503] [0/0] Register blocking: GemmBlocking(block_m=32, block_n=32, block_k=32)
V0722 02:26:39.221042 302553 torch/_inductor/codegen/cpp_gemm_template.py:507] [0/0] Cache blocking: GemmBlocking(block_m=625, block_n=1, block_k=4)
V0722 02:26:39.221118 302553 torch/_inductor/codegen/cpp_gemm_template.py:509] [0/0] Thread blocking: GemmBlocking(block_m=625, block_n=1, block_k=4)
V0722 02:26:39.221252 302553 torch/_inductor/codegen/cpp_gemm_template.py:526] [0/0] Number of threads: 60, occupancy: (10, 2, 1)
After this PR:
AUTOTUNE linear_unary(200000x128, 64x128, 64)
_linear_pointwise 0.1143 ms 100.0%
cpp_packed_gemm_0 0.1228 ms 93.1%
V0722 02:29:49.261794 304201 torch/_inductor/codegen/cpp_gemm_template.py:309] [0/0] Register blocking: GemmBlocking(block_m=32, block_n=32, block_k=32)
V0722 02:29:49.262860 304201 torch/_inductor/codegen/cpp_gemm_template.py:313] [0/0] Cache blocking: GemmBlocking(block_m=64, block_n=1, block_k=8)
V0722 02:29:49.262951 304201 torch/_inductor/codegen/cpp_gemm_template.py:315] [0/0] Thread blocking: GemmBlocking(block_m=69, block_n=79, block_k=8)
V0722 02:29:49.263075 304201 torch/_inductor/codegen/cpp_gemm_template.py:332] [0/0] Number of threads: 60, occupancy: (15, 4, 1)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131024
Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w
**Summary**
While trying to integrate CommDebugMode with TorchTitan, I realized that the forward_hooks were being registered even though it was in the backward pass. After investigating, I realized that it was activation checkpointing that was causing this. In order to prevent users from being confused, I edited CommDebugMode so that it could differentiate between backward pass operations and activation checkpointing operations. I have also added an example case showing that CommDebugMode is able to successfully differentiate between the backward pass and activation checkpointing. The output for the example can be seen below.
**Test Case**
torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e activation_checkpointing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130996
Approved by: https://github.com/XilunWu
ghstack dependencies: #131419
**Summary**
I switched the module tracker I had been inheriting from PyTorch’s all purpose one to the one written by Sanket in the distributed tools folder. I did this because the original one messed up activation checkpointing by adding itself to the parent set in the backward_pre_hook and then in the forward_pre_hook for the activation_checkpointing.
**Test Case**
pytest test/distributed/_tensor/debug/test_comm_mode_features.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131419
Approved by: https://github.com/XilunWu
```python
# NOTE [low-contention collectives]
# When a collective is overlapped with abundant compute, it makes sense to
# prioritize reducing the contention between the collective and the overlapped
# compute, even at the cost of a slightly slower collective.
#
# Common collective implementations (e.g., NCCL without user buffer
# registration) optimize for throughput with no ambient compute. However, such
# implementations may not be optimal when they are overlapped with compute:
# - These impls typically fuse the entire collective into a single kernel and
# reserve SM resources based on the most demanding portion of the collective,
# even when a large portion of the collective does not require this much
# resource.
# - These implementations typically fuse the entire collective into a single
# kernel and reserve SM resources based on the most demanding portion of the
# collective, even when a large portion of the collective does not require this
# much resource.
# - These implementations often use SM-based P2P copy as opposed to copy
# engine-based P2P copy. Copy engine-based P2P copy may not have a significant
# advantage when there's no ambient compute. However, it may significantly
# improve overall resource utilization in the presence of ambient compute.
#
# When overlapped with intensive compute (e.g., persistent matmul kernels), the
# SM-usage of a collective can lead to inefficient overlapping.
#
# Low-contention collectives achieve their goals with the following strategies:
# - Use copy engine-based copy whenever possible.
# - Break down portions of a collective with different resource requirements
# into multiple kernels. This improves the overlapping efficiency at the cost
# of additional launching overhead.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130583
Approved by: https://github.com/weifengpy
The test fail internally [T195592444](https://www.internalfb.com/intern/tasks/?t=195592444) (This is meta internal link). But we don't see the failure in OSS.
It turns out that there are 2 issues:
1. `run_test('cuda')` is improperly handled since it tries to import a module named 'cuda' if cuda is available. Since the import fails, all tests in the file are skipped. This hides the failure in OSS. The failure is exposed in internal tests since the main block which runs `run_test('cuda')` is skipped sometimes.
2. fix the real issue that incompatible inputs are provided to `do_bench`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131402
Approved by: https://github.com/eellison
Regression introduced by https://github.com/pytorch/pytorch/pull/126376
Before this change, compiling torch_cpu on my MacBook prints tons of warnings every time HooksInterface is included
```
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/src/optim/adamw.cpp:1:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/optim/adamw.h:3:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/nn/module.h:3:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/nn/modules/container/any_module_holder.h:3:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/nn/modules/container/any_value.h:3:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/detail/static.h:4:
In file included from /Users/nshulga/git/pytorch/pytorch/torch/csrc/api/include/torch/types.h:3:
In file included from /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/ATen.h:7:
In file included from /Users/nshulga/git/pytorch/pytorch/aten/src/ATen/Context.h:13:
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/detail/HIPHooksInterface.h:27:11: warning: '~HIPHooksInterface' overrides a destructor but is not marked 'override' [-Winconsistent-missing-destructor-override]
virtual ~HIPHooksInterface() = default;
^
/Users/nshulga/git/pytorch/pytorch/aten/src/ATen/detail/AcceleratorHooksInterface.h:16:11: note: overridden virtual function is here
virtual ~AcceleratorHooksInterface() = default;
^
1 warning generated.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131204
Approved by: https://github.com/albanD, https://github.com/seemethere
Summary:
Remove operator_benchmark caffe2 build due to the removal of caffe2: 2fd75667b4
Plus, we are deleting the TARGETS file from broken benchmarks that we do not intend to maintain.
Test Plan: Sandcastle CI
Differential Revision: D60086216
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131460
Approved by: https://github.com/vmpuri
Summary:
Modify the existing `mean` operator in PyTorch, invoked by `torch.mean`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff enables PyTorch users to invoke `torch.mean` on a nested tensor when reducing along the ragged dimension, e.g. `*` in a `(B, *, M)` nested tensor.
Parametrize unit tests from `sum` to verify the accuracy of the ragged reduction implementation for `torch.mean`. Add unit tests and parametrize `sum` unit tests to verify error handling for unsupported features in `NestedTensor` `torch.mean`.
Test Plan:
Verify that the new unit test passes via the following command:
```
buck2 run mode/{opt,inplace} //caffe2/test:nested -- --regex test_mean
```
```
buck2 run mode/{opt,inplace} //caffe2/test:nested -- --regex test_jagged_op
```
Differential Revision: D59654668
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131132
Approved by: https://github.com/davidberard98, https://github.com/jbschlosser
# Summary
While debugging CI failures for flash_attention tests I stumbled across 2 IMAs for the split-kv variant of flash attention.
1. Illegal global memory writes during the writing of softmax_lse_accum. This was pinpointed to the temporary liftime of these out_accum and softmax_lse_accum. These were likely getting their refcount dropped **before** the kernel launch that used, them allowing them to potentially get used for other allocations.
2. After debugging this there was illegal writes of the combine kernel. I was able to pinpoint this to the writing to the reduce LSE. From my understanding it was making assumption that kBlocKM evenly divided the global number of rows and wasn't masking out these writes.
### History
My line of thinking for this:
We create the temporary split accum + LSE stats tensors to store the data for each split. We then launch a follow up kernel to do the reduction.
Under ordinary non roofline memory usage the cuda memory caching allocator will keep these allocations alive even though the tensors were created within a temporary scope and no longer have any live references.
On CI we often run near max memory usage. We change/add tests and suddenly we get close to oom threshold. The memory allocator will reap these segments and we get write after free errors.
After that fix I did get further past the splitkv_flash kernel and then got the following error:
``` Shell
❯ TORCH_DISABLE_ADDR2LINE=1 PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer --show-backtrace=device --tool memcheck --log-file ima.txt python ima.py
softmax_lseaccum_ptr =0x7f5ebb208a00
oaccum_ptr =0x7f5ebb208c00
softmax_lse_ptr = 0x7f5ebb208800
❯
❯ head ima.txt -n 10
========= COMPUTE-SANITIZER
========= Invalid __global__ write of size 4 bytes
========= at void pytorch_flash::flash_fwd_splitkv_combine_kernel<pytorch_flash::Flash_fwd_kernel_traits<(int)32, (int)64, (int)256, (int)4, (bool)0, (bool)0, cutlass::bfloat16_t, pytorch_flash::Flash_kernel_traits<(int)32, (int)64, (int)256, (int)4, cutlass::bfloat16_t>>, (int)16, (int)1, (bool)1>(pytorch_flash::Flash_fwd_params)+0x630
========= by thread (2,0,0) in block (0,0,0)
========= Address 0x7f5ebb208804 is out of bounds
========= and is 1 bytes after the nearest allocation at 0x7f5ebb208800 of size 4 bytes
```
Okay I looked at the address and it looks like we are writing consective bytes past the softmax_lse_ptr in from the combine func: I tried padding out the softmax_lse to q_padded and no more illegal memory errors on my repro:
```
========= COMPUTE-SANITIZER
========= ERROR SUMMARY: 0 errors
```
Fixes https://github.com/pytorch/pytorch/issues/131240
Fixes https://github.com/pytorch/pytorch/issues/131227
Fixes https://github.com/pytorch/pytorch/issues/131221
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131277
Approved by: https://github.com/malfet
Summary:
Previously it was unclear what `_convert_input_to_fake` actually does (used in strict), and in particular how it is different from `make_fake_inputs` (used in non-strict).
This PR splits that function to work purely on user inputs, then renames it to `extract_fake_inputs` and adds a comment clarifying what it does—namely, it extracts fake inputs from a given graph module instead of "converting inputs to fake inputs" (as suggested by the current name) or "making fake inputs" (as happens in non-strict, where no tracing has taken place yet).
The remainder of that function used to also fakify params and buffers. It turns out that this part is identical to what happens in non-strict, hence we also pull `make_fake_inputs` out from `non_strict_utils` into `_trace`, merge it with another util, and make both modes call it.
Test Plan: existing tests
Differential Revision: D60084442
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131421
Approved by: https://github.com/zhxchen17
Summary:
Newer versions of the MKL library return `SPARSE_STATUS_INVALID_VALUE` when badly formed non-triangular matrices are passed to the `mkl_sparse_?_trsv`/`mkl_sparse_?_mrsv` functions. This would start aborting (badly written) tests that worked with the old version which just filled the result tensor with `-NaN` instead of returning an error status.
This changes the code to fill the result tensor with `-NaN` on `SPARSE_STATUS_INVALID_VALUE` so we get the same behavior regardless of the MKL version in use.
Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//caffe2/test:sparse -- --run-disabled`
Differential Revision: D59542023
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130382
Approved by: https://github.com/malfet
Summary:
- Log export errors to Scuba and mark them with "classified" and "unclassified"
- Classify errors by exception type (ALLOW_LIST) and a `case_name` attribute
- Add `case_name` for some exceptions.
Test Plan:
Running the code below logs a classified error to `torch_export_usage` table in Scuba.
```
import torch
from torch._export.db.case import SupportLevel
class TorchSymMin(torch.nn.Module):
"""
torch.sym_min operator is not supported in export.
"""
def forward(self, x):
return x.sum() + torch.sym_min(x.size(0), 100)
example_args = (torch.randn(3, 2),)
tags = {"torch.operator"}
support_level = SupportLevel.NOT_SUPPORTED_YET
model = TorchSymMin()
torch.export.export(model, example_args)
``
Differential Revision: D59981459
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131327
Approved by: https://github.com/zhxchen17
Fixes#130284Fixes#130653
- Add `torch.library.register_vmap` to custom ops
- Add `register_vmap` for operators in ops in custom_op_db.
- Make `torch.autograd.Function` support kwarg-only kwargs for vmap
- test operators in op_db with `tests/test_vmap`.
- change `test_vmap` to allow custom `out_dim` and allow "None" in `out_dim` when testing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130589
Approved by: https://github.com/zou3519
Fixes#126338
## Issue Summary
When torchinductor compiles the combination `functional_collective -> view.dtype -> wait`, a memory leak occurs. This happens because `view.dtype` is compiled into an out-of-place Triton kernel that copies the input data to a new tensor, even if the data hasn't completed collection via the wait operation. The tensor used by `collective` is only freed when the `wait` operation triggers the garbage collector, see [~WorkRegistry](https://github.com/pytorch/pytorch/blob/main/torch/csrc/distributed/c10d/Functional.cpp#L41). However, since `wait` now waits for a new tensor, the previous one is never freed. The `view.dtype` should only check the metadata instead of creating a new tensor. The current lowering is against its semantics and causes memory leaks.
See more great discussions in the #126338
This kind of lowering also generates unnecessary triton kernels for `view.dtype` when it can't be fused with other operations.
## Fix
The function `aten.view.dtype` is a CPU operation that changes the metadata of its input. After discussions with @eellison and @bdhirsh, we decided to change the lowering of `aten.view.dtype` to ensure it fallback properly to the correct `aten.view.dtype` instead of generating a Triton kernel in some cases. This approach also preserves the same semantics of the view operation.
When the model calls `aten.view.dtype` with a data type whose bit width matches the input's original data type, we lower it to the newly added `DtypeView` in IR, acting like a `ReinterpretView`. When the operation can be fused, its `make_loader` is called to maintain the correct type conversion for each load instruction. When the operation can't be fused, it falls back to `aten.view.dtype` to avoid Triton kernel generation.
## Example
```python
@torch.compile
def fn(x, y):
x = x.view(torch.float16)
y = y.view(torch.float16) + 1
return x @ y
x = torch.randn((2, 2), device=self.device, dtype=torch.bfloat16)
y = torch.randn((2, 2), device=self.device, dtype=torch.bfloat16)
fn(x, y)
```
The output code generated before this fix is like the following.
```python
triton_poi_fused_add_view_0...
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
xnumel = 4
xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:]
xmask = xindex < xnumel
x0 = xindex
tmp0 = tl.load(in_ptr0 + (x0), xmask).to(tl.float32)
tmp1 = tmp0.to(tl.bfloat16).to(tl.float32, bitcast=True).to(tl.float32)
tl.store(out_ptr0 + (x0), tmp1, xmask)
triton_poi_fused_add_view_1...
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
xnumel = 4
xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:]
xmask = xindex < xnumel
x0 = xindex
tmp0 = tl.load(in_ptr0 + (x0), xmask).to(tl.float32)
tmp1 = tmp0.to(tl.bfloat16).to(tl.float32, bitcast=True).to(tl.float32)
tmp2 = 1.0
tmp3 = tmp1 + tmp2
tl.store(out_ptr0 + (x0), tmp3, xmask)
def call(args):
...
triton_poi_fused_view_0.run(arg0_1, buf0, 4, grid=grid(4), stream=stream0)
del arg0_1
buf1 = empty_strided_cuda((2, 2), (2, 1), torch.float16)
# Source Nodes: [view_1, y], Original ATen: [aten.add, aten.view]
triton_poi_fused_add_view_1.run(arg1_1, buf1, 4, grid=grid(4), stream=stream0)
del arg1_1
buf2 = empty_strided_cuda((2, 2), (2, 1), torch.float16)
# Source Nodes: [matmul, view_1, x, y], Original ATen: [aten.add, aten.mm, aten.view]
extern_kernels.mm(buf0, buf1, out=buf2)
```
As you can see, the two `view` operations are compiled to two kernels `triton_poi_fused_view_0` nad `triton_poi_fused_add_view_1`. Both of them has a line `tmp1 = tmp0.to(tl.bfloat16).to(tl.float32, bitcast=True).to(tl.float32)` which does the type conversion.
The main issue is that the first `view` operation didn't do anything to the actual data. But it generates a triton kernel with a new output tensor. Another small issue is that this triton kernel can't be compiled because `bitcast=True` only support type converstion with same bidwidth.
The following are output code generated after this PR.
```python
triton_poi_fused_add_0...
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
xnumel = 4
xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:]
xmask = xindex < xnumel
x0 = xindex
tmp0 = tl.load(in_ptr0 + (x0), xmask).to(tl.float32)
tmp1 = tmp0.to(tl.bfloat16).to(tl.float32)
tmp2 = 1.0
tmp3 = tmp1 + tmp2
tl.store(out_ptr0 + (x0), tmp3, xmask)
def call(args):
...
triton_poi_fused_add_0.run(arg1_1, buf0, 4, grid=grid(4), stream=stream0)
del arg1_1
buf1 = empty_strided_cuda((2, 2), (2, 1), torch.float16)
# Source Nodes: [matmul, y], Original ATen: [aten.add, aten.mm]
extern_kernels.mm(aten.view.dtype(arg0_1, torch.float16), buf0, out=buf1)
```
The first `view` operation has been replaced with the `aten.view.dtype` and it is directly passed as an argument. The second one is still there because it is fused with the following add operation. The invalid bitcast operation is removed too.
The following two code snippets is for the upcasts and downcasts. For dtype in `torch.float16, torch.bfloat16`, each load will be upcasted to float32, then downcast to its original dtype to ensure use values with the right precision.
7bda23ef84/torch/_inductor/codegen/triton.py (L1725-L1726)7bda23ef84/torch/_inductor/codegen/triton.py (L629-L642)
Huge thanks to @eellison, @bdhirsh, @shunting314, and @desertfire .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128883
Approved by: https://github.com/eellison
Migrate all pull jobs to the new Amazon 2023 AMI runner type.
Exceptions:
- Distributed tests are still on the old AMI since they had some weird [test failures](https://github.com/pytorch/pytorch/actions/runs/10047579686/job/27770963175). Will debug those separately.
- Ported over a couple trunk and slow jobs that had `sync-tag`s set with the pull jobs and so needed to be on the same AMI
Revert plan, in case something starts breaking when we run these new AMIs at a larger scale:
- If specific jobs start failing consistently, we bring those jobs back to the old AMI
- If the failure is more widespread, revert this PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131250
Approved by: https://github.com/malfet, https://github.com/atalman
Summary: `test/distributed/_composable/test_replicate_with_compiler.py` torch.compiles. This change introduces a version of MultiProcessTestCase that derives from the inductor TestCase class to make sure we always get a clean cache dir.
Test Plan: `python test/distributed/_composable/test_replicate_with_compiler.py`
Differential Revision: [D59925519](https://our.internmc.facebook.com/intern/diff/D59925519)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131053
Approved by: https://github.com/eellison
Now that remote caching has evolved into various parts of PT2, we want to separate triton and pt2 caching as changes to one have caused SEVs to the other.
Differential Revision: D60047752
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131345
Approved by: https://github.com/aorenste
The problem was we were shoving SymInts into the constant_args side
table. The root problem is that torch.fx.node.base_types, which we use
to determine what can be put in the graph, doesn't actually have SymInt
in it. This PR fixes base_types to include SymInt.
Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131363
Approved by: https://github.com/oulgen
When pin_memory and share_memory both are set to True in _create_cpu_state_dict, the memory is pinned using cudaHostRegister but is never unpinned. So, once tensor is created and freed, when a new tensor is created the caching allocator is allocating the same memory. This fails with below error.
```
obj = <[RuntimeError('CUDA error: part or all of the requested memory range is already mapped\nCUDA kernel errors might be a...pile with `TORCH_USE_CUDA_DSA` to enable device-side assertions.\n') raised in repr()] Tensor object at 0x7f0028a4d6c0> pg = None, device = None, _ = None
```
This PR fixes this by unregistering this memory on tensor free by attaching a hook.
This is easily reproducible with xlformers checkpointing unit tests and the fix is verified with the same.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131270
Approved by: https://github.com/LucasLLC
I regretted the decision in
https://github.com/pytorch/pytorch/pull/130606. Most user
torch_dispatchs don't have enough to actually handle the HOP correctly,
so for now I'd prefer that users explicitly define the interaction
between the HOP and their torch_dispatch class.
An example is FlopCounterMode: if we allow HOPs to get passed to it, it
will ignore auto_functionalized(mm) by default but it will record flops
for mm, which is weird.
Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131370
Approved by: https://github.com/ydwu4
Summary:
Inductor will aggressively try to decompose and lower ops into a smaller opset. However, sometimes it may not align with kernel coverage (or perf preference) on different backends. (eg. Inductor will decompose Gelu into primitive ops, but certain backends already has a Gelu op) Therefore, we need a mechanism to allow customization of decomp for trace function so that Inductor will simply pass this op through.
Test Plan:
Reviewers:
@eellison
Subscribers:
Tasks:
Tags:
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131329
Approved by: https://github.com/eellison
There were some miscellaneous issues I found:
* The WrapperCodeGen subclass constructors don't accept any arguments, which doesn't mesh with how Inductor can try to construct them.
* A DeviceInterface subclass for Triton doesn't implement `triton_supported() == True`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130933
Approved by: https://github.com/eellison, https://github.com/jansel
For some non-contiguous tensors, tensor.view would trigger the following
runtime error:
"RuntimeError: view size is not compatible with input tensor’s size and stride
(at least one dimension spans across two contiguous subspaces).
Use .reshape(…) instead"
So, let's use reshape instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131302
Approved by: https://github.com/muchulee8, https://github.com/desertfire
The bug causing the correctness problem will be fixed in future OS release. Root cause of the problem is in a bug in an optimization to MPSGraph reshape operation in MacOS 14_4 that results in a correctness issue with the shapes the LSTM gradient operation has when num_layers > 2.
Solves silentness of issue #125803.
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130038
Approved by: https://github.com/malfet
Summary:
This diff reverts D59561509
D59561509: [FX][export] DCE pass, check schema for node impurity (#130395) by yushangdi causes the following test failure:
Tests affected:
- [cogwheel:cogwheel_mtia_cmf_m5_shrunk_test#test_flow_with_verification](https://www.internalfb.com/intern/test/844425041436985/)
Here's the Multisect link:
https://www.internalfb.com/multisect/6533402
Here are the tasks that are relevant to this breakage:
T191383430: 10+ tests unhealthy for ads_mtia_inference
The backout may land if someone accepts it.
If this diff has been generated in error, you can Commandeer and Abandon it.
Test Plan: NA
Differential Revision: D60029318
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131341
Approved by: https://github.com/angelayi
we have seen stacktrace samples showing that a lot of compilation time is spent in exceptions raised in `OpOverloadPacket.__getattr__`. It's not entirely clear why/how this happens, but I spot-checked a few places in `_inductor.graph.py` where we previously may have been calling `hasattr(OpOverloadPacket, ...)`, that can be avoided (hasattr will go through getattr, which, for OpOverloadPacket, will do a lookup in the dispatch table for all overload names of the packet).
Test Plan: CI
Differential Revision: D60048270
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131348
Approved by: https://github.com/davidberard98
SequenceParallel style assumes the input torch.Tensor ALREADY sharded on
the sequence dimension if not passing in DTensor. Since it causes some
user confusion on the documentation, this PR:
1. for the case where input passed in is already a DTensor, we check the
input placements and redistribute if it's not sharded on the sequence
dimension
2. update the doc to make it more explicit about the case when user
passed in a torch.Tensor and DTensor
This would fix https://github.com/pytorch/pytorch/issues/129355
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131346
Approved by: https://github.com/awgu
This PR re-implements pin memory aiming to get rid of the optional `device` argument and makes all related APIs to be device-agnostic. We add two new abstract APIs in [AcceleratorHooksInterface](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/detail/AcceleratorHooksInterface.h#L12) and redefine pin memory as: "Pin memory is always pinned for the current accelerator device". In detail, it uses [getAcceleratorHooksInterface](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/Context.h#L61) in pin_memory/is_pinned to get an appropriate device and invoke the corresponding overridden interfaces, instead of using BackendSelect and then dispatching to CUDA or other specific backends' implement methods.
Note: For new backends who want to implement and use pin memory, just inherit AcceleratorHooksInterface and overwrite the `isPinnedPtr` and `getPinnedMemoryAllocator` methods.
Additional context: To avoid BC-breaking, this PR just preserves the `device` arg of related APIs and would throw a deprecation warning if `device` arg is passed. Another PR will be submitted to update all PT callers (`Tensor.is_pinned()`, `Tensor.pin_memory()`...) not to pass this arg based on this PR. In future, `device` arg will be actually removed.
Relates #124908
Relates #14560
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126376
Approved by: https://github.com/albanD
Fixes#130284Fixes#130653
- Add `torch.library.register_vmap` to custom ops
- Add `register_vmap` for operators in ops in custom_op_db.
- Make `torch.autograd.Function` support kwarg-only kwargs for vmap
- test operators in op_db with `tests/test_vmap`.
- change `test_vmap` to allow custom `out_dim` and allow "None" in `out_dim` when testing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130589
Approved by: https://github.com/zou3519
This is an updated PR to equip cond with the autograd feature and replaces the old [PR](https://github.com/pytorch/pytorch/pull/126007)
@ydwu4 I tried to incorporate your requests already.
Currently there are two problems that I struggle with solving:
1. There seems to be an import issue when trying to import cond in `torch/__init__.py`, see [here](8a704035c9/torch/__init__.py (L1914-L1916)). Therefore, I had to comment those lines, which resolved the import issues, but I believe cond is not proberly exposed as torch.cond.
2. I am not entirely sure how to deal with the opinfo test in `hop_db.py`
Co-authored-by: Yidi Wu <yidi@meta.com>
Co-authored-by: Xuehai Pan <XuehaiPan@outlook.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126911
Approved by: https://github.com/ydwu4
test_public_bindings should be run on anything that changes the public API - need to figure out in the future what is part of the public api, currently I'm using anything in torch/
flex_attention should be run on anything involving autograd
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130397
Approved by: https://github.com/malfet
Summary:
X-link: https://github.com/pytorch/benchmark/pull/2388
We can enable accuracy checks at Diff time since it is not a performance metric.
* Refactor the existing diff time test to use the new PT2 Benchmark Runner.
* Deprecate the speedup tests and enable the accuracy tests only. We rely on ServiceLab to perform performance testing and regression detection.
Test Plan:
Sandcastle CI
Or buck test command:
```
buck2 test 'fbcode//mode/opt' fbcode//pytorch/benchmark/fb/test_gpu:run_test_gpu -- test_training_resnet50_accuracy
```
Test UI: https://www.internalfb.com/intern/testinfra/testrun/1688850102375429
Reviewed By: oulgen
Differential Revision: D59825601
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131266
Approved by: https://github.com/oulgen
Summary:
It is a long known pain point that if other users are running things, the call of `torch.cuda.memory.list_gpu_processes()` will error out:
```
torch.cuda.memory.list_gpu_processes()
File "torch/cuda/memory.py", line 647, in list_gpu_processes
procs = amdsmi.amdsmi_get_gpu_process_list(handle) # type: ignore[attr-defined]
File "amdsmi/py_interface/amdsmi_interface.py", line 1946, in amdsmi_get_gpu_process_list
_check_res(
File "amdsmi/py_interface/amdsmi_interface.py", line 510, in _check_res
raise AmdSmiLibraryException(ret_code)
amdsmi.py_interface.amdsmi_exception.AmdSmiLibraryException: Error code:
10 | AMDSMI_STATUS_NO_PERM - Permission Denied
```
So just catch this error
Test Plan: torch.cuda.memory.list_gpu_processes() no longer fails
Differential Revision: D59901053
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131018
Approved by: https://github.com/eqy, https://github.com/clee2000
Also bold certain text in the error message as suggested
<img width="3000" alt="Screenshot 2024-07-19 at 5 56 48 PM" src="https://github.com/user-attachments/assets/378f20c5-c6b2-4e53-8eaf-0bd26c3a6b60">
With a GLOBAL like `os.execv` the error message is now as such
```python
File "/data/users/mg1998/pytorch/torch/serialization.py", line 1256, in load
raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
_pickle.UnpicklingError: Weights only load failed. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
Trying to load unsupported GLOBAL posix.execv whose module posix is blocked.
Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131259
Approved by: https://github.com/malfet, https://github.com/albanD
test_public_bindings should be run on anything that changes the public API - need to figure out in the future what is part of the public api, currently I'm using anything in torch/
flex_attention should be run on anything involving autograd
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130397
Approved by: https://github.com/malfet
Adds support for SymInts in the FakeTensor cache.
A couple notes:
1. When a SymInt is present in the input key for a FakeTensor operation we cache on the ShapeEnv instead of using the FakeTensorMode cache. This is necessary so we don't have to remember and check the guards. It reduces the cache hits but there's diminishing return on how much work we can do before the cache becomes more of a burden than a gain.
2. We need to be careful that when we cache an output SymInt that is a direct copy from the input that when we have a cache-hit we copy the SymNode from the input to the output. This is important because the fx-graph building code actually uses SymNode ids in the process of building the graph so constructing a same-content-but-different-id SymNode will fail.
3. In the cache key we store SymInts as a _PySymInputStub. These represent SymInt (and friends) but support `__hash__` and `__eq__` (which SymInt do not).
4. In the cache entry we store SymInts as a _SymIntOutputStub.
Perf example:
```
python benchmarks/dynamo/timm_models.py --ci --accuracy --timing
--explain --inductor --dynamic-shapes --dynamic-batch-only --device cuda
--training --amp --total-partitions 2 --partition-id 0 --output
/tmp/training_timm_models.csv --filter crossvit_9_240
```
fake tensor cache before:
```
INFO: FakeTensor cache stats:
INFO: cache_hits: 68137
INFO: cache_misses: 837
INFO: cache_bypasses:
INFO: symbolic shape: 48224
INFO: CompositeImplicitAutograd: 917
INFO: non-fake tensor: 70
INFO: non-FakeTensor output: 62
INFO: non-builtin: 8
INFO: dynamic output shape: 1
```
and after:
```
INFO: FakeTensor cache stats:
INFO: cache_hits: 88187
INFO: cache_misses: 14233
INFO: cache_bypasses:
INFO: CompositeImplicitAutograd: 1037
INFO: non-FakeTensor output: 602
INFO: non-fake tensor: 70
INFO: unsafe view: 36
INFO: non-builtin: 8
INFO: dynamic output shape: 1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127596
Approved by: https://github.com/eellison
ghstack dependencies: #131014, #129780
This is part of #127596, pulled out to make reviewing a little easier.
Flatten the FakeTensor cache key - so it's a list of singular elements and pointing at one requires a single index rather than a PyTree path. This is used in the next PR to allow us to have the cache entry refer to an input SymInt that it needs to copy directly into the output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129780
Approved by: https://github.com/oulgen, https://github.com/eellison
ghstack dependencies: #131014
Python 3.10 adds `@dataclass(slots=True)` to auto-build the `__slots__` for a dataclass. This is really useful but we can't use it until 3.10 becomes our minimum version.
Copied the code for that functionality from python into a new decorator and ported it to use 3.8 syntax (removed use of `match`).
Usage:
```
@dataclass_slots
@dataclass
class X:
pass
```
is the same as (in py3.10):
```
@dataclass(slots=True)
class X:
pass
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131014
Approved by: https://github.com/oulgen, https://github.com/eellison
Summary: The JK disables dynamo by passing None to set_eval_frame.
Test Plan:
Ran buck test mode/opt caffe2/test/dynamo:test_dynamo
Buck UI: https://www.internalfb.com/buck2/1fec33b4-c95a-4bdf-b47b-7c0b8ab9e24a
Test UI: https://www.internalfb.com/intern/testinfra/testrun/2814750010105363
Network: Up: 0B Down: 0B
Jobs completed: 9596. Time elapsed: 28:54.5s.
Tests finished: Pass 4796. Fail 0. Fatal 0. Skip 17. Build failure 0
Also manually write a small local test with torch.compile and toggles the code to see if PT2 can be disabled. Validated with running the test and observing the log.
PT2 enabled: P1486847242. Can see dynamo log about graph breaks.
PT2 disabled: P1486847727. No dynamo log. The newly added warning printed.
Reviewed By: ezyang
Differential Revision: D59968925
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131258
Approved by: https://github.com/c00w
1) Add skip undefined tensor in cpu fallback when call _copy_from_and_resize;
2) Modify to_cpu function support optional tensor;
3) Add copy back to origin optional tensor when alias_info isWrite is true.
@ezyang @bdhirsh
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130237
Approved by: https://github.com/ezyang
**Summary**
Fixed issue with updating the current module when transitioning between child module to parent module and in the backward pass. The first issue is caused because the prehook is not called again when we go back to the parent module and that the hook being used was a register_module_forward_hook, which runs before the register_module_hook used in redistribute, causing the collective call to be assigned to the incorrect module. In order to do this, I updated the current module to be the parent module in a register_forward_hook in the module tracker. The second issue was caused by the parent set in the module tracker I inherit from being incorrect. I fixed this issue by saving the parents of each module and using them in collective counter instead of the incorrect set. I have updated the example in module_operation_tracing to reflect the correct output. In addition, I changed the test cases that used the incompatible old CommDebugMode.
**Test Case**
1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_operation_tracing
2. pytest test/distributed/_tensor/debug/test_comm_mode_features.py -s -k test_transformer_module_tracing
3. python test/distributed/_composable/fsdp/test_fully_shard_training.py -k TestFullyShardGradientAccumulation.test_gradient_accumulation
4. python test/distributed/_tensor/test_math_ops.py -k DistMathOpsTest.test_layer_norm_bwd
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130995
Approved by: https://github.com/XilunWu
ghstack dependencies: #130410
…with large index
Fixes#130806
When an output size of 2147483648 (=131072*16384) is expected in the above issue, it throwed out the following error:
RuntimeError: HIP error: invalid configuration argument
What happened was that the second parameter passed to hipLaunchKernel was crazy {2147483648,1,1}.
Found two issues in the Indexing.cu:
1: ptrdiff_t was used but it is signed int, outTotalSize >= 2147483648 can cause overflow when doing [this](39493aa934/aten/src/ATen/native/cuda/Indexing.cu (L1367)):
2: On ROCm, std::min -> ::min did not work as expected when outTotalSize>=2147483648
As the result, 2147483648 was sent to hipLaunchKernel which the GPU does not support such a huge number since this number specifies the number of threads per block. The original code intended to set 128 threads per block, though this is debatable as the perf would not good for latest powerful GPUs (a TODO item to update for perf maybe?) , but at least it would not cause `invalid configuration argument` error.
[Test]
Run the same code snippet in the [issue](https://github.com/pytorch/pytorch/issues/130806), and print the output, its dim and numel(), which looks like below now:
```
output=tensor([[ 0.4044, -0.0244, -0.6865, ..., -0.7800, 0.1175, 1.6726],
[-1.0866, -0.1609, 0.3538, ..., 1.9105, 0.7882, 1.1583],
[-2.2079, 0.3736, 0.3610, ..., -0.2658, -0.0459, 1.3077],
...,
[ 0.8753, -0.7482, -0.1978, ..., 0.9016, 1.1501, -0.5178],
[-1.5845, -0.6277, 1.4520, ..., 0.5733, -2.1198, -0.0915],
[-0.6310, -1.0239, -0.1910, ..., 0.4309, 0.1630, 0.3239]],
device='cuda:0'), dim=2, numel=2147483648
```
Added a large tensor unit test too.
```
/pytorch# pytest test/nn/test_embedding.py -k test_large_tensors
================================================================================== test session starts ===================================================================================
platform linux -- Python 3.9.19, pytest-7.3.2, pluggy-1.4.0
rootdir: /dockerx/development/pytorch
configfile: pytest.ini
plugins: flakefinder-1.1.0, rerunfailures-14.0, xdist-3.3.1, xdoctest-1.1.0, cpp-2.3.0, hypothesis-5.35.1
collected 288 items / 287 deselected / 1 selected
Running 1 items in this shard
test/nn/test_embedding.py . [100%]
=========================================================================== 1 passed, 287 deselected in 3.16s ============================================================================
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130994
Approved by: https://github.com/jeffdaily, https://github.com/xw285cornell
## Description
For single thread case, this PR improves the cache blocking in CPP GEMM template with the CPU info (the L1 and L2 cache size). `Mc_blocks` and `Kc_blocks` are calculated based on the below condition:
- size_of_B < L1
- size_of_A < 0.5 * L2
For multi-thread, we need to tune the task decomposition among threads together with cache blocking. We disabled the cache blocking change for now and will submit a follow-up PR for multi-thread optimizations.
## Performance
No regressions. Models with > 3% performance speedup are listed below:
### BF16 single thread (measured on CPU with AMX support)
- static shape
| Model Family | Model Name | Speedup |
|--------------|------------|---------|
torchbench | detectron2_fasterrcnn_r_101_dc5| 4%
- dynamic shape
| Model Family | Model Name | Speedup |
|--------------|------------|---------|
torchbench | detectron2_fasterrcnn_r_101_dc5| 4%
### FP32 single thread (measured on Ice Lake)
- static shape
| Model Family | Model Name | Speedup |
|--------------|------------|---------|
torchbench | basic_gnn_edgecnn| 10%
- dynamic shape
| Model Family | Model Name | Speedup |
|--------------|------------|---------|
torchbench | basic_gnn_edgecnn| 10%
### Next step
The E2E level improvement is limited due to the below reasons:
- For several HF models, we can observe performance improvement at kernel level for the gemm template kernel but since the performance is either still worse than ATen kernel (thus won't be selected during autotune) or improved from worse than ATen to similar to ATen, so we don't see E2E level performance change.
- There're models where the gemm template kernel could get > 10% performance improvement with this PR but since the kernel time is only about 3% of the E2E time, we don't observe significant E2E level improvement.
We will continue to find possible optimizations in the gemm template kernel in follow-up PRs.
Co-authored-by: Jiong Gong <jiong.gong@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129348
Approved by: https://github.com/jgong5, https://github.com/jansel
ghstack dependencies: #130675, #130690
Currently we require `n % register_block_n == 0` which typically bring good perf when `n` is a multiply of 8, 16, 32 etc. while will fall back to the reference micro gemm otherwise (where `register_block_n == 1`). This PR optimizes this by padding `n` to the multiple of `register_block_n` which is 8, 16, 32 etc. for packed weight. Therefore, the micro-gemm can work as is on the padded `n`. When the weight is padded, we will use the local accumulation buffer to get the result from micro-gemm and then unpadded (sliced) before storing back to the output buffer.
Performance numbers measured on "Intel (R) Xeon (R) CPU Max 9480", single core, bf16.
Before
AUTOTUNE linear_unary(512x768, 3073x768, 3073)
_linear_pointwise 2.3563 ms 100.0%
cpp_packed_gemm_0 710.5902 ms 0.3%
After
AUTOTUNE linear_unary(512x768, 3073x768, 3073)
cpp_packed_gemm_0 1.8909 ms 100.0%
_linear_pointwise 2.1016 ms 90.0%
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130690
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
ghstack dependencies: #130675
0.12.0 Major Updates:
- Add context manager to temporarily set the dictionary sorting mode
- Add accessor APIs
- Use `stable` tag for `pybind11` for Python 3.13 support
- Fix potential segmentation fault for pickling support
0.12.1 Updates:
- Fix warning regression during import when launch with strict warning filters
Closes#130155
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130139
Approved by: https://github.com/zou3519
ghstack dependencies: #130895
Fixed TrainingIRToRunDecomp failures for test_tensor_attribute_zero_args and also a few re-tracability failures because run_decomposition does a retracing.
**edit:** also remove the eliminate_dead_code() in _unlift because of one onnx test failure:
a constant tensor attr was lifted as constant_tensor input but it's not used in the graph after aot_autograd due to a short cut in its decomposition. This causes the setattr to be removed by eliminate_dead_code but the graph signature still contains the name of that buffer, which causes an inconsitency between the transformed graph and ep's original signature after _unlift. And it seems that this has happened a few times where some nodes are accidentally removed and we're in an inconsistent state.
The alternative of removing it would be: every time we call elimiate_dead_code, we verify the consistency of the graph with 1. the graph before transformation and 2. all the meta datas but i think this deserves a complete design.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130990
Approved by: https://github.com/pianpwk
Resubmit of #129325
Previously each mutation was represented by a `MutationOutput` operation which
was a new scheduler node that must be scheduled immediately afterwards.
Now we have a single scheduler node, which produces mutiple `MutationOutput`
buffers as its output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130832
Approved by: https://github.com/lezcano
ghstack dependencies: #130831
------
The opposite of #130836. Pin `sympy >= 1.13.0` for Python >= 3.9 and `sympy == 1.12.1` for Python 3.8.
- #130836
See the PR description of #130836 for more details.
`sympy` 1.13.0 introduces some breaking changes which break our tests. More specifically:
- Ref [Backwards compatibility breaks and deprecations](https://github.com/sympy/sympy/wiki/release-notes-for-1.13.0#backwards-compatibility-breaks-and-deprecations)
> BREAKING CHANGE: Float and Integer/Rational no longer compare equal with a == b. From now on Float(2.0) != Integer(2). Previously expressions involving Float would compare unequal e.g. x*2.0 != x*2 but an individual Float would compare equal to an Integer. In SymPy 1.7 a Float will always compare unequal to an Integer even if they have the same "value". Use sympy.numbers.int_valued(number) to test if a number is a concrete number with no decimal part. ([#25614](https://github.com/sympy/sympy/pull/25614) by [@smichr](https://github.com/smichr))
`sympy >= 1.13.0` is required to enable Python 3.13 support. This should be part of #130689.
- #130689
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130895
Approved by: https://github.com/ezyang
There's no reason to ban them for vmap or jvp, because without the
{grad, vjp} transforms those just act above PyTorch autograd, which will
end up saving regular Tensors.
Test Plan:
- some tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131191
Approved by: https://github.com/drisspg
Summary: Since WaitCounter frontend itself has minimal depdendencies it's fine to be moved into c10. Specific backends can be registered/linked separately.
Test Plan: unit test
Reviewed By: jamesperng, asiab4, c-p-i-o
Differential Revision: D59842868
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131021
Approved by: https://github.com/asiab4
All the changes brought by the original PR have been addressed in alternative ways in the stack. Why the original PR has to be reverted requires more effort because there is some bad interaction with export.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131058
Approved by: https://github.com/williamwen42
#### Issue
Model parameters sometime do not appear in the `named_parameters()` function. For example, when trying to jit.trace an already jit.scripted model. This PR fixes that by relying on `state_dict` to get both parameters`requires_grad=True` and buffers.
#### Test Plan
* `pytest test/export/test_converter.py -s -k test_convert_retrace_nested_scripted_modules`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129787
Approved by: https://github.com/angelayi
Summary:
In export workflow, we always have a lifted graph which doesn't fetch constants through get_attr nodes. This cause some compatibility issue when we're trying to use inductor's split_const_gm function with a lifted graph.
This diff make an additive change to split_const_gm's interface, such that, when the pass sees a placeholder node is present in the lifted_constants table, it will also use that as the source of constness.
This change won't break the existing code and the lifted_constants table can be used orthogonal to the existing const folding mechanisms.
Also as required from MTIA team, we want to introduce a small callback function used to skip certain nodes during const folding.
For the internal followup counterpart, see D59685145
Test Plan: buck run mode/opt caffe2/test:test_export -- -r split_const_gm
Differential Revision: D59692790
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130743
Approved by: https://github.com/desertfire, https://github.com/SherlockNoMad
… addition, fix device assignment for newly created variables in method
Fix an example: Resolve broadcasting error in attn_bias and attn_mask addition, fix device assignment for newly created variables in method
1. `attn_bias += attn_mask` would cause a broadcasting error. Because the shape of `attn_bias` is (L, S), the shape of the output would be expected as (L, S) too. When the shape of input is (N, num_heads, L, S), a broadcasting should be triggered. Then, the shape of the output would be (N, num_heads, L, S), which is unexpected.
2. `attn_bias` is a newly created variables in method, which is not assigned device.
**This is my retry of #130200 .** I used a wrong account in that pr.
Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130209
Approved by: https://github.com/mikaylagawarecki
Summary:
`fr_trace.py` is used to analyze flight recorder dump files.
This script was taken from @wconstab and @zdevito.
Only minor changes made were to make the linter happy and add a few odd new fields that I added in version `2.2` of the collector portions.
Test Plan:
Tested manually on some flight recorder data and it seems to run.
TODO:
Address 15 odd `#type: ignore` that I put in there to make the linter happy for now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130764
Approved by: https://github.com/fduwjj
Earlier the signature of dequantize ops for decomposed quantized Tensor was changed for wider use-cases where the output dtype can be different from torch.float and needs to be passed during dequantization.
Please refer: https://github.com/pytorch/pytorch/pull/121450
However, setting of correct output dtype for dequantize ops was still missing in convert_pt2e flow.
This change enables the users to use PT2E quantization flow with non torch.float unquantized dtype, such as torch.bfloat16.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128953
Approved by: https://github.com/jgong5, https://github.com/jerryzh168
The current code assumes that indirect variables will be created by the
same `IndexPropagation` instance, however that isn't true in the case of
masked sub-blocks where we take in variables from the parent block.
This fixes the issue by moving the var range information up to the
`LoopBody` object where it can be shared by all sub-blocks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130984
Approved by: https://github.com/lezcano
Regular update.
1. New 90 ATen operators and their variants are supported for XPU.
2. Bugfixing: a. Fixing out-of-bound memory access in index_put kernel b. Fixing debug build error
3. Binary change. Split device AOT code of SYCL kernel into multiple libraries to avoid linkage failure.
4. torch-xpu-ops test case enhancement: a. Hook PyTorch testing ob_db to align opInfo configuration with CUDA b. Hook _check_arg_device2 and freeze_rng_state to make XPU happy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131015
Approved by: https://github.com/EikanWang
Speedup bias-add compute by moving it to the epilogue. Performance numbers measured on "Intel (R) Xeon (R) CPU Max 9480", single core, bf16.
Before
AUTOTUNE linear_unary(512x768, 3072x768, 3072)
cpp_packed_gemm_0 1.9200 ms 100.0%
_linear_pointwise 1.9345 ms 99.3%
After
AUTOTUNE linear_unary(512x768, 3072x768, 3072)
cpp_packed_gemm_0 1.8321 ms 100.0%
_linear_pointwise 1.9246 ms 95.2%
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130675
Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel
beartype has served us well in identifying type errors and ensuring we call internal functions with the correct arguments (thanks!). However, the value of having beartype is diminished because of the following:
1. When beartype improves support for better Dict[] type checking, it discovered typing mistakes in some functions that were previously uncaught. This caused the exporter to fail with newer versions beartype when it used to succeed. Since we cannot fix PyTorch and release a new version just because of this, it creates confusion for users that have beartype in their environment from using torch.onnx
2. beartype adds an additional call line in the traceback, which makes the already thick dynamo stack even larger, affecting readability when users diagnose errors with the traceback.
3. Since the typing annotations need to be evaluated, we cannot use new syntaxes like `|` because we need to maintain compatibility with Python 3.8. We don't want to wait for PyTorch take py310 as the lowest supported Python before using the new typing syntaxes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130484
Approved by: https://github.com/titaiwangms
While for optimizations like pad_mm, there are always only two possible choices, for other decision procedures, like kernel choice selection, the set of "available" choices depends on the input. Instead of storing the choices as metadata, we can instead take a look at all choices for which we have collected data (i.e. `df[CHOICE_COL].unique()`).
In this PR, I also try to replace "choice" and "feedback" with global constants CHOICE_COL and FEEDBACK_COL.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130304
Approved by: https://github.com/eellison
This is an updated PR to equip cond with the autograd feature and replaces the old [PR](https://github.com/pytorch/pytorch/pull/126007)
@ydwu4 I tried to incorporate your requests already.
Currently there are two problems that I struggle with solving:
1. There seems to be an import issue when trying to import cond in `torch/__init__.py`, see [here](8a704035c9/torch/__init__.py (L1914-L1916)). Therefore, I had to comment those lines, which resolved the import issues, but I believe cond is not proberly exposed as torch.cond.
2. I am not entirely sure how to deal with the opinfo test in `hop_db.py`
Co-authored-by: Yidi Wu <yidi@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126911
Approved by: https://github.com/ydwu4
Summary:
Add three top level APIs for numeric debugger in pt2e flow that can log intermediate output in the model
and calculate summary for metric comparisons between nodes in two graphs
* `prepare_for_propagation_comparison`
* `extract_results_from_loggers`
* `compare_results`
Test Plan:
python test/test_quantization.py -k test_prepare_for_propagation_comparison
python test/test_quantization.py -k test_extract_results_from_loggers
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130643
Approved by: https://github.com/dulinriley, https://github.com/tarun292
We currently can't generate split scans when there are multiple scan
values, so we normally fall back to ATen. However, for the higher order
scan op, we can't fallback so it makes sense to just generate the slower
kernel anyway. This avoids having special shapes where we fail to
codegen.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130936
Approved by: https://github.com/lezcano
Sets `prefer_deferred_runtime_asserts_over_guards=True` for export, so any guards emitted from `SymNode.expect_true` (for example, guards that are implicitly required to be true for an op to succeed) won't lead to constraint violations. Instead these should appear in the graph as runtime asserts, or potentially as replacement expressions for placeholder shapes.
For example, this reshape op should emit s0 * s1 = s2, deferred as a runtime assert.
```
x = torch.randn(4, 8) # [s0, s1]
y = torch.randn(32) # [s2]
out = x.reshape(-1) + y
# this emits Eq(s0 * s1, s2), and we represent y's shape as [s0*s1] in the graph.
```
However, other complex guards can still cause export to fail, for instance guards emitted from `SymNode.guard_bool/guard_size_oblivious` (e.g. explicit if-else conditions in user code or lower-level op implementations hit during tracing) can still raise constraint violations. These can be deferred with `allow_complex_guards_as_runtime_asserts=True`. We don't yet make this default, because while this makes export more likely to succeed, it results in non-trivial asserts being emitted that often represent specialization to a variant of the op, or checks related to 0/1 specialization.
We also remove forced specializations for export and kill the `_disable_forced_specializations` flag - now any guard we can't express with Dims/DerivedDims either are handled with Hybrid SymInts, or should be resolved with rewriting or deferring.
Follow up:
Currently, `ShapeEnv._set_replacement()` is called for complex equality expressions (e.g. s2 -> s0*s1 in the example above), and the ExportedProgram stores `s0*s1` in the input placeholder. This isn't checked for validity when the program is run, so an option is to avoid replacement and/or runtime assert on equality.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130775
Approved by: https://github.com/avikchaudhuri
The #130912 error happens since `operator.mul` does not have `_schema`.
So why do we have `operator.mul` and why is it not dispatched to `torch.ops.aten.mul`? This op comes from %mul_3.
%mul_3 : [num_users=50] = call_function[target=operator.mul](args = (%arg689_1, 4096), kwargs = {})
`%arg689_1` is a placeholder with `meta[‘val’] = s0`. It comes form dynamic shapes and represents the batch size since it’s also used in many other nodes such as:
%view_1 : [num_users=1] = call_function[target=torch.ops.aten.view.default](args = (%mm, [%arg689_1, 4096, 320]), kwargs = {})
and
%native_group_norm_2 : [num_users=1] = call_function[target=torch.ops.aten.native_group_norm.default](args = (%div_1, %arg16_1, %arg17_1, %arg689_1, 320, 4096, 32, 1e-06), kwargs = {})
To fix the issue, we can add `operator.mul` to skip list.
Fixes#130912
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130986
Approved by: https://github.com/eellison
Fixes#128745
Solve the issue with conflicts when users use full_state_dict while the model is FSDP.
Current solve the issue for `full_state_dict=True`, with error
`'aten.copy_.default: got mixed torch.Tensor and DTensor, need to convert all torch.Tensor to DTensor before calling distributed operators!',).`
TODO: for` broadcast_from_rank0=True, full_state_dict=True`, the error is
`NotImplementedError: c10d::broadcast_: attempted to run this operator with Meta tensors`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129635
Approved by: https://github.com/fegin
Summary: Finishing up the mechanism to "register" certain types of operators to a registry so that the serializer can handle them correctly. This is expected to be firstly used by executorch.
Test Plan: buck run mode/opt caffe2/test:test_export -- -r test_export_with_extension_op_serialization
Differential Revision: D59825148
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130851
Approved by: https://github.com/angelayi
- More conservative estimation of plannable inputs
- Consider constant_pad_nd as pointwise node in concat lowering
- Use aten.cat instead of constant pad ndwhen padding just a single dimension because it can be memory-planned away
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128909
Approved by: https://github.com/Chillee
**Summary**
Currently, the output of CommDebugMode contains a lot of noise, such as operations that usually won’t tell the user much information such as aten.detach.default. I have created a set of these trivial operations and added a user argument noise_level for users to choose how much information they would want to receive.
noise_level = 1 prints module-level collective counts
noise_level = 2 prints operations not included in trivial operations and module information
noise_level = 3 prints all operations
In addition, I have removed the generate_module_tracing_table since noise_level = 1 essentially replaces it. Finally, I have updated the examples and test cases.
**Test Plan**
1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_json_dump
2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_json_dump
3. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_operation_tracing
4. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing
5. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing
6. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130410
Approved by: https://github.com/XilunWu
Might fix#127660, need to test some more cases.
We update the reinplacing pass. If we have something like the following,
where "sin" is a custom op (this situation should also apply to triton
kernels)
```py
def graph(x):
y = sin(x)
z = sin(y)
x.copy_(z)
```
then the reinplacer used to produce the following:
```py
"""step 1: reinplaces the first sin"""
def graph(x):
x_clone = x.clone()
sin_out(x, out=x_clone)
z = sin(x_clone)
x.copy_(z)
"""step 2: reinplaces the second sin"""
def graph(x):
x_clone = x.clone()
sin_out(x, out=x_clone)
sin_out(x_clone, out=x_clone)
x.copy_(x_clone)
```
However, the first clone is unnecessary. It is safe to reinplace
the first sin into the following:
```py
def graph(x):
sin_out(x, out=x)
z = sin(x)
x.copy_(z)
```
because there are no users of `x`'s original value (the copy_ node
doesn't actually use the original value of x!)
This PR updates the reinplacing pass to ignore copy_ in its computation
of if the original value of the mutated argument is still needed.
NB: this also applies to triton kernels, but it was easier for me to
reason about custom ops (and my repros were all for custom ops).
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130866
Approved by: https://github.com/oulgen
# Summary
- This removes a bunch of example score mods that were primarily used for testing and places them directly in the test file. We should follow up with merging test_flex_decode and test_flash when the velocity slows down a little
- Fixes a bug with indexing on block mask
- Adds some doc strings to helper funcs and fixes some misc typing things
- Forces functions passed to `create_block_mask` to mask_mods and updates tests files
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130871
Approved by: https://github.com/joydddd, https://github.com/Chillee
This PR re-implements pin memory aiming to get rid of the optional `device` argument and makes all related APIs to be device-agnostic. We add two new abstract APIs in [AcceleratorHooksInterface](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/detail/AcceleratorHooksInterface.h#L12) and redefine pin memory as: "Pin memory is always pinned for the current accelerator device". In detail, it uses [getAcceleratorHooksInterface](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/Context.h#L61) in pin_memory/is_pinned to get an appropriate device and invoke the corresponding overridden interfaces, instead of using BackendSelect and then dispatching to CUDA or other specific backends' implement methods.
Note: For new backends who want to implement and use pin memory, just inherit AcceleratorHooksInterface and overwrite the `isPinnedPtr` and `getPinnedMemoryAllocator` methods.
Additional context: To avoid BC-breaking, this PR just preserves the `device` arg of related APIs and would throw a deprecation warning if `device` arg is passed. Another PR will be submitted to update all PT callers (`Tensor.is_pinned()`, `Tensor.pin_memory()`...) not to pass this arg based on this PR. In future, `device` arg will be actually removed.
Relates #124908
Relates #14560
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126376
Approved by: https://github.com/albanD
We should be able to create multiple CUDAPluggableAllocators in the same pytorch program (see https://github.com/pytorch/pytorch/issues/124807, https://github.com/pytorch/pytorch/pull/125722 for context). When mixing CUDAPluggableAllocators in the same pytorch program, we need to make sure that the deleter passed in through the CUDAPluggableAllocator gets "attached" to the data_ptr and persist until program exit (when it's called to free the memory).
Currently, CUDAPluggableAllocator maintains a global `current_custom_allocator`. When creating the `DataPtr`, `raw_deleter` attaches `custom_raw_deleter` to the DataPtr which calls `current_custom_allocator->raw_delete(...)`. This approach is fine when using only one allocator, however for multiple allocator use case, DataPtr would be using the deleter of whatever is in the `current_custom_allocator`. For example, if allocation 1 was done with `cudaMalloc` and allocation 2 was done with `ncclMemAlloc`, and if `current_custom_allocator` is currently pointing to the CUDAPluggableAllocator with `ncclMemAlloc` - when cleaning up the allocation 1, we'd be using `ncclMemFree` instead of `cudaFree`.
In this PR, we solve the above problem by remembering the `free_fn_` using a deleter context. Hence, there is no need to go through an allocator object to find the deleter.
CC: @zdevito @ptrblck @eqy
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130472
Approved by: https://github.com/eqy, https://github.com/ezyang
Summary: Modify the existing `sum` operator in PyTorch, invoked by `torch.sum`, to allow for reductions along the ragged dimension of a nested tensor. This diff enables PyTorch users to invoke `torch.sum` on a nested tensor with `dim=1`, where `ragged_idx=1`.
Functions modified in `caffe2/torch/nested/_internal/ops.py`:
- `sum_dim_IntList()`: The function assumes that `ragged_idx=1`; in the case that `dim=1` as well, where `dim` is the dimension on which we reduce, this diff invokes the PyTorch benchmark found in D58423489. Specifically, this diff pads a nested tensor, e.g. of logical shape `(B, *, M)`, using [`torch.ops.aten._jagged_to_padded_dense_forward`](https://www.internalfb.com/code/fbsource/[92c2a067ab04e3eebc999254fed4ae2fbea6def3]/fbcode/deeplearning/fbgemm/fbgemm_gpu/fb/inductor_lowerings/elementwise_ops.py?lines=26), then reduces across the `*` dimension (`dim == 1`) to a `(B, M)` output tensor.
- `_wrap_jagged_dims()`: This diff adds special handling to allow for the case where `dim` contains `1` and not `0`, but to continue disallowing the case where `dim` contains `0` and not `1`. In this function's creation, I created a helper function, `_get_condition_for_invalid_jagged_reductions()`, which makes it clearer which conditions apply to which operators. Specifically, operators which are enabled with jagged reductions are specified at the top of the file in `SUPPORTED_JAGGED_REDUCTIONS` and have a different set of conditions that need to be tested, as reducing along `dim == 1` without `dim == 0` is now possible.
Functions modified in `caffe2/test/test_nestedtensor.py`:
- `test_sum_int_DimList()`: This diff adds special handling in the `sum` unit test to allow for the case where `dim` contains `1` and not `0`, but to continue disallowing the case where `dim` contains `0` and not `1`.
- `test_sum_int_DimList_ragged_dim_1()`: This diff adds a new unit test which verifies the accuracy and feasibility of reducing along the jagged dimension of a nested tensor.
Notes:
- This diff solely adds functionality for the case in which we reduce only along the ragged dimension. Cases in which we reduce along both the ragged and another dimension, like `dim == (1, 2)`, are not permitted, as this set of diffs focuses primarily on the former.
- The `sum` operator is the only operator which uses the function `_wrap_jagged_dims()`; all other operators use `_wrap_jagged_dim()`. I would like to later look into why this is the case and if we can consolidate this!
- I modified some of the comments in the `sum` function as well as the unit tests for more clarity.
Test Plan:
Verify that existing (`test_sum_int_DimList`) and new (`test_sum_int_DimList_ragged_dim_1`) unit tests pass via the following command:
```
buck2 run mode/{opt,inplace} //caffe2/test:nested -- --regex test_sum_int_DimList
```
Differential Revision: D59571209
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130425
Approved by: https://github.com/davidberard98
Reland of: https://github.com/pytorch/pytorch/pull/128016
Summary from previous PR:
We assume only two possible mutually exclusive scenarios:
Running compiled region for training (Any of inputs has requires_grad)
Produced differentiable outputs should have requires_grad.
Running compiled region for inference (None of inputs has requires_grad)
All outputs do not have requires_grad.
Even if user runs the region under no_grad(), but has an input Tensor with requires_grad - we go Training scenario (1).
With current state that means:
1/ needs_autograd should not check torch.is_grad_enabled(), only that any of inputs requires_grad
2/ if needs_autograd => trace_joint (We are in training scenario 1.) => always run compiled region under with.enable_grad()
Changes in partitioner?
Inference and Training graphs had difference in return container, list/tuple.
The changes in partitioner are done to unify and return always tuple.
As a result - some changes in test_aotdispatch.py for graph contents list -> tuple.
Why was revert?
There was a regression of hf_Reformer model on inference.
```
TORCHINDUCTOR_FX_GRAPH_CACHE=0 python benchmarks/dynamo/torchbench.py --performance --inference --bfloat16 --backend inductor --device cuda --only hf_Reformer --cold-start-latency --use-eval-mode
```
Because one of the compiled graphs contained outputs, which are aliases to the inputs that are nn.Parameter(requires_grad=True).
Even if inference bencharmsk torchbench runs inside with` torch.no_grad()` - alias (specifically for hf_Reformer - expand) ops preserve requires_grad.
As a result we started compiling training graph instead of inference.
Fix for view ops:
If we have outputs, that are aliases to inputs that requires_grad, those outputs requires grad is not a reason to generate training graph.
This is handled in aot_autograd.py, where output_and_mutation_safe are calculated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128890
Approved by: https://github.com/bdhirsh
Summary:
This diff introduces a much more flexible model for WaitCounter backend:
1. Backend can be installed dynamically (even if not linked with pytorch) instead of relying on macros and swapping implementation at compile time
2. Multiple backends are supported at the same time.
Test Plan: unit test
Reviewed By: jamesperng
Differential Revision: D59795863
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130934
Approved by: https://github.com/asiab4
Summary: `test/distributed/_composable/test_replicate_with_compiler.py` exercises inductor. This change introduces a version of MultiProcessTestCase that derives from the inductor TestCase class to make sure we always get a clean cache dir.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129494
Approved by: https://github.com/eellison
This PR resolves several sets of `_scaled_mm` test failures:
- `scale_a` and `scale_b` are now required arguments, so the function `sample_inputs_scaled_mm` must supply them
- `_scaled_mm` does not support `"meta"` device, so it should be skipped in `test_meta.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130897
Approved by: https://github.com/drisspg
Taking inspiration from `GraphModule.print_readable` (aka I copied its [code](17b45e905a/torch/fx/graph_module.py (L824))), I added a `print_readable` to the unflattened module, because it's kind of nontrivial to print the contents of this module.
Example print from `python test/export/test_unflatten.py -k test_unflatten_nested`
```
class UnflattenedModule(torch.nn.Module):
def forward(self, x: "f32[2, 3]"):
# No stacktrace found for following nodes
rootparam: "f32[2, 3]" = self.rootparam
# File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:99 in forward, code: x = x * self.rootparam
mul: "f32[2, 3]" = torch.ops.aten.mul.Tensor(x, rootparam); x = rootparam = None
# No stacktrace found for following nodes
foo: "f32[2, 3]" = self.foo(mul); mul = None
bar: "f32[2, 3]" = self.bar(foo); foo = None
return (bar,)
class foo(torch.nn.Module):
def forward(self, mul: "f32[2, 3]"):
# No stacktrace found for following nodes
child1param: "f32[2, 3]" = self.child1param
nested: "f32[2, 3]" = self.nested(mul); mul = None
# File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:79 in forward, code: return x + self.child1param
add: "f32[2, 3]" = torch.ops.aten.add.Tensor(nested, child1param); nested = child1param = None
return add
class nested(torch.nn.Module):
def forward(self, mul: "f32[2, 3]"):
# File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:67 in forward, code: return x / x
div: "f32[2, 3]" = torch.ops.aten.div.Tensor(mul, mul); mul = None
return div
class bar(torch.nn.Module):
def forward(self, add: "f32[2, 3]"):
# No stacktrace found for following nodes
child2buffer: "f32[2, 3]" = self.child2buffer
# File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:87 in forward, code: return x - self.child2buffer
sub: "f32[2, 3]" = torch.ops.aten.sub.Tensor(add, child2buffer); add = child2buffer = None
return sub
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128617
Approved by: https://github.com/zhxchen17, https://github.com/pianpwk
Previously, it was only possible to collect data or use a heuristic regardless of where autoheuristic is used. This PR makes it possible to collect data for some optimizations while using a learned heuristic for other optimizations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130245
Approved by: https://github.com/shunting314
FSDP2 eager pre-allocates the output buffer for AllGather and the AllGather just writes into that buffer. However, under compile, by default we use out-of-place AllGather, which means in Traceable FSDP2 case we will be unnecessarily using more memory than eager. We want to re-inplace that AllGather instead.
This PR adds a post_grad pass to re-inplace all_gather_into_tensor (i.e. changing it from `all_gather_into_tensor.default` out-of-place op to `all_gather_into_tensor_out.default` out-variant op).
One thing to note is that since with this pass we are introducing a mutable op into the post_grad FX graph, we must do this pass after `reinplace_inplaceable_ops` (at which point we are okay again with having mutable ops in the graph). To facilitate this, this PR adds a `post_grad_custom_post_reinplace_pass` extension point to allow user-defined post-reinplace FX passes.
---
Test commands:
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_fullgraph_backend_inductor`
---
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129773
Approved by: https://github.com/eellison
This PR allows `fully_shard`'s first argument to be `List[nn.Module]` instead of strictly `nn.Module`. This allows more flexible grouping of modules/parameters for communication, which can lead to memory savings and/or more efficient communication.
**Approach**
At a high level, we can think of a model as a tree of modules. Previously, we could only select specific module nodes in this tree as representing one FSDP parameter group. With this PR, we can select a group of module nodes, effectively becoming a single super node.
To implement the runtime schedule, we define new forward hooks that run based on the following semantics:
- If a module is the first to run the pre-hook, actually run the given pre-hook. Otherwise, the pre-hook is no-op.
- If a module is the last to run the post-hook, actually run the given post-hook. Otherwise, the post-hook is a no-op.
- First and last are determined by scoreboarding against a set of the modules.
- This set must get cleared at the end of backward in the case that >=1 module in the list is never used, in which case we still want the forward hooks to run in the next forward after this backward.
Beyond these new forward hooks, everything else is some simple generalization from `Module` to `List[Module]` or `Tuple[Module, ...]`.
**Examples**
This PR enables wrapping Llama models more efficiently by grouping the final norm and output linear together: https://github.com/pytorch/torchtitan/pull/382.
If at least one of the modules in the list does not run forward before backward, then there will be a warning message like:
```
1 of the 2 modules passed to fully_shard did not run forward before backward, which is error-prone since FSDP post-forward/pre-backward logic will not run for these modules. We recommend passing only modules that run forward together. Modules that did not run forward: [FSDPLinear(in_features=1, out_features=1, bias=True)]
```
---
**Changes for reland:** none since breakage was from PR below
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130949
Approved by: https://github.com/weifengpy
ghstack dependencies: #130947
This PR relaxes `@contract` to allow the 1st argument to be `Sequence[nn.Module]` instead of strictly `nn.Module`. This is required for the next PR, which allows `fully_shard` to take in `List[nn.Module]`.
---
**Changes for reland:**
- The previous PR assumed that any `func` decorated with `@contract` would return the same input `module` as output (which is true for PT-D composable APIs).
- However, TorchRec `shard` returns a different module as output (though that module _does_ satisfy the `@contract` FQN check).
- This PR removes the assumption and instead only enforces the FQN check following the input module order. In other words, if calling `func([x1, ..., xN])` for `N` modules `x1, ..., xN` that returns `[y1, ..., yM]` for `M` modules, we require that `N = M` and that FQNs are preserved coordinate-wise: `xi` and `yi` have same FQNs for all `i = 1, ..., N`.
Differential Revision: [D59863438](https://our.internmc.facebook.com/intern/diff/D59863438)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130947
Approved by: https://github.com/weifengpy, https://github.com/atalman
Fixes#127666.
Other std math functions are replaced with those in the global namespace during hipify. HIP does not claim to support every function in the C++ standard library. std::clamp is not yet supported and we have been relying on the std implementation. For Fedora 40 + gcc 14, a host-side assert is used which is not supported. Work-around this by replacing std::clamp with min and max. Using #ifndef USE_ROCM to differentiate between CUDA using std::clamp and the ROCm replacement broke Windows builds. The replacement generates the same PTX as std::clamp, so using the replacement unconditionally. The replacement generates the same PTX as std::clamp. See https://godbolt.org/z/Wde9KW3v4 for a sample.
Original patch comes from @lamikr. Modified to improve efficiency.
https://github.com/lamikr/rocm_sdk_builder/pull/37
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127812
Approved by: https://github.com/hongxiayang, https://github.com/malfet
Summary: Uses original ExportedProgram constants and graph signature to inform decompositions, so that constant tensors and non-persistent buffers are respected for training IR. Removes 7 test failures for training IR.
Test Plan: test_export
Differential Revision: D59820909
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130864
Approved by: https://github.com/angelayi
Summary: `collect_defined_kernels()` is essentially patching deep inside to see if a specific codegen is happening. We could also patch somewhere in the cache path to make sure it's called, but I'm not sure that's really testing anything interesting. I suggest it's better to just disable the remote cache here.
Test Plan: `buck2 test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/inductor:metrics -- --exact 'caffe2/test/inductor:metrics - test_kernel_args_num_gb (caffe2.test.inductor.test_metrics.TestMetrics)' --run-disabled --stress-runs 10`
Differential Revision: D59825899
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130853
Approved by: https://github.com/oulgen
In this PR, I added support for packaging the AOTI generated files into a zipfile, and loading it in python.
`compile_so` takes the path to the package, a device, and a desired so_path location, and compiles package into a .so, and saves to the specified location.
`load_package` takes a path to the package and device, calls _extract_so, and then creates a callable to run the compiled model.
The zipfile generated looks like the following:
```
|- version
|- archive_format
|- data
|- aotinductor
|- cbtnafqaqrhvwztv7xudlal4xs6sofxa5oxccyuaqtrt6aozaklx.cubin # AOTI cuda generated cubin files
|- cskkqtna23bty2v3aq7g2q37cxrgufehlkuaaolhlgug5zg6fuwe.cpp # AOTI generated cpp file
|- cskkqtna23bty2v3aq7g2q37cxrgufehlkuaaolhlgug5zg6fuwe_compile_flags # Flags for compiling the .o
|- c6qqtnpgwfi3dv5nb76ai773kt45ezoxfwdmd7q37lvq6fs2tnoi.o # AOTI saved const.o
|- cskkqtna23bty2v3aq7g2q37cxrgufehlkuaaolhlgug5zg6fuwe_linker_flags # Flags for linking the files to form the .so
|- constants
|- constants.pt # Constants saved using torch.save, can be loaded using mmap
```
The workflow is something like:
```
with torch.no_grad():
ep = torch.export.export(
model,
example_inputs,
dynamic_shapes=dynamic_shapes,
strict=False,
)
gm = ep.module()
package_path = torch._inductor.aot_compile(
gm,
example_inputs,
options= {
"aot_inductor.output_path": "my_path.pt2", # or a directory
"aot_inductor.package": True,
}
)
compiled_model = torch._inductor.package.load_package(package_path, device)
return compiled_model
```
I tried turning on loading the weights using mmap by default, but had some trouble with it, so that is just left as a todo
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129895
Approved by: https://github.com/malfet
Summary:
This diff does a minor cleanup of WaitCounters:
1. Fixes some singleton use to ensure one instance of WaitCounterImpl per counter per process
2. Updates API to enable measuring duration of individual wait operations
Test Plan: unit test
Differential Revision: D59709324
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130664
Approved by: https://github.com/c-p-i-o, https://github.com/asiab4
Enables a few extra ruff rules, most of which do not have any violations as I already cleaned them with earlier PRs, these just turns them on to enforce them. Adds 1 noqa as we want the suboptimal lambda generation + call kept as a test. Also enables the test in flake8
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130700
Approved by: https://github.com/justinchuby, https://github.com/ezyang
Adds better error messages when a socket fails to bind in libuv.
New format:
```
The server socket has failed to bind. port: 1, useIpv6: 0, code: -13, name: EACCES, message: permission denied
```
Old format:
```
The server socket has failed to listen on any local network address. useIpv6: 0, code: -98, name: EADDRINUSE, message: address already in use
```
Test plan:
Added test in `test_store.py`
```
python test/distributed/test_store.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130797
Approved by: https://github.com/kurman
This is useful for splitting grad to run in two parts while preserving intermediates:
<details>
<summary>
Click to see code
</summary>
```python
import collections
import weakref
from torch.autograd.graph import GradientEdge
def _get_grad_fn_or_grad_acc(t):
if t.requires_grad and t.grad_fn is None:
return t.view_as(t).grad_fn.next_functions[0][0]
else:
return t.grad_fn
def reverse_closure(roots, target_nodes):
# Recurse until we reach a target node
closure = set()
actual_target_nodes = set()
q: Deque = collections.deque()
for node in roots:
if node is not None and node not in closure:
closure.add(node)
q.append(node)
while q:
node = q.popleft()
reverse_edges = node.metadata.get("reverse_edges", [])
for holder_ref, idx in reverse_edges:
ref = holder_ref()
if ref is not None:
raise RuntimeError("Reverse graph is no longer alive")
fn = ref.node
if fn in closure or fn is None:
continue
if fn in target_nodes:
actual_target_nodes.add(fn)
continue
closure.add(fn)
q.append(fn)
return closure, actual_target_nodes
# Enable weak pointer
class Holder():
def __init__(self, node):
self.node = node
# TODO: use weak references to avoid reference cycle
def construct_reverse_graph(roots):
q: Deque = collections.deque()
root_seen = set()
reverse_graph_refs = []
for node in roots:
if node is not None and node not in root_seen:
q.append(node)
root_seen.add(node)
while q:
node = q.popleft()
for fn, idx in node.next_functions:
if fn is not None:
# Don't necessarily need to store on the graph
reverse_edges = fn.metadata.get("reverse_edges", [])
if len(reverse_edges) == 0:
q.append(fn)
holder = Holder(node)
holder_ref = weakref.ref(holder)
reverse_graph_refs.append(holder)
reverse_edges.append((holder_ref, idx))
fn.metadata["reverse_edges"] = reverse_edges
return reverse_graph_refs
def get_param_groups(inputs, params):
inputs_closure, _ = reverse_closure(inputs, set())
param_groups = dict() # keyed on intermediates
for i, param in enumerate(params):
closure, intersected = reverse_closure([param], inputs_closure)
param_group = {
"params": set([param]),
"intermediates": set(intersected),
}
for input_node in intersected:
existing = param_groups.get(input_node, None)
if existing is not None:
existing["params"] = existing["params"].union(param_group["params"])
existing["intermediates"] = existing["intermediates"].union(param_group["intermediates"])
param_group = existing
else:
param_groups[input_node] = param_group
# Sanity check: union of all param_groups params should be equal to all params
union_params = set()
seen_ids = set()
unique_param_groups = []
for param_group in param_groups.values():
if id(param_group) not in seen_ids:
seen_ids.add(id(param_group))
unique_param_groups.append(param_group)
union_params = union_params.union(param_group["params"])
assert union_params == set(params)
return unique_param_groups
def compute_grads_only_inputs2(roots, inps, weights):
root_grad_fns = list(map(_get_grad_fn_or_grad_acc, roots))
inp_grad_fns = list(map(_get_grad_fn_or_grad_acc, inps))
weight_grad_fns = list(map(_get_grad_fn_or_grad_acc, weights))
reverse_graph_refs = construct_reverse_graph(root_grad_fns)
param_groups = get_param_groups(inp_grad_fns, weight_grad_fns)
del reverse_graph_refs
for param_group in param_groups:
for i, intermediate in enumerate(param_group["intermediates"]):
def get_hook(param_group, i):
def hook(grad_inputs):
if param_group.get("grads", None) is None:
param_group["grads"] = [None] * len(param_group["intermediates"])
param_group["grads"][i] = grad_inputs
return hook
# These are always "split" nodes that we need to recompute, so
# save their inputs.
intermediate.register_prehook(get_hook(param_group, i))
dinputs = torch.autograd.grad((out,), inputs=tuple(inps), grad_outputs=(torch.ones_like(out),), retain_graph=True)
return dinputs, param_groups
def compute_grads_only_weights2(user_weights, param_groups):
all_dweights = dict()
for param_group in param_groups:
# TODO: Handle case where intermediate can have multiple outputs
intermediate_edges = tuple(GradientEdge(i, 0) for i in param_group["intermediates"])
weights_edges = tuple(GradientEdge(w, 0) for w in param_group["params"])
assert all(len(g) == 1 for g in param_group["grads"])
# [NEW!] Able to pass a GradientEdge to autograd.grad as output
# We do not need to retain_graph because... guarantee no overlap?
print("trying to execute: ", intermediate_edges, weights_edges)
dweights = torch.autograd.grad(intermediate_edges, weights_edges, grad_outputs=sum(param_group["grads"], tuple()))
for w, dw in zip(param_group["params"], dweights):
all_dweights[w] = dw
# return grads in the original order weights were provided in
out = []
for w in user_weights:
grad_acc = _get_grad_fn_or_grad_acc(w)
out.append(all_dweights[grad_acc])
return tuple(out)
```
</details>
```python
import torch.nn as nn
# Setup
mod1 = nn.Linear(10, 10)
mod2 = nn.Linear(10, 10)
a = torch.rand(10, requires_grad=True)
weights = tuple(mod1.parameters()) + tuple(mod2.parameters())
inps = (a,)
out = mod2(mod1(a))
class LoggingTensorMode(torch.utils._python_dispatch.TorchDispatchMode):
def __torch_dispatch__(self, func, types, args=(), kwargs=None):
if kwargs is None:
kwargs = {}
rs = func(*args, **kwargs)
print(f"{func.__module__}.{func.__name__}")
return rs
print(" -- SPLIT -- ")
# Compute gradients in two parts
with LoggingTensorMode():
print("PART 1")
dinputs, state = compute_grads_only_inputs2((out,), inps, weights)
print("PART 2")
dweights = compute_grads_only_weights2(weights, state)
out = mod2(mod1(a))
print(" -- REF -- ")
# Compare with reference
with LoggingTensorMode():
ref_all_gradients = torch.autograd.grad(out, inputs=tuple(inps) + weights, grad_outputs=(torch.ones_like(out),))
for actual, ref in zip(dinputs + dweights, ref_all_gradients):
print(torch.allclose(actual, ref))
```
<img width="598" alt="image" src="https://github.com/pytorch/pytorch/assets/13428986/3681b8a7-3ab4-4d1d-a836-abef6913e671">
```
PART 1
torch._ops.aten.view.default
torch._ops.aten.view.default
torch._ops.aten.view.default
torch._ops.aten.view.default
torch._ops.aten.view.default
torch._ops.aten.ones_like.default
V0603 10:17:21.590878 8300067520 torch/autograd/graph.py:751] Executing: <ViewBackward0 object at 0x12a1ee160> with grad_outputs: [f32[10]]
torch._ops.aten.view.default
V0603 10:17:21.591204 8300067520 torch/autograd/graph.py:751] Executing: <AddmmBackward0 object at 0x12a1ee0d0> with grad_outputs: [f32[1, 10]]
torch._ops.aten.t.default
torch._ops.aten.mm.default
V0603 10:17:21.591578 8300067520 torch/autograd/graph.py:751] Executing: <ViewBackward0 object at 0x100d7ae50> with grad_outputs: [f32[1, 10]]
torch._ops.aten.view.default
V0603 10:17:21.591747 8300067520 torch/autograd/graph.py:751] Executing: <ViewBackward0 object at 0x12a1e4a60> with grad_outputs: [f32[10]]
torch._ops.aten.view.default
V0603 10:17:21.591834 8300067520 torch/autograd/graph.py:751] Executing: <AddmmBackward0 object at 0x12a1e4bb0> with grad_outputs: [f32[1, 10]]
torch._ops.aten.t.default
torch._ops.aten.mm.default
V0603 10:17:21.591922 8300067520 torch/autograd/graph.py:751] Executing: <ViewBackward0 object at 0x12a1e4a90> with grad_outputs: [f32[1, 10]]
torch._ops.aten.view.default
PART 2
trying to execute: (GradientEdge(node=<AddmmBackward0 object at 0x12a1e4bb0>, output_nr=0),) (GradientEdge(node=<AccumulateGrad object at 0x12a21b130>, output_nr=0), GradientEdge(node=<AccumulateGrad object at 0x12a21b7c0>, output_nr=0))
V0603 10:17:21.592223 8300067520 torch/autograd/graph.py:751] Executing: <AddmmBackward0 object at 0x12a1e4bb0> with grad_outputs: [f32[1, 10]]
torch._ops.aten.t.default
torch._ops.aten.mm.default
torch._ops.aten.t.default
torch._ops.aten.sum.dim_IntList
torch._ops.aten.view.default
V0603 10:17:21.592421 8300067520 torch/autograd/graph.py:751] Executing: <TBackward0 object at 0x12a1cad60> with grad_outputs: [f32[10, 10]]
torch._ops.aten.t.default
trying to execute: (GradientEdge(node=<AddmmBackward0 object at 0x12a1ee0d0>, output_nr=0),) (GradientEdge(node=<AccumulateGrad object at 0x12a1e41c0>, output_nr=0), GradientEdge(node=<AccumulateGrad object at 0x12a21b670>, output_nr=0))
V0603 10:17:21.593481 8300067520 torch/autograd/graph.py:751] Executing: <AddmmBackward0 object at 0x12a1ee0d0> with grad_outputs: [f32[1, 10]]
torch._ops.aten.t.default
torch._ops.aten.mm.default
torch._ops.aten.t.default
torch._ops.aten.sum.dim_IntList
torch._ops.aten.view.default
V0603 10:17:21.593750 8300067520 torch/autograd/graph.py:751] Executing: <TBackward0 object at 0x12a21b2b0> with grad_outputs: [f32[10, 10]]
torch._ops.aten.t.default
torch._ops.aten.view.default
torch._ops.aten.view.default
torch._ops.aten.view.default
torch._ops.aten.view.default
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127766
Approved by: https://github.com/albanD
Summary:
We should log compile ID as well for easier comparison.
Currently going through some of this data, I think we should make few more changes as well.
Reland for D59725870
Test Plan: Sandcastle and Pytorch
Differential Revision: D59789110
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130801
Approved by: https://github.com/oulgen
FSDP2 eager pre-allocates the output buffer for AllGather and the AllGather just writes into that buffer. However, under compile, by default we use out-of-place AllGather, which means in Traceable FSDP2 case we will be unnecessarily using more memory than eager. We want to re-inplace that AllGather instead.
This PR adds a post_grad pass to re-inplace all_gather_into_tensor (i.e. changing it from `all_gather_into_tensor.default` out-of-place op to `all_gather_into_tensor_out.default` out-variant op).
One thing to note is that since with this pass we are introducing a mutable op into the post_grad FX graph, we must do this pass after `reinplace_inplaceable_ops` (at which point we are okay again with having mutable ops in the graph). To facilitate this, this PR adds a `post_grad_custom_post_reinplace_pass` extension point to allow user-defined post-reinplace FX passes.
---
Test commands:
- `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_fullgraph_backend_inductor`
---
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129773
Approved by: https://github.com/eellison
Summary: Similar to the handling of metrics, save inductor counter deltas in the FX graph cache entry and increment the counters appropriately on a cache hit
Test Plan: new unit test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130635
Approved by: https://github.com/eellison
beartype has served us well in identifying type errors and ensuring we call internal functions with the correct arguments (thanks!). However, the value of having beartype is diminished because of the following:
1. When beartype improves support for better Dict[] type checking, it discovered typing mistakes in some functions that were previously uncaught. This caused the exporter to fail with newer versions beartype when it used to succeed. Since we cannot fix PyTorch and release a new version just because of this, it creates confusion for users that have beartype in their environment from using torch.onnx
2. beartype adds an additional call line in the traceback, which makes the already thick dynamo stack even larger, affecting readability when users diagnose errors with the traceback.
3. Since the typing annotations need to be evaluated, we cannot use new syntaxes like `|` because we need to maintain compatibility with Python 3.8. We don't want to wait for PyTorch take py310 as the lowest supported Python before using the new typing syntaxes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130484
Approved by: https://github.com/titaiwangms
Summary: Adds non-strict implementation of training IR export. Any expected non-strict training IR failures are also either existing strict training IR or non-strict failures (no new failures added). 4 strict training IR failures also resolved.
Refraining from unifying export/export_for_training, per @ydwu4's feedback :)
Test Plan: added test_export_training_ir_to_run_decomp_non_strict.py for non-strict training IR
Differential Revision: D59349454
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130062
Approved by: https://github.com/ydwu4, https://github.com/zhxchen17
Summary:
Move the alloc_trace logic into a separate class, to reduce risk of deadlocks when mixing with CCA's lock. Switch to an std::mutex instead of std::recursive_mutex.
Let's us re-use the logic in TraceEntryRingBuffer class for later diffs.
Test Plan: CI, resnet run, and FBR model.
Differential Revision: D59690408
Pulled By: aaronenyeshi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130741
Approved by: https://github.com/davidberard98
Uses `dict.fromkeys` whenever possible as covered by flake8-comprehensions rule C420. While the ruff rule RUF025 is still in preview, flake8-comprehensions have added a new rule which covers this. Use dict.fromkeys is faster when the value being added to the dictionary is the same at every iteration and is immutable, it also removes an unnecessary dict comprehension.
This rule will be enabled with our current ruleset in RUF in 0.6 as C420.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130699
Approved by: https://github.com/lezcano, https://github.com/ezyang
Summary:
By default, performance tests (speedup experiments) will run the baseline and test backend alternately.
However, this does not work for the torchao backend, which will change the model in-place, therefore the baseline run will also run with torchao backend since the model has already been quantized.
Add a new experiment "latency_experiment" to run performance tests non-alternately (first run baseline for a few iterations, then run the test backend).
Test Plan:
```
buck2 run mode/opt //pytorch/benchmark:pt2 -- --only AlbertForMaskedLM --quantization noquant --performance --inference --bfloat16
```
```
buck2 run mode/opt //pytorch/benchmark:pt2 -- --only AlbertForMaskedLM --quantization autoquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune
```
Differential Revision: D59332736
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130136
Approved by: https://github.com/jerryzh168
The conversion cache used for fixing https://github.com/pytorch/pytorch/issues/115260 depended on "store" which might be removed and ignored. This would lead to inconsistent code generated between vec and scalar kernels since we generate scalar kernel first followed by the vector kernel and the store buffer might be removed by the scalar and impacts the vector kernel codegen. This PR move the caching from "store" to the "to_dtype" calls which won't be impacted by the removed buffers.
`pytest -k test_consistent_remove_buffers test/inductor/test_cpu_repro.py`
before
```c++
extern "C" void kernel(const bfloat16* in_ptr0,
bfloat16* out_ptr1)
{
{
for(long x0=static_cast<long>(0L); x0<static_cast<long>(64L); x0+=static_cast<long>(16L))
{
auto tmp0 = at::vec::Vectorized<bfloat16>::loadu(in_ptr0 + static_cast<long>(x0), 16);
auto tmp1 = at::vec::convert<float>(tmp0);
auto tmp2 = tmp1 + tmp1;
auto tmp3 = at::vec::convert<bfloat16>(tmp2);
auto tmp4 = at::vec::convert<float>(tmp3);
auto tmp5 = tmp1 + tmp4;
auto tmp6 = at::vec::convert<bfloat16>(tmp5);
tmp6.store(out_ptr1 + static_cast<long>(x0), 16);
}
#pragma omp simd simdlen(8)
for(long x0=static_cast<long>(64L); x0<static_cast<long>(65L); x0+=static_cast<long>(1L))
{
auto tmp0 = in_ptr0[static_cast<long>(x0)];
auto tmp1 = c10::convert<float>(tmp0);
auto tmp2 = decltype(tmp1)(tmp1 + tmp1);
auto tmp3 = c10::convert<bfloat16>(tmp2);
auto tmp4 = decltype(tmp1)(tmp1 + tmp2);
auto tmp5 = c10::convert<bfloat16>(tmp4);
out_ptr1[static_cast<long>(x0)] = tmp5;
}
}
}
```
after
```c++
extern "C" void kernel(const bfloat16* in_ptr0,
bfloat16* out_ptr1)
{
{
for(long x0=static_cast<long>(0L); x0<static_cast<long>(64L); x0+=static_cast<long>(16L))
{
auto tmp0 = at::vec::Vectorized<bfloat16>::loadu(in_ptr0 + static_cast<long>(x0), 16);
auto tmp1 = at::vec::convert<float>(tmp0);
auto tmp2 = tmp1 + tmp1;
auto tmp3 = at::vec::convert<bfloat16>(tmp2);
auto tmp4 = tmp1 + tmp2;
auto tmp5 = at::vec::convert<bfloat16>(tmp4);
tmp5.store(out_ptr1 + static_cast<long>(x0), 16);
}
#pragma omp simd simdlen(8)
for(long x0=static_cast<long>(64L); x0<static_cast<long>(65L); x0+=static_cast<long>(1L))
{
auto tmp0 = in_ptr0[static_cast<long>(x0)];
auto tmp1 = c10::convert<float>(tmp0);
auto tmp2 = decltype(tmp1)(tmp1 + tmp1);
auto tmp3 = c10::convert<bfloat16>(tmp2);
auto tmp4 = decltype(tmp1)(tmp1 + tmp2);
auto tmp5 = c10::convert<bfloat16>(tmp4);
out_ptr1[static_cast<long>(x0)] = tmp5;
}
}
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130677
Approved by: https://github.com/leslie-fang-intel
# Motivation
I found a difference between sympy 1.12 and 1.13.
```python
# for 1.12
>>> import sympy
>>> a = sympy.Number(0.0)
>>> a == 0
True
```
```python
# for 1.13
>>> import sympy
>>> a = sympy.Number(0.0)
>>> a == 0
False
```
The different behavior will impact the result of [safe_mul](6beec34b1c/torch/utils/_sympy/value_ranges.py (L521-L528)), resulting in an incorrect results when `a = sympy.Number(0.0)`, `b = inf` and the result is `nan` if sympy version is 1.13. (the expected result is **0**)
```python
def safe_mul(a, b):
# Make unknown() * wrap(0.0) == wrap(0.0)
if a == 0.0:
return a
elif b == 0.0:
return b
else:
return a * b
```
In different sympy versions, `sympy.Number(0)` always has the same behavior that equals to 0.0.
```python
>>> import sympy
>>> a = sympy.Number(0)
>>> a == 0.0
True # for different sympy versions
```
So, use 0.0 when checking zero in safe_mul to keep compatible with different sympy versions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130729
Approved by: https://github.com/lezcano, https://github.com/EikanWang
as titled, fixed a case when passing ord as 2 (default value), the op
dispatching does not receive the default value case
We simply check if the args schema receiving a `ord` field or not
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130753
Approved by: https://github.com/awgu
My attempt at a fix for https://github.com/pytorch/pytorch/issues/130335, see issue for more details / internal xref. Any feedback from inductor folks is appreciated. I attempted to make the move-constructors-to-cuda pass a bit less aggressive by detecting when the movement would incur a H2D sync for `aten.index_put_`. I'm not sure if there are any other ops that inductor falls back to eager on, that may-or-may-not incur a H2D sync if we change any of their inputs from cpu to cuda.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130338
Approved by: https://github.com/eellison
This PR marks all buffers and parameters of an NNModule as static using the `mark_static_address` API. As a result, when tensors are passed to AOT, the `tensor_dict` metadata of placeholder nodes will contain the `static_address_type` key, indicating which graph argument positions are static for cudagraphs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130391
Approved by: https://github.com/anijain2305
Extend constant folding for dynamic shape node, only support pointwise op and some restricted ops
We support dynamic shapes by limiting constant folding of ops that are guaranteed to have uniform values (full, pointwise ops, and views) and running these operators with tensors of shape 1. This also eliminates the possibility of memory overhead of constant folding.
Taken over from https://github.com/pytorch/pytorch/pull/128937
joint work with @imzhuhl
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129686
Approved by: https://github.com/Chillee
ghstack dependencies: #130367
This PR allows `fully_shard`'s first argument to be `List[nn.Module]` instead of strictly `nn.Module`. This allows more flexible grouping of modules/parameters for communication, which can lead to memory savings and/or more efficient communication.
**Approach**
At a high level, we can think of a model as a tree of modules. Previously, we could only select specific module nodes in this tree as representing one FSDP parameter group. With this PR, we can select a group of module nodes, effectively becoming a single super node.
To implement the runtime schedule, we define new forward hooks that run based on the following semantics:
- If a module is the first to run the pre-hook, actually run the given pre-hook. Otherwise, the pre-hook is no-op.
- If a module is the last to run the post-hook, actually run the given post-hook. Otherwise, the post-hook is a no-op.
- First and last are determined by scoreboarding against a set of the modules.
- This set must get cleared at the end of backward in the case that >=1 module in the list is never used, in which case we still want the forward hooks to run in the next forward after this backward.
Beyond these new forward hooks, everything else is some simple generalization from `Module` to `List[Module]` or `Tuple[Module, ...]`.
**Examples**
This PR enables wrapping Llama models more efficiently by grouping the final norm and output linear together: https://github.com/pytorch/torchtitan/pull/382.
If at least one of the modules in the list does not run forward before backward, then there will be a warning message like:
```
1 of the 2 modules passed to fully_shard did not run forward before backward, which is error-prone since FSDP post-forward/pre-backward logic will not run for these modules. We recommend passing only modules that run forward together. Modules that did not run forward: [FSDPLinear(in_features=1, out_features=1, bias=True)]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127786
Approved by: https://github.com/yf225, https://github.com/weifengpy
ghstack dependencies: #127773
This PR relaxes `@contract` to allow the 1st argument to be `Sequence[nn.Module]` instead of strictly `nn.Module`. This is required for the next PR, which allows `fully_shard` to take in `List[nn.Module]`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127773
Approved by: https://github.com/weifengpy
This PR adds support to use expandable segments with private memory pools which should unblock using it with cuda graphs and cuda graph trees. Currently, the allocator silently avoids using expandable segments when allocating in a private pool due to checkpoint saving/restoring not meshing well with how we keep track of unmapped blocks.
The PR itself is pretty short, most of the logic for checkpointing and reapplying state for non-expandable segments transfers over without much work.
Expandable segments reserve a virtual address space of size equal to the amount of physical memory on the GPU. Every time we want to `malloc()` or `free()` memory in a memory pool with expandable segments turned on, we map/unmap pages of physical GPU memory under the hood to create a new block that we return to the caller. This is beneficial due to the fact that each memory pool functions as a single segment of memory with a contiguous block of memory addresses that can grow and shrink as needed, avoiding fragmentation from allocating multiple non-contiguous segments that may not be merged together.
The caching allocator handles this by creating an unmapped block for the entire reserved virtual address space at init, which is treated similarly to an unallocated block in a free pool. When callers call `malloc()`, it's split and mapped to create allocated blocks, and calling `free()` similarly caches and merges free blocks in a free pool to be used later. Expandable blocks are unmapped and returned back to Cuda when they are cleaned up, or when we hit an OOM and the allocator attempts to remap cached free blocks. The code paths to map, free, and unmap blocks in expandable segments is similar to that for normal blocks and does all the same work of updating stats on memory usage, moving blocks between active and free pools, and returning memory to Cuda.
With Cuda Graph Trees and private memory pools, we need the ability to take checkpoints of the current state of the memory allocator after each graph capture as well as reapplying the state before capturing a new graph after replaying a captured graph so that the new cuda graph capture has access to the state of the allocator at the point after replaying a previously captured graph so it can reuse empty blocks and allocate new ones.
As mentioned in a below comment, memory in a private pool is cached until the private pool is destroyed and allocations can only grow from extra graph captures, any freeing of memory would result in invalid memory addresses and would break cuda graphs.
One implementation detail to note for unmapped blocks with expandable segments is that unmapped blocks are kept track in a member variable `unmapped` of a `BlockPool`. `unmapped` is *not* part of the checkpointed state of the caching allocator and isn't restored when reapplying checkpoints since we never free/unmap memory back to cuda and is persisted across graph captures / replays.
Checkpointing the current state of the memory allocator works as expected with expandable segments. Checkpointing grabs the first block of every segment in the active and free pools of the private pool and traverses the linked list of blocks in the segment to capture the state of every segment, which is then saved and kept for when it is needed to be reapplied. For expandable blocks, the last block in every segment will be an unallocated unmapped block containing the remaining amount of unmapped memory at graph capture time, and this too is saved in the checkpoint.
Reapplying the checkpoints works by freeing all allocated blocks and merging them into a single block per segment, then for each segment, we manually split and allocate all blocks from the checkpoint and then free the blocks marked as unallocated in the checkpoint state. For expandable segments, we need to make some modifications to not split unmapped blocks and avoid manually mapping then freeing unmapped blocks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128068
Approved by: https://github.com/eqy, https://github.com/eellison
Summary: The scalar tensor by default is on CPU, which failed the cuda graph capture. To fix the issue, we put the scalar tensor on GPU
Test Plan: buck2 test 'fbcode//mode/opt' fbcode//gen_ai/llm_inference/fb/tests:test_llama2_multimodal_generator -- --exact 'gen_ai/llm_inference/fb/tests:test_llama2_multimodal_generator - gen_ai.llm_inference.fb.tests.test_llama2_multimodal_generator.TestGenerator: test_multimodal_decode_gen2'
Differential Revision: D59740639
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130712
Approved by: https://github.com/Skylion007, https://github.com/chenyang78
This PR introduces AutoHeuristic, a framework to collect results from autotuning, learn a heuristic as a machine learning model (a regression tree), and then ship the learned heuristic by generating the regression tree to code.
The heuristics have been learned on artificial/random data that has been collected with the `gen_data_pad_mm.py` script. The `gen_pad_mm_a100.sh` scripts can then be used to learn a heuristic and generate it to code.
The best model is decided by doing a grid search over various values for `max_depth` and `min_samples_leaf` and choosing the model with the highest number of correct predicitons on the validation set.
The heuristic can return "unsure" which means that it is not sure which choice is the best choice and as a result autotuning will happen.
On A100 only tensors where each dimension is >= 512 are considered. For smaller tensors the heuristics that I learned returned "unsure" too often.
The results for randomly generated data and huggingface look as follows:
`max_wrong_speedup` is max(`wrong_speedups`) where `wrong_speedups` contains all the speedups one could have achieved for those examples where the heuristic made a wrong choice, i.e. a `max_wrong_speedup` of 1.37 means that the heuristic selected a choice, but the other choice would have been 1.37x faster. `gman_wrong_speedup` is the geomean of `wrong_speedups`.
The heuristic is learned as a regression tree, that returns higher values for better choices. The threshold decides how much better the better choice has to be for it to be returned, i.e. on A100 if the better choice is less than 1.702530x better than the other choice, "unsure" will be returned. This threshold is determined using the validation set.
A100
```
max_depth min_samples_leaf dataset correct wrong unsure total max_wrong_speedup gman_wrong_speedup threshold
15 5.0 10 train 2730 4 3023 5757 1.372220 1.193873 1.702530
16 5.0 10 val 878 0 1042 1920 NaN NaN 1.702530
17 5.0 10 test 925 2 993 1920 1.741708 1.354954 1.702530
18 5.0 10 hf-train 14 0 22 36 NaN NaN 1.702530
19 5.0 10 hf-inf 7 0 1 8 NaN NaN 1.702530
```
The numbers for huggingface only include tensors where each dim is >=512. If all tensors would have been included there would have been the following number of matmuls, where at least one dimension is unaligned:
A100 hf-train: 60
A100 hf-inf: 10
## Results on running huggingface locally
This only includes models where the learned heuristic made at least one decision. For the examples here, it takes around 0.25-0.3 seconds to perform autotuning for the padded and unpadded version, so each decision that the heuristic makes saves around 0.25-0.3 seconds.
#pad_mm_autotuning is the number of times autotuning happened in pad_mm and #heuristic_made_decision is the number of times the heuristic made a decision (i.e. it didn't return "unsure").
I ran huggingface locally, each model 5 times and took the median speedup and compilation_latency.
Results on huggingface training
```
name speedup_heuristic speedup_baseline speedup_diff compilation_latency_heuristic compilation_latency_baseline compilation_latency_diff comp_latency_reduction% #pad_mm_autotuning #heuristic_made_decision
BartForCausalLM 1.19 (+/- 0.00) 1.19 (+/- 0.00) -0.00 40.33 (+/- 1.13) 40.95 (+/- 0.78) -0.62 1.52 3 2
BartForConditionalGeneration 1.53 (+/- 0.06) 1.47 (+/- 0.05) 0.06 81.93 (+/- 5.20) 82.23 (+/- 1.92) -0.30 0.36 3 1
BlenderbotSmallForCausalLM 1.86 (+/- 0.04) 1.86 (+/- 0.00) 0.00 36.76 (+/- 0.49) 37.62 (+/- 1.33) -0.87 2.31 3 2
CamemBert 2.36 (+/- 0.01) 2.35 (+/- 0.01) 0.01 97.60 (+/- 1.91) 98.69 (+/- 1.35) -1.09 1.11 2 1
DistillGPT2 2.57 (+/- 0.01) 2.57 (+/- 0.01) 0.00 57.33 (+/- 0.77) 58.26 (+/- 1.41) -0.93 1.59 3 2
PLBartForCausalLM 2.07 (+/- 0.01) 2.06 (+/- 0.01) 0.01 32.54 (+/- 0.83) 34.65 (+/- 0.71) -2.11 6.10 3 2
PLBartForConditionalGeneration 1.87 (+/- 0.00) 1.88 (+/- 0.00) -0.01 58.45 (+/- 1.24) 58.95 (+/- 1.92) -0.50 0.85 3 1
RobertaForCausalLM 2.39 (+/- 0.01) 2.40 (+/- 0.01) -0.01 97.38 (+/- 1.52) 97.69 (+/- 1.18) -0.31 0.32 2 1
TrOCRForCausalLM 1.70 (+/- 0.00) 1.70 (+/- 0.00) -0.00 44.79 (+/- 1.33) 45.25 (+/- 1.08) -0.46 1.01 3 2
Mean difference in speedup: 0.01
Mean compilation latency saved: -0.80s
Mean compilation latency reduction: 1.68%
```
Results on huggingface inference
```
name speedup_heuristic speedup_baseline speedup_diff compilation_latency_heuristic compilation_latency_baseline compilation_latency_diff comp_latency_reduction% #pad_mm_autotuning #heuristic_made_decision
BartForCausalLM 1.11 (+/- 0.00) 1.11 (+/- 0.00) 0.00 19.02 (+/- 0.28) 19.40 (+/- 0.35) -0.38 1.95 3 2
BartForConditionalGeneration 1.26 (+/- 0.01) 1.23 (+/- 0.03) 0.03 36.84 (+/- 0.40) 36.55 (+/- 0.75) 0.30 -0.81 3 1
BlenderbotSmallForCausalLM 1.87 (+/- 0.02) 1.87 (+/- 0.01) 0.00 17.53 (+/- 0.31) 18.03 (+/- 0.43) -0.49 2.74 3 2
DistillGPT2 2.50 (+/- 0.02) 2.50 (+/- 0.01) 0.00 16.16 (+/- 0.29) 16.40 (+/- 0.18) -0.24 1.46 3 2
PLBartForCausalLM 1.93 (+/- 0.01) 1.94 (+/- 0.01) -0.00 15.30 (+/- 0.22) 16.01 (+/- 0.71) -0.71 4.43 3 2
PLBartForConditionalGeneration 1.98 (+/- 0.01) 1.98 (+/- 0.01) 0.00 25.90 (+/- 0.32) 26.58 (+/- 0.62) -0.67 2.53 3 1
TrOCRForCausalLM 1.61 (+/- 0.00) 1.62 (+/- 0.00) -0.01 21.38 (+/- 0.37) 21.85 (+/- 0.16) -0.47 2.16 3 2
Mean difference in speedup: 0.00
Mean compilation latency saved: -0.38s
Mean compilation latency reduction: 2.07%
```
For now, the heuristic can only be applied to decide whether to pad for mm. One could also learn heuristics for bmm and addmm.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128643
Approved by: https://github.com/Chillee, https://github.com/eellison
Summary: Looks like "spawn" is broken. Since we have "subprocess", I don't think we need it any more, so just remove as an option.
Test Plan: Verified that we get: `AssertionError: Invalid start method: spawn`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130746
Approved by: https://github.com/Skylion007
#### Issue
Fix two issues related to inputs lifting when there are sub-blocks.
* Some inputs may appear in the nested sub-blocks, which need a recursive search to identify which arguments need to be lifted / passed in the top-level block.
* Some inputs to the sub-block are intermediate results, meaning their names are only number. This will cause issue during code generation (i.e., invalid argument name). We rename those to valid names.
#### Test Plan
* `pytest test/export/test_converter.py -s -k test_convert_nn_module_with_nested_if_and_param`
* `test/export/test_converter.py -s -k test_hidden_input_name`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128386
Approved by: https://github.com/angelayi
This is the implementation following the RFC: https://github.com/pytorch/pytorch/issues/130407
ncclCommSplit
Summary:
In current Pytorch/c10d, the new_group API is used to create a new
process group from the default pg. When device_id is specified in
init_process_group and nccl is used as the backend, the new_group call
will use ncclCommSplit to create the nccl communicators to save
communicator resources. It has a few drawbacks:
Redundant calls
Suppose the default group has 256 ranks, we need to have 32 children PGs
and each child PG has 8 ranks. in this case, each rank needs to call
new_group and ncclCommSplit 32 times because of how we implement
new_group API and the collective requirement of ncclCommSplit. For a
specific global rank, 31 calls of ncclCommSplit would be no_color split,
and only 1 of them is colored split. With the proposed new split_group
API, we expect only 1 call of split_group/ncclCommSplit is needed per
rank in the above example case
new_group can only split from default_pg
Ideally, a new pg should be able to be split from any pg
With the new split_group API, users can create new PGs using
ncclCommSplit with less number of calls and initialize the PG eagerly.
This is also useful in the cases of creating many P2P communicators.
Test Plan:
New UTs:
e.g., python test/distributed/test_c10d_nccl.py -k
test_comm_split_group_larger_scale
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130507
Approved by: https://github.com/wconstab
Summary:
As a followup to https://github.com/pytorch/pytorch/pull/130454, users are hitting the cross-mesh operation error because the DeviceMesh thread ID differs between the saved vs. loaded DTensor due to thread id being different.
This is a hot fix to only consider the real thread_id in DeviceMesh hash under threaded backend, but set it to None for all other cases.
As a follow up, we need to look at the following test failures to better root cause specific DeviceMesh related failures related to MTPG, if thread_id is not included as part of the hash.
```
test/distributed/_composable/fsdp/test_fully_shard_training.py::TestFullyShardRegisteredParams::test_param_registration_after_forward
test/distributed/_tensor/test_dtensor_ops.py::TestDTensorOpsCPU::test_dtensor_op_db_column_stack_cpu_float32
```
Adding an additional is_initialized() check since APF has a test mocking the backend without pg initialized. Therefore, we need to add the is_initialized() check to avoid test failure. In real use case, we should have a pg initialized before the get_backend() check. Not sure if we want to add this specifically for the test, but temporarily adding it to unblock APF conveyor runs.
Test Plan:
```
[irisz@devgpu051.cln3 /data/users/irisz/fbsource/fbcode (38e4a0a3b)]$ buck2 test 'fbcode//mode/opt' fbcode//apf/distributed/tests:pipeline_parallel_test_cpu -- --exact 'apf/distributed/tests:pipeline_parallel_test_cpu - apf.distributed.tests.pipeline_parallel_test_cpu.PipelineParallelContextTestCPU: test_stage_pg_creation_with_different_backends'
```
Reviewed By: gag1jain
Differential Revision: D59725924
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130685
Approved by: https://github.com/gag1jain
0.12.0 Major Updates:
- Add context manager to temporarily set the dictionary sorting mode
- Add accessor APIs
- Use `stable` tag for `pybind11` for Python 3.13 support
- Fix potential segmentation fault for pickling support
0.12.1 Updates:
- Fix warning regression during import when launch with strict warning filters
Closes#130155
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130139
Approved by: https://github.com/zou3519
Made the following changes:
- mutates_args is now keyword-only and mandatory. This is to align with
torch.library.custom_op (which makes it mandatory because it's easy to
miss)
- op_name is now keyword-only. This helps the readability of the API
- updated all usages of infer_schema
This change is not BC-breaking because we introduced
torch.library.infer_schema a couple of days ago.
Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130705
Approved by: https://github.com/yushangdi
# Motivation
Before this PR, device construction was `cuda` type when only a device index was given. It also returns the `PrivateUser1` type if a `PrivateUser1` type is registered.
```bash
>>> import torch
>>> device = torch.device(0)
>>> device.type
'cuda'
>>> a = torch.tensor([1, 2])
>>> b = a.to(0)
>>> b
tensor([1, 2], device='cuda:0')
```
It works well on CUDA GPU. But it will raise unexpected information and error running on XPU.
```bash
>>> import torch
>>> device = torch.device(0)
>>> device.type
'cuda'
>>> a = torch.tensor([1, 2])
>>> b = a.to(0)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/xxx/pytorch/torch/cuda/__init__.py", line 302, in _lazy_init
raise AssertionError("Torch not compiled with CUDA enabled")
AssertionError: Torch not compiled with CUDA enabled
```
With this PR, refine the logic to use the currently available device type instead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129119
Approved by: https://github.com/albanD, https://github.com/gujinghui, https://github.com/EikanWang
ghstack dependencies: #129463, #129205, #129363
This is the initial version of an API to create custom operators whose
implementations are backed by triton kernels. While user-defined triton
kernels work out-of-the-box with triton kernels, you may wish to
construct a custom operator if you need to compose with other PyTorch
subsystems, like Tensor subclasses or vmap.
I'm hoping to get design feedback on this and ship it so that we can
begin experimenting with customers.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130637
Approved by: https://github.com/albanD
Reduces the guard overhead from 2.1k units to 1k units. Compared to no-inlining (0.4k units), this reduces the slowdown from 5x to 2.5x.
This introduces unsoundness, but only for hooks for inbuilt nn modules (user defined nn module hooks are fine).
Each builtin nn module adds 4 empty ordered dict checks in the check_fn. This blows up for models with large numbers of builtin nn modules. With this PR, we skip those guards. There is no other easy way I can think of right now to control the guard overhead.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130420
Approved by: https://github.com/jansel
ghstack dependencies: #130654
**Summary**
Support more than 1 Local Buffer in an outer loop fused node and also the case when multi global buffers sharing usage of same local buffer.
**TestPlan**
```
python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_two_local_buffers_in_outer_loop_fusion
python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_share_local_buffers_in_outer_loop_fusion
```
**Next Step**
- [✓] Support more than one Local Buffer/Global Buffer
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129121
Approved by: https://github.com/jgong5, https://github.com/peterbell10
ghstack dependencies: #126967
Summary:
CUDAGraph Trees previously relies on an assumption that static inputs (parameters and buffers) does not change tensor addresses across multiple function invocations. This assumption can be used to reduce the number of tensor copies to improve performance. We also use `check_static_inputs_are_stable()` to check whether this assumption holds at runtime.
While this assumption is True in most cases, we recently observe a few cases that this assumption is not valid:
- [Inline inbuilt nn modules](https://github.com/pytorch/pytorch/pull/126822): the same function (a nn module) is used in multiple places and different parameters and buffers are passed to this function with different tensor addresses
- Some user code changes tensor addresses of parameters/buffers. See [internal example]( https://www.internalfb.com/mlhub/pipelines/runs/mast/sw-935450288-OfflineTraining_08ba1cf0?job_attempt=1&version=0&env=PRODUCTION)
- Compiled Autograd may also pass parameters/buffers with different tensor addresses across runs.
Previous PR [#126822](https://github.com/pytorch/pytorch/pull/126822) (by @mlazos) allows detecting static tensor address changes during runtime and re-recording a cudagraph if that happened. However, if the same function is re-recorded too many times, it may introduce large overhead and hurt performance. This PR adds `torch._inductor.config.triton.cudagraph_max_recording` (=5) to fallback to eager if a function has been recorded more than `cudagraph_max_recording` times for a specific node in the CUDAGraph Trees.
A summary on how static tensor address changes are handled now:
- For each child node, check the assumption via `check_invariants`. If this holds, execute node with the assumption.
- If the assumption does not hold for all child nodes, re-record if the function_id has not been recorded too many times for the current_node.
- If the function_id has been re-recorded too many times, fallback to eager function and warning.
Test Plan: CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129349
Approved by: https://github.com/eellison
Summary: On the autograd side of things, we are currently saving the kwinputs but we aren't doing anything with them on the profiler side. This diff enables the use of the kwinputs for both FunctionEvents and Chrome Traces.
Test Plan: Added unit testing for both chrome traces and FunctionEvents. Used RecordFunctionFast to test kwinputs since test already had kwargs being passed in but not tested.
Differential Revision: D59472345
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130373
Approved by: https://github.com/davidberard98
Fixes#125224
For large ranges, calls to CUDA `randint` use a different `unroll_factor` to generate random ints. This `unroll_factor` was not considered correctly in the calculation of the Philox offsets. Thus, some of the random states were reused, resulting in lower entropy (see #125224).
This also affects multiple other random functions, such as `torch.rand` and `torch.randn`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126066
Approved by: https://github.com/eqy, https://github.com/lezcano
Preventing match across mutations should always be the safe thing to do. This will be especially important for Traceable FSDP2 because in that case we do have mutation ops (`.set_` and `.resize_(0)`) in the middle of the graph for both joint-graph and post-grad graph, so making sure the pattern matcher passes work well with middle-of-graph mutation ops is important.
Q: Why can't we move these mutation ops to the end of graph, to make pass writing easier?
A: We attempted to do that in https://github.com/pytorch/pytorch/pull/129852, but the custom FX passes (in `torch/_functorch/_aot_autograd/fx_passes.py`) for the re-functionalization is complicated to maintain, and the changes to partitioner (in `torch/_functorch/partitioners.py`) also feels hacky. Hence we want to preserve these mutation ops in the middle of graph to avoid the complexity.
Test commands:
- `pytest -rA test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_uint4x2_mixed_mm`
- `pytest -rA test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_serialized_patterns_up_to_date`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130584
Approved by: https://github.com/jansel
# Flex Decoding
tl;dr This PR adds `flex_decoding` kernel to higher-order-op: `flex_attention` as the backend for multi-head attention decoding.
Higher-order-op `flex_attention` was introduced in (https://github.com/pytorch/pytorch/pull/121845) to accept a user defined score modification callable (`score_mod`) and through `torch.compile`to create an efficient fused flash attention kernel instatiation. The `flex_attention` kernel is efficient for long queries (>512 tokens) attention. This PR introduces `flex_decoding` kernel as an alternative backend for `flex_attention` HOP to handle LLM inference where short queries (<32 tokens) attends to long key/value sequences.
### Details
LLM decoding iteratively attends each newly generated token ( query length = 1 ) to a long key/value context (up to 132k). `flex_attention` kernel only parallelizes attention along query length (M), batch size (B) and number of heads (H) dimension. LLM decoding lacks enough parallelism in the M dimension to fill up all SMs on the modern GPUs.
`flex_decoding` adds parallelization along key/value sequence length (N). The key/value cache of a single head are split into multiple blocks and the query tokens attends to them in parallel. The results for the same head are then reduced across KV blocks to generate a global output.
## Examples
Consider a Group Query Attention (GQA) decoding case, where a query token of 16 query heads (Hq) attends to 2 kv head (Hkv). Assume a batch size of 2 (B=2) and kv cache length of 4096 (N=4096). The attention kernel iteratively attends to newly generated query token (Mq = 1).
We transform this problem into a Multiheaded Attention (MHA) problem by assuming a query length equal to number of query heads per kv heads, i.e. M=Hq//Hkv.
The inputs to `flex_attention` HOP is thus a query of shape (B=2, H=Hkv=2, M=Hq//Hkv=8, D=64), key,value of shape (B=2, H=Hkv=2, N=4096, D=64, which lead to an intermediate attention score matrix of shape (2, 2, 8, 4096) and an output of shape (2, 2, 8, 64).
```Python
import torch
from torch.nn.attention._flex_attention import _flex_attention as flex_attention
torch.manual_seed(0)
# Lets create some input tensors
# query of shape (B, Hkv, Hq//Hkv, D)
# key/value of shape (B, Hkv, N, D)
query = torch.randn(2, 2, 8, 64, device="cuda", dtype=torch.float32)
key = torch.randn(2, 2, 4096, 64, device="cuda", dtype=torch.float32)
value = torch.randn(2, 2, 4096, 64, device="cuda", dtype=torch.float32)
# Lets create a new score_modification checkerboard.
def checkerboard(score, batch, head, token_q, token_kv):
score = torch.where(torch.abs(token_kv - token_q) == 1, score * 0.5, score)
score = torch.where(torch.abs(token_kv - token_q) == 2, score * 2.0, score)
return score
# Lets call flex_attention with this new score modification for decoding.
# The flex_attention HOP will chose flex_decoding as its backend since our query length (M) is only 8.
output = flex_attention(query, key, value, score_mod=checkerboard)
compiled_flex_attention = torch.compile(flex_attention)
out_compiled = compiled_flex_attention (query, key, value, score_mod=checkerboard)
torch.testing.assert_close(output, out_compiled, atol=2e-2, rtol=2e-2)
```
## Future Plans
- This PR does not implement load mask for score_mod function. This means if the score_mod functions takes a captured buffer along the M dimension , it must be padded to q length of 16, or next 2^n of query length if q_len > 16.
i.e.
```python
q_scale = torch.randn(Hq//Hkv, device="cuda")
q_scale = torch.nn.functional.pad(q_scale, (0, 16-Hq//Hkv)) # Pad captured buffer
def bias_mod(score, batch, head, q, kv):
score = score + q_scale[token_q]
return score
```
- Backward path for short queries (<128 token) currently does not work because the `flex_attention_backward` kernel is lacking mask support and only takes query length of a multiple of 128.
- Dynamic shape and max_autotuning is currently not working
- Add block sparse mask support (#129216 is a draft for flex_attention kernel)
- Add explicit GQA support. (#130076 is a draft for GQA support on flex_attention kernel)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129415
Approved by: https://github.com/Chillee
beartype has served us well in identifying type errors and ensuring we call internal functions with the correct arguments (thanks!). However, the value of having beartype is diminished because of the following:
1. When beartype improves support for better Dict[] type checking, it discovered typing mistakes in some functions that were previously uncaught. This caused the exporter to fail with newer versions beartype when it used to succeed. Since we cannot fix PyTorch and release a new version just because of this, it creates confusion for users that have beartype in their environment from using torch.onnx
2. beartype adds an additional call line in the traceback, which makes the already thick dynamo stack even larger, affecting readability when users diagnose errors with the traceback.
3. Since the typing annotations need to be evaluated, we cannot use new syntaxes like `|` because we need to maintain compatibility with Python 3.8. We don't want to wait for PyTorch take py310 as the lowest supported Python before using the new typing syntaxes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130484
Approved by: https://github.com/titaiwangms
Since the raise_comms and sink_waits passes are also scheduling-based, we can now implement reorder_compute_for_overlap as an optional step in the same pass. Merging them into the same pass greatly simplifies the logic and makes it easier to reason about the synergy between different passes.
- The unit tests are now fixed and re-enabled.
- Verified that the pass produces good schedulling w/ Llama3 70B in torchtitan (the scheduling was sub-optimal before this PR).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130573
Approved by: https://github.com/Chillee
ghstack dependencies: #129980
This involved beefing up the Python dispatcher to handle torch_dispatch.
Given a HOP and a torch_dispatch Tensor subclass:
- the HOP will show up in the subclass's `__torch_dispatch__`
- you can also use HOP.py_impl to register a rule for the HOP x
subclass interaction
- (coming soon) we'll offer a way to open register HOP x subclass
interaction without needing to touch the subclass's
`__torch_dispatch__` or the HOP's .py_impl.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130606
Approved by: https://github.com/ydwu4
Fixes#129403
Create a separate printing function to debug SymNode, since we can't easily change `__repr__` that is used by GraphModule.recompile() to create a pythonic version of a graph
This is my first contribution, please let me know if there is anything that I should look into in further details
Thank you for you guidance! 🙏 I hope to contribute more in the future!
@aorenste
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129925
Approved by: https://github.com/aorenste
Reland https://github.com/pytorch/pytorch/pull/128709.
When the input predicate is a python constant, we specialize into one of the branches and warn users that torch.cond is not preserving the dynamism. The previous behavior is that we baked in True/False in the cond operator. This can be confusing. In this PR, we change it to be specializing into one of the branches when the inputs are constants.
We additionally change the naming of cond operator to default one without overriding its name. This allows better testing on de-serialized graph.
Test Plan:
The predicate in some existing tests is the result of a shape comparison. When no dynamic shape is involved, the predicate is a python bool. To fix them, we either change the predicate to be some data-dependent tensor or change the test to check cond is specialized as one of the branches,
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130493
Approved by: https://github.com/BoyuanFeng
Sometimes, it could be difficult to write a fake class e.g. when the original implementation is using some third-party libraries or users are certain that the class is safe to trace with the real object.
This PR allows user to specify their intention by implementing a "safe_to_trace_with_real_obj" method on their script class.
Test Plan:
`pytest test/export/test_torchbind.py -k safe`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129586
Approved by: https://github.com/zou3519
Construct frame localsplus in 3.12+ using our own simplified way rather than copypasting from CPython.
This is necessary for 3.13 since we can no longer generate frame `f_locals` before executing the interpreter frame.
We also enable this for 3.12 since the `f_locals` construction between 3.12 and 3.13 is the same, so we can test for correctness with 3.12.
This is also one of the first steps to completing https://github.com/pytorch/pytorch/issues/93753 - we will implement simplified f_locals generation of previous Python versions in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129185
Approved by: https://github.com/jansel
Summary:
This PR changes two implementations to make CP (CP8) lose curve be on par with TP (TP8).
1. Making key and value contiguous before doing ring attention. It is unclear why this is a requirement as SDPA does not have this requirement.
2. Use the out, grad_out, softmax_lse passed by autograd to do the backward. This implementation is similar to the implementation in transformer engine. The original implementation reruns the SDPA to get the output and logsumexp and uses that reculcated results to infer the corrected softmax_lse. But that implementation does not give a better accuracy or lose curve. Instead, that implementation converges slower.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129515
Approved by: https://github.com/d4l3k, https://github.com/wanchaol
ghstack dependencies: #129512, #129514
Fixes the failure in `test/export/test_export_training_ir_to_run_decomp.py ` caused by dead code elimination removing node with side effects.
For background, in export, we may want to export higher-level IRs that are not functional, so we need to check for side effects more carefully.
A call_function node is impure if it has at least one mutable argument.
Fixed the tests below:
test_to_module_with_mutated_buffer_multiple_update_sub_later
test_export_input_mutation_static_shape
test_buffer_util
Another attempt modifying the original DCE pass is made in PR #130395, but it breaks some other tests, so here we add a flag and use it for export only.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130552
Approved by: https://github.com/pianpwk
Workaround bug in `reductionAndWithTensor:` that kills app with the
following assert if 5+D tensor as an input
```
Assertion failed: (0 <= mpsAxis && mpsAxis < 4 && "Runtime canonicalization must simplify reduction axes to minor 4 dimensions."), function encodeNDArrayOp, file GPUReductionOps.mm, line 76.
```
by reshaping the tensor to 2D/3D one before running the reduction.
Refactored common code into `all_any_common_impl_mps` as both `reductionOrWithTensor:` and `reductionAndWithTensor:` suffer from the same issue
Enabled `test_reduction_ops_5D` and added regression test to it
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130542
Approved by: https://github.com/Skylion007, https://github.com/albanD
ghstack dependencies: #130541
Summary: If we attempt to precompile sets of different choices (e.g. Triton vs Cutlass) that have the same key, the cached pool of futures doesn't work, since it only includes the first set of configs. Add the config's hashes to the key to avoid this problem.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130350
Approved by: https://github.com/eellison
We add torch.library.Library._register_torch_dispatch_rule. Here, a user
can provide us a specific rule to run for a specific
(torch_dispatch_class, operator) pair. The motivation is that a user
might want to extend a subclass/mode but may not have access to the
source code of the subclass/mode.
I'll make this public in a follow-up PR if we think the approach and API
is good.
Keep in mind that many subclasses will likely deliver their own open
registration solution (DTensor has register_sharding_prop_rule and NJT
has register_jagged_op); _register_torch_dispatch_rule is meant as a
catch-all open registration mechanism for when the subclass hasn't
provided anything more specific.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130064
Approved by: https://github.com/albanD
Summary:
Integrate scope tracking with `checkpoint/fb/logging_handlers.py`.
Add a map of uuid -> tracker context manager. when logging handler has following events:
* `start`: create scope_tracker object, call `__enter__`, add to map with uuid
* `end`: retrieve scope_tracker object by uuid, call `__exit__`.
* `exception`: retrieve scope_tracker object by uuid, call `__exit__` with current exception info.
Test Plan:
Test with bento notebook (attached).
with a runtime_error in finish_checkpoint method.
scuba records:
https://fburl.com/scuba/workflow_signpost/ddttgmv2
Differential Revision: D56654417
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130452
Approved by: https://github.com/LucasLLC
Summary: `quantization_tag` is a first class citizen metadata in quantization flows that is preserved by it. As we'll want to store the quantized exported graphs we also need to preserve this metadata as it's used in later flows. Only json supported metadata will be allowed to be serialized.
Test Plan: Added test case
Differential Revision: D57939282
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127473
Approved by: https://github.com/angelayi
Extend constant folding for dynamic shape node, only support pointwise op and some restricted ops
We support dynamic shapes by limiting constant folding of ops that are guaranteed to have uniform values (full, pointwise ops, and views) and running these operators with tensors of shape 1. This also eliminates the possibility of memory overhead of constant folding.
Taken over from https://github.com/pytorch/pytorch/pull/128937
joint work with @imzhuhl
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129686
Approved by: https://github.com/Chillee
ghstack dependencies: #130367
Summary:
full context in D59385876
Based on the offline discussion with PT2 folks, we switched to change the SDPA impl to mitigate the AOTI lowering issue
Test Plan: PYTORCH_TEST_FBCODE=1 buck2 run mode/opt -c=python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.split-dwarf=true caffe2/test/inductor:test_inductor -- -r test_sdpa_inference_mode_aot_compile
Differential Revision: D59495634
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130281
Approved by: https://github.com/drisspg, https://github.com/zou3519, https://github.com/Skylion007, https://github.com/justinchuby
When run some test cases on the privateuse1 device, the device_suffix in a test_name is 'privateuse1' sometimes.
For examples, a test_name is 'test_Dropout1d_npu', while it would be 'test_Dropout1d_privateuse1' sometimes.
When setUpClass() didn't set it, the device_suffix would be "privateuse1".
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130091
Approved by: https://github.com/zou3519
The tests for `raise_comms` and `sink_waits` passes were not enabled in CI. The passes are now broken due to functional collective v2 and possibly other changes.
Correctness issues:
- The original passes did not take mutation into consideration and may yield semantically different scheduling order. This may be due to the recent changes to how mutations are expressed in Inductor IR (e.g., MutationOutput).
Effectiveness issues:
- The original passes only moved the comm/wait nodes themselves. However, comm nodes can come with prologues (e.g., clone for all_reduce_, split-cat for non-zero dim all-gather). Whenever there are any prologues, the comms won't be raised at all.
- The prologues are often horizontally fused with other pointwise nodes. This can severely delay the scheduling of the comm node.
This PR:
- Make the passes handle mutation correctly.
- Instead of moving individual comm/wait nodes, schedule all node using a scored method. This way the comm nodes can be optimally raised even in the presence of prologues.
- The horizontal fusion of prolofues often severely delays the scheduling of the comm node. Horizontally fusing this clone can almost never out-perform scheduling the comm node earlier. Also in most cases, this clone is eliminated via in-place reuse. Therefore, we tell the scheduler to not fuse it.
- Enable the tests in CI.
Co-authored-by: Will Feng <yf225@cornell.edu>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129980
Approved by: https://github.com/yf225
As discussed with @mlazos and @Chillee in the Inductor group chat, we need the concept of `GroupedSchedulerNode` to be able to express nodes that must be scheduled together one-after-another (i.e. no other node is allowed to fuse into them or schedule in-between them).
This is particularly important for comm reordering and fine-grained control of peak memory. For Traceable FSDP2, there are two very important requirements:
- At any time, there must be only one AllGather in flight. However, our existing comm reordering pass will naturally raise **all** of AllGather ops to the beginning of the graph, which will clearly blow up memory usage. Instead, we leverage GroupedScheduleNode which provides simple connection points to build the "chaining" on. i.e. we use it to express the schedule `(copyin + AllGather1) -> (AllGather1Wait+copyout) -> (copyin + AllGather2) -> (AllGather2Wait+copyout) ...` by setting up fake dep between the GroupedScheduleNode, which is a very clean and easy-to-understand way to express this schedule.
- The "comms" in FSDP2 are not just comms, but a combination of compute and comm. We must prevent other nodes from being scheduled in-between that set of nodes, otherwise we are artificially delaying the release of comm buffer memory which makes the peak memory usage quite bad. This is particularly pronounced for `AllGatherWait+copyout`.
From these two requirements, we derive the behavior of `GroupedSchedulerNode`: it contains nodes that must be scheduled together one-after-another (i.e. no other node is allowed to fuse into them or schedule in-between them).
----
Q: Can we leverage `ir.Subgraph`?
A: I looked into the possibility of using `ir.Subgraph` to implement this, but realized that:
1. `ir.Subgraph` requires defining the subgraph in FX IR.
2. There is no guarantee that the Inductor IR nodes that we want to group together will all have a corresponding FX IR node, because some of those Inductor IR nodes can potentially be dynamically generated by a custom pass in the scheduler (e.g. for merging multiple all-gathers into one big all-gather, and later we want to group that big all-gather with some other op). Dynamically generated Inductor IR node doesn't have a corresponding upstream FX IR node.
3. For the above reasons, we can't use the `ir.Subgraph`, and need to define a new (and more lightweight) concept of `GroupedSchedulerNode` to achieve the behavior we need (this PR).
----
Test commands:
- `pytest -rA test/distributed/test_compute_comm_reordering.py::TestComputeCommReorderingMultiProc::test_grouped_scheduler_node`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128568
Approved by: https://github.com/eellison, https://github.com/mlazos
Summary: Users have been confused why user annotations on GPU tracks do not show when doing GPU only tracing. This comment should help users understand that to use this function they need to have CPU activies enabled.
Test Plan: N/A it is just updating a comment
Differential Revision: D59649390
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130561
Approved by: https://github.com/aaronenyeshi
Before the PR, custom ops that don't return outputs will get eliminated after calling `.module()` because the effect_token that keeps the operator alive is removed in remove_effect_token pass. The reason why we want to remove_effect_token is because we don't want the token to be part of input. However, this causes DCE calls in remove_effect_token itself and the dce calls in unlift to remove the custom op in the graph causing an error in the exported graph.
This PR calls has_side_effect in with_effect to make sure graph.eliminate_dead_code doesn't remove the calls by accident.
Test Plan:
Add a new test pytest test/export/test_torchbind.py -k test_export_inplace_custom_op
Differential Revision: [D59498728](https://our.internmc.facebook.com/intern/diff/D59498728)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129680
Approved by: https://github.com/angelayi
Take intersection of all the tags for corresponding aten op overloads. Previously, some of the rng ops not having tags caused issues with constant folding (they should get decomposed but thats a separate issue).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130367
Approved by: https://github.com/ezyang
Summary: This diff updates the ExportedProgram class in PyTorch to allow for multiple verifiers to be attached to it. This is done by adding a new field to the ExportedProgram schema called "verifiers" which is a list of strings representing the names of the verifiers to be attached to the program. The verifiers are loaded using the "load_verifier" function which is defined in the "torch._export.serde.serialize" module. The "exported_program.dialect" field is also deprecated in favor of the "verifiers" field.
Test Plan: CI
Differential Revision: D59408546
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130364
Approved by: https://github.com/angelayi, https://github.com/ydwu4
Fixes#129865
Currently, new_group will call ncclCommSplit in some cases. In theory, ncclCommSplit will bring performance and memory benefits. However, the config parameter of the ncclCommSplit function in pytorch does not set "splitShare=1", which results in the optimization of ncclCommSplit being turned off and the benefits being invalid.
This PR turn on splitShare=1 to make the optimization of comm_split effective.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129929
Approved by: https://github.com/shuqiangzhang
Summary: Previously, remove_effect_tokens pass didn't pass kwargs to the internal nodes. This PR fix it and add a test for it.
Test Plan: buck2 run caffe2/test:test_export -- -r test_remove_effect_token_kwargs
Reviewed By: angelayi
Differential Revision: D59603147
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130491
Approved by: https://github.com/angelayi
Fixes #ISSUE_NUMBER
As a followup to https://github.com/pytorch/pytorch/pull/130454, users are hitting the cross-mesh operation error because the DeviceMesh thread ID differs between the saved vs. loaded DTensor due to thread id being different.
This is a hot fix to only consider the real thread_id in DeviceMesh hash under threaded backend, but set it to None for all other cases.
As a follow up, we need to look at the following test failures to better root cause specific DeviceMesh related failures related to MTPG, if thread_id is not included as part of the hash.
```
test/distributed/_composable/fsdp/test_fully_shard_training.py::TestFullyShardRegisteredParams::test_param_registration_after_forward
test/distributed/_tensor/test_dtensor_ops.py::TestDTensorOpsCPU::test_dtensor_op_db_column_stack_cpu_float32
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130495
Approved by: https://github.com/awgu, https://github.com/wanchaol
This PR fixes profiler/test_profiler.py::.TestProfiler::test_oom_tracing
Test expects OOM by allocating huge tensor. But MI300X has enough memory to allocate such a tensor.
This PR increases tensor size with a large margin to force OutOfMemory exception on MI300X and future GPU generations
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130334
Approved by: https://github.com/jeffdaily, https://github.com/janeyx99
And move `using namespace mps` outside of every function as there are no
need to repeat it
Use `getTensorsStringKey` instead of explicit
`getMPSShapeString(getMPSShape(t)) + getMPSDataTypeString(t)`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130541
Approved by: https://github.com/Skylion007
Fixes https://github.com/pytorch/pytorch/issues/130456
When we mark_unbacked a size, we actually DO have a hint for it
(because we have a real, input tensor) for it, and previously, we were
accidentally putting it into the hint field of SymNode. If marked
unbacked size is zero or one, this can lead to inconsistency between
hint compute and static evaluation compute under guard size oblivious,
since that's the whole point of size oblivious. Answer is to scrub out
hints on mark unbacked ints.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130483
Approved by: https://github.com/lezcano
This PR makes it so that all tensors are reduced to their metadata in AOTAutogradCache. Because dynamo always embeds constant tensors into the FXgraph directly, there's no risk of a constant tensor whose values are semantically important being lost here. AOTAutograd itself may take a constant tensor and set it as an attribute on an FXGraph for inductor, but Dynamo never does this.
One other thing that this diff does is add `[pickler.fast](https://docs.python.org/3/library/pickle.html#pickle.Pickler.fast)` to our pickling algorithm for cache key generation. Pickle will often memoize/intern strings when pickling, leading to false cache misses due to inconsistent memoization. Turning on pickler.fast removes this behavior.
Technically `fast` is a "deprecated" feature according to python docs. But it's still supported in py3.8-3.12, and if it ever is removed, the only downside will just be a few more cache misses, so I think it's worth just adding here (and removing later as needed)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128583
Approved by: https://github.com/oulgen
ghstack dependencies: #128335
This PR is to update the input `weight` of `_convert_weight_to_int4pack` from `[n][k] int32` to `[n][k / 2] uint8`, both for CPU, CUDA and MPS, which can help decouple int4 model checkpoint with different ISAs and different platforms in `gpt-fast`. The advantage is int4 model checkpoint can be shared in different test machines, without re-generating in one certain platform. Meanwhile, the size of input `weight` can be reduced to `1 / 8`.
Before this PR, packed weight stored in CUDA specific layout: `[n/8][k/(InnerKTiles*16)][32][InnerKTiles/2]`, dtype int32, where InnerKTiles = 2, 4, 8. CPU packed weight viewed as the SAME shape but stored in different layout: `[n/64][k][32]`, dtype uint8. Weight is strongly coupled with platforms (CPU/CUDA) and ISAs (AVX512/AVX2/scalar). And users cannot use a generated weight in another different ISA or platform, because when loading weight into devices, the compute format is different.

Now, we use common serialized layout (`[n][k/2] uint8`) for different devices or ISAs as input `weight` of `_convert_weight_to_int4pack`, and each back chooses how to interpret as compute layout.

### Performance
Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores)
There is no obvious regression of this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129940
Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/mingfeima
In particular, when creating the PyTorch wheel, we use setuptools find_packages 551b3c6dca/setup.py (L1055) which explicitly skips packages without `__init__.py` files (namespace packages) https://setuptools.pypa.io/en/latest/userguide/package_discovery.html#finding-simple-packages.
So this PR is reverting the change to stop skipping these namespace packages as, even though they are in the codebase, they are not in the published binaries and so we're ok relaxing the public API and importability rules for them.
A manual diff of the two traversal methods:
```
torch._inductor.kernel.bmm
torch._inductor.kernel.conv
torch._inductor.kernel.flex_attention
torch._inductor.kernel.mm
torch._inductor.kernel.mm_common
torch._inductor.kernel.mm_plus_mm
torch._inductor.kernel.unpack_mixed_mm
torch._strobelight.examples.cli_function_profiler_example
torch._strobelight.examples.compile_time_profile_example
torch.ao.pruning._experimental.data_sparsifier.benchmarks.dlrm_utils
torch.ao.pruning._experimental.data_sparsifier.benchmarks.evaluate_disk_savings
torch.ao.pruning._experimental.data_sparsifier.benchmarks.evaluate_forward_time
torch.ao.pruning._experimental.data_sparsifier.benchmarks.evaluate_model_metrics
torch.ao.pruning._experimental.data_sparsifier.lightning.tests.test_callbacks
torch.ao.quantization.experimental.APoT_tensor
torch.ao.quantization.experimental.adaround_fake_quantize
torch.ao.quantization.experimental.adaround_loss
torch.ao.quantization.experimental.adaround_optimization
torch.ao.quantization.experimental.apot_utils
torch.ao.quantization.experimental.fake_quantize
torch.ao.quantization.experimental.fake_quantize_function
torch.ao.quantization.experimental.linear
torch.ao.quantization.experimental.observer
torch.ao.quantization.experimental.qconfig
torch.ao.quantization.experimental.quantizer
torch.csrc.jit.tensorexpr.codegen_external
torch.csrc.jit.tensorexpr.scripts.bisect
torch.csrc.lazy.test_mnist
torch.distributed._tensor.examples.checkpoint_example
torch.distributed._tensor.examples.comm_mode_features_example
torch.distributed._tensor.examples.comm_mode_features_example_argparser
torch.distributed._tensor.examples.convnext_example
torch.distributed._tensor.examples.torchrec_sharding_example
torch.distributed._tensor.examples.visualize_sharding_example
torch.distributed.benchmarks.benchmark_ddp_rpc
torch.distributed.checkpoint.examples.async_checkpointing_example
torch.distributed.checkpoint.examples.fsdp_checkpoint_example
torch.distributed.checkpoint.examples.stateful_example
torch.distributed.examples.memory_tracker_example
torch.fx.experimental.shape_inference.infer_shape
torch.fx.experimental.shape_inference.infer_symbol_values
torch.include.fp16.avx
torch.include.fp16.avx2
torch.onnx._internal.fx.analysis.unsupported_nodes
torch.onnx._internal.fx.passes._utils
torch.onnx._internal.fx.passes.decomp
torch.onnx._internal.fx.passes.functionalization
torch.onnx._internal.fx.passes.modularization
torch.onnx._internal.fx.passes.readability
torch.onnx._internal.fx.passes.type_promotion
torch.onnx._internal.fx.passes.virtualization
torch.utils._strobelight.examples.cli_function_profiler_example
torch.utils.benchmark.examples.sparse.compare
torch.utils.benchmark.examples.sparse.fuzzer
torch.utils.benchmark.examples.sparse.op_benchmark
torch.utils.tensorboard._convert_np
torch.utils.tensorboard._embedding
torch.utils.tensorboard._onnx_graph
torch.utils.tensorboard._proto_graph
torch.utils.tensorboard._pytorch_graph
torch.utils.tensorboard._utils
torch.utils.tensorboard.summary
torch.utils.tensorboard.writer
```
These are all either namespace packages (which we want to remove) or package that are not importable (and tagged as such in the test).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130497
Approved by: https://github.com/aorenste
In this PR, we abstracted the different types of aten operation parameters as `ParameterMetadata`. This structure intends to be used to represent and store the metadata of each aten operation parameter. Currently, it only supports `Tensor`, `TensorList`, and `Scalar`.
```C++
using ParameterMetadataValue = std::variant<TensorMetadata, std::vector<TensorMetadata>, c10::Scalar>;
```
With this PR, we can extend other parameter-type support in a more modularize way, like `string`, `int`, `double`.
Differential Revision: [D59399546](https://our.internmc.facebook.com/intern/diff/D59399546)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125308
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/atalman
This PR adds support to use expandable segments with private memory pools which should unblock using it with cuda graphs and cuda graph trees. Currently, the allocator silently avoids using expandable segments when allocating in a private pool due to checkpoint saving/restoring not meshing well with how we keep track of unmapped blocks.
The PR itself is pretty short, most of the logic for checkpointing and reapplying state for non-expandable segments transfers over without much work.
Expandable segments reserve a virtual address space of size equal to the amount of physical memory on the GPU. Every time we want to `malloc()` or `free()` memory in a memory pool with expandable segments turned on, we map/unmap pages of physical GPU memory under the hood to create a new block that we return to the caller. This is beneficial due to the fact that each memory pool functions as a single segment of memory with a contiguous block of memory addresses that can grow and shrink as needed, avoiding fragmentation from allocating multiple non-contiguous segments that may not be merged together.
The caching allocator handles this by creating an unmapped block for the entire reserved virtual address space at init, which is treated similarly to an unallocated block in a free pool. When callers call `malloc()`, it's split and mapped to create allocated blocks, and calling `free()` similarly caches and merges free blocks in a free pool to be used later. Expandable blocks are unmapped and returned back to Cuda when they are cleaned up, or when we hit an OOM and the allocator attempts to remap cached free blocks. The code paths to map, free, and unmap blocks in expandable segments is similar to that for normal blocks and does all the same work of updating stats on memory usage, moving blocks between active and free pools, and returning memory to Cuda.
With Cuda Graph Trees and private memory pools, we need the ability to take checkpoints of the current state of the memory allocator after each graph capture as well as reapplying the state before capturing a new graph after replaying a captured graph so that the new cuda graph capture has access to the state of the allocator at the point after replaying a previously captured graph so it can reuse empty blocks and allocate new ones.
As mentioned in a below comment, memory in a private pool is cached until the private pool is destroyed and allocations can only grow from extra graph captures, any freeing of memory would result in invalid memory addresses and would break cuda graphs.
One implementation detail to note for unmapped blocks with expandable segments is that unmapped blocks are kept track in a member variable `unmapped` of a `BlockPool`. `unmapped` is *not* part of the checkpointed state of the caching allocator and isn't restored when reapplying checkpoints since we never free/unmap memory back to cuda and is persisted across graph captures / replays.
Checkpointing the current state of the memory allocator works as expected with expandable segments. Checkpointing grabs the first block of every segment in the active and free pools of the private pool and traverses the linked list of blocks in the segment to capture the state of every segment, which is then saved and kept for when it is needed to be reapplied. For expandable blocks, the last block in every segment will be an unallocated unmapped block containing the remaining amount of unmapped memory at graph capture time, and this too is saved in the checkpoint.
Reapplying the checkpoints works by freeing all allocated blocks and merging them into a single block per segment, then for each segment, we manually split and allocate all blocks from the checkpoint and then free the blocks marked as unallocated in the checkpoint state. For expandable segments, we need to make some modifications to not split unmapped blocks and avoid manually mapping then freeing unmapped blocks.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128068
Approved by: https://github.com/zdevito, https://github.com/eqy
When I play with DCP for distributed inference, I found that we are still using deprecated APIs for DCP even in unit test. So this PR is using the new API with unified small letters "dcp".
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130475
Approved by: https://github.com/wz337
This is kind of a short-sighted workaround and we should actually come
up with a way to fix this in general, but I got annoyed that I can't use
-k to filter tests in test_schedule, and realized it's because we jam
tests using the new MultiProcContinuousTest fixture together with
old-style tests.
For now I separate the two types of tests so -k works again.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130294
Approved by: https://github.com/H-Huang
Summary:
* Added support for preserving it during deepcopy, need to remap the args since _numeric_debug_handle refers
to the nodes in the graph
TODO: need to fully support re-export, currently the metadata for output node is not preserved
Test Plan:
python test/test_quantization.py -k test_deepcopy_preserve_handle
python test/test_quantization.py -k test_copy_preserve_handle
all related tests:
python test/test_quantization.py -k TestGenerateNumericDebugHandle
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129287
Approved by: https://github.com/zhxchen17
**Summary**
When check the vectorization status among 3 test suit, we found some operators disabled vectorization with message `Disabled vectorization: op: remainder`. In this PR, we add vectorization support of this op.
**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_remainder
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_int_div_vec
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129849
Approved by: https://github.com/jgong5, https://github.com/lezcano
ghstack dependencies: #130405
`max_norm=True` is currently written in the note, but `max_norm` can be a `float`, NOT a `bool` (as the [docstring](ec284d3a74/torch/nn/modules/sparse.py (L30)) says).
That note was created in #45595
The current pull request cleans it up.
The value `True` in the note can confuse the users to think it can be a boolean.
In fact, a counter-intuitive behavior will happen if users try to set it to `False`:
it will be interpreted as 0, so the values of the embedding will become 0 - not what the users were expecting by setting it to `False`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129687
Approved by: https://github.com/mikaylagawarecki, https://github.com/malfet
Summary:
If triton is available, but we can't import triton.compiler.compiler.triton_key, then we see some annoying behavior:
1) If we don't actually need to compile triton, the subprocess pool will still spew error messages about the import failure; it's unclear to users if this is an actual problem.
2) If we do need to compile triton, we a) see the error messages from above and b) get a vanilla import exception without the helpful "RuntimeError: Cannot find a working triton installation ..."
Test Plan: Ran with and without torch.compile for a) recent version of triton, b) triton 2.2, and c) no triton. In all cases, verified expected output (success or meaningful error message)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130403
Approved by: https://github.com/eellison
original PR: https://github.com/pytorch/pytorch/pull/128599 (re-created after revert + poisoned diff train)
Summary:
This PR adds deduplication and CSE for runtime asserts. Existing size computation in the graph is CSE'd along with added runtime asserts, and redundant asserts are removed. Shape calls on intermediate tensors are also turned into compute on input sizes if possible, allowing intermediate tensors to be freed earlier. For example:
```
z = torch.cat([x, x], dim=0) # 2*s0
w = z.repeat(y.shape[0]) # 2*s0*s1
_w = w.shape[0]
s0 = x.shape[0]
s1 = y.shape[0]
_w0 = 2 * s0
_w = _w0 * s1
```
Additionally, constrain_range calls are deduplicated. Single-symbol bound checks for unbacked symbols (e.g. u0 >= 0, u0 <= 5) and sym_constrain_range.default calls are also removed, since they accumulate range info in the ShapeEnv, and are replaced with two _assert_scalar.default calls that check the min/max bounds. For example:
```
torch.sym_constrain_range_for_size(n, min=2, max=16)
torch.sym_constrain_range(n, min=4, max=20)
torch._check(n >= 0)
torch._check(n >= 3)
torch._check(n <= 14)
torch.sym_constrain_range_for_size(n)
torch._check(n >= 4)
torch._check(n <= 14)
```
Test Plan:
contbuild & OSS CI, see 940e4477ab
Original Phabricator Test Plan:
Imported from GitHub, without a `Test Plan:` line.
Differential Revision: D59543603
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130380
Approved by: https://github.com/izaitsevfb
This is kind of a short-sighted workaround and we should actually come
up with a way to fix this in general, but I got annoyed that I can't use
-k to filter tests in test_schedule, and realized it's because we jam
tests using the new MultiProcContinuousTest fixture together with
old-style tests.
For now I separate the two types of tests so -k works again.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130294
Approved by: https://github.com/H-Huang
- Fix C++20 forward compatibility warnings, namely
```
warning: use of function template name with no prior declaration in function call with explicit template arguments is a C++20 extension [-Wc++20-extensions]
multi_tensor_apply_for_fused_optimizer<2, 512>(kernel_name,
```
- Use nested namespaces
- Do not explicitly specify `at::` namespace for functions already implemented inside of that namespace
- Use more convenience methods (rather than call by hand)
- Use C++14 `return f();` for void functions
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130453
Approved by: https://github.com/Skylion007
Reland of: https://github.com/pytorch/pytorch/pull/128016
Summary from previous PR:
We assume only two possible mutually exclusive scenarios:
Running compiled region for training (Any of inputs has requires_grad)
Produced differentiable outputs should have requires_grad.
Running compiled region for inference (None of inputs has requires_grad)
All outputs do not have requires_grad.
Even if user runs the region under no_grad(), but has an input Tensor with requires_grad - we go Training scenario (1).
With current state that means:
1/ needs_autograd should not check torch.is_grad_enabled(), only that any of inputs requires_grad
2/ if needs_autograd => trace_joint (We are in training scenario 1.) => always run compiled region under with.enable_grad()
Changes in partitioner?
Inference and Training graphs had difference in return container, list/tuple.
The changes in partitioner are done to unify and return always tuple.
As a result - some changes in test_aotdispatch.py for graph contents list -> tuple.
Why was revert?
There was a regression of hf_Reformer model on inference.
```
TORCHINDUCTOR_FX_GRAPH_CACHE=0 python benchmarks/dynamo/torchbench.py --performance --inference --bfloat16 --backend inductor --device cuda --only hf_Reformer --cold-start-latency --use-eval-mode
```
Because one of the compiled graphs contained outputs, which are aliases to the inputs that are nn.Parameter(requires_grad=True).
Even if inference bencharmsk torchbench runs inside with` torch.no_grad()` - alias (specifically for hf_Reformer - expand) ops preserve requires_grad.
As a result we started compiling training graph instead of inference.
Fix for view ops:
If we have outputs, that are aliases to inputs that requires_grad, those outputs requires grad is not a reason to generate training graph.
This is handled in aot_autograd.py, where output_and_mutation_safe are calculated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128890
Approved by: https://github.com/bdhirsh
Summary:
Currently we have an issue where CPU User annotations can overlap with python events in the event that a python event calls step() within the function itself. To combat this, we can move the left side of the user annotation to the beginning of the parent python function. We do this because when instantiating the profiler we already start on step 0.
To implement this, we start by collecting all instances of ProfilerStep during post processing. Since TorchOps and Python events are sorted already, we can easily check if the current python event partially overlaps with the current ProfilerStep and, if so, alter the start time of the current ProfilerStep. We then move to the next ProfilerStep and continue iterating through all the python events. This keeps the time complexity of adding events to 'out' at O(s + n) -> O(n) post sorting, where "s" is the number of ProfilerSteps and "n" is the length of all events.
Test Plan:
Added unit test in which step() is called midway through a function. Afterwards, we print out a trace and then load the json to check that there are no overlaps. Also make sure that there is no regression in performance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129599
Approved by: https://github.com/aaronenyeshi
When the input predicate is a python constant, we specialize into one of the branches and warn users that torch.cond is not preserving the dynamism. The previous behavior is that we baked in True/False in the cond operator. This can be confusing. In this PR, we change it to be specializing into one of the branches when the inputs are constants.
We additionally change the naming of cond operator to default one without overriding its name. This allows better testing on de-serialized graph.
Test Plan:
The predicate in some existing tests is the result of a shape comparison. When no dynamic shape is involved, the predicate is a python bool. To fix them, we either change the predicate to be some data-dependent tensor or change the test to check cond is specialized as one of the branches,
Differential Revision: [D59589709](https://our.internmc.facebook.com/intern/diff/D59589709)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128709
Approved by: https://github.com/zou3519
This fixes MSVC build regression introduced by https://github.com/pytorch/pytorch/pull/129710 as VC++ fails to unroll nested defines in the specific order and fails with
```
C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\aten\src\ATen\native\cuda\int4mm.cu(984): error: "#" not expected here
do { const cudaError_t __err = cudaFuncGetAttributes( &funcAttr, #if defined(USE_ROCM) (void *)func #else func #endif ); c10::cuda::c10_cuda_check_implementation( static_cast<int32_t>(__err), "C:\\actions-runner\\_work\\pytorch\\pytorch\\builder\\windows\\pytorch\\aten\\src\\ATen\\native\\cuda\\int4mm.cu", __func__, static_cast<uint32_t>(991), true); } while (0);
```
Fixes https://github.com/pytorch/pytorch/issues/130437
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130441
Approved by: https://github.com/Skylion007, https://github.com/malfet
Fixes the example in #118304 for `torch._functorch.aot_autograd.aot_export_module` and `torch.export.export`.
On a high level, the issue is caused by not detecting fake_mode when there's no input.
Change plan:
1) we add a `dynamic_shapes: Union[bool, None] = None` arg to `aot_export_module` and `_aot_export_function`.
2) if the input is not a graph module, then we can only rely on this `dynamic_shapes` input arg.
3) If the input is a graph module, then we can traverse the graph and check.
4) So we check if the input mod is a graph module or just a module, and do 2) or 3) depending on the type.
Fixes#129927
Bug source: dynamo's fake_mode is not detected correctly in `_convert_input_to_fake` in `_traced.py` when there’s no input to the graph). So in ` _strict_export_lower_to_aten_ir`, we create another fake_mode. `dynamo_fake_mode` is not the same as the fake_mode used by dynamo.
Change plan:
check `gm_torch_level` graph's node meta "example_value" for fake mode in addition.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129928
Approved by: https://github.com/angelayi
Needle has moved quite a bit on the ROCm backend front. This PR intended to examine the tests referenced in the following issue: https://github.com/pytorch/pytorch/issues/96560
This a follow-up PR to https://github.com/pytorch/pytorch/pull/125069
unskipping the next batch of tests referenced by the aforementioned issue. No explicit changes needed for source as they worked immediately after unskipping.
The tests previously marked with xfail have now been modified to not expect a failure iff running on ROCm as they now pass. Behavior is unchanged for them on other architectures.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127966
Approved by: https://github.com/malfet
`aten._to_copy` can receive a python number as input. This occurs in
torch.compile support for vmap (see #130188). Previously, this would
raise an assertion error. This PR changes it so that if we see a python
number, we call torch.scalar_tensor on it first (h/t @bdhirsh).
Fixes#130362Fixes#130188
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130381
Approved by: https://github.com/Chillee
Summary:
1. Fixed#130201 by adding type promotion.
2. Added proper tests.
3. Found torch's type promotion is different from numpy as follows:
```python
import torch
import numpy as np
np.clip(np.array([1], dtype=np.float32), np.array([1], dtype=np.int32), None).dtype # dtype('float64')
torch.clamp(torch.tensor([1], dtype=torch.float32), torch.tensor([1], dtype=torch.int32)).dtype # torch.float32
```
~Not sure the proper way to handle it, it causes numpy ref tests to fail.~
Reason here, so think I'm gonna xfail it:
3c1cf03fde/test/test_ops.py (L260-L264)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130226
Approved by: https://github.com/malfet
This PR is to update the input `weight` of `_convert_weight_to_int4pack` from `[n][k] int32` to `[n][k / 2] uint8`, both for CPU, CUDA and MPS, which can help decouple int4 model checkpoint with different ISAs and different platforms in `gpt-fast`. The advantage is int4 model checkpoint can be shared in different test machines, without re-generating in one certain platform. Meanwhile, the size of input `weight` can be reduced to `1 / 8`.
Before this PR, packed weight stored in CUDA specific layout: `[n/8][k/(InnerKTiles*16)][32][InnerKTiles/2]`, dtype int32, where InnerKTiles = 2, 4, 8. CPU packed weight viewed as the SAME shape but stored in different layout: `[n/64][k][32]`, dtype uint8. Weight is strongly coupled with platforms (CPU/CUDA) and ISAs (AVX512/AVX2/scalar). And users cannot use a generated weight in another different ISA or platform, because when loading weight into devices, the compute format is different.

Now, we use common serialized layout (`[n][k/2] uint8`) for different devices or ISAs as input `weight` of `_convert_weight_to_int4pack`, and each back chooses how to interpret as compute layout.

### Performance
Intel (R) Xeon (R) CPU Max 9480, single socket (56 cores)
There is no obvious regression of this PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129940
Approved by: https://github.com/jgong5, https://github.com/lezcano, https://github.com/mingfeima
Fix static `py::object`s with `py::gil_safe_call_once_and_store`.
The following code will leak a `py::object` which will call its destructor when shutdown the program. The destructor will call `Py_DECREF(obj.m_ptr)` which may raise a segmentation fault.
```c++
void func() {
static py::object obj = py::module_::import("foo").attr("bar");
...
}
```
The correct code is to use raw pointers rather than the instance.
```c++
void func() {
static py::object* obj_ptr = new py::object{py::module_::import("foo").attr("bar")};
py::object obj = *obj_ptr;
...
}
```
This PR uses the `py::gil_safe_call_once_and_store` function from `pybind11`, which can run arbitrary initialization code only once under the Python GIL thread safely.
```c++
void func() {
PYBIND11_CONSTINIT static py::gil_safe_call_once_and_store<py::object> storage;
py::object obj = storage
.call_once_and_store_result(
[]() -> py::object {
return py::module_::import("foo").attr("bar");
}
)
.get_stored();
...
}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130341
Approved by: https://github.com/ezyang
When applied to a triton kernel, capture_triton allows the triton kernel
to be captured when tracing with make_fx. It does this by transforming the
call to the triton kernel into a call to the
triton_kernel_wrapper_mutation HOP, which can actually be traced into a
graph via make_fx.
We have two main uses cases for this:
- non-strict export doesn't use Dynamo, but people want to use
non-strict export to export programs with triton kernels.
non-strict export uses make_fx tracing, so this is a necessary step in
that direction.
- People want to write inductor passes that replace a sequence of
operators with a call to a function that may contain a triton kernel.
The way these passes work today is that we have a FX graph and want to
replace a subgraph of it with a new subgraph. We obtain said subgraph
from calling make_fx on the function; this won't work on raw triton
kernels but will work if one uses capture_triton.
Test Plan:
- I wrote some manual tests to run make_fx over two of the triton
kernels in test_triton_kernels. It would be nice to be able to run
make_fx through all of the tests in the file but I'm not sure how to
do that refactor right now.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130178
Approved by: https://github.com/oulgen
ghstack dependencies: #130177
TritonKernelVariable's logic tells us how to go from a user-defined
triton kernel and a grid to a call to the triton_kernel_wrapper_mutation
HOP. We want to re-use this in a setting without Dynamo; in the next PR
up, we create a new decorator (capture_triton) that, when applied to a
triton kernel, transforms a call to the triton kernel into a call
to the triton_kernel_wrapper_mutation HOP.
Test Plan:
- existing tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130177
Approved by: https://github.com/oulgen, https://github.com/ydwu4
Summary:
When writing out Graphviz files for graphs, sometimes the arguments are all
in a row and it's unclear which is which. Like for `aten.conv2d`, someone might not
remember the stride, padding, dilation order.
Add an option `normalize_args` (defaults to False) to normalize all args into kwargs.
This should help the readability of a graph.
Differential Revision: D59529417
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130348
Approved by: https://github.com/mcremon-meta
Summary: This diff fixes a bug, where all record_annotations will save a TraceEntry to each of the device_traces. Instead, we should only save annotations to the current device_trace that is being called by the thread calling the native allocator's recordAnnotation.
Test Plan: CI and ran workloads on MVAI WPR FBR.
Reviewed By: zdevito
Differential Revision: D59477339
Pulled By: aaronenyeshi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130315
Approved by: https://github.com/zdevito
We add torch.library.Library._register_torch_dispatch_rule. Here, a user
can provide us a specific rule to run for a specific
(torch_dispatch_class, operator) pair. The motivation is that a user
might want to extend a subclass/mode but may not have access to the
source code of the subclass/mode.
I'll make this public in a follow-up PR if we think the approach and API
is good.
Keep in mind that many subclasses will likely deliver their own open
registration solution (DTensor has register_sharding_prop_rule and NJT
has register_jagged_op); _register_torch_dispatch_rule is meant as a
catch-all open registration mechanism for when the subclass hasn't
provided anything more specific.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130064
Approved by: https://github.com/albanD
- Add AMD support for int4 kernel
- Only supports CDNA2 and CDNA3 gpus for now
- Uses `mfma_f32_16x16x16bf16` instruction for matrix multiply
- Uses `v_and_or_b32` instruction and `__hfma2` instrinsic for unpacking bf16 values
- Enable hipify for `__nv_bfloat16` and `__nv_bfloat162` data types
- Enable int4 unit tests for CDNA2 and CDNA3 AMD gpus
- Fix torchscript issues due to hipify for `__nv_bfloat16` type
- TorchScript has its own implementation for bfloat16 type
- Implemented in `__nv_bloat16` structure at [resource_strings.h](https://github.com/pytorch/pytorch/blob/main/torch/csrc/jit/codegen/fuser/cuda/resource_strings.h)
- So, we shouldn't hipify any reference of `__nv_bfloat16` in the torchscript implementation
- Hence moved the `__nv_bfloat16` direct references in `codegen.cpp` and `cuda_codegen.cpp` to `resource_strings.h` which is already exempted from hipify
Fixes#124699
Fixes pytorch-labs/gpt-fast/issues/154
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129710
Approved by: https://github.com/malfet
Summary: Previously, when we inline the subgraphs that doesn't have a different require_grad environment, we didn't clean up the nodes's users in subgraph and direcly used them to to replace the output of the call_modules. This records dead depencies in node.users. This PR fixes this.
Test Plan:
Added a new test.
Also see the torchrec tests:
Step 1:
buck run mode/dev-nosan //aimp/experimental/pt2:pt2_export -- --model-entity-id 934687114 --output /tmp/934687114.zip --use-torchrec-eager-mp --use-manifold
Step 2:
buck run mode/opt -c python.package_style=inplace -c fbcode.enable_gpu_sections=true aimp/cli:cli -- --platform=aps --template=disagg_gpu_aps_pt2 --pt2 --model-entity-id=934687114 non-request-only-tagging torchrec-shard-and-quantize gpu-disagg-split assign-device materialize-weights script-and-save
Differential Revision: D59132214
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129716
Approved by: https://github.com/angelayi
Threads inside the thread pools are not named, so they inherit the main process name or the name of the first thread. In our case if we set `pt_main_thread` as the thread name when a thread does `import torch`, this name will be inherited by all the threads in the created pools.
This PR names the threads in the pools I was able to find. There are other pools created, like OpenMP ones and we need to follow-up on those.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130270
Approved by: https://github.com/d4l3k, https://github.com/albanD
Summary: I actually don't grok why this pattern works; I guess pytest expects a different import syntax for these relative imports?? But this pattern is used in many other tests here (notably `test_aot_inductor.py`), so it must be right ;)
Test Plan:
Ran both ways:
* `python test/inductor/test_memory_planning.py`
* `pytest test/inductor/test_memory_planning.py`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130275
Approved by: https://github.com/zou3519
----
- We now record on CacheEntry what the compile id that populated it was, so now we can say why a specific frame was rejected
- Add structured log for recompiles under name artifact "recompile_reasons". As it stands, it's not terribly structured, but this was the easiest thing I could do to start
- Slightly reformat multi-reason printing; since we only report one guard failure seems better to have it as a single line
Example output:
```
V0703 10:34:13.273000 140345997743104 torch/_dynamo/guards.py:2590] [0/1] [__recompiles] Recompiling function f in /data/users/ezyang/a/pytorch/b.py:3
V0703 10:34:13.273000 140345997743104 torch/_dynamo/guards.py:2590] [0/1] [__recompiles] triggered by the following guard failure(s):
V0703 10:34:13.273000 140345997743104 torch/_dynamo/guards.py:2590] [0/1] [__recompiles] - 0/0: tensor 'L['x']' size mismatch at index 0. expected 4, actual 5
```
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130043
Approved by: https://github.com/anijain2305
Previously, subgraph input names were whatever the input proxies were,
which were confusing. This PR changes those names to be
whatever the names of the arguments the functions being
speculate_subgraph'ed are. This is best-effort: if we can't figure it
out then we go back to the previous strategy.
Test Plan:
- existing expecttests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130255
Approved by: https://github.com/ydwu4
Auto slow test detection is marking and then un marking these as slow, so permanently mark them as slow on windows.
These tests take >500s on windows.
This is part of the reason why test_decomp keeps failing on windows (ex da66e50e6e)
The other part is something to do with reruns + thresholds that I am still investigating
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130260
Approved by: https://github.com/huydhn, https://github.com/malfet
Previously, jobs would log lines like this due to interpreteting an int8 value as a signed char when streaming out.
"ProcessGroupNCCL created ncclComm_ 0x94960120 on CUDA device: ^@"
We need a better solution for avoiding this systematically, but at least
for now fix the spot we know about.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130184
Approved by: https://github.com/eeggl, https://github.com/Skylion007
Summary:
Previously we store edge id in numeric_debug_handle to support operator fusion and operator decomposition throughout the stack,
but according to feedback from customers, people prefer the simpler per-node id, and they are fine with not having the additional
support for numerical debugging for inputs and willing to hack around to achieve this.
This PR changes the structure of numeric_debug_handle to store unique_id for each node instead.
e.g.
graph:
```
node = op(input_node, weight_node)
```
Before:
```
node.meta[NUMERIC_DEBUG_HANDLE_KEY] = {input_node: id1, weight_node: id2, "output": id3}
```
After:
```
node.meta[NUMERIC_DEBUG_HANDLE_KEY] = id1
```
Test Plan:
python test/test_quantization.py -k TestGenerateNumericDebugHandle
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129811
Approved by: https://github.com/tarun292
This PR updates the public API for NJT construction `torch.nested.nested_tensor_from_jagged()` to accept values for min / max sequence length. It's useful to provide these ahead of time to avoid GPU -> CPU syncs from on-demand computation later on.
NB: The test changes are extensive because I reworked the existing `_validate_nt()` helper function used throughout our NJT construction tests to verify more (specifically: expected cached min / max seq len and contiguity).
API design question: should we additionally provide an option to compute these from `offsets` at construction time? I can think of three possible cases during construction:
1. Min / max seq len has already been obtained from *somewhere* (manual calculation, static values, etc.) and they should be used in the cache
2. Min / max seq len should be computed immediately at construction time for use in the cache (ideally, the caller wouldn't have to do this computation manually)
3. Min / max seq len are not needed at all (i.e. SDPA isn't ever called) and computation should be skipped
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130175
Approved by: https://github.com/davidberard98, https://github.com/soulitzer
We add torch.library.Library._register_torch_dispatch_rule. Here, a user
can provide us a specific rule to run for a specific
(torch_dispatch_class, operator) pair. The motivation is that a user
might want to extend a subclass/mode but may not have access to the
source code of the subclass/mode.
I'll make this public in a follow-up PR if we think the approach and API
is good.
Keep in mind that many subclasses will likely deliver their own open
registration solution (DTensor has register_sharding_prop_rule and NJT
has register_jagged_op); _register_torch_dispatch_rule is meant as a
catch-all open registration mechanism for when the subclass hasn't
provided anything more specific.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130064
Approved by: https://github.com/albanD
**Summary**
In order to give users more information, I have added the deviceMesh for operations with DTensor inputs, and module parameter sharding and FQN. These changes have only been placed in operation tracing log. In the future, I plan to just have one logging function with an argument to show how detailed a user wants the log to be, and will get rid of the module tracing log function. This information has also been added to the JSON dump and can be seen in the browser visual. I have also edited the test case file as the module_depth dictionary has been replaced with module_helper_dict and have edited the example output for the MLP operation tracing which can be seen below:
**Test Plan**
1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_json_dump
2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_json_dump
3. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_operation_tracing
4. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing
5. pytest test/distributed/_tensor/debug/test_comm_mode_features.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130072
Approved by: https://github.com/XilunWu
ghstack dependencies: #129994
Summary: We call `.get` in the elastic store barrier operation but we don't need the result. This switches it to use `.wait` instead which eliminates one network round trip as `get` internally does a wait first.
Test Plan:
CI + existing tests -- no behavior change
Differential Revision: D59396199
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130148
Approved by: https://github.com/kurman, https://github.com/wconstab
for checkpoint optimizer, tensors are created on CUDA when other backends are used. This is because by default torch.device() constructed via a single device ordinal is treated as a cuda device.
In _alloc_tensor, empty tensor are created using device = cast(torch.device, _get_device_module(device_type).current_device()). above will return only the index which will create the empty tensor on CUDA by the default behavior. So, change it to use torch.device(device_type,device_module(device_type).current_device()) to get the device with the index.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129110
Approved by: https://github.com/fegin
This PR makes it so that we don't try to serialize FunctionalTensorWrappers. FunctionalTensorWrappers don't pickle well because they have no underlying storage. This should be fixable at a later point, but I might not be the right author for implementing the serialization for it. If there's a way to avoid actually saving the FunctionalTensorWrappers themselves and just saving the ViewMetadata so we can replay it, that would also work.
To do this, we disable view_replay_input_mutations when using AOTAutogradCache, and then only keep the functional tensor in the ViewAndMutationMeta if we need it for view_replay_input_mutations (i.e. the cache is off).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128335
Approved by: https://github.com/bdhirsh
In https://www.internalfb.com/intern/sevmanager/view/s/429861/, a downstream consuming buffer `buf486_buf526` had two read dependencies; `buf373` and `buf394`, both of which were at separate indices of the upstream foreach op. `buf486_buf526` was fused into `buf373` because in the usual fused case, this is completely fine if all dependencies are met in the upstream fused buffer. However in the foreach case and this case specifically it is possible for foreach ops to be partitioned if there are many arguments in order to stay under CUDA driver arg limits. As a result, this large foreach op was split into two, and the latter had `buf394` in its node schedule for allocation, while the earlier split did not, even though `buf486_buf526` uses the `buf394`, as a result we would hit the unbound local error.
@eellison provided this repro to help debug the issue (https://www.internalfb.com/phabricator/paste/view/P1453035092)
To fix this, we no longer return a valid producer subnode if there are multiple producer subnodes for a downstream consuming op. In short we should not fuse if there are dependencies on multiple foreach subkernels because 1) their execution order is non-deterministic and 2) (this issue) we may not properly handle dependencies in the presence of foreach partitioning.
Co-authored-by: David Berard <dberard@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130046
Approved by: https://github.com/eellison
This PR is needed to resolve usability issues with PyTorch ROCm nightly wheels on non-gfx90a/gf94x architectures as a result of https://github.com/pytorch/pytorch/pull/127944.
Addresses https://github.com/pytorch/pytorch/issues/119081#issuecomment-2166504992
### With this PR's changes, I get the following on a gfx908 (unsupported by hipblasLT) architecture:
_Using setter function:_
```
>>> torch.backends.cuda.preferred_blas_library(backend="cublaslt")
[W617 19:58:58.286088851 Context.cpp:280] Warning: torch.backends.cuda.preferred_blas_library is an experimental feature. If you see any error or unexpected behavior when this flag is set please file an issue on GitHub. (function operator())
[W617 19:59:02.125161985 Context.cpp:291] Warning: Attempting to use hipBLASLt on an unsupported architecture! Overriding blas backend to hipblas (function operator())
<_BlasBackend.Cublas: 0>
```
_Using `TORCH_BLAS_PREFER_HIPBLASLT` env var:_
```
root@9d47bf40d4d4:/tmp/pytorch# TORCH_BLAS_PREFER_CUBLASLT=1 python
>>> import torch
>>> torch.backends.cuda.preferred_blas_library()
[W619 06:14:11.627715807 Context.cpp:274] Warning: Attempting to use hipBLASLt on an unsupported architecture! Overriding blas backend to hipblas (function operator())
<_BlasBackend.Cublas: 0>
```
### and the following on a gfx90a (supported by hipblasLT) architecture:
_Using setter function:_
```
>>> import torch
>>> torch.backends.cuda.preferred_blas_library()
<_BlasBackend.Cublaslt: 1>
>>> torch.backends.cuda.preferred_blas_library(backend="cublas")
<_BlasBackend.Cublas: 0>
>>> torch.backends.cuda.preferred_blas_library(backend="cublaslt")
[W620 18:38:29.404265518 Context.cpp:293] Warning: torch.backends.cuda.preferred_blas_library is an experimental feature. If you see any error or unexpected behavior when this flag is set please file an issue on GitHub. (function operator())
<_BlasBackend.Cublaslt: 1>
```
_Using `TORCH_BLAS_PREFER_HIPBLASLT` env var:_
```
root@9d47bf40d4d4:/tmp/pytorch# TORCH_BLAS_PREFER_HIPBLASLT=1 python
>>> import torch
>>> torch.backends.cuda.preferred_blas_library()
<_BlasBackend.Cublaslt: 1>
```
(Same result for _Using `TORCH_BLAS_PREFER_CUBLASLT` env var:_)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128753
Approved by: https://github.com/malfet
Construct frame localsplus in 3.12+ using our own simplified way rather than copypasting from CPython.
This is necessary for 3.13 since we can no longer generate frame `f_locals` before executing the interpreter frame.
We also enable this for 3.12 since the `f_locals` construction between 3.12 and 3.13 is the same, so we can test for correctness with 3.12.
This is also one of the first steps to completing https://github.com/pytorch/pytorch/issues/93753 - we will implement simplified f_locals generation of previous Python versions in the future.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129185
Approved by: https://github.com/jansel
**Summary**
Currently, users have 2 options to view the tracing data. The first is through console where colored text is used to help users read the information. The second is they can log the information to a text file to view the log, which is useful in instances where the log is too long to fit in the console. However, depending on the model complexity, these logs could go on for thousands of lines making it difficult for the user to find specific information. In order to fix this, I have added the functionality to convert the log into a JSON file, which will be used to create a tree view in a browser, allowing the user to collapse parts of the log that will not be useful to them. I have given the user the option to pass their own file path, but have a default one in the event that none is provided. The expected output of the beginning json file and the browser view for the MLP model are shown below:
<img width="542" alt="Screenshot 2024-07-02 at 3 40 41 PM" src="https://github.com/pytorch/pytorch/assets/50644008/b9570540-e1d2-4777-b643-db4801b60ed8">
<img width="777" alt="Screenshot 2024-07-02 at 3 41 43 PM" src="https://github.com/pytorch/pytorch/assets/50644008/9296e255-c3ae-48a4-8be7-4273f69ee178">
**Test Plan**
1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_json_dump
2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_json_dump
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129994
Approved by: https://github.com/XilunWu
Summary:
use &= instead of |= since |= ignores incorrect scale/zp
change scale to use float comparison, instead of int comparison
Issue warning instead of error for backward compatibility: ex: P1204628034
Test Plan: see warning in: P1204628034
Reviewed By: jerryzh168
Differential Revision: D55699212
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123769
Approved by: https://github.com/jerryzh168
This PR:
* Sets a random seed before generating each sample for an OpInfo test. It does this by intercepting the sample input iterator via `TrackedInputIter`, optionally setting the seed to a test name specific seed before each iterator call (default is to set the seed).
* Some quick and dirty benchmarking shows (hopefully) negligible overhead from setting the random seed before each sample input generation. For a trivial (single assert) test that uses `@ops`:
* Uncovered a bunch of test issues:
* Test breakdown (>100 total)
* A lot of tolerance issues (tweaked tolerance values to fix)
* 1 broken OpInfo (`sample_inputs_masked_fill` was generating a sample of the wrong dtype)
* 3 actually broken semantics (for masked tensor; added xfails)
* 4 Jacobian mismatches (added xfails)
* 2 nan results (skip for now, need fixing)
* 3 results too far from reference result (add xfails)
* Skips MPS tests for now (there are so many failures!). Those will default to the old behavior.
**before (no seed setting):**
```
real 0m21.306s
user 0m19.053s
sys 0m5.192s
```
**after (with seed setting):**
```
real 0m21.905s
user 0m19.578s
sys 0m5.390s
```
* Utilizing the above for reproducible sample input generation, adds support for restricting the iterator to a single sample input. This is done via an env var `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX` and its usage is included in the repro command.
```
======================================================================
ERROR: test_bar_add_cuda_uint8 (__main__.TestFooCUDA.test_bar_add_cuda_uint8)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_device_type.py", line 971, in test_wrapper
return test(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/jbschlosser/branches/testing_updates/test/test_ops.py", line 2671, in test_bar
self.assertFalse(True)
AssertionError: True is not false
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_utils.py", line 2816, in wrapper
method(*args, **kwargs)
File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_utils.py", line 2816, in wrapper
method(*args, **kwargs)
File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_device_type.py", line 419, in instantiated_test
result = test(self, **param_kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_utils.py", line 1426, in wrapper
fn(*args, **kwargs)
File "/home/jbschlosser/branches/testing_updates/torch/testing/_internal/common_device_type.py", line 982, in test_wrapper
raise new_e from e
Exception: Caused by sample input at index 3: SampleInput(input=Tensor[size=(10, 5), device="cuda:0", dtype=torch.uint8], args=TensorList[Tensor[size=(), device="cuda:0", dtype=torch.uint8]], kwargs={}, broadcasts_input=False, name='')
To execute this test, run the following from the base repo dir:
PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=3 python test/test_ops.py -k TestFooCUDA.test_bar_add_cuda_uint8
This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
----------------------------------------------------------------------
Ran 1 test in 0.037s
FAILED (errors=1)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128238
Approved by: https://github.com/janeyx99, https://github.com/justinchuby
If we have dynamic shapes, the heuristic in mixed_mm will cause a crash, because it cannot compare m, k and n to integer values. This PR makes it so that the heuristic only runs if we have static shapes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130081
Approved by: https://github.com/Chillee
Summary:
We have the cache to guarantee the `sym` is codegen only once, see the following code
```
def ensure_size_computed(self, sym: sympy.Symbol):
if isinstance(sym, sympy.Symbol) and symbol_is_type(sym, SymT.PRECOMPUTED_SIZE):
if sym in self.computed_sizes:
return
self.computed_sizes.add(sym)
expr = V.graph.sizevars.inv_precomputed_replacements[sym]
self.writeline(
f"{self.declare}{sym} = {self.expr_printer(expr)}{self.ending}"
)
```
However, we don't consider the case when same `sym`s need to be codegen in both conditions (true branch and false branch), which caused the issue of `undefined symbols`: P1441378833
To fix the issue, we use a stack to capture the state before doing the condition codegen and restore the state after doing the codegen
Test Plan:
TORCH_LOGS="+inductor" buck2 run mode/dev-nosan -c fbcode.nvcc_arch=h100 -c fbcode.enable_gpu_sections=true --config 'cxx.extra_cxxflags=-g1' -c fbcode.platform010_cuda_version=12 //scripts/hhh:repro_cond_torch_compile
PYTORCH_TEST_FBCODE=1 TORCH_COMPILE_DEBUG=1 buck2 run mode/opt -c=python.package_style=inplace -c fbcode.enable_gpu_sections=true -c fbcode.platform=platform010 -c fbcode.split-dwarf=true //caffe2/test/inductor:control_flow -- -r test_cond_control_flow_with_precomputed_size
Differential Revision: D58973730
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129492
Approved by: https://github.com/aakhundov
This PR adds deduplication and CSE for runtime asserts. Existing size computation in the graph is CSE'd along with added runtime asserts, and redundant asserts are removed. Shape calls on intermediate tensors are also turned into compute on input sizes if possible, allowing intermediate tensors to be freed earlier. For example:
```
z = torch.cat([x, x], dim=0) # 2*s0
w = z.repeat(y.shape[0]) # 2*s0*s1
_w = w.shape[0]
# something with _w ...
# turns into ->
s0 = x.shape[0]
s1 = y.shape[0]
_w0 = 2 * s0
_w = _w0 * s1
```
Additionally, constrain_range calls are deduplicated. Single-symbol bound checks for unbacked symbols (e.g. u0 >= 0, u0 <= 5) and sym_constrain_range.default calls are also removed, since they accumulate range info in the ShapeEnv, and are replaced with two _assert_scalar.default calls that check the min/max bounds. For example:
```
torch.sym_constrain_range_for_size(n, min=2, max=16)
torch.sym_constrain_range(n, min=4, max=20)
torch._check(n >= 0)
torch._check(n >= 3)
torch._check(n <= 14)
# turns into
torch.sym_constrain_range_for_size(n)
torch._check(n >= 4)
torch._check(n <= 14)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128599
Approved by: https://github.com/ezyang
**Summary**
Support more than 1 Local Buffer in an outer loop fused node and also the case when multi global buffers sharing usage of same local buffer.
**TestPlan**
```
python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_two_local_buffers_in_outer_loop_fusion
python -u -m pytest -s -v inductor/test_cpu_repro.py -k test_share_local_buffers_in_outer_loop_fusion
```
**Next Step**
- [✓] Support more than one Local Buffer/Global Buffer
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129121
Approved by: https://github.com/jgong5, https://github.com/peterbell10
ghstack dependencies: #126967
Compiling the `create_block_mask` function allows us to "materialize" extremely large masks. This would have been a 1 *trillion* element tensor if fully materialized.
```
print(do_bench(lambda: create_block_mask(causal_mask, 1, 1, 2**20, 2**20, _compiled=True)))
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130106
Approved by: https://github.com/yanboliang
ghstack dependencies: #130160
Summary:
Multiple threads can be calling the alloc_trace std::vector, which will result in SIGSEGVs when objects are double freed, accessed after free, or two inserts at the same time.
We need to lock when inserting, accessing or removing TraceEntry in alloc_trace.
Test Plan:
This is a rare crash, which was exposed when we introduced recordAnnotations, which saves record_function annotations into the snapshot files. Saving a lot of annotations can trigger this bug. Here are a few jobs that crashed before, and this diff fixes.
Differential Revision: D59380507
Pulled By: aaronenyeshi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130180
Approved by: https://github.com/eqy, https://github.com/kit1980
There is one huge problem this fixes: today, sympify(symint)
produces a float(!!) because Sympy attempts to see if you can
coerce the symint to float in sympify and of course this works on
SymInt.
However, this also has another nontrivial effect: anywhere in Inductor
where sympy expressions are passed around, it is also valid to pass
around a SymInt now. I'm ambivalent about this: it's currently a
mistake to be passing around a SymInt when a sympy expression is
expected. But maybe this is fine?
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130166
Approved by: https://github.com/yf225
This PR adds deduplication and CSE for runtime asserts. Existing size computation in the graph is CSE'd along with added runtime asserts, and redundant asserts are removed. Shape calls on intermediate tensors are also turned into compute on input sizes if possible, allowing intermediate tensors to be freed earlier. For example:
```
z = torch.cat([x, x], dim=0) # 2*s0
w = z.repeat(y.shape[0]) # 2*s0*s1
_w = w.shape[0]
# something with _w ...
# turns into ->
s0 = x.shape[0]
s1 = y.shape[0]
_w0 = 2 * s0
_w = _w0 * s1
```
Additionally, constrain_range calls are deduplicated. Single-symbol bound checks for unbacked symbols (e.g. u0 >= 0, u0 <= 5) and sym_constrain_range.default calls are also removed, since they accumulate range info in the ShapeEnv, and are replaced with two _assert_scalar.default calls that check the min/max bounds. For example:
```
torch.sym_constrain_range_for_size(n, min=2, max=16)
torch.sym_constrain_range(n, min=4, max=20)
torch._check(n >= 0)
torch._check(n >= 3)
torch._check(n <= 14)
# turns into
torch.sym_constrain_range_for_size(n)
torch._check(n >= 4)
torch._check(n <= 14)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128599
Approved by: https://github.com/ezyang
Summary:
## Context
TL;DR: aot_export failed for SDPA memory efficient backend when using `inference_mode`
The CMF AOTI lowering started to fail on the trunk. We have the script (https://fburl.com/code/kfk64i5s) to reproduce the issue quickly (log: P1469307638). By bisecting the stack, we found the issue starting from the D58701607
## Root Cause
In the `inference_mode()`,
the `aten::scaled_dot_product_attention` was not decomposed before the `functionalization` and the op it-self was an out-place op, so the `functionalization` doesn't make change and then was decomposed into `masked_fill_.`, then decomposed to the `copy_`
So it's `aten::sdpa` --- (functionalization) ---> `aten::sdpa` --- (decompose) ---> `masked_fill_` --- (decompose) ---> `copy_` ---> failure
In the `torch.no_grad()`,
`aten::sdpa` was decomposed before `functionalization`, so the story is
`aten::sdpa` --- (decompose) ---> `masked_fill_` --- (functionalization) ---> `masked_fill` --- (decompose) ---> `out-place ops` ---> good
## How to fix
Long-term:
The issue was tracked in the ticket (https://github.com/pytorch/pytorch/issues/129418). The long-term fix could be we do one more round of `functionalization` after the `decompose`, like
`aten::sdpa` --- (functionalization) ---> `aten::sdpa` --- (decompose) ---> `masked_fill_` --- (functionalization) ---> `masked_fill` ---> good
Short-term:
It would be a big change I guess. To unblock the production use-case, I marked the `aten::sdpa` should be decomposed in this diff
Test Plan:
local repro works now
buck run mode/opt scripts/sijiac/prototypes:sdpa_aoti
Differential Revision: D59385876
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130164
Approved by: https://github.com/zou3519
The real root cause of the issue is that the current stream on a given CUDA device may be the legacy default stream, which doesn't seem to have a device associated with it. If the current CUDA device as reported by `cudaGetDevice` doesn't match the device of the intended legacy default stream's device (this happens if a user is running distributed code without e.g., `torch.cuda.set_device(mylocalrank)`) then the stream synchronize will not have the intended effect. Previous stream sync code here correctly inserted a `DeviceGuard` to ensure that this legacy-default-stream-sync with a mismatched current device didn't happen, but the check is elided here. The simplest fix is to just use the `CUDAStream` wrapper's `synchronize()` call, which already correctly uses a `DeviceGuard` internally:
a21d4363d2/c10/cuda/CUDAStream.h (L132)
OUTDATED below:
The current behavior of `barrier`'s `synchronizeInternal` seems to be a bit counterintuitive, as it is synchronizing on a device's current `CUDAStream` rather than the one used for the actual `allreduce` (the `ncclStream`). In practice this results in a script like the following:
```
import logging
import os
import time
import torch
import torch.distributed as dist
def main():
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(message)s")
backend = 'nccl'
group = torch.distributed.init_process_group(backend=backend)
rank = torch.distributed.get_rank(group=group)
for i in range(4):
time.sleep(rank)
logging.info(f"Rank {rank}: enter barrier {i}")
dist.barrier()
logging.info(f"Rank {rank}: exit barrier {i}")
dist.destroy_process_group()
if __name__ == "__main__":
main()
```
appearing to show that ranks can exit barrier(s) before other ranks have entered. Note that the device-side ordering should still be correct in this case, but the host is free to run ahead.
The issue can be worked-around by adding a `torch.cuda.synchronize(rank)` after the `barrier`, but this seems to be against the spirit of the stream synchronization which deliberately tried to avoid a device synchronization.
This PR does a sync on the `allreduce`'s stream so that a device synchronization is not needed to align the host's output with the device.
CC @wujingyue @Aidyn-A @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129908
Approved by: https://github.com/kwen2501
Removes extraneous .a, .so, and .py files from the split build. From here we can also clean up the builder script which produces the binary to do this. That pr is https://github.com/pytorch/builder/pull/1912
Verification:
The built wheel with BUILD_LIBTORCH_WHL=1 has the following files only (with .a, .so, and .py extensions)
```
sahanp@devgpu086 ~/p/dist (viable/strict)> pwd (pytorch-3.10)
/home/sahanp/pytorch/dist
sahanp@devgpu086 ~/p/dist (viable/strict)> find . -type f \( -name "*.py" -o -name "*.a" -o -name "*.so" \) (pytorch-3.10)
./torch/__init__.py
./torch/lib/libbackend_with_compiler.so
./torch/lib/libc10.so
./torch/lib/libjitbackend_test.so
./torch/lib/libtorch.so
./torch/lib/libtorch_cpu.so
./torch/lib/libtorch_global_deps.so
./torch/lib/libtorchbind_test.so
sahanp@devgpu086 ~/p/dist (viable/strict)>
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130053
Approved by: https://github.com/atalman
Summary: The explain function does a conversion dry run to provide feedback on which operators are not supported / fail the conversion to the users.
Test Plan: * `pytest test/export/test_converter.py`
Differential Revision: D59251934
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129968
Approved by: https://github.com/angelayi
To avoid outage on HUD, I plan to migrate perf stats to dynamoDB as follows:
1. Upload perf stats to both Rockset and dynamoDB
2. Copy all the existing content from Rockset to dynamoDB
3. Create new Rockset tables to map to dynamoDB
4. Switch HUD to use the new Rockset tables (temporarily)
5. Delete the existing tables
This depends on https://github.com/pytorch-labs/pytorch-gha-infra/pull/422
### Testing
```
python3 -m tools.stats.upload_dynamo_perf_stats --workflow-run-id 9770217910 --workflow-run-attempt 1 --repo "pytorch/pytorch" --head-branch "gh/shunting314/162/head" --rockset-collection torch_dynamo_perf_stats --rockset-workspace inductor --dynamodb-table torchci-dynamo-perf-stats --match-filename "^inductor_"
...
Writing 1607 documents to DynamoDB torchci-dynamo-perf-stats
```
And confirm the same number of documents is on the table

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129544
Approved by: https://github.com/clee2000
Summary:
Title. This way, both FXGraphCache and AOTAutogradCache use the same torch_key, and we don't need to only hash specific files.
There's an argument to be made to only hash *.py and *.cpp files. Maybe we can fix the glob to do that.
We use a buck_filegroup because otherwise $SRCs gets too large. By using `$(location :torch_sources)`, we make the genrule implicitly depend on all files globbed by torch_sources.
Test Plan:
Unit tests still pass on OSS
For torch_key:
```
buck2 build caffe2:src_hash.txt -v 2 --show-output
```
See the output, then make any change to any torch file. See that the hash changes.
Reviewed By: oulgen
Differential Revision: D58875785
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129250
Approved by: https://github.com/oulgen
This model's accuracy test recently regressed. I have a quite smooth debugging process to figure out the cause. So I'd like to write it down just in case it can be helpful.
Clicking the model name beit_base_patch16_224 on the dashboard, we are able to see the pass status of the model in e.g. the past month. For this model, we can see that it starts to fail on June 08:
<img width="1118" alt="Screenshot 2024-07-02 at 5 36 35 PM" src="https://github.com/pytorch/pytorch/assets/52589240/32f27ccd-3ec7-4431-88b3-febeff831f8e">
What's nice is the dashboard shows the nightly commits for each run.
Running
```
git log --oneline a448b3ae9537c0ae233fb9199a4a221fdffbb..0e6c204642a571d5a7cd60be0caeb9b50faca030 torch/_inductor/
```
Gives us the list of Inductor PRs between the good and bad commit: https://gist.github.com/shunting314/eb57965688fc9e1746fcfa9b7b6b02df
Roughly looking thru the PRs, I feel
```
ffc202a1b91 Added remove_noop_ops to joint_graph_passes (#124451)
```
can change numerics so I disable it locally by this one line change: https://gist.github.com/shunting314/13aec768bda986056d0fb40dce53396e . And then the accuracy test pass. (Command: time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only beit_base_patch16_224 )
Horace's PR (https://github.com/pytorch/pytorch/pull/124451) itself is valid. It removes no-op ops in joint-graph. I think maybe the graph get changed and cause the partitioner do different recomputation decisions. That can cause some numerics change.
Since this is not a real issue, I'll raise the tolerance to make it pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130005
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #129996, #129941
This PR batch the fix for a few accuracy failures issues during training by raising tolerance. I do that only for models that I think it fails not due to real issue.
## sebotnet33ts_256
The accuracy test for this model start to fail around June 05 [link](https://hud.pytorch.org/benchmark/timm_models/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Sun%2C%2002%20Jun%202024%2007%3A19%3A38%20GMT&stopTime=Tue%2C%2002%20Jul%202024%2007%3A19%3A38%20GMT&granularity=day&mode=training&dtype=amp&lBranch=main&lCommit=04a0d856207d83c2031e4b9cb6825ba3e0092850&rBranch=main&rCommit=e62925930f6a62f6aeeb1fe1a661a9bd3352b53d&model=sebotnet33ts_256).
I can not repro locally, but from the log from the dashboard:
```
RMSE (res-fp64): 0.09441, (ref-fp64): 0.02971 and shape=torch.Size([1536]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000
```
raising the tolerance should fix it.
## DebertaForQuestionAnswering
This model fails accuracy test on the dashboard only in max-autotune mode. I can not repro locally by command:
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/huggingface.py --accuracy --no-translation-validation --training --amp --backend inductor --device cuda --only DebertaForQuestionAnswering
```
From error message on the dashboard:
```
RMSE (res-fp64): 0.01803, (ref-fp64): 0.00537 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
```
0.02 tolerance should suppress this error.
## gluon_inception_v3
This model fail on the dashboard in max-autotune mode. I can not repro locally by command
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only gluon_inception_v3
```
From error message on the dashboard
```
RMSE (res-fp64): 0.02798, (ref-fp64): 0.00730 and shape=torch.Size([384]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
Accuracy failed for key name Mixed_7c.branch3x3dbl_3a.bn.running_var
```
raising tolerance should suppress this error.
# mobilenetv3_large_100
Fail in MA model. I can not repro locally by command
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only
```
The error message on the dashboard is
```
RMSE (res-fp64): 0.29754, (ref-fp64): 0.05205 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000
```
The tensor is so small that the noise can be high. I use larger multiplier for smaller tensor in torch._dynamo.utils.same.
# yolov3
Fail on dashboard with error
```
Error on the dashboard: RMSE (res-fp64): 0.01278, (ref-fp64): 0.00246 and shape=torch.Size([256]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```
Fix it by using a larger multiplier for smaller tensors and raising the tolereance.
# timm_efficientdet
Fail on the dashboard with error
```
E0623 18:37:43.638000 139924418725056 torch/_dynamo/utils.py:1468] RMSE (res-fp64): 0.00096, (ref-fp64): 0.00009 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```
But I can not repro locally with command
```
time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only timm_efficientdet --training
```
Raise the tolerance should fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129941
Approved by: https://github.com/jansel
ghstack dependencies: #129996
I'm debugging the accuracy failure for training vision_maskrcnn.
Unfortunately I could not succeed to run it locally (I've check pined commits for torchbenchmars/torchvision are correct, and reinstalled torchbenchmark for mask_rcnn). I get this error:
```
eager run fail: AssertionError: targets should not be none when in training mode
```
(Command: time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --training --only vision_maskrcnn )
But look at the log from the dashboard
```
E0623 19:17:59.085000 140114670171328 torch/_dynamo/utils.py:1468] RMSE (res-fp64): nan, (ref-fp64): nan and shape=torch.Size([1024, 256, 1, 1]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```
We can see both the reference number and the pt2 number are NaN. I change torch._dynamo.utils.same to return true if both RMSE values are NaN.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129996
Approved by: https://github.com/jansel
The real root cause of the issue is that the current stream on a given CUDA device may be the legacy default stream, which doesn't seem to have a device associated with it. If the current CUDA device as reported by `cudaGetDevice` doesn't match the device of the intended legacy default stream's device (this happens if a user is running distributed code without e.g., `torch.cuda.set_device(mylocalrank)`) then the stream synchronize will not have the intended effect. Previous stream sync code here correctly inserted a `DeviceGuard` to ensure that this legacy-default-stream-sync with a mismatched current device didn't happen, but the check is elided here. The simplest fix is to just use the `CUDAStream` wrapper's `synchronize()` call, which already correctly uses a `DeviceGuard` internally:
a21d4363d2/c10/cuda/CUDAStream.h (L132)
OUTDATED below:
The current behavior of `barrier`'s `synchronizeInternal` seems to be a bit counterintuitive, as it is synchronizing on a device's current `CUDAStream` rather than the one used for the actual `allreduce` (the `ncclStream`). In practice this results in a script like the following:
```
import logging
import os
import time
import torch
import torch.distributed as dist
def main():
logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(message)s")
backend = 'nccl'
group = torch.distributed.init_process_group(backend=backend)
rank = torch.distributed.get_rank(group=group)
for i in range(4):
time.sleep(rank)
logging.info(f"Rank {rank}: enter barrier {i}")
dist.barrier()
logging.info(f"Rank {rank}: exit barrier {i}")
dist.destroy_process_group()
if __name__ == "__main__":
main()
```
appearing to show that ranks can exit barrier(s) before other ranks have entered. Note that the device-side ordering should still be correct in this case, but the host is free to run ahead.
The issue can be worked-around by adding a `torch.cuda.synchronize(rank)` after the `barrier`, but this seems to be against the spirit of the stream synchronization which deliberately tried to avoid a device synchronization.
This PR does a sync on the `allreduce`'s stream so that a device synchronization is not needed to align the host's output with the device.
CC @wujingyue @Aidyn-A @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129908
Approved by: https://github.com/kwen2501
Summary: This is to forward fix D59140215 from a PyTorch open source contributor T194074371. On PyTorch side, we need to use isinstance instead of type when checking for nn.Module. This is the same way get_submodule is currently implemented.
Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//dper3/dper3/core/tests:module_test`
Differential Revision: D59254638
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130075
Approved by: https://github.com/mikaylagawarecki
# Changes
* small fix in stage error message
* Move `format_pipeline_order` and `_validate_pipeline_order` out of `test_schedule.py` into `schedules.py`.
* Wrap the execution runtime in a try-except which on error will log the timestep and schedule plan before re-raising the exception.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129369
Approved by: https://github.com/wconstab
ghstack dependencies: #129368
188 new ATen operators/variants are added in the pin update, involving eager and torch.compile usage on HuggingFace, TIMM and TorchBench models. 16 new unit tests ported to enhance functionality coverage. Aligned source file directory structure with ATen native. Fixed corner case failures in aten::resize, aten::index_add and aten::index_put.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129353
Approved by: https://github.com/EikanWang
**Summary**
As the CommModeFeature example file grew, there were to many LOC that was repeated for setting up the models used. I created two functions, one to handle MLP and MLPStacked models and the other for transformer models. The output of the examples will not have changed.
**Test Plan**
1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_distributed_sharding_display
2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLPStacked_distributed_sharding_display
3. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing
4. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing
5. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_operation_tracing
6. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129613
Approved by: https://github.com/XilunWu
ghstack dependencies: #129602
**Summary**
Currently, comm_mode only allowed users to differentiate between forward and backward passes at the operational level. I modified the code so that users can now see the collective counts for the passes at a module level. I decided to slightly change how the output was formatted making it easier to differentiate between a collective count and an operation. I have designed the operational trace table function so that in the future, a user can use command line arguments in order to determine the level of information they want to display instead of having two similar functions. Finally, I have updated the new output and test cases for comm_mode example and test files. The expected output for the first 3 examples are shown below:
<img width="320" alt="Screenshot 2024-06-26 at 2 30 25 PM" src="https://github.com/pytorch/pytorch/assets/50644008/b8e88075-a07f-4e84-b728-a08959df3661">
<img width="497" alt="Screenshot 2024-06-26 at 2 29 15 PM" src="https://github.com/pytorch/pytorch/assets/50644008/5ef4bea7-1355-4089-bfb0-c7e3f588ac77">
<img width="615" alt="Screenshot 2024-06-26 at 2 31 05 PM" src="https://github.com/pytorch/pytorch/assets/50644008/feacae51-76f7-403b-b6cd-dd15e981770e">
**Test Plan**
1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing
2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing
3. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_operation_tracing
4. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing
5. pytest test/distributed/_tensor/debug/test_comm_mode_features.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129602
Approved by: https://github.com/XilunWu, https://github.com/wz337
This moves a bunch of runtime inspection of the `output_info` for alias handling into the construction of fixed output handlers that are created during compilation and captured by the runtime wrapper.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128188
Approved by: https://github.com/bdhirsh
This model's accuracy test recently regressed. I have a quite smooth debugging process to figure out the cause. So I'd like to write it down just in case it can be helpful.
Clicking the model name beit_base_patch16_224 on the dashboard, we are able to see the pass status of the model in e.g. the past month. For this model, we can see that it starts to fail on June 08:
<img width="1118" alt="Screenshot 2024-07-02 at 5 36 35 PM" src="https://github.com/pytorch/pytorch/assets/52589240/32f27ccd-3ec7-4431-88b3-febeff831f8e">
What's nice is the dashboard shows the nightly commits for each run.
Running
```
git log --oneline a448b3ae9537c0ae233fb9199a4a221fdffbb..0e6c204642a571d5a7cd60be0caeb9b50faca030 torch/_inductor/
```
Gives us the list of Inductor PRs between the good and bad commit: https://gist.github.com/shunting314/eb57965688fc9e1746fcfa9b7b6b02df
Roughly looking thru the PRs, I feel
```
ffc202a1b91 Added remove_noop_ops to joint_graph_passes (#124451)
```
can change numerics so I disable it locally by this one line change: https://gist.github.com/shunting314/13aec768bda986056d0fb40dce53396e . And then the accuracy test pass. (Command: time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only beit_base_patch16_224 )
Horace's PR (https://github.com/pytorch/pytorch/pull/124451) itself is valid. It removes no-op ops in joint-graph. I think maybe the graph get changed and cause the partitioner do different recomputation decisions. That can cause some numerics change.
Since this is not a real issue, I'll raise the tolerance to make it pass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130005
Approved by: https://github.com/eellison, https://github.com/jansel
ghstack dependencies: #129996, #129941
This PR batch the fix for a few accuracy failures issues during training by raising tolerance. I do that only for models that I think it fails not due to real issue.
## sebotnet33ts_256
The accuracy test for this model start to fail around June 05 [link](https://hud.pytorch.org/benchmark/timm_models/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Sun%2C%2002%20Jun%202024%2007%3A19%3A38%20GMT&stopTime=Tue%2C%2002%20Jul%202024%2007%3A19%3A38%20GMT&granularity=day&mode=training&dtype=amp&lBranch=main&lCommit=04a0d856207d83c2031e4b9cb6825ba3e0092850&rBranch=main&rCommit=e62925930f6a62f6aeeb1fe1a661a9bd3352b53d&model=sebotnet33ts_256).
I can not repro locally, but from the log from the dashboard:
```
RMSE (res-fp64): 0.09441, (ref-fp64): 0.02971 and shape=torch.Size([1536]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000
```
raising the tolerance should fix it.
## DebertaForQuestionAnswering
This model fails accuracy test on the dashboard only in max-autotune mode. I can not repro locally by command:
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/huggingface.py --accuracy --no-translation-validation --training --amp --backend inductor --device cuda --only DebertaForQuestionAnswering
```
From error message on the dashboard:
```
RMSE (res-fp64): 0.01803, (ref-fp64): 0.00537 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
```
0.02 tolerance should suppress this error.
## gluon_inception_v3
This model fail on the dashboard in max-autotune mode. I can not repro locally by command
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only gluon_inception_v3
```
From error message on the dashboard
```
RMSE (res-fp64): 0.02798, (ref-fp64): 0.00730 and shape=torch.Size([384]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.010000
Accuracy failed for key name Mixed_7c.branch3x3dbl_3a.bn.running_var
```
raising tolerance should suppress this error.
# mobilenetv3_large_100
Fail in MA model. I can not repro locally by command
```
TORCHINDUCTOR_MAX_AUTOTUNE=1 time python benchmarks/dynamo/timm_models.py --accuracy --training --amp --backend inductor --disable-cudagraphs --device cuda --only
```
The error message on the dashboard is
```
RMSE (res-fp64): 0.29754, (ref-fp64): 0.05205 and shape=torch.Size([]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.040000
```
The tensor is so small that the noise can be high. I use larger multiplier for smaller tensor in torch._dynamo.utils.same.
# yolov3
Fail on dashboard with error
```
Error on the dashboard: RMSE (res-fp64): 0.01278, (ref-fp64): 0.00246 and shape=torch.Size([256]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```
Fix it by using a larger multiplier for smaller tensors and raising the tolereance.
# timm_efficientdet
Fail on the dashboard with error
```
E0623 18:37:43.638000 139924418725056 torch/_dynamo/utils.py:1468] RMSE (res-fp64): 0.00096, (ref-fp64): 0.00009 and shape=torch.Size([2]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```
But I can not repro locally with command
```
time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --only timm_efficientdet --training
```
Raise the tolerance should fix.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129941
Approved by: https://github.com/jansel
ghstack dependencies: #129996
I'm debugging the accuracy failure for training vision_maskrcnn.
Unfortunately I could not succeed to run it locally (I've check pined commits for torchbenchmars/torchvision are correct, and reinstalled torchbenchmark for mask_rcnn). I get this error:
```
eager run fail: AssertionError: targets should not be none when in training mode
```
(Command: time python benchmarks/dynamo/torchbench.py --backend inductor --amp --performance --training --only vision_maskrcnn )
But look at the log from the dashboard
```
E0623 19:17:59.085000 140114670171328 torch/_dynamo/utils.py:1468] RMSE (res-fp64): nan, (ref-fp64): nan and shape=torch.Size([1024, 256, 1, 1]). res.dtype: torch.float32, multiplier: 3.000000, tol: 0.001000
```
We can see both the reference number and the pt2 number are NaN. I change torch._dynamo.utils.same to return true if both RMSE values are NaN.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129996
Approved by: https://github.com/jansel
Fixes#129389
If a user registers a device-specific implementation for an operator that accepts no Tensors, then we require the operator to have a "device: torch.device argument"
We switch on the device argument to select the correct backend to dispatch to.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129978
Approved by: https://github.com/zou3519
Summary:
Previously we store edge id in numeric_debug_handle to support operator fusion and operator decomposition throughout the stack,
but according to feedback from customers, people prefer the simpler per-node id, and they are fine with not having the additional
support for numerical debugging for inputs and willing to hack around to achieve this.
This PR changes the structure of numeric_debug_handle to store unique_id for each node instead.
e.g.
graph:
```
node = op(input_node, weight_node)
```
Before:
```
node.meta[NUMERIC_DEBUG_HANDLE_KEY] = {input_node: id1, weight_node: id2, "output": id3}
```
After:
```
node.meta[NUMERIC_DEBUG_HANDLE_KEY] = id1
```
Test Plan:
python test/test_quantization.py -k TestGenerateNumericDebugHandle
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129811
Approved by: https://github.com/tarun292
This PR modifies `_embedding_bag_backward` item inside _native_functions.yaml_, so that it
dispatches to CPU and CUDA directly, instead of `CompositeImplicitAutograd`.
*Context:* PyTorch operations that have the `CompositeImplicitAutograd` dispatch do not
allow third party backends (e.g. XLA) to modify its implementation, since this dispatch
key has higher priority. When calling `_embedding_bag_backward` operation using XLA, a
dispatch error will be thrown, since PyTorch/XLA doesn't support sparse tensors.
*Problem:* `_embedding_bag_backward` has a `sparse` parameter that controls whether the
operation should return a sparse or dense tensor. However, at the moment, PyTorch/XLA does
not support sparse tensors. In order to fallback that execution to dense, i.e. change the
flag at runtime, we need to be able to modify its implementation.
*Solution:* we have changed the dispatch of `_embedding_bag_backward` to CPU and CUDA,
which allowed us to introduce our own kernel for it.
Additionally, this PR refactored the representation of its mode from constant integers
into an enum class. It also introduces two additional operators: `int == EmbeddingBagMode`
and `int != EmbeddingBagMode`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129691
Approved by: https://github.com/lezcano
More context [here](https://github.com/pytorch/pytorch/issues/129682#issuecomment-2195463838), but this change was enough to get this AOTI + float8 repro running for me (below).
Previously, it would fail an assertion [here](https://github.com/pytorch/pytorch/blob/main/torch/_meta_registrations.py#L5387) at inductor lowering time. It looks like during lowering, we were supposed to pass `param.transpose(1, 0)` as the second argument to the scaled_mm kernel. But in the inductor IR, this object is a `ReinterpretView` with `get_name()` equal to one of the param constants, so we would end up passing the constant directly into the kernel, instead of performing the view first.
I'm not totally sure if this is the right place to make the change, so interested in any thoughts from inductor folks (cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @peterbell10 @ipiszy @yf225 @chenyang78 @kadeng @muchulee8 @ColinPeppler @amjames @desertfire @chauhang @eellison )
```
import torch
from torch.export import export
from torch.export._trace import _export
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
#
# This source code is licensed under the BSD 3-Clause license found in the
# LICENSE file in the root directory of this source tree.
import copy
import io
import random
import unittest
import pytest
import torch
import torch.nn as nn
import torch.nn.functional as F
from float8_experimental.float8_dynamic_linear import Float8DynamicLinear
from float8_experimental.float8_linear_utils import swap_linear_with_float8_linear
from float8_experimental.float8_tensor import Float8Tensor
from float8_experimental.float8_utils import compute_error
random.seed(0)
torch.manual_seed(0)
is_H100 = torch.cuda.is_available() and torch.cuda.get_device_capability() >= (9, 0)
import torch.nn.utils.parametrize as parametrize
# NOTE: we should upstream this directly into export and make it more automatic!
class UnwrapTensorSubclass(torch.nn.Module):
def forward(self, *tensors):
todo = list(tensors)
for tp, meta, inner_tensors in reversed(self.rebuild_stack):
nb_tensor = len(inner_tensors)
inner_tensors = {a: b for a, b in zip(inner_tensors, todo[-nb_tensor:])}
todo = todo[nb_tensor:]
rebuilt = tp.__tensor_unflatten__(inner_tensors, meta, None, None)
todo.append(rebuilt)
assert len(todo) == 1
return todo[0]
def right_inverse(self, tensor):
assert type(tensor) is not torch.Tensor
rebuild_stack = []
plain_tensors = []
todo = [tensor]
while todo:
obj = todo.pop()
inner_tensors, metadata = obj.__tensor_flatten__()
rebuild_stack.append((type(obj), metadata, inner_tensors))
for attr_name in inner_tensors:
val = getattr(obj, attr_name)
if type(val) is torch.Tensor:
plain_tensors.append(val)
else:
assert isinstance(val, torch.Tensor)
todo.append(val)
self.rebuild_stack = rebuild_stack
return plain_tensors
def unwrap_tensor_subclass(model, filter_fn=None):
for name, child in model.named_children():
if (
isinstance(child, Float8DynamicLinear) and
hasattr(child, "weight") and
type(child.weight) is not torch.Tensor and
isinstance(child.weight, torch.Tensor)
):
parametrize.register_parametrization(child, "weight", UnwrapTensorSubclass())
unwrap_tensor_subclass(child)
return model
class FeedForward(nn.Module):
def __init__(self) -> None:
super().__init__()
self.w1 = nn.Linear(4096, 14336, bias=False)
self.w3 = nn.Linear(4096, 14336, bias=False)
self.w2 = nn.Linear(14336, 4096, bias=False)
def forward(self, x: torch.Tensor) -> torch.Tensor:
return self.w2(F.silu(self.w1(x)) * self.w3(x))
def reset_parameters(self):
for m in self.modules():
if isinstance(m, nn.Linear):
m.reset_parameters()
export_model = FeedForward().to("cuda")
swap_linear_with_float8_linear(
export_model,
Float8DynamicLinear,
from_float_kwargs={"pre_quantize_weight": True},
)
export_model = unwrap_tensor_subclass(export_model)
batch_size = 4
num_tokens = 1024
embedding_dim = 4096
input_tensor = torch.randn(
batch_size, num_tokens, embedding_dim, device="cuda", dtype=torch.float32
)
example_args = (input_tensor,)
# NOTE: this breaks unless we use strict=False, pre_dispatch=False!
exported_program: torch.export.ExportedProgram = _export(
export_model,
example_args,
strict=False,
pre_dispatch=False,
)
with torch.no_grad():
so_path = torch._inductor.aot_compile(exported_program.module(), example_args)
print(so_path)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129688
Approved by: https://github.com/eellison
Ops in torch, torch.functional, and torch.nn.functional are cache safe by default (at least, based on my cursory audit of the ops). This fixes a few tests that use these ops with the cache.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128329
Approved by: https://github.com/bdhirsh
The default value of `rot90()` in the schema registry is `[0,1]` because we split the function schema by `", "`. There should be no space after `,` in `[0,1]`.
5c9d5272e4/aten/src/ATen/native/native_functions.yaml (L6120-L6126)
Then the the default value is formatted to `(0,1)` in `pyi` files. This PR manually adds an extra whitespace when rerendering the default value to a string.
```python
", ".join(string.split(","))
```
```python
# before
def rot90(input: Tensor, k: _int = 1, dims: _size = (0,1)) -> Tensor: ...
# after
def rot90(input: Tensor, k: _int = 1, dims: _size = (0, 1)) -> Tensor: ...
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129884
Approved by: https://github.com/ezyang
Summary: The test is from D59181111, but I couldn't figure out a way to make it pass on FBCODE because loading PyTorch C++ extension requires Ninja which is not going to work with BUCK
Test Plan: `buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test:transformers`
Differential Revision: D59304327
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129997
Approved by: https://github.com/drisspg
Fixes#128510.
https://github.com/pytorch/pytorch/pull/124451 makes LayoutLMForSequenceClassification hit the SDPA pattern 1 and then encounter the accuracy issue. The issue only happens with BF16 inference single thread. This PR tends to increase the model tolerance and make the check pass. Note that even the math-version SDPA could have the issue because of some small implementation diff.
The test log:
Single thread
```
correct_result: SequenceClassifierOutput(loss=tensor(0.5998), logits=tensor([[0.3301, 0.1338]], dtype=torch.bfloat16), hidden_states=None, attentions=None)
new_result: SequenceClassifierOutput(loss=tensor(0.6016), logits=tensor([[0.3281, 0.1357]], dtype=torch.bfloat16), hidden_states=None, attentions=None)
E0627 01:09:16.762789 140281313759104 torch/_dynamo/utils.py:1476] RMSE (res-fp64): 0.00151, (ref-fp64): 0.00046 and shape=torch.Size([1, 2]). res.dtype: torch.bfloat16, multiplier: 3.000000, tol: 0.001000
E0627 01:09:16.762972 140281313759104 torch/_dynamo/utils.py:1390] Accuracy failed for key name logits
fail_accuracy
```
Multiple threads
```
correct_result: SequenceClassifierOutput(loss=tensor(0.6007), logits=tensor([[0.3301, 0.1357]], dtype=torch.bfloat16), hidden_states=None, attentions=None)
new_result: SequenceClassifierOutput(loss=tensor(0.6016), logits=tensor([[0.3281, 0.1357]], dtype=torch.bfloat16), hidden_states=None, attentions=None)
pass
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129728
Approved by: https://github.com/jgong5, https://github.com/jansel
This does a round trip request on socket connect -- this allows for detecting connection resets etc and retrying before the non-retryable application requests are sent.
This adds support for PING to both the libuv and legacy backend.
Example error:
```
[trainer85612|12]:W0701 13:41:43.421574 4776 TCPStore.cpp:182] [c10d] recvValue failed on SocketImpl(fd=24, ...): Connection reset by peer
[trainer85612|12]:Exception raised from recvBytes at /mnt/code/pytorch/torch/csrc/distributed/c10d/Utils.hpp:669 (most recent call first):
...
[trainer85612|12]:#9 c10d::TCPStore::incrementValueBy(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, long) from /packages/.../conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so:84809637
[trainer85612|12]:#10 c10d::TCPStore::waitForWorkers() from /packages/.../conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so:84812868
[trainer85612|12]:#11 c10d::TCPStore::TCPStore(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, c10d::TCPStoreOptions const&) from /packages/.../conda/lib/python3.10/site-packages/torch/lib/libtorch_cpu.so:84814775
```
Test plan:
```
python test/distributed/test_store.py -v
```
```
tristanr@devvm4382 ~/pytorch (d4l3k/tcpstore_ping)> python ~/pt_tests/tcpstore_large_test.py
starting pool
started 90000
started 30000
started 70000
started 20000
started 80000
started 60000
started 0
[W702 16:16:25.301681870 TCPStore.cpp:343] [c10d] Starting store with 100000 workers but somaxconn is 4096.This might cause instability during bootstrap, consider increasing it.
init 20000
set 20000
init 80000
set 80000
init 70000
set 70000
init 60000
set 60000
init 30000
set 30000
init 90000
set 90000
started 40000
init 40000
set 40000
started 50000
init 50000
set 50000
started 10000
init 10000
set 10000
init 0
set 0
run finished 617.2992351055145
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129985
Approved by: https://github.com/rsdcastro, https://github.com/kurman
Rerun the failing test singly with the env var set. If it succeeds, start a new process without the cpp stack traces env var
We don't want to waste time generating these if we don't have to
They can also show up in assertion errors, which may cause unexpected failures if a test wants to check these
Adds new --rs (run single) to be used the same way --scs and --sc are. It will only run the single test in the step current file
https://hud.pytorch.org/pytorch/pytorch/pull/129004?sha=2c349d3557d399020bf1f6a8b7045e2e4957ba46 has some examples of logs
In the above:
* test_checkpoint_valid failed, then passed in another subprocess. The testing continued in a different new subprocess from the test right after it (test_checkpointing_without_reentrant_early_free)
* test_format_traceback_short failed consistently, but it continued to run because keep-going was set
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129004
Approved by: https://github.com/PaliC
Summary:
1. add one more model lib dep.
2. add error message when torchscript failed to find a class in python compilation unit.
Test Plan: CI
Reviewed By: jingsh
Differential Revision: D59243250
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129897
Approved by: https://github.com/jingsh
Previously each mutation was represented by a `MutationOutput` operation which
was a new scheduler node that must be scheduled immediately afterwards.
Now we have a single scheduler node, which produces mutiple `MutationOutput`
buffers as its output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129325
Approved by: https://github.com/lezcano
ghstack dependencies: #128893
Currently a buffer represents both a tensor with physical storage and a
computation that produces the tensor as a result.
This PR attempts to split these into two different concepts in the scheduler.
This should allow us to have multiple outputs from a single operation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128893
Approved by: https://github.com/lezcano
I run into this a lot. I can imagine that it would look opaque to users,
so made it more friendly
Old error message: "ValueError: infer_schema(func): Return has unsupported type <class 'inspect._empty'>."
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129896
Approved by: https://github.com/yushangdi
The revert of #127199 seems to surface an additional failure on A100---small tolerance bump to account for this.
I did find what appears to be a race condition in the one of the kernels used in this workload but I'm not sure it's related here...
CC @nWEIdia
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129902
Approved by: https://github.com/ezyang
Summary:
The inputs to grid function are varying argument, it can be one number, two numbers, or three numbers. The current implementation captured it as a tuple. For example "grid((16,))". The fix is to change it to varying number of elements. In the previous example, it is changed to "grid(16,)".
PARAM et-replay code will be modified to reflect this change in a following up DIFF.
Test Plan: buck2 test mode/dev-nosan caffe2/test:profiler -- -- test_execution_trace_with_pt2
Differential Revision: D59195933
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129832
Approved by: https://github.com/Skylion007, https://github.com/davidberard98
# Error
```
File "/data/users/colinpeppler/pytorch/torch/_meta_registrations.py", line 704, in sym_constrain_range
constrain_range(size, min=min, max=max)
File "/data/users/colinpeppler/pytorch/torch/fx/experimental/symbolic_shapes.py", line 898, in constrain_range
a.node.shape_env._constrain_range(a.node.expr, min, max)
File "/data/users/colinpeppler/pytorch/torch/fx/experimental/recording.py", line 245, in wrapper
return fn(*args, **kwargs)
File "/data/users/colinpeppler/pytorch/torch/fx/experimental/symbolic_shapes.py", line 2813, in _constrain_range
assert isinstance(a, sympy.Symbol), f"constraining non-Symbols NYI, {a} is {type(a)}"
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
AssertionError: constraining non-Symbols NYI, s1 + s2 is <class 'sympy.core.add.Add'>
```
# Context
I ran into the following scenario:
```
getitem = ...
sym_size_int = torch.ops.aten.sym_size.int(getitem, 0) # this is u0 = s0 + s1
_check_is_size = torch._check_is_size(sym_size_int)
# we fail at this guy
sym_constrain_range_default = torch.ops.aten.sym_constrain_range.default(sym_size_int, min = 4, max = 1234)
# runtime assertion
add = sym_size_int + sym_size_int_1
eq = add == sym_size_int
_assert_scalar_default = torch.ops.aten._assert_scalar(eq, "Runtime assertion failed for expression Eq(s0 + s1, u0) on node 'eq'")
```
everything but getitem was asserted into the FX graph by insert_deferred_runtime_asserts()
7e4329c258/torch/fx/passes/runtime_assert.py (L38-L52)
In the above scenario, we fail trying to constraint the range on `s0 + s1` which is not a `sympy.Symbol`.
And why exactly are we constraining the range on `s0 + s1`? Because it's the replacement for `u0`.
# Approach
Whenever we try to constrain the range on the replacement of ~~an unbacked symint~~ a non-symbol, just ignore it.
In the scenario above, we'll be okay to ignore it because whenever there's a replacement on an unbacked symint, we will update its range. Hence, no need to constrain the range on `s1 + s1`. We can confirm this with `TORCH_LOGS="+dynamic"`.
```
torch/fx/experimental/symbolic_shapes.py:4737: _update_var_to_range u0 = VR[4, 198] (update)
torch/fx/experimental/symbolic_shapes.py:4856: set_replacement u0 = s1 + s2 (trivial_lhs) VR[4, 198]
```
600bf978ba/torch/fx/experimental/symbolic_shapes.py (L4759-L4764)
Differential Revision: [D59257079](https://our.internmc.facebook.com/intern/diff/D59257079)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129907
Approved by: https://github.com/jingsh
Some profiling suggests that the repeated maybe evaluate static calls are expensive.
Ref: https://github.com/pytorch/pytorch/issues/123964
With test script:
```
import torch
import torch._dynamo.config
torch._dynamo.config.capture_scalar_outputs = True
@torch.compile(fullgraph=True)
def f(a, b):
xs = b.tolist()
for x in xs:
torch._check_is_size(x)
torch._check(x <= 20)
return a.split(xs)
N = 20
splits = torch.randint(10, (N,))
sz = splits.sum().item()
f(torch.randn(sz), splits)
```
Before:
```
real 0m18.526s
user 0m16.555s
sys 0m11.031s
```
After:
```
real 0m13.831s
user 0m12.152s
sys 0m10.941s
```
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129893
Approved by: https://github.com/lezcano
**Summary**
I have added an even more detailed module tracker that now includes the collective counts and operations that happen in each submodule making it easier for users to debug. The tracing now includes the operation's DTensor arguements' input shape and sharding. Like the module collective tracing, the user also has the option to log the tracing table to output.txt file. I have decided not to include the example output for transformer as it is too many lines. The expected output for the MLP_operation_tracing is shown below:
<img width="574" alt="Screenshot 2024-06-25 at 3 33 16 PM" src="https://github.com/pytorch/pytorch/assets/50644008/a09e2504-19d5-4c69-96e8-f84e852d7786">
<img width="467" alt="Screenshot 2024-06-25 at 3 33 45 PM" src="https://github.com/pytorch/pytorch/assets/50644008/55c07d2d-6cb6-410f-82ac-2849bb7bfbbb">
**Test Plan**
1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_operation_tracing
2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129017
Approved by: https://github.com/XilunWu
# Changes
* small fix in stage error message
* Move `format_pipeline_order` and `_validate_pipeline_order` out of `test_schedule.py` into `schedules.py`.
* Wrap the execution runtime in a try-except which on error will log the timestep and schedule plan before re-raising the exception.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129369
Approved by: https://github.com/wconstab
ghstack dependencies: #129368
Re-organize ```block_mask``` related arguments a tuple to reduce the individual argument number. I was trying to use named tuple, but aot autograd doesn't work well with named tuple. The only downside of using tuple rather than named tuple is we need to use index to access its element. But we only need this at one place, it should be fine.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129831
Approved by: https://github.com/Chillee, https://github.com/drisspg
Summary:
There were two problems with the HistogramObserver:
1. It does not work when someone passes a batch_size 1, tensor_size 1 data-point.
2. The Histogram doesn't seem to actually update if the range of the new x falls within the old one
These issues were both fixed.
On top of this, I greatly simplified the logic for the histogram updating. Now, it doesn't do the downsampling anymore, which saves a ton of memory and code. The accuracy can still be controlled with the upsampling ratio. This ratio was also too high for the accuracy we generally need here, I reduced the default for this.
Also the code is cleaner now, much easier to follow what's happening.
test_histogram_observer_same_inputs was likely wrong - If I pass 0s and 1s to my histogramobserver, I want them to actually count! The current test now thinks it's good to discard and ignore these values.
Test Plan: You can run the included tests.
Differential Revision: D58931336
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129387
Approved by: https://github.com/jerryzh168
In the "layout()" method of "TensorImpl" defined in the file core/TensorImpl.h, the following code and documentation can be found:
```
Layout layout() const {
...
if .. {
...
} else if (is_sparse_compressed()) {
// Typically, the tensor dispatch keys define the tensor layout
// uniquely. This allows using non-virtual layout method for
// better performance. However, when tensor's layout depends,
// say, on tensor attributes, one must use this execution path
// where the corresponding tensor impl class overwrites virtual
// layout_impl() method.
return layout_impl();
} else {
...
}
}
```
However, this override was never implemented. This PR put the override in place, to prepare for sparsity propagation in another PR.
https://github.com/pytorch/pytorch/issues/117188
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129930
Approved by: https://github.com/ezyang
Background: this bug was triggering DEBUG=1 asserts in the backward for `unbind()`, which calls `empty_like()`. I found that the NJT implementation of `empty_like()` was redispatching on `values` while blindly passing along all kwargs. This resulted in `empty_like(values, ..., layout=torch.jagged)`, which is incorrect since `values` is strided, tripping the debug assert here:
433b691f98/aten/src/ATen/EmptyTensor.cpp (L305)
This PR explicitly sets `layout=torch.strided` when redispatching `*_like()` factories on `values`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129879
Approved by: https://github.com/soulitzer
Fixes#111884
In the minimised reproducer, we have a loop with the index expression `-q0*q1`
for which in the merge tester we get:
```
expr1 = - 0 * (_merge_tester * 16) = 0
expr2 = - _merge_tester * 0 = 0
```
so it decides we can merge the dimensions and `q0` is set to `0`, meaning `-q0*q1` is always zero!
Here I change the test so we have at least one case where no zeros are
substituted so we can catch this situation. In the normal strided case we get
e.g.
```
expr = 16 * q0 + q1
expr1 = 16 * _merge_tester2 + (16 * _merge_tester1)
expr2 = 16 * (_merge_tester2 + _merge_tester1)
```
which are still equivalent expressions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129806
Approved by: https://github.com/lezcano
**Summary**
This PR mainly refactor 2 things:
1. Passing in weight's data type explicitly in `create_micro_gemm` as `input2.dtype`. When registering `CppMicroGemmConfig`, we will reuse `input.dtype` if `input2.dtype` is not explicitly registered.
2. Add an util function to get the output data type and compute data type from input data type.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129221
Approved by: https://github.com/jgong5, https://github.com/jansel
ghstack dependencies: #128825, #129048, #129049, #129103, #129220
**Summary**
We change the schema of QLinear Binary, so it will be easier to enable the corresponding gemm template.
- Extra input of binary post-op is a tensor which needs to be an input node of autotuning, we need to move it at front of `output_scale` which is a scalar.
- We also move it at front of `bias`, since `bias` is optional tensor for this fusion, but `other` is a must to have for linear binary fusion.
**Test Plan**
```
python -u -m pytest -s -v test/quantization/core/test_quantized_op.py -k qlinear
python -u -m pytest -s -v test/inductor/test_mkldnn_pattern_matcher.py -k qlinear
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129049
Approved by: https://github.com/jgong5, https://github.com/jansel
ghstack dependencies: #128825, #129048
In nvidia internal testing, for slower devices such as Orin NX, on large dtypes like complex128, test_linalg_solve_triangular_large is taking multiple hours to complete and timing out CI. This PR adds a slowTest marker so it can be skipped due to speed issues. cc @nWEIdia
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129903
Approved by: https://github.com/lezcano
_Action.__repr__ gets rearranged so it doesn't require an underscore or
a 's' prefix, but still keeps multi-digit stage and microbatch indices
separated by an alpha character indicating the action type.
to/from CSV methods allow dumping a generated schedule to CSV format for
offline visualization or manual editing in a spreadsheet and reloading
to use at runtime.
Co-authored-by: Howard Huang <howardhuang@fb.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129264
Approved by: https://github.com/H-Huang
Fixes #ISSUE_NUMBER
Gonna fill in the RFC but just want to run CI to see if anything else breaks.
Test:
```
python test/distributed/_composable/fsdp/test_fully_shard_training.py -k test_raise_not_implemented_state_dict_if_2d
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129519
Approved by: https://github.com/awgu
As this is a min CMake version supported by top level PyTorch
Hides
```
CMake Deprecation Warning at aten/src/ATen/native/quantized/cpu/qnnpack/deps/clog/CMakeLists.txt:7 (cmake_minimum_required):
Compatibility with CMake < 3.5 will be removed from a future version of
CMake.
Update the VERSION argument <min> value or use a ...<max> suffix to tell
CMake that the project does not need compatibility with older versions.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129906
Approved by: https://github.com/kit1980
In this PR:
- Ensure that if a tensor not requiring grad is saved for backward unpacking does not trigger a detach (unless the user installs a saved tensor pack hook that returns a tensor requiring grad).
- Update non-reentrant checkpoint to also no longer detach for this case.
Alternatives:
- For custom autograd Function, you could directly save on ctx to work around this, but that would not work for when we switch to using custom ops.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127959
Approved by: https://github.com/YuqingJ
ghstack dependencies: #125795, #128545, #129262
This PR enables using AOTriton as a shared library dependency instead of a static one.
Resolves the issue of linker errors when trying to build PyTorch for a lot of (>7 or so) gfx archs due to huge size of aotriton static library.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129094
Approved by: https://github.com/malfet
Fixes#127666.
Other std math functions are replaced with those in the global namespace during hipify. HIP does not claim to support every function in the C++ standard library. std::clamp is not yet supported and we have been relying on the std implementation. For Fedora 40 + gcc 14, a host-side assert is used which is not supported. Work-around this by replacing std::clamp with min and max for USE_ROCM builds.
Patch comes from @lamikr. Modified to use #ifndef USE_ROCM.
https://github.com/lamikr/rocm_sdk_builder/pull/37
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127812
Approved by: https://github.com/hongxiayang, https://github.com/malfet
Hard to write tests. This PR makes many test pass in the stack such as
`PYTORCH_TEST_WITH_DYNAMO=1 pytest test/test_ao_sparsity.py::TestComposability::test_convert_without_squash_mask`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129858
Approved by: https://github.com/mlazos
ghstack dependencies: #129830
Extra period at the end throws off pip:
```
root@f04177cab5af:/data/pytorch# pip install -r .ci/docker/requirements-ci.txt
ERROR: Invalid requirement: 'lxml==5.0.0.': Expected end or semicolon (after version specifier)
lxml==5.0.0.
~~~~~~~^ (from line 309 of .ci/docker/requirements-ci.txt)
```
Not sure why CI docker builds do not have an issue with this period.
Typo comes from f73b1b9388
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129695
Approved by: https://github.com/huydhn
Summary: Fix index issues in torch.fx.interpreter by changing range from `[:i]` to `[:i+1]`. Because if there are `n` elements, the last index `i` of the `for` loop is `n-1` and `[:i]` can only get access to elements from index `0` to index `n-2` and miss the last element. `[:i+1]` can get access to all elements correctly.
Test Plan: Test with Node API
Differential Revision: D59028395
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129527
Approved by: https://github.com/dulinriley
**Summary:**
Increase riscv implementation in DepthwiseConvKernel.
**Compile:**
export USE_CUDA=0
export USE_DISTRIBUTED=0
export USE_MKLDNN=0
export MAX_JOBS=4
export CMAKE_CXX_COMPILER=clang++
export CMAKE_C_COMPILER=clang
export CMAKE_C_FLAGS=-march=rv64gcv
export CMAKE_CXX_FLAGS=-march=rv64gcv
python3 setup.py develop --cmake
**Test Plan:**
**Correctness** - Check the results of the run before and after test_convolution.py
python3 test/run_test.py --include nn/test_convolution --keep-going
**Before:**
===== 9 passed, 13 skipped, 564 deselected in 46.55s =====
The following tests failed consistently:
test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_backward_twice
test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_inconsistent_types
test/nn/test_convolution.py::TestConvolutionNN::test_conv_modules_raise_error_on_incorrect_input_size
test/nn/test_convolution.py::TestConvolutionNN::test_conv_shapecheck
test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv1d
test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv2d
test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv3d
test/nn/test_convolution.py::TestConvolutionNN::test_mismatch_shape_conv2d
test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCPU::test_conv_empty_channel_cpu_complex64
test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCPU::test_conv_empty_channel_cpu_float32
**After:**
===== 9 passed, 13 skipped, 564 deselected in 48.13s =====
The following tests failed consistently:
test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_backward_twice
test/nn/test_convolution.py::TestConvolutionNN::test_Conv2d_inconsistent_types
test/nn/test_convolution.py::TestConvolutionNN::test_conv_modules_raise_error_on_incorrect_input_size
test/nn/test_convolution.py::TestConvolutionNN::test_conv_shapecheck
test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv1d
test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv2d
test/nn/test_convolution.py::TestConvolutionNN::test_invalid_conv3d
test/nn/test_convolution.py::TestConvolutionNN::test_mismatch_shape_conv2d
test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCPU::test_conv_empty_channel_cpu_complex64
test/nn/test_convolution.py::TestConvolutionNNDeviceTypeCPU::test_conv_empty_channel_cpu_float32
**Performance** - Compare the results before and after mobilenet_v2
python3 run.py mobilenet_v2 -d cpu -t eval
**Before:**
Running eval method from mobilenet_v2 on cpu in eager mode with input batch size 16 and precision fp32.
CPU Wall Time per batch: 19590.647 milliseconds
CPU Wall Time: 19590.647 milliseconds
Time to first batch: 5271.3518 ms
CPU Peak Memory: 0.3809 GB
**After:**
Running eval method from mobilenet_v2 on cpu in eager mode with input batch size 16 and precision fp32.
CPU Wall Time per batch: 13523.530 milliseconds
CPU Wall Time: 13523.530 milliseconds
Time to first batch: 2696.0304 ms
CPU Peak Memory: 0.3408 GB
**Versions:**
Clang version: 17.0.2
Platform: CanMV-K230
Architecture: riscv64
OS: Ubuntu 23.10
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127867
Approved by: https://github.com/malfet
To make debugging regressions like ones happened last Wed when new version of torchao was released, that resulted in TorchBench downgrading pytorch version to 2.3.1
Test plan: Look at the log output for example https://github.com/pytorch/pytorch/actions/runs/9720408234/job/26832794157?pr=129809#step:20:1158 contains
```
+ echo 'Print all dependencies after TorchBench is installed'
Print all dependencies after TorchBench is installed
+ python -mpip freeze
absl-py==2.1.0
accelerate==0.31.0
aiohttp==3.9.5
aiosignal==1.3.1
astunparse==1.6.3
async-timeout==4.0.3
attrs==23.2.0
audioread==3.0.1
beautifulsoup4==4.12.3
boto3==1.19.12
botocore==1.22.12
bs4==0.0.2
cachetools==5.3.3
certifi==2024.6.2
cffi==1.16.0
charset-normalizer==3.3.2
click==8.1.7
...
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129809
Approved by: https://github.com/kit1980, https://github.com/atalman
Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes.
What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here...
Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343
Approved by: https://github.com/Skylion007
**Summary**
When check the vectorization status among 3 test suit, we found some operators disabled vectorization with message `Disabled vectorization: op: bitwise_and`. In this PR, we add vectorization support of 6 bitwise functions.
In this PR, we also remove `bitwise_xor` from `ops_to_bool` list which sets output data type as bool in data type propagation. It seems wrong since according to this doc
https://pytorch.org/docs/stable/generated/torch.bitwise_xor.html, it should return the same integral data type with input and the testcase `test_bitwise3` failed due to this issue.
**Test Plan**
```
python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_bitwise
python -u -m pytest -s -v test/inductor/test_torchinductor.py -k test_bitwise3
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129733
Approved by: https://github.com/jgong5, https://github.com/Skylion007
This PR follows https://github.com/pytorch/pytorch/pull/129374#pullrequestreview-2136555775 cc @malfet:
> Lots of formatting changes unrelated to PR goal, please keep them as part of separate PR (and please add lint rule if you want to enforce those, or at least cite one)
`usort` allows empty lines within import segments. For example, `usort` do not change the following code:
```python
import torch.aaa
import torch.bbb
import torch.ccc
x = ... # some code
```
```python
import torch.aaa
import torch.bbb
import torch.ccc
x = ... # some code
```
```python
import torch.aaa
import torch.bbb
import torch.ccc
x = ... # some code
```
This PR first sort imports via `isort`, then re-sort the file using `ufmt` (`usort` + `black`). This enforces the following import style:
1. no empty lines within segments.
2. single empty line between segments.
3. two spaces after import statements.
All the code snippets above will be formatted to:
```python
import torch.aaa
import torch.bbb
import torch.ccc
x = ... # some code
```
which produces a consistent code style.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129751
Approved by: https://github.com/malfet
Fixes based on discussion in https://github.com/pytorch/pytorch/issues/128665
Our previous assumption was that for looped schedules `stage_ids = range(rank, total_stages, num_local_stages)`. This is not true for all schedules. This change relaxes that assumptions and allows arbitrary ordering of stages. For example in the added test we do, rank 0: [stage0, stage3], rank 1: [stage1, stage2]. The test also adds a schedule registry (for testing) which performs 1 microbatch through this schedule
```
F0_0 None None F0_3 B0_3 None None B0_0
None F0_1 F0_2 None None B0_2 B0_1 None
```
Co-authored-by: Will Constable <whc@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128976
Approved by: https://github.com/wconstab
ghstack dependencies: #128983
- Lifting Tensor Constant attributes to buffers: TorchScript does not automatically lift tensor constant attributes to buffers. So previous converter cannot access tensor constant attributes. This PR fixed the issue.
- Add SetAttr support for tensor attributes by copy_.
- Add SetAttr support for non-tensor attributes. In particular, we maintain the current value of non-tensor attributes in `name_to_non_tensor_attribute_node`, similar to an interpreter pass on non-tensor attributes. So we can support the following use case:
```python
def forward(self, x):
c1 = self.count
self.count += 1
c2 = self.count
return x + c1 + c2
```
- Fixed a bug in GetAttr to support the following use case:
```python
def forward(self, inp):
x = self.buffer
self.buffer += 1
y = self.buffer
return x + y + inp
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129440
Approved by: https://github.com/angelayi
FSDP implements the following logic but its missing from DDP.
This PR adds an equivalent function for the same.
```python
def __getattr__(self, name: str) -> Any:
"""Forward missing attributes to the wrapped module."""
try:
return super().__getattr__(name) # defer to nn.Module's logic
except AttributeError:
return getattr(self._fsdp_wrapped_module, name)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128620
Approved by: https://github.com/awgu
Summary:
In a C++ program, if we have child threads doing GPU work, it would be nice to get traces of those threads as well. The problem is, pushProfilingCallbacks() is not called on child threads, therefore, no observer traces are collected on these threads, entirely missing in the final output.
This diff provides a new API that a child thread may elect to call to register itself onto the profiler that was started in main thread (or whatever the Python thread that manages the profiler).
Test Plan:
```
buck2 test @mode/opt //caffe2/test:profiler_test_cpp_thread
```
Reviewed By: aaronenyeshi
Differential Revision: D56669942
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128743
Approved by: https://github.com/aaronenyeshi
This PR runs the reduce-scatter copy-in in the default stream, allowing the reduce-scatter input (large allocation proportional to unsharded gradients) to be allocated in the default stream to avoid fragmenting that memory across stream memory pools.
- In general, minimizing memory usage spikes in non-default-stream memory pools helps because otherwise, that memory cannot be reused by the default stream outside of that spike. This reduce-scatter input allocation represents one such spike. The reduce-scatter outputs are still allocated in the separate `reduce_scatter` stream since they are small and have a non-spiky allocation/free pattern (we iteratively allocate them through backward and free them altogether after optimizer).
- This PR should not have any impact on overlap (I sanity checked Llama3-8B traces from torchtitan; plus we have the `test_fully_shard_overlap.py` unit tests).
**Experiment**
**(Before)** Llama3-8B, 1D FSDP, 8 H100s, bf16/fp32 mixed precision, no AC, local batch size 1:
```
[rank0]:2024-06-27 16:38:56,620 - root - INFO - step: 1 loss: 12.2764 memory: 71.99GiB(75.75%) wps: 1,436 mfu: 8.41%
[rank0]:2024-06-27 16:38:56,620 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:2024-06-27 16:38:57,943 - root - INFO - step: 2 loss: 12.1001 memory: 79.82GiB(83.98%) wps: 6,195 mfu: 36.28%
[rank0]:2024-06-27 16:38:59,266 - root - INFO - step: 3 loss: 11.7697 memory: 79.82GiB(83.98%) wps: 6,193 mfu: 36.27%
[rank0]:2024-06-27 16:39:00,587 - root - INFO - step: 4 loss: 11.2807 memory: 79.82GiB(83.98%) wps: 6,203 mfu: 36.32%
[rank0]:2024-06-27 16:39:01,910 - root - INFO - step: 5 loss: 10.9494 memory: 79.82GiB(83.98%) wps: 6,198 mfu: 36.30%
```
**(After)** Llama3-8B, 1D FSDP, 8 H100s, bf16/fp32 mixed precision, no AC, local batch size 1:
```
[rank0]:2024-06-27 16:41:12,106 - root - INFO - step: 1 loss: 12.2560 memory: 69.46GiB(73.08%) wps: 1,158 mfu: 6.78%
[rank0]:2024-06-27 16:41:12,106 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40
[rank0]:2024-06-27 16:41:13,502 - root - INFO - step: 2 loss: 12.0949 memory: 77.29GiB(81.32%) wps: 5,870 mfu: 34.37%
[rank0]:2024-06-27 16:41:14,839 - root - INFO - step: 3 loss: 11.7770 memory: 77.29GiB(81.32%) wps: 6,130 mfu: 35.90%
[rank0]:2024-06-27 16:41:16,154 - root - INFO - step: 4 loss: 11.3188 memory: 77.29GiB(81.32%) wps: 6,230 mfu: 36.48%
[rank0]:2024-06-27 16:41:17,474 - root - INFO - step: 5 loss: 10.9443 memory: 77.29GiB(81.32%) wps: 6,211 mfu: 36.37%
```
**2.53 GiB reduction in peak reserved memory.**
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129721
Approved by: https://github.com/weifengpy, https://github.com/yifuwang
Update ruff to 0.5.0 so we can enable all the some of the new checks I've been wanting to add to the codebase. First just updating the code to comply with some rule changes and a couple minor API changes / deprecations.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129744
Approved by: https://github.com/ezyang
I think there is a typo in the first example of the `torch.func.stack_module_state` documentation. The first parameter in the function call in the `wrapper` return is missing an 's'.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129126
Approved by: https://github.com/zou3519
This PR follows https://github.com/pytorch/pytorch/pull/129374#pullrequestreview-2136555775 cc @malfet:
> Lots of formatting changes unrelated to PR goal, please keep them as part of separate PR (and please add lint rule if you want to enforce those, or at least cite one)
`usort` allows empty lines within import segments. For example, `usort` do not change the following code:
```python
import torch.aaa
import torch.bbb
import torch.ccc
x = ... # some code
```
```python
import torch.aaa
import torch.bbb
import torch.ccc
x = ... # some code
```
```python
import torch.aaa
import torch.bbb
import torch.ccc
x = ... # some code
```
This PR first sort imports via `isort`, then re-sort the file using `ufmt` (`usort` + `black`). This enforces the following import style:
1. no empty lines within segments.
2. single empty line between segments.
3. two spaces after import statements.
All the code snippets above will be formatted to:
```python
import torch.aaa
import torch.bbb
import torch.ccc
x = ... # some code
```
which produces a consistent code style.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129751
Approved by: https://github.com/malfet
Fixes the silent correctness issue in #129207 by preventing the user from calling the convolution op on MPS device with an unsupported value.
The fix for the missing support is coming in later as that requires work on the kernel side so it'll take some more time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129484
Approved by: https://github.com/kulinseth
As @vmoens pointed out, the current error message does not make the "either/or" between setting `weights_only=False` and using `add_safe_globals` clear enough, and should print the code for the user to call `add_safe_globals`
New formatting looks like such
In the case that `add_safe_globals` can be used
```python
>>> import torch
>>> from torch.testing._internal.two_tensor import TwoTensor
>>> torch.save(TwoTensor(torch.randn(2), torch.randn(2)), "two_tensor.pt")
>>> torch.load("two_tensor.pt", weights_only=True)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/data/users/mg1998/pytorch/torch/serialization.py", line 1225, in load
raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
_pickle.UnpicklingError: Weights only load failed. This file can still be loaded, to do so you have two options
(1) Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
(2) Alternatively, to load with `weights_only=True` please check the recommended steps in the following error message.
WeightsUnpickler error: Unsupported global: GLOBAL torch.testing._internal.two_tensor.TwoTensor was not an allowed global by default. Please use `torch.serialization.add_safe_globals([TwoTensor])` to allowlist this global if you trust this class/function.
Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.
```
For other issues (unsupported bytecode)
```python
>>> import torch
>>> t = torch.randn(2, 3)
>>> torch.save(t, "protocol_5.pt", pickle_protocol=5)
>>> torch.load("protocol_5.pt", weights_only=True)
/data/users/mg1998/pytorch/torch/_weights_only_unpickler.py:359: UserWarning: Detected pickle protocol 5 in the checkpoint, which was not the default pickle protocol used by `torch.load` (2). The weights_only Unpickler might not support all instructions implemented by this protocol, please file an issue for adding support if you encounter this.
warnings.warn(
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/data/users/mg1998/pytorch/torch/serialization.py", line 1225, in load
raise pickle.UnpicklingError(_get_wo_message(str(e))) from None
_pickle.UnpicklingError: Weights only load failed. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source.
Please file an issue with the following so that we can make `weights_only=True` compatible with your use case: WeightsUnpickler error: Unsupported operand 149
Check the documentation of torch.load to learn more about types accepted by default with weights_only https://pytorch.org/docs/stable/generated/torch.load.html.
```
Old formatting would have been like:
```python
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/data/users/mg1998/pytorch/torch/serialization.py", line 1203, in load
raise pickle.UnpicklingError(UNSAFE_MESSAGE + str(e)) from None
_pickle.UnpicklingError: Weights only load failed. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you get the file from a trusted source. Alternatively, to load with `weights_only` please check the recommended steps in the following error message. WeightsUnpickler error: Unsupported global: GLOBAL torch.testing._internal.two_tensor.TwoTensor was not an allowed global by default. Please use `torch.serialization.add_safe_globals` to allowlist this global if you trust this class/function.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129705
Approved by: https://github.com/albanD, https://github.com/vmoens
ghstack dependencies: #129239, #129396, #129509
Summary:
There is a small cosmetic issue in the C++ wrapper file generated by AOTInductor - The launchKernel() call isn't properly indented.
Added indentation for launchKernel() code block call when there's a "if" condition. a.k.a when `grid_uses_symbolic_shapes` is `True`.
Test Plan:
Test cmd ran (in pytorch oss):
`TORCH_LOGS="output_code" TORCH_COMPILE_DEBUG=1 python test/inductor/test_aot_inductor.py -k test_zero_grid_with_backed_symbols_abi_compatible_cuda`
And then manually verified the output code generated in a path like
`/tmp/torchinductor_guorachel/coraisesuchpl3qabrazn7ydydszcit6lwpn7ckd3b4wej4rep5l/cba5g5ajeh5sym3tp5iqn7kkokimj7qqd4krs2rruhupbfqgppge.cpp`
Similarly, also verified for test case:`test_zero_grid_with_unbacked_symbols_abi_compatible_cuda`
Differential Revision: D58897157
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129616
Approved by: https://github.com/ColinPeppler
TorchDynamo guard mechanism guards on the key order on the dictionaries if the user iterates over the dictionary. For standard dict, we can write a fast C++ implementation by using PyDict_Next. But with OrderedDict, we have to rely on `keys` Python API to get the key ordering. This makes guard evaluation slow.
With Dynamo inlining into inbuilt nn modules, I am seeing many guards over the OrderedDict on `_modules`, `_parameters`. From reading the code, I don't see any reason to not use standard dicts. I think OrderedDict was preferred over dict because of the ordering, but dicts are now ordered. With this PR, I am observing ~20% reduction in guard overhead of a HF model.
Functionality impact
- The only difference between dict and OrdedeDict is `move_to_end` method for OrderedDict ([link](https://stackoverflow.com/questions/34305003/difference-between-dictionary-and-ordereddict)). But the changes here are internal to nn module, and we do not use `move_to_end` for `_parameters`, `_modules` and `_buffers`. We use `move_to_end` for hooks but this PR keeps the OrderedDict for hooks untouched (we should still followup with hooks but in a separate PR).
Perf impact
- I dont anticipate any perf impact. `dict` is completely implemented in C. OrderedDict is Python wrapper over dict with only few method overridden ([link](https://stackoverflow.com/questions/34305003/difference-between-dictionary-and-ordereddict)).
Typing impact
- I dont anticipate any. For all the user visible methods for nn.Module, we don't expose the underlying `_modules` etc. We have iterators like `named_parameters` which return an Iterator of Parameter. So, no typing changes required.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129164
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #129163
Fixes#129601
Background: it's possible that a traceable wrapper subclass will have an optional inner tensor constituent (e.g. NJT's cached min / max sequence lengths). To specify this, the subclass's `__tensor_flatten__()` impl should leave out any unspecified optional inner tensors in the returned list of `attrs`.
This PR guards on the list of inner tensor `attrs` returned in `subclass.__tensor_flatten__()[0]`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129618
Approved by: https://github.com/anijain2305
Fixes#122978
## Summary
To fix compilation error for test test_dynamic_scalar_abi_compatible_cpu_with_stack_allocation
- Error 1
```
error: no matching function for call to ‘torch::aot_inductor::ArrayRefTensor<float>::ArrayRefTensor(float [1], const int64_t [0], const int64_t [0], int&, int32_t&)’
613 | ArrayRefTensor<float> buf3(buf3_storage, int_array_6, int_array_6, cached_torch_device_type_cpu, this->device_idx_);
| ^
...
torch/include/torch/csrc/inductor/aoti_runtime/arrayref_tensor.h:188:35: note: no known conversion for argument 2 from ‘const int64_t [0]’ {aka ‘const long int [0]’} to ‘torch::aot_inductor::MiniArrayRef<const long int>’
188 | MiniArrayRef<const int64_t> sizes,
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~
```
Fix: added constructor for empty array in arrayref_tensor.h
- Error 2
```
error: cannot convert ‘torch::aot_inductor::ArrayRefTensor<float>’ to ‘AtenTensorHandle’ {aka ‘AtenTensorOpaque*’}
625 | AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_item_float32(buf3, &zuf0_raw));
| ^~~~
| |
| torch::aot_inductor::ArrayRefTensor<float>
```
Fix: in cpp_wrapper_cpu.py, added codegen to call convert ArrayRefTensor to AtenTensorHandle first.
## Test Plan
```
python test/inductor/test_aot_inductor.py -k AOTInductorTestABICompatibleCpuWithStackAllocation.test_dynamic_scalar_abi_compatible_cpu_with_stack_allocation
```
Before the fix, detailed in #122978:
```
| AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_item_float32(buf3, &zuf0_raw));
| ^~~~
| |
| torch::aot_inductor::ArrayRefTensor<float>
/home/yingzhaoseattle/pytorch/torch/include/torch/csrc/inductor/aoti_runtime/utils.h:34:8: note: in definition of macro ‘AOTI_TORCH_ERROR_CODE_CHECK’
Ran 1 test in 4.377s
FAILED (errors=1)
```
After the fix
```
/home/yingzhaoseattle/pytorch/torch/backends/cudnn/__init__.py:107: UserWarning: PyTorch was compiled without cuDNN/MIOpen support. To use cuDNN/MIOpen, rebuild PyTorch making sure the library is visible to the build system.
warnings.warn(
stats [('calls_captured', 3), ('unique_graphs', 1)]
inductor [('extern_calls', 1)]
.
----------------------------------------------------------------------
Ran 1 test in 9.633s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129173
Approved by: https://github.com/chenyang78
FSDP2 accumulates gradients for sharded parameters outside of the autograd engine's normal accumulation logic. We can respect registered post-accumulate-grad hooks by running them manually.
**Discussion**
Discussing with @soulitzer, changing FSDP2 to make the sharded parameters autograd leaves requires nontrivial changes to FSDP and some changes to the autograd engine (around forward vs. backward streams) where the changes may not preserve eager-mode performance and/or add some complexity.
Under the FSDP2 design, the sharded parameters never participate in autograd, so calling `register_post_accumulate_grad_hook` on them would otherwise be a no-op. In other words, there is virtually no chance for FSDP2 incorrectly re-running the hook when it should not.
Given these, a reasonable near-term solution is for FSDP2 to run the post-accumulate-grad hooks manually.
**Caveats**
- Running `foreach=False` optimizer _per parameter tensor_ incurs significantly higher CPU overhead compared to `foreach=True` (partially due to `DTensor` being a `__torch_dispatch__` tensor subclass).
- On preliminary benchmarking on Llama3-8B on 8 GPUs, this CPU overhead is mostly tolerable, but on smaller # of GPUs or a less compute-intensive model, this may not be.
- One solution for native Adam/AdamW is to use `fused=True`, which makes both the CPU overhead lower and GPU compute faster. However, this is generally not an option for user-defined optimizers.
- If this CPU overhead blocks adoption of this feature, then we should seriously consider an FSDP-specific API like `register_post_backward_hook(params: List[nn.Parameter]) -> None` that allows the user to see all parameters in the `FSDPParamGroup` together for the hook so that the user can still run a `foreach=True` optimizer step on that `List[nn.Parameter]`.
- The post-accumulate-grad hook runs in the reduce-scatter stream. Our current stream handling logic does not have the default stream wait for the reduce-scatter stream until the end of backward. Unless we add that, we cannot simply run the post-accumulate-grad hook in the default stream.
- This means that optimizer compute will overlap with backward compute, which may slowdown end-to-end execution slightly (e.g. due to SM contention or wave quantization effects). For example, on Llama3-8B, we see about ~3% decrease in MFU when running optimizer in backward even though the optimizer steps are fully overlapped and there are no CPU boundedness issues.
- This PR's goal is only to run the hook manually. State dict etc. for optimizer-in-backward is out of scope.
**Experiments (torchtitan)**
- Llama3-8B on 2 GPUs, local batch size 1, with full activation checkpointing, and bf16/fp32 mixed precision:
- Without optimizer-in-backward: 82.03 GiB reserved memory; 28.1% MFU
- With optimizer-in-backward (`foreach=False`): 72.84 GiB reserved memory; 28.9% MFU (speedup from more of optimizer step overlapped)
- With optimizer-in-backward (`fused=True`): 70.84 GiB reserved memory; 30.4% MFU
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129450
Approved by: https://github.com/weifengpy, https://github.com/yf225
It was deprecated since CMake-3.0 in favor of `execute_process`, see https://cmake.org/cmake/help/v3.18/command/exec_program.html
This makes the following warning disappear:
```
CMake Warning (dev) at cmake/Modules/FindARM.cmake:5 (EXEC_PROGRAM):
Policy CMP0153 is not set: The exec_program command should not be called.
Run "cmake --help-policy CMP0153" for policy details. Use the cmake_policy
command to set the policy and suppress this warning.
Use execute_process() instead.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129714
Approved by: https://github.com/kit1980
Fixes#129685
After matching a pattern, we currently try to remove all the nodes of that
pattern, which doesn't work if any intermediate node has users outside of the
pattern. In which case we can't delete those particular nodes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129689
Approved by: https://github.com/shunting314
I am building PyTorch with the Intel oneAPI 2024.0.0 compiler, and encountered this compile error:
```
[ 85%] Building CXX object caffe2/CMakeFiles/cpu_rng_test.dir/__/aten/src/ATen/test/cpu_rng_test.cpp.o
In file included from /home/src/pytorch/aten/src/ATen/test/cpu_rng_test.cpp:2:
/home/src/pytorch/aten/src/ATen/test/rng_test.h:119:41: error: loop variable 'to' creates a copy from type 'const ::std::optional<int64_t>' (aka 'const optional<long>') [-Werror,-Wrange-loop-construct]
119 | for (const ::std::optional<int64_t> to : tos) {
| ^
/home/src/pytorch/aten/src/ATen/test/rng_test.h:119:10: note: use reference type 'const ::std::optional<int64_t> &' (aka 'const optional<long> &') to prevent copying
119 | for (const ::std::optional<int64_t> to : tos) {
| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
| &
1 error generated.
```
This change makes the compiler happy.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129589
Approved by: https://github.com/colesbury
Before the PR, custom ops that don't return outputs will get eliminated after calling `.module()` because the effect_token that keeps the operator alive is removed in remove_effect_token pass. The reason why we want to remove_effect_token is because we don't want the token to be part of input. However, this causes DCE calls in remove_effect_token itself and the dce calls in unlift to remove the custom op in the graph causing an error in the exported graph.
This PR calls has_side_effect in with_effect to make sure graph.eliminate_dead_code doesn't remove the calls by accident.
Test Plan:
Add a new test pytest test/export/test_torchbind.py -k test_export_inplace_custom_op
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129680
Approved by: https://github.com/angelayi
Fixes [#129370](https://github.com/pytorch/pytorch/issues/129370)
Suggest correct a List type annotation when input is in Tuple type. To avoid confusion, we only suggest a type if the type is supported.
Example:
Tuple[int, int] -> List[int]
Tuple[Tensor, Tensor, Optional[Tensor]] -> List[Optional[Tensor]]
Tuple[int, ...] -> List[int]
ValueError: infer_schema(func): Parameter y has unsupported type typing.Tuple[torch.Tensor, torch.Tensor, typing.Optional[torch.Tensor]]. Tuple type annotation is not supported. Please try to use a List instead. For example, typing.List[typing.Optional[torch.Tensor]].
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129417
Approved by: https://github.com/zou3519
Changes by apply order:
1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`.
2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`.
3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first.
`.parent{...}.absolute()` -> `.absolute().parent{...}`
4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.)
`.parent.parent.parent.parent` -> `.parents[3]`
5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~
~`.parents[3]` -> `.parents[4 - 1]`~
6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374
Approved by: https://github.com/justinchuby, https://github.com/malfet
This PR introduces `_detect_dma_connectivity` - a utility for detecting DMA connectivity among devices.
The "DMA connectivity" in this context is more stringent than the ability to perform memory copy without CPU involvement. We define it as the ability for a device to issue load/store instructions and perform atomic operations on memory that resides on connected devices. The ability translates to the ability to run most aten GPU operations with operands backed by remote memory. `_detect_dma_connectivity` can help PyTorch and its users to determine whether certain DMA-based optimizations are possible.
`_detect_dma_connectivity` takes a `(device_type, connection_type)` pair and returns a matrix describing the connectivity. Connectivity detectors are statically registered on a `(device_type, connection_type)` basis. This PR implements the detector for `(CUDA, "nvlink")`. Later, detectors for pairs such as `(ROCM, "infinity_fabric")` can be introduced.
Example:
```python3
>>> from torch._C._autograd import DeviceType
>>> from torch._C._distributed_c10d import _detect_dma_connectivity
>>> connectivity = _detect_dma_connectivity(DeviceType.CUDA, "nvlink")
>>> for row in connectivity.matrix:
... print(row)
...
[0, 18, 18, 18, 18, 18, 18, 18]
[18, 0, 18, 18, 18, 18, 18, 18]
[18, 18, 0, 18, 18, 18, 18, 18]
[18, 18, 18, 0, 18, 18, 18, 18]
[18, 18, 18, 18, 0, 18, 18, 18]
[18, 18, 18, 18, 18, 0, 18, 18]
[18, 18, 18, 18, 18, 18, 0, 18]
[18, 18, 18, 18, 18, 18, 18, 0]
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129510
Approved by: https://github.com/weifengpy
Upload test stats when workflow always so that we can get status for cancelled workflows (especially ones that were cancelled manually)
There aren't that many workflow conclusions, so might as well as always run it, and we can see what happens
Undos [this old PR](https://togithub.com/pytorch/pytorch/pull/79180)
Notable pitfalls from the above:
Might cause noise if things can't be downloaded, but since this workflow doesn't show up on PRs, I think it's ok to slowly deal with what comes
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129694
Approved by: https://github.com/huydhn
When the input predicate is a python constant, we specialize into one of the branches and warn users that torch.cond is not preserving the dynamism. The previous behavior is that we baked in True/False in the cond operator. This can be confusing. In this PR, we change it to be specializing into one of the branches when the inputs are constants.
We additionally change the naming of cond operator to default one without overriding its name. This allows better testing on de-serialized graph.
Test Plan:
The predicate in some existing tests is the result of a shape comparison. When no dynamic shape is involved, the predicate is a python bool. To fix them, we either change the predicate to be some data-dependent tensor or change the test to check cond is specialized as one of the branches,
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128709
Approved by: https://github.com/zou3519
In this PR, we implement the first version of training_ir.run_decomp functionality. Since we don't return the modified buffers as extra output in training IR, our previous strategy of reusing graph signature won't work. In fact, this run_decomp is more similar to retracing. So i reuse some of export steps here. After this PR:
export_for_training().run_decomp({}, _preserve_ops=[all 183 ops]) == export_for_predispatch() - autograd_manipulating_ops.
Differential Revision: [D59069090](https://our.internmc.facebook.com/intern/diff/D59069090)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129249
Approved by: https://github.com/zhxchen17
ghstack dependencies: #128077, #129092
Summary: Somehow the delegate returns a real tensor result even though we pass in fake tensors. So here we need to convert the result to fake.
Test Plan: `buck2 run @//mode/dev-nosan //on_device_ai/helios/multi_zion:multi_zion_test -- -r test_single_delegate_dsp_only`
Differential Revision: D58617091
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128752
Approved by: https://github.com/ydwu4
Currently the runner determinator is buggy and doesn't let anyone's workflows run against the LF runners (it prefixes a "@" to the user names in the issue instead of either stripping it or prefixing it to the incoming names)
This PR fixes the bug so that people opted in to using LF runners can actually use them. It also puts the python code back into the repo. Even though the code isn't directly invoked, having it there makes testing and linting easier/possible
Also includes lint fixes
Note: if you just review the .yml file you'll see all the relevant diffs
### Testing:
#### Before
```
python .github/scripts/runner_determinator.py --github-token $GH_KEY --github-issue 5132 --github-actor ZainRizvi --github-issue-owner ZainRizvi --github-branch foo
{"label_type": "", "message": "LF Workflows are disabled for ZainRizvi, ZainRizvi. Using meta runners."}
```
#### After
```
python .github/scripts/runner_determinator.py --github-token $GH_KEY --github-issue 5132 --github-actor ZainRizvi --github-issue-owner ZainRizvi --github-branch foo
{"label_type": "lf.", "message": "LF Workflows are enabled for ZainRizvi, ZainRizvi. Using LF runners."}
```
Aside: updated test case after rebase:
```
python .github/scripts/runner_determinator.py --github-token $GH_KEY --github-issue 5132 --github-actor ZainRizvi --github-issue-owner ZainRizvi2 --github-branch foo --github-repo python/pythonss --github-ref-type branch
{"label_type": "lf.", "message": "LF Workflows are enabled for ZainRizvi. Using LF runners."}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129612
Approved by: https://github.com/zxiiro, https://github.com/jeanschmidt
Summary: LLVM-15 has a warning `-Wno-return` which can be used to identify functions that do not return. Qualifying these functions with `[[noreturn]]` is a perf optimization.
Test Plan: Sandcastle
Reviewed By: dmm-fb
Differential Revision: D59003594
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129575
Approved by: https://github.com/Skylion007
Fixes: #128478
In backward() implementation checkpointing code was quering device type from the rng_state tensors saved on forward(). These tensors are CPU only tensors and don't carry device information with them. As a result CUDA device was assumed as a default. Which is not correct if user runs on some other device. For example, on XPU.
This patch saves full device information on forward() and uses it on backward() to get device type. Previously forward save only device index.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128671
Approved by: https://github.com/guangyey, https://github.com/soulitzer
Summary: [Here](ea588d7fd3/torch/_inductor/kernel/conv.py (L252)) in the `conv` lowering `dilation` is not `size_hint`-ed. This breaks if `dilation` is a symbolic expression (which we see in some internal models). The PR fixes it by adding a `size_hints`.
Test Plan:
```
$ python test/inductor/test_torchinductor.py -k test_convolution5
...
----------------------------------------------------------------------
Ran 2 tests in 7.329s
OK
```
Differential Revision: D59097019
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129631
Approved by: https://github.com/chenyang78
Summary: Somehow, using underscore alias of some builtin types breaks pyre
Test Plan:
All failed tests from D58983461 are passing:
```
buck2 test 'fbcode//mode/opt' fbcode//caffe2/torch/fb/training_toolkit/utils/tests:gpu_memory_utils_test-type-checking
buck2 test 'fbcode//mode/dev-nosan' fbcode//dper_lib/silvertorch/lib:device_util-type-checking
buck2 test 'fbcode//mode/dev-nosan' fbcode//dper_lib/silvertorch/lib:thompson_samplers_gpu-type-checking
buck2 test 'fbcode//mode/dev-nosan' fbcode//dper_lib/silvertorch/modules/retrieval/diversity/tests:combined_sampling_diversifier_test-type-checking
buck2 test 'fbcode//mode/dev-nosan' fbcode//dper_lib/silvertorch/modules/retrieval/diversity/tests:submodular_opt_test-type-checking
```
Differential Revision: D59029768
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129525
Approved by: https://github.com/XuehaiPan, https://github.com/clee2000, https://github.com/malfet
Summary: Add Shivam to the list of code owners for the profiler code paths, so that Shivam gets added to reviewers for PRs too.
Test Plan: CI
Differential Revision: D59072152
Pulled By: aaronenyeshi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129591
Approved by: https://github.com/sraikund16
Summary:
To fix the following failure cases:
For example, when `M, K, N = 245760, 656, 6560`, fp8 with compile fails due to `RuntimeError: mat2 must be col_major`.
---------
From the inductor generated code (https://fburl.com/everpaste/epcagkrd)
```
V0625 01:38:55.551000 140329914449920 torch/_inductor/scheduler.py:1623] [0/0] scheduling ComputedBuffer(name='buf12', layout=FixedLayout('cuda', torch.float8_e4m3fn, size=[656, 6560], stride=[6656, 1]),
... ...
V0625 01:38:56.194000 140329914449920 torch/_inductor/graph.py:1680] [0/0] [__output_code] buf12 = empty_strided_cuda((656, 6560), (6656, 1), torch.float8_e4m3fn)
... ...
V0625 01:38:56.194000 140329914449920 torch/_inductor/graph.py:1680] [0/0] [__output_code] return (buf10, buf2, buf5, buf6, reinterpret_tensor(buf11, (245760, 656), (1, 245760), 0), reinterpret_tensor(buf12, (6560, 656), (1, 6656), 0), )
... ...
V0625 01:39:12.098000 140312968167424 torch/_inductor/graph.py:1680] [1/0_1] [__output_code] assert_size_stride(permute_10, (6560, 656), (1, 6656))
... ...
V0625 01:39:12.098000 140312968167424 torch/_inductor/graph.py:1680] [1/0_1] [__output_code] buf8 = aten._scaled_mm.default(buf6, permute_10, buf7, reciprocal_3, None, None, torch.bfloat16)
```
Inductor gives the mat2 (`permute_10`) a different stride (`6656`) instead of using its shape[0] (`(6560, 656)`).
Therefore, the `stride[1] == shape[0]` condition fails.
To fix the issue, simply modify the `is_col_major` check to exclude this condition as it doesn't hold for all valid cases.
Test Plan:
Run the failed case again. It works with the fix.
-----
Sandcastle / GitHub CI will make sure the existing tests could still pass.
Reviewed By: vkuzo
Differential Revision: D58994704
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129521
Approved by: https://github.com/drisspg
This PR does 3 things:
1. Adds a copy-free strided->jagged layout conversion for NT
2. Adds a copy-free jagged->strided layout conversion for NT
3. Modifies and expands the .to() API to support the layout argument for the specific case of NT layout conversion.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115749
Approved by: https://github.com/jbschlosser
Summary:
Use JK to control the release instead of using env variable to toggle the feature.
Note: sharing the store reduces shutdown races asn the TCPStore lifecycle is managed outside of trainer rank execution time.
Test Plan: CI
Differential Revision: D59071544
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129603
Approved by: https://github.com/d4l3k
As titled. Previously, __obj_flatten__ can run in a fake tensor mode, e.g. in process_input of aot_autograd, which is surrounded by a fake tensor mode. This causes the tensor ops inside __obj_flatten__ to run under fake tensor mode. However, tensors inside of script obejct are real tensors, this causes the fake tensor mode to error out saying that we need to first fakify fall the tensors (because allow_non_fake_inputs is set to True).
In this PR, we disable all the dispatch modes when running to_fake_obj.
Note that, the output of `__obj_flatten__` will be fakified and filled inside of the corresponding FakeScriptObject. So during traicng, we'll be using FakeScriptObject that has fake tensor contents.
Test Plan:
Add a new test: pytest test/export/test_torchbind.py -k test_compile_tensor_op_in_tensor_flatten
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129605
Approved by: https://github.com/angelayi
By installing torchao explicitly, as torchao-0.3.0 that was release recently to pypi introduced hard dependency to torch-2.3.1, which results in following cryptic error: `RuntimeError: operator torchvision::nms does not exist`
TODOs:
- Figure out what installs torchao from pypi rather than builds from source
- Add proper CI pin for torchao
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129620
Approved by: https://github.com/kit1980, https://github.com/huydhn
Recently we decided to split export IR into two different IRs (training vs inference). In the inference IR, one major change we decided to introduce was we wanted to keep the composite ops that user specified in the IR. This PR does that by overriding the CompositeImplicitAutograd decomp in export inference path.
Differential Revision: [D58701607](https://our.internmc.facebook.com/intern/diff/D58701607)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128077
Approved by: https://github.com/bdhirsh
Summary:
Expose nlohmann json library so that it can be used from inside Pytorch. The library already exists in the `third_party` directory. This PR is making `nlohmann/json.hpp` header available to be used from `torch.distributed`.
The next PR makes actual use of this header.
imported-using-ghimport
Test Plan: Imported from OSS
Reviewed By: malfet
Differential Revision: D59035246
Pulled By: c-p-i-o
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129570
Approved by: https://github.com/d4l3k, https://github.com/malfet
Summary: We need to add the Rank information to the NCCL debug data so that kineto can infer all the necessary process group info such that on-demand can create distributedInfo metadata. Kineto portion will be added in a follow up diff
Test Plan: Tested in D58736045, this diff just splits the kineto and profiler instances
Differential Revision: D59028819
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129528
Approved by: https://github.com/aaronenyeshi
In case of ciflow, runs are triggered by a tag which is created by @pytorchbot, which breaks the logic of the runner determinator.
In case of tag triggers, extract the pr number from the tag name, fetch the pr and extract the user login from it.
Both the inline and standalone python scripts have been updated for consistency.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129500
Approved by: https://github.com/malfet, https://github.com/zxiiro
* created fb internal implementation in `caffe2/torch/csrc/monitor/fb/instrumentation.cpp`
* uses `facebook::data_preproc::WaitCounterUs` under the hood by having `WaitCounterImpl` trivially subclass it.
* this makes `WaitCounterHandle` a glorified pointer to `facebook::data_preproc::WaitCounterUs` which is statically defined in the `STATIC_WAIT_COUNTER` macro making these pointers Meyer's singletons.
* `facebook::data_preproc::WaitCounterUs` uses 3 singletons:
1. `std::unique_ptr<DynamicCounter::State>` map — leaky singleton
2. `std::weak_ptr<WaitCounterUs::State>` map — leaky singleton
3. publisherSingleton — normal singleton since it manages resources (threads)
* `facebook::data_preproc::WaitCounterUs` actually owns shared pointers to the state and its destructor will remove it from the `std::weak_ptr<WaitCounterUs::State>` map when the reference count for the state hits 0.
* linked `caffe2/torch/csrc/monitor/fb/instrumentation.cpp` and added `//data_preproc/common:counters` (dpp dependency) to `caffe2/fb/fbcode/target_definitions.bzl`
* wrapped OSS null implementation in `#ifndef FBCODE_CAFFE2` so that internally we use the fb internal implementation.
as a follow-up I might move the counter implementation out of the data_preproc/counters library to a more common ai infra library?
Differential Revision: [D58458751](https://our.internmc.facebook.com/intern/diff/D58458751/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128605
Approved by: https://github.com/c-p-i-o
ghstack dependencies: #128466
Small improvements on runner determinator script:
* Don't do splitting of the issue comment, unless necessary;
* Match username against a set over a list;
* Match both triggering_actor and issue owner over only actor (to avoid edge cases, where we get `pytorch-bot[bot]`)
* Add stripping, to remove potential breaking and not visible whitespaces;
* Don't use linux.4xlarge as a runner: it should not depend on meta runners, for reliability;
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129462
Approved by: https://github.com/zxiiro, https://github.com/ZainRizvi
In `oneShotAllReduce`, ranks read data from peers in a round-robin fashion to load-balance NVLinks. However, the following reduction is also performed in the this order which is different across ranks. This can results in slight numerical differences across ranks, which can lead to a hang in data dependent applications like speculative decoding.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129501
Approved by: https://github.com/Chillee
TorchDynamo guard mechanism guards on the key order on the dictionaries if the user iterates over the dictionary. For standard dict, we can write a fast C++ implementation by using PyDict_Next. But with OrderedDict, we have to rely on `keys` Python API to get the key ordering. This makes guard evaluation slow.
With Dynamo inlining into inbuilt nn modules, I am seeing many guards over the OrderedDict on `_modules`, `_parameters`. From reading the code, I don't see any reason to not use standard dicts. I think OrderedDict was preferred over dict because of the ordering, but dicts are now ordered. With this PR, I am observing ~20% reduction in guard overhead of a HF model.
Functionality impact
- The only difference between dict and OrdedeDict is `move_to_end` method for OrderedDict ([link](https://stackoverflow.com/questions/34305003/difference-between-dictionary-and-ordereddict)). But the changes here are internal to nn module, and we do not use `move_to_end` for `_parameters`, `_modules` and `_buffers`. We use `move_to_end` for hooks but this PR keeps the OrderedDict for hooks untouched (we should still followup with hooks but in a separate PR).
Perf impact
- I dont anticipate any perf impact. `dict` is completely implemented in C. OrderedDict is Python wrapper over dict with only few method overridden ([link](https://stackoverflow.com/questions/34305003/difference-between-dictionary-and-ordereddict)).
Typing impact
- I dont anticipate any. For all the user visible methods for nn.Module, we don't expose the underlying `_modules` etc. We have iterators like `named_parameters` which return an Iterator of Parameter. So, no typing changes required.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129164
Approved by: https://github.com/mikaylagawarecki
ghstack dependencies: #129163
Fixes#129079
Currently, the tensor object is loading correctly in-place, but the non-tensor object such as learning rate is not load correctly after f518cf811d, which is a regression introduced in 2.3.
This PR replaces tree_map_only and manual replacement of the state dict items with _tree_map_only and fixes the regression of non-tensor loading.
Test:
```
# test to make sure lr is loading correctly
python3 test/distributed/checkpoint/e2e/test_e2e_save_and_load.py -k test_init_state_dict
# test to make sure load on meta device model still works
python3 test/distributed/checkpoint/test_tp_checkpoint.py -k test_tp_checkpoint_load_on_meta_device
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129398
Approved by: https://github.com/fegin
Summary:
_decompose_exported_program() ran into an issue with trace_joint, where trace_joint() produces values with mismatching FakeModes. Adding fake mode context to aot_export_module() so this doesn't happen.
#thanks to tugsbayasgalan for the fix!
Test Plan: test_experimental
Differential Revision: D58977694
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129421
Approved by: https://github.com/tugsbayasgalan, https://github.com/zhxchen17
**Summary**
The logic for CommDebugMode module collective tracing is incorrect as it only worked for leaf module nodes on the model's module tree. If we had a sub-module that had a collective call along with a nested module inside it, the sub-module was not removed from the module_tracker parent set leading to double-counting collectives. This problem was addressed by checking to make sure the current sub-module was not already in the parent set. The output of the below test cases should remain the same.
**Test Plan**
1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing
2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128887
Approved by: https://github.com/XilunWu
ghstack dependencies: #128729
**Summary**
Currently, there is only an example file for comm_mode and its features. I have created test cases that mirror the examples while the more complicated test cases also ensure that comm_mode resets all variables when used multiple times in the same function. This test case suite will also help developers ensure that new code they add to comm_mode does not affect correctness of old features.
#128536
**Test Plan**
pytest test/distributed/_tensor/debug/test_comm_mode_features.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128729
Approved by: https://github.com/XilunWu
Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes.
What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here...
Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343
Approved by: https://github.com/Skylion007
Upload sccache stats to s3 instead of rockset
I don't think we use these anywhere, so it's ok to cut off the ingest into rockset right now.
We should consider deleting this entirely if we don't plan on using it
I will work on copying existing data over from rockset to s3
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129490
Approved by: https://github.com/huydhn, https://github.com/ZainRizvi
This PR removes `ProcessGroupCudaP2P` and changes async-TP to use `SymmetricMemory`. The async-TP implementation is still workspace-based, but it now doesn't require a buffer size to be specified upfront.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128762
Approved by: https://github.com/wanchaol
Inductor currently materialize a large sparse matrix in the backward pass for CrossEntropyLoss and load that to compute gradients of Softmax input. If we could fuse the sparse matrix computation to the consumer sides, we gonna have both perf and memory usage wins.
The Fx graph snippets that construct this aforementioned sparse matrix looks like:
```
full_default_3: "bf16[32768, 50257]" = torch.ops.aten.full.default([32768, 50257], 0, dtype = torch.bfloat16, layout = torch.strided, device = device(type='cuda', index=0), pin_memory = False)
scatter: "bf16[32768, 50257]" = torch.ops.aten.scatter.value(full_default_3, 1, where_2, -1.0); full_default_3 = where_2 = None
```
Leveraging the following observations:
- the scatter is applied upon a all zero (or more generally a const tensor)
- the index tensor for the scatter has a single element on the scatter dimension. In this case it's the label tensor
allow us to lower this 'scatter_upon_const_tensor' pattern to a pointwise kernel that can be easily fused with downstream kernels:
```
def inner_fn(idx):
selector_idx = list(idx)
selector_idx[dim] = 0 # can do this since the index tensor has a single element on the scatter dimension
selector = selector_loader(selector_idx)
return ops.where(
selector == ops.index_expr(idx[dim], torch.int64),
ops.constant(val, dtype),
ops.constant(background_val, dtype),
)
```
## Test result on microbenchmark
For the microbenchmark added as `test_cross_entropy_loss`, we improve latency from 47.340ms to 42.768ms, memory footprint from 10.524GB to 7.227GB on A100. (on H100, we improve latency from 27.54ms to 23.51ms, memory footprint from 10.574GB to 7.354GB).
The saving matches the back-of-envelope calculation. We avoid storing a BF16 tensor with shape [30K, 50K] which is about 3GB in size. On A100, avoid loading and storing such a tensor can roughly save 3GB x 2 / 1.5TBGS = 4ms
## Test result on llm.c
We also test this on llm.c and the saving is much larger especially for memory footprint. The reason is due to autotuning that allocates extra memory for benchmarking. (Check https://github.com/pytorch/pytorch/issues/129258 and https://github.com/pytorch/pytorch/pull/129399 for more details).
For llm.c PyTorch implementation on A100, we improve from
171K tokens/s , 33.6G peak memory usage to
180K tokens/s, 18.6G peak memory usage. (A **45%** saving of peak memory)
## Test on PyTorch 2.0 Dashboard
The optimization is quite general especially for transformers. We tested this on PyTorch2.0 dashboard. Here is the [result](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Mon%2C%2017%20Jun%202024%2018%3A07%3A51%20GMT&stopTime=Mon%2C%2024%20Jun%202024%2018%3A07%3A51%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/shunting314/158/head&lCommit=c62c55e29c65497d495217b6574bb36b0c4da7d4&rBranch=main&rCommit=0d25f096c1beaf8749932a3d6083ad653405ed71).
TLDR, for Huggingface benchmark suite, we get **6%** geomean perf improvement and **10%** geomean memory footprint improvement.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129043
Approved by: https://github.com/jansel, https://github.com/Chillee
If users pass `device_id` to init_process_group, they enable eager init
for the default group. Then if they subsequently call `new_group`, the
device_id argument is not required as it should be assumed to match the
one used for init_process_group.
However, both `init_process_group` and `new_group` apis share a helper
function, which expects a `device_id` value that defaults to None. When
it's None, eager initialization is disabled.
This PR ensures that if a device_id was passed to init_process_group,
the same device_id will automatically be fed into the helper function
for any new_group calls that follow.
**Test plan**
I found an existing test in CI `test_comm_split_subgroup` that failed after my change, because it was asserting that backend comm_split counter did not increment eagerly, and its behavior had changed to increment eagerly. I updated the test in the PR to pass with my change.
I also tested locally via simple program with TORCH_CPP_LOG_LEVEL=INFO and
observed eager initialization of the 'lows' and 'highs' PGs before the
'Here' print.
```
import torch
import torch.distributed as dist
dist.init_process_group(backend="nccl", device_id =torch.device(f"cuda:{torch.distributed.get_node_local_rank(0)}"))
dist.new_group([0, 1], group_desc="lows")
dist.new_group([2, 3], group_desc="highs")
print("Here")
torch.distributed.destroy_process_group()
```
Output:
https://gist.github.com/wconstab/88a5ba0b970244ca1f79133f989e0349
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129284
Approved by: https://github.com/pavanbalaji, https://github.com/fduwjj, https://github.com/d4l3k, https://github.com/nvcastet
This PR adds an alternative backend for Inductor, adding Composable Kernel Universal GEMM instances to the autotune instance selection.
The implementation is heavily influenced by the series of PRs which adds CUTLASS backend (https://github.com/pytorch/pytorch/issues/106991). The main differences are
(1) customizing compiler for the ROCm platform
(2) customizing template code generation for Composable Kernel Universal GEMM instances.
We provide config tuning knobs for balancing between instance sources compilation time and finding the best instance.
### Testing
Install the ck library
```
pip install git+https://github.com/rocm/composable_kernel@develop
```
Run the test
```
TORCH_LOGS=+torch._inductor \
pytest --capture=tee-sys test/inductor/test_ck_backend.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125453
Approved by: https://github.com/eellison, https://github.com/jansel
#### Issue
In jit.trace, torch.numel() is automatically cast to a `LongTensor`. But during conversion, we lost the casting part. `prim::NumToTensor` was previously converted to `torch.ops.aten.scalar_tensor`, which uses the same `dtype` as the input tensor instead of `LongTensor`. in this PR, we add a casting to convert it to the correct `dtype`.
#### Test Plan
We activate previously failing test case.
* `pytest test/export/test_converter.py -s -k test_implicit_constant_to_tensor_handling`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128761
Approved by: https://github.com/angelayi
We've been facing issues where TCPStore can successfully connect but then fail in the validate() function due to resets from listen backlog queue overflow when combined with reset enabled as well as long init times.
This PR does a few things:
* Retry that connect and validate up to the specified timeout.
* Use exponential backoff for the retry logic with jitter instead of a fixed 1s sleep.
* Eliminate the `sleep(std::chrono::milliseconds(numWorkers))` on init which can add significant delays to startup. This is no longer necessary per @XilunWu https://github.com/pytorch/pytorch/pull/116141
Test plan:
```
python test/distributed/test_store.py -v
./build/bin/BackoffTest
```
Will do internal testing with some large scale jobs to ensure TCPStore works correctly.
At 4k scale: 4x improvement
```
tristanr@devvm4382 ~/pt_tests [SIGABRT]> time TORCH_SHOW_CPP_STACKTRACES=1 python tcpstore_large_test.py (pytorch-3.10)
started 0
init 0
set 0
joined all
________________________________________________________
Executed in 1.98 secs fish external
usr time 0.93 secs 91.00 micros 0.93 secs
sys time 1.98 secs 954.00 micros 1.97 secs
tristanr@devvm4382 ~/pt_tests> conda activate torchdrive-3.10 (pytorch-3.10)
tristanr@devvm4382 ~/pt_tests> time TORCH_SHOW_CPP_STACKTRACES=1 python tcpstore_large_test.py (torchdrive-3.10)
started 0
init 0
set 0
joined all
________________________________________________________
Executed in 8.20 secs fish external
usr time 2.15 secs 0.00 micros 2.15 secs
sys time 2.76 secs 843.00 micros 2.76 secs
```
```py
import time
import os
import threading
from multiprocessing import Pool
WORLD_SIZE = 10000
import torch.distributed as dist
def run(rank):
should_log = rank % (WORLD_SIZE // 10) == 0
if should_log:
print(f"started {rank}")
store = dist.TCPStore(
host_name="devvm4382.nao0.facebook.com",
port=29500,
world_size=WORLD_SIZE,
is_master=rank == 0,
use_libuv=True,
)
if should_log:
print(f"init {rank}")
store.set(f"key{rank}", "1234")
if should_log:
print(f"set {rank}")
del store
def noop(rank):
pass
print("starting pool")
with Pool(WORLD_SIZE) as pool:
pool.map(noop, range(WORLD_SIZE), 1)
print("pool hot")
start = time.time()
pool.map(run, range(WORLD_SIZE), 1)
print("run finished", time.time()-start)
```
```
tristanr@devvm4382 ~/pt_tests> python tcpstore_large_test.py (pytorch-3.10)
starting pool
pool hot
started 0
[W624 16:58:09.086081750 TCPStore.cpp:343] [c10d] Starting store with 10000 workers but somaxconn is 4096.This might cause instability during bootstrap, consider increasing it.
started 1000
init 1000
set 1000
started 2000
init 2000
set 2000
started 3000
init 3000
set 3000
started 4000
init 4000
set 4000
started 5000
init 5000
set 5000
started 6000
init 6000
set 6000
started 7000
init 7000
set 7000
started 8000
init 8000
set 8000
started 9000
init 9000
set 9000
init 0
set 0
run finished 0.705092191696167
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129261
Approved by: https://github.com/rsdcastro, https://github.com/wconstab, https://github.com/kurman, https://github.com/XilunWu, https://github.com/c-p-i-o
FSDP2 accumulates gradients for sharded parameters outside of the autograd engine's normal accumulation logic. We can respect registered post-accumulate-grad hooks by running them manually.
**Discussion**
Discussing with @soulitzer, changing FSDP2 to make the sharded parameters autograd leaves requires nontrivial changes to FSDP and some changes to the autograd engine (around forward vs. backward streams) where the changes may not preserve eager-mode performance and/or add some complexity.
Under the FSDP2 design, the sharded parameters never participate in autograd, so calling `register_post_accumulate_grad_hook` on them would otherwise be a no-op. In other words, there is virtually no chance for FSDP2 incorrectly re-running the hook when it should not.
Given these, a reasonable near-term solution is for FSDP2 to run the post-accumulate-grad hooks manually.
**Caveats**
- Running `foreach=False` optimizer _per parameter tensor_ incurs significantly higher CPU overhead compared to `foreach=True` (partially due to `DTensor` being a `__torch_dispatch__` tensor subclass).
- On preliminary benchmarking on Llama3-8B on 8 GPUs, this CPU overhead is mostly tolerable, but on smaller # of GPUs or a less compute-intensive model, this may not be.
- One solution for native Adam/AdamW is to use `fused=True`, which makes both the CPU overhead lower and GPU compute faster. However, this is generally not an option for user-defined optimizers.
- If this CPU overhead blocks adoption of this feature, then we should seriously consider an FSDP-specific API like `register_post_backward_hook(params: List[nn.Parameter]) -> None` that allows the user to see all parameters in the `FSDPParamGroup` together for the hook so that the user can still run a `foreach=True` optimizer step on that `List[nn.Parameter]`.
- The post-accumulate-grad hook runs in the reduce-scatter stream. Our current stream handling logic does not have the default stream wait for the reduce-scatter stream until the end of backward. Unless we add that, we cannot simply run the post-accumulate-grad hook in the default stream.
- This means that optimizer compute will overlap with backward compute, which may slowdown end-to-end execution slightly (e.g. due to SM contention or wave quantization effects). For example, on Llama3-8B, we see about ~3% decrease in MFU when running optimizer in backward even though the optimizer steps are fully overlapped and there are no CPU boundedness issues.
- This PR's goal is only to run the hook manually. State dict etc. for optimizer-in-backward is out of scope.
**Experiments (torchtitan)**
- Llama3-8B on 2 GPUs, local batch size 1, with full activation checkpointing, and bf16/fp32 mixed precision:
- Without optimizer-in-backward: 82.03 GiB reserved memory; 28.1% MFU
- With optimizer-in-backward (`foreach=False`): 72.84 GiB reserved memory; 28.9% MFU (speedup from more of optimizer step overlapped)
- With optimizer-in-backward (`fused=True`): 70.84 GiB reserved memory; 30.4% MFU
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129450
Approved by: https://github.com/weifengpy
**Performance mode Issue**: When dynamo benchmarks performance warm-up failed, the result will be not written into csv file. But the accuracy will be written as `fail_to_run` even when dynamo pass failed. So the accuracy model number is not aligned with performance model number for each of their csv files.

- **Fix**: The warm-up failed models will be recorded into csv file shown as following:

**Accuracy mode issue**: `detectron2_fasterrcnn_r` models failed on accuracy mode, but was tested successfully on performance mode. The accuracy failure is same as PR ee557d8f61.
```
Dynamic Shape:
Traceback (most recent call last):
File "benchmarks/dynamo/torchbench.py", line 449, in <module>
torchbench_main()
File "benchmarks/dynamo/torchbench.py", line 445, in torchbench_main
main(TorchBenchmarkRunner(), original_dir)
File "/workspace/pytorch/benchmarks/dynamo/common.py", line 3650, in main
process_entry(0, runner, original_dir, args)
File "/workspace/pytorch/benchmarks/dynamo/common.py", line 3582, in process_entry
return run(runner, args, original_dir)
File "/workspace/pytorch/benchmarks/dynamo/common.py", line 4163, in run
assert marked, f"nothing in example_inputs had a dim with {batch_size}"
AssertionError: nothing in example_inputs had a dim with 4
```

- **Fix**: same as PR ee557d8f61, the batch_size will be skipped to set as 4 when testing dynamic shapes.
Dynamic shapes passrate improved from 89% -> **95%**
| Comp Item | Compiler | suite | before | After fix |
|-----------|----------|------------|------------|------------|
| Pass Rate | Inductor | torchbench | 89%, 73/82 | 95%, 79/83 |
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126764
Approved by: https://github.com/jansel
Current logic to set the HAS_SBGEMM flag is ignored in case the BLAS libraries are found already, ie, if set from environment variable BLAS=OpenBLAS . If BLAS_LIBRARIES are already set the code to find if BLAS_LIBRARY has sbgemm is never executed. The following commit brings out this logic outside unconditionally.
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125227
Approved by: https://github.com/malfet
MacOS uses case-insensitive filesystem by default, but it's better to specify include path using proper capitalization
Should fix
```
MultiTensorApply.h:4:10: warning: non-portable path to file '<ATen/native/mps/operations/FusedOptimizerOps.h>'; specified path differs in case from file name on disk [-Wnonportable-include-path]
#include <Aten/native/mps/operations/FusedOptimizerOps.h>
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129474
Approved by: https://github.com/albanD, https://github.com/atalman, https://github.com/qqaatw
Fixes#125745
Bug source: When addition requires broadcasting, adding complex numbers is not implemented correctly in `torch/_inductor/decomposition.py` because `x.view(x.real.dtype)` would multiply the last dimension by 2, and then broadcasting wouldn't work.
Fix: re-shape the complex tensors after view and before broadcasting.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129044
Approved by: https://github.com/zou3519, https://github.com/lezcano
For #125323
* Fixes typing for python < 3.10
* Fixes#129390
For #124688
* Improved attribution by registering `register_hook` and `post_accumulate_grad_hook` on params.
* Fixed pre-mature per module bw peak state initialization for AC.
* This improves per-module stats, global `peak_mem` was already accurate and remains unaffected.
For #128508
* When AC is applied to a `mod (nn.Module)` the backward order of execution is `pre-bw -> pre-fw -> post-fw -> post-bw`. Since the `ModTracker` maintains the `parents` attribute as set, the `post-fw` during backward was prematurely removing it from parents.
* With the fix we now maintain a per-module counter and only remove a module from `parents` when its counter goes to 0.
* Added tests to ensure this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129400
Approved by: https://github.com/awgu, https://github.com/huydhn
Changes by apply order:
1. Replace all `".."` and `os.pardir` usage with `os.path.dirname(...)`.
2. Replace nested `os.path.dirname(os.path.dirname(...))` call with `str(Path(...).parent.parent)`.
3. Reorder `.absolute()` ~/ `.resolve()`~ and `.parent`: always resolve the path first.
`.parent{...}.absolute()` -> `.absolute().parent{...}`
4. Replace chained `.parent x N` with `.parents[${N - 1}]`: the code is easier to read (see 5.)
`.parent.parent.parent.parent` -> `.parents[3]`
5. ~Replace `.parents[${N - 1}]` with `.parents[${N} - 1]`: the code is easier to read and does not introduce any runtime overhead.~
~`.parents[3]` -> `.parents[4 - 1]`~
6. ~Replace `.parents[2 - 1]` with `.parent.parent`: because the code is shorter and easier to read.~
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129374
Approved by: https://github.com/justinchuby, https://github.com/malfet
On Jetson IGX, `python test/test_cuda.py -k test_graph_capture_oom` fails with the following error:
```
RuntimeError: NVML_SUCCESS == r INTERNAL ASSERT FAILED at "/opt/pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp":841, please report a bug to PyTorch.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.10/unittest/case.py", line 59, in testPartExecutor
yield
File "/usr/lib/python3.10/unittest/case.py", line 591, in run
self._callTestMethod(testMethod)
File "/usr/lib/python3.10/unittest/case.py", line 549, in _callTestMethod
method()
File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 2759, in wrapper
method(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 2759, in wrapper
method(*args, **kwargs)
File "/opt/pytorch/pytorch/test/test_cuda.py", line 2255, in test_graph_capture_oom
with self.assertRaisesRegex(RuntimeError, oom_regex):
File "/usr/lib/python3.10/unittest/case.py", line 239, in __exit__
self._raiseFailure('"{}" does not match "{}"'.format(
File "/usr/lib/python3.10/unittest/case.py", line 163, in _raiseFailure
raise self.test_case.failureException(msg)
AssertionError: "out of memory" does not match "NVML_SUCCESS == r INTERNAL ASSERT FAILED at "/opt/pytorch/pytorch/c10/cuda/CUDACachingAllocator.cpp":841, please report a bug to PyTorch. "
```
This is a known issue as nvml support on Jetson is limited, and the OOM reporting in CUDACachingAllocator.cpp requires nvml to be properly loaded, which fails on Jetson.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128661
Approved by: https://github.com/eqy, https://github.com/atalman
cuDNN v8.x added a graph-capturable CTCLoss, which slots "neatly" into the `Tensor` variant
~~WIP as cuDNN has a restriction on the max target length (255), but this is not checkable in the graph-capture case, so the UX around warnings/error-messages here might need to be tuned...~~
Currently checks restriction on max target length during warmup run(s), and bails out during capture if this constraint was violated during warmup.
CC @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128271
Approved by: https://github.com/ezyang, https://github.com/malfet
Since we use [`DEFAULT_PROTOCOL=2`](https://github.com/pytorch/pytorch/blob/main/torch/serialization.py#L62), some functions/classes that were renamed from python 2-->3 will be pickled with their python2 name. This PR ensures that when a mod `GLOBAL <python2_mod>.<python2_name> ` is encountered, [following the strategy used by pickle](https://github.com/python/cpython/blob/main/Lib/pickle.py#L1590C13-L1593C63) it is properly mapped to `<python3_mod>.<python3_name>`.
This fix ensures that `add_safe_globals` works properly for such functions/classes (i.e. users will allowlist the python3 func and the weights_only unpickler will do the appropriate translation when checking whether a class was allowlisted).
An example is as follows:
`__builtin__` was named to `builtins`, see the [release notes for Python 3.0](https://docs.python.org/3/whatsnew/3.0.html)
> Renamed module `__builtin__` to [`builtins`](https://docs.python.org/3/library/builtins.html#module-builtins) (removing the underscores, adding an ‘s’). The __builtins__ variable found in most global namespaces is unchanged. To modify a builtin, you should use [builtins](https://docs.python.org/3/library/builtins.html#module-builtins), not `__builtins__`!
However, since we use [`DEFAULT_PROTOCOL=2`](https://github.com/pytorch/pytorch/blob/main/torch/serialization.py#L62), builtins will be pickled with their module string as `__builtin__`.
```python
>>> import pickle
>>> import pickletools
>>> print.__module__
'builtins'
>>> with open('print.pkl', 'wb') as f:
>>> pickle.dump(print, f, protocol=2) # 2 because this is the default protocol used by pytorch
>>> with open('print.pkl', 'rb') as f:
>>> pickletools.dis(f)
0: \x80 PROTO 2
2: c GLOBAL '__builtin__ print' # pickle saves the module string as __builtin__ !!! :(
21: q BINPUT 0
23: . STOP
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129244
Approved by: https://github.com/albanD
Expect the username in the runner rollover issue (https://github.com/pytorch/test-infra/issues/5132) to be prefixed with a "@".
This will make typos way less likely since github's autocomplete/autoformating will help out
For now, I've updated the issue to have usernames both with and without the @ while this change rolls out
Testing:
Ran the script locally on both this issue and a new test issue and verified they both had the expected output:
```
(venv) (base) ➜ ~/pytorch git:(zainr/improve-get-workflow-type)
python .github/scripts/get_workflow_type.py --github-token github_pat_*** --github-issue 5132 --github-user ZainRizvi --github-branch "zainr/stuff"
{"label_type": "lf.", "message": "LF Workflows are enabled for ZainRizvi. Using LF runners."}
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129246
Approved by: https://github.com/zxiiro, https://github.com/huydhn
# Compile time for eager backend
## AlbertForMaskedLM
No inlining - 3.65 seconds
Inlining on main - 7.48 seconds
Inlining + this PR - 6.70 seconds
## MobileBertForMaskedLM
No inlining - 26.90 seconds
Inlining on main - 48.21 seconds
Inlining + this PR - 43.85 seconds
*Next PR in the stack makes the total compile time better/comparable to no inlining*
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129315
Approved by: https://github.com/jansel
ghstack dependencies: #129316
It's embarrassing that there is a hidden double clone bug in coordinate descent tuning.
In `CachingAutotuner.coordinate_descent_tuning`, we clone mutated args to make sure benchmarking does not cause numerical problems. But latter on in `CachingAutotuner.bench` we do that again.
This double clone is fine if
- the tensor is small
- the allocation of the tensor is not on the critical path for memory footprint.
But neither holds for quite common usage of cross entropy loss.
This is related to the memory usage debugging in https://github.com/pytorch/pytorch/pull/129043 . Note that the general issue that peak memory usage increasing due to autotuning still exists. This bug just makes it worse (since we double allocate).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129399
Approved by: https://github.com/Chillee, https://github.com/jansel
Volta(sm_7x) do not have a HW support for bfloat16 datatype, and while it is is emulated to ted in software, so PyTorch eager can use bfloat16 tensors, but not in Triton. So if graph with either CUDA bf16 input or output tensors is used, raise warnings and skip the frame.
Add optional parameter `including_emulation` to `torch.cuda.is_bf16_supported` method and call it from `torch._inductor.compile_fx. _check_triton_bf16_support`.
Test plan: Modify `is_bf16_supported` to return False and see that warning is generated
Fixes https://github.com/pytorch/pytorch/issues/118122 and https://github.com/pytorch/pytorch/issues/118581
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129288
Approved by: https://github.com/eqy, https://github.com/jansel
This PR does two things:
1. it duplicates the fake script object because aot_export trace the program twice. The result of tracing in the first time would cause the tracing result of second time be wrong.
2. Also add a new test for methods that return constant outputs. Before the PR, there's is no meta["val"] for these nodes because fx won't track these constants. We still need to preserve these constant return operators in the graph because torchbind objects are stateful and deleting it would remove the implicit state mutation inside of the object.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128844
Approved by: https://github.com/angelayi
Changes:
1. Make some arguments positional-only as we only support Python 3.8+
2. Clean up `torch.typename(obj)` implementation.
3. Update type annotations., especially `is_tensor()` and `is_masked_tensor()` using `TypeGuard`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129001
Approved by: https://github.com/malfet
Changes:
1. Make some arguments positional-only as we only support Python 3.8+
2. Clean up `torch.typename(obj)` implementation.
3. Update type annotations., especially `is_tensor()` and `is_masked_tensor()` using `TypeGuard`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129001
Approved by: https://github.com/malfet
Previously, the `FSDPCommContext` only defines the stream attributes when `FSDPCommContext.init` is called from lazy initialization. This means that if the user calls `module.unshard()` before lazy init (e.g. first forward pass), then it would error in `wait_for_unshard()`. This PR fixes this by making sure that the stream attributes are defined, only with the default stream, at construction time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129241
Approved by: https://github.com/Skylion007, https://github.com/weifengpy
**Summary**
In int8 GEMM Template, we will view the input from 3D to 2D and view the output back to 3D for QLinear which makes the output of this QLinear as `view`. So, if this output view inputs to a QLinear-Binary fusion which breaks the assumption of QLinear-Binary with post op inplace `sum`. We change the postop name from inplace `sum` to outplace `add` for this case which is similar as FP32/BF16 Linear Inplace as in 1208347d09/torch/_inductor/fx_passes/mkldnn_fusion.py (L541-L543).
**TestPlan**
```
clear && numactl -C 56-111 -m 1 python -u -m pytest -s -v inductor/test_mkldnn_pattern_matcher.py -k test_qlinear_dequant_promotion_cpu_input_dim_exceeds_2
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128808
Approved by: https://github.com/jgong5
ghstack dependencies: #128804
This is a copy of Brian's PR https://github.com/pytorch/pytorch/pull/128754, with some changes in the test_distributed_patterns.py unit tests to more closely reflect FSDP2 patterns. Also disabled two tests `test_input_mutation_storage_resize_up_down` and `test_input_mutation_storage_resize_not_supported` in test_aotdispatch.py until we figure out the right behavior for them.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129203
Approved by: https://github.com/bdhirsh
* __->__ #125323
### Why do we need the FSDP Memory Tracker?
**Tuning Decisions**
1. What is the expected peak memory with current configuration?
2. If I change my FSDP wrapping, how much effect will it have on peak memory?
3. What is the best batch size to use?
4. What is the maximum sequence length that one can run with current configuration?
5. How does increasing/decreasing the “DP” world size affect peak memory?
6. How much memory do I save if I move the optimizer to the CPU?
7. Which activation checkpointing policy should I use?
8. If I have various SAC policies, How do they compare against each other?
9. What happens if I apply different SAC policies to different FSDP units?
10. If I make my gradient reduction in fp32, what effect will it have on memory?
11. If I want to use a custom mixed precision policy, how will it affect the peak memory?
12. When does it make sense to use HSDP?
13. Can I reshard to a smaller mesh without increasing peak memory substantially?
14. Can safely disable post forward reshard without causing an OOM?
**Debugging**
1. Which module contributes most to activation memory?
2. Which FSDP unit is holding a lot of unsharded memory?
3. AC is not releasing memory?
The FSDP2 Memory Tracker addresses all of the above. It is based on:
* #124688
* #128508
Example and Output:
```
if __name__== "__main__":
from contextlib import nullcontext
from functools import partial
import torch
from torch.distributed._composable import checkpoint
from torch.distributed._composable.fsdp import (
CPUOffloadPolicy,
fully_shard,
MixedPrecisionPolicy,
)
from torch.distributed._tensor import DeviceMesh
from torch.distributed._tools.fsdp2_mem_tracker import FSDPMemTracker
from torch._subclasses.fake_tensor import FakeTensorMode
from torch.testing._internal.distributed._tensor.common_dtensor import (
ModelArgs,
Transformer,
TransformerBlock,
)
from torch.testing._internal.distributed.fake_pg import FakeStore
dev = torch.device("cuda:0")
torch.cuda.set_device(dev)
world_size = 4
store = FakeStore()
torch.distributed.init_process_group(
"fake", rank=0, world_size=world_size, store=store
)
mesh = DeviceMesh("cuda", torch.arange(0, world_size))
torch.cuda.empty_cache()
torch.manual_seed(42)
use_fake_mode = False
with FakeTensorMode() if use_fake_mode else nullcontext():
vocab_size = 8192
bsz, seq_len = 32, 1024
with torch.device(dev):
model_args = ModelArgs(
n_layers=2,
n_heads=16,
vocab_size=vocab_size,
max_seq_len=seq_len,
dropout_p=0.1,
)
model = Transformer(model_args)
foreach = True
mp_policy = MixedPrecisionPolicy(param_dtype=torch.bfloat16, reduce_dtype=torch.float32)
offload_policy = CPUOffloadPolicy(pin_memory=not use_fake_mode)
reshard_after_forward = True
fsdp_config = {
}
fully_shard_fn = partial(
fully_shard,
mesh=mesh,
reshard_after_forward=reshard_after_forward,
offload_policy=offload_policy,
mp_policy=mp_policy,
)
for module in model.modules():
if isinstance(module, TransformerBlock):
checkpoint(module, preserve_rng_state=not use_fake_mode)
fully_shard_fn(module)
fully_shard_fn(model)
optim = torch.optim.Adam(model.parameters(), lr=1e-2, foreach=foreach)
torch.manual_seed(42)
inp = torch.randint(0, vocab_size, (bsz, seq_len), device=dev)
torch.cuda.reset_accumulated_memory_stats()
torch.cuda.reset_peak_memory_stats()
fmt = FSDPMemTracker(model, optim)
fmt.track_inputs((inp,))
with fmt:
for iter_idx in range(2):
loss = model(inp).sum()
loss.backward()
optim.step()
optim.zero_grad()
if iter_idx == 0:
fmt.reset_mod_stats()
mem_stats = torch.cuda.memory_stats()
tracker_peak = fmt.get_tracker_snapshot("peak")[dev]["Total"]
cuda_peak_active = mem_stats["active_bytes.all.peak"]
fmt.display_modulewise_snapshots(depth=4, units="MiB", tabulate=True)
fmt.display_snapshot("peak", units="MiB", tabulate=True)
print(
f"peak active: {cuda_peak_active / (1024**3)} GiB | "
f"Tracker Max: {tracker_peak / (1024 ** 3)} GiB"
)
if not use_fake_mode:
print(f"Accuracy: {tracker_peak/cuda_peak_active}")
try:
torch.distributed.destroy_process_group()
except Exception as e:
print(e)
```
<img width="1236" alt="Screenshot 2024-06-21 at 5 16 49 PM" src="https://github.com/pytorch/pytorch/assets/12934972/9be40b8b-e635-4112-b111-418413e6b959">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125323
Approved by: https://github.com/awgu
This code is unused because we just inline the `.parameters` call. The code was also wrong because side-effects only track the first level of mutations. An object might not marked mutated if one of the child objects (like a dict) is mutated.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129316
Approved by: https://github.com/jansel
This PR:
- moves some of the dtype-string utilities into ScalarType.{h, cpp}
- adds a new utility to get a mapping from dtype name to the C++ dtype
- the perser now checks if the string is a dtype name; if it is then it
pulls the c++ dtype from the mapping.
Test Plan:
- new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129189
Approved by: https://github.com/albanD
ghstack dependencies: #129177, #129178, #129179
Currently if `x` is a CUDA tensor, calling `x.untyped_storage().resize_()` seems to always go into the `built without cuda` branch of `resize_storage_bytes_()` regardless of whether PyTorch is built with CUDA. I suspect this is because `inductor_ops.cpp` is only included in `libtorch_cpu.so` thus doesn't have the `USE_CUDA` information or ability to link to CUDA-related functions.
This PR moves `resize_storage_bytes_()` related custom op functions out of `inductor_ops.cpp` into its standalone file `resize_storage_bytes.cpp` to be included in `libtorch_python.so` instead. This mimics the setup for `StorageMethods.cpp`. This way, `resize_storage_bytes_()` can have access to the CUDA-related functions, which passes the CUDA unit test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129215
Approved by: https://github.com/jansel
**Summary**
Currently, only way for users to view the module tracing table is to print in the console which could be hard to read. I have added the functionality to comm_debug_mode for a user to log the module tracing table to output.txt file giving the user more options to view module tracing. I have implemented the use case in the module tracing examples. The expected output is shown below for MLPModule tracing:
<img width="349" alt="Screenshot 2024-06-14 at 10 39 07 AM" src="https://github.com/pytorch/pytorch/assets/50644008/a05288a9-3cdb-483b-8e27-daab50da6251">
**Test Plan**
1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing
2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128721
Approved by: https://github.com/tianyu-l, https://github.com/XilunWu
ghstack dependencies: #128720
**Summary**
The previous example file would run all examples at the same time, leading to confusing output as the 4 processors would mix up the order. In order to fix this, I have added the functionality to choose which example to run to make it easier for users to read the output. Due to importing from torch.testing._internal.distributed._tensor.common_dtensor, the argparser from a file in the dependency tree would overwrite the argparser that I attempted to place in the example file. As a result, I created an argparser in a different file and imported it above previously mentioned import.
**Test Plan**
1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_distributed_sharding_display
2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLPStacked_distributed_sharding_display
3. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e MLP_module_tracing
4. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_module_tracing
5. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -h
The first four outputs will be the same as the outputs seen in previous PRs. The expected output for help argument is seen below:
<img width="931" alt="Screenshot 2024-06-14 at 10 25 06 AM" src="https://github.com/pytorch/pytorch/assets/50644008/547ca112-1e7a-4769-857a-558292c6fe7b">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128720
Approved by: https://github.com/XilunWu
Significant bytecode generation API change!
The new suggested convention to generating bytecode to call a function is now to wrap instructions that push a callable to the stack with `add_push_null`, then that callable is called with `create_call_function` with `push_null=False` (see diff for examples).
In Python 3.13, NULL is now expected to be pushed after the callable. In <=3.12, the NULL was pushed before the callable. This change abstracts away the exact placement of the NULL, but the developer must be aware that a NULL may be needed when codegen'ing a callable.
This abstraction also reduces the need for the `push_null=True` option in `create_call_function`, which removes the need to rotate a NULL to the right place on the stack with a sequence of `SWAP` instructions.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129172
Approved by: https://github.com/jansel
As titled. If `expr1` `expr2` are int, don't need to do `.xreplace`.
See example error:
```
UserError: L['args'][0][0].size()[1] = 35 is not equal to L['args'][0][2].size()[1] = 23
```
Summary:
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129271
Approved by: https://github.com/lezcano
`WeakDep`s force readers to have completed before a mutation overwrites the
buffer, but we want to allow fusions to occur for inplace mutations where the
same index is read and written.
Currently this is achieved by:
1. Identifying the buffers used by the mutating op in its `dep_closure`
2. Not creating `WeakDep`s for buffers in the `dep_closure`
3. Fixing up any bad fusions that might occur by an extra check in `can_fuse_vertical`
So we are first over-agressive in removing `WeakDep`, then add an ad-hoc fixup.
This PR instead emits all `WeakDep`s and adds a `fusable_weak_dep` check to
`can_fuse_vertical` which selectively allows inplace operation to fuse.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128979
Approved by: https://github.com/lezcano
ghstack dependencies: #129082, #129083
The nodes are already topologically sorted by this point, so DCEing a chain of
nodes will take one full iteration per node. Simply reversing the iteration
order means all users will be removed before checking a node.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129082
Approved by: https://github.com/lezcano
Summary:
This forward fixes this diff:
D58699985
Since we have a few things in flight it would be much better to forward fix this test
Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:test_inductor_cuda -- --exact 'caffe2/test/inductor:test_inductor_cuda - test_red_followed_by_transposed_pointwise (caffe2.test.inductor.test_torchinductor.TritonCodeGenTests)'
Differential Revision: D58767577
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129037
Approved by: https://github.com/vkuzo
Summary: Make the recordAnnotations' Record function callback lazily initialize when record memory history starts. This will help reduce the impact on Time To First Batch metric.
Test Plan: CI and ran locally.
Differential Revision: D58875576
Pulled By: aaronenyeshi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129242
Approved by: https://github.com/zdevito
This diff introduces AOTAutogradTestWithCache, which runs AOTAutogradTests with both dynamo and AOTAutogradCache.
To do this, for any verify_aot_autograd() calls in the original tests, we run compiled_f an extra time. We also turn on a new strict mode that throws any time a cache is missed due to weird reasons, like BypassAOTAutogradCache or FxGraphCacheMiss.
We use a mocked version of FXGraphCache to decrease the number of variables for these tests. The normal tests in test_aot_autograd_cache.py will still run with FXGraphCache. I might change my mind and unmock these in the future.
In total, 87 of the tests pass naturally. None of the tests fail in non strict cache mode, so the cache never crashes, it just misses more often than we'd like. The remaining 27 tests fail due to relatively simple (though not necessarily easy to fix) reasons. I'll fix the remaining test failures in the next few PRs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128222
Approved by: https://github.com/bdhirsh
Summary:
Add '`TORCH_LOGS=+fsdp`' in the CLI to print fsdp logs
Example:
`TORCH_LOGS=+fsdp torchrun --standalone --nproc_per_node=2 run_fsdp.py`
Description:
Add logging to `FSDPParamGroup.pre_forward`, `FSDPParamGroup.post_forward`, `FSDPParamGroup.pre_backward`, and `FSDPParamGroup.post_backward`, `FSDPState._root_pre_forward` if is the root, and `FSDPState._root_post_backward_final_callback`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128663
Approved by: https://github.com/weifengpy, https://github.com/awgu
cuDNN v8.x added a graph-capturable CTCLoss, which slots "neatly" into the `Tensor` variant
~~WIP as cuDNN has a restriction on the max target length (255), but this is not checkable in the graph-capture case, so the UX around warnings/error-messages here might need to be tuned...~~
Currently checks restriction on max target length during warmup run(s), and bails out during capture if this constraint was violated during warmup.
CC @ptrblck
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128271
Approved by: https://github.com/ezyang
Fixes https://github.com/pytorch/pytorch/issues/128611
We detach using tensor_data, which already preserves the version counter, so there is no reason to save it prior to unpacking:
```
at::TensorBase VariableHooks::tensor_data(const at::TensorBase& self) const {
TORCH_CHECK(self.defined(), "cannot call tensor_data() on undefined tensor");
auto self_impl_copy = self.unsafeGetTensorImpl()->shallow_copy_and_detach(
/*version_counter=*/self.unsafeGetTensorImpl()->version_counter(),
/*allow_tensor_metadata_change=*/
self.unsafeGetTensorImpl()->allow_tensor_metadata_change());
return at::Tensor(self_impl_copy);
}
```
This changes the behavior when hooks are involved:
- Previously, if you had a hook that replaced the saved tensor with an entirely new tensor, we would've smashed the saved version counter onto that during unpack, which is not quite correct because the tensor returned by user's pack hook is not necessarily aliased to the tensor originally being saved (unlikely), and even if it were, the version counter would already be shared, if the user did their operations not in inference mode (unlikely).
- In this PR, we restore the version counter using the version counter from the unpack hook's output.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128545
Approved by: https://github.com/albanD
ghstack dependencies: #125795
Summary: We need to redefine RE_PYTORCH_PREPROCESSOR here since in hipify_torch, it will apply positive lookbehind (?<=\W) and lookahead (?=\W) to the pattern to avoid matching keyword at the beginning and end of code line. However, this can happen in codegen, which will cause the pattern to not match.
Test Plan:
```
buck2 run //caffe2/test/inductor:test_cpp_wrapper_hipify
```
```
File changed: fbcode//caffe2/test/inductor/test_cpp_wrapper_hipify.py
Buck UI: https://www.internalfb.com/buck2/395155fa-b2dc-4892-8c71-74e52c65fa2f
Note: Using experimental modern dice
Network: Up: 0B Down: 0B (reSessionID-8fcfc520-755c-48f9-bacc-507c62f59231)
Jobs completed: 10947. Time elapsed: 0.5s.
Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2)
BUILD SUCCEEDED
/data/users/zhuoran/fbsource/buck-out/v2/gen/fbcode/15b7034708b669be/caffe2/test/inductor/__test_cpp_wrapper_hipify__/test_cpp_wrapper_hipify#link-tree/torch/_utils_internal.py:282: NCCL_DEBUG env var is set to None
/data/users/zhuoran/fbsource/buck-out/v2/gen/fbcode/15b7034708b669be/caffe2/test/inductor/__test_cpp_wrapper_hipify__/test_cpp_wrapper_hipify#link-tree/torch/_utils_internal.py:300: NCCL_DEBUG is forced to WARN from None
test_hipify_aoti_driver_header (caffe2.test.inductor.test_cpp_wrapper_hipify.TestCppWrapperHipify) ... ok
test_hipify_basic_declaration (caffe2.test.inductor.test_cpp_wrapper_hipify.TestCppWrapperHipify) ... ok
test_hipify_cross_platform (caffe2.test.inductor.test_cpp_wrapper_hipify.TestCppWrapperHipify) ... ok
----------------------------------------------------------------------
Ran 3 tests in 0.262s
OK
```
e2e test:
```
TORCH_LOGS="output_code,graph_code" buck2 run mode/{opt,amd-gpu,inplace} -c fbcode.triton_backend=amd -c fbcode.enable_gpu_sections=true //aiplatform/modelstore/model_generation/gpu_lowering_service:gpu_lowering_cli -- --model_input_path="ads_storage_fblearner/tree/user/facebook/fblearner/predictor/936383960/0/gpu_lowering/input.merge" --model_output_path="ads_storage_fblearner/tree/user/facebook/fblearner/predictor/936383960/0/gpu_lowering/mi300_inductor_output.merge" --lowering_backend AOT_INDUCTOR --is_ads_model False --aot_inductor_lowering_settings_json='{"use_scripting":true,"preset_lowerer":"standalone_hstu_cint;disable_new_lowering_weights;disable_dper_passes:passes=fuse_parallel_linear_no_weight_change","precision":4,"output_precision":4, "remove_unexpected_type_cast":false, "sample_input_tile_factor":32}' 2>&1 | tee local_benchmark_log.txt
```
Differential Revision: D58705216
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128912
Approved by: https://github.com/desertfire
Summary: Currently AOTI does a two-pass compilation for the CUDA backend. In the first pass AOTI generates Python code, runs the generated code once with real example inputs to trigger Triton kernel compilation and tuning, and then AOTI runs the second pass to generate cpp code and compiles that into a shared library.
There are several problems with this approach when we want to enable the cpp wrapper mode for JIT Inductor:
* Compilation time: JIT compilation is more sensitive to compilation time than AOT compilation. The two-pass approach does add extra overhead for compilation.
* Peak memory size: when executing the first-pass generated code with real inputs, some inputs need to be cloned to avoid side effect coming from input mutation. This can raise the high-water mark for memory consumption.
* Missing triton kernel autotuning: Because kernel autotune depends on the kernel being executed in the two-pass approach, some kernels will not be autotuned when a model contains control flow such as torch.if or torch.while.
This PR is the first step towards solving these problems by moving Triton kernel autotuning to the compile time and use random inputs for tuning. The cpp wrapper codegen still has two passes, but in the first pass, Inductor will generate a separate code just for kernel autotuning, with https://gist.github.com/desertfire/606dc772b3e989b5e2edc66d76593070 as an example, and we no longer need to execute the model after the first-pass finishes. After that we rerun a second pass to generate cpp code. This reduces peak memory consumption and enables kernel autotuning when there is control flow. Truly making the codegen into one-pass will come later once this solution is proven stable and generates as performant kernels as before.
Differential Revision: [D58782766](https://our.internmc.facebook.com/intern/diff/D58782766)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129057
Approved by: https://github.com/jansel, https://github.com/eellison
Previously `linear_add_bias` only support the added tensor is `bfloat16`.
```
class M(torch.nn.Module):
def __init__(self, dtype):
super().__init__()
self.linear1 = torch.nn.Linear(10, 64, bias=False)
self.bias1 = torch.randn(64).bfloat16() # if the bias is not bf16, we will crash
def forward(self, x):
return self.linear1(x) + self.bias1
```
For `Autocast(bf16)` cases, `self.bias1` will not be converted to bf16. And we also not checked the dtype for weight and bias in the pattern matcher, this will lead to error if weight is bfl6 while bias is fp32.
We have 2 options to resolve this:
- Check bias/weight dtype, only fold the bias when they are same dtype
- We will fold them even they are not same dtype. By inserting to_dtypes for `bias node` to enforce it have same dtype with weight.
This PR chose option1, since we can't implicitly cast bias to bf16 here which would lose precision.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129138
Approved by: https://github.com/jgong5
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):
This PR introduces a prototype for `SymmetricMemory` (including a CUDA implementation) - a remote-memory access-based communication primitive. It allows for user-defined communication patterns/kernels and is designed to be torch.compile-friendly. It addresses the major limitations of `IntraNodeComm` and `ProcessGroupCudaP2p` and serves as a replacement for them.
### SymmetricMemory
`SymmetricMemory` represents symmetric allocations across a group of devices. The allocations represented by a `SymmetricMemory` object are accessible by all devices in the group. The class can be used for **op-level custom communication patterns** (via the get_buffer APIs and the synchronization primitives), as well as **custom communication kernels** (via the buffer and signal_pad device pointers).
### Python API Example
```python
from torch._C.distributed_c10d import _SymmetricMemory
# Set a store for rendezvousing symmetric allocations on a group of devices
# identified by group_name. The concept of groups is logical; users can
# utilize predefined groups (e.g., a group of device identified by a
# ProcessGroup) or create custom ones. Note that a SymmetricMemoryAllocator
# backends might employ a more efficient communication channel for the actual
# rendezvous process and only use the store for bootstrapping purposes.
_SymmetricMemory.set_group_info(group_name, rank, world_size, store)
# Identical to empty_strided, but allows symmetric memory access to be
# established for the allocated tensor via _SymmetricMemory.rendezvous().
# This function itself is not a collective operation.
t = _SymmetricMemory.empty_strided_p2p((64, 64), (64, 1), torch.float32, group_name)
# Users can write Python custom ops that leverages the symmetric memory access.
# Below are examples of things users can do (assuming the group's world_size is 2).
# Establishes symmetric memory access on tensors allocated via
# _SymmetricMemory.empty_strided_p2p(). rendezvous() is a one-time process,
# and the mapping between a local memory region and the associated SymmetricMemory
# object is unique. Subsequent calls to rendezvous() with the same tensor will receive
# the cached SymmetricMemory object.
#
# The function has a collective semantic and must be invoked simultaneously
# from all rendezvous participants.
symm_mem = _SymmetricMemory.rendezvous(t)
# This represents the allocation on rank 0 and is accessible from all devices.
buf = symm_mem.get_buffer(0, (64, 64), torch.float32)
if symm_mem.rank == 0:
symm_mem.wait_signal(src_rank=1)
assert buf.eq(42).all()
else:
# The remote buffer can be used as a regular tensor
buf.fill_(42)
symm_mem.put_signal(dst_rank=0)
symm_mem.barrier()
if symm_mem.rank == 0:
symm_mem.barrier()
assert buf.eq(43).all()
else:
new_val = torch.empty_like(buf)
new_val.fill_(43)
# Contiguous copies to/from a remote buffer utilize copy engines
# which bypasses SMs (i.e. no need to load the data into registers)
buf.copy_(new_val)
symm_mem.barrier()
```
### Custom CUDA Comm Kernels
Given a tensor, users can access the associated `SymmetricMemory` which provides pointer to remote buffers/signal_pads needed for custom communication kernels.
```cpp
TORCH_API c10::intrusive_ptr<SymmetricMemory> get_symmetric_memory(
const at::Tensor& tensor);
class TORCH_API SymmetricMemory : public c10::intrusive_ptr_target {
public:
...
virtual std::vector<void*> get_buffer_ptrs() = 0;
virtual std::vector<void*> get_signal_pad_ptrs() = 0;
virtual void** get_buffer_ptrs_dev() = 0;
virtual void** get_signal_pad_ptrs_dev() = 0;
virtual size_t get_buffer_size() = 0;
virtual size_t get_signal_pad_size() = 0;
virtual int get_rank() = 0;
virtual int get_world_size() = 0;
...
};
```
### Limitations of IntraNodeComm and ProcessGroupCudaP2p
Both `IntraNodeComm` (used by `ProcessGroupCudaP2p`) manages a single fixed-size workspace. This approach:
- Leads to awkward UX in which the required workspace needs to be specified upfront.
- Can not avoid extra copies for some algorithms in eager mode (e.g., custom/multimem all-reduce, reduce-scatter, all-gather).
- Prevents torch.compile from eliminating all copies.
In addition, they only offer out-of-the-box communication kernels and don't expose required pointers for user-defined, custom CUDA comm kernels.
* __->__ #128582
Differential Revision: [D58849033](https://our.internmc.facebook.com/intern/diff/D58849033)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128582
Approved by: https://github.com/wanchaol
current implementation of compiled_autograd_enabled_count affects the entire region under the context manager. so if the context manager wraps torch.compile calls unrelated to the backward, they are affected too:
- no lazy compile for compiled fw
- no aot autograd cache for inference graphs
we instead maintain a flag when we execute the compiled backward callable, to isolate the special handling to the compiled backward graph
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128982
Approved by: https://github.com/jansel
ghstack dependencies: #127960, #128905
**Summary**
The previous stack op strategy was causing the input to be resharded, resulting in list index out of range error. I delayed the resharding for after the input_specs were created so that the new dimension could be inserted, preventing the error above. I have also ran all the other test cases to ensure changes did not introduce any new bugs
**Test Plan**
pytest test/distributed/_tensor/test_tensor_ops.py -s -k test_stack
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129018
Approved by: https://github.com/XilunWu
https://github.com/pytorch/pytorch/pull/126717 will skip the tests in both ABI compatible and non-ABI compatible mode.
It's not expected to skip them in non-ABI compatible mode since they can actually run successfully in such mode but only have issues in ABI compatible mode.
We leverage the existing `xfail_list` for those that will only fail in ABI compatible mode.
- `test_qlinear_add` is already in the `xfail_list`.
- `test_linear_packed` doesn't fail either in my local run (running with `TORCHINDUCTOR_ABI_COMPATIBLE=1`) or in the CI of this PR so I didn't add it into `xfail_list`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128506
Approved by: https://github.com/jgong5, https://github.com/desertfire
We present a utility MemTracker, that tracks the module-wise memory for the code executed under its context. The core features that this tool aims to provide are:
1. Capturing 'snapshots' of memory for each module during its execution. Specifically, at 8 points, during pre-forward, post-forward, pre-backward, 2nd pre-forward (if AC is applied), 2nd post-forward (if AC is applied), post-backward. Also capturing peak memory snapshot during forward and backward.
2. Each such snapshot provides the per device (cpu, cuda etc) memory breakdown in terms of the global parameters, gradients, activations, optimizer states and temporary memory.
3. A summary for each module (that can be analyzed or processed later), in terms of the memory occupied by its own parameters, buffers, inputs and outputs. The remaining components can be derived from these per module attributes and its corresponding captured snapshots.
4. Record the global peak memory consumption per device and their respective breakdowns.
5. Ability to do all of this under the FakeTensorMode so that all these statistics can be obtained without executing code on real data.
6. Ability to register and track modules, optimizers and any other tensors that are created outside the context of MemTracker.
7. Ability to capture a custom memory snapshot at any point during program execution execution.
8. Utility functions to display all of these statistics in user-friendly and human readable manner.
These features will enable users to anticipate OOMs, debug and pinpoint where majority of memory comes from, experiment with different activation checkpointing policies, batch sizes, mixed precision, model architecture features (ex. number of layers, hidden dimensions, number of attention heads etc.) and inter-device memory movement (ex. CPU off-loading) among others. Basically anything and everything related to device memory.
* __->__ #128508
Example:
> import torch
> import torchvision.models as models
> from torch.distributed._tools.mem_tracker import MemTracker
> device, dtype = "cuda", torch.float32
> with torch.device(device):
> model = models.resnet18().to(dtype=dtype)
> optim = torch.optim.Adam(model.parameters(), foreach=True)
> mem_tracker = MemTracker()
> mem_tracker.track_external(model, optim)
> with mem_tracker as mt:
> for i in range(2):
> input_batch = torch.randn(256, 3, 224, 224, device=device, dtype=dtype)
> model(input_batch).sum().backward()
> optim.step()
> optim.zero_grad()
> if i == 0:
> # to account for lazy init of optimizer state
> mt.reset_mod_stats()
> mt.display_snapshot("peak", units="MiB", tabulate=True)
> mt.display_modulewise_snapshots(depth=2, units="MiB", tabulate=True)
> # Check for accuracy of peak memory
> tracker_max = mt.get_tracker_snapshot('peak')[device]['Total']
> cuda_max = torch.cuda.max_memory_allocated()
> accuracy = tracker_max / cuda_max
> print(f"Tracker Max: {tracker_max}, CUDA Max: {cuda_max}, Accuracy: {accuracy}")
Output
<img width="1197" alt="Screenshot 2024-06-15 at 12 10 12 AM" src="https://github.com/pytorch/pytorch/assets/12934972/83e953db-43dc-4094-90eb-9f1d2ca8e758">
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124688
Approved by: https://github.com/awgu
In the split build we end up with an incorrect RPATH for `libtorch_python.so`. This PR fixes said RPATH.
What the rpath should look like:
```
sahanp@devgpu086 ~/pytorch ((636de71c…))> objdump -p ~/main_so_files/libtorch_python.so | grep "RPATH" (pytorch-3.10)
RPATH /lib/intel64:/lib/intel64_win:/lib/win-x64:/home/sahanp/pytorch/build/lib:/home/sahanp/.conda/envs/pytorch-3.10/lib:
```
Before
```
sahanp@devgpu086 ~/pytorch ((636de71c…))> objdump -p ~/split_so_files/libtorch_python.so | grep "RPATH" (pytorch-3.10)
RPATH /home/sahanp/pytorch/torch/lib:/home/sahanp/pytorch/build/lib:
```
After
```
sahanp@devgpu086 ~/pytorch ((636de71c…))> objdump -p build/lib/libtorch_python.so | grep "RPATH" (pytorch-3.10)
RPATH /lib/intel64:/lib/intel64_win:/lib/win-x64:/home/sahanp/pytorch/build/lib:/home/sahanp/pytorch/torch/lib:/home/sahanp/.conda/envs/pytorch-3.10/lib:
```
Testing that this works is in the above PR. Similarly, after running ciflow/binaries the output of objdump -p should not change https://www.diffchecker.com/14PRmCNz/ (checked manywheel py 3.10 cuda 12.1)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129088
Approved by: https://github.com/malfet
Summary:
Same as D57688538, recreated because of GH issues
This diff introduces LocalShardsWrapper which is crucial to migrating from using ShardedTensor to DTensor in TRec state dict representation. As well as any changes needed in PT-D and ModelStore to support this.
It allows us to extend DTensor to support multiple shards on a rank as well as empty shards on a rank as needed by TRec sharding logic.
This diff also extends the support for LocalShardsWrapper to be used in conjunction with DTensor in checkpointing cases (ModelStore and DCP)
See D54375878 for how it is used.
**LocalShardsWrapper supports the following torch ops:**
+ torch.ops._c10d_functional.all_gather_into_tensor.default
+ aten._to_copy.default
+ aten.view.default
+ aten.equal.default
+ aten.detach.default
With extensibility to add more as required by use cases.
See https://docs.google.com/document/d/16Ptl50mGFJW2cljdF2HQ6FwsiA0scwbAbjx_4dhabJw/edit?usp=drivesdk for more info regarding design and approach.
NOTE: This version of LocalShardsWrapper does not support empty shards, that is added in the next diff enabling CW. D57063512
Test Plan:
` buck test mode/opt -c python.package_style=inplace aiplatform/modelstore/client/tests_gpu:dist_checkpoint_save_load_with_stateful_tests -- --print-passing-details`
`buck2 test 'fbcode//mode/dev-nosan' fbcode//torchrec/distributed/tests:test_tensor_configs -- --print-passing-details`
Sandcastle
Reviewed By: XilunWu, wanchaol
Differential Revision: D58570479
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129150
Approved by: https://github.com/XilunWu
Summary:
Export, through AOTAutograd, [deduplicates](11ff5345d2/torch/fx/experimental/proxy_tensor.py (L198)) sym_size calls, which can cause issues during unflattening when the sym_size node is used in multiple submodules.
If preserve_call_module_signature is set, these nodes can't be passed between submodules as placeholders, so the calls (and any downstream un-duplicated nodes) must be copied. Adding this to unflattener
Test Plan: export unflatten test case
Reviewed By: TroyGarden, angelayi
Differential Revision: D58697231
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129153
Approved by: https://github.com/angelayi
Idea: close over min / max sequence length in the main NJT view func (`_nested_view_from_jagged`) so that view replay during fake-ification propagates these correctly in torch.compile.
For dynamic shapes support for min / max sequence length, this PR uses a hack that stores the values in `(val, 0)` shaped tensors.
**NB: This PR changes SDPA to operate on real views instead of using `buffer_from_jagged()` / `ViewNestedFromBuffer`, which may impact the internal FIRST model. That is, it undoes the partial revert from #123215 alongside a fix to the problem that required the partial revert. We need to verify that there are no regressions there before landing.**
Differential Revision: [D55448636](https://our.internmc.facebook.com/intern/diff/D55448636)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122836
Approved by: https://github.com/soulitzer
So how come this PR fixes any flakiness?
Well, following my investigation (read pt 1 in the linked ghstack PR below), I had realized that this test only consistently errors after another test was found flaky.
Why? Because TORCH_SHOW_CPP_STACKTRACES=1 gets turned on for _every_ test after _any_ test reruns, following this PR https://github.com/pytorch/pytorch/pull/119408. And yea, this test checked for exact error message matching, which no longer would match since the stacktrace for a foreach function is obviously going to be different from a nonforeach.
So we improve the test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129003
Approved by: https://github.com/soulitzer
Summary:
(1) Make code work when a first layer does not have a bias.
(2) Make it possible to provide both modules and module names as input
(3) Allow sequences of contiguous layers as input, that then get split into pairs
(4) fix documentation to be more clear on inputs to be provided
Test Plan:
Run this new version of the algorithm on a network and see if it throws errors.
There's also this notebook to run and test N5199827
It you tell me where I can find the tests for this code, I can add some simple unit tests as well.
Differential Revision: D55895862
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124632
Approved by: https://github.com/jerryzh168
When caching is enabled, an internal model fails with
```
assert_size_stride(bmm_9, (17, s0, 512), (54784, 512, 1))
AssertionError: expected size 17==17, stride 57344==54784 at dim=0
```
looking at this model, the exact problem is when the cache is hit on the forward graph, the generated code for backward fails since the strides of the outputs of forward, passed to backward as inputs, are not what we expected.
This PR changes the evaluation logic so that we defer evaluation of output stride exprs to load path as opposed to eagerly doing it on save path.
I have not been able to come up with a unit test repro for this problem.
Differential Revision: [D58796503](https://our.internmc.facebook.com/intern/diff/D58796503)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128997
Approved by: https://github.com/ezyang
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at d75cde1</samp>
Added MPS support and autograd formulas for LU factorization of tensors. Implemented the `linalg_lu_factor` and `linalg_lu_factor.out` functions for the MPS backend in `LinearAlgebra.mm` and added tests in `test_mps.py`. Added the corresponding dispatch entries in `native_functions.yaml` and the backward and forward formulas in `derivatives.yaml`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99269
Approved by: https://github.com/kulinseth, https://github.com/lezcano
Fix erfinv codegen when ISA could not be detected
Manual test plan (on MacOS):
- Modify `valid_vec_isa_list` to return empty list
- Run `python3 inductor/test_torchinductor_opinfo.py -v -k test_comprehensive_erfinv_cpu_bool`
Before this change, abovementioned test will fail with
```
Output:
/var/folders/rk/fxg20zvx6vvb5bk7cplq4xrc0000gn/T/tmpgic60b6c/ns/cnsp7snp7fyclkm5lsfiyiv3m6c3svevkbhcb3v7pijdfjwlyaij.cpp:11:25: error: use of undeclared identifier 'calc_erfinv'
auto tmp2 = calc_erfinv(tmp1);
^
1 error generated.
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129090
Approved by: https://github.com/jgong5, https://github.com/jansel
Summary: Found during testing with remote caching: Use the same output logger object between graph.py and codecache.py since it's patched in `run_and_get_cpp_code`. That allows us to capture any logging produced from the codecache path when using `run_and_get_cpp_code`. I'm also fixing a few tests that were passing mistakenly because logging was missing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128794
Approved by: https://github.com/oulgen, https://github.com/leslie-fang-intel
Move operators from member functions to free functions. This is needed to fix torch inductor on s390x.
This change fixes tests like
DynamicShapesMiscTests::test_numpy_min_dynamic_shapes from test/dynamo/test_dynamic_shapes.py
This change also fixes recently intorduced build failure on s390x.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129066
Approved by: https://github.com/malfet
Summary:
Add new traceEvents into Memory Snapshot for record_function annotations. These will capture both the profiler's step annotation as well as user annotations.
Test Plan:
CI
Pulled By:
aaronenyeshi
Differential Revision: D55941362
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129072
Approved by: https://github.com/zdevito
Summary:
WARNING: This API is highly unstable and will be subject to change in the future.
Add a protoype to "decompose" an ExportedProgram into a joint graph form, so that we can compute the gradients on this graph.
Test Plan: buck test mode/opt caffe2/torch/fb/export:test_experimental
Differential Revision: D55657917
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128847
Approved by: https://github.com/tugsbayasgalan
I'd like to discuss the criteria that we regard an implementation as stable. If there is no existing standard, my initial proposal would be a 6 month period after the commit to regard it as stable. As a result, now Adam and AdamW on CUDA would be considered as stable, while the rest are of beta.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129006
Approved by: https://github.com/malfet
This PR builds the split build in the pull workflow and runs the appropriate tests against them. A single linux cpu and single gpu build were chosen arbitrarily to not add too many tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126813
Approved by: https://github.com/atalman
ghstack dependencies: #127934
This PR removes the second separate package we were using for the libtorch wheel.
In terms of testing that this works we will look use the PRs above this in the stack.
As for sanity checking these are the wheels that are produced by running
```
python setup.py clean && BUILD_LIBTORCH_WHL=1 with-proxy python setup.py bdist_whee
l && BUILD_PYTHON_ONLY=1 with-proxy python setup.py bdist_wheel --cmake
```
```
sahanp@devgpu086 ~/pytorch ((5f15e171…))> ls -al dist/ (pytorch-3.10)
total 677236
drwxr-xr-x 1 sahanp users 188 Jun 4 12:19 ./
drwxr-xr-x 1 sahanp users 1696 Jun 4 12:59 ../
-rw-r--r-- 1 sahanp users 81405742 Jun 4 12:19 torch-2.4.0a0+gitca0a73c-cp310-cp310-linux_x86_64.whl
-rw-r--r-- 1 sahanp users 612076919 Jun 4 12:19 libtorch-2.4.0a0+gitca0a73c-py3-none-any.whl
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127934
Approved by: https://github.com/atalman
Some features of third-party devices depend on TraceUtils.h, so some of the CUDA code was removed and split into NCCLUtils files.
In addition, some common functions still remain in TraceUtils.h since I'm not sure if other devices will use them later.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126969
Approved by: https://github.com/c-p-i-o
Summary:
The default input device for is_pinned function is Cuda. This can unnecessarily create Cuda context for CPU tensors when just generating TensorProperties, bloating memory usage. Passing the device to the is_pinned call site inside def create_from_tensor solves this issue.
This also fixes Model Store test
https://www.internalfb.com/intern/test/844425019931542?ref_report_id=0
which is currently broken on memory usage assertions.
Test Plan: UT
Differential Revision: D58695006
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128896
Approved by: https://github.com/fegin
# Summary
First PR got reverted and needed a redo
This pull request introduces an fp8 row-scaling kernel as an optional implementation for `scaled_mm`. The kernel selection is based on the scaling tensors of the inputs. For inputs `x` and `y` of shape `[M, K]` and `[K, N]` respectively, the following conditions must be met:
- `x`'s scale should be a 1-dimensional tensor of length `M`.
- `y`'s scale should be a 1-dimensional tensor of length `N`.
It's important to note that this kernel is not called "rowwise, columnwise" scaling because, although the scales for `y` are semantically along its columns, this implementation only supports the TN format. This means the scaling is along the faster-moving dimension, or the "row".
The following two PRs were required to enable local builds:
- [PR #126185](https://github.com/pytorch/pytorch/pull/126185)
- [PR #125523](https://github.com/pytorch/pytorch/pull/125523)
### Todo
We still do not build our Python wheels with this architecture.
@ptrblck @malfet, should we replace `sm_90` with `sm_90a`?
The NVRTC TMA shadowing feels wrong, but I a not sure the right way to spoof the symbol for this compilation unit:
https://github.com/pytorch/pytorch/pull/125204/files#r1586986954
#### ifdef
I tried to use : `#if !defined(USE_ROCM) && defined(CUDA_VERSION) && CUDA_VERSION >= 12000 && \
defined(__CUDA_ARCH__) && __CUDA_ARCH__ > 900` to gate the building of the kernel. I was having a hell of a time with this.. so I am not really sure the right way to do this
Kernel Credit:
@jwfromm
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128989
Approved by: https://github.com/yangsiyu007, https://github.com/vkuzo
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):
This PR introduces a prototype for `SymmetricMemory` (including a CUDA implementation) - a remote-memory access-based communication primitive. It allows for user-defined communication patterns/kernels and is designed to be torch.compile-friendly. It addresses the major limitations of `IntraNodeComm` and `ProcessGroupCudaP2p` and serves as a replacement for them.
### SymmetricMemory
`SymmetricMemory` represents symmetric allocations across a group of devices. The allocations represented by a `SymmetricMemory` object are accessible by all devices in the group. The class can be used for **op-level custom communication patterns** (via the get_buffer APIs and the synchronization primitives), as well as **custom communication kernels** (via the buffer and signal_pad device pointers).
### Python API Example
```python
from torch._C.distributed_c10d import _SymmetricMemory
# Set a store for rendezvousing symmetric allocations on a group of devices
# identified by group_name. The concept of groups is logical; users can
# utilize predefined groups (e.g., a group of device identified by a
# ProcessGroup) or create custom ones. Note that a SymmetricMemoryAllocator
# backends might employ a more efficient communication channel for the actual
# rendezvous process and only use the store for bootstrapping purposes.
_SymmetricMemory.set_group_info(group_name, rank, world_size, store)
# Identical to empty_strided, but allows symmetric memory access to be
# established for the allocated tensor via _SymmetricMemory.rendezvous().
# This function itself is not a collective operation.
t = _SymmetricMemory.empty_strided_p2p((64, 64), (64, 1), torch.float32, group_name)
# Users can write Python custom ops that leverages the symmetric memory access.
# Below are examples of things users can do (assuming the group's world_size is 2).
# Establishes symmetric memory access on tensors allocated via
# _SymmetricMemory.empty_strided_p2p(). rendezvous() is a one-time process,
# and the mapping between a local memory region and the associated SymmetricMemory
# object is unique. Subsequent calls to rendezvous() with the same tensor will receive
# the cached SymmetricMemory object.
#
# The function has a collective semantic and must be invoked simultaneously
# from all rendezvous participants.
symm_mem = _SymmetricMemory.rendezvous(t)
# This represents the allocation on rank 0 and is accessible from all devices.
buf = symm_mem.get_buffer(0, (64, 64), torch.float32)
if symm_mem.rank == 0:
symm_mem.wait_signal(src_rank=1)
assert buf.eq(42).all()
else:
# The remote buffer can be used as a regular tensor
buf.fill_(42)
symm_mem.put_signal(dst_rank=0)
symm_mem.barrier()
if symm_mem.rank == 0:
symm_mem.barrier()
assert buf.eq(43).all()
else:
new_val = torch.empty_like(buf)
new_val.fill_(43)
# Contiguous copies to/from a remote buffer utilize copy engines
# which bypasses SMs (i.e. no need to load the data into registers)
buf.copy_(new_val)
symm_mem.barrier()
```
### Custom CUDA Comm Kernels
Given a tensor, users can access the associated `SymmetricMemory` which provides pointer to remote buffers/signal_pads needed for custom communication kernels.
```cpp
TORCH_API c10::intrusive_ptr<SymmetricMemory> get_symmetric_memory(
const at::Tensor& tensor);
class TORCH_API SymmetricMemory : public c10::intrusive_ptr_target {
public:
...
virtual std::vector<void*> get_buffer_ptrs() = 0;
virtual std::vector<void*> get_signal_pad_ptrs() = 0;
virtual void** get_buffer_ptrs_dev() = 0;
virtual void** get_signal_pad_ptrs_dev() = 0;
virtual size_t get_buffer_size() = 0;
virtual size_t get_signal_pad_size() = 0;
virtual int get_rank() = 0;
virtual int get_world_size() = 0;
...
};
```
### Limitations of IntraNodeComm and ProcessGroupCudaP2p
Both `IntraNodeComm` (used by `ProcessGroupCudaP2p`) manages a single fixed-size workspace. This approach:
- Leads to awkward UX in which the required workspace needs to be specified upfront.
- Can not avoid extra copies for some algorithms in eager mode (e.g., custom/multimem all-reduce, reduce-scatter, all-gather).
- Prevents torch.compile from eliminating all copies.
In addition, they only offer out-of-the-box communication kernels and don't expose required pointers for user-defined, custom CUDA comm kernels.
* __->__ #128582
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128582
Approved by: https://github.com/wanchaol
https://github.com/pytorch/pytorch/pull/126717 will skip the tests in both ABI compatible and non-ABI compatible mode.
It's not expected to skip them in non-ABI compatible mode since they can actually run successfully in such mode but only have issues in ABI compatible mode.
We leverage the existing `xfail_list` for those that will only fail in ABI compatible mode.
- `test_qlinear_add` is already in the `xfail_list`.
- `test_linear_packed` doesn't fail either in my local run (running with `TORCHINDUCTOR_ABI_COMPATIBLE=1`) or in the CI of this PR so I didn't add it into `xfail_list`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128506
Approved by: https://github.com/jgong5, https://github.com/desertfire
The print statements for the get_workflow_type script is problematic because the shell script calling this script is expecting the output to only be JSON. This PR resolves this by removing all print statements to covert them to a message field in the JSON return output so that the output can continue to expect to be JSON while giving us the debug data we are looking for.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128969
Approved by: https://github.com/tylertitsworth, https://github.com/ZainRizvi
The hope that lives in this PR: I am currently trying to debug why the foreach tests are so flaky. It looks like every flaky test falls under this pattern:
- a test is flaky due to the mta_called assertion, which gathers data from the profiler regarding whether the multi_tensor_apply_kernel has been called.
- then, a later test fails deterministically, usually failing to compare two results.
```
================== 1 failed, 241 deselected, 2 rerun in 1.76s ==================
Got exit code 1
Stopping at first consistent failure
The following tests failed and then succeeded when run in a new process ['test/test_foreach.py::TestForeachCUDA::test_binary_op_float_inf_nan__foreach_add_cuda_bfloat16']
The following tests failed consistently: ['test/test_foreach.py::TestForeachCUDA::test_binary_op_list_error_cases__foreach_add_cuda_bfloat16']
```
So my suspicion is that the first causes the second, but what causes the first? Idk! So it would be nice to have the error message tell us what the profiler actually saw in case it's getting muddled. This change would help mostly because I have not been able to repro this flakiness locally.
Also undo the useless changes in #128220 which are actually redundant as Joel and I realized that we set the seed during the setUp of every test.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128991
Approved by: https://github.com/clee2000
This PR adds `set_post_optim_event` that allows power users to provide their own CUDA event that is recorded after the optimizer step for the FSDP root module to wait the all-gather streams on.
```
def set_post_optim_event(self, event: torch.cuda.Event) -> None:
```
By default, the root would have the all-gather streams wait on the current stream (`wait_stream`), which may introduce false dependencies if there is unrelated computation after the optimizer step and before the wait. For example, this pattern can appear in recommendation models.
To avoid those false dependencies while preserving the correctness guarantee, we provide this API so that the user can provide their own CUDA event to wait the all-gather streams on.
We include both correctness test (`test_fully_shard_training.py`) and overlap test (`test_fully_shard_overlap.py`).
---
One possible way to use the API is to register a post-step hook on the optimizer. For example:
12e8d1399b/test/distributed/_composable/fsdp/test_fully_shard_training.py (L546-L552)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128975
Approved by: https://github.com/sanketpurandare, https://github.com/weifengpy
ghstack dependencies: #128884
Summary:
use_mtia should instead set use_device='mtia' similar to cuda, xpu, and privateuseone. Avoid an ever-growing list of use_* arguments.
Since use_mtia is specific to FBCode, we don't need a deprecation warning.
Test Plan: CI.
Differential Revision: D57338005
Pulled By: aaronenyeshi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126284
Approved by: https://github.com/fenypatel99
### bc-breaking for existing users of the private API:
- Existing policy functions must now change their return value to be [CheckpointPolicy](c0b40ab42e/torch/utils/checkpoint.py (L1204-L1230)) Enum instead of bool.
- To restore previous behavior, return `PREFER_RECOMPUTE` instead of `False` and `{PREFER,MUST}_SAVE` instead of `True` depending whether you prefer the compiler to override your policy.
- Policy function now accepts a `ctx` object instead of `mode` for its first argument.
- To restore previous behavior, `mode = "recompute" if ctx.is_recompute else "forward"`.
- Existing calls to `_pt2_selective_checkpoint_context_fn_gen` must be renamed to `create_selective_checkpoint_contexts `. The way you use the API remains the same. It would've been nice to do something different (not make the user have to use functools.partial?), but this was the easiest to compile (idk if this should actually be a constraint).
Related doc: https://docs.google.com/document/d/1BKyizkZPdri9mHqdDOLAUpkI7SbbKfLHRFVVpK9ZWqo/edit
Memory considerations:
- As with the existing SAC, cached values are cleared upon first use.
- We error if the user wishes to backward a second time on a region forwarded with SAC enabled.
In-place:
- We use version counting to enforce that if any cached tensor has been mutated. In-place operations not mutating cached tensors are allowed.
- `allow_cache_entry_mutation=True` can be passed to disable this check (useful in the case of auto AC where the user is cleverly also saves the output of the in-place)
Randomness, views
- Currently in this PR, we don't do anything special for randomness or views, the author of the policy function is expected to handle them properly. (Would it would be beneficial to error? - we either want to save all or recompute all random tensors)
Tensor object preservation
- ~We guarantee that if a tensor does not requires grad, and it is saved, then what you get out is the same tensor object.~ UPDATE: We guarantee that if a tensor is of non-differentiable dtype AND it is not a view, and it is saved, then what you get out is the same tensor object. This is a nice guarantee for nested tensors which care about the object identity of of the offsets tensor.
Policy function
- Enum values are `{MUST,PREFER}_{SAVE,RECOMPUTE}` (bikeshed welcome). Alternatively there was `{SAVE,RECOMPUTE}_{NON_,}OVERRIDABLE`. The former was preferred bc it seemed clearer that two `MUST` clashing should error, versus it is ambiguous whether two `NON_OVERRIDABLE` being stacked should silently ignore or error.
- The usage of Enum today. There actually is NO API to stack SAC policies today. The only thing the Enum should matter for in the near term is the compiler. The stacking SAC policy would be useful if someone wants to implement something like simple FSDP, but it is not perfect because with a policy of `PREFER_SAVE` you are actually saving more than autograd would save normally (would be fixed with AC v3).
- The number of times we call the policy_fn is something that should be documented as part of public API. We call the policy function for all ops except ~~detach~~ UPDATE : metadata ops listed in `torch.utils.checkpoint.SAC_IGNORED_OPS`) because these ops may be called a different number of times by AC itself between forward and recompute.
- The policy function can be a stateful object (we do NOT make separate copies of this object for forward/recompute, the user is expected to handle that via is_recompute see below).
Tensors guaranteed to be the same tensor as-is
- Policy function signature takes ctx object as its first argument. The ctx function is an object encapsulating info that may be useful to the user, it currently only holds "is_recompute". Adding this indirection gives us flexibility to add more attrs later if necessary.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125795
Approved by: https://github.com/Chillee, https://github.com/fmassa
Summary:
Unblocks a test that's failing.
`codegen` can be unset until `compile` is called. If `codegen` is not set, then just use the kernel name directly.
Test Plan:
```
buck2 run //caffe2/test:tensorexpr -- --regex test_simple_add
```
Differential Revision: D58727391
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128951
Approved by: https://github.com/aaronenyeshi
This PR adds two APIs `set_modules_to_forward_prefetch` and `set_modules_to_backward_prefetch` to enable explicit forward/backward all-gather prefetching, respectively.
```
def set_modules_to_forward_prefetch(self, modules: List[FSDPModule]): -> None
def set_modules_to_backward_prefetch(self, modules: List[FSDPModule]): -> None
```
**Motivation**
FSDP2 implements _reasonable defaults_ for forward and backward prefetching. In forward, it uses implicit prefetching and allows two all-gather output tensors to be alive at once (so that the current all-gather copy-out can overlap with the next all-gather). In backward, it uses explicit prefetching based on the reverse post-forward order.
However, there may be cases where with expert knowledge, we can reduce communication bubbles by moving all-gathers manually. One way to expose such behavior is to expose _prefetching limits_, i.e. integers that configure how many outstanding all-gathers/all-gather output tensors can be alive at once. IMIHO, this leans toward _easy_, not _simple_ (see [PyTorch design principles](https://pytorch.org/docs/stable/community/design.html#principle-2-simple-over-easy)).
The crux of the problem is that there may be special cases where manual intervention can give better performance. Exposing a prefetching limit and allowing users to pass a value >1 just smooths over the problem since such a limit would generally apply over the entire model even though it possibly should not. Then, expert users will see a specific all-gather that they want to deviate from this limit, and there is little we can do.
Thus, we instead choose to expose the most primitive extension point: namely, every `FSDPModule` gives an opportunity to prefetch other all-gathers in forward and in backward. How to leverage this extension point is fully up to the user. Implementing the prefetch limit can be done using this extension point (e.g. record the post-forward order yourself using forward hooks, iterate over that order, and call the `set_modules_to_forward_prefetch` / `set_modules_to_backward_prefetch` APIs).
Differential Revision: [D58700346](https://our.internmc.facebook.com/intern/diff/D58700346)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128884
Approved by: https://github.com/ckluk2, https://github.com/weifengpy
Adds support for `Variable._execution_engine.queue_callback()`, which is used in FSDP2.
Important tests:
- `pytest -rA test/inductor/test_compiled_autograd.py::TestCompiledAutograd::test_callback_graph_break_throws_error`
- `pytest -rA test/inductor/test_compiled_autograd.py::TestAutogradWithCompiledAutograd::test_callback_adds_callback`
- `PYTORCH_TEST_WITH_DYNAMO=1 python test/test_autograd.py -k TestAutograd.test_callback_adds_callback`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126366
Approved by: https://github.com/xmfan
Summary
Pass parameters from request to dump_nccl_trace_pickle handler.
The supported parameters + value are all lowercase.
includecollectives={true, false}
includestacktraces={true, false}
onlyactive={true, false}
Example post is:
/handler/dump_nccl_trace_pickle?includecollectives=true&includestacktraces=false&onlyactive=true
Test Plan:
unit tests
Differential Revision: [D58640474](https://our.internmc.facebook.com/intern/diff/D58640474)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128781
Approved by: https://github.com/d4l3k
This adds a `dump_traceback` handler so you can see all running threads for a job. This uses a temporary file as a buffer when calling `faulthandler.dump_traceback` and requires the GIL to be held during dumping.
Test plan:
```
python test/distributed/elastic/test_control_plane.py -v -k traceback
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128904
Approved by: https://github.com/c-p-i-o
In NVIDIA internal CI, on Jetson devices we are seeing this failure for `python test/inductor/test_cuda_cpp_wrapper.py -k test_addmm_cuda_cuda_wrapper -k test_linear_relu_cuda_cuda_wrapper`:
```
/usr/local/lib/python3.10/dist-packages/torch/_inductor/compile_fx.py:132: UserWarning: TensorFloat32 tensor cores for float32 matrix multiplication available but not enabled. Consider setting `torch.set_float32_matmul_precision('high')` for better performance.
warnings.warn(
W0613 20:57:17.722000 281473279256672 torch/_inductor/utils.py:902] [0/0] Not enough SMs to use max_autotune_gemm mode
frames [('total', 1), ('ok', 1)]
stats [('calls_captured', 2), ('unique_graphs', 1)]
inductor [('extern_calls', 2), ('fxgraph_cache_miss', 1), ('pattern_matcher_count', 1), ('pattern_matcher_nodes', 1)]
aot_autograd [('total', 1), ('ok', 1)]
F
======================================================================
FAIL: test_linear_relu_cuda_cuda_wrapper (__main__.TestCudaWrapper)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 2759, in wrapper
method(*args, **kwargs)
File "/opt/pytorch/pytorch/test/inductor/test_torchinductor.py", line 9818, in new_test
return value(self)
File "/usr/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/opt/pytorch/pytorch/test/inductor/test_cuda_cpp_wrapper.py", line 152, in fn
_, code = test_torchinductor.run_and_get_cpp_code(
File "/opt/pytorch/pytorch/test/inductor/test_torchinductor.py", line 356, in run_and_get_cpp_code
result = fn(*args, **kwargs)
File "/opt/pytorch/pytorch/test/inductor/test_select_algorithm.py", line 43, in wrapped
return fn(*args, **kwargs)
File "/usr/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/usr/lib/python3.10/unittest/mock.py", line 1379, in patched
return func(*newargs, **newkeywargs)
File "/usr/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/usr/lib/python3.10/contextlib.py", line 79, in inner
return func(*args, **kwds)
File "/opt/pytorch/pytorch/test/inductor/test_select_algorithm.py", line 62, in test_linear_relu_cuda
self.assertEqual(counters["inductor"]["select_algorithm_autotune"], 1)
File "/usr/local/lib/python3.10/dist-packages/torch/testing/_internal/common_utils.py", line 3642, in assertEqual
raise error_metas.pop()[0].to_error(
AssertionError: Scalars are not equal!
Expected 1 but got 0.
Absolute difference: 1
Relative difference: 1.0
```
Looking into it, we see the failure is from https://github.com/pytorch/pytorch/blob/main/test/inductor/test_select_algorithm.py#L62. The warning `W0613 20:57:17.722000 281473279256672 torch/_inductor/utils.py:902] [0/0] Not enough SMs to use max_autotune_gemm ` is triggered from https://github.com/pytorch/pytorch/blob/main/torch/_inductor/utils.py#L973. Printing torch.cuda.get_device_properties(0).multi_processor_count returns 16 on the computelab AGX Orin; thus it makes sense that this check is failing, since the min_required_sms is 68, thus not letting it pick the autotune algorithm. Looking at the main for test_select_algorithm.py, we see that these tests should only be run if is_big_gpu(0) is true: https://github.com/pytorch/pytorch/blob/main/test/inductor/test_select_algorithm.py#L344. Thus this PR adds a similar check to the invocation of these tests in test_cuda_cpp_wrapper.py.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128652
Approved by: https://github.com/soulitzer, https://github.com/eqy
Fixes#105157
Bug source: `from __future__ import annotations` converts type annotation to strings to make forwards references easier. However, existing custom ops do not consider strings to be valid types.
Fix: We check if the argument and return type annotation is string type. If so, we try to use `eval` to convert it to a type.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128809
Approved by: https://github.com/zou3519
Related to: https://github.com/pytorch/pytorch/issues/125879
Would check if we are compiled with CUDA before publishing CUDA Docker nightly image
Test
```
#18 [conda-installs 5/5] RUN IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())'); echo "Is torch compiled with cuda: ${IS_CUDA}"; if test "${IS_CUDA}" != "True" -a ! -z "12.4.0"; then exit 1; fi
#18 1.656 Is torch compiled with cuda: False
#18 ERROR: process "/bin/sh -c IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())'); echo \"Is torch compiled with cuda: ${IS_CUDA}\"; if test \"${IS_CUDA}\" != \"True\" -a ! -z \"${CUDA_VERSION}\"; then \texit 1; fi" did not complete successfully: exit code: 1
------
> [conda-installs 5/5] RUN IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())'); echo "Is torch compiled with cuda: ${IS_CUDA}"; if test "${IS_CUDA}" != "True" -a ! -z "12.4.0"; then exit 1; fi:
1.656 Is torch compiled with cuda: False
------
Dockerfile:80
--------------------
79 | RUN /opt/conda/bin/pip install torchelastic
80 | >>> RUN IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())');\
81 | >>> echo "Is torch compiled with cuda: ${IS_CUDA}"; \
82 | >>> if test "${IS_CUDA}" != "True" -a ! -z "${CUDA_VERSION}"; then \
83 | >>> exit 1; \
84 | >>> fi
85 |
--------------------
ERROR: failed to solve: process "/bin/sh -c IS_CUDA=$(python -c 'import torch ; print(torch.cuda._is_compiled())'); echo \"Is torch compiled with cuda: ${IS_CUDA}\"; if test \"${IS_CUDA}\" != \"True\" -a ! -z \"${CUDA_VERSION}\"; then \texit 1; fi" did not complete successfully: exit code: 1
(base) [ec2-user@ip-172-30-2-248 pytorch]$ docker buildx build --progress=plain --platform="linux/amd64" --target official -t ghcr.io/pytorch/pytorch:2.5.0.dev20240617-cuda12.4-cudnn9-devel --build-arg BASE_IMAGE=nvidia/cuda:12.4.0-devel-ubuntu22.04 --build-arg PYTHON_VERSION=3.11 --build-arg CUDA_VERSION= --build-arg CUDA_CHANNEL=nvidia --build-arg PYTORCH_VERSION=2.5.0.dev20240617 --build-arg INSTALL_CHANNEL=pytorch --build-arg TRITON_VERSION= --build-arg CMAKE_VARS="" .
#0 building with "default" instance using docker driver
```
Please note looks like we are installing from pytorch rather then nighlty channel on PR hence cuda 12.4 is failing since its not in pytorch channel yet:
https://github.com/pytorch/pytorch/actions/runs/9555354734/job/26338476741?pr=128852
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128852
Approved by: https://github.com/malfet
Summary:
For PointToPoint(sendrecv), the deviceId is lower_rank:higher_rank. This means a p2p group cannot be created through commSplit since it cannot find a parent.
Fix this by using the right device key of current rank.
Differential Revision: D58631639
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128803
Approved by: https://github.com/shuqiangzhang
Fixes#127908
## Description
Created docs to document the torch.cuda.cudart function to solve the issue #127908.
I tried to stick to the [guidelines to document a function](https://github.com/pytorch/pytorch/wiki/Docstring-Guidelines#documenting-a-function) but I was not sure if there is a consensus on how to handle the docs of a function that calls an internal function. So I went ahead and tried what the function will raise, etc. from the user endpoint and documented it (i.e. I am giving what actually _lazy_init() will raise).
Updated PR from #128298 since I made quite a big mistake in my branch. I apologize for the newbie mistake.
### Summary of Changes
- Added docs for torch.cuda.cudart
- Added the cudart function in the autosummary of docs/source/cuda.rst
## Checklist
- [X] The issue that is being fixed is referred in the description
- [X] Only one issue is addressed in this pull request
- [X] Labels from the issue that this PR is fixing are added to this pull request
- [X] No unnecesary issues are included into this pull request
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128741
Approved by: https://github.com/msaroufim
# Summary
The primary reason for the change was lack of current use case and the need to work around an two Inductor issue.
- Tensor arguments as kwarg only
- multiple outputs from triton templates
If the need for the amax return type arises we can consider either adding it, more likely creating a separate op.
In principle PyTorch is moving away from ops that bundle lots of functionality into "mega ops". We instead rely upon the compiler to generate appropriate fused kernels.
### Changes:
- This removes the amax return type from scaled_mm. We have found that the common use case is to return in "high-precision" ( a type with more precision than fp8). This is only relevant when returning in low-precision.
- We currently still allow for fp8 returns and scaled result. Perhaps we should also ban this as well...
New signature:
```Python
def meta_scaled_mm(
self: torch.Tensor,
mat2: torch.Tensor,
scale_a: torch.Tensor,
scale_b: torch.Tensor,
bias: Optional[torch.Tensor] = None,
scale_result: Optional[torch.Tensor] = None,
out_dtype: Optional[torch.dtype] = None,
use_fast_accum: bool = False,
) -> torch.Tensor:
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128683
Approved by: https://github.com/vkuzo
#### Issue
Tensor constant was previously lifted directly as an input in the fx graph, which results errors for multiple test cases with tensor constant. This PR introduces a fix to convert tensor constant to a `GetAttr` in the fx graph.
This PR also introduces other fixes to maintain a valid `state_dict` for exported program when there are tensor constants. In short, after tensor constants are converted as `GetAttr`, they are treated as buffers during retracing. The fix will convert those back from buffer to constant.
#### Test Plan
Add new test cases that generate tensor constants
* `pytest test/export/test_converter.py -s -k test_implicit_constant_to_tensor_handling`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128442
Approved by: https://github.com/angelayi
Summary: Today meta['val'] on placeholder nodes doesn't preserve the consistent requires_grad information with the original inputs. Seems there's no easy way to fix this directly at proxy tensor layer. This is useful for reexporting joint graph.
Test Plan: test_preserve_requires_grad_placeholders
Differential Revision: D58555651
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128656
Approved by: https://github.com/tugsbayasgalan
Idea: close over min / max sequence length in the main NJT view func (`_nested_view_from_jagged`) so that view replay during fake-ification propagates these correctly in torch.compile.
For dynamic shapes support for min / max sequence length, this PR uses a hack that stores the values in `(val, 0)` shaped tensors.
**NB: This PR changes SDPA to operate on real views instead of using `buffer_from_jagged()` / `ViewNestedFromBuffer`, which may impact the internal FIRST model. That is, it undoes the partial revert from #123215 alongside a fix to the problem that required the partial revert. We need to verify that there are no regressions there before landing.**
Differential Revision: [D55448636](https://our.internmc.facebook.com/intern/diff/D55448636)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/122836
Approved by: https://github.com/soulitzer
ghstack dependencies: #127007, #128057
Summary: We lose traceback info when an exception occurs in a subprocess because Python traceback objects don't pickle. In the subprocess-based parallel compile, we _are_ logging an exception in the subprocess, but a) those messages are easy to miss because they're not in the traceback output, and b) it seems that logging in the subproc is swallowed by default in internal builds. This PR captures the traceback in the subprocess and makes it available in the exception thrown in the main process. Users now see failures that look like this:
```
...
File "/home/slarsen/.conda/envs/pytorch-3.10_3/lib/python3.10/concurrent/futures/_base.py", line 458, in result
return self.__get_result()
File "/home/slarsen/.conda/envs/pytorch-3.10_3/lib/python3.10/concurrent/futures/_base.py", line 403, in __get_result
raise self._exception
torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised:
SubprocException: An exception occurred in a subprocess:
Traceback (most recent call last):
File "/data/users/slarsen/pytorch-3.10_3/torch/_inductor/compile_worker/subproc_pool.py", line 270, in do_job
result = SubprocMain.foo()
File "/data/users/slarsen/pytorch-3.10_3/torch/_inductor/compile_worker/subproc_pool.py", line 263, in foo
SubprocMain.bar()
File "/data/users/slarsen/pytorch-3.10_3/torch/_inductor/compile_worker/subproc_pool.py", line 260, in bar
SubprocMain.baz()
File "/data/users/slarsen/pytorch-3.10_3/torch/_inductor/compile_worker/subproc_pool.py", line 257, in baz
raise Exception("an error occurred")
Exception: an error occurred
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128775
Approved by: https://github.com/jansel
Test CI
This fixes issues like this where I don't even intend to use the fuzzer. this way if someone is calling functions from the fuzzer numpy will be imported otherwise the import should not happen at the top of the file
```
>>> import torchao
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torchao/__init__.py", line 26, in <module>
from torchao.quantization import (
File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torchao/quantization/__init__.py", line 7, in <module>
from .smoothquant import * # noqa: F403
File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torchao/quantization/smoothquant.py", line 18, in <module>
import torchao.quantization.quant_api as quant_api
File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torchao/quantization/quant_api.py", line 23, in <module>
from torchao.utils import (
File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torchao/utils.py", line 2, in <module>
import torch.utils.benchmark as benchmark
File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torch/utils/benchmark/__init__.py", line 4, in <module>
from torch.utils.benchmark.utils.fuzzer import * # noqa: F403
File "/home/marksaroufim/anaconda3/envs/fresh/lib/python3.10/site-packages/torch/utils/benchmark/utils/fuzzer.py", line 5, in <module>
import numpy as np
ModuleNotFoundError: No module named 'numpy'
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128759
Approved by: https://github.com/Skylion007
Improve Dynamo to support the FSDP2 `use_training_state()` context manager.
Test command:
`
pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_dynamo_trace_use_training_state
`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127854
Approved by: https://github.com/yanboliang
**Summary**
Inductor currently uses modulo and division to compute indices into certain multi-dimensional tensors, such as those arising from row padding. This PR matches on that indexing pattern, replacing it with an N-D block pointer. This should be more efficient than computing indices with division and modulo, and it can easily map to DMAs on non-GPU hardware targets.
Because the 1D block size needs to map to an integer block shape in ND, we need to know that the ND block size evenly divides the size of the iteration range. This PR only generates ND block pointers when it can guarantee that the iteration order and number of elements loaded are unchanged. This means that the number of elements in a slice of the iteration range must either be:
- Powers of 2. Since Triton block sizes are powers of 2, any integer power of 2 either divides the block size, or is greater than the block size. In the latter case, `CielDiv(x, y)` rounds up to 1.
- Multiples of the maximum block size. Since block sizes are powers of 2, the maximum block size is a multiple of every possible block size.
Note that a *slice* of the iteration range does not include the leading dimension. Thus we can support arbitrary leading dimensions like `(5,8)`.
Feature proposal and discussion: https://github.com/pytorch/pytorch/issues/125077
Example kernel:
```
triton.jit
def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr):
xnumel = 4096
xoffset = tl.program_id(0) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:]
xmask = xindex < xnumel
tmp0 = tl.reshape(tl.load(tl.make_block_ptr(in_ptr0, shape=[32, 16, 8], strides=[1024, 32, 1], block_shape=[32 * (32 <= ((127 + XBLOCK) // 128)) + ((127 + XBLOCK) // 128) * (((127 + XBLOCK) // 128) < 32), 16 * (16 <= ((7 + XBLOCK) // 8)) + ((7 + XBLOCK) // 8) * (((7 + XBLOCK) // 8) < 16), 8 * (8 <= XBLOCK) + XBLOCK * (XBLOCK < 8)], order=[0, 1, 2], offsets=[(xoffset // 128), (xoffset // 8) % 16, xoffset % 8]), boundary_check=[0, 1, 2]), [XBLOCK])
tmp1 = tmp0 + tmp0
tl.store(tl.make_block_ptr(out_ptr0, shape=[4096], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp1, [XBLOCK]).to(tl.float32))
''', device_str='cuda')
```
**Test Plan**
This PR adds a new CI test script to cover this feature. The tests can be grouped into a few main categories:
- Can we generate strided block pointers for the appropriate shapes?
- Powers of 2
- Non-power of 2, but multiple of the maximum block size
- Arbitrary leading dimensions, with power of 2 inner dimensions
- Weird strides and offsets
- Reductions
- Symbolic shapes that are multiples of the maximum block size (wasn't able to trace this through dynamo)
- Broadcasts (some variables are missing from the indexing expression)
- Do we still compile other cases correctly, even if we don't expect to be able to generate block pointers?
- Unsupported static shapes
- Unsupported symbolic shapes
- Mixing and matching these cases:
- Pointwise and reduction in the same kernel
- Sanity check the test harness
- Do we raise an exception if the expected number of block pointers and the actual number are different?
**Follow-ups**
There are a few important cases which this PR can't handle. I'm hoping these can be deferred to follow-up PRs:
- Handle non-divisible shapes
- Change the tiling algorithm to generate a 2D (X,Y) blocking, if doing so enables block pointers to be emitted.
- Pad unsupported loads up to the nearest divisible size, then mask/slice out the extra elements? This is probably the best solution, but I'm not yet sure how to go about it in triton.
- Take advantage of this analysis when `triton.use_block_ptr=False`. I'm guessing we can still avoid `%` and `/` without requiring block pointers. Maybe we could compute block indices with arange and broadcast instead?
Differential Revision: D56739375
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127342
Approved by: https://github.com/jansel, https://github.com/shunting314
The following are all constrained under the ONNX exporter project scope.
- `personal_of_interest.rst`
- Moving folks no longer working on the project to emeritus.
- Adding @justinchuby, @titaiwangms, @shubhambhokare1 and @xadupre,
who have all made countless contributions to this project.
- `CODEOWNERS`
- Removing folks no longer working on the project.
- Updating new owners who will now be notified with PRs related to
the specific file paths.
- `merge_rules.yaml`
- Removing folks no longer working on the project.
🫡
Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126364
Approved by: https://github.com/titaiwangms, https://github.com/justinchuby, https://github.com/albanD
Summary: If any subprocess in the pool crashes, we get a BrokenProcessPool exception and the whole pool becomes unusable. Handle crashes by recreating the pool.
Test Plan:
* New unit test
* Started a long-running test (`test/inductor/test_torchinductor.py`), periodically killed subprocess manually, made sure the test run recovers and makes progress.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128757
Approved by: https://github.com/jansel
When handling an input to dynamo that's a view of a subclass, dynamo does some handling to reconstruct the view. Part of this is to construct symints for the input parameters to the view.
Previously, the code would just call `create_symbol()` which by default specifies a _positive_ symint (>= 0); this fails in the case where you have an aten::view that was called with a -1.
Fix: just specify `positive=None` when calling `create_symbol()`, to avoid restricting the symint to >= 0 or <= 0.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128662
Approved by: https://github.com/jbschlosser
Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom):
This PR introduces a prototype for `SymmetricMemory` (including a CUDA implementation) - a remote-memory access-based communication primitive. It allows for user-defined communication patterns/kernels and is designed to be torch.compile-friendly. It addresses the major limitations of `IntraNodeComm` and `ProcessGroupCudaP2p` and serves as a replacement for them.
### SymmetricMemory
`SymmetricMemory` represents symmetric allocations across a group of devices. The allocations represented by a `SymmetricMemory` object are accessible by all devices in the group. The class can be used for **op-level custom communication patterns** (via the get_buffer APIs and the synchronization primitives), as well as **custom communication kernels** (via the buffer and signal_pad device pointers).
### Python API Example
```python
from torch._C.distributed_c10d import _SymmetricMemory
# Set a store for rendezvousing symmetric allocations on a group of devices
# identified by group_name. The concept of groups is logical; users can
# utilize predefined groups (e.g., a group of device identified by a
# ProcessGroup) or create custom ones. Note that a SymmetricMemoryAllocator
# backends might employ a more efficient communication channel for the actual
# rendezvous process and only use the store for bootstrapping purposes.
_SymmetricMemory.set_group_info(group_name, rank, world_size, store)
# Identical to empty_strided, but allows symmetric memory access to be
# established for the allocated tensor via _SymmetricMemory.rendezvous().
# This function itself is not a collective operation.
t = _SymmetricMemory.empty_strided_p2p((64, 64), (64, 1), torch.float32, group_name)
# Users can write Python custom ops that leverages the symmetric memory access.
# Below are examples of things users can do (assuming the group's world_size is 2).
# Establishes symmetric memory access on tensors allocated via
# _SymmetricMemory.empty_strided_p2p(). rendezvous() is a one-time process,
# and the mapping between a local memory region and the associated SymmetricMemory
# object is unique. Subsequent calls to rendezvous() with the same tensor will receive
# the cached SymmetricMemory object.
#
# The function has a collective semantic and must be invoked simultaneously
# from all rendezvous participants.
symm_mem = _SymmetricMemory.rendezvous(t)
# This represents the allocation on rank 0 and is accessible from all devices.
buf = symm_mem.get_buffer(0, (64, 64), torch.float32)
if symm_mem.rank == 0:
symm_mem.wait_signal(src_rank=1)
assert buf.eq(42).all()
else:
# The remote buffer can be used as a regular tensor
buf.fill_(42)
symm_mem.put_signal(dst_rank=0)
symm_mem.barrier()
if symm_mem.rank == 0:
symm_mem.barrier()
assert buf.eq(43).all()
else:
new_val = torch.empty_like(buf)
new_val.fill_(43)
# Contiguous copies to/from a remote buffer utilize copy engines
# which bypasses SMs (i.e. no need to load the data into registers)
buf.copy_(new_val)
symm_mem.barrier()
```
### Custom CUDA Comm Kernels
Given a tensor, users can access the associated `SymmetricMemory` which provides pointer to remote buffers/signal_pads needed for custom communication kernels.
```cpp
TORCH_API c10::intrusive_ptr<SymmetricMemory> get_symmetric_memory(
const at::Tensor& tensor);
class TORCH_API SymmetricMemory : public c10::intrusive_ptr_target {
public:
...
virtual std::vector<void*> get_buffer_ptrs() = 0;
virtual std::vector<void*> get_signal_pad_ptrs() = 0;
virtual void** get_buffer_ptrs_dev() = 0;
virtual void** get_signal_pad_ptrs_dev() = 0;
virtual size_t get_buffer_size() = 0;
virtual size_t get_signal_pad_size() = 0;
virtual int get_rank() = 0;
virtual int get_world_size() = 0;
...
};
```
### Limitations of IntraNodeComm and ProcessGroupCudaP2p
Both `IntraNodeComm` (used by `ProcessGroupCudaP2p`) manages a single fixed-size workspace. This approach:
- Leads to awkward UX in which the required workspace needs to be specified upfront.
- Can not avoid extra copies for some algorithms in eager mode (e.g., custom/multimem all-reduce, reduce-scatter, all-gather).
- Prevents torch.compile from eliminating all copies.
In addition, they only offer out-of-the-box communication kernels and don't expose required pointers for user-defined, custom CUDA comm kernels.
* __->__ #128582
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128582
Approved by: https://github.com/wanchaol
In this PR, we abstracted the different types of aten operation parameters as `ParameterMetadata`. This structure intends to be used to represent and store the metadata of each aten operation parameter. Currently, it only supports `Tensor`, `TensorList`, and `Scalar`.
```C++
using ParameterMetadataValue = std::variant<TensorMetadata, std::vector<TensorMetadata>, c10::Scalar>;
```
With this PR, we can extend other parameter-type support in a more modularize way, like `string`, `int`, `double`, and other different types to be summarized as the following list. The list is collected from all aten operations and ordered by the number of being used.
- `Tensor`
- `bool`
- `int64_t`
- `TensorList`
- `Scalar`
- `c10::SymIntArrayRef`
- `::std::optional<Tensor>`
- `IntArrayRef`
- `double`
- `c10::SymInt`
- `::std::optional<ScalarType>`
- `::std::optional<double>`
- `::std::optional<bool>`
- `::std::optional<Layout>`
- `::std::optional<Device>`
- `::std::optional<int64_t>`
- `Dimname`
- `::std::optional<Generator>`
- `c10::string_view`
- `::std::optional<c10::string_view>`
- `OptionalIntArrayRef`
- `::std::optional<Scalar>`
- `OptionalSymIntArrayRef`
- `::std::optional<MemoryFormat>`
- `::std::optional<c10::SymInt>`
- `ScalarType`
- `ArrayRef<Scalar>`
- `DimnameList`
- `::std::optional<ArrayRef<double>>`
- `::std::array<bool,3>`
- `::std::optional<DimnameList>`
- `c10::List<::std::optional<Tensor>>`
- `::std::array<bool,2>`
- `Storage`
- `::std::array<bool,4>`
- `Device`
- `DeviceIndex`
- `ITensorListRef`
- `Stream`
- `Layout`
- `MemoryFormat`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125308
Approved by: https://github.com/jgong5, https://github.com/jansel
This adjusts the settings of the libuv backend to match the older TCPStore.
* DEFAULT_BACKLOG: setting this to -1 will enable using the host somaxconn value instead of a hardcoded 16k value. When going over this limit with `tcp_abort_on_overflow` set it results in connections being reset.
* TCP_NODELAY: Since TCPStore primarily sends small messages there's no benefit to using Nargle's algorithm and it may add additional latency for store operations.
Test plan:
```
python test/distributed/test_store.py -v -k LibUv
```
Benchmark script:
```
import time
import os
import torch.distributed as dist
rank = int(os.environ["RANK"])
store = dist.TCPStore(
host_name="<server>",
port=29500,
world_size=2,
is_master=(rank == 0),
use_libuv=True,
)
if rank == 1:
total_iters = 0
total_dur = 0
for iter in range(10):
iters = 500000
start = time.perf_counter()
for i in range(iters):
store.set(f"key_{i}", f"value_{i}")
dur = time.perf_counter() - start
print(f"{iter}. {iters} set, qps = {iters/dur}")
total_iters += iters
total_dur += dur
print(f"overall qps = {total_iters/total_dur}")
else:
print("sleeping")
time.sleep(1000000000)
```
Performance seems to be negligible difference between TCP_NODELAY and not for a single host
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128739
Approved by: https://github.com/rsdcastro, https://github.com/kurman, https://github.com/c-p-i-o
Fix docstrings in Learning Rate Scheduler.
The fix can be verified by running pydocstyle path-to-file --count
Related #112593
**BEFORE the PR:**
pydocstyle torch/optim/lr_scheduler.py --count
92
**AFTER the PR:**
pydocstyle torch/optim/lr_scheduler.py --count
0
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128679
Approved by: https://github.com/janeyx99
Changes:
1. Add memory align macro support on Windows.
2. Fix `#pragma unroll` not support on MSVC cl compiler.
`#pragma unroll` occur error on msvc `cl` compiler, but it would be supported on Windows `clang`.
We'd better disable it only on `__msvc_cl__` compiler, and get better performance if we enabled `clang`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128686
Approved by: https://github.com/jgong5, https://github.com/jansel
Fixes https://github.com/pytorch/pytorch/issues/125720
I was earlier worried that DELETE_* or STORE_* on referent values should result in a graph break, because they could invalidate the weak ref. But then @zou3519 pointed out that weakref invalidation will happen EVENTUALLY, CPython provides no guarantees when the weakref will be invalidated (even when the user calls del x and x is the last reference).
So any code that relies on del x to invalidate the weakref of x right away is BAD code. CPython provide no guarantees. Therefore we can (ab)use this nuance, and can just ignore DELETE_* or STORE_* on the referent objects.
The only corner case is when Dynamo is reconstructing the weakref object. Dynamo will have a hard time being correct here, so just SKIP_FRAME on such a case. This is rare.
Cpython notes
1) https://docs.python.org/3/library/weakref.html
2) https://docs.python.org/3/reference/datamodel.html#index-2
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128533
Approved by: https://github.com/jansel
Taking inspiration from `GraphModule.print_readable` (aka I copied its [code](17b45e905a/torch/fx/graph_module.py (L824))), I added a `print_readable` to the unflattened module, because it's kind of nontrivial to print the contents of this module.
Example print from `python test/export/test_unflatten.py -k test_unflatten_nested`
```
class UnflattenedModule(torch.nn.Module):
def forward(self, x: "f32[2, 3]"):
# No stacktrace found for following nodes
rootparam: "f32[2, 3]" = self.rootparam
# File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:99 in forward, code: x = x * self.rootparam
mul: "f32[2, 3]" = torch.ops.aten.mul.Tensor(x, rootparam); x = rootparam = None
# No stacktrace found for following nodes
foo: "f32[2, 3]" = self.foo(mul); mul = None
bar: "f32[2, 3]" = self.bar(foo); foo = None
return (bar,)
class foo(torch.nn.Module):
def forward(self, mul: "f32[2, 3]"):
# No stacktrace found for following nodes
child1param: "f32[2, 3]" = self.child1param
nested: "f32[2, 3]" = self.nested(mul); mul = None
# File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:79 in forward, code: return x + self.child1param
add: "f32[2, 3]" = torch.ops.aten.add.Tensor(nested, child1param); nested = child1param = None
return add
class nested(torch.nn.Module):
def forward(self, mul: "f32[2, 3]"):
# File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:67 in forward, code: return x / x
div: "f32[2, 3]" = torch.ops.aten.div.Tensor(mul, mul); mul = None
return div
class bar(torch.nn.Module):
def forward(self, add: "f32[2, 3]"):
# No stacktrace found for following nodes
child2buffer: "f32[2, 3]" = self.child2buffer
# File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:87 in forward, code: return x - self.child2buffer
sub: "f32[2, 3]" = torch.ops.aten.sub.Tensor(add, child2buffer); add = child2buffer = None
return sub
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128617
Approved by: https://github.com/zhxchen17, https://github.com/pianpwk
Summary:
Firstly, this does not change any existing behaviour, since all the
default values for kwargs were hardcoded into the ``_checkpoint_without_reentrant_generator`` call.
Secondly, this is needed for unlocking the full potential of composable
checkpointing making it equivalent to ``torch.utils.checkpoint.checkpoint(use_reentrant=False)``.
Finally, an added benefit is now composable checkpointing can be used under ``FakeTensorMode`` by
passing ``preserve_rng_state=False``.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128516
Approved by: https://github.com/awgu
This PR is enough to get this test to pass when using `TORCHDYNAMO_INLINE_INBUILT_NN_MODULES`:
```
TORCHDYNAMO_INLINE_INBUILT_NN_MODULES=1 python test/inductor/test_group_batch_fusion.py -k TestPostGradBatchLinearFusion.test_batch_linear_post_grad_fusion
```
inductor has a pre-grad pass to swap out multiple `linear` layers with with `addbmm`, but it also needs to insert an `unbind()` at the end. If that unbind is then followed by a mutation (like `add_()`), the autograd engine will complain (autograd does not let you mutate the output of multiple-out-view ops like unbind).
I made a tweak to the pattern matching logic to avoid matching if the output of the linear is used in an op that mutates its input. My hope is that:
(1) this situation is rare enough that it won't materially impact pattern matching in real world code
(2) I had to use a heuristic for "is an op a mutable op", since the graph we get is from dynamo, so it can contain code like `operator.iadd` in it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128570
Approved by: https://github.com/eellison, https://github.com/mlazos
ghstack dependencies: #127927
Fixes https://github.com/pytorch/pytorch/issues/127374
The error in the linked repro is:
```
AssertionError: Please convert all Tensors to FakeTensors first or instantiate FakeTensorMode with 'allow_non_fake_inputs'. Found in aten.sym_storage_offset.default(_to_functional_tensor(FakeTensor(..., device='cuda:0', size=(16, 4), dtype=torch.uint8),
device='cuda:0'))
```
Where we hit FakeTensor.__torch_dispatch__, but our input is a C++ `FunctionalTensorWrapper`.
What should actually have happened is that the call to `aten.sym_storage_offset` hits the `Functionalize` dispatch key, which should remove the `FunctionalTensorWrapper` and redispatch. I spent some time debugging and haven't actually figured out why this isn't happening. Instead, this PR just skips that step completely, and asks `FunctionalTensor` to directly unwrap the C++ `FunctionalTensorWrapper` when querying tensor metadata.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127927
Approved by: https://github.com/tugsbayasgalan
This adds better logging of errors to the socket and TCPStore classes.
All socket operations should now include the local and remote addresses and we actually log errors from the TCPStoreBackend::run as well as TCPStoreBackendUV which were previously INFO messages and not actually logged.
It also overhauls test_wait in test_store.py as it had a race condition causing it to be flaky.
Test plan:
```
python test/distributed/test_store.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128673
Approved by: https://github.com/c-p-i-o
We introduced AOTI_TORCH_CHECK in #119220 to resolve slow-compilation
time issues. Unfortunately, it caused perf regressions for CPU
, as described in issue #126665. After some investigation, it turned
out the slow compilation was caused by the use of the builtin
function __builtin_expect provided by gcc/clang. Moreover,
nuking __builtin_expect doesn't seem to cause any performance penalty,
even though its purpose is to improve performance by providing the
compiler with branch prediction information.
abs latency numbers using the script shared by #126665:
before the fix after the fix
T5Small 1019.055694 917.875027
T5ForConditionalGeneration 1009.825196 916.369239
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128402
Approved by: https://github.com/desertfire
This PR enables specific axe to be dynamic with calling torch.export.export and torch.export.Dim.
Features:
(1) Turn dynamic_axes to dynamic_shapes
(2) Dim constraints remain the same (see test case with hitting constraints). This might give different user experience, since we didn't have any constraints in torchscript-onnx exporting.
(3) If input_names is used in dynamic_axes, ValueError will be raised, as input_names is currently not supported.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128371
Approved by: https://github.com/justinchuby
Fixes#113124.
## Description
I modified the installing.rst file to address the system requirements and troubleshooting steps for using LibTorch with different GLIBC versions.
### Summary of Changes
- Added system requirements specifying the GLIBC version needed for both the cxx11 ABI version and the pre-cxx11 ABI version of LibTorch.
- Included a troubleshooting section with instructions on how to check the dependencies of the LibTorch libraries and identify the required GLIBC version using the `ldd lib/libtorch.so` command.
## Checklist
- [X] The issue that is being fixed is referred in the description
- [X] Only one issue is addressed in this pull request
- [X] Labels from the issue that this PR is fixing are added to this pull request
- [X] No unnecesary issues are included into this pull request
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128135
Approved by: https://github.com/jbschlosser
We don't care about the Dynamo x TorchScript composition, so I'm
disabling these tests (so they don't get reported as flaky). Not
disabling all of the TorchScript tests yet because they have been useful
to catch random bugs.
Test Plan:
- CI
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128731
Approved by: https://github.com/williamwen42
FIXES#113263. Same idea as in https://github.com/pytorch/pytorch/pull/113417, but we need a more intrusive C API to silently nop default saved tensor hooks, in order to support user-code that use torch.autograd.disable_saved_tensors_hooks (see test_unpack_hooks_can_be_disabled). We mock the output of get_hooks while leaving push/pop untouched.
For compiled autograd, we're firing pack hooks once and unpack hooks twice right now, I'll look into this separately from this issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123196
Approved by: https://github.com/soulitzer
These models are really flaky. I went into the CI machine and ran the model many times, sometime it fails, sometimes it passes. Even Pytorch-eager results change from run to run, so the accuracy comparison is fundamentally broken/non-deterministic. I am hitting these issues more frequently in inlining work. There is nothing wrong with inlining, I think these models are on the edge of already-broken accuracy measurement, and inlining is just pushing it in more broken direction.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128715
Approved by: https://github.com/eellison
This is an extension of [ModuleTracker](https://github.com/pytorch/pytorch/blob/main/torch/utils/module_tracker.py) with added features and bug fixes.
1. Allows installing user-defined hooks to be called in pre-fw, post-fw, pre-bw and post-bw hooks of the ``ModTracker``.
2. Adds a function ``get_known_fqn`` that retrieves the fqn of the module as tracked by the ``ModTracker``.
3. Only registers the multi-grad hooks if we are in the forward pass. This is important because, a module's pre-fw and post-fw hooks get called in the backward during AC and we do not want to register multi-grad hooks in this case.
4. Sets the kwarg ``always_call=True`` for post-fw hooks, so that they are called post AC.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128508
Approved by: https://github.com/wanchaol
Adds `C10_UBSAN_ENABLED` macro and use it to disable `SymIntTest::Overflows` (fails under `signed-integer-overflow` UBSAN check).
Also cleans up UBSAN guard in `jit/test_misc.cpp` to use `C10_UBSAN_ENABLED` and the existing `C10_ASAN_ENABLED` instead of locally defining `HAS_ASANUBSAN`.
> NOTE: This should fix `SymIntTest::Overflows` failing under ubsan in fbcode too...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127967
Approved by: https://github.com/atalman, https://github.com/d4l3k, https://github.com/malfet
This PR renames the implementation details of register_fake to align
more with the new name. It is in its own PR because this is risky
(torch.package sometimes depends on private library functions and
implementation details).
Test Plan:
- tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/123938
Approved by: https://github.com/williamwen42
This is an extension of [ModuleTracker](https://github.com/pytorch/pytorch/blob/main/torch/utils/module_tracker.py) with added features and bug fixes.
1. Allows installing user-defined hooks to be called in pre-fw, post-fw, pre-bw and post-bw hooks of the ``ModTracker``.
2. Adds a function ``get_known_fqn`` that retrieves the fqn of the module as tracked by the ``ModTracker``.
3. Only registers the multi-grad hooks if we are in the forward pass. This is important because, a module's pre-fw and post-fw hooks get called in the backward during AC and we do not want to register multi-grad hooks in this case.
4. Sets the kwarg ``always_call=True`` for post-fw hooks, so that they are called post AC.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128508
Approved by: https://github.com/wanchaol
Tries to fix#127677.
# Context
Just as @peterbell10 pointed out, we have the following scenario:
```
a = ops.indirect_indexing(...)
b = ops.index_expr(a, ...)
c = ops.indirect_indexing(b, ...)
```
We can repro this as:
```
def forward(self, arg0_1, arg1_1, arg2_1):
iota = torch.ops.prims.iota.default(arg0_1, start = 0, step = 1, index=0),
repeat_interleave = torch.ops.aten.repeat_interleave.Tensor(arg1_1);
index = torch.ops.aten.index.Tensor(iota, [repeat_interleave]);
index_1 = torch.ops.aten.index.Tensor(arg2_1, [index]);
return (index_1,)
```
which should generate a JIT py file like this:
```
def triton_poi_fused_index_select_0(in_ptr0, in_ptr1, out_ptr0, ks0, xnumel, XBLOCK : tl.constexpr):
...
tmp0 = tl.load(in_ptr0 + (x1), xmask, eviction_policy='evict_last')
tmp1 = ks0
tmp2 = tmp0 + tmp1
tmp3 = tmp0 < 0
tmp4 = tl.where(tmp3, tmp2, tmp0)
# check_bounds()
tl.device_assert(((0 <= tmp4) & (tmp4 < ks0)) | ~(xmask), "index out of bounds: 0 <= tmp4 < ks0")
def call():
arg0_1, arg1_1, arg2_1 = args
buf1 = aten.repeat_interleave.Tensor(arg1_1)
buf4 = empty_strided_cuda((u0, 64), (64, 1))
triton_poi_fused_index_select_0.run(
buf1, arg2_1, buf4, s0,
triton_poi_fused_index_select_0_xnumel,
grid=grid(triton_poi_fused_index_select_0_xnumel),
stream=stream0)
```
# Issue
In our `IndexPropagation.indirect_indexing()` call we have `expr=indirect0` which is spawned in `LoopBodyBlock.indirect_indexing()`.
3b555ba477/torch/_inductor/ir.py (L8154-L8160)
When we try to see if we can prove its bounds, we fail because `indirect0` isn't in `var_ranges`.
# Approach
When creating `indirect` symbols from fallback, specify its range to be `[-size, size -1]` to avoid a lookup error with `indirectX`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128378
Approved by: https://github.com/lezcano, https://github.com/peterbell10
**Summary**
Currently, the comm_mode_feature_examples does not have an example for printing sharding information for a model with nested module. While adding the new example to the suite, I recognized a way to refactor existing examples in order to make them more readable for users. The expected output can be found below:
<img width="354" alt="Screenshot 2024-06-11 at 5 41 14 PM" src="https://github.com/pytorch/pytorch/assets/50644008/68cef7c7-cb1b-4e51-8b60-85123d96ca92">
**Test Plan**
torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128461
Approved by: https://github.com/XilunWu
ghstack dependencies: #128369, #128451
**Summary**
I have added comments to address previous readability concerns in comm_mode.py and comm_mode_features_example.py. I also renamed files and test cases in order to better reflect what they are about. Removed non-distributed test case and other lines of code that do not contribute to the example of how comm_mode can be used. Finally, I've added the expected output for each example function so users are not forced to run code.
**Test Plan**
torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128451
Approved by: https://github.com/XilunWu
ghstack dependencies: #128369
**Summary**
Currently, CommDebugMode only allows displaying collective tracing at a model level whereas a user may require a more detailed breakdown. In order to make this possible, I have changed the ModuleParamaterShardingTracker by adding a string variable to track the current sub-module as well as a dictionary keeping track of the depths of the submodules in the model tree. CommModeDebug class was changed by adding a new dictionary keeping track of the module collective counts as well as a function that displays the counts in a way that is easy for the user to read. Two examples using MLPModule and Transformer have been added to showcase the new changes. The expected output of the simpler MLPModule example is:
<img width="255" alt="Screenshot 2024-06-10 at 4 58 50 PM" src="https://github.com/pytorch/pytorch/assets/50644008/cf2161ef-2663-49c1-a8d5-9f97e96a1791">
**Test Plan**
torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/display_sharding_example.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128369
Approved by: https://github.com/XilunWu
Summary:
Added `set_module_name_qconfig` support to allow users to set configurations based on module name in `X86InductorQuantizer`.
For example, only quantize the `sub`:
```python
class M(torch.nn.Module):
def __init__(self):
super().__init__()
self.linear = torch.nn.Linear(5, 5)
self.sub = Sub()
def forward(self, x):
x = self.linear(x)
x = self.sub(x)
return x
m = M().eval()
example_inputs = (torch.randn(3, 5),)
# Set config for a specific submodule.
quantizer = X86InductorQuantizer()
quantizer.set_module_name_qconfig("sub", xiq.get_default_x86_inductor_quantization_config())
```
- Added `set_module_name_qconfig` to allow user set the configuration at the `module_name` level.
- Unified the annotation process to follow this order: `module_name_qconfig`, `operator_type_qconfig`, and `global_config`.
- Added `config_checker` to validate all user configurations and prevent mixing of static/dynamic or QAT/non-QAT configs.
- Moved `_get_module_name_filter` from `xnnpack_quantizer.py` into `utils.py` as it common for all quantizer.
Test Plan
```bash
python -m pytest quantization/pt2e/test_x86inductor_quantizer.py -k test_set_module_name
```
@Xia-Weiwen @leslie-fang-intel @jgong5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126044
Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jerryzh168
When we don't dynamo.reset(), we don't recompile on different dynamic shapes.
Also, some of the returned views were tuples - so when we `* 2`, we actually just copy all the inputs twice in the tuple. I changed it so that it would just return one of the values from the return tuple.
Additionally, this exposes a bug that fails with the slice operation, so I skipped it when we're testing with dynamic shapes:
```
File "/home/dberard/local/pytorch/torch/fx/experimental/symbolic_shapes.py", line 3996, in produce_guards
sexpr = ShapeGuardPrinter(symbol_to_source, source_ref, self.var_to_sources).doprint(expr)
File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/printer.py", line 292, in doprint
return self._str(self._print(expr))
File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/printer.py", line 331, in _print
return printmethod(expr, **kwargs)
File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/str.py", line 56, in _print_Add
t = self._print(term)
File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/printer.py", line 331, in _print
return printmethod(expr, **kwargs)
File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/str.py", line 366, in _print_Mul
a_str = [self.parenthesize(x, prec, strict=False) for x in a]
File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/str.py", line 366, in <listcomp>
a_str = [self.parenthesize(x, prec, strict=False) for x in a]
File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/str.py", line 37, in parenthesize
return self._print(item)
File "/home/dberard/local/miniconda3/envs/pytorch/lib/python3.10/site-packages/sympy/printing/printer.py", line 331, in _print
return printmethod(expr, **kwargs)
File "/home/dberard/local/pytorch/torch/fx/experimental/symbolic_shapes.py", line 1494, in _print_Symbol
assert self.symbol_to_source.get(expr), (
AssertionError: s3 (could be from ['<ephemeral: symint_visitor_fn>', '<ephemeral: symint_visitor_fn>']) not in {s0: ["L['x'].a.size()[1]", "L['x'].b.size()[1]", "L['x'].size()[1]", "L['x'].a.size()[1]", "L['x'].b.size()[1]", "L['x'].a.size()[1]", "L['x'].b.size()[1]"], s1: ["L['x'].a.stride()[0]", "L['x'].b.stride()[0]", "L['x'].stride()[0]", "L['x'].a.stride()[0]", "L['x'].b.stride()[0]", "L['x'].a.stride()[0]", "L['x'].b.stride()[0]"], s2: ["L['x'].a.storage_offset()", "L['x'].b.storage_offset()", "L['x'].a.storage_offset()", "L['x'].b.storage_offset()"]}. If this assert is failing, it could be due to the issue described in https://github.com/pytorch/pytorch/pull/90665
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128659
Approved by: https://github.com/YuqingJ
Summary:
GraphTransformObserver saves the SVG file of the input/output graph in each inductor pass. In my test with CMF model, if the graph is large, GraphViz took forever to convert DOT to SVG. That is NOT acceptable.
This DIFF is to save DOT file instead of SVG file to speed it up. Also DOT file size is order of mangitude smaller than SVG.
To view these graphs, user can run dot -Txxx inpout.dot to convert DOT to any other format you want. User can control how many iterations to layout the graph properly. Refer to https://web.archive.org/web/20170507095019/http://graphviz.org/content/attrs#dnslimit for details.
Test Plan: buck2 test mode/dev-sand caffe2/test:fx -- fx.test_fx_xform_observer.TestGraphTransformObserver
Differential Revision: D58539182
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128634
Approved by: https://github.com/mengluy0125
It has come to my attention that some of our licenses are incorrect, so I attempted to rectify a few of them based on given recommendations for:
clog - BSD-3
eigen - MPL-2.0
ffnvcodec - LGPL-2.1
-> **hungarian - Permissive (free to use)**
irrlicht - The Irrlicht Engine License (zlib/libpng)
-> **pdcurses - Public Domain for core**
-> **sigslot - Public Domain**
test - BSD-3
Vulkan - Apache-2.0 or MIT
fb-only: more context is here https://fb.workplace.com/groups/osssupport/posts/26333256012962998/?comment_id=26333622989592967
This PR addressed the manual mismatches of licensing mentioned above (the two bolded, one is getting addressed in #128085, but as everything else is generated by pulling through other files, I did not address those. It is unclear what needs to be updated for the remaining to be accurate/if they're inaccurate today.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128630
Approved by: https://github.com/malfet
This matches our autograd logic for pytorch native operators. There's no
need to invoke an autograd.Function if we're under a torch.no_grad() or
if none of the inputs have requires_grad=True (invoking an
autograd.Function results in (noticeable) overhead).
Test Plan:
- new test
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127976
Approved by: https://github.com/williamwen42
Fixes#127896
### Description
Add docstring to `torch/jit/frontend.py:get_default_args` function
### Checklist
- [x] The issue that is being fixed is referred in the description
- [x] Only one issue is addressed in this pull request
- [x] Labels from the issue that this PR is fixing are added to this pull request
- [x] No unnecessary issues are included into this pull request
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128408
Approved by: https://github.com/malfet
In fused_all_gather_matmul, each rank copies their shard into their
local p2p buffer, performs a barrier, then performs (copy -> matmul) for
each remote shard. The (copy -> matmul)s for remote shards run on two
streams without synchronization. This not only allows for
computation/communication overlapping, but also computation/computation
overlapping which alleviates the wave quantization effect caused by
computation decomposition.
However, the synchronization-free approach doesn't work well with
fused_matmul_reduce_scatter, in which there's a barrier in every step.
Without synchronization between the two streams, a matmul in one stream
can delay a barrier in the other stream, further delaying the copy
waiting for the barrier.
This PR addresss the issue by adding synchronization between the two
streams such that the matmul of step i can only start after the barrier
of step i-1 completes. With this approach, we lose the
computation/computation overlapping, but avoid slowdown due to delayed
barrier.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127455
Approved by: https://github.com/Chillee
ghstack dependencies: #127454
This PR changes the traced_tangents field of ViewAndMutationMeta to be cache safe. Specifically, at runtime, the only time we need the fw_metadata's traced_tangent's field is for Tensor subclass metadata from __tensor_flatten__. So instead of storing an entire FakeTensor, which has many fields that can be unserializable, only store the result of __tensor_flatten__() on any FakeTensors representing subclasses.
That said, there's no guarantee that `__tensor_flatten__` is actually serializable: if we fail to pickle the result of __tensor_flatten__ we won't save to the cache.
To do this, we also make a small change to `__coerce_same_metadata_as_tangent__`, so that it takes in the return value of tensor_flatten() instead of an entire FakeTensor. Let me know if we should change the name of the function.
By doing this, we can now run the dynamic shapes cache test with autograd turned on.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127618
Approved by: https://github.com/bdhirsh
Summary: I've seen this issue once in the wild and oulgen was able to repro in a unit test. The problem is this:
- We're using pickle to turn everything related to the FX graph cache key into a byte stream, then hashing the bytes to compute the cache key.
- Pickle is optimized to avoid serializing the same ID more than once; it instead drops a reference to a previously-pickled object if it encounters the same ID.
- That pickle behavior means that we can see different cache keys if an object id appears more than once in the hashed objects vs. being functionally equivalent but distinct objects.
The cases I've investigated only involve the torch.device objects in the tensor graph args. That is, we may compile a graph with two tensor args, each referencing `torch.device('cpu')`. In one run, those devices may reference the same object; in another, they may reference distinct (but equivalent) objects. In practice, my observation is that the compiler is largely deterministic and this situation is rare. I've seen cache misses on a real benchmark only when enabling/disabling FakeTensor caching in order to introduce different code paths that otherwise produce the same fx graph. But the failing unit test seems to be enough motivation for a remediation?
I don't really love this solution, but I've failed to find another way to make the pickling phase robust to these kinds of changes, e.g., by changing the protocol version or by overriding internal methods (which would also be gross). But I'm definitely open to other creative ideas.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128366
Approved by: https://github.com/oulgen, https://github.com/eellison
Summary: The feature was previously disabled in fbcode due to breaking the deterministic NE unit tests. Now it has been on in OSS for quite a while and we verified that it has no NE impact on CMF, we want to update the unit test and enable the feature.
Test Plan:
```
time buck2 test 'fbcode//mode/opt' fbcode//aps_models/ads/icvr/tests/ne/e2e_deterministic_tests:fm_tests -- --exact 'aps_models/ads/icvr/tests/ne/e2e_deterministic_tests:fm_tests - aps_models.ads.icvr.tests.ne.e2e_deterministic_tests.icvr_fm_test.ICVR_FM_DeterministicTest: test_icvr_fm_pt2_fsdp_multi_gpus'
```
Differential Revision: D58425432
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128555
Approved by: https://github.com/eellison
Summary: When calling a fallback op in the minimal_arrayref_interface mode with an optional tensor, a temporary RAIIAtenTensorHandle needes to be explicitly created in order to pass a pointer of tensor as the optional tensor parameter.
Test Plan: CI
Differential Revision: D58528575
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128613
Approved by: https://github.com/hl475
When performing fused_all_gather_matmul/fused_matmul_reduce_scatter and gather_dim/scatter_dim != 0, a copy of the lhs operand (A_shard/A) is needed for layout transformation.
This copy can be avoided if the lhs operand already has the following stride order:
lhs.movedim(gather_dim, 0).contiguous().movedim(0, gather_dim).stride()
In `micro_pipeline_tp` passes, we enforce the lhs operand to have such stride order via `inductor_prims.force_stride_order`. This way if the lhs operand has a flexible layout, the copy is avoided.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127454
Approved by: https://github.com/Chillee
This PR introduces naive CPU impls for:
* `_jagged_to_padded_dense_forward()`
* `_padded_dense_to_jagged_forward()`
On the CUDA side, these are backed by lifted FBGEMM kernels. We may want to revisit the CPU versions with higher-performance implementations at a later time.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127007
Approved by: https://github.com/davidberard98
We should be able to remove this as, with the new canonicalisation, we
have that `a < b` and `-a > -b` should be canonicalised to the same
expression (if SymPy does not interfere too much).
nb. I thought this would cut further the compilation time, but I was running
the benchmarks wrong (not removing triton's cache oops). It turns out that
after the first PR in this stack, https://github.com/pytorch/pytorch/issues/128398 is fully fixed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128500
Approved by: https://github.com/ezyang
ghstack dependencies: #128410, #128411
https://github.com/pytorch/pytorch/pull/126717 will skip the tests in both ABI compatible and non-ABI compatible mode.
It's not expected to skip them in non-ABI compatible mode since they can actually run successfully in such mode but only have issues in ABI compatible mode.
We leverage the existing `xfail_list` for those that will only fail in ABI compatible mode.
- `test_qlinear_add` is already in the `xfail_list`.
- `test_linear_packed` doesn't fail either in my local run (running with `TORCHINDUCTOR_ABI_COMPATIBLE=1`) or in the CI of this PR so I didn't add it into `xfail_list`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128506
Approved by: https://github.com/jgong5, https://github.com/desertfire
Fixes https://github.com/pytorch/pytorch/issues/128544
Fixes https://github.com/pytorch/pytorch/issues/128535
We had a problem with multithreading where the nonlocals were being
clobbered. In the first place, we stored these nonlocals because we
wanted to ferry information from an autograd.Function.apply to
autograd.Function.forward.
Our new approach is:
- pass the information directly as an input to the
autograd.Function.apply. This means that the autograd.Function.forward
will receive the information too.
- this messes up ctx.needs_input_grad, which has an element per input to
forward. The user should not see the additional information we passed.
We fix this by temporarily overriding ctx.needs_input_grad to the
right thing.
- this exposed a bug in that ctx.needs_input_grad wasn't correct for
TensorList inputs. This PR fixes that too.
Test Plan:
- existing and new tests
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128547
Approved by: https://github.com/williamwen42, https://github.com/soulitzer
https://github.com/pytorch/pytorch/issues/127572
Allow mutations in backward on forward inputs, if
1/ not mutationg metadata
Enforced at compilation time.
2/ if create_graph=True: mutated input does not require_grad
Enforced in runtime, when create_graph mode can be detected by checking torch.is_grad_enabled()
Adding input_joint_info to track mutations of inputs during joint.
Created a separate field in ViewAndMutationMeta as it is filled only after joint fn tracing.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128409
Approved by: https://github.com/bdhirsh
As part of #125683, this PR adds epilogue fusion support for bf16/fp16 gemms. The key changes are as follows:
1. bf16 linear w/ epilogue fusion of some ops was originally supported via ATen oneDNN linear pointwise ops. In order to match the ATen op semantics, in-template epilogue support is added to the cpp gemm template so that we would have: "gemm + in-template epilogues -> template buffer". If the template is chosen for codegen, the in-template epilogues will be concatenated with the out-of-template epilogues that are appended during the scheduling.
2. Support bf16/fp16 legalization for `codegen_loop_bodies` which is used to generate the epilogue loops.
3. We used to leverage the in-place buffer mechanism to handle the in-place buffers in the epilogue codegen, in particular, for the reuses for output buffers of GEMM, template and epilogues. This is not correct since the output buffer is an "output" not an "in-place" buffer of the template kernel itself. Now, we use a dedicated "aliases" dict to manage such buffer reuses and the intermediate aliasing buffers are removed after codegen.
4. Add `localize_buffer` method to `LocalBufferScope` to allow the replacement of a global buffer with a local one in the given inductor IR nodes. This helps the fused loops to work on smaller-sized local buffers for better data locality.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126545
Approved by: https://github.com/jansel
Minor tweak of comparison as using `assert` on `torch.allclose` prevents the mismatches from being logged. Also bump a few tolerances that seem to be causing failures on sm86/sm90
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128553
Approved by: https://github.com/jcaip
## Context
This PR ported GGML int8 per channel matrix multiplication and matrix vector multiplication metal shaders into ATen library.
llama.cpp LICENSE: https://github.com/ggerganov/llama.cpp/blob/master/LICENSE
## Key Changes
Made the following changes to the original code:
* Memory layout of weight and scales is different than llama.cpp.
* Weight dequantization (scales multiplication) is done after MM is finished.
* Following PyTorch naming convention (M, K, N and assuming row major).
## Benchmark
When M = 1, mv shader improves existing ATen int8mm by 40%.
When M > 4, mm shader outperforms existing ATen int8mm up to 10x for a large M, as show blow.

Hence the kernel chooses different shaders based on M.
## Test Plan
Tests are passing:
```
❯ python test/test_mps.py -v -k _int8_
/Users/larryliu/CLionProjects/pytorch/venv/lib/python3.8/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: 'dlopen(/Users/larryliu/CLionProjects/pytorch/venv/lib/python3.8/site-packages/torchvision/image.so, 0x0006): Symbol not found: __ZN3c1017RegisterOperatorsD1Ev
Referenced from: <A770339A-37C9-36B2-84FE-4125FBE26FD6> /Users/larryliu/CLionProjects/pytorch/venv/lib/python3.8/site-packages/torchvision/image.so
Expected in: <5749F98A-0A0C-3F89-9CBF-277B3C8EA00A> /Users/larryliu/CLionProjects/pytorch/torch/lib/libtorch_cpu.dylib'If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
warn(
test__int8_mm_m_1_k_32_n_32_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_1_k_32_n_64_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_1_k_64_n_32_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_1_k_64_n_64_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_32_k_32_n_32_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_32_k_32_n_64_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_32_k_64_n_32_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_32_k_64_n_64_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_64_k_32_n_32_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_64_k_32_n_64_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_64_k_64_n_32_mps (__main__.TestLinalgMPSMPS) ... ok
test__int8_mm_m_64_k_64_n_64_mps (__main__.TestLinalgMPSMPS) ... ok
----------------------------------------------------------------------
Ran 12 tests in 1.180s
OK
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127646
Approved by: https://github.com/malfet
In a previous life, we used sympy.oo to represent the lower/upper bounds of integer ranges. Later, we changed this to be sys.maxsize - 1 for a few reasons: (1) sometimes we do tests on a value being exactly sys.maxsize, and we wanted to avoid a data dependent guard in this case, (2) sympy.oo corresponds to floating point infinity, so you get incorrect types for value ranges with oo, and (3) you can do slightly better reasoning if you assume that input sizes fall within representable 64-bit integer range.
After working in the sys.maxsize regime for a bit, I've concluded that this was actually a bad idea. Specifically, the problem is that you end up with sys.maxsize in your upper bound, and then whenever you do any sort of size-increasing computation like size * 2, you end up with 2 * sys.maxsize, and you end up doing a ton of arbitrary precision int computation that is totally unnecessary. A symbolic bound is better.
But especially after #126905, we can't go back to using sympy.oo, because that advertises that it's not an integer, and now your ValueRanges is typed incorrectly. So what do we do? We define a new numeric constant `int_oo`, which is like `sympy.oo` but it advertises `is_integer`. **test/test_sympy_utils.py** describes some basic properties of the number, and **torch/utils/_sympy/numbers.py** has the actual implementation.
The rest of the changes of the PR are working out the implications of this change. I'll give more commentary as inline comments.
Fixes https://github.com/pytorch/pytorch/issues/127396
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127693
Approved by: https://github.com/lezcano
ghstack dependencies: #126905
Summary: There are clang errors in profiler_kineto. It would probably be a good idea to fix them as the file is already quite dense.
Test Plan: Make sure all on Phabricator all tests under static_tests/lint_root pass
Differential Revision: D58431005
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128464
Approved by: https://github.com/aaronenyeshi
Related doc: https://docs.google.com/document/d/1BKyizkZPdri9mHqdDOLAUpkI7SbbKfLHRFVVpK9ZWqo/edit
Memory considerations:
- As with the existing SAC, cached values are cleared upon first use.
- We error if the user wishes to backward a second time on a region forwarded with SAC enabled.
In-place:
- We use version counting to enforce that if any cached tensor has been mutated. In-place operations not mutating cached tensors are allowed.
- `allow_cache_entry_mutation=True` can be passed to disable this check (useful in the case of auto AC where the user is cleverly also saves the output of the in-place)
Randomness, views
- Currently in this PR, we don't do anything special for randomness or views, the author of the policy function is expected to handle them properly. (Would it would be beneficial to error? - we either want to save all or recompute all random tensors)
Tensor object preservation
- We guarantee that if a tensor does not requires grad, and it is saved, then what you get out is the same tensor object. If the tensor does require grad, we must detach to avoid creating a reference cycle. This is a nice guarantee for nested tensors which care about the object identity of of the offsets tensor.
Policy function
- Enum values are `{MUST,PREFER}_{SAVE,RECOMPUTE}` (bikeshed welcome). Alternatively there was `{SAVE,RECOMPUTE}_{NON_,}OVERRIDABLE`. The former was preferred bc it seemed clearer that two `MUST` clashing should error, versus it is ambiguous whether two `NON_OVERRIDABLE` being stacked should silently ignore or error.
- The usage of Enum today. There actually is NO API to stack SAC policies today. The only thing the Enum should matter for in the near term is the compiler. The stacking SAC policy would be useful if someone wants to implement something like simple FSDP, but it is not perfect because with a policy of `PREFER_SAVE` you are actually saving more than autograd would save normally (would be fixed with AC v3).
- The number of times we call the policy_fn is something documented part of public API. We call the policy function for all ops except detach because detach is itself called a different number of times by AC between forward and recompute.
- The policy function can be a stateful object (we do NOT make separate copies of this object for forward/recompute, the user is expected to handle that via is_recompute see below).
Tensors guaranteed to be the same tensor as-is
- Policy function signature takes ctx object as its first argument. The ctx function is an object encapsulating info that may be useful to the user, it currently only holds "is_recompute". Adding this indirection gives us flexibility to add more attrs later if necessary.
"bc-breaking" for existing users of the private API:
- Existing policy functions must now change their return value to use the Enum.
- Existing calls to `_pt2_selective_checkpoint_context_fn_gen` must be renamed to `gen_selective_checkpoint_context_fn`. The way you use the API remains the same. It would've been nice to do something different (not make the user have to use functools.partial?), but this was the easiest to compile (idk if this should actually be a constraint).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125795
Approved by: https://github.com/Chillee, https://github.com/fmassa
Summary: prim::dtype has the signature `(Tensor a) -> int`, where it gets the dtype of the tensor and returns the integer corresponding to this dtype based on the enum in ScalarType.h. Previously we were converting prim::dtype by returning the actual dtype of the tensor (ex. torch.float32). This causes some incorrect control flow to behavior, specifically where it checks if `prim::dtype(tensor) in [3, 5, 7]`, where [3, 5, 7] correspond to torch.int32, torch.float16, torch.float64. This control flow would always returns False because we would be comparing torch.float32 against the integers [3, 5, 7], which is a type mismatch.
Test Plan: 7/22 internal models now are convertable and runnable in eager and sigmoid! P1410243909
Reviewed By: jiashenC
Differential Revision: D58469232
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128517
Approved by: https://github.com/jiashenC
```at::detail::computeStorageNbytesContiguous``` does fewer data-dependent tests compared to ```at::detail::computeStorageNbytes```. Therefore, use of former is more likely to succeed with dynamic shapes. This PR detects is_contiguous and dispatches to the appropriate function. This should be helpful in unblocking aot_eager for torchrec. As an aside, this is an alternative solution to the unsound solution I had first proposed in another [PR](#128141).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128515
Approved by: https://github.com/ezyang
This PR lifts internal lowerings written for FBGEMM kernels that do jagged <-> padded dense conversions. In particular, this PR provides lowerings and meta registrations for the following ATen ops:
* `_jagged_to_padded_dense_forward()`
* `_padded_dense_to_jagged_forward()`
* NB: if `total_L` is not provided, the output shape is data-dependent. An unbacked SymInt is used for this case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125968
Approved by: https://github.com/davidberard98
This PR intends to support the aten operations with the `out` tensor.
Currently, the AOT compile always does **NOT** keep input tensor mutations. According to the comments, this is because it has not encountered such a use case.
> For now there's no use case involving keeping input mutations in the graph (which we can only do in the inference case anyway). We can add this later if we need to.
However, for aten operations, it is popular that the `out` tensor is an input parameter and needs to be mutated. This PR intends to support it by adding a `keep_inference_input_mutations` flag to `aot_inductor.keep_inference_input_mutations`. This flag can provide flexibility to the callee in deciding whether the AOT compile needs to keep input tensor mutations in the graph.
Take `clamp` as an example as follows.
```python
out_tensor = torch.randn(128, dtype=torch.float, device=device).fill_(-2.0)
inp_tensor = torch.randn(128, dtype=torch.float, device=device).fill_(1.0)
min_tensor = inp_tensor - 0.05
max_tensor = inp_tensor + 0.05
torch.clamp(input=inp_tensor, min=min_tensor, max=max_tensor, out=out_tensor)
```
W/O this PR
```python
def forward(self):
arg0_1: "f32[128]"; arg1_1: "f32[128]"; arg2_1: "f32[128]"; arg3_1: "f32[128]";
arg0_1, arg1_1, arg2_1, arg3_1, = fx_pytree.tree_flatten_spec([], self._in_spec)
clamp_min: "f32[128]" = torch.ops.aten.clamp_min.Tensor(arg0_1, arg1_1); arg0_1 = arg1_1 = None
clamp_max: "f32[128]" = torch.ops.aten.clamp_max.Tensor(clamp_min, arg2_1); clamp_min = arg2_1 = None
return (clamp_max, clamp_max)
```
W/ this PR
```python
def forward(self):
arg0_1: "f32[128]"; arg1_1: "f32[128]"; arg2_1: "f32[128]"; arg3_1: "f32[128]";
arg0_1, arg1_1, arg2_1, arg3_1, = fx_pytree.tree_flatten_spec([], self._in_spec)
clamp_min: "f32[128]" = torch.ops.aten.clamp_min.Tensor(arg0_1, arg1_1); arg0_1 = arg1_1 = None
clamp_max: "f32[128]" = torch.ops.aten.clamp_max.Tensor(clamp_min, arg2_1); clamp_min = arg2_1 = None
copy_: "f32[128]" = torch.ops.aten.copy_.default(arg3_1, clamp_max); arg3_1 = clamp_max = None
return (copy_,)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124926
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/angelayi
Summary:
D56907877 modified OSS commSplit. However, commSplit requires every rank being called even though it is no-color. ncclCommSplit will not create a communicator for nocolor ranks hence this line of code will potentially throw error like `NCCL WARN CommUserRank : comm argument is NULL`
Revert this change from D56907877
Test Plan: CI
Differential Revision: D58436088
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128459
Approved by: https://github.com/shuqiangzhang
Summary:
The previous side effect pruning algorithm would keep many dead cell
variables alive. For example, in
https://github.com/pytorch/pytorch/issues/125078, the compiled function
has one return but there were three in the Dynamo graph due to two
dead cell variables not being pruned away.
This PR adds a corrected algorithm. "new cell variables" are alive if
they can be reached from one of the following:
1. any of the tx.symbolic_locals or tx.stack (that is, if they are
involved in a return from the function or intermediate variable
during a graph break). Example: an alive NestedUserFunctionVariable
2. "mutations to pre-existing objects". Example: appending a
NestedUserFunctionVariable to a global list
The new algorithm reflects this, but please let me know if there are
more cases to handle.
Test Plan:
- existing tests (afaict, test/dynamo/test_python_autograd is the best
SideEffects test case we have)
- see in test/dynamo/test_higher_order_ops that the expecttests changed
-- the functorch dynamo graphs no longer return dead cellvars.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128028
Approved by: https://github.com/jansel
This PR makes it so we lazily save to the cache on backward call instead of saving ahead of time always. We have to pass a closure to post_compile to prevent cyclic dependencies.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126999
Approved by: https://github.com/bdhirsh
ghstack dependencies: #126791
This is a short-term fix (for 2.4). In the longer term we should
fix https://github.com/pytorch/pytorch/issues/128430
The problem is that warnings.warn that are inside Dynamo print
all the time. Python warnings are supposed to print once, unless their
cache is reset: Dynamo ends up resetting that cache everytime it runs.
As a workaround we provide our own warn_once cache that is keyed on the
warning msg. I am not worried about this increasing memory usage because
that's effectively what python's warnings.warn cache does.
Test Plan:
- fix tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128456
Approved by: https://github.com/anijain2305
Fix https://github.com/pytorch/pytorch/issues/128287.
Previous the assertion in `linear_add_bias` are pretty bad
```
assert packed_weight_node.name == "_reorder_linear_weight"
assert transpose_weight_node.name == "permute_default"
```
because the `name` can be changed to `_reorder_linear_weight_id, permute_default_id` if we have more than 1 reorder/permute.
Check `target` instead `name` can solve this issue.
UT is also updated to have match more than 1 `linear_add_bias` pattern to cover this case.
Co-authored-by: Jiong Gong <jiong.gong@intel.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128473
Approved by: https://github.com/jgong5
Summary:
Number of features rely on TCP store as a control plane. By default TCPStore server is started on rank0 trainer and this can create a a race condition when rank0 may exit (error and graceful exit) and any other ranks reading/writing will fail.
Solution: TCPStore server should outlive all the trainer processes. By moving the ownership TCPStore to torchelastic agent it naturally fixes the lifecycle of the server.
Static rendezvous in torchelastic does already support sharing of the TCPStore server. We are extending this to more commonly used c10d rendezvous handler.
Any handler would like to manage tcp store has to:
- Return true on `use_agent_store` property
- `RendezvousInfo`.`RendezvousStoreInfo`#[`master_addr/master_port`] values refer to managed TCPStore (those are returned on `next_rendezvous` call)
Note: in some instances users may want to use non-TCPStore based stores for the torchelastic rendezvous process, so the handler will need to create and hold a reference to TCPStore (as done in this change)
Test Plan:
`cat ~/workspace/dist-demo/stores.py`
~~~
import torch
import logging
import sys
import torch.distributed as dist
import torch
import os
import time
logger = logging.getLogger(__name__)
logger.addHandler(logging.StreamHandler(sys.stderr))
logger.setLevel(logging.INFO)
def _run_test(store):
if dist.get_rank() == 1:
logger.info("Rank %s is sleeping", dist.get_rank())
time.sleep(5)
key = "lookup_key"
logger.info("Checking key %s in store on rank %s", key, dist.get_rank())
store.check([key])
else:
logger.info("rank %s done", dist.get_rank())
def main() -> None:
use_gpu = torch.cuda.is_available()
dist.init_process_group(backend="nccl" if use_gpu else "gloo")
dist.barrier()
logger.info(f"Hello World from rank {dist.get_rank()}")
host = os.environ['MASTER_ADDR']
port = os.environ['MASTER_PORT']
world_size = os.environ['WORLD_SIZE']
logger.info("testing TCPStore")
store = dist.TCPStore(
host_name=host, port=int(port), world_size=int(world_size),
)
_run_test(store)
if __name__ == "__main__":
main()
~~~
With the fix (TORCH_DISABLE_SHARE_RDZV_TCP_STORE=0 or just drop the option)
~~~
(pytorch_38) [kurman@devgpu011.cln5 ~/local/pytorch (main)]$ TORCH_DISABLE_SHARE_RDZV_TCP_STORE=0 python -m torch.distributed.run --rdzv-backend c10d --nproc-per-node 3 ~/workspace/dist-demo/stores.py
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Hello World from rank 1
Hello World from rank 2
Hello World from rank 0
testing TCPStore
testing TCPStore
testing TCPStore
rank 2 done
Rank 1 is sleeping
rank 0 done
Checking key lookup_key in store on rank 1
~~~
TORCH_DISABLE_SHARE_RDZV_TCP_STORE=1
~~~
(pytorch_38) [kurman@devgpu011.cln5 ~/local/pytorch (main)]$ TORCH_DISABLE_SHARE_RDZV_TCP_STORE=1 python -m torch.distributed.run --rdzv-backend c10d --npro
c-per-node 3 ~/workspace/dist-demo/stores.py
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
WARNING:__main__:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
Hello World from rank 0
Hello World from rank 2
Hello World from rank 1
testing TCPStore
testing TCPStore
testing TCPStore
rank 0 done
rank 2 done
Rank 1 is sleeping
Checking key lookup_key in store on rank 1
[rank1]: Traceback (most recent call last):
[rank1]: File "/home/kurman/workspace/dist-demo/stores.py", line 46, in <module>
[rank1]: main()
[rank1]: File "/home/kurman/workspace/dist-demo/stores.py", line 42, in main
[rank1]: _run_test(store)
[rank1]: File "/home/kurman/workspace/dist-demo/stores.py", line 22, in _run_test
[rank1]: store.check([key])
[rank1]: torch.distributed.DistNetworkError: Connection reset by peer
E0605 17:40:22.853277 140249136719680 torch/distributed/elastic/multiprocessing/api.py:832] failed (exitcode: 1) local_rank: 1 (pid: 2279237) of binary: /home/kurman/.conda/envs/pytorch_38/bin/python
Traceback (most recent call last):
File "/home/kurman/.conda/envs/pytorch_38/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/kurman/.conda/envs/pytorch_38/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/data/users/kurman/pytorch/torch/distributed/run.py", line 904, in <module>
main()
File "/data/users/kurman/pytorch/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/data/users/kurman/pytorch/torch/distributed/run.py", line 900, in main
run(args)
File "/data/users/kurman/pytorch/torch/distributed/run.py", line 891, in run
elastic_launch(
File "/data/users/kurman/pytorch/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/data/users/kurman/pytorch/torch/distributed/launcher/api.py", line 263, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/kurman/workspace/dist-demo/stores.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-06-05_17:40:22
host : devgpu011.cln5.facebook.com
rank : 1 (local_rank: 1)
exitcode : 1 (pid: 2279237)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
~~~
Differential Revision: D58180193
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128096
Approved by: https://github.com/shuqiangzhang
With inilining NN modules these tests no longer raise runtime errors because changing static ptrs induces a rerecording instead of a runtime error. The solution is to run the test with inlining disabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128529
Approved by: https://github.com/anijain2305
ghstack dependencies: #128528
With inilining NN modules these tests no longer raise runtime errors because changing static ptrs induces a rerecording instead of a runtime error. The solution is to run the test with inlining disabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128528
Approved by: https://github.com/anijain2305
This PR implements "V0" of AOTAutogradCache. Given an input to AOTAutograd, we calculate a cache key, then save an AOTAutogradCacheEntry.
Each AOTAutogradCacheEntry has:
- A CompiledForward and optionally a CompiledBackward
- A bunch of metadata.
CompiledForward and CompiledBackward each save the *key* to the FXGraphCache associated with the compiled object. FXGraphCache populates this key field as long as it's able to return a compiled graph given a set of inputs. We then load the same object from the FXGraphCache on an AOTAutogradCache hit.
On cache miss:
- Run AOTAutograd, up to AOTAutogradDispatch.post_compile.
- Save an AOTAutogradCacheEntry to the cache after compiling the necessary portions and receiving a cache key from FXGraphCache. In this we *always* compile the backwards ahead of time. The PR above this one implements backward lazy caching, so that we only save to the cache after compiling the backward in a lazy backward scenario.
- Return the resulting object
On cache hit:
- Run AOTAutogradCacheEntry.post_compile() on the cache key.
- This attempts to load the forward and backward graphs from FXGraphCache
- As long as we successfully load from FXGraphCache, it's a hit. We then rewrap the callable with post compile wrappers using our saved metadata.
For now, we ignore the fakified out and debug wrappers. We only save to the cache if Fakified out is turned off.
V0 Guards behavior:
FXGraphCache serializes guards that are needed in the shape_env based on the symint inputs to the graph. The invariant that AOTAutograd uses here is that the sources for symints given to it by dynamo are exactly the same as the ones it passes to inductor, for both the forward and backward passes. (This does *not* mean that the tensor values passed in are the same: only that their symints are). That is, AOTAutograd and Inductor never create new guards based on symints with *different sources* than those passed to it by inductor.
We don't currently store any AOTAutograd specific guards: my hypothesis is that FXGraphCache already stores these, as any guards generated by AOTAutograd should already be in the shape_env before calling into inductor, and we don't generate new guards post inductor. If this is needed, I'll add it in another diff.
Testing:
We'll start with some basic unit tests, but I'll be adding more and more complicated testing as the next step.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126791
Approved by: https://github.com/bdhirsh
Before: `softmax` definition uses `jagged_unary_pointwise()` (wrong)
After: `softmax` impl adjusts the `dim` arg to account for the difference in dimensionality between the outer NT and the NT's `_values`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119459
Approved by: https://github.com/soulitzer
Fixes#126950
`ptd_state_dict` with `broadcast_from_rank0=False` might miss 2 condition checks in the `set_optimizer_state_dict`
Here we add another condition `full_state_dict=True` with corresponding tensor distribution without broadcasting if broadcast_from_rank0=False
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128004
Approved by: https://github.com/fegin
Fixes#120570
## Description
Update torch.nanmean() docstring to mention input dtype requirement as either floating point type or complex.
Previously, the torch.mean() docstring had been updated in #120208 in a similar manner, but the torch.nanmean() docstring was not updated.
## Checklist
- [X] The issue that is being fixed is referred in the description.
- [X] Only one issue is addressed in this pull request.
- [x] Labels from the issue that this PR is fixing are added to this pull request.
- [X] No unnecessary issues are included into this pull request.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128155
Approved by: https://github.com/malfet
Thus far TunableOp was implemented for gemm, bgemm, and scaled_mm. gemm_and_bias was notably missing. This PR closes that gap.
This PR also fixes a regression after #124362 disabled the numerical check by default. The env var to enable it no longer worked.
CC @xw285cornell
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128143
Approved by: https://github.com/Skylion007
Not requiring all functions to have types allows a lot of 'Any' types to slip in - which poison types and make mypy unable to properly typecheck the code. I want to flip the default so that new files are required to have fully typed defs and we can have a burndown list of files that fail to require full types.
The preceding stack of PRs (cut up simply to limit the number of file changes per PR "reasonable") adds `# mypy: allow-untyped-defs` to any file which didn't immediately pass mypy with the flag flipped. Due to changing files and merge conflicts it will probably be necessary to have several passes through before landing this final PR which turns the option on.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127836
Approved by: https://github.com/oulgen, https://github.com/Skylion007
This PR implements "V0" of AOTAutogradCache. Given an input to AOTAutograd, we calculate a cache key, then save an AOTAutogradCacheEntry.
Each AOTAutogradCacheEntry has:
- A CompiledForward and optionally a CompiledBackward
- A bunch of metadata.
CompiledForward and CompiledBackward each save the *key* to the FXGraphCache associated with the compiled object. FXGraphCache populates this key field as long as it's able to return a compiled graph given a set of inputs. We then load the same object from the FXGraphCache on an AOTAutogradCache hit.
On cache miss:
- Run AOTAutograd, up to AOTAutogradDispatch.post_compile.
- Save an AOTAutogradCacheEntry to the cache after compiling the necessary portions and receiving a cache key from FXGraphCache. In this we *always* compile the backwards ahead of time. The PR above this one implements backward lazy caching, so that we only save to the cache after compiling the backward in a lazy backward scenario.
- Return the resulting object
On cache hit:
- Run AOTAutogradCacheEntry.post_compile() on the cache key.
- This attempts to load the forward and backward graphs from FXGraphCache
- As long as we successfully load from FXGraphCache, it's a hit. We then rewrap the callable with post compile wrappers using our saved metadata.
For now, we ignore the fakified out and debug wrappers. We only save to the cache if Fakified out is turned off.
V0 Guards behavior:
FXGraphCache serializes guards that are needed in the shape_env based on the symint inputs to the graph. The invariant that AOTAutograd uses here is that the sources for symints given to it by dynamo are exactly the same as the ones it passes to inductor, for both the forward and backward passes. (This does *not* mean that the tensor values passed in are the same: only that their symints are). That is, AOTAutograd and Inductor never create new guards based on symints with *different sources* than those passed to it by inductor.
We don't currently store any AOTAutograd specific guards: my hypothesis is that FXGraphCache already stores these, as any guards generated by AOTAutograd should already be in the shape_env before calling into inductor, and we don't generate new guards post inductor. If this is needed, I'll add it in another diff.
Testing:
We'll start with some basic unit tests, but I'll be adding more and more complicated testing as the next step.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126791
Approved by: https://github.com/bdhirsh
Summary: I admit I'm not 100% sure what I'm doing here. I'm hitting a bug in the FX graph cache when we try to evaluate a guards expression. We're creating guards that look like this:
```
Ne(CeilToInt(FloatTrueDiv(ToFloat(8*L['t0']) - 4.0, 8.0))*CeilToInt(FloatTrueDiv(ToFloat(8*L['t1']) - 4.0, 8.0)), CeilToInt(FloatTrueDiv(ToFloat(8*L['t1']) - 4.0, 8.0))) and ...
```
It looks like we have a facility to define these operators in the SYMPY_INTERP map and we're just missing FloatTrueDiv and ToFloat. What's surprsing to me is that we're only hitting this problem with the FX graph enabled. We can create such guards, but we've never actually evaluated any?
Test Plan:
`TORCHINDUCTOR_FX_GRAPH_CACHE=1 python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --inference --bfloat16 --only detectron2_fcos_r_50_fpn`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128418
Approved by: https://github.com/ezyang
Summary: Improve the convert fp32 to fp16 fx pass to use to_dtype node and const folding instead of inplace conversion.
Test Plan:
```
buck2 test @//mode/{opt,inplace} //glow/fb/fx/fba/tests:test_fba_pass_manager_builder
```
Differential Revision: D57803843
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127829
Approved by: https://github.com/Skylion007
Fixes#127905
### Description
Add docstring to torch/onnx/symbolic_opset9.py:sigmoid function
### Checklist
- [x] The issue that is being fixed is referred in the description
- [x] Only one issue is addressed in this pull request
- [x] Labels from the issue that this PR is fixing are added to this pull request
- [x] No unnecessary issues are included into this pull request
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128082
Approved by: https://github.com/titaiwangms
Summary:
Pass parameters from request to dump_nccl_trace_pickle handler.
The supported parameters + value are all lowercase.
includecollectives={true, false}
includestacktraces={true, false}
onlyactive={true, false}
Example post is:
/handler/dump_nccl_trace_pickle?includecollectives=true&includestacktraces=false&onlyactive=true
Test Plan:
unit tests
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128307
Approved by: https://github.com/d4l3k
ghstack dependencies: #128191
Summary:
Add a unit test for the only_active flag to _dump_nccl_trace API call.
With this flag, we only expect active records to be returned.
Test Plan:
Unit test.
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128191
Approved by: https://github.com/d4l3k
Fixes#127897
### Description
Add docstring to torch/onnx/symbolic_opset9.py:sigmoid function
### Checklist
- [x] The issue that is being fixed is referred in the description
- [x] Only one issue is addressed in this pull request
- [x] Labels from the issue that this PR is fixing are added to this pull request
- [x] No unnecessary issues are included into this pull request
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128171
Approved by: https://github.com/titaiwangms
Summary:
The previous side effect pruning algorithm would keep many dead cell
variables alive. For example, in
https://github.com/pytorch/pytorch/issues/125078, the compiled function
has one return but there were three in the Dynamo graph due to two
dead cell variables not being pruned away.
This PR adds a corrected algorithm. "new cell variables" are alive if
they can be reached from one of the following:
1. any of the tx.symbolic_locals or tx.stack (that is, if they are
involved in a return from the function or intermediate variable
during a graph break). Example: an alive NestedUserFunctionVariable
2. "mutations to pre-existing objects". Example: appending a
NestedUserFunctionVariable to a global list
The new algorithm reflects this, but please let me know if there are
more cases to handle.
Test Plan:
- existing tests (afaict, test/dynamo/test_python_autograd is the best
SideEffects test case we have)
- see in test/dynamo/test_higher_order_ops that the expecttests changed
-- the functorch dynamo graphs no longer return dead cellvars.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128028
Approved by: https://github.com/jansel
We observed signficant compile time regression in torchtitan when turning
on 2D parallel + torch.compile recently. So I decided to get a deeper
understanding why.
It turns out this is affecting **all the trainings** that have functional collectives
captured in the graph, not only 2D parallel (2D parallel was just the
job that happen to have collectives captured in the TP region).
The root cause is because when doing inductor lowering, we are calling
the comm analysis pass to get a estimated collective time for each
collective node in the graph, for each call to check the collective
node, we are calling `get_gpu_type()`, which under the hood calls a
`torch.utils.collect_env.run` to get the GPU info. However, this call is
super expensive! The reason is that this call effectively spawns a new
process and call `nvidia-smi` to get the GPU info, so the cost is **linear**
to the number of collective nodes in the graph.
see https://github.com/pytorch/pytorch/blob/main/torch/utils/collect_env.py#L75
The fix is to add a lru cache to the function, so that we only call this
once and reuse the cached results afterwards
torchtitan benchmark shows:
* before this fix: 2D parallel + fp8 compile time: 6min +
* after this fix: 2D parallel + fp8 compile time: 2min 48s (more than 100% improvement)
There're more room to improve the compile time, but this PR is trying to fix the biggest regression I found so far.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128363
Approved by: https://github.com/yf225
### Motivation
Intel Gaudi accelerator (device name hpu) is seen to have good pass rate with the pytorch framework UTs , however being an out-of-tree device, we face challenges in adapting the device to natively run the existing pytorch UTs under pytorch/test. The UTs however is a good indicator of the device stack health and as such we run them regularly with adaptations.
Although we can add Gaudi/HPU device to generate the device specific tests using the TORCH_TEST_DEVICES environment variable, we miss out on lot of features such as executing for specific dtypes, skipping and overriding opInfo. With significant changes introduced every Pytorch release maintaining these adaptations become difficult and time consuming.
Hence with this PR we introduce Gaudi device in common_device_type framework, so that the tests are instantiated for Gaudi when the library is loaded.
The eventual goal is to introduce Gaudi out-of-tree support as equivalent to in-tree devices
### Changes
Add HPUTestBase of type DeviceTypeTestBase specifying appropriate attributes for Gaudi/HPU.
Include code to check if intel Gaudi Software library is loaded and if so, add the device to the list of devices considered for instantiation of device type tests
### Additional Context
please refer the following RFC : https://github.com/pytorch/rfcs/pull/63/
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126970
Approved by: https://github.com/albanD
gen_static_runtime_ops hasn't been updated in a while. In preparation for https://github.com/pytorch/pytorch/pull/127675 in which I need to re-run the codegen step for cumprod, I want to land these changes beforehand in case there are any other issues that arise.
I added a number of ops to the blocklist:
```
+ "_nested_tensor_storage_offsets",
+ "_nested_get_values", # no CPU backend
+ "_nested_get_values_copy", # no CPU backend
+ "_nested_view_from_jagged", # testing needs to be patched
+ "_nested_view_from_jagged_copy", # testing needs to be patched
+ "_nested_view_from_buffer", # testing needs to be patched
+ "_nested_view_from_buffer_copy", # testing needs to be patched
+ "_int_mm", # testing needs to be patched
+ "_to_sparse_csc", # testing needs to be patched
+ "_to_sparse_csr", # testing needs to be patched
+ "segment_reduce", # testing needs to be patched
```
Most of these are added just because testing doesn't work right now.
Additionally, a few `fft` ops seem to have been removed from native_functions.yaml; I'm guessing it's unlikely FFT would have been used in many real models though.
Differential Revision: [D58329403](https://our.internmc.facebook.com/intern/diff/D58329403/)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128299
Approved by: https://github.com/YuqingJ
Some features of third-party devices depend on TraceUtils.h, so some of the CUDA code was removed and split into NCCLUtils files.
In addition, some common functions still remain in TraceUtils.h since I'm not sure if other devices will use them later.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126969
Approved by: https://github.com/c-p-i-o
This PR properly registers the tensor used in the module compute as a parameter. This bug was hidden previously because all tensors on the nn modules would be considered constant by dynamo, with inlining NN modules, this is no longer the case.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128356
Approved by: https://github.com/anijain2305
ghstack dependencies: #128355
Today inlining builtin nn modules is not compatible with parameter freezing. Freezing parameters and then constant folding them through the graph relies on the assumption that they will not be inputs and will be static across calls to the same graph. When inlining builtin nn modules this assumption is broken and we reuse the same graph for different instances of the same nn module. There are three options 1) abandon constant folding, 2) create a dispatcher layer (like cudagraphs) which will dispatch to the correct constant-folded graph for each distinct set of parameters or 3) recompile
This PR implements 3 by introducing guards on the parameter pointers. This was due to freezing being relatively rare and performance sensistive. 2 Had many more unknowns and 1 is not a viable option due to the drop in performance.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128355
Approved by: https://github.com/anijain2305
We implemented a lowering for the avg_pool3d_backward operation and created tests for it.
We ran some benchmarks and achieved the following results:
```
[-------------- avgpool_3d_backwards --------------]
| Decomposed | Eager
16 threads: ----------------------------------------
(3, 5, 400, 200, 200) | 6061 | 11160
(3, 5, 300, 200, 200) | 4547 | 8372
(3, 5, 200, 200, 200) | 3032 | 5585
(3, 5, 300, 300, 300) | 10100 | 18840
(3, 5, 100, 100, 100) | 381 | 703
(3, 5, 100, 300, 200) | 2270 | 4190
(8, 8, 128, 128, 128) | 3397 | 6253
(2, 3, 150, 150, 150) | 520 | 947
(1, 3, 128, 128, 128) | 161 | 299
(8, 16, 64, 64, 64) | 851 | 1569
(1, 1, 50, 50, 50) | 17 | 11
(3, 5, 20, 40, 40) | 17 | 30
(3, 5, 10, 20, 20) | 17 | 11
(1, 1, 10, 10, 10) | 16 | 11
(3, 5, 5, 10, 10) | 17 | 11
(3, 5, 2, 5, 5) | 17 | 11
```
These were run on an RTX 3050, so we were not able to allocate larger tensors due to memory limitations.
We believe it would be beneficial to benchmark this on more recent hardware, just to check if the performance holds up with larger sizes.
Furthermore, we also refactored code from adaptive_avg_pool2d and adaptive_max_pool2d, to reduce code duplication.
We diffed the kernels and they are identical.
Fixes#127101
Co-authored-by: Martim Mendes <martimccmendes@tecnico.ulisboa.pt>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127722
Approved by: https://github.com/jansel
This PR intends to support the aten operations with the `out` tensor.
Currently, the AOT compile always does **NOT** keep input tensor mutations. According to the comments, this is because it has not encountered such a use case.
> For now there's no use case involving keeping input mutations in the graph (which we can only do in the inference case anyway). We can add this later if we need to.
However, for aten operations, it is popular that the `out` tensor is an input parameter and needs to be mutated. This PR intends to support it by adding a `keep_inference_input_mutations` flag to `aot_inductor.keep_inference_input_mutations`. This flag can provide flexibility to the callee in deciding whether the AOT compile needs to keep input tensor mutations in the graph.
Take `clamp` as an example as follows.
```python
out_tensor = torch.randn(128, dtype=torch.float, device=device).fill_(-2.0)
inp_tensor = torch.randn(128, dtype=torch.float, device=device).fill_(1.0)
min_tensor = inp_tensor - 0.05
max_tensor = inp_tensor + 0.05
torch.clamp(input=inp_tensor, min=min_tensor, max=max_tensor, out=out_tensor)
```
W/O this PR
```python
def forward(self):
arg0_1: "f32[128]"; arg1_1: "f32[128]"; arg2_1: "f32[128]"; arg3_1: "f32[128]";
arg0_1, arg1_1, arg2_1, arg3_1, = fx_pytree.tree_flatten_spec([], self._in_spec)
clamp_min: "f32[128]" = torch.ops.aten.clamp_min.Tensor(arg0_1, arg1_1); arg0_1 = arg1_1 = None
clamp_max: "f32[128]" = torch.ops.aten.clamp_max.Tensor(clamp_min, arg2_1); clamp_min = arg2_1 = None
return (clamp_max, clamp_max)
```
W/ this PR
```python
def forward(self):
arg0_1: "f32[128]"; arg1_1: "f32[128]"; arg2_1: "f32[128]"; arg3_1: "f32[128]";
arg0_1, arg1_1, arg2_1, arg3_1, = fx_pytree.tree_flatten_spec([], self._in_spec)
clamp_min: "f32[128]" = torch.ops.aten.clamp_min.Tensor(arg0_1, arg1_1); arg0_1 = arg1_1 = None
clamp_max: "f32[128]" = torch.ops.aten.clamp_max.Tensor(clamp_min, arg2_1); clamp_min = arg2_1 = None
copy_: "f32[128]" = torch.ops.aten.copy_.default(arg3_1, clamp_max); arg3_1 = clamp_max = None
return (copy_,)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124926
Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/angelayi
Fixes#127898
### Description
Add docstring to torch/onnx/symbolic_opset9.py:sigmoid function
### Checklist
- [x] The issue that is being fixed is referred in the description
- [x] Only one issue is addressed in this pull request
- [x] Labels from the issue that this PR is fixing are added to this pull request
- [x] No unnecessary issues are included into this pull request
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128083
Approved by: https://github.com/titaiwangms
Since there are such cycles in libfmt and PyTorch, which are detected by clang-tidy.
```
/home/cyy/pytorch/third_party/fmt/include/fmt/format-inl.h:25:10: error: circular header file dependency detected while including 'format.h', please check the include path [misc-header-include-cycle,-warnings-as-errors]
25 | #include "format.h"
| ^
/home/cyy/pytorch/third_party/fmt/include/fmt/format.h:4530:12: note: 'format-inl.h' included from here
4530 | # include "format-inl.h"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127233
Approved by: https://github.com/ezyang
autograd already marks nodes as needed or not before calling calling compiled autograd. so our worklist already skips nodes not specified in the `inputs` kwarg.
For the .backward(inputs=) case, I'm keeping the grads as outputs, just like for .grad(inputs=), this is to still guard on graph_output when we collect the nodes. This does not get DCE'd rn, and is ignored in the post graph bytecode.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128252
Approved by: https://github.com/jansel
Fixes https://github.com/pytorch/pytorch/issues/75287 and https://github.com/pytorch/pytorch/issues/117437
- `nn.Module._register_state_dict_hook` --> add public `nn.Module.register_state_dict_post_hook`
- Add a test as this API was previously untested
- `nn.Module._register_load_state_dict_pre_hook` --> add public `nn.Module.register_load_state_dict_pre_hook` (remove the `with_module` flag, default it to `True`
~- For consistency with optimizer `load_state_dict_pre_hook` raised by @janeyx99, allow the pre-hook to return a new `state_dict`~
- Document issue pointed out by https://github.com/pytorch/pytorch/issues/117437 regarding `_register_state_dict_hook` semantic of returning a new state_dict only being respected for the root for private hook
- Remove this for the public `register_state_dict_post_hook`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126704
Approved by: https://github.com/albanD
ghstack dependencies: #126906
Original issue: https://github.com/pytorch/pytorch/issues/114338
We assume only two possible mutually exclusive scenarios:
1. Running compiled region for training (Any of inputs has requires_grad)
- Produced differentiable outputs should have requires_grad.
2. Running compiled region for inference (None of inputs has requires_grad)
- All outputs do not have requires_grad.
Even if user runs the region under no_grad(), but has an input Tensor with requires_grad - we go Training scenario (1).
With current state that means:
1/ needs_autograd should not check torch.is_grad_enabled(), only that any of inputs requires_grad
2/ if needs_autograd => trace_joint (We are in training scenario 1.) => always run compiled region under with.enable_grad()
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128016
Approved by: https://github.com/bdhirsh
In a previous life, we used sympy.oo to represent the lower/upper bounds of integer ranges. Later, we changed this to be sys.maxsize - 1 for a few reasons: (1) sometimes we do tests on a value being exactly sys.maxsize, and we wanted to avoid a data dependent guard in this case, (2) sympy.oo corresponds to floating point infinity, so you get incorrect types for value ranges with oo, and (3) you can do slightly better reasoning if you assume that input sizes fall within representable 64-bit integer range.
After working in the sys.maxsize regime for a bit, I've concluded that this was actually a bad idea. Specifically, the problem is that you end up with sys.maxsize in your upper bound, and then whenever you do any sort of size-increasing computation like size * 2, you end up with 2 * sys.maxsize, and you end up doing a ton of arbitrary precision int computation that is totally unnecessary. A symbolic bound is better.
But especially after #126905, we can't go back to using sympy.oo, because that advertises that it's not an integer, and now your ValueRanges is typed incorrectly. So what do we do? We define a new numeric constant `int_oo`, which is like `sympy.oo` but it advertises `is_integer`. **test/test_sympy_utils.py** describes some basic properties of the number, and **torch/utils/_sympy/numbers.py** has the actual implementation.
The rest of the changes of the PR are working out the implications of this change. I'll give more commentary as inline comments.
Fixes https://github.com/pytorch/pytorch/issues/127396
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127693
Approved by: https://github.com/lezcano
ghstack dependencies: #126905
Summary: We've updated the sort_index in Kineto chrome traces to support device ids up to 16 devices. This should make chrome trace rows be ordered in the same way as CUDA. We need to update the unit test as well.
Test Plan:
Ran locally the changing test:
```
$ buck2 test 'fbcode//mode/opt' fbcode//caffe2/test:test_profiler_cuda -- --exact 'caffe2/test:test_profiler_cuda - test_basic_chrome_trace (profiler.test_profiler.TestProfiler)'
File changed: fbcode//caffe2/third_party/kineto.submodule.txt
Buck UI: https://www.internalfb.com/buck2/f4fd1e9a-99f1-4422-aeed-b54903c64146
Test UI: https://www.internalfb.com/intern/testinfra/testrun/16888498639845776
Network: Up: 5.4KiB Down: 8.6KiB (reSessionID-0329120e-7fa2-4bc0-b539-7e58058f8fce)
Jobs completed: 6. Time elapsed: 1:01.2s.
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0
```
Differential Revision: D58362964
Pulled By: aaronenyeshi
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128333
Approved by: https://github.com/Skylion007
The FX graphs for some PT2 models are very complicated, Inductor usually goes through many passes of graph optimization to generate the final FX graph. It’s very difficult to see the change in each pass, and check if the optimized graph is correct and optimal.
GraphTransformObserver is an observer listening to all add/erase node events on GraphModule during a graph transform pass, and save the changed nodes. When the pass is done and if there is any change in the graph, GraphTransformObserver will save the SVG files of the input graph and the output graph for that pass.
This PR is to enable GraphTransformObserver for inductor.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127962
Approved by: https://github.com/jansel
Going through the dispatcher + pybind11 + torch.ops adds about 2 us overhead
per call compared to `PyArgParser`.
Note that views of inputs are reconstructed by AOTAutograd before being returned
to the python code, so dispatching for autograd's sake shouldn't be required
here.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128185
Approved by: https://github.com/lezcano
ghstack dependencies: #128183, #128184
Summary:
Data from PyTorch distributed is mostly useful during initial stages of model development.
Provide options to reduce data sent/dumped.
`_dump_nccl_trace` takes 3 optional switches. Default as before returns everything
- `includeCollectives`: option to also include collectives: Default is True.
- `includeStacktraces`: option to include stack traces in collectives. Default is True.
- `onlyActive`: option to only send active collective work - i.e. not completed. Default is
False (i.e. send everything)
Test Plan:
Unit tests
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127651
Approved by: https://github.com/wconstab
Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes.
What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here...
Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343
Approved by: https://github.com/Skylion007
At a high level, the idea behind this PR is:
* Make it clearer what the promotion and int/float rules for various Sympy operations are. Operators that previously were polymorphic over int/float are now split into separate operators for clarity. We never do mixed int/float addition/multiplication etc in sympy, instead, we always promote to the appropriate operator. (However, equality is currently not done correctly.)
* Enforce strict typing on ValueRanges: if you have a ValueRange for a float, the lower and upper MUST be floats, and so forth for integers.
The story begins in **torch/utils/_sympy/functions.py**. Here, I make some changes to how we represent certain operations in sympy expressions:
* FloorDiv now only supports integer inputs; to do float floor division, do a truediv and then a trunc. Additionally, we remove the divide out addition by gcd optimization, because sympy gcd is over fields and is willing to generate rationals (but rationals are bad for ValueRange strict typing).
* ModularIndexing, LShift, RShift now assert they are given integer inputs.
* Mod only supports integer inputs; eventually we will support FloatMod (left for later work, when we build out Sympy support for floating operations). Unfortunately, I couldn't assert integer inputs here, because of a bad interaction with sympy's inequality solver that is used by the offline solver
* TrueDiv is split into FloatTrueDiv and IntTrueDiv. This allows for us to eventually generate accurate code for Python semantics IntTrueDiv, which is written in a special way to preserve precision when the inputs are >= 2**53 beyond what first coercing the integer to floats and then doing true division.
* Trunc is split to TruncToFloat and TruncToInt.
* Round is updated to return a float, not an int, making it consistent with the round op handler in Inductor. To get Python-style conversion to int, we call TruncToInt on the result.
* RoundDecimal updated to consistently only ever return a float
* Add ToFloat for explicit coercion to float (required so we can enforce strict ValueRanges typing)
In **torch/__init__.py**, we modify SymInt and SymFloat to appropriately call into new bindings that route to these refined sympy operations. Also, we modify `torch.sym_min` and `torch.sym_max` to have promotion semantics (if one argument is a float, the return result is always a float), making them inconsistent with builtins.min/max, but possible to do type analysis without runtime information.
We also need to introduce some new op handlers in **torch/_inductor/ops_handler.py**:
* `to_int` for truncation to int64, directly corresponding to TruncToInt; this can be implemented by trunc and dtype, but with a dedicated handler it is more convenient for roundtripping in Sympy
* `int_truediv` for Python-style integer true division, which has higher precision than casting to floats and then running `truediv`
These changes have consequences. First, we need to make some administrative changes:
* Actually wire up these Sympy functions from SymInt/SymFloat in **torch/fx/experimental/sym_node.py**, including the new promotion rules (promote2)
* Add support for new Sympy functions in **torch/utils/_sympy/interp.py**, **torch/utils/_sympy/reference.py**
* In particular, in torch.utils._sympy.reference, we have a strong preference to NOT do nontrivial compute, instead, everything in ops handler should map to a singular sympy function
* TODO: I chose to roundtrip mod back to our Mod function, but I think I'm going to have to deal with the C/Python inconsistency this to fix tests here
* Add printer support for the Sympy functions in **torch/_inductor/codegen/common.py**, **torch/_inductor/codegen/cpp_utils.py**, **torch/_inductor/codegen/triton.py**. `int_truediv` and mixed precision equality is currently not implemented soundly, so we will lose precision in codegen for large values. TODO: The additions here are not exhaustive yet
* Update ValueRanges logic to use new sympy functions in **torch/utils/_sympy/value_ranges.py**. In general, we prefer to use the new Sympy function rather than try to roll things by hand, which is what was done previously for many VR analysis functions.
In **torch/fx/experimental/symbolic_shapes.py** we need to make some symbolic reasoning adjustments:
* Avoid generation of rational subexpressions by removing simplification of `x // y` into `floor(x / y)`. This simplification then triggers an addition simplification rule `(x + y) / c --> x / c + y / c` which is bad because x / c is a rational number now
* `_assert_bound_is_rational` is no more, we no longer generate rational bounds
* Don't intersect non-int value ranges with the `int_range`
* Support more sympy Functions for guard SYMPY_INTERP
* Assert the type of value range is consistent with the variable type
The new asserts uncovered necessary bug fixes:
* **torch/_inductor/codegen/cpp.py**, **torch/_inductor/select_algorithm.py**, **torch/_inductor/sizevars.py** - Ensure Wild/Symbol manually allocated in Inductor is marked `is_integer` so it's accepted to build expressions
* **torch/_inductor/utils.py** - make sure you actually pass in sympy.Expr to these functions
* **torch/_inductor/ir.py** - make_contiguous_strides_for takes int/SymInt, not sympy.Expr!
* **torch/export/dynamic_shapes.py** - don't use infinity to represent int ranges, instead use sys.maxsize - 1
Because of the removal of some symbolic reasoning that produced rationals, some of our symbolic reasoning has gotten worse and we are unable to simplify some guards. Check the TODO at **test/test_proxy_tensor.py**
**Reland notes.** This requires this internal fbcode diff https://www.internalfb.com/phabricator/paste/view/P1403322587 but I cannot prepare the diff codev due to https://fb.workplace.com/groups/osssupport/posts/26343544518600814/
It also requires this Executorch PR https://github.com/pytorch/executorch/pull/3911 but the ET PR can be landed prior to this landing.
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126905
Approved by: https://github.com/xadupre, https://github.com/lezcano
**Summary**
Example code to display distributed model parameters and verify them against ground truth. Also prints sharding information.
**Test Plan**
torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/display_sharding_example.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127987
Approved by: https://github.com/XilunWu
ghstack dependencies: #127358, #127360, #127630
This patch implements `with sdpa_kernel(SDPBackend.EFFICIENT_ATTENTION):` by reusing AOTriton's accelerated SDPA implementation
Known limitations:
- Only supports MI200/MI300X GPUs
- Does not support varlen
- Does not support `CausalVariant`
- Optional arguments `causal_diagonal` and `seqlen_k` in `_efficient_attention_forward/backward` must be null
- Does not work well with inductor's SDPA rewriter. The rewriter has been updated to only use math and flash attention on ROCM.
This PR also uses a different approach of installing AOTriton binary instead of building it from source in the base docker image. More details on motivation: https://github.com/pytorch/pytorch/pull/124885#issuecomment-2153229129
`PYTORCH_TEST_WITH_ROCM=1 PYTORCH_TESTING_DEVICE_ONLY_FOR="cuda" python test/test_transformers.py` yields "55028 passed, 20784 skipped" results with this change. [Previous result](https://hud.pytorch.org/pr/127528) of `test_transformers.py` was 0 error, 0 failure, 55229 skipped out of 75517 tests in total (the XML report does not contain total number of passed tests).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124885
Approved by: https://github.com/malfet
If any inputs are mutated that require grad, even if all the outputs don't require grad, we should still run autograd with a backwards graph. This fixes two tests: test_input_mutation_alias_everything and test_view_detach.
Fixes#128035
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128229
Approved by: https://github.com/aorenste
In case user modified stage module out of place, such as
mod = DDP(mod)
mod = torch.compile(mod)
They need a stage builder else than `pipe.build_stage()`.
This PR provides an API to do so:
```
def build_stage(
stage_module,
stage_index,
pipe.info(),
...
)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128273
Approved by: https://github.com/wconstab
https://github.com/pytorch/pytorch/pull/124272 set the alignment to the `consts_o` but if there're `data_size` of tensor in the `consts_o` non divisible by the alignment, the following tensors are not aligned anymore, resulting in poor performance on CPU.
We align the `data_size` as well in this PR and pad the serialized bytes. Since `size` of the tensor instead of the `data_size` is used when creating tensor from the serialized bytes ([link](f4d7cdc5e6/torch/csrc/inductor/aoti_runtime/model.h (L236-L259))), there won't be correctness issue. `data_size` is only used to record the [bytes_read](f4d7cdc5e6/torch/csrc/inductor/aoti_runtime/model.h (L217)).
This PR will improve the performance on CPU for 4 models in HF, 7 models in TIMM and 1 model in Torchbench.
For the unit test, I add a bias value the original `data_size` of which is not divisible by the alignment to test the correctness:
```
constants_info_[0].dtype = static_cast<int32_t>(at::kFloat);
constants_info_[0].data_size = 64; # was 40 before this PR
constants_info_[0].shape = {10};
constants_info_[1].dtype = static_cast<int32_t>(at::kFloat);
......
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127610
Approved by: https://github.com/jgong5, https://github.com/desertfire
----
- Bring PipelineStage/Schedule more front-and-center
- provide details on how to manually construct PipelineStage
- move tracer example and manual example below so the high-level flow
(e2e) is closer to the top
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128236
Approved by: https://github.com/H-Huang
ghstack dependencies: #128201, #128228
Fixes#127913
### Description
Add docstring to `torch/onnx/symbolic_opset9.py`:`sigmoid` function
### Checklist
- [x] The issue that is being fixed is referred in the description
- [x] Only one issue is addressed in this pull request
- [x] Labels from the issue that this PR is fixing are added to this pull request
- [x] No unnecessary issues are included into this pull request
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127983
Approved by: https://github.com/xadupre
Fixes this on macOS 12:
```
/Users/qqaatw/Forks/pytorch/aten/src/ATen/native/mps/operations/FastFourierTransform.mm:108:60: error: use of undeclared identifier 'MPSDataTypeComplexFloat16'; did you mean 'MPSDataTypeFloat16'?
(inputTensor.dataType == MPSDataTypeFloat16) ? MPSDataTypeComplexFloat16 : MPSDataTypeComplexFloat32;
^~~~~~~~~~~~~~~~~~~~~~~~~
MPSDataTypeFloat16
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127859
Approved by: https://github.com/kulinseth
For mixed mm with small sizes of m, such as in the example provided in #127056, being able to set BLOCK_M to 16 leads to better performance. This PR introduces kernel configs that are specific to mixed mm by extending the mm configs with two configs that work well for the example provided in #127056.
I am excluding configs with (BLOCK_M=16, BLOCK_K=16, BLOCK_N=64) because triton crashes when this config is used.
For the example in #127056:
- Without my changes, skip_triton is evaluated to true which disables autotuning. On my machine I achieve 146GB/s.
- If autotuning is enabled, but BLOCK_M>=32, I achieve 614 GB/s.
- With the changes in this PR (i.e. autotuning enabled and BLOCK_M=16), I achieve 772 GB/s.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127663
Approved by: https://github.com/Chillee
https://github.com/pytorch/pytorch/pull/127825
The majority of the g5 runner usage comes from inductor (its something like 2x everything else)
in the past week, inductor ran 1300 ish times on PRs and 300 times on main. Inductor-periodic ran 50 times on main, so the previous move from inductor -> inductor-periodic only results in 250 fewer runs.
I was under the impression that cu124 is experimental currently and eventually we'll need to switch to it, so this will stay until we switch or inductor uses much fewer runners
Are we expected to be able to handle two versions of cuda in CI? Because currently we cannot, at least not comfortably
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128250
Approved by: https://github.com/huydhn
Updates ruff to 0.4.8. Some minor fixes, but noticably is 10% faster on microbenchmark and should further reduce local and CI runtime of the linter. Also includes a few bugfixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128214
Approved by: https://github.com/ezyang
This is a short term fix for: https://github.com/pytorch/pytorch/issues/124002
We found the cause of bad perf for the int8_unpack kernel is due to sub-optimal indexing. In this PR we introduce 2 indexing optimizations:
1. expand FloorDiv to the entire expression when feasible. E.g. `x1 * 1024 + x2 // 2` will be transformed to `(x1 * 2048 + x2) // 2`. The motivation is that we have more chance to simplify loops for `x1 * 2048 + x2`.
2. merge ModularIndexing pairs: `ModularIndexing(ModularIndex(x, 1, a), 1, b)`, can be simplified to `ModularIndexing(x, 1, b)` if a is a multiple of b.
With both indexing optimizations, we improve int8_unpack perf by 1.54x (183us -> 119us).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127661
Approved by: https://github.com/jansel
**Summary**
This PR switches the default TCPStore server backend to a new implementation that utilizes [`libuv`](https://github.com/libuv/libuv) for significantly lower initialization time and better scalability:
<img width="714" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/18503011-da5d-4104-8ba9-abc456438b02">
We hope this improvement would benefit users from a much shorter startup time in large-scale jobs. Eventually, we hope to fully replace the old TCPStore backend implementation with the libuv one.
**What it changes**
This PR changes the underlying TCPStore server backend to `libuv` if users don't explicitly specify to use the old TCPStore server. This change is not supposed to cause any user notice except significant faster TCPStore startup for large-scale jobs.
One thing to note is, we do not support the initialization approach where user passes in a socket for libuv backend. We plan to support it as a next step but we choose to disable it before fully testing. If you are initializing TCPStore in this approach, you can see the next section to remain using the old TCPStore server.
**Fallback/Remain using the old TCPStore server**
For users who want to stay with the old TCPStore backend, there're 3 ways:
1. If user is directly instantiating TCPStore object, user can pass in argument `use_libuv=False` to use the old TCPStore server backend e.g. `store = torch.distributed.TCPStore(..., use_libuv=False)`.
2. Or, specify the TCPStore backend option in `init_method` when calling default ProcessGroup init, e.g. `torch.distributed.init_process_group(..., init_method="{YOUR_RENDEZVOUS_METHOD}://{YOUR_HOSTNAME}:{YOUR_PORT}?use_libuv=0")`
3. Or, user can set environment variable `USE_LIBUV` to `"0"` when launching.
These 3 approach are in order of precedence. That being said, if user specifies `use_libuv=0` in `init_method` and also sets environment var `USE_LIBUV="1"`, the former will take effect and the TCPStore backend instantiated will be the old one instead of the one using libuv.
**Operating Systems Compatibility**
From the CI signals, we believe the new implementation has the same behavior as the old TCPStore server on all supported platforms. If you notice any behavior discrepancy, please file an issue with `oncall: distributed` label.
**Test Plan**
`pytest test/distributed/test_store.py`
<img width="2548" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/dc0aebeb-6d5a-4daa-b98c-e56bd39aa588">
note: `TestMultiThreadedWait::test_wait` is a broken test that has been there for some time.
`test/distributed/elastic/utils/distributed_test.py`
<img width="2558" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/a6a3266d-b798-41c4-94d2-152056a034f6">
**TODO**
1. Update the doc at
- https://pytorch.org/docs/stable/distributed.html#distributed-key-value-store
- https://pytorch.org/docs/stable/distributed.html#tcp-initialization
2. Make torch elastic rendezvous to use libuv TCPStore as well. See `torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py` cc @mrshenli @pritamdamania87 @zhaojuanmao @satgera @gqchen @aazzolini @osalpekar @jiayisuse @H-Huang @kwen2501 @awgu @penguinwu @fegin @wanchaol @fduwjj @wz337 @tianyu-l @wconstab @yf225 @chauhang @d4l3k @kurman
3. Test if libuv backend is okay with initialization with socket. Change `LibUvTCPStoreTest::test_take_over_listen_socket`.
**Test Plan**
`pytest test/distributed/test_store.py`
<img width="2548" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/dc0aebeb-6d5a-4daa-b98c-e56bd39aa588">
note: `TestMultiThreadedWait::test_wait` is a broken test that has been there for some time.
`test/distributed/elastic/utils/distributed_test.py`
<img width="2558" alt="image" src="https://github.com/pytorch/pytorch/assets/12968408/a6a3266d-b798-41c4-94d2-152056a034f6">
Differential Revision: [D58259591](https://our.internmc.facebook.com/intern/diff/D58259591)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127957
Approved by: https://github.com/kurman
ghstack dependencies: #127956
# Issues
Currently two issues need to be fixed with LoopedBFS:
1. The wrap around send operation to the looped around stage blocks will cause a hang. For some reason this doesn't surface on single node, but on multihost this surfaces in a hang.
<img width="1311" alt="image" src="https://github.com/pytorch/pytorch/assets/14858254/210d9d18-455f-4f65-8a11-7ce2c1ec73fd">
2. When microbatches are popped off in `backward_one_chunk` will automatically use the `bwd_chunk_id` starting from 0. This works for interleaved 1f1b and 1f1b, but for loopedBFS we want to pop from starting at `num_microbatches - 1`. Same needs to be fixed for gpipe?
# Changes
- Update LoopedBFS implementation to share `_step_microbatches` with `Interleaved1F1B`
- Also share the tests between the two schedules for varying num_microbatches, local_stages, and world_sizes
- Update `backward_one_chunk` to optionally take a `bwd_chunk_id` argument.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127796
Approved by: https://github.com/wconstab
Summary:
1. Integrate NaN and INF checker with existing config, controllable by env var.
2. Move inject point of NaN & INF checker earlier, this could prevent buffer freeing before check.
3. Inject debugging code in Kernel level, which prevents us trying to read buffers that are fused inplace and into a single kernel.
Test Plan:
Debugging utility.
Test and check by existing tests with env var:
```
TORCHINDUCTOR_NAN_ASSERTS=1 TORCHINDUCTOR_MAX_AUTOTUNE=0 python test/inductor/test_aot_inductor.py -k AOTInductorTestNonABICompatibleCuda.test_seq_non_abi_compatible_cuda
```
Reviewed By: ColinPeppler
Differential Revision: D57989176
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127574
Approved by: https://github.com/chenyang78, https://github.com/desertfire
Default values were added to Params in order to eliminate CUDA warnings like
```
and the implicitly-defined constructor does not initialize ‘PyTorchMemEffAttention::AttentionKernel<float, cutlass::arch::Sm80, true, 64, 64, 64, true, true>::accum_t PyTorchMemEffAttention::AttentionKernel<float, cutlass::arch::Sm80, true, 64, 64, 64, true, true>::Params::scale’
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112215
Approved by: https://github.com/eqy, https://github.com/ezyang
Changed the API of `pipeline()` to take microbatch instead of full batch as example args.
Main purpose is to:
- make this API more atomic;
- decouple tracing frontend from runtime info like `num_chunks`.
Side effects:
- Creates opportunity for varying `num_chunks` of schedules with the same `pipe` object.
- User has to create example microbatch input.
- Chunk spec stuff are now all moved to runtime side.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128163
Approved by: https://github.com/H-Huang
My goal is to run these tests with the autograd cache on, but first I want them running with dynamo. These tests already caught an interesting issue so I thought it would be helpful to just have them.
Next up I'll have a second subclass of these tests, run them twice, and expect a cache hit the second time from autograd.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128047
Approved by: https://github.com/ezyang
Renaming ManualPipelineStage to remove the "Manual" part. I needed to replace the existing `PipelineStage` which takes in the `pipe` argument, so I have renamed that to `TracerPipelineStage`. @kwen2501 will remove this entirely in favor of adding a util to `Pipe` to just create the stage directly.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128157
Approved by: https://github.com/wconstab
Dynamo doesn't support `RegisterPostBackwardFunction` very well yet. This PR skips it and rely on `root_post_backward_callback` under compile. We will improve `RegisterPostBackwardFunction` support in Q3.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127864
Approved by: https://github.com/awgu
Most commonly CPU scalars used for philox random seed. Right now, any cpu input will skip cudagraphing the entire graph. We need both the traced graph and the runtime inputs to be cudaified.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125382
Approved by: https://github.com/jansel
As reported in https://github.com/pytorch/pytorch/issues/119434, `hf_T5_generate` failed with dynamic shape testing, we propose to skip the dynamic batch size testing of this model in this PR.
* Error msg is
```
File "/home/jiayisun/pytorch/torch/_dynamo/guards.py", line 705, in SHAPE_ENV
guards = output_graph.shape_env.produce_guards(
File "/home/jiayisun/pytorch/torch/fx/experimental/symbolic_shapes.py", line 3253, in produce_guards
raise ConstraintViolationError(
torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (L['inputs_tensor'].size()[0])! For more information, run with TORCH_LOGS="+dynamic".
- Not all values of RelaxedUnspecConstraint(L['inputs_tensor'].size()[0]) are valid because L['inputs_tensor'].size()[0] was inferred to be a constant (4).
```
* Root Cause is
This error happens while creating guard for this [model script line](https://github.com/huggingface/transformers/blob/main/src/transformers/models/t5/modeling_t5.py#L561): `scores += position_bias_masked`
I run it with TORCH_LOGS="+dynamic" and got the key line : `I0305 00:21:00.849974 140376923287424 torch/fx/experimental/symbolic_shapes.py:3963] [6/0_1] eval Eq(s0, 4) [guard added] at miniconda3/envs/pt2/lib/python3.9/site-packages/transformers/models/t5/modeling_t5.py:561 in forward (_refs/__init__.py:403 in _broadcast_shapes)`
The reason for this error is that the batch dimension of `inputs_tensor` in the dynamic batch size test is marked as dynamic shape `s0`, so the batch dimension of `scores` generated by a series of operations with `inputs_tensor` is also `s0`. However, because the function of creating `attention_mask` is not in Dynamo but in python. The batch dimension of `attention_mask` is the real shape `4`, and the batch dimension of `position_bias_masked` generated by a series of operations with `attention_mask` is also the real shape `4`, not the dynamic shape `s0`. The current line of `scores += position_bias_masked` requires creating a guard and check whether the batch dimension of `scores` is always equal to the batch dimension of `position_bias_masked`, Eq(s0, 4), the error happens.
So the root cause of this error is that the function of creating `attention_mask` not in Dynamo but in python. The reason why the function of `attention_mask` not in Dynamo is that Dynamo has a graph break on this function (happened in the [model script line](https://github.com/huggingface/transformers/blob/main/src/transformers/generation/utils.py#L476): `is_pad_token_in_inputs = (pad_token_id is not None) and (pad_token_id in inputs)`) due to the following error:
`torch._dynamo.exc.Unsupported: Tensor.item`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121129
Approved by: https://github.com/leslie-fang-intel, https://github.com/ezyang
see the issue https://github.com/pytorch/pytorch/issues/127636 to for details about the issue, TLDR is that
when inlining is enabled, we create a fake tensor while tracing in dynamo and try to perform aten.add.Tensor between
two tensor of different types, with out inlining we do not hit that operation during tracing.
```
Failed running call_function <built-in function add>(*(FakeTensor(..., size=(20, 20), grad_fn=<AddBackward0>), FakeTensor(..., device='cuda:0', size=(20, 20))), **{}):
Unhandled FakeTensor Device Propagation for aten.add.Tensor, found two different devices cpu, cuda:0
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128023
Approved by: https://github.com/anijain2305
ghstack dependencies: #127487, #127553
Summary: I got reports of intermittent failures in CI and the logs show errors like this:
```
CRITICAL:concurrent.futures:Future 139789013754560 in unexpected state: FINISHED
```
I can't repro locally, but seems clear that we should initialize the future _before_ sending work to the subprocess pool since it could finish before we call set_running_or_notify_cancel()
Differential Revision: [D58239829](https://our.internmc.facebook.com/intern/diff/D58239829)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128086
Approved by: https://github.com/jansel
ghstack dependencies: #128037
The fist DDP bucket is always being created of the size of `dist._DEFAULT_FIRST_BUCKET_BYTES` (1 MiB) by default regardless of `bucket_cap_mb`. The proposal is to set `bucket_cap_mb` as the one main bucket size if it was supplied by the user.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121640
Approved by: https://github.com/wanchaol
We used to have two backward chunk id counting systems, one at schedule level, the other at stage level.
(Which makes safety dependent on the two advancing hand-in-hand.)
This PR consolidates the counting system to the schedule side only, which would pass `mb_index` to the following stage calls:
`forward_one_chunk`
`backward_one_chunk`
`get_bwd_send_ops`
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127935
Approved by: https://github.com/H-Huang
Summary:
Part of the work helping export's automatic dynamic shapes / dynamic shapes refining based on suggested fixes.
Introduces a util function refine_dynamic_shapes_from_suggested_fixes() that takes the error message from a ConstraintViolationError message containing suggested dynamic shapes fixes, along with the original dynamic shapes spec, and returns the new spec. Written so that the suggested fixes from export can be directly parsed and used.
Example usage for the automatic dynamic shapes workflow:
```
# export, fail, parse & refine suggested fixes, re-export
try:
export(model, inps, dynamic_shapes=dynamic_shapes)
except torch._dynamo.exc.UserError as exc:
new_shapes = refine_dynamic_shapes_from_suggested_fixes(exc.msg, dynamic_shapes)
export(model, inps, dynamic_shapes=new_shapes)
```
For examples of behavior, see the added test and docstring. Will take suggestions for renaming the function to something else 😅
Test Plan: test_export tests
Differential Revision: D57409142
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127436
Approved by: https://github.com/avikchaudhuri
This is a short term fix for: https://github.com/pytorch/pytorch/issues/124002
We found the cause of bad perf for the int8_unpack kernel is due to sub-optimal indexing. In this PR we introduce 2 indexing optimizations:
1. expand FloorDiv to the entire expression when feasible. E.g. `x1 * 1024 + x2 // 2` will be transformed to `(x1 * 2048 + x2) // 2`. The motivation is that we have more chance to simplify loops for `x1 * 2048 + x2`.
2. merge ModularIndexing pairs: `ModularIndexing(ModularIndex(x, 1, a), 1, b)`, can be simplified to `ModularIndexing(x, 1, b)` if a is a multiple of b.
With both indexing optimizations, we improve int8_unpack perf by 1.54x (183us -> 119us).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127661
Approved by: https://github.com/jansel
Tracing through `__init__` is important because it initializes (calls STORE_ATTR) on members. By doing that, we kick in the mutation tracking for these objects. So, things like mutating `_modules` etc is tracked automatically.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126578
Approved by: https://github.com/jansel
ghstack dependencies: #128001
#### Description
Handle custom ops during TorchScript to ExportedProgram covnersion
```python
torch.library.define(
"mylib::foo",
"(Tensor x) -> Tensor",
lib=lib,
)
# PyTorch custorm op implementation
@torch.library.impl(
"mylib::foo",
"CompositeExplicitAutograd",
lib=lib,
)
def foo_impl(x):
return x + x
# Meta function of the custom op.
@torch.library.impl_abstract(
"mylib::foo",
lib=lib,
)
def foo_meta(x):
return x + x
class M(torch.nn.Module):
def forward(self, x):
return torch.ops.mylib.foo(x)
```
#### Test Plan
* Add a test case where custom op is called and converted. `pytest test/export/test_converter.py -s -k test_ts2ep_converter_custom_op`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127580
Approved by: https://github.com/angelayi
I'm currently locked into jsonargparse version 4.19.0, and it complains when used in combination with LightningCLI (v2.0.8). This is because it cares about the types declared in google style docstrings. This causes a problem when it tries to parse how it should cast arguments to construct an instance of an LRScheduler class because the docstrings declare the "verbose" parameter as a bool, but the defaults recently changed to a string "deprecated". This means the type should really be `bool | str`.
This PR adds a `| str` to the docstring type in each learning rate scheduler class. This will prevent jsonargparse from complaining.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127943
Approved by: https://github.com/janeyx99
Summary:
Data from PyTorch distributed is mostly useful during initial stages of model development.
Provide options to reduce data sent/dumped.
`_dump_nccl_trace` takes 3 optional switches. Default as before returns everything
- `includeCollectives`: option to also include collectives: Default is True.
- `includeStacktraces`: option to include stack traces in collectives. Default is True.
- `onlyActive`: option to only send active collective work - i.e. not completed. Default is
False (i.e. send everything)
Test Plan:
Unit tests
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127651
Approved by: https://github.com/wconstab
https://github.com/pytorch/pytorch/pull/128038#issuecomment-2150802030
In the above, pending unstable jobs get put into the ok_failed_checks list, and because there are a lot of unstable jobs, it exceeds the threshold and merge fails.
I don't think unstable jobs should be considered in the ok failed checks threshold, only flaky and broken trunk jobs should be considered there.
Change looks big, but main thing is that unstable jobs don't get included in the check for how many flaky failures there are. The other changes are mostly renames so things are clearer
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128080
Approved by: https://github.com/huydhn
There is still ongoing discussion on how this API should work.
Current approach:
- The pre-all-gather ops run in the default stream and the all-gather is called from the default stream with `async_op=True`.
- Pros:
- The all-gather input and output tensors are allocated in the default stream, so there is no increased memory fragmentation across stream pools.
- There is no need for additional CUDA synchronization. The API is self-contained.
- Cons:
- The pre-all-gather ops (e.g. cast from fp32 -> bf16 and all-gather copy-in device copies) cannot overlap with other default stream compute. The biggest concern here is for CPU offloading, the H2D copies cannot overlap.
Alternative approach:
- Follow the default implicit prefetching approach, where the pre-all-gather ops and all-gather run in separate streams.
- Pros:
- The pre-all-gather ops can overlap with default stream compute.
- Cons:
- We require an API that should be called after the last optimizer step (namely, last op that modified sharded parameters) and before the first `unshard` call that has the all-gather streams wait for the default stream. The API is no longer self-contained and now has a complementary API.
- The all-gather input and output tensors are allocated in separate streams (not the default stream), so there can be increased memory fragmentation across pools.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128138
Approved by: https://github.com/wanchaol
ghstack dependencies: #128100
# Motivation
Previously, the default dtype for AMP on XPU was aligned with the CPU. To align with other GPUs, we intend to change the default dtype for AMP to `torch.float16`. This change aims to save users the effort of converting models from `torch.float16` to `torch.bfloat16`, or vice versa when they want to run the model on different types of GPUs.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127741
Approved by: https://github.com/EikanWang, https://github.com/albanD
Fix bug in `_update_process_group` DDP API where we didn't correctly reset `local_used_map_` and a few other variables. This resulted in errors like `Encountered gradient which is undefined, but still allreduced by...`
Added a unit test as well that reproduced the issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128092
Approved by: https://github.com/awgu, https://github.com/fegin
as titled, given that our DTensorSpec is immutable, we can always reuse
the spec if the input/output have the same tensor metadata. this helps two fold:
1. We don't need to re-calculate the hash everytime we produce a
DTensorSpec, reduce runtime operator overhead
2. reduce the DTensor construction overhead.
Some local benchmark on a 800 parameter clip_grad_norm shows that for
foreach_norm the CPU overhead reduces from 11ms -> 7.8ms (around 30% improvement)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128112
Approved by: https://github.com/awgu
### Introduction/Problem
Today when dynamo traces a builtin nn module (nn.Linear for example) it will specially handle parameters of that module by storing them as constant attributes of the graph. This requires that dynamo guard on the ID of the NNModule because if the instance of the module changes, we need to retrace and recollect the new parameters as attributes of the graph. This creates a 1:1 compiled graph to cudagraph relationship.
With hierarchical compilation, dynamo will treat builtin nn modules like any other code. This reduces complexity and critically, if there are multiple identical layers in a model, we only need to compile one of those layers once, and reuse the same compiled artifact for each layer. This introduces a problem for the current approach to parameter handling. Since the parameters could now possibly change across calls to the compiled artifact, these need to be inputs to the graph instead of attributes. This introduces a problem for cudagraphs - previously cudagraphs was guaranteed that the parameters of builtin NN Modules would be constant across calls, but now since the compiled artifact needs to be agnostic to the actual instance of the NN module being used these parameter memory locations may vary. Previously cudagraphs simply copies varying inputs to cudagraph owned memory, but since the parameters are quite large, this is catastrophic for performance.
### Solution
To avoid this performance cliff, this PR allows cudagraphs to re-record a new cudagraph if only parameters change. Metadata about which arguments are parameters are propagated from AOT Autograd to compile_fx, and these indices are passed to cudagraphs. If these memory locations change, a new graph is recorded vs previously where this would be an error (because this previously should not happen). This enables a 1:many compiled graph to cudagraph relationship. Across similar modules we will re-record cudagraphs and dispatch the correct graph if parameter pointers match when the cudagraph is executed.
### Next steps (if needed)
It is theoretically possible that a user passes Parameters that change frequently as inputs to model code - if this is a common issue this design allows for dynamo to pass metadata indicating which parameters were created in a builtin NN Module context to only permit those parameters to have the multi-cudagraph behavior, but this PR does not implement this.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126822
Approved by: https://github.com/eellison
ghstack dependencies: #126820, #126821
Collect the indices of the static parameters to pass down to cudagraphs in order to re-record if necessary.
This location was chosen in order to allow us to restrict this (if needed) in the future by setting metadata in dynamo.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126820
Approved by: https://github.com/bdhirsh
#### Description
Add support for converting `prim::__contains__` from TorchScript IR to ExportedProgram, e.g.,
```python
class MIn(torch.nn.Module):
def forward(self, x: torch.Tensor):
return x.dtype in [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
```
#### Test Plan
* Add test cases to cover both contains IR resulted from primitive types or Tensor. `pytest test/export/test_converter.py -s -k test_ts2ep_converter_contains`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127544
Approved by: https://github.com/angelayi
At a high level, the idea behind this PR is:
* Make it clearer what the promotion and int/float rules for various Sympy operations are. Operators that previously were polymorphic over int/float are now split into separate operators for clarity. We never do mixed int/float addition/multiplication etc in sympy, instead, we always promote to the appropriate operator. (However, equality is currently not done correctly.)
* Enforce strict typing on ValueRanges: if you have a ValueRange for a float, the lower and upper MUST be floats, and so forth for integers.
The story begins in **torch/utils/_sympy/functions.py**. Here, I make some changes to how we represent certain operations in sympy expressions:
* FloorDiv now only supports integer inputs; to do float floor division, do a truediv and then a trunc. Additionally, we remove the divide out addition by gcd optimization, because sympy gcd is over fields and is willing to generate rationals (but rationals are bad for ValueRange strict typing).
* ModularIndexing, LShift, RShift now assert they are given integer inputs.
* Mod only supports integer inputs; eventually we will support FloatMod (left for later work, when we build out Sympy support for floating operations). Unfortunately, I couldn't assert integer inputs here, because of a bad interaction with sympy's inequality solver that is used by the offline solver
* TrueDiv is split into FloatTrueDiv and IntTrueDiv. This allows for us to eventually generate accurate code for Python semantics IntTrueDiv, which is written in a special way to preserve precision when the inputs are >= 2**53 beyond what first coercing the integer to floats and then doing true division.
* Trunc is split to TruncToFloat and TruncToInt.
* Round is updated to return a float, not an int, making it consistent with the round op handler in Inductor. To get Python-style conversion to int, we call TruncToInt on the result.
* RoundDecimal updated to consistently only ever return a float
* Add ToFloat for explicit coercion to float (required so we can enforce strict ValueRanges typing)
In **torch/__init__.py**, we modify SymInt and SymFloat to appropriately call into new bindings that route to these refined sympy operations. Also, we modify `torch.sym_min` and `torch.sym_max` to have promotion semantics (if one argument is a float, the return result is always a float), making them inconsistent with builtins.min/max, but possible to do type analysis without runtime information.
We also need to introduce some new op handlers in **torch/_inductor/ops_handler.py**:
* `to_int` for truncation to int64, directly corresponding to TruncToInt; this can be implemented by trunc and dtype, but with a dedicated handler it is more convenient for roundtripping in Sympy
* `int_truediv` for Python-style integer true division, which has higher precision than casting to floats and then running `truediv`
These changes have consequences. First, we need to make some administrative changes:
* Actually wire up these Sympy functions from SymInt/SymFloat in **torch/fx/experimental/sym_node.py**, including the new promotion rules (promote2)
* Add support for new Sympy functions in **torch/utils/_sympy/interp.py**, **torch/utils/_sympy/reference.py**
* In particular, in torch.utils._sympy.reference, we have a strong preference to NOT do nontrivial compute, instead, everything in ops handler should map to a singular sympy function
* TODO: I chose to roundtrip mod back to our Mod function, but I think I'm going to have to deal with the C/Python inconsistency this to fix tests here
* Add printer support for the Sympy functions in **torch/_inductor/codegen/common.py**, **torch/_inductor/codegen/cpp_utils.py**, **torch/_inductor/codegen/triton.py**. `int_truediv` and mixed precision equality is currently not implemented soundly, so we will lose precision in codegen for large values. TODO: The additions here are not exhaustive yet
* Update ValueRanges logic to use new sympy functions in **torch/utils/_sympy/value_ranges.py**. In general, we prefer to use the new Sympy function rather than try to roll things by hand, which is what was done previously for many VR analysis functions.
In **torch/fx/experimental/symbolic_shapes.py** we need to make some symbolic reasoning adjustments:
* Avoid generation of rational subexpressions by removing simplification of `x // y` into `floor(x / y)`. This simplification then triggers an addition simplification rule `(x + y) / c --> x / c + y / c` which is bad because x / c is a rational number now
* `_assert_bound_is_rational` is no more, we no longer generate rational bounds
* Don't intersect non-int value ranges with the `int_range`
* Support more sympy Functions for guard SYMPY_INTERP
* Assert the type of value range is consistent with the variable type
The new asserts uncovered necessary bug fixes:
* **torch/_inductor/codegen/cpp.py**, **torch/_inductor/select_algorithm.py**, **torch/_inductor/sizevars.py** - Ensure Wild/Symbol manually allocated in Inductor is marked `is_integer` so it's accepted to build expressions
* **torch/_inductor/utils.py** - make sure you actually pass in sympy.Expr to these functions
* **torch/_inductor/ir.py** - make_contiguous_strides_for takes int/SymInt, not sympy.Expr!
* **torch/export/dynamic_shapes.py** - don't use infinity to represent int ranges, instead use sys.maxsize - 1
Because of the removal of some symbolic reasoning that produced rationals, some of our symbolic reasoning has gotten worse and we are unable to simplify some guards. Check the TODO at **test/test_proxy_tensor.py**
Signed-off-by: Edward Z. Yang <ezyang@meta.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126905
Approved by: https://github.com/xadupre, https://github.com/lezcano
Summary:
Add the following error checks for the `unbind` operator on `NestedTensor`s when `ragged_idx != 1`:
- The current implementation allows the creation of `NestedTensor` instances from the class definition with an `offsets` tensor that applies to a dimension other than the jagged dimension. This diff ensures that `unbind` fails when the `offsets` exceed the length of the jagged dimension.
Test Plan:
Added the following unit tests:
`test_unbind_with_lengths_ragged_idx_equals_2_bad_dim_cpu` verifies that `unbind` fails when there is a mismatch between the offsets and the jagged dimension, for `NestedTensor`s with `lengths`.
```
test_unbind_with_lengths_ragged_idx_equals_2_bad_dim_cpu (test_nestedtensor.TestNestedTensorSubclassCPU) ... ok
```
Reviewed By: davidberard98
Differential Revision: D57989082
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128058
Approved by: https://github.com/davidberard98
The fist DDP bucket is always being created of the size of `dist._DEFAULT_FIRST_BUCKET_BYTES` (1 MiB) by default regardless of `bucket_cap_mb`. The proposal is to set `bucket_cap_mb` as the one main bucket size if it was supplied by the user.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/121640
Approved by: https://github.com/wanchaol
This adds support for the WorkerServer binding to TCP as well as the existing unix socket support.
```py
server = _WorkerServer("", 1234)
```
Test plan:
Added unit test
```
python test/distributed/elastic/test_control_plane.py
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127986
Approved by: https://github.com/c-p-i-o
where ``$BUILD_ENVIRONMENT`` is one of the build environments
enumerated in
[pytorch-dockerfiles](https://github.com/pytorch/pytorch/blob/master/.ci/docker/build.sh). The dockerfile used by jenkins can be found under the `.ci` [directory](https://github.com/pytorch/pytorch/blob/master/.ci/docker)
2. Run ``docker run -it -u jenkins $DOCKER_IMAGE``, clone PyTorch and
run one of the scripts in this directory.
The Docker images are designed so that any "reasonable" build commands
will work; if you look in [build.sh](build.sh) you will see that it is a
very simple script. This is intentional. Idiomatic build instructions
should work inside all of our Docker images. You can tweak the commands
however you need (e.g., in case you want to rebuild with DEBUG, or rerun
the build with higher verbosity, etc.).
We have to do some work to make this so. Here is a summary of the
mechanisms we use:
- We install binaries to directories like `/usr/local/bin` which
are automatically part of your PATH.
- We add entries to the PATH using Docker ENV variables (so
they apply when you enter Docker) and `/etc/environment` (so they
continue to apply even if you sudo), instead of modifying
`PATH` in our build scripts.
- We use `/etc/ld.so.conf.d` to register directories containing
shared libraries, instead of modifying `LD_LIBRARY_PATH` in our
build scripts.
- We reroute well known paths like `/usr/bin/gcc` to alternate
implementations with `update-alternatives`, instead of setting
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.