I suspected that I would need to repack vLLM wheels from https://github.com/pytorch/pytorch/pull/162000 because I renamed the wheel, and it turns out to be true. The error is as follows:
```
$ uv pip install --pre xformers --index-url https://download.pytorch.org/whl/nightly/cu129
Using Python 3.12.11+meta environment at: venv/py3.12
Resolved 28 packages in 759ms
error: Failed to install: xformers-0.0.33.dev20250901+cu129-cp39-abi3-manylinux_2_27_x86_64.manylinux_2_28_x86_64.whl (xformers==0.0.33.dev20250901+cu129)
Caused by: Wheel version does not match filename: 0.0.33+5d4b92a5.d20250907 != 0.0.33.dev20250901+cu129
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162371
Approved by: https://github.com/atalman
Summary: This adds a `_rank` field to DeviceMesh init that allows for instantiating a DeviceMesh without depending on `dist.get_rank()` which requires a global PG to be instantiated.
Test Plan:
```
buck2 test mode/opt -c fbcode.enable_gpu_sections=true //caffe2/test/distributed:device_mesh -- init_backend
```
Rollback Plan:
Differential Revision: D81981777
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162439
Approved by: https://github.com/kwen2501, https://github.com/fduwjj
Fixing issue introduced in https://github.com/pytorch/pytorch/pull/158538
where `aten.copy_.default` is registered as a pointwise op, but without linearity.
In particular, when both `src` and `dst` tensors have same `Partial` placements, direct copy should happen without redistribute, instead of redistributing both to `Replicate` before making the copy.
This was discovered from silent incorrect results e.g. on `torch.einsum` backward.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162460
Approved by: https://github.com/zpcore
Summary: Avoid multiple storage writer resets in async save. Currently the reset gets called by the async_save method and then again in the save method. In the async path, async_save should only do the staging and the reset should only happen in the synchronous save path.
Test Plan:
```
buck test 'fbcode//mode/opt' //aiplatform/modelstore/experimental/DCP/tests:checkpoint_dist_client_test
```
https://www.internalfb.com/intern/testinfra/testrun/15199648841705052
Rollback Plan:
Differential Revision: D79230339
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159448
Approved by: https://github.com/meetv18
Fixes#154849
This change addresses the request to add support for SIGUSR1 and SIGUSR2 signals in torchrun for SLURM environments. Changes supports these signals through the configurable `TORCHELASTIC_SIGNALS_TO_HANDLE` environment variable and signals_to_handle parameter from laucher api
Tests:
For validations purpose:
test_signal_handling.py,
simple_test_api_signal_handling.py,
Unit Tests:
for launcher changes:launcher/test_api.py
for api changes: multiprocessing/test_api.py
E2E: test_run.py
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160690
Approved by: https://github.com/fduwjj
I confirmed that the tracing was correct i.e. NamedTupleVariable had the correct dynamic attribute added to it.
The problem was that NamedTupleVariable was always marked as immutable. This does not reflect the behavior of namedtuple.
Subclasses of namedtuple may be mutable, so when a NamedTupleVariable is derived from a subclass that is mutable, I made NamedTupleVariable mutable as well. Then side_effects correctly updates the returned object.
Fixes#161610
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161645
Approved by: https://github.com/anijain2305, https://github.com/StrongerXi
This pull request enhances the PyTorch operator benchmarking suite by introducing support for benchmarking with `torch.compile` mode, in addition to existing Eager and JIT. It also adds peak memory measurement (fwd/bwd pass); improves the output format in JSON to be used by dashboard for reporting; and introduce some more CLI options. The new CLI flags introduced are:
- Added `--use-compile` CLI argument and corresponding logic to run benchmarks using `torch.compile`, including mutual exclusivity with `--use-jit`
- Added `--benchmark-name` argument for customizing the benchmark name in output
- Updated default value for `--output-json-for-dashboard` to `benchmark-results.json` for more predictable output file name
Sample command to run a single operator:
`python -m pt.mm_test --use-compile`
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161394
Approved by: https://github.com/jbschlosser
My goal right now is to try to make the "vanilla" AccumulateGrad path for DTensor (that just calls detach) fast. I'm doing this in two steps:
(1) [this PR]: hardcode aten.detach in DTensor to re-use the input tensor's DTensorSpec, instead of running "real" sharding prop.
(2) [assuming success of 1]: move the detach() call into C++, try adding a DTensor dispatch key, and avoid dispatching back to python entirely (except for some code that probably needs to allocate a pyobject for the output DTensor, from C++)
I'm pushing this PR first to confirm that I don't break anything with my detach fastpath. I did some manual local testing to confirm that for normal usages of detach, the input and output DTensor have equal DTensorSpec objects. Technically, we previously would allocate a fresh DTensorSpec, and with this change we are just re-using the input tensor's DTensorSpec. So I'm mostly hoping that DTensorSpecs don't generally get mutated
This by itself does seem to speed up `alias` by quite a bit (roughly 2.5x speedup, from ~336us -> 133us):
**aten.detach(plain_tensor)**
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7f8da2921790>
_ = x.detach()
4.80 us
1 measurement, 100000 runs , 1 thread
```
**aten.detach(DTensor) [before this PR]**
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7f47cd68e750>
_ = x_dt.detach()
336.40 us
1 measurement, 1000 runs , 1 thread
```
**aten.detach(DTensor) [after this PR]**
```
<torch.utils.benchmark.utils.common.Measurement object at 0x7f0a34c05520>
_ = x_dt.detach()
Median: 133.45 us
2 measurements, 1000 runs per measurement, 1 thread
```
benchmark script:
```
import torch
import torch.distributed as dist
from torch.distributed.tensor import DeviceMesh, DTensor, Partial, Replicate, Shard
from torch.testing._internal.distributed.fake_pg import FakeStore
import torch.utils.benchmark as benchmark
fake_store = FakeStore()
dist.init_process_group("fake", store=fake_store, rank=0, world_size=2)
mesh = torch.distributed.device_mesh.init_device_mesh('cuda', (2,))
x = torch.randn(4, 4, requires_grad=True)
x_dt = DTensor.from_local(x, mesh, [Shard(0)], run_check=False)
t0 = benchmark.Timer(
stmt='_ = x_dt.detach()',
globals={'x_dt': x_dt},
)
print(t0.blocked_autorange())
dist.destroy_process_group()
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160580
Approved by: https://github.com/ezyang
# why
- unnecessary as we only ever need to know the dtype and maybe the
device
- we already take in the kernel inputs which have the device
- enable us to specify the layout after finding all the configs
but before generating the ChoiceCallers
# what
- replace all calls in template_heuristics that used to take Layout
with now just taking out_dtype
# testing
ci
Differential Revision: [D81820115](https://our.internmc.facebook.com/intern/diff/D81820115)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162238
Approved by: https://github.com/eellison
ghstack dependencies: #161347, #161348, #161349
# why
- enable us to override the default configs, or fall back to them
through subclassing InductorChoices
# what
- override (private) function
- default implementationt takes the kernel template choice (ktc)
generator for every template and just executes the generator
- future overrides can decide to replace those generators, or filter
out choices
- the 2nd expensive step (maybe_append_choices, choice_or_none) is
handled outside this function, in the main V.choices.get_mm_configs
this means that any overriding benefits from not generating expensive
templates that aren't going to be used
# testing
```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```
Differential Revision: [D81520570](https://our.internmc.facebook.com/intern/diff/D81520570)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161349
Approved by: https://github.com/eellison
ghstack dependencies: #161347, #161348
\# why
- every callsite just executes the generator on the spot
- previous pr adds the ability to add an override before expensive
generators are executed, so we don't need this generator anymore
\# what
- rather than yielding the ChoiceCaller, just return the list of all
valid ChoiceCallers
\# testing
```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```
Differential Revision: [D81520574](https://our.internmc.facebook.com/intern/diff/D81520574)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161348
Approved by: https://github.com/eellison
ghstack dependencies: #161347
# why
- gather everything up to make choices, without running
potentially expensive generators
- enables overrides where we toss the entire list of configs
from inductor, without having to enumrate it (expensive)
# what
- add a holding class that just gets all the components necessary
to generate a ChoiceCaller
- use that class to generate ChoiceCallers
- this does not (yet) add the override function, but just prepares
the scene
```
python3 -bb -m pytest test/inductor/test_max_autotune.py -v
```
Differential Revision: [D81520569](https://our.internmc.facebook.com/intern/diff/D81520569)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161347
Approved by: https://github.com/eellison
Our compiler is generating inefficient code for the offsetCalc in certain situations.
The root-cause for this needs to be identified. For now specialized unrolling based on 'dims' notably helps perf.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161700
Approved by: https://github.com/jeffdaily