pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 12:54:11 +08:00

Author	SHA1	Message	Date
Xuehai Pan	17ab99463a	[Easy] Add notes for setting up dev venv with specific Python version (#164214 ) Resolves https://github.com/pytorch/pytorch/issues/164010#issuecomment-3340751377 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164214 Approved by: https://github.com/ezyang ghstack dependencies: #162324	2025-10-01 08:25:13 +00:00
Xuehai Pan	eca6ac2293	[BE][Easy] update CUDA and ROCm sources in nightly tool (#162324 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162324 Approved by: https://github.com/ezyang	2025-10-01 08:25:13 +00:00
Xuanteng Huang	12d4cb0122	Suppress `FutureWarning`s in `torch.distributed.algorithms.ddp_comm_hooks` (#163939 ) Fixes #163938 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163939 Approved by: https://github.com/cyyever, https://github.com/kwen2501	2025-10-01 07:51:12 +00:00
Haifeng Jin	590224f83c	Improve repeat op to a single copy (#163842 ) In #163455 , the `reshape` was not a pure view op. The `permute` before it created an non-contiguous tensor, which would trigger a data copy during the reshape. This PR improved the implementation by remove the `urtensor` intermediate tensor completely. By simply expanding the `xtensor` would achieve the `repeat` effect. Before this PR, there were two data copies (in `urtensor.copy_` and `urtensor.reshape`). Now, there is only one data copy in the `.copy_()`. Reshape would not copy data because it is on a contiguous tensor. One more note is that we do want at one copy because we want to duplicate the elements for the repeats. User can inplace modify single elements without afffecting others. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163842 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-10-01 06:27:53 +00:00
Yuanyuan Chen	cc8b14d09a	[2/N] Simplify "in" operation for containers of a single item (#164323 ) These issues are detected by ruff [FURB171](https://docs.astral.sh/ruff/rules/single-item-membership-test/#single-item-membership-test-furb171). Pull Request resolved: https://github.com/pytorch/pytorch/pull/164323 Approved by: https://github.com/justinchuby, https://github.com/Skylion007	2025-10-01 05:39:11 +00:00
Animesh Jain	96c3b9e275	[dynamo] Use strings instead of modules for fqn info tracking (#164272 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164272 Approved by: https://github.com/Skylion007, https://github.com/williamwen42, https://github.com/mlazos	2025-10-01 04:22:57 +00:00
Nikita Shulga	9ddfc59b9b	[BE] Delete stale non-ephemeral runners workarounds (#164285 ) As all Win runners are ephemeral, no need to cleanup leftover processes or uninstall PyTorch at the end of the test Pull Request resolved: https://github.com/pytorch/pytorch/pull/164285 Approved by: https://github.com/Skylion007	2025-10-01 03:47:36 +00:00
Nikita Shulga	6d4dfa0878	[CI] Push `viable/strict/${time}` tags (#164183 ) Every time viable strict is updated Pull Request resolved: https://github.com/pytorch/pytorch/pull/164183 Approved by: https://github.com/seemethere	2025-10-01 03:41:10 +00:00
Banit Agrawal	11ccb95ccb	[PyTorch Pinned Allocator] Pinned memory stats and perf fixes around allocating blocks (#163777 ) Summary: This diff adds bucket stats for pinned memory and also a perf fix to not check for sizes when background thread is enabled Differential Revision: D83162186 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163777 Approved by: https://github.com/bbus	2025-10-01 03:28:58 +00:00
Nikita Shulga	bd0907dc4c	[BE][CI] Unify requirments (#163396 ) Both Linux, Windows and MacOS CI workflows should use `.ci/docker/requirements-ci.txt` TODOS: - Investigate why `choco install cmake` is needed to successfully detect MKL - Move `psutil` installation from specific scripts into requirements-ci.txt Pull Request resolved: https://github.com/pytorch/pytorch/pull/163396 Approved by: https://github.com/Skylion007	2025-10-01 03:28:48 +00:00
Alexander Grund	8bb71c07c4	Skip symmetric memory tests calling `_scaled_mm` on CCC < 8.9 (#164251 ) This avoids them failing on e.g. A100 GPUs with > RuntimeError: torch._scaled_mm is only supported on CUDA devices with compute capability >= 9.0 or 8.9, or ROCm MI300+ Pull Request resolved: https://github.com/pytorch/pytorch/pull/164251 Approved by: https://github.com/Skylion007, https://github.com/kwen2501	2025-10-01 03:26:21 +00:00
Yuanyuan Chen	fa90090735	Use dataclass features in two classes (#164221 ) This PR completes two TODO items by using features of `dataclass`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164221 Approved by: https://github.com/Skylion007, https://github.com/mlazos Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-10-01 03:20:39 +00:00
Aaron Gokaslan	591997490a	[BE][Easy]: Add prims common TypeGuard (#164263 ) Slightly improves typing by adding a TypeGuard. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164263 Approved by: https://github.com/albanD	2025-10-01 03:13:10 +00:00
mansiag05	531f3bf5e1	Adding check for square matrix for input tensor in matrix_exp backwar… (#163357 ) …d op. Fixes #146796 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163357 Approved by: https://github.com/lezcano	2025-10-01 03:12:30 +00:00
ankushwahaRH	2a5ce2feb4	Add algorithm in header (#164295 ) Fixes #163307. Added ```#include <algorithm>``` to vulkan QueryPool for the std::for_each call Pull Request resolved: https://github.com/pytorch/pytorch/pull/164295 Approved by: https://github.com/Skylion007	2025-10-01 03:09:50 +00:00
Yiming Zhou	3787a5a60e	[export] Explicitly passing requires_grad to nn.Parameter() in deserialization (#164290 ) Summary: `nn.Parameter()` by default has `requires_grad=True` and would cause issues when there are non-float parameters. Test Plan: buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_non_float_weight Differential Revision: D83598796 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164290 Approved by: https://github.com/angelayi	2025-10-01 02:55:20 +00:00
Animesh Jain	c66d18d24d	[dynamo][sac] Support functools partial context_fn for sac (#164308 ) Fixes https://github.com/pytorch/pytorch/issues/164300 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164308 Approved by: https://github.com/Lucaskabela, https://github.com/soulitzer	2025-10-01 02:47:55 +00:00
eellison	e0f118585f	skip non memory deps in memory estimator (#164294 ) Differential Revision: [D83601030](https://our.internmc.facebook.com/intern/diff/D83601030) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164294 Approved by: https://github.com/mlazos	2025-10-01 02:44:58 +00:00
bobrenjc93	10a005e87f	[torchfuzz] add layout operators (#164210 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164210 Approved by: https://github.com/pianpwk ghstack dependencies: #164034, #164209, #164211	2025-10-01 02:33:19 +00:00
bobrenjc93	1f3995cdc8	[torchfuzz] raise if Operator abstract method is not implemented (#164211 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164211 Approved by: https://github.com/pianpwk ghstack dependencies: #164034, #164209	2025-10-01 02:33:19 +00:00
bobrenjc93	abfcce58a4	[torchfuzz] remove erroneous can_produce check (#164209 ) can_produce is an abstract method that always return false Pull Request resolved: https://github.com/pytorch/pytorch/pull/164209 Approved by: https://github.com/pianpwk ghstack dependencies: #164034	2025-10-01 02:33:19 +00:00
Jane Xu	5b1c39f5a1	Add smoke tests to verify that stable ABI FA3 wheel runs w/ newer torch (#163782 ) Passing CI: https://github.com/pytorch/pytorch/actions/runs/18141589975/job/51635340255?pr=163782 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163782 Approved by: https://github.com/huydhn, https://github.com/mikaylagawarecki	2025-10-01 02:30:38 +00:00
Simon Layton	8df3f2fa98	Revert new-test part of #163829 (#164259 ) Summary: New test sizes for `test_scaled_mm_vs_emulated_block_wise` all fail with ``` RuntimeError: Invalid scaling configuration ``` Disable these new tests for now (the remaining test is a parametrized version of the original test case) Test Plan: `pytest test/test_scaled_matmul_cuda.py` Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Simon Layton <simonlayton@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/164259 Approved by: https://github.com/jananisriram ghstack dependencies: #164266	2025-10-01 02:23:21 +00:00
Simon Layton	7a9119948e	Split scaled-mm tests into separate file (#164266 ) Summary: * Split scaled-mm-specific tests into `test/test_scaled_matmul.py` Test Plan: ``` pytest test/test_matmul_cuda.py pytest test/test_scaled_matmul_cuda.py ``` Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Simon Layton <simonlayton@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/164266 Approved by: https://github.com/Skylion007, https://github.com/albanD	2025-10-01 02:23:21 +00:00
Shangdi Yu	28c1d2f81b	[aoti] AOTI mingw cross compilation (#163188 ) To run this, you need to install `mingw64-gcc-c++` and download windows cuda library toolkit. See design doc and demo instructions in https://docs.google.com/document/d/1iDaChqA5nNKkBFTzsdkmoomvQlXHbnlb1Z4yEp7xaJA/edit?tab=t.0 If cross_platform_target is windows, we do the following: - do not link to `sleef`. This can be improved in the future if we need it. Currently I avoid it because that requires extra setup on the linux side - Use `mingw64-gcc-c++` to compile - Use `WINDOWS_CUDA_HOME` instead of `CUDA_HOME` when linking to cuda ``` python test/inductor/test_aot_inductor_windows.py -k so ``` Other changes: - de-couples compile_standalone config and dynamic link flag - create a new aot_inductor_mode config module, which is used to control configs in aot_inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163188 Approved by: https://github.com/desertfire	2025-10-01 02:22:06 +00:00
Banit Agrawal	c4bbc6433e	[PyTorch CCA] Add an API to get expandable segment sizes (#163771 ) Summary: This diffs add an API to query expandable segment size for each stream so that we can use this info to warmup the segment in advance, so we dont incur any performance penalty during steady state inference for new CUDA memory allocations. Differential Revision: D76447308 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163771 Approved by: https://github.com/bbus	2025-10-01 02:16:58 +00:00
Jeff Daily	ad7e3c93b1	[ROCm][CD] librocroller.so missing from ROCm 7 wheel (#164244 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164244 Approved by: https://github.com/jeffdaily, https://github.com/Skylion007 Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-10-01 00:02:34 +00:00
Jane Xu	7f3dc45300	Migrate DeviceType to torch/headeronly (#163999 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163999 Approved by: https://github.com/mikaylagawarecki	2025-09-30 23:13:27 +00:00
PyTorch UpdateBot	ff715366aa	[vllm hash update] update the pinned vllm hash (#164190 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164190 Approved by: https://github.com/pytorchbot	2025-09-30 22:43:49 +00:00
Sherlock Huang	60a4961ff4	[DTensor] Allow redistribute to Partial if src matches (#164253 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164253 Approved by: https://github.com/zpcore	2025-09-30 22:42:49 +00:00
Frank Lin	bec6541d84	[CUDA][CUDAGraph] Reduce capture overhead in CUDA Graph memory reuse (#162186 ) Previous work #158352 delivered CUDAGraph memory footprint reduction with no replay-time impact, but capture time regressed (up to 20× slower) due to repeated full-graph traversals. See previous benchmark results [here](https://github.com/pytorch/pytorch/pull/158352#issuecomment-3215947565) This PR removes capture/reply overhead while preserving the memory savings: 1. Terminals as free markers We stop inserting empty nodes and instead record the current stream terminals as free markers. This avoids mutating the user’s graph and keeps semantics unchanged. 2. Incremental, cached reachability We add a per-graph reuse context that caches reverse-traversal state: * `graph_reuse_context[graph].visited[stream]` tracks nodes already seen from that stream’s terminal frontier. * On each allocation during capture, we resume traversal from the latest terminals and only visit unseen nodes. * A block is freed when all its recorded markers are in the visited set of its allocation stream—i.e., all markers are proven predecessors of future work. See [the performance results here](https://docs.google.com/spreadsheets/d/e/2PACX-1vRPvdd9Xa8W87ixbiA0da_qvOhrUAjUpFz0G-_j-MsDnoeRyhEa4_ut_W3rqcg1VVZVFJ-gucwov-3b/pubhtml?gid=1468302443&single=true), we sweep synthetic multi-stream CUDA Graphs built by `capture_benchmark.py` (same as before, we generate random interleaving of alloc/free/join with given probabilities, see [gist here](https://gist.github.com/eee4017/e2092d215b1d4bd46534148939af39e3)), and we compare median capture/replay times and memory. On an NVIDIA H100 PCIe across 24 configs, the optimization preserves reserved memory reduction at ~24–98%, leaves allocated memory unchanged, and brings capture time back to baseline (range 0.96–1.04× vs. baseline) with replay time unchanged (range 0.97–1.11×). Pull Request resolved: https://github.com/pytorch/pytorch/pull/162186 Approved by: https://github.com/eqy, https://github.com/ngimel	2025-09-30 22:28:46 +00:00
fduwjj	1f1de20ba9	[c10d][BE][ez] Update tensor ptr inside nccl.cpp (#164276 ) This is mostly a cosmetic change which replace the deprecating `data_ptr` API with mutable or const one. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164276 Approved by: https://github.com/Skylion007, https://github.com/eqy, https://github.com/kwen2501	2025-09-30 22:05:12 +00:00
Anshul Sinha	2810977d3a	[FSDP][Replicate] tests replicate type casting behavior and edge cases in mixed precision (#162861 ) Summary: Ensures that replicate can handle the same type casting behavior and edge cases that fully shard can when mixed precision is used Test Cases 1. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k test_float16_on_one_submodule 2. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k test_submodules_with_external_inputs 3. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k test_norm_modules_bf16 4. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k test_norm_modules_fp16 5. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k test_clamp_reduce_dtype 6. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k test_dataclass_input Pull Request resolved: https://github.com/pytorch/pytorch/pull/162861 Approved by: https://github.com/mori360 ghstack dependencies: #162830, #162836, #162839, #162851, #162853, #162855	2025-09-30 22:03:23 +00:00
Wei Feng	ae4fd4ea75	[FSDP2] support AC(FSDP) for torchtitan's MOE (#164009 ) for fsdp2 + EP, titan has fully_shard(AC(layer)) and fully_shard(layer.moe.experts): https://github.com/pytorch/torchtitan/issues/1624 for implicit prefetching, backward order is * _pre_backward unshard (norm, output) * _backward_prefetch unshard layers.6 * post_backward reshard (norm, output) * _pre_backward unshard layers.6 (no-op, unsharded already) * _backward_prefetch unshard layers.6.moe.experts * recompute_fn pre_forward unshard layers.6.moe.experts (no-op, unsharded already) * ~~recompute_fn post_forward reshard layers.6.moe.experts~~ <----- this PR make it a no-op * _pre_backward unshard layers.6.moe.experts (no-op, unsharded already) * _backward_prefetch unshard layers.5 * post_backward reshard layers.6.moe.experts * post_backward reshard layers.6 unit test: `pytest -s test/distributed/_composable/fsdp/test_fully_shard_comm.py -k test_set_modules_to_backward_prefetch_inside_ac` before fix: `NGPU=4 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml" ./run_train.sh --parallelism.expert_parallel_degree=2` ``` [rank0]:[titan] 2025-09-30 11:43:01,714 - root - INFO - step: 1 loss: 12.0162 grad_norm: 1.7315 memory: 45.64GiB(48.05%) tps: 1,028 tflops: 10.87 mfu: 1.10% [rank0]:[titan] 2025-09-30 11:43:01,714 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40 [rank0]:[titan] 2025-09-30 11:43:35,233 - root - INFO - [GC] Performing periodical GC collection 0.06 seconds [rank0]:[titan] 2025-09-30 11:43:35,987 - root - INFO - step: 50 loss: 6.9302 grad_norm: 0.9985 memory: 59.66GiB(62.80%) tps: 11,712 tflops: 123.89 mfu: 12.53% ``` after fix: `NGPU=4 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml" ./run_train.sh --parallelism.expert_parallel_degree=2` ``` [rank0]:[titan] 2025-09-30 11:38:57,377 - root - INFO - step: 1 loss: 12.0134 grad_norm: 1.6916 memory: 38.42GiB(40.45%) tps: 805 tflops: 8.51 mfu: 0.86% [rank0]:[titan] 2025-09-30 11:38:57,377 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40 [rank0]:[titan] 2025-09-30 11:39:28,541 - root - INFO - [GC] Performing periodical GC collection 0.06 seconds [rank0]:[titan] 2025-09-30 11:39:29,279 - root - INFO - step: 50 loss: 6.9346 grad_norm: 1.1875 memory: 52.58GiB(55.36%) tps: 12,583 tflops: 133.10 mfu: 13.46% ``` for explicit prefetching, layers.6 backward prefetch layers.5 and layers.5.moe.experts. layers.6.moe.experts does not have explicit prefetch. backward order is like this * _pre_backward unshard (norm, output) * _prefetch_unshard layers.6 * post_backward reshard (norm, output) * _pre_backward unshard layers.6 (no-op, unsharded already) * _prefetch_unshard layers.5 * _prefetch_unshard layers.5.moe.experts * recompute_fn pre_forward unshard layers.6.moe.experts * ~~recompute_fn post_forward reshard layers.6.moe.experts~~ <----- this PR makes it a no-op * _pre_backward unshard layers.6.moe.expert (no-op, unsharded already) * post_backward reshard layers.6.moe.expert * post_backward reshard layers.6 before fix: `NGPU=4 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml" ./run_train.sh --parallelism.expert_parallel_degree=2` ``` [rank0]:[titan] 2025-09-30 11:53:24,574 - root - INFO - step: 1 loss: 12.0180 grad_norm: 1.6948 memory: 45.77GiB(48.18%) tps: 849 tflops: 8.98 mfu: 0.91% [rank0]:[titan] 2025-09-30 11:53:24,574 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40 [rank0]:[titan] 2025-09-30 11:53:57,768 - root - INFO - [GC] Performing periodical GC collection 0.07 seconds [rank0]:[titan] 2025-09-30 11:53:58,515 - root - INFO - step: 50 loss: 6.9358 grad_norm: 1.0528 memory: 59.80GiB(62.95%) tps: 11,827 tflops: 125.10 mfu: 12.65%``` ``` after fix: `NGPU=4 CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_16b.toml" ./run_train.sh --parallelism.expert_parallel_degree=2` ``` [rank0]:[titan] 2025-09-30 12:08:39,404 - root - INFO - step: 1 loss: 12.0143 grad_norm: 1.7030 memory: 38.55GiB(40.58%) tps: 988 tflops: 10.45 mfu: 1.06% [rank0]:[titan] 2025-09-30 12:08:39,404 - root - INFO - Synchronizing and adjusting timeout for all ProcessGroups to 0:01:40 [rank0]:[titan] 2025-09-30 12:09:10,482 - root - INFO - [GC] Performing periodical GC collection 0.06 seconds [rank0]:[titan] 2025-09-30 12:09:11,168 - root - INFO - step: 50 loss: 6.9356 grad_norm: 0.9911 memory: 52.81GiB(55.59%) tps: 12,637 tflops: 133.68 mfu: 13.52% ``` Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/164009 Approved by: https://github.com/soulitzer	2025-09-30 22:02:24 +00:00
Animesh Jain	adc11a7634	[export] avoid checks during tracing of export verification (#164219 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/164219 Approved by: https://github.com/Lucaskabela	2025-09-30 21:46:59 +00:00
Anshul Sinha	99e28ffab3	[FSDP][Replicate] tests replicate core functionality with mixed precision (#162855 ) Summary: Ensures that replicate functionality works the same as fully shard's when mixed precision is used Test Cases 1. pytest test/distributed/_composable/test_replicate_mixed_precision.py -k TestReplicateMixedPrecisionTraining Pull Request resolved: https://github.com/pytorch/pytorch/pull/162855 Approved by: https://github.com/mori360 ghstack dependencies: #162830, #162836, #162839, #162851, #162853	2025-09-30 21:45:58 +00:00
Anshul Sinha	01dd2c2b42	[FSDP][Replicate] tests replicate is composable with tp (#162853 ) Summary: Proof that new replicate API is composable with TP Test Case 1. pytest test/distributed/_composable/test_replicate_training.py -k test_replicate_tp Pull Request resolved: https://github.com/pytorch/pytorch/pull/162853 Approved by: https://github.com/mori360 ghstack dependencies: #162830, #162836, #162839, #162851	2025-09-30 21:29:54 +00:00
Anshul Sinha	d3bdf8c32e	[FSDP][Replicate] tests replicate with custom forward method (#162851 ) Summary: tests replicate works when users use custom forward methods Test Cases 1. pytest test/distributed/_composable/test_replicate_training.py -k test_register_fsdp_forward_method Pull Request resolved: https://github.com/pytorch/pytorch/pull/162851 Approved by: https://github.com/mori360 ghstack dependencies: #162830, #162836, #162839	2025-09-30 21:15:34 +00:00
Anshul Sinha	1ce9563ff6	[FSDP][Replicate] tests replicate gradient accumulation and 1f1b microbatching (#162839 ) Summary: In order to ensure that replicate acts as intended (a specialized version of hsdp) we need to make sure that it can pass the same tests that fully_shard can for training. The first test verifies Replicate works with gradient accumulation properly. The second verifies that replicate works correctly with a One-Forward-One-Backward (1F1B) pipeline parallelism schedule Test Cases 1. pytest test/distributed/_composable/test_replicate_training.py -k test_gradient_accumulation 2. pytest test/distributed/_composable/test_replicate_training.py -k test_1f1b_microbatching Pull Request resolved: https://github.com/pytorch/pytorch/pull/162839 Approved by: https://github.com/mori360 ghstack dependencies: #162830, #162836	2025-09-30 21:00:16 +00:00
xadupre	9e631392dc	Missing lambda in torch._check (#164225 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164225 Approved by: https://github.com/Skylion007	2025-09-30 20:32:38 +00:00
PaulZhang12	1cce6efdb8	Fix silent incorrectness for bmm/baddmm out_dtype overload (#164095 ) Add input checks like meta functions for standard ops in `ATen/native/LinearAlgebra.cpp` for the `out_dtype` variants. Fixes silent incorrectness in https://github.com/pytorch/pytorch/issues/163816 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164095 Approved by: https://github.com/ngimel	2025-09-30 20:13:13 +00:00
Nikita Shulga	5a93f00c79	[CI] Delete binary smoke workflows (#164260 ) Those were very useful in the past, because: - CI builder jobs did not generates wheels, but rather run `python setup.py develop` and shared docker layers, which is no longer the case, all CI jobs produce wheels - CD jobs were targeting pre-CXX11 ABI, but this is no longer the case after manylinux2_28 migration Existing, but acceptable gaps: - Windows libtorch debug builds sometimes might fail, but IMO it's ok not to be able to produce those for a few days, as number of libtorch users are somewhat small - All CD jobs are based on AlmaLinux, while CI are based on Ubuntu, but this could be adjusted if needed, besides AlmaLinux-9 and Ubuntu-22.04 are pretty close in terms of glibc and gcc versions - CD jobs build for all GPU architectures, while CI only for the one being tested, but there are now periodic H100 and B200 jobs, and not a lot of development happens for Voltas or Pascals Besides there are better tools to alert about the nightly failures Pull Request resolved: https://github.com/pytorch/pytorch/pull/164260 Approved by: https://github.com/seemethere, https://github.com/atalman	2025-09-30 20:00:07 +00:00
Yuanyuan Chen	e30f01b5b5	[1/N] Simplify "in" operation for containers of a single item (#164224 ) These issues are detected by ruff [FURB171](https://docs.astral.sh/ruff/rules/single-item-membership-test/#single-item-membership-test-furb171). Pull Request resolved: https://github.com/pytorch/pytorch/pull/164224 Approved by: https://github.com/rec, https://github.com/Skylion007	2025-09-30 19:59:43 +00:00
Jeff Daily	ffc645c870	half support for fused_moving_avg_obs_fake_quant() op (#164175 ) Follow up to https://github.com/pytorch/pytorch/pull/162620. Add half support, as well. This fixes some failures in inductor benchmarks such as from this log https://github.com/pytorch/pytorch/actions/runs/18051942373/job/51376749459. `NotImplementedError: "aminmax_kernel" not implemented for 'Half'` Pull Request resolved: https://github.com/pytorch/pytorch/pull/164175 Approved by: https://github.com/malfet, https://github.com/jerryzh168	2025-09-30 19:35:17 +00:00
Han Qi	60f0a356fd	Update persons of interest for XLA. The previous one is out of date. (#158652 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158652 Approved by: https://github.com/JackCaoG, https://github.com/albanD	2025-09-30 19:21:18 +00:00
Kohaku-Blueleaf	d2c5f231f6	Fix the shape check inside gnll loss (#147522 ) Fixes #147521 This modification allow user to put any size of var in GaussianNLLLoss if the var is broadcastable (to input/target's size) Therefore, the demo code in #147521 will result in expected behaviour and correct output. This allow all input size that match: `input.size = (..., n, ...), var.size = (..., 1, ...)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/147522 Approved by: https://github.com/mikaylagawarecki	2025-09-30 18:40:15 +00:00
PyTorch MergeBot	cc5d74c366	Revert "[BE] Remove HermeticPyObjectTLS and Simplify PythonOpRegistrationTrampoline (#163464 )" This reverts commit 94195a37ae4eae9c486a81b0f67725c8970f74d6. Reverted https://github.com/pytorch/pytorch/pull/163464 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/163464#issuecomment-3353307034))	2025-09-30 18:20:20 +00:00
Markus Hoehnerbach	a707042353	fix: inductor non_blocking test - warmup events to make test pass whether it is the first run or not (#164188 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164188 Approved by: https://github.com/williamwen42	2025-09-30 18:20:17 +00:00
Pian Pawakapan	d615f6b935	[inductor] use hint_override in kernel benchmark args (#164207 ) Summary: forward fix T239259207 Test Plan: test_multi_kernel Differential Revision: D83539263 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164207 Approved by: https://github.com/bobrenjc93, https://github.com/mlazos	2025-09-30 18:09:29 +00:00
Nick Riasanovsky	719b64ee8b	Fix TMA transpose logic to handle 1D shapes + string differences (#163966 ) Fixes #163702. This fixes 2 issues: 1. The value may inconsistently be a shape or string. This normalizes to handle both of these. 2. 1D shapes should not transpose data. This fixes the order of operations to prevent this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163966 Approved by: https://github.com/eellison	2025-09-30 17:51:37 +00:00

1 2 3 4 5 ...

93804 Commits