pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-11-19 10:04:58 +08:00

Author	SHA1	Message	Date
Jithun Nair	c555f241a6	Use GITHUB_TOKEN	2025-11-11 18:35:37 +00:00
Jithun Nair	742408cf7f	Remove github.token	2025-11-11 18:26:29 +00:00
Jithun Nair	d6c644b97e	Add actions: read permissions to be able to access artifacts for other workflow run	2025-11-11 18:24:47 +00:00
Jithun Nair	2ef8a3a2fc	try another way of setting token	2025-11-11 18:13:33 +00:00
Jithun Nair	68f019947d	Use secrets.GITHUB_TOKEN	2025-11-11 17:00:17 +00:00
Jithun Nair	99ec56b8bd	syntax	2025-11-11 16:53:30 +00:00
Jithun Nair	fc1ebb1a3f	Remove invalid workflow input	2025-11-11 16:52:15 +00:00
Jithun Nair	d7091e5e8e	Add workflow_dispatch trigger	2025-11-11 16:50:44 +00:00
Jithun Nair	4a71e2ebae	syntax	2025-11-11 16:49:05 +00:00
Jithun Nair	53e74a3ac0	Hardcode values for testing	2025-11-11 16:46:45 +00:00
Jithun Nair	b572cd15de	Try to use docker-builds artifacts	2025-11-11 16:24:56 +00:00
Jithun Nair	ada16deeaa	Name of artifact must be unique	2025-11-11 15:29:52 +00:00
Jithun Nair	a937421281	Upload artifacts named for each matrix element	2025-11-11 15:25:55 +00:00
Jithun Nair	e314ea03c3	Upload artifacts for docker-builds	2025-11-11 15:03:11 +00:00
Jithun Nair	b89f0103a4	Check if caching benefits are seen on test runners	2025-11-11 02:06:44 +00:00
Jithun Nair	e02e639c8c	Restore code to skip ghcr.io step if not push	2025-11-11 00:43:10 +00:00
Jithun Nair	4ba1846d8a	Pull ghcr.io image and tag it as ECR image	2025-11-11 00:34:06 +00:00
Jithun Nair	42dd417bc7	Run ghcr.io push step but don't actually push	2025-11-10 21:14:12 +00:00
Jithun Nair	1e34a45463	Merge branch 'main' into add_rocm_docker_caching	2025-11-10 15:06:10 -06:00
Jithun Nair	6e2c62ac17	Use ghcr.io image to see if docker pull time decreases	2025-11-10 21:02:06 +00:00
William Wen	fe0bb7cf60	[export, 3.14] handle patching methods with functools.partial correctly in non-strict export (#167396 ) Note: dynamo is not affected by this since patching class methods are not supported right now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/167396 Approved by: https://github.com/angelayi ghstack dependencies: #167382, #167383, #167384, #167387	2025-11-10 20:52:05 +00:00
William Wen	cf63b212e3	[3.14, dataloader] handle forkserver default mp start method in 3.14 (#167387 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/167387 Approved by: https://github.com/malfet ghstack dependencies: #167382, #167383, #167384	2025-11-10 20:52:05 +00:00
William Wen	17e70ae459	[dynamo, 3.14] enable dynamo in 3.14 (#167384 ) dynamo tests are passing in the CI PR above - so we could probably just enable dynamo right now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/167384 Approved by: https://github.com/Skylion007, https://github.com/mlazos ghstack dependencies: #167382, #167383	2025-11-10 20:52:05 +00:00
William Wen	ad7db3617e	[inductor, 3.14] catch pickle.PicklingError exceptions (#167383 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/167383 Approved by: https://github.com/aorenste ghstack dependencies: #167382	2025-11-10 20:52:04 +00:00
William Wen	5320ca3725	[inductor, 3.14] fix itertools.product pickle error in test_cpu_repro (#167382 ) `inductor/test_cpu_cpp_wrapper` was failing since it was attempting to pickle`itertools.product`, and that is no longer picklable in 3.14. We work around by eagerly generating a list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/167382 Approved by: https://github.com/atalman, https://github.com/malfet	2025-11-10 20:52:04 +00:00
Malay Bag	3e4faca130	[torch.export] Refactor placeholder_naming_pass to reduce CCN (#166600 ) Summary: Reduced CCN from 37 to 28 of placeholder_naming_pass method Test Plan: Existing tests Differential Revision: D85820388 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166600 Approved by: https://github.com/angelayi viable/strict/1762823531	2025-11-10 20:44:18 +00:00
Sean McGovern	0c2f206ded	Typo fix - baddbmm_strategy (#166963 ) This is called by registration with decorator, so function not called directly. For clarity, add the "b" for "batch" in function name. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166963 Approved by: https://github.com/janeyx99 viable/strict/1762821623	2025-11-10 20:35:42 +00:00
Robert Hardwick	6cf21fa331	Fix -ffunction-sections, -fdata-sections not being added on aarch64. (#166407 ) Preferred solution to #166380 Changes: - Moved summary print to bottom of CMakeLists.txt - Fix the problem 'add_compile_options' should be called before targets defined, so opted for `append_cxx_flag_if_supported` and `append_c_flag_if_supported` ( new ). - Added extra verbosity so it can be seen when linker script added. ( unfortunately linker script has to be added per-target rather than globally due to ninja/cmake depdendency tracking ). Also move summary print to bottom of CMakeLists.txt and improve logging Pull Request resolved: https://github.com/pytorch/pytorch/pull/166407 Approved by: https://github.com/Aidyn-A, https://github.com/atalman	2025-11-10 20:32:08 +00:00
Thanh Ha	cdc8460f2c	Use c7i.2xlarge for H100 build (#167466 ) The build system maybe oversized for what is necessary. Reduce the size to optimize costs. The default workflow runner is linux.c7i.2xlarge so we are just removing the runner definition in the workflow so that it uses the default. Relates to pytorch/test-infra#7175. Pull Request resolved: https://github.com/pytorch/pytorch/pull/167466 Approved by: https://github.com/seemethere	2025-11-10 20:20:54 +00:00
Shangdi Yu	86130aa2ca	Fix flaky memory profiler test [2] (#167268 ) Fixes #167037 Move the module definition outside of the unit test so when we run the unit test multiple times, the module is not re-compiled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/167268 Approved by: https://github.com/angelayi viable/strict/1762818582	2025-11-10 19:51:38 +00:00
Jazlyn Li	9491830c79	move subgraph_has_impure_ops from `node.is_impure` into const_fold to unblock production (#167443 ) Summary: https://github.com/pytorch/pytorch/pull/166609 updates `node.is_impure` to consider a submodule as impure if submodule contains impure node. This in turn changes `graph.eliminate_dead_code()` function behavior, which does not eliminate nodes with side effects, see [pytorch documentation](https://docs.pytorch.org/docs/stable/fx.html#torch.fx.Graph.eliminate_dead_code) > Remove all dead code from the graph, based on each node’s number of users, and whether the nodes have any side effects. While this is correct that a submodule containing side-effectful ops is side-effectful and should not be dead code eliminated, some customers rely on the dead code elimination to eliminate submodules that contain impure ops which is the behavior before #166609 fix. Due to production environment constraints, we have to revert https://github.com/pytorch/pytorch/pull/166609 and move the side-effectful submodule check logic to `const_fold.py`, which will correctly not const-fold a submodule that contains impure ops. NOTE other call sites that use `node.is_impure()` to make decisions are still incorrectly eliminating side-effectful submodules, but we can't safely change that today. ## This pr - move `_subgraph_has_impure_op` into `fx/experimental/const_fold.py`, check and prevent const-folding an impure submodule - added a note in `node.is_impure` to highlight the incorrect behavior and context in case people go looking in the future. Test Plan: run test_fx_const_fold and all tests pass Differential Revision: D86641994 Pull Request resolved: https://github.com/pytorch/pytorch/pull/167443 Approved by: https://github.com/jfix71 viable/strict/1762817094	2025-11-10 19:29:54 +00:00
Aaron Orenstein	04a85b4c21	[compile-on-one-rank] Step 1: DeviceId (#166680 ) Add a "--virtual-local-rank" mode to torchrun. When used instead of passing the local rank in LOCAL_RANK it uses a LOCAL_RANK of "0" and adjusts CUDA_VISIBLE_DEVICES to reflect the desired GPU index. Testing: (tweaked run_train.sh to use `--log-dir`) ``` export NGPU=8 export CONFIG_FILE="./torchtitan/models/llama3/train_configs/debug_model.toml" with-proxy ./run_train.sh --model.name compiler_toolkit.llama3 --compile.enable --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4 ``` And then comparing ranks: Without --virtual-local-rank gives a lot of differences like: ``` [rank#]: mul_1: "f32[8, 512, 256]" = torch.ops.aten.mul.Tensor(mul, view_9); mul = None -[rank#]: _to_copy_3: "bf16[8, 512, 256]" = torch.ops.aten._to_copy.default(mul_1, dtype = torch.bfloat16, layout = torch.strided, device = device(type='cuda', index=0)); mul_1 = None +[rank#]: _to_copy_3: "bf16[8, 512, 256]" = torch.ops.aten._to_copy.default(mul_1, dtype = torch.bfloat16, layout = torch.strided, device = device(type='cuda', index=1)); mul_1 = None [rank#]: detach: "f32[8, 512, 1]" = torch.ops.aten.detach.default(rsqrt); rsqrt = None ``` With --virtual-local-rank makes those differences go away. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166680 Approved by: https://github.com/ezyang	2025-11-10 18:47:31 +00:00
Catherine Lee	a4437d76f0	Add some labeler rules that used to be in the autolabel bot (#167330 ) See https://github.com/pytorch/test-infra/pull/7446 for the paths Pull Request resolved: https://github.com/pytorch/pytorch/pull/167330 Approved by: https://github.com/huydhn	2025-11-10 18:38:42 +00:00
Shangdi Yu	3ea829a337	Fix torch.cond HOP device in inductor (#167354 ) Fixes #166918 The output device may not be on the same device as the predicate device. ``` python test/inductor/test_control_flow.py -k test_output_on_different_device ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/167354 Approved by: https://github.com/ydwu4, https://github.com/zou3519 viable/strict/1762815047	2025-11-10 18:19:38 +00:00
Nikita Shulga	3966b5ad05	[BE] Fix out-of-bounds index_put in test_mps.py (#167444 ) Discovered while enabling assertions on out-of-bounds accesses. Otherwise test fails with ``` ERROR: test_sdpa_mask_fp16_L6_S17_NH23_HS121 (__main__.TestSDPA.test_sdpa_mask_fp16_L6_S17_NH23_HS121) ---------------------------------------------------------------------- Traceback (most recent call last): File "/Users/malfet/git/pytorch/pytorch/torch/testing/_internal/common_utils.py", line 3334, in wrapper method(args, *kwargs) ~~~~~~^^^^^^^^^^^^^^^^^ File "/Users/malfet/git/pytorch/pytorch/build/../test/test_mps.py", line 9494, in test_sdpa_mask_fp16_L6_S17_NH23_HS121 self._test_sdpa_mask(torch.float16, 7, 17, 23, 121) ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/malfet/git/pytorch/pytorch/build/../test/test_mps.py", line 9478, in _test_sdpa_mask y_ref = F.scaled_dot_product_attention(q.cpu(), k.cpu(), v.cpu(), attn_mask=mask.cpu(), dropout_p=0.0, is_causal=False) ~~~~~^^ torch.AcceleratorError: index out of range ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/167444 Approved by: https://github.com/Skylion007, https://github.com/manuelcandales viable/strict/1762813841	2025-11-10 18:19:28 +00:00
Jason Ansel	f6a79b2a4a	[inductor] Wrap pallas_call in jax.jit (#167441 ) My understanding is this is needed for performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/167441 Approved by: https://github.com/oulgen viable/strict/1762811384	2025-11-10 17:29:56 +00:00
albanD	2fcf41dd8e	Add the ruff rule and skip everything for now (#167360 ) Part of https://github.com/pytorch/pytorch/issues/164878 We can start narrowing the skips and remove them as PRs keep landing. This PR is just to setup the scaffolding, fix will be in follow up Pull Request resolved: https://github.com/pytorch/pytorch/pull/167360 Approved by: https://github.com/janeyx99 viable/strict/1762809894	2025-11-10 17:10:15 +00:00
Bin Bao	31ccd8f13e	[AOTI] Fix a mixed-device bug for scatter_add (#167341 ) Summary: Fix https://github.com/pytorch/pytorch/issues/166841. AOTI incorrectly generates a call to aoti_torch_cuda_scatter_reduce_two_out while the op should actually run on CPU. Fix by using the correct device when calling _generate_scatter_fallback in the wrapper codegen. Pull Request resolved: https://github.com/pytorch/pytorch/pull/167341 Approved by: https://github.com/yushangdi	2025-11-10 16:59:44 +00:00
Angel Li	59307ca1bc	[BE] adding documentation (#167334 ) `torch.ao.quantization` and `torch.fx.experimental` <img width="833" height="518" alt="Screenshot 2025-11-07 at 3 20 54 PM" src="https://github.com/user-attachments/assets/47b72f28-29bd-4bab-b41f-24d97419e411" /> <img width="892" height="560" alt="Screenshot 2025-11-07 at 3 20 45 PM" src="https://github.com/user-attachments/assets/129825ab-6706-41f2-964d-8774debab18c" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/167334 Approved by: https://github.com/janeyx99 viable/strict/1762800711	2025-11-10 14:46:42 +00:00
PyTorch UpdateBot	c28475db7c	Update slow tests (#166844 ) This PR is auto-generated weekly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/weekly.yml). Update the list of slow tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/166844 Approved by: https://github.com/pytorchbot viable/strict/1762793628	2025-11-10 12:39:27 +00:00
PyTorch UpdateBot	74aec83841	[xla hash update] update the pinned xla hash (#167452 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned xla hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/167452 Approved by: https://github.com/pytorchbot	2025-11-10 12:03:01 +00:00
zpcore	52e744d68a	[DTensor] Support convert StridedShard to shard order and vice versa (#166740 ) We plan to use `StridedShard` to express `shard_order`. This PR adds the function to support the conversion between `StridedShard` and `shard_order`. I moved some test related function into torch/testing/_internal/common_utils.py. We may only care about _dtensor_spec.py and test_utils.py in this PR for the review. ### How to convert shard order to StridedShard: Considering the example: - placements = $[x_0, x_1, x_2, x_3, x_4]$, all $x_?$ are shard on the same tensor dim. Let's see how the shard order will impact the split_factor (sf). We loop from right to left in the placements to construct the split_factor by assuming different shard order. Starting from $x_4$, this should be a normal shard. Then $x_3$. There are two possibilities, $x_3$'s order can be before $x_4$. If so, $x_3$'s sf=1, because $x_3$ is before $x_4$ in the placements. Else $x_3$'s order is after $x_4$, then the $x_3$'s sf should be the mesh dim size of $x_4$, which is $T(x_4)$: <img width="820" height="431" alt="image" src="https://github.com/user-attachments/assets/f53b4b24-2523-42cc-ad6f-41f3c280db70" /> We can use this method to decide on the split factor for $x_2$, $x_1$ and so on. ### How to convert StridedShard to shard order: This follows the same method above. We check all possible paths and use the real split_factor to see which path matchs the split_factor. If no such matches, the StridedShard is unable to be converted to shard order. --- Pull Request resolved: https://github.com/pytorch/pytorch/pull/166740 Approved by: https://github.com/ezyang viable/strict/1762781259	2025-11-10 09:35:10 +00:00
Yu, Guangye	3cfbf98ea9	[xpu][feature] Add XPU support on torch.accelerator.get_memory_info (#162564 ) # Motivation Support XPU for `torch.accelerator.get_memory_info`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162564 Approved by: https://github.com/albanD ghstack dependencies: #156812	2025-11-10 05:34:49 +00:00
Isalia20	47db55258b	[MPS] sparse sparse mm (#167013 ) Sparse sparse mm op implementation Pull Request resolved: https://github.com/pytorch/pytorch/pull/167013 Approved by: https://github.com/malfet	2025-11-10 05:27:49 +00:00
Isalia20	50af6f3393	[MPS] erfinv for sparse mps (#166711 ) Should be merged after #166708 Pull Request resolved: https://github.com/pytorch/pytorch/pull/166711 Approved by: https://github.com/Skylion007, https://github.com/malfet viable/strict/1762779774	2025-11-10 05:25:31 +00:00
Nikita Shulga	e545ba2d34	[DTensor] Fix Conv behavior for replicate stategy (#167402 ) Pass `dim_map` to `_requires_data_exchange` and return False if both spatial and channels dimensions are replicated Modify `test_conv1d` and `test_conv3d` to check values rather than just shape, and replicate `conv3d` across batch dimension In general, feels like current Convolution implementation was written to work only if tensor is sharded across last dimention Pull Request resolved: https://github.com/pytorch/pytorch/pull/167402 Approved by: https://github.com/ezyang viable/strict/1762777661	2025-11-10 05:13:42 +00:00
Wang, Chuanqi	a058bbdd6f	[xpu][test] Enable profiler test for XPU (#165423 ) Fixes #165130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165423 Approved by: https://github.com/EikanWang, https://github.com/atalman, https://github.com/mlazos viable/strict/1762775374	2025-11-10 04:02:59 +00:00
Slawomir Siwek	2c78080ec0	Register functorch XPU/HPU dispatch keys (#167095 ) Fixes TestOperatorsXPU.test_data_write_errors_under_transform_xpu https://github.com/intel/torch-xpu-ops/issues/2237 Tests on other devices throw runtime error "_mutating directly with `.data` inside functorch transform is not allowed._", but XPU/HPU fails earlier on `_has_compatible_shallow_copy_type`. This check is not met only when calling tensor.data inside functorch call. ```cpp bool _has_compatible_shallow_copy_type(const Tensor& self, const Tensor& from) { return self.unsafeGetTensorImpl()->has_compatible_shallow_copy_type( from.key_set()); } ``` ### t.data \| Tensor \| Device \| Dispatch Keys \| \|--------\|---------\|---------------\| \| `self` \| `xpu` \| `XPU, ADInplaceOrView, AutogradXPU, AutocastXPU` \| \| `from` \| `cpu` \| `CPU, ADInplaceOrView, AutogradCPU, AutocastCPU` \| ### t.data inside functorch transform \| Tensor \| Device \| Dispatch Keys \| \|--------\|---------\|---------------\| \| `self` \| `xpu` \| `ADInplaceOrView, AutogradOther, FuncTorchGradWrapper` \| \| `from` \| `cpu` \| `CPU, ADInplaceOrView, AutogradCPU, AutocastCPU, FuncTorchGradWrapper` \| ### t.data inside functorch transform + XPU dispatch key \| Tensor \| Device \| Dispatch Keys \| \|--------\|---------\|---------------\| \| `self` \| `xpu` \| `XPU, ADInplaceOrView, AutogradXPU, AutocastXPU, FuncTorchGradWrapper` \| \| `from` \| `cpu` \| `CPU, ADInplaceOrView, AutogradCPU, AutocastCPU, FuncTorchGradWrapper` \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/167095 Approved by: https://github.com/guangyey, https://github.com/albanD viable/strict/1762760973	2025-11-10 03:10:22 +00:00
Oguz Ulgen	fe6615e397	Swap pallas test shard to 12.8 (#167428 ) Getting some weird failures building cuda13, lets stick to what we know works Pull Request resolved: https://github.com/pytorch/pytorch/pull/167428 Approved by: https://github.com/jansel viable/strict/1762759504	2025-11-10 02:42:35 +00:00
Yu, Guangye	abf31db2cc	Introduce a new API torch.accelerator.get_memory_info (#156812 ) # Motivation `torch.cuda.mem_get_info` and `torch.xpu.mem_get_info` are widely used in other popular repos, such as - `076313bd09/python/sglang/srt/utils.py (L378)`， - `7ecc2d7f39/src/accelerate/utils/modeling.py (L822)`, - `7ba34b1241/vllm/worker/worker.py (L150)`. - This PR introduces a unified API `torch.accelerator.get_memory_info` to cover this scenario. Pull Request resolved: https://github.com/pytorch/pytorch/pull/156812 Approved by: https://github.com/albanD	2025-11-10 01:57:39 +00:00

1 2 3 4 5 ...

95784 Commits