pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Joel Schlosser	9fe3b2afbe	Remove torch.serialization entries from the doc ignore list (#160224 ) Follows the approach done in #158581 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160224 Approved by: https://github.com/janeyx99	2025-10-17 09:06:09 +00:00
Animesh Jain	f3683453ae	[compile] Regional inductor compilation with fx.annotate (#164776 ) This PR introduces a way to compile a region of FX graph using `fx.traceback.annotate`. ### UX 1) In the user code, mark the region that you want to be compiled with inductor using `with fx_traceback.annotate({"compile_with_inductor": 0})`. As of now, we just rely on the string `compile_with_inductor` and ignore the integer. As the needs arise, we can update the logic. Example ``` def fn(x, y): sin = torch.sin(x) with fx_traceback.annotate({"compile_with_inductor": 0}): mul = sin * y add = mul + 1 return torch.sin(add) ``` 2) You have to instruct the compiler to use the annotations with `compile_fx_annotated_nodes_with_inductor` transformation. This is somewhat controversial, and a user might expect that just setting annotation is enough. But for now to control the blast radius, we need to explicitly do this. One such example is ``` # Set the fw and bw compiler of aot_autograd to `compile_fx_annotated_nodes_with_inductor` def aot_eager_regional_inductor(): return aot_autograd( fw_compiler=compile_fx_annotated_nodes_with_inductor, bw_compiler=compile_fx_annotated_nodes_with_inductor, ) ``` 3) Fixable in short-term - You have to wrap the user code in `torch.fx.traceback.preserve_node_meta` to ensure that annotations are propagated to the compiler. This is fixable, just need to make CI happy. ### Implementation 1) Relies on `CapabilityBasedPartitioner` to "scoop" out regions based on annotations, and then create subgraphs in the main graph. 2) Call `torch._inductor.standalone_compile` on these subgraphs, and jam the returned callable into the FX graph at the place of call_module Resulting graph looks something like this - search for `torch__inductor_standalone_compile_inner` Forward graph ``` class GraphModule(torch.nn.Module): def forward(self, primals_1: "f32[10]", primals_2: "f32[10]"): # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:64 in fn, code: sin = torch.sin(x) sin: "f32[10]" = torch.ops.aten.sin.default(primals_1) # No stacktrace found for following nodes inner = torch__inductor_standalone_compile_inner(sin, primals_2) # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:68 in fn, code: add = mul + 1 getitem: "f32[10]" = inner[0]; inner = None # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:70 in fn, code: return torch.sin(add) sin_1: "f32[10]" = torch.ops.aten.sin.default(getitem) return (sin_1, primals_1, primals_2, sin, getitem) ``` Backward graph ``` class GraphModule(torch.nn.Module): def forward(self, primals_1: "f32[10]", primals_2: "f32[10]", sin: "f32[10]", add: "f32[10]", tangents_1: "f32[10]"): # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:64 in fn, code: sin = torch.sin(x) cos_1: "f32[10]" = torch.ops.aten.cos.default(primals_1); primals_1 = None # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:70 in fn, code: return torch.sin(add) cos: "f32[10]" = torch.ops.aten.cos.default(add); add = None mul_1: "f32[10]" = torch.ops.aten.mul.Tensor(tangents_1, cos); tangents_1 = cos = None # No stacktrace found for following nodes inner = torch__inductor_standalone_compile_inner(mul_1, sin, primals_2); mul_1 = sin = primals_2 = None # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:67 in fn, code: mul = sin * y getitem: "f32[10]" = inner[0] getitem_1: "f32[10]" = inner[1]; inner = None # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:64 in fn, code: sin = torch.sin(x) mul_4: "f32[10]" = torch.ops.aten.mul.Tensor(getitem_1, cos_1); getitem_1 = cos_1 = None return (mul_4, getitem) ``` ### Some issue raised in the HOP meeting 1) CSE will not differentiate different meta custom nodes and do wrong thing. 2) SAC - The recomputed forward will be smaller than the forward. Will we compile a smaller region than? 3) What happens if you have a op in the middle which does not disturb the topology, is it still 1 subgraph? 4) What happens with the nesting of `fx_traceback.annotate`? Are there any ordering requirements? 5) What are we going to use the annotations for? a) compile flex b) streams c) nn.Module info to organize MoE components for pipelining d) PP stages e) Rename graph nodes for more debugging f) No nested regional compile Pull Request resolved: https://github.com/pytorch/pytorch/pull/164776 Approved by: https://github.com/SherlockNoMad ghstack dependencies: #165188	2025-10-13 22:22:20 +00:00
Angel Li	fa95882093	[BE] document distributed apis (#165194 ) This PR documents some `torch.distributed.distributed_c10d` APIs. Below are some screenshots of the rendered docs. <img width="909" height="527" alt="Screenshot 2025-10-10 at 10 18 40 PM" src="https://github.com/user-attachments/assets/555ae886-bead-47f3-8c67-9bc91c14bd11" /> <img width="885" height="548" alt="Screenshot 2025-10-10 at 10 18 47 PM" src="https://github.com/user-attachments/assets/1d6f7af1-db28-40f9-927e-5c47668a1a88" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/165194 Approved by: https://github.com/janeyx99	2025-10-13 20:13:59 +00:00
Angel Li	70ec464c16	[BE] document some quantization public apis (#165160 ) This PR documents some apis in `torch.ao.quantization.utils` <img width="885" height="296" alt="Screenshot 2025-10-10 at 4 38 10 PM" src="https://github.com/user-attachments/assets/4323a6f5-ac3a-4f2e-ba00-35f3b208bef4" /> <img width="876" height="319" alt="Screenshot 2025-10-10 at 4 38 14 PM" src="https://github.com/user-attachments/assets/164822c3-9740-46f9-953d-bb20c77bcf69" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/165160 Approved by: https://github.com/janeyx99	2025-10-13 17:24:42 +00:00
PyTorch MergeBot	8d49cd5b26	Revert "[compile] Regional inductor compilation with fx.annotate (#164776 )" This reverts commit 1e4c7dffa31b3284a4cd4daa4424602827bd9d0f. Reverted https://github.com/pytorch/pytorch/pull/164776 on behalf of https://github.com/malfet due to Looks like this one broke everything, not the top of the stack ([comment](https://github.com/pytorch/pytorch/pull/164776#issuecomment-3393725466))	2025-10-11 23:14:23 +00:00
Animesh Jain	1e4c7dffa3	[compile] Regional inductor compilation with fx.annotate (#164776 ) This PR introduces a way to compile a region of FX graph using `fx.traceback.annotate`. ### UX 1) In the user code, mark the region that you want to be compiled with inductor using `with fx_traceback.annotate({"compile_with_inductor": 0})`. As of now, we just rely on the string `compile_with_inductor` and ignore the integer. As the needs arise, we can update the logic. Example ``` def fn(x, y): sin = torch.sin(x) with fx_traceback.annotate({"compile_with_inductor": 0}): mul = sin * y add = mul + 1 return torch.sin(add) ``` 2) You have to instruct the compiler to use the annotations with `compile_fx_annotated_nodes_with_inductor` transformation. This is somewhat controversial, and a user might expect that just setting annotation is enough. But for now to control the blast radius, we need to explicitly do this. One such example is ``` # Set the fw and bw compiler of aot_autograd to `compile_fx_annotated_nodes_with_inductor` def aot_eager_regional_inductor(): return aot_autograd( fw_compiler=compile_fx_annotated_nodes_with_inductor, bw_compiler=compile_fx_annotated_nodes_with_inductor, ) ``` 3) Fixable in short-term - You have to wrap the user code in `torch.fx.traceback.preserve_node_meta` to ensure that annotations are propagated to the compiler. This is fixable, just need to make CI happy. ### Implementation 1) Relies on `CapabilityBasedPartitioner` to "scoop" out regions based on annotations, and then create subgraphs in the main graph. 2) Call `torch._inductor.standalone_compile` on these subgraphs, and jam the returned callable into the FX graph at the place of call_module Resulting graph looks something like this - search for `torch__inductor_standalone_compile_inner` Forward graph ``` class GraphModule(torch.nn.Module): def forward(self, primals_1: "f32[10]", primals_2: "f32[10]"): # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:64 in fn, code: sin = torch.sin(x) sin: "f32[10]" = torch.ops.aten.sin.default(primals_1) # No stacktrace found for following nodes inner = torch__inductor_standalone_compile_inner(sin, primals_2) # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:68 in fn, code: add = mul + 1 getitem: "f32[10]" = inner[0]; inner = None # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:70 in fn, code: return torch.sin(add) sin_1: "f32[10]" = torch.ops.aten.sin.default(getitem) return (sin_1, primals_1, primals_2, sin, getitem) ``` Backward graph ``` class GraphModule(torch.nn.Module): def forward(self, primals_1: "f32[10]", primals_2: "f32[10]", sin: "f32[10]", add: "f32[10]", tangents_1: "f32[10]"): # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:64 in fn, code: sin = torch.sin(x) cos_1: "f32[10]" = torch.ops.aten.cos.default(primals_1); primals_1 = None # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:70 in fn, code: return torch.sin(add) cos: "f32[10]" = torch.ops.aten.cos.default(add); add = None mul_1: "f32[10]" = torch.ops.aten.mul.Tensor(tangents_1, cos); tangents_1 = cos = None # No stacktrace found for following nodes inner = torch__inductor_standalone_compile_inner(mul_1, sin, primals_2); mul_1 = sin = primals_2 = None # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:67 in fn, code: mul = sin * y getitem: "f32[10]" = inner[0] getitem_1: "f32[10]" = inner[1]; inner = None # File: /data/users/anijain/pytorch2/test/dynamo/test_regional_inductor.py:64 in fn, code: sin = torch.sin(x) mul_4: "f32[10]" = torch.ops.aten.mul.Tensor(getitem_1, cos_1); getitem_1 = cos_1 = None return (mul_4, getitem) ``` ### Some issue raised in the HOP meeting 1) CSE will not differentiate different meta custom nodes and do wrong thing. 2) SAC - The recomputed forward will be smaller than the forward. Will we compile a smaller region than? 3) What happens if you have a op in the middle which does not disturb the topology, is it still 1 subgraph? 4) What happens with the nesting of `fx_traceback.annotate`? Are there any ordering requirements? 5) What are we going to use the annotations for? a) compile flex b) streams c) nn.Module info to organize MoE components for pipelining d) PP stages e) Rename graph nodes for more debugging f) No nested regional compile Pull Request resolved: https://github.com/pytorch/pytorch/pull/164776 Approved by: https://github.com/SherlockNoMad	2025-10-11 15:49:42 +00:00
PyTorch MergeBot	00059db034	Revert "[RELAND] Always build USE_DISTRIBUTED (#160449 ) and Make distributed modules importable even when backend not built (#159889 ) (#162594 )" This reverts commit 09cb34c1dce8fe1b880bbf3115d8ddad3401d871. Reverted https://github.com/pytorch/pytorch/pull/162594 on behalf of https://github.com/malfet due to reverted internally and now can be safely reverted in OSS ([comment](https://github.com/pytorch/pytorch/pull/162594#issuecomment-3334176367))	2025-09-25 13:47:46 +00:00
Svetlana Karslioglu	8c8416b021	Update pytorch.org links in docs/conf.py (#163682 ) Update links in conf.py to docs.pytorch.org Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/163682 Approved by: https://github.com/sekyondaMeta, https://github.com/albanD	2025-09-23 21:40:11 +00:00
Sherlock Huang	95ac7d724e	Rename to _debug_mode.py to make it private (#163534 ) rename debug_mode.py to _debug_mode.py to make it private, per @alban's request. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163534 Approved by: https://github.com/albanD	2025-09-23 04:27:10 +00:00
Edward Yang	09cb34c1dc	[RELAND] Always build USE_DISTRIBUTED (#160449 ) and Make distributed modules importable even when backend not built (#159889 ) (#162594 ) Summary: Original: D81957844 and D81957923 Also, https://github.com/pytorch/pytorch/pull/162142 is patched in as well #buildall Test Plan: sandcastle and oss ci Rollback Plan: Reviewed By: H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/162594 Approved by: https://github.com/H-Huang, https://github.com/dcci	2025-09-22 21:12:18 +00:00
PyTorch MergeBot	f0078941cf	Revert "[RELAND] Always build USE_DISTRIBUTED (#160449 ) and Make distributed modules importable even when backend not built (#159889 ) (#162594 )" This reverts commit 6c334885d48725197b5d35e2c1543efc0f4198d0. Reverted https://github.com/pytorch/pytorch/pull/162594 on behalf of https://github.com/wdvr due to reverted internally - @ezyang see D82281294 ([comment](https://github.com/pytorch/pytorch/pull/162594#issuecomment-3317017530))	2025-09-22 05:39:07 +00:00
Jiannan Wang	6ac2b3ae35	[BE] Adding aliases for CUDA and XPU API documentation (#162984 ) This PR reorganizes CUDA and XPU API documentation with additional aliases pages. Multiple entries of APIs under torch.cuda are thus removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162984 Approved by: https://github.com/janeyx99	2025-09-21 22:28:27 +00:00
jiannanWang	b6a48ff69f	[BE] Add Documentation for Device APIs (#162834 ) Added documentation for torch.cuda APIs. Fixed docstring for xpu and mtia is_bf16_supported API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162834 Approved by: https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2025-09-16 17:01:06 +00:00
Sherlock Huang	f8d379d29e	[DTensor] Introduce DebugMode (#162665 ) Introduce a lightweight TorchDispatchMode for understanding the magic behind DTensor. - Tracks redistribution, see `redistribute_input(input_idx, from_placement, to_placement)` - Optionally tracks torch-level functions, via `__torch_function__` - Optionally tracks FakeTensor operations, which was needed for propagating tensor meta as a step of sharding propagation - Optionally tracks real tensor operations, including functional c10d op, and regular ops - Calls are shown in the hierarchical structure! - shorthand representation - dt: DTesnor, ft: FakeTensor, t: Tensor - DM(2, 2) == DeviceMesh(shape = [2, 2]) - [R, P, S(0)] == Placement[Replicate, Partial, Shard(0)] - f32[8,8] == float32 with shape[8, 8] ``` debug_mode = DTensorDebugMode(record_faketensor=False, record_realtensor=True) with debug_mode: torch.mm(x_dtensor, y_dtensor) print(debug_mode.debug_string()) ``` produces: ``` torch.mm(dt: f32[8, 8][S(0)], dt: f32[8, 32][S(0)]) aten::mm(dt: f32[8, 8][S(0)], dt: f32[8, 32][S(0)]) redistribute_input(1, [S(0)], [R]) _c10d_functional::all_gather_into_tensor(t: f32[1, 32], 8, 0) _c10d_functional::wait_tensor(t: f32[8, 32]) aten::mm(t: f32[1, 8], t: f32[8, 32]) ``` Another example, for torch.einsum ``` torch.functional.einsum(bld,dnh->blnh, dt: f32[16, 6, 8][P, R], dt: f32[8, 4, 4][R, P]) aten::unsqueeze(dt: f32[16, 6, 8][P, R], 3) aten::unsqueeze(t: f32[16, 6, 8], 3) aten::unsqueeze(dt: f32[16, 6, 8, 1][P, R], 4) aten::unsqueeze(t: f32[16, 6, 8, 1], 4) aten::permute(dt: f32[16, 6, 8, 1, 1][P, R], [0, 1, 3, 4, 2]) aten::permute(t: f32[16, 6, 8, 1, 1], [0, 1, 3, 4, 2]) aten::unsqueeze(dt: f32[8, 4, 4][R, P], 3) aten::unsqueeze(t: f32[8, 4, 4], 3) aten::unsqueeze(dt: f32[8, 4, 4, 1][R, P], 4) aten::unsqueeze(t: f32[8, 4, 4, 1], 4) aten::permute(dt: f32[8, 4, 4, 1, 1][R, P], [3, 4, 1, 2, 0]) aten::permute(t: f32[8, 4, 4, 1, 1], [3, 4, 1, 2, 0]) aten::permute(dt: f32[16, 6, 1, 1, 8][P, R], [0, 1, 4, 2, 3]) aten::permute(t: f32[16, 6, 1, 1, 8], [0, 1, 4, 2, 3]) aten::view(dt: f32[16, 6, 8, 1, 1][P, R], [1, 96, 8]) aten::view(t: f32[16, 6, 8, 1, 1], [1, 96, 8]) aten::permute(dt: f32[1, 1, 4, 4, 8][R, P], [4, 2, 3, 0, 1]) aten::permute(t: f32[1, 1, 4, 4, 8], [4, 2, 3, 0, 1]) aten::view(dt: f32[8, 4, 4, 1, 1][R, P], [1, 8, 16]) aten::view(t: f32[8, 4, 4, 1, 1], [1, 8, 16]) aten::bmm(dt: f32[1, 96, 8][P, R], dt: f32[1, 8, 16][R, P]) redistribute_input(0, [P, R], [S(2), S(2)]) aten::chunk(t: f32[1, 96, 8], 4, 2) aten::cat(['t: f32[1, 96, 2]', 't: f32[1, 96, 2]', 't: f32[1, 96, 2]', 't: f32[1, 96, 2]']) _c10d_functional::reduce_scatter_tensor(t: f32[4, 96, 2], sum, 4, 2) aten::clone(t: f32[1, 96, 1]) redistribute_input(1, [R, P], [S(1), S(1)]) aten::chunk(t: f32[1, 8, 16], 4, 1) aten::clone(t: f32[1, 2, 16]) aten::chunk(t: f32[1, 2, 16], 2, 1) aten::cat(['t: f32[1, 1, 16]', 't: f32[1, 1, 16]']) _c10d_functional::reduce_scatter_tensor(t: f32[2, 1, 16], sum, 2, 3) _c10d_functional::wait_tensor(t: f32[1, 1, 16]) aten::bmm(t: f32[1, 96, 1], t: f32[1, 1, 16]) aten::view(dt: f32[1, 96, 16][P, P], [16, 6, 1, 4, 4]) aten::view(t: f32[1, 96, 16], [16, 6, 1, 4, 4]) aten::permute(dt: f32[16, 6, 1, 4, 4][P, P], [0, 1, 3, 4, 2]) aten::permute(t: f32[16, 6, 1, 4, 4], [0, 1, 3, 4, 2]) aten::view(dt: f32[16, 6, 4, 4, 1][P, P], [16, 6, 4, 4]) aten::view(t: f32[16, 6, 4, 4, 1], [16, 6, 4, 4]) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162665 Approved by: https://github.com/ezyang	2025-09-16 07:30:05 +00:00
Edward Yang	6c334885d4	[RELAND] Always build USE_DISTRIBUTED (#160449 ) and Make distributed modules importable even when backend not built (#159889 ) (#162594 ) Summary: Original: D81957844 and D81957923 Also, https://github.com/pytorch/pytorch/pull/162142 is patched in as well #buildall Test Plan: sandcastle and oss ci Rollback Plan: Reviewed By: H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/162594 Approved by: https://github.com/H-Huang, https://github.com/dcci	2025-09-12 10:54:42 +00:00
PyTorch MergeBot	6b59a19242	Revert "[RELAND] Always build USE_DISTRIBUTED (#160449 ) and Make distributed modules importable even when backend not built (#159889 ) (#162594 )" This reverts commit 6e8f17c58029e5fa6bc222b2445ebbc0cbdc17c7. Reverted https://github.com/pytorch/pytorch/pull/162594 on behalf of https://github.com/huydhn due to Reverted internally ([comment](https://github.com/pytorch/pytorch/pull/162594#issuecomment-3283985880))	2025-09-12 06:52:03 +00:00
Edward Yang	6e8f17c580	[RELAND] Always build USE_DISTRIBUTED (#160449 ) and Make distributed modules importable even when backend not built (#159889 ) (#162594 ) Summary: Original: D81957844 and D81957923 Also, https://github.com/pytorch/pytorch/pull/162142 is patched in as well #buildall Test Plan: sandcastle and oss ci Rollback Plan: Reviewed By: H-Huang Pull Request resolved: https://github.com/pytorch/pytorch/pull/162594 Approved by: https://github.com/H-Huang, https://github.com/dcci	2025-09-12 03:56:18 +00:00
Edward Yang	dda071587f	Revert "Make distributed modules importable even when backend not built (#159889 )" (#162568 ) This reverts commit a0d026688cd69583d5a4e0c6f3e5fda141a7f4a9. Revert "Always build USE_DISTRIBUTED. (#160449)" This reverts commit d80297a6846f1f2c36fd4f19e22919f2abe8fcea. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162568 Approved by: https://github.com/huydhn	2025-09-10 04:29:42 +00:00
Edward Yang	d80297a684	Always build USE_DISTRIBUTED. (#160449 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci	2025-09-08 19:10:36 +00:00
PyTorch MergeBot	1e0656f063	Revert "Always build USE_DISTRIBUTED. (#160449 )" This reverts commit de893e96c775023aa3be895060848fac3296772c. Reverted https://github.com/pytorch/pytorch/pull/160449 on behalf of https://github.com/jeanschmidt due to internal changes breaks import checks, see [D81845053](https://www.internalfb.com/diff/D81845053) ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3264887002))	2025-09-08 07:04:36 +00:00
Edward Yang	de893e96c7	Always build USE_DISTRIBUTED. (#160449 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci	2025-09-05 20:15:11 +00:00
PyTorch MergeBot	adae7f66aa	Revert "Always build USE_DISTRIBUTED. (#160449 )" This reverts commit c37103234afc832dcad307e9016230810957c9d5. Reverted https://github.com/pytorch/pytorch/pull/160449 on behalf of https://github.com/jeanschmidt due to Breaking internal build rules, see D81756619 ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3259430011))	2025-09-05 18:58:47 +00:00
Edward Yang	c37103234a	Always build USE_DISTRIBUTED. (#160449 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci	2025-09-04 19:43:17 +00:00
PyTorch MergeBot	b7dad7dd49	Revert "Always build USE_DISTRIBUTED. (#160449 )" This reverts commit 90b08643c3a6eb1f3265b7d1388bd76660759f46. Reverted https://github.com/pytorch/pytorch/pull/160449 on behalf of https://github.com/jeanschmidt due to Already discussed with @ezyang about the internal quirks and errors ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3254219358))	2025-09-04 15:25:07 +00:00
Saurabh Mishra	1281470155	[DCP][HuggingFace] Add Support for dequantization of SafeTensors checkpoints (#160682 ) This PR introduces the QuantizedHuggingFaceReader component which enables the reading and dequantization of the quantized tensors in the SafeTensors checkpoint. Following capabilities are inrtoduced: - Configuration the target DType and the block size. - Multi threaded dequantization for efficiency Test Plan: buck test //caffe2/test/distributed/checkpoint\:test_quantized_hf_storage ``` Time elapsed: 2:34.1s Tests finished: Pass 31. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Differential Revision: D80174674 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160682 Approved by: https://github.com/ankitageorge	2025-09-04 01:09:53 +00:00
Edward Yang	90b08643c3	Always build USE_DISTRIBUTED. (#160449 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci	2025-09-03 07:33:55 +00:00
PyTorch MergeBot	4e42aa8ffc	Revert "Always build USE_DISTRIBUTED. (#160449 )" This reverts commit b7034e9c924412bfbe8ee25a22d7e95239b5ca65. Reverted https://github.com/pytorch/pytorch/pull/160449 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, can't be landed with forward fix due to internal tooling problems ([comment](https://github.com/pytorch/pytorch/pull/160449#issuecomment-3246689684))	2025-09-02 20:28:42 +00:00
Edward Yang	b7034e9c92	Always build USE_DISTRIBUTED. (#160449 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160449 Approved by: https://github.com/wconstab, https://github.com/albanD, https://github.com/dcci	2025-09-01 23:00:21 +00:00
Paul de Supinski	768a1017c5	Allow parallel start NUMA binding (#161576 ) # Context In #161183, we added NUMA-binding support for `Callable` entrypoints to `elastic_launch`. However, we would raise an exception if the subprocesses would be spawned in parallel via `ThreadPoolExecutor`, which is an option configurable via the `TORCH_MP_PARALLEL_START` environment variable (see diff). The logic here was that `os.sched_setaffinity`, which we used to set CPU affinities, is [per process](https://docs.python.org/3/library/os.html#os.sched_setaffinity), so there could be a race condition during a parallel start: > Restrict the process with PID pid (or the current process if zero) to a set of CPUs. mask is an iterable of integers representing the set of CPUs to which the process should be restricted. But on further reading, the Linux docs say [`sched_setaffinity` is per thread.](https://man7.org/linux/man-pages/man2/sched_setaffinity.2.html) As it turns out, the Python doc is a misnomer. I [verified that `sched_setaffinity` only affects the calling thread, not the entire calling process.](https://gist.github.com/pdesupinski/7e2de3cbe5bb48d489f257b83ccddf07) The upshot is that we actually can safely use the inheritance trick from #161183 even with parallel start, since the setting will be inherited from the calling thread, and `os.sched_setaffinity` only affects the calling thread. # This PR Remove restrictions against parallel start for NUMA binding. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161576 Approved by: https://github.com/d4l3k	2025-08-28 01:15:58 +00:00
Paul de Supinski	33346b5814	Support NUMA Binding for Callable Entrypoints, Take 2 (#161183 ) # Context In #160163, we added support for NUMA binding for `Callable` entrypoints to `elastic_launch`. This requires special consideration, because they go through a different path to spawn subprocesses compared to `str` entrypoints, a path which does not provide a straightforward way to utilize `numactl` CLI. See #160006 for a full description of the challenges. Although #160163 worked in initial local experiments, we ran into some linker errors in other environments when we tried to call `numactl`. This appeared to be due to interactions with how the `LD_PRELOAD` environment variable was being set. # This PR On further thought, the most straightforward, foolproof solution here is to use [the trick that @d4l3k suggested.](https://github.com/pytorch/pytorch/issues/160006#issuecomment-3162018836) Specifically, for each local rank `i`: 1. The parent process sets its own CPU affinity to what local rank `i`'s should be. 2. Then, the parent spawns the subprocess for local rank `i`. 3. Finally, the parent resets its own CPU affinity to what it was originally. There were other solutions that would work just for `Callable` entrypoints, but I believe this is the simplest one that can work for both `str` and `Callable`, and it's pretty simple. This required a bit of refactoring: 1. Turn all the `_get_.*_numactl_options` into functions which return a set of logical CPUs to bind to, rather than options like `--cpunodebind=0`. 2. Instead of wrapping commands with `numactl`, use `os.sched_setaffinity` to bind to the CPUs from (1.). 3. Put this all inside a context manager which encapsulates applying and restoring the bindings in the parent process. 4. Use the context manager for both `str` and `Callable` paths # Test Plan ## Automated `$ pytest test/test_numa_binding.py` ## Manual See [doc.](https://docs.google.com/document/d/1vxD-OKYBTT27jbBwtW9iz9g0tNM0u-i0tiTJg_ieQA8/edit?tab=t.0) Meta only, but TLDR tried out every combination of `str`, `Callable`, binding disabled, and binding enabled on the same model and saw 2x SM utilization for binding enabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161183 Approved by: https://github.com/d4l3k	2025-08-23 07:23:22 +00:00
Jane Xu	9b803cdbe2	[BE] Remove more optim entries from docs coverage ignore list (#160194 ) This PR does privatize ReduceLRSchedulerOnPlateau.is_better -> ReduceLRSchedulerOnPlateau._is_better because that API was never meant to be public. A GitHub search for it also reveals that the API is not commonly used much. https://github.com/search?q=.is_better%28&type=code&p=2 If you do use this API and you rely on it for some reason, please file an issue. In the meantime, you can access it through `_is_better(...)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160194 Approved by: https://github.com/albanD, https://github.com/Skylion007	2025-08-09 00:09:45 +00:00
Svetlana Karslioglu	e4e2701429	Add the RunLLM widget to the website (#152055 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/152055 Approved by: https://github.com/albanD	2025-07-31 20:53:53 +00:00
PaliC	b57d1ef110	[BE] Remove __reduce_deploy__ (#158291 ) This PR removes the integration point torch.fx had with torch::deploy (and another minor change). Note: This PR has some broken mypy errors, but I believe those should have been in the code base beforehand, and should be fixed in a separate PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/158291 Approved by: https://github.com/albanD ghstack dependencies: #158290	2025-07-30 01:36:03 +00:00
Mikayla Gawarecki	1e79872f2e	[BE] More torch.nn docs coverage test (except for torch.nn.parallel) (#158654 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158654 Approved by: https://github.com/janeyx99 ghstack dependencies: #158491	2025-07-25 22:03:55 +00:00
Mikayla Gawarecki	9e8f27cc79	[BE] Make torch.nn.modules.* satisfy the docs coverage test (#158491 ) Options to address the "undocumented python objects": 1. Reference the functions in the .rst via the torch.nn.modules namespace. Note that this changes the generated doc filenames / locations for most of these functions! 2. [Not an option] Monkeypatch `__module__` for these objects (broke several tests in CI due to `inspect.findsource` failing after this change) 3. Update the .rst files to also document the torch.nn.modules forms of these functions, duplicating docs. #### [this is the docs page added](https://docs-preview.pytorch.org/pytorch/pytorch/158491/nn.aliases.html) This PR takes option 3 by adding an rst page nn.aliases that documents the aliases in nested namespaces, removing all the torch.nn.modules.* entries from the coverage skiplist except - NLLLoss2d (deprecated) - Container (deprecated) - CrossMapLRN2d (what is this?) - NonDynamicallyQuantizableLinear This mostly required adding docstrings to `forward`, `extra_repr` and `reset_parameters`. Since forward arguments are already part of the module docstrings I just added a very basic docstring. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158491 Approved by: https://github.com/janeyx99	2025-07-25 22:03:55 +00:00
Joel Schlosser	316c188a5e	Remove torch.functional entries from the doc ignore list (#158581 ) Options to address the "undocumented python objects": 1. Reference the functions in the .rst via the `torch.functional` namespace. Note that this changes the generated doc filenames / locations for most of these functions! 2. Document these functions by referencing them from the `torch.` namespace instead, in line with common usage. This would also require setting the `__module__` for these functions and moving entries from `torch.functional`'s `__all__` -> `torch`'s `__all__`, which is BC-breaking. 3. Update the .rst files to also document the `torch.functional` forms of these functions, duplicating docs. This PR takes option (3) above and: * Removes all 20 `torch.functional` entries from the doc ignore list * Removes `torch.functional.align_tensors()` entirely, since we don't want to document it. * This is technically BC-breaking, although the previous impl simply errored out. This change could be moved to a separate isolated PR for safety. * Introduces `torch.aliases.md` as a hidden page for the `torch.functional` aliases to the `torch` analogue functions Pull Request resolved: https://github.com/pytorch/pytorch/pull/158581 Approved by: https://github.com/janeyx99	2025-07-25 17:19:01 +00:00
PyTorch MergeBot	a9f6770edd	Revert "[BE] Remove __reduce_deploy__ (#158291 )" This reverts commit 9c68c4d08f4c4da49f0086b80e382f0cdd518f60. Reverted https://github.com/pytorch/pytorch/pull/158291 on behalf of https://github.com/ZainRizvi due to Reverting as per offline discussion to fix internal breaks. @PaliC will reland this as a codev diff. Instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3119037960))	2025-07-25 16:09:39 +00:00
Xuehai Pan	f5e2de928b	[BE] fix remaining flake8 v7 warnings (#159044 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/159044 Approved by: https://github.com/Skylion007 ghstack dependencies: #159043	2025-07-25 02:56:34 +00:00
PaliC	9c68c4d08f	[BE] Remove __reduce_deploy__ (#158291 ) This PR removes the integration point torch.fx had with torch::deploy (and another minor change). Note: This PR has some broken mypy errors, but I believe those should have been in the code base beforehand, and should be fixed in a separate PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/158291 Approved by: https://github.com/albanD ghstack dependencies: #158288, #158290	2025-07-23 20:27:28 +00:00
PyTorch MergeBot	920f26c761	Revert "[BE] Remove __reduce_deploy__ (#158291 )" This reverts commit 0b9fb91f17edfbc51ae36584dcb8350b2d8bb23b. Reverted https://github.com/pytorch/pytorch/pull/158291 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally, see D78496147 for details. To validate your fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/158288#issuecomment-3099826158))	2025-07-21 23:17:38 +00:00
Jane Xu	7cc5d03dfc	Document the rest of the specific optimizer module APIs (#158669 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158669 Approved by: https://github.com/albanD ghstack dependencies: #158483	2025-07-19 07:27:15 +00:00
Jane Xu	f73594164a	[BE] document Adadelta and Adagrad APIs properly (#158483 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158483 Approved by: https://github.com/albanD	2025-07-19 07:27:15 +00:00
Svetlana Karslioglu	79e49efadd	Pull latest Sphinx theme (#158595 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/158595 Approved by: https://github.com/albanD	2025-07-18 18:46:47 +00:00
angelayi	66c9bc5062	[export] Add runnable code to export docs (#158506 ) Preview: https://docs-preview.pytorch.org/pytorch/pytorch/158506/export.html Yay I can add runnable code to export docs now Also moved export API reference to a different file. With these changes, we can start to consolidate the [export tutorial](https://docs.pytorch.org/tutorials/intermediate/torch_export_tutorial.html) with the docs on pytorch docs. We just need to move the section on DDE and 0/1 specialization, and then I think we can delete the export tutorial. Pull Request resolved: https://github.com/pytorch/pytorch/pull/158506 Approved by: https://github.com/pianpwk, https://github.com/svekars	2025-07-17 20:15:22 +00:00
PaliC	0b9fb91f17	[BE] Remove __reduce_deploy__ (#158291 ) This PR removes the integration point torch.fx had with torch::deploy (and another minor change). Note: This PR has some broken mypy errors, but I believe those should have been in the code base beforehand, and should be fixed in a separate PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/158291 Approved by: https://github.com/albanD ghstack dependencies: #158288, #158290	2025-07-17 05:56:26 +00:00
Svetlana Karslioglu	fc5ae12293	Fix issue with right-nav (#156119 ) Enable on page right nav. For autosummary, we need to set `"show_toc_level": 2` so that navigation is enabled. Example: * Main: https://docs.pytorch.org/docs/main/special.html - right nav (under On this page) is empty. * Preview: https://docs-preview.pytorch.org/pytorch/pytorch/156119/special.html - right nav (under On this page) has a all the object listed <img width="1125" alt="Screenshot 2025-06-16 at 2 48 16 PM" src="https://github.com/user-attachments/assets/0790bb72-5997-4542-9847-0a89be4598c0" /> vs <img width="1030" alt="Screenshot 2025-06-16 at 2 48 55 PM" src="https://github.com/user-attachments/assets/4897c49c-044d-4bea-a8cd-490c90cca2b0" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/156119 Approved by: https://github.com/albanD	2025-06-17 18:09:51 +00:00
Ankita George	bf798a2f01	Change _hfstorage to hfstorage (#155837 ) Summary: Change HF classes to not have an underscore, there-by making them public, we will add documentation to them following this Test Plan: ensure existing tests pass Rollback Plan: Differential Revision: D76364024 Pull Request resolved: https://github.com/pytorch/pytorch/pull/155837 Approved by: https://github.com/saumishr	2025-06-13 20:19:51 +00:00
Joel Schlosser	5e93abe3c0	Address docs for clip_grad functions (#155125 ) This PR takes the opinionated stance that `torch.nn.utils.<func>` should be the preferred API over `torch.nn.utils.clip_grad.<func>`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/155125 Approved by: https://github.com/albanD, https://github.com/mikaylagawarecki, https://github.com/janeyx99	2025-06-05 19:22:09 +00:00
Jane Xu	2f3f8339ec	[BE] Document device memory apis in correct module (#155126 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/155126 Approved by: https://github.com/msaroufim, https://github.com/Skylion007	2025-06-05 15:16:48 +00:00
Natalia Gimelshein	f01e628e3b	Resubmit Remove MemPoolContext (#154042 ) (#154746 ) Summary: Per title Test Plan: Added tests + existing tests Differential Revision: D75695030 Pull Request resolved: https://github.com/pytorch/pytorch/pull/154746 Approved by: https://github.com/malfet	2025-05-31 01:21:54 +00:00

1 2 3 4 5

237 Commits