pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Dzmitry Huba	2cd5fd1588	Enable local tensor mode on DTensor view ops test (#165596 ) While enabling this test discovered lack of support for sub meshes. Added limited support for sub meshes by properly computing rank coordinates for a given sub mesh. The implementation follows similar approach to collectives. We infer all sub meshes for the given dimensions and compute each rank's coordinates with respect to is sub mesh. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165596 Approved by: https://github.com/ezyang	2025-10-16 20:52:06 +00:00
Oguz Ulgen	7d0f872cb3	Use union syntax in torch/_inductor runtime and fx_passes (#165652 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165652 Approved by: https://github.com/aorenste	2025-10-16 20:51:59 +00:00
PyTorch MergeBot	fb06e49ce8	Revert "[inductor] print 0.0 as 0 for triton (#164291 )" This reverts commit 99b32a6750bfd0cfe2bc84a47823e1da34802b7b. Reverted https://github.com/pytorch/pytorch/pull/164291 on behalf of https://github.com/malfet due to Broke slow job, see `aba8c43594/1` ([comment](https://github.com/pytorch/pytorch/pull/164291#issuecomment-3412768915))	2025-10-16 20:44:29 +00:00
PyTorch MergeBot	27a98e6ae9	Revert "[DeviceMesh] Prefer using _layout over _mesh for all sorts of things (#165554 )" This reverts commit d61a9b88cf3be04a29c5a7d6e9622ae5e8d51de3. Reverted https://github.com/pytorch/pytorch/pull/165554 on behalf of https://github.com/malfet due to Looks like it broke serialization test, see `aba8c43594/1` ([comment](https://github.com/pytorch/pytorch/pull/165554#issuecomment-3412765681))	2025-10-16 20:41:37 +00:00
PyTorch MergeBot	b10f463b1a	Revert "[DeviceMesh] Introduce private constructor instead of _create_mesh_from_ranks (#165555 )" This reverts commit 99097b6d89c927c15180ff4683c38be01f9955f6. Reverted https://github.com/pytorch/pytorch/pull/165555 on behalf of https://github.com/malfet due to Looks like it broke serialization test, see `aba8c43594/1` ([comment](https://github.com/pytorch/pytorch/pull/165554#issuecomment-3412765681))	2025-10-16 20:41:37 +00:00
PyTorch MergeBot	431c13cf61	Revert "[DeviceMesh] Simplify unflatten method (#165556 )" This reverts commit 86fd4fc23e697e275d37c36e3cbe521f156434fd. Reverted https://github.com/pytorch/pytorch/pull/165556 on behalf of https://github.com/malfet due to Looks like it broke serialization test, see `aba8c43594/1` ([comment](https://github.com/pytorch/pytorch/pull/165554#issuecomment-3412765681))	2025-10-16 20:41:37 +00:00
Janani Sriram	9bf5b38c14	[Inductor][Triton][FP8] Refactor scaled_mm template to accept scaling mode (#164318 ) Summary: Refactor `scaled_mm` Inductor template to support template choice based on scaling mode. This modification sets up the infrastructure for adding new templates based on new scaling modes, such as deepseek-style scaling (a follow-up diff), as new scaling modes (deepseek, block, group) scale before the accumulation (as opposed to per-tensor and per-row scaling, which apply scaling after accumulation). This modification also further enables Inductor to infer a scaling type based on the shape of the scaling tensors, which makes existing infrastructure more extensible to new scaling modes. Test Plan: ``` TORCHINDUCTOR_CACHE_DIR=~/personal/cache_dir_inductor CUDA_LAUNCH_BLOCKING=1 TORCH_USE_CUDA_DSA=1 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 buck2 run mode/{opt,inplace} pytorch/tritonbench:run -- --op fp8_gemm --only torch_fp8_gemm,pt2_fp8_gemm --metrics tflops,accuracy --m 256 --n 768 --k 512 --output="/home/jananisriram/personal/random_bench.csv" --scaling_rowwise --atol=20 --rtol=2 2>&1 \| tee ~/personal/random.log ``` bifferential Revision: D83591083 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164318 Approved by: https://github.com/drisspg, https://github.com/slayton58	2025-10-16 20:40:45 +00:00
linhaifeng	37f3ba274a	[Fix] Use sys.executable instead of hardcoded python (#165633 ) Replace hardcoded "python" string with sys.executable to ensure correct Python interpreter is used. This fixes failures on systems with multiple Python runtimes or where "python" is not in PATH. Similar to pytorch/pytorch#155918 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/165633 Approved by: https://github.com/Skylion007	2025-10-16 20:26:10 +00:00
IvanKobzarev	585b9dbb5e	[async_tp] Support ag+mm with gather_dim lastdim of mat_A (#163068 ) Adding ag+mm support for the case, when gather_dim is last dim of matmul (reduction dim). When we decompose matmul by reduction dimension we result in partials that needs additional reduction, we allocate memory for accumulator. Decomposition should not produce small (thin) mms that can not efficiently load the GPU. Limiting for minimal size of the shard 1024 (found empirically by testing in torchtitan). scaled_mm is not supported yet for this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163068 Approved by: https://github.com/ngimel	2025-10-16 20:14:39 +00:00
Maggie Moss	d795fb225a	[RFC] Add pyrefly to lintrunner (#165179 ) This will add pyrefly to lint runner as a warning only - and allow us to collect feedback about the tool before switching to pyrefly as the main type checker. References the steps outlined here: : https://github.com/pytorch/pytorch/issues/163283: test plan: `lintrunner init` `lintrunner` confirm when pyrefly errors are present results look like: https://gist.github.com/maggiemoss/e6cb2d015dd1ded560ae1329098cf33f Pull Request resolved: https://github.com/pytorch/pytorch/pull/165179 Approved by: https://github.com/ezyang	2025-10-16 20:07:09 +00:00
Shunting Zhang	a303d6dda9	[inductor] don't try to reorder loops for template (#165601 ) fix https://github.com/pytorch/pytorch/issues/165579 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165601 Approved by: https://github.com/yushangdi	2025-10-16 19:05:21 +00:00
Luca Wehrstedt	86fd4fc23e	[DeviceMesh] Simplify unflatten method (#165556 ) By adding a few small helpers (e.g., a `splice` method to `_MeshLayout`, and making `_init_process_groups` static and thus stateless) we can substantially shorten the definition of the unflatten method, and help readability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165556 Approved by: https://github.com/fduwjj ghstack dependencies: #165554, #165555	2025-10-16 18:36:16 +00:00
Luca Wehrstedt	99097b6d89	[DeviceMesh] Introduce private constructor instead of _create_mesh_from_ranks (#165555 ) The refactoring of DeviceMesh is heavily constrained by the signature of its constructor, which is a public API which contains some "legacy" concepts which we'd love to get rid of, such as an explicit/materialized `mesh` Tensor. In other languages the solution to this would be to add a private overload of the constructor. Python doesn't natively allow this, but in this PR I managed to build something that approximates it. This new private constructor basically only takes `_layout`, `_global_rank_permutation`, and `mesh_dim_names`. With such a constructor we can effectively simplify a lot of callsites and get rid of the `_create_mesh_from_ranks` helper method. That's a good thing because it was instantiating many DeviceMeshes in a for loop, which always felt unnecessary. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165555 Approved by: https://github.com/fduwjj, https://github.com/fegin ghstack dependencies: #165554	2025-10-16 18:36:16 +00:00
IvanKobzarev	7d87d7052e	[inductor][bucketing] Fx collectives bucketing of multiple dtypes (#162470 ) Bucketing of multiple dtypes to be processed in one bucketed collective. First target is to bucket bf16 and f32, but already can be used with other dtypes. For now multidtype bucketing is only supported with "custom_ops" mode. Non custom_ops needs additional work on inductor side. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162470 Approved by: https://github.com/eellison	2025-10-16 18:31:43 +00:00
arkadip-maitra	1a34ff4e04	Fixing get_local_rank() variable missing when compiled (#165432 ) Fixes #165215 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165432 Approved by: https://github.com/bdhirsh	2025-10-16 18:20:34 +00:00
Luca Wehrstedt	d61a9b88cf	[DeviceMesh] Prefer using _layout over _mesh for all sorts of things (#165554 ) The goal of this PR is to avoid storing the explicit `mesh` Tensor inside each DeviceMesh, and instead compute it on-the-fly when the end user needs it, and try to replace all of its internal usages with `_layout` and the newly-introduced `_global_rank_permutation` Tensor. The name of this attribute is up for debate. The advantage of the `_global_rank_permutation` Tensor is that it is _the same_ Tensor for the root mesh and all its children, so it doesn't need to be copied/reallocated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165554 Approved by: https://github.com/fduwjj	2025-10-16 17:01:44 +00:00
Isuru Fernando	99b32a6750	[inductor] print 0.0 as 0 for triton (#164291 ) Fixes https://github.com/pytorch/pytorch/issues/164157 Fixes https://github.com/pytorch/pytorch/issues/164086 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164291 Approved by: https://github.com/bobrenjc93	2025-10-16 16:37:50 +00:00
Brian Hirsh	ed74dc054d	add the option to disable functionalization in AOTDispatcher (#164577 ) I'm cleaning this PR up as a proper way of disabling functionalization via config in AOTDispatcher. I removed the non-functionalization related changes from the original version: (1) preventing proxy mode (and functionalization) from incorrectly decomposing CIA ops (Ed has a PR for it here: https://github.com/pytorch/pytorch/pull/164939) (2) preventing python-dispatcher-based decomps above autograd from running. I'm not doing this for now, will likely do it in a followup Pull Request resolved: https://github.com/pytorch/pytorch/pull/164577 Approved by: https://github.com/ezyang ghstack dependencies: #165372	2025-10-16 15:44:11 +00:00
Brian Hirsh	f33c7e1a43	add and fix OpInfo tests for the default partitioner (#165372 ) I noticed the default partitioner was breaking in some dynamic shape tests, so prior to turning off functionalization I want to tweak it to pass all of our OpInfo tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/165372 Approved by: https://github.com/ezyang	2025-10-16 15:44:11 +00:00
Lucas Kabela	e6d9d68598	[Bugfix][Dynamo] Fix Sparse tensors by graph break in Dynamo (#164873 ) Fixes #164823 by making lack of support for sparse tensors very explicit (in fake tensor, inductor, and lowering code) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164873 Approved by: https://github.com/williamwen42, https://github.com/eellison, https://github.com/mlazos	2025-10-16 15:06:20 +00:00
IvanKobzarev	9272437cde	Fx collectives bucketing: add bucket all_reduce (#165351 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165351 Approved by: https://github.com/eellison	2025-10-16 13:27:33 +00:00
lichuyang	f06e669f6c	refactor: replace runtime_error with TORCH_CHECK for better error handling (#163628 ) Fixes some parts of issue #148114 @pytorchbot label "topic: not user facing" @FFFrog PTAL Pull Request resolved: https://github.com/pytorch/pytorch/pull/163628 Approved by: https://github.com/albanD	2025-10-16 11:09:48 +00:00
Nan Zhang	00afa06800	Add cse for make_block_ptr in Triton codegen (#163399 ) Summary: per title Test Plan: added test cases Differential Revision: D82648215 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163399 Approved by: https://github.com/jansel, https://github.com/njriasan	2025-10-16 05:29:48 +00:00
Oguz Ulgen	5d0b22008d	Codemod inductor/fx_passes from Optional to union none (#165606 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165606 Approved by: https://github.com/aorenste ghstack dependencies: #165604, #165605	2025-10-16 04:59:47 +00:00
Oguz Ulgen	ab6014a903	Codemod inductor/runtime from Optional to union none (#165605 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165605 Approved by: https://github.com/aorenste ghstack dependencies: #165604	2025-10-16 04:59:47 +00:00
Oguz Ulgen	f6daffc54d	Codemod codecache.py from Optional to union none (#165604 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165604 Approved by: https://github.com/aorenste	2025-10-16 04:59:37 +00:00
Simon Fan	21697feff2	[hop] run local_map with interpreter to preserve fx_traceback annotations (#165336 ) We have an issue when using fx_traceback.annotate and HOPs that trace joint graphs. HOPs have bodies that have already been traced by Dynamo, and after Animesh's PR, does have the annotations. But when we lower that Dynamo HOP body to aten in either pre-dispatch or post-dispatch, we need to propagate the annotations to the aten nodes. AOTAutograd does this indirectly by piggybacking off the `PropagateUnbackedSymInts` fx.Interpreter. I'm not sure if all HOPs should be using it to trace their joints or not. This PR adds an interpreter to local_map's implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165336 Approved by: https://github.com/yushangdi	2025-10-16 02:53:17 +00:00
Xilun Wu	12fa4192c5	[ContextParallel] add process-time based Round-Robin load-balance to CP (#163617 ) Summary The load-balancing problem can be modeled as [identical-machines scheduling](https://en.wikipedia.org/wiki/Identical-machines_scheduling) problem. We already provided an easy-to-extend interface in #161062 for implementing load-balancing and in this PR we start with adding a Round-Robin solution as an example and also a verification. This can be easily adapted to other solutions like Shortest-processing-time-first/ Longest-processing-time-first with extra padding added for collectives. - Added a new type of `_LoadBalancer` implementation `_PTRRLoadBalancer` which is designed for `flex_attention()`. This load-balance strategy analyzes the `BlockMask` sparsity info and perform Round-Robin (unlike traditional Round-Robin doing it in circular order, we do in zig-zag order). - Make `_context_parallel_buffers` and `context_parallel_unshard` handle batched load-balance index (previously it can only handle non-batched load-balance index), like in `create_cp_block_mask`. Test `pytest test/distributed/tensor/test_attention.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163617 Approved by: https://github.com/fegin	2025-10-16 02:20:27 +00:00
Shangdi Yu	19ba506ca3	Support libtorch and posix mingw flavor (#165574 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165574 Approved by: https://github.com/desertfire	2025-10-16 01:31:50 +00:00
jmaczan	003dd13073	[dynamo, guards] Better error messages when generated guard fails on the same frame (#165242 ) Not sure what exactly we want to have in the message, but that's easy to adjust. I tried to find a reliable test to reproduce this message (happens only when a guard fails right after it's created), but I ended up mocking a `guard_manager.check` function to return `False` to trigger this behavior. I think that's fine, because any other case that we pick (like datetime.now()), we want to patch one day anyway, so every time we make the next patch, will need to chase for another repro test @williamwen42 Fixes #164990 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165242 Approved by: https://github.com/williamwen42	2025-10-16 01:05:31 +00:00
Pearu Peterson	ca8bd5dbed	Move toString(ScalarType) and ScalarType ostream operator to headeronly (#164405 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164405 Approved by: https://github.com/Skylion007, https://github.com/janeyx99 ghstack dependencies: #164350, #164354	2025-10-16 00:55:43 +00:00
Pearu Peterson	26f3803433	Remove workaround to old CUDA bug (#164354 ) As in the title. A check for https://github.com/pytorch/pytorch/issues/164348 to see if the workaround can be removed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164354 Approved by: https://github.com/janeyx99, https://github.com/ngimel, https://github.com/malfet, https://github.com/jeffdaily ghstack dependencies: #164350	2025-10-16 00:55:43 +00:00
Pearu Peterson	48064acf37	Move AT_FORALL_... macros and ScalarTypeToCPPTypeT to headeronly (#164350 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164350 Approved by: https://github.com/janeyx99	2025-10-16 00:55:42 +00:00
xinan.lin	e5a9c247bc	[Fix XPU CI] [Inductor UT] Fix test cases broken by community. (#165406 ) Fixes #163159, Fixes #164098, Fixes #164097, Fixes #164099, Fixes #165025 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165406 Approved by: https://github.com/EikanWang, https://github.com/jansel	2025-10-16 00:53:32 +00:00
PaulZhang12	901bbcba12	Gate division bitwise numerics under a flag (#165566 ) https://github.com/pytorch/pytorch/pull/164144 ensures that division for compile is bitwise equivalent with eager. However, in https://github.com/pytorch/pytorch/issues/164301, the kernel performance is regressed. On B200: With standard triton `/`: 6511 GB/s With triton `div_rn`: 4692 GB/s Further investigation is required for the generated PTX to see why there is such a large slowdown. For now, enable bitwise equivalent results under `TORCHINDUCTOR_EMULATE_DIVISION_ROUNDING` similar to emulate_precision_cast Pull Request resolved: https://github.com/pytorch/pytorch/pull/165566 Approved by: https://github.com/ngimel, https://github.com/eellison	2025-10-15 23:41:01 +00:00
Nikhil Patel	febb603230	[Inductor][CuTeDSL] Move load_template up two directories (#165347 ) (#165576 ) Summary: Moves the function used to load CuTeDSL Jinja templates up one level out of the flex attention folder. This way it can be used for more generate Inductor templates in the future. Test Plan: `INDUCTOR_TEST_DISABLE_FRESH_CACHE=1 TORCHINDUCTOR_CACHE_DIR=~/cutetest buck2 run mode/opt //caffe2/test/inductor:cutedsl_grouped_mm -c fbcode.nvcc_arch=b200a -c fbcode.enable_gpu_sections=true -c fbcode.platform010_cuda_version=12.8` Reviewed By: drisspg Differential Revision: D84527470 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165576 Approved by: https://github.com/jananisriram	2025-10-15 23:37:55 +00:00
Xiao Fu	568d2f3ae7	[Dynamo][Logging] Add sources/types to LazyVariableTracker logging (#165402 ) Fixes #162860 This task add the variable source attrition to LazyVariableTracker when output trace bytecode Test plan -- test/dynamo/test_error_messages.py ErrorMessagesTest.test_variable_tracker_source_attribution The output is as specified in the prior mentioned Github issue. <img width="961" height="59" alt="Screenshot 2025-10-13 at 10 19 44 PM" src="https://github.com/user-attachments/assets/fb27da3f-d00b-437b-bf2e-52e892572cd7" /> This is specifically for the log setup with ``TORCH_LOGS=trace_bytecode`` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165402 Approved by: https://github.com/Lucaskabela, https://github.com/williamwen42 Co-authored-by: William Wen <williamwen@meta.com>	2025-10-15 23:23:09 +00:00
James Wu	b54e466fd0	Megacache integration (#163533 ) This diff adds megacache integration for DynamoCache. Because DynamoCache requires lazy serialization, i.e. it can only be serialized once all relevant backends have been compiled and we're ready for a save, we actually do the DynamoCache saving only on a call to `torch.compiler.save_cache_artifacts`. Differential Revision: [D82735763](https://our.internmc.facebook.com/intern/diff/D82735763/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/163533 Approved by: https://github.com/oulgen, https://github.com/zhxchen17	2025-10-15 22:49:15 +00:00
Sarthak Tandon	66ea76ec44	[ROCm][tunableop] Improvements to tunableop Numerical Check (#163079 ) Modified the flag PYTORCH_TUNABLEOP_NUMERICAL_CHECK, so that it accepts the numerical tolerances in the format atol_rtol as compared to the previous 0 and 1. Retains previous functionality with default values as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163079 Approved by: https://github.com/naromero77amd, https://github.com/jeffdaily	2025-10-15 22:26:47 +00:00
Richard Zou	e787d532b6	tmp fix for compile internal logger issue (#165568 ) Summary: Catch runtime exception when garse and scrub uninteresting configs from inductor config Test Plan: tested locally Differential Revision: D84727788 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165568 Approved by: https://github.com/luccafong, https://github.com/oulgen	2025-10-15 22:03:16 +00:00
eellison	b3f6d49b69	Overlap scheduler improvements (#165318 ) Bucketing a number of smallish improvements: - Account for bucketing in overlap calculation: if an in-flight collective exists with the same bucket key, reduce new collectives estimated time by its latency time - Update compute domination so we are ordering based on compute idx, as opposed to compute depth, so we never reorder compute. this makes it a bit easier to reason about memory, and pre-fetching, although we can exploring reordering in the future. - When we wait on a collective, force all collectives on the same process group as it that were enqueued prior to the collective to wait as well. Better Memory Handling: - Pre-fetch limiting - when scheduling collectives for overlap, only pre-fetch up to a certain distance, then schedule off-path collectives (which are typically memory reducing). - When we are above peak memory, schedule waits. TODO: - for each compute node, we know its original memory in the graph. we could limit pre-fetching that goes across peak memory - By scheduling off-path collectives for overlap, we reduce memory, but if there weren't enough compute for overlap, we need to proactively schedule them. not an issue yet on examples. - config some hard coded constants, clean up enablement (can do in subsequent pr) On small llama 2d backward : 578 of 618 potentially hideable collectives hidden original mem 14.4GB, rescheduled mem, 15.9GB on forward: 254/256 potentially hideable collectives hidden original mem 5.8 gb, reshceduled mem 5.8GB WIP: adding tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/165318 Approved by: https://github.com/ezyang, https://github.com/IvanKobzarev ghstack dependencies: #164738, #164783, #164944, #164945, #165059	2025-10-15 21:58:47 +00:00
Howard Huang	bc1f2108d7	[PP] Update backward_counter and fsdp util to schedule class (#165513 ) Fixed one issue with FSDP last reshard not being called. Rest is mostly refactoring, changing some variables to be class variables so they can be used in https://github.com/pytorch/torchtitan/pull/1721 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165513 Approved by: https://github.com/fegin	2025-10-15 21:58:16 +00:00
Boyuan Feng	f071f17911	[Graph Partition] fix partition x memory plan issue (#165514 ) For `test_graph_partition_with_memory_plan_reuse`, before this PR, when using graph partition, it would error ([P1992728479](https://www.internalfb.com/phabricator/paste/view/P1992728479)): ``` def partition_0(args): ... del buf0 return (buf3, buf4, buf5, buf2, primals_4, ) ... File "/tmp/torchinductor_boyuan/ww/cwwc7ukfqscg2vy6ankby2fizdb377tvgyx3fwdgddrxe3g47jg6.py", line 132, in partition_0 return (buf3, buf4, buf5, buf2, primals_4, ) ^^^^ NameError: name 'buf2' is not defined. Did you mean: 'buf0'? ``` When not using graph partition, it would work and give the following code ([P1992997521](https://www.internalfb.com/phabricator/paste/view/P1992997521)): ``` def call(self, args): ... buf2 = buf0; del buf0 # reuse ... ``` Note that the issue is buf0 is not reused for buf2 when using graph partition. Why? Because the codegen runs `run_wrapper_ir_passes` and `memory_plan_reuse`, which pops tailing `MemoryPlanningLine` unless it is in graph output by checking `V.graph.get_output_names()`. However, for graph partition, we should check the output of the current partition instead of the graph before partition. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165514 Approved by: https://github.com/ProExpertProg, https://github.com/eellison	2025-10-15 21:52:16 +00:00
Avik Chaudhuri	fa1539594b	consolidate fw and inference compile paths (#165457 ) By design, fw compile and inference compile stages should share a bunch of code; just consolidating the duplication here. Differential Revision: D84628978 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165457 Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan	2025-10-15 21:33:50 +00:00
zpcore	dfc8a1c5dd	Fix `_StridedShard` incorrect split (#165533 ) https://github.com/pytorch/pytorch/pull/164820 introduced a bug that `_StridedShard` will call parent class `Shard`'s `split_tensor` method, thus results in incorrect data locality. (I think @ezyang spotted this issue, but we have no test to capture this) Meanwhile, I notice another bug that when we normalize a `_StridedShard`'s placement, it will also trigger parent class `Shard`'s `split_tensor` method because it will create a Shard class [here](`0c14f55de6/torch/distributed/tensor/_api.py (L783)`). I think we never test `distribute_tensor` for `_StridedShard` before. So I added a test here to compare against ordered shard. Using classmethod because the _split_tensor logic is different between `Shard` and `_StridedShard`. Basically I want to shard on local tensors without initializing the Shard object: ``` local_tensor = _StridedShard._make_shard_tensor(dim, tensor, mesh, mesh_dim, split_factor=split_factor) local_tensor = Shard._make_shard_tensor(dim, tensor, mesh, mesh_dim) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165533 Approved by: https://github.com/XilunWu	2025-10-15 20:52:41 +00:00
Sarthak Tandon	7f9b745494	[ROCm][tunableop] Modified Online Tuning Mode to add Instant Logging (#163965 ) - Added instant logging in online tuning mode, so that each tuned GEMM is instantly written - Allows us to have saved tuning configs, in cases of crashes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163965 Approved by: https://github.com/naromero77amd, https://github.com/jeffdaily	2025-10-15 20:02:31 +00:00
Lucas Kabela	83f9baf413	[Bugfix][Precompile][vLLM] Support for pickling einops for aot_autograd serialization in vLLM (#165359 ) Fixes issue with compiling `Qwen2_5_vl` in https://github.com/vllm-project/vllm/pull/23207 (issue happens with `aot_autograd_cache`) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165359 Approved by: https://github.com/jamesjwu	2025-10-15 20:00:24 +00:00
Angel Li	78f5a1ec60	varlen api (#164502 ) Summary Today, the only way to have variable sequence length support in PyTorch attention is through nested tensors [here](https://docs.pytorch.org/tutorials/intermediate/scaled_dot_product_attention_tutorial.html#nestedtensor-and-dense-tensor-support). We also want to add an explicit lower-level API that provides variable sequence length support without padding/masking in SDPA. This PR builds out `varlen_attn`, the public API that users can call for the forward method, and `_varlen_attn`, the private API that calls into the Flash Attention/cuDNN backend. Benchmarking To benchmark, we compare runtime and TFLOPs against the current SDPA approach with padding. Settings: - 1 H100 machine - `batch_size=8`, `max_seq_len=2048`, `embed_dim=1024`, `num_heads=16` - dtype `torch.bfloat16` - `is_causal=False` - for variable length, we set sequences to be random multiples of 64 up to `max_seq_len` - 100 runs \| \| Variable Length API \| SDPA \| \|--------\|--------------------\|----------\| \| Runtime \| 0.21750560760498047 ms \| 0.43171775817871094 ms \| \| TFLOPs \| 231.812 \| 320.840 \| The sparsity is 0.453 which we can see matches the speedup we get from Varlen (approx 50%). TFLOPs remains around the same, with SDPA slightly larger due to potential higher overhead and total flops scaling with sequence length. Testing Run `python test/test_varlen_attention.py` for unit tests where we verify basic functionality and confirm numerical match between varlen outputs vs SDPA. Next steps Next steps from this PR (higher in the stack) include registering the private API `_varlen_attn` as a custom op, implementing backward support, and enabling cuDNN with correct numerics. (This stack builds on top of #162326) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164502 Approved by: https://github.com/v0i0, https://github.com/drisspg	2025-10-15 19:45:55 +00:00
eellison	2b71b62045	Add Memory Estimation Tracker (#165059 ) Add Memory Tracker utility, which will track live memory given alternate ordering of nodes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165059 Approved by: https://github.com/ezyang, https://github.com/IvanKobzarev ghstack dependencies: #164738, #164783, #164944, #164945	2025-10-15 19:44:29 +00:00
PyTorch MergeBot	8c4b528403	Revert "[Inductor][CuTeDSL] Move load_template up two directories (#165347 )" This reverts commit 815d6415996d5b32b569fd2a8206f1e57c75bfe3. Reverted https://github.com/pytorch/pytorch/pull/165347 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/165347#issuecomment-3407958496))	2025-10-15 19:30:46 +00:00

1 2 3 4 5 ...

52660 Commits