pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Anthony Shoumikhin	7cae7902a2	Add scripts to check xrefs and urls (#151844 ) Traverses the docs and code to find any broken links Pull Request resolved: https://github.com/pytorch/pytorch/pull/151844 Approved by: https://github.com/huydhn	2025-04-28 09:30:07 +00:00
Nikita Shulga	13966d0bf5	[BE] Migrate dtype_abbrs into one location (#152229 ) Namely `torch.utils._dtype_abbrs.dtype_abbrs` Before that it was defined in various forms of completeness in `c02edba863/torch/fx/graph.py (L215)`, `c02edba863/torch/testing/_internal/common_utils.py (L5226)` and `c02edba863/torch/testing/_internal/logging_tensor.py (L17)` TODO: - Add linter that `torch.testing._internal` module is not referenced from any of the public facing APIs, as it can have extra dependencies such as `expect_test` Fixes https://github.com/pytorch/pytorch/issues/152225 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152229 Approved by: https://github.com/clee2000, https://github.com/Skylion007	2025-04-28 03:52:47 +00:00
Laith Sakka	98bd2bd1ab	Do not generate long log messages for suppressed data dependent errors. (#151023 ) TORCH_LOGS="all" python test/test_dynamic_shapes.py -k test_guard_or_true before: <img width="1065" alt="Screenshot 2025-04-10 at 9 55 27 AM" src="https://github.com/user-attachments/assets/3ee20de0-2902-4eb1-8ab0-80f1b974fb78" /> after: <img width="1124" alt="Screenshot 2025-04-10 at 9 54 35 AM" src="https://github.com/user-attachments/assets/4e7e1f0c-856c-417f-8763-bfe183e2450d" /> Note: we actually do not expect to see a log at all, this is an orthogonal issue in recording where it logs each error seen even when recording is not enabled? I will follow up with PR for that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151023 Approved by: https://github.com/bobrenjc93	2025-04-28 00:39:52 +00:00
cyy	70d7638b0d	Fix clang-tidy suppression in torch/csrc/jit (#152271 ) Remove some clang-tidy suppression in torch/csrc/jit by applying fixes or refactoring. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152271 Approved by: https://github.com/Skylion007, https://github.com/malfet Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2025-04-27 21:18:39 +00:00
cyy	b34146a093	Fix initGdsBindings declaration (#152277 ) Move initGdsBindings into the correct namespace. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152277 Approved by: https://github.com/Skylion007	2025-04-27 17:04:56 +00:00
Zizeng Meng	861945100e	[Kineto] Enable OOM observer (#152160 ) Summary: # Context: When memory leak happens, it usually trigger the OOM in the later iterations. The snapshot of full iteration will be huge and hard to interpret. On CUDA side, they provide OOM observer which generates snapshot when OOM happens with latest 1,500,000 entries for debugging. In this diff, we want to implement the feature on MTIA side Test Plan: Run this test with last diff in the stack. ``` buck run @//mode/opt kineto/libkineto/fb/mtia/integration_tests:mtia_memory_auto_trace_test ``` As shown, the memory_snapshot is generated when oom happens Log: P1794792326 Snapshot: https://fburl.com/pytorch_memory_visualizer/lx73y6s3 {F1977402355} Differential Revision: D71993315 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152160 Approved by: https://github.com/sraikund16	2025-04-27 15:56:44 +00:00
Nikita Shulga	bb680b5a87	[MPSInductor] Fix masked_fill decomp (#152268 ) By adding `mps` to the list of accelerators that can work with CPU scalars Fixes `GPUTests.test_masked_fill_promotion_mps` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152268 Approved by: https://github.com/kulinseth, https://github.com/dcci, https://github.com/Skylion007 ghstack dependencies: #152266	2025-04-27 15:50:46 +00:00
Yuanhao Ji	cbcf677223	[Dynamo] Replace `unimplemented` with `unimplemented_v2` in `torch/_dynamo/variables/lists.py` (#151873 ) Part of #147913 Replace `unimplemented` with`unimplemented_v2` in `torch/_dynamo/variables/lists.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151873 Approved by: https://github.com/williamwen42 Co-authored-by: William Wen <william.wen42@gmail.com>	2025-04-27 11:59:45 +00:00
Yuanhao Ji	0423a7b322	[Dynamo] Replace `unimplemented` with `unimplemented_v2` in `torch/_dynamo/variables/nn_module.py` (#151895 ) Part of #147913 Replace `unimplemented` with`unimplemented_v2` in `torch/_dynamo/variables/nn_module.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151895 Approved by: https://github.com/williamwen42 Co-authored-by: William Wen <william.wen42@gmail.com>	2025-04-27 11:54:42 +00:00
Anthony Shoumikhin	e2f9759bd0	Fix broken URLs (#152237 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152237 Approved by: https://github.com/huydhn, https://github.com/malfet	2025-04-27 09:56:42 +00:00
Nikita Shulga	cbcc03c2ad	[MPSInductor][BE] Only include headers when needed (#152266 ) Store headers used by shader in `MetalKernel.headers` Add headers when function depending on it gets invoked Generate majority of a special ops from template Delete two unused functors: `entr` and `xlog1py` as they are decomposed by inductor anyway Pull Request resolved: https://github.com/pytorch/pytorch/pull/152266 Approved by: https://github.com/Skylion007, https://github.com/jansel, https://github.com/dcci, https://github.com/cyyever	2025-04-27 05:09:50 +00:00
Bin Bao	a0d440a26a	[AOTI][reland] Remove typedef for half and bfloat16 (#151109 ) Summary: Reland https://github.com/pytorch/pytorch/pull/150657 typedef is prone to name collision. Explicitly spell out the actual aten types, needed for the libtorch-free codegen. Differential Revision: [D72878456](https://our.internmc.facebook.com/intern/diff/D72878456) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151109 Approved by: https://github.com/angelayi	2025-04-26 23:17:35 +00:00
Zhiyi Zhang	225742838b	Add an additional check to trigger graph break for sparse tensor (#151897 ) Fixes #151522 This PR fixes the issue that Dynamo fails to trigger a graph break for sparse tensors in certain code paths. I added an additional check to handle this case, and it resolves the original problem. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151897 Approved by: https://github.com/jansel	2025-04-26 21:02:32 +00:00
Oguz Ulgen	e4a1a16bef	Check integrity of bytes in AppendingByteSerializer (#152139 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/152139 Approved by: https://github.com/zou3519	2025-04-26 18:10:58 +00:00
Aaron Gokaslan	6a62356857	[BE][Easy]: Change typing to DimsType in dim_reduction (#151677 ) Use prims_common DimsType to reduce duplication of DType Pull Request resolved: https://github.com/pytorch/pytorch/pull/151677 Approved by: https://github.com/albanD	2025-04-26 16:59:32 +00:00
Zhengxu Chen	203201255f	[dynamo] remove dead code for DATA_PTR_MATCH (#152206 ) Summary: Seems this guard is not created anywhere Test Plan: CI Differential Revision: D73682084 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152206 Approved by: https://github.com/anijain2305, https://github.com/jansel	2025-04-26 15:25:01 +00:00
Yukio Siraichi	ee8166e94f	Correctly handle duplicated arguments when merging input views. (#146275 ) Fix: #135099 This PR changes how we map the original inputs into the new set of inputs that take in the tensor input's base instead of their aliases. Problem: in order to create this mapping, we had a dictionary that mapped the hashed arguments into their respective indices. However, if there's a group of equal arguments, we will have only one mapping for such an argument. This breaks the assumption that there will be one mapping for each argument. Solution: map the hashed arguments into a list of indices. Then, we will be able to correctly reconstruct the parameters for the new calling convention. Pull Request resolved: https://github.com/pytorch/pytorch/pull/146275 Approved by: https://github.com/bdhirsh	2025-04-26 14:50:16 +00:00
FFFrog	580913290c	[Easy] The event_id of torch.cuda.Event and torch.xpu.Event always is 0 (#151226 ) Although torch.cuda.Event and torch.xpu.Event have cuda_event and sycl_event fields respectively, the event_id exposed from the base class torch.Event is always 0, which can confuse users. The memory of torch.Event is not useful to torch.cuda.Event and torch.xpu.Event, but we still need to inherit from torch.Event because CPython will check it. Repro with cuda: ``` >>> import torch >>> event = torch.cuda.Event() >>> event.cuda_event 0 >>> event.event_id 0 >>> event.record() >>> event.cuda_event 127982096 >>> event.event_id 0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151226 Approved by: https://github.com/albanD, https://github.com/guangyey ghstack dependencies: #151404, #151221, #151411	2025-04-26 14:18:22 +00:00
FFFrog	0f9b02c839	[Easy][torch.Event] Fix and improve the docs of torch.Event (#151411 ) Changes: - add detailed function or class signature - fix the wrong display of torch.Event.wait and torch.Event.record Pull Request resolved: https://github.com/pytorch/pytorch/pull/151411 Approved by: https://github.com/albanD ghstack dependencies: #151404, #151221	2025-04-26 13:52:38 +00:00
FFFrog	bd7dc1b17d	[Easy] Fix the function signature of torch.Event (#151221 ) As the title stated. The difference between declaration and implemention. declaration: `d5a19e4525/torch/_C/__init__.pyi.in (L157-L162)` Implementation: `d5a19e4525/torch/csrc/Event.cpp (L30-L32)` Question: Which one should we choose? - Change enable_timing to False to be consistent with torch.cuda.Event - Change enable_timing to True to avoid BC-break Pull Request resolved: https://github.com/pytorch/pytorch/pull/151221 Approved by: https://github.com/albanD ghstack dependencies: #151404	2025-04-26 13:51:56 +00:00
Chuanqi Xu	4a46ee96d2	[Indcutor Remote Cache] Raise an exception if redis module is required but not available (#151779 ) If we need redis but redis is not available, it is better to tell the user to install redis instead of continue silently. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151779 Approved by: https://github.com/aorenste	2025-04-26 11:21:54 +00:00
Mu-Chu Lee	8d427e9e76	[AOTInductor] Inherit Buffer if not being updated (#152092 ) Summary: Inherit buffer from original constants buffer if it's not being updated. Test Plan: TBD Differential Revision: D73571260 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152092 Approved by: https://github.com/kflu, https://github.com/jingsh	2025-04-26 04:28:23 +00:00
Dan Johnson	d22c4cc353	Add option to use mempool on OOM (#151487 ) MemPool is a separate pool of memory handled by the caching allocator. This PR adds the option let the caching allocator try to use this pool as a last resort instead of OOMing by associating a use_on_oom bool with each MemPool. Usage: Users can optionally specify a ``use_on_oom`` bool (which is False by default) during MemPool creation. If true, then the CUDACachingAllocator will be able to use memory in this pool as a last resort instead of OOMing. ``` pool = torch.cuda.MemPool(allocator, use_on_oom=True) with torch.cuda.use_mem_pool(pool): a = torch.randn(40 * 1024 * 1024, dtype=torch.uint8, device="cuda") del a # at the memory limit, this will succeed by using pool's memory in order to avoid the oom b = torch.randn(40 * 1024 * 1024, dtype=torch.uint8, device="cuda") ``` Testing: ``` python test/test_cuda.py -k test_mempool_limited_memory_with_allocator ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/151487 Approved by: https://github.com/eqy, https://github.com/syed-ahmed, https://github.com/ngimel	2025-04-26 04:04:57 +00:00
Nikita Shulga	3ef6d6924a	[BE] Switch `TestConsistency` to MPS device (#147893 ) Which will eventually allow move decorators away more `common_mps.py` Adjust tolerances accordingly. XFAIL a bunch of tests on MacOS-13, which is going to be deprecated anyway Pull Request resolved: https://github.com/pytorch/pytorch/pull/147893 Approved by: https://github.com/atalman ghstack dependencies: #152204	2025-04-26 01:19:21 +00:00
Flavio Sales Truzzi	4647658247	[PT2] - Allowlist should have precedence (#151942 ) Summary: When working on List[List[int]], the ints were being considered Constants regardless of their inclusion on the allowlist. Test Plan: CI + new test https://www.internalfb.com/intern/testinfra/testrun/5066549856504774 Differential Revision: D73137631 Pull Request resolved: https://github.com/pytorch/pytorch/pull/151942 Approved by: https://github.com/laithsakka	2025-04-26 00:58:43 +00:00
PyTorch MergeBot	fa1b4ef649	Revert "Rewrite the guts of torch::jit::Lexer to speed it up (#151850 )" This reverts commit 47d34261e06e2416e7a1e7d51a3d428e4ea51f9d. Reverted https://github.com/pytorch/pytorch/pull/151850 on behalf of https://github.com/ZainRizvi due to This codev PR is breaking on it's internal counterpart diff D73129443. For codev PRs like this one, please always make sure the internal diff is green and then land the diff internally. The Github PR will be automatically merged ([comment](https://github.com/pytorch/pytorch/pull/151850#issuecomment-2831686141))	2025-04-26 00:44:11 +00:00
Scott Wolchok	47d34261e0	Rewrite the guts of torch::jit::Lexer to speed it up (#151850 ) The trie-based approach was, apparently, not efficient. This incidentally fixes a bug where "not inp" and "is note" were lexed incorrectly; see test_lexer.cpp update. Differential Revision: [D73129443](https://our.internmc.facebook.com/intern/diff/D73129443/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151850 Approved by: https://github.com/Skylion007 ghstack dependencies: #151801, #151802, #151803, #151804, #151805, #151806, #151807, #151810, #151849	2025-04-25 23:49:35 +00:00
Chien-Chin Huang	6aa92806db	[CP] Use TorchFunctionMode to dispatch SDPA for CP (#147902 ) While we prefer not use monkey patching to dispatch SDPA, TorchFunctionMode is currently not compatible with selective activation checkpointing (https://github.com/pytorch/pytorch/issues/147995). This PR adds `TorchFunctionMode` to CP code and make it configurable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147902 Approved by: https://github.com/XilunWu	2025-04-25 23:33:48 +00:00
Anthony Shoumikhin	9e50c21e27	Fix xrefs (#151888 ) Fix existing cross references and removed old ones Pull Request resolved: https://github.com/pytorch/pytorch/pull/151888 Approved by: https://github.com/eqy, https://github.com/huydhn, https://github.com/svekars	2025-04-25 21:27:27 +00:00
xadupre	91c590f048	[ONNX] add converters for sym_min, sym_max (#152196 ) Conversion of Phi4-multimodel-instruct fails because of missing converters for torch.sym_max, and torch.sym_min. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152196 Approved by: https://github.com/justinchuby	2025-04-25 20:01:05 +00:00
Sam Larsen	8542d55f0c	[logging] Clean up dynamo_timed usages in cudagraph_trees (#152136 ) Summary: I'm investigating differences in total torch.compile overhead in our two main internal sources: dynamo_compile and pt2_compile_events. One source of discrepancy is due to cudagraphs overheads. Currently, we have a context manager that optionally attributes a dynamo_timed region to a cudagraph-related column logged to dynamo_compile, but _all_ dynamo_timed regions show up in pt2_compile_events (hence the discrepancy; pt2_compile_events is overcounting). We could filter out these specific events from pt2_compile_events when measuring overall overhead. But I'm going to argue that those timed regions that we DO NOT consider as a compiler-related overhead don't have much value in logging in the first place. So I'm suggesting we just remove those instances. Here's the production job with the discrepancy: * dynamo_compile: https://fburl.com/scuba/dynamo_compile/3604eypl * pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/c2dv8sty Test Plan: torchbench nanogpt: * tlparse: https://fburl.com/h1n2ascc * dynamo_compile: https://fburl.com/scuba/dynamo_compile/sandbox/u37yrynp * pt2_compile_events: https://fburl.com/scuba/pt2_compile_events/s7avd0di Pull Request resolved: https://github.com/pytorch/pytorch/pull/152136 Approved by: https://github.com/BoyuanFeng	2025-04-25 19:18:12 +00:00
Nikita Shulga	56190d2577	[MPS] Fix ICE for entr bool instantiation on M1/M2 (#152204 ) By instantiating it implicitly, otherwise attempts to run something like ``` % python3 -c "import torch; print(torch.special.entr(torch.testing.make_tensor(10, dtype=torch.bool, device='mps')))" ``` will fail with ``` Failed to created pipeline state object, error: Error Domain=AGXMetalG14X Code=3 "Compiler encountered an internal error" ``` Similar in spirit to https://github.com/pytorch/pytorch/pull/149123 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152204 Approved by: https://github.com/dcci	2025-04-25 19:00:49 +00:00
Yuanhao Ji	d7eb3a492c	[Typing] Enable torch.types.IntLikeType / FloatLikeType / BoolLikeType (#152157 ) ### Changes Replace `Union[SymInt, int]` and `Union[int, SymInt]` with `IntLikeType`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152157 Approved by: https://github.com/Skylion007	2025-04-25 19:00:10 +00:00
Shangdi Yu	85bfaf8cc5	Package const folded graph's cubin file (#152145 ) Summary: We need to pacakge const folded graph's cubin file into the final .pt2 package. Fix https://github.com/pytorch/pytorch/issues/152067 Test Plan: ``` buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r test_constant_folding_cuda ``` Differential Revision: D73626480 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152145 Approved by: https://github.com/henrylhtsang, https://github.com/desertfire	2025-04-25 18:38:32 +00:00
eellison	a5f2fd1017	Unskip index_put in cudagraphs (#152186 ) The repro from the original skip in https://github.com/pytorch/pytorch/pull/105439 does not fail. unskip. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152186 Approved by: https://github.com/Skylion007	2025-04-25 18:15:49 +00:00
Jithun Nair	bcf1031cb8	[ROCm] Fixes to enable VM-based MI300 CI runners (#152133 ) New VM-based MI300 CI runners tested in https://github.com/pytorch/pytorch/pull/151708 exposed some issues in CI that this PR fixes: * HSAKMT_DEBUG_LEVEL is a debug env var that was introduced to debug driver issues. However, in the new MI300 runners being tested, since they run inside a VM, the driver emits a debug message `Failed to map remapped mmio page on gpu_mem 0` when calling `rocminfo` or doing other GPU-related work. This results in multiple PyTorch unit tests failing when doing a string match on the stdout vs expected output. * HSA_FORCE_FINE_GRAIN_PCIE was relevant for rccl performance improvement, but is not required now. * amdsmi doesn't return metrics like [power_info](https://rocm.docs.amd.com/projects/amdsmi/en/latest/reference/amdsmi-py-api.html#amdsmi-get-power-cap-info) and [clock_info](https://rocm.docs.amd.com/projects/amdsmi/en/latest/reference/amdsmi-py-api.html#amdsmi-get-clock-info) in a VM ("Guest") environment. Return 0 as the default in cases where amdsmi returns "N/A" * amdsmi throws an exception when calling `amdsmi.amdsmi_get_clock_info` on the VM-based runners. Temporarily skipping the unit test for MI300 until we find a resolution. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152133 Approved by: https://github.com/jeffdaily	2025-04-25 18:06:48 +00:00
James Wu	0dae27d75b	Turn on static cuda launcher in OSS (#151691 ) After a few small bugfixes on tests (to make it so we throw/catch similar exceptions to triton), I think we're ready to flip the switch and use StaticCudaLauncher on by default in OSS. Initial round of benchmarks look good, with average compilation time going down by a few percent: <img width="828" alt="image" src="https://github.com/user-attachments/assets/cad03e09-b4d6-49a7-a9e5-6068d1c0bd5c" /> With no changes to runtime perf: <img width="823" alt="image" src="https://github.com/user-attachments/assets/3fcd435e-1057-43f4-878b-8d66a3812a10" /> There are a few noisy models I want to double check, though, so will run some more tests before accepting review. Full benchmark results, showing a ~5% compile time improvement across the board: https://hud.pytorch.org/benchmark/huggingface/inductor_with_cudagraphs?dashboard=torchinductor&startTime=Wed%2C%2016%20Apr%202025%2002%3A31%3A12%20GMT&stopTime=Wed%2C%2023%20Apr%202025%2002%3A31%3A12%20GMT&granularity=hour&mode=training&dtype=amp&deviceName=cuda%20(a100)&lBranch=gh/jamesjwu/139/orig&lCommit=cc45c8667fa23dec16ca50002d9504a34688ca5c&rBranch=main&rCommit=2a9afdae81d0dde98e96d7e3c9ca840e241e5405 <img width="1482" alt="image" src="https://github.com/user-attachments/assets/6e6a7f39-7f44-459f-9845-9a37f084ea82" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/151691 Approved by: https://github.com/oulgen, https://github.com/jansel, https://github.com/EikanWang	2025-04-25 17:48:53 +00:00
PyTorch MergeBot	c03359de2d	Revert "[Inductor] Record Triton’s Base32 Cache Key in .best_config for Debugging (#148981 )" This reverts commit fc6e37ceb23f99808265c11a37368078d5f982b8. Reverted https://github.com/pytorch/pytorch/pull/148981 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking internally. @davidberard98 can you please help get these changes validated? Details in D73628297. To validate the fixes internally, you can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts ([comment](https://github.com/pytorch/pytorch/pull/148981#issuecomment-2831044810))	2025-04-25 17:45:13 +00:00
henrylhtsang	4ea2e093ca	[inductor][BE] Clean up use_mixed_mm and mixed_mm_choice usage inside pytorch (#152071 ) Differential Revision: [D73551912](https://our.internmc.facebook.com/intern/diff/D73551912/) Decided to leave the mixed_mm tests alive. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152071 Approved by: https://github.com/eellison	2025-04-25 17:25:55 +00:00
zhxchen17	558f45190e	[dynamo] Guard serialization for NOT_PRESENT_IN_GENERIC_DICT (#151343 ) Adding guard serialization for type NOT_PRESENT_IN_GENERIC_DICT Differential Revision: [D73057304](https://our.internmc.facebook.com/intern/diff/D73057304/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/151343 Approved by: https://github.com/jansel, https://github.com/anijain2305 ghstack dependencies: #151318	2025-04-25 14:16:30 +00:00
zhxchen17	a34c28e0d2	[dynamo] Add guard serialization for tensor matches. (#151318 ) This is a proof-of-concept of how we could serialize a guard and deserialize it back from the bytes. The main behavioral change introduced in this diff is on CheckFunctionManager: ``` check_fn_manager = CheckFunctionManager(code, output_graph, guards_serialization_mode="save") guards_state: bytes = check_fn_manager.guards_state ``` Once `guards_serialization_mode` is set to `save`, CheckFunctionManager will return an addtional `bytes` object called `guards_state` which should contain all the information needed for deserializing guards later. When we load back guards state, we will set `guards_serialization_mode` is set to `load`: ``` output_graph_state = pickle.loads(guards_state) check_fn_manager = CheckFunctionManager(code, output_graph_state, guards_serialization_mode="load") ``` # TENSOR_MATCH Since we have many types of guards to support, we will break the work into small diffs instead of a single diff to support every guards. We kick off the work from TENSOR_MATCH from this diff. # Testing For each type of guard we will test it like the following: 1. Use guard_filter_fn to select 1 type of guard each time. 2. Call InstructionTranslator directly on an example function to get OutputGraph and CheckFunctionManager (reference guard manager) 3. Serialize->deserialize the output graph state and re-build the guards with a new CheckFunctionManager (loaded guard manager) 4. Throw a set of example inputs to both reference and loaded guard manager to see if their behavior match. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151318 Approved by: https://github.com/jansel, https://github.com/anijain2305	2025-04-25 14:16:23 +00:00
Yu, Guangye	ad81eeb7c7	Refactor to use torch.accelerator.device_index instead of torch.cuda.device for generic device context manager (#148880 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148880 Approved by: https://github.com/EikanWang, https://github.com/albanD ghstack dependencies: #148864	2025-04-25 09:45:25 +00:00
Yu, Guangye	33c75cae0a	Add torch.accelerator.device_index as accelerator's device switch context (#148864 ) # Motivation We propose adding support for the Python with statement on `torch.accelerator.device_index` to enable device switching functionality. This enhancement would simplify writing device-agnostic code and provide benefits across all accelerators. Its device-specific counterparts include [`torch.cuda.device`](`00199acdb8/torch/cuda/__init__.py (L482)`) and [`torch.cuda._DeviceGuard`](`00199acdb8/torch/cuda/__init__.py (L469)`). Design Philosophy It accepts either an `Int` or `None` as input. When `None` is passed, no device switch is performed. Supporting `None` is important for compatibility, as it's possible to encounter `None` values from `torch.device.index`. Therefore, with this PR, we can do like this ```python src = 0 dst = 1 # Set src to current device torch.accelerator.set_device_index(src) with torch.accelerator.device_index(dst): # Inside with statement, we set dst to current device assert torch.accelerator.get_device_index() == dst # Here the current device should be src assert torch.accelerator.get_device_index() == src ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/148864 Approved by: https://github.com/albanD	2025-04-25 09:45:25 +00:00
Justin Chu	a811d3351b	[ONNX] Implement sym_not (#152111 ) Implement onnx support for sym_not. Replaces https://github.com/pytorch/pytorch/pull/147472 Fix https://github.com/pytorch/pytorch/issues/136572 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152111 Approved by: https://github.com/titaiwangms	2025-04-25 07:50:37 +00:00
Michael Lazos	a936d596f6	[Cutlass] Implement EVT example tensor creation (#150904 ) This PR implements a translation layer from inductor IR to "example tensors" the expected arguments of the EVT tracer. These tensors basically store the name, shape, stride, and dtype of the tensor and allow an ast-based python parse to generate the EVT C++. udpates to example tensor creation Previously merged: * https://github.com/pytorch/pytorch/pull/150903 * https://github.com/pytorch/pytorch/pull/150346 * https://github.com/pytorch/pytorch/pull/150345 * https://github.com/pytorch/pytorch/pull/150344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/150904 Approved by: https://github.com/eellison	2025-04-25 04:43:37 +00:00
Justin Chu	e2c7ae52d5	[ONNX] Add group_norm support from opset 21 (#152138 ) I didn't run the model in test because ORT doesn't have the op yet. Nevertheless it should be leveraged for newer opset versions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152138 Approved by: https://github.com/titaiwangms, https://github.com/shubhambhokare1, https://github.com/cyyever	2025-04-25 03:30:07 +00:00
Tristan Rice	1a6d50d407	Reducer: add check on received data to avoid segfault (#152143 ) When ncclCommAbort is called it may return invalid/corrupted data to the reducer. This adds a check so we don't read past the end of the tensors leading to a segfault. While this looks like it could be a security issue it actually isn't since we only read past the end of the buffer, not write. Fixes #149418 Test plan: https://gist.github.com/d4l3k/b47c2c95cf9c37e78069e19f1b6ed2c6 Pull Request resolved: https://github.com/pytorch/pytorch/pull/152143 Approved by: https://github.com/fduwjj, https://github.com/fegin	2025-04-25 02:16:44 +00:00
Jim Wan	7f28c03fac	Adding fbgemm to whitelist (#152079 ) Adding `torch.ops.fbgemm` to GraphPickler's allowlist. Otherwise, the fx graph module containing `fbgemm` node will return "Unable to pickle non-standard op" error. The validation is done on the model and the difference appears only on the graph name not the node. Pull Request resolved: https://github.com/pytorch/pytorch/pull/152079 Approved by: https://github.com/aorenste	2025-04-25 01:13:51 +00:00
leslie-fang-intel	d70490ecfe	[Inductor][CPP] Optimize the epilogue for int8 GEMM Template (#152000 ) Summary For int8 GEMM Template, the micro GEMM will calculate in u8s8s32 and we will do the scale/zp compensation in the epilogue. In general, it will be calculated as: ``` temp = micro_gemm_output * x_scale * w_scale temp = temp - (x_scale * w_scale * x_zp) * sum(w, 0) ``` For case when `x_scale, w_scale, x_zp` are constant, we can pre-calculate the compensation to save runtime calculation. Performance Test with 4 cores of XEON-5 and shapes from VIT model Before ``` GEMM(M=197,N=768,K=768) compile: 0.0939 ms (2.48 TOPS, 18.13 GB/s) GEMM(M=197,N=3072,K=768) compile: 0.4275 ms (2.17 TOPS, 13.90 GB/s) GEMM(M=197,N=768,K=3072) compile: 0.2677 ms (3.47 TOPS, 22.20 GB/s) GEMM(M=1,N=1000,K=768) compile: 0.0148 ms (0.10 TOPS, 99.10 GB/s) ``` After ``` GEMM(M=197,N=768,K=768) compile: 0.0597 ms (3.90 TOPS, 28.53 GB/s) GEMM(M=197,N=3072,K=768) compile: 0.2126 ms (4.37 TOPS, 27.95 GB/s) GEMM(M=197,N=768,K=3072) compile: 0.2282 ms (4.07 TOPS, 26.04 GB/s) GEMM(M=1,N=1000,K=768) compile: 0.0149 ms (0.10 TOPS, 98.71 GB/s) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/152000 Approved by: https://github.com/Xia-Weiwen, https://github.com/CaoE, https://github.com/jansel	2025-04-24 23:36:00 +00:00
Jing Xu	2089b22c76	[xpu] set aot device flags in cpp_extension (#149459 ) If PyTorch is compiled with only AOT text strings starting with "dg2", the `_get_sycl_arch_list()` function will pass an empty string to `-device` argument of `ocloc` and then cause a compilation crash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/149459 Approved by: https://github.com/guangyey, https://github.com/dvrogozh, https://github.com/malfet Co-authored-by: Yu, Guangye <106960996+guangyey@users.noreply.github.com> Co-authored-by: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>	2025-04-24 22:55:52 +00:00

1 2 3 4 5 ...

47802 Commits