pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
can-gaa-hou	b3ad8f4a9c	[BUG] Fix nonzero_static crash on CUDA when the input is a empty tensor (#162578 ) Fixes #162473 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162578 Approved by: https://github.com/ngimel	2025-09-15 05:44:15 +00:00
Edward Yang	755cf90672	Redirect all use of filesystem to c10/utils/FileSystem.h (#162914 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/162914 Approved by: https://github.com/Skylion007, https://github.com/dcci, https://github.com/cyyever	2025-09-15 04:30:41 +00:00
Nikita Shulga	76e5df3866	[BE] Use `fmt::format` to define Conv key (#162925 ) Also use `getArrayRefString` instead of having separate cases for 2D and 3D Conv Pull Request resolved: https://github.com/pytorch/pytorch/pull/162925 Approved by: https://github.com/Skylion007 ghstack dependencies: #162921	2025-09-15 02:44:12 +00:00
Nikita Shulga	7fe1f5ea49	[BE] Delete [Ventura\|Sonoma]Ops header (#162921 ) Was a temp solution to make PyTorch+MPS buildable on MacOS-12, but it's no longer needed, as in 2.9+ MPS is only supported on MacOS Sonoma+ Pull Request resolved: https://github.com/pytorch/pytorch/pull/162921 Approved by: https://github.com/Skylion007, https://github.com/dcci	2025-09-15 02:44:12 +00:00
James Wu	e156a07171	[Precompile] [RFC] Implement aot_compile_module (#162171 ) This PR adds a new interface _aot_compile to `OptimizedModule`, so that the following is possible: ``` mod = SimpleLinearModule() inputs = [ ModelInput( args=(torch.randn(3, 3),), kwargs={}, contexts=[torch.no_grad(), eval_mode(model)], ), ModelInput( args=(torch.randn(3, 3),), kwargs={}, contexts=[train_mode(model)] ), ] assert isinstance(model, torch._dynamo.eval_frame.OptimizedModule) model._aot_compile( inputs, ) ``` After this PR, you can AOT precompile NanoGPT and use it to train directly. I'll share my fork of the repo to make this work. ## ModelInput The `ModelInput` API is a work in progress; for now it represents a set of inputs and contexts to instruct the compiler to compile. Most commonly, this is "compile an eval mode with no grad, and a training mode with grad", but also contains things like autocasting contexts, etc. ## Dispatch Dispatching is super simple here, we just iterate through all the precompiled fullgraphs and check guards for each one until there's one htat passes. I'm a bit worried that having this in python code is going to be too expensive. The guard checks are happening in C++ anyway, though, so the only python bottlenecked step here is just the for loop, so perhaps the overhead will not be high. I'll work on measuring this, though. ## TODOs This PR does not support `mod.compile()`, only `torch.compile(mod)`. In order to support `mod.compile()`, we'll need to update torch.nn.Module with an updated implementation — I can add that frontend later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162171 Approved by: https://github.com/zhxchen17	2025-09-14 23:32:28 +00:00
Isalia20	ba5ca31676	[MPS] sparse mps any (#162885 ) Add SparseMPS key for any op Pull Request resolved: https://github.com/pytorch/pytorch/pull/162885 Approved by: https://github.com/malfet, https://github.com/Skylion007	2025-09-14 18:57:53 +00:00
Isalia20	8e1db46493	[MPS] enable empty like and unsqueeze for SparseMPS (#162910 ) Enable empty like and unsqueeze for SparseMPS Pull Request resolved: https://github.com/pytorch/pytorch/pull/162910 Approved by: https://github.com/malfet, https://github.com/Skylion007	2025-09-14 17:47:06 +00:00
Edward Yang	aff2438554	QoL: add pip to requirements-build.txt (#162896 ) uv venvs by default don't come with pip, but for example setup.py assumes it is available. Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/162896 Approved by: https://github.com/Skylion007	2025-09-14 17:08:05 +00:00
Shen Zhang	3f8a2e62ea	Fix rebind_unbacked in torch.fx.experimental.symbolic_shapes (#162788 ) ## Description Fix a float type handling in `torch.fx.experimental.symbolic_shapes` function. [#162480](https://github.com/pytorch/pytorch/issues/162480) ## Issue When I use AOTInductor to compile the YOLOv10, I encounter the bug `'float' object has no attribute 'node'`. [Torch AOTInductor Ahead-Of-Time Compilation Fail](https://github.com/opendatalab/DocLayout-YOLO/issues/177) The problem is due to missing float type handling. https://github.com/pytorch/pytorch/blob/main/torch/fx/experimental/symbolic_shapes.py#L597 ``` if isinstance(u1, int): log.info( "rebind_unbacked: discard %s %s %s -> %s", n.target, raw_u0, path, u1, ) continue ``` ## Solution Change the code `if isinstance(u1, float)` to `if isinstance(u1, (int,float))` Pull Request resolved: https://github.com/pytorch/pytorch/pull/162788 Approved by: https://github.com/ezyang	2025-09-14 17:07:14 +00:00
Clark Kang	6d64bc3990	[data foundation][vizard] Prevent checking the device type of numpy object in Tensorboard logger (#162888 ) Summary: The check is introduced in D82262053 - `scalar_value` could be a numpy object - Move the check of `device.type` into `make_np` method where it happens only when it's a `torch.Tensor`. Test Plan: ``` vizard launch -j 1x8 --launch=flow --config-path=pkg://vizard_projects.image_classification.configs --config-name=resnet50 ++flow.secure_group=ml_sensors ++flow.entitlement=ai_frameworks_pnb ++max_train_steps_per_epoch=10 ++max_epochs=5 ++log_every_n_steps=10 ++profiler=null ++max_eval_steps_per_epoch=10 ``` Rollback Plan: Differential Revision: D82383428 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162888 Approved by: https://github.com/xush6528	2025-09-14 08:09:08 +00:00
angelayi	972140b7e9	[benchmark] Add HF LLM benchmarks (#156967 ) Results in https://docs.google.com/spreadsheets/d/1xXOPg9JjEmPx0zc5QBNdyXQq8-K2_r4ybHaiS-q7pZ0/edit?gid=88695043#gid=88695043 Pull Request resolved: https://github.com/pytorch/pytorch/pull/156967 Approved by: https://github.com/huydhn Co-authored-by: Huy Do <huydhn@gmail.com>	2025-09-14 07:41:06 +00:00
Thien Tran	84186c39ed	[NVRTC] Enable compiling templated kernels (#162875 ) Per NVRTC doc - https://docs.nvidia.com/cuda/nvrtc/index.html#accessing-lowered-names, we can compile a templated kernel (e.g. `kernel<float>`) with the following steps NVRTC side - (new) `nvrtcAddNameExpression` -> C++ template e.g. `f<float>` - `nvrtcCompileProgram` - (new) `nvrtcGetLoweredName` -> get mangled name. need to do a copy since later this string is freed after NVRTC program is destroyed - `nvrtcDestroyProgram` CUDA side - use mangled name instead of normal name -> profit - `extern "C"` is not even needed Pull Request resolved: https://github.com/pytorch/pytorch/pull/162875 Approved by: https://github.com/msaroufim	2025-09-14 06:17:36 +00:00
Nick Riasanovsky	74a35c6344	[Triton] [Inductor] Enable TMA store for TMA mm templates (#160480 ) Summary: Adds support for TMA store in all TMA matmul templates (notably persistent_tma including addmm and scaled_mm). This works by requiring a template be registered with `tma_store=True` and when met constructs indices/range_trees to hook into the existing code base's TMA store support. This also includes a couple notable changes: - Adds support in the TMA template support for checking the output layout. - Adds support for "hoisting" the tensor descriptor to the top of the kernel. This will currently only be used by template code right now, but in principle it can be generalized to other implementation. - Supports considering multiple indices as the "contiguous" index. This is handled with support for transposing the input data when the alignment is no longer consistent. In general since the TMA support is derived from the index it doesn't seems reasonable that the 1D index math forces a certain alignment depending on index ordering so long as the layout matches. Test Plan: Tested with test_max_autotune.py unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/160480 Approved by: https://github.com/NikhilAPatel	2025-09-14 04:56:49 +00:00
PyTorch UpdateBot	d2f6daf6a7	[audio hash update] update the pinned audio hash (#162892 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162892 Approved by: https://github.com/pytorchbot	2025-09-14 04:27:37 +00:00
PyTorch UpdateBot	e74b21d66a	[vllm hash update] update the pinned vllm hash (#162891 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162891 Approved by: https://github.com/pytorchbot	2025-09-14 04:27:35 +00:00
Laith Sakka	f01bf0f64b	Do not use // but use CleanDiv or FloorDiv instead (#162869 ) Summary: When rewriting sympy expressions in the compiler codebase we want to generate FloorDiv(a, b) CleanDiv(a, b) directly and not a//b. since the later become floor(a*pow(b, -1)) For symnodes we automatically handle that conversions in the symnode op dispatch. I will follow up with an issue to track all other usages of //. Block internal Model. Test Plan: add test run existing tests. dakechen1993 testing on the model. Rollback Plan: Differential Revision: D82362241 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162869 Approved by: https://github.com/ezyang	2025-09-14 01:30:33 +00:00
Ben Niu	886699bc5c	Port shared_ptr optimization in std::shared_ptr to intrusive_ptr (#162784 ) Summary: Please see D21021645 for details about the optimization and why it's beneficial. A similar change has been added to libstdc++ as well, see `dbf8bd3c2f` Rollback Plan: Reviewed By: yfeldblum Differential Revision: D81960754 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162784 Approved by: https://github.com/swolchok	2025-09-13 21:01:00 +00:00
Varun Patil	72b5159782	[flatbuffer] Fix compile error due to discarded result (#162767 ) Summary: One of our builds fails because the return value of fread is discarded. Explicit cast to void fixes the build. ```log In file included from fbcode/caffe2/torch/csrc/jit/mobile/import.cpp:15: fbcode/caffe2/torch/csrc/jit/mobile/file_format.h:156:3: error: ignoring return value of function declared with 'warn_unused_result' attribute [-Werror,-Wunused-result] 156 \| fread(data.get(), size, 1, f); \| ^~~~~ ~~~~~~~~~~~~~~~~~~~~~~ 1 error generated. ... BUILD FAILED Failed to build 'fbcode//caffe2:libtorch (cfg:opt-linux-x86_64-clang19-no-san-opt-by-default#fef256f7ee896871)' ``` Test Plan: No runtime behavior change. CI. Rollback Plan: Differential Revision: D82265002 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162767 Approved by: https://github.com/Skylion007	2025-09-13 20:24:43 +00:00
Nyakku Shigure	f37eaebed1	Add missing `tags` parameter to `custom_op` overload signatures (#162047 ) It appears to be an omission in #149782. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162047 Approved by: https://github.com/zou3519, https://github.com/BoyuanFeng Co-authored-by: Boyuan Feng <fby.1994@gmail.com>	2025-09-13 19:57:23 +00:00
PyTorch MergeBot	5b9114bf19	Revert "[ROCm/Windows] Support aotriton for scaled_dot_product_attention on Windows. (#162330 )" This reverts commit 62843c14bbf694f5722fd6e1075da4792507fe42. Reverted https://github.com/pytorch/pytorch/pull/162330 on behalf of https://github.com/atalman due to Sorry reverting looks like broke windows nightlies see https://github.com/pytorch/pytorch/issues/162881 ([comment](https://github.com/pytorch/pytorch/pull/162330#issuecomment-3288544921))	2025-09-13 15:43:50 +00:00
PyTorch MergeBot	deb7ebe0a3	Revert "[Reland] Use std::string_view in torchgen (#158625 )" This reverts commit 972e409829343cc2062aeee0994a9c1c735d216a. Reverted https://github.com/pytorch/pytorch/pull/158625 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break a couple of ExecuTorch tests for Vulkan backend ([comment](https://github.com/pytorch/pytorch/pull/158625#issuecomment-3287754275))	2025-09-13 07:52:50 +00:00
PyTorch MergeBot	9c93dc8123	Revert "Return NoOpDeviceGuardImpl in replace of CudaDeviceGuard when device is not available, or cpu-only build (#160532 )" This reverts commit a956c4ab1cb13079203a8f07eb26218724f54dc8. Reverted https://github.com/pytorch/pytorch/pull/160532 on behalf of https://github.com/huydhn due to Reverted internally ([comment](https://github.com/pytorch/pytorch/pull/160532#issuecomment-3287745165))	2025-09-13 07:42:12 +00:00
PyTorch MergeBot	31040b6357	Revert "port some distributed tensor test files for Intel GPU (#161703 )" This reverts commit 179f10621b418427fc6e92f58ea2b0bbe4cc9c52. Reverted https://github.com/pytorch/pytorch/pull/161703 on behalf of https://github.com/huydhn due to Sorry for reverting your change but these tests are failing internally ([comment](https://github.com/pytorch/pytorch/pull/161703#issuecomment-3287720713))	2025-09-13 07:22:14 +00:00
Edward Yang	aa41d3e49c	Claude loves making these files in top level, ignore them for sanity. (#162806 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/162806 Approved by: https://github.com/albanD	2025-09-13 04:59:00 +00:00
PyTorch UpdateBot	f0fcf436c5	[audio hash update] update the pinned audio hash (#162864 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162864 Approved by: https://github.com/pytorchbot	2025-09-13 04:17:21 +00:00
PyTorch UpdateBot	5663910472	[vllm hash update] update the pinned vllm hash (#162751 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned vllm hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162751 Approved by: https://github.com/pytorchbot	2025-09-13 04:16:51 +00:00
Xuan Zhang	da669d51bf	fusion of large accumulated reads only at ir level (#161978 ) This is to revert some of the changes in https://github.com/pytorch/pytorch/pull/158667 In particular, we only disallow fusion of large accumulate read at IR level and not at scheduler level, as users can create their own custom fusion logics for the scheduler level. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161978 Approved by: https://github.com/yf225	2025-09-13 04:07:25 +00:00
Georgia Phillips	783985e9fe	kjt pytree registration (#161114 ) Differential Revision: D80656182 Pull Request resolved: https://github.com/pytorch/pytorch/pull/161114 Approved by: https://github.com/henryoier	2025-09-13 03:57:43 +00:00
Jimmy Lu	49d30f9a23	Fix boxcox to return same result for same input in one batch (#162772 ) Summary: The SIMD path is using SLEEF version of `pow` which is slightly different from `std::pow`. The fix is to use the same vectorized code (with partial load and store) for the trailing data as well to ensure consistency between results. Rollback Plan: Differential Revision: D82265247 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162772 Approved by: https://github.com/swolchok	2025-09-13 03:57:35 +00:00
Huy Do	66133b1ab7	Build vLLM aarch64 nightly wheels (#162664 ) PyTorch has published its aarch64 nightly wheels for all CUDA version after https://github.com/pytorch/pytorch/pull/162364 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162664 Approved by: https://github.com/atalman	2025-09-13 03:43:55 +00:00
Chen	543d50db2b	Fix torch export with dict input nested in args (#162618 ) Investigated together with @pyemma and @taotaohuang001 ## Problem when calling exported module with dict nested in the args tuple, it will make following complaits ``` Traceback (most recent call last): File "/home/chzhu/infinitrain/test_torch_export.py", line 32, in <module> print(exported_model({"a2": torch.randn(10), "a1": torch.randn(10)})) File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/fx/graph_module.py", line 848, in call_wrapped return self._wrapped_call(self, args, kwargs) File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/fx/graph_module.py", line 424, in __call__ raise e File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/fx/graph_module.py", line 411, in __call__ return super(self.cls, obj).__call__(args, *kwargs) # type: ignore[misc] File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1773, in _wrapped_call_impl return self._call_impl(args, *kwargs) File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1879, in _call_impl return inner() File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1806, in inner args_kwargs_result = hook(self, args, kwargs) # type: ignore[misc] File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py", line 929, in _fn return fn(args, *kwargs) File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/_unlift.py", line 81, in _check_input_constraints_pre_hook flat_args_with_path = _check_inputs_match(args, kwargs, self._in_spec) File "/home/chzhu/infinitrain/build/infinitrain/environments/development-venv/lib/python3.10/site-packages/torch/export/_unlift.py", line 64, in _check_inputs_match raise ValueError( # noqa: B904 ValueError: Trying to flatten user inputs with exported input tree spec: TreeSpec(tuple, None, [TreeSpec(tuple, None, [TreeSpec(dict, ['a1', 'a2'], [, ])]), TreeSpec(dict, [], [])]) but actually got inputs with tree spec of: TreeSpec(tuple, None, [TreeSpec(tuple, None, [TreeSpec(dict, ['a2', 'a1'], [, ])]), TreeSpec(dict, [], [])]). Please check that the inputs have the same number and type of args and kwargs as the ones you used when tracing. ``` ## How to reproduce the issue ```python import torch # create a nn.Module with data_batch as input and output as output class MyModel(torch.nn.Module): def __init__(self): super(MyModel, self).__init__() self.linear = torch.nn.Linear(10, 1) def forward(self, data_batch): h1 = self.linear(data_batch["a1"]) h2 = self.linear(data_batch["a2"]) return h1 + h2 # torch export this module model = MyModel() example_args_forward = ( { "a1": torch.randn(10), "a2": torch.randn(10), }, ) exported_model = torch.export.export(model, example_args_forward, strict=True) # save the exported model torch.export.save(exported_model, "exported_model.pt2") # load the exported model exported_model = torch.export.load("exported_model.pt2").module() # run the exported model print(exported_model({"a2": torch.randn(10), "a1": torch.randn(10)})) ``` ## Root Cause Input spec is encoded as [TreeSpec](`582d278983/torch/utils/_pytree.py (L1059)`) in torch export. With (args, kwargs) at the top level. When we call the exported model, it has a pre-execution [hook](`582d278983/torch/export/_unlift.py (L66)`) to check the input TreeSpec matches the received TreeSpec, where in Treespec, the dict key order is preserved. Something like TreeSpec(dict, ['a2', 'a1'], [,*]) To workaround this, the input check reorders [kwargs](`582d278983/torch/export/_unlift.py (L67)`), that is why kwargs can be out of order. But the dict nested in the args is not re-ordered, so any re-ordering of the keys will throw errors. ## Solution Update eq_spec to handle the dict case, where we only guarantee that key set is the same without ordering constraints. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162618 Approved by: https://github.com/angelayi	2025-09-13 03:24:30 +00:00
PyTorch MergeBot	7dd5f7b125	Revert "python fastpath for DTensor detach(), confirm that aliasing DTensorSpec is ok (#160580 )" This reverts commit 4b2d297eec425475a82934a52e0edd96805524a1. Reverted https://github.com/pytorch/pytorch/pull/160580 on behalf of https://github.com/bdhirsh due to this broke shampoo, yanking ([comment](https://github.com/pytorch/pytorch/pull/160580#issuecomment-3287372891))	2025-09-13 02:04:36 +00:00
Sherlock Huang	a956c4ab1c	Return NoOpDeviceGuardImpl in replace of CudaDeviceGuard when device is not available, or cpu-only build (#160532 ) Summary: To support exporting a cuda model on a CPU-only machine under fake tensor mode. User commonly need to move sample inputs to the cuda device with .to("cuda:0") or .to("cuda") call. This diff supports this. I expect the following pattern to work ``` with FakeTensorMode(allow_non_fake_inputs=True): cuda_module = module.to("cuda:0") cuda_sample_inputs = tuple([x.to("cuda:0") for x in sample_inputs]) with torch.no_grad(): ep = torch.export.export(cuda_module, cuda_sample_inputs) ``` Test Plan: CI Rollback Plan: Differential Revision: D80181887 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160532 Approved by: https://github.com/henryoier, https://github.com/ezyang	2025-09-13 01:50:51 +00:00
Kevin Tang	0925c644ed	[DCP] Decrease checkpoint background process Gloo pg init timeout (#162760 ) Summary: Sometimes checkpoint background process creation times out during gloo pg init. Attempting to destroy the process during that time can block the trainer thread until the timeout completes. This diff reduces the pg init timeout from 30m -> 10m to reduce the cleanup time. Test Plan: CI Rollback Plan: Differential Revision: D81724668 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162760 Approved by: https://github.com/meetv18	2025-09-13 01:50:40 +00:00
Xu Han	b2553a6ec4	[AOTI] raise PyTorchStreamWriter open failed error code on windows (#162799 ) When I debug AOTI UT: `TestAOTInductorPackage_cpu::test_add`. I found it didn't output the verbose error code, when PyTorchStreamWriter open failed. This PR add the verbose error code output for debug. Local test shows as below: <img width="1124" height="653" alt="image" src="https://github.com/user-attachments/assets/01cb1a51-2982-4106-8b5b-c608ac26a075" /> The error code is 32, we can check the Windows error code 32 at https://learn.microsoft.com/en-us/windows/win32/debug/system-error-codes--0-499- ``` ERROR_SHARING_VIOLATION 32 (0x20) The process cannot access the file because it is being used by another process. ``` This issue is caused by the file is opened by another process. I fixed same issue in zip open as PR: https://github.com/pytorch/pytorch/pull/162617 But still no idea how to open file with shared access in `std::ofstream`. I will continue to researching it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162799 Approved by: https://github.com/jansel	2025-09-13 01:41:14 +00:00
Parshant Sharma	a749c40342	[Bilinear] move check to reset_parameters (#160952 ) Fixes #160407 ### Summary: Moved the check to reset_parameters to make `Bilinear` module lazy. Lazy modules have in_features initialized to 0 and a pre forward hook that initializes these to the appropriate shape, then calls reset parameters, ### Impact: module: nn, linear.py ### Test: <img width="903" height="182" alt="Screenshot From 2025-08-19 13-27-12" src="https://github.com/user-attachments/assets/bc04b0d6-5174-4dc9-8b21-9e019b3822a5" /> Pull Request resolved: https://github.com/pytorch/pytorch/pull/160952 Approved by: https://github.com/mikaylagawarecki	2025-09-13 01:17:10 +00:00
Nick Riasanovsky	595e13feb7	[BE] [Inductor] Update NoValidChoicesError logic (#162814 ) Summary: Updates the NoValidChoicesError logic to include some additional context for if not choices exists or if no choices compiled. Test Plan: NFC. Depending on CI. Rollback Plan: Differential Revision: D82312035 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162814 Approved by: https://github.com/mlazos	2025-09-13 00:45:50 +00:00
Xuan Zhang	ddc5107601	An improved heuristic for operator reordering for peak memory + debugging logs (#161810 ) Revisiting the idea in https://github.com/pytorch/pytorch/pull/140195 For the lpmf algorithm in the memory reorder pass, in some cases, when all the nodes that can be scheduled are quite large, it is beneficial to switch the scheduling strategy. So instead of using size as the criterion, we choose a node that can unlock more nodes to become schedulable by analyzing their successor nodes. For an internal use case, we observe up to 20 GiB memory difference and here are the before and after memory snapshot. More information can be found in [D81270682](https://www.internalfb.com/diff/D81270682) (internal only). <img width="348" height="227" alt="image" src="https://github.com/user-attachments/assets/fb71e840-1508-44ed-bc9d-5eb4d364607d" /> In addition, add the functionality to upload the graph to tlparse for offline debugging. The format of the json is in consistency with the simulator [here](https://fburl.com/code/3l3d3qi4) (internal only). Pull Request resolved: https://github.com/pytorch/pytorch/pull/161810 Approved by: https://github.com/yf225	2025-09-13 00:42:32 +00:00
FFFrog	a94ddd9b00	[OpenReg] Fix the docs of Accelerator Intergration (#162826 ) ---- - Fixed the redirect link about step 1 - Formatted the autoload and added necessary links Pull Request resolved: https://github.com/pytorch/pytorch/pull/162826 Approved by: https://github.com/albanD ghstack dependencies: #161917, #161918, #160101	2025-09-12 23:53:17 +00:00
FFFrog	29f84b0f61	[OpenReg] Improve the Event and Stream capabilities of DeviceGuardImplInterface (#160101 ) Changes: - Based on `OpenRegStream` and `OpenRegEvent`, we improve the implementation of Device Guard for `OpenReg` - Add some related testcases Pull Request resolved: https://github.com/pytorch/pytorch/pull/160101 Approved by: https://github.com/albanD ghstack dependencies: #161917, #161918	2025-09-12 23:53:17 +00:00
FFFrog	27daa6af6a	[OpenReg] Strengthen Openreg's execution limits to minimize the waste of computing resources (#161918 ) Currently, OpenReg supports Linux, Windows, and OS X, ensuring stability and ease of integration with third-party devices across all three platforms. It also doesn't rely on any other accelerators (such as CUDA or MPS). Therefore, to minimize computational resource usage, `test_openreg` can be added to certain BLOCKLISTS to prevent its execution, limiting OpenReg's execution to only necessary scenarios. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161918 Approved by: https://github.com/albanD ghstack dependencies: #161917	2025-09-12 23:53:17 +00:00
FFFrog	9b429846e8	[OpenReg] Migrate OpenReg Tests from tests/test_openreg.py into torch_openreg/tests (#161917 ) Background: Almost all the tests in `test/test_openreg.py` are designed for `torch_openreg`, so placing these testcases in the test directory is not a good idea. Instead, they should be moved to the `tests` directory under `torch_openreg`, coordinating these tests with their corresponding functional logic. How to do: So how do we verify the quality of the third-party device integration mechanism? We will maintain a `test_openreg` entrypoint in `test/run_test.py`. This entrypoint will install `torch_openreg` and run all the testcases located in `torch_openreg`. As long as all testcases pass, we can guarantee that the out-of-tree backend integration mechanism is available. Next: We will also improve `torch_openreg's` test coverage in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/161917 Approved by: https://github.com/albanD	2025-09-12 23:53:17 +00:00
PyTorch MergeBot	cdfa298a3b	Revert "[MTIA Runtime] Add foreach_div ops to native_functions.yaml (#162732 )" This reverts commit a3f01f6418667f791f36d928f7e912eb89be2e67. Reverted https://github.com/pytorch/pytorch/pull/162732 on behalf of https://github.com/huydhn due to Reverted internally ([comment](https://github.com/pytorch/pytorch/pull/162732#issuecomment-3287163750))	2025-09-12 23:52:43 +00:00
Nikita Shulga	d25c35d2b2	[MPS] Fix `[nan]median` output for empty tensors (#162846 ) It should be `NaN` rather than 0 Added respective checks to `test_empty_tensor` Fixes https://github.com/pytorch/pytorch/issues/162798 Pull Request resolved: https://github.com/pytorch/pytorch/pull/162846 Approved by: https://github.com/dcci	2025-09-12 22:26:29 +00:00
Dmitry Rogozhkin	ee53ad2dd0	xpu: test py_limited_api with SyclExtension (#162546 ) Commit extends existing CUDA test to cover XPU SyclExtension case for the same feature - `py_limited_api`. Commit required a fix for xpu to install some Aten header files (#145902) which got resolved after the merge of #159621. See: https://github.com/pytorch/pytorch/issues/145902 Requires: https://github.com/pytorch/pytorch/pull/159621 Requires: https://github.com/intel/torch-xpu-ops/pull/1743 CC: @guangyey, @EikanWang Pull Request resolved: https://github.com/pytorch/pytorch/pull/162546 Approved by: https://github.com/guangyey, https://github.com/EikanWang, https://github.com/janeyx99	2025-09-12 21:57:01 +00:00
Haifeng Jin	0dcd9304aa	fix high=0 bug in nll_loss test (#162763 ) Minor bug fix for the `nll_loss` test. Before this PR, it runs `torch.randint(high=0)`, which will fail because it would try to generate a number that >= low and < high, i.e. x>=0 and x<0. The test did not fail because that line is not run when testing on CPU because it failed earlier because of a unsupported dtype. However, as we support TPUs at Google, this line is reached first before the dtype check, which triggers the bug. To my understanding, these OpInfo should be general enough to support different hardware. Fixing this obvious bug would make it more general cross different hardware. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162763 Approved by: https://github.com/soulitzer	2025-09-12 21:48:18 +00:00
Ruben Rodriguez Buchillon	25f1a5d8d1	[inductor][ez] add src_hash property for Templates (#161468 ) # why enable caching/overriding/filtering based on src hash later # what - KernelTemplate has a src_hash that is None by default - sha256 on TritonTemplate of the template src code - None on ExternKernelChoice to have same API # testing n/a (not in use in this change) Differential Revision: [](https://our.internmc.facebook.com/intern/diff/) Differential Revision: [D81821149](https://our.internmc.facebook.com/intern/diff/D81821149) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161468 Approved by: https://github.com/eellison ghstack dependencies: #161351, #161350, #162293	2025-09-12 21:10:45 +00:00
Ruben Rodriguez Buchillon	269c9907a0	[inductor][choices] rename get_mm_configs to get_template_configs (#162293 ) # why - eventually we want all templates to go through this - we're exposing this through diode as a sort of interface/API - avoid later renaming # what - rename get_mm_configs to get_template_configs - rename _finalize_mm_configs to _finalize_template_configs # testing - lintrunner - ci Differential Revision: [D81820641](https://our.internmc.facebook.com/intern/diff/D81820641) Pull Request resolved: https://github.com/pytorch/pytorch/pull/162293 Approved by: https://github.com/eellison ghstack dependencies: #161351, #161350	2025-09-12 21:10:45 +00:00
Ruben Rodriguez Buchillon	a326ef37e6	[inductor] leverage template stacking in V.choices.get_mm_configs (#161350 ) # why - now everything is in place to just gather templates and run the V.choices.get_mm_configs once per op - enables any overrides inside V.choices.get_mm_configs to have a full view of the options for an op, not just for one template # what - replace multiple calls to V.choices.get_mm_configs with calls to gather the active templates, and then using those in a single call # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520571](https://our.internmc.facebook.com/intern/diff/D81520571) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161350 Approved by: https://github.com/eellison, https://github.com/jansel ghstack dependencies: #161351	2025-09-12 21:10:38 +00:00
Ruben Rodriguez Buchillon	cdb2d1838a	[inductor] FlexibleLayout for ExternKernelChoice for mms (#161351 ) # why - if we only use ExternKernelChoice we're not doing any codegen - if we're not doing any codegen, we can use a FlexibleLayout here, and provide deeper passes more chances to change it # what - if all the kernel template choices (KTC) are with a ExternKernelChoice template, we switch to a FlexibleLayout before generating the choice - add a test to make sure that works as intended (FlexibleLayout for only extern, and FixedLayout if Triton is involved) - caveats: - because CPP, CUTLASS, and CK are not using V.choices.get_mm_configs yet, we turn off the optimization if either of those backends are in use. This will be relaxed once they support this too - because Triton templates are still using their own calls (not a single call) to get_mm_configs, it's also turned off there. The next diff unifies Triton + ATEN to a single call to get_mm_configs and that in turn allows the optimization there too # testing ``` python3 -bb -m pytest test/inductor/test_max_autotune.py -v ``` Differential Revision: [D81520584](https://our.internmc.facebook.com/intern/diff/D81520584) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161351 Approved by: https://github.com/eellison, https://github.com/jansel	2025-09-12 21:10:31 +00:00

1 2 3 4 5 ...

93017 Commits