pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 12:54:11 +08:00

Author	SHA1	Message	Date
Nicolas De Carli	cbc08c8993	Add NEON acceleration for `Vectorized<int[8\|16\|32\|64>` (#165273 ) Summary: Adding NEON specializations of Vectorized<T> for int8, int16, int32 and int64. Correcness has been checked using test_ops.py and the comprehensive torch test operator_benchmark_test.py has been enhanced by adding cases of bitwise operations, boolean ops and integer ops. The benchmark, which uses the PyTorch API, shows significant enhancements in a wide variety of operations: Before: bitwise xor: 779.882us boolean any: 636.209us boolean all: 538.621us integer mul: 304.457us integer asr: 447.997us After: bitwise xor: 680.221us ---> 15% higher throughput boolean any: 391.468us ---> 63% higher throughput boolean all: 390.189us ---> 38% higher throughput integer mul: 193.532us ---> 57% higher throughput integer asr: 179.929us---> 149% higher throughput Test Plan: Correctness: buck2 test @mode/opt //caffe2/test:test_ops buck2 test @mode/opt //caffe2/test:torch buck2 test @mode/opt //caffe2/test/distributed/launcher/fb:fb_run_test Performance: buck2 run mode/opt //caffe2/benchmarks/operator_benchmark/fb:operator_benchmark_test Differential Revision: D84424638 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165273 Approved by: https://github.com/malfet trunk/cbc08c899310f1e24251ee56060b79721829d212	2025-10-16 21:35:13 +00:00
Yiming Zhou	1a54d3333d	[easy] Fix graph_capture in aot_joint_with_descriptors test (#165660 ) when `with_export=True`, `aot_export_joint_with_descriptors` should take the graph produced by `_dynamo_graph_capture_for_export` ``` python test/functorch/test_aot_joint_with_descriptors.py -k test_preserve_annotate_simple python test/functorch/test_aot_joint_with_descriptors.py -k test_preserve_annotate_flex_attention ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165660 Approved by: https://github.com/yushangdi trunk/1a54d3333de6b9d2e8aa785b3d791c87201be45a	2025-10-16 21:10:11 +00:00
Aaron Orenstein	4c1c341fa0	FakeTensorMode shouldn't cache syms when tracing (#164718 ) Improve FakeTensor cache to handle SymNode and tracing properly. For now, when we're proxy tracing just don't bother caching operations that contain SymNodes in the output. The problem is that the proxy tracer relies on SymNode identity and our cache doesn't preserve that. It can be fixed (and I left some notes in _validate_symbolic_output_for_caching() how) but it's not worth it for now. If we aren't proxy tracing then caching is fine. Thus these changes: 1. Our cache key needs to include whether we were actively tracing or not - this way if we create a cache entry when we weren't tracing and then we try to use it when we ARE tracing it gets rerun. 2. If there's a SymNode in the output then bypass tracing. 3. Some general cleanup of the output validation - we were unnecessarily doing it as a two-step process when it could just be a single step (it's still two parts internally but only a single outer try/except). Pull Request resolved: https://github.com/pytorch/pytorch/pull/164718 Approved by: https://github.com/bobrenjc93 ghstack dependencies: #165266, #164717 trunk/4c1c341fa06e6ac2cf7d9089e9628529f89b0b62	2025-10-16 20:57:07 +00:00
Aaron Orenstein	5f21cc786a	Teach ProxyTorchDispatchMode how to decompose sympy.Expr into known inputs (#164717 ) In a training library we hit a weird conflict between dtensor, dynamic shapes, and proxy tensor. The problem is occuring because in sharding_prop we use FakeTensors to compute an operation size (so we don't have to use the full "real" data). We turn off proxy tracing while we're doing that because we don't want the FakeTensor ops to end up in the graph. We then use that size when doing later operations. Normally this is no problem - but when those sizes are dynamic shapes then we have a problem - the proxy tracer wants to track the provenance of all shape operations (`s1*s2`) but since tracing is disabled it doesn't see the operation and when we then use the result shape later on the proxy tracer gets all confused (because the SymNode appeared out of nowhere). At first we were thinking to never disable shape tracing - but that caused a slew of other downstream problems (lots of code that actually needs the shape tracing to be disabled) so instead we enable having a "sym tracing override" and surgically when we disable proxy tracing we leave shape tracing enabled. After this change the dtensor embedding is "fixed" but then runs afoul of a FakeTensor cache bug - which is fixed in the next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164717 Approved by: https://github.com/bobrenjc93, https://github.com/ezyang ghstack dependencies: #165266	2025-10-16 20:57:06 +00:00
Aaron Orenstein	e86942f422	minor proxy_tensor reorg (#165266 ) Moving some code around in proxy_tensor in preparation for the next PR. There we no actual changes (other than simple relabeling such as `self.tracer` -> `tracer`): - Move _compute_proxy() out of ProxyTorchDispatchMode. - Give `sympy_expr_tracker` a structured type instead of `object`. - Split SymNode registration out of ProxyTorchDispatchMode.__sym_dispatch__() so it can be reused. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165266 Approved by: https://github.com/ezyang, https://github.com/mlazos	2025-10-16 20:57:06 +00:00
Dzmitry Huba	2cd5fd1588	Enable local tensor mode on DTensor view ops test (#165596 ) While enabling this test discovered lack of support for sub meshes. Added limited support for sub meshes by properly computing rank coordinates for a given sub mesh. The implementation follows similar approach to collectives. We infer all sub meshes for the given dimensions and compute each rank's coordinates with respect to is sub mesh. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165596 Approved by: https://github.com/ezyang trunk/2cd5fd15882ad940caa28346bac23b9f5ff2c893 viable/strict/1760668990	2025-10-16 20:52:06 +00:00
Oguz Ulgen	7d0f872cb3	Use union syntax in torch/_inductor runtime and fx_passes (#165652 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165652 Approved by: https://github.com/aorenste trunk/7d0f872cb36841e9e975002bcee16aa3177c7f46 viable/strict/1760666101	2025-10-16 20:51:59 +00:00
PyTorch MergeBot	fb06e49ce8	Revert "[inductor] print 0.0 as 0 for triton (#164291 )" This reverts commit 99b32a6750bfd0cfe2bc84a47823e1da34802b7b. Reverted https://github.com/pytorch/pytorch/pull/164291 on behalf of https://github.com/malfet due to Broke slow job, see `aba8c43594/1` ([comment](https://github.com/pytorch/pytorch/pull/164291#issuecomment-3412768915)) trunk/fb06e49ce86c120cb070b0b28c7bd49785a68e43	2025-10-16 20:44:29 +00:00
PyTorch MergeBot	27a98e6ae9	Revert "[DeviceMesh] Prefer using _layout over _mesh for all sorts of things (#165554 )" This reverts commit d61a9b88cf3be04a29c5a7d6e9622ae5e8d51de3. Reverted https://github.com/pytorch/pytorch/pull/165554 on behalf of https://github.com/malfet due to Looks like it broke serialization test, see `aba8c43594/1` ([comment](https://github.com/pytorch/pytorch/pull/165554#issuecomment-3412765681)) trunk/27a98e6ae97a0f82c2deba225b1142b73be2e639	2025-10-16 20:41:37 +00:00
PyTorch MergeBot	b10f463b1a	Revert "[DeviceMesh] Introduce private constructor instead of _create_mesh_from_ranks (#165555 )" This reverts commit 99097b6d89c927c15180ff4683c38be01f9955f6. Reverted https://github.com/pytorch/pytorch/pull/165555 on behalf of https://github.com/malfet due to Looks like it broke serialization test, see `aba8c43594/1` ([comment](https://github.com/pytorch/pytorch/pull/165554#issuecomment-3412765681))	2025-10-16 20:41:37 +00:00
PyTorch MergeBot	431c13cf61	Revert "[DeviceMesh] Simplify unflatten method (#165556 )" This reverts commit 86fd4fc23e697e275d37c36e3cbe521f156434fd. Reverted https://github.com/pytorch/pytorch/pull/165556 on behalf of https://github.com/malfet due to Looks like it broke serialization test, see `aba8c43594/1` ([comment](https://github.com/pytorch/pytorch/pull/165554#issuecomment-3412765681))	2025-10-16 20:41:37 +00:00
Ketan Ambati	aead9270f5	12/n : Remove fbandroid_compiler_flags (#165558 ) Summary: Currently `get_c2_fbandroid_xplat_compiler_flags()` is reading the `caffe2.strip_glog` buckconfig which we want to get rid of. This diff removes the `fbandroid_compiler_flags` arg and merges it with compiler_flags with a nested select and the select version of the method The goal is to get rid of all the usages of `get_c2_fbandroid_xplat_compiler_flags()` so that we can get rid of the `caffe2.strip_glog` buckconfig Test Plan: CI bifferential Revision: D84626885 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165558 Approved by: https://github.com/malfet trunk/aead9270f56ebc7302c7f5fa7e5dff959f26608e	2025-10-16 20:41:24 +00:00
Janani Sriram	9bf5b38c14	[Inductor][Triton][FP8] Refactor scaled_mm template to accept scaling mode (#164318 ) Summary: Refactor `scaled_mm` Inductor template to support template choice based on scaling mode. This modification sets up the infrastructure for adding new templates based on new scaling modes, such as deepseek-style scaling (a follow-up diff), as new scaling modes (deepseek, block, group) scale before the accumulation (as opposed to per-tensor and per-row scaling, which apply scaling after accumulation). This modification also further enables Inductor to infer a scaling type based on the shape of the scaling tensors, which makes existing infrastructure more extensible to new scaling modes. Test Plan: ``` TORCHINDUCTOR_CACHE_DIR=~/personal/cache_dir_inductor CUDA_LAUNCH_BLOCKING=1 TORCH_USE_CUDA_DSA=1 TRITON_PRINT_AUTOTUNING=1 TRITON_ALWAYS_COMPILE=1 TORCH_LOGS=+inductor TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 ENABLE_PERSISTENT_TMA_MATMUL=1 TORCHINDUCTOR_MAX_AUTOTUNE_GEMM=1 buck2 run mode/{opt,inplace} pytorch/tritonbench:run -- --op fp8_gemm --only torch_fp8_gemm,pt2_fp8_gemm --metrics tflops,accuracy --m 256 --n 768 --k 512 --output="/home/jananisriram/personal/random_bench.csv" --scaling_rowwise --atol=20 --rtol=2 2>&1 \| tee ~/personal/random.log ``` bifferential Revision: D83591083 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164318 Approved by: https://github.com/drisspg, https://github.com/slayton58 trunk/9bf5b38c14f7f7c627d2c8775b203f1b3d61597e	2025-10-16 20:40:45 +00:00
Tristan Trouwen	aba8c43594	Register var for MTIA (#165382 ) Summary: Registers variance kernel Reviewed By: srsuryadev Differential Revision: D84546250 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165382 Approved by: https://github.com/malfet trunk/aba8c43594a83772281a62a7961c0b6ddcff321d	2025-10-16 20:35:15 +00:00
linhaifeng	37f3ba274a	[Fix] Use sys.executable instead of hardcoded python (#165633 ) Replace hardcoded "python" string with sys.executable to ensure correct Python interpreter is used. This fixes failures on systems with multiple Python runtimes or where "python" is not in PATH. Similar to pytorch/pytorch#155918 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/165633 Approved by: https://github.com/Skylion007 trunk/37f3ba274a8ccebc6b3409f52cf068a8b23617d4	2025-10-16 20:26:10 +00:00
IvanKobzarev	585b9dbb5e	[async_tp] Support ag+mm with gather_dim lastdim of mat_A (#163068 ) Adding ag+mm support for the case, when gather_dim is last dim of matmul (reduction dim). When we decompose matmul by reduction dimension we result in partials that needs additional reduction, we allocate memory for accumulator. Decomposition should not produce small (thin) mms that can not efficiently load the GPU. Limiting for minimal size of the shard 1024 (found empirically by testing in torchtitan). scaled_mm is not supported yet for this case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/163068 Approved by: https://github.com/ngimel trunk/585b9dbb5ed8a149e2c6a196537aafe065b61ec4	2025-10-16 20:14:39 +00:00
Maggie Moss	d795fb225a	[RFC] Add pyrefly to lintrunner (#165179 ) This will add pyrefly to lint runner as a warning only - and allow us to collect feedback about the tool before switching to pyrefly as the main type checker. References the steps outlined here: : https://github.com/pytorch/pytorch/issues/163283: test plan: `lintrunner init` `lintrunner` confirm when pyrefly errors are present results look like: https://gist.github.com/maggiemoss/e6cb2d015dd1ded560ae1329098cf33f Pull Request resolved: https://github.com/pytorch/pytorch/pull/165179 Approved by: https://github.com/ezyang trunk/d795fb225ace717f692ceb3f1d20dfb35afbdf8a	2025-10-16 20:07:09 +00:00
tvukovic-amd	7df9aca529	[ROCm][Windows] Enable AOTriton runtime compile on Windows (#165538 ) AOTriton uses prebuilt runtime binaries if the user's ROCm version matches the ones used to generate the prebuilt runtime. However, since there's no prebuilt runtime available for Windows, this check needs to be bypassed for Windows. This PR enables it by changing condition to always build AOTriton runtime from source on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165538 Approved by: https://github.com/xinyazhang, https://github.com/jeffdaily trunk/7df9aca52946ae47ca4d98dbe0685a412fbc77b8	2025-10-16 19:51:43 +00:00
Shangdi Yu	d4a713cd9c	Change forkserver test to only run below 3.13.8 (#165667 ) A multiprocessing bug is fixed in 3.13.8, see [https://docs.python.org/3.13/whatsnew/changelog.html](https://l.workplace.com/l.php?u=https%3A%2F%2Fdocs.python.org%2F3.13%2Fwhatsnew%2Fchangelog.html&h=AT0qUhHJq5c2UJvQaq9_MrSo0mVhwn1VOfq1nDQl2C1UOhDI80RMbzVayhG7LSAT1uYHKtkftKnBDwiGMhbw0YRvQLe5vwE01qejpPFautHvU3LXeOE1KChPykqz3qnCRzk7czu_iNzQ05shR4F1N_qYOzR5YxejA52ZZQ), [gh-126631](https://github.com/python/cpython/issues/126631) So this test will fail when we update to python 3.13.8 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165667 Approved by: https://github.com/malfet trunk/d4a713cd9c8ea1dc13917d3311d73c13914306a6	2025-10-16 19:34:10 +00:00
Sean McGovern	5daef30b26	158232 Fix autocast cache incorrectly retaining no_grad state (#165068 ) Fixes #158232 The autocast caching heuristic in `aten/src/ATen/autocast_mode.cpp:139` did not account for gradient mode state when deciding whether to cache. FSDP2 is not directly related. ~~This PR adds `GradMode::is_enabled()` check to caching condition. Caching is now disabled in `no_grad()` contexts to prevent storing tensors with incorrect gradient state. Ensures correctness at the cost of using cache.~~ This PR proposes separate caches for gradient-enabled and gradient-disabled modes. Adds tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165068 Approved by: https://github.com/ngimel, https://github.com/janeyx99 trunk/5daef30b26b794d237fbbc399c1d47ec0380200a	2025-10-16 19:32:01 +00:00
Huy Do	6dedd34c31	[CD] Skip 12.9 build on Windows (#165665 ) Per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/165665 Approved by: https://github.com/Camyll, https://github.com/malfet trunk/6dedd34c31b9b9ba3a91931efe79eee99cd56cef	2025-10-16 19:11:27 +00:00
Shunting Zhang	a303d6dda9	[inductor] don't try to reorder loops for template (#165601 ) fix https://github.com/pytorch/pytorch/issues/165579 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165601 Approved by: https://github.com/yushangdi trunk/a303d6dda9532f6e6a8e0776ba866727df28b721	2025-10-16 19:05:21 +00:00
Jagadish Krishnamoorthy	7669ac9402	[ROCm] Add scaled_mm v2 support. (#165528 ) Add mx fp4 support in Blas.cpp. Updated the scale_kernel_dispatch array and ScaledGemmImplementation enum to include MXFP4 support. Modify the tests under test_scaled_matmul_cuda accordingly. PYTORCH_TEST_WITH_ROCM=1 python test/test_scaled_matmul_cuda.py -v -k test_blockwise 115 test passed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165528 Approved by: https://github.com/jeffdaily trunk/7669ac940280f3af50ef5ec2a41d788df91abdbc	2025-10-16 18:36:41 +00:00
Luca Wehrstedt	86fd4fc23e	[DeviceMesh] Simplify unflatten method (#165556 ) By adding a few small helpers (e.g., a `splice` method to `_MeshLayout`, and making `_init_process_groups` static and thus stateless) we can substantially shorten the definition of the unflatten method, and help readability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165556 Approved by: https://github.com/fduwjj ghstack dependencies: #165554, #165555 trunk/86fd4fc23e697e275d37c36e3cbe521f156434fd	2025-10-16 18:36:16 +00:00
Luca Wehrstedt	99097b6d89	[DeviceMesh] Introduce private constructor instead of _create_mesh_from_ranks (#165555 ) The refactoring of DeviceMesh is heavily constrained by the signature of its constructor, which is a public API which contains some "legacy" concepts which we'd love to get rid of, such as an explicit/materialized `mesh` Tensor. In other languages the solution to this would be to add a private overload of the constructor. Python doesn't natively allow this, but in this PR I managed to build something that approximates it. This new private constructor basically only takes `_layout`, `_global_rank_permutation`, and `mesh_dim_names`. With such a constructor we can effectively simplify a lot of callsites and get rid of the `_create_mesh_from_ranks` helper method. That's a good thing because it was instantiating many DeviceMeshes in a for loop, which always felt unnecessary. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165555 Approved by: https://github.com/fduwjj, https://github.com/fegin ghstack dependencies: #165554	2025-10-16 18:36:16 +00:00
eqy	a214371008	[FP8] Add other Blackwell compute-capabiilities to expected fail `test_honor_sm_carveout` (#165159 ) CUTLASS SM hint also isn't working for other Blackwells, need green context for carveout Pull Request resolved: https://github.com/pytorch/pytorch/pull/165159 Approved by: https://github.com/Skylion007 trunk/a21437100815725eaaa086aafca2c12ca3e8cedb	2025-10-16 18:35:06 +00:00
IvanKobzarev	7d87d7052e	[inductor][bucketing] Fx collectives bucketing of multiple dtypes (#162470 ) Bucketing of multiple dtypes to be processed in one bucketed collective. First target is to bucket bf16 and f32, but already can be used with other dtypes. For now multidtype bucketing is only supported with "custom_ops" mode. Non custom_ops needs additional work on inductor side. Pull Request resolved: https://github.com/pytorch/pytorch/pull/162470 Approved by: https://github.com/eellison trunk/7d87d7052ef40fc802d8340c6a56ce3b7beb8407	2025-10-16 18:31:43 +00:00
arkadip-maitra	1a34ff4e04	Fixing get_local_rank() variable missing when compiled (#165432 ) Fixes #165215 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165432 Approved by: https://github.com/bdhirsh trunk/1a34ff4e04ea45d58f3d49d560086ba256702ccc	2025-10-16 18:20:34 +00:00
Angel Li	fe5ccb1a74	bf16 support for per tensor backward (#165362 ) Adding bf16 for the backward pass of `torch._fake_quantize_learnable_per_tensor_affine()`. Note that for testing, we modified the seed to avoid increasing tolerance due to cases where difference in Python vs CPP downcasting causes tensor mismatches. (e.g. 27.87704 vs 27.8408 before downcasting, 27.7500 vs 27.8750 after downcasting for Python vs CPP op) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165362 Approved by: https://github.com/andrewor14 trunk/fe5ccb1a74b983ecc9e111b704c62e2129e7e03f	2025-10-16 17:47:01 +00:00
Thanh Ha	85586d7efc	Make c7i the default for _linux-build.yml (#164747 ) Use linux.c7i.2xlarge as the default runner for the _linux-build.yml workflow. In testing we found that switching from c5 - c7i grants a 15-20% faster build times despite c7i costing 5% more. This should reduce costs of jobs using _linux-build.yml. Relates to pytorch/test-infra#7175. Pull Request resolved: https://github.com/pytorch/pytorch/pull/164747 Approved by: https://github.com/atalman trunk/85586d7efcefb36d44264d1019f71ea58d8c472b	2025-10-16 17:37:51 +00:00
PyTorch MergeBot	e1d71a6b35	Revert "12/n : Remove fbandroid_compiler_flags (#165558 )" This reverts commit d7ffa8b8a29ba6071c51499c1df3d702d0a26f72. Reverted https://github.com/pytorch/pytorch/pull/165558 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/165558#issuecomment-3411879769)) trunk/e1d71a6b35318c5d492a3900c84b904be8b8c9de	2025-10-16 17:18:56 +00:00
Luca Wehrstedt	d61a9b88cf	[DeviceMesh] Prefer using _layout over _mesh for all sorts of things (#165554 ) The goal of this PR is to avoid storing the explicit `mesh` Tensor inside each DeviceMesh, and instead compute it on-the-fly when the end user needs it, and try to replace all of its internal usages with `_layout` and the newly-introduced `_global_rank_permutation` Tensor. The name of this attribute is up for debate. The advantage of the `_global_rank_permutation` Tensor is that it is _the same_ Tensor for the root mesh and all its children, so it doesn't need to be copied/reallocated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/165554 Approved by: https://github.com/fduwjj trunk/d61a9b88cf3be04a29c5a7d6e9622ae5e8d51de3	2025-10-16 17:01:44 +00:00
Isuru Fernando	99b32a6750	[inductor] print 0.0 as 0 for triton (#164291 ) Fixes https://github.com/pytorch/pytorch/issues/164157 Fixes https://github.com/pytorch/pytorch/issues/164086 Pull Request resolved: https://github.com/pytorch/pytorch/pull/164291 Approved by: https://github.com/bobrenjc93 ciflow/mps/165453 ciflow/trunk/165453 ciflow/inductor/165453 trunk/99b32a6750bfd0cfe2bc84a47823e1da34802b7b	2025-10-16 16:37:50 +00:00
Edward Yang	783da8b8e7	Repro for property related Dynamo graph break (#165609 ) Signed-off-by: Edward Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/165609 Approved by: https://github.com/albanD, https://github.com/gchanan, https://github.com/malfet, https://github.com/anijain2305 trunk/783da8b8e7f3af90c5b8bde4c849768bd2860834 viable/strict/1760646330	2025-10-16 16:22:43 +00:00
Brian Hirsh	ed74dc054d	add the option to disable functionalization in AOTDispatcher (#164577 ) I'm cleaning this PR up as a proper way of disabling functionalization via config in AOTDispatcher. I removed the non-functionalization related changes from the original version: (1) preventing proxy mode (and functionalization) from incorrectly decomposing CIA ops (Ed has a PR for it here: https://github.com/pytorch/pytorch/pull/164939) (2) preventing python-dispatcher-based decomps above autograd from running. I'm not doing this for now, will likely do it in a followup Pull Request resolved: https://github.com/pytorch/pytorch/pull/164577 Approved by: https://github.com/ezyang ghstack dependencies: #165372 trunk/ed74dc054d45ede6ebf77e1e1b7e2a7a15612e55	2025-10-16 15:44:11 +00:00
Brian Hirsh	f33c7e1a43	add and fix OpInfo tests for the default partitioner (#165372 ) I noticed the default partitioner was breaking in some dynamic shape tests, so prior to turning off functionalization I want to tweak it to pass all of our OpInfo tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/165372 Approved by: https://github.com/ezyang	2025-10-16 15:44:11 +00:00
Yu, Guangye	219fb6aafc	Refactor CUDAAllocatorConfig using ConfigTokenizer (#165281 ) * #165129 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165281 Approved by: https://github.com/albanD ghstack dependencies: #165129, #165131, #165135, #165136 trunk/219fb6aafc6203a1be68798ced470a26e7a2a5d3	2025-10-16 15:26:50 +00:00
Yu, Guangye	515b5ff539	Remove unused code in CUDAAllocatorConfig (#165136 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165136 Approved by: https://github.com/Skylion007 ghstack dependencies: #165129, #165131, #165135	2025-10-16 15:26:50 +00:00
Yu, Guangye	608a6d4a26	Reuse AcceleratorAllocatorConfig in CUDAAllocatorConfig (#165135 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165135 Approved by: https://github.com/Skylion007 ghstack dependencies: #165129, #165131	2025-10-16 15:26:40 +00:00
Yu, Guangye	03e5dbb26e	Register CUDAAllocatorConfig to AcceleratorAllocatorConfig (#165131 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165131 Approved by: https://github.com/Skylion007 ghstack dependencies: #165129	2025-10-16 15:26:28 +00:00
Yu, Guangye	7ee45f7503	Restore AcceleratorAllocatorConfig to avoid potential regression (#165129 ) # Motivation This PR aims to restore `AcceleratorAllocatorConfig` to avoid the potential regression mentioned in https://github.com/pytorch/pytorch/pull/160666#issue-3323270375 These code change would be reverted in the following PR https://github.com/pytorch/pytorch/pull/165304 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165129 Approved by: https://github.com/albanD	2025-10-16 15:26:17 +00:00
Lucas Kabela	e6d9d68598	[Bugfix][Dynamo] Fix Sparse tensors by graph break in Dynamo (#164873 ) Fixes #164823 by making lack of support for sparse tensors very explicit (in fake tensor, inductor, and lowering code) Pull Request resolved: https://github.com/pytorch/pytorch/pull/164873 Approved by: https://github.com/williamwen42, https://github.com/eellison, https://github.com/mlazos trunk/e6d9d685986c9b46013a6bef99ecf532a481b8e8	2025-10-16 15:06:20 +00:00
Nikita Shulga	1a5b7eca7b	[BE] Fold cond into `TORCH_CHECK(false,...)` (#165593 ) Replace `if (!foo) { TORCH_CHECK(false, "bar");}` with `TORCH_CHECK(foo,"bar");` Pull Request resolved: https://github.com/pytorch/pytorch/pull/165593 Approved by: https://github.com/albanD ghstack dependencies: #165594 trunk/1a5b7eca7b6a0a73a6d4c03ebe8c45fbb0c115ae viable/strict/1760642528	2025-10-16 15:00:30 +00:00
Isalia20	8573574b32	[MPS] sparse mask implementation (#165102 ) sparse mask implementation Pull Request resolved: https://github.com/pytorch/pytorch/pull/165102 Approved by: https://github.com/malfet trunk/8573574b3242d93f3844c7c0bc8fec913eca3e19 viable/strict/1760640743	2025-10-16 14:31:00 +00:00
Nikita Shulga	e6033f6efb	[MPS] Improve `index_fill_` error handling (#165594 ) It shoudl not throw "Cannot convert a float64 Tensor to MPS", but rather a sensible "Converting complex Scalar to non-complex type is not supported". Add TODO about the complex support, probably good reason to rip out MPSGraph from index_fill as well Pull Request resolved: https://github.com/pytorch/pytorch/pull/165594 Approved by: https://github.com/dcci, https://github.com/kulinseth trunk/e6033f6efb20e717c41a32bfddeeb638387a2e76	2025-10-16 14:18:39 +00:00
IvanKobzarev	9272437cde	Fx collectives bucketing: add bucket all_reduce (#165351 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/165351 Approved by: https://github.com/eellison trunk/9272437cde67fcbb7dde66373382f711fd189418 viable/strict/1760635490	2025-10-16 13:27:33 +00:00
lichuyang	f06e669f6c	refactor: replace runtime_error with TORCH_CHECK for better error handling (#163628 ) Fixes some parts of issue #148114 @pytorchbot label "topic: not user facing" @FFFrog PTAL Pull Request resolved: https://github.com/pytorch/pytorch/pull/163628 Approved by: https://github.com/albanD trunk/f06e669f6c5a0b1840dc57224fecc1a27d46b049 viable/strict/1760628335	2025-10-16 11:09:48 +00:00
PyTorch MergeBot	69b05913fb	Revert "Add mingw to docker (#165560 )" This reverts commit 5e480b8ecf870e4a466c165701ab0e9d055f2ceb. Reverted https://github.com/pytorch/pytorch/pull/165560 on behalf of https://github.com/pytorch-auto-revert due to Reverted automatically by pytorch's autorevert, to avoid this behaviour add the tag autorevert: disable ([comment](https://github.com/pytorch/pytorch/pull/165560#issuecomment-3409814274)) trunk/69b05913fb0332f9a938c74e26b106e2bd24d82e viable/strict/1760619733	2025-10-16 08:42:11 +00:00
Isalia20	d73c283c3a	[CUDA] Large tensor maxpool crash fix (#165374 ) Fixes #165297 Pull Request resolved: https://github.com/pytorch/pytorch/pull/165374 Approved by: https://github.com/eqy, https://github.com/malfet trunk/d73c283c3a315cbed83e1795bb05db8ec315c48a	2025-10-16 07:59:46 +00:00
Tiwari-Avanish	eaeaa08e3a	[PowerPC] Disable MKLDNN TF32 on PowerPC to fix build failure (#163454 ) The commits f4d8bc46c7706f872abcb4ec41f0b32207d5d826 added TF32 support for x86 CPUs, which causes build failures on PowerPC systems with mkldnn. This patch disables TF32 paths on PowerPC while keeping x86 TF32 support intact, allowing PyTorch to build successfully on PowerPC. I have run the mkldnn test case on PowerPC, and it passed successfully. `pytest test/test_mkldnn.py 87 passed, 2 skipped in 1709.02s (0:28:29` Pull Request resolved: https://github.com/pytorch/pytorch/pull/163454 Approved by: https://github.com/jgong5, https://github.com/malfet trunk/eaeaa08e3a8071be46f833f7b46aa642ec14e0f7	2025-10-16 06:13:59 +00:00

... 2 3 4 5 6 ...

94717 Commits