pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-21 21:49:24 +08:00

Author	SHA1	Message	Date
Zhonglin Han	e1b3219c81	Update and rename gpu_test.py to api_test.py	2024-10-02 09:26:12 -07:00
Chuanhao Zhuge	f2203b6876	[mosaic_gpu] Add a basic unit test Summary: Forked from `611ad63060/tests/mosaic/gpu_test.py`	2024-09-19 10:12:29 -07:00
Chuanhao Zhuge	8f891f4017	[mosaic_gpu] Start a repo for exploring PyTorch - Mosaic GPU integration	2024-09-19 09:56:11 -07:00
James Wu	803ce507f1	Log structured logging overhead to dynamo compile (kinda) (#136142 ) Summary: X-link: https://github.com/pytorch/benchmark/pull/2454 This adds structured logging overhead at a per compile basis to compilation metrics. To do so, we track the frame_id_frame_compile_id that trace_structured uses to categorize compiles, and use that as the key in our timing table. Implementation notes: - If there's times we call trace_structured without a compile id, the time won't be measured. Not really a good way around that today given the compile id framework of compilation metrics. Strobelight is still the best way to measure on a per job basis. - We don't actually measure the time it takes to log the compilation metrics itself. Fundamentally, it's not possible to log this properly if we're storing the logging number in compilation metrics, since there's no way to measure it before we do it(unless we want discrepancies between dynamo_compile and tlparse, which seems suboptimal). Hopefully for a large job, the cost of structured_logging compilation metrics itself is small. - I wanted to use frame_phase_timing here, but there's a bunch of ids to iron out, and I don't really want to deal with that headache. compilation_time_metrics is sort of what I want, but that isn't by frame/compile id, so it's also a bit off. Putting it into torch.logging as a separate thing so logging tracks its own overhead seems fine, though. Test Plan: Run benchmarks/nanogpt and staging logger. See that the new compilation metric is logged to the staged dynamo_compile table: https://fburl.com/scuba/logger_staging_jjwu_30582a48f1ff9cf5f4ac50a4c40af/xazjg5xq Note that the sum(structured_logging_overhead_s) / sum(entire_frame_compile_time) = 8.387 / 124.278 = 6%, which seems reasonable as the overhead for a small compilation like this. You can also look at samples for a more detailed log of this. Reviewed By: oulgen Differential Revision: D62643611 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136142 Approved by: https://github.com/bobrenjc93	2024-09-19 16:11:38 +00:00
Andrew Gu	65df26f615	[FSDP2] Fixed 2D mismatched grad placements (#136237 ) ``` CUDA_VISIBLE_DEVICES=2,3,6,7 pytest test/distributed/_composable/test_composability/test_2d_composability.py -k test_train_parity_2d_transformer ``` Differential Revision: [D62964658](https://our.internmc.facebook.com/intern/diff/D62964658) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136237 Approved by: https://github.com/weifengpy	2024-09-19 14:35:15 +00:00
PyTorch MergeBot	4ea741d24f	Revert "Reland D62220158 (#136213 )" This reverts commit 083c9149b75cd918b6fb2795050d7173923a3629. Reverted https://github.com/pytorch/pytorch/pull/136213 on behalf of https://github.com/jeanschmidt due to Seems to have introduced regressions in rocm signals ([comment](https://github.com/pytorch/pytorch/pull/136213#issuecomment-2360885064))	2024-09-19 12:44:54 +00:00
Igor Sugak	bce52d0b60	[CODEMOD][caffe2] use npt.NDArray instead of np.ndarray in type annotations (#136288 ) Summary: To facilitate PSS-2 upgrade, this uses `ndt.NDArray` instead of `nd.ndarray` in type annotations. In Numpy-1.19 (PSS-1) it's an alias to `nd.ndarray` -- a noop. In Numpy-1.24, `ndt.NDArray` a proper generic type, and without this change uses of `nd.ndarray` generate this Pyre type error: ```counterexample Invalid type parameters [24]: Generic type `np.ndarray` expects 2 type parameters. ``` Test Plan: Sandcastle plus visual inspection Differential Revision: D62977370 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136288 Approved by: https://github.com/kit1980	2024-09-19 12:40:36 +00:00
Jan Wieczorek	908a5689eb	Return unsafe_view instead of view from matmul when folding occurs (#134568 ) When tensor folding occurs during matmul operation returned tensor is a view. This can cause issues when matmul is used inside a custom function and such view is then returned as output. Then it cannot be modified inplace and causes errors. It can be especially problematic when after such function inplace allreduce is performed. Issue is resolved when unsafe_view is returned from matmul instead. This solution aligns matmul decomposition with eager implementation in such a way that a non view tensor is returned. Test included in this PR reproduces the issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134568 Approved by: https://github.com/zou3519	2024-09-19 11:52:16 +00:00
Huy Do	db80b98ec4	XFAIL test_segfault (#136252 ) Fixes https://github.com/pytorch/pytorch/issues/128551 As this has been failing in trunk for a while and there is no owner yet to fix it properly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136252 Approved by: https://github.com/andrewkho	2024-09-19 04:17:06 +00:00
Duygu Altinok	775517693a	Add type checks for Tensor.add_ (#135864 ) Fixes #127049 There's already a meta func in `meta_registrations.py` for `add_` and `sub_` methods. I added a second meta function for error checking, i.e `int.add/sub_(float)` and `bool.add/sub_(other types)` . Also the corresponding test with Dynamo passes, removed `@xfailIfTorchDynamo`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135864 Approved by: https://github.com/williamwen42	2024-09-19 03:09:36 +00:00
William Wen	e037bb326f	[dynamo] fix crash in InspectSignatureVariable (#136010 ) Fix crash that was happening in https://github.com/pytorch/pytorch/issues/128095, because we were trying to extract a constant incorrectly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136010 Approved by: https://github.com/yanboliang, https://github.com/anijain2305, https://github.com/jansel	2024-09-19 00:23:00 +00:00
Jerry Zhang	f2b0fc89f2	Add uint16 support for observer (#136238 ) Summary: att Test Plan: python test/test_quantization.py -k TestObserver Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D62909821](https://our.internmc.facebook.com/intern/diff/D62909821) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136238 Approved by: https://github.com/tarun292	2024-09-18 23:52:18 +00:00
Nikita Shulga	068c80e6b6	[BE][MPS] Fix deprecation warnings on MacOS 15.0 (#136292 ) [reverseSquareRootWithTensor:](https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/reversesquareroot(with:name:)?changes=__8&language=objc) were deprecated in favor of [reciprocalSquareRootWithTensor:](https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/reciprocalsquareroot(_:name:)?changes=__8&language=objc) Without it, following warnings are generated if compiled on recently released MacOS Sequoia: ``` /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:720:35: warning: 'reverseSquareRootWithTensor:name:' is deprecated: first deprecated in macOS 15.0 [-Wdeprecated-declarations] 720 \| rsqrtTensor = [mpsGraph reverseSquareRootWithTensor:varianceEpsTensor name:nil]; \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~ \| reciprocalSquareRootWithTensor /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/invoke.h:341:10: note: in instantiation of function template specialization 'at::native::batch_norm_backward_mps(const Tensor &, const Tensor &, const std::optional<Tensor> &, const std::optional<Tensor> &, const std::optional<Tensor> &, const std::optional<Tensor> &, const std::optional<Tensor> &, bool, double, std::array<bool, 3>)::(anonymous class)::operator()<MPSGraph , CachedGraph >' requested here 341 \| decltype(std::declval<_Fp>()(std::declval<_Args>()...)) \| ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/invoke.h:351:19: note: while substituting deduced template arguments into function template '__invoke' [with _Fp = (lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68) &, _Args = <MPSGraph , CachedGraph >] 351 \| static decltype(std::__invoke(std::declval<_XFp>(), std::declval<_XArgs>()...)) __try_call(int); \| ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/invoke.h:357:28: note: while substituting deduced template arguments into function template '__try_call' [with _XFp = (lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68) &, _XArgs = (no value)] 357 \| using _Result = decltype(__try_call<_Fp, _Args...>(0)); \| ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/conjunction.h:27:32: note: in instantiation of template class 'std::__invokable_r<void, (lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68) &, MPSGraph , CachedGraph >' requested here 27 \| __expand_to_true<__enable_if_t<_Pred::value>...> __and_helper(int); \| ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/conjunction.h:38:39: note: while substituting explicitly-specified template arguments into function template '__and_helper' 38 \| using _And _LIBCPP_NODEBUG = decltype(std::__and_helper<_Pred...>(0)); \| ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__functional/function.h:828:20: note: (skipping 1 context in backtrace; use -ftemplate-backtrace-limit=0 to see all) 828 \| bool = _And< _IsNotSame<__remove_cvref_t<_Fp>, function>, __invokable<_Fp, _ArgTypes...> >::value> \| ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__functional/function.h:841:49: note: in instantiation of default argument for '__callable<(lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68) &>' required here 841 \| using _EnableIfLValueCallable = __enable_if_t<__callable<_Fp&>::value>; \| ^~~~~~~~~~~~~~~~ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__functional/function.h:851:32: note: in instantiation of template type alias '_EnableIfLValueCallable' requested here 851 \| template <class _Fp, class = _EnableIfLValueCallable<_Fp>> \| ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__functional/function.h:852:25: note: in instantiation of default argument for 'function<(lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68)>' required here 852 \| _LIBCPP_HIDE_FROM_ABI function(_Fp); \| ^~~~~~~~~~~~~ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68: note: while substituting deduced template arguments into function template 'function' [with _Fp = (lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68), $1 = (no value)] 623 \| auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) { \| ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:24: note: while substituting deduced template arguments into function template 'LookUpOrCreateCachedGraph' [with T = CachedGraph] 623 \| auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) { \| ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/System/Library/Frameworks/MetalPerformanceShadersGraph.framework/Headers/MPSGraphArithmeticOps.h:123:1: note: 'reverseSquareRootWithTensor:name:' has been explicitly marked deprecated here 123 \| -(MPSGraphTensor ) reverseSquareRootWithTensor:(MPSGraphTensor ) tensor \| ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:745:37: warning: 'reverseSquareRootWithTensor:name:' is deprecated: first deprecated in macOS 15.0 [-Wdeprecated-declarations] 745 \| rsqrtTensor = [mpsGraph reverseSquareRootWithTensor:varianceEpsTensor name:nil]; \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~ \| reciprocalSquareRootWithTensor /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/System/Library/Frameworks/MetalPerformanceShadersGraph.framework/Headers/MPSGraphArithmeticOps.h:123:1: note: 'reverseSquareRootWithTensor:name:' has been explicitly marked deprecated here 123 \| -(MPSGraphTensor ) reverseSquareRootWithTensor:(MPSGraphTensor ) tensor \| ^ 2 warnings generated. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136292 Approved by: https://github.com/kit1980	2024-09-18 23:38:31 +00:00
Nikita Shulga	b9a197df77	[BE][MPS] Delete duplicated code in `View.mm` (#136295 ) After https://github.com/pytorch/pytorch/pull/135706 `getGatherScatterScalarType` returns exactly the same results as `scalarToMetalTypeString` , so delete the function and call `scalarToMetalTypeString` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136295 Approved by: https://github.com/kit1980	2024-09-18 22:44:43 +00:00
Siju Samuel	f1ad680818	[dynamo]Remove stream hardcoding in dynamo VariableBuilder (#131763 ) Fixes #ISSUE_NUMBER Recent change from PR#123487 used torch.cuda.Stream directly and this causes failure for other backends. This PR will generalize the stream handling for all backends like cuda/hpu/xpu Pull Request resolved: https://github.com/pytorch/pytorch/pull/131763 Approved by: https://github.com/yanboliang, https://github.com/yf225	2024-09-18 22:32:34 +00:00
Will Feng	bc9597b7d8	[Traceable FSDP2] Minor refactor to traceable FSDP2 unit tests (#136219 ) Changes in this PR: - Monkey-patching `F.scaled_dot_product_attention` with a lambda seems to not work in some cases. This PR avoids using a lambda. - Running `fullgraph=True` and `fullgraph=False` in the same unit test seems to cause the two cases to interfere with each other and causes error. This PR splits them into two separate unit tests. - The checks in the unit tests might not work with compile cache. This PR turns off the cache in order to have a more predictable compile behavior to do unit test on. Test commands: - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor_fullgraph_True` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor_fullgraph_False` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor_fullgraph_True` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor_fullgraph_False` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136219 Approved by: https://github.com/yifuwang	2024-09-18 22:30:23 +00:00
Isuru Fernando	1a86d8aa29	Fix calling Add._from_args and Mul._from_args (#136143 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136143 Approved by: https://github.com/ezyang	2024-09-18 20:51:04 +00:00
Atul Jangra	aae68e2976	Add wait counter for nccl abort (#136067 ) Summary: Quite a few times, we see the NCCL PG abort taking too long. There's no easy way to measure this, so let's add a counter to measure this across the stack. This will help us measure how much time we take the NCCL abort. Test Plan: Unit tests Reviewed By: c-p-i-o Differential Revision: D62675010 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136067 Approved by: https://github.com/fduwjj	2024-09-18 20:14:10 +00:00
eqy	68a7246f13	[cuDNN][conv][A100] Bump tolerances for `vmap_autograd_grad` `conv2d` on A100 (#136178 ) Likely due to a cuDNN heuristics update Pull Request resolved: https://github.com/pytorch/pytorch/pull/136178 Approved by: https://github.com/Skylion007	2024-09-18 19:42:13 +00:00
maajidkhann	5a6ddbcc3b	Extending the Pytorch vec backend for SVE (ARM) (#119571 ) Motivation: In Pytorch, Aten vectorization supports multiple platforms, including x86 and Arm, as well as multiple data types. It provides a generic implementation of Vector (Vec) type that allows the programmer to write code packing various primitives (such as floats) within 256bit & 512bits registers. It can be extended to support other ISAs easily by adding more VecISA sub-classes. Reference Link: https://github.com/pytorch/pytorch/tree/main/aten/src/ATen/cpu/vec This PR: * Our goal with this contribution is to add support for SVE backend for Vec in the Aten vectorization for CPU backend which can be benefitted by any ARM architecture supported CPU's that supports SVE. * More about SVE ISA for ARM: [https://developer.arm.com/Architectures/Scalable Vector Extensions](https://developer.arm.com/Architectures/Scalable%20Vector%20Extensions) * We are using the ARM C Language Extensions for SVE (https://developer.arm.com/documentation/102699/0100/Optimizing-with-intrinsics ) to accelerate performance for various operators in the SVE backend for Vec. * Currently we are adding support only for SVE ISA with the vector length of 256 bits (SVE 256). In future, we plan to extend this SVE support for other vector lengths as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119571 Approved by: https://github.com/malfet, https://github.com/snadampal Co-authored-by: Divya Kotadiya <divya.kotadiya@fujitsu.com>	2024-09-18 18:59:10 +00:00
Jack Taylor	bad69044d8	[ROCm] upgrade ROCm CI builds to py3.10 (#134108 ) Upgrade ROCm CI builds to py3.10 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134108 Approved by: https://github.com/jeffdaily, https://github.com/jithunnair-amd, https://github.com/atalman	2024-09-18 17:39:34 +00:00
fduwjj	3efaa016b1	[c10d] Make test compatible for new pytest (#136158 ) Temporary fix to the issue in https://github.com/pytorch/pytorch/issues/127517. Short-term fix following CPython: `51aefc5bf9/Lib/unittest/case.py (L419-L426)` Differential Revision: [D62878083](https://our.internmc.facebook.com/intern/diff/D62878083) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136158 Approved by: https://github.com/fegin	2024-09-18 17:10:55 +00:00
Scott Wolchok	605f2d802a	[PyTorch] Remove unnecessary include of c10/util/Exception.h in irange.h (#136202 ) Manually audited and can't figure out why this would be needed. Differential Revision: [D62879500](https://our.internmc.facebook.com/intern/diff/D62879500/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136202 Approved by: https://github.com/malfet	2024-09-18 16:57:15 +00:00
CaoE	6a6f5b20c5	Add _addmm_activation to lower precision cast policy on AutocastCPU (#135936 ) Fixes #132613. Add `_addmm_activation` to lower precision cast policy on AutocastCPU. `_addmm_activation` https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/transformers/transformer.cpp#L39 of `transformer_encoder_layer_forward` may throw `RuntimeError: mat1 and mat2 must have the same dtype, but got BFloat16 and Float` when autocast is enabled, as `_native_multi_head_attention` is put in lower data type cast policy https://github.com/pytorch/pytorch/pull/107674 and `_addmm_activation` may encounter mixed data types. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135936 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-09-18 16:31:27 +00:00
Isuru Fernando	c8d152cb0e	Fix fast_expand recursion error (#136163 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136163 Approved by: https://github.com/ezyang	2024-09-18 13:58:45 +00:00
Sun, Jiayi	701ba5203f	[Inductor] Increase multiplier to 3 for Inductor AMP FP16 benchmark correctness check (#135932 ) Fix https://github.com/pytorch/pytorch/issues/135657. Aligned with AMP BF16, using multiplier 3 for Inductor AMP FP16 benchmark correctness check Pull Request resolved: https://github.com/pytorch/pytorch/pull/135932 Approved by: https://github.com/CaoE, https://github.com/jgong5, https://github.com/jansel	2024-09-18 13:03:45 +00:00
Prachi Gupta	b5be4d8c05	Fix ROCm skip decorator for test_ddp_tp and multiprocess UTs (#136161 ) skip_if_rocm is used only in multiprocess case (when UT test class is a child of MultiProcessTestCase). Each individual process can exit with a skip code. If used for single process UT, it will cause the UT to fail as the process returns a non-zero exit code. Use skipIfRocm in single process UTs. To avoid the above confusion, this PR renamed skip_if_rocm to skip_if_rocm_multiprocess. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/136161 Approved by: https://github.com/jithunnair-amd, https://github.com/kwen2501, https://github.com/fegin	2024-09-18 11:01:23 +00:00
Menglu Yu	083c9149b7	Reland D62220158 (#136213 ) Summary: We fix the unit test test_pad_mm and reland the diff Test Plan: See in D62220158 Differential Revision: D62891584 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136213 Approved by: https://github.com/dshi7	2024-09-18 07:33:41 +00:00
Jason Ansel	a0207c8471	[dynamo] Fix support for classmethod(property(...)) (#134968 ) Fixes #134451 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134968 Approved by: https://github.com/yanboliang	2024-09-18 04:47:51 +00:00
Nikita Shulga	9aa22eabe7	[CI] Make linux-aarch64 shards actually running different tests (#136208 ) Non-functional sharding was introduced in https://github.com/pytorch/pytorch/pull/125255 but each shard in that case were running the same tests... Pull Request resolved: https://github.com/pytorch/pytorch/pull/136208 Approved by: https://github.com/seemethere, https://github.com/ZainRizvi, https://github.com/atalman	2024-09-18 03:10:21 +00:00
Kiuk Chung	8895f69d12	[torch/numpy][numpy2.0 compat] Additional changes for tests to run under numpy-2.0 (#136152 ) Continuation of https://github.com/pytorch/pytorch/pull/131909. This PR makes numpy tests compatible with numpy>=2.0.0. Specifically it deals with APIs that have been removed from numpy-2.0. Changes in this PR: 1. Use `numpy.exceptions.ComplexWarning` if `numpy.exceptions` namespace is present. In numpy-2.0 `numpy.ComplexWarning` has been removed in favor of using `numpy.exceptions.ComplexWarning` (see [numpy-2.0 migration guide](https://numpy.org/devdocs/numpy_2_0_migration_guide.html#changes-to-namespaces)). Note that `numpy.exceptions` was introduced in numpy-1.25.0 hence does not exist in numpy<=1.24.x. 2. Do the same for `numpy.exceptions.VisibleDeprecationWarning` 3. Use `np.sort(...,axis=0)` over `np.msort()`(`np.msort()` removed in numpy-2.0) 4. Use `np.pad()` over `np.lib.pad()` (`np.lib` removed in numpy-2.0) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136152 Approved by: https://github.com/atalman	2024-09-18 02:11:22 +00:00
Nikita Shulga	6682327c75	[BE] Make `NestedTensorTransformerFunctions.cu` compilable without warnings (#136222 ) Before the change compilation produced following warnings: ``` /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu: In function ‘std::tuple<dim3, dim3, at::native::StackArray<long int> > at::native::check_shape_and_partition_(const at::Tensor&, const std::vector<at::Tensor>&, const at::Tensor&)’: /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:584:22: warning: comparison of integer expressions of different signedness: ‘const int’ and ‘const size_t’ {aka ‘const long unsigned int’} [-Wsign-compare] 584 \| TORCH_CHECK(num_jagged_dim <= kStackArrayMaxDims); \| ~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu: In lambda function: /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:1224:1061: warning: comparison of integer expressions of different signedness: ‘long unsigned int’ and ‘int’ [-Wsign-compare] 1224 \| AT_DISPATCH_INDEX_TYPES( \| ^ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu: In lambda function: /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:1224:1985: warning: comparison of integer expressions of different signedness: ‘long unsigned int’ and ‘int’ [-Wsign-compare] 1224 \| AT_DISPATCH_INDEX_TYPES( \| ^ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu: In instantiation of ‘void at::native::jagged_dense_elementwise_jagged_output_opt_(const at::Tensor&, const std::vector<at::Tensor>&, const at::Tensor&, const at::Tensor&, F) [with scalar_t = c10::Half; F = __nv_dl_wrapper_t<__nv_dl_trailing_return_tag<at::Tensor (*)(const at::Tensor&, c10::ArrayRef<at::Tensor>, std::optional<c10::SymInt>), at::native::_fbgemm_dense_to_jagged_forward_symint, c10::Half, 1> >]’: /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:1515:1: required from here /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:1336:2006: warning: comparison of integer expressions of different signedness: ‘size_t’ {aka ‘long unsigned int’} and ‘int’ [-Wsign-compare] 1336 \| AT_DISPATCH_INDEX_TYPES( \| ^ /home/nshulga/git/pytorch/pytorch/aten/src/ATen/native/nested/cuda/NestedTensorTransformerFunctions.cu:1336:2113: warning: comparison of integer expressions of different signedness: ‘size_t’ {aka ‘long unsigned int’} and ‘int’ [-Wsign-compare] 1336 \| AT_DISPATCH_INDEX_TYPES( \| ^ ``` after it compiled without a warning Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/136222 Approved by: https://github.com/PaliC, https://github.com/kit1980	2024-09-18 01:24:05 +00:00
leslie-fang-intel	b18ba9419e	[AO][Inductor] Enable WOQ fusion pattern with permute (#135928 ) Summary Fix https://github.com/pytorch/pytorch/issues/135831 and https://github.com/pytorch/ao/issues/890. The root cause of the numerical failure was that the customized woq-int8 kernel was not triggered due to changes in the pattern. After re-adding the fusion pattern, the accuracy check now passes. I will open a separate TorchAO PR to enable these unit tests in TorchAO. Test Plan ``` python test/inductor/test_mkldnn_pattern_matcher.py -k test_woq_int8 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135928 Approved by: https://github.com/jgong5, https://github.com/eellison	2024-09-18 00:56:16 +00:00
Chirag Pandya	cccf500193	[c10d] remove sleep from watchdogHandler (#135760 ) Summary: Remove sleep from the `watchdogHandler` function. This sleep unnecessary slows things down during a NCCL timeout. Flight recorder is configured to take a minute, at most, to dump out it's buffer. This sleep ends up waiting for `8` minutes before destroy is called. Test Plan: Unit tests. Differential Revision: D62529875 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135760 Approved by: https://github.com/fduwjj, https://github.com/shuqiangzhang	2024-09-18 00:55:01 +00:00
Nikita Shulga	f6f1504d39	[MPS] Fix 5D+ reductions over negative dimentions (#136198 ) This fixes bug introduced by https://github.com/pytorch/pytorch/pull/99856 that attempts to speed-up reduction for 5D+ tensor if trailing dimensions are all ones, but introduces crashes/off-by-one errors for wrapped dimensions Added regresion test case to `TestMPS.test_sum` Fixes https://github.com/pytorch/pytorch/issues/136132 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136198 Approved by: https://github.com/albanD	2024-09-17 21:53:31 +00:00
Banit Agrawal	a575ce0dc6	[PyTorch Pinned Allocator] Add support of background thread to process events (#135524 ) Summary: Currently we process events in the regular allocation path and we call cudaEventQuery to check on the events and this path can take some locks in libcuda driver. Its not entirely needed to do process events in the allocation path, we could move this to a background thread and keep processing events regularly and put the freed block to the free list. Differential Revision: D62396585 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135524 Approved by: https://github.com/zyan0	2024-09-17 21:08:10 +00:00
Banit Agrawal	48d18fbd4c	[PyTorch CUDA Allocator] Allow reuse of non-split blocks with better rounding (#136174 ) Summary: This diff adds an option to round the non-split blocks in caching allocator so that they can be reused without causing lots of fragmentation for large memory segments. For example, if we specify max_split memory size as 400MB, then all allocations more than 400MB will not be split. Lets say, we allocated some 1024MB blocks and these are cached in the allocator blocks. If we request a new 500MB block, we round it to nearest power-2-division, thats 512MB, we add default kLargeBuffer of 20MB, that will be 532MB and since 532MB is less than existing 1024MB block, the 1024MB will not be used for this allocation, instead a new 512MB block will be created. In this diff, we provide an option to cofigure the kLargeBuffer for rounding and expose as a configurable option, so 512MB + max_non_split_rounding_size and if thats greater than 1024MB, we will use te 1024MB and we wont create a new 512MB block using cudaMalloc. This option is added so that we can pre-allocate some large blocks so that we can reuse them as much as possible and we dont stall on calling cudaMalloc. Differential Revision: D62758758 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136174 Approved by: https://github.com/zyan0	2024-09-17 19:08:44 +00:00
eqy	e3aa5e2f64	[NCCL] Don't override `waitUntilInitialized`'s setting of `comm->initialized_` (#136155 ) #133630 sets `initialized_` to `true` which causes previous wait codepaths to skip necessary waits, see also #https://github.com/pytorch/pytorch/issues/136151 CC @shuqiangzhang @wconstab Pull Request resolved: https://github.com/pytorch/pytorch/pull/136155 Approved by: https://github.com/fduwjj, https://github.com/kwen2501, https://github.com/c-p-i-o, https://github.com/shuqiangzhang	2024-09-17 18:50:12 +00:00
Huanyu He	a4e9a1c90b	[TorchRec][PT2 IR][APF] short circuit the flatten/unflatten between EBC and KTRegroupAsDict modules (#136045 ) Summary: # context * for the root cause and background please refer to this [post](https://fb.workplace.com/groups/1028545332188949/permalink/1042204770823005/) * basica idea of this diff is to short circuit the pytree flatten-unflatten function pairs between two preserved modules, i.e., EBC/fpEBC and KTRegroupAsDict. NOTE: There could be multiple EBCs and one single KTRegroupAsDict as shown in the [pic](https://fburl.com/gslide/lcyt8eh3) {F1864810545} * short-circuiting the EBC-KTRegroupAsDict pairs are very special and a must in most of the cases due to the EBC key-order issue with distributed table lookup. * hide all the operations behind a control flag `short_circuit_pytree_ebc_regroup` to the torchrec main api call `decapsulate_ir_modules`, which should only be visible to the infra layer, not to the users. # details * The `_short_circuit_pytree_ebc_regroup` function finds all the EBCs/fpEBC and KTRegroupAsDict modules in an unflattened module. Retrieve their fqns and sort to in_fqns (regroup_fqns) and out_fqns (ebc_fqns). Because currently the fpEBC is swapped as a whole, so we do some extra fqn logic to filter out the EBC that belongs to an up-level fpEBC. * a util function `prune_pytree_flatten_unflatten` removes the in-coming and out-going pytree flatten/unflatten function calls in the graph module, based on the given fqns. WARNING: The flag `short_circuit_pytree_ebc_regroup` should be turned on if EBCs are used and EBC sharding is needed. Assertions are also added if can't find a `KTRegroupAsDict` module, or `finalize_interpreter_modules` is not `True`. # additional changes * absorb the `finalize_interpreter_modules` process inside the torchrec main api `decapsulate_ir_modules`. * set `graph.owning_module` in export.unflatten as required by the graph modification * add one more layer of `sparse_module` for closely mimicing the APF model structure. Test Plan: # run test * serializer ``` buck2 run fbcode//mode/opt fbcode//torchrec/ir/tests:test_serializer ``` * apf ``` buck2 run fbcode//mode/opt fbcode//aps_models/ads/gmp/tests/ne/e2e_deterministic_tests:gmp_e2e_ne_tests -- --filter-text 'test_mtml_instagram_model_562438350_single_gpu_with_ir' ``` * local mp run ``` ==== Finished E2E deterministic test for mtml_instagram_model_gmp_474023725_non_kjt_unary ==== finished test_mtml_instagram_model_562438350_single_gpu_with_ir Imports took: 6.0s! Profile with --import-profiler. --_ \|""---__ Executed 1 example in 203.1s: \|'.\| \|\| . """\| Successful: 1 \| \|\| \|\| /\|\""-. \| Failed: 0 \| \|\| \|\| \| \| \| Skipped: 0 \| \|\| \|\| \| \\|/ \| Not executed: 8 \|."\| \|\| --"" '__\| https://testslide.readthedocs.io/ --" \|__---""" ``` Differential Revision: D62606738 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136045 Approved by: https://github.com/angelayi	2024-09-17 18:42:56 +00:00
angelayi	ea10c072f3	[export] Deserialize args with python keyword names (#136036 ) Currently when we deserialize inputs to nodes, we deserialize arguments with default values as kwargs. So deserializing `aten.uniform`, which has the signature `uniform(Tensor(a!) self, float from=0, float to=1, *, Generator? generator=None) -> Tensor(a!)`, will get become `uniform(x, from=0, to=1)`. However, this fails when running in python because `from` is a python keyword. So the solution here is to not deserialize it as a kwarg. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136036 Approved by: https://github.com/zhxchen17	2024-09-17 18:13:14 +00:00
Joel Schlosser	a8382847f4	Support rms_norm() for NJT (#135872 ) `rms_norm()` is a nice-to-have for ViT :) This PR: * SymInt-ifies `rms_norm()`, allowing NJT to use the same decomp. * Adds torch_function-based input validation logic for nested-specific stuff (no normalization supported over the ragged dim for now) on the python NJT side. * Adds multi-dim support (on non-ragged, non-batch dims) to `mean()` for NJT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135872 Approved by: https://github.com/mikaylagawarecki ghstack dependencies: #125947	2024-09-17 18:09:20 +00:00
Nikita Shulga	785e98783b	Delete links to non-existing `run_plan_mpi.cc` (#136204 ) That were deleted by https://github.com/pytorch/pytorch/pull/125092 Fixes https://github.com/pytorch/pytorch/issues/136199 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136204 Approved by: https://github.com/albanD, https://github.com/seemethere	2024-09-17 17:51:56 +00:00
Trung Truong	cc365fdd7b	[MTIA] Support torch.cuda.get_device_capability equivalent API on MTIA (#135889 ) Summary: Mirror `get_device_capability` on MTIA per https://fburl.com/gdoc/p4lo5avn At the moment, both the major and minor version are just 0 Test Plan: Unit test: `buck2 test //mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api` https://www.internalfb.com/intern/testinfra/testconsole/testrun/1688850109958190/ Differential Revision: D62595296 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135889 Approved by: https://github.com/egienvalue	2024-09-17 17:42:56 +00:00
Xintong Hu	8e5bb356e0	[PT2] Port merge_concats_pass to PT2 pre_grad passes (#135527 ) Summary: as title Test Plan: new UT Differential Revision: D62398390 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135527 Approved by: https://github.com/frank-wei	2024-09-17 17:26:53 +00:00
Nikhil Gupta	63dc5dff10	[Fix]: Update CPUINFO submodule to fix support for NON-SVE ARM Hardware (#135857 ) Regression PR : https://github.com/pytorch/cpuinfo/pull/255 Change-Id: I56cec061072be11ec33ccb661114360b979fc7aa Pull Request resolved: https://github.com/pytorch/pytorch/pull/135857 Approved by: https://github.com/digantdesai, https://github.com/malfet	2024-09-17 16:50:17 +00:00
Justin Chu	67b14ce8bd	[ONNX] Fix numpy method to return the correct type (#136162 ) Previous implementation of the `numpy()` method returns `fp64` when the tensor is `fp32`. This is unexpected but seems to be caused by calling `__array__(dtype=None)` on the numpy array. I updated the implementation to implement the `numpy()` method explicitly and added tests to guard the behavior. This needs to be cherry-picked into torch 2.5 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136162 Approved by: https://github.com/gramalingam, https://github.com/xadupre	2024-09-17 15:51:00 +00:00
Mauricio Villegas	ece8267d2c	Add back optim type hints that were lost when .pyi files were removed (#136185 ) When stub files (`.pyi`) were removed from `optim` (#125556, #125452), some types that existed are no longer available. This pull request adds them back. Just for reference, these types are used in `pytorch-lightning`'s `LightningCLI`. Command line interfaces are created automatically, and having type hints make them nicer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136185 Approved by: https://github.com/janeyx99	2024-09-17 15:45:15 +00:00
Edward Z. Yang	913f97e878	Don't run reshape pattern match on dynamic shape size tensor (#136100 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/136100 Approved by: https://github.com/mengluy0125	2024-09-17 15:08:55 +00:00
PyTorch MergeBot	462b727d1e	Revert "Add decomposition for permute_copy (#130944 )" This reverts commit ab9a7eadd34aee59fc67e29237610b7562cc4ff0. Reverted https://github.com/pytorch/pytorch/pull/130944 on behalf of https://github.com/jeanschmidt due to Broke internal signal executorch.backends.xnnpack.test.ops.permute.TestPermute, more details on D62737086. @eellison could you please help get this PR merged to main? ([comment](https://github.com/pytorch/pytorch/pull/130944#issuecomment-2355846394))	2024-09-17 13:42:55 +00:00
PyTorch MergeBot	2c4ae81494	Revert "Add decomposition for squeeze_copy (#130941 )" This reverts commit c33b0580e6a702be0cd5be691b3b465da012aa34. Reverted https://github.com/pytorch/pytorch/pull/130941 on behalf of https://github.com/jeanschmidt due to Need to revert in order to be able to revert https://github.com/pytorch/pytorch/pull/130944, after fixing any merge conflicts, feel free to merge it back ([comment](https://github.com/pytorch/pytorch/pull/130941#issuecomment-2355831480))	2024-09-17 13:39:07 +00:00
PyTorch MergeBot	3b5e2689a1	Revert "Optimize dict reconstruct to not codegen untouched values (#134876 )" This reverts commit a1a57a424dc992f4dc2d44bdc1e4e7e500881a9c. Reverted https://github.com/pytorch/pytorch/pull/134876 on behalf of https://github.com/jeanschmidt due to new introduced test test_reconstruct.py::ReconstructTest::test_functional_call_reconstruct is breaking internally. @zou3519 may you help get those changes merged back to main? ([comment](https://github.com/pytorch/pytorch/pull/134876#issuecomment-2355697685))	2024-09-17 13:00:01 +00:00
ankurneog	e248c1d7eb	Update real device in FSDP state_dict_utils (#134994 ) ## Motivation The default device for tensor.device both for sharded as well as non sharded is set to cuda by default. Hence while checking the FSDP UTs we see the following errors. This change updates the actual device type based on the created tensor. ``` [rank3] File "/root/repos/pytorch-training-tests/tests/pytorch/v2.4.0/distributed_hpu/fsdp/test_fsdp_dtensor_state_dict.py", line 143, in test_dtensor_sharded_tensor_state_dict_identical [rank3] sharded_tensor_sd = ref_model.state_dict() [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1944, in state_dict [rank3] hook_result = hook(self, destination, prefix, local_metadata) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context [rank3] return func(args, kwargs) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/fsdp/_state_dict_utils.py", line 752, in _post_state_dict_hook [rank3] tensor.device, [rank3] File "/usr/local/lib/python3.10/dist-packages/typing_extensions.py", line 2853, in wrapper [rank3] return arg(args, **kwargs) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/sharded_tensor/api.py", line 1152, in __torch_function__ [rank3] return dispatch(st_instance, func) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/sharded_tensor/api.py", line 1134, in dispatch [rank3] return _SHARDED_OPS[func](types, args, kwargs, st._process_group) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/op_registry_utils.py", line 33, in wrapper [rank3] return wrapped_func(types, args, kwargs, process_group) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/distributed/_shard/sharded_tensor/_ops/tensor_ops.py", line 52, in tensor_device [rank3] dev = torch.device(torch.cuda.current_device()) [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 878, in current_device [rank3] _lazy_init() [rank3] File "/usr/local/lib/python3.10/dist-packages/torch/cuda/__init__.py", line 305, in _lazy_init [rank3] raise AssertionError("Torch not compiled with CUDA enabled") [rank3] AssertionError: Torch not compiled with CUDA enabled ```` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134994 Approved by: https://github.com/fegin	2024-09-17 04:39:08 +00:00
wz337	408fe41a45	[DSD][EZ] Minor update in _state_dict_utils.py (#136165 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136165 Approved by: https://github.com/kwen2501 ghstack dependencies: #135725, #135763	2024-09-17 04:32:43 +00:00
Brian Hirsh	dc82d274e6	make view.dtype always return an alias (#136074 ) Fixes https://github.com/pytorch/pytorch/issues/136064 In the linked repro, this issue was that there was some code like this: ``` # x has dtype torch.float32 def f(x): y = x.view(torch.float32) y.copy_(...) ``` Where because `view.dtype` is implemented today to potentially directly return its input, we would end up directly clobbering the proxy for our graph input (replacing its FX proxy value from `arg0_1` to `view_1`). This is not desirable, because we have careful assertions in AOTDispatcher that mutations only ever happen on graph inputs - but this clobbering caused the mutation to appear, from the perspective of the FX graph, like it was happening on a view of the input. Why is this normally not a problem? Ordinarily, the `ADInplaceOrView` kernel for `view.dtype` will take the output of the view kernel, [and detach() it](https://github.com/pytorch/pytorch/blob/main/tools/autograd/gen_inplace_or_view_type.py#L466) (properly creating a fresh `TensorImpl`). This does not happen, though, if you are executing the kernel from with a `__torch_dispatch__` region: the `ADInplaceOrView` logic has already run above you, so that key will be in the TLS exclude set. This PR changes eager behavior - at first I considered trying to only change behavior under compile. But this problem isn't technically specific to PT2: if you ever rely on tensor identity from inside of a __torch_dispatch__ call, then we need to make sure the raw `view.dtype` kernel doesn't directly return the input. I am also making the assumption that "`view.dtype` no-op'ing when the dtype is the same" is not a case worth optimizing in eager mode, and that the overhead of the `TensorImpl` creation is relatively negligible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136074 Approved by: https://github.com/Skylion007, https://github.com/ezyang, https://github.com/albanD ghstack dependencies: #136041	2024-09-17 03:40:54 +00:00
Brian Hirsh	d463a81c27	inductor: dont use default_dtype during rng functionalization (#136041 ) Fixes https://github.com/pytorch/pytorch/issues/119162 See context at https://github.com/pytorch/pytorch/issues/119162#issuecomment-2349849469 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136041 Approved by: https://github.com/eellison	2024-09-17 03:40:54 +00:00
Zhijing Li (Accelerator Enablement)	3f74310784	Back out "Flip triton kernel default layout constraint to "needs_fixed_stride_order" (#135581 )" (#136160 ) Test Plan: make train-hstu-cint-publish-bf16-tgif-local Differential Revision: D62766335 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136160 Approved by: https://github.com/muchulee8	2024-09-17 01:06:10 +00:00
PyTorch MergeBot	37a08b33bb	Revert "fix compiled_autograd deadlock throw (#135795 )" This reverts commit 00dc7d435652ad66e9d2feb2660928b632281a98. Reverted https://github.com/pytorch/pytorch/pull/135795 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/135795#issuecomment-2354233619))	2024-09-16 23:59:56 +00:00
Laith Sakka	071da87cd7	use csv extention for test report in order for it to be uploaded to s3 (#136128 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136128 Approved by: https://github.com/clee2000	2024-09-16 21:47:46 +00:00
Justin Chu	c12536b3c0	[ONNX] Treat CompositeImplicitAutograd ops as normal ops in decomp (#136153 ) Since https://github.com/pytorch/pytorch/pull/135080, the CompositeImplicitAutograd (CIA) ops are only decomposed when a decomp function is provided in a table. There is no longer a need to distinguish CIA ops like Upsample and preserve them explicitly. On the ONNX Script torchlib side I will unregister some ops from the following list to make sure some CIA ops are still decomposed. ``` <OpOverload(op='aten.__and__', overload='Scalar')>, <OpOverload(op='aten.__and__', overload='Tensor')>, <OpOverload(op='aten.__or__', overload='Scalar')>, <OpOverload(op='aten.__or__', overload='Tensor')>, <OpOverload(op='aten.__xor__', overload='Scalar')>, <OpOverload(op='aten.__xor__', overload='Tensor')>, <OpOverload(op='aten._add_batch_dim', overload='default')>, <OpOverload(op='aten._assert_tensor_metadata', overload='default')>, <OpOverload(op='aten._backward', overload='default')>, <OpOverload(op='aten._batch_norm_impl_index_backward', overload='default')>, <OpOverload(op='aten._cast_Byte', overload='default')>, <OpOverload(op='aten._cast_Char', overload='default')>, <OpOverload(op='aten._cast_Double', overload='default')>, <OpOverload(op='aten._cast_Float', overload='default')>, <OpOverload(op='aten._cast_Half', overload='default')>, <OpOverload(op='aten._cast_Int', overload='default')>, <OpOverload(op='aten._cast_Long', overload='default')>, <OpOverload(op='aten._cast_Short', overload='default')>, <OpOverload(op='aten._choose_qparams_per_tensor', overload='default')>, <OpOverload(op='aten._convolution', overload='deprecated')>, <OpOverload(op='aten._convolution_double_backward', overload='default')>, <OpOverload(op='aten._convolution_mode', overload='default')>, <OpOverload(op='aten._cufft_clear_plan_cache', overload='default')>, <OpOverload(op='aten._cufft_get_plan_cache_max_size', overload='default')>, <OpOverload(op='aten._cufft_get_plan_cache_size', overload='default')>, <OpOverload(op='aten._cufft_set_plan_cache_max_size', overload='default')>, <OpOverload(op='aten._debug_has_internal_overlap', overload='default')>, <OpOverload(op='aten._dim_arange', overload='default')>, <OpOverload(op='aten._embedding_bag_sparse_backward', overload='default')>, <OpOverload(op='aten._gather_sparse_backward', overload='default')>, <OpOverload(op='aten._grid_sampler_2d_cpu_fallback_backward', overload='default')>, <OpOverload(op='aten._has_compatible_shallow_copy_type', overload='default')>, <OpOverload(op='aten._is_zerotensor', overload='default')>, <OpOverload(op='aten._lu_with_info', overload='default')>, <OpOverload(op='aten._nnpack_available', overload='default')>, <OpOverload(op='aten._pack_padded_sequence_backward', overload='default')>, <OpOverload(op='aten._pad_circular', overload='default')>, <OpOverload(op='aten._pad_enum', overload='default')>, <OpOverload(op='aten._pad_packed_sequence', overload='default')>, <OpOverload(op='aten._propagate_xla_data', overload='default')>, <OpOverload(op='aten._remove_batch_dim', overload='default')>, <OpOverload(op='aten._reshape_from_tensor', overload='default')>, <OpOverload(op='aten._rowwise_prune', overload='default')>, <OpOverload(op='aten._saturate_weight_to_fp16', overload='default')>, <OpOverload(op='aten._scaled_dot_product_attention_math', overload='default')>, <OpOverload(op='aten._shape_as_tensor', overload='default')>, <OpOverload(op='aten._sobol_engine_draw', overload='default')>, <OpOverload(op='aten._sparse_bsc_tensor_unsafe', overload='default')>, <OpOverload(op='aten._sparse_bsr_tensor_unsafe', overload='default')>, <OpOverload(op='aten._sparse_compressed_tensor_unsafe', overload='default')>, <OpOverload(op='aten._sparse_coo_tensor_unsafe', overload='default')>, <OpOverload(op='aten._sparse_csc_tensor_unsafe', overload='default')>, <OpOverload(op='aten._sparse_csr_tensor_unsafe', overload='default')>, <OpOverload(op='aten._sparse_log_softmax', overload='Dimname')>, <OpOverload(op='aten._sparse_log_softmax', overload='int')>, <OpOverload(op='aten._sparse_mm', overload='default')>, <OpOverload(op='aten._sparse_mm', overload='reduce')>, <OpOverload(op='aten._sparse_softmax', overload='Dimname')>, <OpOverload(op='aten._sparse_softmax', overload='int')>, <OpOverload(op='aten._sparse_sum', overload='default')>, <OpOverload(op='aten._sparse_sum', overload='dim_dtype')>, <OpOverload(op='aten._sparse_sum', overload='dtype')>, <OpOverload(op='aten._test_ambiguous_defaults', overload='a')>, <OpOverload(op='aten._test_ambiguous_defaults', overload='b')>, <OpOverload(op='aten._test_autograd_multiple_dispatch', overload='ntonly')>, <OpOverload(op='aten._test_check_tensor', overload='default')>, <OpOverload(op='aten._test_serialization_subcmul', overload='default')>, <OpOverload(op='aten._test_string_default', overload='default')>, <OpOverload(op='aten._thnn_differentiable_gru_cell_backward', overload='default')>, <OpOverload(op='aten._thnn_differentiable_lstm_cell_backward', overload='default')>, <OpOverload(op='aten._thnn_fused_lstm_cell_backward', overload='default')>, <OpOverload(op='aten._to_cpu', overload='default')>, <OpOverload(op='aten._upsample_bicubic2d_aa', overload='vec')>, <OpOverload(op='aten._upsample_bilinear2d_aa', overload='vec')>, <OpOverload(op='aten._upsample_nearest_exact1d', overload='default')>, <OpOverload(op='aten._upsample_nearest_exact1d', overload='vec')>, <OpOverload(op='aten._upsample_nearest_exact2d', overload='default')>, <OpOverload(op='aten._upsample_nearest_exact2d', overload='vec')>, <OpOverload(op='aten._upsample_nearest_exact3d', overload='default')>, <OpOverload(op='aten._upsample_nearest_exact3d', overload='vec')>, <OpOverload(op='aten._use_cudnn_rnn_flatten_weight', overload='default')>, <OpOverload(op='aten._validate_sparse_bsc_tensor_args', overload='default')>, <OpOverload(op='aten._validate_sparse_bsr_tensor_args', overload='default')>, <OpOverload(op='aten._validate_sparse_compressed_tensor_args', overload='default')>, <OpOverload(op='aten._validate_sparse_coo_tensor_args', overload='default')>, <OpOverload(op='aten._validate_sparse_csc_tensor_args', overload='default')>, <OpOverload(op='aten._validate_sparse_csr_tensor_args', overload='default')>, <OpOverload(op='aten._version', overload='default')>, <OpOverload(op='aten._weight_norm', overload='default')>, <OpOverload(op='aten._weight_norm_differentiable_backward', overload='default')>, <OpOverload(op='aten.absolute', overload='default')>, <OpOverload(op='aten.adaptive_avg_pool1d', overload='default')>, <OpOverload(op='aten.adaptive_avg_pool2d', overload='default')>, <OpOverload(op='aten.adaptive_avg_pool3d', overload='default')>, <OpOverload(op='aten.adaptive_max_pool1d', overload='default')>, <OpOverload(op='aten.affine_grid_generator_backward', overload='default')>, <OpOverload(op='aten.align_as', overload='default')>, <OpOverload(op='aten.align_tensors', overload='default')>, <OpOverload(op='aten.all', overload='dimname')>, <OpOverload(op='aten.any', overload='dimname')>, <OpOverload(op='aten.arccos', overload='default')>, <OpOverload(op='aten.arccosh', overload='default')>, <OpOverload(op='aten.arcsin', overload='default')>, <OpOverload(op='aten.arcsinh', overload='default')>, <OpOverload(op='aten.arctan', overload='default')>, <OpOverload(op='aten.arctan2', overload='default')>, <OpOverload(op='aten.arctanh', overload='default')>, <OpOverload(op='aten.argsort', overload='default')>, <OpOverload(op='aten.argsort', overload='dimname')>, <OpOverload(op='aten.argsort', overload='stable')>, <OpOverload(op='aten.argwhere', overload='default')>, <OpOverload(op='aten.atleast_1d', overload='Sequence')>, <OpOverload(op='aten.atleast_2d', overload='Sequence')>, <OpOverload(op='aten.atleast_3d', overload='Sequence')>, <OpOverload(op='aten.avg_pool1d', overload='default')>, <OpOverload(op='aten.bilinear', overload='default')>, <OpOverload(op='aten.broadcast_tensors', overload='default')>, <OpOverload(op='aten.can_cast', overload='default')>, <OpOverload(op='aten.cat', overload='names')>, <OpOverload(op='aten.cdist', overload='default')>, <OpOverload(op='aten.chain_matmul', overload='default')>, <OpOverload(op='aten.chalf', overload='default')>, <OpOverload(op='aten.choose_qparams_optimized', overload='default')>, <OpOverload(op='aten.clip', overload='Tensor')>, <OpOverload(op='aten.clip', overload='default')>, <OpOverload(op='aten.column_stack', overload='default')>, <OpOverload(op='aten.combinations', overload='default')>, <OpOverload(op='aten.concat', overload='default')>, <OpOverload(op='aten.concat', overload='names')>, <OpOverload(op='aten.concatenate', overload='default')>, <OpOverload(op='aten.concatenate', overload='names')>, <OpOverload(op='aten.conv1d', overload='default')>, <OpOverload(op='aten.conv1d', overload='padding')>, <OpOverload(op='aten.conv2d', overload='default')>, <OpOverload(op='aten.conv2d', overload='padding')>, <OpOverload(op='aten.conv3d', overload='default')>, <OpOverload(op='aten.conv3d', overload='padding')>, <OpOverload(op='aten.conv_tbc_backward', overload='default')>, <OpOverload(op='aten.conv_transpose1d', overload='default')>, <OpOverload(op='aten.conv_transpose2d', overload='input')>, <OpOverload(op='aten.conv_transpose3d', overload='input')>, <OpOverload(op='aten.corrcoef', overload='default')>, <OpOverload(op='aten.cosine_embedding_loss', overload='default')>, <OpOverload(op='aten.cosine_similarity', overload='default')>, <OpOverload(op='aten.cov', overload='default')>, <OpOverload(op='aten.cross', overload='default')>, <OpOverload(op='aten.cross_entropy_loss', overload='default')>, <OpOverload(op='aten.ctc_loss', overload='IntList')>, <OpOverload(op='aten.ctc_loss', overload='Tensor')>, <OpOverload(op='aten.cudnn_is_acceptable', overload='default')>, <OpOverload(op='aten.cummax', overload='dimname')>, <OpOverload(op='aten.cummaxmin_backward', overload='default')>, <OpOverload(op='aten.cummin', overload='dimname')>, <OpOverload(op='aten.cumprod', overload='dimname')>, <OpOverload(op='aten.cumprod_backward', overload='default')>, <OpOverload(op='aten.cumsum', overload='dimname')>, <OpOverload(op='aten.cumulative_trapezoid', overload='dx')>, <OpOverload(op='aten.cumulative_trapezoid', overload='x')>, <OpOverload(op='aten.data', overload='default')>, <OpOverload(op='aten.det', overload='default')>, <OpOverload(op='aten.diag', overload='default')>, <OpOverload(op='aten.diagflat', overload='default')>, <OpOverload(op='aten.diff', overload='default')>, <OpOverload(op='aten.divide', overload='Scalar')>, <OpOverload(op='aten.divide', overload='Scalar_mode')>, <OpOverload(op='aten.divide', overload='Tensor')>, <OpOverload(op='aten.divide', overload='Tensor_mode')>, <OpOverload(op='aten.dstack', overload='default')>, <OpOverload(op='aten.einsum', overload='default')>, <OpOverload(op='aten.embedding_backward', overload='default')>, <OpOverload(op='aten.embedding_bag', overload='default')>, <OpOverload(op='aten.embedding_bag', overload='padding_idx')>, <OpOverload(op='aten.embedding_sparse_backward', overload='default')>, <OpOverload(op='aten.fake_quantize_per_channel_affine', overload='default')>, <OpOverload(op='aten.fake_quantize_per_channel_affine_cachemask_backward', overload='default')>, <OpOverload(op='aten.fake_quantize_per_tensor_affine', overload='default')>, <OpOverload(op='aten.fake_quantize_per_tensor_affine', overload='tensor_qparams')>, <OpOverload(op='aten.fake_quantize_per_tensor_affine_cachemask_backward', overload='default')>, <OpOverload(op='aten.fbgemm_linear_fp16_weight', overload='default')>, <OpOverload(op='aten.fbgemm_linear_fp16_weight_fp32_activation', overload='default')>, <OpOverload(op='aten.fbgemm_linear_int8_weight', overload='default')>, <OpOverload(op='aten.fbgemm_linear_int8_weight_fp32_activation', overload='default')>, <OpOverload(op='aten.fbgemm_linear_quantize_weight', overload='default')>, <OpOverload(op='aten.fbgemm_pack_gemm_matrix_fp16', overload='default')>, <OpOverload(op='aten.fbgemm_pack_quantized_matrix', overload='KN')>, <OpOverload(op='aten.fbgemm_pack_quantized_matrix', overload='default')>, <OpOverload(op='aten.fft_fft', overload='default')>, <OpOverload(op='aten.fft_fft2', overload='default')>, <OpOverload(op='aten.fft_fftn', overload='default')>, <OpOverload(op='aten.fft_fftshift', overload='default')>, <OpOverload(op='aten.fft_hfft', overload='default')>, <OpOverload(op='aten.fft_hfft2', overload='default')>, <OpOverload(op='aten.fft_hfftn', overload='default')>, <OpOverload(op='aten.fft_ifft', overload='default')>, <OpOverload(op='aten.fft_ifft2', overload='default')>, <OpOverload(op='aten.fft_ifftn', overload='default')>, <OpOverload(op='aten.fft_ifftshift', overload='default')>, <OpOverload(op='aten.fft_ihfft', overload='default')>, <OpOverload(op='aten.fft_ihfft2', overload='default')>, <OpOverload(op='aten.fft_ihfftn', overload='default')>, <OpOverload(op='aten.fft_irfft', overload='default')>, <OpOverload(op='aten.fft_irfft2', overload='default')>, <OpOverload(op='aten.fft_irfftn', overload='default')>, <OpOverload(op='aten.fft_rfft', overload='default')>, <OpOverload(op='aten.fft_rfft2', overload='default')>, <OpOverload(op='aten.fft_rfftn', overload='default')>, <OpOverload(op='aten.fix', overload='default')>, <OpOverload(op='aten.flatten_dense_tensors', overload='default')>, <OpOverload(op='aten.fliplr', overload='default')>, <OpOverload(op='aten.flipud', overload='default')>, <OpOverload(op='aten.float_power', overload='Scalar')>, <OpOverload(op='aten.float_power', overload='Tensor_Scalar')>, <OpOverload(op='aten.float_power', overload='Tensor_Tensor')>, <OpOverload(op='aten.frobenius_norm', overload='dim')>, <OpOverload(op='aten.gather', overload='dimname')>, <OpOverload(op='aten.gather_backward', overload='default')>, <OpOverload(op='aten.ger', overload='default')>, <OpOverload(op='aten.gradient', overload='array')>, <OpOverload(op='aten.gradient', overload='scalararray')>, <OpOverload(op='aten.gradient', overload='scalarint')>, <OpOverload(op='aten.gradient', overload='scalarrayarray')>, <OpOverload(op='aten.gradient', overload='scalarrayint')>, <OpOverload(op='aten.gradient', overload='tensorarray')>, <OpOverload(op='aten.gradient', overload='tensorarrayint')>, <OpOverload(op='aten.greater', overload='Scalar')>, <OpOverload(op='aten.greater', overload='Tensor')>, <OpOverload(op='aten.greater_equal', overload='Scalar')>, <OpOverload(op='aten.greater_equal', overload='Tensor')>, <OpOverload(op='aten.grid_sampler', overload='default')>, <OpOverload(op='aten.group_norm', overload='default')>, <OpOverload(op='aten.gru', overload='data')>, <OpOverload(op='aten.gru', overload='input')>, <OpOverload(op='aten.gru_cell', overload='default')>, <OpOverload(op='aten.hinge_embedding_loss', overload='default')>, <OpOverload(op='aten.histogramdd', overload='TensorList_bins')>, <OpOverload(op='aten.histogramdd', overload='default')>, <OpOverload(op='aten.histogramdd', overload='int_bins')>, <OpOverload(op='aten.hstack', overload='default')>, <OpOverload(op='aten.index_add', overload='dimname')>, <OpOverload(op='aten.index_copy', overload='dimname')>, <OpOverload(op='aten.index_fill', overload='Dimname_Scalar')>, <OpOverload(op='aten.index_fill', overload='Dimname_Tensor')>, <OpOverload(op='aten.index_select', overload='dimname')>, <OpOverload(op='aten.index_select_backward', overload='default')>, <OpOverload(op='aten.infinitely_differentiable_gelu_backward', overload='default')>, <OpOverload(op='aten.inner', overload='default')>, <OpOverload(op='aten.instance_norm', overload='default')>, <OpOverload(op='aten.inverse', overload='default')>, <OpOverload(op='aten.is_complex', overload='default')>, <OpOverload(op='aten.is_conj', overload='default')>, <OpOverload(op='aten.is_distributed', overload='default')>, <OpOverload(op='aten.is_floating_point', overload='default')>, <OpOverload(op='aten.is_inference', overload='default')>, <OpOverload(op='aten.is_leaf', overload='default')>, <OpOverload(op='aten.is_neg', overload='default')>, <OpOverload(op='aten.is_nonzero', overload='default')>, <OpOverload(op='aten.is_signed', overload='default')>, <OpOverload(op='aten.is_vulkan_available', overload='default')>, <OpOverload(op='aten.isclose', overload='default')>, <OpOverload(op='aten.isfinite', overload='default')>, <OpOverload(op='aten.isreal', overload='default')>, <OpOverload(op='aten.istft', overload='default')>, <OpOverload(op='aten.item', overload='default')>, <OpOverload(op='aten.kl_div', overload='default')>, <OpOverload(op='aten.kron', overload='default')>, <OpOverload(op='aten.kthvalue', overload='dimname')>, <OpOverload(op='aten.l1_loss', overload='default')>, <OpOverload(op='aten.layer_norm', overload='default')>, <OpOverload(op='aten.ldexp', overload='Tensor')>, <OpOverload(op='aten.less', overload='Scalar')>, <OpOverload(op='aten.less', overload='Tensor')>, <OpOverload(op='aten.less_equal', overload='Scalar')>, <OpOverload(op='aten.less_equal', overload='Tensor')>, <OpOverload(op='aten.linalg_cholesky', overload='default')>, <OpOverload(op='aten.linalg_cond', overload='default')>, <OpOverload(op='aten.linalg_cond', overload='p_str')>, <OpOverload(op='aten.linalg_det', overload='default')>, <OpOverload(op='aten.linalg_eigh', overload='default')>, <OpOverload(op='aten.linalg_eigvals', overload='default')>, <OpOverload(op='aten.linalg_eigvalsh', overload='default')>, <OpOverload(op='aten.linalg_inv', overload='default')>, <OpOverload(op='aten.linalg_ldl_factor', overload='default')>, <OpOverload(op='aten.linalg_lu_factor', overload='default')>, <OpOverload(op='aten.linalg_matmul', overload='default')>, <OpOverload(op='aten.linalg_matrix_norm', overload='default')>, <OpOverload(op='aten.linalg_matrix_norm', overload='str_ord')>, <OpOverload(op='aten.linalg_matrix_power', overload='default')>, <OpOverload(op='aten.linalg_matrix_rank', overload='atol_rtol_float')>, <OpOverload(op='aten.linalg_matrix_rank', overload='atol_rtol_tensor')>, <OpOverload(op='aten.linalg_matrix_rank', overload='default')>, <OpOverload(op='aten.linalg_matrix_rank', overload='tol_tensor')>, <OpOverload(op='aten.linalg_multi_dot', overload='default')>, <OpOverload(op='aten.linalg_norm', overload='default')>, <OpOverload(op='aten.linalg_norm', overload='ord_str')>, <OpOverload(op='aten.linalg_pinv', overload='atol_rtol_float')>, <OpOverload(op='aten.linalg_pinv', overload='default')>, <OpOverload(op='aten.linalg_pinv', overload='rcond_tensor')>, <OpOverload(op='aten.linalg_slogdet', overload='default')>, <OpOverload(op='aten.linalg_solve', overload='default')>, <OpOverload(op='aten.linalg_solve_ex', overload='default')>, <OpOverload(op='aten.linalg_svd', overload='default')>, <OpOverload(op='aten.linalg_svdvals', overload='default')>, <OpOverload(op='aten.linalg_tensorinv', overload='default')>, <OpOverload(op='aten.linalg_tensorsolve', overload='default')>, <OpOverload(op='aten.linalg_vander', overload='default')>, <OpOverload(op='aten.linalg_vecdot', overload='default')>, <OpOverload(op='aten.linear', overload='default')>, <OpOverload(op='aten.log_sigmoid', overload='default')>, <OpOverload(op='aten.log_softmax', overload='Dimname')>, <OpOverload(op='aten.log_softmax', overload='int')>, <OpOverload(op='aten.logcumsumexp', overload='dimname')>, <OpOverload(op='aten.logdet', overload='default')>, <OpOverload(op='aten.logsumexp', overload='names')>, <OpOverload(op='aten.lstm', overload='data')>, <OpOverload(op='aten.lstm', overload='input')>, <OpOverload(op='aten.lstm_cell', overload='default')>, <OpOverload(op='aten.lu_solve', overload='default')>, <OpOverload(op='aten.margin_ranking_loss', overload='default')>, <OpOverload(op='aten.masked_select_backward', overload='default')>, <OpOverload(op='aten.matmul', overload='default')>, <OpOverload(op='aten.matrix_exp', overload='default')>, <OpOverload(op='aten.matrix_exp_backward', overload='default')>, <OpOverload(op='aten.matrix_power', overload='default')>, <OpOverload(op='aten.max', overload='names_dim')>, <OpOverload(op='aten.max', overload='other')>, <OpOverload(op='aten.max_pool1d', overload='default')>, <OpOverload(op='aten.max_pool1d_with_indices', overload='default')>, <OpOverload(op='aten.max_pool2d', overload='default')>, <OpOverload(op='aten.max_pool3d', overload='default')>, <OpOverload(op='aten.mean', overload='names_dim')>, <OpOverload(op='aten.median', overload='names_dim')>, <OpOverload(op='aten.meshgrid', overload='default')>, <OpOverload(op='aten.meshgrid', overload='indexing')>, <OpOverload(op='aten.min', overload='names_dim')>, <OpOverload(op='aten.min', overload='other')>, <OpOverload(op='aten.mish_backward', overload='default')>, <OpOverload(op='aten.mode', overload='dimname')>, <OpOverload(op='aten.msort', overload='default')>, <OpOverload(op='aten.multilabel_margin_loss', overload='default')>, <OpOverload(op='aten.multiply', overload='Scalar')>, <OpOverload(op='aten.multiply', overload='Tensor')>, <OpOverload(op='aten.nanmean', overload='default')>, <OpOverload(op='aten.nanmedian', overload='names_dim')>, <OpOverload(op='aten.nanquantile', overload='default')>, <OpOverload(op='aten.nanquantile', overload='scalar')>, <OpOverload(op='aten.native_channel_shuffle', overload='default')>, <OpOverload(op='aten.negative', overload='default')>, <OpOverload(op='aten.nested_to_padded_tensor', overload='default')>, <OpOverload(op='aten.nll_loss', overload='default')>, <OpOverload(op='aten.nll_loss2d', overload='default')>, <OpOverload(op='aten.nll_loss_nd', overload='default')>, <OpOverload(op='aten.nonzero_numpy', overload='default')>, <OpOverload(op='aten.norm', overload='names_ScalarOpt_dim')>, <OpOverload(op='aten.norm', overload='names_ScalarOpt_dim_dtype')>, <OpOverload(op='aten.norm_except_dim', overload='default')>, <OpOverload(op='aten.not_equal', overload='Scalar')>, <OpOverload(op='aten.not_equal', overload='Tensor')>, <OpOverload(op='aten.nuclear_norm', overload='default')>, <OpOverload(op='aten.nuclear_norm', overload='dim')>, <OpOverload(op='aten.one_hot', overload='default')>, <OpOverload(op='aten.orgqr', overload='default')>, <OpOverload(op='aten.outer', overload='default')>, <OpOverload(op='aten.output_nr', overload='default')>, <OpOverload(op='aten.pad', overload='default')>, <OpOverload(op='aten.pad_sequence', overload='default')>, <OpOverload(op='aten.pairwise_distance', overload='default')>, <OpOverload(op='aten.pdist', overload='default')>, <OpOverload(op='aten.pinverse', overload='default')>, <OpOverload(op='aten.poisson_nll_loss', overload='default')>, <OpOverload(op='aten.prelu', overload='default')>, <OpOverload(op='aten.prod', overload='dim_Dimname')>, <OpOverload(op='aten.promote_types', overload='default')>, <OpOverload(op='aten.qr', overload='default')>, <OpOverload(op='aten.quantile', overload='default')>, <OpOverload(op='aten.quantile', overload='scalar')>, <OpOverload(op='aten.quantized_gru_cell', overload='default')>, <OpOverload(op='aten.quantized_lstm_cell', overload='default')>, <OpOverload(op='aten.quantized_rnn_relu_cell', overload='default')>, <OpOverload(op='aten.quantized_rnn_tanh_cell', overload='default')>, <OpOverload(op='aten.relu6', overload='default')>, <OpOverload(op='aten.repeat_interleave', overload='self_Tensor')>, <OpOverload(op='aten.repeat_interleave', overload='self_int')>, <OpOverload(op='aten.result_type', overload='Scalar')>, <OpOverload(op='aten.result_type', overload='Scalar_Scalar')>, <OpOverload(op='aten.result_type', overload='Scalar_Tensor')>, <OpOverload(op='aten.result_type', overload='Tensor')>, <OpOverload(op='aten.retains_grad', overload='default')>, <OpOverload(op='aten.rms_norm', overload='default')>, <OpOverload(op='aten.rnn_relu', overload='data')>, <OpOverload(op='aten.rnn_relu', overload='input')>, <OpOverload(op='aten.rnn_relu_cell', overload='default')>, <OpOverload(op='aten.rnn_tanh', overload='data')>, <OpOverload(op='aten.rnn_tanh', overload='input')>, <OpOverload(op='aten.rnn_tanh_cell', overload='default')>, <OpOverload(op='aten.row_stack', overload='default')>, <OpOverload(op='aten.rrelu', overload='default')>, <OpOverload(op='aten.scaled_dot_product_attention', overload='default')>, <OpOverload(op='aten.scatter', overload='dimname_src')>, <OpOverload(op='aten.scatter', overload='dimname_value')>, <OpOverload(op='aten.scatter_add', overload='dimname')>, <OpOverload(op='aten.selu', overload='default')>, <OpOverload(op='aten.silu_backward', overload='default')>, <OpOverload(op='aten.size', overload='Dimname')>, <OpOverload(op='aten.size', overload='int')>, <OpOverload(op='aten.slogdet', overload='default')>, <OpOverload(op='aten.slow_conv3d', overload='default')>, <OpOverload(op='aten.smm', overload='default')>, <OpOverload(op='aten.softmax', overload='Dimname')>, <OpOverload(op='aten.softmax', overload='int')>, <OpOverload(op='aten.sort', overload='dimname')>, <OpOverload(op='aten.sort', overload='dimname_stable')>, <OpOverload(op='aten.sparse_bsc_tensor', overload='ccol_row_value')>, <OpOverload(op='aten.sparse_bsc_tensor', overload='ccol_row_value_size')>, <OpOverload(op='aten.sparse_bsr_tensor', overload='crow_col_value')>, <OpOverload(op='aten.sparse_bsr_tensor', overload='crow_col_value_size')>, <OpOverload(op='aten.sparse_coo_tensor', overload='indices')>, <OpOverload(op='aten.sparse_coo_tensor', overload='indices_size')>, <OpOverload(op='aten.sparse_csc_tensor', overload='ccol_row_value')>, <OpOverload(op='aten.sparse_csc_tensor', overload='ccol_row_value_size')>, <OpOverload(op='aten.sparse_csr_tensor', overload='crow_col_value')>, <OpOverload(op='aten.sparse_csr_tensor', overload='crow_col_value_size')>, <OpOverload(op='aten.special_digamma', overload='default')>, <OpOverload(op='aten.special_erf', overload='default')>, <OpOverload(op='aten.special_erfc', overload='default')>, <OpOverload(op='aten.special_erfinv', overload='default')>, <OpOverload(op='aten.special_exp2', overload='default')>, <OpOverload(op='aten.special_expit', overload='default')>, <OpOverload(op='aten.special_expm1', overload='default')>, <OpOverload(op='aten.special_gammainc', overload='default')>, <OpOverload(op='aten.special_gammaincc', overload='default')>, <OpOverload(op='aten.special_gammaln', overload='default')>, <OpOverload(op='aten.special_i0', overload='default')>, <OpOverload(op='aten.special_log1p', overload='default')>, <OpOverload(op='aten.special_log_softmax', overload='default')>, <OpOverload(op='aten.special_logit', overload='default')>, <OpOverload(op='aten.special_logsumexp', overload='default')>, <OpOverload(op='aten.special_multigammaln', overload='default')>, <OpOverload(op='aten.special_ndtr', overload='default')>, <OpOverload(op='aten.special_polygamma', overload='default')>, <OpOverload(op='aten.special_psi', overload='default')>, <OpOverload(op='aten.special_round', overload='default')>, <OpOverload(op='aten.special_sinc', overload='default')>, <OpOverload(op='aten.special_softmax', overload='default')>, <OpOverload(op='aten.special_xlogy', overload='default')>, <OpOverload(op='aten.special_xlogy', overload='other_scalar')>, <OpOverload(op='aten.special_xlogy', overload='self_scalar')>, <OpOverload(op='aten.square', overload='default')>, <OpOverload(op='aten.sspaddmm', overload='default')>, <OpOverload(op='aten.std', overload='correction_names')>, <OpOverload(op='aten.std', overload='default')>, <OpOverload(op='aten.std', overload='dim')>, <OpOverload(op='aten.std', overload='names_dim')>, <OpOverload(op='aten.std_mean', overload='correction_names')>, <OpOverload(op='aten.std_mean', overload='default')>, <OpOverload(op='aten.std_mean', overload='dim')>, <OpOverload(op='aten.std_mean', overload='names_dim')>, <OpOverload(op='aten.stft', overload='center')>, <OpOverload(op='aten.stft', overload='default')>, <OpOverload(op='aten.stride', overload='Dimname')>, <OpOverload(op='aten.stride', overload='int')>, <OpOverload(op='aten.subtract', overload='Scalar')>, <OpOverload(op='aten.subtract', overload='Tensor')>, <OpOverload(op='aten.sum', overload='dim_DimnameList')>, <OpOverload(op='aten.sum_to_size', overload='default')>, <OpOverload(op='aten.svd', overload='default')>, <OpOverload(op='aten.sym_size', overload='int')>, <OpOverload(op='aten.sym_stride', overload='int')>, <OpOverload(op='aten.take_along_dim', overload='default')>, <OpOverload(op='aten.tensordot', overload='default')>, <OpOverload(op='aten.thnn_conv2d', overload='default')>, <OpOverload(op='aten.tile', overload='default')>, <OpOverload(op='aten.to_dense', overload='default')>, <OpOverload(op='aten.to_dense_backward', overload='default')>, <OpOverload(op='aten.to_mkldnn_backward', overload='default')>, <OpOverload(op='aten.to_sparse', overload='default')>, <OpOverload(op='aten.to_sparse', overload='sparse_dim')>, <OpOverload(op='aten.to_sparse_bsc', overload='default')>, <OpOverload(op='aten.to_sparse_bsr', overload='default')>, <OpOverload(op='aten.to_sparse_csc', overload='default')>, <OpOverload(op='aten.to_sparse_csr', overload='default')>, <OpOverload(op='aten.trace_backward', overload='default')>, <OpOverload(op='aten.trapezoid', overload='dx')>, <OpOverload(op='aten.trapezoid', overload='x')>, <OpOverload(op='aten.trapz', overload='dx')>, <OpOverload(op='aten.trapz', overload='x')>, <OpOverload(op='aten.triplet_margin_loss', overload='default')>, <OpOverload(op='aten.true_divide', overload='Scalar')>, <OpOverload(op='aten.true_divide', overload='Tensor')>, <OpOverload(op='aten.type_as', overload='default')>, <OpOverload(op='aten.unflatten_dense_tensors', overload='default')>, <OpOverload(op='aten.upsample_bicubic2d', overload='vec')>, <OpOverload(op='aten.upsample_bilinear2d', overload='vec')>, <OpOverload(op='aten.upsample_linear1d', overload='vec')>, <OpOverload(op='aten.upsample_nearest1d', overload='default')>, <OpOverload(op='aten.upsample_nearest1d', overload='vec')>, <OpOverload(op='aten.upsample_nearest2d', overload='default')>, <OpOverload(op='aten.upsample_nearest2d', overload='vec')>, <OpOverload(op='aten.upsample_nearest3d', overload='default')>, <OpOverload(op='aten.upsample_nearest3d', overload='vec')>, <OpOverload(op='aten.upsample_trilinear3d', overload='vec')>, <OpOverload(op='aten.value_selecting_reduction_backward', overload='default')>, <OpOverload(op='aten.vander', overload='default')>, <OpOverload(op='aten.var', overload='correction_names')>, <OpOverload(op='aten.var', overload='default')>, <OpOverload(op='aten.var', overload='dim')>, <OpOverload(op='aten.var', overload='names_dim')>, <OpOverload(op='aten.var_mean', overload='correction_names')>, <OpOverload(op='aten.var_mean', overload='default')>, <OpOverload(op='aten.var_mean', overload='dim')>, <OpOverload(op='aten.var_mean', overload='names_dim')>, <OpOverload(op='aten.vstack', overload='default')>, <OpOverload(op='aten.where', overload='Scalar')>, <OpOverload(op='aten.where', overload='ScalarOther')>, <OpOverload(op='aten.where', overload='ScalarSelf')>, <OpOverload(op='aten.where', overload='default')>, <OpOverload(op='aten.wrapped_linear_prepack', overload='default')>, <OpOverload(op='aten.wrapped_quantized_linear_prepacked', overload='default')> ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136153 Approved by: https://github.com/xadupre, https://github.com/gramalingam	2024-09-16 21:28:54 +00:00
Pearu Peterson	b76d1b79e6	Add scaling arguments to bsr_dense_addmm (#136104 ) As in the title. Tackles https://github.com/pytorch/ao/pull/821/files#r1759821413 The PR assumes that the existing tuning parameters are good also when using scaling arguments. This needs to be verified as a follow-up task. Also, this PR redefines triton-contiguous tensors: the tensor must have strides not larger than 1. This will now allow zero strides that previously triggered `contiguous` call although the underlying memory buffer was contiguous. Re: "a considerable slow-down occurs because tensor data is copied element-wise rather than chunk-wise" - this note should refer to a code (torch or triton?) that implements the element/chunk-wise copy so that we could verify that allowing zero strides indeed would not trigger element-wise copies. Atm, the performance increase in ViT-H benchmarks (that involve using 0 strides) is an evidence that allowing zero strides does not lead to slow-downs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136104 Approved by: https://github.com/cpuhrsch	2024-09-16 20:26:54 +00:00
PyTorch MergeBot	bfbcdf4967	Revert "[dynamo] Fix support for classmethod(property(...)) (#134968 )" This reverts commit c64ae601ba9eb3ad2cd3402a14f6ac83c0ab7eba. Reverted https://github.com/pytorch/pytorch/pull/134968 on behalf of https://github.com/jeanschmidt due to Breaking internal signals, we need to skip the new tests on py3.10 ([comment](https://github.com/pytorch/pytorch/pull/134968#issuecomment-2353909010))	2024-09-16 20:26:35 +00:00
Dan Johnson	3c97b0ab00	Use ncclAlltoAllv and ncclAlltoAll API when supported (#134499 ) NCCL does not have an api for ncclAllToAll and ncclAllToAllv, so PyTorch does point to point send/recv. Expose this API if it is supported. Differential Revision: [D61683836](https://our.internmc.facebook.com/intern/diff/D61683836/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134499 Approved by: https://github.com/shuqiangzhang, https://github.com/eqy	2024-09-16 20:08:06 +00:00
Kiuk Chung	abd16a8c64	[torch/multiprocessing] Use multiprocessing.reduction.register ForkingPickler.register to register custom tensor and storage reductions (#135030 ) Right now `multiprocessing.reduction.register()` is simply an alias to `multiprocessing.reduction.ForkingPickler.register()` https://github.com/python/cpython/blame/main/Lib/multiprocessing/reduction.py#L56, but the top-level `register()` function exposes less of the internal details of `multiprocessing.reduction` module. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135030 Approved by: https://github.com/albanD	2024-09-16 20:07:29 +00:00
fduwjj	a0c7029a75	[c10d][Reland] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931 ) (#135653 ) We introduced the dispatchable backend for a ProcessGroup and collective in https://github.com/pytorch/pytorch/issues/86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG. Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options" We need to make changes to the test to make it aligned with the change. This is try to reland D62008954 by fixing internal errors. Differential Revision: [D62483294](https://our.internmc.facebook.com/intern/diff/D62483294/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135653 Approved by: https://github.com/wz337, https://github.com/H-Huang	2024-09-16 19:56:42 +00:00
James Wu	7537f74277	Refactor FxGraphCache.load into separate functions, so that AOTAutogradCache may access it correctly later (#135491 ) Summary: We refactor FxGraphCache.load into three phases: - prepare_key, which checks that an inductor input is cacheable and bypasses otherwise - load_with_key, which tries to lookup the key in the cache - post compile, where we do some logging and run post compile steps Splitting it along these lines will allow AOTAutogradCache to use load_with_key and still get access to all of the observability + remote cache logic when accessing FxGraphCache, without needing to pass key components, etc. Differential Revision: D62314862 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135491 Approved by: https://github.com/oulgen	2024-09-16 19:48:08 +00:00
Aaron Gokaslan	31715be72a	[BE]: Update mypy to 1.11.2 (#133816 ) Updates mypy to 1.11.1 to improve type inference Pull Request resolved: https://github.com/pytorch/pytorch/pull/133816 Approved by: https://github.com/ezyang	2024-09-16 19:44:11 +00:00
Nikita Shulga	38caf10411	[EZ] Fix spelling typo (#136157 ) s/toosl/tools/ (spotted by @louie-tsai) Also, capitalize CUDA Pull Request resolved: https://github.com/pytorch/pytorch/pull/136157 Approved by: https://github.com/kit1980	2024-09-16 19:30:30 +00:00
Ke Wen	c977bb7d03	[Distributed] fix FileSystemWriter __init__ (#136135 ) Fixes #135608. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136135 Approved by: https://github.com/Skylion007	2024-09-16 19:11:08 +00:00
eugenekoran	717fca2cac	Drop outdated section 'Running clang-tidy' in CONTRIBUTING.md (#136146 ) Fixes #125920 [Running clang-tidy](https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#running-clang-tidy) section is misleading and outdated. C++ lint is done with lintrunner and covered in [local-linting](https://github.com/pytorch/pytorch/blob/main/CONTRIBUTING.md#local-linting) section. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136146 Approved by: https://github.com/janeyx99	2024-09-16 19:02:21 +00:00
Alexander Kurakin	f89ce4dfbb	`torch.nn.MultiheadAttention`: docs: improvement (#136111 ) `torch.nn.MultiheadAttention`: docs: improvement Pull Request resolved: https://github.com/pytorch/pytorch/pull/136111 Approved by: https://github.com/janeyx99	2024-09-16 18:52:20 +00:00
Nikita Shulga	d3647d15e6	Remove accidentally committed code (#136154 ) Accidentally left out during rebase Pull Request resolved: https://github.com/pytorch/pytorch/pull/136154 Approved by: https://github.com/kit1980, https://github.com/albanD	2024-09-16 18:34:20 +00:00
PyTorch MergeBot	d0cebedb31	Revert "Add Triton CPU as an Inductor backend (#133408 )" This reverts commit e498b02b472e45cfd6b7a08db0d6c1babec655c5. Reverted https://github.com/pytorch/pytorch/pull/133408 on behalf of https://github.com/jeanschmidt due to Broke internal signals, see D62737208 for more details ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2353623816))	2024-09-16 18:33:33 +00:00
PyTorch MergeBot	7fe004f7cf	Revert "Add CI for Triton CPU backend (#135342 )" This reverts commit 426580a67db15ec17b2b861a09667bf59927e033. Reverted https://github.com/pytorch/pytorch/pull/135342 on behalf of https://github.com/jeanschmidt due to Broke internal signals, see D62737208 for more details ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2353623816))	2024-09-16 18:33:33 +00:00
Aaron Gokaslan	23c0d2689e	[BE][Ez]: Fix missing float16 coverage for adaptive_pool3d_cpu (#136091 ) Testing if op info coverage has issues Pull Request resolved: https://github.com/pytorch/pytorch/pull/136091 Approved by: https://github.com/ezyang	2024-09-16 18:22:16 +00:00
Suresh Babu Kolla	5193f23469	[Pytorch] Cleanup Strobelight URL and shorten for readability (#136102 ) Summary: - Converted strobelight URL prefix to more readable and editable json - Dump shortened URLs when possible for easier readability Test Plan: ``` python ./torch/_strobelight/examples/compile_time_profile_example.py python torch/_strobelight/examples/cli_function_profiler_example.py ``` Differential Revision: D62690292 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136102 Approved by: https://github.com/laithsakka	2024-09-16 18:10:33 +00:00
PyTorch MergeBot	0199fd4d7e	Revert "[inductor] More fixes on the keys of `constants` and `signature` dictionaries (#135406 )" This reverts commit e54b559e8860e343692bb5534777b2384a57a613. Reverted https://github.com/pytorch/pytorch/pull/135406 on behalf of https://github.com/jeanschmidt due to Reverting as it is breaking triton_mtia internal signals @jansel could you have a look and help get those changes merged? ([comment](https://github.com/pytorch/pytorch/pull/135406#issuecomment-2353557481))	2024-09-16 17:58:02 +00:00
Aaron Gokaslan	b491e2974c	[BE][Ez]: Add full half/bfloat16 dtype for `unique` and `isin` (#136114 ) Fixes #136090 * Add support for isin to tensor half dtypes for CPU (just add a few extra dispatches). * Seems like the CUDA implementation for bfloat16 was mostly compiled and available all along (it just calls sort internally AND unique). To enable it, we just need to remove an assert to access it (since sort's functionality was updated since the assert was added) and add missing dtype support to unique. * This unlocks more GPU functionality with minimal code bloat. I also added CPU kernels for the dtypes for parity. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136114 Approved by: https://github.com/malfet	2024-09-16 17:49:12 +00:00
Justin Chu	0aa41eb52f	[ONNX] Run type promotion test in CI and update the table (#135915 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135915 Approved by: https://github.com/gramalingam, https://github.com/xadupre	2024-09-16 16:46:13 +00:00
IvanKobzarev	090046b936	[effects] Turn off dtype promotion for with_effects lowering (#136039 ) By default inductor promotes arguments to the common highest dtype. Having empty token with dtype=torch.float32 results in dtype promotion for effectful ops during lowering of with_effects. Disabling dtype promotion for this lowering. Removing previous workaround making token dtype torch.bool. Testing: ``` python test/distributed/test_c10d_functional_native.py -k test_inductor_dtypeview_memory_lea ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136039 Approved by: https://github.com/bdhirsh, https://github.com/eellison, https://github.com/zou3519	2024-09-16 16:14:05 +00:00
Tom Ritchford	c33b0580e6	Add decomposition for squeeze_copy (#130941 ) * Extracted from #128416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130941 Approved by: https://github.com/amjames, https://github.com/eellison	2024-09-16 15:46:57 +00:00
Jon Janzen	13bd1256f9	Delete stable prototype (#135911 ) This project ended up going in an entirely different direction, so we can close out all this Pull Request resolved: https://github.com/pytorch/pytorch/pull/135911 Approved by: https://github.com/izaitsevfb, https://github.com/malfet	2024-09-16 15:32:17 +00:00
Bin Bao	d833f49602	[reland][Inductor] Rename `cpp_wrapper_cuda.py` as `cpp_wrapper_gpu.py` (#136046 ) Summary: Reland https://github.com/pytorch/pytorch/pull/135313 after fixing internal build issues Test Plan: CI Differential Revision: D62658837 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136046 Approved by: https://github.com/chenyang78, https://github.com/etaf, https://github.com/jansel	2024-09-16 14:35:19 +00:00
Bin Bao	a803cb0531	[AOTI] Refactor how cpp_wrapper specific options are set (#136035 ) Summary: 1) When cpp-wrapper is turned on, certain triton specific options need to be set, both for forward and backward. This PR considate the settings in one place. 2) Change config.triton.autotune_at_compile_time to default to None. If the flag is not explicitly set by user, default it to True for cpp-wrapper. Differential Revision: [D62689940](https://our.internmc.facebook.com/intern/diff/D62689940) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136035 Approved by: https://github.com/chenyang78	2024-09-16 14:32:13 +00:00
atalman	bbc3fdbbde	Add python 3.13.0t build to Docker images (#136001 ) Adds 3.13t python to Docker images Pull Request resolved: https://github.com/pytorch/pytorch/pull/136001 Approved by: https://github.com/albanD	2024-09-16 12:49:36 +00:00
PyTorch MergeBot	3117f2cf67	Revert "[BE]: Update mypy to 1.11.2 (#133816 )" This reverts commit 55299cfc223fa838aadd8d6d6fa3ed541fa5acd1. Reverted https://github.com/pytorch/pytorch/pull/133816 on behalf of https://github.com/jeanschmidt due to seems to have broken https://github.com/pytorch/pytorch/actions/runs/10865710499/job/30155699792 on main ([comment](https://github.com/pytorch/pytorch/pull/133816#issuecomment-2352377684))	2024-09-16 09:11:16 +00:00
Xuehai Pan	951c21d679	[dynamo] simplify implementation for `builtins.sum` (#133779 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133779 Approved by: https://github.com/jansel, https://github.com/anijain2305 ghstack dependencies: #133778	2024-09-16 04:53:06 +00:00
Xuehai Pan	9961aaa601	[dynamo] simplify implementation for `functools.reduce` (#133778 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133778 Approved by: https://github.com/jansel, https://github.com/anijain2305	2024-09-16 04:53:06 +00:00
Ke Wen	d2207c57f7	[Distributed] add pack-check method for float8_e5m2 (#136115 ) Add support for Float8_e5m2, following similar algorithm used for Float8_e4m3fn (i.e. overflow check). Made `HasNanFP8x8` a template so that it is extendable based on dtype. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136115 Approved by: https://github.com/Skylion007 ghstack dependencies: #135891, #135961	2024-09-15 21:37:43 +00:00
Howard Huang	e501ed71d4	Update link in distributed.tensor.parallel.rst (#136103 ) dtensor folder was moved Pull Request resolved: https://github.com/pytorch/pytorch/pull/136103 Approved by: https://github.com/kwen2501, https://github.com/fegin	2024-09-15 19:36:29 +00:00
Tom Ritchford	ab9a7eadd3	Add decomposition for permute_copy (#130944 ) * Extracted from #129476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130944 Approved by: https://github.com/amjames, https://github.com/eellison	2024-09-15 19:35:14 +00:00
Andrii Grynenko	a141c6bb0d	[pytorch][monitoring] Dynamic backend for WaitCounter (#135967 ) Summary: This implements a default backend proxy that tries to look up a backend via dlsym. What this enables is dynamically loading a module with a backend implementation without having it statically linked with the application. Differential Revision: D62549295 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135967 Approved by: https://github.com/c-p-i-o	2024-09-15 18:07:49 +00:00
Tugsbayasgalan Manlaibaatar	dec3403b24	Add some doc for export_for_training (#135918 ) Differential Revision: [D62610491](https://our.internmc.facebook.com/intern/diff/D62610491) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135918 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #135080, #135912	2024-09-15 17:08:12 +00:00
Tugsbayasgalan Manlaibaatar	1904b09e61	Create export_for_inference API and expose core_aten as public facing API (#135912 ) Differential Revision: [D62606908](https://our.internmc.facebook.com/intern/diff/D62606908) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135912 Approved by: https://github.com/avikchaudhuri ghstack dependencies: #135080	2024-09-15 17:05:07 +00:00
Tugsbayasgalan Manlaibaatar	382fad58b3	Deprecate _preserve_ops and consolidate with decomp_table (#135080 ) In this PR, we deprecate _preserve_ops feature in run_decomposition API. We can't kill this API completely because Executorch team depends on it. As the syncing between two repos is non-trivial, I just leave this argument as deprecated for now. In the next PR, i will immediately remove it. After this PR, run_decompositions will only decompose what's inside the decomp table and preserve the rest by default. Note that this feature is only rolled out to OSS for now. Old code path is protected under IS_FBCODE flag. Differential Revision: [D62163161](https://our.internmc.facebook.com/intern/diff/D62163161/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135080 Approved by: https://github.com/justinchuby, https://github.com/avikchaudhuri, https://github.com/bdhirsh	2024-09-15 17:01:58 +00:00
PyTorch MergeBot	357b7fb579	Revert "[Pytorch] Consolidate Strobelight compile time profiler between OSS and fbcode (#135953 )" This reverts commit b8637503c036abb898f6b880b325aeffe6f09c03. Reverted https://github.com/pytorch/pytorch/pull/135953 on behalf of https://github.com/kollasb due to Broke internal module factory compatibility, revert from Phabricator failed ([comment](https://github.com/pytorch/pytorch/pull/135953#issuecomment-2351381777))	2024-09-15 05:32:38 +00:00
cyy	31e42a45dd	Fix redundant move warnings by g++ (#134987 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134987 Approved by: https://github.com/ezyang	2024-09-15 05:28:19 +00:00
PyTorch UpdateBot	e1abd346a3	[audio hash update] update the pinned audio hash (#136106 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136106 Approved by: https://github.com/pytorchbot	2024-09-15 04:31:35 +00:00
Will Feng	386884e553	[Traceable FSDP2] Ignore FSDP2 forward hook side-effects in AC; Support FSDP2 + AC (#134997 ) > Ignore FSDP2 forward hook side-effects in AC Under AC, FSDP2 does not rely on forward hook to all-gather weights to do recomputation, instead it relies on pre-backward hook to do this job: `451eaf0ff2/torch/distributed/_composable/fsdp/_fsdp_state.py (L219-L220)` So when we use `speculate_subgraph` to trace the utils.checkpoint AC region, we don't actually need to worry about FSDP2 forward hook's side effects and can safely ignore it, because we are not and we don't expect to re-run the FSDP2 forward hook during backward recomputation. ---- Test commands: - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134997 Approved by: https://github.com/zou3519 ghstack dependencies: #135727	2024-09-15 02:00:17 +00:00
leslie-fang-intel	8072ebc36c	SKIP llama for dynamic size testing (#135960 ) Running Torchbench llama with dynamic size failed with ``` File "/localdisk/leslie/torch_inductor_community/pytorch/torch/fx/experimental/symbolic_shapes.py", line 4182, in produce_guards raise ConstraintViolationError( torch.fx.experimental.symbolic_shapes.ConstraintViolationError: Constraints violated (L['inputs'][0].size()[0])! For more information, run with TORCH_LOGS="+dynamic". - Not all values of RelaxedUnspecConstraint(L['inputs'][0].size()[0]) are valid because L['inputs'][0].size()[0] was inferred to be a constant (32). ``` Skip this model for marking dynamic dim. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135960 Approved by: https://github.com/ezyang	2024-09-15 00:06:49 +00:00
Guilherme Leobas	a1a57a424d	Optimize dict reconstruct to not codegen untouched values (#134876 ) PR changes how `reconstruct` is done for a ConstDict. As of today, it works as follow: (1) codegen(...) each pair of key/value (2) create a new dictionary to hold the new items (3) clear the original dictionary (4) update the original dict with the one created in (2) We do a micro optimization in the generated bytecode to: - Only codegen the items that changed. - Only clear the original dictionary if a key was removed. Fixes: #133487 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134876 Approved by: https://github.com/zou3519	2024-09-14 23:25:28 +00:00
Bob Ren	a5eb43d8b4	Add TensorReferenceAnalysis and some tests (#135886 ) Split out and modified from https://github.com/pytorch/pytorch/pull/130228. There were a bunch of subtle bugs eg. sometimes we need to use torch.ops.aten.{operator}.Tensor vs other times using torch.ops.aten.{operator}.default. Or in the case of pow we need to use Tensor_Tensor. I figured it'd be easier to split out adding TensorReferenceAnalysis and add some tests and do the actual integration in a separate diff. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135886 Approved by: https://github.com/ezyang	2024-09-14 23:09:40 +00:00
Isuru Fernando	391f2d6d50	use a fast expand algorithm (#135999 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135999 Approved by: https://github.com/ezyang	2024-09-14 23:09:34 +00:00
Isuru Fernando	5b21d91197	Fix dividing Mul by factor (#136079 ) Fixes https://github.com/pytorch/pytorch/issues/136032 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136079 Approved by: https://github.com/ezyang	2024-09-14 22:14:27 +00:00
Jez Ng	426580a67d	Add CI for Triton CPU backend (#135342 ) Where possible, I have marked failing tests (which we intend to fix or triage) as `@xfail_if_triton_cpu`. This will help us track progress of the Triton CPU backend over time. Tests that I don't think we need to address, or that are flaky, have been marked as skips. Successful CI run: https://github.com/pytorch/pytorch/actions/runs/10822238062/job/30028284549 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135342 Approved by: https://github.com/jansel ghstack dependencies: #133408	2024-09-14 21:45:19 +00:00
Jez Ng	e498b02b47	Add Triton CPU as an Inductor backend (#133408 ) The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408 Approved by: https://github.com/jansel	2024-09-14 21:45:19 +00:00
Aaron Gokaslan	55299cfc22	[BE]: Update mypy to 1.11.2 (#133816 ) Updates mypy to 1.11.1 to improve type inference Pull Request resolved: https://github.com/pytorch/pytorch/pull/133816 Approved by: https://github.com/ezyang	2024-09-14 21:40:36 +00:00
Jason Ansel	c64ae601ba	[dynamo] Fix support for classmethod(property(...)) (#134968 ) Fixes #134451 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134968 Approved by: https://github.com/yanboliang	2024-09-14 21:00:41 +00:00
Aaron Gokaslan	7f5abb44af	[BE][Ez]: Update pybind11 to 2.13.6. Exposes new conduit cross-compat API (#136087 ) Updates pybind11 submodule. The major patchnote is an experimental new function that is added to all pybind11 objects that will make them more compatible across pybind11 version, settings, and frameworks (such as nanobind) called cpp_conduit. No code changes needed on our end except to update Pull Request resolved: https://github.com/pytorch/pytorch/pull/136087 Approved by: https://github.com/malfet	2024-09-14 20:48:44 +00:00
Michael Lazos	8df01c8258	[Dynamo] Remove ignored modes from torch function mode stack guard (#135503 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135503 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502	2024-09-14 18:52:22 +00:00
Michael Lazos	860838e9be	[Dynamo] Remove ignored modes workaround (#135502 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135502 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422	2024-09-14 18:52:22 +00:00
Michael Lazos	1b9daeb240	[Dynamo] Trace enter/exit of TorchFunctionModes (#135422 ) This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode) Typically the bytecode for a context manager looks like this during a graph break: 1. graph call 2. enter context 3. unsupported code 4. exit context 5. resume call resume fn structure: 1. enter context 2. jump ... 3. exit context The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack). So for torch function modes the structure of our output code is this: 1. graph call 2. mutate tf mode stack to replay mutations 4. unsupported code 5. on exception restore stack 6. resume function Then our resume fn looks like this: 1. no-op enter torch function mode 2. jump 3. exit tf mode To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context). Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135422 Approved by: https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443, #135444	2024-09-14 18:52:22 +00:00
Michael Lazos	06caa2d560	[Dynamo] Simplify torch function mode stack guard (#135444 ) The semantics of ignored modes previously had edge cases, this eliminates these by in essence filtering any ignored modes out of both the ref stack and the current torch function mode stack. This is purely to fix complexity in #135422. The ignored modes handling will be removed in a future PR after https://github.com/pytorch/pytorch/pull/135422 lands, since we will then trace through DeviceContexts vs inserting them into the graph which needed these extra workarounds for correctness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135444 Approved by: https://github.com/anijain2305, https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443	2024-09-14 18:52:22 +00:00
Michael Lazos	14cabdf626	[Dynamo] Support thread local setattr (#135443 ) In preparation for tracing through DeviceContext (`defb515306/torch/utils/_device.py (L66)`) This PR adds support for calling the setattr of thread local objects. These objects have a slots impl, and since this doesn't appear to have any side effects, we call this setattr impl when replaying mutations, since calling `object.__setattr__` on these objects results in a type error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135443 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137	2024-09-14 18:52:22 +00:00
Michael Lazos	5c5c33ac32	[Dynamo] Trace torch function modes entered outside of torch.compile (#133137 ) This PR adds initial tracing for torch function modes. Details: In essence, this adds tracing into the torch function of modes entered outside of the torch.compile call. This does not yet support tracing enter/exit of a torch function mode/ tracing set_default_device properly using the new mode infra (this will be a very good stress test for modes). I am adding more PRs to this stack to support these. The overall plan is to support tracing enter/exit and handling graph breaks like we do other torch.* context managers. Previously landed: https://github.com/pytorch/pytorch/pull/133135 https://github.com/pytorch/pytorch/pull/133136 https://github.com/pytorch/pytorch/pull/133134 https://github.com/pytorch/pytorch/pull/133133 https://github.com/pytorch/pytorch/pull/133132 https://github.com/pytorch/pytorch/pull/133131 https://github.com/pytorch/pytorch/pull/133729 https://github.com/pytorch/pytorch/pull/133130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133137 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: #134732	2024-09-14 18:52:22 +00:00
Michael Lazos	228760b945	[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732 ) For tracing cond/while in eager, we trace the HOP with the eager backend with metadata torchfunction mode enabled. HOPs disallow the mutation that occurs in this torch function mode, so it is not able to be traced. As a result, we use a custom backend which enters this mode for tracing these HOPs. Thanks to @ydwu4 for the help with implementing this Pull Request resolved: https://github.com/pytorch/pytorch/pull/134732 Approved by: https://github.com/ydwu4	2024-09-14 18:52:22 +00:00
Bin Bao	b4c84c3167	[AOTI] Fix a fallback op returning None issue (#135997 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/135781. In some cases, a fallback can return None in the place of a tensor. Differential Revision: [D62659039](https://our.internmc.facebook.com/intern/diff/D62659039) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135997 Approved by: https://github.com/chenyang78	2024-09-14 18:12:06 +00:00
Laith Sakka	b82122beef	Only keep ListOfLinears module in basic_modules_benchmarks and add gpu version. (#135730 ) All of the previous benchmarks are similar, ListOfLinears should be representative enough. I copied the previous benchmarks from unit tests without an intention, was just trying to create a large number of benchmarks to better observe noise. This PR keeps only one, we can add more as we see value and regressions in the future. Also this diff adds a GPU version. ``` collecting compile time instruction count for basic_modules_ListOfLinears_eager compile time instruction count for iteration 0 is 6479525851 compile time instruction count for iteration 1 is 1024432680 compile time instruction count for iteration 2 is 1019417317 compile time instruction count for iteration 3 is 1013603566 compile time instruction count for iteration 4 is 1008853980 compile time instruction count for iteration 5 is 1009541481 compile time instruction count for iteration 6 is 1005025533 compile time instruction count for iteration 7 is 1004116323 compile time instruction count for iteration 8 is 1000828633 compile time instruction count for iteration 9 is 999788323 collecting compile time instruction count for basic_modules_ListOfLinears_inductor compile time instruction count for iteration 0 is 40837529730 compile time instruction count for iteration 1 is 18411921909 compile time instruction count for iteration 2 is 18383665161 compile time instruction count for iteration 3 is 18348983522 compile time instruction count for iteration 4 is 18349276590 compile time instruction count for iteration 5 is 18353046274 compile time instruction count for iteration 6 is 18346818581 compile time instruction count for iteration 7 is 18340057998 compile time instruction count for iteration 8 is 18331267320 compile time instruction count for iteration 9 is 18328381338 collecting compile time instruction count for basic_modules_ListOfLinears_inductor_gpu compile time instruction count for iteration 0 is 15408870979 compile time instruction count for iteration 1 is 10949520859 compile time instruction count for iteration 2 is 11058786167 compile time instruction count for iteration 3 is 11003606719 compile time instruction count for iteration 4 is 10896406770 compile time instruction count for iteration 5 is 10982875189 compile time instruction count for iteration 6 is 10931848275 compile time instruction count for iteration 7 is 10956345008 compile time instruction count for iteration 8 is 11045384499 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135730 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-09-14 16:45:52 +00:00
Suresh Babu Kolla	b8637503c0	[Pytorch] Consolidate Strobelight compile time profiler between OSS and fbcode (#135953 ) Summary: Move towards consolidating strobelight profiler implementations between OSS and fbcode. This change is a first step towards that. - Created a new function to abstract out compile time profiling enablement. This function allows profiler to switch between different function profilers (e.g. Thrift based or CLI based) - Both OSS and Fbcode now use one compile time profiler in torch/_strobelight Test Plan: Tested OSS with following commands: ``` python torch/_strobelight/examples/compile_time_profile_example.py python torch/_strobelight/examples/cli_function_profiler_example.py TORCH_COMPILE_STROBELIGHT=TRUE TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 python benchmarks/dynamo/huggingface.py --ci --accuracy --timing --explain --inductor --device cuda --training --amp --only XLNetLMHeadModel ``` See test commands for fbcode in comments. Differential Revision: D62444551 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135953 Approved by: https://github.com/laithsakka	2024-09-14 16:35:22 +00:00
William Wen	f97cccf62a	[3.13] fix 3.13 pickle error in torch/package (#136049 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136049 Approved by: https://github.com/albanD ghstack dependencies: #136034	2024-09-14 14:28:09 +00:00
CaoE	db393fb95e	Add Half support for reflection and replication padding on CPU (#135931 ) Fixes #135680 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135931 Approved by: https://github.com/Skylion007	2024-09-14 14:18:55 +00:00
PyTorch MergeBot	23dec79cef	Revert "[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732 )" This reverts commit 731b178b56c83966d6e8cdfb0015d22d8f91b4d2. Reverted https://github.com/pytorch/pytorch/pull/134732 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))	2024-09-14 10:02:55 +00:00
PyTorch MergeBot	8c8a3086a7	Revert "[Dynamo] Trace torch function modes entered outside of torch.compile (#133137 )" This reverts commit 4528777e034b157a8329d1879daf52290eea199a. Reverted https://github.com/pytorch/pytorch/pull/133137 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))	2024-09-14 10:02:55 +00:00
PyTorch MergeBot	46f5037007	Revert "[Dynamo] Support thread local setattr (#135443 )" This reverts commit 149d0b716173787df4543186ff74b605aca54e3e. Reverted https://github.com/pytorch/pytorch/pull/135443 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))	2024-09-14 10:02:55 +00:00
PyTorch MergeBot	7975ec3a29	Revert "[Dynamo] Simplify torch function mode stack guard (#135444 )" This reverts commit ce3c74f2744cbc134b95cf8bd53ae5e3fbc67c29. Reverted https://github.com/pytorch/pytorch/pull/135444 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))	2024-09-14 10:02:55 +00:00
PyTorch MergeBot	f3180f0088	Revert "[Dynamo] Trace enter/exit of TorchFunctionModes (#135422 )" This reverts commit 7743149b2be4a9eba7e0997ccdc6abe552bec266. Reverted https://github.com/pytorch/pytorch/pull/135422 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))	2024-09-14 10:02:55 +00:00
PyTorch MergeBot	838c912502	Revert "[Dynamo] Remove ignored modes workaround (#135502 )" This reverts commit 5c67cf180ee53d696f95d7c45dd99a35399e4450. Reverted https://github.com/pytorch/pytorch/pull/135502 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))	2024-09-14 10:02:55 +00:00
PyTorch MergeBot	72b868d034	Revert "[Dynamo] Remove ignored modes from torch function mode stack guard (#135503 )" This reverts commit e77bd0ebd20e96990ccd40518e68bbcfe7fda855. Reverted https://github.com/pytorch/pytorch/pull/135503 on behalf of https://github.com/mlazos due to broke python test/quantization/pt2e/test_numeric_debugger.py TestNumericDebugger.test_re_export_preserve_handle modified yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2350937008))	2024-09-14 10:02:54 +00:00
Zhenbin Lin	41b58a1bec	OpenReg: Fix issue when copying on the same device (#135956 ) Current copy gets wrong value when src and dst are both openreg. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135956 Approved by: https://github.com/albanD	2024-09-14 09:57:45 +00:00
CaoE	f96a073c9d	Use _amp_foreach_non_finite_check_and_unscale_ for CPU grads of ShardedGradScaler (#135232 ) Use `_amp_foreach_non_finite_check_and_unscale_` instead of fallback version for CPU grads of `ShardedGradScaler ` as `_amp_foreach_non_finite_check_and_unscale_ ` is supported on CPU https://github.com/pytorch/pytorch/pull/109281. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135232 Approved by: https://github.com/ezyang	2024-09-14 09:53:17 +00:00
Will Feng	a815611db9	[Traceable FSDP2][Partitioner] Must save AC output if output has a backward hook (#135727 ) If node is AC region output and has a backward hook on it, we intentionally choose to save it. This is to work around circular dependencies in Traceable FSDP2+AC. Example: ``` out = fully_shard(utils.checkpoint(module))(x) norm_out = layer_norm(out) ``` and there is a circular dependency: 1. In backward, grad_input of layer_norm aka. `out_grad` is actually dependent on `out`. 2. `out` depends on `out`'s backward hook created by FSDP2 (which does all-gather for `module` weights) in order to be recomputed. 3. `out`'s FSDP2 backward hook, as is the case for all eager backward hooks, depends on `out_grad` -> circular dependency with (1)! Solution: check whether `out` has a backward hook, and if so, intentionally save `out` in forward graph outputs. With this, we can break the above circular dependency. ---- Pull Request resolved: https://github.com/pytorch/pytorch/pull/135727 Approved by: https://github.com/Chillee	2024-09-14 08:45:58 +00:00
Oguz Ulgen	3352c9ac94	Add higher order operator name to the cache bypass exception (#135876 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135876 Approved by: https://github.com/jamesjwu, https://github.com/zou3519	2024-09-14 07:05:29 +00:00
Will Feng	5a2be192d1	[Traceable FSDP2] Don't register RegisterPostBackwardFunction if user intends to use Traceable FSDP2, and assert that compiled autograd is not used when entering RegisterPostBackwardFunction (#135824 ) During enablement of Traceable FSDP2 on internal models, sometimes the user only applies torch.compile to some of the FSDP2 instances but not all of them. Such mixed usage pattern is not supported by compiled autograd. Here we try to catch and throw error at such usage pattern, so that the user can fix the usage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135824 Approved by: https://github.com/awgu	2024-09-14 06:30:12 +00:00
Nikita Shulga	a9bef85263	[CI] Increase open file handles limit to 16K on MacOS (#136061 ) May be it will help with flaky failures tracked in https://github.com/pytorch/pytorch/issues/135885 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136061 Approved by: https://github.com/clee2000, https://github.com/kit1980, https://github.com/huydhn, https://github.com/ZainRizvi	2024-09-14 06:16:12 +00:00
Laith Sakka	44dd218a61	Disable garbage collection during compile_time_instructions count in benchmark base by default. (#135768 ) When we measure compile time instruction count, probably we do want in most cases to measure gc instructions disabling it here by default. if it is needed we can add an option to allow it, or someone can use the regular total instruction count instead of compile time instruction count. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135768 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-09-14 06:15:28 +00:00
Nikita Shulga	1a67e2b680	[MPS] Add native im2col (#135706 ) It's called from `torch.unfold` and one of the few remaining vestiges in `MPSFallback.mm` Strongly inspired by CUDA implementation from `09519eb195/aten/src/ATen/native/cuda/im2col.cuh (L40-L61)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135706 Approved by: https://github.com/albanD	2024-09-14 06:09:36 +00:00
Jack Taylor	b9b6094793	[ROCm] Skip pointwise associative scan tests due to regression (#135995 ) https://github.com/pytorch/pytorch/pull/133012 caused a regression on ROCm causing pointwise scan tests to fail ``` ERROR: test_pointwise_associative_scan_tuple_reverse_True_combine_mode_pointwise_cuda ERROR: test_pointwise_associative_scan_tuple_reverse_False_combine_mode_pointwise_cuda ERROR: test_pointwise_associative_scan_complex_pytree_reverse_True_combine_mode_pointwise_cuda ERROR: test_pointwise_associative_scan_complex_pytree_reverse_False_combine_mode_pointwise_cuda ERROR: test_pointwise_associative_scan_binary_operator_reverse_True_combine_mode_pointwise_cuda ERROR: test_pointwise_associative_scan_binary_operator_reverse_False_combine_mode_pointwise_cuda ``` Skipping temporarily while triage is underway. Full log: https://ossci-raw-job-status.s3.amazonaws.com/log/30067645445 ``` File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/_inductor/graph.py", line 1020, in call_function out = lowerings[target](args, kwargs) # type: ignore[index] File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/_inductor/lowering.py", line 363, in wrapped out = decomp_fn(args, **kwargs) File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/_inductor/lowering.py", line 6245, in associative_scan raise RuntimeError("Unable to generate code for associative_scan op") torch._inductor.exc.LoweringException: RuntimeError: Unable to generate code for associative_scan op ``` NOTE: even "eager" backend fails ``` File "/opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/_higher_order_ops/associative_scan.py", line 338, in associative_scan_op_dense raise NotImplementedError("associative_scan is not implemented for eager") NotImplementedError: associative_scan is not implemented for eager ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135995 Approved by: https://github.com/malfet	2024-09-14 05:40:10 +00:00
fduwjj	911a43f930	[TCPStore] Remove deprecated constructor (#136004 ) While looking at TCPStore code again and found it confusing that we still keep the deprecated constructor for TCPStore in cpp while we don't expose it in python via pybind already. I checked both internal and external, all use cases in cpp (aside from unit test fixed in this PR) already moved to using option. So let's remove this legacy constructor to avoid confusion. Differential Revision: [D62653634](https://our.internmc.facebook.com/intern/diff/D62653634) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136004 Approved by: https://github.com/Skylion007, https://github.com/XilunWu	2024-09-14 04:25:47 +00:00
Michael Lazos	e77bd0ebd2	[Dynamo] Remove ignored modes from torch function mode stack guard (#135503 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135503 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502	2024-09-14 02:41:16 +00:00
Michael Lazos	5c67cf180e	[Dynamo] Remove ignored modes workaround (#135502 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135502 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422	2024-09-14 02:41:16 +00:00
Michael Lazos	7743149b2b	[Dynamo] Trace enter/exit of TorchFunctionModes (#135422 ) This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode) Typically the bytecode for a context manager looks like this during a graph break: 1. graph call 2. enter context 3. unsupported code 4. exit context 5. resume call resume fn structure: 1. enter context 2. jump ... 3. exit context The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack). So for torch function modes the structure of our output code is this: 1. graph call 2. mutate tf mode stack to replay mutations 4. unsupported code 5. on exception restore stack 6. resume function Then our resume fn looks like this: 1. no-op enter torch function mode 2. jump 3. exit tf mode To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context). Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135422 Approved by: https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443, #135444	2024-09-14 02:41:08 +00:00
Michael Lazos	ce3c74f274	[Dynamo] Simplify torch function mode stack guard (#135444 ) The semantics of ignored modes previously had edge cases, this eliminates these by in essence filtering any ignored modes out of both the ref stack and the current torch function mode stack. This is purely to fix complexity in #135422. The ignored modes handling will be removed in a future PR after https://github.com/pytorch/pytorch/pull/135422 lands, since we will then trace through DeviceContexts vs inserting them into the graph which needed these extra workarounds for correctness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135444 Approved by: https://github.com/anijain2305, https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443	2024-09-14 02:40:59 +00:00
Michael Lazos	149d0b7161	[Dynamo] Support thread local setattr (#135443 ) In preparation for tracing through DeviceContext (`defb515306/torch/utils/_device.py (L66)`) This PR adds support for calling the setattr of thread local objects. These objects have a slots impl, and since this doesn't appear to have any side effects, we call this setattr impl when replaying mutations, since calling `object.__setattr__` on these objects results in a type error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135443 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137	2024-09-14 02:40:52 +00:00
Michael Lazos	4528777e03	[Dynamo] Trace torch function modes entered outside of torch.compile (#133137 ) This PR adds initial tracing for torch function modes. Details: In essence, this adds tracing into the torch function of modes entered outside of the torch.compile call. This does not yet support tracing enter/exit of a torch function mode/ tracing set_default_device properly using the new mode infra (this will be a very good stress test for modes). I am adding more PRs to this stack to support these. The overall plan is to support tracing enter/exit and handling graph breaks like we do other torch.* context managers. Previously landed: https://github.com/pytorch/pytorch/pull/133135 https://github.com/pytorch/pytorch/pull/133136 https://github.com/pytorch/pytorch/pull/133134 https://github.com/pytorch/pytorch/pull/133133 https://github.com/pytorch/pytorch/pull/133132 https://github.com/pytorch/pytorch/pull/133131 https://github.com/pytorch/pytorch/pull/133729 https://github.com/pytorch/pytorch/pull/133130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133137 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: #134732	2024-09-14 02:40:43 +00:00
Michael Lazos	731b178b56	[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732 ) For tracing cond/while in eager, we trace the HOP with the eager backend with metadata torchfunction mode enabled. HOPs disallow the mutation that occurs in this torch function mode, so it is not able to be traced. As a result, we use a custom backend which enters this mode for tracing these HOPs. Thanks to @ydwu4 for the help with implementing this Pull Request resolved: https://github.com/pytorch/pytorch/pull/134732 Approved by: https://github.com/ydwu4	2024-09-14 02:40:32 +00:00
PyTorch MergeBot	1786a17fed	Revert "Use _amp_foreach_non_finite_check_and_unscale_ for CPU grads of ShardedGradScaler (#135232 )" This reverts commit 51c52061339069a2162e921e5b464fad5a411522. Reverted https://github.com/pytorch/pytorch/pull/135232 on behalf of https://github.com/CaoE due to wrong commit ([comment](https://github.com/pytorch/pytorch/pull/135232#issuecomment-2350792806))	2024-09-14 02:31:06 +00:00
CaoE	51c5206133	Use _amp_foreach_non_finite_check_and_unscale_ for CPU grads of ShardedGradScaler (#135232 ) Use `_amp_foreach_non_finite_check_and_unscale_` instead of fallback version for CPU grads of `ShardedGradScaler ` as `_amp_foreach_non_finite_check_and_unscale_ ` is supported on CPU https://github.com/pytorch/pytorch/pull/109281. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135232 Approved by: https://github.com/ezyang	2024-09-14 02:20:58 +00:00
Yu, Guangye	2e8d431a8f	Fix tensor.data_ptr() representation overflow (#135567 ) # Motivation fix https://github.com/pytorch/pytorch/issues/135550 In PyTorch, [`tensor.data_ptr()`](`e889252493/tools/autograd/templates/python_variable_methods.cpp (L204)`) is reinterpreted by a [signed int64](`e889252493/torch/csrc/autograd/utils/wrap_outputs.h (L50)`) data type, which could result in an overflow issue, like below: ```python import torch a = torch.randn(2).to('xpu') a.data_ptr() # one possible output is -23453392437248 # this is inconsistent with storage.data_ptr() a.untyped_storage().data_ptr() # one possible output is 18446720620317114368 ``` This PR aims to fix this representation overflow issue to make `tensor.data_ptr()` consistent with [`tensor.untyped_storage().data_ptr()`](`c0d2f991b1/torch/csrc/StorageMethods.cpp (L62)`). With this PR, the output will become: ```python import torch a = torch.randn(2).to('xpu') a.data_ptr() # one possible output is 18446720620317114368 # this is consistent with storage.data_ptr() a.untyped_storage().data_ptr() # one possible output is 18446720620317114368 ``` # Solution Use `PyLong_FromVoidPtr` to prevent the overflow issue and fit the semantic of `wrap`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135567 Approved by: https://github.com/dvrogozh, https://github.com/EikanWang, https://github.com/albanD	2024-09-14 01:52:04 +00:00
Nikita Shulga	95496e4855	[CI] Check that PyTorch is built with OpenMP (#136060 ) Restriction for x86 only builds should have been removed long time ago Pull Request resolved: https://github.com/pytorch/pytorch/pull/136060 Approved by: https://github.com/clee2000, https://github.com/kit1980, https://github.com/ZainRizvi	2024-09-14 01:51:36 +00:00
Li, Xingyuan	5de4cb8cd8	[Inductor UT] Generalize inductor UT for intel GPU (Part 3) (#135827 ) [Inductor UT] Reuse Inductor test case for Intel GPU. Reuse `test/inductor/test_compiled_autograd.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135827 Approved by: https://github.com/etaf, https://github.com/desertfire	2024-09-14 01:43:05 +00:00
Joel Schlosser	06bc717410	Fix sum() forward for NJT (#131945 ) This PR solves two problems with `sum()` support in NJT: * `sum()` over a dim with `keepdim=True` returns the wrong shape (i.e. it'll keep the wrong dim). This is a long-standing bug from way back in #112519. * Historically, we've only supported `sum()` over a dim and not a full reduction. This PR adds the full reduction form (forward only, backward still fails). Pull Request resolved: https://github.com/pytorch/pytorch/pull/131945 Approved by: https://github.com/davidberard98, https://github.com/jananisriram	2024-09-14 00:58:03 +00:00
Nikita Shulga	081c4a966d	[BE] Use squeeze/unsqueeze in im2col (#136006 ) And move unsqeeze out of the dispatch, as it's dtype agnostic Pull Request resolved: https://github.com/pytorch/pytorch/pull/136006 Approved by: https://github.com/Skylion007, https://github.com/eqy	2024-09-14 00:35:37 +00:00
Ke Wen	4237592b8f	[Distributed] add pack-check method for float8_e4m3fn (#135961 ) We check 8 x FP8 simultaneously, at size of 8 bytes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135961 Approved by: https://github.com/yifuwang, https://github.com/Skylion007 ghstack dependencies: #135891	2024-09-14 00:32:27 +00:00
William Wen	a00faf4408	[3.13] fix 3.13 pickle error in serialization.py (#136034 ) Error encountered when adding dynamo 3.13 support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136034 Approved by: https://github.com/albanD	2024-09-14 00:02:40 +00:00
eellison	b608ff3bea	[Easy] Dont match to mm_plus_mm if not in max autotune (#135929 ) It's only an optimization when we tune the triton template. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135929 Approved by: https://github.com/FindHao	2024-09-13 23:38:02 +00:00
Jerry Zhang	b8eef500a6	Fix attr check for quantization spec (#135736 ) Summary: Previously we only checked dtype and is_dynamic to decide if two quantization spec are equivalent this may not work in some cases, e.g. when people use different qscheme or quant_min/quant_max This PR added checks for other fields as well Test Plan: regression tests Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D62530974](https://our.internmc.facebook.com/intern/diff/D62530974) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135736 Approved by: https://github.com/sxu	2024-09-13 23:01:22 +00:00
Menglu Yu	aad556a0b5	[PT2][Inductor][Optimus] Fix a corner case in remove_split_with_size_one (#135962 ) Summary: see context in https://fb.workplace.com/groups/1075192433118967/permalink/1501768230461383/ Test Plan: # local reproduce ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "mai" --flow_id 642153776 ``` P1586356950 # e2e before fix f642153776 after fix Differential Revision: D62625318 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135962 Approved by: https://github.com/jackiexu1992	2024-09-13 22:53:08 +00:00
Zain Rizvi	3c5d44dda5	Cleanup unused runner variants (#136058 ) Cleaning up unused runner variants, leaving behind only the few that are actually referenced by workflows For more details see description in the PR that generated these code changes: - https://github.com/pytorch/test-infra/pull/5665 Pull Request resolved: https://github.com/pytorch/pytorch/pull/136058 Approved by: https://github.com/wdvr, https://github.com/malfet	2024-09-13 22:50:07 +00:00
Justin Chu	e2d3af405f	[ONNX] Remove logging apis from public (#133825 ) Remove - torch.onnx.enable_log - torch.onnx.disable_log - torch.onnx.set_log_stream - torch.onnx.log Because they are not meant for public consumption and has been marked for deprecation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133825 Approved by: https://github.com/titaiwangms	2024-09-13 22:19:52 +00:00
Jessica Vandebon	baff86dafb	[MTIA tensor] allow shallow copy between CPU and MTIA tensors (#135871 ) Reviewed By: egienvalue, hanzlfs Differential Revision: D61662214 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135871 Approved by: https://github.com/egienvalue, https://github.com/nautsimon	2024-09-13 22:13:58 +00:00
Huy Do	db5e1b44d2	Fix inductor-micro-benchmark results upload (take 2) (#136052 ) I had a brain freeze when I wrote the original fix. The parameters were in the wrong order. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136052 Approved by: https://github.com/clee2000, https://github.com/kit1980, https://github.com/malfet	2024-09-13 22:05:10 +00:00
Nikita Shulga	a30d5ba16c	Fix bug in split-build workflows codegen (#136043 ) By just deleting a few rogue lines left out in https://github.com/pytorch/pytorch/pull/135510 If file in workflows folder does not have a `.yml` extensions it will not be launched at all, will it? Pull Request resolved: https://github.com/pytorch/pytorch/pull/136043 Approved by: https://github.com/kit1980, https://github.com/atalman	2024-09-13 21:29:06 +00:00
Laith Sakka	46935c8241	Reduce default iterations to 5 . (#135773 ) running all benchmarks takes around 15 mins rn, this is the data https://www.internalfb.com/phabricator/paste/view/P1583590240 the data looks mostly stable, and 5 iterations should be good, specially with our 1.5% threshold. that said, the diff also add a way to increase the number of iterations for a specific benchmark. after the change results https://www.internalfb.com/phabricator/paste/view/P1583618969 time is down to half (7 mins) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135773 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-09-13 21:16:38 +00:00
Laith Sakka	4f407c1884	Only measure compile time instruction count for sum_floordiv benchmark (#135785 ) there was a recent strange noise +5%, -5%. using only compile time : 1) avoid gc time . 2) avoid other operations that are not what we try to measure by this. ==> less probable noise. ``` collecting compile time instruction count for sum_floordiv_regression compile time instruction count for iteration 0 is 8899290248 compile time instruction count for iteration 1 is 1188830489 compile time instruction count for iteration 2 is 1180579615 compile time instruction count for iteration 3 is 1176263131 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135785 Approved by: https://github.com/avikchaudhuri, https://github.com/anijain2305	2024-09-13 21:14:10 +00:00
Laith Sakka	2e461e54e8	Add gpu and gpu_dynamic versions of add_loop (#135809 ) I am thinking maybe 3 iterations are enough for this one? - so I am keeping eager and inductor since inductor is 2X eager time - Eager dynamic is 2X eager so keeping this as well. - inductor have three tests. (dynamic gpu, gpu and cpu) I am unsure if am over profiling here happy to trim if anyone have suggestions. ``` collecting compile time instruction count for add_loop_eager compile time instruction count for iteration 0 is 8213664211 compile time instruction count for iteration 1 is 2798628246 compile time instruction count for iteration 2 is 2796811362 compile time instruction count for iteration 3 is 2794438188 compile time instruction count for iteration 4 is 2794634117 collecting compile time instruction count for add_loop_eager_dynamic compile time instruction count for iteration 0 is 5724108021 compile time instruction count for iteration 1 is 5499908609 compile time instruction count for iteration 2 is 5569101366 compile time instruction count for iteration 3 is 5493806364 compile time instruction count for iteration 4 is 5493169851 collecting compile time instruction count for add_loop_inductor compile time instruction count for iteration 0 is 49789381222 compile time instruction count for iteration 1 is 25769347393 compile time instruction count for iteration 2 is 25772594322 compile time instruction count for iteration 3 is 25768695952 compile time instruction count for iteration 4 is 25768032314 collecting compile time instruction count for add_loop_inductor_gpu compile time instruction count for iteration 0 is 23966942581 compile time instruction count for iteration 1 is 23771950919 compile time instruction count for iteration 2 is 23770784286 compile time instruction count for iteration 3 is 23780160875 compile time instruction count for iteration 4 is 23774634465 collecting compile time instruction count for add_loop_inductor_dynamic_gpu compile time instruction count for iteration 0 is 41505055086 compile time instruction count for iteration 1 is 41293654089 compile time instruction count for iteration 2 is 41301016100 compile time instruction count for iteration 3 is 41306056207 compile time instruction count for iteration 4 is 41308171566 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135809 Approved by: https://github.com/ezyang, https://github.com/anijain2305	2024-09-13 20:42:31 +00:00
atalman	a3d827a28c	Use python 3.11 for Large Wheel build (#136042 ) Use Python 3.11 in nightly Large wheel builds. Required for Colab testing Pull Request resolved: https://github.com/pytorch/pytorch/pull/136042 Approved by: https://github.com/kit1980, https://github.com/malfet Co-authored-by: Sergii Dymchenko <kit1980@gmail.com>	2024-09-13 20:27:11 +00:00
Yiming Zhou	4312794b92	[reland][export] fix re-export custom metadata (#135720 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/134778 The previous D62304294 broke some executorch tests. It has already been reverted. In this diff, `_collect_param_buffer_metadata()` is modified in a way that when a `call_function` node is encountered and its input nodes include `get_attr`. We skip the fields that have been collected previously and only collect rest of the fields. This prevents over-writing. Test Plan: ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//executorch/backends/xnnpack/test:test_xnnpack_ops buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_re_export_preserve_handle buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_run_decompositions_preserve_handle ``` Differential Revision: D62514208 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135720 Approved by: https://github.com/zhxchen17, https://github.com/jerryzh168	2024-09-13 20:15:15 +00:00
Sergii Dymchenko	b856f3539b	Fix script name in the comments (#135507 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135507 Approved by: https://github.com/atalman	2024-09-13 19:59:47 +00:00
Jing Xu	835e7bb077	fix requirements.txt installation failure issue on Windows (#134567 ) Fixes #134564 Root cause: The `lintrunner` wheel released on [pypi.org](https://pypi.org/project/lintrunner/#files) only supports Windows 32bit and Linux 64bit. Since compilation of pytorch requires a 64bit env, on windows, the `lintrunner` has to be compiled from source distribution. `Rust` is its dependency for compilation, as indicated in the error message. Meanwhile, Visual Studio environment is needed for linking libraries.. ![image](https://github.com/user-attachments/assets/180cd899-8886-43b5-b42f-031f41e81683) Issue when performing `pip install lintrunner` without a Visual Studio environment activated is shown below. ```bash >python -m pip install lintrunner Collecting lintrunner Downloading lintrunner-0.12.5.tar.gz (62 kB) Installing build dependencies ... done Getting requirements to build wheel ... done Preparing metadata (pyproject.toml) ... done Building wheels for collected packages: lintrunner Building wheel for lintrunner (pyproject.toml) ... error error: subprocess-exited-with-error × Building wheel for lintrunner (pyproject.toml) did not run successfully. │ exit code: 1 ╰─> [137 lines of output] Running `maturin pep517 build-wheel -i C:\Users\\miniforge3\envs\py310\python.exe --compatibility off` ðŸ“¡ Using build options bindings from pyproject.toml Compiling proc-macro2 v1.0.79 Compiling unicode-ident v1.0.12 Compiling version_check v0.9.4 Compiling windows_x86_64_msvc v0.52.4 Compiling winapi v0.3.9 Compiling serde v1.0.197 Compiling autocfg v1.2.0 Compiling syn v1.0.109 Compiling lazy_static v1.4.0 Compiling libc v0.2.153 Compiling equivalent v1.0.1 Compiling hashbrown v0.14.3 Compiling memchr v2.7.2 Compiling yansi v1.0.1 Compiling unicode-width v0.1.11 Compiling regex-syntax v0.8.3 Compiling encode_unicode v0.3.6 Compiling cfg-if v1.0.0 Compiling winnow v0.6.5 Compiling cc v1.0.92 error: could not compile `windows_x86_64_msvc` (build script) due to 2 previous errors warning: build failed, waiting for other jobs to finish... error: could not compile `serde` (build script) due to 2 previous errors error: could not compile `proc-macro2` (build script) due to 2 previous errors error: could not compile `syn` (build script) due to 2 previous errors error: could not compile `libc` (build script) due to 2 previous errors error: could not compile `winapi` (build script) due to 2 previous errors ðŸ’¥ maturin failed Caused by: Failed to build a native library through cargo Caused by: Cargo build finished with "exit code: 101": `cargo rustc --manifest-path Cargo.toml --message-format json --release --bins --` ðŸ“¦ Including license file "LICENSE" ðŸ”— Found bin bindings error: linker `link.exe` not found \| = note: program not found note: the msvc targets depend on the msvc linker but `link.exe` was not found note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option. note: VS Code is a different product, and is not sufficient. error: aborting due to 1 previous error error: linker `link.exe` not found \| = note: program not found note: the msvc targets depend on the msvc linker but `link.exe` was not found note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option. note: VS Code is a different product, and is not sufficient. error: aborting due to 1 previous error error: linker `link.exe` not found \| = note: program not found note: the msvc targets depend on the msvc linker but `link.exe` was not found note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option. note: VS Code is a different product, and is not sufficient. error: aborting due to 1 previous error error: linker `link.exe` not found \| = note: program not found note: the msvc targets depend on the msvc linker but `link.exe` was not found note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option. note: VS Code is a different product, and is not sufficient. error: aborting due to 1 previous error error: linker `link.exe` not found \| = note: program not found note: the msvc targets depend on the msvc linker but `link.exe` was not found note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option. note: VS Code is a different product, and is not sufficient. error: aborting due to 1 previous error error: linker `link.exe` not found \| = note: program not found note: the msvc targets depend on the msvc linker but `link.exe` was not found note: please ensure that Visual Studio 2017 or later, or Build Tools for Visual Studio were installed with the Visual C++ option. note: VS Code is a different product, and is not sufficient. error: aborting due to 1 previous error Error: command ['maturin', 'pep517', 'build-wheel', '-i', 'C:\\Users\\\\miniforge3\\envs\\py310\\python.exe', '--compatibility', 'off'] returned non-zero exit status 1 [end of output] note: This error originates from a subprocess, and is likely not a problem with pip. ERROR: Failed building wheel for lintrunner Failed to build lintrunner ERROR: ERROR: Failed to build installable wheels for some pyproject.toml based projects (lintrunner) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134567 Approved by: https://github.com/malfet	2024-09-13 18:43:55 +00:00
PyTorch MergeBot	b6d6aa49b8	Revert "Validate input types for `torch.nn.Linear` and `torch.nn.Bilinear` (#135596 )" This reverts commit e157ce3ebbb3f30d008c15914e82eb74217562f0. Reverted https://github.com/pytorch/pytorch/pull/135596 on behalf of https://github.com/malfet due to It's too restrictive, should allow other int-like types, such as `numpy.int64` ([comment](https://github.com/pytorch/pytorch/pull/135596#issuecomment-2349714104))	2024-09-13 18:06:56 +00:00
PyTorch MergeBot	deee21cb78	Revert "[Inductor] Rename `cpp_wrapper_cuda.py` as `cpp_wrapper_gpu.py` (#135313 )" This reverts commit 16b37b309f64ddd4e498c57a99191e1d9b3dfdac. Reverted https://github.com/pytorch/pytorch/pull/135313 on behalf of https://github.com/izaitsevfb due to breaks internal builds ([comment](https://github.com/pytorch/pytorch/pull/135313#issuecomment-2349662091))	2024-09-13 17:53:21 +00:00
Daohang Shi	3f69410976	[gpu-profiler] Expose active and repeat in os env var (#135757 ) Summary: https://fb.workplace.com/groups/ai.efficiency.tools.users/permalink/1855136444971825/ Test Plan: `buck2 test mode/opt caffe2/test:profiler -- -r test_kineto_profiler_api ` eyes Differential Revision: D62529249 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135757 Approved by: https://github.com/Yuzhen11	2024-09-13 17:48:27 +00:00
PyTorch MergeBot	18f9331e5d	Revert "[aoti] Fix workspace generation for triton (#135552 )" This reverts commit d3833253928f29ed760b2dccac2b730028a868ca. Reverted https://github.com/pytorch/pytorch/pull/135552 on behalf of https://github.com/izaitsevfb due to blocks revert of #135313, internal failures, see D62511427 ([comment](https://github.com/pytorch/pytorch/pull/135552#issuecomment-2349641372))	2024-09-13 17:47:36 +00:00
Catherine Lee	bc0f330169	[trymerge] Manually close merged PR when Github fails (#135890 ) Manually close merged PR when Github fails to do it. Consequences of current design: Sleeping for 1 min uses up the machine, might result in race conditions, results in merging label to removed a bit later, pr still left open if this api fails too (ie no async clean up job) Tested in https://github.com/malfet/deleteme/pull/92 by removing the part of the commit message that has "resolved #pr num" Pull Request resolved: https://github.com/pytorch/pytorch/pull/135890 Approved by: https://github.com/malfet, https://github.com/huydhn	2024-09-13 17:29:24 +00:00
Rachel Guo	7834c0bb2c	[AOTI][Tooling] Add stats summary (mean/min/max, etc) for jit inductor tensor value printing (#135887 ) Summary: As title. Follow up to add stats summary (mean/min/max, etc) for jit inductor tensor value printing as well. The inductor python wrapper code level printing would look something like this: {F1859224287} Test Plan: CI Reviewed By: chenyang78 Differential Revision: D62415575 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135887 Approved by: https://github.com/chenyang78	2024-09-13 17:19:25 +00:00
PyTorch MergeBot	6ef49fe8f1	Revert "Pass ideep:lowp_kind to matmul_forward::compute on cache misses (#135058 )" This reverts commit 3d2431380999252d5401f83d5010b398a32e7597. Reverted https://github.com/pytorch/pytorch/pull/135058 on behalf of https://github.com/malfet due to It regresses x86 performance ([comment](https://github.com/pytorch/pytorch/pull/135058#issuecomment-2349480861))	2024-09-13 17:09:45 +00:00
Jack Taylor	a15774563b	[ROCm] Enable ROCm support for inductor's dynamic_rblock_scaling (#129663 ) As of ROCm 6.1 [hipDeviceProp_t::regsPerMultiprocessor](https://rocm.docs.amd.com/projects/HIP/en/latest/doxygen/html/structhip_device_prop__t.html#a7390d5b180d63978c81aa971060270b4) is now available allowing us to enable this attribute on ROCm. ``` >>> torch.cuda.get_device_properties(0) _CudaDeviceProperties(name='AMD Instinct MI250X/MI250', major=9, minor=0, gcnArchName='gfx90a:sramecc+:xnack-', total_memory=65520MB, multi_processor_count=104) >>> torch.cuda.get_device_properties(0).regs_per_multiprocessor 65536 ``` With https://github.com/triton-lang/triton/pull/3962we can extract n_regs and n_spells from a triton binary with AMD backend allowing us to enable inductor's dynamic_rblock_scaling on ROCm initially implemented in https://github.com/pytorch/pytorch/pull/115094 Leaving this in draft until following PRs have landed: - https://github.com/pytorch/pytorch/pull/129361 to bump the triton commit pin - https://github.com/pytorch/pytorch/pull/128449 to allow us to grab warp_size from device properties instead of hard coding 64 on ROCm. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129663 Approved by: https://github.com/jansel, https://github.com/shunting314	2024-09-13 16:45:39 +00:00
PyTorch MergeBot	564d00f364	Revert "Fix clang-tidy warnings in Caffe2 code (#134935 )" This reverts commit 7cfd23636c8fa6fcbb8bf3ea34e15b847ec9ad9d. Reverted https://github.com/pytorch/pytorch/pull/134935 on behalf of https://github.com/izaitsevfb due to breaks internal builds, caffe2 is still used internally ([comment](https://github.com/pytorch/pytorch/pull/134935#issuecomment-2349368152))	2024-09-13 16:42:37 +00:00
drisspg	ae02d663cd	[FlexAttention] Fix output layout (#135882 ) We previously only supported the same v_head dim and + qk_head dim. When allowed for different head-dims I accidently kept the same query strides for the output. This PR fixes this bug as well it ensures that we always produce output in the same stride order as the input query. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135882 Approved by: https://github.com/yanboliang, https://github.com/Chillee	2024-09-13 16:36:05 +00:00
James Wu	ad2f0e9f81	Add remote cache time saved to compilation metrics (#135490 ) Summary: Record remote cache time saved via frame_phase_timing We add to the "phase" when remote cache hits and saves us time, so that we have a 1:1 correspondence between a frame and time saved. Test Plan: Internally run benchmark, see that it's populated in sandbox table after previous diff lands and logger config is actualized. Show that column exists in table: https://fburl.com/scuba/logger_staging_jjwu_30582a48f1ff9cf5f4ac50a4c40af/fp2te0ff Note that an earlier version of D62105258 had the column as a string so the staging table is a bit messed up. But you can see the most recent samples have the column populates as a float. Reviewed By: aorenste Differential Revision: D62106921 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135490 Approved by: https://github.com/aorenste	2024-09-13 16:35:51 +00:00
Edward Z. Yang	21ffa18ad1	Fix "expand: SymIntArrayRef expected to contain only concrete integers" in AOTInductor (#135933 ) Internal xref: https://fb.workplace.com/groups/1075192433118967/permalink/1501860707118802/ Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135933 Approved by: https://github.com/angelayi	2024-09-13 15:23:42 +00:00
eqy	2519e5a8de	[CUDA][FP8] Skip rowwise scaling test on sm89 (#135718 ) Same reason as #https://github.com/pytorch/pytorch/pull/133612, rowwise scaling implementation is sm90+ specific (e.g., uses TMA) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135718 Approved by: https://github.com/Skylion007	2024-09-13 15:07:20 +00:00
Laith Sakka	ba6e0f31ab	Remove cycle dependency by localizing the import. (#135926 ) Summary: Since https://www.internalfb.com/diff/D62215095 landed there has been many silence errors due to the dependency between functional_tensor and config. ``` File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/export/__init__.py", line 64, in <module> File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/export/dynamic_shapes.py", line 23, in <module> File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/export/exported_program.py", line 26, in <module> File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/_higher_order_ops/__init__.py", line 1, in <module> File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/_higher_order_ops/cond.py", line 6, in <module> File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/_subclasses/functional_tensor.py", line 9, in <module> File "/tmp/torch_deploy_zip5YRJC1/torch_python_modules.zip/torch/_inductor/config.py", line 44, in <module> ``` https://fburl.com/logarithm/ol5kx0ee complaining about a cycle dependency this fix it. Test Plan: buck test multipy/runtime:test_deploy_embedded_cuda_interp_without_cuda_available -- --run-disabled TorchpyTest.AcquireMultipleSessionsInDifferentPackages Reviewed By: aorenste Differential Revision: D62616765 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135926 Approved by: https://github.com/aorenste, https://github.com/oulgen, https://github.com/Skylion007	2024-09-13 15:05:41 +00:00
PyTorch MergeBot	7ed0563cad	Revert "[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732 )" This reverts commit e504fb70693d4a3741c3380b6a989d441e84f737. Reverted https://github.com/pytorch/pytorch/pull/134732 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))	2024-09-13 12:52:58 +00:00
PyTorch MergeBot	eb7dd91dd1	Revert "[Dynamo] Trace torch function modes entered outside of torch.compile (#133137 )" This reverts commit fafdd588f27e1d56090c6d260d0382c255eaf9eb. Reverted https://github.com/pytorch/pytorch/pull/133137 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))	2024-09-13 12:52:58 +00:00
PyTorch MergeBot	3f30360d05	Revert "[Dynamo] Support thread local setattr (#135443 )" This reverts commit 30b007bea329f512af3dc4fd4e6c7d145e807b71. Reverted https://github.com/pytorch/pytorch/pull/135443 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))	2024-09-13 12:52:58 +00:00
PyTorch MergeBot	4734e356d6	Revert "[Dynamo] Simplify torch function mode stack guard (#135444 )" This reverts commit 0c080cb2c78a85a5320fbeadbbb9a2cc640fd89d. Reverted https://github.com/pytorch/pytorch/pull/135444 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))	2024-09-13 12:52:57 +00:00
PyTorch MergeBot	ac169795a9	Revert "[Dynamo] Trace enter/exit of TorchFunctionModes (#135422 )" This reverts commit 2af3b8ffd84e36b91279174e9106f84b2d2a11f2. Reverted https://github.com/pytorch/pytorch/pull/135422 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))	2024-09-13 12:52:57 +00:00
PyTorch MergeBot	fca58bfda1	Revert "[Dynamo] Remove ignored modes workaround (#135502 )" This reverts commit 7d5e0dd4b1a8d20fc8624b3085a6f5ddedd89a2e. Reverted https://github.com/pytorch/pytorch/pull/135502 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))	2024-09-13 12:52:57 +00:00
PyTorch MergeBot	dc71e7a7d4	Revert "[Dynamo] Remove ignored modes from torch function mode stack guard (#135503 )" This reverts commit c56728b643e2b7d796abd7ec45803319e1c5967d. Reverted https://github.com/pytorch/pytorch/pull/135503 on behalf of https://github.com/albanD due to Broke tests on main ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2348886378))	2024-09-13 12:52:57 +00:00
PyTorch MergeBot	1cdf658f4a	Revert "[PT2][inductor][Optimus] Add pad_aten_mm_pass pattern to resolve long computation kernel in LCE (#135167 )" This reverts commit eb0fe029337b31bcb3d4b2d1e539895393975d68. Reverted https://github.com/pytorch/pytorch/pull/135167 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI eg. https://github.com/pytorch/pytorch/actions/runs/10845542664/job/30097957154 ([comment](https://github.com/pytorch/pytorch/pull/135167#issuecomment-2348847595))	2024-09-13 12:35:05 +00:00
PyTorch MergeBot	b5c52e96e8	Revert "[dynamo] Fix support for classmethod(property(...)) (#134968 )" This reverts commit bf68e16e94fc05f10d434cdc162a14d02c6ad23c. Reverted https://github.com/pytorch/pytorch/pull/134968 on behalf of https://github.com/jithunnair-amd due to Broke ROCm CI: eg. https://github.com/pytorch/pytorch/actions/runs/10845542664/job/30097956613 ([comment](https://github.com/pytorch/pytorch/pull/134968#issuecomment-2348837553))	2024-09-13 12:29:03 +00:00
Bin Bao	ea2ecab15b	[AOTI][reland] Fix assert_function call in cpu autotune template (#135920 ) Summary: Reland https://github.com/pytorch/pytorch/pull/135086. In the ABI-compatible mode, assert_function should be AOTI_TORCH_CHECK. Test Plan: CI Differential Revision: D62500592 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135920 Approved by: https://github.com/chenyang78	2024-09-13 12:21:57 +00:00
CaoE	2f53d570fe	Update document for autocast on CPU (#135299 ) Update document for autocast on CPU due to the support of float16 and changes in the operator list. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135299 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/svekars	2024-09-13 09:11:47 +00:00
Ke Wen	31007cf200	[Distributed] add FP8 support to NaN checker (#135891 ) Adding support for `torch.float8_e4m3fn` and `torch.float8_e5m2` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135891 Approved by: https://github.com/wconstab	2024-09-13 08:43:54 +00:00
Michael Lazos	c56728b643	[Dynamo] Remove ignored modes from torch function mode stack guard (#135503 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135503 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422, #135502	2024-09-13 08:41:32 +00:00
Michael Lazos	7d5e0dd4b1	[Dynamo] Remove ignored modes workaround (#135502 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135502 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137, #135443, #135444, #135422	2024-09-13 08:41:32 +00:00
Michael Lazos	2af3b8ffd8	[Dynamo] Trace enter/exit of TorchFunctionModes (#135422 ) This PR implements tracing of with contexts with TorchFunction modes which have the default enter/exit behavior (ie pushing/popping the mode) Typically the bytecode for a context manager looks like this during a graph break: 1. graph call 2. enter context 3. unsupported code 4. exit context 5. resume call resume fn structure: 1. enter context 2. jump ... 3. exit context The issue with torch function modes is that side effects will replay any mutations to the torch function stack performed during tracing. So, we do not need to enter and exit around the unsupported code in the original function (doing so would result in a duplicate torch function mode entry during execution of the unsupported code), and we don't need to enter again in the resume function (the mode that was pushed from the side effects bytecode would still be on the stack). So for torch function modes the structure of our output code is this: 1. graph call 2. mutate tf mode stack to replay mutations 4. unsupported code 5. on exception restore stack 6. resume function Then our resume fn looks like this: 1. no-op enter torch function mode 2. jump 3. exit tf mode To implement the no-op enter of the torch function mode I added torch function mode in polyfill which no-op enters, but normally exits. This is needed because we still want to trace the with context in the resume function, and exit properly (the exit instructions will still be in the function, so we need to generate instructions to set up the context). Separately from the bytecode, dynamo also tracks contexts on the block stack, which is how the SETUP_* instructions are implemented. Naturally at a graph break, we exit these block stacks to properly reset the contexts entirely, so that we can re-enter around the unsupported code soundly. However once again, in the torch function mode case, in the event of a graph we do not want to perform any exit side effects because we want to preserve the state of the mode stack as is so that we will properly update the stack with bytecode mentioned in the first section. If we exited here, dynamo would pop the mode off of the symbolic stack, and not update the true python torch function mode stack with the suffix bytecode. All in all, for torch function modes we enter exactly once, update the global torch function mode stack with side effects bytecode, re-read this stack when compiling the resume function, and exit exactly once in the resume function. This matches the semantics of eager exactly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135422 Approved by: https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443, #135444	2024-09-13 08:41:24 +00:00
Michael Lazos	0c080cb2c7	[Dynamo] Simplify torch function mode stack guard (#135444 ) The semantics of ignored modes previously had edge cases, this eliminates these by in essence filtering any ignored modes out of both the ref stack and the current torch function mode stack. This is purely to fix complexity in #135422. The ignored modes handling will be removed in a future PR after https://github.com/pytorch/pytorch/pull/135422 lands, since we will then trace through DeviceContexts vs inserting them into the graph which needed these extra workarounds for correctness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135444 Approved by: https://github.com/anijain2305, https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443	2024-09-13 08:41:17 +00:00
Michael Lazos	30b007bea3	[Dynamo] Support thread local setattr (#135443 ) In preparation for tracing through DeviceContext (`defb515306/torch/utils/_device.py (L66)`) This PR adds support for calling the setattr of thread local objects. These objects have a slots impl, and since this doesn't appear to have any side effects, we call this setattr impl when replaying mutations, since calling `object.__setattr__` on these objects results in a type error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135443 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137	2024-09-13 08:41:07 +00:00
Michael Lazos	fafdd588f2	[Dynamo] Trace torch function modes entered outside of torch.compile (#133137 ) This PR adds initial tracing for torch function modes. Details: In essence, this adds tracing into the torch function of modes entered outside of the torch.compile call. This does not yet support tracing enter/exit of a torch function mode/ tracing set_default_device properly using the new mode infra (this will be a very good stress test for modes). I am adding more PRs to this stack to support these. The overall plan is to support tracing enter/exit and handling graph breaks like we do other torch.* context managers. Previously landed: https://github.com/pytorch/pytorch/pull/133135 https://github.com/pytorch/pytorch/pull/133136 https://github.com/pytorch/pytorch/pull/133134 https://github.com/pytorch/pytorch/pull/133133 https://github.com/pytorch/pytorch/pull/133132 https://github.com/pytorch/pytorch/pull/133131 https://github.com/pytorch/pytorch/pull/133729 https://github.com/pytorch/pytorch/pull/133130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133137 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: #134732	2024-09-13 08:41:00 +00:00
Michael Lazos	e504fb7069	[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732 ) For tracing cond/while in eager, we trace the HOP with the eager backend with metadata torchfunction mode enabled. HOPs disallow the mutation that occurs in this torch function mode, so it is not able to be traced. As a result, we use a custom backend which enters this mode for tracing these HOPs. Thanks to @ydwu4 for the help with implementing this Pull Request resolved: https://github.com/pytorch/pytorch/pull/134732 Approved by: https://github.com/ydwu4	2024-09-13 08:40:50 +00:00
Jez Ng	b346e99376	remove fast_flush arguments (#135387 ) I've removed them from upstream Triton in https://github.com/triton-lang/triton/pull/4485. It looks like most places in the code use the default value of `fast_flush=True` anyway, though there are two PRs from @pearu that use `False`. To my knowledge, there's no reason to use the `False` value. Differential Revision: [D62325778](https://our.internmc.facebook.com/intern/diff/D62325778) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135387 Approved by: https://github.com/nmacchioni, https://github.com/jansel	2024-09-13 08:13:46 +00:00
Animesh Jain	7dc1788396	[inductor] Remove the batch fusion passes from being a default (#135922 ) Ads team do a search internally to figure out which fusion passes to use. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135922 Approved by: https://github.com/eellison, https://github.com/yanboliang ghstack dependencies: #135819	2024-09-13 06:07:33 +00:00
xinan.lin	9fd54d787d	[Inductor UT] Generalize device-bias code in test_triton_kernels.py introduced in #135530 (#135656 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135656 Approved by: https://github.com/EikanWang, https://github.com/zou3519	2024-09-13 05:27:56 +00:00
xingyuan li	b38be727eb	[Inductor UT] Generalize inductor UT for intel GPU (Part 2) (#134556 ) [Inductor UT] Reuse Inductor test case for Intel GPU. Reuse `test/inductor/test_torchinductor_opinfo.py` Reuse `test/inductor/test_minifier_isolate.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134556 Approved by: https://github.com/etaf, https://github.com/eellison	2024-09-13 05:16:28 +00:00
Jokeren	e54b559e88	[inductor] More fixes on the keys of `constants` and `signature` dictionaries (#135406 ) Previous PR forgets to change two other places that also create `constants` and `signature`. https://github.com/pytorch/pytorch/pull/135170 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135406 Approved by: https://github.com/jansel	2024-09-13 04:10:41 +00:00
wz337	eea5e6ff0f	[DCP][DSD] Add a test case to demonstrate the workaround to load full state dict into a 2D model (#135763 ) Fix https://github.com/pytorch/pytorch/issues/134095 This is a workaround for loading full state dict into a FSDP1+TP 2D model. Since named_parameters() in FSDP1 does not return DTensor, we don't have the information to shard the full_state_dict and load it directly into the 2d model. In order to load a full state dict in FSDP1+TP 2D model, we need to do: - load the full state dict into a 1D FSDP model - dcp.save the full/shard state dict into storage - initialize a 2D FSDP1+TP model - get the default sharded state dict for the 2D model (full_state_dict=False) - dcp.load the state dict from storage - load the state dict into the 2D model Pull Request resolved: https://github.com/pytorch/pytorch/pull/135763 Approved by: https://github.com/fegin ghstack dependencies: #135725	2024-09-13 03:51:14 +00:00
Pian Pawakapan	6df91b5917	real tensor prop for composite ops (#135717 ) Fixes #135632 Adds real tensor propagation for decompositions, checking any symbols on their outputs Pull Request resolved: https://github.com/pytorch/pytorch/pull/135717 Approved by: https://github.com/ezyang	2024-09-13 03:35:16 +00:00
wz337	0cdc6a8dcd	[DSD] Fix distributed state dict full_state_dict option hang during set_state_dict (#135725 ) Fix https://github.com/pytorch/pytorch/issues/134095 This fix distributed state dict full_state_dict option hang during set_state_dict. We switch `_distribute_tensors` in _state_dict_utils.py to use `DTensor.from_local` instead of `distribute_tensor` to support FSDP2+TP 2D strided sharding use case, as `distribute_tensor` cannot handle strided sharding yet. `distribute_tensor` incurs a scatter behind the scenes, while `DTensor.from_local` takes the local slice from the full tensor on each rank to create the DTensor (no collective). This means it's the user's responsibility to make sure the full_tensor from the full_state_dict is the same across all ranks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135725 Approved by: https://github.com/fegin	2024-09-13 03:26:36 +00:00
Prachi Gupta	6cdc70bccd	[ROCm] skip test_fp8_cast_and_t on non-MI300 machines (#135917 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/135917 Approved by: https://github.com/malfet	2024-09-13 02:46:48 +00:00
Yu, Guangye	e6b68359d7	Fix xpu memory stats error (#135818 ) # Motivation fix https://github.com/pytorch/pytorch/issues/135726 After merging two free blocks, I made a stupid mistake of ignoring the correct size to decrease the active memory size, which should be the original block size instead of the merged block size. # Additional Context Add a UT to guard this scenario. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135818 Approved by: https://github.com/EikanWang	2024-09-13 02:41:21 +00:00
Nikita Shulga	1c04cbfba6	[BE] Use `C10_UNUSED` (#135914 ) Instead of `(void)foo; // Suppress unused variable` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135914 Approved by: https://github.com/huydhn, https://github.com/eqy	2024-09-13 02:27:07 +00:00
Shivam Raikundalia	062681a0ed	[Profiler] Torch Profiler distributed info is not JSON serializable (#135548 ) Summary: To fix https://github.com/pytorch/pytorch/issues/133308 we must create an encoder for numpy values so we can serialize the distributed metadata to JSON. Test Plan: Added unit test to check that numpy values can be serialized Differential Revision: D62411619 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135548 Approved by: https://github.com/aaronenyeshi, https://github.com/albanD	2024-09-13 02:22:33 +00:00
Aaron Orenstein	8c356ce3da	Fix lint errors in fbcode (#135614 ) Summary: Fixed a bunch of fbcode imports that happened to work but confused autodeps. After this autodeps still suggests "improvements" to TARGETS (which breaks our builds) but at least it can find all the imports. Test Plan: ``` fbpython fbcode/tools/build/buck/linters/lint_autoformat.py --linter=autodeps --default-exec-timeout=1800 -- fbcode/caffe2/TARGETS fbcode/caffe2/test/TARGETS ``` Before: ``` ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/testing.py:229) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fbur$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export.py:87) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fburl$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_serdes.py:9) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fb$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_serdes.py:10) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https://fburl$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_retraceability.py:7) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See https:$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_retraceability.py:6) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See ht$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export_nonstrict.py:7) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See http$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_export_nonstrict.py:6) when processing rule "test_export". Please make sure it's listed in the srcs parameter of another rule. See $ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "test_export" (from caffe2/test/export/test_export_training_ir_to_run_decomp.py:8) when processing rule "test_export". Please make sure it's listed in the srcs parameter of an$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "testing" (from caffe2/test/export/test_export_training_ir_to_run_decomp.py:10) when processing rule "test_export". Please make sure it's listed in the srcs parameter of anoth$ ERROR while processing caffe2/test/TARGETS: Found "//python/typeshed_internal:typeshed_internal_library" owner for "cv2" but it is protected by visibility rules: [] (from caffe2/test/test_bundled_images.py:7) when processing rule "test_bundled_$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "caffe2.test.profiler_test_cpp_thread_lib" (from caffe2/test/profiler/test_cpp_thread.py:29) when processing rule "profiler_test_cpp_thread". Please make sure it's listed in t$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._utils_internal.get_file_path_2" (from caffe2/test/test_custom_ops.py:23) when processing rule "custom_ops". Please make sure it's listed in the srcs parameter of anoth$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._utils_internal.get_file_path_2" (from caffe2/test/test_public_bindings.py:13) when processing rule "public_bindings". Please make sure it's listed in the srcs paramete$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._C._profiler.symbolize_tracebacks" (from caffe2/test/test_cuda.py:3348) when processing rule "test_cuda". Please make sure it's listed in the srcs parameter of another $ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for "torch._C._profiler.gather_traceback" (from caffe2/test/test_cuda.py:3348) when processing rule "test_cuda". Please make sure it's listed in the srcs parameter of another rule$ ERROR while processing caffe2/test/TARGETS: Cannot find an owner for include <torch/csrc/autograd/profiler_kineto.h> (from caffe2/test/profiler/test_cpp_thread.cpp:2) when processing profiler_test_cpp_thread_lib. Some things to try: ``` Differential Revision: D62049222 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135614 Approved by: https://github.com/oulgen, https://github.com/laithsakka	2024-09-13 02:04:34 +00:00
Jason Ansel	bf68e16e94	[dynamo] Fix support for classmethod(property(...)) (#134968 ) Fixes #134451 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134968 Approved by: https://github.com/yanboliang	2024-09-13 01:14:18 +00:00
eqy	d732df7e56	[Inductor] Disable TF32 in `test_slice_scatter_reinplace` (#135709 ) TF32 linear/matmul numerics seem unrelated to test functionality so disabling it here to abate noisy failures Pull Request resolved: https://github.com/pytorch/pytorch/pull/135709 Approved by: https://github.com/eellison	2024-09-13 00:30:45 +00:00
Sahan Paliskara	c9de2efde6	[Docs] fix inconsistent docs in conv1d, conv2d, and conv3d (#135894 ) Addresses https://github.com/pytorch/pytorch/issues/135880 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135894 Approved by: https://github.com/mikaylagawarecki, https://github.com/malfet	2024-09-13 00:19:42 +00:00
Jason Ansel	1f15c0c7a5	[fx] Replace _snake_case with a regexp (#135822 ) ~2x speedup on this function, though saves <0.5s overall Pull Request resolved: https://github.com/pytorch/pytorch/pull/135822 Approved by: https://github.com/oulgen ghstack dependencies: #135787, #135788, #135820, #135821	2024-09-13 00:18:41 +00:00
Jason Ansel	a72124add9	[fx] Minor optimization in create_arg (#135821 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135821 Approved by: https://github.com/oulgen ghstack dependencies: #135787, #135788, #135820	2024-09-13 00:18:41 +00:00
Jason Ansel	10ca4c0564	[inductor] Use TracerBase directly in LoopBody (#135820 ) This skips some unneeded work in the subclass. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135820 Approved by: https://github.com/oulgen ghstack dependencies: #135787, #135788	2024-09-13 00:18:41 +00:00
Jason Ansel	d3aab9642b	[inductor] Optimize can_fuse_vertical() (#135788 ) An O(n^2) to O(n) improvement by not comparing all pairs of deps. Before: ![image](https://github.com/user-attachments/assets/797cd1bd-5d53-4374-8e76-ffce4232d7f9) After: ![image](https://github.com/user-attachments/assets/1e61bf29-adba-41a4-839e-f028130fa979) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135788 Approved by: https://github.com/oulgen ghstack dependencies: #135787	2024-09-13 00:18:41 +00:00
Jason Ansel	67a929eea8	[inductor] Remove unused check (#135787 ) I think this is unreachable code because mode is always None on reads. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135787 Approved by: https://github.com/oulgen	2024-09-13 00:18:41 +00:00
Isuru Fernando	f576960bbc	do not expand in replace/simplify if no changes (#135863 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135863 Approved by: https://github.com/ezyang	2024-09-13 00:12:01 +00:00
Nikita Shulga	1aba224cfd	Update nightly PyTorch version to 2.6.0 (#135916 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/135916 Approved by: https://github.com/kit1980	2024-09-13 00:08:52 +00:00
Shangdi Yu	d383325392	[aoti] Fix workspace generation for triton (#135552 ) Fixes #131337 - add `arg_type` for workspace_arg, the type is consistent with the type in `generate_workspace_allocation()`. - do not generate example tensors for `workspace`, and use `generate_workspace_allocation()` instead. - add workspace allocation generation code to `kernel_autotune_calls`. e.g. ```python workspace = empty_strided_cuda((1280, ), (1, ), torch.uint8) workspace.zero_() ..... triton_spl_fused_add_cumprod_0.run(buf2, arg0_1, arg1_1, workspace, 1, 10000, grid=split_scan_grid(1, 10000), stream=stream0) del buf2, arg0_1, arg1_1, workspace ``` - add `empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda` to the header of triton autotune code. The generated cpp has lines like below, so we also implement a `zero_()` for ` AtenTensorHandle `. ```cpp static constexpr int64_t int_array_0[] = {1280L, }; static constexpr int64_t int_array_1[] = {1L, }; AtenTensorHandle workspace_handle; AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_empty_strided(1, int_array_0, int_array_1, cached_torch_dtype_uint8, cached_torch_device_type_cuda, 0, &workspace_handle)); RAIIAtenTensorHandle workspace(workspace_handle); workspace.zero_(); ``` - Fix handle grid_fn for grid computation. Pass in "RBLOCK" to `split_scan_grid` - Fix dynamic shapes: Without the fix we generate code that looks like this `workspace = empty_strided_cuda((32((255 + s0) // 256), ), (1, ), torch.uint8)` when doing triton autotune and `s0` is not defined. The solution approach is to use `V.graph.sizevars.size_hint(nbytes)` to realize the workspace size for triton autotune. Note that we only realize it for triton autotune code, but not for the cpp cuda code. - We also generate slightly different cpp code depending on if `abi_compatible` is turned on. ```cpp RAIIAtenTensorHandle workspace(workspace_handle); AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_zero_(workspace.get())); ``` vs ```cpp at::Tensor workspace = at::detail::empty_strided_cuda({8L(c10::div_floor_integer(static_cast<int64_t>((255L + s0)), static_cast<int64_t>(256L))), }, {1L, }, at::kByte, c10::DeviceType::CUDA); workspace.zero_(); ``` Test Plan: ``` TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper python test/inductor/test_cuda_cpp_wrapper.py DynamicShapesCudaWrapperCudaTests.test_consecutive_split_cumprod_cuda_dynamic_shapes_cuda_wrapper TORCHINDUCTOR_ABI_COMPATIBLE=1 python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_consecutive_split_cumprod_cuda_cuda_wrapper TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_consecutive_split_cumprod_cuda ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135552 Approved by: https://github.com/desertfire	2024-09-12 23:53:09 +00:00
Ma Jian	00dc7d4356	fix compiled_autograd deadlock throw (#135795 ) Fixes #135298 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135795 Approved by: https://github.com/xmfan	2024-09-12 23:24:57 +00:00
Yanbo Liang	1760bbc259	[FlexAttention] Ensure q/k/v and block_mask on excact the same device (#135823 ) Fixes #134739 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135823 Approved by: https://github.com/BoyuanFeng	2024-09-12 23:11:01 +00:00
Jack Taylor	fb9d8e3248	[ROCm] Use ieee precision for fp32 in flex attention (#135702 ) `3bebc09be9` Brought in a change to flex_attention to allow TF32 precision, this largely lacks support on ROCm side and we should use ieee. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135702 Approved by: https://github.com/jeffdaily, https://github.com/drisspg	2024-09-12 23:00:48 +00:00
eellison	aaabfc8930	[Easy] Check if quant registered in constant folding (#135875 ) Belated fix for https://github.com/pytorch/pytorch/issues/110904 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135875 Approved by: https://github.com/shunting314	2024-09-12 22:16:39 +00:00
William Wen	63d6cd351a	[dynamo] support torch.nn.attention.sdpa_kernel context manager (#135404 ) Fixes https://github.com/pytorch/pytorch/issues/134608 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135404 Approved by: https://github.com/jansel, https://github.com/drisspg	2024-09-12 22:04:48 +00:00
PyTorch MergeBot	3de9e474df	Revert "Check function declarations of Core ML code (#135467 )" This reverts commit bc1b8f094d24de27432f4c29f0729e85a6b5ba63. Reverted https://github.com/pytorch/pytorch/pull/135467 on behalf of https://github.com/malfet due to This breaks ios periodic jobs, see https://github.com/pytorch/pytorch/actions/runs/10797026668/job/29947377532 ([comment](https://github.com/pytorch/pytorch/pull/135467#issuecomment-2347322784))	2024-09-12 22:04:35 +00:00
PyTorch MergeBot	3e1a4ea132	Revert "[DSD] Fix distributed state dict full_state_dict option hang during set_state_dict (#135725 )" This reverts commit 83c594ebd6dfa517fdd67ae23929cc60d5fa325d. Reverted https://github.com/pytorch/pytorch/pull/135725 on behalf of https://github.com/ZainRizvi due to This is breaking lint. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/10835983999/job/30068709508) [HUD commit link](`83c594ebd6`) ([comment](https://github.com/pytorch/pytorch/pull/135725#issuecomment-2347303272))	2024-09-12 21:47:38 +00:00
Sanskar Modi	e157ce3ebb	Validate input types for `torch.nn.Linear` and `torch.nn.Bilinear` (#135596 ) Adding validation checks to check the input types and display better error messages for the same. Fixes https://github.com/pytorch/pytorch/issues/135463 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135596 Approved by: https://github.com/malfet	2024-09-12 21:28:37 +00:00
Pian Pawakapan	b897ab0540	[export] ignore mark_dynamic() in export (#135536 ) Previously we were accomodating `torch._dynamo.mark_dynamic()` for export's dynamic shapes. Here we clean things up and ignore it, requiring users to specify an export input for `dynamic_shapes`. Note: there's 4 decorators relevant to export, `mark_dynamic, maybe_mark_dynamic, mark_static, mark_unbacked`. User calls that involve export have only been `mark_dynamic()`, and we use `maybe_mark_dynamic` under the hood for `Dim.AUTO`, but we could start using others. One reason I decided to not warn and just silently ignore is these decorators cause the tensors to carry dynamic info, and it'll be hard to tell whether the markers are from export or user calls when re-exporting with the same inputs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135536 Approved by: https://github.com/avikchaudhuri	2024-09-12 21:22:19 +00:00
Fadi Arafeh	3d24313809	Pass ideep:lowp_kind to matmul_forward::compute on cache misses (#135058 ) Optimized dynamic quantization for aarch64 was enabled by #126687 and #134897 This PR fixes an issue for aarch64 where on a [cache miss](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/quantized/cpu/qlinear_dynamic.cpp#L592) (e.g. if input dimensions change) [ideep::matmul_forward::compute ](https://github.com/intel/ideep/blob/pytorch-rls-v3.5.3-2/include/ideep/operators/matmul.hpp#L160) (wrongly) runs with the [default lowp_kind (u8s8)](https://github.com/intel/ideep/blob/pytorch-rls-v3.5.3-2/include/ideep/operators/matmul.hpp#L174) which is not supported by oneDNN+ACL (Arm Compute Library), causing the workload to fall back to a much slower oneDNN gemm:jit kernel Example: ```python import torch DIM = 4096 INPUT_SIZE1 = 32 INPUT_SIZE2 = 16 class LinearNet(torch.nn.Module): def __init__(self): super().__init__() self.fc1 = torch.nn.Linear(DIM, DIM, bias=False) def forward(self, x): x = self.fc1(x) return x input1 = torch.randn(size=(INPUT_SIZE1, DIM)) input2 = torch.randn(size=(INPUT_SIZE2, DIM)) with torch.no_grad(): model = LinearNet() model = torch.ao.quantization.quantize_dynamic(model,{torch.nn.Linear}) model(input1) # this goes to ACL lowp_gemm print("="50) model(input2) # this goes to gemm:jit without this PR, and to ACL with this PR ``` In the code snippet above: - The matmul from `model(input1)` goes to oneDNN+ACL (in both cases, with and without the PR) - The matmul from `model(input2)`: Without this PR: there's a cache miss (different input shapes) and matmul_forward::compute is run with the default lowp_kind (u8s8). Hence the matmul falls back to gemm:jit in oneDNN. However, With this PR* the matmul goes to oneDNN+ACL which is around 10x faster than oneDNN+jit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135058 Approved by: https://github.com/jondea, https://github.com/malfet	2024-09-12 20:30:20 +00:00
Riley Dulin	cd472bb1e3	[torch][fx] Add new replacement_callback to materialize a replacement just in time (#135553 ) Summary: Sometimes we only want to generate a replacement for a matched pattern once we know some information about the nodes in the pattern. So far, we have found this the most useful to do matches based on specific shapes of tensors flowing into functions. Use a callback function similar to `match_filters`. By default this isn't used. Had to make `replacement` a None-able parameter because Callable was already used to detect a case where a graph needed to be traced. Differential Revision: D62412628 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135553 Approved by: https://github.com/SherlockNoMad	2024-09-12 18:52:14 +00:00
Guilherme Leobas	f032135bbf	Add batching rule for torch.scatter_reduce (#135547 ) Fixes #134797 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135547 Approved by: https://github.com/zou3519	2024-09-12 18:51:21 +00:00
Joel Schlosser	525bec804c	NJT <-> padded dense conversions (#125947 ) This PR: * Implements the pre-existing `nt.to_padded_tensor(padding_val)` ATen op via the FBGEMM kernel + appropriate view gymnastics (since that kernel only handles 2D values) * Introduces a new `_nested_from_padded_tensor` op for the reverse conversion, implemented via the reverse FBGEMM kernel + view gymnastics * Note: there is currently no public API for this; design booted to a future PR TODO: * ~~Propagate min / max sequence length via the new factory function `_nested_from_padded_tensor`~~ * ~~Verify that Inductor does computation fusion via test logic~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/125947 Approved by: https://github.com/soulitzer	2024-09-12 17:54:25 +00:00
wz337	83c594ebd6	[DSD] Fix distributed state dict full_state_dict option hang during set_state_dict (#135725 ) Fix https://github.com/pytorch/pytorch/issues/134095 This fix distributed state dict full_state_dict option hang during set_state_dict. We switch `_distribute_tensors` in _state_dict_utils.py to use `DTensor.from_local` instead of `distribute_tensor` to support FSDP2+TP 2D strided sharding use case, as `distribute_tensor` cannot handle strided sharding yet. `distribute_tensor` incurs a scatter behind the scenes, while `DTensor.from_local` takes the local slice from the full tensor on each rank to create the DTensor (no collective). This means it's the user's responsibility to make sure the full_tensor from the full_state_dict is the same across all ranks. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135725 Approved by: https://github.com/fegin	2024-09-12 17:43:57 +00:00
Rachel Guo	c1277945d3	[AOTI][Tooling] Support debug printing for inductor level extern kernel call such as externkernel.addmm, bmm, etc. (#135731 ) Summary: As title. Effect after merging this diff would look something like this: ``` print('inductor: before_launch - triton_poi_fused_0 - buf0', buf0) triton_poi_fused_0.run(buf0, 6, grid=grid(6), stream=stream0) print('inductor: after_launch - triton_poi_fused_0 - buf0', buf0) buf1 = empty_strided_cuda((16, 6), (6, 1), torch.float32) # Topologically Sorted Source Nodes: [linear], Original ATen: [aten.addmm] print('inductor: before_launch - extern_kernels.addmm - buf0', buf0) extern_kernels.addmm(buf0, reinterpret_tensor(arg2_1, (16, 16), (16, 1), 0), reinterpret_tensor(L__self___weight, (16, 6), (1, 16), 0), alpha=1, beta=1, out=buf1) print('inductor: after_launch - extern_kernels.addmm - buf0', buf0) ``` Context: D62272588 only support major triton kernel jit inductor debug printing codegen Test Plan: CI & OSS CI Reviewed By: chenyang78, ColinPeppler Differential Revision: D62397017 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135731 Approved by: https://github.com/ColinPeppler	2024-09-12 17:31:10 +00:00
Isuru Fernando	dab7d646d5	Use a better decomposition for split_with_sizes (#135728 ) This decomposition has less checks and improves the performance of torch.compile. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135728 Approved by: https://github.com/ezyang	2024-09-12 16:38:51 +00:00
whywhy-rtx3090	7647c398ff	Allow optional positional arguments for `torch.func.functional_call` (#134643 ) This PR resolves #134408. Add an additional test and have passed the local test. Do you think we should add a post-check to ensure `args` and `kwargs` are not both `None`? It seems to be possible to have modules without inputs. This PR does not include any such post-check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134643 Approved by: https://github.com/zou3519	2024-09-12 15:22:06 +00:00
Justin Chu	d67cc58181	[ONNX] Fix symbolic values and numpy implementation (#135786 ) 1. Remove `__eq__` to make `SymbolicTensor` hashable and test for that 2. Update the `__array__` method so that it works for tensor on GPU Fixes https://github.com/pytorch/pytorch/issues/135700 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135786 Approved by: https://github.com/titaiwangms	2024-09-12 14:24:43 +00:00
Animesh Jain	dddaadac6c	[dynamo] Dont graph break on inner torch.compile (#135819 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135819 Approved by: https://github.com/jansel	2024-09-12 11:39:09 +00:00
Jason Ansel	02169364e1	[inductor] Split reduction loops when there is no shared reads (#134307 ) Fixes #129102 ![image](https://github.com/user-attachments/assets/0d00f75b-2bb9-4ce6-a0d9-2daceaff539c) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134307 Approved by: https://github.com/shunting314	2024-09-12 09:45:08 +00:00
Yanbo Liang	c30042fbeb	[GPT-fast] Update compilation time target for Llama & Mixtral (#135817 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135817 Approved by: https://github.com/xmfan, https://github.com/huydhn	2024-09-12 07:13:44 +00:00
Sun, Jiayi	6700175531	[Inductor] simplify indexing_exprs in LoopBody._init_with_copy (#135574 ) This PR uses `var_ranges` information to simplify `indexing_exprs` in `LoopBody._init_with_copy` to to reduce occurrences of `FloorDiv` and `ModularIndexing` in the `indexing_exprs`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135574 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-09-12 06:56:34 +00:00
Xilun Wu	de8a8653c0	[dtensor][BE] replace compute_local_shape with compute_local_shape_and_global_offset (#135554 ) Summary 1. This PR removes the public API `compute_local_shape` and replace its use with the more general API `compute_local_shape_and_global_offset`. 2. To keep `compute_local_shape_and_global_offset` consistent with `compute_local_shape` on empty shards, it now returns local tensor shape `(0,)` for empty shards which is more aligned with DTensor's semantics on non-participating ranks. Test `pytest test/distributed/_tensor/test_dtensor.py` `pytest test/distributed/_tensor/test_init.py` `pytest test/distributed/_tensor/test_tensor_ops.py` Differential Revision: [D62415591](https://our.internmc.facebook.com/intern/diff/D62415591) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135554 Approved by: https://github.com/tianyu-l, https://github.com/wz337	2024-09-12 06:30:09 +00:00
Jason Ansel	86335e9135	[reland 3/3][fx] Bypass custom __setattr__ in Node.__init__ (#135735 ) Relands #135079 whcih was reverted by #135562 I broke this up into three parts to test internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135735 Approved by: https://github.com/oulgen	2024-09-12 05:50:39 +00:00
angelayi	14e3f3c062	[aoti] Remove nlohmann/json.hpp from header (#135765 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135765 Approved by: https://github.com/malfet	2024-09-12 05:38:51 +00:00
Dmitry Rogozhkin	9852c6d236	xpu: fix 3rd party builds on systems with cmake<3.25 (#135767 ) Cmake LINUX variable is available on starting from cmake 3.25. Better to use CMAKE_SYSTEM_NAME instead to relax cmake version requirement. See: https://cmake.org/cmake/help/v3.25/variable/LINUX.html Fixes: #135766 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135767 Approved by: https://github.com/malfet, https://github.com/guangyey	2024-09-12 05:31:01 +00:00
Jason Ansel	6354271178	[inductor] Skip unused call to get_estimated_runtime() (#135776 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135776 Approved by: https://github.com/oulgen ghstack dependencies: #135445, #135446	2024-09-12 05:22:23 +00:00
Jason Ansel	12902f6ecf	[inductor] Cache get_operation_names/get_buffer_names (#135446 ) Before: ![image](https://github.com/user-attachments/assets/db5b6fce-d849-4512-a21d-7a09efc72311) After: ![image](https://github.com/user-attachments/assets/097e340c-03b2-491e-ad36-132350b37892) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135446 Approved by: https://github.com/oulgen ghstack dependencies: #135445	2024-09-12 05:22:23 +00:00
Jason Ansel	3decb676aa	[inductor] Optimize cache_on_self (#135445 ) This is a small compile time win, but also makes profiles more readable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135445 Approved by: https://github.com/oulgen	2024-09-12 05:22:23 +00:00
Zhenbin Lin	8d68a02905	OpenReg: Split the daemon into drvier/executor (#135646 ) Split the daemon into a proper user-process driver vs device-process executor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135646 Approved by: https://github.com/albanD	2024-09-12 05:03:46 +00:00
Jason Ansel	28330a8a39	[reland 1/3][fx] Bypass custom __setattr__ in Node.__init__ (#135733 ) Relands #135079 whcih was reverted by #135562 I broke this up into three parts to test internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135733 Approved by: https://github.com/oulgen	2024-09-12 04:29:37 +00:00
Animesh Jain	eaba287adb	[dynamo] Bug fix for _torchdynamo_inline source handling (#135612 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135612 Approved by: https://github.com/drisspg	2024-09-12 04:05:08 +00:00
cyy	f5f1d0a753	Fix build warnings for torch_python (#134981 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134981 Approved by: https://github.com/ezyang	2024-09-12 03:59:34 +00:00
Adam J. Stewart	5bc238c73e	torch.hub: add get_dir/set_dir type hints (#134906 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134906 Approved by: https://github.com/Skylion007	2024-09-12 03:53:29 +00:00
He Kai	79223114db	Avoid inserting extra transpose when the input to group norm is NHWC (#135575 ) When the input format for group norm is NHWC and the device is privateuseone, it introduces an additional transpose operation. To avoid this issue, a check for the privateuseone device needs to be added here. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135575 Approved by: https://github.com/ezyang	2024-09-12 03:36:05 +00:00
cyy	7cfd23636c	Fix clang-tidy warnings in Caffe2 code (#134935 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134935 Approved by: https://github.com/ezyang	2024-09-12 03:27:09 +00:00
Feng Yuan	0d1d69fd25	Update torch-xpu-ops pin (ATen XPU implementation) (#135647 ) Release cycle for PyTorch 2.5 1. Fixing runtime error on Windows: Fail to load torch_xpu_ops_unary_binary_kernels.dll as the bin size is large. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135647 Approved by: https://github.com/EikanWang	2024-09-12 03:16:08 +00:00
Aaron Orenstein	21a64d57b1	[BE] typing for decorators - masked/_ops (#135108 ) Differential Revision: D62184735 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135108 Approved by: https://github.com/Skylion007	2024-09-12 01:34:09 +00:00
Shangdi Yu	1a74952925	"Remove BLOCK_LIST" (#135729 ) Summary: Skip test_prepare_qat_conv_bn_fusion_getitem_placeholder when we use training ir, since it's only for bn-getitem pattern, but the pattern doesn't exist in training ir. Remove BLOCK_LIST since it's empty. Now all internal unittests will use training ir. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' caffe2/test/quantization:test_quantization -- -r test_prepare_qat_conv_bn_fusion_getitem_placeholder buck2 run 'fbcode//mode/dev-nosan' caffe2/test:quantization_pt2e_qat -- -r test_prepare_qat_conv_bn_fusion_getitem_placeholder ``` Differential Revision: D62387987 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135729 Approved by: https://github.com/tugsbayasgalan	2024-09-12 01:22:06 +00:00
Huy Do	a130ed828a	Fix the upload of x86 micro benchmark results (#135780 ) Upload stats workflow currently skips this https://github.com/pytorch/pytorch/actions/runs/10807251335/job/29977650639, this is a miss from https://github.com/pytorch/pytorch/pull/135042. So, the workflow is running but nothing has been uploaded yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135780 Approved by: https://github.com/atalman	2024-09-12 01:16:38 +00:00
Menglu Yu	eb0fe02933	[PT2][inductor][Optimus] Add pad_aten_mm_pass pattern to resolve long computation kernel in LCE (#135167 ) Summary: We observed another long computation issue for OBA_AFOC pyper model, thus adding a pattern to avoid the perf regression - Only happens in A100 - Do not want to use force_shape_pad since it will pad all GEMMs, which may not be optimal. Optimus pass has more flexisibility to customized GEMM shape and do corresponding padding - To enable, we pass the pass to config, where "k_threshold_to_pad" can be customized inductor_config.patch(post_grad_fusion_options={"pad_aten_mm_pass": {"k_threshold_to_pad" : 8388608}}) Test Plan: # unit test ``` buck2 test mode/opt //caffe2/test/inductor:pad_mm ``` Buck UI: https://www.internalfb.com/buck2/58b0f272-f405-45be-bc8d-aec2dc4d5841 Test UI: https://www.internalfb.com/intern/testinfra/testrun/10133099209954651 Network: Up: 9.0KiB Down: 142B (reSessionID-8eb71a37-a5ca-4aff-a4f1-93ade3e47e4e) Jobs completed: 9. Time elapsed: 3:18.0s. Cache hits: 0%. Commands: 3 (cached: 0, remote: 0, local: 3) Tests finished: Pass 17. Fail 0. Fatal 0. Skip 0. Build failure 0 # e2e test see [D62388582](https://www.internalfb.com/diff/D62388582) Differential Revision: D62220158 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135167 Approved by: https://github.com/jackiexu1992	2024-09-12 00:51:34 +00:00
Wei Feng	d270e2d240	[FSDP2] better error msg for cpu offloading (#135156 ) when cpu offloading is enabled, if user load a gpu state dict, FSDP2 will throw a less obvious error at backward ``` RuntimeError: attempting to assign a gradient with device type 'cpu' to a tensor with device type 'cuda'. Please ensure that the gradient and the tensor are on the same device ``` this PR throws error more explicitly by specifying which parameters should be moved because of cpu offloading ``` FSDP parameters should be materialized on cpu when enabling cpu offloading. For example, load cpu state dict or call module.to_empty(device="cpu"). Found following parameters on non-cpu device: ['0.weight'] ``` `pytest -s test/distributed/_composable/fsdp/test_fully_shard_state_dict.py -k test_dp_state_dict_cpu_offload` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135156 Approved by: https://github.com/awgu	2024-09-12 00:05:07 +00:00
xinan.lin	16b37b309f	[Inductor] Rename `cpp_wrapper_cuda.py` as `cpp_wrapper_gpu.py` (#135313 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135313 Approved by: https://github.com/jansel, https://github.com/desertfire ghstack dependencies: #135312	2024-09-11 23:59:54 +00:00
xinan.lin	13ee85ca5e	[Inductor] Generalize cuda cpp wrapper as common triton based GPU cpp wrapper, will be reused by xpu in next PR. (#135312 ) [Inductor] Generalize cuda cpp wrapper as common triton based GPU cpp wrapper, will be reused by xpu in next PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135312 Approved by: https://github.com/jansel, https://github.com/desertfire, https://github.com/eellison	2024-09-11 23:59:54 +00:00
Will Feng	94d2471d1f	[Traceable FSDP2] Use .copy_ instead of .set_ for unsharded_param inplace update; Replace unsharded_param graph input usage with graph intermediate; Support FSDP2+LoRA (#133730 ) Using `fsdp.set_` for unsharded_param inplace update causes difficult-to-debug errors when enabling Traceable FSDP2 on TorchTune models. In this PR, we change it to use `fsdp.copy_` which fixes the error and also strictly follows eager semantics (i.e. if user explictly stores an alias of the unsharded_param during execution of the user's module code, that alias will get updated correctly when the unsharded_param is copy_ into; whereas if we just swap out unsharded_param storage via set_, that user-saved alias will not get updated, which is not good). This PR also implements the graph pass to remove the resizes and copy if there is a resize_(full) -> copy_ -> resize_(0) pattern. ------ Test commands: - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_trace_fsdp_copy_` - `pytest -rA test/dynamo/test_repros.py::ReproTests::test_partitioner_cse_respects_mutation_boundaries` - `pytest -rA test/dynamo/test_repros.py::ReproTests::test_fsdp_set_input_mutation_applied_when_input_gets_no_gradients` - `pytest -rA test/inductor/test_pattern_matcher.py::TestPatternMatcher::test_mutation_op_matching` - `python test/inductor/test_distributed_patterns.py DistributedPatternTests.test_fake_distributed_aot_eager` - `PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=1 PYTORCH_TEST_WITH_CROSSREF=1 python test/functorch/test_aotdispatch.py TestEagerFusionOpInfoCPU.test_aot_autograd_exhaustive_norm_cpu_float32` - `python test/distributed/test_inductor_collectives.py TestCollectivesInductor.test_backwards` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133730 Approved by: https://github.com/bdhirsh	2024-09-11 23:01:05 +00:00
Alexander Jipa	5ca46be15e	Fix/torch cat doc attr (#135698 ) The `torch.cat` attr name for tensors in the docs differs from the method signature, unlike other methods. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135698 Approved by: https://github.com/albanD Co-authored-by: Alexander Jipa <azzhipa@amazon.com>	2024-09-11 22:32:55 +00:00
Mayank Mishra	9a04cfbeff	fix for fp16 (#134106 ) This PR is a replacement for https://github.com/pytorch/pytorch/pull/133085 for pushing a quick fix for RMSNorm. The original author is @kkontny Previous PR summary: Since FP16 has quite small dynamic range it is very easy to overflow while computing `at::pow(input, 2)` , and it happens in real world computation. I've tried to use `nn.RMSNorm` fused implementation instead of `LlamaRMSNorm` inside `transformers` implementation of Llama (`src/transformers/models/llama/modeling_llama.py`). It started to give wrong answers in Fp16 while still giving good in FP32. I figured out happens due to overflow while computing square of the input tensor. Original `LLamaRMSNorm` implementation upcasts input to fp32 to prevent this and give better numerical stability. ``` class LlamaRMSNorm(nn.Module): def __init__(self, hidden_size, eps=1e-6): """ LlamaRMSNorm is equivalent to T5LayerNorm """ super().__init__() self.weight = nn.Parameter(torch.ones(hidden_size)) self.variance_epsilon = eps def forward(self, hidden_states): input_dtype = hidden_states.dtype hidden_states = hidden_states.to(torch.float32) variance = hidden_states.pow(2).mean(-1, keepdim=True) hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon) return self.weight * hidden_states.to(input_dtype) ``` Proposed commit fixed the issue. FP16 in RMSNorm has to be treated in special way, to be usable in real world implementations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134106 Approved by: https://github.com/mikaylagawarecki, https://github.com/eqy	2024-09-11 22:02:07 +00:00
Shubham Bhokare	66db61f0d1	[ONNX] Update fake mode usage in onnx docs (#135512 ) Update fake mode usage in onnx docs Pull Request resolved: https://github.com/pytorch/pytorch/pull/135512 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2024-09-11 21:29:04 +00:00
PyTorch MergeBot	c025f7becc	Revert "[Partitioner] Reuse partition to check whether nodes exist (#135317 )" This reverts commit e004d539da3335d97a8134c9081245628f18eb67. Reverted https://github.com/pytorch/pytorch/pull/135317 on behalf of https://github.com/izaitsevfb due to BC-breaking, breaks executorch and internal meta builds ([comment](https://github.com/pytorch/pytorch/pull/135317#issuecomment-2344730294))	2024-09-11 21:27:53 +00:00
FFFrog	8c4e1148b8	Refactoring byte_order (#135558 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135558 Approved by: https://github.com/mikaylagawarecki	2024-09-11 21:06:43 +00:00
Nikita Shulga	e20ee39558	Expand bitwise ops to unsigned types (#135525 ) Fixes https://github.com/pytorch/pytorch/issues/135436 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135525 Approved by: https://github.com/ezyang	2024-09-11 20:48:52 +00:00
Xinya Zhang	74fd1bf965	[ROCm] Update to AOTriton 0.7b (#134498 ) Notable changes: 1. Enable CudaGraph related tests 2. Fix UT problems 3. EXPERIMENTAL Navi31 support. User should enable Navi31 support with Env Var `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1` Know Problem: 1. `test/test_transformers.py` will massive failures and/or NaN outputs with `--use-pytest` + Update: Confirmed skip `class TestSDPAPrivateUse1Only` can fix the problem with `--use-pytest` Note: AOTriton 0.7b adds support to nestedtenosrs+SDPA but need more work (and consequently a separate PR) to enable it. Fixes #133540 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134498 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily, https://github.com/malfet	2024-09-11 20:34:01 +00:00
Sidney Tsang	5d964a5eb7	[Export] Fix SDPA decomposition (#135297 ) Summary: Update SDPA decomposition to match updated stride from D62009189 which aligns strides with the `aten._scaled_dot_product_attention_math.default`, which makes `t.permute().continuous().permute()` no longer necessary. Test Plan: CI Differential Revision: D62278378 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135297 Approved by: https://github.com/drisspg	2024-09-11 20:21:59 +00:00
Bin Bao	118d7e1480	[Inductor] add _dynamo.reset to test_cat_slice_cat_cuda (#135694 ) Summary: test_cat_slice_cat_cuda runs inductor multiple times and check counters["inductor"] in between, and thus we need to reset properly. Differential Revision: D62500331 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135694 Approved by: https://github.com/masnesral	2024-09-11 20:07:11 +00:00
Bob Ren	dd47f6f623	Simplify expr before getting implications in _maybe_evaluate_static (#135499 ) Fixes #134268 Previously we weren't simplifying these expressions before calling get_implications, resulting in inconsistent application of FloorDiv/CleanDiv. See #134268 for more details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135499 Approved by: https://github.com/ezyang	2024-09-11 19:48:29 +00:00
Tom Ritchford	e05ea2b179	Add decomposition for transpose_copy (#130943 ) * Extracted from #128416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130943 Approved by: https://github.com/amjames, https://github.com/eellison	2024-09-11 19:45:22 +00:00
Shangdi Yu	ad75b09d89	Replace capture_pre_autograd_graph with export_for_training in torch tests (#135623 ) Summary: as title Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_conv_dynamic buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:fx -- -r matcher buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r x86 ``` CI Differential Revision: D62448302 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135623 Approved by: https://github.com/tugsbayasgalan	2024-09-11 19:23:08 +00:00
rzou	a2cb9b7331	Flip triton kernel default layout constraint to "needs_fixed_stride_order" (#135581 ) This is to match the default layout constraint for custom operators. By default, Inductor should match the stride order of inputs to a triton kernel. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/135581 Approved by: https://github.com/eellison ghstack dependencies: #135530	2024-09-11 18:43:18 +00:00
Edward Z. Yang	451eaf0ff2	Log full exception trace when error raised in Dynamo (#135697 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135697 Approved by: https://github.com/Skylion007	2024-09-11 18:14:33 +00:00
Zain Rizvi	09519eb195	Support rolling over a percentage of workflows (#134816 ) In order to support adding a rollover percentage, this ended up being a complete rewrite of runner_determinator.py. Details of the new format are in the comments up top. On the plus side, this now includes some unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134816 Approved by: https://github.com/PaliC, https://github.com/zxiiro	2024-09-11 18:01:26 +00:00
Bob Ren	5314ae2660	Don't use exception chaining for BackendCompilerFailed (#135545 ) Commandeered from https://github.com/pytorch/pytorch/pull/135496 as I'm now helping @ezyang ship dynamic float arguments in PT2. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135545 Approved by: https://github.com/ezyang	2024-09-11 17:49:18 +00:00
Jack Taylor	da587de9cb	[ROCm] [BUGFIX] Re-enable rocm-specific tuning parameters v2 (#133852 ) Small bug fix - https://github.com/pytorch/pytorch/pull/124592 replaced the torch.version.hip with device_props but made a mistake in porting the original logic. The original code was: `if torch.version.hip is not None:` Which was incorrectly replaced by: `if self.device_props.type != "hip":` Another occurence of https://github.com/pytorch/pytorch/pull/130617 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133852 Approved by: https://github.com/masnesral, https://github.com/malfet	2024-09-11 17:21:40 +00:00
Jithun Nair	82a4df2d5f	[CI] [ROCm] Run rocm workflow on every push to main branch (#135644 ) Dial the frequency back up from https://github.com/pytorch/pytorch/pull/131637 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135644 Approved by: https://github.com/huydhn	2024-09-11 17:21:05 +00:00
Catherine Lee	18a9030952	[CI] Fix update slow tests (#135390 ) * Add pytorchbot to list of approvers for file * Add labels to the auto created PR The auto generated PR is currently not merging due to some failing tests on slow workflow that were supposed to be moved back to normal idk if this has much value, clearly we've been managing without the update Pull Request resolved: https://github.com/pytorch/pytorch/pull/135390 Approved by: https://github.com/ZainRizvi	2024-09-11 17:02:17 +00:00
Isuru Fernando	03f23d07b4	Optimize ShapeEnv.replace (#135652 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135652 Approved by: https://github.com/ezyang ghstack dependencies: #135621, #135622	2024-09-11 16:50:59 +00:00
Isuru Fernando	8c738c9270	Improve performance of sympy_generic_le (#135622 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135622 Approved by: https://github.com/ezyang ghstack dependencies: #135621	2024-09-11 16:20:03 +00:00
Isuru Fernando	7ddacaf40a	Improve performance of canonicalize_bool_expr (#135621 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135621 Approved by: https://github.com/ezyang	2024-09-11 16:20:03 +00:00
PyTorch MergeBot	183c32fd3b	Revert "[Dynamo] Trace torch function modes entered outside of torch.compile (#133137 )" This reverts commit 0d15122092c27fec1143b800bab7c996d126b547. Reverted https://github.com/pytorch/pytorch/pull/133137 on behalf of https://github.com/clee2000 due to something in this stack broke functorch/test_control_flow.py::TestControlFlow::test_scan_simple_graph [GH job link](https://github.com/pytorch/pytorch/actions/runs/10804912306/job/29980571390) [HUD commit link](`444b52ff40`), newly added test yesterday ([comment](https://github.com/pytorch/pytorch/pull/133137#issuecomment-2344054339))	2024-09-11 15:57:00 +00:00
PyTorch MergeBot	3ab12e2596	Revert "[Dynamo] Support thread local setattr (#135443 )" This reverts commit 160c228a4bd60ceffa62b045a6b0a6f9413835c5. Reverted https://github.com/pytorch/pytorch/pull/135443 on behalf of https://github.com/clee2000 due to something in this stack broke functorch/test_control_flow.py::TestControlFlow::test_scan_simple_graph [GH job link](https://github.com/pytorch/pytorch/actions/runs/10804912306/job/29980571390) [HUD commit link](`444b52ff40`), newly added test yesterday ([comment](https://github.com/pytorch/pytorch/pull/135443#issuecomment-2344042800))	2024-09-11 15:53:55 +00:00
PyTorch MergeBot	596e93b506	Revert "[dynamo] Bug fix for _torchdynamo_inline source handling (#135612 )" This reverts commit 5c3d0a2dedbc0e85f3b256ce56ac674078a5fae1. Reverted https://github.com/pytorch/pytorch/pull/135612 on behalf of https://github.com/clee2000 due to broke inductor/test_cpu_select_algorithm.py::TestSelectAlgorithmCPU::test_linear_input_transpose_bias_True_cpu_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10805518363/job/29982386304) [HUD commit link](`5c3d0a2ded`), bad TD ([comment](https://github.com/pytorch/pytorch/pull/135612#issuecomment-2344039370))	2024-09-11 15:51:12 +00:00
PyTorch MergeBot	f96e8041b1	Revert "[Dynamo] Simplify torch function mode stack guard (#135444 )" This reverts commit 444b52ff40cf4afce7bc3fdcf021a88eab3b954c. Reverted https://github.com/pytorch/pytorch/pull/135444 on behalf of https://github.com/clee2000 due to something in this stack broke functorch/test_control_flow.py::TestControlFlow::test_scan_simple_graph [GH job link](https://github.com/pytorch/pytorch/actions/runs/10804912306/job/29980571390) [HUD commit link](`444b52ff40`), newly added test yesterday ([comment](https://github.com/pytorch/pytorch/pull/135444#issuecomment-2344036843))	2024-09-11 15:48:27 +00:00
PyTorch MergeBot	7cf9c81918	Revert "[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732 )" This reverts commit 6a3edfcc1e474e6ebd0c06624000a6d6bf1a0dee. Reverted https://github.com/pytorch/pytorch/pull/134732 on behalf of https://github.com/clee2000 due to broke functorch/test_control_flow.py::TestControlFlow::test_scan_simple_graph [GH job link](https://github.com/pytorch/pytorch/actions/runs/10804912306/job/29980571390) [HUD commit link](`444b52ff40`), newly added test yesterday ([comment](https://github.com/pytorch/pytorch/pull/134732#issuecomment-2344016694))	2024-09-11 15:39:21 +00:00
Sam Larsen	49e0b88aab	Fix test_triton_kernel_float64_constant (#135583 ) Summary: Landed https://github.com/pytorch/pytorch/pull/135260 too soon and the test in that PR doesn't do exactly what I tested (actually test different dtypes). Test Plan: `python test/inductor/test_triton_kernels.py -k float64_constant` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135583 Approved by: https://github.com/isuruf, https://github.com/eellison, https://github.com/Skylion007	2024-09-11 15:16:23 +00:00
Pushpak Raj Gautam	ee8c5cc1cc	For S444023: Back out "deprecate `search_autotune_cache` (#133628 )" (#135186 ) Summary: For S444023 Test Plan: Revert prevented the NaN errors - f639391901 Training job ran for 7767 iterations. NaN errors show up within the first 1k. Reviewed By: nmacchioni Differential Revision: D62224747 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135186 Approved by: https://github.com/kit1980	2024-09-11 14:08:40 +00:00
Nikita Lutsenko	ce4d146f56	ATen \| Fix MPSCNNNeuron creation on Mac Catalyst. (#135595 ) Summary: These are still utilized directly when using relu/sigmoid/tanh tensors directly from here: https://fburl.com/code/k6n7ofzd However, on Mac Catalyst we always were returning `nil`, as such in most cases yielding the entire graph completely useless and most often just stray `MPSTemporaryImage` references that were never written into. This fixes the issue completely by making sure that we always return the valid kernels back, so they can be executed. Test Plan: Test with segmentation net that uses a combination of relu and other tensors together - run this via Mac Catalyst build - it works! {F1858576745} Reviewed By: MichaelTay Differential Revision: D62430010 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135595 Approved by: https://github.com/MichaelTay	2024-09-11 11:12:23 +00:00
Amadeusz Skrzypczak	0226fcaacf	Disable cuda specific restrictions in _scaled_mm for other devices (#135579 ) Fixes #135576 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135579 Approved by: https://github.com/drisspg	2024-09-11 11:05:38 +00:00
Yanbo Liang	4cde5096c4	[Inductor][FlexAttention] Supports dynamic shapes with block mask (#135629 ) Fixes #134560 and #135206 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135629 Approved by: https://github.com/drisspg	2024-09-11 08:10:50 +00:00
Ke Wen	443c015393	[Distributed] Improve efficiency of NaN checker (#135414 ) Some customers would like to run the NaN checks on the fly, so we are improving its efficiency. ## Benchmarking Allreduce 2G floats. `TORCH_NCCL_NAN_CHECK=1` Red kernel: ncclAllreduce Blue kernel: Nan check <img width="1093" alt="Screenshot 2024-09-06 at 10 00 05 PM" src="https://github.com/user-attachments/assets/5501bc31-024f-4115-adb2-dd66eb4025d3"> ## Comparison with torch ops: Let's say a user manually check for NaNs with the following torch ops before all-reduce: ``` torch.any(torch.isnan(x)) ``` <img width="1091" alt="Screenshot 2024-09-06 at 10 14 53 PM" src="https://github.com/user-attachments/assets/1f8b5f63-c955-4612-bb96-241b6c69959b"> So our perf is on-par with torch ops. ## Changes - Load from vidmem using "big packs" of 16 bytes - Bump `blockDim.x` from 256 to 512 - Separate loads and checks into two loops, each of 8 iterations - Unroll the loops - Templated functions for checking NaN in a "big pack" based on dtype Special thanks to @jbachan from NCCL! Pull Request resolved: https://github.com/pytorch/pytorch/pull/135414 Approved by: https://github.com/wconstab	2024-09-11 07:53:42 +00:00
Yiming Zhou	4ae6d7c18f	Back out "[pytorch][PR] [export] fix re-export custom metadata" (#135634 ) Summary: Broke some tests. Revert this diff Test Plan: CI Differential Revision: D62474337 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135634 Approved by: https://github.com/tugsbayasgalan	2024-09-11 06:16:26 +00:00
Eddie Yan	3084b7b5c0	[cuDNN][SDPA] Support `attn_bias` in cuDNN (#130482 ) CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/130482 Approved by: https://github.com/drisspg, https://github.com/Skylion007, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-09-11 05:59:25 +00:00
Animesh Jain	5c3d0a2ded	[dynamo] Bug fix for _torchdynamo_inline source handling (#135612 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135612 Approved by: https://github.com/drisspg ghstack dependencies: #135588	2024-09-11 05:23:42 +00:00
fduwjj	c608b17f60	[PTD][BE][c10d] Add some code documents for TCPStore code and cosmetic changes to libUVStore code (#130496 ) While designing something else when TCPStore is needed. I spent some time digging into the codebase of TCPStore and found that the code is a little bit challenging to understand without proper documents. Although people from OSS community must be smarter than me, I still want to document my findings in the code so that devs and users can use them as a reference down the road. Also for libuv, we need to make private variables with a "_", so it's a pure renaming of private variables such as `tcpServer`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130496 Approved by: https://github.com/wconstab	2024-09-11 04:42:25 +00:00
Michael Lazos	444b52ff40	[Dynamo] Simplify torch function mode stack guard (#135444 ) The semantics of ignored modes previously had edge cases, this eliminates these by in essence filtering any ignored modes out of both the ref stack and the current torch function mode stack. This is purely to fix complexity in #135422. The ignored modes handling will be removed in a future PR after https://github.com/pytorch/pytorch/pull/135422 lands, since we will then trace through DeviceContexts vs inserting them into the graph which needed these extra workarounds for correctness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135444 Approved by: https://github.com/anijain2305, https://github.com/williamwen42 ghstack dependencies: #134732, #133137, #135443	2024-09-11 04:18:22 +00:00
Michael Lazos	160c228a4b	[Dynamo] Support thread local setattr (#135443 ) In preparation for tracing through DeviceContext (`defb515306/torch/utils/_device.py (L66)`) This PR adds support for calling the setattr of thread local objects. These objects have a slots impl, and since this doesn't appear to have any side effects, we call this setattr impl when replaying mutations, since calling `object.__setattr__` on these objects results in a type error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135443 Approved by: https://github.com/anijain2305 ghstack dependencies: #134732, #133137	2024-09-11 04:18:22 +00:00
Michael Lazos	0d15122092	[Dynamo] Trace torch function modes entered outside of torch.compile (#133137 ) This PR adds initial tracing for torch function modes. Details: In essence, this adds tracing into the torch function of modes entered outside of the torch.compile call. This does not yet support tracing enter/exit of a torch function mode/ tracing set_default_device properly using the new mode infra (this will be a very good stress test for modes). I am adding more PRs to this stack to support these. The overall plan is to support tracing enter/exit and handling graph breaks like we do other torch.* context managers. Previously landed: https://github.com/pytorch/pytorch/pull/133135 https://github.com/pytorch/pytorch/pull/133136 https://github.com/pytorch/pytorch/pull/133134 https://github.com/pytorch/pytorch/pull/133133 https://github.com/pytorch/pytorch/pull/133132 https://github.com/pytorch/pytorch/pull/133131 https://github.com/pytorch/pytorch/pull/133729 https://github.com/pytorch/pytorch/pull/133130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133137 Approved by: https://github.com/jansel, https://github.com/zou3519 ghstack dependencies: #134732	2024-09-11 04:18:22 +00:00
Michael Lazos	6a3edfcc1e	[Dynamo] Use custom backend to reenter metadata tf mode when tracing while/cond (#134732 ) For tracing cond/while in eager, we trace the HOP with the eager backend with metadata torchfunction mode enabled. HOPs disallow the mutation that occurs in this torch function mode, so it is not able to be traced. As a result, we use a custom backend which enters this mode for tracing these HOPs. Thanks to @ydwu4 for the help with implementing this Pull Request resolved: https://github.com/pytorch/pytorch/pull/134732 Approved by: https://github.com/ydwu4	2024-09-11 04:18:22 +00:00
penguin-wwy	356f14e7b7	Fix the output of FileCheck when not run and add unit tests (#135345 ) When FileCheck is destructed without execution, it should output all rules. For example: ``` >>> fc = FileCheck().check("test") >>> del fc You have not run this instance of FileCheck! FileCheck checks: CHECK: test ``` Additionally, unit tests for the Python interface of FileCheck will be added. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135345 Approved by: https://github.com/eellison	2024-09-11 04:13:24 +00:00
Sathyanarayanan Saravanamuthu	34dc8f69a1	Adding entry-point based support for out-of-tree rendezvous plugins (#132633 ) Fixes #127519 Currently in torchrun rendezvous, there are only two rendezvous backends supported out of the box: `C10d` and `Etcd`. The changes in this PR enables the distributed elastic users to bring their out-of-tree rendezvous backend implementations as Python packages. #### AUTHORING NEW PLUGIN Any new plugin will be a python package exposing entry-points. For example, the structure of redis plugin is as follows: ``` plugin_root \|_ pyproject.toml \|_ src \|_ redis \|_ __init__.py \|_ redis_store.py \|_ redis_backend.py ``` The contents of the `pyproject.toml` should indicate that this is exposes a torchrun entry-point by mentioning the group name `torchrun.plugins`. The `pyproject.toml` for redis plugin would be as follows: ``` [project] name = "redis" version = "0.0.1" [project.entry-points.'torchrun.plugins'] redis = 'redis' ``` The `src/redis/__init__.py` file would contain functions that return the plugin name and plugin handler. The contents of `__init__.py` for redis would be as follows: ``` def getPluginHandler(): def _create_redis_handler(params: RendezvousParameters): from redis_rendezvous_backend import create_backend backend, store = create_backend(params) return create_handler(store, backend, params) return _create_redis_handler ``` The files `redis_store` and `redis_backend` contain the implementation of [Store](`41189b0da4/torch/_C/_distributed_c10d.pyi (L171)`) and [RendezvousBackend](`e782918b8e/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py (L61)`) respectively. #### USER EXPERIENCE Before using the plugin for the first time, the user has to install the plugin packages. For example, the published packages can be installed using `pip3 install <plugin-name>` and the plugin is in local file systemcan be installed using `pip3 install -e <plugin-location>`. Once installed, the new backend can be used in torchrun as follows: ``` torchrun --rdzv-backend=redis --rdzv-endpoint=redis-container:6379 --nnodes=3 --nproc-per-node=1 --max-restarts=3 --rdzv-id=1 test.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132633 Approved by: https://github.com/fduwjj	2024-09-11 03:35:02 +00:00
angelayi	cd9ee49a69	[aoti] Add cpp loader (#135374 ) * Added a cpp loader, AOTIModelPackageLoader, which can load the .pt2, build the .so, and create a runner. The python-facing API is that users can directly call the `run` function, whereas in cpp users can directly access the `runner_` if they are more familiar with that. I couldn't figure out how to bind the `get_runner()` function to python... * Added a new config, `aot_inductor.package_cpp_only` which will not package the so. This means that whenever the package is loaded, we will need to build the so. This is turned off by default so that new environments do not need to rebuild their so. The `package_cpp_only` is a feature which torchchat intends to use to provide flexibility to users. * Added a new config, `aot_inductor.metadata` which stores user-provided metadata, serialized to the pt2 as a json file. It also stores the device used when exporting, "cuda" or "cpu", so that during load time, we can use that data to determine which AOTIModelContainerRunner to use. The metadata can be accessed through `loader.get_metadata()`. TODO is to move this metadata to the toplevel `package_aoti` function so that we can remove the metadata as a config. * Separated out `package_aoti` as a standalone function, instead of it automatically being called in inductor. This is to prepare for the case where users will compile multiple models, and want to bundle it in one package. The specific use case is in torchchat, where we want to package the separately-exported encoder and decoder layers. An example of how to use this is in `test_multiple_methods`. * `load_package` will load a singular model, given the model name. * The loader doesn't support windows for now, I think I need to add some more casing to make the build commands work on windows? Differential Revision: [D62329906](https://our.internmc.facebook.com/intern/diff/D62329906) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135374 Approved by: https://github.com/desertfire, https://github.com/malfet	2024-09-11 03:00:01 +00:00
chuanqiw	26e5572dd2	Bump triton xpu pin and release version (#135638 ) Similar with https://github.com/pytorch/pytorch/pull/135627 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135638 Approved by: https://github.com/atalman	2024-09-11 00:56:15 +00:00
Animesh Jain	693897df42	[dynamo] Missing guard source keys for corner case of NNModuleVariabl… (#135041 ) Potentially fixes - https://fb.workplace.com/groups/1286739428954016/permalink/1319662695661689/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/135041 Approved by: https://github.com/ezyang	2024-09-11 00:43:26 +00:00
Nikita Shulga	3bf6be457d	[MPS] Add missing dispatch to rshift.Tensor (#135607 ) Missed it while working on https://github.com/pytorch/pytorch/pull/131813 Test plan: `python -c "import torch;print(torch.randint(100, 500, (64,), device='mps') >> torch.tensor([3,], device='mps'))"` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135607 Approved by: https://github.com/manuelcandales	2024-09-11 00:20:53 +00:00
titaiwangms	492f064f15	[ONNX] Add assertion nodes to ignoring list (#135591 ) Fixes #135419 PS: there are 104 empty output nodes, I suggest we add them one by one when we run into them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135591 Approved by: https://github.com/justinchuby	2024-09-11 00:18:17 +00:00
rzou	29408ea81a	Add option to tweak inductor stride settings for user-defined triton kernels (#135530 ) Previously, Inductor was allowed to modify the stride/storage_offset (layout) for inputs to user-defined triton kernels. This can cause silent incorrectness because most triton kernels are written for a specific striding pattern (usually contiguous). This PR adds a config to allow the user to choose Inductor's behavior on this. The options are: - "flexible_layout" (default): Inductor can modify the layout for inputs to user-defined triton kernels as much as it wants. - "needs_fixed_stride_order": Inductor must preserve the stride order (when compared to tracing) for inputs to user-defined triton kernels. This matches our handling for custom operators. In the future, we'll want a "needs_exact_strides" option (this is the safest option). Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/135530 Approved by: https://github.com/FindHao, https://github.com/oulgen	2024-09-11 00:11:17 +00:00
Haoming Lu	02dcb07765	Add boolean support in pack segments ops for both cpu and cuda impls (#132897 ) (#135620 ) Summary: Same as int types, forward only. bypass-github-export-checks diff has been synced to github Test Plan: buck test mode/dev-nosan //caffe2/torch/fb/sparsenn:test -- test_pack_segments https://www.internalfb.com/intern/testinfra/testconsole/testrun/16888498646804437/ Reviewed By: garroud Differential Revision: D60785563 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135620 Approved by: https://github.com/kit1980 Co-authored-by: Haoming Lu <haominglu@meta.com>	2024-09-11 00:03:17 +00:00
Animesh Jain	5c38aa72c0	[dynamo][dicts][nv-embed] Support update with kwargs (#135588 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135588 Approved by: https://github.com/yanboliang	2024-09-10 23:50:23 +00:00
atalman	5134ba7458	Bump triton pin and release version (#135627 ) Update the pin and release version to sync with https://github.com/triton-lang/triton/tree/release/3.1.x Pull Request resolved: https://github.com/pytorch/pytorch/pull/135627 Approved by: https://github.com/Chillee, https://github.com/drisspg, https://github.com/malfet	2024-09-10 23:46:36 +00:00
titaiwangms	e48ee2cf50	[ONNX] Fix scaled_dot_product_attention with float scale (#135594 ) Fixes #125158 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135594 Approved by: https://github.com/justinchuby	2024-09-10 23:04:02 +00:00
hongxyan	eb38ee21ba	[ROCm] slow torch.sum optimization by increasing max_values_per_thread in reduce config (#135397 ) Fixes #132964 This change is to optimize torch.sum() performance by increasing max_values_per_thread in setReduceConfig() for ROCm platform. By increasing this parameter, it uses fewer threadblocks and improved the performance. Test: Tested on MI300x and H100, and now the MI300x perf improved to 3205GByte/s from ~1690GByte/s for the test case and is slightly better than H100 (3136GByte/s). Also tested with other different sizes of tensors and also see perf improvement. ```python import torch from triton.testing import do_bench x = torch.randn(2*30, device='cuda') ms = do_bench(lambda: x.sum(dim=-1)) bandwidth_gbyte = x.numel() x.dtype.itemsize / (10**9) time_s = ms / 1000 bw_per_second = bandwidth_gbyte / time_s print(bw_per_second) ``` Co-author: @carlobertolli Pull Request resolved: https://github.com/pytorch/pytorch/pull/135397 Approved by: https://github.com/eqy, https://github.com/malfet	2024-09-10 21:03:01 +00:00
Shunting Zhang	8057b72763	[ez][inductor] don't benchmark cloning if there are no mutated args (#135533 ) When a kernel does not have mutated args (this is quite common?), benchmarking the cost of cloning actually benchmarks a no-op. This still takes >100ms since triton.testing.do_bench will allocate 100 ms budget to run the kernel. Skipping this benchmarking can save quite some compilation time if the code path is hit multiple times. Let's say, if the code path is hit 100 times when the graph is large, we would save >10s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135533 Approved by: https://github.com/jansel ghstack dependencies: #135531	2024-09-10 20:54:31 +00:00
Shunting Zhang	7b17918dc9	[inductor] fix a device sync issue for benchmarking fusion (#135531 ) Fix https://github.com/pytorch/pytorch/issues/134768 . When we benchmark the latency for a fused node set, we do benchmarking twice: 1. benchmark the latency of the kernel including cloning mutated args 2. benchmark the latency of cloning mutated args without running the kernel We subtract result 2 from result 1 to get the latency of the kernel itself. But when the tensors are not on the cuda device 0, we get equal number for result 1 and result 2 no matter how much work the kernel does. The root cause is, in `triton.testing.do_bench` the `torch.cuda.synchronize` call sync the current cuda device (which is device 0 if it's not overriden). But since the tensors and kernels are located on another device, the sync actually does nothing (unless there happens to be other kernels on the device 0). The fix is to set the correct current device in our benchmarking code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135531 Approved by: https://github.com/jansel	2024-09-10 20:54:31 +00:00
Yiming Zhou	66c45f3ed9	[export] fix re-export custom metadata (#135282 ) Fixes #134778 When a model is exported and debug handles are added to the "custom" field of non-placeholder and non-output nodes in the graph, re-exporting it will change the metadata of placeholder nodes (the "custom" field will be added or copied to these nodes, depending whether `ExportedProgram` or `ExportedProgram.module()` is passed to `generate_numeric_debug_handle()`). This occurs because when we re-export the model, `placeholder` nodes are unlifted to `get_attr` nodes. These nodes remain as `get_attr` after being exported to `gm_torch_level`. Their metadata are modified [here](https://github.com/pytorch/pytorch/blob/main/torch/export/_trace.py#L1347) based on `params_buffers_to_node_meta` which is collected [here](https://github.com/pytorch/pytorch/blob/main/torch/export/_trace.py#L1312). Pull Request resolved: https://github.com/pytorch/pytorch/pull/135282 Approved by: https://github.com/jerryzh168, https://github.com/zhxchen17, https://github.com/tugsbayasgalan	2024-09-10 20:15:02 +00:00
PyTorch MergeBot	0a9d55d2ee	Revert "[AOTI] Fix assert_function call in cpu autotune template (#135086 )" This reverts commit 16c3b8f87cfa9cb5acee8104820baa389e7ee2bd. Reverted https://github.com/pytorch/pytorch/pull/135086 on behalf of https://github.com/izaitsevfb due to breaks internal tests, see D62405818 ([comment](https://github.com/pytorch/pytorch/pull/135086#issuecomment-2341889428))	2024-09-10 19:51:16 +00:00
Catherine Lee	4ca65d3323	[CI] Increase sharding for jobs that are timing out (#135582 ) Increase sharding for * slow grad check * slow cuda tests slow / linux-focal-cuda12.1-py3.10-gcc9-sm86 / test * avx Pull Request resolved: https://github.com/pytorch/pytorch/pull/135582 Approved by: https://github.com/huydhn, https://github.com/malfet	2024-09-10 19:45:13 +00:00
Andrew Gu	c932b39739	[FSDP2] Added `_set_unshard_async_op` (#135523 ) This PR adds a private API `_set_unshard_async_op` that allows for running pre-forward and pre-backward all-gathers using the `async_op=True` path so that all-gather allocations happen in the default stream to avoid inter-stream fragmentation. If using this option, forward requires explicit prefetching e.g. via the `unshard(async_op=True)` API for overlap. fp32 -> bf16 casts and the all-gather copy-in will not overlap with compute. Differential Revision: [D62401551](https://our.internmc.facebook.com/intern/diff/D62401551) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135523 Approved by: https://github.com/weifengpy	2024-09-10 19:28:02 +00:00
Rachel Guo	1f15973657	[AOTI][Tooling][7/n] Add debug printing support for JIT inductor codegen path as well (#135285 ) Summary: 1. Add the debug printer call to a level lower for triton kernel python wrapper codegen path 2. Add `torch.save()` for jit inductor as well 3. This also fixes the issue introduced in D61949020 (at python wrapper code level for triton kernel not printing) Test Plan: ``` AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_addmm_abi_compatible_cuda ``` Differential Revision: D62272588 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135285 Approved by: https://github.com/chenyang78	2024-09-10 19:24:58 +00:00
Dan Zimmerman	fc88ba260f	[amdsmi][torch] Update amdsmi API usages (#135504 ) Summary: In ROCm 6.2.0 there were API name changes-- we check if the new APIs exist and use them in this diff; see `7b2463abe0` for the changes Test Plan: CI Differential Revision: D62325661 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135504 Approved by: https://github.com/eqy, https://github.com/houseroad	2024-09-10 19:15:39 +00:00
Sam Larsen	bf8d0e3107	[inductor] Enable subprocess parallel compile internally with killswitch (#132467 ) Differential Revision: [D60629630](https://our.internmc.facebook.com/intern/diff/D60629630) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132467 Approved by: https://github.com/eellison	2024-09-10 19:05:46 +00:00
Shivam Raikundalia	3a1239a248	[Profiler] Harden Record Function Kwargs (#135365 ) Summary: In S445839, we had HTA break because of the "stream" parameter that was added to gpu traces. This brought up discussions regarding hardening our post processing of said inputs as to not break JSON schema as well as downstream tools. For this reason, this diff does the following. 1. Only allow int, double, bool and string values to be processed as kwinputs for JSON output. We can handle lists if needed in the future. 2. Make sure that any boolean is lowercase when a string so that the JSON does not break when parsing it 3. Force stream parameter to be an int Test Plan: Added unit tests to ensure that the list of requirements above is true for kwargs only. Differential Revision: D62304843 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135365 Approved by: https://github.com/aaronenyeshi	2024-09-10 18:44:05 +00:00
Sam Larsen	4f9f1775d8	Fix flaky TestCudaWrapper.test_randint_cuda_cuda_wrapper (#135370 ) Summary: This test is flaky when run after `test_dynamic_shapes_persistent_reduction_mixed_x_dim_cuda_cuda_wrapper` because the TestCase sets config options globally in its setUp() that stick around for subsequent tests. For test isolation, we use a contextlib.ExitStack pattern in other tests to patch the config options and restore them in tearDown(). Update all TestCases in `test/inductor/test_combo_kernels.py` to use that pattern. Test Plan: ``` python test/inductor/test_combo_kernels.py python test/inductor/test_cuda_cpp_wrapper.py TestCudaWrapper.test_dynamic_shapes_persistent_reduction_mixed_x_dim_cuda_cuda_wrapper TestCudaWrapper.test_randint_cuda_cuda_wrapper ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135370 Approved by: https://github.com/jansel	2024-09-10 18:43:14 +00:00
Thanh Ha	5e0788befb	Migrate remaining jobs to use runner determinator (#134867 ) At this point all self-hosted runner jobs should be using the runner determinator to switch between LF and Meta runners. This change updates the remaining jobs that have not yet been migrated over. Issue: https://lf-pytorch.atlassian.net/browse/PC-25 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134867 Approved by: https://github.com/ZainRizvi	2024-09-10 18:14:00 +00:00
Ivan Zaitsev	440f8f57af	Revert "[fx] Bypass custom __setattr__ in Node.__init__ (#135079 )" (#135562 ) This reverts commit 66da3b3b2acacb116a9b23e91b24934830eaf6b8. #135079 breaks internal tests and needs to be reverted. Revert with mergebot doesn't work as this PR is technically part of the stack, but, according to @jansel, it should be possible to revert it individually. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135562 Approved by: https://github.com/jansel, https://github.com/seemethere	2024-09-10 18:07:11 +00:00
Zhou, Lingzhi	e004d539da	[Partitioner] Reuse partition to check whether nodes exist (#135317 ) The time complexity of find node whether in NodeList is O(n). Reuse partition to speed up due to partition.nodes is hash table and has same elements. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135317 Approved by: https://github.com/ezyang	2024-09-10 17:45:29 +00:00
Zixi Qi	c4b84a46a9	Add more logging to TunableOp validators (#135396 ) Summary: Add more logging to TunableOp validators Test Plan: Verified additional logging when loading kernel selections: ``` ROCBLAS_VERSION validation: expect 4.0.0-72e57364-dirty to match 4.0.0-72e57364-dirty GCN_ARCH_NAME validation: expect gfx942:sramecc+:xnack- to match gfx942:sramecc+:xnack- HIPBLASLT_VERSION validation: expect 800-a15e4178 to match 800-a15e4178 ROCM_VERSION validation: expect 6.0.0.0-12969-1544e39 to match 6.0.0.0-12969-1544e39 PT_VERSION validation: expect 2.5.0 to match 2.5.0 ``` ``` [qizixi@devgpu039.atn3 /data/users/qizixi/fbsource/fbcode (f9305317d\|remote/master)]$ PYTORCH_TUNABLEOP_VERBOSE=1 buck2 run mode/{opt,amd-gpu} -c fbcode.e nable_gpu_sections=true //scripts/xdwang/example:fc_llama -- --enable-tuning File changed: fbcode//hipblas_tuning_pt_llama0.csv Buck UI: https://www.internalfb.com/buck2/1ed2fac4-743e-49ef-805f-7fb6b9300022 Network: Up: 0B Down: 0B Jobs completed: 4189. Time elapsed: 0.2s. BUILD SUCCEEDED Enabled tuning - Run Linear (matmul) 2 x 1280 x 8192, dtype = torch.bfloat16 INFO:2024-09-06 14:38:07 2834864:2835138 CuptiActivityProfiler.cpp:260] HIP versions. Roctracer: 4.1; Runtime: 60032830; Driver: 60032830 INFO:2024-09-06 14:38:07 2834864:2836083 DynoConfigLoader.cpp:61] Setting communication fabric enabled = 0 reading tuning results from hipblas_tuning_pt_llama0.csv Validator PT_VERSION=2.5.0 Validator ROCM_VERSION=6.0.0.0-12969-1544e39 Validator HIPBLASLT_VERSION=800-a15e4178 Validator GCN_ARCH_NAME=gfx942:sramecc+:xnack- Validator ROCBLAS_VERSION=4.0.0-72e57364-dirty ROCBLAS_VERSION validation: expect 4.0.0-72e57364-dirty to match 4.0.0-72e57364-dirty GCN_ARCH_NAME validation: expect gfx942:sramecc+:xnack- to match gfx942:sramecc+:xnack- HIPBLASLT_VERSION validation: expect 800-a15e4178 to match 800-a15e4178 ROCM_VERSION validation: expect 6.0.0.0-12969-1544e39 to match 6.0.0.0-12969-1544e39 PT_VERSION validation: expect 2.5.0 to match 2.5.0 Loading results Avg time: 13.165860176086426 us, Achieved 3.19 TFLOPS, 1598.24 GB/s - Run Linear (matmul) 2 x 8192 x 1024, dtype = torch.bfloat16 Avg time: 13.230760097503662 us, Achieved 2.54 TFLOPS, 1271.14 GB/s - Run Linear (matmul) 2 x 7168 x 8192, dtype = torch.bfloat16 Avg time: 26.804399490356445 us, Achieved 8.76 TFLOPS, 4384.90 GB/s - Run Linear (matmul) 2 x 8192 x 3584, dtype = torch.bfloat16 Avg time: 13.407809734344482 us, Achieved 8.76 TFLOPS, 4384.14 GB/s 2x1280x8192-torch.bfloat16,13.165860176086426,3.18574247630113,1598.237845349412 2x8192x1024-torch.bfloat16,13.230760097503662,2.536092541374924,1271.1420867780075 2x7168x8192-torch.bfloat16,26.804399490356445,8.762778814892096,4384.9040543618985 2x8192x3584-torch.bfloat16,13.407809734344482,8.759112362638383,4384.138585247748 ``` Reviewed By: leitian Differential Revision: D62322830 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135396 Approved by: https://github.com/eqy	2024-09-10 17:20:59 +00:00
cyy	bc1b8f094d	Check function declarations of Core ML code (#135467 ) Relax the restrictions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135467 Approved by: https://github.com/ezyang	2024-09-10 16:05:22 +00:00
rzou	f65a564fa2	[inductor] Flip custom_op_default_layout_constraint (#135239 ) By default, Inductor should respect the stride order of input Tensors to custom operators. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/135239 Approved by: https://github.com/albanD ghstack dependencies: #135391	2024-09-10 14:27:43 +00:00
Edward Z. Yang	386b313028	Handle KeyError for compiler collective in scalars too (#135385 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135385 Approved by: https://github.com/jansel	2024-09-10 12:33:04 +00:00
torotoki	6d7cbc20d2	Add dynamo itertools.pairwise support (#135416 ) Fixes #133766 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135416 Approved by: https://github.com/XuehaiPan, https://github.com/jansel Co-authored-by: Xuehai Pan <XuehaiPan@pku.edu.cn>	2024-09-10 11:37:59 +00:00
xinan.lin	ca16956b20	[Inductor] Generalize device guard codegen for cpp_wrapper mode. (#134761 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134761 Approved by: https://github.com/jansel, https://github.com/EikanWang ghstack dependencies: #134693	2024-09-10 10:11:52 +00:00
xinan.lin	67735d1ee8	[Inductor] Generalize `is_cuda` to specific device_type to make cpp_wrapper mode be extensible (#134693 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134693 Approved by: https://github.com/ezyang, https://github.com/EikanWang, https://github.com/jansel	2024-09-10 10:11:13 +00:00
Boyuan Feng	6e13f5eb38	[FlexAttention] Add broadcast support for kv batch dimension (#135505 ) This PR adds broadcast support for KV batch dimension. ## Details Consider Q of shape `[Bq, Hq, Q_LEN, D]`, and K, V of shape `[Bkv, Hkv, KV_LEN, D]`. Prior to this diff, we require `Bq == Bkv`. However, for some use cases, we may have Bkv < Bq. For example, in paged attention, we provide K, V of shape `[1, Hkv, MAX_LEN, D]`, while still providing Q of shape `[Bq, Hq, Q_LEN, D]`. Here, MAX_LEN is the maximal number of tokens supported by paged attention. This PR relax this requirement to be `Bq == Bkv or (Bq > 1 and Bkv == 0)`. This support covers both flex decoding, flex attention forward and backward. ## Benchmark GPU: H100 We see negligible (1%~2%) performance change from this PR when `Bq == Bkv`. ``` python benchmarks/transformer/score_mod.py --calculate-bwd ``` ### Perf before this PR FWD \| Type \| Speedup \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| \|---------\|-----------\|---------------\|------------\|----------------\|------------------------------\| \| Average \| 0.743 \| \| \| \| \| \| Max \| 0.955 \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| \| Min \| 0.548 \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| BWD \| Type \| Speedup \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| \|---------\|-----------\|-------------\|------------\|----------------\|-----------------------------\| \| Average \| 0.834 \| \| \| \| \| \| Max \| 1.261 \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| \| Min \| 0.456 \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| <details> <summary> Full performance sweep </summary> \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| fwd_eager_time \| fwd_compiled_time \| bwd_eager_time \| bwd_compiled_time \| fwd_speedup \| bwd_speedup \| \|---------------\|------------\|----------------\|-------------------------------\|------------------\|---------------------\|------------------\|---------------------\|---------------\|---------------\| \| None \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 64) \| 15.264 \| 17.184 \| 107.040 \| 140.800 \| 0.888 \| 0.760 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 512, 16, 512, 64) \| 15.840 \| 19.744 \| 112.576 \| 140.064 \| 0.802 \| 0.804 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 64) \| 15.232 \| 17.344 \| 87.744 \| 142.496 \| 0.878 \| 0.616 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 64) \| 15.264 \| 17.184 \| 108.192 \| 143.328 \| 0.888 \| 0.755 \| \| None \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 128) \| 19.904 \| 22.400 \| 106.432 \| 136.512 \| 0.889 \| 0.780 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 512, 16, 512, 128) \| 19.424 \| 26.752 \| 91.712 \| 106.688 \| 0.726 \| 0.860 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 128) \| 19.808 \| 22.432 \| 89.024 \| 101.920 \| 0.883 \| 0.873 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 128) \| 19.840 \| 22.272 \| 88.896 \| 102.592 \| 0.891 \| 0.867 \| \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 30.240 \| 32.416 \| 116.768 \| 112.256 \| 0.933 \| 1.040 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 29.536 \| 37.024 \| 113.664 \| 102.688 \| 0.798 \| 1.107 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 30.656 \| 32.800 \| 116.992 \| 127.008 \| 0.935 \| 0.921 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 30.592 \| 32.480 \| 116.928 \| 112.160 \| 0.942 \| 1.043 \| \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 40.448 \| 61.920 \| 198.656 \| 204.512 \| 0.653 \| 0.971 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 37.760 \| 62.528 \| 189.536 \| 170.624 \| 0.604 \| 1.111 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 40.896 \| 62.368 \| 198.304 \| 205.824 \| 0.656 \| 0.963 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 40.448 \| 61.952 \| 198.432 \| 203.648 \| 0.653 \| 0.974 \| \| None \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 318.528 \| 355.904 \| 947.232 \| 1162.496 \| 0.895 \| 0.815 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 199.776 \| 252.128 \| 677.792 \| 813.184 \| 0.792 \| 0.834 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 316.512 \| 363.328 \| 947.712 \| 1361.984 \| 0.871 \| 0.696 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 317.984 \| 356.864 \| 947.264 \| 1165.024 \| 0.891 \| 0.813 \| \| None \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 446.656 \| 734.656 \| 1664.288 \| 2172.960 \| 0.608 \| 0.766 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 278.688 \| 467.648 \| 1182.624 \| 1339.296 \| 0.596 \| 0.883 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 447.872 \| 744.096 \| 1662.944 \| 2196.544 \| 0.602 \| 0.757 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 448.128 \| 732.928 \| 1663.072 \| 2156.800 \| 0.611 \| 0.771 \| \| None \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 64) \| 15.648 \| 16.640 \| 107.520 \| 143.008 \| 0.940 \| 0.752 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 512, 2, 512, 64) \| 15.776 \| 18.240 \| 129.056 \| 141.920 \| 0.865 \| 0.909 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 64) \| 15.168 \| 16.640 \| 103.616 \| 139.648 \| 0.912 \| 0.742 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 64) \| 15.616 \| 16.640 \| 128.608 \| 164.448 \| 0.938 \| 0.782 \| \| None \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 128) \| 19.776 \| 21.952 \| 125.344 \| 170.304 \| 0.901 \| 0.736 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 512, 2, 512, 128) \| 19.776 \| 23.712 \| 104.288 \| 196.896 \| 0.834 \| 0.530 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 128) \| 19.072 \| 21.952 \| 102.080 \| 177.056 \| 0.869 \| 0.577 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 128) \| 19.648 \| 21.920 \| 109.920 \| 170.848 \| 0.896 \| 0.643 \| \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| 30.464 \| 31.936 \| 127.808 \| 228.832 \| 0.954 \| 0.559 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| 29.472 \| 33.856 \| 113.152 \| 215.072 \| 0.871 \| 0.526 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| 30.496 \| 32.160 \| 116.576 \| 231.744 \| 0.948 \| 0.503 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| 30.464 \| 31.904 \| 116.320 \| 229.824 \| 0.955 \| 0.506 \| \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| 40.480 \| 61.440 \| 176.448 \| 345.312 \| 0.659 \| 0.511 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| 38.304 \| 59.424 \| 169.312 \| 371.360 \| 0.645 \| 0.456 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| 40.960 \| 61.760 \| 176.512 \| 358.912 \| 0.663 \| 0.492 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| 40.352 \| 61.696 \| 176.512 \| 344.928 \| 0.654 \| 0.512 \| \| None \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 64) \| 316.224 \| 357.728 \| 905.728 \| 1668.448 \| 0.884 \| 0.543 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 64) \| 199.904 \| 248.416 \| 636.544 \| 1109.088 \| 0.805 \| 0.574 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 64) \| 314.880 \| 363.616 \| 906.304 \| 1658.176 \| 0.866 \| 0.547 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 64) \| 316.160 \| 354.368 \| 906.080 \| 1649.024 \| 0.892 \| 0.549 \| \| None \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 128) \| 446.912 \| 739.840 \| 1555.808 \| 2521.952 \| 0.604 \| 0.617 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 128) \| 279.776 \| 463.904 \| 1068.928 \| 1849.888 \| 0.603 \| 0.578 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 128) \| 446.080 \| 748.960 \| 1553.504 \| 2629.888 \| 0.596 \| 0.591 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 128) \| 446.208 \| 740.608 \| 1558.880 \| 2524.960 \| 0.602 \| 0.617 \| \| None \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| 33.568 \| 41.280 \| 170.016 \| 147.584 \| 0.813 \| 1.152 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| 30.688 \| 43.040 \| 159.552 \| 146.720 \| 0.713 \| 1.087 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| 34.112 \| 41.504 \| 170.112 \| 152.672 \| 0.822 \| 1.114 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| 34.240 \| 41.152 \| 170.272 \| 134.976 \| 0.832 \| 1.261 \| \| None \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 128) \| 48.672 \| 76.416 \| 295.296 \| 263.648 \| 0.637 \| 1.120 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 512, 16, 512, 128) \| 45.088 \| 72.576 \| 281.920 \| 237.664 \| 0.621 \| 1.186 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 128) \| 48.032 \| 76.672 \| 295.520 \| 265.248 \| 0.626 \| 1.114 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 128) \| 48.096 \| 76.096 \| 295.456 \| 262.112 \| 0.632 \| 1.127 \| \| None \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 64) \| 93.920 \| 111.232 \| 401.568 \| 382.944 \| 0.844 \| 1.049 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 64) \| 68.192 \| 95.232 \| 338.752 \| 326.816 \| 0.716 \| 1.037 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 64) \| 93.984 \| 111.840 \| 401.856 \| 444.224 \| 0.840 \| 0.905 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 64) \| 94.176 \| 110.496 \| 401.600 \| 383.136 \| 0.852 \| 1.048 \| \| None \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 128) \| 131.488 \| 227.040 \| 727.424 \| 739.712 \| 0.579 \| 0.983 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 128) \| 95.616 \| 169.760 \| 616.864 \| 574.112 \| 0.563 \| 1.074 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 128) \| 131.680 \| 228.672 \| 727.616 \| 746.048 \| 0.576 \| 0.975 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 128) \| 131.104 \| 225.696 \| 727.904 \| 735.392 \| 0.581 \| 0.990 \| \| None \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 64) \| 1227.296 \| 1386.656 \| 3720.192 \| 4539.904 \| 0.885 \| 0.819 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 64) \| 691.360 \| 831.712 \| 2515.872 \| 3067.808 \| 0.831 \| 0.820 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 64) \| 1228.192 \| 1403.136 \| 3715.520 \| 5309.280 \| 0.875 \| 0.700 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 64) \| 1229.024 \| 1384.992 \| 3715.904 \| 4550.368 \| 0.887 \| 0.817 \| \| None \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 128) \| 1784.832 \| 2865.888 \| 6539.840 \| 8460.224 \| 0.623 \| 0.773 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 128) \| 1017.408 \| 1660.480 \| 4369.824 \| 5056.992 \| 0.613 \| 0.864 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 128) \| 1792.448 \| 2904.864 \| 6546.080 \| 8537.024 \| 0.617 \| 0.767 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 128) \| 1795.552 \| 2856.864 \| 6544.672 \| 8400.160 \| 0.629 \| 0.779 \| \| None \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 64) \| 34.240 \| 38.880 \| 148.832 \| 179.936 \| 0.881 \| 0.827 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 512, 2, 512, 64) \| 31.168 \| 38.080 \| 138.528 \| 167.552 \| 0.818 \| 0.827 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 64) \| 34.240 \| 39.168 \| 148.512 \| 181.248 \| 0.874 \| 0.819 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 64) \| 34.240 \| 38.784 \| 148.864 \| 180.224 \| 0.883 \| 0.826 \| \| None \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 128) \| 48.832 \| 76.352 \| 253.632 \| 295.968 \| 0.640 \| 0.857 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 512, 2, 512, 128) \| 45.760 \| 65.792 \| 239.040 \| 290.752 \| 0.696 \| 0.822 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 128) \| 48.768 \| 76.576 \| 253.312 \| 304.032 \| 0.637 \| 0.833 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 128) \| 48.768 \| 76.192 \| 253.600 \| 296.096 \| 0.640 \| 0.856 \| \| None \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 64) \| 93.728 \| 109.728 \| 357.696 \| 498.912 \| 0.854 \| 0.717 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 64) \| 68.704 \| 92.288 \| 295.616 \| 386.240 \| 0.744 \| 0.765 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 64) \| 93.632 \| 111.392 \| 357.408 \| 512.448 \| 0.841 \| 0.697 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 64) \| 93.280 \| 109.952 \| 357.696 \| 501.440 \| 0.848 \| 0.713 \| \| None \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 128) \| 131.392 \| 230.496 \| 612.224 \| 807.552 \| 0.570 \| 0.758 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 128) \| 96.512 \| 165.184 \| 502.624 \| 672.384 \| 0.584 \| 0.748 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 128) \| 131.360 \| 232.608 \| 612.064 \| 832.320 \| 0.565 \| 0.735 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 128) \| 131.008 \| 230.528 \| 612.640 \| 804.320 \| 0.568 \| 0.762 \| \| None \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 64) \| 1227.968 \| 1377.408 \| 3477.920 \| 5324.384 \| 0.892 \| 0.653 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 64) \| 695.264 \| 824.544 \| 2268.224 \| 3210.208 \| 0.843 \| 0.707 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 64) \| 1228.640 \| 1404.576 \| 3476.832 \| 5463.456 \| 0.875 \| 0.636 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 64) \| 1228.416 \| 1378.752 \| 3478.048 \| 5367.712 \| 0.891 \| 0.648 \| \| None \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 128) \| 1788.736 \| 2867.712 \| 6039.520 \| 8616.256 \| 0.624 \| 0.701 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 128) \| 1021.952 \| 1653.824 \| 3866.208 \| 5306.848 \| 0.618 \| 0.729 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 128) \| 1786.752 \| 2896.352 \| 6044.128 \| 8871.360 \| 0.617 \| 0.681 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 128) \| 1786.080 \| 2868.672 \| 6040.160 \| 8550.144 \| 0.623 \| 0.706 \| \| None \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 64) \| 57.504 \| 71.552 \| 312.768 \| 255.040 \| 0.804 \| 1.226 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 16, 512, 64) \| 49.472 \| 71.104 \| 285.696 \| 243.520 \| 0.696 \| 1.173 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 64) \| 58.112 \| 72.896 \| 312.768 \| 288.256 \| 0.797 \| 1.085 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 64) \| 57.952 \| 71.680 \| 312.768 \| 255.552 \| 0.808 \| 1.224 \| \| None \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| 82.336 \| 144.256 \| 580.128 \| 500.160 \| 0.571 \| 1.160 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| 76.160 \| 123.712 \| 552.544 \| 447.648 \| 0.616 \| 1.234 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| 82.400 \| 145.184 \| 580.032 \| 504.032 \| 0.568 \| 1.151 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| 82.368 \| 143.904 \| 580.192 \| 499.936 \| 0.572 \| 1.161 \| \| None \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 64) \| 177.216 \| 209.568 \| 787.872 \| 747.712 \| 0.846 \| 1.054 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 64) \| 121.984 \| 168.256 \| 651.968 \| 628.256 \| 0.725 \| 1.038 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 64) \| 177.088 \| 211.488 \| 788.320 \| 864.352 \| 0.837 \| 0.912 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 64) \| 177.440 \| 208.576 \| 787.424 \| 749.120 \| 0.851 \| 1.051 \| \| None \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 128) \| 249.472 \| 441.376 \| 1405.440 \| 1431.648 \| 0.565 \| 0.982 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 128) \| 172.960 \| 312.064 \| 1172.064 \| 1096.448 \| 0.554 \| 1.069 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 128) \| 249.632 \| 446.336 \| 1405.408 \| 1448.480 \| 0.559 \| 0.970 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 128) \| 250.944 \| 440.128 \| 1406.624 \| 1421.952 \| 0.570 \| 0.989 \| \| None \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 64) \| 2418.720 \| 2747.936 \| 7330.432 \| 9023.712 \| 0.880 \| 0.812 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 64) \| 1353.696 \| 1608.480 \| 4941.696 \| 6078.752 \| 0.842 \| 0.813 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 64) \| 2427.456 \| 2746.816 \| 7329.792 \| 10539.968 \| 0.884 \| 0.695 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 64) \| 2426.688 \| 2763.168 \| 7336.256 \| 9057.536 \| 0.878 \| 0.810 \| \| None \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 128) \| 3554.240 \| 5634.400 \| 12919.872 \| 16843.489 \| 0.631 \| 0.767 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 128) \| 2003.648 \| 3250.784 \| 8610.144 \| 10015.424 \| 0.616 \| 0.860 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 128) \| 3582.080 \| 5710.944 \| 12923.328 \| 17011.871 \| 0.627 \| 0.760 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 128) \| 3581.920 \| 5618.144 \| 12934.528 \| 16745.888 \| 0.638 \| 0.772 \| \| None \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 64) \| 57.120 \| 71.232 \| 269.760 \| 295.680 \| 0.802 \| 0.912 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 2, 512, 64) \| 49.408 \| 65.312 \| 242.304 \| 253.952 \| 0.756 \| 0.954 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 64) \| 57.504 \| 72.544 \| 269.632 \| 298.976 \| 0.793 \| 0.902 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 64) \| 57.760 \| 71.040 \| 269.600 \| 296.640 \| 0.813 \| 0.909 \| \| None \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| 82.336 \| 147.168 \| 466.080 \| 487.456 \| 0.559 \| 0.956 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| 76.704 \| 115.040 \| 435.392 \| 453.248 \| 0.667 \| 0.961 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| 81.856 \| 147.424 \| 465.920 \| 499.552 \| 0.555 \| 0.933 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| 81.760 \| 146.656 \| 466.176 \| 485.984 \| 0.557 \| 0.959 \| \| None \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 64) \| 176.608 \| 206.976 \| 678.080 \| 866.976 \| 0.853 \| 0.782 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 64) \| 121.664 \| 164.768 \| 538.240 \| 636.160 \| 0.738 \| 0.846 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 64) \| 176.608 \| 209.664 \| 677.696 \| 883.424 \| 0.842 \| 0.767 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 64) \| 177.440 \| 207.840 \| 677.248 \| 868.288 \| 0.854 \| 0.780 \| \| None \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| 250.272 \| 449.536 \| 1163.424 \| 1420.832 \| 0.557 \| 0.819 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| 173.472 \| 305.376 \| 929.408 \| 1104.544 \| 0.568 \| 0.841 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| 249.376 \| 454.976 \| 1163.648 \| 1455.296 \| 0.548 \| 0.800 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| 250.368 \| 450.144 \| 1163.520 \| 1409.984 \| 0.556 \| 0.825 \| \| None \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 64) \| 2416.576 \| 2726.208 \| 6835.520 \| 10442.784 \| 0.886 \| 0.655 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 64) \| 1357.440 \| 1590.752 \| 4433.664 \| 5975.296 \| 0.853 \| 0.742 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 64) \| 2427.360 \| 2747.040 \| 6853.056 \| 10670.784 \| 0.884 \| 0.642 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 64) \| 2441.120 \| 2718.944 \| 6836.640 \| 10433.792 \| 0.898 \| 0.655 \| \| None \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 128) \| 3555.392 \| 5620.960 \| 11944.000 \| 16504.801 \| 0.633 \| 0.724 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 128) \| 2010.848 \| 3241.152 \| 7636.064 \| 9870.464 \| 0.620 \| 0.774 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 128) \| 3557.440 \| 5688.352 \| 11935.744 \| 17090.496 \| 0.625 \| 0.698 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 128) \| 3562.720 \| 5630.432 \| 11939.168 \| 16392.033 \| 0.633 \| 0.728 \| </details> ### Perf after this PR FWD \| Type \| Speedup \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| \|---------\|-----------\|---------------\|------------\|----------------\|----------------------------\| \| Average \| 0.776 \| \| \| \| \| \| Max \| 1.006 \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| \| Min \| 0.566 \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| BWD \| Type \| Speedup \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| \|---------\|-----------\|-------------\|------------\|----------------\|-----------------------------\| \| Average \| 0.817 \| \| \| \| \| \| Max \| 1.150 \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| \| Min \| 0.454 \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| <details> <summary> Full performance sweep </summary> \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| fwd_eager_time \| fwd_compiled_time \| bwd_eager_time \| bwd_compiled_time \| fwd_speedup \| bwd_speedup \| \|---------------\|------------\|----------------\|-------------------------------\|------------------\|---------------------\|------------------\|---------------------\|---------------\|---------------\| \| None \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 64) \| 15.680 \| 17.056 \| 64.544 \| 73.376 \| 0.919 \| 0.880 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 512, 16, 512, 64) \| 15.712 \| 19.872 \| 65.408 \| 72.864 \| 0.791 \| 0.898 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 64) \| 16.160 \| 17.280 \| 64.896 \| 73.888 \| 0.935 \| 0.878 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 64) \| 16.192 \| 17.120 \| 64.896 \| 75.424 \| 0.946 \| 0.860 \| \| None \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 128) \| 19.648 \| 22.496 \| 89.184 \| 82.592 \| 0.873 \| 1.080 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 512, 16, 512, 128) \| 20.320 \| 26.816 \| 91.264 \| 82.880 \| 0.758 \| 1.101 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 128) \| 20.096 \| 22.528 \| 89.184 \| 83.776 \| 0.892 \| 1.065 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 512, 16, 512, 128) \| 19.680 \| 22.432 \| 89.184 \| 120.096 \| 0.877 \| 0.743 \| \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 32.384 \| 32.512 \| 119.232 \| 128.960 \| 0.996 \| 0.925 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 30.176 \| 37.248 \| 113.664 \| 119.520 \| 0.810 \| 0.951 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 32.512 \| 32.928 \| 119.264 \| 131.456 \| 0.987 \| 0.907 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 64) \| 32.448 \| 32.704 \| 119.200 \| 128.352 \| 0.992 \| 0.929 \| \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 41.952 \| 62.176 \| 199.040 \| 214.304 \| 0.675 \| 0.929 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 39.744 \| 62.880 \| 189.504 \| 179.968 \| 0.632 \| 1.053 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 41.472 \| 62.784 \| 199.136 \| 217.664 \| 0.661 \| 0.915 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 16, 1024, 128) \| 42.048 \| 61.952 \| 199.168 \| 214.496 \| 0.679 \| 0.929 \| \| None \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 341.184 \| 357.632 \| 980.256 \| 1328.896 \| 0.954 \| 0.738 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 212.576 \| 252.960 \| 673.888 \| 824.864 \| 0.840 \| 0.817 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 340.000 \| 363.296 \| 980.768 \| 1375.808 \| 0.936 \| 0.713 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 64) \| 340.768 \| 356.832 \| 980.960 \| 1326.272 \| 0.955 \| 0.740 \| \| None \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 459.392 \| 737.120 \| 1678.240 \| 2205.248 \| 0.623 \| 0.761 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 292.672 \| 468.096 \| 1178.016 \| 1371.584 \| 0.625 \| 0.859 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 462.144 \| 745.312 \| 1680.000 \| 2252.512 \| 0.620 \| 0.746 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 16, 4096, 128) \| 462.112 \| 736.576 \| 1679.008 \| 2216.480 \| 0.627 \| 0.758 \| \| None \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 64) \| 16.064 \| 16.704 \| 105.120 \| 120.768 \| 0.962 \| 0.870 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 512, 2, 512, 64) \| 15.552 \| 18.144 \| 107.136 \| 121.696 \| 0.857 \| 0.880 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 64) \| 16.096 \| 16.768 \| 102.688 \| 120.864 \| 0.960 \| 0.850 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 64) \| 16.032 \| 16.576 \| 104.736 \| 124.672 \| 0.967 \| 0.840 \| \| None \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 128) \| 19.392 \| 21.952 \| 104.736 \| 174.656 \| 0.883 \| 0.600 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 512, 2, 512, 128) \| 20.128 \| 23.712 \| 105.216 \| 199.008 \| 0.849 \| 0.529 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 128) \| 19.904 \| 21.888 \| 103.744 \| 179.520 \| 0.909 \| 0.578 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 512, 2, 512, 128) \| 19.968 \| 21.952 \| 104.640 \| 177.312 \| 0.910 \| 0.590 \| \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| 32.096 \| 31.904 \| 118.720 \| 231.968 \| 1.006 \| 0.512 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| 30.528 \| 33.952 \| 112.480 \| 218.304 \| 0.899 \| 0.515 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| 32.160 \| 32.224 \| 118.752 \| 237.312 \| 0.998 \| 0.500 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 64) \| 32.128 \| 32.032 \| 118.240 \| 233.120 \| 1.003 \| 0.507 \| \| None \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| 41.312 \| 61.280 \| 177.408 \| 350.688 \| 0.674 \| 0.506 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| 39.552 \| 59.360 \| 168.832 \| 371.488 \| 0.666 \| 0.454 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| 41.984 \| 61.696 \| 177.376 \| 360.416 \| 0.680 \| 0.492 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 1024, 2, 1024, 128) \| 41.312 \| 61.760 \| 177.184 \| 355.744 \| 0.669 \| 0.498 \| \| None \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 64) \| 339.744 \| 357.888 \| 939.712 \| 1665.376 \| 0.949 \| 0.564 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 64) \| 212.608 \| 248.832 \| 633.280 \| 1122.848 \| 0.854 \| 0.564 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 64) \| 339.712 \| 363.232 \| 940.448 \| 1689.440 \| 0.935 \| 0.557 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 64) \| 341.056 \| 355.264 \| 940.128 \| 1641.152 \| 0.960 \| 0.573 \| \| None \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 128) \| 460.736 \| 741.024 \| 1569.824 \| 2559.552 \| 0.622 \| 0.613 \| \| None \| causal \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 128) \| 293.856 \| 464.192 \| 1066.240 \| 1840.416 \| 0.633 \| 0.579 \| \| relative_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 128) \| 460.704 \| 753.152 \| 1570.112 \| 2641.088 \| 0.612 \| 0.594 \| \| head_bias \| None \| torch.bfloat16 \| (2, 16, 4096, 2, 4096, 128) \| 460.832 \| 745.536 \| 1570.144 \| 2602.560 \| 0.618 \| 0.603 \| \| None \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| 35.680 \| 41.280 \| 171.840 \| 158.176 \| 0.864 \| 1.086 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| 31.360 \| 42.976 \| 158.912 \| 139.264 \| 0.730 \| 1.141 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| 35.168 \| 41.600 \| 171.648 \| 161.344 \| 0.845 \| 1.064 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 64) \| 35.136 \| 41.152 \| 171.808 \| 158.336 \| 0.854 \| 1.085 \| \| None \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 128) \| 48.832 \| 76.384 \| 295.680 \| 277.696 \| 0.639 \| 1.065 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 512, 16, 512, 128) \| 45.632 \| 72.512 \| 281.760 \| 250.752 \| 0.629 \| 1.124 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 128) \| 49.504 \| 76.608 \| 295.584 \| 279.712 \| 0.646 \| 1.057 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 16, 512, 128) \| 48.864 \| 75.904 \| 295.456 \| 277.568 \| 0.644 \| 1.064 \| \| None \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 64) \| 99.392 \| 111.232 \| 408.640 \| 442.656 \| 0.894 \| 0.923 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 64) \| 71.392 \| 95.168 \| 338.784 \| 341.760 \| 0.750 \| 0.991 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 64) \| 99.808 \| 112.256 \| 408.608 \| 456.160 \| 0.889 \| 0.896 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 64) \| 100.032 \| 110.816 \| 408.512 \| 444.192 \| 0.903 \| 0.920 \| \| None \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 128) \| 135.040 \| 226.112 \| 726.880 \| 774.176 \| 0.597 \| 0.939 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 128) \| 99.904 \| 169.696 \| 616.448 \| 607.104 \| 0.589 \| 1.015 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 128) \| 135.488 \| 228.384 \| 727.776 \| 782.368 \| 0.593 \| 0.930 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 16, 1024, 128) \| 135.744 \| 225.664 \| 728.000 \| 773.600 \| 0.602 \| 0.941 \| \| None \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 64) \| 1324.192 \| 1387.808 \| 3866.944 \| 5217.184 \| 0.954 \| 0.741 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 64) \| 738.464 \| 832.608 \| 2507.392 \| 3146.688 \| 0.887 \| 0.797 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 64) \| 1326.016 \| 1404.256 \| 3867.872 \| 5382.624 \| 0.944 \| 0.719 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 64) \| 1326.144 \| 1386.688 \| 3867.552 \| 5203.264 \| 0.956 \| 0.743 \| \| None \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 128) \| 1847.488 \| 2866.336 \| 6612.704 \| 8597.696 \| 0.645 \| 0.769 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 128) \| 1066.592 \| 1660.640 \| 4357.696 \| 5174.016 \| 0.642 \| 0.842 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 128) \| 1850.464 \| 2905.408 \| 6616.928 \| 8793.280 \| 0.637 \| 0.752 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 16, 4096, 128) \| 1848.896 \| 2834.720 \| 6623.872 \| 8637.920 \| 0.652 \| 0.767 \| \| None \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 64) \| 36.384 \| 38.656 \| 150.336 \| 182.624 \| 0.941 \| 0.823 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 512, 2, 512, 64) \| 31.360 \| 38.112 \| 137.664 \| 171.840 \| 0.823 \| 0.801 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 64) \| 36.608 \| 39.040 \| 150.528 \| 183.872 \| 0.938 \| 0.819 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 64) \| 36.064 \| 38.656 \| 150.560 \| 183.520 \| 0.933 \| 0.820 \| \| None \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 128) \| 49.344 \| 76.352 \| 253.920 \| 301.440 \| 0.646 \| 0.842 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 512, 2, 512, 128) \| 46.720 \| 65.824 \| 239.424 \| 296.384 \| 0.710 \| 0.808 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 128) \| 49.248 \| 76.416 \| 253.728 \| 307.808 \| 0.644 \| 0.824 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 512, 2, 512, 128) \| 49.376 \| 76.288 \| 253.728 \| 304.736 \| 0.647 \| 0.833 \| \| None \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 64) \| 99.264 \| 110.144 \| 364.960 \| 503.072 \| 0.901 \| 0.725 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 64) \| 71.136 \| 92.384 \| 294.432 \| 393.056 \| 0.770 \| 0.749 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 64) \| 99.200 \| 111.360 \| 365.152 \| 512.640 \| 0.891 \| 0.712 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 64) \| 99.264 \| 110.240 \| 365.088 \| 504.224 \| 0.900 \| 0.724 \| \| None \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 128) \| 135.680 \| 230.336 \| 613.472 \| 816.896 \| 0.589 \| 0.751 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 128) \| 100.256 \| 165.088 \| 502.144 \| 676.480 \| 0.607 \| 0.742 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 128) \| 135.008 \| 232.480 \| 613.184 \| 836.672 \| 0.581 \| 0.733 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 1024, 2, 1024, 128) \| 135.232 \| 230.624 \| 613.536 \| 827.136 \| 0.586 \| 0.742 \| \| None \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 64) \| 1324.064 \| 1378.688 \| 3631.808 \| 5308.384 \| 0.960 \| 0.684 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 64) \| 731.776 \| 826.688 \| 2263.168 \| 3241.344 \| 0.885 \| 0.698 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 64) \| 1316.128 \| 1403.200 \| 3625.088 \| 5550.688 \| 0.938 \| 0.653 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 64) \| 1311.904 \| 1378.880 \| 3616.320 \| 5353.696 \| 0.951 \| 0.675 \| \| None \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 128) \| 1837.856 \| 2887.392 \| 6121.632 \| 8586.656 \| 0.637 \| 0.713 \| \| None \| causal \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 128) \| 1066.976 \| 1654.368 \| 3843.136 \| 5291.040 \| 0.645 \| 0.726 \| \| relative_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 128) \| 1854.208 \| 2896.832 \| 6130.112 \| 8745.984 \| 0.640 \| 0.701 \| \| head_bias \| None \| torch.bfloat16 \| (8, 16, 4096, 2, 4096, 128) \| 1860.512 \| 2889.344 \| 6135.648 \| 8750.592 \| 0.644 \| 0.701 \| \| None \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 64) \| 60.640 \| 71.552 \| 315.968 \| 296.512 \| 0.847 \| 1.066 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 16, 512, 64) \| 50.784 \| 71.040 \| 284.288 \| 258.880 \| 0.715 \| 1.098 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 64) \| 61.312 \| 72.704 \| 315.680 \| 302.016 \| 0.843 \| 1.045 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 64) \| 60.800 \| 71.776 \| 316.320 \| 297.152 \| 0.847 \| 1.065 \| \| None \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| 84.576 \| 144.416 \| 580.576 \| 535.936 \| 0.586 \| 1.083 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| 76.064 \| 123.648 \| 553.344 \| 481.376 \| 0.615 \| 1.150 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| 84.160 \| 145.248 \| 581.024 \| 540.000 \| 0.579 \| 1.076 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 512, 16, 512, 128) \| 84.512 \| 143.552 \| 581.088 \| 535.776 \| 0.589 \| 1.085 \| \| None \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 64) \| 189.152 \| 209.408 \| 798.400 \| 868.704 \| 0.903 \| 0.919 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 64) \| 127.552 \| 168.800 \| 650.816 \| 663.328 \| 0.756 \| 0.981 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 64) \| 189.376 \| 211.360 \| 798.080 \| 895.552 \| 0.896 \| 0.891 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 64) \| 189.440 \| 208.576 \| 797.888 \| 873.152 \| 0.908 \| 0.914 \| \| None \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 128) \| 257.536 \| 441.760 \| 1408.960 \| 1514.720 \| 0.583 \| 0.930 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 128) \| 179.328 \| 312.096 \| 1170.368 \| 1177.472 \| 0.575 \| 0.994 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 128) \| 259.264 \| 446.944 \| 1408.768 \| 1530.400 \| 0.580 \| 0.921 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 16, 1024, 128) \| 258.080 \| 440.480 \| 1408.864 \| 1514.144 \| 0.586 \| 0.930 \| \| None \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 64) \| 2595.808 \| 2771.456 \| 7616.704 \| 10405.248 \| 0.937 \| 0.732 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 64) \| 1435.744 \| 1610.336 \| 4927.520 \| 6220.000 \| 0.892 \| 0.792 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 64) \| 2595.264 \| 2745.056 \| 7611.232 \| 10631.392 \| 0.945 \| 0.716 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 64) \| 2576.256 \| 2735.456 \| 7626.400 \| 10346.976 \| 0.942 \| 0.737 \| \| None \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 128) \| 3679.744 \| 5634.816 \| 13077.056 \| 17182.528 \| 0.653 \| 0.761 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 128) \| 2099.360 \| 3250.176 \| 8589.664 \| 10236.672 \| 0.646 \| 0.839 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 128) \| 3676.800 \| 5716.288 \| 13073.088 \| 17311.071 \| 0.643 \| 0.755 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 16, 4096, 128) \| 3679.136 \| 5570.496 \| 13070.720 \| 17192.863 \| 0.660 \| 0.760 \| \| None \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 64) \| 61.600 \| 71.008 \| 272.320 \| 300.000 \| 0.868 \| 0.908 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 2, 512, 64) \| 50.176 \| 65.344 \| 241.568 \| 258.912 \| 0.768 \| 0.933 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 64) \| 61.120 \| 72.512 \| 272.672 \| 305.408 \| 0.843 \| 0.893 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 64) \| 61.248 \| 71.136 \| 272.640 \| 301.120 \| 0.861 \| 0.905 \| \| None \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| 83.872 \| 146.784 \| 466.912 \| 496.832 \| 0.571 \| 0.940 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| 76.704 \| 115.072 \| 435.584 \| 462.112 \| 0.667 \| 0.943 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| 83.392 \| 147.392 \| 466.656 \| 504.448 \| 0.566 \| 0.925 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 512, 2, 512, 128) \| 83.360 \| 146.688 \| 466.656 \| 499.040 \| 0.568 \| 0.935 \| \| None \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 64) \| 189.024 \| 207.584 \| 684.768 \| 873.568 \| 0.911 \| 0.784 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 64) \| 126.944 \| 164.288 \| 536.192 \| 645.984 \| 0.773 \| 0.830 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 64) \| 188.768 \| 209.760 \| 684.096 \| 897.504 \| 0.900 \| 0.762 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 64) \| 189.408 \| 207.776 \| 685.024 \| 876.384 \| 0.912 \| 0.782 \| \| None \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| 259.168 \| 449.536 \| 1167.936 \| 1433.280 \| 0.577 \| 0.815 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| 180.000 \| 305.312 \| 928.000 \| 1113.920 \| 0.590 \| 0.833 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| 258.464 \| 455.136 \| 1167.808 \| 1462.848 \| 0.568 \| 0.798 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 1024, 2, 1024, 128) \| 257.824 \| 450.208 \| 1167.744 \| 1448.000 \| 0.573 \| 0.806 \| \| None \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 64) \| 2598.368 \| 2729.120 \| 7134.400 \| 10381.632 \| 0.952 \| 0.687 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 64) \| 1435.456 \| 1591.040 \| 4424.768 \| 6035.808 \| 0.902 \| 0.733 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 64) \| 2594.752 \| 2725.952 \| 7128.384 \| 10822.496 \| 0.952 \| 0.659 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 64) \| 2597.888 \| 2716.960 \| 7101.568 \| 10385.440 \| 0.956 \| 0.684 \| \| None \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 128) \| 3647.648 \| 5581.632 \| 12089.952 \| 16667.233 \| 0.654 \| 0.725 \| \| None \| causal \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 128) \| 2093.952 \| 3241.440 \| 7579.392 \| 9847.936 \| 0.646 \| 0.770 \| \| relative_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 128) \| 3650.528 \| 5650.688 \| 12105.568 \| 16963.680 \| 0.646 \| 0.714 \| \| head_bias \| None \| torch.bfloat16 \| (16, 16, 4096, 2, 4096, 128) \| 3680.064 \| 5585.312 \| 12117.504 \| 16935.040 \| 0.659 \| 0.716 \| </details> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135505 Approved by: https://github.com/Chillee	2024-09-10 09:30:02 +00:00
Roy Hvaara	23b1486185	[MPS] Allow nan mean reduction in `nll_loss` (#135434 ) This PR allows results from `nn_loss` to be `nan`, which is the same behavior as with CUDA and CPU https://github.com/pytorch/pytorch/pull/64572#issuecomment-926504162. Fixes #134431 Ref #64572 #119108 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135434 Approved by: https://github.com/malfet	2024-09-10 08:37:59 +00:00
Victor Tao	9902b349cb	[Inductor] Make static_input_idxs a set for faster lookup (#135314 ) `static_input_idxs` is only used for lookups. With large models, this is a large list. This takes over a millisecond in some cases. Profile before change: <img width="824" alt="image" src="https://github.com/user-attachments/assets/002a0775-fd2f-4d27-8cf2-812b502d7d5e"> Profile after change: gaps are smaller, 1ms speedup before launching the cuda graph <img width="794" alt="image" src="https://github.com/user-attachments/assets/12a0a0b9-2cc1-4d53-ac87-9bd5140a46f5"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135314 Approved by: https://github.com/oulgen	2024-09-10 07:27:55 +00:00
Tugsbayasgalan Manlaibaatar	5a9ac83e94	Fix doc (#135551 ) Differential Revision: [D62412667](https://our.internmc.facebook.com/intern/diff/D62412667/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135551 Approved by: https://github.com/yushangdi ghstack dependencies: #135549	2024-09-10 07:18:44 +00:00
Sam Larsen	1adf28a5c0	[inductor] print triton float64 constants correctly (#135260 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135260 Approved by: https://github.com/jansel	2024-09-10 07:05:02 +00:00
Tugsbayasgalan Manlaibaatar	c18052da0e	Add some minor doc improvement and ban using training IR for unflattener (#135549 ) Title Differential Revision: [D62412490](https://our.internmc.facebook.com/intern/diff/D62412490/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135549 Approved by: https://github.com/yushangdi	2024-09-10 06:48:42 +00:00
Yichen Yan	c0d2f991b1	Increase `TRITON_MAX_BLOCK['X']` (#135181 ) Fixes #135028 As title, increase `TRITON_MAX_BLOCK['X']` to 4096 and fix an error, thanks to @Chillee: https://github.com/pytorch/pytorch/pull/133300/files#r1744706189 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135181 Approved by: https://github.com/jansel	2024-09-10 05:54:37 +00:00
Thomas Bohnstingl	e889252493	Implementation of scan (#134102 ) This operation is supposed to be the pendant to the `associative_scan`, but can operate with non-associative functions. @ydwu4 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134102 Approved by: https://github.com/ydwu4	2024-09-10 04:51:16 +00:00
Avik Chaudhuri	6546c6186d	do not raise when flatten_fn_with_keys not found when suggesting fixes (#135518 ) Test Plan: added test Differential Revision: D62395371 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135518 Approved by: https://github.com/zhxchen17	2024-09-10 03:47:36 +00:00
Chien-Chin Huang	1d9fefff19	[DCP] Fixes the stateless optimizer issue of distributed state_dict (#135535 ) Some optimizers don't have states that can cause get_state_dict/set_state_dict behave incorrectly. This PR fixes the issues. fixes: https://github.com/pytorch/pytorch/issues/133415 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135535 Approved by: https://github.com/wz337	2024-09-10 03:10:00 +00:00
zengxian	7ec17b49cf	Fix dynamo benchmark skip logic for cpu device (#135193 ) Fixes #132380, adjust torchbench and huggingface skip models list, then we can remove `--no-skip` when running benchmarks on 3 suites. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135193 Approved by: https://github.com/chuanqi129, https://github.com/jansel	2024-09-10 03:02:19 +00:00
Wu, Chunyuan	146921007a	[inductor] [cpp] fix the input contiguous check in max-autotune (#134982 ) ## Description Fixes the FP32 accuracy failure of `resmlp_12_224` and BF16 accuracy failure of `volo_d1_224` in timm. In this PR, we check whether input is contiguous using the following way: If it has `FixedLayout`, we know the accurate strides. For `FlexibleLayout`, if its data is a `ComputedBuffer`, we could get the fill order of the buffer to decide whether it's contiguous. For the other cases, we won't use GEMM template as we can't infer whether it's contiguous. ## Additional context The current GEMM template only supports this case: `input.get_stride()[-1] == 1`. In `resmlp_12_224`, when we run into this check, the layout of `input` is a `FlexibleLayout`. The reason is that when realizing the input which is a `View` IR, the `convert_to_reinterpret_view` call fails: `d14fe3ffed/torch/_inductor/ir.py (L4712-L4715)` And it finally runs into this `copy_input` and returns a `FlexibleLayout`. `d14fe3ffed/torch/_inductor/ir.py (L4722)` When checking its stride, this `FlexibleLayout` indeed satisfies `input.get_stride()[-1] == 1` but it is later decided as a `FixedLayout` with `size = (3072, 196), stride = (1, 3072)`, which is not supported by the GEMM template, thus causing accuracy issue in this model. The `FlexibleLayout` is converted to `FixedLayout` during [CppPackedGemmTemplate.add_choices](`d14fe3ffed/torch/_inductor/mkldnn_lowerings.py (L1051)`) which calls [slice_nd](`d14fe3ffed/torch/_inductor/codegen/cpp_template_kernel.py (L150)`) when rendering the kernel (`slice_nd(X)`). When creating the `SliceView` IR, [as_storage_and_layout](`d14fe3ffed/torch/_inductor/ir.py (L2288)`) invokes [decide_layout](`d14fe3ffed/torch/_inductor/ir.py (L2135)`) and converts it to a `FixedLayout` with `size = (3072, 196), stride = (1, 3072)`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134982 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-09-10 02:47:38 +00:00
Yueming Hao	a71e5509bc	[inductor]Add profiler to operatorbench (#135515 ) Add profiling to operatorbench. The new argument `--profile` is added and the profiling trace is like the following figure. <img width="954" alt="image" src="https://github.com/user-attachments/assets/5b00d6e3-4905-4a77-a5e9-9f62620a5fd5"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135515 Approved by: https://github.com/shunting314	2024-09-10 02:33:30 +00:00
Guilherme Leobas	136e28f616	Enable forward AD in functional.affine_grid (#135494 ) Fixes #121411 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135494 Approved by: https://github.com/zou3519, https://github.com/soulitzer	2024-09-10 00:07:07 +00:00
Jeff Daily	39a61795e3	remove amax_ptr from scaled_gemm (#135421 ) amax was removed from _scaled_mm by #128683. Remove it from the internal at::cuda::blas::scaled_gemm, as well. This allows hipBLASLt to find additional solutions rather than forcing amax to be used and then discarding the result. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135421 Approved by: https://github.com/drisspg, https://github.com/eqy	2024-09-09 23:04:36 +00:00
Scott Wolchok	b4feec9782	[xplat][XNNPACK] don't prefer static linkage in xplat for main target (#135529 ) Building XNNPACK as a static library has some issues because of multiple global params floating around. Let's try to get rid of it in xplat and see how it fares. Differential Revision: [D60776152](https://our.internmc.facebook.com/intern/diff/D60776152/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D60776152/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/135529 Approved by: https://github.com/kimishpatel, https://github.com/mcr229, https://github.com/kirklandsign	2024-09-09 22:47:01 +00:00
Yanbo Liang	d81731615f	[Dynamo] Adding CallFunctionNoArgsSource and (#135425 ) CallFunctionNoArgsGuardAccessor to support torch.cuda.current_device() Pull Request resolved: https://github.com/pytorch/pytorch/pull/135425 Approved by: https://github.com/anijain2305	2024-09-09 22:46:00 +00:00
shubhambhokare1	e2f9a83b85	[ONNX] Drop final None values as inputs for nodes in exporter graph (#135520 ) When value for an optional input is not provided, it is defaulted to `None`, which gets translates to "" in the onnx graph. To avoid this, if we have a list of inputs and the final few are all `None`, strip them in the graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135520 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2024-09-09 22:28:41 +00:00
PyTorch MergeBot	70a65a8bd5	Revert "NJT <-> padded dense conversions (#125947 )" This reverts commit 09a5e88bef04d5485b70d8f65f46a675aaa52942. Reverted https://github.com/pytorch/pytorch/pull/125947 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing dynamo test `09a5e88bef`, maybe a landrace ([comment](https://github.com/pytorch/pytorch/pull/125947#issuecomment-2339228570))	2024-09-09 22:01:09 +00:00
PyTorch MergeBot	689d278543	Revert "Add `__init__.py` to shape inference folder. (#135461 )" This reverts commit dced0d6d9f05f0962f74a3c6227f774111c15715. Reverted https://github.com/pytorch/pytorch/pull/135461 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it exposes some public function without appropriate doc. I will reopen the issue with hi-prio so that it can be fixed properly ([comment](https://github.com/pytorch/pytorch/pull/135461#issuecomment-2339218382))	2024-09-09 21:55:13 +00:00
atalman	9b764491e3	Use upload-artifact@v4.4.0 for create_release.yml (#135528 ) Fixes failure: https://github.com/pytorch/pytorch/actions/runs/10780281005/job/29895846007 Due broken sync ``` actions/upload-artifact@v2 and actions/download-artifact@v4.1.7 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135528 Approved by: https://github.com/kit1980, https://github.com/malfet	2024-09-09 20:48:52 +00:00
Maclyn Brandwein	cbc6b30a24	Fix broken E2E tests on Linux machines (#135394 ) Summary: I'm not entirely sure why this is failing with an `ImportError` (according to lastnameye a super class of `ModuleNotFoundError`s), but on our E2E tests on Linux machines (but not Macs?), we're seeing the import failure not getting caught -- `ImportError: cannot import name 'parutil' from 'libfb.py' (/data/sandcastle/boxes/eden-trunk-hg-full-fbsource/buck-out/v2/gen/fbsource/d0c916ec8d40ce11/arvr/libraries/ctrl/studies/replay/__ctrl-r__/ctrl-r#link-tree/libfb/py/__init__.py)` from this test run https://www.internalfb.com/sandcastle/workflow/2522015791331601269, an instance of this job: https://www.internalfb.com/intern/test/844425085172858?ref_report_id=0 is the overall job Test Plan: `arc skycastle schedule tools/skycastle/workflows2/ctrl/js_tests.sky:test_js_e2e_replay_tests --sandcastle-spec-overrides '{"type": "fbcode", "unicastle_size": "I1_MEDIUM"}'` -> https://www.internalfb.com/sandcastle/workflow/256705178764255769 Differential Revision: D62321167 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135394 Approved by: https://github.com/laithsakka	2024-09-09 20:18:08 +00:00
PyTorch MergeBot	5b368de7f7	Revert "[ONNX] Update fake mode usage in onnx docs (#135512 )" This reverts commit a13c118994b4f118388d97a35abcb91a396cd437. Reverted https://github.com/pytorch/pytorch/pull/135512 on behalf of https://github.com/davidberard98 due to failing test https://github.com/pytorch/pytorch/actions/runs/10778813316/job/29891679127 ([comment](https://github.com/pytorch/pytorch/pull/135512#issuecomment-2338999090))	2024-09-09 20:15:12 +00:00
Joel Schlosser	09a5e88bef	NJT <-> padded dense conversions (#125947 ) This PR: * Implements the pre-existing `nt.to_padded_tensor(padding_val)` ATen op via the FBGEMM kernel + appropriate view gymnastics (since that kernel only handles 2D values) * Introduces a new `_nested_from_padded_tensor` op for the reverse conversion, implemented via the reverse FBGEMM kernel + view gymnastics * Note: there is currently no public API for this; design booted to a future PR TODO: * ~~Propagate min / max sequence length via the new factory function `_nested_from_padded_tensor`~~ * ~~Verify that Inductor does computation fusion via test logic~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/125947 Approved by: https://github.com/soulitzer	2024-09-09 19:37:32 +00:00
Sahan Paliskara	a4e6a0b240	[split build] move periodic split builds into own concurrency group (#135510 ) To avoid nightly workflows cancelling each other Pull Request resolved: https://github.com/pytorch/pytorch/pull/135510 Approved by: https://github.com/clee2000, https://github.com/huydhn, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-09-09 19:35:57 +00:00
imShZh	4ab232d0c4	Fix symbolic number's type and tensor's dtype mismatch bug in Tensor ctor (#135433 ) Fixes #135432 In the current implementation, if we try to store a symbolic number in Tensor's constructor, it assumes that the tensor's dtype and the symbolic number's type are matched, which is not the case. In other words, if we try to store a `SymInt`, current implementation assumes tensor's dtype is `torch.int32`, `torch.int64` or something. And if we try to store a `SymFloat`, it assumes tensor's dtype is `torch.float32` or `torch.float64`. However, the tensor's dtype could also be `torch.float32` or something else when we try to store `SymInt`, which would be wrong. This PR stores symbolic numbers by tensor's scalar type by wrapping `SymInt` and `SymFoat`'s guarded number into a PyObject. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135433 Approved by: https://github.com/ezyang	2024-09-09 19:32:18 +00:00
Sergii Dymchenko	2032f107d7	Don't try to tag s390x docker images (#135509 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135509 Approved by: https://github.com/atalman	2024-09-09 19:07:48 +00:00
rzou	5f7d956362	Fix bugs blocking flipping the default layout constraint for custom ops (#135391 ) Fixes two things: - For regular PyTorch ops, the default layout constraint tag is always flexible_layout. This was a bug with #135238 - Mark the new quantized _wrapped_linear_prepack ops as flexible_layout. The metas for these are incorrect, I didn't want to fix them (and changing the default requires the metas actually be correct). Test Plan: - The next PR up in the stack. The PRs are split because the next one is riskier. foo Pull Request resolved: https://github.com/pytorch/pytorch/pull/135391 Approved by: https://github.com/albanD	2024-09-09 18:24:21 +00:00
shubhambhokare1	a13c118994	[ONNX] Update fake mode usage in onnx docs (#135512 ) Update fake mode usage in onnx docs Pull Request resolved: https://github.com/pytorch/pytorch/pull/135512 Approved by: https://github.com/justinchuby	2024-09-09 18:10:37 +00:00
Chien-Chin Huang	21241bfeee	[CP] Extend CP to support load-balancing shards (#132442 ) This PR extends the current ring attention to support load-balancing shards -- the context/sequence is divided into `2 * world_size` shards and each rank gets `rank` and `(world_size * 2 - rank - 1)` shards. The data re-shuffling is done in the `context_parallel` API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132442 Approved by: https://github.com/wconstab	2024-09-09 18:04:38 +00:00
PyTorch MergeBot	73a6fc6e30	Revert "[Inductor] Make static_input_idxs a set for faster lookup (#135314 )" This reverts commit 011cae9570fb3c44b7f6f0c8004c470579ed21da. Reverted https://github.com/pytorch/pytorch/pull/135314 on behalf of https://github.com/ZainRizvi due to Lint is failing on this file in trunk. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/10777258770/job/29885960050) [HUD commit link](`011cae9570`) ([comment](https://github.com/pytorch/pytorch/pull/135314#issuecomment-2338678219))	2024-09-09 17:33:01 +00:00
Roy Hvaara	09287e3af4	[MPS] Add regression test for `fft.fftfreq` (#135440 ) The issue reported in #135223 was already solved in #128393. This PR adds a regression test for it. Fixes #135223 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135440 Approved by: https://github.com/ezyang	2024-09-09 17:12:36 +00:00
Bin Bao	16c3b8f87c	[AOTI] Fix assert_function call in cpu autotune template (#135086 ) Summary: In the ABI-compatible mode, assert_function should be AOTI_TORCH_CHECK. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135086 Approved by: https://github.com/chenyang78, https://github.com/angelayi ghstack dependencies: #134857	2024-09-09 16:54:12 +00:00
Bin Bao	9c6dff4941	[AOTI] Add C shim for aten.mkldnn_rnn_layer in cpp wrapper (#134857 ) Summary: Support aten.mkldnn_rnn_layer in the ABI-compatible mode. Because aten.mkldnn_rnn_layer is an aten op, it is easier to add a C shim function for it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134857 Approved by: https://github.com/angelayi	2024-09-09 16:54:12 +00:00
atalman	0eb425a563	[Release] Apply Release changes scripts after release 2.4 (#135495 ) Based on additional changes required for https://github.com/pytorch/pytorch/pull/128347 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135495 Approved by: https://github.com/kit1980	2024-09-09 16:49:04 +00:00
Victor Tao	011cae9570	[Inductor] Make static_input_idxs a set for faster lookup (#135314 ) `static_input_idxs` is only used for lookups. With large models, this is a large list. This takes over a millisecond in some cases. Profile before change: <img width="824" alt="image" src="https://github.com/user-attachments/assets/002a0775-fd2f-4d27-8cf2-812b502d7d5e"> Profile after change: gaps are smaller, 1ms speedup before launching the cuda graph <img width="794" alt="image" src="https://github.com/user-attachments/assets/12a0a0b9-2cc1-4d53-ac87-9bd5140a46f5"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135314 Approved by: https://github.com/oulgen	2024-09-09 16:24:58 +00:00
CaoE	dfb2b661f7	Use float data type for Half var_sum in batchnorm stats updating on CPU (#126525 ) Using float data type for Half `var_sum` in batchnorm stats updating on CPU to avoid `var_sum` overflow since the representation range of Half is small. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126525 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-09-09 15:31:38 +00:00
Roy Hvaara	5a69e0ebbe	[MPS] Update decorator comments with issue ref (#135448 ) Updating the comments with references to better places for context now that the bugs have been identified. xref #135442 #135447 #134184 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135448 Approved by: https://github.com/ezyang	2024-09-09 15:18:52 +00:00
Xavier Dupré	5e145861f2	[ONNX] Improves documentation of ONNX exporter (#135372 ) The PR updates the documentation to reflect the changes introduced in pytorch 2.5 and related to onnx exporter. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135372 Approved by: https://github.com/justinchuby Co-authored-by: Justin Chu <justinchuby@users.noreply.github.com>	2024-09-09 15:09:01 +00:00
Yuxin Wu	c35b953531	Fix wrong error msg (#135423 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135423 Approved by: https://github.com/ezyang	2024-09-09 13:28:31 +00:00
PHLens	dced0d6d9f	Add `__init__.py` to shape inference folder. (#135461 ) Fixes #135196 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135461 Approved by: https://github.com/ezyang	2024-09-09 13:27:58 +00:00
Jiong Gong	c0436c5701	[inductor][cpp][gemm] fix perf regression xcit_large_24_p8_224 (#134686 ) (#135438 ) Fix #134686. PR https://github.com/pytorch/pytorch/pull/132729 makes GEMM template faster for one of the GEMMs in xcit_large_24_p8_224: SingleProcess AUTOTUNE benchmarking takes 1.7088 seconds and 1.9207 seconds precompiling AUTOTUNE linear_unary(12544x3072, 768x3072, 768) cpp_packed_gemm_2 2.9371 ms 100.0% _linear_pointwise 3.1584 ms 93.0% But it is slower than Aten in the e2e run due to different cache behavior. The access to the input data (12544x3072) is LLC latency bound and bottlenecks seen due to the memory synchronization (data transfers and coherence updates across processors). This PR tries to mitigate the problem by cooperatively loading different chunks of input data from different processors that share the input data. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135438 Approved by: https://github.com/leslie-fang-intel	2024-09-09 05:16:02 +00:00
cyy	60e8dc4374	Check function declarations in Caffe2 code (#134925 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134925 Approved by: https://github.com/ezyang	2024-09-09 05:03:29 +00:00
xingyunjohn1	e6c3f58584	Fix example: Address broadcasting error in the addition of `attn_bias… (#135427 ) …` and `attn_mask`, and correct device assignment for newly created variables in the method. Fix example: Address broadcasting error in the addition of `attn_bias` and `attn_mask`, and correct device assignment for newly created variables in the method. 1. Adding `attn_bias += attn_mask` results in a broadcasting error. The expected shape of `attn_bias` is (L, S), so the output should also have the shape (L, S). However, when the input shape is (N, num_heads, L, S), broadcasting occurs, leading to an output shape of (N, num_heads, L, S), which is not desired. 2. `attn_bias` is a newly created variable within the method, but it is not assigned to the correct device. This is my retry of PR #130209 . The PR has been merged into commit `d4a79d4a7c746068d25fe5cf9333495561f4ce1f`, but the modifications were overwritten by subsequent commits. Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com> @mikaylagawarecki provided a more elegant implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135427 Approved by: https://github.com/ezyang	2024-09-09 03:47:34 +00:00
PhilipMay	90e12cf63d	Fix return type of `nansum` example. (#135435 ) One of the examples in the documentation of `torch.nansum` contains a wrong return type. This fixes it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135435 Approved by: https://github.com/ezyang	2024-09-09 03:34:52 +00:00
Zhou, Lingzhi	44c08f4984	[Partitioner] Query whether nodes exist in graph faster (#135316 ) Find node if exist in graph.nodes (linked list) take too long time. Using graph._find_nodes_lookup_table (hash table) instead to speed up. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135316 Approved by: https://github.com/ezyang	2024-09-09 03:34:02 +00:00
Rafal Litka	b6186353c6	enable lazy_init for hpu (#135203 ) enables lazy_init for hpu device Pull Request resolved: https://github.com/pytorch/pytorch/pull/135203 Approved by: https://github.com/ezyang	2024-09-09 03:32:20 +00:00
Alexander Kurakin	b7eb7256fb	docs: `torch.nn.utils.rnn.pack_padded_sequence`: docs improve (#135417 ) docs: `torch.nn.utils.rnn.pack_padded_sequence`: docs improve /cc @mikaylagawarecki Pull Request resolved: https://github.com/pytorch/pytorch/pull/135417 Approved by: https://github.com/ezyang	2024-09-09 03:16:11 +00:00
Xu Han	c1ae78be92	[inductor] calibration inductor windows uts (18/N) (#135449 ) skip test_quantized_* UTs of `test/inductor/test_cpu_select_algorithm.py`. Windows inductor don't support quantize so far. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135449 Approved by: https://github.com/ezyang	2024-09-09 03:10:54 +00:00
yuqingj	defb515306	[NJT]Add permute ops support (#135336 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135336 Approved by: https://github.com/davidberard98	2024-09-08 21:00:41 +00:00
Jason Ansel	31c4e0d37d	[inductor] Cleanup analysis done at lowering time (#135412 ) Before this we would take multiple passes over the body of each IRNode as we did lowering. This combines most analysis into `OpCounterCSE` so it can be done in a single pass. Before: ![image](https://github.com/user-attachments/assets/0047db09-4258-4491-a9a6-b078e183092a) After: ![image](https://github.com/user-attachments/assets/1e03adcb-8303-4bb1-8bbb-cc42dacd44d7) This stack: ![image](https://github.com/user-attachments/assets/d6b50b24-c30c-4d23-8b1a-344b3ba65d7a) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135412 Approved by: https://github.com/oulgen ghstack dependencies: #135286, #135306, #135377, #135400	2024-09-08 18:02:36 +00:00
Jason Ansel	53290ca00b	[inductor] Refactor BaseSchedulerNode.__init__ (#135400 ) Might be a small compile time improvement since we remove a call to extract_read_writes(). Pull Request resolved: https://github.com/pytorch/pytorch/pull/135400 Approved by: https://github.com/oulgen ghstack dependencies: #135286, #135306, #135377	2024-09-08 18:02:36 +00:00
Jason Ansel	16f5155992	[inductor] Fast path for extract_read_writes without tracing (#135377 ) Before (bottom of stack): ![image](https://github.com/user-attachments/assets/13060ff9-b31d-42a9-8e8f-c50b2bf3dc2f) After (this PR): ![image](https://github.com/user-attachments/assets/7d190821-b614-46b7-9e9e-9087443df654) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135377 Approved by: https://github.com/oulgen ghstack dependencies: #135286, #135306	2024-09-08 18:02:32 +00:00
Jason Ansel	37144be03d	[inductor] Remove ReadWrites.op_counts (#135306 ) This was (almost) unused. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135306 Approved by: https://github.com/oulgen ghstack dependencies: #135286	2024-09-08 18:02:28 +00:00
Jason Ansel	3bdc54ed18	[inductor] Refactor LoopBody.memory_usage (#135286 ) This is preparing for some other changes where I speed up extract_read_writes tracing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135286 Approved by: https://github.com/oulgen	2024-09-08 18:02:24 +00:00
cyy	2196f32475	[22/N] Fix clang-tidy warnings in jit (#135319 ) Follows #134537 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135319 Approved by: https://github.com/titaiwangms	2024-09-08 17:18:29 +00:00
Wanchao Liang	cfc227ad43	[reland][dtensor] move DTensor to public namespace (#134203 ) reland of https://github.com/pytorch/pytorch/pull/133113 I have to create a new PR because the previous reverted PR could not either be rebased, or imported successfully :( ---- Moving DTensor to be in the public namespace, to formally add the documentation page that includes all the public APIs. This includes: * many path renames and path import fixes * a dedicated doc page without too much content yet (adding in the next PRs) * To preserve the BC for users still using the torch.distributed._tensor, I added a shim script to redirect old path calls to the new module The BC preserving is evidented by the fact that all DTensor tests are still working without changing the public imports. So it's safe to land the changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/134203 Approved by: https://github.com/tianyu-l	2024-09-08 17:08:40 +00:00
Animesh Jain	20cab91a12	[dynamo] Remove skip from jit freeze tests (#135281 ) Fixes https://github.com/pytorch/pytorch/issues/119781 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135281 Approved by: https://github.com/zou3519	2024-09-08 15:11:12 +00:00
CaoE	a6fae2e811	Use BRGEMM for Half flash attention forward kernel (#131879 ) Use oneDNN BRGEMM on packed data to get better performance on the 5th generation of Xeon where Intel® Advanced Matrix Extensions (AMX) will have fp16 support, e.g. amx-fp16. Multiple models have achieved acceleration, for instance, FP16 stable diffusion v2.1 has achieved over 50% improvement. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131879 Approved by: https://github.com/jgong5, https://github.com/peterbell10 ghstack dependencies: #131878	2024-09-08 12:32:23 +00:00
Justin Chu	042f2f7746	[ONNX] Re-raise the exception if the dynamic shapes cannot be refined (#135418 ) Improve error reporting. Otherwise users will just see not being able to refine shapes most of the time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135418 Approved by: https://github.com/titaiwangms	2024-09-08 05:30:34 +00:00
Huamin Li	fd494dd426	Change wrapped_linear_prepack and wrapped_quantized_linear_prepacked to private by adding _ as prefix (#135401 ) Summary: In https://github.com/pytorch/pytorch/pull/134232, we added two new ops wrapped_linear_prepack and wrapped_quantized_linear_prepacked. From the review comments and offline discussion, we are changing them to private by adding `_` as prefix Differential Revision: D62325142 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135401 Approved by: https://github.com/houseroad	2024-09-08 04:16:24 +00:00
Bob Ren	8334cb2fb9	remove commented out breakpoints (#135363 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135363 Approved by: https://github.com/oulgen	2024-09-08 02:15:45 +00:00
Yanbo Liang	e72ed4717e	[Dynamo] Fix Huggingface PretrainedConfig get non const attr (#135413 ) Fixes #135329 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135413 Approved by: https://github.com/anijain2305	2024-09-07 19:16:29 +00:00
drisspg	3bebc09be9	[FlexAttention] Align the matmul tensorcore usage (#135168 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135168 Approved by: https://github.com/Chillee	2024-09-07 16:33:41 +00:00
Sam Larsen	a2db22e6bb	[inductor] Catch BrokenProcessPool and print a more helpful message. (#135120 ) Summary: BrokenProcessPool means a parallel-compile subprocess exited, which we never expect. It's likely due to a crash, so print a more meaningful error message and instructions that it's probably easier to debug by turning off parallel compile. Output looks like: ``` ... File "/data/users/slarsen/pytorch/torch/_inductor/runtime/compile_tasks.py", line 45, in _reload_python_module exec(code, mod.__dict__, mod.__dict__) File "/tmp/torchinductor_slarsen/4q/c4qw7xk5lbb7whg5txnk4hwbc7z6kepak3o666tr3d64gcad5r5b.py", line 815, in <module> async_compile.wait(globals()) File "/data/users/slarsen/pytorch/torch/_inductor/async_compile.py", line 265, in wait raise RuntimeError( RuntimeError: A compilation subprocess exited unexpectedly. This is likely due to a crash. To facilitate debugging, you can re-run with TORCHINDUCTOR_COMPILE_THREADS=1 to cause compilation to occur in the main process. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135120 Approved by: https://github.com/Chillee	2024-09-07 16:33:37 +00:00
Jason Ansel	eac5e12548	[inductor] Move LoopBody to its own file (#135257 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135257 Approved by: https://github.com/oulgen	2024-09-07 16:29:15 +00:00
Wu, Chunyuan	18479c5f70	[Doc] update max-autotune for CPU (#134986 ) The current doc for `max-autotune` is applicable only for GPU. This PR adds the corresponding content for CPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134986 Approved by: https://github.com/jgong5, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-09-07 13:42:40 +00:00
CaoE	f7c0c06692	Add oneDNN BRGEMM support on CPU (#131878 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131878 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-09-07 13:22:30 +00:00
Yu, Guangye	b53d97c7be	[Intel GPU] Add XPU memory-related APIs (#129919 ) # Motivation According to https://github.com/pytorch/pytorch/issues/116322, we will help unify the device allocator. So we introduce a simple xpu device allocator only with the key functionality first. And expect to add some memory statistics-related functionality after the unification. But now, some memory statistic-related APIs listed in https://github.com/pytorch/pytorch/issues/127929 are requested. We need more time to unify the device allocator. In order to facilitate the user experience, we expect to support these memory statistic-related APIs before the unification. # Additional Context Fixes: #127929 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129919 Approved by: https://github.com/dvrogozh, https://github.com/abhilash1910, https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/albanD ghstack dependencies: #130923	2024-09-07 11:15:17 +00:00
Yu, Guangye	6c1da66407	[Reland] Refactor caching device allocator utils (#130923 ) # Motivation Following [[RFC] Intel GPU Runtime Upstreaming for Allocator ](https://github.com/pytorch/pytorch/issues/116322), this PR aims to refactor caching device allocator utils to improve code reuse usage. This is the first PR, we could prepare some follow-up PRs continuing to refactor the device caching allocator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130923 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD, https://github.com/eqy	2024-09-07 11:14:17 +00:00
Jiong Gong	d7c97e7245	[inductor][cpp][gemm] cache blocking config for dynamic shapes (#133538 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133538 Approved by: https://github.com/leslie-fang-intel ghstack dependencies: #135277, #133447 Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com>	2024-09-07 11:09:30 +00:00
Jiong Gong	be9f4ffe88	[inductor][cpp][gemm] enable dynamic M for k-slicing (#133447 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133447 Approved by: https://github.com/leslie-fang-intel ghstack dependencies: #135277 Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com>	2024-09-07 11:09:30 +00:00
Jiong Gong	692faa9bc6	[inductor][cpp][gemm] reduce memory alloc overhead by allocating local acc once per thread (#135277 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135277 Approved by: https://github.com/leslie-fang-intel Co-authored-by: Wu, Chunyuan <chunyuan.wu@intel.com>	2024-09-07 11:09:25 +00:00
Justin Chu	32f3af72b7	[ONNX] Support FakeTensor in ONNXProgram (#135399 ) Sync with https://github.com/justinchuby/torch-onnx/compare/v0.1.20...v0.1.21 to support FakeTensors in ONNXProgram. Specifically, this PR implements the `apply_weights` method to allow users to supply a dictionary of concrete tensors to replace FakeTensors in the exported model weights. An error is raised when users try to serialize a FakeTensor to avoid segfaults. Also fixed a bug in `.save()` when `keep_initializers_as_inputs` is True and `include_initializers` is False. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135399 Approved by: https://github.com/titaiwangms	2024-09-07 04:48:18 +00:00
Yanbo Liang	ebab5c85c4	[FlexAttention] Skip very small block size unit tests on H100 due to Triton bug (#135393 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135393 Approved by: https://github.com/BoyuanFeng	2024-09-07 04:35:22 +00:00
Justin Chu	3d734d837b	[ONNX] Handle mixed sequence inputs properly (#135378 ) Previously, when an input contains a mixture of `Value` and python constants like `[SymbolicTensor('sym_size_int_3', type=Tensor(INT64), shape=[], producer=node_Shape_0, index=0), 512]`, we get errors like ```pytb Traceback (most recent call last): File "/Users/justinc/Documents/GitHub/torch-onnx/src/torch_onnx/_building.py", line 367, in _call_op converted_named_inputs = _process_python_constants_and_sequences( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/justinc/Documents/GitHub/torch-onnx/src/torch_onnx/_building.py", line 275, in _process_python_constants_and_sequences raise TypeError( TypeError: Constant input '[SymbolicTensor('sym_size_int_3', type=Tensor(INT64), shape=[], producer=node_Shape_0, index=0), 512]' of type '<class 'list'>' is not supported ``` This PR updates Sequence handling to support this case, as well as variadic inputs and ONNX Sequence inputs. Synced from https://github.com/justinchuby/torch-onnx/pull/187 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135378 Approved by: https://github.com/titaiwangms	2024-09-07 03:07:39 +00:00
Yiming Zhou	c92227c41a	[quant][pt2e] fix placeholder typo and related quantization tests (#135379 ) A previous typo on "placeholder" and related tests in quantization are fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135379 Approved by: https://github.com/jerryzh168	2024-09-07 02:31:43 +00:00
blaine-rister	e6a0221fc6	[Inductor] Optionally allow padding on non-GPU devices (#135280 ) This is the OSS component of a larger MTIA diff. Currently, Inductor disables padding for non-GPU devices. We need to change this behavior to enable padding on MTIA. This PR adds a config option to enable padding on the CPU, or any other non-GPU device. In the future, we might want to enable padding on all devices by default. However, that might require supporting device-dependent padding defaults, since CPUs will likely use different settings than H100 GPUs. Differential Revision: D61038114 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135280 Approved by: https://github.com/jfix71, https://github.com/shunting314	2024-09-07 02:19:14 +00:00
Justin Chu	a6b9d444fb	[ONNX] Refactor exporter errors (#135180 ) Refactor exporter errors to combine old errors and new errors for API consistency. This PR also 1. Removes the `_C._check_onnx_proto(proto)` call in the old exporter. We don't need the ONNX checker because it is limited. 2. Removes the `OnnxExporterError` defined in the dynamo module. This class unnecessarily stores the onnx program object, making it very bulky. Instead, we revert to use the plain OnnxExporterError defined in the `errors` module and use it as the base class for all errors. 3. Continues to expose `OnnxExporterError` in `torch.onnx` and the rest of the errors in `torch.onnx.errors`. 4. Removes the `CheckerError` and `InvalidExportOptionsError` from `torch.onnx`. This is BC breaking but should have low impact. 5. I did not rename existing errors out of compatibility considerations, even though `ExporterError` would have been more succinct. Fixes https://github.com/pytorch/pytorch/issues/135125 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135180 Approved by: https://github.com/titaiwangms	2024-09-07 00:50:15 +00:00
Sergii Dymchenko	d42b0c8f22	Add release matrix for 2.5 (#135383 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135383 Approved by: https://github.com/huydhn	2024-09-07 00:49:53 +00:00
Will Feng	941d094dd1	[Dynamo][DTensor] Fixes SymNodeVariable() is not a constant error in Compiled DDP + TP unit test (#135315 ) Before the fix, the unit test will fail at forward Dynamo tracing: ``` File "/data/users/willfeng/pytorch/test/distributed/_composable/test_replicate_with_compiler.py", line 415, in test_ddp_tp loss = compiled_replicate_model(data).sum() ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ... torch._dynamo.exc.InternalTorchDynamoError: SymNodeVariable() is not a constant from user code: File "/data/users/willfeng/pytorch/torch/distributed/tensor/parallel/_data_parallel_utils.py", line 34, in _unflatten_tensor result = DTensor.from_local( ``` After the fix, the compilation fails at a later step (Compiled Autograd tracing), due to needing "pre-dispatch tracing of backward graph" feature (see details at https://github.com/pytorch/pytorch/issues/127797#issuecomment-2291695474). I believe this PR is a net improvement, because it should also fix the 1D Traceable FSDP2 failure case on internal models (https://github.com/pytorch/pytorch/issues/130978#issuecomment-2319476690), which is much harder to build a minimal unit test for. Fixes https://github.com/pytorch/pytorch/issues/130978. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135315 Approved by: https://github.com/bdhirsh	2024-09-07 00:11:25 +00:00
Shangdi Yu	b1a934741e	Change test_constant_prop_preserve_metadata (#135268 ) Summary: In new export_for_training, "stack_trace" does not exist in node meta anymore. Test Plan: ``` buck run fbcode//mode/dev-nosan fbcode//caffe2/test:quantization_pt2e -- -r test_constant_prop_preserve_metadata ``` Reviewed By: angelayi Differential Revision: D62219974 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135268 Approved by: https://github.com/angelayi	2024-09-07 00:02:35 +00:00
Sahan Paliskara	0c661f3e1a	[Split Build] Refactor split build binary builds into their own workflows and move split build binary builds to periodic (#134624 ) As we need to move split build binary tests from trunk to periodic this pr, refactors those jobs out into its own workflow to achieve this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134624 Approved by: https://github.com/malfet	2024-09-06 23:57:56 +00:00
leslie-fang-intel	2c7e314803	[Inductor][CPP] Fix the issue of view dtype (#135301 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/135160, it's a regression introduced by https://github.com/pytorch/pytorch/pull/134569, where the dtype of `to_dtype_bitcast` was incorrectly handled when using the scalarize implementation. TestPlan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_view_dtype ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135301 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-09-06 23:36:44 +00:00
Sun, Jiayi	ead4407f57	[inductor] Fix loop split optimization (#135303 ) Fix https://github.com/pytorch/pytorch/issues/135274. Improve the check whether the div expr matches: add a check whether `split_var` is in `original_body.iter_vars`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135303 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-09-06 23:06:25 +00:00
Henry Tsang	2f5b40c099	[aoti test] Disable FP8 funz dtypes in fp8 runtime check test (#135373 ) Fixing https://github.com/pytorch/pytorch/issues/126734 Key is the funz FP8 types are for AMD only. source: https://github.com/openxla/stablehlo/blob/main/rfcs/20230321-fp8_fnuz.md Pull Request resolved: https://github.com/pytorch/pytorch/pull/135373 Approved by: https://github.com/chenyang78	2024-09-06 23:05:47 +00:00
Yidi Wu	993b5647ab	[export] fix placeholder name collision tests by removing map call (#135366 ) The current test is failing because of the current unstable state of map. torch.compile and non-strict export are taking two seperate routes unlike cond and while_loop. This pr fix the test it self. We'll fix map in follow up PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135366 Approved by: https://github.com/angelayi	2024-09-06 22:02:50 +00:00
Sam Larsen	2ab26806f1	Require tlparse for failing tests in test_structured_trace.py (#135376 ) Summary: These tests are currently failing internally. Per discussion, skip if tlparse is unavailable Test Plan: ``` feature remove tlparse buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --run-disabled --regex test_structured_trace.py feature install tlparse buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --run-disabled --regex test_structured_trace.py ``` Differential Revision: D62310342 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135376 Approved by: https://github.com/ezyang	2024-09-06 21:53:41 +00:00
Jane Xu	b1612569f6	[BE] Clarify defaulting behavior in optimizer (#135384 ) Fixes #135340 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135384 Approved by: https://github.com/drisspg, https://github.com/jainapurva	2024-09-06 21:52:55 +00:00
Will Constable	dc0e818738	[FR] Automatically infer a common filename prefix (#135158 ) Save the annoyance of specifying this on the command line each time Pull Request resolved: https://github.com/pytorch/pytorch/pull/135158 Approved by: https://github.com/fduwjj, https://github.com/c-p-i-o ghstack dependencies: #135157	2024-09-06 21:44:27 +00:00
Will Constable	06e414d7fe	[FR] Make trace_dir a required argument (#135157 ) Ensures users get a clean error if they forget to specify the dir, and improves the help message. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135157 Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj	2024-09-06 21:44:27 +00:00
PyTorch MergeBot	a681260caf	Revert "[ONNX] Refactor exporter errors (#135180 )" This reverts commit 5eebd9315a72422d59b6f8d8ca8e4e573e231d5c. Reverted https://github.com/pytorch/pytorch/pull/135180 on behalf of https://github.com/clee2000 due to I think this broke test_public_bindings.py::TestPublicBindings::test_correct_module_names [GH job link](https://github.com/pytorch/pytorch/actions/runs/10743909338/job/29800779403) [HUD commit link](`5eebd9315a`), possibly a landrace with the PR that landed before it ([comment](https://github.com/pytorch/pytorch/pull/135180#issuecomment-2334844191))	2024-09-06 21:39:18 +00:00
William Wen	95e976a63f	[dynamo] recursively skip frames when Dynamo cache limit is hit (#135144 ) Fixes https://github.com/pytorch/pytorch/pull/135144 and [T197117723](https://www.internalfb.com/intern/tasks/?t=197117723). In general, adds `SkipCodeRecursiveException` to Dynamo - when raised in Dynamo, convert_frame will return a `skip_code_recursive_flag` back to C Dynamo, signaling it to skip the current frame and all recursive calls. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135144 Approved by: https://github.com/jansel, https://github.com/anijain2305	2024-09-06 21:38:53 +00:00
Catherine Lee	306ac44eaa	[ez][TD] Fix request for issue body returns None (#135389 ) I assumed it would be empty string if the body is empty, but its just None Pull Request resolved: https://github.com/pytorch/pytorch/pull/135389 Approved by: https://github.com/malfet	2024-09-06 21:02:01 +00:00
Vadym Khortiuk	a7643baceb	Revert expectFailureIf condition on tests with torch.compile on Windows (#134759 ) Fixes #134716 This PR reverts some changes introduced in `6eae569546` (#133987) torch.compile is not available on Windows, tests should be expected to fail. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134759 Approved by: https://github.com/malfet	2024-09-06 20:51:55 +00:00
William Wen	a4030e37be	[dynamo] reland map/zip iterator related changes (#135074 ) Differential Revision: [D62211019](https://our.internmc.facebook.com/intern/diff/D62211019) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135074 Approved by: https://github.com/jansel, https://github.com/anijain2305, https://github.com/mlazos	2024-09-06 20:38:02 +00:00
Henry Tsang	22e1fb6faa	[test][easy] Add debug utils for cpu select algorithm test (#135038 ) Summary: Add debug utils to debug a flaky test in fbcode ci. Some context: https://github.com/pytorch/pytorch/pull/126545 Test Plan: ci Differential Revision: D62005445 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135038 Approved by: https://github.com/jgong5, https://github.com/XuehaiPan	2024-09-06 20:30:49 +00:00
titaiwangms	2a4890e315	[ONNX] Clean up the missed lines from previous PRs (#135368 ) Some missed deleted lines Pull Request resolved: https://github.com/pytorch/pytorch/pull/135368 Approved by: https://github.com/justinchuby	2024-09-06 20:27:52 +00:00
Tristan Rice	3ce433aef2	[TCPStore] use wait counters (#135283 ) This replaces the existing TCPStore counters with the new shared wait counters. There's no users of the tcpstore counters so should be completely safe to remove. Test plan: Existing tests + build There's no OSS backend for wait counters so can't write any tests with them currently. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135283 Approved by: https://github.com/c-p-i-o	2024-09-06 19:54:25 +00:00
Jane Xu	7f2d20e687	Run all autograd node post hooks (#134728 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134728 Approved by: https://github.com/albanD, https://github.com/soulitzer	2024-09-06 19:44:28 +00:00
titaiwangms	32fd29c1ea	[ONNX] Properly handle Attributes in traceable functions (#135367 ) Previously the attributes were sent in as Attr objects even when we call the function as a plain Python function. Turning them into python objects. From https://github.com/justinchuby/torch-onnx/pull/186 Related https://github.com/microsoft/onnxscript/issues/1846 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135367 Approved by: https://github.com/justinchuby	2024-09-06 19:35:22 +00:00
Justin Chu	5eebd9315a	[ONNX] Refactor exporter errors (#135180 ) Refactor exporter errors to combine old errors and new errors for API consistency. This PR also 1. Removes the `_C._check_onnx_proto(proto)` call in the old exporter. We don't need the ONNX checker because it is limited. 2. Removes the `OnnxExporterError` defined in the dynamo module. This class unnecessarily stores the onnx program object, making it very bulky. Instead, we revert to use the plain OnnxExporterError defined in the `errors` module and use it as the base class for all errors. 3. Continues to expose `OnnxExporterError` in `torch.onnx` and the rest of the errors in `torch.onnx.errors`. 4. Removes the `CheckerError` and `InvalidExportOptionsError` from `torch.onnx`. This is BC breaking but should have low impact. 5. I did not rename existing errors out of compatibility considerations, even though `ExporterError` would have been more succinct. Fixes https://github.com/pytorch/pytorch/issues/135125 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135180 Approved by: https://github.com/titaiwangms	2024-09-06 19:10:56 +00:00
Nowtryz	a15aabc975	Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262 ) Hi, I noticed the `unfold` operator was missing on MaskedTensor. I tested that my change works when calling unfold and backward on a `MaskedTensor` but I didn't find the tests for the dispatch of such operation. Where is it? Pull Request resolved: https://github.com/pytorch/pytorch/pull/125262 Approved by: https://github.com/cpuhrsch	2024-09-06 19:06:23 +00:00
Jokeren	b143426db3	[Inductor] Use argument names as the key for the `constants` dict and the `signature` dict (#135170 ) Referencing how triton constructs these dictionaries `ca3fb5f6fa/python/triton/runtime/jit.py (L639)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135170 Approved by: https://github.com/htyu	2024-09-06 19:05:00 +00:00
Oguz Ulgen	13ba0a2e5c	Run bypassed graph compile outside the except block to avoid chaining of exceptions (#135175 ) Fixes #135172 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135175 Approved by: https://github.com/masnesral, https://github.com/ezyang	2024-09-06 19:03:57 +00:00
wdziurdz	8520ce5f78	Fix incorrect trace of post-accumulate grad hook on tensor with zero dims (#135226 ) Fix incorrect trace of post-accumulate grad hook on tensor with zero dimensions Fixes #135207 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135226 Approved by: https://github.com/xmfan	2024-09-06 18:19:54 +00:00
Tristan Rice	196748d491	[elastic] support local_addr across all rendezvous impls (#135262 ) Summary: There was a regression introduced in https://github.com/pytorch/pytorch/pull/125743 that made `local_addr` no longer used. This fixes that by passing `local_addr` to `RendezvousStoreInfo.build` everywhere it's used. This also fixes a number of tests allowing them to be run in parallel which hugely sped up the testing cycle as this change touches many different rendezvous implementations. This required a few fixes in unrelated tests. Test Plan: Added tests for the common rendezvous implementations that `local_addr` to prevent future regressions. ``` buck2 test @//mode/dev-nosan fbcode//caffe2/test/distributed/elastic/... fbcode//caffe2/torch/distributed/elastic/... -- --stress-runs 3 ``` To vet the parallelism changes I also ran with 3 stress runs each to identify flakiness caused by parallelism. Differential Revision: D62256407 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135262 Approved by: https://github.com/fduwjj, https://github.com/wz337	2024-09-06 17:55:43 +00:00
Pian Pawakapan	177e4f4218	remove _check call on item() for torch.istft (#135234 ) Fixes #135014 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135234 Approved by: https://github.com/tugsbayasgalan	2024-09-06 17:31:25 +00:00
Henry Tsang	3988b3468b	[aoti][easy] remove breakpoint() in wrapper.py (#134807 ) Differential Revision: D61687146 Remove an unintended breakpoint in code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134807 Approved by: https://github.com/YUNQIUGUO	2024-09-06 17:25:05 +00:00
Zhengxu Chen	04118d8617	[export] Record the global torch version in serialization. (#135243 ) Summary: In general I think it will be useful to also record the global torch version in the EP, so that we can track them in the logging in addition to the schema version. Test Plan: CI Reviewed By: henryoier Differential Revision: D62252626 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135243 Approved by: https://github.com/yushangdi	2024-09-06 17:02:06 +00:00
Riley Dulin	24482e5c68	[torch][fx] Set maximum warning count during fx.Graph.lint (#135069 ) Summary: resnet152 spent about 15 minutes writing warning messages in _unlift during `to_executorch` because they're all written to unbuffered stderr by the `warnings` module. These warnings are almost always about get_attr nodes referencing a non-existent name: ```lang=py warnings.warn(f'Node {node} target {node.target} {atom} of {seen_qualname} does ' 'not reference an nn.Module, nn.Parameter, or buffer, which is ' 'what \'get_attr\' Nodes typically target' ) ``` I'm not aware of a way to configure the warnings module to write this out at most once, so I'm just going to disable the lint for now. Test Plan: Re-ran resnet152 with Executorch and the XNNPackBackend, it is much faster now Differential Revision: D62156090 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135069 Approved by: https://github.com/yushangdi	2024-09-06 16:41:59 +00:00
yanbing-j	c0ec599f27	Update submodule ideep to include aarch64 change (#134897 ) This PR is per ARM request, which is in https://github.com/intel/ideep/issues/334. Context for the request is: Arm team has upstreamed the dynamic quantization changes, all the PRs were merged (torch, ideep, oneDNN), but without this ideep submodule update, the feature will not work. The change is isolated to only matmul operator and quantization path alone. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134897 Approved by: https://github.com/jgong5, https://github.com/atalman, https://github.com/snadampal	2024-09-06 16:40:26 +00:00
Alfredo Tupone	7074de43c0	Porting to GCC 15 (#135188 ) uint8_t is found on cstdint header Pull Request resolved: https://github.com/pytorch/pytorch/pull/135188 Approved by: https://github.com/Skylion007	2024-09-06 16:16:53 +00:00
Rachel Guo	771dcce11d	[AOTI][Tooling][6/n] Fix long dtype input tensors calling `mean()` in `aoti_torch_print_tensor_handle` (#135072 ) Differential Revision: D61635232 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135072 Approved by: https://github.com/hl475, https://github.com/ColinPeppler	2024-09-06 15:59:32 +00:00
Avik Chaudhuri	de74aafff4	error on exporting ScriptModule (#135302 ) Test Plan: added test Differential Revision: D62279179 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135302 Approved by: https://github.com/yushangdi	2024-09-06 15:12:40 +00:00
rzou	ad29a2c0dc	Add Inductor config for default stride behavior (#135238 ) By default, Inductor is allowed to manipulate the layout (strides+storage offset) of input tensors to custom operators. We want to change it so that the default is that Inductor should respect the stride order of input tensors to custom operators. This PR adds a config to toggle the behavior, in the next PR up we'll change the default. We also make the following changes: - We add a new operator Tag (flexible_layout), which means that inductor is allowed to manipulate the layout. When we flip the default, users can specify they want the old behavior by using this tag. This is a reland of https://github.com/pytorch/pytorch/pull/126986, which was previously reverted due to silent incorrectness. We've since fixed the silent incorrectness (https://github.com/pytorch/pytorch/pull/133639) Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/135238 Approved by: https://github.com/albanD	2024-09-06 14:48:24 +00:00
Yiwen Shi	3a9e33dca8	[torchelastic] Don't do signal handling when off the main thread (#135088 ) Summary: In multiprocessing, signal handling is not possible if the thread is not the main thread. This resulted in the following error: > "ValueError('signal only works in main thread of the main interpreter')" To address this issue, the diff checks whether the thread is the main thread and, if not, skips signal handling. Test Plan: Before this change, MAST job failed: https://fburl.com/mlhub/iq2m10v8 With this change, MAST job succeeded: https://fburl.com/mlhub/q6kb8343 Differential Revision: D62166943 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135088 Approved by: https://github.com/d4l3k	2024-09-06 14:47:03 +00:00
David Berard	a086882d72	[inductor][triton] mark workspace args as mutated (#134648 ) SplitScan makes use of a workspace arg that needs to be zeroed before it is used - then, it is used to communicate between thread blocks during the triton kernel implementation. It is mutated during during the execution of the kernel, so it should be marked as such. Before this PR, it is not marked as mutated; AFAIK this is fine during normal execution, but during autotuning it causes problems. The workspace starts off zeroed (as expected), but during autotuning the kernel will be executed multiple times and the workspace does not get re-set between executions, resulting in incorrect data. If the data is used for indexing, then you can fail device-side asserts (and the results after the initial run (with autotuning) could be wrong). The test added in this PR repros the issue when the fix is removed. When we mark the arg as mutated, then the arg gets cloned before autotuning, so that the arg passed to the kernel during autotuning will always be zeroed as expected. `804852c1f9/torch/_inductor/runtime/triton_heuristics.py (L685-L689)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134648 Approved by: https://github.com/peterbell10, https://github.com/jansel	2024-09-06 14:23:37 +00:00
Will Feng	84ae6b7d6b	AOTDispatcher: limit cases when we detach() graph inputs to non-leaves (#134193 ) This PR is slightly a revival / update to the discussion from https://github.com/pytorch/pytorch/pull/98960: Part of FSDP2's tracing strategy right now is that: (1) it is painful/difficult to handle the case where we have multiple graph input tensors that are aliased to each other and at least one of them is duplicated (2) we already have longstanding in logic to remove duplicate input tensors from the graph in dynamo. Morally, FSDP2 gives us duplicate input tensors in the backward graph for every `unsharded_param`, because we have (a) the `unsharded_param` being closed over by the backward hook to resize/allgather, and (b) the same `unsharded_param` being saved for backward by autograd (we now guarantee in the partitioner that we will always save the base tensor for backward and recompute views) (3) However, we were still seeing cases where the `unsharded_param` showed up twice in the backward graph inputs, as distinct tensor objects (with different python ids) instead of being true duplicates that dynamo can de-dup. It turns on that this was because we were `.detach()`ing the `unsharded_param` in AOTDispatcher before plumbing it through the compiled forward (and so autograd would save a detach'd version of the `unsharded_param`). This is precisely because of the logic from https://github.com/pytorch/pytorch/pull/98960. However, re-reading the detailed comments, it seems unnecessary to do a detach() on a graph input that is a (leaf) `nn.Parameter`, even if it happens to get no gradients in the backward. Since it is a leaf, we don't have to worry about the autograd engine "continuing to backprop through the graph beyond the current tensor" (the leaf has no other grad_fn for autograd to backprop through). So this PR makes us a bit less aggressive about calling detach() on inputs: we only do it when: (1) our graph input statically will get a `None` gradient (and also has no metadata mutations, the existing state) (2) and our graph input is a non-leaf tensor (so detach()ing is actually required to prevent autograd from incorrectly backpropping past the non-leaf. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134193 Approved by: https://github.com/yf225 Co-authored-by: Will Feng <yf225@cornell.edu>	2024-09-06 14:06:48 +00:00
Julia Guo	60a097a071	[CD] Update binary_linux_test.sh to include calling builder smoke test (#133869 ) Run smoke test Fixes #1969 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133869 Approved by: https://github.com/atalman Co-authored-by: Andrey Talman <atalman@fb.com>	2024-09-06 13:27:24 +00:00
Wu, Chunyuan	13bae39e22	[inductor] [cpp] improve cache blocking for is_dynamic_M (#131306 ) ## Performance Models with >= 3% performance speedup are listed below: ### AMP single-thread dynamic shape (measured on CPU with AMX support) No regressions \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|---------\| torchbench \| soft_actor_critic\| 3% Pull Request resolved: https://github.com/pytorch/pytorch/pull/131306 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel ghstack dependencies: #135275 Co-authored-by: Jiong Gong <jiong.gong@intel.com>	2024-09-06 13:21:24 +00:00
Jiong Gong	4ef6c05f65	[inductor][cpp][gemm] fix autotune runtime error from linear_binary fusion (#135275 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135275 Approved by: https://github.com/leslie-fang-intel	2024-09-06 13:21:23 +00:00
Edward Z. Yang	d6b9bd3e60	Also handle compiler collective when input variable doesn't exist on all ranks (#135147 ) Internal xref: https://fb.workplace.com/groups/3095840833991792/permalink/3810738595835342/ Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135147 Approved by: https://github.com/jansel	2024-09-06 13:18:36 +00:00
Edward Z. Yang	d0591f4658	Ignore fresh unbacked when doing recursive make_fx inside HOPs (#135053 ) Internal xref: https://fb.workplace.com/groups/6829516587176185/posts/7705964779531357/ This now also incorporates a test from https://github.com/pytorch/pytorch/pull/133585 (which it fixes) and the prep PR https://github.com/pytorch/pytorch/pull/134407 Including the PR desc from that: I am trying to fix a problem reported by user in [fb.workplace.com/groups/6829516587176185/permalink/7705964779531357](https://fb.workplace.com/groups/6829516587176185/permalink/7705964779531357/) The summary of this problem is that when we do collect metadata analysis in AOTAutograd, we accumulate pending unbacked symbols which are going to be discarded at the end of the trace. However, if we do a recursive make_fx inside tracing, as occurs with torch.cond, we end up seeing that there are pending unbacked symbols that aren't associated with a binding, even though it's spurious (they've leaked into the inner make_fx call from the outer AOTAutograd analysis). In https://github.com/pytorch/pytorch/pull/133588 I tried to just prevent adding the symbols to the pending list at all in the first place. But this itself caused some problems which were fixed in https://github.com/pytorch/pytorch/pull/124785 . The problem fixed in that PR is that when we allocate tangents that have unbacked size, something prevented them from having correct unbacked SymInts when ignore fresh unbacked SymInts was enabled. So I had patched it at the time by just not suppressing pending symbols and clearing them out some other way. I think... I was wrong in that PR? That is to say, it was OK to avoid putting the fresh unbacked symbols in the pending list; the real problem was suppressing unbacked renamings. But there doesn't seem to be a good reason to suppress these; this PR shows that it doesn't actually fail any tests if you do these anyway. Intuitively, this makes sense, because you can't trigger renamings unless you're actually adding unbacked symbols to the pending set. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135053 Approved by: https://github.com/ydwu4	2024-09-06 13:13:15 +00:00
Yan Zhiwei	b5dea061c8	check compilation status before query cudnn version in conv (#135332 ) This PR is created for fixing the https://github.com/pytorch/pytorch/issues/135322. The cudnn compilation status should be check firstly before querying version, otherwise, conv may trigger runtimeerror before any check in other non-cuda backends. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135332 Approved by: https://github.com/EikanWang, https://github.com/atalman	2024-09-06 12:50:04 +00:00
Michael Lazos	041960a1ce	[Dynamo] Automatically in-graph traceable tensor subclass ctors (#135151 ) Fixes https://github.com/pytorch/pytorch/issues/114389 Previously, dynamo would attempt to trace through the `__init__` of traceable tensor subclasses, since their constructors are AOT dispatcher traceable by definition, dynamo should automatically put these in the graph like we do for any other tensors. Not doing this is difficult because dynamo would need to apply mutations post tensor subclass creation in the graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135151 Approved by: https://github.com/bdhirsh	2024-09-06 12:23:38 +00:00
Sun, Jiayi	67c7924ea1	[inductor] Fix gen_transposed_tile_load_store (#135307 ) Recent PR: https://github.com/pytorch/pytorch/pull/131745 bring new VLA logical in cpp codegen. And it will raise build fail error on MSVC and error code is `Compiler Error C2131`: https://learn.microsoft.com/en-us/cpp/error-messages/compiler-errors-1/compiler-error-c2131?view=msvc-170 reproduce UT: ```cmd pytest test\inductor\test_torchinductor_dynamic_shapes.py -v -k test_large_block_sizes_dynamic_shapes_cpu ``` Original generated code: ```c++ alignas(16) float tmp1[static_cast<int64_t>(((-256LL)(c10::div_floor_integer(static_cast<int64_t>(ks1), static_cast<int64_t>(16LL)))) + (16LLks1))]; ``` Changes: allocate a large-enough fixed-sized buffer. New genarated code: ```c++ alignas(16) float tmp1[16*16]; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135307 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-09-06 10:44:08 +00:00
penguin-wwy	217ba7b2ab	[Docs] Update FileCheck doc (#135199 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135199 Approved by: https://github.com/soulitzer	2024-09-06 08:18:38 +00:00
CaoE	758d515d98	[Inductor][CPP] Select tiling factor for lower precision data types (#133830 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133830 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-09-06 08:12:37 +00:00
Feng Yuan	60d98b4cfb	Update torch-xpu-ops pin (ATen XPU implementation) (#135300 ) Release cycle for PyTorch 2.5 1. Bugfixing: correct reduction logic in cdist kernel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135300 Approved by: https://github.com/EikanWang	2024-09-06 07:30:09 +00:00
Shangdi Yu	590a3e9f8a	[export][training ir migration] quantized_decomposed.quantize_per_tensor decomposition (#134525 ) Summary: In graph of TestXNNPACKQuantizer.test_dynamic_linear_with_con test, some quantized_decomposed.quantize_per_tensor.default ops are becoming quantized_decomposed.dequantize_per_tensor.tensor ops when using the new training ir. This is because we lift params/buffers before calling make_fx. So previously, for the graph that’s passed to make_fx,`graph.L__self___linear1.weight` is a tensor now in training ir, graph.L__self___linear1.weight is a FakeTensor. This caused the node overload to be different. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_dynamic_linear_with_conv ``` Differential Revision: D61364547 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134525 Approved by: https://github.com/tugsbayasgalan, https://github.com/jerryzh168	2024-09-06 07:06:06 +00:00
drisspg	764ee6e3f9	[FlexAttention] Specify padding_value for boundary checked loads (#134573 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134573 Approved by: https://github.com/Chillee	2024-09-06 06:47:26 +00:00
wz337	67f98a99a4	[DeviceMesh][Easy] Make RuntimeError a bit more descriptive by including the actual world_size (#135271 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135271 Approved by: https://github.com/fduwjj	2024-09-06 06:23:20 +00:00
fduwjj	e020a8755a	[Fix][FR][ez] Remove debugging logs (#135308 ) Removing the print added during debugging process. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135308 Approved by: https://github.com/wz337	2024-09-06 06:14:33 +00:00
Jason Ansel	7ffb3b201c	[inductor] Remove LoopBody.reads,writes,other (#135256 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135256 Approved by: https://github.com/oulgen ghstack dependencies: #135070, #135076, #135082, #135084, #135079, #135235	2024-09-06 06:11:55 +00:00
Jason Ansel	f946bf88c4	[inductor] Skip retracing an existing LoopBody (#135235 ) This is roughly a 7% speedup in inductor compile time for hf_Bert_large. The time spent in `LoopBody.__init__` improves from 15% to 8% of `fx_codegen_and_compile`. Before ![image](https://github.com/user-attachments/assets/7de0f28e-35bd-472f-b4be-b52733d2a85c) After ![image](https://github.com/user-attachments/assets/5f0cf11a-43c5-43ae-b13c-f32383a75a7f) Overall ![image](https://github.com/user-attachments/assets/6a369d8c-fb5e-4ad2-9504-0fc745ad6568) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135235 Approved by: https://github.com/oulgen ghstack dependencies: #135070, #135076, #135082, #135084, #135079	2024-09-06 06:11:55 +00:00
Jason Ansel	66da3b3b2a	[fx] Bypass custom __setattr__ in Node.__init__ (#135079 ) Before: ![image](https://github.com/user-attachments/assets/5f0a6ae6-6049-44d0-b5f2-a549a23ad97f) After: ![image](https://github.com/user-attachments/assets/51c9f91b-f8a0-4043-8362-65813feec823) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135079 Approved by: https://github.com/oulgen ghstack dependencies: #135070, #135076, #135082, #135084	2024-09-06 06:11:46 +00:00
Laith Sakka	41e653456e	[RDP] Fix "No module named 'libfb’" (#135244 ) Summary: D62215095 Introduced an import error to arvr pipelines as the is_fbcode() function does not work as intended. This changes is_fbcode() to be a much stricter check. Test Plan: ``` buck2 run arvr/mode/platform010/opt-stripped //arvr/libraries/depthlink/clients/mr_replay:pipeline_runner -c bolt.use_eva3_sim=True -- --config_file arvr/libraries/depthlink/clients/mr_replay/configs/runner_config.yaml --features DEPTH ``` Differential Revision: D62237502 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135244 Approved by: https://github.com/aorenste	2024-09-06 04:52:31 +00:00
chilli	e40a0a9359	Add randomness checking for sdpa vmap (#135176 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135176 Approved by: https://github.com/zou3519	2024-09-06 04:50:49 +00:00
Xuan Zhang	c05a7adb36	[inductor][debug] fix draw_buffers (#135266 ) Before: ![image](https://github.com/user-attachments/assets/aac756f3-1349-4647-9da3-87cf105cf647) After: <img width="791" alt="image" src="https://github.com/user-attachments/assets/d72c663c-e598-42fa-ac40-9e58956f1ec1"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135266 Approved by: https://github.com/yf225	2024-09-06 04:12:41 +00:00
hippocookie	5f57be7571	[Distributed] Change function call in test to non-deprecated to eliminate warning (#134938 ) Migrate function call in test to eliminate warning message in below and reduce the chance of test fail when methods removed - from deprecated `save_state_dict` change to `save` - from deprecated `load_state_dict` change to `load` Warning message: ```bash pytorch/test/distributed/checkpoint/test_fsdp_model_state.py:37: FutureWarning: `save_state_dict` is deprecated and will be removed in future versions.Please use `save` instead. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134938 Approved by: https://github.com/wz337, https://github.com/fegin	2024-09-06 03:25:09 +00:00
Xu Han	29d72c1100	[inductor] check intel compiler minimal version (#135209 ) On Windows: early version icx has `-print-file-name` issue, and can't preload correctly for inductor. Add minimal version check for Intel compiler. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135209 Approved by: https://github.com/ezyang	2024-09-06 03:21:07 +00:00
leslie-fang-intel	3b1a334c0f	[Inductor][CPP] Avoid mistake wgt tensor delete (#135100 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/134998: Previously, we only checked if the `get_attr` FX node for the weight had a single user node. However, two `get_attr` nodes may share the same tensor and should not be deleted in such cases. In this PR, we add the count of users for tensor along with the num of users for nodes to decide whether this tensor can be deleted or not. TestPlan ``` python test/inductor/test_cpu_select_algorithm.py -k test_linear_wgt_multi_users ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135100 Approved by: https://github.com/jgong5	2024-09-06 03:13:36 +00:00
leslie-fang-intel	07689a38bf	[Inductor] Fix AOT weight alignment issue on CPU (#135205 ) Summary Fix issue: https://github.com/pytorch/pytorch/issues/135027. On CPU, the `consts_size` used to generate `_binary_constants_bin_start` is not padded to `ALIGN_BYTES`, while `serialized_weights` is, causing a failure in the 16K alignment check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135205 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-09-06 03:06:51 +00:00
Edward Z. Yang	06a7dc21c1	Remove dead expect_rational (#135105 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135105 Approved by: https://github.com/malfet	2024-09-06 02:57:27 +00:00
Edward Z. Yang	d9a18173fa	Report qualname of exception type rather than <class 'RuntimeError'> (#135146 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135146 Approved by: https://github.com/Skylion007, https://github.com/albanD, https://github.com/yanboliang ghstack dependencies: #135148, #135145	2024-09-06 02:56:50 +00:00
Edward Z. Yang	d8543e3162	Include exception type qualname when rewrapping InternalTorchDynamoError (#135145 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135145 Approved by: https://github.com/drisspg, https://github.com/anijain2305 ghstack dependencies: #135148	2024-09-06 02:56:50 +00:00
Edward Z. Yang	ad01fc194d	Consolidate raise and rewrap raise error branches (#135148 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135148 Approved by: https://github.com/anijain2305, https://github.com/albanD, https://github.com/yanboliang, https://github.com/malfet	2024-09-06 02:56:46 +00:00
Haibo Chen	e162414963	add instrumentation of CCA stats for reserved and allocated memory size (#135231 ) As titled Pull Request resolved: https://github.com/pytorch/pytorch/pull/135231 Approved by: https://github.com/c-p-i-o	2024-09-06 02:48:56 +00:00
Edward Z. Yang	9e5a797771	Improve test_public_bindings import module error reporting (#135258 ) Error was hard to understand without message. Render it now. See https://github.com/pytorch/pytorch/pull/135259 for it in action. Example failure: ``` 2024-09-05T20:04:45.3022000Z FAILED [5.9524s] test_public_bindings.py::TestPublicBindings::test_modules_can_be_imported - AssertionError: String comparison failed: '' != "torch._logging.scribe failed to import w[112 chars].py)" 2024-09-05T20:04:45.3025413Z + torch._logging.scribe failed to import with error ImportError: cannot import name 'TypeAlias' from 'typing' (/opt/conda/envs/py_3.9/lib/python3.9/typing.py) 2024-09-05T20:04:45.3026990Z ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135258 Approved by: https://github.com/albanD	2024-09-06 02:40:03 +00:00
atalman	b46a1b9e2d	Use Python 3.9 on all libtorch jobs (#135245 ) Part of the migration py3.8->3.9 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135245 Approved by: https://github.com/izaitsevfb	2024-09-06 02:27:22 +00:00
Sunita Nadampalli	9688014820	aarch64: extend matmul heuristic checks to all neoverse platforms (#134548 ) for aarch64 neoverse platforms there are two gemm backends available for matmul operator on PyTorch: (1) Arm Compute Library and (2) OpenBLAS. While Arm Compute Library provides better performance over OpenBLAS, it has overhead for the kernel launch time, and hence we use OpenBLAS for smaller tensor compute. The heuristic was originally implemented for neoverse_v1. This commit extends the heuristic to other neoverse platforms Pull Request resolved: https://github.com/pytorch/pytorch/pull/134548 Approved by: https://github.com/malfet	2024-09-06 01:40:50 +00:00
titaiwangms	8f6e73f068	[ONNX] Enable experimental exporter logic to dynamo_export and support refine dynamic_shapes (#134976 ) (1) Enable experimental exporter logic to dynamo_export (2) Refine dynamic shapes and retry export in export strategies (3) Delete `torch_export_graph_extractor` and use the new export logic (4) Disable ExportedProgram test in `test_fx_onnx_with_onnxruntime.py`, as ONNXProgram is different now. Fixes https://github.com/pytorch/pytorch/issues/126479 Fixes #135183 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134976 Approved by: https://github.com/justinchuby	2024-09-06 01:29:56 +00:00
Bin Bao	1e57ef08fa	[AOTI] Support MKLDNN qconv ops in cpp wrapper (#134795 ) Summary: Similar to https://github.com/pytorch/pytorch/pull/134475, support qconv in the ABI-compatible mode for cpp-wrapper Inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134795 Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/angelayi ghstack dependencies: #134475, #134783	2024-09-06 01:01:53 +00:00
Bin Bao	614b86d602	[AOTI] Support MKLDNN qlinear ops in cpp wrapper (#134783 ) Summary: Similar to https://github.com/pytorch/pytorch/pull/134475, support qlinear in the ABI-compatible mode for cpp-wrapper Inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134783 Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/angelayi ghstack dependencies: #134475	2024-09-06 01:01:53 +00:00
Bin Bao	0b96dfb736	[AOTI] Support MKLDNN conv ops in cpp wrapper (#134475 ) Summary: Partially fix https://github.com/pytorch/pytorch/issues/123040. In the ABI-compatible mode, MKLDNN fallback ops do not have C shim implementations and thus need to go through the custom ops launch path. Other MLKDNN ops will be fixed in following PRs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134475 Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/angelayi	2024-09-06 01:01:53 +00:00
Shivam Raikundalia	62b221d5cc	Add Percentages to Function Events (#135155 ) Summary: Users have recently asked that the profiler contains self/total CPU and device percentages to FunctionEvents so that teams can process the data procedurely. Some of it could be done mathematically via subroutines but since we already have the information in the _build_table, lets build it there. Test Plan: Check that we have the same table as before but also check that the parameters we check also have the expected values Differential Revision: D62210351 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135155 Approved by: https://github.com/shanw-meta, https://github.com/kit1980	2024-09-06 00:39:11 +00:00
Laith Sakka	66dd4577b1	Track base of FunctionalTensor in inference mode. (#135141 ) The idea behind the tracking is the following, whenever we see a tensor if the tensors is a root tensors (does not have any view metas ) when we consider is as the base of the all the tensors that shares its storage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135141 Approved by: https://github.com/zou3519	2024-09-06 00:10:25 +00:00
cyy	cc28634172	[Submodule] Bump pybind11 to v2.13.5 (#135202 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/135202 Approved by: https://github.com/Skylion007	2024-09-06 00:09:00 +00:00
wz337	c83cdf068b	[DTensor] Fix view op replicating on tensor dim when the size of the tensor dim = 1 (#135054 ) We found a corner case that when a tensor dimension is 1, calling `view(1)` would result in an unexpected replication (see case 1 below). When the tensor dimension to shard is not 1, no matter whether the tensor dimension is evenly-shardable across the mesh dimension, it won't cause an implicit replication behind the scenes if view doesn't change the size of the given tensor dimension (see case 2 and 3). When the tensor dimension to shard is of size 1, it is not being added to shardable_dims here: https://github.com/pytorch/pytorch/blob/main/torch/distributed/_tensor/ops/_view_ops.py#L518 ``` # uneven case where the size of the tensor dimension to shard is 1 p = torch.randn(1,2) mesh = init_device_mesh(“cuda”, (2,)) dtensor = distribute_tensor(p, mesh, [Shard(0)]) t = dtensor.view(1, 2) # this would result in replication, meaning t is now replicated across all ranks. # uneven case where the size of the tensor dimension to shard is not 1 p = torch.randn(3, 2) mesh = init_device_mesh(“cuda”, (2,)) dtensor = distribute_tensor(p, mesh, [Shard(0)]) t = dtensor.view(3, 2) # this would not result in replication. # this would not result in replication, meaning t stays as sharded. # even case p = torch.randn(2,2) dtensor = distribute_tensor(p, mesh, [Shard(0)]) t = dtensor.view(2, 2) # this would not result in replication, meaning t stays as sharded. ``` Differential Revision: [D62155606](https://our.internmc.facebook.com/intern/diff/D62155606) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135054 Approved by: https://github.com/tianyu-l, https://github.com/wanchaol	2024-09-06 00:03:54 +00:00
titaiwangms	28ccfba248	[ONNX] Delete ONNXProgramSerializer (#135261 ) Fixes #135182 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135261 Approved by: https://github.com/justinchuby	2024-09-05 23:52:51 +00:00
Jason Ansel	b2386bdca1	[debug] Add helper to run cProfile on a function (#135084 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135084 Approved by: https://github.com/oulgen ghstack dependencies: #135070, #135076, #135082	2024-09-05 23:41:30 +00:00
Jason Ansel	bdfc8d9f96	[fx] Don't use generators in map_aggregate (#135082 ) While the generators avoid a copy, they are slow. Before: ![image](https://github.com/user-attachments/assets/70a55a9a-0595-4105-b0ab-22cf77c7409c) After: ![image](https://github.com/user-attachments/assets/cecb9c59-ae36-47de-8b08-cab2c7cb3d57) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135082 Approved by: https://github.com/oulgen ghstack dependencies: #135070, #135076	2024-09-05 23:41:30 +00:00
Jason Ansel	70779dded8	[fx] Compile time optimization in Node.__update_args_kwargs (#135076 ) Before this we took two passes over all of the args. Before: ![image](https://github.com/user-attachments/assets/24ce5628-03f4-4983-9f2d-5ddf0ca5816e) After: ![image](https://github.com/user-attachments/assets/c9681aa2-32f0-4f6b-a598-fc6f90ffafb5) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135076 Approved by: https://github.com/Chillee ghstack dependencies: #135070	2024-09-05 23:41:30 +00:00
Jason Ansel	ea231300d1	[inductor] Improve compile time regression from MemoryDep.normalize (#135070 ) Possible fix for #135056 Before ![image](https://github.com/user-attachments/assets/3962cb85-e808-4fd4-991f-471ff5ef7eae) After ![image](https://github.com/user-attachments/assets/2322d48d-6518-4518-baca-336027b5cda8) Measured based on: ``` python benchmarks/dynamo/torchbench.py --ci --accuracy --timing --explain --inductor --device cuda --training --only hf_Bert_large --stats -n1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135070 Approved by: https://github.com/Chillee	2024-09-05 23:41:30 +00:00
PyTorch MergeBot	8f66995459	Revert "Support rolling over a percentage of workflows (#134816 )" This reverts commit fc890b55b51098437b6149abf1026a8b2aaee389. Reverted https://github.com/pytorch/pytorch/pull/134816 on behalf of https://github.com/malfet due to Causes lint to intermittently fail ([comment](https://github.com/pytorch/pytorch/pull/134816#issuecomment-2332902609))	2024-09-05 23:39:41 +00:00
Kulin Seth	144fde4fd2	[MPS] Add support for autocast in MPS (#99272 ) Fixes https://github.com/pytorch/pytorch/issues/88415 Need to run inductor/test_cpu_select_algorithm Pull Request resolved: https://github.com/pytorch/pytorch/pull/99272 Approved by: https://github.com/malfet Co-authored-by: Siddharth Kotapati <skotapati@apple.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Co-authored-by: Roy Hvaara <roy@lightyear.no>	2024-09-05 23:23:17 +00:00
Avik Chaudhuri	43f4947d44	fix fake tensor tolist implementation (#135131 ) Summary: When exporting for training with `tolist`, we do not hit `FunctionalTensor.tolist` since we do not functionalize. Unfortunately, this means we hit `FakeTensor.tolist`, which creates unbacked symints that are not backed by proxies. Rather than trying to patch up this low-level implementation, we replace it with essentially what `FunctionalTensor.tolist` does, which is higher-level: we essentially desugar to `item()` calls and let it take care of unbacked symints. Test Plan: Some expected failures are gone now. Also found a test for `tolist` that was written when `FunctionalTensor.tolist` was implemented but not really doing much; repurposed it now to exercise more modes. Differential Revision: D62197742 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135131 Approved by: https://github.com/ezyang	2024-09-05 23:20:31 +00:00
Chirag Pandya	65e1c34061	[rfc] scuba for flight recorder (#134794 ) Summary: Record flight recorder status in a scuba table. Test Plan: Testing with timing out a job. Will post results soon. Differential Revision: D61729221 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134794 Approved by: https://github.com/fduwjj	2024-09-05 23:18:10 +00:00
Stonepia	830247c355	[Intel Triton] Update Intel Triton to release/2.5.0 (#134074 ) This PR relands https://github.com/pytorch/pytorch/pull/134053 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134074 Approved by: https://github.com/EikanWang	2024-09-05 22:46:31 +00:00
Yidi Wu	4262755b5a	[cond] fix typo in cond codegen (#134708 ) As titled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134708 Approved by: https://github.com/jansel	2024-09-05 22:38:24 +00:00
Edward Z. Yang	3825607144	Add torch._logging.scribe (#135224 ) See https://github.com/pytorch/pytorch/pull/135138 for a usage example. Meta only, see https://docs.google.com/document/d/1JpbAQvRhTmuxjnKKjT7qq57dsnV84nxSLpWJo1abJuE/edit#heading=h.9wi46k7np6xw for context fbscribelogger is a library that allows us to write to scribe, which is Meta's logging infrastructure, when you have appropriate access token (this token is available for jobs running on main, as well as authorized jobs with the ci-scribe label). The resulting data is accessible via Scuba (a real time in-memory database) and Hive (a more traditional SQL persisted database). Here's the motivating use case. Suppose there is somewhere in PyTorch's codebase where you'd like to log an event, and then you'd like to find all the situations where this log is called. If PyTorch is rolled out to our internal users, we have some FB-oriented APIs (like torch._utils_internal.signpost_event) with which you can do this. But you have to actually land your PR to main, wait for it to be ingested to fbcode, and then wait for us to actually roll out this version, before you get any data. But what if you want the results within the next few hours? Instead, you can use torch._logging.scribe to directly write to our logging infrastructure from inside CI jobs. The most convenient approach is to log unstructured JSON blobs to `open_source_signpost` (added in this PR; you can also add your own dedicated table as described in the GDoc above). After adding logging code to your code, you can push your PR to CI, add 'ci-scribe' label, and in a few hours view the results in Scuba, e.g., (Meta-only) https://fburl.com/scuba/torch_open_source_signpost/z2mq8o4l If you want continuous logging on all commits on master, you can land your PR and it will be continuously get logging for all CI runs that happen on main. Eventually, if your dataset is important enough, you can consider collaborating with PyTorch Dev Infra to get the data collected in our public AWS cloud so that OSS users can view it without access to Meta's internal users. But this facility is really good for prototyping / one-off experiments. It's entirely self serve: just add your logging, run your PR CI with ci-scribe, get results, do analysis in Scuba. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135224 Approved by: https://github.com/Skylion007	2024-09-05 22:37:13 +00:00
eqy	3c8f71ff93	[cuDNN][64-bit indexing] cuDNN v9.3+ supports non-batch-splittable convolutions with > 2**31 elements (#134890 ) For longstanding issues such as #95024 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134890 Approved by: https://github.com/Skylion007	2024-09-05 22:22:45 +00:00
Zain Rizvi	fc890b55b5	Support rolling over a percentage of workflows (#134816 ) In order to support adding a rollover percentage, this ended up being a complete rewrite of runner_determinator.py. Details of the new format are in the comments up top. On the plus side, this now includes some unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134816 Approved by: https://github.com/PaliC, https://github.com/zxiiro	2024-09-05 22:21:45 +00:00
Animesh Jain	058a69d91a	[fbcode][dynamo] Turn on guard_nn_modules using justknobs_check (#134928 ) As Title Pull Request resolved: https://github.com/pytorch/pytorch/pull/134928 Approved by: https://github.com/ezyang	2024-09-05 22:05:54 +00:00
sanchitintel	6c5920d515	Tune int8 AMX WoQ micro-kernel for CPU (#134832 ) This patch prevents performance regression against the default ATen implementation for LLaMA 3.1 int8 GPTQ WoQ workload. Uses AMX micro-kernel only if `M` >= `block_m` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134832 Approved by: https://github.com/jgong5	2024-09-05 22:01:14 +00:00
Zhengxu Chen	116fd474da	[export] Expand coverage to more copied sym ops for unflattener. (#135119 ) Test Plan: buck2 test 'fbcode//mode/opt' fbcode//torchrec/ir/tests:test_serializer -- --run-disabled ``` File changed: fbcode//caffe2/torch/export/unflatten.py Buck UI: https://www.internalfb.com/buck2/2e0377e7-e2b6-4bd0-8133-a787245165a0 Test UI: https://www.internalfb.com/intern/testinfra/testrun/5066549824883887 Network: Up: 0B Down: 0B Jobs completed: 16. Time elapsed: 10.2s. Tests finished: Pass 6. Fail 0. Fatal 0. Skip 0. Build failure 0 ``` Differential Revision: D62190172 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135119 Approved by: https://github.com/yushangdi	2024-09-05 21:58:20 +00:00
Scott Wolchok	a5d70cf545	[PyTorch] Add isfinite to BFloat16-math.h (#135052 ) Missing function from <cmath>. Differential Revision: [D62148884](https://our.internmc.facebook.com/intern/diff/D62148884/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135052 Approved by: https://github.com/PaliC, https://github.com/albanD ghstack dependencies: #135031	2024-09-05 21:50:36 +00:00
Scott Wolchok	7fe819d917	[PyTorch] Fix -Wshadow -Werror build in BFloat16-inl.h (#135031 ) `float_t` is required to exists in C99 math.h, which causes -Wshadow to fire. We don't need the alias, fortunately. Differential Revision: [D62135908](https://our.internmc.facebook.com/intern/diff/D62135908/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135031 Approved by: https://github.com/albanD	2024-09-05 21:48:21 +00:00
PyTorch MergeBot	f63571060c	Revert "Use actions/upload-artifact@v4.4.0 for rest of workflows (#135264 )" This reverts commit 9c0b03020b7204ca5d5dbe18174bab005f79c47b. Reverted https://github.com/pytorch/pytorch/pull/135264 on behalf of https://github.com/atalman due to broke CI ([comment](https://github.com/pytorch/pytorch/pull/135264#issuecomment-2332674607))	2024-09-05 21:43:05 +00:00
Yidi Wu	38fead8f7c	[hop] preserve metadata in re-tracing hop subgraph by running with interpreter (#135159 ) In this way, the interpreter.run can preserve the current metadata of subgraphs correctly when tracing the subgraphs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135159 Approved by: https://github.com/tugsbayasgalan	2024-09-05 21:36:56 +00:00
Huy Do	24a223c49d	Run inductor micro benchmark on x86 metal runner (#135042 ) This enables inductor micro benchmark on CPU (x86): * Running on AWS metal runner for more accurate benchmark * I add a new `arch` column, which will be either x86_64 or arm64 for CPU or GPU name for GPU. We can use this later to differentiate between different setup, i.e. cuda (a100) vs cuda (a10g) or cpu (x86_64) vs cpu (arm64) The next step would be to run this one cpu arm64, and cuda (a10g). ### Testing Here is the CSV results from my test run https://github.com/pytorch/pytorch/actions/runs/10709344180 ``` name,metric,target,actual,dtype,device,arch,is_model mlp_layer_norm_gelu,flops_utilization,0.8,17.36,bfloat16,cpu,x86_64,False gather_gemv,memory_bandwidth(GB/s),990,170.80,int8,cpu,x86_64,False gather_gemv,memory_bandwidth(GB/s),1060,204.78,bfloat16,cpu,x86_64,False Mixtral-8x7B-v0.1,token_per_sec,175,26.68,int8,cpu,x86_64,True Mixtral-8x7B-v0.1,memory_bandwidth(GB/s),1130,171.91,int8,cpu,x86_64,True Mixtral-8x7B-v0.1,compilation_time(s),162,47.36,int8,cpu,x86_64,True gemv,memory_bandwidth(GB/s),870,236.36,int8,cpu,x86_64,False gemv,memory_bandwidth(GB/s),990,305.71,bfloat16,cpu,x86_64,False Llama-2-7b-chat-hf,token_per_sec,94,14.01,bfloat16,cpu,x86_64,True Llama-2-7b-chat-hf,memory_bandwidth(GB/s),1253,185.18,bfloat16,cpu,x86_64,True Llama-2-7b-chat-hf,compilation_time(s),162,74.99,bfloat16,cpu,x86_64,True Llama-2-7b-chat-hf,token_per_sec,144,25.09,int8,cpu,x86_64,True Llama-2-7b-chat-hf,memory_bandwidth(GB/s),957,165.83,int8,cpu,x86_64,True Llama-2-7b-chat-hf,compilation_time(s),172,70.69,int8,cpu,x86_64,True layer_norm,memory_bandwidth(GB/s),950,172.03,bfloat16,cpu,x86_64,False ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135042 Approved by: https://github.com/yanboliang	2024-09-05 21:31:36 +00:00
Will Feng	e4920a1364	[Traceable FSDP2][Dynamo] allow tracing through auto_functionalized HOP (#135169 ) If an `auto_functionalized` HOP is included in backward graph due to activation checkpointing, we will run into a scenario where Compiled Autograd Dynamo tracing will need to trace through the `auto_functionalized` HOP. This PR adds support for it. Test commands: - `pytest -rA test/inductor/test_compiled_autograd.py::TestCompiledAutograd::test_trace_auto_functionalized` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135169 Approved by: https://github.com/zou3519	2024-09-05 21:22:45 +00:00
Shangdi Yu	bc5ecf83d7	[training ir migration] Fix quantization tests (#135184 ) Summary: Fixed some quantization tests for new training ir: Fix batch norm node pattern matcher. In training ir, we have `aten.batch_norm` node instead of `aten._native_batch_norm_legit` and `aten._native_batch_norm_legit_no_training`. Test Plan: ``` buck run fbcode//mode/dev-nosan fbcode//caffe2/test:quantization_pt2e ``` Reviewed By: tugsbayasgalan Differential Revision: D62209819 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135184 Approved by: https://github.com/tugsbayasgalan	2024-09-05 21:19:28 +00:00
PyTorch MergeBot	e55c0f59e5	Revert "[Reland] Refactor caching device allocator utils (#130923 )" This reverts commit 9809080b9ed657a8c0ea0383be7cbdce3a26e05e. Reverted https://github.com/pytorch/pytorch/pull/130923 on behalf of https://github.com/kit1980 due to breaking internal builds - Error: Relocation overflow has occured ([comment](https://github.com/pytorch/pytorch/pull/130923#issuecomment-2332640961))	2024-09-05 21:16:14 +00:00
PyTorch MergeBot	a4cf9653ee	Revert "Remove Caffe2 code from tool scripts (#134941 )" This reverts commit c818ecd1698a28d9fadf4a81453a89914b18374a. Reverted https://github.com/pytorch/pytorch/pull/134941 on behalf of https://github.com/kit1980 due to breaking internal builds - The path `caffe2/operators/hip/gather_op.cuh` does not exist ([comment](https://github.com/pytorch/pytorch/pull/134941#issuecomment-2332636624))	2024-09-05 21:12:54 +00:00
atalman	9c0b03020b	Use actions/upload-artifact@v4.4.0 for rest of workflows (#135264 ) To be consistent with https://github.com/pytorch/pytorch/pull/135263 and rest of workflows. Use v4.4.0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135264 Approved by: https://github.com/kit1980, https://github.com/malfet	2024-09-05 21:05:06 +00:00
Jack Taylor	034717a029	[ROCm] remove triton-rocm commit pin and merge pins with triton.txt (#133438 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133438 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>	2024-09-05 20:36:45 +00:00
Angela Yi	9c38b00999	[export] Add ability to run eagerly on UnflattenedModule (#133996 ) Summary: Added the contextmanager, `_disable_interpreter`, which is meant to put around a call to `unflatten`. This will generate an UnflattendModule and sub-InterpreterModules which will not use torch.fx.Interpreter to run eagerly. We want to have this as a state of the module instead of a contextmanager around running the module because it's not clear where we are calling the unflattened module. This seems to improve the performance: https://fb.workplace.com/groups/1075192433118967/posts/1473590629945810/?comment_id=1473621763276030 Test Plan: CI Differential Revision: D60939034 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133996 Approved by: https://github.com/pianpwk	2024-09-05 20:28:42 +00:00
atalman	8efe547046	Use actions/upload-artifact@v4.4.0 for triton builds (#135263 ) Same as: https://github.com/pytorch/pytorch/pull/135139 Fixes upload failure: https://github.com/pytorch/pytorch/actions/runs/10722567217/job/29748125015 fix regression introduced by https://github.com/pytorch/pytorch/pull/135068 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135263 Approved by: https://github.com/kit1980, https://github.com/huydhn	2024-09-05 20:03:39 +00:00
rzou	82d00acfee	Allow cross-device copies for cpu scalars in refs (#135140 ) This copies our eager-mode behavior where someone can do torch.add(a, b, out=c) where a and b are CPU scalar tensors and c is a CUDA tensor. Fixes https://github.com/pytorch/pytorch/issues/121619 by side effect (we get into a situation where we're writing a CPU scalar into a FakeTensor that is actually a meta tensor) Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/135140 Approved by: https://github.com/williamwen42, https://github.com/yanboliang	2024-09-05 19:08:48 +00:00
Zhonglin Han	098431a29d	Update Resize.cpp with new device type (#135117 ) Update Resize.cpp with new device type Pull Request resolved: https://github.com/pytorch/pytorch/pull/135117 Approved by: https://github.com/egienvalue	2024-09-05 18:53:13 +00:00
Xintong Hu	be660ea2d3	[PT2] Directly set meta.val in group_batch_fusion_aten (#135078 ) Summary: instead of using FakeTensorProp after the pass Differential Revision: D62162640 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135078 Approved by: https://github.com/frank-wei	2024-09-05 18:17:06 +00:00
CaoE	52c7c89ea4	[Inductor][CPP] Leverage full bits for BF16/FP16 vectorization (#126502 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126502 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-09-05 17:17:46 +00:00
IvanKobzarev	1efd341d15	[fake_tensor] Move unrecognized_type NotImplemented before ConstProp (#135033 ) We should not try to do ConstProp on the unrecognized types (e.g. Subclasses). In case of those types throwing NotImplemented will jump to the next torch_dispatch. Test: ``` python test/functorch/test_aotdispatch.py -k test_aot_test_subclasses_with_tensor_factories ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135033 Approved by: https://github.com/zou3519, https://github.com/bdhirsh	2024-09-05 17:09:41 +00:00
Mikayla Gawarecki	a096f2899d	Add torch.serialization.skip_data context manager (#134504 ) ## Semantic The semantic is (1) By default `torch.serialization.skip_data(materialize_fake_tensors=False)` will make `torch.save` skip writing storages (but reserve space for them in the checkpoint). ```python import torch import torch.nn as nn sd = nn.Linear(3, 5).state_dict() with torch.serialization.skip_data(): torch.save(sd, 'foo.pt') print(torch.load('foo.pt', weights_only=True)) ``` (2) With `torch.serialization.skip_data(materialize_fake_tensors=True)`If FakeTensor is passed to `torch.save` the pickler will treat these FakeTensors as being "materialized" space will be reserved in the checkpoint for the associated storage bytes, and when loading the type will be Tensor instead of FakeTensor) ```python import torch import torch.nn as nn from torch._subclasses.fake_tensor import FakeTensorMode with FakeTensorMode(): m = nn.Linear(3, 5, dtype=torch.float16, device='cuda') sd = m.state_dict() with torch.serialization.skip_data(materialize_fake_tensors=True): torch.save(sd, 'bla.pt') print(torch.load('bla.pt', weights_only=True)) # OrderedDict([('weight', tensor([[0., 0., 0.], # [0., 0., 0.], # [0., 0., 0.], # [0., 0., 0.], # [0., 0., 0.]], device='cuda:0', dtype=torch.float16)), ('bias', tensor([0., 0., 0., 0., 0.], device='cuda:0', dtype=torch.float16))]) ``` ## Follow Ups - [ ] `torch.load` semantic for skip_data context manager - [ ] Mechanism for getting offsets of storages saved via this method (for writing in a separate pass) Differential Revision: [D62238610](https://our.internmc.facebook.com/intern/diff/D62238610) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134504 Approved by: https://github.com/albanD	2024-09-05 16:53:39 +00:00
Edward Z. Yang	dbeb8a1691	Render log filepaths that are not anchored in torch's directory in a reasonable way (#135165 ) For example, if I do TORCH_LOGS=fbscribelogger I'll get: ``` I0904 17:59:07.567000 3672513 fbscribelogger/__init__.py:161] stop ``` instead of ``` I0904 12:46:15.332000 2930287 ../../../../../home/ezyang/local/a/pytorch-env/lib/python3.10/site-packages/fbscribelogger/__init__.py:161] stop ``` Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135165 Approved by: https://github.com/Skylion007	2024-09-05 16:48:09 +00:00
mori360	b1f72e2984	Gradient scaler for DTensor (#132816 ) Solve the request [here](https://github.com/pytorch/pytorch/issues/120003#issuecomment-2248805798). Enable DTensor input in gradient scaler's APIs, especially on `.unscale_()` Related dispatch strategy is added to accept DTensor input. To enable found_inf to conduct reduce action across devices, we add allreduce at dispatch with args after dispatch strategy and kernel. Since `aten._amp_foreach_non_finite_check_and_unscale_.default` is an inplace_op, grad_scale as the arg[0] with be inplaced, so that redesign a strategy or refactoring the kernel would not help Test files are testing 2 parts under 1-d(dp) and 2-d(dp,tp) cases: 1. whether the non-inf values unscaled 2. whether all DTensors at each device could found inf even not at their device. 3. If inf not found, will new parameters generates 4. if inf found, will scale be updated Pull Request resolved: https://github.com/pytorch/pytorch/pull/132816 Approved by: https://github.com/XilunWu, https://github.com/weifengpy, https://github.com/wanchaol	2024-09-05 16:44:32 +00:00
Henry Tsang	bb3c2408f4	[inductor][test] in test_unbacked_symints, replace inductor's skipCUDAIf with common device type's skipcudaif (#133936 ) Differential Revision: D61506212 Use `skipCUDAIf` from `torch.testing._internal.common_device_type` if we create the test class with `instantiate_device_type_tests`. `instantiate_device_type_tests` would make sure the class has attr device_type, which works with`skipCUDAIf` from `torch.testing._internal.common_device_type`. Also skipping test_vertical_pointwise_reduction_fusion for cpu test class, since the test expects cuda. FAILED [0.0026s] test/inductor/test_unbacked_symints.py::TestUnbackedSymintsCPU::test_vertical_pointwise_reduction_fusion_cpu - AttributeError: 'TestUnbackedSymintsCPU' object has no attribute 'device' repro: ``` CUDA_VISIBLE_DEVICES="" pytest test/inductor/test_unbacked_symints.py -k cpu -v ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133936 Approved by: https://github.com/ColinPeppler, https://github.com/desertfire	2024-09-05 16:40:14 +00:00
Tom Ritchford	2c99f17a32	Implement VariableTracker.python_type() (#134215 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134215 Approved by: https://github.com/amjames, https://github.com/jansel	2024-09-05 16:35:47 +00:00
Tarun Karuturi	0043dcd79e	Switch torch pt2e xnnpack tests to use export_for_training (#134788 ) Migrate all the callsites inside the pt2e XNNPACK tests to use export_for_training. Differential Revision: D61994553 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134788 Approved by: https://github.com/mergennachin	2024-09-05 16:11:18 +00:00
Edward Z. Yang	2e2fb668fa	Upgrade expecttest to 0.2.1 (#135136 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135136 Approved by: https://github.com/albanD, https://github.com/atalman, https://github.com/Skylion007	2024-09-05 16:05:35 +00:00
Stonepia	9d24f945ba	[CI] Use larger instance for building triton whl (#135201 ) When running CI jobs of "Build Triton Wheels", it failed due to the lack of resources. This PR uses a larger runner to avoid these issues. The failure message is like: ``` Process completed with exit code 137. ``` Related running actions: Failed actions: https://github.com/pytorch/pytorch/actions/runs/10714445036 Success actions: https://github.com/pytorch/pytorch/actions/runs/10716710830 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135201 Approved by: https://github.com/chuanqi129, https://github.com/atalman	2024-09-05 14:36:23 +00:00
min-jean-cho	ecbd715363	[Intel GPU][Windows] Fix overriding default CMAKE_CXX_FLAGS (#135093 ) The root cause is that `/EHsc` is part of the default `CMAKE_CXX_FLAGS` in CMake. Fix to not override the default `CMAKE_CXX_FLAGS`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135093 Approved by: https://github.com/EikanWang, https://github.com/atalman	2024-09-05 12:52:43 +00:00
Xinyu	58f2477a26	[Dynamo] Support builtin function frozenset (#134563 ) Support builtin function frozenset in dynamo Pull Request resolved: https://github.com/pytorch/pytorch/pull/134563 Approved by: https://github.com/anijain2305, https://github.com/EikanWang, https://github.com/jansel	2024-09-05 12:15:10 +00:00
sanchitintel	43dcb4bb61	Revise CPU vectorization ISA support API (#135075 ) Revising (mostly renaming) CPU vectorization ISA support API (non-frontend-user-facing). Also added AVX512_BF16 ISA detection API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135075 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/ezyang	2024-09-05 12:14:56 +00:00
Bin Bao	50d1e37079	[AOTI] Fix a unbacked symint retrieve bug (#134670 ) Summary: Fix https://github.com/pytorch/pytorch/issues/134081. When a unbacked symint is computed as the shape of a tensor from a tuple, generated C++ code needs to use std::get<> to extract the tensor. Differential Revision: [D62142113](https://our.internmc.facebook.com/intern/diff/D62142113) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134670 Approved by: https://github.com/angelayi, https://github.com/22quinn, https://github.com/chenyang78	2024-09-05 11:34:14 +00:00
Feng Yuan	b99ef1a02e	Update torch-xpu-ops pin (ATen XPU implementation) (#135185 ) Release cycle for PyTorch 2.5 1. Update specific AOT targets for Windows. On Windows, AOT target list prefers Intel client GPUs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135185 Approved by: https://github.com/EikanWang	2024-09-05 10:05:23 +00:00
Jack Zhang	8a5c8e5db9	Update unbacked symints in masked_select more precisely (#134899 ) ## Summary At the moment, the fake impl for `masked_select` simply sets the upper range while updating its size-like SymInt to `sys.maxsize`(9223372036854775807, max value for an unsigned int64) if the there are any SymInts in the original input tensor shape. This PR constrains the range more intelligently by using the upper ranges of each SymInt in the input tensor shape. This solves an issue where an model being lowered to Executorch errors during memory planning because the memory allocated for `masked_select` ended up exceeded the 64-bit address space (`INT_MAX * size(dtype)`). ## Test plan - Passes existing unit tests (tests case where upper bound is inf) - Added unit test to verify upper bound reduction calculation - Tested end-to-end by exporting with TORCH_LOGS="export" and ensuring that the range for `masked_select`'s SymInt size has the correct upper bound Pull Request resolved: https://github.com/pytorch/pytorch/pull/134899 Approved by: https://github.com/ezyang	2024-09-05 09:01:06 +00:00
Yutao Xu	c7328dff7f	Enhance the stability of the complex divide code (#134647 ) In C++, when a floating-point literal (e.g., 3.14) is compared with a variable of type float, the literal is by default interpreted as a double. ```c++ float f = 3.14f; if (f == 3.14) { // Do something } ``` If a device does not support double, an error will occur. This PR addresses the issue of complex64 errors on machines that do not support double operations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134647 Approved by: https://github.com/EikanWang, https://github.com/albanD	2024-09-05 08:36:37 +00:00
Wu, Chunyuan	749dc6ceda	[inductor] [cpp] use_local_acc if template_buffer_has_other_users (#135081 ) Fix the compilation error of `coat_lite_mini` in timm and `YituTechConvBert` in HF: ``` /tmp/tmpuu94adg_/nf/cnf3zm677wbfjzzll522zvjp57g44udzfnj66ac2t5b2odvfqpts.cpp:239:33: error: invalid conversion from ‘const float’ to ‘float’ [-fpermissive] 239 \| &(in_ptr2[static_cast<int64_t>(n_start + (192Lm_start) + (Nrnci) + ((-1L)Nrnc))]), \| ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ \| \| \| const float* ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135081 Approved by: https://github.com/jgong5 ghstack dependencies: #134984	2024-09-05 08:31:31 +00:00
fduwjj	eaeae0ac95	[c10d] Change collective to take in a list of tensors so it work fully for all collectives (#135049 ) We found that currently, we only pass one input and output tensor to the function `collective`, and this causes NaNCheck, work numel stats and FR input/output sizes not accurate for all-to-all, scatter and reduce. So we want to let the collective take in a list of tensors to ensure it works for all collectives inside PGNCCL. This partially revert what we did in https://github.com/pytorch/pytorch/pull/119421, and down the road we will have another round of cleanup on the collective to make it cleaner. For now, at least for the sake of correctness, we changed it back. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135049 Approved by: https://github.com/kwen2501	2024-09-05 07:56:56 +00:00
Pian Pawakapan	5a0e7a408f	restore CSE'd node metadata in runtime asserts pass (#134516 ) Adds val, and optionally stack_trace & nn_module_stack metadata back to SymInt compute nodes that we CSE, with a hook on `graph.create_node()`. Not sure if there's other metadata we want to populate here? Pull Request resolved: https://github.com/pytorch/pytorch/pull/134516 Approved by: https://github.com/ezyang	2024-09-05 07:50:04 +00:00
Yan Zhiwei	81a8624296	[Intel GPU] Customized XPU behaviour in indexing, group norm (#134453 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134453 Approved by: https://github.com/EikanWang, https://github.com/albanD ghstack dependencies: #133980	2024-09-05 07:41:57 +00:00
Wu, Chunyuan	731fd3172a	[inductor] [cpp] generate reindexer for each epilogue_node (#134984 ) Fixes the FP32 accuracy failure of `levit_128` in timm. Previously, we used `Y` which is the output of the final epilogue node to calculate the reindexer. We actually need to use each epilogue node to calculate the reindexer from the GEMM output to the epilogue node. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134984 Approved by: https://github.com/jgong5	2024-09-05 07:08:31 +00:00
Tugsbayasgalan Manlaibaatar	9d705605dd	Fix decomp behaviour in export training IR (#134801 ) Subset of changes in https://github.com/pytorch/pytorch/pull/132901, can't land the previous one because it is too complicated. Rest of the change will be implemented as follow up after export design meeting. This part just makes the training IR -> inference IR decomp to have the same path as normal export. Differential Revision: [D62000525](https://our.internmc.facebook.com/intern/diff/D62000525) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134801 Approved by: https://github.com/avikchaudhuri, https://github.com/angelayi	2024-09-05 06:37:44 +00:00
Sun, Jiayi	05feb6e4ed	[Inductor] support masked vectorization for the tail_loop for dynamic shapes (#131745 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131745 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-09-05 06:17:48 +00:00
Pian Pawakapan	7b280c31ba	[export] dynamic_shapes serialization, load/dump (#134718 ) Adds utility functions `_dump_dynamic_shapes` and `_load_dynamic_shapes`. - `_dump_dynamic_shapes`: dynamic shapes spec -> serialized format: - takes in the `dynamic_shapes` pytree object you'd feed into `export()`, and dumps into serialized format - `_load_dynamic_shapes`: serialized format -> dynamic shapes spec - takes the serialized format, and produces a `dynamic_shapes` object you feed into `export()` For example with dumping: ``` dx = Dim("dx", min=4, max=16) dy = dx + 1 inputs = ( [ torch.randn(4, 4), torch.randn(5, 4), ], torch.randn(4), torch.randn(4, 4), "hello", ) dynamic_shapes = { "a": [ (dx, 4), (dy, 4), ], "b": (Dim.AUTO,), "c": None, "d": None, } out = _dump_dynamic_shapes(dynamic_shapes, inputs) ``` would generate the following output: ``` DynamicShapesSpec( dynamic_shapes=( [ ['dx', 4], ['dx + 1', 4], ], ['_DimHint.STATIC'], ['_DimHint.STATIC', '_DimHint.STATIC'], None, ), dims={ 'dx': RootDim( min=4, max=16, derived=['dx + 1'], ), }, ) ``` The serialized format contains 2 keys, `dynamic_shapes` and `dims.` - `dynamic_shapes` is the pytree structure matching the input to `export()`, with strings in place of Dim names and enums, and ints/Nones otherwise. Each tensor is represented with a list of shapes, non-tensors with Nones. - `dims` contain min/max range and derived dims info for each root dim. The test cases show some roundtrippability guarantees for these functions. Definitely taking naming suggestions for them :) Follow up: utility function to extract serializable format from ExportedProgram. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134718 Approved by: https://github.com/avikchaudhuri	2024-09-05 05:39:44 +00:00
PyTorch UpdateBot	f2a7228aed	[executorch hash update] update the pinned executorch hash (#135162 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135162 Approved by: https://github.com/pytorchbot	2024-09-05 04:21:51 +00:00
Will Feng	8fb1281db9	[Traceable FSDP2] Skip _backward_prefetch under compile, and rely on compiler pass to have prefetching (#135163 ) Before this PR, when traceable FSDP2 + AC is run, an error would be thrown: ``` File "/data/users/willfeng/pytorch/torch/_dynamo/variables/builtin.py", line 1449, in call_getitem return args[0].call_method(tx, "__getitem__", args[1:], kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/willfeng/pytorch/torch/_dynamo/variables/lists.py", line 435, in call_method return super().call_method(tx, name, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/willfeng/pytorch/torch/_dynamo/variables/lists.py", line 392, in call_method return super().call_method(tx, name, args, kwargs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/willfeng/pytorch/torch/_dynamo/variables/lists.py", line 131, in call_method return self.getitem_const(tx, value) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/data/users/willfeng/pytorch/torch/_dynamo/variables/lists.py", line 106, in getitem_const return self.items[index] Error: Index out of bound from user code: File "<eval_with_key>.5", line 105, in forward aot0_trace_wrapped = torch__dynamo__trace_wrapped_higher_order_op_self_invoke(aot0_tangents_1, bw_state = aot0_primals_34); aot0_tangents_1 = None File "/data/users/willfeng/pytorch/torch/_dynamo/_trace_wrapped_higher_order_op.py", line 74, in self_invoke return _trace_wrapped_op(args, dyn_kwargs, kwargs) File "/data/users/willfeng/pytorch/torch/_dynamo/external_utils.py", line 132, in call_hook_from_backward_state return getattr(bw_state, hook_name)(args, **kwargs) File "/data/users/willfeng/pytorch/torch/distributed/_composable/fsdp/_fsdp_state.py", line 271, in _pre_backward self._fsdp_param_group.pre_backward(default_prefetch) File "/data/users/willfeng/pytorch/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 332, in pre_backward self._backward_prefetch() File "/data/users/willfeng/pytorch/torch/distributed/_composable/fsdp/_fsdp_param_group.py", line 417, in _backward_prefetch target_fsdp_param_group = self.comm_ctx.post_forward_order[target_index] ``` Since it's okay to rely on the compiler to recover the "prefetching" pattern, we will skip this `_backward_prefetch()` code path during tracing to avoid the error, and have a compiler pass (in future PR) to achieve the equivalent prefetching overlap. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135163 Approved by: https://github.com/awgu	2024-09-05 03:32:04 +00:00
ZhiweiYan-96	a7a53b796b	[Intel GPU]device guard codegen for XPU (#133980 ) This PR is a supplement to #130082. The previous PR #130082 fulfill the basic functionality of codegen, while we found it fails to handle the device sameness check in lots of uts. Current PR is aimed to facilitate the XPU device guard code generation. With current PR, the code snippet in `RegisterXPU.cpp` is as follows, where we can see the device guard is successfully generated. ```c++ namespace { at::Tensor & wrapper_XPU_Tensor_float_out_normal_out(const at::Tensor & mean, double std, ::std::optional<at::Generator> generator, at::Tensor & out) { std::optional<Device> common_device = std::nullopt; (void)common_device; // Suppress unused variable warning c10::impl::check_and_update_common_device(common_device, out, "wrapper_XPU_Tensor_float_out_normal_out", "out"); c10::impl::check_and_update_common_device(common_device, mean, "wrapper_XPU_Tensor_float_out_normal_out", "mean"); const OptionalDeviceGuard device_guard(device_of(out)); return at::native::normal_out(mean, std, generator, out); } } // anonymous namespace ``` Nevertheless, without current change, the generated code is ```c++ namespace { at::Tensor & wrapper_XPU_Tensor_float_out_normal_out(const at::Tensor & mean, double std, ::std::optional<at::Generator> generator, at::Tensor & out) { // No device check // DeviceGuard omitted return at::native::normal_out(mean, std, generator, out); } } // anonymous namespace ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133980 Approved by: https://github.com/EikanWang, https://github.com/malfet	2024-09-05 01:53:31 +00:00
Bob Ren	30b98940b8	Fix typo in comment (#135111 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135111 Approved by: https://github.com/aorenste, https://github.com/oulgen	2024-09-05 01:39:04 +00:00
Wei Feng	724faac260	[FSDP] casting input args with dataclass(frozen=True) (#135067 ) resolve: https://github.com/pytorch/pytorch/pull/135029 when enabling mixed precision, FSDP cast input args to desired dtype by calling `_apply_to_tensors`. When input args has `dataclass(frozen=True)`, we hit following runtime error, because of using `setattr` in `_apply_to_tensors` `dataclasses.FrozenInstanceError: cannot assign to field 'some_key'`. The fix is to use dataclasses api `dataclasses.replace` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135067 Approved by: https://github.com/awgu	2024-09-05 01:19:53 +00:00
Aleksei Nikiforov	04e11c7eed	Update current scripts used for setting up s390x runners (#129866 ) Update current scripts used for setting up s390x runners Just a documentation update. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129866 Approved by: https://github.com/malfet, https://github.com/huydhn	2024-09-05 01:17:54 +00:00
drisspg	a3e0d4bf07	[FlexAttention] Fix mismatched backward strides for eager impl (#135152 ) # Fixes: The first repro from: https://github.com/pytorch/pytorch/issues/134888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135152 Approved by: https://github.com/Chillee	2024-09-05 01:14:53 +00:00
FFFrog	27d86f93fe	Remove redundant code (#134955 ) Remove GetPrivateUse1HooksInterface Pull Request resolved: https://github.com/pytorch/pytorch/pull/134955 Approved by: https://github.com/Skylion007	2024-09-05 01:11:32 +00:00
Animesh Jain	32f45f01a9	[dynamo] Retire CompileProfiler (#135133 ) Fixes confusion in https://github.com/pytorch/pytorch/issues/113443 We have TORCH_LOGS that supersedes CompileProfiler Pull Request resolved: https://github.com/pytorch/pytorch/pull/135133 Approved by: https://github.com/ezyang ghstack dependencies: #135039, #135121, #135129, #135130	2024-09-05 01:08:40 +00:00
fduwjj	4a661e089a	[FR] Add version based logic to FR script and make traces print can be filtered (#135154 ) This PR makes version passing around the version, so that we can have different behaviors for different versions of FR dump. This PR also adds the logic of filtering to certain PG(desc) and ranks to show their traces. Some minor refactors to make the name more accurate and util function working. <img width="1180" alt="image" src="https://github.com/user-attachments/assets/4ef8a2d6-1296-4a45-b9a7-6d3b48fbe233"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135154 Approved by: https://github.com/wconstab	2024-09-05 00:59:32 +00:00
Nikita Shulga	105ac2418c	Fix binary builds artifact download (#135139 ) By upgrading upload-artifacts action to v4.4.0 As artifact store layout is different between v3 and v4 actions and artifacts uploaded by v3 can not be downloaded by v4 Should fix`Unable to download artifact(s): Artifact not found for name: libtorch-cpu-shared-with-deps-release`, which could be seen for example [here](https://github.com/pytorch/pytorch/actions/runs/10707740040/job/29690137218#step:7:29) I.e. fix regression introduced by https://github.com/pytorch/pytorch/pull/135068 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135139 Approved by: https://github.com/atalman, https://github.com/huydhn	2024-09-05 00:43:34 +00:00
Laith Sakka	560f449d8f	Fix: use clone_preserve_strides in auto_functionalized_v2 (#135142 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135142 Approved by: https://github.com/zou3519 ghstack dependencies: #134409	2024-09-05 00:39:48 +00:00
Aidyn-A	956da79bda	[CUDA][AMP] Fix autocast_dtype (#133938 ) Fixes #132715 The failure in #132715 is due to `autocast_dtype` being a thread-local variable. It causes inconsistencies between `get_autocast_dtype()` among different threads. To be exact, what is happening in the following: The amp dtype is set to `bfloat16` on main thread. The `backward` call runs on a side thread, so `at::autocast::prioritize` fails because `lower_precision_fp` defaults to `float16`: `6f738d6434/aten/src/ATen/autocast_mode.h (L221-L225)` This PR makes `autocast_dtype` thread-global so it consistent among all threads of forward and backward passes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133938 Approved by: https://github.com/soulitzer	2024-09-05 00:07:32 +00:00
chuanqiw	977a909250	[CI] Build pytorch wheel with Torch XPU Operators on Windows (#133151 ) # Description This pipeline enables the CI build on Windows with PR labeled with ciflow/xpu. This will build torch binary with Torch XPU Operators on Windows using Vision Studio BuildTools 2022. # Changes 1. Install xpu batch file (install_xpu.bat) - Check if build machine has oneAPI in environment, and if the version of it is latest. If not, install the latest public released oneAPI in the machine. 2. GHA callable pipeline (_win-build.yml) - Set vc_year and use_xpu as parameter to set build wheel environment. 3. GHA workflow (xpu.yml) - Add a new windows build job and pass parameters to it. 4. Build wheels script (.ci/pytorch/win-test-helpers/build_pytorch.bat) - Prepare environment for building, e.g. install oneAPI bundle. # Note 1. For building wheels on Intel GPU, you need Vision Studio BuildTools version >= 2022 2. This pipeline requires to use Vision Studio BuildTools 2022 to build wheels. For now, we specify "windows.4xlarge.nonephemeral" as build machine label in the yaml file. We will request to add self-hosted runners with Intel GPU and Vision Studio BuildTools 2022 installed soon. Work for #114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133151 Approved by: https://github.com/chuanqi129, https://github.com/atalman Co-authored-by: chuanqiw <chuanqi.wang@intel.com>	2024-09-05 00:02:46 +00:00
Howard Huang	b3ef0c99f5	[PP] Fix zero bubble composability with DP (#134052 ) Moved all the backward functions (`stage_backward_input`, `stage_backward_weight`, `stage_backward`) under the same `backward_maybe_with_nosync` function which controls the logic of the data parallel wrappers. FSDP was not working with zero bubble PP because there will be twice as many "backward" calls and we update the weight gradients after `autograd.grad` is called. As a result, we need to manually call the FSDP `post_backward_hook()` after the weights have the correct gradients. Fixes the tests: `python test/distributed/_composable/test_composability/test_pp_composability.py ComposabilityTest.test_manual_with_data_parallel_dp_type_FSDP_ScheduleClass0_use_new_runtime_False` `python test/distributed/_composable/test_composability/test_pp_composability.py ComposabilityTest.test_manual_with_data_parallel_dp_type_DDP_ScheduleClass0_use_new_runtime_False` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134052 Approved by: https://github.com/kwen2501	2024-09-04 23:46:29 +00:00
Benjamin Glass	43c9b4e0e6	Fix unintentional deduplication of returned tensors (#134726 ) When CSE was used, returned tensors that had gone through identical processing steps but were distinct from a data perspective were pruned out of the graph. This commit protects tensors which are directly output from being pruned, and adds a test for this behavior. Closes #88813 and #114344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134726 Approved by: https://github.com/amjames, https://github.com/zou3519, https://github.com/bdhirsh	2024-09-04 23:42:56 +00:00
titaiwangms	00a8666708	[ONNX] Support output_names in dynamic_axes when dynamo=True (#135134 ) Previous to this PR, if output_names shows in dynamic_axes, it errors when we turn it to dynamic_shapes of torch.export, as we only recognized input_names. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135134 Approved by: https://github.com/justinchuby	2024-09-04 23:42:13 +00:00
eqy	4f70b3cfae	[CUDA][complex][TF32] Update `test_noncontiguous_samples` tolerances for `complex64` (#134526 ) Recent cuDNN heuristics change surfaces same TF32 issue as `float32` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134526 Approved by: https://github.com/ezyang	2024-09-04 23:37:16 +00:00
Shangdi Yu	359077fa43	[export] Fix indentation (#135128 ) Summary: as title Test Plan: CI Differential Revision: D62195680 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135128 Approved by: https://github.com/tugsbayasgalan	2024-09-04 23:26:36 +00:00
Ke Wen	9810ce9ca7	[PP] Go back to export instead of _export (#134299 ) Reverts https://github.com/pytorch/pytorch/pull/130998 because FakeTensor + real device suffice to work around the autocast issue in HF. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134299 Approved by: https://github.com/lessw2020	2024-09-04 23:25:17 +00:00
Animesh Jain	804852c1f9	[dynamo] Search for _torchdynamo_inline only for functions (#135130 ) Issue seen in https://github.com/pytorch/pytorch/issues/93633 Fixes https://github.com/pytorch/pytorch/issues/93633 Unable to create a testcase Pull Request resolved: https://github.com/pytorch/pytorch/pull/135130 Approved by: https://github.com/williamwen42, https://github.com/yanboliang ghstack dependencies: #135039, #135121, #135129	2024-09-04 23:02:59 +00:00
Sun, Jiayi	13a4a0c60d	[Inductor] Apply loop split optimization in codegen_node (#132389 ) This PR applies loop split optimization in codegen_node to avoid non-contiguous load. When the vector is loaded in a non-contiguous manner due to a division in the index, we eliminate the division by splitting the loop to avoid non-contiguous load. Example: ``` import torch import torch.nn as nn class GNReLU(torch.nn.Module): def __init__(self, num_groups, num_channels): super(GNReLU, self).__init__() self.gn = nn.GroupNorm(num_groups, num_channels) def forward(self, x): return torch.nn.functional.relu(self.gn(x)) input = torch.randn(2, 960, 96, 96).to(memory_format=torch.channels_last) m = GNReLU(32, 960).eval() compiled_m = torch.compile(m) with torch.no_grad(): compiled_m(input) ``` Generated code: - Before: ``` cpp_fused_native_group_norm_relu_0 = async_compile.cpp_pybinding(['const float', 'const float', 'const float', 'float', 'float', 'float'], ''' #include "/tmp/torchinductor_jiayisun/vu/cvuckxaygqfovv2zu2byqhcmiejbke7mdhf2rpgpr5mlscdev2hg.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2) { #pragma omp parallel num_threads(56) { int tid = omp_get_thread_num(); { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<long>(17280L)); for(long x2=static_cast<long>(0L); x2<static_cast<long>(9216L); x2+=static_cast<long>(1L)) { for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx1) + (960Lx2) + (8847360Lx0)), 16); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0); } for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(14L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx1) + (960Lx2) + (8847360Lx0)), 14); masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, 14, &wrecps0); } } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec)); tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<long>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<long>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.m2); } } } } { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(9216L); x1+=static_cast<long>(1L)) { for(long x2=static_cast<long>(0L); x2<static_cast<long>(960L); x2+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x2 + (960Lx1) + (8847360Lx0)), 16); auto tmp1 = [&] { __at_align__ std::array<float, 16> tmpbuf; #pragma GCC unroll 16 for (long x2_inner = 0; x2_inner < 16; x2_inner++) { tmpbuf[x2_inner] = out_ptr0[static_cast<long>((32Lx0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))]; } return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16); } () ; auto tmp3 = [&] { __at_align__ std::array<float, 16> tmpbuf; #pragma GCC unroll 16 for (long x2_inner = 0; x2_inner < 16; x2_inner++) { tmpbuf[x2_inner] = out_ptr1[static_cast<long>((32Lx0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))]; } return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16); } () ; auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x2), 16); auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x2), 16); auto tmp2 = tmp0 - tmp1; auto tmp4 = static_cast<float>(276480.0); auto tmp5 = at::vec::Vectorized<float>(tmp4); auto tmp6 = tmp3 / tmp5; auto tmp7 = static_cast<float>(1e-05); auto tmp8 = at::vec::Vectorized<float>(tmp7); auto tmp9 = tmp6 + tmp8; auto tmp10 = tmp9.rsqrt(); auto tmp11 = tmp2 * tmp10; auto tmp13 = tmp11 * tmp12; auto tmp15 = tmp13 + tmp14; auto tmp16 = at::vec::clamp_min(tmp15, decltype(tmp15)(0)); tmp16.store(out_ptr2 + static_cast<long>(x2 + (960Lx1) + (8847360Lx0))); } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg2_1, = args args.clear() assert_size_stride(arg2_1, (2, 960, 96, 96), (8847360, 1, 92160, 960)) buf0 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32) buf1 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32) buf3 = empty_strided_cpu((2, 960, 96, 96), (8847360, 1, 92160, 960), torch.float32) cpp_fused_native_group_norm_relu_0(arg2_1, _frozen_param3, _frozen_param2, buf0, buf1, buf3) del arg2_1 return (buf3, ) ``` - After: ``` cpp_fused_native_group_norm_relu_0 = async_compile.cpp_pybinding(['const float', 'const float', 'const float', 'float', 'float', 'float'], ''' #include "/tmp/torchinductor_jiayisun/vu/cvuckxaygqfovv2zu2byqhcmiejbke7mdhf2rpgpr5mlscdev2hg.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2) { #pragma omp parallel num_threads(56) { int tid = omp_get_thread_num(); { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<long>(17280L)); for(long x2=static_cast<long>(0L); x2<static_cast<long>(9216L); x2+=static_cast<long>(1L)) { for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx1) + (960Lx2) + (8847360Lx0)), 16); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0); } for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(14L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx1) + (960Lx2) + (8847360Lx0)), 14); masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, 14, &wrecps0); } } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec)); tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<long>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<long>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.m2); } } } } { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(9216L); x1+=static_cast<long>(1L)) { #pragma GCC ivdep for(long x2=static_cast<long>(0L); x2<static_cast<long>(32L); x2+=static_cast<long>(1L)) { for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx2) + (960Lx1) + (8847360Lx0)), 16); auto tmp1 = out_ptr0[static_cast<long>(x2 + (32Lx0))]; auto tmp4 = out_ptr1[static_cast<long>(x2 + (32Lx0))]; auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x3 + (30Lx2)), 16); auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x3 + (30Lx2)), 16); auto tmp2 = at::vec::Vectorized<float>(tmp1); auto tmp3 = tmp0 - tmp2; auto tmp5 = static_cast<float>(276480.0); auto tmp6 = tmp4 / tmp5; auto tmp7 = static_cast<float>(1e-05); auto tmp8 = decltype(tmp6)(tmp6 + tmp7); auto tmp9 = 1 / std::sqrt(tmp8); auto tmp10 = at::vec::Vectorized<float>(tmp9); auto tmp11 = tmp3 tmp10; auto tmp13 = tmp11 * tmp12; auto tmp15 = tmp13 + tmp14; auto tmp16 = at::vec::clamp_min(tmp15, decltype(tmp15)(0)); tmp16.store(out_ptr2 + static_cast<long>(x3 + (30Lx2) + (960Lx1) + (8847360Lx0))); } for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(14L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx2) + (960Lx1) + (8847360Lx0)), 14); auto tmp1 = out_ptr0[static_cast<long>(x2 + (32Lx0))]; auto tmp4 = out_ptr1[static_cast<long>(x2 + (32Lx0))]; auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x3 + (30Lx2)), 14); auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x3 + (30Lx2)), 14); auto tmp2 = at::vec::Vectorized<float>(tmp1); auto tmp3 = tmp0 - tmp2; auto tmp5 = static_cast<float>(276480.0); auto tmp6 = tmp4 / tmp5; auto tmp7 = static_cast<float>(1e-05); auto tmp8 = decltype(tmp6)(tmp6 + tmp7); auto tmp9 = 1 / std::sqrt(tmp8); auto tmp10 = at::vec::Vectorized<float>(tmp9); auto tmp11 = tmp3 * tmp10; auto tmp13 = tmp11 * tmp12; auto tmp15 = tmp13 + tmp14; auto tmp16 = at::vec::clamp_min(tmp15, decltype(tmp15)(0)); tmp16.store(out_ptr2 + static_cast<long>(x3 + (30Lx2) + (960Lx1) + (8847360L*x0)), 14); } } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg2_1, = args args.clear() assert_size_stride(arg2_1, (2, 960, 96, 96), (8847360, 1, 92160, 960)) buf0 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32) buf1 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32) buf3 = empty_strided_cpu((2, 960, 96, 96), (8847360, 1, 92160, 960), torch.float32) cpp_fused_native_group_norm_relu_0(arg2_1, _frozen_param3, _frozen_param2, buf0, buf1, buf3) del arg2_1 return (buf3, ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132389 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel Co-authored-by: Jiong Gong <jiong.gong@intel.com>	2024-09-04 22:42:46 +00:00
Animesh Jain	87842cc658	[dynamo][super] Corner case where the class is not present in the __mro__ (#135129 ) I could not come up with a testcase. This was seen in https://github.com/pytorch/pytorch/issues/93633 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135129 Approved by: https://github.com/yanboliang ghstack dependencies: #135039, #135121	2024-09-04 22:30:09 +00:00
Michael Lazos	d9ae92cd6e	[Dynamo] Support for proxying frozen dataclasses (#134846 ) Fixes https://github.com/pytorch/pytorch/issues/133858 Details: Previously Dynamo would treat dataclasses as UserDefinedVariables. This was non-desirable if we would like to proxy the value into the graph, which is needed for TensorSubclassMetadata. To rectify this, frozen dataclasses are now able to be proxied similarly to NamedTuples. We require the object to be frozen, because if arbitrary mutation were allowed, we would need to replay those mutations in the graph after construction of the object. For tracing construction of the variable, the generated `__init__` for the dataclass uses `object.__setattr__` because frozen dataclasses throw errors on the usual `__setattr__` invocation. With this treatment, no special handling is needed in dynamo for frozen dataclass construction. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134846 Approved by: https://github.com/bdhirsh, https://github.com/anijain2305	2024-09-04 22:17:00 +00:00
Xilun Wu	ed06772e35	[TorchElastic] add warning when users try to pass a "use_libuv" argument to create_c10d_store (#135062 ) Summary Extend the warning message to be more self-explained Pull Request resolved: https://github.com/pytorch/pytorch/pull/135062 Approved by: https://github.com/shuqiangzhang	2024-09-04 22:05:51 +00:00
Nikita Shulga	fb1c580892	[BE][optim] Make pyright recognize exported symbols (#135043 ) Follows pattern introduced by https://github.com/pytorch/pytorch/pull/80955 which [pyright](https://github.com/microsoft/pyright) prefers over `__all__` symbol, see https://github.com/microsoft/pylance-release/issues/2953#issuecomment-1168956296 Fixes https://github.com/pytorch/pytorch/issues/134985 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135043 Approved by: https://github.com/janeyx99	2024-09-04 21:53:46 +00:00
rzou	2276940f8c	Make Dynamo inline through torch._library.custom_ops.autograd (#135066 ) Fixes https://github.com/pytorch/pytorch/issues/135057 The bug was: in the situation that Dynamo graph breaks in the forward and Compiled Autograd uses Dynamo to introspect the backward, we end up running into a "Unsupported: inlining through SKIPFILES" error. The solution is to mark the entirety of this module as inlineable. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/135066 Approved by: https://github.com/bdhirsh, https://github.com/williamwen42, https://github.com/yanboliang	2024-09-04 21:48:28 +00:00
Manuel Candales	4e6df83d19	[PT] Add out variant for avg_pool1d and adaptive_avg_pool1d (#135051 ) Test Plan: CI Differential Revision: D62148410 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135051 Approved by: https://github.com/SS-JIA	2024-09-04 21:20:01 +00:00
Animesh Jain	a8611da86f	[dynamo][backend match] Optimize backend match for common case (#135121 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135121 Approved by: https://github.com/williamwen42 ghstack dependencies: #135039	2024-09-04 21:02:29 +00:00
Boyuan Feng	09a339fc06	[Flex Attention] update __getitem__ without tree_map_only to support compile (#134627 ) Adds a helper function for getting the block mask for a specific row index during decoding. We need this change to avoid the pytree + torch.compile issue #134731. Tested in gpt-fast [pr](https://github.com/pytorch-labs/gpt-fast/pull/196). Pull Request resolved: https://github.com/pytorch/pytorch/pull/134627 Approved by: https://github.com/Chillee	2024-09-04 20:09:41 +00:00
PyTorch MergeBot	741d52c69f	Revert "Add support for 32KB multi_tensor_apply kernel arguments (#134373 )" This reverts commit 08184aa85cf183198ebdf2fd7a49fe7bc4842c13. Reverted https://github.com/pytorch/pytorch/pull/134373 on behalf of https://github.com/drisspg due to See https://github.com/pytorch/pytorch/issues/135126 for more details ([comment](https://github.com/pytorch/pytorch/pull/134373#issuecomment-2329839011))	2024-09-04 19:44:29 +00:00
Saurabh Mishra	dd7cd182ab	[AIInfra][DCP] All gather keys checkpoint utils bug fix (#135045 ) Summary: All gather keys checkpoint utils bug fix. Dist. get_world_size should have the process group passed in to avoid inconsistent world size in case the process group has changed. This is common in the tests. Test Plan: UTs Reviewed By: Saiteja64 Differential Revision: D61578832 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135045 Approved by: https://github.com/MeetVadakkanchery, https://github.com/LucasLLC	2024-09-04 18:49:34 +00:00
Shivam Raikundalia	eb0fd17bc4	[Profiler] Fix Raw Metadata Iterator (#135096 ) Summary: D62008788 added an extra parameter to the RawTensorMetadata struct. For some reason this causes some corrupted accesses in other tests as described in T200685032. Once this is removed the tests pass. Going forward we need to document how to add parameters to this portion of the code as the AppendOnlyLists seem to be very rigid. Test Plan: Ran all the tests locally and they all passed. Differential Revision: D62171089 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135096 Approved by: https://github.com/aaronenyeshi	2024-09-04 18:41:50 +00:00
PyTorch MergeBot	c88c19c6de	Revert "restore CSE'd node metadata in runtime asserts pass (#134516 )" This reverts commit 1dfb1052395d908ed6e67288c9357e16022da272. Reverted https://github.com/pytorch/pytorch/pull/134516 on behalf of https://github.com/pianpwk due to breaking NestedTensor test ([comment](https://github.com/pytorch/pytorch/pull/134516#issuecomment-2329738450))	2024-09-04 18:41:21 +00:00
Shunting Zhang	873abfc18e	[inductor] fix compile time regression due the (disabled) loop ordering after fusion (#135071 ) It's a bit surprised that the code added in Scheduler.fusable_read_and_write would increase compilation time. Here are some number I get from a H100 on BertForMaskedLM: - without the fix, cold start compilation time is around 82s - with the fix, cold start compilation time is around 76s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135071 Approved by: https://github.com/jansel	2024-09-04 18:36:59 +00:00
rzou	d7b57c4d63	Fix tensor.data access under inference_mode and compile (#134878 ) Fixes https://github.com/pytorch/pytorch/issues/134798 In the regular Tensor case, when you call Tensor.data, there's a check for if inference mode is active. If it is active, then we don't set the version counter. We replicate this check for Tensor Subclasses (the bug was we were trying to set the version counter on a FakeTensor in inference_mode). Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/134878 Approved by: https://github.com/bdhirsh	2024-09-04 17:55:41 +00:00
Svetlana Karslioglu	0d193a0adf	Add ExecuTorch warning to mobile_optimizer (#134697 ) Preview: https://docs-preview.pytorch.org/pytorch/pytorch/134697/mobile_optimizer.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/134697 Approved by: https://github.com/ali-khosh, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-09-04 17:47:14 +00:00
Jason Ansel	193c547461	[inductor] Refactor simplify erase_nodes() (#134822 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134822 Approved by: https://github.com/shunting314 ghstack dependencies: #134748, #134749	2024-09-04 17:32:07 +00:00
Jason Ansel	2ddf3ed707	[inductor] Allow cudagraphs with unused CPU inputs (#134749 ) This pattern was preventing cudagraphs from kicking in on torch_multimodal_clip, resulting in `1.6529 → 3.3471` speedup. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134749 Approved by: https://github.com/shunting314 ghstack dependencies: #134748	2024-09-04 17:32:07 +00:00
Jason Ansel	cff1158200	[inductor] Pass to fix device on index(..., [iota]) (#134748 ) This pattern was preventing cudagraphs from kicking in on torch_multimodal_clip, resulting in `1.6529 → 3.3471` speedup. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134748 Approved by: https://github.com/shunting314	2024-09-04 17:31:58 +00:00
PyTorch MergeBot	7858045491	Revert "Fix set_unbacked_bindings when list of Tensors is returned (#133585 )" This reverts commit 2a49296d7563150d67bb00bd4c97bc5aafaa77df. Reverted https://github.com/pytorch/pytorch/pull/133585 on behalf of https://github.com/ezyang due to fails torchrec tests ([comment](https://github.com/pytorch/pytorch/pull/133585#issuecomment-2329602983))	2024-09-04 17:21:32 +00:00
PyTorch MergeBot	8759ed2ac5	Revert "Compute and do renamings even when ignoring fresh unbacked symbols (#134407 )" This reverts commit 46cb2af7d822681298370bab9d49b3cba5546dd5. Reverted https://github.com/pytorch/pytorch/pull/134407 on behalf of https://github.com/ezyang due to need to back out https://github.com/pytorch/pytorch/pull/133585 ([comment](https://github.com/pytorch/pytorch/pull/134407#issuecomment-2329597388))	2024-09-04 17:18:21 +00:00
PyTorch MergeBot	fc07e6bf56	Revert "Ignore fresh unbacked when doing recursive make_fx inside HOPs (#135053 )" This reverts commit a178a053ad2c8e42d1b684ed38385b9646ec3b74. Reverted https://github.com/pytorch/pytorch/pull/135053 on behalf of https://github.com/ezyang due to need to back out https://github.com/pytorch/pytorch/pull/133585 ([comment](https://github.com/pytorch/pytorch/pull/134407#issuecomment-2329597388))	2024-09-04 17:18:21 +00:00
Laith Sakka	c8ab9b06a2	Redesign custom op functionlaization for better re-inplace (#134409 ) - The new implementation (auto_functionalized_v2) is enabled by default but can be disable using an inductor flag. - In export mode the old implementation is used. Motiviation Previous functionalization fails to re-inplace arguments when they are view over other tensors. see issue https://github.com/pytorch/pytorch/issues/131192 The new functionalization is easier to re-inplace for views. A) Functionalizations pass consider a program: ``` func(t) x = t[0] y = t[1] foo(x, y) # custom operator with x, y mutable return (x, y, t) ``` - To functionalize `foo` we generate a function that operates on the base tensors of the inputs; (x.base() and y.base()) and record how to regenerates the views out of the base for argument x by recording ```ViewInfo=(x.base(), x.size(), x.stride, x,storage_offset())``` - Due to some limitations on the torch.export arguments format, we have to generate alot of arguments, but this is something we can simplify in the future, for the example above we get the following function. ``` auto_functionalized = torch.ops.higher_order.auto_functionalized(torch.ops.mylib.foo.default, _x_base_index = 0, _x_size = (), _x_stride = (), _x_storage_offset = 0 , _y_base_index = 0,_y_size = (), _y_stride = (), _y_storage_offset = 1 , _all_bases = [arg0_1]) ``` - In the code above: - _all_bases[t]: refers to a unique set of bases for all foo arguments. - for each argument x we have _x_base_index, _x_size, _x_stride, _x_storage_offset that can be used to (1) regenerate x from _all_bases[_x_base_index] or a copy of a the base. - the output of auto_functionalized is foo output , followed by x tensors one for each base in _all_bases, that is a copy of the base tensor after observing the mutations of the all the arguments that are views of that base. - for each use of a base in _all_bases or a view of it , that are after the call to foo, replace it with a view of the new output for the function above after functionalization we get : ``` def forward(self, arg0_1: "f32[2][1]cpu"): auto_functionalized = torch.ops.higher_order.auto_functionalized(torch.ops.mylib.foo.default, _x_base_index = 0, _x_size = (), _x_stride = (), _x_storage_offset = 0, _y_base_index = 0, _y_size = (), _y_stride = (), _y_storage_offset = 1, _all_bases = [arg0_1]) getitem_1: "f32[2][1]cpu" = auto_functionalized[1]; auto_functionalized = None copy_: "f32[2][1]cpu" = torch.ops.aten.copy_.default(arg0_1, getitem_1); arg0_1 = copy_ = None # No stacktrace found for following nodes select_2: "f32[][]cpu" = torch.ops.aten.select.int(getitem_1, 0, 0) select_3: "f32[][]cpu" = torch.ops.aten.select.int(getitem_1, 0, 1); getitem_1 = None return (select_2, select_3) ``` B) Semantics of auto_functionalize The new semantics of auto_functionalize is as the following: 1. For each base in all_bases, copy the base and create all_bases copies. (if a base is inplaced we do not need to copy it) 2. For each arg, regenerate the arg from the copy of its base using the view information above. 3. return the original foo output followed by the new bases. C) Re-inplace pass since auto_functionalize not copy the bases, what we actually inplace is the bases. (run just like before but on the beses instead of args). 1. For each base b in _all_bases check if there is any use of base (or its aliases/views) after auto_functionalize (before its overwritten with a copy) if there is not any, then inplace it (avoid copying it in step 1 above). Pull Request resolved: https://github.com/pytorch/pytorch/pull/134409 Approved by: https://github.com/zou3519	2024-09-04 17:08:58 +00:00
Shivam Raikundalia	195ac85fb6	[Profiler] Allow kwinputs to be non-string values (#134893 ) Summary: When we process keyword arguments in profiler today we assume that all values will be strings. This breaks HTA because it assumes that "stream" and other values similar to it will be ints. To fix this we will only put quotes around strings for ivalues. Test Plan: Add chrome trace export in unit tests and check that stream does not have quotes around it Differential Revision: D62056059 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134893 Approved by: https://github.com/sanrise, https://github.com/izaitsevfb	2024-09-04 16:34:10 +00:00
atalman	60dfe1b35e	Fix lint after Bump actions/download-artifact update (#135109 ) Fixes lint after auto-generated PR: `367a78495f` Pull Request resolved: https://github.com/pytorch/pytorch/pull/135109 Approved by: https://github.com/ezyang, https://github.com/huydhn	2024-09-04 15:26:17 +00:00
Avik Chaudhuri	8bfd4916d6	fast path for sympy gcd in floordiv (#134880 ) Summary: Re-implementation of https://github.com/pytorch/pytorch/pull/134150, which was reverted because of some internal tests hanging (case B). The original motivation was to get some other internal test unstuck (case A). The root cause is that sympy.gcd is both very clever as well as can blow up in some cases. This PR introduces a fast path with an appropriate fallback to sympy.gcd that ensures that both cases A and B go through. Test Plan: See the included test for specific examples. Also https://fb.workplace.com/groups/1075192433118967/posts/1491493248155548/?comment_id=1491938994777640&reply_comment_id=1492622821375924 Differential Revision: D62043315 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134880 Approved by: https://github.com/ezyang	2024-09-04 14:56:49 +00:00
chuanqiw	67208f08bd	[CD] Enable XPU nightly build on Windows (#134312 ) Depends on https://github.com/pytorch/builder/pull/1975 land. Works for https://github.com/pytorch/pytorch/issues/114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134312 Approved by: https://github.com/atalman	2024-09-04 14:46:36 +00:00
Edward Z. Yang	6c5669903f	Fix Invalid NaN comparison due to infinity-zero multiply on latest sympy (#135044 ) Fixes https://github.com/pytorch/pytorch/issues/133735 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135044 Approved by: https://github.com/zou3519	2024-09-04 14:13:09 +00:00
Edward Z. Yang	a178a053ad	Ignore fresh unbacked when doing recursive make_fx inside HOPs (#135053 ) Internal xref: https://fb.workplace.com/groups/6829516587176185/posts/7705964779531357/ I'm not sure this is the right approach though... Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135053 Approved by: https://github.com/ydwu4 ghstack dependencies: #134407	2024-09-04 13:25:08 +00:00
Edward Z. Yang	46cb2af7d8	Compute and do renamings even when ignoring fresh unbacked symbols (#134407 ) This is a bit twisty and I don't entirely understand the situation, but here's my best explanation. In https://github.com/pytorch/pytorch/pull/133588 I am trying to fix a problem reported by user in https://fb.workplace.com/groups/6829516587176185/permalink/7705964779531357/ The summary of this problem is that when we do collect metadata analysis in AOTAutograd, we accumulate pending unbacked symbols which are going to be discarded at the end of the trace. However, if we do a recursive make_fx inside tracing, as occurs with torch.cond, we end up seeing that there are pending unbacked symbols that aren't associated with a binding, even though it's spurious (they've leaked into the inner make_fx call from the outer AOTAutograd analysis). In #133588 I tried to just prevent adding the symbols to the pending list at all in the first place. But this itself caused some problems which were fixed in https://github.com/pytorch/pytorch/pull/124785 . The problem fixed in that PR is that when we allocate tangents that have unbacked size, something prevented them from having correct unbacked SymInts when ignore fresh unbacked SymInts was enabled. So I had patched it at the time by just not suppressing pending symbols and clearing them out some other way. I think... I was wrong in that PR? That is to say, it was OK to avoid putting the fresh unbacked symbols in the pending list; the real problem was suppressing unbacked renamings. But there doesn't seem to be a good reason to suppress these; this PR shows that it doesn't actually fail any tests if you do these anyway. Intuitively, this makes sense, because you can't trigger renamings unless you're actually adding unbacked symbols to the pending set. But I don't entirely understand all the interactions. I just know that this seems to not cause tests to fail, and it should fix the internal issue (which I need to add a UT for.) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134407 Approved by: https://github.com/ydwu4	2024-09-04 13:25:07 +00:00
FFFrog	5690f003a6	C10_DIAGNOSTIC_PUSH_AND_IGNORED_IF_DEFINED and C10_DIAGNOST should be used in pairs (#135004 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135004 Approved by: https://github.com/aaronenyeshi	2024-09-04 13:14:23 +00:00
Thanh Ha	dcf05fcb14	Fix stale job using non-existant ARC runner (#134863 ) The ARC CI system has been shutdown so this job is currently using a runner that doesn't exist. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134863 Approved by: https://github.com/ZainRizvi	2024-09-04 12:57:10 +00:00
FFFrog	a8467c17c3	Remove specific lazy initialization of PrivateUse1 (#135002 ) As the title stated, lazy initialization of PrivateUse1 can been removed because maybe_initialize_device have supported PrivateUse1 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135002 Approved by: https://github.com/albanD	2024-09-04 11:45:45 +00:00
FFFrog	80a6d60829	Moving _run_autocast_outofplace to basic class named TestAutocast to reduce redundance (#134460 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134460 Approved by: https://github.com/EikanWang, https://github.com/ezyang	2024-09-04 10:48:58 +00:00
Luca Wehrstedt	c2ff9fe042	[fp8 rowwise] Retune the tile heuristics to increase perf (#134781 ) I propose a new heuristic function to select tile tile size, cluster size, and transposition given M, N and K. It improves the performance across the board (on average) while remaining simple and relying only on a handful of kernels (to limit build time and binary size). Across the shapes I benchmarked, the new heuristic gives a (geometric) mean speedup of +16.5%. Some shapes worsen, but 98.6% of the shapes retain their old performance (up to 5% to allow for noise) or improve it. ![image](https://github.com/user-attachments/assets/bca30583-ac32-4af6-a4f9-37164bdb2430) I benchmarked on over 5.4k different shapes: - For M and N I swept across all values which are the sums of two powers of 2 (limited to multiples of 64, capped at 16,384) - For K I only used powers of 2 between 1,024 and 8,192 (based on the intuition that the optimal config doesn't depend on K, which turned out to be the case) Here's the detailed speedup for each shape ![image](https://github.com/user-attachments/assets/acac4318-9ee0-455d-861b-c764b8c13d22) <details> <summary> This is the code I used to benchmark </summary> ``` import torch import torch.utils.benchmark s = set() for i in range(6, 15): s.add(2i) for j in range(6, i): s.add(2i + 2j) ms = [i for i in sorted(s) if i <= 214] ns = [i for i in sorted(s) if i <= 214] ks = [2i for i in range(10, 14)] def make_graph(n_iters, f): g = torch.cuda.CUDAGraph() with torch.cuda.graph(g): for _ in range(n_iters): f() return g def rowwise_scale(t, dtype_t): min_v, max_v = torch.finfo(dtype_t).min, torch.finfo(dtype_t).max scale_t = torch.clamp(t.abs().amax(dim=-1, keepdim=True).float(), min=1e-12) / max_v t_fp8 = (t / scale_t).clamp(min=min_v, max=max_v).to(dtype_t) return t_fp8, scale_t for m in ms: for n in ns: for k in ks: a = torch.randn((m, k), device="cuda", dtype=torch.float) b_t = torch.randn((n, k), device="cuda", dtype=torch.float) a_fp8, scale_a = rowwise_scale(a, torch.float8_e4m3fn) b_t_fp8, scale_b_t = rowwise_scale(b_t, torch.float8_e4m3fn) func = lambda: torch._scaled_mm( a_fp8, b_t_fp8.t(), scale_a=scale_a, scale_b=scale_b_t.t(), bias=None, use_fast_accum=True, out_dtype=torch.bfloat16 ) print(f"{m=},{n=},{k=}") print(torch.utils.benchmark.Timer("g.replay()", globals={"g": make_graph(1000, func)}).blocked_autorange(min_run_time=1).mean / 1000) ``` </details> <details> <summary> This is the code I used for the plots </summary> ``` from itertools import islice import pandas as pd import matplotlib.pyplot as plt from matplotlib.cm import ScalarMappable from matplotlib.colors import FuncNorm from mpl_toolkits.axes_grid1 import ImageGrid def batched(iterable, n): iterator = iter(iterable) while batch := tuple(islice(iterator, n)): yield batch def try_to_convert(v): if v == "False": return False if v == "True": return True return int(v) def get_from_paste(filename): text = open(filename, "rt").read() headers = [] data = [] for config, value in batched(text.splitlines(), 2): config_elems = config.split(",") if not headers: headers = [e.partition("=")[0] for e in config_elems] data.append((*(try_to_convert(e.partition("=")[-1]) for e in config_elems), float(value))) return pd.DataFrame(data, columns=headers + ["latency"]) old_latencies = get_from_paste(...) new_latencies = get_from_paste(...) ratios = pd.merge(new_latencies, old_latencies, how="left", left_on=["m", "n", "k"], right_on=["m", "n", "k"], suffixes=("_new", "_old")) ratios = ratios.assign(ratio=ratios.latency_old / ratios.latency_new) fig = plt.figure(figsize=(40.0, 10.0)) grid = ImageGrid( fig, 111, nrows_ncols=(1, 4), axes_pad=0.5, share_all=True, cbar_location="right", cbar_mode="single", cbar_size="7%", cbar_pad=0.15, ) log_amax = np.max(np.abs(np.log(ratios.ratio.to_numpy()))) for K, ax in zip([1024, 2048, 4096, 8192], grid): pivoted = ratios[(ratios.k == K)].pivot_table(index="m", columns="n", values="ratio") im = ax.imshow(np.log(pivoted.to_numpy()), origin="lower", vmin=-log_amax, vmax=log_amax, cmap="PiYG") m_vals, n_vals = pivoted.axes ax.set_xticks(np.arange(len(n_vals)), labels=[f"N={i}" for i in n_vals.values], fontsize=12) ax.set_yticks(np.arange(len(m_vals)), labels=[f"M={i}" for i in m_vals.values], fontsize=12) plt.setp(ax.get_xticklabels(), rotation=90, ha="right", rotation_mode="anchor") ax.grid(False) ax.set_title(f"K={K}", fontsize=20) norm = FuncNorm((lambda x: np.log(x), lambda x: np.exp(x)), np.exp(-log_amax), np.exp(log_amax)) ax.cax.colorbar(ScalarMappable(norm=norm, cmap="PiYG")) plt.show() counts, bins = np.histogram(np.log(ratios.ratio.to_numpy()), bins=500) plt.stairs(counts, np.exp(bins), fill=True) plt.xscale("function", functions=(lambda x: np.log(x), lambda x: np.exp(x))) ``` </details> I only benchmarked fast_accum=True and out_dtype=torch.bfloat16 supposing that these are the most commonly-used flags (e.g., with fast_accum=False row-wise scaling is much slower than tensor-wise scaling hence unpractical). Pull Request resolved: https://github.com/pytorch/pytorch/pull/134781 Approved by: https://github.com/drisspg, https://github.com/eqy ghstack dependencies: #134773	2024-09-04 09:17:28 +00:00
Luca Wehrstedt	eec8fa038e	[fp8 rowwise] Support transposing operands in order to change output layout (#134773 ) On some occasion, a column-major output layout is more efficient (it's unclear if it's because of better store coalescing for some tile shapes, or whether it's just that it's CUTLASS's default and thus it's better optimized). At this stage I only add a flag that allows to transpose, but the hardest will be deciding on a new heuristic to turn it on selectively. This will be in a follow-up PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134773 Approved by: https://github.com/drisspg	2024-09-04 09:17:28 +00:00
Gregory Comer	679b8fe426	Update generate-xnnpack-wrappers.py parsing to handle build identifier (#134724 ) Fixes an issue after updating XNNPACK where parsing the XNNPACK CMakeLists breaks. I'm just ignored the generated build identifier for now, since it's not used and we would need to update the buck build to generate it at build time. Remove unused ukernels_xop XNNPACK target as it has no sources (after the recent update) and causes buck1 to complain. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134724 Approved by: https://github.com/mcr229	2024-09-04 08:45:46 +00:00
Pian Pawakapan	1dfb105239	restore CSE'd node metadata in runtime asserts pass (#134516 ) Adds val, and optionally stack_trace & nn_module_stack metadata back to SymInt compute nodes that we CSE, with a hook on `graph.create_node()`. Not sure if there's other metadata we want to populate here? Pull Request resolved: https://github.com/pytorch/pytorch/pull/134516 Approved by: https://github.com/ezyang	2024-09-04 05:56:28 +00:00
Avik Chaudhuri	9f00317997	rationalize STATIC vs. None (#134877 ) Summary: A bit of refactoring to prepare to remove `None` as a way to specify static dimensions in dynamic shapes, given we already have `Dim.STATIC` for the same purpose. We will now warn whenever this happens. However no tests were modified because problematic uses of `None` still need to behave as they do today, until we are ready to remove support. It should be easy to port tests by replacing the warning function to raise instead. Note that other uses of `None`, such as for entire values (tensor or non-tensor) remain as is. Moving forward this should be the only purpose of `None` (at least externally). Finally, there's a bit of confusion in our representation now because `AUTO` also internally transforms to `None`. Renamed dynamic_shapes to transformed_dynamic_shapes where this happens. Overall the two forms (pre and post transformation) have different properties so should probably not be represented in the same format in the future. Test Plan: existing Differential Revision: D62040729 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134877 Approved by: https://github.com/pianpwk	2024-09-04 05:34:26 +00:00
Yu, Guangye	9809080b9e	[Reland] Refactor caching device allocator utils (#130923 ) # Motivation Following [[RFC] Intel GPU Runtime Upstreaming for Allocator ](https://github.com/pytorch/pytorch/issues/116322), this PR aims to refactor caching device allocator utils to improve code reuse usage. This is the first PR, we could prepare some follow-up PRs continuing to refactor the device caching allocator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130923 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD, https://github.com/eqy	2024-09-04 05:31:08 +00:00
Xu Han	6448d351db	[inductor] clean up cpp_builder code. (#134909 ) Clean up cpp_builder duplication code. Hi @henrylhtsang , could you please help on land internally? Pull Request resolved: https://github.com/pytorch/pytorch/pull/134909 Approved by: https://github.com/henrylhtsang	2024-09-04 05:29:08 +00:00
PyTorch UpdateBot	2c9b4d2052	[executorch hash update] update the pinned executorch hash (#135077 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135077 Approved by: https://github.com/pytorchbot	2024-09-04 05:17:29 +00:00
CaoE	6b05aafc57	Add specializations for VecMaskLoad and VecMaskCast (#126501 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126501 Approved by: https://github.com/jgong5, https://github.com/jansel ghstack dependencies: #126500	2024-09-04 05:12:52 +00:00
CK Luk	ffd1e214df	Back out "[FSDP2] Set `ctx.set_materialize_grads(False)` for post-backward (#133498 )" (#135059 ) Summary: Original commit changeset: 96513cbc425f Original Phabricator Diff: D61291210 There is some evidence that FB-FM-v4 has better NE with Set ctx.set_materialize_grads(False), especially when pairing up with prefetching. See https://www.internalfb.com/intern/anp/view/?id=5732259 Test Plan: export NUM_WORKERS=128 export BATCH_SIZE=1024 export CONFIG_FILE="mast_joint_arch_exploration_cmf_updated_fbfm_v3_fsdp2.yaml" export ENTITLEMENT=ads_global_tc_2k_training_large_short buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher -c fbcode.platform010_cuda_version=12 -c hpc_comms.use_nccl=2.17.1 -- mode=${CONFIG_FILE} launcher.tags='[ads_ranking_taxonomy_monetization_genai]' launcher.data_project=pytorch_at_scale launcher.max_retries=10 launcher.fbl_entitl ement=${ENTITLEMENT} launcher.oncall=pytorch_training_enablement launcher.hardware=GRANDTETON launcher.num_workers=${NUM_WORKERS} data_loader.dataset.batch_size=${BATCH_SIZE} training.planner.proposer=dynamic_col_dim training.planner.proposer.optim_target=h bm 2>&1\| tee ~/tmp/log.mast Differential Revision: D62009163 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135059 Approved by: https://github.com/awgu	2024-09-04 04:50:32 +00:00
cyy	c818ecd169	Remove Caffe2 code from tool scripts (#134941 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134941 Approved by: https://github.com/ezyang	2024-09-04 03:47:58 +00:00
Animesh Jain	9e6f4f3f77	[dynamo] Use __eq__ for backend match (#135039 ) Fixes https://github.com/pytorch/pytorch/issues/131150 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135039 Approved by: https://github.com/jansel	2024-09-04 03:35:18 +00:00
dependabot[bot]	367a78495f	Bump actions/download-artifact from 2 to 4.1.7 in /.github/workflows (#135068 ) Bumps [actions/download-artifact](https://github.com/actions/download-artifact) from 2 to 4.1.7. - [Release notes](https://github.com/actions/download-artifact/releases) - [Commits](https://github.com/actions/download-artifact/compare/v2...v4.1.7) --- updated-dependencies: - dependency-name: actions/download-artifact dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>	2024-09-03 20:33:57 -07:00
Sam Larsen	362ecd9817	[inductor] Skip the sub-process pool until it's ready (#133508 ) Summary: Torch-compiling a quick script can be a bit slower than it needs to be: even though we initialize the subprocess pool early, it still might not be ready by the time we try to compile the first Triton kernel. Instead, let's use the single-threaded path until the pool has successfully completed a no-op job. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133508 Approved by: https://github.com/Chillee	2024-09-04 03:26:55 +00:00
Justin Chu	7600e9b36f	[ONNX] Use the stable APIs in onnxscript and sync the latest logic (#134782 ) Use the stable apis from onnxscript: https://github.com/microsoft/onnxscript/issues/1827 Sync with torch-onnx at https://github.com/justinchuby/torch-onnx/compare/v0.1.12...v0.1.15. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134782 Approved by: https://github.com/titaiwangms	2024-09-04 03:10:20 +00:00
Jason Ansel	982e27e532	[halide-backend] Update CI pin (#130258 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130258 Approved by: https://github.com/eellison	2024-09-04 03:08:49 +00:00
Rachel Guo	ae3aa8ff73	[AOTI][Tooling][5/n] Refactor the debug printer call to a level lower (#134789 ) Summary: 1. Move the debug printer call a level lower -> at here :https://www.internalfb.com/code/fbsource/[931d7bbb9e7cf2dcb926f42718f56fc940903eec]/fbcode/caffe2/torch/_inductor/codegen/cpp_wrapper_cuda.py?lines=335 2. Add UT for validating debug printer for user defined triton kernel codegen The benefit of having the debug printer call happens at a more centralized place is 1) reduce the duplicate debug printer related logic code scattered everywhere in the codebase 2) it can handle more triton kernel codegen path as long as it invokes this `generate_kernel_call()` for example, it can automatically handle/support user_defined_kernel 's debug printing which is a pretty common use case we encounter in debugging Test Plan: ```AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=2 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_aoti_debug_printer_user_defined_triton_kernel_abi_compatible_cuda``` Also verified that templateKernel codegen path still works Differential Revision: D61949020 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134789 Approved by: https://github.com/ColinPeppler	2024-09-04 02:41:30 +00:00
Bob Ren	ea89f01281	Remove unused comment (#135034 ) As part of my rampup I've been reading through some of @ezyang's diffs. I noticed in https://github.com/pytorch/pytorch/pull/133439 there was a comment that he forgot to remove. This diff removes that comment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135034 Approved by: https://github.com/albanD	2024-09-04 02:32:26 +00:00
Edward Z. Yang	175485097a	[EASY] Typofix (#135022 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135022 Approved by: https://github.com/albanD	2024-09-04 01:59:40 +00:00
Edward Z. Yang	15c25c4580	Fix dim mismatch logic automatic dynamic not working with compiler collectives (#135025 ) Fixes https://fb.workplace.com/groups/3095840833991792/permalink/3810738595835342/ Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/135025 Approved by: https://github.com/albanD	2024-09-04 01:50:21 +00:00
CaoE	4ebf6b04a8	Turn on expanded index path for Half on CPU (#133553 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133553 Approved by: https://github.com/yanbing-j, https://github.com/jgong5, https://github.com/peterbell10	2024-09-04 00:56:56 +00:00
Moritz Marseu	e000cf0ad9	Fix license metadata in setup.py (#129219 ) Package metadata in setup.py lists license as BSD-3 which is not a valid SPDX id. The correct id would be BSD-3-Clause. Specifying an SPDX id is beneficial to license compliance scanning. Taking up #129123 from my personal account. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129219 Approved by: https://github.com/malfet, https://github.com/kit1980	2024-09-04 00:21:22 +00:00
Menglu Yu	45743019cf	[PT2][Optimus] Skip meta update on symblic shape (#134975 ) Summary: We noticed that there will be runtime error to do the dim broadcast when the meta example value has symbolic shape, thus we skip it. Test Plan: ``` buck2 run mode/opt //caffe2/benchmarks/dynamo/fb:torchbench_run_ads_dhen_5x_training -- -m ads_dhen_5x -t training ``` P1559019921 Differential Revision: D62115015 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134975 Approved by: https://github.com/xuzhao9	2024-09-04 00:05:51 +00:00
Shivam Raikundalia	9ffcca7060	[Profiler] Handle Tensor Sizes/Strides Parsing Error (#134862 ) Summary: Currently some jobs are encountering the following trace, P1539415198. This suggests that when we are parsing through tensors the path is prone to encountering an invalid address. This is is possibly occurring because for some reason the sizes() and strides() of a Tensor seem to not be of the same dimensions. We assume such when iterating through the shapes to get the Ivalue generator. When browsing some of the tensor implementations, I found that some of the size and stride paths are different which could be the cause of this issue. Regardless, the profiler should be flexible enough to handle such issues without bringing down the whole main thread. If the crashes still persist, it will still give us a data point as to where they are occurring and we can rule out the strides/sizes as the culprit Test Plan: This change doesn't break anything in the happy path, just makes sure the bad path is not exited abruptly. We should use this in order to debug what the events are having mismatching dimensions between sizes and strides. Differential Revision: D62008788 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134862 Approved by: https://github.com/aaronenyeshi	2024-09-03 23:46:38 +00:00
Zain Rizvi	f05b716d6d	Add validator to ensure runner determinator script is kept in sync (#134800 ) We keep two copies of the runner-determinator script: 1. In runner_determinator.py, for ease of testing. This however is not actually executed during CI 2. Embedded in _runner-determinator.yml. This is what CI uses. Why the duplication? Short version: Because of how github CI works, during a given CI run the workflow yml files could actually come from the main branch, while the remaining files get read from the local commit. This can lead to a newer version of _runner-determinator.yml trying to invoke an older version of runner_determintor.py than it was actually designed for. Chaos ensues. We mitigate this by embedding the script into the yml file. But we still keep the script around because it's much easier to run tests against. This workflow's job is to ensure that if one edits the script in one of those two locations then they remember to update it in the other location as well Pull Request resolved: https://github.com/pytorch/pytorch/pull/134800 Approved by: https://github.com/zxiiro, https://github.com/PaliC ghstack dependencies: #134796	2024-09-03 23:29:04 +00:00
Zain Rizvi	469429b959	Refactor runner determinator (#134796 ) Some minor refactorings to make the code easier to parse and easier to add unit tests for. Keeping this as a separate PR for ease of review, since it should have zero functional behavior changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/134796 Approved by: https://github.com/zxiiro, https://github.com/PaliC	2024-09-03 23:29:04 +00:00
PyTorch MergeBot	c044deb9ce	Revert "c10d/logging: add C10D_LOCK_GUARD (#134131 )" This reverts commit f33bcbe5fd67e6b18be259ad2f0dc11c74157075. Reverted https://github.com/pytorch/pytorch/pull/134131 on behalf of https://github.com/kit1980 due to See D61985186 ([comment](https://github.com/pytorch/pytorch/pull/134131#issuecomment-2327556381))	2024-09-03 22:35:14 +00:00
PyTorch MergeBot	2fd36086bc	Revert "Add torch.serialization.skip_data context manager (#134504 )" This reverts commit 94db935749b8de99d8c3ab23fb880c67c8f3e67a. Reverted https://github.com/pytorch/pytorch/pull/134504 on behalf of https://github.com/kit1980 due to See D62082697 ([comment](https://github.com/pytorch/pytorch/pull/134504#issuecomment-2327542276))	2024-09-03 22:21:27 +00:00
drisspg	85fa019697	[Docs] Fix call to deprecated function (#135037 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/135037 Approved by: https://github.com/janeyx99, https://github.com/jbschlosser	2024-09-03 20:57:11 +00:00
rzou	14c8ef5198	autolabel aotinductor->export (#135040 ) "module: aotinductor" will automatically add "oncall: export". Test Plan: - none Pull Request resolved: https://github.com/pytorch/pytorch/pull/135040 Approved by: https://github.com/ydwu4	2024-09-03 20:17:51 +00:00
Xu Han	c40e622966	[inductor] add openmp config for intel conpiler on Linux. (#134973 ) Config `openmp` for Intel Compiler on Linux. Base on this PR, we can confirm the Intel optimized libraries are work built well. <img width="1039" alt="image" src="https://github.com/user-attachments/assets/838d5114-c778-4961-9cfe-39a814647089"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134973 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-09-03 20:10:21 +00:00
Driss Guessous	272f3b9fe1	[FlexAttention] Update tolerance for failing test (#135035 ) Summary: Address: T198937061 Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:flex_attention -- --exact 'caffe2/test/inductor:flex_attention - test_no_q_info_compile_False (caffe2.test.inductor.test_flex_attention.TestBlockMask)' --run-disabled Differential Revision: D62137797 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135035 Approved by: https://github.com/Chillee	2024-09-03 20:09:21 +00:00
Xilun Wu	e7731b3f8a	[TorchElastic] make torch elastic not have to realize TCPStore backend type and rely on c10d to decide which backend to use (#134882 ) D53335860 and D56435815 added an option to torch elastic allowing users to choose a TCPStore backend type to use via 1) explicit argument passing in user code when instantiating `MastRendezvousHandler` 2) pass `--use_libuv` command line argument to `torchrun`. The motivation was to offer a quick way to roll back to non-libuv TCPStore backend since we were making libuv the default in `c10d` code. Now we think that it's better to have torch elastic to not realize the TCPStore backend type but rely on `c10d`'s mechanism to decide which backend to use for torch elastic as well. In this sense, the TCPStore backend type used by torch elastic will be identical to that in pytorch. PyTorch TCPStore uses the environment variable `USE_LIBUV` to determine the backend type: when `USE_LIBUV="0"`, the non-libuv backend will be used. when `USE_LIBUV="1"`, the libuv backend will be used. And this is the default option. Differential Revision: [D58259590](https://our.internmc.facebook.com/intern/diff/D58259590/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134882 Approved by: https://github.com/shuqiangzhang	2024-09-03 19:43:21 +00:00
Nikita Shulga	71383dd3da	[MPS] Fix bachnorm_2d for channels last (#134618 ) By skipping gather of input tensor if memory_layout is channels_last, which is a first step towards fixing https://github.com/pytorch/pytorch/issues/134580 Though underlying problem is much more interesting, i.e. MPS does not have a generic support for channels last, but `c10::is_contiguoius()` is true for channels last layout. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134618 Approved by: https://github.com/albanD	2024-09-03 19:20:11 +00:00
Tobias Ringwald	758d787901	Added complex support for `torch.logsumexp` (#133187 ) Added complex support for `torch.logsumexp`. Implemented complex backward pass for `torch.logsumexp`. Fixes #133047 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133187 Approved by: https://github.com/amjames, https://github.com/lezcano	2024-09-03 17:28:36 +00:00
Laith Sakka	6c3767452d	Move auto functionalize tests in their own test file (#134834 ) title + use `with torch.library._scoped_library as lib` when needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134834 Approved by: https://github.com/zou3519 ghstack dependencies: #134831	2024-09-03 17:09:03 +00:00
Haibo Chen	2e0b114c06	add a new Guage API with an empty backend to PyTorch core (#134883 ) Summary: The current use case is to continuously measure the total allocated and reserved CUDA memory size from CUDACachingAllocator, and export their distribution (min, max, p90 etc) over time as timeseries. The current callback-based API does not work because the backend decides when the measurement is taken, so data points between two measurements may not be recorded. The distribution (e.g. max) as such will not be accurate. This new API closely follow the design of the existing WaitCounter API otherwise. This is not quite a synchronous version of DynamicCounter, as summing multiple data points does not make sense to my use case Test Plan: CI Differential Revision: D61837528 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134883 Approved by: https://github.com/c-p-i-o	2024-09-03 17:08:47 +00:00
Nikita Shulga	7804c089c6	[BE] Update numpy version to 2.0.2 (#134875 ) It's long time to abandon pre-release version Partially addresses https://github.com/pytorch/pytorch/issues/134868 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134875 Approved by: https://github.com/justinchuby, https://github.com/clee2000, https://github.com/kit1980, https://github.com/atalman, https://github.com/Skylion007	2024-09-03 17:02:35 +00:00
Justin Chu	1b9f51bd88	[ONNX] Bump onnxscript version in CI; temporarily remove op test (#133748 ) Bump onnxscript version in CI to 0.1.0.dev20240831, and temporarily remove the fx consistency test. We will add a better version back later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133748 Approved by: https://github.com/titaiwangms	2024-09-03 16:30:07 +00:00
PyTorch MergeBot	27677ead7c	Revert "[ONNX] Bump onnxscript version in CI; temporarily remove op test (#133748 )" This reverts commit 6eed63c8b9c4f54a573bb51960d252cd42bfab0c. Reverted https://github.com/pytorch/pytorch/pull/133748 on behalf of https://github.com/ZainRizvi due to The version bump appears to be pulling in an unavailable numpy version? [GH job link](https://github.com/pytorch/pytorch/actions/runs/10686076754/job/29620426371) [HUD commit link](`6eed63c8b9`) ([comment](https://github.com/pytorch/pytorch/pull/133748#issuecomment-2326932868))	2024-09-03 16:19:47 +00:00
Edward Z. Yang	a258844a32	Properly handle empty CPUINFO variable (#134916 ) Fixes https://github.com/pytorch/pytorch/issues/134915 But I did not root cause why CPUINFO is totally empty to begin with... Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134916 Approved by: https://github.com/Skylion007	2024-09-03 15:59:59 +00:00
PyTorch MergeBot	f927bcb934	Revert "[Inductor] Apply loop split optimization in codegen_node (#132389 )" This reverts commit 3cb5d251224b3fb59b5a10c6fefbb4c84eb565a6. Reverted https://github.com/pytorch/pytorch/pull/132389 on behalf of https://github.com/ZainRizvi due to Hi, this seems to be breaking in trunk. See test_dataloader.py::TestDataLoader::test_segfault [GH job link](https://github.com/pytorch/pytorch/actions/runs/10660461216/job/29556282081) [HUD commit link](`de3a641476`) ([comment](https://github.com/pytorch/pytorch/pull/132389#issuecomment-2326843129))	2024-09-03 15:40:45 +00:00
Justin Chu	6eed63c8b9	[ONNX] Bump onnxscript version in CI; temporarily remove op test (#133748 ) Bump onnxscript version in CI to 0.1.0.dev20240831, and temporarily remove the fx consistency test. We will add a better version back later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133748 Approved by: https://github.com/titaiwangms	2024-09-03 15:33:09 +00:00
IvanKobzarev	33ba952e31	[subclasses] Do not fakeTensor const prop subclass args (#134855 ) The issue: Const propagation checks only if arguments do not have FakeTensor. If argument is Subclass, it will pass this condition. As a result Const Propogation execution happens without FakeTensorMode and having tensor factories inside Subclass.__torch_dispatch__ results that this Tensor is not Fakified. Solution: If we have subclasses arguments, do not count that const propagation is doable Pull Request resolved: https://github.com/pytorch/pytorch/pull/134855 Approved by: https://github.com/zou3519	2024-09-03 13:31:49 +00:00
Edward Z. Yang	2a49296d75	Fix set_unbacked_bindings when list of Tensors is returned (#133585 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133585 Approved by: https://github.com/albanD	2024-09-03 12:23:31 +00:00
Feng Yuan	2443507acc	Update torch-xpu-ops pin (ATen XPU implementation) (#134983 ) Release cycle for PyTorch 2.5 1. Enable Windows build in latest torch-xpu-ops. Resolved large bin issue. 2. Refine test infrastructure for compatibility on different HW platforms. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134983 Approved by: https://github.com/EikanWang	2024-09-03 12:14:37 +00:00
Nikita Shulga	39935e0fde	Update cpuinfo submodule (#134891 ) Last time it was done in June by https://github.com/pytorch/pytorch/pull/127505 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134891 Approved by: https://github.com/Skylion007	2024-09-03 09:29:59 +00:00
chilli	23a2161ad1	Changed addmv to be a decomposition and not a fallback (#134823 ) Overall seems to be faster ![image](https://github.com/user-attachments/assets/0cbea76e-fb78-4634-9265-047de0291549) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134823 Approved by: https://github.com/jansel ghstack dependencies: #134813, #134818, #134819	2024-09-03 06:33:31 +00:00
chilli	9856bc50a2	Switch nanmedian to not cuda synchronize (#134819 ) Generally, this seems to be faster. ![image](https://github.com/user-attachments/assets/43a86c6f-236d-4ba1-aae0-14e3d88ae401) And as an added benefit, it works great with cudagraphs and such :) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134819 Approved by: https://github.com/Skylion007, https://github.com/eqy ghstack dependencies: #134813, #134818	2024-09-03 06:33:31 +00:00
chilli	6fce1faa10	change multinomial to use async asserts instead of a synchronization (#134818 ) Fixes https://github.com/pytorch/pytorch/issues/134442 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134818 Approved by: https://github.com/ezyang ghstack dependencies: #134813	2024-09-03 06:33:24 +00:00
chilli	db193d1e29	add msg to _assert_async (#134813 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134813 Approved by: https://github.com/ezyang, https://github.com/eqy, https://github.com/albanD	2024-09-03 06:33:18 +00:00
leslie-fang-intel	d14fe3ffed	[Inductor][CPP] Turns on inline_inbuilt_nn_modules for CPP GEMM template testing (#132487 ) Summary The CPP GEMM template testing has been skipped with turning on `inline_inbuilt_nn_modules ` as in https://github.com/pytorch/pytorch/issues/131929. Since https://github.com/pytorch/pytorch/pull/132334 has landed to fix the issues. Turn on this flag back since it's default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132487 Approved by: https://github.com/anijain2305, https://github.com/jgong5	2024-09-03 05:05:50 +00:00
CaoE	a00fad0177	Add specializations for vectorized conversion between float and BF16/FP16 (#126500 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/126500 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-09-03 02:09:12 +00:00
titaiwangms	45f11094b6	[ONNX] Delete `op_level_debug` from `torch.onnx.ExportOptions` (#134961 ) op_level_debug helped to identify missing operators, and wrongly implemented operators at the time that dynamo exporter relied on nearest matching and torchlib was just created. However, right now, with dispatcher logic improved and torchlib becomes mature, we no longer need it. PS: op-level-debug diagnostics rule is not deleted in this PR, as it auto generates lint error code, and need more time to fix. We can delete it when we retire sarif. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134961 Approved by: https://github.com/justinchuby	2024-09-02 23:38:39 +00:00
Xuehai Pan	4c1dd13ba3	[BE] better type annotation for `torch.types` (#129559 ) Closes #129525 - #129525 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129559 Approved by: https://github.com/ezyang	2024-09-02 15:35:32 +00:00
Jonathan Wenger	76710d4f95	Corrected docstring of ``solve_triangular`` (#129766 ) Description The arguments docstring of [torch.linalg.solve_triangular](https://pytorch.org/docs/stable/generated/torch.linalg.solve_triangular.html#torch.linalg.solve_triangular) incorrectly describes the shape of the ``A`` argument if the option ``left=True``. The argument ``A`` should have shape $k \times k$ if ``left=False`` in line with the rest of the docstring and the implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129766 Approved by: https://github.com/lezcano	2024-09-02 13:30:30 +00:00
Edward Z. Yang	ee03530fd9	Add a test to avoid decorator based regression for cprofile traces (#133086 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133086 Approved by: https://github.com/aorenste	2024-09-02 12:53:34 +00:00
FEI	16de25b1dc	fix tensor_repr(at::Tensor) (#134762 ) (#134764 ) Fixes #134762 @ezyang @antocuni Pull Request resolved: https://github.com/pytorch/pytorch/pull/134764 Approved by: https://github.com/ezyang Co-authored-by: Edward Z. Yang <ezyang@meta.com>	2024-09-02 06:05:10 +00:00
Blaine Burton Rister	3daca187aa	[Inductor] Allow customizing the padding format (#133939 ) Based on https://github.com/pytorch/pytorch/pull/130956. Inductor already supports padding through the `config.comprehensive_padding` option, but the padding format involves a few heuristics that are specific to Nvidia GPUs: - When we pad, it is always aligned to the next multiple of 128 bytes. - Strides smaller than 1024 are not padded. - Only intermediate values are padded, not outputs. The last of these is not really GPU-specific, but there are certain cases where we may want to override it. For example, padding outputs is useful on hardware accelerators with specific memory alignment requirements, or for applications where performance is more important than conformity with eager mode. This PR surfaces padding parameters up to Inductor's config module, so the user can control them. - `config.pad_outputs`: choose whether to pad outputs (default: `False`) - `config.padding_alignment_bytes`: choose the alignment size for padding (default: `128`) - `config.padding_stride_threshold`: choose the smallest stride that we will pad. For example, setting this to 0 will pad all unaligned strides. (default: `1024`) Test plan Added a new test in `test_padding.py` which tries various combinations of these options, checking that the output strides match our expectations. These changes should not affect perf, because the defaults are identical to Inductor's current behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133939 Approved by: https://github.com/shunting314 Co-authored-by: Yueming Hao <yhao@meta.com>	2024-09-02 05:56:33 +00:00
PyTorch UpdateBot	de3a641476	[executorch hash update] update the pinned executorch hash (#134914 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134914 Approved by: https://github.com/pytorchbot	2024-09-02 03:52:40 +00:00
Sun, Jiayi	3cb5d25122	[Inductor] Apply loop split optimization in codegen_node (#132389 ) This PR applies loop split optimization in codegen_node to avoid non-contiguous load. When the vector is loaded in a non-contiguous manner due to a division in the index, we eliminate the division by splitting the loop to avoid non-contiguous load. Example: ``` import torch import torch.nn as nn class GNReLU(torch.nn.Module): def __init__(self, num_groups, num_channels): super(GNReLU, self).__init__() self.gn = nn.GroupNorm(num_groups, num_channels) def forward(self, x): return torch.nn.functional.relu(self.gn(x)) input = torch.randn(2, 960, 96, 96).to(memory_format=torch.channels_last) m = GNReLU(32, 960).eval() compiled_m = torch.compile(m) with torch.no_grad(): compiled_m(input) ``` Generated code: - Before: ``` cpp_fused_native_group_norm_relu_0 = async_compile.cpp_pybinding(['const float', 'const float', 'const float', 'float', 'float', 'float'], ''' #include "/tmp/torchinductor_jiayisun/vu/cvuckxaygqfovv2zu2byqhcmiejbke7mdhf2rpgpr5mlscdev2hg.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2) { #pragma omp parallel num_threads(56) { int tid = omp_get_thread_num(); { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<long>(17280L)); for(long x2=static_cast<long>(0L); x2<static_cast<long>(9216L); x2+=static_cast<long>(1L)) { for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx1) + (960Lx2) + (8847360Lx0)), 16); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0); } for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(14L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx1) + (960Lx2) + (8847360Lx0)), 14); masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, 14, &wrecps0); } } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec)); tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<long>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<long>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.m2); } } } } { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(9216L); x1+=static_cast<long>(1L)) { for(long x2=static_cast<long>(0L); x2<static_cast<long>(960L); x2+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x2 + (960Lx1) + (8847360Lx0)), 16); auto tmp1 = [&] { __at_align__ std::array<float, 16> tmpbuf; #pragma GCC unroll 16 for (long x2_inner = 0; x2_inner < 16; x2_inner++) { tmpbuf[x2_inner] = out_ptr0[static_cast<long>((32Lx0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))]; } return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16); } () ; auto tmp3 = [&] { __at_align__ std::array<float, 16> tmpbuf; #pragma GCC unroll 16 for (long x2_inner = 0; x2_inner < 16; x2_inner++) { tmpbuf[x2_inner] = out_ptr1[static_cast<long>((32Lx0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))]; } return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16); } () ; auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x2), 16); auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x2), 16); auto tmp2 = tmp0 - tmp1; auto tmp4 = static_cast<float>(276480.0); auto tmp5 = at::vec::Vectorized<float>(tmp4); auto tmp6 = tmp3 / tmp5; auto tmp7 = static_cast<float>(1e-05); auto tmp8 = at::vec::Vectorized<float>(tmp7); auto tmp9 = tmp6 + tmp8; auto tmp10 = tmp9.rsqrt(); auto tmp11 = tmp2 * tmp10; auto tmp13 = tmp11 * tmp12; auto tmp15 = tmp13 + tmp14; auto tmp16 = at::vec::clamp_min(tmp15, decltype(tmp15)(0)); tmp16.store(out_ptr2 + static_cast<long>(x2 + (960Lx1) + (8847360Lx0))); } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg2_1, = args args.clear() assert_size_stride(arg2_1, (2, 960, 96, 96), (8847360, 1, 92160, 960)) buf0 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32) buf1 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32) buf3 = empty_strided_cpu((2, 960, 96, 96), (8847360, 1, 92160, 960), torch.float32) cpp_fused_native_group_norm_relu_0(arg2_1, _frozen_param3, _frozen_param2, buf0, buf1, buf3) del arg2_1 return (buf3, ) ``` - After: ``` cpp_fused_native_group_norm_relu_0 = async_compile.cpp_pybinding(['const float', 'const float', 'const float', 'float', 'float', 'float'], ''' #include "/tmp/torchinductor_jiayisun/vu/cvuckxaygqfovv2zu2byqhcmiejbke7mdhf2rpgpr5mlscdev2hg.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2) { #pragma omp parallel num_threads(56) { int tid = omp_get_thread_num(); { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<long>(17280L)); for(long x2=static_cast<long>(0L); x2<static_cast<long>(9216L); x2+=static_cast<long>(1L)) { for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx1) + (960Lx2) + (8847360Lx0)), 16); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0); } for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(14L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx1) + (960Lx2) + (8847360Lx0)), 14); masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, 14, &wrecps0); } } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec)); tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<long>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<long>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.m2); } } } } { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(9216L); x1+=static_cast<long>(1L)) { #pragma GCC ivdep for(long x2=static_cast<long>(0L); x2<static_cast<long>(32L); x2+=static_cast<long>(1L)) { for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx2) + (960Lx1) + (8847360Lx0)), 16); auto tmp1 = out_ptr0[static_cast<long>(x2 + (32Lx0))]; auto tmp4 = out_ptr1[static_cast<long>(x2 + (32Lx0))]; auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x3 + (30Lx2)), 16); auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x3 + (30Lx2)), 16); auto tmp2 = at::vec::Vectorized<float>(tmp1); auto tmp3 = tmp0 - tmp2; auto tmp5 = static_cast<float>(276480.0); auto tmp6 = tmp4 / tmp5; auto tmp7 = static_cast<float>(1e-05); auto tmp8 = decltype(tmp6)(tmp6 + tmp7); auto tmp9 = 1 / std::sqrt(tmp8); auto tmp10 = at::vec::Vectorized<float>(tmp9); auto tmp11 = tmp3 tmp10; auto tmp13 = tmp11 * tmp12; auto tmp15 = tmp13 + tmp14; auto tmp16 = at::vec::clamp_min(tmp15, decltype(tmp15)(0)); tmp16.store(out_ptr2 + static_cast<long>(x3 + (30Lx2) + (960Lx1) + (8847360Lx0))); } for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(14L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx2) + (960Lx1) + (8847360Lx0)), 14); auto tmp1 = out_ptr0[static_cast<long>(x2 + (32Lx0))]; auto tmp4 = out_ptr1[static_cast<long>(x2 + (32Lx0))]; auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x3 + (30Lx2)), 14); auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x3 + (30Lx2)), 14); auto tmp2 = at::vec::Vectorized<float>(tmp1); auto tmp3 = tmp0 - tmp2; auto tmp5 = static_cast<float>(276480.0); auto tmp6 = tmp4 / tmp5; auto tmp7 = static_cast<float>(1e-05); auto tmp8 = decltype(tmp6)(tmp6 + tmp7); auto tmp9 = 1 / std::sqrt(tmp8); auto tmp10 = at::vec::Vectorized<float>(tmp9); auto tmp11 = tmp3 * tmp10; auto tmp13 = tmp11 * tmp12; auto tmp15 = tmp13 + tmp14; auto tmp16 = at::vec::clamp_min(tmp15, decltype(tmp15)(0)); tmp16.store(out_ptr2 + static_cast<long>(x3 + (30Lx2) + (960Lx1) + (8847360L*x0)), 14); } } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg2_1, = args args.clear() assert_size_stride(arg2_1, (2, 960, 96, 96), (8847360, 1, 92160, 960)) buf0 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32) buf1 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32) buf3 = empty_strided_cpu((2, 960, 96, 96), (8847360, 1, 92160, 960), torch.float32) cpp_fused_native_group_norm_relu_0(arg2_1, _frozen_param3, _frozen_param2, buf0, buf1, buf3) del arg2_1 return (buf3, ) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132389 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel Co-authored-by: Jiong Gong <jiong.gong@intel.com>	2024-09-02 00:28:34 +00:00
Aaron Orenstein	c140fa1426	Reorg cache code to make it simpler (#134911 ) Summary: Pull the big nested function out of the middle of cached_autotune() into its own class. Also refactor creating the autotune cache itself out - which gets shared in the next diff. Test Plan: unit tests Differential Revision: D60677501 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134911 Approved by: https://github.com/oulgen	2024-09-02 00:27:40 +00:00
Edward Z. Yang	0cbcef12bd	Stop adding useless prefix to error message here, you're pushing the important info off the screen. (#133108 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133108 Approved by: https://github.com/Skylion007	2024-09-01 23:11:17 +00:00
Edward Z. Yang	208442ea18	Don't setup try-except handler when Dynamo compiling (#133239 ) The reraise is not supported and so this just gunks up our actual exception handling. You can trigger this by hitting an exception inside of an NN module that has hooks on it. You end up graph breaking on the reraise here, and losing the inner stack trace from the actual exception that was raised. This might be kind of controversial. An alternate strategy is to support reraises in Dynamo or something but IDK this doesn't feel like the right place to apply force. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133239 Approved by: https://github.com/anijain2305	2024-09-01 22:26:46 +00:00
Edward Z. Yang	ea01aec8b1	Move FunctionSchema implementations to cpp file (#133856 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133856 Approved by: https://github.com/bdhirsh, https://github.com/albanD	2024-09-01 19:50:35 +00:00
Oguz Ulgen	2dadc2c8fc	Log fx graph cache bypass reasons (#134792 ) Summary: Lets track when we bypass and why Test Plan: unit tests Differential Revision: D61994739 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134792 Approved by: https://github.com/jamesjwu	2024-09-01 19:02:09 +00:00
cyy	1595e755af	[Reland] [Torchgen] Pass mutable to cpp.valuetype_type (#134549 ) Reland of #121415 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134549 Approved by: https://github.com/ezyang	2024-09-01 15:15:38 +00:00
eqy	b1a00b7b6d	Abate `-Wsign-compare` warning spam in `Indexing.cu` (#134805 ) Fix for warning spam like ``` warning: comparison of integer expressions of different signedness: ‘long int’ and ‘uint64_t’ {aka ‘long unsigned int’} [-Wsign-compare] ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134805 Approved by: https://github.com/janeyx99	2024-09-01 10:48:07 +00:00
cyy	d03f767cae	Check function declarations of Vulkan code (#134550 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134550 Approved by: https://github.com/ezyang	2024-09-01 09:38:35 +00:00
Natalia Gimelshein	c25b64a057	expose host_emptyCache to python, fix a bug in freeing cudaHostRegist… (#134919 ) …ered memory Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134919 Approved by: https://github.com/eqy	2024-09-01 09:07:25 +00:00
Manuel Candales	caa04e0cae	[ET] codegen: bool array as array ref (#134886 ) Test Plan: CI Differential Revision: D62046959 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134886 Approved by: https://github.com/larryliu0820	2024-09-01 01:33:43 +00:00
Natalia Gimelshein	29b7852dc1	drop gil in couple places (leads to deadlocks) (#134910 ) Per title Pull Request resolved: https://github.com/pytorch/pytorch/pull/134910 Approved by: https://github.com/eqy	2024-09-01 00:05:53 +00:00
Aaron Orenstein	7239b8a4f1	Clean up RemoteCache classes (#134032 ) Summary: The existing RemoteCacheBackend classes were a bit haphazard - some of them accepted bytes only, some accepted objects, some returned different types of objects than were passed in. Update them to be more consistent: 1. RemoteCacheBackend is an implementation of a backend: Redis, Memcache, Manifold, LocalFile 2. RemoteCacheSerde is an implementation of a serde protocol - to turn structured objects (dict, list, etc) into bytes: RemoteCacheJsonSerde (json encoding), RemoteCachePassthroughSerde (strictly bytes only) 3. RemoteCache is the cache implementation itself, mixing a RemoteCacheBackend along with an RemoteCacheSerde to provide structured caching. Other than simply reorganizing the existing cache code this also fixes the Redis autotune caching for OSS. Test Plan: unit tests Reviewed By: oulgen Differential Revision: D61178859 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134032 Approved by: https://github.com/oulgen, https://github.com/bhack	2024-08-31 20:18:59 +00:00
Xu Han	590d96be64	[inductor] move test_fuse_large_params to slow test. (#134900 ) Move `test_fuse_large_params` to slow test. This case spend about 1.5 minutes. <img width="855" alt="image" src="https://github.com/user-attachments/assets/adf16dcf-d398-4d66-8dda-0c9cafc4e351"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134900 Approved by: https://github.com/jansel	2024-08-31 18:08:11 +00:00
haozhe.zhu	f4641ca481	[Inductor] Remove VecChecker and fallback non-supported Vec op to Scalar impl with a for loop (#134569 ) Fall back non-vectorized op by scalar impl + for loop. Example code: ``` cpp_fused_igammac_0 = async_compile.cpp_pybinding(['const double', 'const double', 'double'], ''' #include "/tmp/torchinductor_root/z4/cz4j2mmotlx3z2b7u4fbjtdt4x6plhd67ljwzg5bk7ekv4xz6y7q.h" extern "C" void kernel(const double in_ptr0, const double* in_ptr1, double* out_ptr0) { { for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(48L); x0+=static_cast<int64_t>(8L)) { auto tmp0 = at::vec::VectorizedN<double,2>::loadu(in_ptr0 + static_cast<int64_t>(x0), 8); auto tmp1 = in_ptr1[static_cast<int64_t>(0L)]; auto tmp2 = at::vec::VectorizedN<double,2>(tmp1); auto tmp3 = [&]() { __at_align__ std::array<double, 8> tmpbuf0; tmp0.store(tmpbuf0.data(), 8); __at_align__ std::array<double, 8> tmpbuf1; tmp2.store(tmpbuf1.data(), 8); __at_align__ std::array<double, 8> tmpbuf_out; for (int i = 0; i < 8; i++) { tmpbuf_out[i] = calc_igammac(tmpbuf0[i], tmpbuf1[i]); } return at::vec::VectorizedN<double, 2>::loadu(tmpbuf_out.data(), 8); } () ; tmp3.store(out_ptr0 + static_cast<int64_t>(x0), 8); } #pragma omp simd simdlen(4) for(int64_t x0=static_cast<int64_t>(48L); x0<static_cast<int64_t>(50L); x0+=static_cast<int64_t>(1L)) { auto tmp0 = in_ptr0[static_cast<int64_t>(x0)]; auto tmp1 = in_ptr1[static_cast<int64_t>(0L)]; auto tmp2 = calc_igammac(tmp0, tmp1); out_ptr0[static_cast<int64_t>(x0)] = tmp2; } } } ''') ``` `frexp` are difficult to be handled by common `fallback` since it returns two `cse_var` `2ba60a1618/torch/_inductor/codegen/cpp.py (L752-L766)` So we added a special function to do that. ``` cpp_fused_frexp_0 = async_compile.cpp_pybinding(['const double', 'double', 'int32_t'], ''' #include "/tmp/torchinductor_root/z4/cz4j2mmotlx3z2b7u4fbjtdt4x6plhd67ljwzg5bk7ekv4xz6y7q.h" extern "C" void kernel(const double in_ptr0, double* out_ptr0, int32_t* out_ptr1) { { for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(16L); x0+=static_cast<int64_t>(8L)) { auto tmp0 = at::vec::VectorizedN<double,2>::loadu(in_ptr0 + static_cast<int64_t>(x0), 8); at::vec::Vectorized<int32_t> tmp1; at::vec::VectorizedN<double, 2> tmp2; [&]() { __at_align__ std::array<double, 8> tmpbuf; tmp0.store(tmpbuf.data(), 8); __at_align__ std::array<int32_t, 8> tmpbuf_exponent; __at_align__ std::array<double, 8> tmpbuf_mantissa; for (int i = 0; i < 8; i++) { tmpbuf_mantissa[i] = std::frexp(tmpbuf[i], &tmpbuf_exponent[i]); } tmp1 = at::vec::Vectorized<int32_t>::loadu(tmpbuf_exponent.data(), 8); tmp2 = at::vec::VectorizedN<double, 2>::loadu(tmpbuf_mantissa.data(), 8); } (); tmp2.store(out_ptr0 + static_cast<int64_t>(x0), 8); tmp1.store(out_ptr1 + static_cast<int64_t>(x0), 8); } #pragma omp simd simdlen(4) for(int64_t x0=static_cast<int64_t>(16L); x0<static_cast<int64_t>(20L); x0+=static_cast<int64_t>(1L)) { auto tmp0 = in_ptr0[static_cast<int64_t>(x0)]; int32_t tmp1; auto tmp2 = std::frexp(tmp0, &tmp1); out_ptr0[static_cast<int64_t>(x0)] = tmp2; out_ptr1[static_cast<int64_t>(x0)] = tmp1; } } } ''') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134569 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-08-31 11:19:57 +00:00
Michael Lazos	16f119e62a	Update compiled optimizer tests for tensor betas (#134169 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134169 Approved by: https://github.com/anijain2305, https://github.com/eellison ghstack dependencies: #134166, #134167, #134168	2024-08-31 10:24:39 +00:00
Michael Lazos	4e71418566	[dynamo] rewrite addcmul_ to remove graph break (#134168 ) Context: Adding support for the beta parameters to be tensors Details: Similarly to the previous two PRs addcmul_ is used with the tensor betas as the value argument. When this occurs, an item() call is invoked in the aten op. To avoid this graph break, addcmul_ is decomposed into its constrituent ops to avoid this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134168 Approved by: https://github.com/anijain2305 ghstack dependencies: #134166, #134167	2024-08-31 10:24:39 +00:00
Michael Lazos	3fb4c6bc38	[dynamo] Rewrite foreach pow to broadcast scalar argument (#134167 ) Context: Adding support for the beta parameters to be tensors Details: In this PR similarly to the previous, foreach_pow calls item() on the first argument when it is a scalar tensor. In this case, we broadcast that scalar tensor into a list of aliases of that tensor to avoid the item() call, and this results in a device copy of the scalar tensor. Once again, I dont think we can change the foreach_pow API due to BC concerns, so this op rewrite allows us to avoid a graph break, generate semantically the same code, and not affect eager. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134167 Approved by: https://github.com/anijain2305 ghstack dependencies: #134166	2024-08-31 10:24:35 +00:00
Michael Lazos	471c33f007	[dynamo] Rewrite foreach_lerp to avoid aten item call (#134166 ) Context: Adding support for the beta parameters to be tensors Details: In order to add support for the beta params to be tensors without graph breaks in the Adam family of optimizers it is necessary to support foreach_lerp(x, y, s) where s is a scalar tensor. Today, this isn't possible because when `s` is a scalar, internally the aten op calls item() on it to extract the value and distribute it to each of the ops on the individual list indices. To support this in dynamo without graph breaks, I decompose the lerp into its constituent ops which support a scalar tensor in the list argument positions which do not result in an item() call. To be clear the item() call is more performant for eager I think and for BC I don't think we can modify that API, so this allows us to have performance in eager and no graph breaks in compile. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134166 Approved by: https://github.com/anijain2305	2024-08-31 10:24:31 +00:00
Xuehai Pan	eed0d76682	[dynamo][itertools] refactor `itertools.islice` to use polyfill (#133876 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133876 Approved by: https://github.com/jansel ghstack dependencies: #133864, #133894	2024-08-31 10:08:07 +00:00
Xuehai Pan	ec660c383e	[dynamo] reduce overhead for `PolyfilledFunctionVariable.call_function` (#134842 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134842 Approved by: https://github.com/jansel	2024-08-31 09:12:46 +00:00
cyyever	d9cc693719	[jit] Change argument names (#134828 ) It seems like a bug. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134828 Approved by: https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2024-08-31 08:42:30 +00:00
Xu Han	136badae64	[inductor] preload icx built in math libs (#134870 ) Intel Compiler implenmented more math libraries than clang, for performance proposal. We need preload them like openmp library. reproduce UT: ```cmd pytest test/inductor/test_cpu_cpp_wrapper.py -v -k test_silu_cpu_dynamic_shapes_cpp_wrapper ``` Depends of module: <img width="804" alt="Image" src="https://github.com/user-attachments/assets/9a672e03-ebf5-4ebb-b182-09180e6f7841"> Local test pass: <img width="857" alt="image" src="https://github.com/user-attachments/assets/afbb8c1c-8fcc-4d64-a3ad-c8521b137d2d"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134870 Approved by: https://github.com/jansel	2024-08-31 04:50:31 +00:00
Yanbo Liang	090d9cf410	[Dynamo][autograd.Function][vmap] support torch._C._are_functorch_transforms_active (#134889 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134889 Approved by: https://github.com/jansel	2024-08-31 04:39:09 +00:00
PyTorch UpdateBot	34b85d301f	[executorch hash update] update the pinned executorch hash (#134894 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134894 Approved by: https://github.com/pytorchbot	2024-08-31 04:16:41 +00:00
Alex Baden	64fad53b50	[Inductor] Support passing module map parameter to Triton make_ir API (#134774 ) In https://github.com/triton-lang/triton/pull/4539 the `make_ir` API was modified to accept a new `module_map` parameter. Update the Inductor callsite accordingly, preserving backwards compatibility following the existing code. Fixes #134674 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134774 Approved by: https://github.com/EikanWang, https://github.com/zou3519, https://github.com/jansel	2024-08-31 03:38:08 +00:00
Eddie Yan	aef5da50f4	Cleanup unused `pytorch.version` (#134381 ) This file doesn't seem to be used anywhere? checking CI... Pull Request resolved: https://github.com/pytorch/pytorch/pull/134381 Approved by: https://github.com/zou3519	2024-08-31 02:50:05 +00:00
PyTorch MergeBot	86e03a64e1	Revert "[Inductor] Allow customizing the padding format (#133939 )" This reverts commit 8b258b3b14408986a1d4142cff5a153c798ceecc. Reverted https://github.com/pytorch/pytorch/pull/133939 on behalf of https://github.com/ZainRizvi due to sorry but this PR is causing issues with diff train imports reverting it for now but it can be merged back in as-is ([comment](https://github.com/pytorch/pytorch/pull/133939#issuecomment-2322635388))	2024-08-31 00:38:30 +00:00
Nikita Shulga	f95085fd91	[BE][MPS] Prefer xfail to skip (#134858 ) This essentially undoes large skips on everything but MacOS Sequoia to nn.modules made by https://github.com/pytorch/pytorch/pull/128393 Instead it uses existing `xfail`, but guards it on `_macos15_or_newer` boolean Before the change if run on MacOS 14: ``` % python3 ../test/test_modules.py -v -k Hardswish 2>&1\|tail -n3 Ran 57 tests in 0.053s OK (skipped=32) ``` After ``` % python3 ../test/test_modules.py -v -k Hardswish 2>&1\|tail -n3 Ran 57 tests in 0.229s OK (skipped=10, expected failures=2) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134858 Approved by: https://github.com/janeyx99	2024-08-31 00:29:48 +00:00
Yiming Zhou	050ad925f3	[benchmark] Add to torchbench relative path search (#134871 ) Add to relative path search in benchmark. This enables user to run `torchbench.py` inside the `pytorch/benchmark/dynamo` folder when `torchbench` repo is cloned in the same level as `pytorch` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134871 Approved by: https://github.com/FindHao	2024-08-31 00:28:22 +00:00
Xuehai Pan	a854c3a25e	[dynamo] refactor `builtins.enumerate` to use polyfill (#133894 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133894 Approved by: https://github.com/jansel ghstack dependencies: #133864	2024-08-31 00:17:27 +00:00
Xuehai Pan	ebbdeeede1	[dynamo][itertools] refactor `itertools.chain` and `itertools.chain.from_iterable` to use polyfills (#133864 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133864 Approved by: https://github.com/jansel	2024-08-31 00:11:54 +00:00
Yichen Yan	5dad6a5a84	[ONNX][DORT] Lazy-import `onnxruntime` (#134662 ) Currently, if installed, `onnxruntime` will be imported when importing `torch._inductor` (which will be imported by some other library, e.g. transformer-engine): ``` /mnt/c.py(53)<module>() -> from torch._inductor.utils import maybe_profile /usr/local/lib/python3.10/site-packages/torch/_inductor/utils.py(49)<module>() -> import torch._export /usr/local/lib/python3.10/site-packages/torch/_export/__init__.py(25)<module>() -> import torch._dynamo /usr/local/lib/python3.10/site-packages/torch/_dynamo/__init__.py(2)<module>() -> from . import convert_frame, eval_frame, resume_execution /usr/local/lib/python3.10/site-packages/torch/_dynamo/convert_frame.py(48)<module>() -> from . import config, exc, trace_rules /usr/local/lib/python3.10/site-packages/torch/_dynamo/trace_rules.py(52)<module>() -> from .variables import ( /usr/local/lib/python3.10/site-packages/torch/_dynamo/variables/__init__.py(38)<module>() -> from .higher_order_ops import ( /usr/local/lib/python3.10/site-packages/torch/_dynamo/variables/higher_order_ops.py(14)<module>() -> import torch.onnx.operators /usr/local/lib/python3.10/site-packages/torch/onnx/__init__.py(62)<module>() -> from ._internal.onnxruntime import ( /usr/local/lib/python3.10/site-packages/torch/onnx/_internal/onnxruntime.py(37)<module>() -> import onnxruntime # type: ignore[import] ``` This issue breaks generated triton kernel because it imported torch, and unexpected runtime libraries as well. I've also added a test for this specific case under `test/onnx`, perhaps we should add more somewhere else? Related issue: https://github.com/huggingface/accelerate/pull/3056 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134662 Approved by: https://github.com/justinchuby	2024-08-31 00:06:28 +00:00
Ratnam Parikh	2384f77d76	[XPU] Fix Windows XPU build (#134276 ) Linker flag check doesn't work correctly with MSVC and linking torch_xpu with torch_cpu_library for windows MSVC works without any errors Pull Request resolved: https://github.com/pytorch/pytorch/pull/134276 Approved by: https://github.com/EikanWang, https://github.com/atalman	2024-08-30 23:51:40 +00:00
Yanbo Liang	e688b78791	[Dynamo][autograd.Function] Trace fwd graph under no_grad mode (#134872 ) Fixes #134820 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134872 Approved by: https://github.com/zou3519	2024-08-30 22:24:18 +00:00
Blaine Burton Rister	8b258b3b14	[Inductor] Allow customizing the padding format (#133939 ) Based on https://github.com/pytorch/pytorch/pull/130956. Inductor already supports padding through the `config.comprehensive_padding` option, but the padding format involves a few heuristics that are specific to Nvidia GPUs: - When we pad, it is always aligned to the next multiple of 128 bytes. - Strides smaller than 1024 are not padded. - Only intermediate values are padded, not outputs. The last of these is not really GPU-specific, but there are certain cases where we may want to override it. For example, padding outputs is useful on hardware accelerators with specific memory alignment requirements, or for applications where performance is more important than conformity with eager mode. This PR surfaces padding parameters up to Inductor's config module, so the user can control them. - `config.pad_outputs`: choose whether to pad outputs (default: `False`) - `config.padding_alignment_bytes`: choose the alignment size for padding (default: `128`) - `config.padding_stride_threshold`: choose the smallest stride that we will pad. For example, setting this to 0 will pad all unaligned strides. (default: `1024`) Test plan Added a new test in `test_padding.py` which tries various combinations of these options, checking that the output strides match our expectations. These changes should not affect perf, because the defaults are identical to Inductor's current behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133939 Approved by: https://github.com/shunting314 Co-authored-by: Yueming Hao <yhao@meta.com>	2024-08-30 20:34:11 +00:00
PyTorch MergeBot	a1ba8e61d1	Revert "[ROCm] remove triton-rocm commit pin and merge pins with triton.txt (#133438 )" This reverts commit 5e8bf29148a590318f678620f84be8f4d5ffff5c. Reverted https://github.com/pytorch/pytorch/pull/133438 on behalf of https://github.com/ZainRizvi due to This still breaks linux binary builds. Added the appropriate labels to ensure tests can pass. See [GH job link](https://github.com/pytorch/pytorch/actions/runs/10626427003/job/29460479554) [HUD commit link](`5e8bf29148`) ([comment](https://github.com/pytorch/pytorch/pull/133438#issuecomment-2322246198))	2024-08-30 20:00:41 +00:00
qchip	f6398eb0fa	dynamic shapes for combo_kenel/foreach_kernel (#134477 ) This PR add dynamic shapes support to foreach and combo kernels for horizontal fusion. A flag `combo_kernel_foreach_dynamic_shapes` (default False to avoid disturb production workflows) is added to _inductor/config.py. Setting it to True enables automatic dynamic shapes for foreach kernels. It is always enabled for combo kernels cases. Added unit cases. This PR also fixes a flaky test case for [T198833257](https://www.internalfb.com/intern/tasks/?t=198833257) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134477 Approved by: https://github.com/mlazos	2024-08-30 19:58:20 +00:00
Wouter Devriendt	db17a9898d	regenerate ci workflows for binary builds with new g4dn runners (#133404 ) Fixes #103104 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133404 Approved by: https://github.com/ZainRizvi	2024-08-30 19:53:22 +00:00
Gabriel Ferns	98b813d0d4	Enable cudagraphs in cpp wrapper (#133885 ) Fixes https://github.com/pytorch/pytorch/issues/130878 Summary: Enables cudagraphs in cpp wrapper by clearing inputs. Generated, non-cpp wrapper code: ```python def call(args): arg0_1, = args args.clear() assert_size_stride(arg0_1, (10, ), (1, )) with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) buf0 = empty_strided_cuda((10, ), (1, ), torch.float32) # Topologically Sorted Source Nodes: [sin], Original ATen: [aten.sin] stream0 = get_raw_stream(0) triton_poi_fused_sin_0.run(arg0_1, buf0, 10, grid=grid(10), stream=stream0) del arg0_1 return (buf0, ) ``` vs generated cpp wrapper code: ```python def _wrap_func(f): def g(args): input_tensors = [arg if isinstance(arg, torch.Tensor) else torch.tensor(arg) for arg in args] input_handles = torch._C._aoti.unsafe_alloc_void_ptrs_from_tensors(input_tensors) # new: args.clear() # end new output_handles = f(input_handles) output_tensors = torch._C._aoti.alloc_tensors_by_stealing_from_void_ptrs(output_handles) return output_tensors return g ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133885 Approved by: https://github.com/eellison, https://github.com/desertfire	2024-08-30 18:48:37 +00:00
fduwjj	bdfa94b787	[RFC] Make fr trace script a console scripts (#134729 ) We want to make fr analyzer script a command after users `pip install torch`, that's why we want to mimic what torchrun is doing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134729 Approved by: https://github.com/c-p-i-o, https://github.com/malfet ghstack dependencies: #134528, #134780	2024-08-30 18:17:06 +00:00
Andrew Gu	a0d0c6b7e6	Used `torch.equal` in `test_foreach_copy_with_multi_dtypes` (#134861 ) `self.assertEqual` allows some tolerance, but here, we want to show that `_foreach_copy_` gives bitwise equivalent results. Let us use `torch.equal` then. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134861 Approved by: https://github.com/Skylion007, https://github.com/janeyx99, https://github.com/crcrpar	2024-08-30 18:04:41 +00:00
fduwjj	1993a2aa9e	[FR] Make pg_name unique, show P2P collective status and fix bugs when running the script as command (#134780 ) Fixes a bunches of bugs in the script when running with the generated command and 3D parallel. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134780 Approved by: https://github.com/c-p-i-o ghstack dependencies: #134528	2024-08-30 18:03:17 +00:00
Xu Han	15f5a4858b	[inductor] enable Intel Compiler(icx-cl) for inductor windows (#134772 ) This PR is enable Intel Compiler (`icx-cl`) for Windows inductor, likes previous PR: https://github.com/pytorch/pytorch/pull/134444 which enable clang. Changes: 1. Fix icx-cl crash by wrong decode args, the right decode should be "utf-8". 2. Add intel compiler check, and intel compiler Windows drivers check(icx-cl). 3. Add Intel compiler openmp args config. 4. Add intel compiler openmp binary preload. For intel compiler openmp binary path: <img width="788" alt="image" src="https://github.com/user-attachments/assets/54c76356-018d-4bef-a9b7-0ea150fd7aba"> For performance, Intel compiler(`icx-cl`) is much better performance than MSVC(`cl`): <img width="875" alt="image" src="https://github.com/user-attachments/assets/67865faf-b1de-4535-917a-486b72527204"> Append `clang-cl` performance data: <img width="821" alt="image" src="https://github.com/user-attachments/assets/476f4568-bf58-457f-b73d-4e57f49be384"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134772 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-08-30 17:51:46 +00:00
David Berard	9e0ddc0e14	[inductor] don't allow triton config pre_hook (#134633 ) The caching autotuner caches triton configs, and it doesn't try to hash or save the pre_hook from the config if it exists. If we had a config that had a pre_hook, then we might autotune -> save the config (without the pre_config) -> later load the saved config and try to run it, but this time without the pre_hook. So this PR adds an assert and deletes the pre_hook handling. We can be confident that we didn't have functional pre_hooks, because the pre_hook handling tries to use `self.arg_name`, which doesn't exist. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134633 Approved by: https://github.com/shunting314, https://github.com/jansel	2024-08-30 17:39:37 +00:00
Masaki Kozuki	e21d7b77ce	Update `ForeachfuncInfo.sample_inputs_func` to yield scalars & scalarlists that are more friendly to test_meta (#134552 ) for `test_meta.py` to see more "PASSED" instead of "XFAIL". `pytest test_meta.py -k "_foreach_"` ran 6400 test cases and: - This PR: 4702 passed, 260 skipped, 73732 deselected, 1698 xfailed - main (92c4771853892193d73d87bd60eca4dc7efc51d8): 3906 passed, 260 skipped, 73732 deselected, 2494 xfailed Pull Request resolved: https://github.com/pytorch/pytorch/pull/134552 Approved by: https://github.com/janeyx99	2024-08-30 17:30:50 +00:00
Animesh Jain	577a93514f	[dynamo][dynamic][heuristic] Mark tuple getitem integers as static (#134734 ) Fixes issue seen in https://github.com/pytorch/pytorch/issues/132872#issuecomment-2314574656 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134734 Approved by: https://github.com/jansel ghstack dependencies: #134653, #134713	2024-08-30 17:06:57 +00:00
Yifu Wang	08184aa85c	Add support for 32KB multi_tensor_apply kernel arguments (#134373 ) ## Benchmark On H100 SXM (HBM2e, 500W TDP), CUDA Toolkit=12.2, Driver Version=535.154.05, with [this script](https://gist.github.com/yifuwang/178c1f4bf951c5794ea79c04d90e44fa) (`torch._foreach_copy_`): Baseline ``` https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmp0g_x4sys device ms: 0.891, cpu ms: 7.200 memory bandwidth: 1457.727 GB/s ``` Single iteration trace: <img width="1432" alt="image" src="https://github.com/user-attachments/assets/8ef54365-0265-4281-a0f0-d4c2f448300e"> This PR ``` https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html#!/?url=https://interncache-all.fbcdn.net/manifold/perfetto_internal_traces/tree/shared_trace/yifu_tmp3jqiugli device ms: 0.683, cpu ms: 6.745 memory bandwidth: 1902.010 GB/s ``` Single iteration trace: <img width="1074" alt="image" src="https://github.com/user-attachments/assets/e52acad1-d09b-492c-9611-6d69e339f3ac"> ## Binary Size and Kernel Specialization The binary size for `libtorch_cuda.so` increased 6MB (243MB -> 249MB). ``` // NOTE: [32KB kernel argument size support] // 32KB kernel argument size support has three requirements: // - CUDART_VERSION >= 12010 // - Driver version >= 530 // - GPU arch >= VOLTA // // Due to minor version compatibility, it possible for binaries built with // CUDART_VERSION >= 12010 to run with driver version < 530. Since driver // version can only be checked at runtime, if CUDART_VERSION >= 12010, we have // to build both 4KB and 32KB kernels and determine the appropriate kernel to // dispatch at runtime. // // - If CUDART_VERSION < 12010, only 4KB kernels will be instantiated. // // - If CUDART_VERSION >= 12010: // - Host code: // - We always instantiate the launching stub for both 4KB and 32KB kernels. // - Device code: // - If __CUDA_ARCH__ >= 700, we always instantiate both 4KB and 32KB // kernels. // - If __CUDA_ARCH__ < 700, it's not possible to even compile an empty // 32KB kernel (formal parameter space overflowed). Thus, we only // instantiate a declaration for 32KB kernels. This is valid as long as the // declaration-only kernel is not launched. // // - At runtime, we dispatch to the 32KB kernel if driver version >= 530 and // GPU arch >= VOLTA. // // - TODO(yifu): once there's a CUDART version that is not compatible with any // driver version below 530, we can determine at compile time to not compile // the kernels for 4KB kernel argument size. // // https://developer.nvidia.com/blog/cuda-12-1-supports-large-kernel-parameters/ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134373 Approved by: https://github.com/eqy, https://github.com/crcrpar, https://github.com/janeyx99	2024-08-30 16:52:28 +00:00
Zhengxu Chen	a19a7524f6	[export] Make sure getitem replacement are synced with module call graph. (#134830 ) Summary: When we are placing nodes in the graph, we should also replace the references in module_call_graph. Test Plan: buck2 run 'fbcode//mode/opt' torchrec/fb/ir/tests:test_serializer -- --filter-regex test_serialize_deserialize_vlea buck2 test 'fbcode//mode/opt' fbcode//torchrec/fb/ir/tests:test_serializer -- --exact 'torchrec/fb/ir/tests:test_serializer - torchrec.fb.ir.tests.test_serializer.TestSerializer: test_serialize_empty_value_vlea' --run-disabled buck2 test 'fbcode//mode/opt' fbcode//torchrec/fb/ir/tests:test_serializer -- --exact 'torchrec/fb/ir/tests:test_serializer - torchrec.fb.ir.tests.test_serializer.TestSerializer: test_deserialized_device_vle' --run-disabled Differential Revision: D62014035 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134830 Approved by: https://github.com/angelayi	2024-08-30 16:47:05 +00:00
Laith Sakka	f5b0caee71	Rewrite `unsafe_remove_auto_functionalized_pass` using `decompose_auto_functionalized` (#134831 ) `unsafe_remove_auto_functionalized_pass` can be written as using `decompose_auto_functionalized`, this way we do not have to update it each time we do a change to `auto_functionalize` (Ex https://github.com/pytorch/pytorch/pull/134409) , and we avoid duplicate logics implemented in two different ways. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134831 Approved by: https://github.com/zou3519	2024-08-30 16:27:53 +00:00
PyTorch MergeBot	351ba3e67c	Revert "[c10d] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931 )" This reverts commit 65864d01341d006955579b145f78547314ceb14b. Reverted https://github.com/pytorch/pytorch/pull/132931 on behalf of https://github.com/ZainRizvi due to This PR is breaking builds internally due to the removal of ProcessGroup::Options ([comment](https://github.com/pytorch/pytorch/pull/132931#issuecomment-2321862402))	2024-08-30 16:27:40 +00:00
Thomas Bohnstingl	994438040c	Improvements for associative_scan - combine_mode (#133012 ) This is part of a series of PRs to improve the functionality of the `associatve_scan` functionality. This specific PR introduces a `combine_mode`, which can be either `pointwise` (default) or `generic`. In case of `generic`, the `associative_scan` is more flexible and allows also to perform non-pointwise functions. This PR has been derived from https://github.com/pytorch/pytorch/pull/129307. @ydwu4 @Chillee @zou3519 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133012 Approved by: https://github.com/ydwu4	2024-08-30 16:09:53 +00:00
PyTorch MergeBot	c6ecf57dd2	Revert "[dynamo] simplify implementation for `functools.reduce` (#133778 )" This reverts commit b5f1ffa7ab0988184497788f2738e1769888ab7d. Reverted https://github.com/pytorch/pytorch/pull/133778 on behalf of https://github.com/ZainRizvi due to This is still failing internally with the same error about 'Graph break due to unsupported builtin _functools.reduce' ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2321787968))	2024-08-30 16:06:10 +00:00
PyTorch MergeBot	7a85c488a8	Revert "[dynamo] simplify implementation for `builtins.sum` (#133779 )" This reverts commit eaa449fbf0fe528a0827ee9b5bcfcd307a7c658d. Reverted https://github.com/pytorch/pytorch/pull/133779 on behalf of https://github.com/ZainRizvi due to This is still failing internally with the same error about 'Graph break due to unsupported builtin _functools.reduce' ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2321787968))	2024-08-30 16:06:10 +00:00
PyTorch MergeBot	1ad08c7a5b	Revert "[dynamo][itertools] refactor `itertools.chain` and `itertools.chain.from_iterable` to use polyfills (#133864 )" This reverts commit 1b703669576223024eb84a76c53b7ec5ed8bb270. Reverted https://github.com/pytorch/pytorch/pull/133864 on behalf of https://github.com/ZainRizvi due to This is still failing internally with the same error about 'Graph break due to unsupported builtin _functools.reduce' ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2321787968))	2024-08-30 16:06:10 +00:00
PyTorch MergeBot	8aa44e14cf	Revert "[dynamo] refactor `builtins.enumerate` to use polyfill (#133894 )" This reverts commit a2566adfb6064235db6d950568435fb6ef58a598. Reverted https://github.com/pytorch/pytorch/pull/133894 on behalf of https://github.com/ZainRizvi due to This is still failing internally with the same error about 'Graph break due to unsupported builtin _functools.reduce' ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2321787968))	2024-08-30 16:06:09 +00:00
PyTorch MergeBot	10c31e96df	Revert "[dynamo][itertools] refactor `itertools.islice` to use polyfill (#133876 )" This reverts commit 7d12e6dceb94a221288f21c0e79ce8ca667d657a. Reverted https://github.com/pytorch/pytorch/pull/133876 on behalf of https://github.com/ZainRizvi due to This is still failing internally with the same error about 'Graph break due to unsupported builtin _functools.reduce' ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2321787968))	2024-08-30 16:06:09 +00:00
Yidi Wu	d261a1751a	[HOP] fix export x inline_inbuilt_nn_modules (#133731 ) TLDR; this PR supports exporting cond x inine_inbuilt nn modules flag by inling into tracing code in proxy_tensor.py _symbolic_trace.py (internally, the pattern is make_fx(record_module_stack)(torch.compile(f))). We have two special treatments for following cases: 1. _ModuleStackTracer will wrap all the nn modules into _AttrProxy. This _AttrProxy has several subtiles which make it hard to inline in dynamo like overriding _modules with a property method and overrides the `__getattr__`, which mutates captured states when calling `__getattr__`. Solution to this is that we unwrap the _AttrProxy and get its corresponding nn_module (a 1-1 correspondence). So that dynamo symbolically traces the original nn module instead of tracing _AttrProxy. 2. The tracer applies a bunch of patches the `__getattr__` and `__call__` of nn.Module for tracking reasons. This doesn't work well with dynamo. The immediate error we see is `torch._dynamo.exc.Unsupported: 'inline in skipfiles: WeakKeyDictionary.__contains__ \| __contains__ /home/yidi/.conda/envs/pytorch/lib/python3.10/weakref.py` caused by a weakdict in PythonKeyTracer. Solution to this is that we remove the patches during dynamo symbolic convert temporally. So that dynamo has a clean environment. make_fx will be trace the transformed bytecode of dynamo and patches nn modules there instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133731 Approved by: https://github.com/anijain2305 ghstack dependencies: #134775	2024-08-30 15:58:20 +00:00
Yidi Wu	932c4ca5a0	make make_fx collective test single threaded (#134775 ) make_fx is not thread-safe due to mutating and patching global states. It's difficult and low roi to make it thread-safe so just turn the tracing test into a single-thread test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134775 Approved by: https://github.com/yifuwang	2024-08-30 15:58:20 +00:00
eqy	c07e566baf	[CUDA][P2P] Check device capability in `requires_cuda_p2p_access` (#134523 ) Tests seem to fail on e.g., Volta without this given the compile time meacros used e.g., in `79b7fff188/torch/csrc/distributed/c10d/intra_node_comm.cu (L487)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134523 Approved by: https://github.com/yifuwang, https://github.com/Skylion007	2024-08-30 14:08:55 +00:00
Joona Havukainen	92f282ca52	Enable batch matmul for result sizes > 232 the tensor can be split along batch axis (#133430 ) Fixes #131865. Addresses the issue seen when running llama v3.1 8B parameter model on MPS backend where the batch matmul output size can go over the 32-bit indexing limit of MPS tensors, causing an assert. Test case to reproduce the issue with the dimensions encountered in llama v3.1 and verify this fix works around it: ``` import torch device='mps' a = torch.randn([32, 20064, 128], dtype=torch.float32,device=device) b = torch.randn([32, 128, 20064], dtype=torch.float32, device=device) res = torch.bmm(a, b) ``` Notably the current change only works as long as the individual output matrix in the bmm does not exceed the number of elements 232. This lets us split up the computation along the batch axis to avoid going over the limit. Added a TORCH_CHECK to raise an error if the individual matrix dimensions are too large to handle for this op until a more general workaround tiling the matmuls is available. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133430 Approved by: https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-08-30 14:08:43 +00:00
wz337	50efbb9f1e	[DeviceMesh][Test] Add a unit test for get_local_rank for flattened mesh (#134603 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134603 Approved by: https://github.com/fduwjj ghstack dependencies: #133838, #133839, #134048	2024-08-30 08:13:37 +00:00
Animesh Jain	0f8bec4399	[dynamo] mark_static_nn_module (#134713 ) Fixes issue seen in https://github.com/pytorch/pytorch/issues/132872#issuecomment-2314574656 With this API, we can mark the offending module as static in detectron2. Today's world - Consider user defined nn module int attributes automatic dynamic. Use the API in this PR to make them static if you want. Alternative work - Consider all int attributes of any user defined nn module class static. And then introduce an API - `torch._dynamo.mark_nn_module_attribute_dynamic`. The default being static is worrying if users have `counter` in their model which is updated in each forward invocation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134713 Approved by: https://github.com/jansel ghstack dependencies: #134653	2024-08-30 07:01:06 +00:00
Jason Ansel	a5630239ad	[dynamo] Improve minifier error message when fp64 not supported (#134737 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134737 Approved by: https://github.com/anijain2305	2024-08-30 06:42:32 +00:00
Ankur Neog	1011e0ae98	Generalize devices specific UTs for dynamo (#130714 ) ## Motivation This is follow up to PR:https://github.com/pytorch/pytorch/pull/126970, adding facility to run content for Intel Gaudi devices. We intend to extend similar generalization for the rest of the content in test/dynamo which is currently being written to work specifically for cuda devices. Other devices can add onto it if support is available. ## Changes carve out bert related content to another class use instantiate_device_type utility to instantiate this class for devices which support the functionality Pull Request resolved: https://github.com/pytorch/pytorch/pull/130714 Approved by: https://github.com/anijain2305	2024-08-30 05:02:47 +00:00
Animesh Jain	7a694f6683	[justknobs] Override __bool__ method (#134799 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134799 Approved by: https://github.com/ezyang	2024-08-30 04:54:02 +00:00
PyTorch UpdateBot	75b86b1554	[executorch hash update] update the pinned executorch hash (#134736 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134736 Approved by: https://github.com/pytorchbot	2024-08-30 04:11:51 +00:00
Jack Taylor	5e8bf29148	[ROCm] remove triton-rocm commit pin and merge pins with triton.txt (#133438 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133438 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com>	2024-08-30 03:38:35 +00:00
Xu Han	1f1e2eeb9d	[inductor] Install `tlparse` for test\dynamo\test_structured_trace.py UTs. (#134806 ) Install tlparse for test\dynamo\test_structured_trace.py UTs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134806 Approved by: https://github.com/ezyang	2024-08-30 03:16:03 +00:00
Laith Sakka	0d5f978795	add basic nn modules diff time benchmarks (#134658 ) benchmarks several shapes of basic nn modules. in both eager and inductor ``` collecting compile time instruction count for basic_modules_ListOfLinears_inductor compile time instruction count for iteration 0 is 48602516013 compile time instruction count for iteration 1 is 20424350269 compile time instruction count for iteration 2 is 20440350455 compile time instruction count for iteration 3 is 20419269999 compile time instruction count for iteration 4 is 20430782200 compile time instruction count for iteration 5 is 20455049622 compile time instruction count for iteration 6 is 20157290712 compile time instruction count for iteration 7 is 20455324001 compile time instruction count for iteration 8 is 20450158317 compile time instruction count for iteration 9 is 20492987748 collecting compile time instruction count for basic_modules_ListOfLinears_eager compile time instruction count for iteration 0 is 961328334 compile time instruction count for iteration 1 is 958887896 compile time instruction count for iteration 2 is 958792214 compile time instruction count for iteration 3 is 958375977 compile time instruction count for iteration 4 is 958568525 compile time instruction count for iteration 5 is 958152305 compile time instruction count for iteration 6 is 959322800 compile time instruction count for iteration 7 is 958332703 compile time instruction count for iteration 8 is 958092100 compile time instruction count for iteration 9 is 958095277 collecting compile time instruction count for basic_modules_ModuleForwardHasGraphBreak_inductor compile time instruction count for iteration 0 is 3572145793 compile time instruction count for iteration 1 is 3503323973 compile time instruction count for iteration 2 is 3501962432 compile time instruction count for iteration 3 is 3501746084 compile time instruction count for iteration 4 is 3500687361 compile time instruction count for iteration 5 is 3822254676 compile time instruction count for iteration 6 is 3498356846 compile time instruction count for iteration 7 is 3499019157 compile time instruction count for iteration 8 is 3500780314 compile time instruction count for iteration 9 is 3500257458 collecting compile time instruction count for basic_modules_ModuleForwardHasGraphBreak_eager compile time instruction count for iteration 0 is 1844838754 compile time instruction count for iteration 1 is 1843476862 compile time instruction count for iteration 2 is 1844761450 compile time instruction count for iteration 3 is 1845371742 compile time instruction count for iteration 4 is 1845159665 compile time instruction count for iteration 5 is 1845035802 compile time instruction count for iteration 6 is 1844895007 compile time instruction count for iteration 7 is 1844697922 compile time instruction count for iteration 8 is 1844780885 compile time instruction count for iteration 9 is 1844493990 collecting compile time instruction count for basic_modules_SequentialWithDuplicatedModule_inductor compile time instruction count for iteration 0 is 1597839479 compile time instruction count for iteration 1 is 1348225351 compile time instruction count for iteration 2 is 1347340818 compile time instruction count for iteration 3 is 1348170800 compile time instruction count for iteration 4 is 1348637747 compile time instruction count for iteration 5 is 1678366444 compile time instruction count for iteration 6 is 1348412420 compile time instruction count for iteration 7 is 1348461578 compile time instruction count for iteration 8 is 1347420149 compile time instruction count for iteration 9 is 1349748195 collecting compile time instruction count for basic_modules_SequentialWithDuplicatedModule_eager compile time instruction count for iteration 0 is 137721777 compile time instruction count for iteration 1 is 139065517 compile time instruction count for iteration 2 is 137130552 compile time instruction count for iteration 3 is 137506030 compile time instruction count for iteration 4 is 137089838 compile time instruction count for iteration 5 is 137477395 compile time instruction count for iteration 6 is 138550452 compile time instruction count for iteration 7 is 137568409 compile time instruction count for iteration 8 is 136968468 compile time instruction count for iteration 9 is 137481664 collecting compile time instruction count for basic_modules_ModuleComparison_inductor compile time instruction count for iteration 0 is 917209684 compile time instruction count for iteration 1 is 899154426 compile time instruction count for iteration 2 is 898145079 compile time instruction count for iteration 3 is 899817018 compile time instruction count for iteration 4 is 899184687 compile time instruction count for iteration 5 is 898172885 compile time instruction count for iteration 6 is 899958951 compile time instruction count for iteration 7 is 899348186 compile time instruction count for iteration 8 is 897745404 compile time instruction count for iteration 9 is 899581123 collecting compile time instruction count for basic_modules_ModuleComparison_eager compile time instruction count for iteration 0 is 113165302 compile time instruction count for iteration 1 is 112724376 compile time instruction count for iteration 2 is 112774611 compile time instruction count for iteration 3 is 114465211 compile time instruction count for iteration 4 is 112689572 compile time instruction count for iteration 5 is 112726465 compile time instruction count for iteration 6 is 112853691 compile time instruction count for iteration 7 is 112295238 compile time instruction count for iteration 8 is 114022136 compile time instruction count for iteration 9 is 112664932 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134658 Approved by: https://github.com/anijain2305 ghstack dependencies: #133834, #134635, #134649, #134652	2024-08-30 02:13:52 +00:00
Xilun Wu	a645a18d2e	[reland][dtensor][MTPG] make sharding prop lru cache not shared among threads (#134509 ) Summary reland of https://github.com/pytorch/pytorch/pull/134294 Fixes #131446 Fixes #126852 Fixes #126868 Fixes #126493 The PR was reverted due to CI red signal in https://github.com/pytorch/pytorch/actions/runs/10537099590/job/29201744658. It seems that the `gaussian_nll_loss` test had been flaky before my original PR #134294 . Therefore this PR also removes the `xfail` mark on this specific test to make CI signal green. See the error message below: ``` 2024-08-24T13:42:01.3228990Z ==================================== RERUNS ==================================== 2024-08-24T13:42:01.3229530Z [31m[1m_ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _[0m 2024-08-24T13:42:01.3229710Z Unexpected success[90m[39;49;00m 2024-08-24T13:42:01.3230235Z [31m[1m_ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _[0m 2024-08-24T13:42:01.3230407Z Unexpected success[90m[39;49;00m 2024-08-24T13:42:01.3230594Z =================================== FAILURES =================================== 2024-08-24T13:42:01.3231128Z [31m[1m_ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _[0m 2024-08-24T13:42:01.3231296Z Unexpected success[90m[39;49;00m ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134509 Approved by: https://github.com/tianyu-l, https://github.com/wz337	2024-08-30 02:13:45 +00:00
Chen Haifeng	27ffa67984	Support __class__ attr for tuple and list variables (#134099 ) Fixes #134086 This supports __class__ attribute for TupleVariable and ListVariable. And allows to construct a tuple or list by using __class__ attribute. This patch also fix a bug in NamedTupleVariable which misses a return on calling super var_getattr. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134099 Approved by: https://github.com/anijain2305, https://github.com/jansel	2024-08-30 01:57:49 +00:00
Colin L. Rice	cf11fc0dcb	dynamo: Only log if we've disabled eval_frame once. (#134529 ) This spams logs pretty badly otherwise Pull Request resolved: https://github.com/pytorch/pytorch/pull/134529 Approved by: https://github.com/chuanhaozhuge, https://github.com/oulgen	2024-08-30 00:35:25 +00:00
Ivan Zaitsev	8b68912dfc	Correctly detect "Rate limit exceeded" error (#134785 ) Currently all 403 errors are treated as "Rate limit exceeded": https://github.com/pytorch/pytorch/actions/runs/10622019167/job/29445336924 [Github docs](https://docs.github.com/en/rest/using-the-rest-api/rate-limits-for-the-rest-api?apiVersion=2022-11-28#exceeding-the-rate-limit) claim: > If you exceed your primary rate limit, you will receive a 403 or 429 response, and the x-ratelimit-remaining header will be 0. You should not retry your request until after the time specified by the x-ratelimit-reset header. After this change: https://github.com/pytorch/pytorch/actions/runs/10622365327/job/29446456395 Note, the 403 error in the jobs above is a separate issue, this PR addresses only the logging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134785 Approved by: https://github.com/clee2000	2024-08-29 23:58:15 +00:00
Yu, Guangye	3402a5d865	fix windows xpu build issue (#133845 ) # Motivation If build XPU via oneAPI 2024.2, it will fail because `sycl-preview.lib` exists in windows. And linking the unexpected lib results in `error LNK2019: unresolved external symbol`. # Solution Use explicitly `sycl-preview` in linux build only. # Additional Context For `find_library`, please note that the variable will not be updated if it has been stored. ``` If the library is found the result is stored in the variable and the search will not be repeated unless the variable is cleared. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133845 Approved by: https://github.com/min-jean-cho, https://github.com/EikanWang, https://github.com/atalman, https://github.com/malfet	2024-08-29 23:53:32 +00:00
leslie-fang-intel	3775fc982d	[Inductor][CPP] Fix Index name error (#134645 ) Summary Fix the comment: https://github.com/pytorch/pytorch/pull/122961#issuecomment-2313930242. For all of the cases we see in the 3 test suits (TorchBench, Timms, Huggingface) we expect: * `_node` is a FX Node with target in ["index_expr", "load", "store"] * `_node.args[1 if _node.target == "index_expr" else 2]` is another FX node with target `get_index` * `_node.args[1 if _node.target == "index_expr" else 2].args[0]` is a str for the name of this index expression It seems not true in some FB internal testcase from the failure log posted in above link. So, add the condition check to work around it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134645 Approved by: https://github.com/jgong5, https://github.com/masnesral	2024-08-29 23:33:15 +00:00
Shuqiang Zhang	d13ce2e2b5	[c10d] release gil lock during eager init (#134779 ) Summary: We found that if we init the pG in a background thread, it would block the main thread till init is complete. This is because in the pybinding we never release the GIL lock Test Plan: existing CI on eager init Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/134779 Approved by: https://github.com/c-p-i-o	2024-08-29 23:25:33 +00:00
Lucian Grijincu	71ff168dbb	pytorch: llvm_codegen: prefix JIT generated functions with 8B of data so jitted code can be called from ASAN+UBSAN on LLVM17 (llvm/llvm-project#65253) (#134572 ) Summary: Similar workaround was already applied elsewhere in pytorch https://github.com/pytorch/pytorch/pull/133623 {D61348865} LLVM17 UBSAN change discussion https://github.com/llvm/llvm-project/issues/104505 Here we also have to associate the data with the function with `setPrefixData(dummyPrefixData)` to prevent this workaround being disabled by the `optimize(*module_);` call which could change layout/remove the unused variable/etc. Differential Revision: D61845799 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134572 Approved by: https://github.com/atalman	2024-08-29 23:15:13 +00:00
Laith Sakka	496e57283d	add add_loop benchmarks (#134652 ) This benchmark measure the cost of compiling the following function in eager and inductor its basically two benchmarks. ``` @torch.compile(backend=self.backend, fullgraph=True) def f(a, b): result = a.clone() for i in range(1000): if i % 3 == 0: result = result + b elif i % 3 == 1: result = result + 8 * b else: result = result.sin() return result ``` PYTHONPATH=$(pwd) python benchmarks/add_loop.py out ``` collecting compile time instruction count for add_loop_eager compile time instruction count for iteration 0 is 8286649663 compile time instruction count for iteration 1 is 2838971338 compile time instruction count for iteration 2 is 2834263023 compile time instruction count for iteration 3 is 2829447493 compile time instruction count for iteration 4 is 2830904231 compile time instruction count for iteration 5 is 2830281077 compile time instruction count for iteration 6 is 2831466595 compile time instruction count for iteration 7 is 2830732164 compile time instruction count for iteration 8 is 2831088056 compile time instruction count for iteration 9 is 2831204407 collecting compile time instruction count for add_loop_inductor compile time instruction count for iteration 0 is 32585687849 compile time instruction count for iteration 1 is 11747553436 compile time instruction count for iteration 2 is 11746959875 compile time instruction count for iteration 3 is 11749479461 compile time instruction count for iteration 4 is 11750053711 compile time instruction count for iteration 5 is 11750793958 compile time instruction count for iteration 6 is 11751673576 compile time instruction count for iteration 7 is 11754552912 compile time instruction count for iteration 8 is 11753723127 compile time instruction count for iteration 9 is 11759059942 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134652 Approved by: https://github.com/anijain2305 ghstack dependencies: #133834, #134635, #134649	2024-08-29 23:04:01 +00:00
fduwjj	65864d0134	[c10d] Remove Option for ProcessGroup and Expose backend Options to reflect the correct code structure (#132931 ) We introduced the dispatchable backend for a ProcessGroup and collective in https://github.com/pytorch/pytorch/issues/86225. This PR is a follow-up cleanup to clean up the option of a ProcessGroup and ask users to either set timeout or backend later on or directly create backend after creating a PG. Also PGNCCL is using option class from ProcessGroup but we actually should use Option from backend class. So this PR is to make the type or name to be aligned with what we are doing in cpp side. I don't change the signature for the public API, so they still use args named "pg_options" We need to make changes to the test to make it aligned with the change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132931 Approved by: https://github.com/H-Huang	2024-08-29 22:40:12 +00:00
Zhuoran Zhao	8b4c487581	Fix AOTInductor complication on ROCM (#134522 ) Summary: Original PR (https://github.com/pytorch/pytorch/pull/124123) is broken by cpp_builder refactoring So resubmit it to fix Test Plan: Test with command here: https://www.internalfb.com/phabricator/paste/view/P1549765548 Differential Revision: D61827208 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134522 Approved by: https://github.com/frank-wei	2024-08-29 21:59:04 +00:00
Shunting Zhang	1e92d7b688	[inductor] move loop ordering after fusion (#126254 ) Restart the work from PR https://github.com/pytorch/pytorch/pull/100331 in this new PR since it's hard to rebase. It would be expected that some code is copy/pasted from the previous PR and main idea is the same. Previously we see relatively large compilation time increase due to too many loop orders being considered. This PR tries to continue the work by doing pruning and only considering loop orders that we know for sure are relevant (i.e. do it on demand). Some manually created cases that loop ordering matters are added as unit tests. The PR can make sure inductor does not miss fusion opportunities for them. This PR should solve the not-able to fusion problem in https://github.com/pytorch/pytorch/issues/130015 Right now there is still significant increase of compilation time. I'll disable the feature by default. Later on after the compilation time issue is resolved, I'll enable it by default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126254 Approved by: https://github.com/jansel	2024-08-29 21:50:07 +00:00
min-jean-cho	416a7894fe	[Windows][XPU] Disable Kineto PTI on Windows only (#134620 ) Disable Kineto + XPU PTI on Windows only. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134620 Approved by: https://github.com/guangyey, https://github.com/malfet Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-08-29 20:58:55 +00:00
Xuehai Pan	7d12e6dceb	[dynamo][itertools] refactor `itertools.islice` to use polyfill (#133876 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133876 Approved by: https://github.com/jansel ghstack dependencies: #133769, #133778, #133779, #133864, #133894	2024-08-29 20:56:16 +00:00
Xuehai Pan	a2566adfb6	[dynamo] refactor `builtins.enumerate` to use polyfill (#133894 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133894 Approved by: https://github.com/jansel ghstack dependencies: #133769, #133778, #133779, #133864	2024-08-29 20:56:16 +00:00
Xuehai Pan	1b70366957	[dynamo][itertools] refactor `itertools.chain` and `itertools.chain.from_iterable` to use polyfills (#133864 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133864 Approved by: https://github.com/jansel ghstack dependencies: #133769, #133778, #133779	2024-08-29 20:56:16 +00:00
Xuehai Pan	eaa449fbf0	[dynamo] simplify implementation for `builtins.sum` (#133779 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133779 Approved by: https://github.com/jansel, https://github.com/anijain2305 ghstack dependencies: #133769, #133778	2024-08-29 20:56:16 +00:00
Xuehai Pan	b5f1ffa7ab	[dynamo] simplify implementation for `functools.reduce` (#133778 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133778 Approved by: https://github.com/jansel, https://github.com/anijain2305 ghstack dependencies: #133769	2024-08-29 20:56:16 +00:00
Xuehai Pan	e09324e7da	[dynamo] simplify polyfill registration for `builtins.all` and `builtins.any` (#133769 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133769 Approved by: https://github.com/jansel	2024-08-29 20:56:16 +00:00
drisspg	b977abd5de	[Inductor] Fix error checking for scaled_mm lowering (#134765 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134765 Approved by: https://github.com/Skylion007	2024-08-29 20:18:42 +00:00
atalman	6180574771	Move py 3.8->3.9 pull, trunk, inductor, prerioric CI tests (#133624 ) Part of Deprecation of python 3.8 and moving to 3.9. Related to: https://github.com/pytorch/pytorch/issues/120718 Except XPU and ROCM jobs Pull Request resolved: https://github.com/pytorch/pytorch/pull/133624 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/ZainRizvi	2024-08-29 19:15:59 +00:00
Jason Ansel	202e5cc87d	[inductor] Fix error in debug_str_extra (#134747 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134747 Approved by: https://github.com/Skylion007, https://github.com/shunting314	2024-08-29 19:09:50 +00:00
Brian Vaughan	43e1df64f8	register all entry_point backends on first attempt (#132546 ) fixes: https://github.com/pytorch/pytorch/issues/131360 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132546 Approved by: https://github.com/jansel	2024-08-29 18:59:29 +00:00
Ke Wen	5470fcd5b9	[5/N] Reconcile barrier and NaN checker (#134707 ) By using a zeros() tensor instead of empty() tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134707 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab ghstack dependencies: #134345, #134357, #134701	2024-08-29 18:51:12 +00:00
zdevito	d91b49dbaa	expandable_segments <-> other allocator options (#134338 ) Previously setting garbage_collection_threshold or max_split_size_mb along with expandable_segments:True could cause the allocator to hit assert failures when running nearly out of memory. This PR ensures garbage_collection and max_split freeing do not accidentally try to release expandable segments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134338 Approved by: https://github.com/ezyang	2024-08-29 18:43:59 +00:00
Rachel Guo	3fc6e47d42	[AOTI] Fix cosmetic indentation issue in cuda cpp wrapper codegen for DeferredCudaKernelLine/GridLine (#134705 ) Summary: Follow up fix for D61018114, D61800622 Increase indentation for `loadKernel` `launchKernel` and `Grid` lines. Test Plan: ``` TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_zero_grid_with_unbacked_symbols_abi_compatible_cuda ``` ``` TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_zero_grid_with_backed_symbols_abi_compatible_cuda ``` Differential Revision: D61927248 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134705 Approved by: https://github.com/ColinPeppler	2024-08-29 18:38:45 +00:00
Aaron Gokaslan	5573c17877	[BE][Ez]: Update ruff to 0.6.3 (#134769 ) Mostly bugfix release, updating because it fixes an edgecase with a rule we are using Pull Request resolved: https://github.com/pytorch/pytorch/pull/134769 Approved by: https://github.com/albanD	2024-08-29 18:35:47 +00:00
Xintong Hu	ce96146623	[PT2] Fix node metadata setting in group_batch_fusion_aten (#134543 ) Summary: Current impl results in `meta` missing fields like`val`, use `FakeTensorProp` to update the information Differential Revision: D61832932 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134543 Approved by: https://github.com/frank-wei	2024-08-29 18:32:04 +00:00
chilli	348d02a983	Changed masked out rows logsumexp to be -inf and not zero (#134650 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134650 Approved by: https://github.com/yanboliang, https://github.com/BoyuanFeng, https://github.com/drisspg	2024-08-29 17:22:52 +00:00
Pian Pawakapan	36a6516290	[export] use single FQN for param_buffer_mapping (#134500 ) Fixes #133252 In strict mode, we have this routine for mapping traced parameters to their FQNs using tensor ids. Currently we assume there's at least 1 unique FQN for each traced parameter, but this seems to break with parameter reuse when call_module nodes are present. Adding a test case where this breaks. Fixes this by assigning the same FQN to all traced parameters with the same tensor id. This is fine because we return the original state_dict for the EP, and the unflattener has its own routine of handling aliasing: https://github.com/pytorch/pytorch/pull/125758 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134500 Approved by: https://github.com/angelayi	2024-08-29 17:06:31 +00:00
Ke Wen	d9d95dc55e	[4/N] Test NaN checker against broadcast (#134701 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134701 Approved by: https://github.com/wconstab ghstack dependencies: #134345, #134357	2024-08-29 17:00:07 +00:00
PyTorch MergeBot	ab646cd805	Revert "[reland][dtensor][MTPG] make sharding prop lru cache not shared among threads (#134509 )" This reverts commit ba5aec88c678fe4b9ad101602c29726724f56e21. Reverted https://github.com/pytorch/pytorch/pull/134509 on behalf of https://github.com/ZainRizvi due to Sorry but this fails internally. For details see D61953754 ([comment](https://github.com/pytorch/pytorch/pull/134509#issuecomment-2318323161))	2024-08-29 16:39:19 +00:00
Ke Wen	26aea277f7	[3/N] Set correct device to CUDA guards (#134357 ) In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue https://github.com/pytorch/pytorch/issues/134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134357 Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang ghstack dependencies: #134345	2024-08-29 16:25:27 +00:00
Xu Han	d503217ea4	[inductor] calibration inductor windows uts (15/N) (#134586 ) Fix `test_logs_out` UT on Windows. make `test/dynamo/test_logging.py` all UTs pass on Windows. Changes: 1. Close `NamedTemporaryFile` to release file handle to avoid PermissionError issue. 2. `PermissionError` setup as `delete=False`, let file not be auto deleted. 3. Open log file as "utf-8" to align with Linux. 4. Process wrap difference for Windows. 5. Delete tmp file manually. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134586 Approved by: https://github.com/jansel	2024-08-29 16:18:40 +00:00
Ke Wen	9953f55f4c	[2/N] Add flag to control which rank should perform NaN check (#134345 ) Fixes https://github.com/pytorch/pytorch/issues/134062. For example, in case of broadcast / scatter, only the root rank should perform the NaN check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134345 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab	2024-08-29 16:13:15 +00:00
Bin Bao	387d3fc296	[AOTI] Switch benchmarking to use export non-strict mode (#130977 ) Summary: Switch the export part used by AOTInductor benchmarking from strict to non-strict, and switch it from producing torch IR to aten IR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130977 Approved by: https://github.com/angelayi ghstack dependencies: #134639	2024-08-29 16:08:52 +00:00
Valentine233	0dbc72887b	[CPU][flash attention] make the stride of output align with input (#134656 ) Fixes #133671 Currently, the output of CPU flash attention has a fixed layout, no matter what the input is. This PR makes the stride of output align with input q/k/v, which is the same behavior as math backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134656 Approved by: https://github.com/jgong5, https://github.com/drisspg	2024-08-29 16:04:25 +00:00
Stonepia	4fcd15a667	Fix test_sgd_weight_decay_xpu accuracy error (#134744 ) Fixes #134743 This PR adds `test_sgd_weight_decay_xpu` in `KERNEL_COUNT_OVERRIDES` to override. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134744 Approved by: https://github.com/EikanWang, https://github.com/desertfire	2024-08-29 15:12:40 +00:00
Animesh Jain	594162f7ab	[dynamo] Support reading attributes from pybind objects (#134630 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134630 Approved by: https://github.com/jansel	2024-08-29 15:06:52 +00:00
Avik Chaudhuri	92e38a476f	preserve aten::to device in export training (#134622 ) Summary: With training IR, we cannot rely on trapping `to()` in `FunctionalTensor` because the regular decomposition kicks it first, and that can cause it to be optimized away. So instead we preserve it until we functionalize, and then replace it explicitly with `_to_copy()`. Test Plan: expected test failures go away Differential Revision: D61883878 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134622 Approved by: https://github.com/zhxchen17, https://github.com/tugsbayasgalan	2024-08-29 14:53:30 +00:00
rzou	092349dcdd	Never CSE aten.empty in the partitioner (#134703 ) aten.empty is almost always fusible into its consumer, so we never CSE it. This fixes a bug that looks like the following: ```py @torch.library.custom_op("_reinplacing::sin_cos", mutates_args={"out_sin", "out_cos"}) def sin_cos(x: torch.Tensor, out_sin: torch.Tensor, out_cos: torch.Tensor) -> None: out_sin.copy_(x.sin()) out_cos.copy_(x.cos()) @torch.compile def f(x): out0 = torch.empty_like(x) out1 = torch.empty_like(x) sin_cos(x, out0, out1) return x.clone(), out0, out1 x = torch.randn(3, requires_grad=True) f(x) ``` - cse would de-duplicate the empty nodes - reinplacing would add an additional clone (because it can't write to both tensors at the same time) - the clone lowers into a new buffer + a copy_ kernel - the copy_ kernel is unnecessary because "empty" is special - all reinplacing needed was an additional buffer, it doesn't matter what the values are. We could attempt to fix this on the reinplacing side but this seemed better as a partitioner heuristic and the reinplacing fix is a bit more tricky (we'd need to identify that the op never reads from the empty node). Test Plan: - new test (the old number was 27, the new number is 21, so this PR helped). Pull Request resolved: https://github.com/pytorch/pytorch/pull/134703 Approved by: https://github.com/yf225 ghstack dependencies: #134466, #134490, #134491	2024-08-29 13:51:19 +00:00
Xuehai Pan	70853b792a	[dynamo][itertools] support `itertools.tee` (#133771 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133771 Approved by: https://github.com/jansel ghstack dependencies: #133801	2024-08-29 13:36:52 +00:00
Xuehai Pan	9e806c1a60	[dynamo] simplify implementation for `os.fspath` (#133801 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133801 Approved by: https://github.com/anijain2305	2024-08-29 13:36:52 +00:00
Animesh Jain	d01a7a9faa	[dynamo] Graph break on FSDP flat_param inconsistent tensor and grad dtype (#134614 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134614 Approved by: https://github.com/awgu, https://github.com/yf225 ghstack dependencies: #134610, #134590, #134621	2024-08-29 09:14:42 +00:00
Animesh Jain	fb35d1e01f	[raland][dynamo][exceptions] Support raise from None (#134621 ) The PR was reverted because this PR traced more code and surfaced a latent bug. Resubmitting w/o any changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134621 Approved by: https://github.com/jansel ghstack dependencies: #134610, #134590	2024-08-29 09:14:42 +00:00
Animesh Jain	2bf622685d	[dynamo][dicts] Support hasattr on dicts (#134590 ) Fixes - https://github.com/pytorch/pytorch/issues/134577 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134590 Approved by: https://github.com/Skylion007 ghstack dependencies: #134610	2024-08-29 09:14:42 +00:00
Animesh Jain	2446dead35	[dynamo][exceptions] Use exception subclass whenever possible (#134610 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134610 Approved by: https://github.com/drisspg, https://github.com/jansel	2024-08-29 09:14:42 +00:00
wz337	cfb642bb6b	[DTensor] Extend implicit replication to replicate DTensor for foreach ops so model doesn't have to be fully tp-ed when using 2D (#134551 ) Fixes [134212](https://github.com/pytorch/pytorch/issues/134212) Currently, when we use 2D FSDP with TP, `optimizer.step()` would fail if the model were not fully tensor parallelized. If we don't have the entire model tensor parallelized when doing 2D, we would have both 1D and 2D DTensor parameters. As foreach is turned on by default, `optimizer.step()` would fail as cross mesh op is not allowed. Error as follows: ``` NotImplementedError: aten._foreach_mul_.Scalar: DTensor does not support cross-mesh operation yet!Got meshes: DeviceMesh('cuda', [[0, 1], [2, 3]], mesh_dim_names=('dp', 'tp')) DeviceMesh('cuda', [1, 3], mesh_dim_names=('dp',)) ``` In this PR, we extend implicit_replication to replicate DTensor in missing dimensions for foreach ops. If users don't want to fully tensor parallelize the model when using 2D, they have the option of using the `implicit_replication()` context manager for `optimizer.step()`. In this case, we would swap out the 1D DTensorSpec and replace it with 2D DTensorSpec. However, we don't want to turn this on by default yet, as we want the users to be aware that the tp dimension is replicated if a layer is not tp-ed. With implicit implication turning on, try replicate dtensor spec in missing dimension would work for most cases for foreach case except when the first DTensor in the list is one that also need to be replicated. This is currently a limitation, which I don't have a good solution yet. Currently, with this change, we can handle most of the cases except the case that the first DTensor's ndim is not the largest. ``` [2D_DTensor, 1D_DTensor...] ---> Implicit_replication() can handle this. [1D_DTensor, 2D_DTensor...] ---> Implicit_replication() can't handle this. ``` This change doesn't affect the existing default behavior, as `implicit_replication()` is not turned on by default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134551 Approved by: https://github.com/tianyu-l	2024-08-29 09:01:31 +00:00
Ke Wen	3645634f3c	[1/N] Move NaN check onto NCCL stream (#134300 ) So that the tensor's lifetime management is the same as the management built for the NCCL, pre and post kernels. Also so that on visualizers, they show up in the NCCL stream line. Otherwise if they show up in the compute line, user may get confused (my code does not have these kernels). The check is thus moved after the point where we depend NCCL stream from the last compute kernel. Also moved declaration of `checkForNan` from Utils.hpp to NCCLUtils.hpp, and renamed Utils.cu to NCCLUtils.cu. Differential Revision: [D61957573](https://our.internmc.facebook.com/intern/diff/D61957573) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134300 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab	2024-08-29 08:28:49 +00:00
Will Feng	578b8d75e5	[2nd try][Traceable FSDP2] Allow tracing through FSDP2 impl in trace_rules.py (#134539 ) The previous PR https://github.com/pytorch/pytorch/pull/133532 caused stuck compilation issue on internal models. In this 2nd attempt PR, we gate the trace_rules.py changes with `if not torch._dynamo.config.skip_fsdp_hooks:`, so that they don't take effect for current graph-break FSDP2 (which relies on the default config value `skip_fsdp_hooks=True`), and will only take effect when we are using Traceable FSDP2 (in which case the user needs to proactively set `skip_fsdp_hooks=False`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/134539 Approved by: https://github.com/ckluk2, https://github.com/yanboliang	2024-08-29 06:28:16 +00:00
Xia, Weiwen	834d8b0965	[Inductor][mkldnn] Bug fix: incorrect codegen arg order for qconv (#134579 ) Fixes #133448 The arg order for mkldnn qconv IR became incorrect after PR #132367 . This PR fixes the bug. Test plan `python test/inductor/test_mkldnn_pattern_matcher.py -k qconv` `python test/inductor/test_cpu_cpp_wrapper.py -k qconv` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134579 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5	2024-08-29 06:20:52 +00:00
wz337	b0a6d9ad27	[DTensor] Add pointwise ops strategy for aten.isinf, aten.isneginf, aten.isposinf (#134699 ) Fixes #ISSUE_NUMBER Need it for https://github.com/facebookresearch/optimizers/blob/main/distributed_shampoo/utils/shampoo_preconditioner_list.py#L671 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134699 Approved by: https://github.com/tianyu-l	2024-08-29 06:01:12 +00:00
Wang, Eikan	da9e61ef70	Get accumulate dtype for Intel GPU (#134465 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): There are two function variants to get accumulated dtype for a given dtype: - Func1: `c10::ScalarType toAccumulateType(c10::ScalarType type, c10::DeviceType device)` - Func2: `c10::ScalarType toAccumulateType(c10::ScalarType type, bool is_cuda)` The Func1 is general enough to support different devices, while the Func2 only supports CUDA and CPU. This PR intends to add the Intel GPU path in the Func1. And we expect users to invoke the Func1 to ensure compatibility for different devices. * __->__ #134465 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134465 Approved by: https://github.com/Skylion007, https://github.com/atalman	2024-08-29 05:27:57 +00:00
Mikayla Gawarecki	94db935749	Add torch.serialization.skip_data context manager (#134504 ) ## Semantic The semantic is (1) By default `torch.serialization.skip_data(materialize_fake_tensors=False)` will make `torch.save` skip writing storages (but reserve space for them in the checkpoint). ```python import torch import torch.nn as nn sd = nn.Linear(3, 5).state_dict() with torch.serialization.skip_data(): torch.save(sd, 'foo.pt') print(torch.load('foo.pt', weights_only=True)) ``` (2) With `torch.serialization.skip_data(materialize_fake_tensors=True)`If FakeTensor is passed to `torch.save` the pickler will treat these FakeTensors as being "materialized" space will be reserved in the checkpoint for the associated storage bytes, and when loading the type will be Tensor instead of FakeTensor) ```python import torch import torch.nn as nn from torch._subclasses.fake_tensor import FakeTensorMode with FakeTensorMode(): m = nn.Linear(3, 5, dtype=torch.float16, device='cuda') sd = m.state_dict() with torch.serialization.skip_data(materialize_fake_tensors=True): torch.save(sd, 'bla.pt') print(torch.load('bla.pt', weights_only=True)) # OrderedDict([('weight', tensor([[0., 0., 0.], # [0., 0., 0.], # [0., 0., 0.], # [0., 0., 0.], # [0., 0., 0.]], device='cuda:0', dtype=torch.float16)), ('bias', tensor([0., 0., 0., 0., 0.], device='cuda:0', dtype=torch.float16))]) ``` ## Follow Ups - [ ] `torch.load` semantic for skip_data context manager - [ ] Mechanism for getting offsets of storages saved via this method (for writing in a separate pass) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134504 Approved by: https://github.com/albanD	2024-08-29 04:52:52 +00:00
Banit Agrawal	297b42012d	[PyTorch] Use pinned memory for zero_cuda_out (#134712 ) Summary: This diff creates a pinned tensor for copying from device for the zero_out op. Differential Revision: D61759262 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134712 Approved by: https://github.com/zyan0	2024-08-29 04:46:08 +00:00
Jennifer (Jiyue) Wang	a32255481b	[caffe2][hipify] remove un-used flag from `pybind_utils.h` (#134404 ) Summary: Encountered issues related to AMD build when working on https://www.internalfb.com/diff/D60739324?dst_version_fbid=2203158110057105 (see stack trace P1545717562) Looking at the file history, seems that the flag is no longer used so I propose to remove it. Alternatively, I could change the `#ifdef` to check both `USE_C10D_NCCL` and `USE_ROCM` and include the corresponding AMD header files. Let me know what is more preferred way. Test Plan: Sandcastle Differential Revision: D61762129 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134404 Approved by: https://github.com/malfet	2024-08-29 04:09:44 +00:00
Syed Tousif Ahmed	4655eb3ee2	Uses MemPoolContext to route allocations from CUDACachingAllocator (#134685 ) Re-open of https://github.com/pytorch/pytorch/pull/133599 that was mistakenly closed by issuing `ghstack land` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134685 Approved by: https://github.com/ezyang	2024-08-29 03:56:31 +00:00
David Berard	4b4ba7ab06	[NJT] Support NJT SDPA + meta-device flop counting (#134289 ) A user wants to use the flop counter with meta devices. This previously caused problems for SDPA+NJT: 1. autocast check: `torch.is_autocast_enabled("meta")` fails because `meta` is not valid for autocasting. If we skip this, we run into the next error 2. math backend: conversion to NST requires getting concrete offsets in a list of python integers, which doesn't work on a meta tensor `b2eb0e8c6a/torch/nested/_internal/sdpa.py (L809-L815)` 3. (fixed in the previous PR, #134288) - if we force using flash attention backend for flop counting, `_flash_attention_forward` previously didn't support meta tensors. In this PR, we check specifically for FlopCounterMode, and, if it's enabled and combined with meta tensors, (a) skip autocasting and (b) force it down the flash attention path. This isn't generally safe for tracing (e.g. if you actually care which kernels you are running), but in the absence of actual device information, we have to make some assumptions. By specifically checking for FlopCounterMode, this should reduce the chance of unintended side effects for other meta tensor users. Note: fake tensor would solve a bunch of these issues, but it's not a viable solution right now for the user. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134289 Approved by: https://github.com/soulitzer ghstack dependencies: #134288	2024-08-29 03:43:42 +00:00
CaoE	17e9c2d1e7	Add oneDNN support for Half LSTM on CPU (#132607 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132607 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-08-29 03:40:10 +00:00
Ivan Zaitsev	41e36e2b46	Reflect check_labels status as a signal (#134711 ) Fixes the workflow when meta-exported diff (co-dev) doesn't have the required labels, but the signal is suppressed due to job failure (e.g. [see this run](https://github.com/pytorch/pytorch/actions/runs/10590994706/job/29347663526?pr=134484)). With this change the workflow status correctly reflects the status of the check. # Testing * [illegal pr_num](https://github.com/pytorch/pytorch/actions/runs/10603163898/job/29386843591) * [successful run](https://github.com/pytorch/pytorch/actions/runs/10603279052/job/29387230110) (topic label present) * no labels: [check fails](https://github.com/pytorch/pytorch/actions/runs/10603310368/job/29387333864) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134711 Approved by: https://github.com/clee2000	2024-08-29 03:11:16 +00:00
Yueming Hao	4f9c68454a	[inductor]Let output or input_as_strided match exact strides (#130956 ) Fixes #130394 TorchInductor doesn't respect original strides of outputs. It opens up optimization opportunities like changing up memory layout. But for some cases, such as the case in https://github.com/pytorch/pytorch/issues/130394, we do need the output match the exact stride as required. The correctness is the first priority goal. So, this PR adds a new API `ir.ExternKernel.require_exact_strides(x, exact_strides, allow_padding=False)` to fix the issue. This PR enables dense and non-dense outputs' strides follow the strides required by semantics. The comparison between the original and after this fix for the test is the below. ```python @triton.jit def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 128 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex % 8 x1 = (xindex // 8) - x2 = xindex tmp0 = tl.load(in_ptr0 + (x0 + (16x1)), xmask) tmp1 = tmp0 + tmp0 - tl.store(out_ptr0 + (x2), tmp1, xmask) + tl.store(out_ptr0 + (x0 + (16x1)), tmp1, xmask) def call(args): arg0_1, = args args.clear() assert_size_stride(arg0_1, (16, 8), (16, 1)) with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) - buf1 = empty_strided_cuda((16, 8), (8, 1), torch.float32) + buf1 = empty_strided_cuda((16, 8), (16, 1), torch.float32) stream0 = get_raw_stream(0) triton_poi_fused_add_copy_0.run(arg0_1, buf1, 128, grid=grid(128), stream=stream0) del arg0_1 return (buf1, ) ``` The buf1 is created with exact stride required by users, and its values are written in same stride with the input. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130956 Approved by: https://github.com/eellison, https://github.com/blaine-rister, https://github.com/desertfire	2024-08-29 03:06:58 +00:00
PyTorch MergeBot	4811dc3de9	Revert "[dynamo] simplify polyfill registration for `builtins.all` and `builtins.any` (#133769 )" This reverts commit cc3a76edbac4a48381db6ccc44a83927f80c545b. Reverted https://github.com/pytorch/pytorch/pull/133769 on behalf of https://github.com/ZainRizvi due to Sorry but this has been discovered to be causing a performance regression internally ([comment](https://github.com/pytorch/pytorch/pull/133769#issuecomment-2316620213))	2024-08-29 03:00:47 +00:00
PyTorch MergeBot	f65df5edae	Revert "[dynamo][itertools] support `itertools.tee` (#133771 )" This reverts commit 1dbd3476de07d7f07489e243cb7a43073e8c25c1. Reverted https://github.com/pytorch/pytorch/pull/133771 on behalf of https://github.com/ZainRizvi due to Sorry, have to revert this in order to be able to revert https://github.com/pytorch/pytorch/pull/133769 ([comment](https://github.com/pytorch/pytorch/pull/133771#issuecomment-2316611158))	2024-08-29 02:49:30 +00:00
PyTorch MergeBot	eaec9e80b8	Revert "[dynamo] simplify implementation for `os.fspath` (#133801 )" This reverts commit 74341e1150f10b8aaddd33a165e686724424071f. Reverted https://github.com/pytorch/pytorch/pull/133801 on behalf of https://github.com/ZainRizvi due to Sorry, have to revert this in order to be able to revert https://github.com/pytorch/pytorch/pull/133769 ([comment](https://github.com/pytorch/pytorch/pull/133771#issuecomment-2316611158))	2024-08-29 02:49:30 +00:00
Jason Ansel	76f975948e	[inductor] Cleanup generate_node_schedule (#134306 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134306 Approved by: https://github.com/shunting314	2024-08-29 02:45:14 +00:00
Sun, Jiayi	cccb121d4e	[Inductor] add inductor config: masked_vec (#134566 ) This PR adds inductor config: masked_vec to control enable/disable masked vectorization for the tail_loop, and enable by default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134566 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-08-29 02:29:06 +00:00
Laith Sakka	c5f114747e	fix flakiness in update_hint_benchmark.py (#134649 ) ``` compile time instruction count for iteration 1 is 10732129038 compile time instruction count for iteration 2 is 10719776783 compile time instruction count for iteration 3 is 10729546868 compile time instruction count for iteration 4 is 10737655132 compile time instruction count for iteration 5 is 10732564252 compile time instruction count for iteration 6 is 10728721234 compile time instruction count for iteration 7 is 10733354271 compile time instruction count for iteration 8 is 10719588972 compile time instruction count for iteration 9 is 10706311856 ``` 1. add torch.manual_seed(0), inputs was not the same across iterations 2. disable gc. 3. remove loop (not needed since compilation happen once only) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134649 Approved by: https://github.com/aorenste ghstack dependencies: #133834, #134635	2024-08-29 02:22:05 +00:00
PyTorch MergeBot	f0fceed432	Revert "[dynamo][exceptions] Use exception subclass whenever possible (#134610 )" This reverts commit 880e3d18a406777dbea6aeaf14443b0e3a8b441c. Reverted https://github.com/pytorch/pytorch/pull/134610 on behalf of https://github.com/ZainRizvi due to Sorry, I had to revert this in order to revert another PR ([comment](https://github.com/pytorch/pytorch/pull/134610#issuecomment-2316568553))	2024-08-29 02:02:12 +00:00
PyTorch MergeBot	67d7040fce	Revert "[dynamo][dicts] Support hasattr on dicts (#134590 )" This reverts commit c566f2465f41b8081caed205fcf5fe973fd970b3. Reverted https://github.com/pytorch/pytorch/pull/134590 on behalf of https://github.com/ZainRizvi due to Sorry, I had to revert this in order to revert another PR ([comment](https://github.com/pytorch/pytorch/pull/134610#issuecomment-2316568553))	2024-08-29 02:02:12 +00:00
PyTorch MergeBot	40cebde3bc	Revert "[raland][dynamo][exceptions] Support raise from None (#134621 )" This reverts commit e96dc3665a1d48434c02e17f7faed41f779cee2c. Reverted https://github.com/pytorch/pytorch/pull/134621 on behalf of https://github.com/ZainRizvi due to Sorry, I had to revert this in order to revert another PR ([comment](https://github.com/pytorch/pytorch/pull/134610#issuecomment-2316568553))	2024-08-29 02:02:12 +00:00
PyTorch MergeBot	c35d1f7b3a	Revert "[dynamo] Graph break on FSDP flat_param inconsistent tensor and grad dtype (#134614 )" This reverts commit e4a5958ab58e2f9b5b9c336a1d2a6449784d88d3. Reverted https://github.com/pytorch/pytorch/pull/134614 on behalf of https://github.com/ZainRizvi due to Sorry, I had to revert this in order to revert another PR ([comment](https://github.com/pytorch/pytorch/pull/134610#issuecomment-2316568553))	2024-08-29 02:02:12 +00:00
PyTorch MergeBot	25531eb735	Revert "[2nd try][Traceable FSDP2] Allow tracing through FSDP2 impl in trace_rules.py (#134539 )" This reverts commit 26e392132d3039345de6aaf8643e7330f7fc3cbc. Reverted https://github.com/pytorch/pytorch/pull/134539 on behalf of https://github.com/ZainRizvi due to Sorry, I had to revert this in order to revert another PR ([comment](https://github.com/pytorch/pytorch/pull/134539#issuecomment-2316568257))	2024-08-29 01:59:02 +00:00
PyTorch MergeBot	cbf5ba1e97	Revert "[1/N] Move NaN check onto NCCL stream (#134300 )" This reverts commit 94caba4899096f160eca9628acddba6032755b3b. Reverted https://github.com/pytorch/pytorch/pull/134300 on behalf of https://github.com/kwen2501 due to This is breaking builds of MTIA ([comment](https://github.com/pytorch/pytorch/pull/134300#issuecomment-2316559704))	2024-08-29 01:50:22 +00:00
PyTorch MergeBot	33d0c11b26	Revert "[2/N] Add flag to control which rank should perform NaN check (#134345 )" This reverts commit 2fe7e332c7a61f025ccbcdbbb4875c6bf0b9afdf. Reverted https://github.com/pytorch/pytorch/pull/134345 on behalf of https://github.com/kwen2501 due to This is breaking builds of MTIA ([comment](https://github.com/pytorch/pytorch/pull/134300#issuecomment-2316559704))	2024-08-29 01:50:22 +00:00
PyTorch MergeBot	43dc17fd00	Revert "[3/N] Set correct device to CUDA guards (#134357 )" This reverts commit afc76c6f2d46d7726012507ec5c67b4c04e21723. Reverted https://github.com/pytorch/pytorch/pull/134357 on behalf of https://github.com/kwen2501 due to This is breaking builds of MTIA ([comment](https://github.com/pytorch/pytorch/pull/134300#issuecomment-2316559704))	2024-08-29 01:50:22 +00:00
PyTorch MergeBot	503c0dd923	Revert "Add MaskedTensor support to *_like API (#128637 )" This reverts commit b6e51711a0ea6174806e75ab6e208d2d910b45f5. Reverted https://github.com/pytorch/pytorch/pull/128637 on behalf of https://github.com/ZainRizvi due to Actually, seems like it was this commit that introduced the failure: test_maskedtensor.py::TestOperatorsCUDA::test_like_empty_like_layout1_cuda_bool [GH job link](https://github.com/pytorch/pytorch/actions/runs/10604690725/job/29392898277) [HUD commit link](`b6e51711a0`) ([comment](https://github.com/pytorch/pytorch/pull/128637#issuecomment-2316554188))	2024-08-29 01:42:52 +00:00
PyTorch MergeBot	1285443994	Revert "Add torch.serialization.skip_data context manager (#134504 )" This reverts commit 202600bc2384cb19a29b8fca503bafc289158c32. Reverted https://github.com/pytorch/pytorch/pull/134504 on behalf of https://github.com/mikaylagawarecki due to This is breaking Windows docs tests due to NamedTemporaryFile on Windows not working well ([comment](https://github.com/pytorch/pytorch/pull/134504#issuecomment-2316543901))	2024-08-29 01:30:49 +00:00
Li-Huai (Allan) Lin	e7711d6c7d	[MPS] Fix SDP training (#134719 ) Check whether the input tensors require grad. If required, then we don't get into the fast path and fall back to composite implicit. Fixes #134678 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134719 Approved by: https://github.com/malfet	2024-08-29 01:28:53 +00:00
Avik Chaudhuri	ca03a14cf7	hang dim hint constants off Dim (#134702 ) Summary: Retry landing https://github.com/pytorch/pytorch/pull/134484 Test Plan: (see original) Differential Revision: D61925860 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134702 Approved by: https://github.com/pianpwk	2024-08-29 01:02:01 +00:00
Rachel Guo	7a554e96b4	[AOTI][Tooling] Follow up to print location of saved file path for `torch.pickle_save()` (#134651 ) Summary: - Follow up to add torch.pickle_save() log for saved file path - Minor debug printer code refine Test Plan: CI Differential Revision: D61883239 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134651 Approved by: https://github.com/muchulee8	2024-08-28 23:58:37 +00:00
Mikayla Gawarecki	202600bc23	Add torch.serialization.skip_data context manager (#134504 ) ## Semantic The semantic is (1) By default `torch.serialization.skip_data(materialize_fake_tensors=False)` will make `torch.save` skip writing storages (but reserve space for them in the checkpoint). ```python import torch import torch.nn as nn sd = nn.Linear(3, 5).state_dict() with torch.serialization.skip_data(): torch.save(sd, 'foo.pt') print(torch.load('foo.pt', weights_only=True)) ``` (2) With `torch.serialization.skip_data(materialize_fake_tensors=True)`If FakeTensor is passed to `torch.save` the pickler will treat these FakeTensors as being "materialized" space will be reserved in the checkpoint for the associated storage bytes, and when loading the type will be Tensor instead of FakeTensor) ```python import torch import torch.nn as nn from torch._subclasses.fake_tensor import FakeTensorMode with FakeTensorMode(): m = nn.Linear(3, 5, dtype=torch.float16, device='cuda') sd = m.state_dict() with torch.serialization.skip_data(materialize_fake_tensors=True): torch.save(sd, 'bla.pt') print(torch.load('bla.pt', weights_only=True)) # OrderedDict([('weight', tensor([[0., 0., 0.], # [0., 0., 0.], # [0., 0., 0.], # [0., 0., 0.], # [0., 0., 0.]], device='cuda:0', dtype=torch.float16)), ('bias', tensor([0., 0., 0., 0., 0.], device='cuda:0', dtype=torch.float16))]) ``` ## Follow Ups - [ ] `torch.load` semantic for skip_data context manager - [ ] Mechanism for getting offsets of storages saved via this method (for writing in a separate pass) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134504 Approved by: https://github.com/albanD	2024-08-28 23:53:17 +00:00
PyTorch MergeBot	f997b2b8e6	Revert "Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262 )" This reverts commit f685018ea9d08f98cbd7106028db134f967f74d3. Reverted https://github.com/pytorch/pytorch/pull/125262 on behalf of https://github.com/ZainRizvi due to Hi, this PR appears to be calling maskedtensor tests to fail on main. Please rebase your changes onto the latest trunk build to repro the failure. test_maskedtensor.py::TestOperatorsCUDA::test_like_empty_like_layout1_cuda_bool [GH job link](https://github.com/pytorch/pytorch/actions/runs/10604716811/job/29393256312) [HUD commit link](`f685018ea9`) ([comment](https://github.com/pytorch/pytorch/pull/125262#issuecomment-2316387447))	2024-08-28 23:10:07 +00:00
Tugsbayasgalan Manlaibaatar	6dd3f81aaf	Add export_for_training as public API (#134677 ) Differential Revision: [D61912084](https://our.internmc.facebook.com/intern/diff/D61912084) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134677 Approved by: https://github.com/avikchaudhuri, https://github.com/zhxchen17	2024-08-28 22:32:10 +00:00
rzou	a7933acd5a	Improve custom ops aliasing error message (#134688 ) Fixes https://github.com/pytorch/pytorch/issues/134278 Test Plan: - tested locally Pull Request resolved: https://github.com/pytorch/pytorch/pull/134688 Approved by: https://github.com/yushangdi ghstack dependencies: #134466, #134490, #134491, #134690, #134692	2024-08-28 22:22:04 +00:00
rzou	dd443f418a	Improve opcheck docs. (#134692 ) Fixes https://github.com/pytorch/pytorch/issues/134119 From user feedback, it's difficult to understand what the tests do. We clarify the docs more. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134692 Approved by: https://github.com/albanD ghstack dependencies: #134466, #134490, #134491, #134690	2024-08-28 22:22:04 +00:00
Ke Wen	afc76c6f2d	[3/N] Set correct device to CUDA guards (#134357 ) In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue https://github.com/pytorch/pytorch/issues/134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134357 Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang ghstack dependencies: #134300, #134345	2024-08-28 22:17:11 +00:00
rzou	5ff97e79ee	Skip test_mutable_custom_op_fixed_layout2 on ROCM (#134690 ) ROCM doesn't trigger the layout optimization that makes the test case valid so we're going to skip the checks. Should fix the following (I'll close them later) - https://github.com/pytorch/pytorch/issues/134481 - https://github.com/pytorch/pytorch/issues/134519 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134690 Approved by: https://github.com/FindHao ghstack dependencies: #134466, #134490, #134491	2024-08-28 22:12:24 +00:00
Ke Wen	2fe7e332c7	[2/N] Add flag to control which rank should perform NaN check (#134345 ) Fixes https://github.com/pytorch/pytorch/issues/134062. For example, in case of broadcast / scatter, only the root rank should perform the NaN check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134345 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab ghstack dependencies: #134300	2024-08-28 21:53:39 +00:00
Janet Yang	26ec06e45d	[amd][lowering] hipify shim v2 headers (#134689 ) Summary: The default c_shim version was switched to 2 for HIP in D60674018. This results in some linking errors where shim function symbols are missing from the compiled .so file (eg. P1551186492) when building lowering benchmark scripts since the required files aren't included. Hipify the shim v2 generated header files as well since they're needed during codegen when the buck binaries are executed. Reviewed By: frank-wei, zoranzhao, henryoier Differential Revision: D61865202 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134689 Approved by: https://github.com/zoranzhao	2024-08-28 21:53:24 +00:00
PyTorch MergeBot	7b3da5f297	Revert "[dynamo] Cache _dynamo.disable results (#134272 )" This reverts commit dbef2b05b4d81e891f7497f92f730a22bebe445d. Reverted https://github.com/pytorch/pytorch/pull/134272 on behalf of https://github.com/anijain2305 due to Peak mem increase detected internally ([comment](https://github.com/pytorch/pytorch/pull/134272#issuecomment-2316308170))	2024-08-28 21:51:43 +00:00
Jia Li	20b62fed21	Create processes in parallel in mp.start_processes for forkserver (#134629 ) Summary: This is to fix the pytorch issue filed https://github.com/pytorch/pytorch/issues/133010 one way to fix this problem is to enable parallel start processes in mp.start_processes. What else in the diff: refactored a test case api_test which was repeating a lot of tests due to the inheritance. added unit test for forkserver when parallel start is on. Test Plan: Added unit tests Differential Revision: D61878552 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134629 Approved by: https://github.com/d4l3k	2024-08-28 21:34:32 +00:00
Nowtryz	f685018ea9	Add MaskedTensor passthrough: unfold, F.Unfold, F.Fold, stack (#125262 ) Hi, I noticed the `unfold` operator was missing on MaskedTensor. I tested that my change works when calling unfold and backward on a `MaskedTensor` but I didn't find the tests for the dispatch of such operation. Where is it? Pull Request resolved: https://github.com/pytorch/pytorch/pull/125262 Approved by: https://github.com/cpuhrsch	2024-08-28 21:30:39 +00:00
Nowtryz	b6e51711a0	Add MaskedTensor support to *_like API (#128637 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128637 Approved by: https://github.com/cpuhrsch	2024-08-28 21:28:23 +00:00
fduwjj	4c16797e71	[c10d FR analyzer] Output a meaningful debug report for users (#134528 ) - This PR generates a more useful output log for users: P1552399180. - It also fixes the logic when we check the all-gather size mismatch. - Add dtype check for collective input/output - We store more context information for error match_state so that we can report them in the file. - Disable the size match for alltoall because we don't log the size for all inputs/outputs. - Correct some types for func args specification. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134528 Approved by: https://github.com/c-p-i-o	2024-08-28 21:22:47 +00:00
Sanket Purandare	de35d3062f	Runtime Estimator for estimating GPU compute time (#134243 ) This PR adds a basic Runtime Estimator for single-device models. It estimates the GPU runtime in milliseconds using various estimation methods under the ``FakeTensorMode``. It provides a ``TorchDispatchMode`` based context manager that can estimate the eager runtime of PyTorch functions. It supports two estimation modes, benchmarking (`operator-level-benchmark`) and roofline cost modeling (`operator-level-cost-model`). For modules executed under this context manager, it agggregates the forward and backward operation runtimes and records their execution orders. ``` import torch from torch import nn, optim from torch._subclasses.fake_tensor import FakeTensorMode from torch.distributed._tools.runtime_estimator import RuntimeEstimator from torch.testing._internal.distributed._tensor.common_dtensor import ( ModelArgs, Transformer, ) if __name__ == "__main__": def _train_step( model: nn.Module, optimizer: optim.Optimizer, inp: torch.Tensor, ): out = model(inp) loss = out.sum() loss.backward() optimizer.step() optimizer.zero_grad() dev = torch.cuda.current_device() vocab_size = 8192 bsz, seq_len = 32, 1024 model_args = ModelArgs( n_layers=4, n_heads=12, vocab_size=vocab_size, max_seq_len=seq_len, dim=768, dropout_p=0.1, ) runtime_estimator = RuntimeEstimator() with FakeTensorMode(): with torch.device(dev): model = Transformer(model_args) optimizer = optim.Adam(model.parameters(), lr=1e-2, foreach=True) inp = torch.randint(0, model_args.vocab_size, (bsz, model_args.max_seq_len), device=dev) with runtime_estimator("operator-level-benchmark"): _train_step(model, optimizer, inp) with runtime_estimator("operator-level-cost-model"): _train_step(model, optimizer, inp) # Actual model runtime with torch.device(dev): model = Transformer(model_args) optimizer = optim.Adam(model.parameters(), lr=1e-2, foreach=True) inp = torch.randint(0, model_args.vocab_size, (bsz, model_args.max_seq_len), device=dev) warmup_iters, actual_iters = 2, 5 start_event = torch.cuda.Event(enable_timing=True) end_event = torch.cuda.Event(enable_timing=True) for _ in range(warmup_iters): _train_step(model, optimizer, inp) start_event.record() for _ in range(actual_iters): _train_step(model, optimizer, inp) end_event.record() torch.cuda.synchronize() measured_time = start_event.elapsed_time(end_event) / actual_iters print(f"Actual total_time: {measured_time:.3f} ms") ``` <img width="506" alt="Screenshot 2024-08-26 at 11 27 15 PM" src="https://github.com/user-attachments/assets/04d243c9-21a6-4389-8c20-80958980788c"> @weifengpy @xuanzhang816 @gnadathur Pull Request resolved: https://github.com/pytorch/pytorch/pull/134243 Approved by: https://github.com/weifengpy	2024-08-28 20:06:54 +00:00
Manuel Candales	cae817c862	[ET][CodeGen] Remove TORCH_API from NativeFunctions.h declarations (#134245 ) Summary: Remove TORCH_API from the generated executorch/kernels/portable/NativeFunctions.h declarations These generated declarations are using ET tensors. They don't need to have the TORCH_API macro prefixed to them, since in this case TORCH_API is just empty. See [codegen/macros.h](https://www.internalfb.com/code/fbsource/[d12d7d3accfb12932368e0216124f2d735c51d73]/fbcode/executorch/codegen/macros.h) Test Plan: CI Differential Revision: D61490943 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134245 Approved by: https://github.com/larryliu0820	2024-08-28 19:58:37 +00:00
Yidi Wu	b07d0a22f5	[hop] require hops to override __call__. (#134352 ) Fixes https://github.com/pytorch/pytorch/issues/133719 by making `__call__` of hops an abstractmethod. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134352 Approved by: https://github.com/zou3519	2024-08-28 19:56:40 +00:00
PyTorch MergeBot	66c33d5989	Revert "[2/N] Add flag to control which rank should perform NaN check (#134345 )" This reverts commit be7752ead3824e79f5ede6a2f59715b415a2f776. Reverted https://github.com/pytorch/pytorch/pull/134345 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/134345#issuecomment-2316133024))	2024-08-28 19:51:59 +00:00
PyTorch MergeBot	23e26b84af	Revert "[3/N] Set correct device to CUDA guards (#134357 )" This reverts commit 13114da4ef9d14978ea1dfc0fefb236cb4000435. Reverted https://github.com/pytorch/pytorch/pull/134357 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/134357#issuecomment-2316121423))	2024-08-28 19:44:55 +00:00
Gregory Comer	3b40b07efb	Update PyTorch for XNNPACK 87ee0b4 (#134518 ) Summary: Update XNNPACK library version. Test Plan: Combined diff CI is clean: D61586079 (all changes, has to be split out for export). Differential Revision: D61822610 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134518 Approved by: https://github.com/mcr229	2024-08-28 19:24:04 +00:00
Animesh Jain	042b733ddd	[dynamo][freezing] Set is_static_type to false after marking an input static (#134653 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134653 Approved by: https://github.com/mlazos	2024-08-28 19:22:37 +00:00
Andrew Gu	aa31e7019a	[FSDP] Made `clip_grad_norm_` norm compute order deterministic (#134673 ) Fixes https://github.com/pytorch/pytorch/issues/134393 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134673 Approved by: https://github.com/weifengpy ghstack dependencies: #134152	2024-08-28 18:44:11 +00:00
Simon Fan	47ba47a81f	[compiled autograd] error instead of deadlock on reentrant autograd (#134530 ) reentrant calls autograd multiple times using the same thread, so it passes all the thread checks and hangs waiting for the lock it holds in another scope Pull Request resolved: https://github.com/pytorch/pytorch/pull/134530 Approved by: https://github.com/jansel ghstack dependencies: #134514	2024-08-28 17:54:31 +00:00
Simon Fan	c352b6aaaf	[compiled autograd][cpp node] point c++ custom autograd functions tracing error to google doc (#134514 ) `RuntimeError: Attempting to trace a potentially unsafe C++ autograd function: torch::autograd::CppNode<CustomOpAutogradFunction>. It may be possible to trace it safely, please refer to the instructions in: https://docs.google.com/document/d/11VucFBEewzqgkABIjebZIzMvrXr3BtcY1aGKpX61pJY/.` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134514 Approved by: https://github.com/yf225	2024-08-28 17:54:31 +00:00
Xilun Wu	ba5aec88c6	[reland][dtensor][MTPG] make sharding prop lru cache not shared among threads (#134509 ) Summary reland of https://github.com/pytorch/pytorch/pull/134294 Fixes #131446 Fixes #126852 Fixes #126868 Fixes #126493 The PR was reverted due to CI red signal in https://github.com/pytorch/pytorch/actions/runs/10537099590/job/29201744658. It seems that the `gaussian_nll_loss` test had been flaky before my original PR #134294 . Therefore this PR also removes the `xfail` mark on this specific test to make CI signal green. See the error message below: ``` 2024-08-24T13:42:01.3228990Z ==================================== RERUNS ==================================== 2024-08-24T13:42:01.3229530Z [31m[1m_ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _[0m 2024-08-24T13:42:01.3229710Z Unexpected success[90m[39;49;00m 2024-08-24T13:42:01.3230235Z [31m[1m_ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _[0m 2024-08-24T13:42:01.3230407Z Unexpected success[90m[39;49;00m 2024-08-24T13:42:01.3230594Z =================================== FAILURES =================================== 2024-08-24T13:42:01.3231128Z [31m[1m_ TestDTensorOpsCPU.test_dtensor_op_db_nn_functional_gaussian_nll_loss_cpu_float32 _[0m 2024-08-24T13:42:01.3231296Z Unexpected success[90m[39;49;00m ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134509 Approved by: https://github.com/tianyu-l, https://github.com/wz337	2024-08-28 17:51:44 +00:00
Bin Bao	310eb6d8c6	[AOTI] Fix test_aoti_inference CPU build issue (#134675 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/130311. We need to guard CUDA-only code in test_aoti_inference with macros so that it won't fail for CPU-only platform. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134675 Approved by: https://github.com/atalman, https://github.com/chunyuan-w	2024-08-28 17:42:19 +00:00
Laith Sakka	633a9a3b13	add back sum_floordiv benchmark. (#134635 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134635 Approved by: https://github.com/avikchaudhuri, https://github.com/oulgen ghstack dependencies: #133834	2024-08-28 17:38:24 +00:00
Banit Agrawal	b8859dc4b8	[PyTorch Pin Memory Allocator] Optimize the free list implementation and add lock sharding (#134154 ) Summary: This diff addresses the lock contention issue in free list implementation of CachingHost/Pinned allocator. We add a different data structure for free list and also add lock sharding based on allocation size. Differential Revision: D61623367 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134154 Approved by: https://github.com/guangyey, https://github.com/jgong5, https://github.com/zyan0, https://github.com/EikanWang, https://github.com/jiayisuse	2024-08-28 17:12:01 +00:00
Chien-Lin Chen	40de63be09	parameterized test_graph_optims and test_graph_scaling_fused_optimizers (#133749 ) Fixes #123451 This is a rework of a reverted pull request, https://github.com/pytorch/pytorch/pull/125127. The test failure is fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133749 Approved by: https://github.com/janeyx99	2024-08-28 16:34:06 +00:00
Chien-Chin Huang	c7338f457c	[DCP] Fixes the BC issue where the traversal doesn't support versions before 2.4 (#134158 ) The original DCP doesn't flattening all the containers, which can cause issues, https://github.com/pytorch/pytorch/pull/125335 intends to solve the issue by flattening all the dictionaries. Unfortunately, it breaks the checkpoints that are saved before 2.4. This also shows some issues of the DCP: 1. DCP should record version in the metadata. 2. DCP should have a nice way to load old state_dict. 3. DCP should unflatten all containers (map, list) not just map. This PR only addresses issue 2 to unblock users. Issue 1 and issue 3 need to be addressed in the future. @pradeepfn Please let me know if this summary matches our discussion. Fixes https://github.com/pytorch/pytorch/issues/133923 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134158 Approved by: https://github.com/wz337, https://github.com/pradeepfn	2024-08-28 16:31:44 +00:00
PyTorch MergeBot	13d40f6fc5	Revert "hang dim hint constants off Dim (#134484 )" This reverts commit c142af7209a423a05504fdec50680333f5a37629. Reverted https://github.com/pytorch/pytorch/pull/134484 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/134484#issuecomment-2315749549))	2024-08-28 16:05:42 +00:00
PyTorch MergeBot	2c88a923a7	Revert "Refactor caching device allocator utils (#130923 )" This reverts commit c45ca8092dddf718563a1a754de798ad25eae1ee. Reverted https://github.com/pytorch/pytorch/pull/130923 on behalf of https://github.com/ZainRizvi due to Sorry but this appears to be causing internal tests to fail with errors like `error: no type named 'DeviceStats' in namespace 'xxx::xxx:xxxAllocator'; did you mean 'DeviceStatus'?` ([comment](https://github.com/pytorch/pytorch/pull/130923#issuecomment-2315730155))	2024-08-28 15:56:08 +00:00
PyTorch MergeBot	d52aff3e73	Revert "Adding entry-point based support for out-of-tree rendezvous plugins (#132633 )" This reverts commit 136b19b062f62c81ea3ed8fb306debe9d7720e93. Reverted https://github.com/pytorch/pytorch/pull/132633 on behalf of https://github.com/ZainRizvi due to Sorry but this is causing internal tests to fail with the error `ImportError: cannot import name '_register_out_of_tree_handlers' from 'torch.distributed.elastic.rendezvous.registry'` ([comment](https://github.com/pytorch/pytorch/pull/132633#issuecomment-2315716201))	2024-08-28 15:49:18 +00:00
chuanqiw	85d9946001	[CI] change conda to miniforge for XPU images (#134455 ) The `.ci/docker` change with `ciflow/xpu` label will trigger docker images rebuild on xpu runner, but xpu runner can't use miniconda, change to miniforge. Works for https://github.com/pytorch/pytorch/issues/114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134455 Approved by: https://github.com/atalman	2024-08-28 15:16:14 +00:00
Mao, Yunfei	208b922327	[Intel GPU] Remove special dispatch logic for xpu in adaptive_avg_pooling (#132217 ) We now align the dispatch logic for XPU with CUDA in the adaptive average pooling operation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132217 Approved by: https://github.com/EikanWang, https://github.com/atalman, https://github.com/albanD, https://github.com/malfet	2024-08-28 15:06:35 +00:00
Bin Bao	e6bf1710ff	[Inductor][Refactor] Rename CPU benchmark test configs (#134639 ) Summary: benchmarks/dynamo/ci_expected_accuracy/update_expected.py expects a benchmark run config is named as {config}_{benchmark}, and CPU tests should follow the same naming convention. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134639 Approved by: https://github.com/huydhn	2024-08-28 14:49:55 +00:00
Avik Chaudhuri	c142af7209	hang dim hint constants off Dim (#134484 ) Summary: Recently https://github.com/pytorch/pytorch/pull/133620 added support for automatic dynamic shapes, where a new enum, `DIM`, was introduced to provide hints like `AUTO` and `STATIC`. This PR is a nominal change where we expose the hints via the existing public `Dim` API, and remove `DIM` from the public API. The main motivation is to avoid having users need to import too many things. Test Plan: existing Differential Revision: D61807361 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134484 Approved by: https://github.com/angelayi	2024-08-28 14:35:40 +00:00
Spencer Gibson	3e42f21eee	Bucketize fix to include number and tensor inputs (#133652 ) Fixes #132222 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133652 Approved by: https://github.com/ezyang	2024-08-28 13:35:41 +00:00
IvanKobzarev	bb22132c8d	[aotd] Make effects op registry WeakKeyDictionary (#134470 ) Op is used as a Dictionary Key, while op can be deregistered as a result this Key will be holding this op from deallocation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134470 Approved by: https://github.com/zou3519	2024-08-28 12:12:00 +00:00
Yanbo Liang	97c8a0739e	[Dynamo] Support inspect.signature.Parameter getattr (#134636 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134636 Approved by: https://github.com/Chillee, https://github.com/anijain2305	2024-08-28 09:59:41 +00:00
Will Feng	26e392132d	[2nd try][Traceable FSDP2] Allow tracing through FSDP2 impl in trace_rules.py (#134539 ) The previous PR https://github.com/pytorch/pytorch/pull/133532 caused stuck compilation issue on internal models. In this 2nd attempt PR, we gate the trace_rules.py changes with `if not torch._dynamo.config.skip_fsdp_hooks:`, so that they don't take effect for current graph-break FSDP2 (which relies on the default config value `skip_fsdp_hooks=True`), and will only take effect when we are using Traceable FSDP2 (in which case the user needs to proactively set `skip_fsdp_hooks=False`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/134539 Approved by: https://github.com/ckluk2, https://github.com/yanboliang	2024-08-28 08:57:56 +00:00
Yanbo Liang	8693322ef0	[Dynamo][autograd.Function] Support mark_non_differentiable (#134087 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134087 Approved by: https://github.com/zou3519	2024-08-28 08:12:37 +00:00
Ke Wen	d01415409b	[PGNCCL] Improve logic to infer device for barrier (#134617 ) Fixes #134391, #124714 The above issues reported that `dist.barrier()` could hang in some cases. The culprit is that ProcessGroupNCCL inferred a wrong device to perform the dummy all-reduce. After the PR, the following will be the order of device selection: - 1st choice: `opts.device_ids`, if provided by user via `barrier(opts)`. - 2nd choice: bound device id, if provided to `init_process_group` via `device_id` arg. - 3rd choice: `usedDeviceIdxs_` recorded in current PG. Will have a value from previous collectives. - 4th choice: `globalRank() % localDeviceCount_`. This can only happen when `dist.barrier()` is the first call of the PG. What's new: - Added the 2nd choice. - In the 4th choice, we use `globalRank()` instead of group-local rank, because the group-local rank can be offset wrt the device id if intra-node GPUs are sharded into multiple dimensions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134617 Approved by: https://github.com/yifuwang, https://github.com/shuqiangzhang	2024-08-28 08:12:09 +00:00
Animesh Jain	e4a5958ab5	[dynamo] Graph break on FSDP flat_param inconsistent tensor and grad dtype (#134614 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134614 Approved by: https://github.com/awgu, https://github.com/yf225 ghstack dependencies: #134610, #134590, #134621	2024-08-28 07:35:24 +00:00
Animesh Jain	e96dc3665a	[raland][dynamo][exceptions] Support raise from None (#134621 ) The PR was reverted because this PR traced more code and surfaced a latent bug. Resubmitting w/o any changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134621 Approved by: https://github.com/jansel ghstack dependencies: #134610, #134590	2024-08-28 07:35:23 +00:00
Animesh Jain	c566f2465f	[dynamo][dicts] Support hasattr on dicts (#134590 ) Fixes - https://github.com/pytorch/pytorch/issues/134577 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134590 Approved by: https://github.com/Skylion007 ghstack dependencies: #134610	2024-08-28 07:35:18 +00:00
Animesh Jain	880e3d18a4	[dynamo][exceptions] Use exception subclass whenever possible (#134610 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134610 Approved by: https://github.com/drisspg, https://github.com/jansel	2024-08-28 07:35:12 +00:00
xingyuan li	bf7db4e4f9	[Inductor UT] Generalize inductor UT for intel GPU (#133309 ) [Inductor UT] Generalize Inductor test case for Intel GPU. - Reuse `test/inductor/test_decompose_mem_bound_mm.py` - Reuse `test/inductor/test_inplacing_pass.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133309 Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/etaf	2024-08-28 06:17:43 +00:00
haozhe.zhu	2ba60a1618	fix torch.prod vectorized path for bool (#128009 ) Fix https://github.com/pytorch/pytorch/issues/127866. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128009 Approved by: https://github.com/jgong5, https://github.com/albanD	2024-08-28 05:27:50 +00:00
Rachel Guo	89929d9abc	[AOTI][Tooling][4/n] Add `torch.save()` for individual intermediate tensor (#133871 ) Differential Revision: D61415304 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133871 Approved by: https://github.com/ColinPeppler	2024-08-28 04:48:00 +00:00
PyTorch UpdateBot	ca77f0a986	[executorch hash update] update the pinned executorch hash (#133386 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133386 Approved by: https://github.com/pytorchbot	2024-08-28 04:16:42 +00:00
PyTorch UpdateBot	e3308d835d	[audio hash update] update the pinned audio hash (#134632 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134632 Approved by: https://github.com/pytorchbot	2024-08-28 04:16:25 +00:00
cyy	bb4dfe90b8	[Reland] [1/N] Fix clang-tidy warnings in inductor (#134544 ) Reland #131979 and exclude aoti_torch_index_put_out changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134544 Approved by: https://github.com/ColinPeppler	2024-08-28 04:05:06 +00:00
Yiming Zhou	71d0eff6e7	Back out "[pytorch][PR] [export] Schematize nn_module_stack serialization" (#134628 ) Summary: Breaking backward compatibilities for serialization and deserialization Differential Revision: D61888223 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134628 Approved by: https://github.com/angelayi	2024-08-28 03:45:46 +00:00
cyy	ec3f52dd27	[21/N] Fix clang-tidy warnings in jit (#134537 ) Follows #133399 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134537 Approved by: https://github.com/Skylion007	2024-08-28 03:22:01 +00:00
Ke Wen	5beb859e74	[BE] no need to print stream in comm abort (#134362 ) Strictly speaking, NCCL communicator has nothing to do with CUDA streams. Thus, we don't need to print stream in comm abort's message. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134362 Approved by: https://github.com/fduwjj, https://github.com/wconstab	2024-08-28 02:14:18 +00:00
Tristan Rice	f33bcbe5fd	c10d/logging: add C10D_LOCK_GUARD (#134131 ) This adds logs if we can't acquire locks in NCCLUtils and ProcessGroupNCCL for 30s. This is motivated by some deadlocks were seeing and it's unclear if it's in NCCL or on the PyTorch side of things. This required replacing most `std::mutex` with `std::timed_mutex` and `std::condition_variable_any` as appropriate. Test plan: existing CI for regressions will add unit tests on `C10D_LOCK_GUARD` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134131 Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj	2024-08-28 01:40:42 +00:00
Yu, Guangye	c45ca8092d	Refactor caching device allocator utils (#130923 ) # Motivation Following [[RFC] Intel GPU Runtime Upstreaming for Allocator ](https://github.com/pytorch/pytorch/issues/116322), this PR aims to refactor caching device allocator utils to improve code reuse usage. This is the first PR, we could prepare some follow-up PRs continuing to refactor the device caching allocator. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130923 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD, https://github.com/eqy	2024-08-28 01:35:23 +00:00
atalman	d96254631e	[CD] Fix docker builds by installing setuptools after python build (#134631 ) Follow up after https://github.com/pytorch/pytorch/pull/134595 Same error happens silently before the error addressed in the above PR (and build continues and builds invalid Docker): ``` #47 457.5 Traceback (most recent call last): #47 457.5 File "<string>", line 1, in <module> #47 457.5 File "/opt/_internal/cpython-3.12.0/lib/python3.12/site-packages/wheel/pep425tags.py", line 3, in <module> #47 457.5 import distutils.util #47 457.5 ModuleNotFoundError: No module named 'distutils' #47 457.5 + local abi_tag= #47 457.5 + ln -s /opt/_internal/cpython-3.12.0 /opt/python/ #47 457.5 + rm -f Python-3.12.0.tgz ``` The fix in https://github.com/pytorch/pytorch/pull/134595 is no longer needed since we will install setuptools right after python installation. Link: https://github.com/pytorch/pytorch/actions/runs/10584642913/job/29329366729#step:6:6041 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134631 Approved by: https://github.com/kit1980	2024-08-28 01:17:41 +00:00
Sun, Jiayi	2b95da7ef4	allow conv_bn mixed dtype folding in post-grad (#133968 ) This PR relaxes the condition to allow conv_bn mixed dtype folding in post-grad. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133968 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel	2024-08-28 01:02:09 +00:00
FFFrog	f7467c3b95	using new device-agnostic api instead of old api like torch.cpu or torch.cuda (#134448 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134448 Approved by: https://github.com/guangyey, https://github.com/shink, https://github.com/albanD	2024-08-28 01:01:49 +00:00
Pian Pawakapan	0c7856973b	[export] enumerate unsupported sympy.Functions (#134271 ) (#134598 ) Summary: There's 2 concepts of unsupported sympy.Functions in symbolic_shapes: 1) unsupported by the export solver, meaning the solver doesn't know how to provide useful fixes for those functions 2) unsupported by the sympy interpreter - meaning we can't reify them into FX nodes because the functions aren't present in PythonReferenceAnalysis This splits the current call into a call for each version, with the Export solver the only user of 1). For 1), we enumerate the functions in _sympy/functions.py, and subtract the functions we know we can support. For 2) there's only 3 functions we've seen pop up in test cases. cc jgong5 mingfeima XiaobingSuper sanchitintel ashokei jingxu10 Differential Revision: D61863394 Pulled By: pianpwk Pull Request resolved: https://github.com/pytorch/pytorch/pull/134598 Approved by: https://github.com/angelayi	2024-08-28 00:34:38 +00:00
albanD	3b33f26513	Add device daemon (#131814 ) Base implementation aiming towards https://github.com/pytorch/rfcs/pull/64 Details of the implementation and next steps in https://github.com/pytorch/pytorch/blob/gh/albanD/3/head/test/cpp_extensions/open_registration_extension/README.md Pull Request resolved: https://github.com/pytorch/pytorch/pull/131814 Approved by: https://github.com/ezyang	2024-08-27 23:32:07 +00:00
Laith Sakka	d6091c8726	Add compile time instruction count metric (#133834 ) PYTHONPATH=$(pwd) python benchmarks/update_hint_benchmark.py out as of this diff, compile_time_instruction_count counts the number of instruction from within convert_frame.compile_inner ``` update_hint_regression,compile_time_instruction_count,10522459165 ``` will add result from CI once populated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133834 Approved by: https://github.com/aorenste	2024-08-27 23:29:02 +00:00
Max Podkorytov	ef0f5919c7	[ROCm][Inductor][CK] Fix codegen after ck signature change (#134483 ) MakeArgument signature was changed in https://github.com/ROCm/composable_kernel/pull/1453 adding splitK argument to universal gemm templates which are used to codegen addmm and matmul (part of the series started at #125453 ) # Testing `pytest test/inductor/test_ck_backend.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134483 Approved by: https://github.com/ColinPeppler	2024-08-27 23:25:42 +00:00
Pian Pawakapan	5ead965026	[export] don't duck size for DIM.AUTO (#134486 ) Summary: apparently DIM.AUTO leads to duck sizing, I didn't catch this. Doing the least intrusive fix possible by using `torch._dynamo.maybe_mark_dynamic()` under the hood. Test Plan: added test Differential Revision: D61809344 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134486 Approved by: https://github.com/avikchaudhuri	2024-08-27 23:00:26 +00:00
PyTorch MergeBot	30094bedbc	Revert "[dynamo][dicts] Support hasattr on dicts (#134590 )" This reverts commit d23c0150f3ba5fd1162358e9e7b0e72e7308c87e. Reverted https://github.com/pytorch/pytorch/pull/134590 on behalf of https://github.com/anijain2305 due to causing trunk CI failures ([comment](https://github.com/pytorch/pytorch/pull/134590#issuecomment-2313705582))	2024-08-27 22:52:52 +00:00
drisspg	d966d91e37	[FlexAttention] Fix Sparse block multiple to ceildiv instead for floor div (#134538 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134538 Approved by: https://github.com/yanboliang ghstack dependencies: #134507, #134511	2024-08-27 22:04:57 +00:00
drisspg	f5c67917d3	[FlexAttention] Remove unused code (#134511 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134511 Approved by: https://github.com/yanboliang ghstack dependencies: #134507	2024-08-27 22:04:57 +00:00
drisspg	856a8410f2	[FlexAttention] Create new variables for the subgraphs (#134507 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134507 Approved by: https://github.com/yanboliang, https://github.com/BoyuanFeng	2024-08-27 22:04:57 +00:00
Nikita Shulga	41e512a4cd	[EZ] Restore `test_unicode_comments` (#134589 ) This reverts changes introduced by test_jit.py by `43737bd78a` and adds lint suppression for this it As test name suggests it should have an unicode comment to make sure our parser can handle it Part of the fix for https://github.com/pytorch/pytorch/issues/134422 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134589 Approved by: https://github.com/aorenste, https://github.com/Skylion007	2024-08-27 21:51:06 +00:00
Bob Ren	1ba39ec1d0	Add test case test_arange_length_with_float32_dtype (#134415 ) Adding a test as a followup from https://github.com/pytorch/pytorch/pull/134296 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134415 Approved by: https://github.com/ezyang	2024-08-27 21:36:23 +00:00
PaliC	b58a0c3c4d	[split build] fix distributed problems (#134502 ) Should fix the issue where USE_C10D_NCCL was not getting propagated to libtorch_python.so Pull Request resolved: https://github.com/pytorch/pytorch/pull/134502 Approved by: https://github.com/yifuwang	2024-08-27 21:12:58 +00:00
David Berard	289486d007	Move attention kernels back from fake_impls to meta_registrations (#134288 ) See #121528 for additional context. In #120682, we moved the attention kernels from meta_registrations to fake_impls with the intent of fixing the device handling for seed/offset: these are typically on CPU. We needed to put the registrations in fake_impls to do this because meta_registrations doesn't have a way to specify device, whereas fake_impls does. But when we tried to actually fix the device types (#120839), we had to revert the PR because it broke cudagraph handling (during which seed/offset _are_ on CUDA). Now, we want to put the registrations back in meta_registrations so that we can call these kernels with meta tensors. The use case is later in this stack - we want to be able to use the flop counter with these kernels. Also - I specifically skip the `compare_tensor_meta()` check in test_fake / test_fake_autocast tests for the `_efficient_attention_forward` and `_flash_attention_forward` kernels, which fails because of the device mismatch from the seed/offset tensors. Then we can un-skip these opinfos. I verified that the efficient_attention_forward bug (#120842) is now caught by these opinfos if I revert the fix from this PR. Differential Revision: [D61687369](https://our.internmc.facebook.com/intern/diff/D61687369) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134288 Approved by: https://github.com/drisspg	2024-08-27 21:10:36 +00:00
rzou	39ca96398b	Update label_to_label with oncall: pt2 hierarchy. (#134582 ) Test Plan: - None Pull Request resolved: https://github.com/pytorch/pytorch/pull/134582 Approved by: https://github.com/clee2000	2024-08-27 21:05:40 +00:00
cyy	b567ca0f51	Remove unused imported names in python files (#134438 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134438 Approved by: https://github.com/zou3519	2024-08-27 20:44:04 +00:00
Animesh Jain	d23c0150f3	[dynamo][dicts] Support hasattr on dicts (#134590 ) Fixes - https://github.com/pytorch/pytorch/issues/134577 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134590 Approved by: https://github.com/Skylion007 ghstack dependencies: #134039	2024-08-27 20:43:40 +00:00
Bo Li	16b8146c9e	Exclude test_transformers and unit tests which require recent GPU arch (#132895 ) This PR is to exclude test_transformers on ROCm temporarily and skip some unit tests which require recent GPU arch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132895 Approved by: https://github.com/jithunnair-amd, https://github.com/pruthvistony, https://github.com/malfet	2024-08-27 20:40:53 +00:00
Yuanhao Ji	44dadf2506	[Fix] Check name when registering privateuse1 backend (#134071 ) do some checks when registering privateuse1 backend to avoid using in-tree deivce names Pull Request resolved: https://github.com/pytorch/pytorch/pull/134071 Approved by: https://github.com/albanD	2024-08-27 20:28:30 +00:00
Colin Peppler	f754c0ae1b	[easy] rm duplicate definition for inductor in TORCH_LOGS documentation (#134480 ) already defined in `2eb9339b71/torch/_logging/_internal.py (L286-L287)` Test Plan: Sandcastle run Differential Revision: D61806088 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134480 Approved by: https://github.com/eellison, https://github.com/mlazos	2024-08-27 20:15:10 +00:00
Moritz Hennen	fe6d0e3a04	Do not compute unnecessary `tensor!=0` for bool tensors in `count_nonzero` (#134254 ) Updated aten/src/ATen/native/TensorAdvancedIndexing.cpp to only reduce non-bool tensors before computing a sum Since I have no expertise for MPS, I did leave the MPS backend untouched. Also, in `count_nonzero_impl` for CPU, I assumed the comparison can be optimized by the compiler for boolean values? `90c821814e/aten/src/ATen/native/TensorAdvancedIndexing.cpp (L2262-L2264)` Fixes #133983 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134254 Approved by: https://github.com/albanD	2024-08-27 20:09:29 +00:00
xpfjmj	b744ed6816	Add a cpu_dispatch_key parameter to the cpu_fallback function (#134321 ) Fixes #134322 Add a cpu_dispatch_key parameter to the cpu_fallback function to support fallback, for example, to SparseCPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134321 Approved by: https://github.com/albanD	2024-08-27 19:57:57 +00:00
Ivan Duka	adf401f822	Links to contributors' GitHub accounts (#133787 ) Maintainers have the links to their GitHub profiles, but the major contributors do not have them. I added the links to the contributors' GitHub accounts in case anyone wants to follow them. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133787 Approved by: https://github.com/albanD	2024-08-27 19:56:08 +00:00
Nikita Shulga	534f43ddce	[Doc] Fix rendering of the unicode characters (#134597 ) https://github.com/pytorch/pytorch/pull/124771 introduced unicode escape sequences inside raw strings, which were not rendered correctly. Also fix typo in `\uue0 ` escape sequence (should have been `\u00e0`) Fix it by relying on [string literal concatenation](https://docs.python.org/3/reference/lexical_analysis.html#string-literal-concatenation) to join raw and regular strings together during lexical analysis stage Fixes https://github.com/pytorch/pytorch/issues/134422 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134597 Approved by: https://github.com/aorenste, https://github.com/Skylion007	2024-08-27 19:52:46 +00:00
Jerry Zhang	3ef4c27ab3	Update pt2e numeric debugger to use node.meta["custom"] field (#134040 ) Summary: With https://github.com/pytorch/pytorch/pull/131912 we now have a "custom" field in node.meta that can be preserved in * copy/deepcopy * run_decompositions() * serialization * re-exporting So we refactored numeric debugger to use this. Test Plan: python test/test_quantization.py TestNumericDebugger Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/134040 Approved by: https://github.com/tarun292	2024-08-27 19:51:03 +00:00
Xu Han	ed494603c7	[inductor] calibration inductor windows uts (16/N) (#134587 ) skip UT for `test/inductor/test_compiled_autograd.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134587 Approved by: https://github.com/jansel	2024-08-27 19:45:02 +00:00
Xu Han	b094972051	[inductor] calibration inductor windows uts (17/N) (#134588 ) skip UTs for `test/inductor/test_minifier_isolate.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134588 Approved by: https://github.com/jansel	2024-08-27 19:41:17 +00:00
Xu Han	9d0e0e6f1d	[inductor] calibration inductor windows uts (14/N) (#134585 ) skip UT for `test/dynamo/test_exc.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134585 Approved by: https://github.com/jansel	2024-08-27 19:40:56 +00:00
Roy Hvaara	05ac7cd760	[MPS] Remove superfluous label/link (#134090 ) This was probably intended to be a comment. I removed it since the issue is already linked in the warning below. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134090 Approved by: https://github.com/albanD	2024-08-27 19:37:33 +00:00
atalman	d5aefadb17	[CD] Fix docker builds by installing setuptools (#134595 ) Seeing failures like this: ``` #49 844.6 //build_scripts/manylinux1-check.py:6: DeprecationWarning: The distutils package is deprecated and slated for removal in Python 3.12. Use setuptools or check PEP 632 for potential alternatives ..... [python 3/3] RUN bash build_scripts/build.sh && rm -r build_scripts: 846.9 ...it did, yay. 846.9 + for PYTHON in '/opt/python/*/bin/python' 846.9 + /opt/python/cpython-3.12.0/bin/python build_scripts/manylinux1-check.py 847.0 Traceback (most recent call last): 847.0 File "//build_scripts/manylinux1-check.py", line 55, in <module> 847.0 if is_manylinux1_compatible(): 847.0 ^^^^^^^^^^^^^^^^^^^^^^^^^^ 847.0 File "//build_scripts/manylinux1-check.py", line 6, in is_manylinux1_compatible 847.0 from distutils.util import get_platform 847.0 ModuleNotFoundError: No module named 'distutils' ------ ``` PR: https://github.com/pytorch/pytorch/pull/134455 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134595 Approved by: https://github.com/kit1980, https://github.com/seemethere, https://github.com/malfet	2024-08-27 19:31:44 +00:00
Bin Bao	a4b44dd2ef	[AOTI] Introduce DeferredCudaGridLine for cuda cpp wrapper (#129268 ) Summary: Similar to https://github.com/pytorch/pytorch/pull/129135, use DeferredCudaGridLine to create a deferred grid computation line when generating cpp wrapper. Differential Revision: [D61800622](https://our.internmc.facebook.com/intern/diff/D61800622) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129268 Approved by: https://github.com/angelayi	2024-08-27 19:23:25 +00:00
Xinya Zhang	5fd670e0ef	[ROCM] Properly disable Flash Attention/Efficient Attention with environment variables (#133866 ) Now `USE_FLASH_ATTENTION=0 USE_MEM_EFF_ATTENTION=0 python setup.py` can compile correctly Fixes #125230 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133866 Approved by: https://github.com/jithunnair-amd, https://github.com/jeffdaily, https://github.com/malfet	2024-08-27 18:24:29 +00:00
PyTorch MergeBot	5b392d22c6	Revert "fix stuck floordiv (#134150 )" This reverts commit 92c4771853892193d73d87bd60eca4dc7efc51d8. Reverted https://github.com/pytorch/pytorch/pull/134150 on behalf of https://github.com/anijain2305 due to compile time regression internal ([comment](https://github.com/pytorch/pytorch/pull/134150#issuecomment-2313230404))	2024-08-27 18:23:44 +00:00
Xilun Wu	0159ebb654	[dtensor] add test for local_map decorator (#127752 ) Summary This PR is a follow-up of #126924 to address reviewer's comments: 1) add a test case to show the use of `local_map` as a function decorator. 2) simplify the logic of handling different data types of `out_placements`. 3) correct variable naming in test cases to match math formulas. Test see #126924 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127752 Approved by: https://github.com/wanchaol	2024-08-27 18:22:23 +00:00
Nikita Shulga	8de0d7690c	Use newer `toAccumulateType` signature in `Normalization.cpp` (#134540 ) Which fixes BatchNorm behavior for if called with empty tensors on MPS backed. Removed `expectedFailureMPS` in test_nn.py, deleted expected failure in `test_mps.py` and adjusted `skipIfMPS` to `expectedFailureMPS` in BatchNorm2d OpInfo decorator, but restrict it only to the memory format tests Test Plan: CI + `python3 -c "import torch; print(torch.nn.BatchNorm2d(3, device='mps')(torch.rand(0, 3, 2, 2, device='mps')))"` Fixes https://github.com/pytorch/pytorch/issues/134423 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134540 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-08-27 18:09:20 +00:00
Jessica Vandebon	68b1a09422	Integrate device agnostic APIs in FSDP library [1/n] (#134337 ) Summary: For MTIA FSDP support, we need to ensure the FSDP library code handles accelerator devices not limited to CUDA. Test Plan: CI Reviewed By: hanzlfs Differential Revision: D60587415 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134337 Approved by: https://github.com/LucasLLC, https://github.com/awgu	2024-08-27 17:31:11 +00:00
Colin Peppler	13049cd6e5	[aotinductor][UserDefinedTritonKernel] fix case with non-constexpr params declared after autotuned params (#134520 ) ## Context In some user Triton kernels, we have this set-up for whatever reason. ``` @triton.jit def mykernel( param0, param1, param2, param3: tl.constexpr, # autotuned param4, # non-constexpr ): ... ``` This is an edge case because it's a general practice to declare all constexprs params at the end. And this will be an issue for AOTI because it fails to codegen all 4 params. That will surface as a device-side error: CUDA IMA, invalid argument... ``` > void* kernel_args_var_0[] = {&var_0, &var_1, &var_2}; --- < CUdeviceptr var_3; < AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_get_data_ptr(buf0, reinterpret_cast<void*>(&var_3))); < void kernel_args_var_0[] = {&var_0, &var_1, &var_2, &var_3}; ``` ## Root-cause * `kernel.constexpr` from the Kernel side-table contains the indices for all `constexpr` params that includes autotuned params. * `raw_args`, that gets passed to wrapper codegen, excludes autotuned args. * In the wrapper codegen, we try to find non-constexpr args using `kernel.constexpr` & `raw_args`. This is okay unless there's a `raw_arg` after an autotuned param in the function signature. `79b7fff188/torch/_inductor/codegen/cpp_wrapper_cuda.py (L118-L126)` ## Fix We try to fix this, by calculating the right constexprs wrt `raw_args`. An illustration ``` raw_args: [arg0, arg1, arg2, arg4] kernel.arg_names: [param0, param1, param2, param3, param4] kernel.constexprs: [3] # param3 is autotuned; this is correct wrt kernel.arg_names constexpr_indices: [] # this is correct wrt raw_args ``` Differential Revision: [D61831625](https://our.internmc.facebook.com/intern/diff/D61831625) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134520 Approved by: https://github.com/oulgen	2024-08-27 17:20:27 +00:00
Ke Wen	13114da4ef	[3/N] Set correct device to CUDA guards (#134357 ) In `collective()`, `pointToPoint()` and `collectiveCoalesced()`, CUDA guards were created with an unset (default) CUDA device. This is the reason for the IMA facing the NaN checker in issue https://github.com/pytorch/pytorch/issues/134062. With this fix, `torch.cuda.set_device(device)` is not needed to work around the IMA. Also refactored a couple places where the guard is created -- preferably we create the guard with a known device, rather than setting the device later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134357 Approved by: https://github.com/wconstab, https://github.com/shuqiangzhang ghstack dependencies: #134300, #134345	2024-08-27 16:38:15 +00:00
Ke Wen	be7752ead3	[2/N] Add flag to control which rank should perform NaN check (#134345 ) Fixes https://github.com/pytorch/pytorch/issues/134062. For example, in case of broadcast / scatter, only the root rank should perform the NaN check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134345 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab ghstack dependencies: #134300	2024-08-27 16:33:59 +00:00
Colin L. Rice	9dc4bd7466	Create a JustknobConfig for use in config (#134161 ) This is designed to be a more ergonomic interface on top of justknob_feature (see https://github.com/pytorch/pytorch/pull/134151 for just the PR with the base commits). The idea is that people stop having to think about this as much, and can just do JustkobsConfig("//the:thing", "FORCE_THING") and it'll do the right thing. Primarily sending this to see how people feel about the API, and using it for new config changes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134161 Approved by: https://github.com/ezyang	2024-08-27 16:07:33 +00:00
Ke Wen	94caba4899	[1/N] Move NaN check onto NCCL stream (#134300 ) So that the tensor's lifetime management is the same as the management built for the NCCL, pre and post kernels. Also so that on visualizers, they show up in the NCCL stream line. Otherwise if they show up in the compute line, user may get confused (my code does not have these kernels). The check is thus moved after the point where we depend NCCL stream from the last compute kernel. Also moved declaration of `checkForNan` from Utils.hpp to NCCLUtils.hpp, and renamed Utils.cu to NCCLUtils.cu. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134300 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab	2024-08-27 16:02:27 +00:00
rzou	c582602245	Update partitioner's is_fusible heuristic to respect triton kernels (#134491 ) mutated arguments to triton kernels are fusible into the triton kernel. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/134491 Approved by: https://github.com/Chillee ghstack dependencies: #134364, #134466, #134490	2024-08-27 15:57:32 +00:00
wz337	761cf91e3c	[DeviceMesh] Add get_all_submeshes in _MeshEnv (#134275 ) Adding a private helper method for Shampoo HSDP use cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134275 Approved by: https://github.com/XilunWu	2024-08-27 14:51:19 +00:00
Mikayla Gawarecki	d028b810fe	Fix flaky GroupNorm ModuleInfo test (#133899 ) Fixes https://github.com/pytorch/pytorch/issues/98677 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133899 Approved by: https://github.com/albanD	2024-08-27 14:45:51 +00:00
Mikayla Gawarecki	2033934ff8	Clarify error messages for NEWOBJ and BUILD in weights_only unpickler (#134346 ) Clarify that `add_safe_globals` will allow types for these instructions Some types do not appear as `GLOBAL` and are only caught in `BUILD`, example from hf slack is `numpy.dtypes.UInt32DType` ```python import torch import numpy as np from tempfile import TemporaryDirectory from pathlib import Path from codecs import encode torch.serialization.add_safe_globals([encode, np.dtype, np.core.multiarray._reconstruct, np.ndarray]) with TemporaryDirectory() as tempdir: p = Path(tempdir) r2 = np.random.get_state() torch.save(r2, p / "r2.pkl") torch.load(p / "r2.pkl", weights_only=True) ``` Yields (error comes from BUILD) ``` UnpicklingError: Weights only load failed. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source. Please file an issue with the following so that we can make `weights_only=True` compatible with your use case: WeightsUnpickler error: Can only build Tensor, parameter or OrderedDict objects, but got <class 'numpy.dtypes.UInt32DType'> ``` The reasoning is that `numpy.dtypes.UInt32DType` is constructed via `REDUCE` with `func =<class 'numpy.dtype'>` and `args= ('u4', False, True)`, clarify the error message that doing `add_safe_globals` on these will also allow them After this PR error message becomes ``` _pickle.UnpicklingError: Weights only load failed. Re-running `torch.load` with `weights_only` set to `False` will likely succeed, but it can result in arbitrary code execution. Do it only if you got the file from a trusted source. Please file an issue with the following so that we can make `weights_only=True` compatible with your use case: WeightsUnpickler error: Can only build Tensor, Parameter, OrderedDict or types allowlisted via `add_safe_globals`, but got <class 'numpy.dtypes.UInt32DType'> ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134346 Approved by: https://github.com/albanD	2024-08-27 14:45:39 +00:00
Mikayla Gawarecki	2ac710e667	Make torch.serialization.set_default_mmap_options usable as a context manager (#134371 ) As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/134371 Approved by: https://github.com/albanD	2024-08-27 14:45:29 +00:00
Nikita Shulga	0fa0ac80e4	Do not use `<filesystem>` on Linux (#134494 ) Because right now it leads to symbol conflict from binary builds. Use of `std::filesystem::file_exists` was introduced by https://github.com/pytorch/pytorch/pull/126601 and in this PR it is replaced with a very straightforward implementation that calls `stat` on the given path, which is a classic C-way of checking for the file existence. This PR should be reverted once one figures out how to keep `std::filesystem` methods linked into the binary private Fixes symptoms of https://github.com/pytorch/pytorch/issues/133437 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134494 Approved by: https://github.com/atalman, https://github.com/d4l3k	2024-08-27 14:44:10 +00:00
PyTorch MergeBot	3418708abf	Revert "[FlexAttention] Create new variables for the subgraphs (#134507 )" This reverts commit 4d0a44d34a46af6dcc764d55269b30ac537822a0. Reverted https://github.com/pytorch/pytorch/pull/134507 on behalf of https://github.com/albanD due to Broke lint due to too long line ([comment](https://github.com/pytorch/pytorch/pull/134507#issuecomment-2312505955))	2024-08-27 13:05:27 +00:00
PyTorch MergeBot	87a3f664e1	Revert "[FlexAttention] Remove unused code (#134511 )" This reverts commit 767c47d3c0ee3fc7804918a08de3f94874143a03. Reverted https://github.com/pytorch/pytorch/pull/134511 on behalf of https://github.com/albanD due to Broke lint due to too long line ([comment](https://github.com/pytorch/pytorch/pull/134507#issuecomment-2312505955))	2024-08-27 13:05:27 +00:00
PyTorch MergeBot	3e10a1eb5a	Revert "[FlexAttention] Fix Sparse block multiple to ceildiv instead for floor div (#134538 )" This reverts commit a34320a6f225061a3b5fe130a5a8fe35ed7a40f9. Reverted https://github.com/pytorch/pytorch/pull/134538 on behalf of https://github.com/albanD due to Broke lint due to too long line ([comment](https://github.com/pytorch/pytorch/pull/134507#issuecomment-2312505955))	2024-08-27 13:05:27 +00:00
rzou	c7cbcdad76	Update partitioner's is_fusible heuristic to respect auto_functionalized (#134490 ) We say Node a is fusible into node b if node b is an auto_functionalized node that may reinplace node a later on. This PR also changes aten.empty to be recomputable w.r.t the Partitioner (it is, like aten.zeros, cheap to recompute and fusible into other ops). Fixes https://github.com/pytorch/pytorch/issues/134468 Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/134490 Approved by: https://github.com/Chillee ghstack dependencies: #134364, #134466	2024-08-27 13:05:01 +00:00
xinyu-intel	dde5974b13	Implementation for rng ops on hpu and xpu (#133068 ) implementation for high_order_op::run_and_save_rng_state and high_order_op::run_with_rng_state on hpu Pull Request resolved: https://github.com/pytorch/pytorch/pull/133068 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/jansel, https://github.com/anijain2305	2024-08-27 11:34:37 +00:00
FEI	ef8236f12b	Provide default value None for the attn_bias parameter(#133981 ) (#133986 ) Fixes #133981 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133986 Approved by: https://github.com/ezyang	2024-08-27 11:10:43 +00:00
drisspg	a34320a6f2	[FlexAttention] Fix Sparse block multiple to ceildiv instead for floor div (#134538 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134538 Approved by: https://github.com/yanboliang ghstack dependencies: #134495, #134507, #134511	2024-08-27 09:53:19 +00:00
drisspg	767c47d3c0	[FlexAttention] Remove unused code (#134511 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134511 Approved by: https://github.com/yanboliang ghstack dependencies: #134495, #134507	2024-08-27 09:53:19 +00:00
drisspg	4d0a44d34a	[FlexAttention] Create new variables for the subgraphs (#134507 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134507 Approved by: https://github.com/yanboliang, https://github.com/BoyuanFeng ghstack dependencies: #134495	2024-08-27 09:53:13 +00:00
Zain Rizvi	f480385277	Remove explicit Amz2023 reference from jobs (#134355 ) Changes jobs to go back to using the default AMI. Note: This is only a cleanup PR. It does NOT introduce any behavior changes in CI Now that the default variant uses the Amazon 2023 AMI and has been shown to be stable for a week, it's time to remove the explicit amz2023 references and go back to using the default variant. After a week or two, when this is rolled out to most people, we can remove the variants from scale config as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134355 Approved by: https://github.com/jeanschmidt	2024-08-27 08:51:42 +00:00
Prashant Rawat	0916d72e99	Fix the warning for cat operators with same qparams (#133999 ) Summary: Currently the warning is printed when the cat inputs have same qparam, leading to a flood of warnings. This diff emits the warning only when cat inputs don't have the same qparam. Test Plan: CI Reviewed By: aprotopopov Differential Revision: D60638609 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133999 Approved by: https://github.com/tarun292	2024-08-27 08:21:39 +00:00
wizzniu	3515090006	Fix TypeError when itering NoneType in instantiate_device_type_tests() (#134457 ) Fixes #134454 Fix TypeError introduced by https://github.com/pytorch/pytorch/pull/133082, which uses iter for NoneType of default args ``except_for`` and ``only_for``. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134457 Approved by: https://github.com/shink, https://github.com/albanD	2024-08-27 07:13:36 +00:00
Sathyanarayanan Saravanamuthu	136b19b062	Adding entry-point based support for out-of-tree rendezvous plugins (#132633 ) Fixes #127519 Currently in torchrun rendezvous, there are only two rendezvous backends supported out of the box: `C10d` and `Etcd`. The changes in this PR enables the distributed elastic users to bring their out-of-tree rendezvous backend implementations as Python packages. #### AUTHORING NEW PLUGIN Any new plugin will be a python package exposing entry-points. For example, the structure of redis plugin is as follows: ``` plugin_root \|_ pyproject.toml \|_ src \|_ redis \|_ __init__.py \|_ redis_store.py \|_ redis_backend.py ``` The contents of the `pyproject.toml` should indicate that this is exposes a torchrun entry-point by mentioning the group name `torchrun.plugins`. The `pyproject.toml` for redis plugin would be as follows: ``` [project] name = "redis" version = "0.0.1" [project.entry-points.'torchrun.plugins'] redis = 'redis' ``` The `src/redis/__init__.py` file would contain functions that return the plugin name and plugin handler. The contents of `__init__.py` for redis would be as follows: ``` def getPluginHandler(): def _create_redis_handler(params: RendezvousParameters): from redis_rendezvous_backend import create_backend backend, store = create_backend(params) return create_handler(store, backend, params) return _create_redis_handler ``` The files `redis_store` and `redis_backend` contain the implementation of [Store](`41189b0da4/torch/_C/_distributed_c10d.pyi (L171)`) and [RendezvousBackend](`e782918b8e/torch/distributed/elastic/rendezvous/dynamic_rendezvous.py (L61)`) respectively. #### USER EXPERIENCE Before using the plugin for the first time, the user has to install the plugin packages. For example, the published packages can be installed using `pip3 install <plugin-name>` and the plugin is in local file systemcan be installed using `pip3 install -e <plugin-location>`. Once installed, the new backend can be used in torchrun as follows: ``` torchrun --rdzv-backend=redis --rdzv-endpoint=redis-container:6379 --nnodes=3 --nproc-per-node=1 --max-restarts=3 --rdzv-id=1 test.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132633 Approved by: https://github.com/wconstab	2024-08-27 07:09:41 +00:00
Xu Han	4a18fcf7af	[inductor] calibration inductor windows uts (12/N) (#134428 ) enable Windows inductor UTs for `test/inductor/test_torchinductor_codegen_dynamic_shapes.py` Failed by depends on https://github.com/pytorch/pytorch/pull/134429, need to rebase after https://github.com/pytorch/pytorch/pull/134429 merged. ```cmd 2024-08-25T23:57:23.2747794Z Windows CI does not have necessary dependencies for test_torchinductor_dynamic_shapes yet 2024-08-25T23:57:23.2748541Z Traceback (most recent call last): 2024-08-25T23:57:23.2749593Z File "C:\actions-runner\_work\pytorch\pytorch\test\inductor\test_torchinductor_codegen_dynamic_shapes.py", line 30, in <module> 2024-08-25T23:57:23.2750688Z from inductor.test_torchinductor_dynamic_shapes import ( 2024-08-25T23:57:23.2751877Z File "C:\actions-runner\_work\pytorch\pytorch\test\inductor\test_torchinductor_dynamic_shapes.py", line 46, in <module> 2024-08-25T23:57:23.2752876Z raise unittest.SkipTest("requires sympy/functorch/filelock") 2024-08-25T23:57:23.2753545Z unittest.case.SkipTest: requires sympy/functorch/filelock 2024-08-25T23:57:23.2754077Z Got exit code 1 2024-08-25T23:57:23.2754874Z No stepcurrent file found. Either pytest didn't get to run (e.g. import error) or file got deleted (contact dev infra) ``` Local test pass: <img width="1892" alt="image" src="https://github.com/user-attachments/assets/241ab082-6026-4f33-b3ac-7e9ef7da744d"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134428 Approved by: https://github.com/jansel	2024-08-27 05:43:07 +00:00
Shivam Raikundalia	0b81f700aa	[PT2/Profiler] Add Context Info to Torch-Compiled Regions (#132765 ) Summary: We want to add compile IDs and frames to each Torch-Compiled Region in order to help users cross reference the section they are checking alongside data obtained from tools, such as tlparse. This diff operates on the assumption that each graph section will enter and exit a CompileContext before it is ran to either compile the graph or look it up in the cache. Based on this assuption, we can save the value of the graph section from the exited CompileContext in eval_frame.c using a Python C API. After this, we can create a new interface in cpp shim to wrap around the record_function in order to pass in the new keyword argument for "context". Test Plan: Enhance test_profiler_dynamo_compiled_region to look for kwinputs as well as a name to see that the context is now labeled. Also changed test to run graph with more contexts so that we test a wider range of profiling. Differential Revision: D60803317 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132765 Approved by: https://github.com/anijain2305	2024-08-27 04:55:04 +00:00
Shuai Yang	de57a6e806	Back out "[dynamo][exception] Support raise exception from None (#134028 )" (#134513 ) Summary: The original diff is causing the error "attempting to assign a gradient with dtype 'c10::BFloat16' to a tensor with dtype ‘float". The context is in: https://fb.workplace.com/groups/1075192433118967/permalink/1491357138169159/ Test Plan: After reverting, the above issue is gone, details are in https://fb.workplace.com/groups/1075192433118967/permalink/1491357138169159/ Differential Revision: D61820520 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134513 Approved by: https://github.com/anijain2305	2024-08-27 02:57:14 +00:00
Xu Han	02b0b524b5	[inductor] Turn on UT: test_randint_int64_mod (#134510 ) It fixed by https://github.com/pytorch/pytorch/pull/134229, turn on it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134510 Approved by: https://github.com/ezyang	2024-08-27 02:33:07 +00:00
Xuehai Pan	d0147290d8	[BE][Easy][dynamo] ensure `trace_rules.MOD_INLINELIST` in alphabetical order (#134246 ) Stack from [ghstack](https://github.com/ezyang/ghstack) (oldest at bottom): * __->__ #134246 * #133987 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134246 Approved by: https://github.com/yanboliang	2024-08-27 02:29:43 +00:00
cyy	2ee201a7d0	[CMake] Remove BUILDING_WITH_TORCH_LIBS (#134434 ) Since BUILDING_WITH_TORCH_LIBS is not used now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134434 Approved by: https://github.com/ezyang	2024-08-27 01:48:21 +00:00
Edward Z. Yang	bdfc1d3987	Remove unnecessary expect_true in split_with_sizes (#133439 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133439 Approved by: https://github.com/albanD	2024-08-27 01:34:00 +00:00
Edward Z. Yang	c7ca89a11a	Improve print stack/locals printing in comptime (#133651 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133651 Approved by: https://github.com/anijain2305	2024-08-27 01:29:50 +00:00
rzou	58771315d3	Unify lowerings for auto_functionalized and triton_kernel_wrapper_functional (#134466 ) Fixes https://github.com/pytorch/pytorch/issues/134372 The triton_kernel_wrapper_functional lowering was causing problems (it was generating small kernels with nans in it, probably from realizing aten.empty nodes. Instead of having its own manual lowering, we change triton_kernel_wrapper_functional to go the same route as auto_functionalized where we decompose the node into clone + mutation nodes. Test Plan: - new test - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/134466 Approved by: https://github.com/oulgen, https://github.com/eellison ghstack dependencies: #134364	2024-08-27 00:53:05 +00:00
PyTorch MergeBot	141a9c7204	Revert "[export] enumerate unsupported sympy.Functions (#134271 )" This reverts commit ddd71e34797f3bb56a048058e007a2df87c5755f. Reverted https://github.com/pytorch/pytorch/pull/134271 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/134271#issuecomment-2311353460))	2024-08-27 00:45:00 +00:00
drisspg	4df10a6340	[FlexAttention] Fix bug when checking whether to return LSE (#134495 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134495 Approved by: https://github.com/yanboliang, https://github.com/Chillee, https://github.com/BoyuanFeng	2024-08-27 00:31:46 +00:00
Xu Han	b98d33c155	[inductor] calibration inductor windows uts (13/N) (#134429 ) enable Windows inductor UTs for `test/inductor/test_torchinductor_dynamic_shapes.py` Local test pass: <img width="1885" alt="image" src="https://github.com/user-attachments/assets/4b96b6d9-715f-4c94-8059-9ee0afaaa574"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134429 Approved by: https://github.com/jansel	2024-08-27 00:16:16 +00:00
Xuehai Pan	74341e1150	[dynamo] simplify implementation for `os.fspath` (#133801 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133801 Approved by: https://github.com/anijain2305 ghstack dependencies: #133771	2024-08-27 00:08:04 +00:00
Xuehai Pan	1dbd3476de	[dynamo][itertools] support `itertools.tee` (#133771 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133771 Approved by: https://github.com/jansel	2024-08-27 00:08:04 +00:00
CK Luk	43bbd781f2	Back out "[Traceable FSDPS] Allow tracing through FSDP2 impl in trace_rules.py (#133532 )" (#134478 ) Summary: Original commit changeset: 0215a41433e9 Original Phabricator Diff: D61432583 D61432583 causes FSDP2 stuck in PT2 compilation when applied to FB-FM-v4. With D61432583: https://www.internalfb.com/mast/job/aps-ckluk-745e763d6a After backing out D61432583: https://www.internalfb.com/mast/job/aps-ckluk-f9604ea1f9 Test Plan: hg graft D61774888 scripts/ckluk/aps/mast_joint_arch_exploration_cmf_updated_fbfm_v3_fsdp2_qps.sh Differential Revision: D61802689 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134478 Approved by: https://github.com/yf225	2024-08-27 00:07:28 +00:00
Xinya Zhang	46ecc673ae	[ROCm] Prevent accidental enablement of efficient attention. (#133331 ) Currently Efficient attention and Flash attention share the same set of GPU kernels on ROCM and have common limitations on head sizes. Fixes https://github.com/pytorch/pytorch/issues/132004 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133331 Approved by: https://github.com/malfet, https://github.com/jithunnair-amd	2024-08-27 00:03:45 +00:00
xinan.lin	0be6584203	[Inductor UT] Refine test case `test_codegen_upcast_to_fp32_upcast` to pass on XPU. (#134474 ) [Inductor UT] Refine test case test_codegen_upcast_to_fp32_upcast to pass on XPU. Fix issue: #134476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134474 Approved by: https://github.com/jansel	2024-08-26 23:59:26 +00:00
Roy Hvaara	1565940114	[MPS] Add `test/test_nn.py` to test suite (#134184 ) This PR increases test coverage by including the tests in `test/test_nn.py` in the test suite of MPS. Some of the tests are decorated with `@expectedFailureMPS` for various reasons. Either that the op is not implemented, or that the outputs do not align. Those tests that contain differing results should be investigated further to rule out any live bugs. ```bash $ python test/run_test.py --mps --verbose -k TestNN Running test batch 'tests to run' cost 84.76 seconds ``` Ref #133520 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134184 Approved by: https://github.com/albanD, https://github.com/malfet	2024-08-26 23:48:23 +00:00
Nikita Shulga	79b7fff188	Fix docstring for torch.signal.windows.nuttall (#134512 ) This partially fixes regression introduced by https://github.com/pytorch/pytorch/pull/124771 but also just improves `z_n` rendering, by using MathML In 2.3 it was [rendered](https://pytorch.org/docs/2.3/generated/torch.signal.windows.nuttall.html#torch.signal.windows.nuttall) as <img width="177" alt="image" src="https://github.com/user-attachments/assets/2c15d1f9-13ad-483f-bb66-41fa3fa4ba9c"> With this change it'll be [rendered](https://docs-preview.pytorch.org/pytorch/pytorch/134512/generated/torch.signal.windows.nuttall.html#torch.signal.windows.nuttall) as ```math z_n = \frac{2 \pi n}{M} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134512 Approved by: https://github.com/kit1980, https://github.com/aorenste, https://github.com/atalman	2024-08-26 22:51:43 +00:00
Pian Pawakapan	ddd71e3479	[export] enumerate unsupported sympy.Functions (#134271 ) There's 2 concepts of unsupported sympy.Functions in symbolic_shapes: 1) unsupported by the export solver, meaning the solver doesn't know how to provide useful fixes for those functions 2) unsupported by the sympy interpreter - meaning we can't reify them into FX nodes because the functions aren't present in PythonReferenceAnalysis This splits the current call into a call for each version, with the Export solver the only user of 1). For 1), we enumerate the functions in _sympy/functions.py, and subtract the functions we know we can support. For 2) there's only 3 functions we've seen pop up in test cases. Differential Revision: D61677956 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134271 Approved by: https://github.com/avikchaudhuri	2024-08-26 22:44:12 +00:00
Benjamin Glass	55236d0cb7	TestForeach::test_parity: Remove check for error message text (#134251 ) Previously, error messages were expected to be string equivalent to error messages thrown by the ref function. This check fails for dozens of torch functions, and doesn't appear to add much value for the end user. This commit removes this check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134251 Approved by: https://github.com/amjames, https://github.com/janeyx99 ghstack dependencies: #134253, #134344	2024-08-26 22:40:54 +00:00
Benjamin Glass	ef8c474fcf	Add the fast path for bfloat16 lgamma (#134344 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134344 Approved by: https://github.com/amjames, https://github.com/janeyx99 ghstack dependencies: #134253	2024-08-26 22:40:54 +00:00
Benjamin Glass	3c5883e550	Fix test_parity xfail for sigmoid (#134253 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134253 Approved by: https://github.com/amjames, https://github.com/janeyx99	2024-08-26 22:40:54 +00:00
soulitzer	a23dae22d5	Update AC pass use_reentrant message (#134472 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134472 Approved by: https://github.com/albanD	2024-08-26 21:57:38 +00:00
Animesh Jain	dbef2b05b4	[dynamo] Cache _dynamo.disable results (#134272 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134272 Approved by: https://github.com/yf225, https://github.com/jansel	2024-08-26 21:04:15 +00:00
Aidyn-A	28a4db84f2	[ARM] Fix infinite recursion in unwind (#134387 ) Fixes #119905 The `TORCH_SHOW_CPP_STACKTRACES=1` setting on ARM causes infinite recursive unwind because on failure a `StackTraceFetcher` attempts to unwind the <ins>failed instruction</ins>: `5ad759ca33/torch/csrc/profiler/combined_traceback.cpp (L25)` then the unwind itself fails: `5ad759ca33/torch/csrc/profiler/unwind/unwind.cpp (L10-L12)` and it causes another attempt to unwind the failure in `unwind()`... In summary, the executed instruction is equivalent to: ```C++ std::vector<void*> unwind() { // some instructions ... return unwind(); } ``` This PR replaces `TORCH_CHECK` by `TORCH_WARN_ONCE` as it will not cause an uncontrolled recursion. The only side effect would be an empty back-trace. Huge thanks to @nWEIdia who found the root cause! Pull Request resolved: https://github.com/pytorch/pytorch/pull/134387 Approved by: https://github.com/eqy, https://github.com/nWEIdia, https://github.com/malfet	2024-08-26 21:02:31 +00:00
Xu Han	900c5083ed	[inductor] calibration inductor windows uts (9/N) (#134425 ) enable Windows inductor UTs of `test/inductor/test_binary_folding.py` Failed UT depends on https://github.com/pytorch/pytorch/pull/134427 Need to rebase after https://github.com/pytorch/pytorch/pull/134427 merged. ```cmd 2024-08-25T23:32:23.0905727Z Traceback (most recent call last): 2024-08-25T23:32:23.0906516Z File "C:\actions-runner\_work\pytorch\pytorch\test\inductor\test_binary_folding.py", line 18, in <module> 2024-08-25T23:32:23.0908200Z from inductor.test_inductor_freezing import TestCase 2024-08-25T23:32:23.0909883Z File "C:\actions-runner\_work\pytorch\pytorch\test\inductor\test_inductor_freezing.py", line 39, in <module> 2024-08-25T23:32:23.0911128Z raise unittest.SkipTest("requires sympy/functorch/filelock") 2024-08-25T23:32:23.0911801Z unittest.case.SkipTest: requires sympy/functorch/filelock 2024-08-25T23:32:23.0912370Z Got exit code 1 2024-08-25T23:32:23.0913155Z No stepcurrent file found. Either pytest didn't get to run (e.g. import error) or file got deleted (contact dev infra) ``` Local test pass: <img width="1898" alt="image" src="https://github.com/user-attachments/assets/4a6e3f66-4bbc-4aab-8f0d-2e2318046e53"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134425 Approved by: https://github.com/ezyang, https://github.com/jansel	2024-08-26 20:57:41 +00:00
Animesh Jain	68624cf089	[dynamo][guards] De-dupe DUPLICATE_INPUT guard (#134354 ) Hard to write a test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134354 Approved by: https://github.com/jansel	2024-08-26 20:48:57 +00:00
Nikita Shulga	af82dc816a	Fix lint failures (#134488 ) Introduced by https://github.com/pytorch/pytorch/pull/131000 Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134488 Approved by: https://github.com/Skylion007, https://github.com/msaroufim, https://github.com/albanD, https://github.com/atalman	2024-08-26 20:13:21 +00:00
albanD	2588b5e51a	Move module_tracker to logging for confused hierarchy (#134467 ) Fixes https://github.com/pytorch/pytorch/issues/134242 Make sure to never raise an error when confused. Logs for confusion can be enabled with `TORCH_LOGS="torch.utils.module_tracker"` or the usual python systems. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134467 Approved by: https://github.com/malfet	2024-08-26 19:39:08 +00:00
Mengwei Liu	a0e062c6f1	Add mean.dtype_out (#133506 ) Give it a try and see if CI is happy Pull Request resolved: https://github.com/pytorch/pytorch/pull/133506 Approved by: https://github.com/bdhirsh	2024-08-26 19:26:11 +00:00
eqy	3541e450af	Support larger page sizes with `use_mmap_weights` (#131000 ) Fixes e.g., `test_large_mmaped_weights_non_abi_compatible_cuda` on machines with 64K page size CC @malfet @tinglvv @nWEIdia Pull Request resolved: https://github.com/pytorch/pytorch/pull/131000 Approved by: https://github.com/malfet	2024-08-26 18:35:55 +00:00
Henry Tsang	3322ee236d	[aoti] remove c_shim_version v1 logic (#134283 ) Summary: Previously, https://github.com/pytorch/pytorch/pull/132750 and https://github.com/pytorch/pytorch/pull/133105 set c_shim_version to 2 for all cases. So removing c_shim_version logic. Test Plan: ci Differential Revision: D61574695 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134283 Approved by: https://github.com/desertfire	2024-08-26 18:29:40 +00:00
Wuxun Zhang	1d231ff8ba	[HOO] add hints_wrapper to support passing context hints (#132860 ) Fixes #126393 The implementation code is based on feedback here (https://github.com/pytorch/pytorch/pull/121639#issuecomment-2223948842). Hints are passed as kwargs of hints_wrapper op. It also supports nested hints. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132860 Approved by: https://github.com/ydwu4, https://github.com/zou3519	2024-08-26 18:21:22 +00:00
Animesh Jain	1ccc8f0200	[dynamo][super] Improve handling of getattr on super (#134039 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134039 Approved by: https://github.com/yanboliang, https://github.com/jansel	2024-08-26 18:20:39 +00:00
Xu Han	1dd4b9221b	[inductor] enable clang for Windows inductor (#134444 ) Changes: 1. Add Windows clang-cl compiler check. 2. Add openmp config for clang-cl. 3. Preload libomp.dll when use clang. 4. Add compiler flags syntax check for `clang` and `clang++`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134444 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/malfet	2024-08-26 18:19:59 +00:00
Xu Han	0a3c064c12	[inductor] fix _maybe_subprocess_run not support Windows path (#134365 ) Windows file path use `\` as delimiter, it is also a escape character. We need translate all path `\` to `/`. which like Linux. Reproduce UTs: ```cmd pytest test\dynamo\test_minifier.py -v -k test_after_dynamo_cpu_accuracy_error ``` Error message: ```cmd ____________________________________________________________________________________________________________ MinifierTests.test_after_dynamo_cpu_accuracy_error _____________________________________________________________________________________________________________ Traceback (most recent call last): File "D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_minifier.py", line 40, in test_after_dynamo_cpu_accuracy_error self._test_after_dynamo( File "D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_minifier.py", line 27, in _test_after_dynamo self._run_full_test(run_code, "dynamo", expected_error, isolate=False) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\_dynamo\test_minifier_common.py", line 235, in _run_full_test self.assertIn(expected_error, test_proc.stderr.decode("utf-8")) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 1112, in assertIn self.fail(self._formatMessage(msg, standardMsg)) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 675, in fail raise self.failureException(msg) AssertionError: 'AccuracyError' not found in 'Traceback (most recent call last):\n File "C:\\Users\\Xuhan\\.conda\\envs\\win_mkl_static\\lib\\site-packages\\torch\\_dynamo\\test_minifier_common.py", line 114, in _maybe_subprocess_run\n exec(code, {"__name__": "__main__", "__compile_source__": code})\n File "<string>", line 9\n torch._dynamo.config.debug_dir_root = "C:\\Users\\Xuhan\\AppData\\Local\\Temp\\tmpufu9t3pc"\n ^\nSyntaxError: (unicode error) \'unicodeescape\' codec can\'t decode bytes in position 2-3: truncated \\UXXXXXXXX escape\n' To execute this test, run the following from the base repo dir: python test\dynamo\test_minifier.py MinifierTests.test_after_dynamo_cpu_accuracy_error This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 --------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------- test stdout: test stderr: Traceback (most recent call last): File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\_dynamo\test_minifier_common.py", line 114, in _maybe_subprocess_run exec(code, {"__name__": "__main__", "__compile_source__": code}) File "<string>", line 9 torch._dynamo.config.debug_dir_root = "C:\Users\Xuhan\AppData\Local\Temp\tmpufu9t3pc" ^ SyntaxError: (unicode error) 'unicodeescape' codec can't decode bytes in position 2-3: truncated \UXXXXXXXX escape --------------------------------------------------------------------------------------------------------------------------- Captured stderr call ---------------------------------------------------------------------------------------------------------------------------- running test ``` Local test passed: <img width="849" alt="image" src="https://github.com/user-attachments/assets/4a4eecc2-7c08-4de6-9395-546b69803b16"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134365 Approved by: https://github.com/jansel, https://github.com/jgong5	2024-08-26 17:48:11 +00:00
atalman	78128cbdd8	[CD] Use ephemeral arm64 runners for nightly and docker builds (#134473 ) Follow up after adding linux arm64 ephemeral instances: https://github.com/pytorch/pytorch/pull/134469 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134473 Approved by: https://github.com/malfet	2024-08-26 17:47:20 +00:00
Xu Han	0f5b052dba	[inductor] calibration inductor windows uts (11/N) (#134427 ) enable Windows inductor UTs of `test/inductor/test_inductor_freezing.py` Local test pass: <img width="1891" alt="image" src="https://github.com/user-attachments/assets/f3a873b4-abb5-4047-92f8-8e6da7c67315"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134427 Approved by: https://github.com/jansel	2024-08-26 17:43:58 +00:00
cyy	73604eed0c	[20/N] Fix clang-tidy warnings in jit (#133399 ) Follows #133067 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133399 Approved by: https://github.com/Skylion007	2024-08-26 17:43:52 +00:00
Xu Han	019b80855f	[inductor] calibration inductor windows uts (10/N) (#134426 ) enable Windows inductor UT of `test/inductor/test_efficient_conv_bn_eval.py` Local test pass: <img width="1892" alt="image" src="https://github.com/user-attachments/assets/8a94c5e4-68bf-4a6f-8a1b-60d6ede14882"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134426 Approved by: https://github.com/jansel	2024-08-26 17:43:36 +00:00
Xu Han	7ff576072f	[inductor] calibration inductor windows uts (8/N) (#134424 ) enable Windows inductor UTs of `test/inductor/test_benchmark_fusion.py` Local test pass: <img width="1912" alt="image" src="https://github.com/user-attachments/assets/5be34b0c-9411-4430-927e-3313245f7c13"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134424 Approved by: https://github.com/ezyang	2024-08-26 17:38:53 +00:00
PyTorch MergeBot	adcce538b7	Revert "Allow mp.start_processes to create processes in parallel (#133707 )" This reverts commit 3546628a2a167ace6060737eeccf8ee8fd87ddc0. Reverted https://github.com/pytorch/pytorch/pull/133707 on behalf of https://github.com/ZainRizvi due to sorry but trunk has been consistently broken since this PR was merged. See: [GH job link](https://github.com/pytorch/pytorch/actions/runs/10529617600/job/29191757055) [HUD commit link](`3546628a2a`) ([comment](https://github.com/pytorch/pytorch/pull/133707#issuecomment-2310709523))	2024-08-26 17:31:10 +00:00
mori360	d0ac5d55ba	Memory optimization for DSD for TorchTune LoRA (#134025 ) Optimize memory cost at [PR#129635](https://github.com/pytorch/pytorch/pull/129635) There are 2 main part of the optimization here: 1. optimize the tensor distributing part, postpone the full_tensor generation, which avoids the memory overlap, saves around 50% peak memory at 2 param test case. 2. apply `assign=True` for the `load_state_dict`, saves memory cost at the state dict loading by assigning the input param, around 50% peak memory at loading part. Future work: Memory optimization to the opt will be conducted in the next PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/134025 Approved by: https://github.com/fegin Co-authored-by: Rachel Guo <guorachel@meta.com>	2024-08-26 17:24:25 +00:00
Catherine Lee	fc61aae70f	Remove color in CI (#133517 ) Remove color by default to make CI logs easier to read Example of color <img width="569" alt="image" src="https://github.com/user-attachments/assets/0da13544-98b1-47be-8383-64a5b3fd8951"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133517 Approved by: https://github.com/ZainRizvi	2024-08-26 16:58:06 +00:00
PyTorch MergeBot	42955e04f1	Revert "[dynamo] Cache _dynamo.disable results (#134272 )" This reverts commit a699bd11551e9755bb9238c6b82c369880789397. Reverted https://github.com/pytorch/pytorch/pull/134272 on behalf of https://github.com/ZainRizvi due to Fails internal tests ([comment](https://github.com/pytorch/pytorch/pull/134272#issuecomment-2310649115))	2024-08-26 16:57:53 +00:00
PyTorch MergeBot	e94bdc7876	Revert "[dynamo][guards] De-dupe DUPLICATE_INPUT guard (#134354 )" This reverts commit cdb9df5efe78142b7a612ae9c938ddf8a8850d10. Reverted https://github.com/pytorch/pytorch/pull/134354 on behalf of https://github.com/ZainRizvi due to Fails internal tests ([comment](https://github.com/pytorch/pytorch/pull/134272#issuecomment-2310649115))	2024-08-26 16:57:53 +00:00
atalman	a6fac0e969	Use ephemeral runners for windows nightly builds (#134463 ) This is definition of windows.4xlarge: ``` windows.4xlarge: disk_size: 256 instance_type: c5d.4xlarge is_ephemeral: true max_available: 420 os: windows ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134463 Approved by: https://github.com/jeanschmidt	2024-08-26 16:33:19 +00:00
Wang, Chuanqi	b417e32da2	[CD] fix xpu nightly wheel test env (#134395 ) (#134464 ) Due to the https://github.com/pytorch/builder/pull/1972 landed, it will source xpu env duplicated in nightly wheel test. Works for https://github.com/pytorch/pytorch/issues/114850 Realnd of #134395 to be landed with pytorchmergebot Pull Request resolved: https://github.com/pytorch/pytorch/pull/134464 Approved by: https://github.com/jeanschmidt Co-authored-by: Wang, Chuanqi <chuanqi.wang@intel.com>	2024-08-26 15:35:48 +00:00
atalman	c507f402f1	Add linux arm64 ephemeral runners (#134469 ) Should be landed with: https://github.com/pytorch/test-infra/pull/5593 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134469 Approved by: https://github.com/jeanschmidt, https://github.com/clee2000	2024-08-26 15:32:45 +00:00
PyTorch MergeBot	17e8a51ff2	Revert "[inductor]Let output or input_as_strided match exact strides (#130956 )" This reverts commit a63efee5cd422db0aabe5d02d2fe35fef9be7978. Reverted https://github.com/pytorch/pytorch/pull/130956 on behalf of https://github.com/ZainRizvi due to sorry but this seems to cause internal tests to fail. Please see D61771533 for details ([comment](https://github.com/pytorch/pytorch/pull/130956#issuecomment-2310490049))	2024-08-26 15:31:23 +00:00
PyTorch MergeBot	1c4780e69a	Revert "c10d/logging: add C10D_LOCK_GUARD (#134131 )" This reverts commit 4c28a0eb0ba437c1b7db559f63f8bec17bd48f69. Reverted https://github.com/pytorch/pytorch/pull/134131 on behalf of https://github.com/ZainRizvi due to Sorry but this causes formatting errors internally which make it fail to build. See D61759282 ([comment](https://github.com/pytorch/pytorch/pull/134131#issuecomment-2310455878))	2024-08-26 15:19:27 +00:00
PyTorch MergeBot	50e90d7203	Revert "[dynamo] simplify implementation for `functools.reduce` (#133778 )" This reverts commit 6c0b15e3828b8e2a0bd726a3e5d4e98c8ced5efe. Reverted https://github.com/pytorch/pytorch/pull/133778 on behalf of https://github.com/ZainRizvi due to Sorry, but this breaks internal tests because of using functools ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2310445169))	2024-08-26 15:16:17 +00:00
PyTorch MergeBot	472c7cf962	Revert "[dynamo] simplify implementation for `builtins.sum` (#133779 )" This reverts commit 8d90392fb02ce5e6854e6b4dbcdc4a7bbd55f8e2. Reverted https://github.com/pytorch/pytorch/pull/133779 on behalf of https://github.com/ZainRizvi due to Sorry, but this breaks internal tests because of using functools ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2310445169))	2024-08-26 15:16:17 +00:00
PyTorch MergeBot	3d7f3f6a55	Revert "[dynamo][itertools] support `itertools.tee` (#133771 )" This reverts commit 0e49b2f18e78386c8ed9ce540a8017411c7ab0cd. Reverted https://github.com/pytorch/pytorch/pull/133771 on behalf of https://github.com/ZainRizvi due to Sorry, but this breaks internal tests because of using functools ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2310445169))	2024-08-26 15:16:17 +00:00
PyTorch MergeBot	e1fc4362fb	Revert "[dynamo] simplify implementation for `os.fspath` (#133801 )" This reverts commit c5f6b72041144c00e240bcfdc783a5597c3d8928. Reverted https://github.com/pytorch/pytorch/pull/133801 on behalf of https://github.com/ZainRizvi due to Sorry, but this breaks internal tests because of using functools ([comment](https://github.com/pytorch/pytorch/pull/133778#issuecomment-2310445169))	2024-08-26 15:16:17 +00:00
Thanh Ha	bb67ff2ba7	Migrate Windows bin jobs to runner determinator (#134231 ) Update Windows binary workflows to use the runner determinator script. Closes: pytorch/ci-infra#262 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134231 Approved by: https://github.com/ZainRizvi	2024-08-26 14:56:00 +00:00
Benjamin Glass	27d97b9649	Remove unnecessary test skip (#134250 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134250 Approved by: https://github.com/amjames, https://github.com/janeyx99	2024-08-26 14:34:53 +00:00
Andrey Talman	be96ccf77c	Revert "[CD] fix xpu nightly wheel test env (#134395 )" (#134461 ) This reverts commit 96738c9d756fbd64e6f2eba67f711d3e18f1630c. Merged without pytorchmergebot command by mistake Pull Request resolved: https://github.com/pytorch/pytorch/pull/134461 Approved by: https://github.com/jeanschmidt	2024-08-26 13:40:17 +00:00
Wang, Chuanqi	96738c9d75	[CD] fix xpu nightly wheel test env (#134395 )	2024-08-26 08:53:15 -04:00
haozhe.zhu	1ff226d88c	[inductor] support vec for atomic add (#131314 ) Depends on https://github.com/pytorch/pytorch/pull/130827 to have correct `index_expr` dtype Support vec for atomic add by scalar implementation. TestPlan: ``` python test/inductor/test_cpu_repro.py -k test_scatter_using_atomic_add_vec ``` Generated code for `test_scatter_using_atomic_add_vec` ``` cpp_fused_scatter_0 = async_compile.cpp_pybinding(['const float', 'const int64_t', 'const float', 'float'], ''' #include "/tmp/torchinductor_root/nn/cnnpkaxivwaa5rzng6qsyc4ao42vschogi3yk33ukwv3emlvxeqq.h" extern "C" void kernel(const float* in_ptr0, const int64_t* in_ptr1, const float* in_ptr2, float* out_ptr0) { { for(long x0=static_cast<long>(0L); x0<static_cast<long>(16L); x0+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x0), 16); tmp0.store(out_ptr0 + static_cast<long>(x0)); } #pragma omp simd simdlen(8) for(long x0=static_cast<long>(16L); x0<static_cast<long>(25L); x0+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(x0)]; out_ptr0[static_cast<long>(x0)] = tmp0; } } { for(long x0=static_cast<long>(0L); x0<static_cast<long>(16L); x0+=static_cast<long>(16L)) { auto tmp0 = at::vec::VectorizedN<int64_t,2>::loadu(in_ptr1 + static_cast<long>(x0), 16); auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x0), 16); auto tmp1 = 25L; auto tmp2 = c10::convert<int64_t>(tmp1); auto tmp3 = at::vec::VectorizedN<int64_t,2>(tmp2); auto tmp4 = tmp0 + tmp3; auto tmp5 = static_cast<int64_t>(0); auto tmp6 = at::vec::VectorizedN<int64_t,2>(tmp5); auto tmp7 = at::vec::VecMask<int64_t,2>(tmp0 < tmp6); auto tmp8 = decltype(tmp4)::blendv(tmp0, tmp4, tmp7.template cast<int64_t,2>()); auto tmp9 = [&] { __at_align__ std::array<int64_t, 16> tmpbuf; tmp8.store(tmpbuf.data()); return tmpbuf; } () ; auto tmp10 = [&] { __at_align__ std::array<int64_t, 16> tmpbuf; #pragma GCC unroll 16 for (long x0_inner = 0; x0_inner < 16; x0_inner++) { tmpbuf[x0_inner] = static_cast<long>(tmp9[x0_inner]); } return at::vec::VectorizedN<int64_t,2>::loadu(tmpbuf.data(), 16); } () ; TORCH_CHECK((at::vec::VecMask<int64_t,2>((at::vec::VectorizedN<int64_t,2>(0) <= tmp10) & (tmp10 < at::vec::VectorizedN<int64_t,2>(25L)))).all_masked(), "index out of bounds: 0 <= tmp10 < 25L"); atomic_add_vec(out_ptr0, tmp8, tmp12); } #pragma omp simd simdlen(8) for(long x0=static_cast<long>(16L); x0<static_cast<long>(20L); x0+=static_cast<long>(1L)) { auto tmp0 = in_ptr1[static_cast<long>(x0)]; auto tmp9 = in_ptr2[static_cast<long>(x0)]; auto tmp1 = 25L; auto tmp2 = c10::convert<int64_t>(tmp1); auto tmp3 = decltype(tmp0)(tmp0 + tmp2); auto tmp4 = tmp0 < 0; auto tmp5 = tmp4 ? tmp3 : tmp0; auto tmp6 = tmp5; auto tmp7 = c10::convert<int64_t>(tmp6); TORCH_CHECK((0 <= tmp7) & (tmp7 < 25L), "index out of bounds: 0 <= tmp7 < 25L"); atomic_add(&out_ptr0[static_cast<long>(tmp5)], static_cast<float>(tmp9)); } } } ''') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131314 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel	2024-08-26 10:36:51 +00:00
fduwjj	bf5c7bf06d	[FR] Fix the bug in FR script (e.g., checking all ranks dump check) (#134383 ) We somehow convert the rank to string which makes the ranks check fail. This fix now convert them all to int. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134383 Approved by: https://github.com/c-p-i-o	2024-08-26 08:21:14 +00:00
Avik Chaudhuri	92c4771853	fix stuck floordiv (#134150 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/134133 Test Plan: Tested on the small repro in the linked issue with different lengths N (replacing 100), recording N vs. time taken in nanoseconds: 10 127268319 20 220839662 30 325463125 40 429259441 50 553136055 60 670799769 70 999170514 80 899014103 90 997168902 100 1168202035 110 1388556619 120 1457488235 130 1609816470 140 2177889877 150 1917560313 160 2121096113 170 2428502334 180 4117450755 190 4003068224 So N ~ 200 takes ~5s. Previously even smaller N would go for >1 min. Didn't add a perf test because ezyang is planning to build a benchmark. Also tested on https://www.internalfb.com/diff/D61560171, which now gets past the stuck point. Differential Revision: D61619660 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134150 Approved by: https://github.com/ezyang	2024-08-26 07:27:59 +00:00
Xuehai Pan	c5f6b72041	[dynamo] simplify implementation for `os.fspath` (#133801 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133801 Approved by: https://github.com/anijain2305 ghstack dependencies: #133769, #133778, #133779, #133771	2024-08-26 07:12:15 +00:00
Amadeusz Skrzypczak	38f97ec8e3	[pt2] Add meta for poisson (#134103 ) Because aten.poisson doesn't have meta function registered, there is one additional eager execution of this op during compilation phase of torch.compile. There are more ops without meta registration. Is there any reason for it? Pull Request resolved: https://github.com/pytorch/pytorch/pull/134103 Approved by: https://github.com/ezyang	2024-08-26 06:14:38 +00:00
Aaron Orenstein	ed86ac2f25	[BE] typing for decorators - fx/_compatibility (#134054 ) Summary: See #131429 Test Plan: unit tests pass Differential Revision: D61493706 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134054 Approved by: https://github.com/oulgen	2024-08-26 04:00:27 +00:00
Laith Sakka	7b6b10417d	Remove ansi escape chars in assertExpectedInline and add options to skip comments and to skip empty lines (#134248 ) I had a night mare rewriting tests in test_misc.py specifically : 1. graphs can have comments that refers to my files "/lsakka/.." we really dont care about comments add option to ignore comments. 2. empty lines added when EXPECTTEST_ACCEPT=1 are changed with linter causing tests to fail or linter fail! add flag to ignore empty lines. 3. EXPECTTEST_ACCEPT fails when the text have some not readable characters. those should not effect comparing strings, also those causes weird diffs comments when tests fails. I removed ansi_escape chars https://github.com/pytorch/pytorch/pull/133045 this is used in Pull Request resolved: https://github.com/pytorch/pytorch/pull/134248 Approved by: https://github.com/aorenste ghstack dependencies: #133639, #134364	2024-08-26 02:03:44 +00:00
Xu Han	2ec149cd3e	[inductor] fix test_functional_call_sequential_params_and_buffers expectation on Windows (#134394 ) This UT actual code only one empty line wrap difference(`linear` and `add`) between Windows/Linux, and the context is right. Reproduce UTs: ```cmd pytest test\dynamo\test_higher_order_ops.py -v -k test_functional_call_sequential_params_and_buffers ``` We can add `empty_line_normalizer` to fix it. ```cmd ______________________________________________________________________________________________ FuncTorchHigherOrderOpTests.test_functional_call_sequential_params_and_buffers _______________________________________________________________________________________________ Traceback (most recent call last): File "D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_higher_order_ops.py", line 3676, in test_functional_call_sequential_params_and_buffers self.assertExpectedInline( File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_utils.py", line 2871, in assertExpectedInline return super().assertExpectedInline(actual if isinstance(actual, str) else str(actual), expect, skip + 1) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\expecttest\__init__.py", line 271, in assertExpectedInline self.assertMultiLineEqualMaybeCppStack(expect, actual, msg=help_text) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\expecttest\__init__.py", line 292, in assertMultiLineEqualMaybeCppStack self.assertMultiLineEqual(expect, actual, args, *kwargs) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 1226, in assertMultiLineEqual self.fail(self._formatMessage(msg, standardMsg)) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 675, in fail raise self.failureException(msg) AssertionError: 'clas[509 chars]one\n add: "f32[1, 1]" = linear + l_buf[69 chars],)\n' != 'clas[509 chars]one\n\n add: "f32[1, 1]" = linear + l_b[71 chars],)\n' class GraphModule(torch.nn.Module): def forward(self, L_params_l1_weight_: "f32[1, 1]", L_params_l1_bias_: "f32[1]", L_buffers_buffer_: "f32[1]", L_inputs_: "f32[1, 1]"): l_params_l1_weight_ = L_params_l1_weight_ l_params_l1_bias_ = L_params_l1_bias_ l_buffers_buffer_ = L_buffers_buffer_ l_inputs_ = L_inputs_ linear: "f32[1, 1]" = torch._C._nn.linear(l_inputs_, l_params_l1_weight_, l_params_l1_bias_); l_inputs_ = l_params_l1_weight_ = l_params_l1_bias_ = None + <<<< (difference is here ) add: "f32[1, 1]" = linear + l_buffers_buffer_; linear = l_buffers_buffer_ = None return (add,) : To accept the new output, re-run test with envvar EXPECTTEST_ACCEPT=1 (we recommend staging/committing your changes before doing this) To execute this test, run the following from the base repo dir: python test\dynamo\test_higher_order_ops.py FuncTorchHigherOrderOpTests.test_functional_call_sequential_params_and_buffers This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ========================================================================================================================== short test summary info ========================================================================================================================== FAILED [0.4275s] test/dynamo/test_higher_order_ops.py::FuncTorchHigherOrderOpTests::test_functional_call_sequential_params_and_buffers - AssertionError: 'clas[509 chars]one\n add: "f32[1, 1]" = linear + l_buf[69 chars],)\n' != 'clas[509 chars]one\n\n add: "f32[1, 1]" = linear + l_b[71 chars],)\n' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134394 Approved by: https://github.com/jansel Co-authored-by: Jason Ansel <jansel@jansel.net>	2024-08-26 01:41:20 +00:00
Tianyi Tao	7af38eb98b	Fix unexpected inference_mode interaction with torch.autograd.functional.jacobian (#130307 ) Fixes #128264 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130307 Approved by: https://github.com/soulitzer	2024-08-25 22:14:02 +00:00
Xu Han	dc1959e6a7	[inductor] calibration inductor windows uts (7/N) (#134420 ) Disable UTs on Windows: `test/dynamo/test_misc.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134420 Approved by: https://github.com/jansel	2024-08-25 20:39:54 +00:00
Xu Han	97fd087cdb	[inductor] calibration inductor windows uts (6/N) (#134419 ) Disable UTs for Windows: `test/dynamo/test_aot_autograd_cache.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134419 Approved by: https://github.com/jansel	2024-08-25 20:39:34 +00:00
Richard Barnes	b5dd60fa75	Fix namespace issues with qnnpack (#134336 ) After this I think all `using namespace` will have been eliminated from PyTorch header files. Internally, `-Wheader-hygiene` will prevent more from being added. Test Plan: Sandcastle Differential Revision: D61679037 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134336 Approved by: https://github.com/Skylion007	2024-08-25 19:50:01 +00:00
Igor Sugak	7940f2428f	[torch/package_importer] add compatibility name mapping (#134376 ) Summary: This enables patching extern modules to provide compatibility with serialized code depending on different versions of those extern modules. The main motivation is to enable Numpy upgrade. In the recent release many alias to builtin types were deprecated and removed [1]. This breaks loading pickled modules that reference the removed aliases. While the proper solution is to re-generate pickled modules, it's not always feasible. This proposes a way to define mapping with a new type, for a module member. It is only set if it's not present in the loaded module, thus removes the need to check for exact versions. https://numpy.org/doc/stable/release/1.20.0-notes.html#using-the-aliases-of-builtin-types-like-np-int-is-deprecated Differential Revision: D61556888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134376 Approved by: https://github.com/SherlockNoMad	2024-08-25 19:34:46 +00:00
Shivam Raikundalia	816061843a	[Distributed/Profiler] Fix input/output dimension overflow (#134360 ) Summary: When using ParamCommsDebugInfo, the input elements and output elements are stored in `int` instead of `int64_t` Test Plan: Run HTA with new outputted values and make sure overflow does not occur Reviewed By: fengxizhou Differential Revision: D61728747 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134360 Approved by: https://github.com/fengxizhou, https://github.com/jeanschmidt	2024-08-25 16:25:56 +00:00
eqy	e93ca12c88	[CUDNN][SDPA] Fix unsupported trivial stride-1 transpose case (#134031 ) Fixes #134001 Incorrect assumption that two same-shape tensors being contiguous meant that they would have the same stride Pull Request resolved: https://github.com/pytorch/pytorch/pull/134031 Approved by: https://github.com/drisspg, https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-08-25 14:31:30 +00:00
Chirag Pandya	08d111250a	[ez][c10d] change ERROR to WARNING (#134349 ) Summary: Change error to warning because TCPStore can be torn down during a normal shutdown. It's OK if we're unable to access TCPStore. Should not be an error. Test Plan: Ran locally Pull Request resolved: https://github.com/pytorch/pytorch/pull/134349 Approved by: https://github.com/fduwjj, https://github.com/wconstab	2024-08-25 14:22:55 +00:00
PyTorch MergeBot	4648848696	Revert "[ROCm] remove triton-rocm commit pin and merge pins with triton.txt (#133438 )" This reverts commit f71c3d265ab52589f983dd252d61461db4e7dbbd. Reverted https://github.com/pytorch/pytorch/pull/133438 on behalf of https://github.com/jeanschmidt due to seems to have introduced breakages in linux binary builds ([comment](https://github.com/pytorch/pytorch/pull/133438#issuecomment-2308787310))	2024-08-25 11:20:30 +00:00
PyTorch MergeBot	e5563f7ad7	Revert "[dtensor][MTPG] make sharding prop lru cache not shared among threads (#134294 )" This reverts commit eb15b1a016c6facaf8605dde2c20b5de1586542d. Reverted https://github.com/pytorch/pytorch/pull/134294 on behalf of https://github.com/jeanschmidt due to seems to have introduced https://github.com/pytorch/pytorch/actions/runs/10537099590/job/29201744658 ([comment](https://github.com/pytorch/pytorch/pull/134294#issuecomment-2308785949))	2024-08-25 11:16:04 +00:00
wz337	268092db83	[DeviceMesh] Allow _flatten() to take in an optional mesh_dim_name (#134048 ) If a mesh_dim_name is given, we will use the given mesh_dim_name to name the new flattened dim. Otherwise, the default is a string concatentaing the mesh_dim_names of the given submesh with each mesh_dim_name separated by "_". For example, if we have a 3D mesh DeviceMesh([[[0, 1], [2, 3]], [[4, 5], [6, 7]]], mesh_dim_names=("dp", "cp", "tp")), calling mesh_3d["dp", "cp"]._flatten() will create a 1D submesh DeviceMesh([0, 1, 2, 3], mesh_dim_names=("dp_cp",)) on rank 0, 1, 2, 3 and a 1D submesh DeviceMesh([4, 5, 6, 7], mesh_dim_names=("dp_cp",)) on rank 4, 5, 6, 7. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134048 Approved by: https://github.com/fegin ghstack dependencies: #133838, #133839	2024-08-25 10:36:01 +00:00
Edward Z. Yang	326db8af4c	Replace sympy Min/Max with reimplementations (#133319 ) Sympy's implementation of Min/Max displays asymptotically bad behavior on `TORCH_COMPILE_CPROFILE=1 python torchrec/distributed/tests/test_pt2_multiprocess.py TestPt2Train.test_compile_multiprocess`. Evidence profile: ![image](https://github.com/user-attachments/assets/142301e9-3a18-4370-b9db-19b32ece7ee8) On this test case, we spend 42% of all time compiling the network on ShapeEnv.replace, which in turn spends all of its time in xreplace. The problem appears to be find_localzeros call. By vendoring the implementations of Min/Max, we can potentially reduce the cost of this operation. The implementation is copy-pasted sympy/functions/elementary/miscellaneous.py but with some adjustments: * I deleted logic related to differentatiation, evalf and heaviside, as it's not relevant to PyTorch reasoning * There's some massaging to appease PyTorch's linters, including a lot of noqa and type: ignore (which I could potentially refactor away with substantive changes, but that's better as its own change) * I deleted the second loop iteration for is_connected, as an attempt at initial optimization (this also simplifies the port, since I can omit some code). I'll comment at that point what the exact difference is. Before this change, the test in question takes 100s with 40 features; post this change, afterwards, it takes only 69s. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133319 Approved by: https://github.com/Skylion007	2024-08-25 05:05:59 +00:00
Avik Chaudhuri	8db8ac700d	line by line logging (#134298 ) Summary: Today there is no good mechanism to detect progress of non-strict export line-by-line in user code. This caused some pain recently in trying to find the exact line of user code that was triggering a bug where the process appeared stuck because deep down something was calling some symbolic shapes code that was suffering some exponential blowup. This PR adds a environment variable for extended debugging that will log the line of user code corresponding to every torch function call. It only works in non-strict export for now. Prefix setting this environment variable with `TORCH_LOGS` enabled for `export` logs at `DEBUG` level (i.e., with a `+` prefix), i.e.,.: ``` TORCHEXPORT_EXTENDED_DEBUG_CURRENT_LOC=1 TORCH_LOGS="+export" ... ``` This will show logs with something like: ``` ... prim::device called at .../example.py:4284 in foo TensorBase.item called at .../example.py:4277 in bar ... ``` We already have an existing place to intercept torch functions where we process data-dependent errors in non-strict, so parking the logging there. An alternative place we could be doing this is where we add `stack_trace` metadata when generating code, but unfortunately at least the example that motivated this gets stuck before generating code, so that would be too late. Test Plan: ran it on some sample commands Differential Revision: D61692156 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134298 Approved by: https://github.com/angelayi	2024-08-25 02:57:11 +00:00
Xu Han	907c32faac	[inductor] calibration inductor windows uts (4/N) (#134401 ) skip failed UTs of `test/dynamo/test_unspec.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134401 Approved by: https://github.com/ezyang	2024-08-25 00:32:29 +00:00
Xu Han	74ef74be36	[inductor] calibration inductor windows uts (3/N) (#134400 ) skip Windows UT of `test/dynamo/test_trace_rules.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134400 Approved by: https://github.com/ezyang	2024-08-24 23:48:50 +00:00
Shivam Raikundalia	d33d68e326	[Profiler] Add test to make sure FunctionEvents are processed lazily (#134359 ) Summary: Create simple test that checks that FunctionEvent build tree happens lazily by checking that the metrics for it changes before and after call. Test Plan: Make sure test passes in CI Reviewed By: briancoutinho Differential Revision: D61685429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134359 Approved by: https://github.com/briancoutinho	2024-08-24 23:03:19 +00:00
Xu Han	af4c87953e	[inductor] calibration inductor windows uts (5/N) (#134402 ) skip UTs of `test/dynamo/test_repros.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134402 Approved by: https://github.com/ezyang	2024-08-24 23:00:11 +00:00
Bob Ren	94f92fbd88	Use integer divison in arange length calculation when start/end/step are integral (#134296 ) Fixes #133338 Test Plan: ``` TORCH_LOGS=dynamic python import torch torch._dynamo.config.capture_scalar_outputs = True @torch.compile() def f(x): y = x.item() torch._check_is_size(y) r = torch.arange(y, dtype=torch.float32) torch._check(r.size(0) == y) return r f(torch.tensor([300])) ``` Before and after diff. Verify the following line ``` I0813 11:05:44.890000 652898 torch/fx/experimental/symbolic_shapes.py:5198] [0/0] runtime_assert Eq(CeilToInt(IntTrueDiv(u0, 1)), u0) [guard added] at aa.py:10 in f (_dynamo/utils.py:2092 in run_node), for more info run with TORCHDYNAMO_EXTENDED_DEBUG_GUARD_ADDED="Eq(CeilToInt(IntTrueDiv(u0, 1)), u0)" ``` no longer shows in the logs. Also verify CI passes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134296 Approved by: https://github.com/aorenste	2024-08-24 21:09:28 +00:00
Aart Bik	1a0d00f1f4	[traced-graph][sparse] enable to_dense() for compressed (#133371 ) Fixes https://github.com/pytorch/pytorch/issues/133174 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133371 Approved by: https://github.com/ezyang	2024-08-24 20:33:23 +00:00
Aart Bik	050aa67e41	[traced-graph][sparse] fix restrictive assert for sparse add (#134037 ) exporting sparse addition can be CPU/Meta this fixes the overly restrictive assert and adds an exporting test Pull Request resolved: https://github.com/pytorch/pytorch/pull/134037 Approved by: https://github.com/ezyang	2024-08-24 20:26:47 +00:00
Xu Han	90fb83749e	[inductor] fix test torch package working with trace on windows (#134397 ) Current temporary directory path is hard code. Fixed by get temporary directory path by API. Reproduce UTs: ```cmd python test/dynamo/test_dynamic_shapes.py -v -k test_torch_package_working_with_trace_dynamic_shapes ``` Error message: ```cmd ________________________________________________________________________________________________ DynamicShapesMiscTests.test_torch_package_working_with_trace_dynamic_shapes ________________________________________________________________________________________________ Traceback (most recent call last): File "D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_misc.py", line 7199, in test_torch_package_working_with_trace with package.PackageExporter(path) as exp: File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\package\package_exporter.py", line 237, in __init__ self.zip_file = torch._C.PyTorchFileWriter(f) RuntimeError: Parent directory /tmp does not exist. To execute this test, run the following from the base repo dir: python test\dynamo\test_dynamic_shapes.py DynamicShapesMiscTests.test_torch_package_working_with_trace_dynamic_shapes This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ========================================================================================================================== short test summary info ========================================================================================================================== FAILED [0.0080s] test/dynamo/test_dynamic_shapes.py::DynamicShapesMiscTests::test_torch_package_working_with_trace_dynamic_shapes - RuntimeError: Parent directory /tmp does not exist. ==================================================================================================================== 1 failed, 1665 deselected in 4.00s ===================================================================================================================== ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134397 Approved by: https://github.com/ezyang	2024-08-24 20:25:44 +00:00
Jonathan Deakin	9cd53b3212	Add Arm copyright line to LICENSE (#133982 ) Some historical commits from arm: - 2021 664126bab5f3f2a275e82b7bde127132cff7f34e - 2023 2630144786e906b40abbe017294d404bcfe3c6ae - 2024 ce6130014156fa9555ce3d16c5f9a84cbdadf8f4 See https://github.com/pytorch/pytorch/pull/126687 for initial discussion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133982 Approved by: https://github.com/malfet	2024-08-24 18:41:06 +00:00
Jonathan Deakin	50d5aa8c10	Enable optimized dynamic quantization on aarch64 (#126687 ) oneDNN+ACL has optimized kernels for s8s8 matmul, so input is signed. This change leaves behaviour on all other platforms the same. This change requires https://github.com/intel/ideep/pull/313 to go in, and oneDNN 3.5 for the optimized kernels. This change speeds up dynamic quantized linear by ~10x. Also, do you have a policy on copyright headers? Arm's usual policy when contributing to open source projects is to include a copyright header on any file which is modified. Would this be acceptable? If not, is there somewhere else suitable to note copyright? Pull Request resolved: https://github.com/pytorch/pytorch/pull/126687 Approved by: https://github.com/jgong5, https://github.com/malfet, https://github.com/snadampal Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>	2024-08-24 18:40:12 +00:00
Jack Taylor	f71c3d265a	[ROCm] remove triton-rocm commit pin and merge pins with triton.txt (#133438 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133438 Approved by: https://github.com/jithunnair-amd, https://github.com/malfet	2024-08-24 18:26:49 +00:00
chuanqiw	6245d5b87b	[CI] Update XPU ci test python version to 3.9 (#134214 ) Works for https://github.com/pytorch/pytorch/issues/114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134214 Approved by: https://github.com/EikanWang, https://github.com/malfet	2024-08-24 18:11:36 +00:00
Yueming Hao	a63efee5cd	[inductor]Let output or input_as_strided match exact strides (#130956 ) Fixes #130394 TorchInductor doesn't respect original strides of outputs. It opens up optimization opportunities like changing up memory layout. But for some cases, such as the case in https://github.com/pytorch/pytorch/issues/130394, we do need the output match the exact stride as required. The correctness is the first priority goal. So, this PR adds a new API `ir.ExternKernel.require_exact_strides(x, exact_strides, allow_padding=False)` to fix the issue. This PR enables non-dense outputs' strides follow the strides required by semantics. The comparison between the original and after this fix for the test is the below. ```python @triton.jit def triton_(in_ptr0, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 128 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex % 8 x1 = (xindex // 8) - x2 = xindex tmp0 = tl.load(in_ptr0 + (x0 + (16x1)), xmask) tmp1 = tmp0 + tmp0 - tl.store(out_ptr0 + (x2), tmp1, xmask) + tl.store(out_ptr0 + (x0 + (16x1)), tmp1, xmask) def call(args): arg0_1, = args args.clear() assert_size_stride(arg0_1, (16, 8), (16, 1)) with torch.cuda._DeviceGuard(0): torch.cuda.set_device(0) - buf1 = empty_strided_cuda((16, 8), (8, 1), torch.float32) + buf1 = empty_strided_cuda((16, 8), (16, 1), torch.float32) stream0 = get_raw_stream(0) triton_poi_fused_add_copy_0.run(arg0_1, buf1, 128, grid=grid(128), stream=stream0) del arg0_1 return (buf1, ) ``` The buf1 is created with exact stride required by users, and its values are written in same stride with the input. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130956 Approved by: https://github.com/eellison, https://github.com/blaine-rister	2024-08-24 17:04:05 +00:00
Animesh Jain	cdb9df5efe	[dynamo][guards] De-dupe DUPLICATE_INPUT guard (#134354 ) Hard to write a test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134354 Approved by: https://github.com/jansel ghstack dependencies: #134272	2024-08-24 15:17:56 +00:00
David Berard	d433a603af	[BE] use torch.amp.autocast instead of torch.cuda.amp.autocast (#134291 ) torch.cuda.amp.autocast / torch.cpu.amp.autocast are deprecated and spew a ton of warnings when these tests run. This PR: Update to just use torch.amp.autocast(device). Note: this uncovers a bug in the test: when `device` is CUDA, it actually shows up as "cuda:0" - so previously, this test was _always_ using `torch.cpu.amp.autocast` even for `cuda` device. This PR fixes this, and uncovers additional bugs in `pinverse` and `linalg.pinv`; `linalg.pinv` was already failing before on CPU, but now the test also catches failures on CUDA, (and this PR adds to the skipped-test list). Pull Request resolved: https://github.com/pytorch/pytorch/pull/134291 Approved by: https://github.com/YuqingJ	2024-08-24 15:07:49 +00:00
Huanyu He	a1061009c9	[PT2] use statically_known_true in slice_noop (#134270 ) Summary: # context * when fixing the graph break in _maybe_compute_kjt_to_jt_dict, we encountered this issue P1539489731: ``` [rank0]: ATTENTION: guard_size_oblivious would fix the error, evaluating expression to False. [rank0]: Maybe you need to add guard_size_oblivious to framework code, see doc below for more guidance. [rank0]: [rank0]: Potential framework code culprit (scroll up for full backtrace): [rank0]: File "/data/users/hhy/fbsource/buck-out/v2/gen/fbcode/61f992c26f3f2773/aps_models/ads/icvr/__icvr_launcher_live__/icvr_launcher_live#link-tree/torch/_inductor/fx_passes/post_grad.py", line 671, in slice_noop [rank0]: if start == 0 and end >= 2*63 - 1 and step == 1: ``` change the condition logic to be compatible with SymInt Test Plan: # commands * run test ``` TORCH_SHOW_CPP_STACKTRACES=1 TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 TORCH_LOGS="+graph_code,output_code,dynamic,aot,guards,verbose_guards,recompiles,graph_breaks" TORCH_TRACE=/var/tmp/tt buck2 run fbcode//mode/opt fbcode//aps_models/ads/icvr:icvr_launcher_live -- mode=fmc/local_ig_fm_v4_mini training.pipeline_type=pt2 2>&1 \| tee -a `date +"%Y.%m.%d.%H.%M"`.`sl whereami`.log ``` * tlparse ``` ls -thl /var/tmp/tt \| head -9 && tlparse `ls -t /var/tmp/tt/* \| head -1` ``` Differential Revision: D61677207 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134270 Approved by: https://github.com/ezyang	2024-08-24 13:58:51 +00:00
atalman	ff77c67d16	Use ephemeral runners for linux nightly builds (#134367 ) Should be landed with https://github.com/pytorch/test-infra/pull/5590 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134367 Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/seemethere	2024-08-24 12:49:07 +00:00
Simon Fan	ff7d94c67e	[compiled autograd] fix saved tensor hook firing count (#134361 ) SavedVariable constructor calls the pack hooks, we don't want to call them for the proxy tensor since it is proxying a tensor that already had called the pack hook during forward. Using the same fix as https://github.com/pytorch/pytorch/pull/123196 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134361 Approved by: https://github.com/jansel ghstack dependencies: #134186, #134200, #134205, #134286, #134290, #134162, #134163	2024-08-24 12:06:36 +00:00
Simon Fan	929de1d0d4	Re-enable skipped compiled autograd eager tests (#134163 ) Originally disabled in: https://github.com/pytorch/pytorch/pull/131700#discussion_r1727153445, but the failure is no longer in CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/134163 Approved by: https://github.com/soulitzer ghstack dependencies: #134186, #134200, #134205, #134286, #134290, #134162	2024-08-24 12:06:36 +00:00
Simon Fan	ad8bdfae1e	add compiled_autograd to programmatic set_logs API (#134162 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134162 Approved by: https://github.com/yf225, https://github.com/jansel ghstack dependencies: #134186, #134200, #134205, #134286, #134290	2024-08-24 12:06:36 +00:00
Simon Fan	1431663693	[compiled autograd] finish classifying tests (#134290 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134290 Approved by: https://github.com/yf225 ghstack dependencies: #134186, #134200, #134205, #134286	2024-08-24 12:06:36 +00:00
Simon Fan	0b228a2af8	[compiled autograd] match eager behavior for ctx.saved_variables (#134286 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134286 Approved by: https://github.com/jansel ghstack dependencies: #134186, #134200, #134205	2024-08-24 12:06:36 +00:00
Simon Fan	6cc57c64b2	[compiled autograd] match eager behavior for post acc grad hooks (#134205 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134205 Approved by: https://github.com/jansel ghstack dependencies: #134186, #134200	2024-08-24 12:06:36 +00:00
Simon Fan	d7a25e1d8c	[compiled autograd] add config patching for certain eager tests (#134200 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134200 Approved by: https://github.com/jansel ghstack dependencies: #134186	2024-08-24 12:06:36 +00:00
Simon Fan	0d9208a398	[compiled autograd] match eager behavior for inplace detached activations (#134186 ) Fixes `TestAutograd.test_saved_variable_saved_original_inplace_detach` when ran under compiled autograd Pull Request resolved: https://github.com/pytorch/pytorch/pull/134186 Approved by: https://github.com/jansel	2024-08-24 12:06:36 +00:00
Huamin Li	ccafc93be5	[AOTI][CPU] Make int8 qlinear work (#134368 ) Summary: This diff will decompose torch.ops._quantized.wrapped_quantized_linear into torch.ops._quantized.wrapped_linear_prepack and torch.ops._quantized.wrapped_quantized_linear_prepacked for AOTI, and added the corresponding impl into shim The way it works will be similar to what we did previously for fbgemm fp16 dynamic qlinear. We will do constant folding for packed weight during runtime (warm up) to achieve the speed up Reviewed By: desertfire Differential Revision: D61396144 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134368 Approved by: https://github.com/houseroad	2024-08-24 08:25:25 +00:00
Xilun Wu	eb15b1a016	[dtensor][MTPG] make sharding prop lru cache not shared among threads (#134294 ) Summary Before this PR, `sharding propagator` is shared among threads. The result is the cache result of rank 0 would be accessible by other ranks e.g. rank 1 and this could lead to wrong DTensor resharding. This PR fixes it by making the cache a local variable at thread level, and it fixes `dstack` test (#126493), `inner` (https://github.com/pytorch/pytorch/issues/126852), and `vstack` (https://github.com/pytorch/pytorch/issues/126868). It also fixes `poisson_nll` (https://github.com/pytorch/pytorch/issues/131446) as a bi-product. Test `pytest test/distributed/_tensor/test_dtensor_ops.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134294 Approved by: https://github.com/wz337, https://github.com/awgu	2024-08-24 05:56:45 +00:00
Xu Han	1034f456ef	[inductor] fix munge_exc not support windows path (#134348 ) Windows file path use `\` as delimiter, it is also a escape character. We need translate all path `\` to `/`. which like Linux. Reproduce UT: ```cmd pytest test\dynamo\test_higher_order_ops.py -v -k test_vmap_grad_vmap_guard_fail ``` Error msg: ```cmd ________________________________________________________________________________________________________ HigherOrderOpVmapGuardTests.test_vmap_grad_vmap_guard_fail _________________________________________________________________________________________________________ Traceback (most recent call last): File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\logging_utils.py", line 89, in test_fn fn(self, records) File "D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_higher_order_ops.py", line 2714, in test_vmap_grad_vmap_guard_fail munge_exc(record.getMessage()), File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_utils.py", line 5252, in munge_exc s = re.sub(file, os.path.basename(file), s) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\re.py", line 209, in sub return _compile(pattern, flags).sub(repl, string, count) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\re.py", line 303, in _compile p = sre_compile.compile(pattern, flags) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_compile.py", line 788, in compile p = sre_parse.parse(p, flags) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_parse.py", line 955, in parse p = _parse_sub(source, state, flags & SRE_FLAG_VERBOSE, 0) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_parse.py", line 444, in _parse_sub itemsappend(_parse(source, state, verbose, nested + 1, File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_parse.py", line 526, in _parse code = _escape(source, this, state) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\sre_parse.py", line 370, in _escape raise source.error("incomplete escape %s" % escape, len(escape)) re.error: incomplete escape \x at position 2 To execute this test, run the following from the base repo dir: python test\dynamo\test_higher_order_ops.py HigherOrderOpVmapGuardTests.test_vmap_grad_vmap_guard_fail This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 --------------------------------------------------------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------------------------------------------------------- frames [('total', 2), ('ok', 2)] inductor [] inline_call [] stats [('calls_captured', 38), ('unique_graphs', 2)] --------------------------------------------------------------------------------------------------------------------------- Captured stderr call ---------------------------------------------------------------------------------------------------------------------------- V0824 01:29:00.148000 27840 torch\_dynamo\guards.py:2787] [0/1] [__recompiles] Recompiling function fn in D:\xu_git\dnnl_cb\pytorch\test\dynamo\test_higher_order_ops.py:2699 V0824 01:29:00.148000 27840 torch\_dynamo\guards.py:2787] [0/1] [__recompiles] triggered by the following guard failure(s): V0824 01:29:00.148000 27840 torch\_dynamo\guards.py:2787] [0/1] [__recompiles] - 0/0: torch._functorch.pyfunctorch.compare_functorch_state([('Vmap', 1, 'error')]) # _dynamo\output_graph.py:479 in init_ambient_guards ========================================================================================================================== short test summary info ========================================================================================================================== FAILED [0.7452s] test/dynamo/test_higher_order_ops.py::HigherOrderOpVmapGuardTests::test_vmap_grad_vmap_guard_fail - re.error: incomplete escape \x at position 2 ``` Local test passed: <img width="860" alt="image" src="https://github.com/user-attachments/assets/90f0d780-0639-4c03-8d7c-6f227c93a3fc"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134348 Approved by: https://github.com/jansel	2024-08-24 05:51:35 +00:00
Shangdi Yu	0694918aeb	[export] Temporarily bypass torch_fn in partitioner (#134292 ) Summary: "torch_fn" is not correct for the decomposed add node from batch norm. This is a temporary workaround to bypass torch fn. For example, for the graph below (test_qat_conv2d_unary graph): ``` graph(): %conv_weight : [num_users=1] = get_attr[target=conv.weight] %bn_weight : [num_users=1] = get_attr[target=bn.weight] %bn_bias : [num_users=1] = get_attr[target=bn.bias] %bn_running_mean : [num_users=1] = get_attr[target=bn.running_mean] %bn_running_var : [num_users=1] = get_attr[target=bn.running_var] %bn_num_batches_tracked : [num_users=1] = get_attr[target=bn.num_batches_tracked] %x : [num_users=1] = placeholder[target=x] %conv2d : [num_users=1] = call_function[target=torch.ops.aten.conv2d.default](args = (%x, %conv_weight, None, [1, 1], [1, 1]), kwargs = {}) %add_ : [num_users=0] = call_function[target=torch.ops.aten.add_.Tensor](args = (%bn_num_batches_tracked, 1), kwargs = {}) %batch_norm : [num_users=1] = call_function[target=torch.ops.aten.batch_norm.default](args = (%conv2d, %bn_weight, %bn_bias, %bn_running_mean, %bn_running_var, True, 0.1, 1e-05, True), kwargs = {}) %relu : [num_users=1] = call_function[target=torch.ops.aten.relu.default](args = (%batch_norm,), kwargs = {}) %max_pool2d : [num_users=1] = call_function[target=torch.ops.aten.max_pool2d.default](args = (%relu, [3, 3], [3, 3]), kwargs = {}) return (max_pool2d,) ``` the add_ node has `'torch_fn': ('add__1', 'method_descriptor.add_'),` in its meta. If we run the line below in `_annotate_qat_conv2d_bn_binary_unary`, we'll have a partition without output nodes. ``` find_sequential_partitions( gm, [torch.nn.Conv2d, torch.nn.BatchNorm2d, operator.add, torch.nn.ReLU] ) ```` ``` partition_list [ SourcePartition(nodes=[conv_weight, conv2d], source=<class 'torch.nn.modules.conv.Conv2d'>, input_nodes=[x], output_nodes=[conv2d], params=[conv_weight]), SourcePartition(nodes=[bn_weight, bn_bias, bn_running_mean, bn_running_var, bn_num_batches_tracked, add_, batch_norm], source=<class 'torch.nn.modules.batchnorm.BatchNorm2d'>, input_nodes=[conv2d], output_nodes=[batch_norm], params=[bn_num_batches_tracked, bn_running_var, bn_bias, bn_weight, bn_running_mean]), SourcePartition(nodes=[add_], source='add_', input_nodes=[bn_num_batches_tracked], output_nodes=[], params=[]) ] ``` We should not have the last partition. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_conv2d ``` Differential Revision: D61569049 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134292 Approved by: https://github.com/angelayi	2024-08-24 05:50:18 +00:00
Daniel Dale	f260cc2edf	Enable DTensor sharding propagation of `native_layer_norm_backward` to more fully accommodate optional args (#133502 ) Fixes #133499 ### The issue Testing a variety of TP `requires_grad` patterns (validating maximally flexible finetuning) revealed `DTensor` sharding propagation of `aten.native_layer_norm_backward` (default) fails with an `IndexError` for certain `requires_grad` patterns (pattern 1) (e.g. `output_mask` `[True, False, False]`) and an `AssertionError` for others (pattern 2) (e.g. output mask `[False, True, ]`). Please see issue #133499 for a full description of the observed failure patterns along with reproduction. ### Use Cases and Remediation Failure pattern 1 is potentially problematic for a variety of finetuning scenarios. Though failure pattern 2 is really an xfail right now since it's not fully supported, IMHO there are use cases (e.g. especially wrt to mechanistic interpretability research, but certain finetuning scenarios too potentially) that justify supporting this output mask (especially since supporting it is fairly straightforward I think). In this PR I propose some modest changes that: Address the aforementioned failure modes. * Add a couple tests that I'm hopeful will help ensure `DTenso`r op dispatch (which is so well implemented and such a pleasure working with btw! 🚀 🎉) accommodates a wide variety of (potentially unanticipated) `requires_grad` patterns as it evolves. To address both failure modes, I'm proposing the following changes: 1. To [`torch.distributed._tensor.ops._math_ops.layer_norm_bwd_strategy`](`7b269cc484/torch/distributed/_tensor/ops/_math_ops.py (L873)`): - Refactor conditional `output_mask` handling such that the input and output specs in the`PlacementStrategy`s of the returned `output_strategy.strategies` list remain aligned with the `op_schema.args_spec` (whose definition does not change at runtime based upon unused optional args). 2. To [`torch.distributed._tensor._sharding_prop.propagate_op_sharding_non_cached`](`7b269cc484/torch/distributed/_tensor/_sharding_prop.py (L256-L262)`): - When iterating through the active `op_schema.args_spec` to build the relevant `expected_input_specs` list, filter any `None` `desired_specs`. 3. To [`torch/distributed/_tensor/_op_schema.OpSchema._inplace_rewrap_schema_suggestion`](`7b269cc484/torch/distributed/_tensor/_op_schema.py (L418)`) - When inputs need a redistribute, for runtime-unrequired (`None` arguments in the aligned `suggestion_args_schema`), ignore the associated `suggestion_args_spec` ### Implementation considerations: - Regarding `1`, to avoid changing the op strategy return args ([`op_strategy`](`cf81180007/torch/distributed/_tensor/_sharding_prop.py (L234)`)), the change in `1` allows `None` elements to exist temporarily in `PlacementStrategy.input_specs` (treating it as `Sequence[DTensorSpec \| None] \| None` when it's `Sequence[DTensorSpec] \| None`. This could be addressed in any number of ways but I thought it best to leave that for a subsequent PR since it could have broader ramifications (e.g. allowing op_strategies to return an output_strategy.input_specs` mask explicitly, explicitly allowing `None`s in `PlacementStrategy.input_specs`, creating a `Null` DTensorSpec etc.). That's why I'm using an ignore arg-type directive there for now. - Regarding `2` and `3` above, I don't introspect `op_schema.op._schema.arguments` to verify any `None` arguments are `torch.OptionalType`, leaving adherence to the schema contract the responsibility of the given op. Regarding `2`, I assume any `desired_spec` will be either a `DTensorSpec` or `None`, so only `None` can be Falsy in this context. - I considered altering the active `args_schema`, which could be inspected and aligned with the active `output_strategy.input_specs` in some cases and avoid the changes in `3`, but I think that would rely on one of (among other possibilities): - all supported op signatures having optional Tensors (`DTensorSpec`) args after required tensors (which isn't a planned required as far as I know), - (somewhat brittle) heuristic-driven arg alignment - only supporting kwargs etc. ### Added Tests To facilitate detection of future `requires_grad` pattern op failure modes as `DTensor` evolves, I added the following two tests: 1. `test/distributed/_tensor/test_math_ops.py DistMathOpsTest.test_layer_norm_bwd_req_grad` - Tests `native_layer_norm_backward` specifically with 20 subtests that sweep valid `output_mask` patterns along in different LayerNorm dimensionality and `elementwise_affine` configurations. 2. `test/distributed/tensor/parallel/test_tp_examples.py DistTensorParallelExampleTest.test_transformer_req_grad` - Samples a subset of `requires_grad` patterns in a more realistic (relative to the `LayerNorm`-specific test) Transformer usage context with different `dtype` and `is_seq_parallel` configurations. Note since there was substantial overlap with the existing `test_transformer_training` test, I took the opportunity to refactor that test to allow relevant code-sharing. I also added an `ExpCommCounts` `NamedTuple` to facilitate the addition of additional `requires_grad` patterns that we may want to test in the future which may result in different comm counts. I created the separate `requires_grad` test to allow decoupling the multi-iteration `test_transformer_training` test and allow addition of new `requires_grad` scenarios as desired while being mindful of resources. Thanks again to the PyTorch distributed team for your immensely valuable contributions to the open-source ML community! Pull Request resolved: https://github.com/pytorch/pytorch/pull/133502 Approved by: https://github.com/XilunWu	2024-08-24 05:49:54 +00:00
Yanbo Liang	8d3c6494ae	[Inductor][FlexAttention] Rename IS_LAST_BLOCK to CHECK_BLOCK_BOUNDARY (#134378 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134378 Approved by: https://github.com/drisspg	2024-08-24 04:40:01 +00:00
Xu Han	5ad759ca33	[inductor] calibration inductor windows uts (2/N) (#134358 ) skip unsupported UTs of `test\inductor\test_compile_worker.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134358 Approved by: https://github.com/jansel	2024-08-24 04:08:59 +00:00
wz337	5ae9c01794	[DTensor] Add naive replicate strategy for aten._linalg_eigh.default (#134284 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/134284 Approved by: https://github.com/awgu	2024-08-24 03:50:05 +00:00
wz337	962e1f6ca7	[DTensor] Add aten.any.default,dim,out to linear_reduction_strategy (#134206 ) For `aten.any`, we can use `reduce_op="sum"` as the linear reduction op. When we do `all_reduce` with `reduce_op="sum"` on bool tensor, if one rank returns `torch.Tensor([True]) `, then the reduction result is `torch.Tensor([True]) `. Only when all ranks return `torch.Tensor([False]) ` would the reduction result be `torch.Tensor([False]) `. This matches with `any`'s behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134206 Approved by: https://github.com/tianyu-l, https://github.com/chuanhaozhuge	2024-08-24 03:49:46 +00:00
wz337	5d39b14b68	[DeviceMesh] Add DeviceMesh slicing support for flatten mesh dim (#133839 ) Add DeviceMesh slicing support such that we could do the following: ``` mesh_3d = init_device_mesh( self.device_type, (2, 2, 2), mesh_dim_names=("replicate", "shard", "cp") ) shard_cp_mesh = mesh_3d["shard", "cp"]._flatten() hsdp_mesh = mesh_3d["replicate", "shard_cp"] # we can get the corresponding group of the flatten mesh through group = shard_cp_mesh.get_group() # or group = mesh_3d["shard_cp"].get_group() # or mesh_3d.get_group(mesh_dim="shard_cp") ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133839 Approved by: https://github.com/fegin ghstack dependencies: #133838	2024-08-24 03:49:29 +00:00
Akash Kaothalkar	195abdb85c	ppc64le: VSX Support for Inductor (#132746 ) ### Description This PR extends the `VecISA` class to include support for VSX on the `ppc64le` architecture within the Inductor backend. This enhancement enables vectorization support, resulting in performance improvements when using `torch.compile()` on `ppc64le`. ### Fixes - Resolved the `test_acosh_with_negative_large_input` test case in `test_cpu_repro.py` by implementing `acosh` for VSX. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132746 Approved by: https://github.com/jansel	2024-08-24 03:36:09 +00:00
Sheng Fu	519342962d	Pass process group info into NcclWork (#134269 ) Summary: Pass process group info into NcclWork Test Plan: buck2 run mode/dev-nosan kineto/libkineto/fb/integration_tests:pytorch_execution_trace_integration_test Differential Revision: D61677160 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134269 Approved by: https://github.com/wconstab	2024-08-24 01:04:43 +00:00
Justin Chu	e2a87fb1e9	[ONNX] Update exporter logic (#134304 ) Sync the exporter logic with torch-onnx at https://github.com/justinchuby/torch-onnx/compare/v0.1.12...v0.1.15. https://github.com/pytorch/pytorch/issues/129277 - Create a `testing` module to facilitate testing model accuracy. The model is internal - Improve decomp table - Improve model verification logic - Add tests The next PRs will enable OpInfo tests and clean up existing code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134304 Approved by: https://github.com/titaiwangms	2024-08-24 00:49:54 +00:00
rzou	a1d0b4d568	Add option to skip functional passes in the pattern matcher's replacement graph (#134364 ) The pattern matcher runs DCE and remove_noop_ops on the replacement graph by default. Previously we had a switch for the DCE. This PR changes that switch to also control if we run remove_noop_ops. The context was that there is silent incorrectness with auto_functionalized. We use the Pattern matcher to decompose auto_functionalized into a mutable op + clones; remove_noop_ops were deleting the clones. Future: can try #134363 Test Plan: - new test. I wasn't able to produce a silently incorrect example so I settled for asserting that clones still exist in the post-grad graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134364 Approved by: https://github.com/eellison ghstack dependencies: #133639	2024-08-24 00:38:55 +00:00
Jason Ansel	2c8fc3f4ce	[inductor] Move imports to top of file in generated code (#134195 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134195 Approved by: https://github.com/eellison ghstack dependencies: #134194	2024-08-24 00:35:57 +00:00
Jason Ansel	1aa0e35a04	[inductor] Remove dead code in multi_kernel.py (#134194 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134194 Approved by: https://github.com/eellison	2024-08-24 00:35:57 +00:00
Yidi Wu	4ff1a4dd0f	[export] support set_grad_enabled hop in dynamo to enable re-tracing (#134281 ) As titled. We added dynamo support for wrap_with_set_grad_enabled hop to support re-trace an exported program. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134281 Approved by: https://github.com/tugsbayasgalan	2024-08-24 00:35:53 +00:00
drisspg	9dc47f5e62	[FlexAttention]Fix how we realize input buffers (#134351 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134351 Approved by: https://github.com/Chillee	2024-08-24 00:31:00 +00:00
Tristan Rice	4c28a0eb0b	c10d/logging: add C10D_LOCK_GUARD (#134131 ) This adds logs if we can't acquire locks in NCCLUtils and ProcessGroupNCCL for 30s. This is motivated by some deadlocks were seeing and it's unclear if it's in NCCL or on the PyTorch side of things. This required replacing most `std::mutex` with `std::timed_mutex` and `std::condition_variable_any` as appropriate. Test plan: existing CI for regressions will add unit tests on `C10D_LOCK_GUARD` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134131 Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj	2024-08-24 00:27:39 +00:00
atalman	e52e93e8fd	Update scale-config files with linux.24xlarge.ephemeral (#134380 ) Add linux.24xlarge.ephemeral to scale config Pull Request resolved: https://github.com/pytorch/pytorch/pull/134380 Approved by: https://github.com/kit1980, https://github.com/ZainRizvi	2024-08-24 00:01:39 +00:00
Pian Pawakapan	54ff320519	[export] refactor ExportGraphSignature construction (#134059 ) Refactors construction of ExportGraphSignature object for export & training IR, explicitly creating AOTAutograd signature for training IR. This will be helpful for upcoming refactors for placeholder naming & runtime asserts prettifying. Changes: - dedups `make_argument_spec` call, moved to export/graph_signature.py - `_sig_to_specs` wrapped into new function `_convert_to_export_graph_signature`, directly converts GraphSignature -> ExportGraphSignature - `_make_fx_helper` explicitly creates AOTAutograd GraphSignature object Pull Request resolved: https://github.com/pytorch/pytorch/pull/134059 Approved by: https://github.com/angelayi, https://github.com/ydwu4	2024-08-23 23:29:28 +00:00
leslie-fang-intel	aa9f4cc733	[Inductor][CPP] Support vectorization of remainder (#129849 ) Summary When check the vectorization status among 3 test suit, we found some operators disabled vectorization with message `Disabled vectorization: op: remainder`. In this PR, we add vectorization support of this op. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_remainder python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_int_div_vec ``` Differential Revision: [D61147014](https://our.internmc.facebook.com/intern/diff/D61147014) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129849 Approved by: https://github.com/jgong5, https://github.com/lezcano	2024-08-23 23:26:51 +00:00
fduwjj	286f2dba9f	[2/N refactor NCCLPG error logs][c10d] Make msg in monitoring thread in NCCLPG more accurate and simpler (#134036 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134036 Approved by: https://github.com/wconstab	2024-08-23 23:21:28 +00:00
Yiming Zhou	2cfc2da527	[export] Make move_to_device_pass function public (#134263 ) Summary: This is a follow-up of https://github.com/pytorch/pytorch/pull/133660 Here we make the `move_to_device_pass()` function publich so users can call it by `from torch.export.passes import move_to_device_pass` Test Plan: CI Differential Revision: D61671310 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134263 Approved by: https://github.com/angelayi	2024-08-23 23:18:30 +00:00
cyyever	c638a40a93	[Caffe2] Remove unused AVX512 code (#133160 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/133160 Approved by: https://github.com/albanD	2024-08-23 23:16:16 +00:00
Xinran / Allan Rui	1f19ccb5b3	[Inductor/Triton] Customize triton codegen to optionally preserve input dtype on tl.load (#132406 ) Differential Revision: D60536337 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132406 Approved by: https://github.com/jfix71, https://github.com/blaine-rister	2024-08-23 22:58:43 +00:00
Pian Pawakapan	8ff3a5be1b	[export] basic auto dynamic shapes (#133620 ) Starter version of automatic dynamic shapes for export. Creates enums `DIM.AUTO`, `DIM.STATIC`, allowing user to specify `AUTO` for dims in dynamic_shapes specs, meaning that corresponding dims are treated as dynamic, and relevant guards will do what's necessary (e.g. refine ValueRanges, set replacements based on equality, or even set static) without raising ConstraintViolationErrors. Basically allows the user to say, "a bunch of these dims can be dynamic, let export do model analysis and return the program with maximum possible dynamism, without complaining". The usage for specifying `dynamic_shapes` is now: ``` AUTO -> dynamic by default, return whatever produce_guards() says, even if it's static None/int/STATIC -> static Dim/DerivedDim -> same as before - will complain if the min/max range is invalid, or if dims related to this are unspecified. ``` Caveat 1: specifying `AUTO` for a dim won't guarantee it'll be dynamic: - specifying `AUTO` for a dim will return the maximum possible dynamism given your program and other specified constraints, but this can still mean you'll get a static program. For example, with the program below, x is specified dynamic, but it's equal to y, which is specified static, and with how we currently do things we won't promote y to dynamic, but will demote(?) x to static. So this can be surprising if you don't fully know your model, and/or missed one of your other inputs when specifying auto-dynamic shapes. ``` class Foo(torch.nn.Module): def forward(self, x, y): return x + y inputs = (torch.randn(6), torch.randn(6)) export(Foo(), inputs, dynamic_shapes={"x": (DIM.AUTO,), "y": None}) ``` Caveat 2: specifying `AUTO` and Dims in the same spec is still problematic: - The way Dims/DerivedDims are currently handled is very strict. A Dim represents a symbol, and we require a user to specify the symbol for all dims governed by the symbol - that's why we've seen errors in the past like `The values of x must always be related to y by ...`, asking the user to specify the exact relation as in the program. We also require the specified min/max range to be a subset of the valid range from model analysis. All this doesn't compose well with specifying `AUTO` just yet - for example in the program below, ideal behavior could be to return a dynamic program, where `dx = x.size(0) = y.size(0)` has range (3,6). Unfortunately this crashes, and correct behavior is to specify `dx` for both inputs. So currently we raise a UserError and crash if both Dims + `AUTO` are present in the spec. ``` class Foo(torch.nn.Module): def forward(self, x, y): return x + y inputs = (torch.randn(6), torch.randn(6)) export(Foo(), inputs, dynamic_shapes={"x": (DIM.AUTO,), "y": {0: Dim("dx", min=3, max=6)}}) # this doesn't work, because x & y and related ``` Implementation details: This is done by setting `assume_static_by_default=False`, and doing a transform on the `dynamic_shapes` spec to preserve semantics. `assume_static_by_default=False` will treat unspecified dims or Nones as dynamic. This is the opposite of what `export.export()` currently does - unspecified Dims/Nones are treated as static. Historically this static-by-default behavior, where the user deals with fewer guards, has been desirable, and we would like to respect that in this implementation. So this internal spec transformation is added, `_transform_shapes_for_default_dynamic()`, does the spec conversion necessary to be compatbile with dynamic by default. Specifically, AUTOs are converted into Nones, and Nones/unspecified dims are filled in with explicitly static constraints. For example, this would look like, for a 3-d tensor: `{0: DIM.AUTO, 1: None, 2: Dim("dx")} -> {0: None, 1: 32, 2: Dim("dx")}` This does seem overly complicated, but it's done to preserve dynamic shapes semantics for `torch._dynamo.export()`, which already uses `assume_static_by_default=False`, and follows the same process for generating shape constraints , via `_process_dynamic_shapes`. There the semantics are: ``` None/unspecified: dynamic by default Dim/DerivedDim: also a strict assertion ``` If we don't care about BC for `_dynamo.export(dynamic_shapes)`, then we can just modify semantics for `_process_dynamic_shapes()` and change all the relevant tests in `test/dynamo/test_export.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133620 Approved by: https://github.com/avikchaudhuri	2024-08-23 22:56:39 +00:00
Angela Yi	f5a2a22dc4	[export] Fix unflattener to respect nn.Parameter requires_grad (#134353 ) Summary: Fixes P1539870235 Test Plan: CI Differential Revision: D61726403 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134353 Approved by: https://github.com/pianpwk	2024-08-23 22:49:34 +00:00
Juan Torrente	eaa2c0e009	Improves error message when passing wrong tensor type to torch.nn.functional.one_hot (#134209 ) The function expects a Tensor of type LongTensor. It currently throws the following error: "one_hot is only applicable to index tensor." which, imo, does not provide the user with enough information on what the problem is. PR simply adds extra information to the error message on this specific scenario. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134209 Approved by: https://github.com/mikaylagawarecki	2024-08-23 22:40:05 +00:00
Nikita Shulga	09a82f3d24	[EZ][BE] Delete references to non-existing `AWS_SCCACHE` secrets (#134370 ) First of all, none of the binary builds should be using sccache for security and reliability reasons (as distributed cache can become corrupted/compromised), but even if they do all authentication to AWS service shoudl be done via OIDC Pull Request resolved: https://github.com/pytorch/pytorch/pull/134370 Approved by: https://github.com/seemethere, https://github.com/atalman	2024-08-23 22:23:48 +00:00
Nikita Shulga	adf0f50cc7	[Compile] Add NEON implementation for bf16->fp32 cast (#134297 ) This changes assembly generated for the following routine ```cpp void bfloat16tofloat(c10::BFloat16* in, float* out) { auto tmp0 = at::vec::Vectorized<c10::BFloat16>::loadu(in, 8); auto tmp1 = at::vec::convert<float>(tmp0); tmp1.store(out); } ``` from ```asm bfloat16tofloat(c10::BFloat16, float): 0000000000000034 stp x29, x30, [sp, #-0x10]! 0000000000000038 mov x29, sp 000000000000003c sub x9, sp, #0x90 0000000000000040 and sp, x9, #0xffffffffffffffe0 0000000000000044 mov x8, #0x0 0000000000000048 adrp x9, 0 ; 0x0 000000000000004c ldr x9, [x9] 0000000000000050 ldr x9, [x9] 0000000000000054 str x9, [sp, #0x88] 0000000000000058 stp xzr, xzr, [sp, #0x10] 000000000000005c ldr q0, [x0] 0000000000000060 str q0, [sp] 0000000000000064 ldr q1, [sp, #0x10] 0000000000000068 stp q0, q1, [sp, #0x20] 000000000000006c add x9, sp, #0x40 0000000000000070 add x10, sp, #0x20 0000000000000074 add x11, x10, x8 0000000000000078 ldp d0, d1, [x11] 000000000000007c shll.4s v0, v0, #16 0000000000000080 shll.4s v1, v1, #16 0000000000000084 stp q0, q1, [x9], #0x20 0000000000000088 add x8, x8, #0x10 000000000000008c cmp x8, #0x20 0000000000000090 b.ne 0x74 0000000000000094 add x8, sp, #0x40 0000000000000098 ld1.4s { v0, v1 }, [x8] 000000000000009c st1.4s { v0, v1 }, [x1] 00000000000000a0 ldr x8, [sp, #0x88] 00000000000000a4 adrp x9, 0 ; 0x0 00000000000000a8 ldr x9, [x9] 00000000000000ac ldr x9, [x9] 00000000000000b0 cmp x9, x8 00000000000000b4 b.ne 0xc4 00000000000000b8 mov sp, x29 00000000000000bc ldp x29, x30, [sp], #0x10 00000000000000c0 ret 00000000000000c4 bl 0xc4 ``` to ```asm bfloat16tofloat(c10::BFloat16, float): 0000000000000034 ldr q0, [x0] 0000000000000038 shll.4s v1, v0, #16 000000000000003c shll2.4s v2, v0, #16 0000000000000040 st1.4s { v1, v2 }, [x1] 0000000000000044 ret ``` And as result speeds up `python3 torchchat.py generate stories110M --num-samples 3 --compile --device cpu --dtype bfloat16` from 33 to 90 tokens/sec Pull Request resolved: https://github.com/pytorch/pytorch/pull/134297 Approved by: https://github.com/kimishpatel	2024-08-23 22:22:59 +00:00
Yiming Zhou	69813dbbfd	[export] Schematize nn_module_stack serialization (#134049 ) `nn_module_stack` was previously serialized to string by adding commas between the module_path and module_type. This error prone when the `nn_module_stack` itself contains commas. This PR fixes this by creating a dictionary to store the `nn_module_stack` and serialize it to string via `json.dumps()` Fixes #131941 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134049 Approved by: https://github.com/angelayi	2024-08-23 21:50:01 +00:00
Yifu Wang	78d69bfe11	[SymmetricMemory] introduce multicast support, multimem_all_reduce_ and multimem_one_shot_all_reduce (#133424 ) ### Summary - Added multicast support to SymmetricMemory. If the cuda runtime and cuda driver have multicast support, SymmetricMemory associate all peer buffers with a multicast object and exposes the multicast virtual address. - Implemented `multimem_all_reduce_` and `multimem_one_shot_all_reduce` based on the multicast support. The two variants shows different performance characteristic for different message size. We plan to use Inductor for collective algo selection (and required symmetric memory buffer allocation). ### Benchmark 8xH100 (non-standard version with HBM2e at 650W). NVSwitch V3 with NVLS support. ![image](https://github.com/user-attachments/assets/4998a16b-c2c0-4797-9dd0-1da2303df947) ![image](https://github.com/user-attachments/assets/278ad361-52cb-4864-82c6-bb67e8d0a3fe) Differential Revision: [D61682507](https://our.internmc.facebook.com/intern/diff/D61682507) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133424 Approved by: https://github.com/yf225, https://github.com/weifengpy	2024-08-23 20:09:20 +00:00
Qiaochu Yuan	2ca7f0fc5c	[Minimizer] for sequential mode, respect find_all setting (#134339 ) Summary: Currently, for sequential mode, minimizer search terminates after a node is excluded via the user defined exclusion_fn. However, on some occasions we would like the search to continue past that for the remaining nodes. In this diff I am changing the termination criteria to respect the find_all setting, where we continue sequential search if it is set. Test Plan: CI Differential Revision: D61720262 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134339 Approved by: https://github.com/jfix71	2024-08-23 19:59:43 +00:00
Daniel Dale	58e2cf364b	Make DTensor sharding propagation for `scaled_dot_product_efficient_attention` and `scaled_dot_product_flash_attention` more conservatively cached (#134146 ) Fixes #134050 ### The issue The current `DTensor` sharding propagation caching policy for `aten.scaled_dot_product_efficient_attention` (default) can result in silently incorrect gradients or trigger an IMA after cuda kernel launch in mixed `require_grad` configurations. Please see issue #134050 for a full description of the observed failure patterns along with reproduction. Note `aten.scaled_dot_product_flash_attention` presents a similar concern so this PR addresses both [as discussed here.](https://github.com/pytorch/pytorch/issues/134050#issuecomment-2299887602) ### Remediation While there are a number of ways this could be addressed, the most straightforward remediation is to modify the sharding propagation caching policy of [`aten._scaled_dot_product_efficient_attention.default`](`b03381cac2/torch/distributed/_tensor/ops/_matrix_ops.py (L337-L340)`), registering it with `schema_info=RuntimeSchemaInfo(4)` to prevent cache sharing between differing `compute_log_sumexp` values i.e. ```python @register_op_strategy(aten._scaled_dot_product_efficient_attention.default, schema_info=RuntimeSchemaInfo(4)) def scaled_dot_product_efficient_attention_strategy( ... ``` [As discussed here](https://github.com/pytorch/pytorch/issues/134050#issuecomment-2299887602), since `aten::_scaled_dot_product_flash_attention` could be affected by a similar issue wrt `return_debug_mask`, this PR adjusts the sharding propagation caching policy for that op as well. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134146 Approved by: https://github.com/tianyu-l	2024-08-23 19:43:30 +00:00
Jesse Cai	157de30f53	[sparse] Update cuSPARSELt to v0.6.2 (#134022 ) Summary: This PR updated cuSPARSELt to v0.6.2. I think we should land https://github.com/pytorch/pytorch/pull/128534 first though. Most of this PR is just enabling tests to run when cuSPARSELt v0.6.2 is available. Unfortunately was running into a bug with fp32 support on Hopper, so I removed fp32 support from the cuSPARSELt backend. I think this should be fine since almost everybody uses the bfloat/float16/int8 kernels. Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/134022 Approved by: https://github.com/jerryzh168, https://github.com/malfet ghstack dependencies: #128534	2024-08-23 19:34:53 +00:00
Angela Yi	74a9001ada	[aoti] Add additional custom op input type support (#132454 ) Summary: Added support for more custom op input types, now only missing dtype, layout, memory format as input type, since we need to add some more testing for mapping the types to their integer values ([previous comment](https://github.com/pytorch/pytorch/pull/126215#discussion_r1617428066)). This PR also replaces the `DynamicArg` struct's `serialized_arg_val` with `list_item_types`, which stores an optional list of strings, where each string represents the type of the value within this list. This is only used for parsing lists of optional tensors, where we need to know if a specific value in the list should be a tensor, or a None. Replacing with a list of strings is also better than storing the actual json format because then we don't need to parse the json string during the runtime, and can just loop over a preprocessed list of strings. Test Plan: `buck2 run @//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r "test_custom_"` Reviewed By: desertfire Differential Revision: D60295995 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132454 Approved by: https://github.com/desertfire	2024-08-23 19:11:36 +00:00
James Wu	f8fbfe5846	Always emit end events even on failure, use thread local storage for stack (#134279 ) Summary: We should always emit an end event in a finally block so that if a unit test or job fails, the stack is still correct. Also, we use thread local storage for the stack, so that in multithreaded scenarios the stack will still be correctly added. Test Plan: Run benchmark and see that everything still works Run ``` TORCH_LOGS=dynamo buck run test/functorch:test_aotdispatch -- -r test_backward_mutation_on_grad_out ``` With some extra logging to see that start events with the correct stack are emitted, and the end events are also emitted even though the test fails at runtime. Differential Revision: D61682556 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134279 Approved by: https://github.com/aorenste	2024-08-23 18:13:13 +00:00
Yidi Wu	a23d86c178	[hop] ban creating hop by directly instantiating HigherOrderOperator. (#133645 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133645 Approved by: https://github.com/zou3519	2024-08-23 17:28:02 +00:00
Jia Li	3546628a2a	Allow mp.start_processes to create processes in parallel (#133707 ) Summary: Background discussion in https://fb.workplace.com/groups/319878845696681/posts/1226087421742481 and pytorch issue filed https://github.com/pytorch/pytorch/issues/133010 one way to fix this problem is to add an option to parallel start processes on pytorch side. Test Plan: Tested aps run in problem and things are in parallel now (next diff) Differential Revision: D61301603 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133707 Approved by: https://github.com/d4l3k, https://github.com/ezyang	2024-08-23 17:11:20 +00:00
rzou	afd081c9d4	[inductor] Fix needs_fixed_stride_order silent incorrectness (#133639 ) Fixes #128084 The approach is option 2 of what Elias suggested in the comment thread: - We require tensors to have the correct stride at usage. This may involve a clone; if there was a clone and then a mutation into it then we copy_ back the result of the mutation. The reason why I went this approach was because it was the easiest and Inductor already works really hard to remove additional clones/copy_. There are some cases that this doesn't generate efficient code for; for example, if the tensor is a view, we don't change the base of the view to have the right stride order, instead we do a clone. The view case isn't very common so I'm ignoring it for now but we could improve this in the future. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/133639 Approved by: https://github.com/eellison	2024-08-23 17:07:58 +00:00
Tristan Rice	2553278bae	.github/merge_rules.yaml: added multiprocessing to Distributed (#134262 ) This allows the Distributed team to approve changes to torch.multiprocessing which is used by torchelastic/run. Example PR: https://github.com/pytorch/pytorch/pull/133707 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134262 Approved by: https://github.com/wconstab, https://github.com/PaliC	2024-08-23 17:07:20 +00:00
Xuehai Pan	6eae569546	[dynamo][fix] always use POSIX-style path in `trace_rule.py` (#133987 ) We are hardcoding some path in string in POSIX style. This will lead to different results on Windows. This PR force all paths to be in POSIX-style. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133987 Approved by: https://github.com/jansel	2024-08-23 16:28:57 +00:00
Yanbo Liang	2eef749b31	[Inductor][FlexAttention] Fix IS_DIVISIBLE bug and add unit tests (#134055 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134055 Approved by: https://github.com/Chillee	2024-08-23 16:11:09 +00:00
IvanKobzarev	8ae4f82243	[aotd] Support HOP effects in backward (#132638 ) Support of effectful operations in backward: 1/ AOTD collects metadata from forward fn only, so we can have usage of effectful ops in backward, that were not used in forward => Allowing tokens discovery during joint function . FunctionalTensorMode holds _tokens, in Joint function after tracing forward we memoize _tokens as `_tokens_forward_output`. 2/ Tokens are added as primals inputs (forward) in EffectTokensWrapper. Tokens that will be used in backward are in partitioner saved values. We do not have control on which positions they are saved in forward outputs. 2/ If new tokens discovered in backward after tracing joint_fn, the result graph will be manually added in the end of primals. _aot_autograd/utils.py 3/ All effectful ops during backward are marked with 'must_be_in_backward' partitioner_tag, to prevent partiitoner to place them in forward. For that functional_tensor_mode got new optional state `self._effects_partitioner_tag` for effectful ops, to set after tracing forward. There are additional changes in partitioner to improve functionality of 'must_be_in_backward' 4/ Unlift tokens now should run for both forward and backward. - As saved for backward tokens are placed on non static places - we identify input and output tokens to erase, by input and output of `with_effects` operation - In forward we can have input tokens, discovered in backward, that are not used in with_effects ops in forward, but saved for backward. We identify them by position in forward inputs. 5/ Adding aot debug logging for graphs before unlifting and before adding additional primal for backward tokens. Tests: ``` python test/higher_order_ops/test_with_effects.py ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132638 Approved by: https://github.com/bdhirsh	2024-08-23 15:30:58 +00:00
PyTorch MergeBot	7fd3b69886	Revert "[dynamo][super] Improve handling of getattr on super (#134039 )" This reverts commit 1da3a049dac3c78554506d5ef9ede55b7c2b774d. Reverted https://github.com/pytorch/pytorch/pull/134039 on behalf of https://github.com/jeanschmidt due to broke internal torchrec signals, see [D61670727](https://www.internalfb.com/diff/D61670727) ([comment](https://github.com/pytorch/pytorch/pull/134039#issuecomment-2307151643))	2024-08-23 13:57:04 +00:00
PyTorch MergeBot	09127b096c	Revert "[inductor] Fix needs_fixed_stride_order silent incorrectness (#133639 )" This reverts commit 8604c0a150b12e0ba3f9a6faaf52498370f21368. Reverted https://github.com/pytorch/pytorch/pull/133639 on behalf of https://github.com/jeanschmidt due to Broke internal fbgemm signals, see [D61670495](https://www.internalfb.com/diff/D61670495) ([comment](https://github.com/pytorch/pytorch/pull/133639#issuecomment-2307133060))	2024-08-23 13:48:04 +00:00
PyTorch MergeBot	75c22dd8bf	Revert "[dynamo][fix] always use POSIX-style path in `trace_rule.py` (#133987 )" This reverts commit b23779ef0af8d4f06e667da460c43d264359f1f0. Reverted https://github.com/pytorch/pytorch/pull/133987 on behalf of https://github.com/albanD due to This breaks windows trunk jobs ([comment](https://github.com/pytorch/pytorch/pull/133987#issuecomment-2306956764))	2024-08-23 12:08:56 +00:00
Xuehai Pan	0e49b2f18e	[dynamo][itertools] support `itertools.tee` (#133771 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133771 Approved by: https://github.com/jansel ghstack dependencies: #133769, #133778, #133779	2024-08-23 10:13:12 +00:00
Xuehai Pan	8d90392fb0	[dynamo] simplify implementation for `builtins.sum` (#133779 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133779 Approved by: https://github.com/jansel ghstack dependencies: #133769, #133778	2024-08-23 10:10:19 +00:00
Xuehai Pan	6c0b15e382	[dynamo] simplify implementation for `functools.reduce` (#133778 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133778 Approved by: https://github.com/jansel ghstack dependencies: #133769	2024-08-23 09:10:44 +00:00
Xuehai Pan	cc3a76edba	[dynamo] simplify polyfill registration for `builtins.all` and `builtins.any` (#133769 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133769 Approved by: https://github.com/jansel	2024-08-23 09:05:24 +00:00
Su, Tong	ca3f48dd5b	[XPU] Set `make triton` install pre-built whl by default (#130313 ) Now the user could install the pre-built `triton` for xpu by calling the following: ```Bash export USE_XPU=1 make triton ``` [Dev Only]: If the user wishes to build it from the source, one could set an additional flag: ```Bash export TRITON_XPU_BUILD_FROM_SOURCE=1 export USE_XPU=1 make triton ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130313 Approved by: https://github.com/chuanqi129, https://github.com/EikanWang, https://github.com/atalman	2024-08-23 07:36:34 +00:00
Luca Wehrstedt	55cdcef0f7	[fp8 rowwise] Work around CUDA Invalid Memory Access bug (#134227 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134227 Approved by: https://github.com/drisspg, https://github.com/eqy ghstack dependencies: #134223, #134224, #134225, #134226	2024-08-23 07:27:55 +00:00
Luca Wehrstedt	9d81767d43	[fp8 rowwise] Rework dispatch logic (#134226 ) It's likely a matter of opinion, but I find this new version to have less duplication, even if it might have more boilerplate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134226 Approved by: https://github.com/drisspg ghstack dependencies: #134223, #134224, #134225	2024-08-23 07:27:55 +00:00
Luca Wehrstedt	0afb4872aa	[fp8 rowwise] Support non-contiguous inputs and clarify checks (#134225 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134225 Approved by: https://github.com/drisspg ghstack dependencies: #134223, #134224	2024-08-23 07:27:52 +00:00
Luca Wehrstedt	9f8d3f511f	[fp8 rowwise] Some clean-up (#134224 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134224 Approved by: https://github.com/drisspg ghstack dependencies: #134223	2024-08-23 07:27:48 +00:00
Luca Wehrstedt	2f198605ac	[fp8 rowwise] Simplify epilogue visitor tree via common blocks (#134223 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134223 Approved by: https://github.com/drisspg	2024-08-23 07:27:41 +00:00
Xuehai Pan	25b2e46573	[dynamo] add max iterator limit while inlining generators (#134233 ) Related: - #133879 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134233 Approved by: https://github.com/jansel	2024-08-23 07:03:31 +00:00
xingyuan li	673b9bd561	[WIP] [Inductor UT] Reuse inductor UT for intel GPU `test/inductor/test_multi_kernel.py` (#133943 ) [Inductor UT] Reuse Inductor test case for Intel GPU. Reuse `test/inductor/test_multi_kernel.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133943 Approved by: https://github.com/EikanWang, https://github.com/jansel Co-authored-by: Justin Chu <justinchu@microsoft.com> Co-authored-by: Jesse Cai <jcjessecai@gmail.com> Co-authored-by: Sahdev Zala <spzala@us.ibm.com> Co-authored-by: rzou <zou3519@gmail.com> Co-authored-by: FFFrog <ljw1101.vip@gmail.com> Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Co-authored-by: yanbing-j <yanbing.jiang@intel.com> Co-authored-by: Will Feng <yf225@cornell.edu> Co-authored-by: Bin Bao <binbao@meta.com> Co-authored-by: Yiming Zhou <yimingzhou@meta.com> Co-authored-by: Yanbo Liang <ybliang8@gmail.com>	2024-08-23 05:52:29 +00:00
Xu Han	80846caa8c	[inductor] fix dynamic size array(vla) build error on msvc v4 (#134221 ) MSVC don't support dynamic array. Ref: https://stackoverflow.com/questions/56555406/creating-dynamic-sized-array-using-msvc-c-compiler We tried to solutions: 1. use std::vector to instead of it in previous PR: https://github.com/pytorch/pytorch/pull/134140, but it changed variable's type and failed at UTs. 2. Use `std::unique_ptr` to instead of it in PR: https://github.com/pytorch/pytorch/pull/134156, @jansel reviewed and give comments: https://github.com/pytorch/pytorch/pull/134156#pullrequestreview-2253091693. It is make sense, allocation memory maybe make code run slower. 3. Use fixed size array to instead of it in PR: https://github.com/pytorch/pytorch/pull/134210, fixed size is hard to process the situlation, reserved size if small than CPU number. > a. Use min() function limited is local test failed: https://github.com/pytorch/pytorch/pull/134210#issuecomment-2304447729 > b. Dynamic select fixed size or dynamic array: https://github.com/pytorch/pytorch/pull/134210#issuecomment-2304128666 . It makes code too complex to maintains. Discussed with origin PR(https://github.com/pytorch/pytorch/pull/115620) author @zhuhaozhe, we think: 1. MSVC it the only one compiler, which not support VLA. 2. MSVC it worse performance than other compilers, use `std::unique_ptr` for MSVC and make it works. 3. For other compilers, keep using current `VLA` code. 4. For Windows users, they can use `clang-cl` or `icx` to get better performance than MSVC. 5. Discussed with @jansel , we need to move compiler check to python side, and make output code cleaner. Reproduce UT: ```cmd pytest test/inductor/test_cpu_repro.py -v -k test_reduction_with_dynamic_threads ``` Error msg: ```cmd C:/Users/Xuhan/AppData/Local/Temp/tmpncykej5v/a4/ca4534cazplidnf7vopaaxaifqkjiyhxm3h2gsylgztputbaeybx.cpp(13): error C2131: expression did not evaluate to a constant C:/Users/Xuhan/AppData/Local/Temp/tmpncykej5v/a4/ca4534cazplidnf7vopaaxaifqkjiyhxm3h2gsylgztputbaeybx.cpp(13): note: failure was caused by a read of a variable outside its lifetime C:/Users/Xuhan/AppData/Local/Temp/tmpncykej5v/a4/ca4534cazplidnf7vopaaxaifqkjiyhxm3h2gsylgztputbaeybx.cpp(13): note: see usage of 'max_threads' C:/Users/Xuhan/AppData/Local/Temp/tmpncykej5v/a4/ca4534cazplidnf7vopaaxaifqkjiyhxm3h2gsylgztputbaeybx.cpp(16): error C3863: array type 'float [max_threads]' is not assignable ``` Genarated code: ```c++ #include "C:/Users/Xuhan/AppData/Local/Temp/tmpt6mxcjzi/j2/cj22tgrdgh42wbunl7gdptg2lintcziox2kmr7rdbcc6n2njrhgx.h" extern "C" __declspec(dllexport) void kernel(const float* in_ptr0, const float* in_ptr1, float* out_ptr0, float* out_ptr1) { { { float tmp_acc0 = 0; at::vec::Vectorized<float> tmp_acc0_vec = at::vec::Vectorized<float>(0); int max_threads = omp_get_max_threads(); float tmp_acc0_arr[max_threads]; for (int tid = 0; tid < max_threads; tid++) { tmp_acc0_arr[tid] = 0; } at::vec::Vectorized<float> tmp_acc0_vec_arr[max_threads]; for (int tid = 0; tid < max_threads; tid++) { tmp_acc0_vec_arr[tid] = at::vec::Vectorized<float>(0); } #pragma omp parallel ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134221 Approved by: https://github.com/zhuhaozhe, https://github.com/jansel	2024-08-23 05:40:08 +00:00
Xu Han	49b9f2d8b0	[inductor] fix signbit build fail on Windows. (#134229 ) Reproduce UT: ```cmd pytest test/inductor/test_torchinductor.py -v -k test_randint_int64_mod_cpu ``` Error message: ```cmd cl : Command line warning D9025 : overriding '/openmp' with '/openmp:experimental' c6airoloxwj4prmlejdyo5ybp43xa2yo5rbnpk4ttw3oifu6noor.cpp C:/Users/Xuhan/AppData/Local/Temp/tmpx1fj2bd4/6a/c6airoloxwj4prmlejdyo5ybp43xa2yo5rbnpk4ttw3oifu6noor.cpp(23): error C2668: 'signbit': ambiguous call to overloaded function C:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt\corecrt_math.h(309): note: could be 'bool signbit(float) noexcept' C:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt\corecrt_math.h(314): note: or 'bool signbit(double) noexcept' C:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt\corecrt_math.h(319): note: or 'bool signbit(long double) noexcept' C:/Users/Xuhan/AppData/Local/Temp/tmpx1fj2bd4/6a/c6airoloxwj4prmlejdyo5ybp43xa2yo5rbnpk4ttw3oifu6noor.cpp(23): note: while trying to match the argument list '(__int64)' C:/Users/Xuhan/AppData/Local/Temp/tmpx1fj2bd4/6a/c6airoloxwj4prmlejdyo5ybp43xa2yo5rbnpk4ttw3oifu6noor.cpp(24): error C2668: 'signbit': ambiguous call to overloaded function C:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt\corecrt_math.h(309): note: could be 'bool signbit(float) noexcept' C:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt\corecrt_math.h(314): note: or 'bool signbit(double) noexcept' C:\Program Files (x86)\Windows Kits\10\include\10.0.22000.0\ucrt\corecrt_math.h(319): note: or 'bool signbit(long double) noexcept' C:/Users/Xuhan/AppData/Local/Temp/tmpx1fj2bd4/6a/c6airoloxwj4prmlejdyo5ybp43xa2yo5rbnpk4ttw3oifu6noor.cpp(24): note: while trying to match the argument list '(int64_t)' ``` Genarated code: ```c++ #include "C:/Users/Xuhan/AppData/Local/Temp/tmpcjnxnvkl/4f/c4ff4q4pxgo3yprbo2nkfopkt3qgi6rmptfpgpl2iylgtunvizwn.h" extern "C" __declspec(dllexport) void kernel(const int64_t* in_ptr0, int64_t* out_ptr0) { #pragma omp parallel num_threads(8) { int tid = omp_get_thread_num(); { #pragma omp for for(int64_t x0=static_cast<int64_t>(0LL); x0<static_cast<int64_t>(20LL); x0+=static_cast<int64_t>(1LL)) { auto tmp0 = in_ptr0[static_cast<int64_t>(0LL)]; auto tmp1 = x0; auto tmp2 = c10::convert<int32_t>(tmp1); auto tmp3 = static_cast<int64_t>(-5); auto tmp4 = static_cast<int64_t>(5); auto tmp5 = randint64_cpu(tmp0, tmp2, tmp3, tmp4); auto tmp6 = static_cast<int64_t>(10); auto tmp7 = mod(tmp5, tmp6); auto tmp8 = static_cast<int32_t>(0); auto tmp9 = tmp7 != tmp8; auto tmp10 = std::signbit(tmp7); auto tmp11 = std::signbit(tmp6); auto tmp12 = tmp10 != tmp11; auto tmp13 = tmp9 & tmp12; auto tmp14 = decltype(tmp7)(tmp7 + tmp6); auto tmp15 = tmp13 ? tmp14 : tmp7; out_ptr0[static_cast<int64_t>(x0)] = tmp15; } } } } ``` Fixed by cast `std::signbit` to `long double`: https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/signbit?view=msvc-170 Local test passed: <img width="848" alt="image" src="https://github.com/user-attachments/assets/e4467256-a068-40ef-a6ff-19b442e9116d"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/134229 Approved by: https://github.com/jansel	2024-08-23 05:40:05 +00:00
Huamin Li	311af3b988	Add new ops wrapped_linear_prepack and wrapped_quantized_linear_prepacked (#134232 ) Summary: This diff adds two new operators torch.ops._quantized.wrapped_linear_prepack and torch.ops._quantized.wrapped_quantized_linear_prepacked. It is a decomposition of the op torch.ops._quantized.wrapped_quantized_linear added in the previous diff. We decomposed in this way as packed weight could be computed early so we don;t need to do it in every forward in AOTI Reviewed By: jerryzh168 Differential Revision: D61395887 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134232 Approved by: https://github.com/houseroad	2024-08-23 04:54:26 +00:00
Xuehai Pan	b23779ef0a	[dynamo][fix] always use POSIX-style path in `trace_rule.py` (#133987 ) We are hardcoding some path in string in POSIX style. This will lead to different results on Windows. This PR force all paths to be in POSIX-style. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133987 Approved by: https://github.com/jansel	2024-08-23 04:33:05 +00:00
Animesh Jain	a699bd1155	[dynamo] Cache _dynamo.disable results (#134272 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134272 Approved by: https://github.com/yf225, https://github.com/jansel	2024-08-23 04:20:50 +00:00
Avik Chaudhuri	b454c51060	remove dynamic_dim (#134211 ) Summary: As promised in https://github.com/pytorch/pytorch/pull/134045. Test Plan: existing Differential Revision: D61646937 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134211 Approved by: https://github.com/angelayi	2024-08-23 04:13:03 +00:00
Rachel Guo	058302494c	[AOTI][Tooling] Add a test case where `config.debug_intermediate_value_printer=True` to check codegen (#133326 ) Summary: As title. Add a test case in test_aot_inductor to check for codegen (i.e. `aoti_torch_print_tensor_handle` is inserted as expected for debugging printer) for both cpu and cuda based on a simple `addmm` test model. Test Plan: ``` AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_aoti_debug_printer_codegen_abi_compatible_{cuda/cpu} ``` Differential Revision: D61169068 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133326 Approved by: https://github.com/ColinPeppler	2024-08-23 02:12:21 +00:00
Yanbo Liang	d2c60749ac	[Inductor][FlexAttention] Respect user's input kernel_options (#134065 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134065 Approved by: https://github.com/Chillee	2024-08-23 01:21:05 +00:00
fduwjj	8301add833	[4/N] Further refactor FR script to make it more modulized (#134196 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134196 Approved by: https://github.com/c-p-i-o	2024-08-23 01:15:29 +00:00
Shivam Raikundalia	bcfc560aea	[Profiler/CPU] Add Test for Dynamic Activity Toggling [4/n] (#134149 ) Summary: Add tests that check function events for dynamic activity toggling for both GPU and CPU events. Also added comments from previous GH comments Test Plan: Make sure all tests pass Differential Revision: D61617514 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134149 Approved by: https://github.com/aaronenyeshi	2024-08-23 01:13:42 +00:00
drisspg	bf5addb613	[FlexAttention] Enable different qk and v head-dims (#134043 ) # Summary Adds the option for the head dims to be different between QK and V tensors. Fixes issue: https://github.com/pytorch/pytorch/issues/133674 V_DIM > QK_DIM is blocked by landing: https://github.com/triton-lang/triton/pull/4138 / https://github.com/triton-lang/triton/pull/4540 Into PyTorch's triton branch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134043 Approved by: https://github.com/Chillee	2024-08-23 01:06:57 +00:00
Bin Bao	7c93c4f8cf	[CI][dashboard] Change aarch64 perf run (#134265 ) Summary: Reduce the aarch64 dashboard run to only test the default config, until we solve the timeout issue. Also increase the frequency from nightly to 6 times a day, to see if we can reproduce the perf instability Nikita has observed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134265 Approved by: https://github.com/malfet	2024-08-23 00:40:28 +00:00
Animesh Jain	b3821f1da1	[dynamo][guards][logs] Generate code_parts for debugging (#134181 ) Fixes https://github.com/pytorch/pytorch/issues/132692 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134181 Approved by: https://github.com/youkaichao, https://github.com/jansel ghstack dependencies: #133742, #134016, #134039	2024-08-22 23:40:37 +00:00
Dan Johnson	edbadc904b	Do not broadcast uniqueId during a split (#133962 ) When using split, we do not need to exchange the NCCL uniqueID at all. This would avoid connecting to the TCPStore on each split operation. @exported-using-ghexport Differential Revision: [D60966980](https://our.internmc.facebook.com/intern/diff/D60966980/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133962 Approved by: https://github.com/shuqiangzhang ghstack dependencies: #133960, #133961	2024-08-22 23:23:32 +00:00
Eli Uriegas	b2eb0e8c6a	docker: Use miniforge, install from pip (#134274 ) Switch installation of the pytorch package to be installed from our download.pytorch.org sources which are better maintained. As well, switching over the miniconda installation to a miniforge installation in order to ensure backwards compat for users expecting to have the conda package manager installed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134274 Approved by: https://github.com/malfet, https://github.com/atalman Co-authored-by: atalman <atalman@fb.com>	2024-08-22 23:20:22 +00:00
Stonepia	30d7e7a1cd	[XPU] Fix patch for old llvm package error for triton xpu (#134204 ) Fixes #134199 The PR #133694 does a workaround to replace the str `"https://tritonlang.blob.core.windows.net/llvm-builds/"` with `"https://oaitriton.blob.core.windows.net/public/llvm-builds/"` in `triton/python/setup.py`. However, in [newer version of Triton](`06e6799f4e`), it has already been changed to `"https://oaitriton.blob.core....` and don't need to be replaced. But formerly, this will throw a runtime error. This PR makes the `check_and_replace` logic won't fail in such a scenario. Both the old link and the newer link could work. Also note that the `.ci/docker/common/install_triton.sh` does not need the fix, because its `sed` command won't be in effect if there is no such pattern. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134204 Approved by: https://github.com/chuanqi129, https://github.com/EikanWang, https://github.com/atalman	2024-08-22 23:18:44 +00:00
drisspg	629bd6f718	Update FlexAttention with masking semantic (#133373 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133373 Approved by: https://github.com/yanboliang	2024-08-22 22:50:33 +00:00
fduwjj	e7929809f3	[c10d][ez] Add comments to CudaEventCache class (#134172 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134172 Approved by: https://github.com/d4l3k, https://github.com/kwen2501	2024-08-22 22:44:12 +00:00
Justin Chu	b319fa3fd9	[ONNX] Opt into ruff fmt (#134120 ) Add ONNX directory to use ruff format. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134120 Approved by: https://github.com/XuehaiPan, https://github.com/Skylion007	2024-08-22 22:44:03 +00:00
Dan Johnson	25499de814	Remove ncclIdToCommMap_. (#133961 ) There is no purpose for this map structure, and it is incorrect in some cases. For example, when the uniqueID is not broadcasted to the other processes. @exported-using-ghexport Differential Revision: [D60966882](https://our.internmc.facebook.com/intern/diff/D60966882/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133961 Approved by: https://github.com/shuqiangzhang ghstack dependencies: #133960	2024-08-22 22:06:25 +00:00
Shangdi Yu	b0cf287b46	[export][training ir migration] Fix getitem not exist (#134259 ) Summary: Make quantization tests compatible with the new training IR. With the new batch norm node `torch.ops.aten.batch_norm.default`, we don't need an additional getitem node after the bn node, so tests need to be fixed to not check for the getitem node. We added a capture_pre_autograd_graph_using_training_ir() function, which returns True when we are using the training ir, and False otherwise. This way, the code supports both training ir and the old ir. For now, we are just rolling out the training ir for fbcode internal tests. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_preserve_source_fn_stack buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_update_shared_qspec buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_conv2d buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_conv_bn_relu_fusion buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_conv_bn_fusion buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_conv_bn_fusion_literal_args ``` Reviewed By: andrewor14, tugsbayasgalan Differential Revision: D61292102 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134259 Approved by: https://github.com/tugsbayasgalan	2024-08-22 22:00:14 +00:00
Bin Bao	f0ba309d78	[CI][dashboard] Add jemalloc back for aarch64 (#134189 ) Forward fix based on https://github.com/pytorch/pytorch/pull/133997#discussion_r1726004220 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134189 Approved by: https://github.com/malfet, https://github.com/huydhn	2024-08-22 21:08:39 +00:00
Dan Johnson	1b6bbaa016	Remove PMI dependencies in PyTorch (#133960 ) This patch makes two changes: 1. Whenever ncclCommSplit accepts groupRanks in its config, we should populate it. This is independent of using PMI or not. For example, non-PMI NCCL can also use this information, if it chooses to. 2. Provide a user flag to decide when to do a uniqueId broadcast and when to skip it. This is a performance optimization, and not a correctness requirement. If the user forgets to set this, we will do the uniqueId broadcast, which is wasteful (because it will be ignored by NCCL), but not incorrect. @exported-using-ghexport Differential Revision: [D60966774](https://our.internmc.facebook.com/intern/diff/D60966774/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133960 Approved by: https://github.com/shuqiangzhang	2024-08-22 20:34:43 +00:00
Yanbo Liang	ff61f55387	[Dynamo][autograd.Function] Supports ctx.set_materialize_grads (#133978 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133978 Approved by: https://github.com/zou3519	2024-08-22 20:06:17 +00:00
Zain Rizvi	5633773188	Convert various jobs to be Linux Foundation fleet compatible (#134128 ) Migrates a batch of workflows over to LF Pull Request resolved: https://github.com/pytorch/pytorch/pull/134128 Approved by: https://github.com/zxiiro, https://github.com/jeanschmidt	2024-08-22 19:23:07 +00:00
Jeff Daily	0eb9c870fd	[reland][ROCm] TunableOp for gemm_and_bias (#128919 ) Reland of #128143 but added `alpha` and `bias` initialization to `launchTunableGemmAndBias` Thus far TunableOp was implemented for gemm, bgemm, and scaled_mm. gemm_and_bias was notably missing. This PR closes that gap. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128919 Approved by: https://github.com/malfet	2024-08-22 18:27:50 +00:00
Shangdi Yu	978c5a80a0	[export][training ir migration] fix batch norm pattern match in quantization (#134157 ) Summary: In the new training ir, we produce `torch.ops.aten.batch_norm.default` instead of `torch.ops.aten._native_batch_norm_legit.default` or `torch.ops.aten._native_batch_norm_legit_no_training.default`. So we need to change the pattern match to accomodate the new op. - Add `torch.ops.aten.batch_norm.default` to pattern matcher list so it's identified as a batch norm node - `torch.ops.aten.batch_norm.default` doesn't have a getitem user anymore, so when removing the bn norm, we need to do `bn_node.replace_all_uses_with(conv_node)` instead of `getitem_node.replace_all_uses_with(conv_node)` The behavior of capture_pre_autograd_graph is consistent for each run. If the run is a fbcode test, then capture_pre_autograd_graph uses training IR. This means both _get_aten_graph_module_for_pattern and replace_pattern_with_filters see the same training IR. If the run is not a fbcode test, then both would see the old IR. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_conv2d_binary2 buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_conv2d_unary buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_linear_unary buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_dynamic_quant_linear buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_qat_dynamic_quant_linear buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_flatten_recipe buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_linear_unary ``` Reviewed By: andrewor14, tugsbayasgalan Differential Revision: D61291077 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134157 Approved by: https://github.com/tugsbayasgalan	2024-08-22 18:25:45 +00:00
Animesh Jain	fee677eeb6	[fbode-testing][dynamo][reland][inline-inbuilt-nn-modules] Mark attri… (#134136 ) Shuai wants to test this internally before https://github.com/pytorch/pytorch/pull/133713 can go in. Creating a separate PR for ghmport. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134136 Approved by: https://github.com/yanboliang	2024-08-22 17:54:58 +00:00
Thanh Ha	8f7d66f0c3	Enable dynamic rollout for Linux binary workflows (#131472 ) Enables dynamic migration of jobs to the LF AWS account for binary workflows. The new runners are only given to people specified in this issue: pytorch/test-infra#5132 This closes pytorch/ci-infra#251. Depends-On: pytorch/pytorch#132870 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131472 Approved by: https://github.com/ZainRizvi	2024-08-22 17:12:50 +00:00
Aaron Orenstein	d95aedf5fd	[BE] typing for decorators - fx/_compatibility (part 1) (#134202 ) Part of #134054. This corresponds to the pytorch mypy changes from D61493706. Updating takes so long and touches so many files that it's impossible to land as a whole without conflicting with some other intermediate change. So landing these 'type: ignore' for pytorch in advance of them actually being needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134202 Approved by: https://github.com/Skylion007	2024-08-22 17:07:33 +00:00
yuqingj	44fa9f991c	[NJT] add aten.to.dtype support (#134164 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134164 Approved by: https://github.com/davidberard98	2024-08-22 16:59:38 +00:00
Xuehai Pan	b6abac68ec	[BE][dynamo] reorganize polyfill module hierarchy (#133977 ) Changes: 1. Move `polyfill.py` -> `polyfills/__init__.py`. It can be used as `polyfill.xxx` -> `polyfills.xxx`. 2. Move submodule loading from `polyfills/__init__.py` to `polyfills/loader.py`. Merge `polyfill.py` and `polyfills/` packages. Each polyfill module have its own namespace for better code organization. The ultimate goal is make `polyfills/__init__.py` empty and all polyfill functions move to its own namespace. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133977 Approved by: https://github.com/jansel	2024-08-22 16:42:29 +00:00
Xuehai Pan	c95ddd4bf2	[dynamo] ensure polyfill function has the same signature as the original function in `substitute_in_graph` (#133813 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133813 Approved by: https://github.com/jansel	2024-08-22 16:38:06 +00:00
Shangdi Yu	240467adfe	[fx] Implement deepcopy for Proxy (#133706 ) Summary: When deepcopy a proxy, we first try the default deepcopy behavior. Test Plan: buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:fx -- -r proxy_deepcopy Differential Revision: D61398418 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133706 Approved by: https://github.com/angelayi	2024-08-22 16:37:30 +00:00
PyTorch MergeBot	b0171c3920	Revert "[ONNX] Opt into ruff fmt (#134120 )" This reverts commit 0870398fa8c3e097640f31cb8a8e2e2d3e522d33. Reverted https://github.com/pytorch/pytorch/pull/134120 on behalf of https://github.com/albanD due to Breaks main branch lint ([comment](https://github.com/pytorch/pytorch/pull/134120#issuecomment-2305089756))	2024-08-22 15:48:14 +00:00
Simon Mahns	828ab84e19	Improve error msg on _lazy_init() error (#134159 ) Reviewed By: hanzlfs Differential Revision: D61627609 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134159 Approved by: https://github.com/hanzlfs	2024-08-22 15:10:50 +00:00
James Wu	3c5485fb7f	[Retry] Log chromium events to scuba (#134118 ) Summary: This diff implements a bunch of views for internal scuba viewing. TODOS that I might punt to another diff: - Saving cache stats via counter is definitely sus here, but there's not really a good way to track "fx graph cache hit for this compile phase" right now. Will think about this more. - We should definitely log frame id, compile id, etc - We should definitely be logging configs. That way, we can A/B test based on whether a config is turned on. - idk what I'm doing with compile_uuid yet, but it's useful when you want to look at samples for a single run. I think if we had mast job info this field is not needed, but it's nice to be able to drill down to a single run and get its chrome trace view or icicle view, so idk Test Plan: All of the above views are run with nanogpt benchmark: ``` buck run mode/opt caffe2/benchmarks/dynamo:torchbench -- --training --backend=inductor --only nanogpt --performance ``` Differential Revision: D61603243 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134118 Approved by: https://github.com/oulgen	2024-08-22 14:59:45 +00:00
Isuru Fernando	1b10a5c652	Allow SymInts and SymFloats as other in div_softmax_pattern (#133989 ) Fixes https://github.com/pytorch/pytorch/issues/133759 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133989 Approved by: https://github.com/ezyang	2024-08-22 14:36:01 +00:00
Vladimir Monakhov	afc2615d33	Add proper casting to fuse_linear_bn_weights (#134105 ) As per title, this PR adds proper casting to fuse_linear_bn_weights in the same style as the conv case above. This previously caused numerical issues on my end, so that is why I am fixing it. Also cleans up the docstring. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134105 Approved by: https://github.com/mikaylagawarecki	2024-08-22 14:26:12 +00:00
yuqingj	b459ca78eb	[NJT]Add unit tests that cover the internal use cases using new NJT API (#133513 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133513 Approved by: https://github.com/davidberard98, https://github.com/soulitzer	2024-08-22 13:54:40 +00:00
PyTorch MergeBot	1a7e8e5780	Revert "Update FlexAttention with masking semantic (#133373 )" This reverts commit 5a7b544e5c3e37bea62c6a231f6230c004a33d38. Reverted https://github.com/pytorch/pytorch/pull/133373 on behalf of https://github.com/jeanschmidt due to Broke internal test/inductor signals, see D61611729 ([comment](https://github.com/pytorch/pytorch/pull/133373#issuecomment-2304714503))	2024-08-22 13:47:26 +00:00
PyTorch MergeBot	88c973005d	Revert "[FlexAttention] Enable different qk and v head-dims (#134043 )" This reverts commit e847b6bb9ba281b0db83fcdd79c328252403e9e8. Reverted https://github.com/pytorch/pytorch/pull/134043 on behalf of https://github.com/jeanschmidt due to Need to revert, in order to be able to revert https://github.com/pytorch/pytorch/pull/133373, feel free to reland this after solving conflicts ([comment](https://github.com/pytorch/pytorch/pull/134043#issuecomment-2304708996))	2024-08-22 13:44:17 +00:00
Aaron Gokaslan	83b5d449a3	Add full float16/bfloat16 support to MaxUnPool (#133774 ) It already supported half so might as well add bfloat16 support for parity Pull Request resolved: https://github.com/pytorch/pytorch/pull/133774 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-08-22 13:34:43 +00:00
Aaron Gokaslan	c9c84ae3ee	[BE][Ez]: Update CUDNN_frontend submodule to 1.6.1 (#134007 ) Update cudnn_frontend submodule to 1.6.1 to patch some minor bugfixes and compiler fixes. # Bug fix * Fixed an issue where custom dropout mask was not correctly applied. * Added -fvisibility=hidden for the pip wheels generated to avoid symbol conflicts with other modules that use cudnn frontend. * Fixed an issue in sdpa operation which when deserialized will lead to numerical mismatches. * Fixed an issue in sdpa fp8 fprop operation (in inference mode). # Samples * Added a new sample to showcase how a custom dropout mask can be applied to a sdpa operation. * Added a sample to showcase convolutions on large (c * d * h * w > 2 **31) tensors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134007 Approved by: https://github.com/eqy	2024-08-22 13:34:17 +00:00
Howard Huang	108a75b454	[PP] Add ZeroBubble schedule (#133467 ) Zero bubble can be expressed through `ScheduleFlexibleInterleaved1F1B` by setting `enable_zero_bubble=True`. But instead of having to include this flag in schedule initialization we should create a separate ZeroBubbleSchedule and also transition `Interleaved1F1B` to derive from `ScheduleFlexibleInterleaved1F1B`. Then we dont need to expose `ScheduleFlexibleInterleaved1F1B` since the naming is not obvious Pull Request resolved: https://github.com/pytorch/pytorch/pull/133467 Approved by: https://github.com/wconstab ghstack dependencies: #132691	2024-08-22 13:32:15 +00:00
PyTorch MergeBot	cedfac20c7	Revert "[SymmetricMemory] introduce multicast support, multimem_all_reduce_ and multimem_one_shot_all_reduce (#133424 )" This reverts commit 66d3eb783c3b3d7087988dd29bfb619b7f4306b7. Reverted https://github.com/pytorch/pytorch/pull/133424 on behalf of https://github.com/jeanschmidt due to Broke internal ADS builds, see D61611517 ([comment](https://github.com/pytorch/pytorch/pull/133424#issuecomment-2304676328))	2024-08-22 13:29:27 +00:00
Andrew Gu	592a172910	[FSDP2] Resolved strided sharding todo in clipping tests (#134152 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134152 Approved by: https://github.com/XilunWu, https://github.com/weifengpy, https://github.com/wz337	2024-08-22 12:45:13 +00:00
Jez Ng	4c645c04d8	Fix type of get_raw_stream (#134187 ) Just something I noticed while implementing a new DeviceInterface I had to add `# type: ignore[assignment]` because mypy thinks DeviceInterface.get_raw_stream is a `Callable` and therefore incompatible with a `staticmethod`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134187 Approved by: https://github.com/jansel	2024-08-22 12:00:08 +00:00
Xu Han	5fb8754434	[inductor] write cpp code with encoding utf-8 (#134027 ) Windows is different to Linux, each Windows version with different language pack have different code page. Inductor on Windows will write the genarated cpp code with its code page, and it should occured un-decode character failed. For this situlation, Microsoft suggest to use Unicode to instead of a specific code page. Ref: https://learn.microsoft.com/en-us/windows/win32/intl/code-page-identifiers Changes: 1. Use `utf-8` as encoder for cpp code. 2. It only change encode for cpp code, but not for binary type. binary type is for AoT binary context. It works on https://github.com/pytorch/pytorch/issues/122094#issuecomment-2299592942. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134027 Approved by: https://github.com/desertfire, https://github.com/jgong5, https://github.com/jansel	2024-08-22 11:54:32 +00:00
Luca Wehrstedt	aea1148d56	[fp8 rowwise] Clarify dtypes (#134114 ) Disambiguate some of the dtypes (e.g., for the scales), move the "constant" ones out of the function, and use safe casting functions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134114 Approved by: https://github.com/drisspg ghstack dependencies: #134110, #134111, #134112, #134113	2024-08-22 11:07:39 +00:00
Luca Wehrstedt	72586ccd14	[fp8 rowwise] Don't build separate kernel for no bias (#134113 ) CUTLASS automatically skips a stage in the epilogue if we provide a nullptr. Thus, instead of building a special kernel for bias=None, we can reuse one of the other ones. This also considerably simplifies the code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134113 Approved by: https://github.com/drisspg ghstack dependencies: #134110, #134111, #134112	2024-08-22 11:07:39 +00:00
Luca Wehrstedt	d64fa11095	[fp8 rowwise] Fix bias calculation being done in low precision (#134112 ) The compute dtype for the bias addition was set to ElementBias. Thus, for a bf16 bias, we would cast the fp32 accum to bf16 and _then_ add the bias. It is however (slightly?) more accurate to first add the bias in fp32 and only cast at the end. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134112 Approved by: https://github.com/drisspg ghstack dependencies: #134110, #134111	2024-08-22 11:07:34 +00:00
Luca Wehrstedt	15faed60ca	[fp8 rowwise] Make schedule selection more readable (#134111 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134111 Approved by: https://github.com/drisspg ghstack dependencies: #134110	2024-08-22 11:07:30 +00:00
Luca Wehrstedt	b8ea5b01c9	[fp8 rowwise] Allocate workspace as a PyTorch Tensor (#134110 ) This makes us pass through the CUDA caching allocator which is safer e.g. in case of CUDA graphs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134110 Approved by: https://github.com/drisspg	2024-08-22 11:07:26 +00:00
cyy	4c8193b8f0	[14/N] Fix clang-tidy warnings in aten/src/ATen (#132733 ) Follows #133807 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132733 Approved by: https://github.com/ezyang	2024-08-22 10:09:15 +00:00
Zitong Zhan	90c821814e	SparseCsrCUDA: cuDSS backend for linalg.solve (#129856 ) This PR switches to cuDSS library and has the same purpose of #127692, which is to add Sparse CSR tensor support to linalg.solve. Fixes #69538 Minimum example of usage: ``` import torch if __name__ == '__main__': spd = torch.rand(4, 3) A = spd.T @ spd b = torch.rand(3).to(torch.float64).cuda() A = A.to_sparse_csr().to(torch.float64).cuda() x = torch.linalg.solve(A, b) print((A @ x - b).norm()) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129856 Approved by: https://github.com/amjames, https://github.com/lezcano, https://github.com/huydhn Co-authored-by: Zihang Fang <zhfang1108@gmail.com> Co-authored-by: Huy Do <huydhn@gmail.com>	2024-08-22 07:57:30 +00:00
Pearu Peterson	64cfcbd8a3	Tune _int_bsr_dense_addmm for int8 inputs on A100 (#134035 ) As in the title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134035 Approved by: https://github.com/cpuhrsch ghstack dependencies: #133855	2024-08-22 06:43:11 +00:00
Feng Yuan	b7baa062fc	Update torch-xpu-ops pin (ATen XPU implementation) (#133850 ) Bugfixings for PyTorch 2.5, 1. Using SYCL group algorithm API instead of old style for sub group shift utilities. 2. Add preprocess in reduction kernel for cases requiring data type cast. 3. Make group norm memory format compatible. 4. ZeroTensor: a. Remove unnecessary aten operators registration, or ZeroTensor process is bypassed. b. Align preprocess with intree implementation in aten::copy_. 5. Rebase checkIndexTensorTypes usage. 6. Align latest semantics of PyTorch foreach operators. Return multiple tensors with offset=0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133850 Approved by: https://github.com/EikanWang	2024-08-22 06:27:03 +00:00
Yuanhao Ji	cdb9c7d228	Add support for using privateuse1 backend name in `instantiate_device_type_tests()` (#133082 ) As you can see, 'privateuse1' appears many times in out-of-tree extension codebase. I think that everything about the device type should be as same as other in-tree backends after registering the privateuse1 backend. For example, after registering a privateuse1 backend named "foo", you should allow "foo" to be passed in as a valid device type. ```diff - instantiate_device_type_tests(TestIndexing, globals(), only_for='privateuse1') - instantiate_device_type_tests(NumpyTests, globals(), only_for='privateuse1') + instantiate_device_type_tests(TestIndexing, globals(), only_for='foo') + instantiate_device_type_tests(NumpyTests, globals(), only_for='foo') ``` > https://github.com/Ascend/pytorch/blob/master/test/test_indexing.py#L1654-L1655 The change is to map privateuse1 backend name to 'privateuse1' when calling `filter_desired_device_types()`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133082 Approved by: https://github.com/albanD	2024-08-22 06:17:21 +00:00
Chong Gu	24c2dd2002	Migrate fuse_chunk_reshape_concat_pass to PT2 (#134026 ) Summary: This is part of the work of dper pass migration https://fburl.com/gdoc/wxwykxns This pass has ~2.4% perf impact for adfinder_reels_ctr_model Test Plan: Still in test Differential Revision: D60789747 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134026 Approved by: https://github.com/huxintong	2024-08-22 06:13:52 +00:00
chilli	938f37b745	Added batching rule for sdpa_math, sdpa_efficient_attention forward, cudnn, and flash attention (#133964 ) Fixes https://github.com/pytorch/pytorch/issues/117016, https://github.com/pytorch/pytorch/issues/102457, https://github.com/pytorch/pytorch/issues/110525, https://github.com/pytorch/pytorch/issues/108065, Pull Request resolved: https://github.com/pytorch/pytorch/pull/133964 Approved by: https://github.com/Skylion007	2024-08-22 05:29:49 +00:00
Xu Han	e2ff094008	[inductor] calibration inductor windows uts (1/N) (#134033 ) Changes: 1. Re-open fixed UTs. 2. Mark skiped reasons for failed UTs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134033 Approved by: https://github.com/jansel	2024-08-22 05:21:28 +00:00
Avik Chaudhuri	0d7ac1966a	kill sharing of constraints (#134045 ) Summary: Previously, reuse of the same `Dim` was encoded by "sharing" internal constraints among constraint targets. This kind of sharing, implemented using `shared` fields between `_Constraint`s, was originally motivated by `dynamic_dim`, specifically to support `==` between `dynamic_dim`s, but we no longer need to maintain this overcomplicated structure: we can simply use names of `Dims` to directly encode sharing information. Thus this PR vastly simplifies the structure of `_Constraint` by removing `shared` fields. As a result, both `_Constraint` and its moral subclass, `_DerivedConstraint`, are 1-1 with `Dim` and its moral subclass, `DerivedDim`. Note that this will break `==` over `dynamic_dim`, so an immediate follow-up will be to remove `dynamic_dim` entirely from our public API. (It's been more than 6 months since the deprecation warning anyway.) I just didn't want to deal with that process in the same PR. Test Plan: existing Differential Revision: D61559413 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134045 Approved by: https://github.com/pianpwk	2024-08-22 04:40:47 +00:00
Wil Kong	de06345e9b	Avoid Host & Device Sync In LR Scheduler (#133663 ) Fixes #133662. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133663 Approved by: https://github.com/janeyx99, https://github.com/eqy Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2024-08-22 03:52:43 +00:00
drisspg	e847b6bb9b	[FlexAttention] Enable different qk and v head-dims (#134043 ) # Summary Adds the option for the head dims to be different between QK and V tensors. Fixes issue: https://github.com/pytorch/pytorch/issues/133674 V_DIM > QK_DIM is blocked by landing: https://github.com/triton-lang/triton/pull/4138 / https://github.com/triton-lang/triton/pull/4540 Into PyTorch's triton branch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134043 Approved by: https://github.com/Chillee	2024-08-22 03:42:17 +00:00
Yanbo Liang	7868b65c4d	[Dynamo] Support dict.setdefault (#134083 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134083 Approved by: https://github.com/williamwen42	2024-08-22 01:57:33 +00:00
Yiming Zhou	7b20514f8e	[export] Device remapping in export (#133660 ) Implemented `move_to_device_pass()` function in `torch._export.passes`. The user has to explicitly call this method to move the exported program from one torch.device to another one. Fixes https://github.com/pytorch/pytorch/issues/121761 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133660 Approved by: https://github.com/angelayi	2024-08-22 01:03:35 +00:00
Bin Bao	df467f8746	[CI] Do not set Intel OMP for aarch64 (#133997 ) As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/133997 Approved by: https://github.com/angelayi	2024-08-22 00:55:46 +00:00
Will Feng	6bddfb9546	[FSDP2] Add cache for FSDP wrapper class (#134135 ) Currently, `fully_shard` will create a new `FSDPMyModuleClass` class for each `MyModuleClass` module object, which causes Dynamo to guard-fail on every module object's type checking. This PR fixes the issue by caching and reusing previously created FSDP wrapper class. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134135 Approved by: https://github.com/awgu	2024-08-22 00:41:30 +00:00
yanbing-j	2a73ba298c	Upgrade submodule oneDNN to v3.5.3 (#131620 ) This PR is to upgrad submodule oneDNN to v3.5.3. ## Improvements - [experimental] Introduced [microkernel API](https://oneapi-src.github.io/oneDNN/ukernels.html) for Intel Architecture Processors. This API exposes internal mechanisms used in matmul and convolution implementation to expert users. - Improved performance of matmul primitive with sum post-op for batched cases on processors with Intel AMX instruction set support. - Introduced fp64 matmul support. This functionality is currently implemented on Intel GPUs with hardware acceleration for fp64 math only. ## Validation results on CPU No regression was found. 1. NLP models accuracy/inference/training Model Name \| Mode Name \| Precision \| OneDNN \| Baseline \| OneDNN/Baseline -- \| -- \| -- \| -- \| -- \| -- bert-large \| realtime \| bf16 \| 192.498 \| 189.664 \| 1.014942214 bert-large \| throughput \| bf16 \| 202.424 \| 202.156 \| 1.001325709 bert-large \| train_phase2 \| bf16 \| 15.955 \| 16.029 \| 0.995383368 LCM \| throughput \| bf16 \| 1.01983 \| 1.06632 \| 0.956401455 stable-diffusion \| throughput \| bf16 \| 0.10313 \| 0.10184 \| 1.012666929 ViT \| realtime \| bf16 \| 1086.48 \| 928.43 \| 1.17023362 ViT \| throughput \| bf16 \| 1419.07 \| 1393.81 \| 1.018122987 yolov7 \| realtime \| bf16 \| 413.468682 \| 415.16503 \| 0.995914039 yolov7 \| throughput \| bf16 \| 369.697 \| 366.789 \| 1.007928264 bert-large \| realtime \| fp32 \| 46.685 \| 46.652 \| 1.000707365 bert-large \| throughput \| fp32 \| 47.766 \| 48.007 \| 0.994979899 bert-large \| train_phase2 \| fp32 \| 7.101 \| 7.104 \| 0.999577703 LCM \| throughput \| fp32 \| 0.5501 \| 0.55023 \| 0.999763735 stable-diffusion \| throughput \| fp32 \| 0.04012 \| 0.04002 \| 1.002498751 ViT \| realtime \| fp32 \| 337.27 \| 335.19 \| 1.006205436 ViT \| throughput \| fp32 \| 346.52 \| 350.08 \| 0.989830896 yolov7 \| realtime \| fp32 \| 107.138054 \| 107.242747 \| 0.999023775 yolov7 \| throughput \| fp32 \| 103.383 \| 104.301 \| 0.99119855 bert-large \| realtime \| int8 \| 283.541 \| 289.569 \| 0.979182855 LCM \| throughput \| int8 \| 1.09864 \| 1.08998 \| 1.0079451 stable-diffusion \| throughput \| int8 \| 0.10617 \| 0.10604 \| 1.001225952 ViT \| realtime \| int8 \| 1562.11 \| 1554.68 \| 1.004779119 ViT \| throughput \| int8 \| 1904.38 \| 1903.39 \| 1.000520125 yolov7 \| realtime \| int8 \| 540.489493 \| 539.902488 \| 1.001087243 yolov7 \| throughput \| int8 \| 499.999 \| 500.757 \| 0.998486292 Device \| Dtype \| Geomean Higher is better -- \| -- \| -- All \| all \| 101.17% All \| fp32 \| 99.83% All \| bf16 \| 102.24% All \| int8 \| 99.91% All \| fp16 \| 103.61% SPR \| all \| 100.54% SPR \| fp32 \| 99.82% SPR \|bf16 \| 101.78% SPR \|int8 \| 99.90% GNR \| all \| 101.58% GNR \| fp32 \| 99.85% GNR \| bf16 \| 102.66% GNR \| int8 \| 99.93% GNR \| fp16 \| 103.61% 2. Torchbench cpu userbenchmark inference & training Perf_Geomean \| Ratio (oneDNN/baseline) -- \| -- eager_throughtput_bf16_infer \| 1.00x eager_throughtput_fp32_infer \| 1.00x jit_llga_throughtput_amp_bf16 \| 1.00x jit_llga_throughtput_fp32 \| 1.00x eager_throughtput_fx_int8 \| 0.99x eager_throughtput_bf16_train \| 1.01x eager_throughtput_fp32_train \| 1.00x 3. Inductor quantization Static quant: Perf_Geomean \| Ratio (oneDNN/baseline) -- \| -- PTQ \| 1.00x PTQ_CPP_WRAPPER \| 1.00x QAT \| 1.00x ACC_Geomean \| Ratio (oneDNN/baseline) -- \| -- PTQ \| 1.00x PTQ_CPP_WRAPPER \| 1.00x QAT \| 1.00x Dynamic quant: \| Ratio (oneDNN/baseline) -- \| -- Performance \| 1.04x Accuracy \| 1.00x 4. Dynamo benchmarks GEOMEAN summary ![image](https://github.com/user-attachments/assets/82fc4b76-50f6-4f06-9ba9-034b932f1158) FP32 Static shape, default wrapper ![image](https://github.com/user-attachments/assets/9335268e-3e99-426b-91f8-f9df90a2007c) FP32 Dynamic shape, default wrapper ![image](https://github.com/user-attachments/assets/e7cf3f4f-2a62-4b58-9461-5e5ba254d822) AMP Static shape, default wrapper ![image](https://github.com/user-attachments/assets/12392c88-e44f-4c95-904a-4fa5fc9f34a2) AMP Dynamic shape, default wrapper ![image](https://github.com/user-attachments/assets/13930b0d-9bb2-46de-9ecb-5d2585d5c2f6) ## Validation results on XPU Category \| Eager \| Inductor -- \| -- \| -- huggingface_amp_fp16_training \| 1.002456 \| 0.999998 huggingface_bfloat16_inference \| 1.005386 \| 1.003511 huggingface_float32_training \| 1.002533 \| 1.003098 torchbench_amp_fp16_training \| 1.009065 \| 1.01323 torchbench_bfloat16_inference \| 1.003371 \| 1.001534 torchbench_float32_training \| 1.012102 \| 1.011596 timm_models_amp_fp16_training \| 1.005511 \| 1.010329 timm_models_bfloat16_inference \| 1.000935 \| 1.000538 timm_models_float32_training \| 0.991873 \| 0.99721 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131620 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-08-21 23:40:02 +00:00
Nikita Shulga	5f0bd98767	Increase max total number of dynamo partitions to 15 (#134153 ) Needed to be able to split some of the aarch64 workflows to 15 shards Pull Request resolved: https://github.com/pytorch/pytorch/pull/134153 Approved by: https://github.com/seemethere, https://github.com/kit1980, https://github.com/ZainRizvi	2024-08-21 23:10:12 +00:00
FFFrog	a5ef04a3b8	add relevant function (#133946 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133946 Approved by: https://github.com/ezyang	2024-08-21 23:04:59 +00:00
rzou	8604c0a150	[inductor] Fix needs_fixed_stride_order silent incorrectness (#133639 ) Fixes #128084 The approach is option 2 of what Elias suggested in the comment thread: - We require tensors to have the correct stride at usage. This may involve a clone; if there was a clone and then a mutation into it then we copy_ back the result of the mutation. The reason why I went this approach was because it was the easiest and Inductor already works really hard to remove additional clones/copy_. There are some cases that this doesn't generate efficient code for; for example, if the tensor is a view, we don't change the base of the view to have the right stride order, instead we do a clone. The view case isn't very common so I'm ignoring it for now but we could improve this in the future. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/133639 Approved by: https://github.com/eellison	2024-08-21 22:54:16 +00:00
Sahdev Zala	d2204d4f0f	Remove skip ci recommendation (#134134 ) Using `skip ci` is no longer a recommendation practices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134134 Approved by: https://github.com/soulitzer	2024-08-21 22:42:25 +00:00
Jesse Cai	255cd75a97	[sparse] Add cuSPARSELt as a backend (#128534 ) Summary: This PR adds in cuSPARSELt as a backend to PyTorch. It is now possible to see if cuSPARSELt is available and the version if it is with ``` torch.backends.cusparselt.is_available() torch.backends.cusparselt.version() ``` Test Plan: ``` python test/test_sparse_semi_structured.py -k test_cusparselt_backend ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/128534 Approved by: https://github.com/cpuhrsch, https://github.com/eqy, https://github.com/syed-ahmed	2024-08-21 22:06:07 +00:00
Justin Chu	0870398fa8	[ONNX] Opt into ruff fmt (#134120 ) Add ONNX directory to use ruff format. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134120 Approved by: https://github.com/XuehaiPan, https://github.com/Skylion007	2024-08-21 21:43:55 +00:00
Gufan Yin	96dfe95ed0	Fix DDPLoadBalancingPlanner docstring (#134044 ) Summary: 1. Indentation in chunk function was wrong. 1. The previous logic missed a level of zip. This diff uses the idiom in python zip doc to do chunking https://docs.python.org/3/library/functions.html#zip Test Plan: Run the docstring locally Differential Revision: D61548758 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134044 Approved by: https://github.com/fegin	2024-08-21 21:28:22 +00:00
Bin Bao	5d5a45dc85	[CI][dashboard] Collect Export pass rate separately (#134076 ) Summary: Collect Export pass rate separately when running AOTInduction, so that we can have a better isolated signal. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134076 Approved by: https://github.com/angelayi	2024-08-21 21:18:55 +00:00
Nikita Shulga	b3eef3deaf	Triple number of shards for aarch64 cpu inductor tests (#134123 ) Let's see if this will work. Alas, other than linting I can only test it after it lands Pull Request resolved: https://github.com/pytorch/pytorch/pull/134123 Approved by: https://github.com/clee2000	2024-08-21 20:52:23 +00:00
Pearu Peterson	345578afb4	Add int8 support to bsr_dense_addmm and bsr_dense_mm Triton kernels (#133855 ) As in the title. In addition, the PR introduces `_int_bsr_dense_addmm` that is equivalent to `bsr_dense_addmm` except for int8 inputs the operation result is int32 tensor (similar to existing `_int_mm`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/133855 Approved by: https://github.com/cpuhrsch	2024-08-21 20:44:40 +00:00
Pavel Belevich	a3e1416c05	Fix out_tensor device in diag_test.py (#134020 ) This benchmark fails if device='cuda' but out_tensor is on cpu Pull Request resolved: https://github.com/pytorch/pytorch/pull/134020 Approved by: https://github.com/soulitzer	2024-08-21 20:43:39 +00:00
Animesh Jain	6c1e2d2462	[easy] Force inline_inbuilt_nn_modules to remove divergence (#134122 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134122 Approved by: https://github.com/williamwen42, https://github.com/mlazos	2024-08-21 20:42:15 +00:00
Valentin Andrei	865facda44	[pytorch] Remove thread naming when torch is imported (#134066 ) Fixes #133690 The naming was added in #121170 to allow performance debugging of latency critical threads. However the `pt_main_thread` name gets inherited every time a new process or thread is created from the parent one, which defeats the purpose. We need a better way to name the thread that launches kernels on accelerators but for the time being we can let users name the threads in the application code, using: `torch.multiprocessing._set_thread_name("insert_name")` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134066 Approved by: https://github.com/soulitzer, https://github.com/d4l3k	2024-08-21 20:34:35 +00:00
PyTorch MergeBot	1491a61769	Revert "[hop] ban creating hop by directly instantiating HigherOrderOperator. (#133645 )" This reverts commit 696107efcb83f9359aa669ab343c2cfa2a111372. Reverted https://github.com/pytorch/pytorch/pull/133645 on behalf of https://github.com/ydwu4 due to breaking ci. probably due to land race ([comment](https://github.com/pytorch/pytorch/pull/133645#issuecomment-2302866106))	2024-08-21 19:33:14 +00:00
Shangdi Yu	5fcfccefc6	[export] Migrate `capture_pre_autograd_graph` to `_export_for_training` (#132815 ) Summary: as title Test Plan: CI Differential Revision: D60860909 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132815 Approved by: https://github.com/tugsbayasgalan	2024-08-21 19:00:41 +00:00
Nikita Shulga	18aaceb7be	Update conda-env-iOS.txt (#134068 ) Followup after https://github.com/pytorch/pytorch/pull/133814 To fix periodic build failures update `typing-extensions` to 4.11.0, as 4.10 is missing in conda Pull Request resolved: https://github.com/pytorch/pytorch/pull/134068 Approved by: https://github.com/wdvr, https://github.com/atalman, https://github.com/Skylion007	2024-08-21 18:47:14 +00:00
David Berard	84b3f1900a	C++ network flow implementation in c10 (#132188 ) The functorch partitioners use network flow to split the joint graph into a forward and backward graph. Internally, we've found that upgrading to networkx 2.8.8 (from 2.5) results in some hard-to-debug failures (internal reference: https://fburl.com/workplace/jrqwagdm). And I'm told that there's interest to remove the python dependency. So this PR introduces a C++ implementation that mirrors the API provided by networkx. We'll need to add python bindings and do some additional testing to verify correctness. Differential Revision: [D61550977](https://our.internmc.facebook.com/intern/diff/D61550977) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132188 Approved by: https://github.com/Chillee	2024-08-21 18:40:54 +00:00
Sahdev Zala	05304f59f0	[Doc] Fix typo in `torch/fx/passes/README.md` (#134078 ) Fix typo, `utis` to `utils`, in the utility name. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134078 Approved by: https://github.com/soulitzer, https://github.com/malfet	2024-08-21 18:35:50 +00:00
Edward Z. Yang	32e057636c	Enable scribe environment for compile-time benchmarks if requested. (#133891 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133891 Approved by: https://github.com/malfet	2024-08-21 18:02:54 +00:00
atalman	750d68ff70	Use amazon linux2 for Docker builds, fix build-docker-conda condition (#134116 ) 1. Switches failing jobs to amzon linux 2: - CUDA, CPU, ROCM jobs are failing 3. Fix trigger condition for build-docker-conda to be same as manywheel and libtorch Pull Request resolved: https://github.com/pytorch/pytorch/pull/134116 Approved by: https://github.com/ZainRizvi, https://github.com/nWEIdia	2024-08-21 18:01:16 +00:00
Yidi Wu	696107efcb	[hop] ban creating hop by directly instantiating HigherOrderOperator. (#133645 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133645 Approved by: https://github.com/zou3519 ghstack dependencies: #133521	2024-08-21 17:34:21 +00:00
Yidi Wu	6835f20d20	[HOP] support generating schema for hop (#133521 ) Add a way of generating a FunctionSchema from example values because hop's schema varies even for the same hop. We didn't use torch._C.FunctionSchema because we cannot construct the classes directly (e.g. "__init__" cannot be used for torch._C.FunctionSchema). Also extending the Basic types in c++ seems not that easy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133521 Approved by: https://github.com/zou3519	2024-08-21 17:34:21 +00:00
Xintong Hu	dd5a7c8397	[PT2] Add a pass to convert stack to unsqueeze cat (#133966 ) Summary: so that we can optimize with `fuse_chunk_reshape_unsqueeze_concat_pass` Test Plan: new UT Reviewed By: frank-wei Differential Revision: D61220221 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133966 Approved by: https://github.com/frank-wei	2024-08-21 17:31:26 +00:00
Animesh Jain	1da3a049da	[dynamo][super] Improve handling of getattr on super (#134039 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134039 Approved by: https://github.com/yanboliang ghstack dependencies: #133742, #134016	2024-08-21 16:50:35 +00:00
Zhengxu Chen	3ef1cc8583	[export] Implement common_getitem_elimination pass. (#133618 ) Summary: In export, we will generate many redundant getitem nodes branching from the same source, inserted by runtime assertions or any passes. This is causing issues with any downstream system relying on any value being uniquely defined by a single node. I don't think it hurt to remove a bunch of getitem nodes only, so I just added to the ctor. Test Plan: rebase on D61256937 ``` buck2 run scripts/bearzx:pt2_export_playground ``` Differential Revision: D61351578 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133618 Approved by: https://github.com/tugsbayasgalan	2024-08-21 16:48:24 +00:00
PyTorch MergeBot	2db28a9611	Revert "[BE]: Update Typeguard to TypeIs for better type inference (#133814 )" This reverts commit bce0caba7804b0787684dbf1f4e1c4d9e3acded5. Reverted https://github.com/pytorch/pytorch/pull/133814 on behalf of https://github.com/ezyang due to root cause of internal failures not addressed ([comment](https://github.com/pytorch/pytorch/pull/133814#issuecomment-2302466444))	2024-08-21 16:13:34 +00:00
IvanKobzarev	57625bacea	[partitioner] Fix must_be_in_backward corner cases (#134002 ) Preparation PR for https://github.com/pytorch/pytorch/pull/132638 "must_be_in_backward" fails the partitioner, if partitioner picks this node as saved_values. The fix is to prevent partitioner to pick those nodes during nodes classification. It's hard to make a test without making effectful ops in backward "must_be_in_backward", which will be testing this ( https://github.com/pytorch/pytorch/pull/132638 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134002 Approved by: https://github.com/bdhirsh ghstack dependencies: #134003	2024-08-21 15:58:49 +00:00
PyTorch MergeBot	68425e68fe	Revert "[dynamo][reland][inline-inbuilt-nn-modules] Mark attributes of nn mod… (#133714 )" This reverts commit e8d3c4be3629582294b5944754009fae60f42f6d. Reverted https://github.com/pytorch/pytorch/pull/133714 on behalf of https://github.com/anijain2305 due to fails internally ([comment](https://github.com/pytorch/pytorch/pull/133714#issuecomment-2302171472))	2024-08-21 14:21:06 +00:00
ooooo	32e052e468	[docs] improve `torch.stack` example code to be reproducible (#133857 ) Improve the sample code can produce the expected results after copying and executing it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133857 Approved by: https://github.com/soulitzer	2024-08-21 14:07:02 +00:00
blazej-smorawski	585c049fa3	Fix `Extension` attribute name in `CppExtension` example (#134046 ) Hi! It seems there's a typo in `CppExtension` example. I think it should say `extra_link_args` instead of `extra_link_flags`. Not that I spent a few hours debugging missing kernels inside a library's fatbin or anything :D. Please see `Extension` definition inside setuptools: `ebddeb36f7/setuptools/_distutils/extension.py (L62)` Thanks! Błażej Pull Request resolved: https://github.com/pytorch/pytorch/pull/134046 Approved by: https://github.com/soulitzer	2024-08-21 13:58:16 +00:00
Aaron Gokaslan	afaa5fcecb	[BE][Ez]: FURB142,FURB92 misc preview fixes (#133880 ) Fixes some miscellaneous code quality issues with some refurb rules that have not been enabled yet. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133880 Approved by: https://github.com/soulitzer, https://github.com/malfet	2024-08-21 13:54:51 +00:00
rzou	683609c631	Skip cpp_extension test internally (#134011 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/134011 Approved by: https://github.com/masnesral	2024-08-21 13:51:05 +00:00
Howard Huang	4b1fb3b0ed	[PP] pt-native input/weight grad split (#132691 ) Add `stage_backward_input` and `stage_backward_weight` functions to perform the weight updates for inputs and weights independently. We still support `self.dw_builder` argument for a custom backward, but it has become optional. It takes a separate code path and cannot be used in conjuction with the native zero backward. Added tests: `python test/distributed/pipelining/test_schedule_multiproc.py -k test_schedule_with_native_zero_bubble` `python test/distributed/pipelining/test_backward.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132691 Approved by: https://github.com/wconstab	2024-08-21 13:37:54 +00:00
leslie-fang-intel	2bffbe06bd	[Inductor][CPP] Support vectorization of load_seed and randn (#130317 ) Summary Enable the vectorization of `load_seed` and `randn`. For now, `randn` is using the reference implementation. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_randn ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130317 Approved by: https://github.com/jgong5 ghstack dependencies: #122961	2024-08-21 13:20:43 +00:00
leslie-fang-intel	313bc11963	[inductor][cpp] complete vectorization for int32/int64 (#122961 ) Summary Implement the complete vectorization of `index_expr` functionally. We also add heuristic from performance perspective to resolve the regressions posted below: https://github.com/pytorch/pytorch/pull/122961#issuecomment-2041336265 by disabling vectorization of specific (Fused) scheduler Node: - Heuristic 1: when the num of non-contiguous `index_expr/load/store` exceeds the threshold, we disable the vectorization. - Heuristic 2: when the total number of elements along the vec dim is less than `tiling_factor/2`, we disable the vectorization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122961 Approved by: https://github.com/jansel Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>	2024-08-21 13:12:38 +00:00
Xuehai Pan	539be0a769	[dynamo] support `ClassMethodDescriptorType` (#133862 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133862 Approved by: https://github.com/jansel	2024-08-21 12:56:19 +00:00
Animesh Jain	0d79f67a25	[dynamo][exception] Support raise exception from None (#134028 ) Fixes https://github.com/pytorch/pytorch/issues/132362 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134028 Approved by: https://github.com/yanboliang	2024-08-21 12:48:35 +00:00
Animesh Jain	bd0db490bf	[dynamo][set] Fix EQUALS_MATCH guard for constant sets and lists (#134016 ) Fixes https://github.com/pytorch/pytorch/issues/133509 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134016 Approved by: https://github.com/laithsakka, https://github.com/jansel ghstack dependencies: #133742	2024-08-21 12:41:52 +00:00
Xuehai Pan	c929e1e11f	[dynamo] fix polyfill for user defined constructor `__new__` (#133822 ) In `cls->tp_call`, if `cls->tp_new` does not return an instance of class `cls`, then `cls->tp_init` is not called on the new instance. Related PR: - #132977 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133822 Approved by: https://github.com/jansel	2024-08-21 12:41:19 +00:00
Michael Lazos	695291be2f	Fix test flakiness due to not resetting state (#134058 ) Fixes https://github.com/pytorch/pytorch/issues/133994 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134058 Approved by: https://github.com/yanboliang	2024-08-21 11:54:08 +00:00
IvanKobzarev	30dc6338c1	[effects] Prevent inductor dtype promotions for HOP effects tokens (#134003 ) Preparation for https://github.com/pytorch/pytorch/pull/132638 and https://github.com/pytorch/pytorch/pull/132755 Inductor promotes arguments dtypes to the highest dtype, as a result additional token tensor argument wtih float32 dtype incurred dtype promotions for lower types, e.g. int32 The solution for that - to use the lowest dtype for tokens - torch.bool. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134003 Approved by: https://github.com/zou3519, https://github.com/bdhirsh	2024-08-21 11:42:10 +00:00
xinan.lin	19eb14493a	[Inductor] Moves intermediary tensors which are constructed on the cpu to XPU when safe, align with CUDA. (#132843 ) [Inductor] Moves intermediary tensors which are constructed on the cpu to XPU when safe, align with CUDA. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132843 Approved by: https://github.com/EikanWang, https://github.com/eellison ghstack dependencies: #132740, #132748	2024-08-21 11:28:09 +00:00
xinan.lin	6535f11259	[Inductor] Support _check_triton_bf16_support on XPU. (#132748 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132748 Approved by: https://github.com/EikanWang, https://github.com/eellison ghstack dependencies: #132740	2024-08-21 11:28:09 +00:00
xinan.lin	c2e2602ecd	[Inductor] Move `GPU_TYPE`(The runtime avaliable gpu type, cuda or xpu) from (#132740 ) Move GPU_TYPE(The runtime avaliable gpu type, cuda or xpu) from `testing/_internal/inductor_utils.py` to `_inductor/utils.py`. So that we can use it in Inductor, not limited in test case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132740 Approved by: https://github.com/EikanWang, https://github.com/jansel	2024-08-21 11:18:00 +00:00
Huamin Li	3d8db41337	Add new op wrapped_quantized_linear (#134024 ) Summary: This diff adds a new operator wrapped_quantized_linear (torch.ops._quantized.wrapped_quantized_linear) and takes the following input argument: input (in fp32) , input_scale, input_zero_point, weight (in fp32), weight_scale, weight_zero_point, bias (in fp32), output_scale, output_zero_point, and out_channel. It does the following 1. Use quantize_per_tensor(input, input_scale, input_zero_point) to quantize the input tensor to int8 2. Use quantized::linear_prepack(weight, weight_scale, weight_zero_point, bias) to pack the weight and bias 3. Use quantized::linear to perform int8 quantized linear 4. dequantize This new op is essentially a wrapper of mutiple ops. We do this as torch.export cannot handle models where it has old quantize apis. Reviewed By: jerryzh168 Differential Revision: D61377266 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134024 Approved by: https://github.com/houseroad	2024-08-21 09:26:58 +00:00
Xuehai Pan	022cd7c9aa	[RFC][dynamo] add decorator to register polyfill for unsupported C++ function to avoid graph break (#133712 ) Add decorator `torch.compiler.substitute_in_graph` to register polyfill for unsupported C++ function to avoid graph break. This API provides an official way to add support for dynamo for third-party C extensions. Also, it can be used to simplify our implementation for `torch._dynamo.polyfill`. `5ee070266f/torch/_dynamo/variables/builtin.py (L97-L107)` Example: ```python >>> import operator >>> operator.indexOf([1, 2, 3, 4, 5], 3) 2 >>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3) Unsupported: ... >>> @torch.compiler.substitute_in_graph(operator.indexOf) ... def indexOf(sequence, x): ... for i, item in enumerate(sequence): ... if item is x or item == x: ... return i ... raise ValueError("sequence.index(x): x not in sequence") >>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3) 2 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133712 Approved by: https://github.com/jansel	2024-08-21 06:36:41 +00:00
Deep Shah	843fdf81c2	Fix a getenv segfault due to a race (#133744 ) Summary: * TLDR: `getenv` is not thread safe w.r.t `setenv`. Environment variables are kept as a per-process "dictionary" by libc. `setenv` can essentially realloc the whole thing move this list to a completely different location. If there is a concurrent `getenv` happening as the same time, it is possible that it might end up reading stale memory and segfault. `getenv` is thread safe w.r.t other `getenv`. * Details: Inside PTD init: ``` ProcessGroupNCCL ctor ... ncclCommWatchdogThread_ = std::thread(&ProcessGroupNCCL::ncclCommWatchdog, this); (https://fburl.com/code/terf9ai7) ``` Inside ncclCommWatchdog thread: ``` ... ncclHeartbeatMonitorThread_ = std::thread(&ProcessGroupNCCL::heartbeatMonitor, this); (https://fburl.com/code/fv9camg2) ... ``` Inside heartbeatMonitor thread: ``` ... std::optional<DumpPipe> dumpPipe = std::nullopt; (https://fburl.com/code/qdvahzbu) dumpPipe.emplace(rank_); ... ``` Inside DumpPipe ctor (https://fburl.com/code/wvixlqcz) ``` getCvarString getenv <=== SIGSEGV ``` On the main thread: We go on to initialize NCCL: Inside getNCCLComm, we call: `getNcclVersion` -> `initEnv` (https://fburl.com/code/j312pccu) `initEnv` inside NCCL does this: `initEnv` -> `setEnvFile` This guy, reads the /etc/nccl.conf file, and sets values of env variables with "setenv" (https://fburl.com/code/cq4r0y0h) This "setenv" can race with "getenv" in heartbeatMonitor thread. Ideally, all `setenv` should be done by a single thread before launching other threads. This diff moves getNCCLVersion before launching watchdog thread to make sure all setenvs are done beforehand. I think we are just getting lucky that we are not hitting it in production. IIRC in fact we saw getenv segfault once in one of the large scale runs, but now I dont remember the details. Test Plan: A lot of testing done as part of D61411062 & CI Differential Revision: D61421292 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133744 Approved by: https://github.com/wconstab, https://github.com/fduwjj	2024-08-21 06:27:31 +00:00
Nicolas Macchioni	af664882dd	Safely infer device type + docstrings + tests (#133668 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133668 Approved by: https://github.com/eellison	2024-08-21 05:27:31 +00:00
fduwjj	b39ec7fbe9	[1/N] Make NCCL PG error messages more accurate and simpler (#134017 ) We did a thorough review on all the error messages we are logging inside PGNCCL, and we want to make log message simpler and more accurate, this is the first PR for this effort. Pull Request resolved: https://github.com/pytorch/pytorch/pull/134017 Approved by: https://github.com/wconstab	2024-08-21 05:21:24 +00:00
Yifu Wang	66d3eb783c	[SymmetricMemory] introduce multicast support, multimem_all_reduce_ and multimem_one_shot_all_reduce (#133424 ) ### Summary - Added multicast support to SymmetricMemory. If the cuda runtime and cuda driver have multicast support, SymmetricMemory associate all peer buffers with a multicast object and exposes the multicast virtual address. - Implemented `multimem_all_reduce_` and `multimem_one_shot_all_reduce` based on the multicast support. The two variants shows different performance characteristic for different message size. We plan to use Inductor for collective algo selection (and required symmetric memory buffer allocation). ### Benchmark 8xH100 (non-standard version with HBM2e at 650W). NVSwitch V3 with NVLS support. ![image](https://github.com/user-attachments/assets/4998a16b-c2c0-4797-9dd0-1da2303df947) ![image](https://github.com/user-attachments/assets/278ad361-52cb-4864-82c6-bb67e8d0a3fe) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133424 Approved by: https://github.com/yf225, https://github.com/weifengpy	2024-08-21 05:11:21 +00:00
Shangdi Yu	8337b4d96e	[training ir migration] Fix ReorderConvertTest (#134010 ) Summary: Change ReorderConvertTest to work with the new `capture_pre_autograd_graph` implementation using D61175223. Note that now `ReorderConvertTest` doesn't work with the old `capture_pre_autograd_graph` anymore. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/passes/tests:optimize_test -- -r ReorderConvertTest ``` Differential Revision: D61507772 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134010 Approved by: https://github.com/tugsbayasgalan	2024-08-21 04:48:43 +00:00
Justin Chu	e8fc1e0118	[ONNX] New export logic leveraging ExportedProgram and ONNX IR (#132530 ) 1/n PR to - Move code from torch-onnx from commit `395495e566` into torch.onnx and fixes imports. - Integrate the new export logic with the torch.onnx.export API and include basic set of tests. - Refactor the API for the change. - Improve documentation. Next PRs will be more tests and docs. Fix https://github.com/pytorch/pytorch/issues/129277 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132530 Approved by: https://github.com/titaiwangms, https://github.com/malfet	2024-08-21 01:08:42 +00:00
Sahdev Zala	06cc2e83f0	Make optim.swa.util content accessible from the torch.optim doc (#133393 ) Link various classes and functions of the `optim.swa.util` to make doc content accessible from the `torch.optim` doc. Currently, if you click the link, https://pytorch.org/docs/stable/optim.html#module-torch.optim.swa_utils it goes to a blank, bottom of the page section of `torch.optim`. Also, `torch.optim.swa_utils.AveragedModel` and `torch.optim.swa_utils.SWALR` classes as well as `torch.optim.swa_utils.update_bn()` and `optim.swa_utils.get_ema_multi_avg_fn` are not linked to doc. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133393 Approved by: https://github.com/janeyx99	2024-08-21 00:43:46 +00:00
Nikita Shulga	d1abd6241a	[CI][BE] Update retry action to v3.0.0 (#119403 ) To reduce number of ``` Node.js 16 actions are deprecated. Please update the following actions to use Node.js 20 ``` Finally can land this one as all nodes has been migrated to AmazonLinux2023 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119403 Approved by: https://github.com/clee2000, https://github.com/Skylion007	2024-08-20 23:56:37 +00:00
leslie-fang-intel	c42ac54d9e	[inductor] prune unused constants in graph scheduling (#132208 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132208 Approved by: https://github.com/leslie-fang-intel Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>	2024-08-20 23:40:11 +00:00
quanta42	5f3d22a609	Avoid GPU syncs by reusing Pre-allocated Zero Tensor (#128069 ) This commit improves the FullyShardedDataParallel (FSDP) class in PyTorch by reducing unnecessary GPU synchronizations by reusing a pre-allocated zero tensor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128069 Approved by: https://github.com/awgu	2024-08-20 22:51:33 +00:00
drisspg	5a7b544e5c	Update FlexAttention with masking semantic (#133373 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133373 Approved by: https://github.com/yanboliang	2024-08-20 22:38:10 +00:00
Yanbo Liang	bc785c2d9a	[Inductor][FlexAttention] Don't trigger dynamic shape on building empty block mask (#133836 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133836 Approved by: https://github.com/Chillee	2024-08-20 22:36:53 +00:00
Nikita Shulga	f7c1f32803	Fix partially initialized module error (#134019 ) https://github.com/pytorch/pytorch/pull/132990 introduced dependency on `torch.version`, which might not be imported yet, and can result in `AttributeError: partially initialized module 'torch' has no attribute 'version' (most likely due to a circular import)` if user starts its code with `import torch.cuda` Fix it by importing `torch.version` explicitly Test Plan: CI Differential Revision: D61549284 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134019 Approved by: https://github.com/seemethere	2024-08-20 22:20:02 +00:00
Sherlock Huang	41fab40be7	[report_exportability] Avoid re-exporting duplicated modules (#133930 ) Summary: Skip re-exporting modules with the duplicated types to speed up the exportability tests. In real models, there are many duplicated modules, and mostly have the same export issues. Test Plan: Existing CI Differential Revision: D61504630 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133930 Approved by: https://github.com/angelayi	2024-08-20 22:11:57 +00:00
Animesh Jain	1ae5d5bb62	[dynamo][user-defined] Improve getattr_static for user_defined objects (#133742 ) Fixes https://github.com/pytorch/pytorch/issues/133607 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133742 Approved by: https://github.com/Skylion007, https://github.com/jansel	2024-08-20 21:51:03 +00:00
atalman	a36739f36a	Cherry-Picking don't resolve conflicts (#134047 ) During cherry-picking we want to use default setting and fail if there is merge conflict Here an example of invalid conflict resolution: https://github.com/pytorch/pytorch/pull/131194 and cherry-pick https://github.com/pytorch/pytorch/pull/133590 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134047 Approved by: https://github.com/kit1980	2024-08-20 21:48:05 +00:00
krzysztofjordan	2e1830c7c8	Implement 2D version of masked_select for nestedtensors (#133889 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133889 Approved by: https://github.com/soulitzer	2024-08-20 21:46:32 +00:00
PyTorch MergeBot	15b5a0b67f	Revert "[RFC][dynamo] add decorator to register polyfill for unsupported C++ function to avoid graph break (#133712 )" This reverts commit 71dd52f51a05d110c06e83f74cef165f64627842. Reverted https://github.com/pytorch/pytorch/pull/133712 on behalf of https://github.com/ZainRizvi due to breaking main windows cpu tests - this stack still causes that windows test to fail ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2299776241))	2024-08-20 21:14:45 +00:00
PyTorch MergeBot	88ead0afc6	Revert "[dynamo] simplify polyfill registration for `builtins.all` and `builtins.any` (#133769 )" This reverts commit 178e8563b8a44243a6f69f3d257d9a3aab71b2c5. Reverted https://github.com/pytorch/pytorch/pull/133769 on behalf of https://github.com/ZainRizvi due to breaking main windows cpu tests - this stack still causes that windows test to fail ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2299776241))	2024-08-20 21:14:45 +00:00
PyTorch MergeBot	3fa874abbe	Revert "[dynamo] simplify implementation for `functools.reduce` (#133778 )" This reverts commit 37b4bc60a4ec65858044983a36577912fb9b4651. Reverted https://github.com/pytorch/pytorch/pull/133778 on behalf of https://github.com/ZainRizvi due to breaking main windows cpu tests - this stack still causes that windows test to fail ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2299776241))	2024-08-20 21:14:45 +00:00
PyTorch MergeBot	98e6a1d8ff	Revert "[dynamo] simplify implementation for `builtins.sum` (#133779 )" This reverts commit 3f58a8051a92470dbd254859322a7eb085a8f243. Reverted https://github.com/pytorch/pytorch/pull/133779 on behalf of https://github.com/ZainRizvi due to breaking main windows cpu tests - this stack still causes that windows test to fail ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2299776241))	2024-08-20 21:14:44 +00:00
PyTorch MergeBot	2540ee372a	Revert "[dynamo][itertools] support `itertools.tee` (#133771 )" This reverts commit 28ce3c0227830c78c0b5d4ec592f5c3879bc61a3. Reverted https://github.com/pytorch/pytorch/pull/133771 on behalf of https://github.com/ZainRizvi due to breaking main windows cpu tests - this stack still causes that windows test to fail ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2299776241))	2024-08-20 21:14:44 +00:00
Justin Chu	ccc0aa69ce	[ONNX] Remove torch.onnx._export (#133824 ) - Remove the deprecated torch.onnx._export function - Remove test/onnx/test_export_modes.py because export modes are no longer supported Pull Request resolved: https://github.com/pytorch/pytorch/pull/133824 Approved by: https://github.com/titaiwangms	2024-08-20 20:54:48 +00:00
Xuehai Pan	b03381cac2	[dynamo] support `cls.__flags__` (#133970 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133970 Approved by: https://github.com/jansel ghstack dependencies: #133969	2024-08-20 20:03:31 +00:00
Xuehai Pan	5229b52bf2	[dynamo] support `cls.__base__` (#133969 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133969 Approved by: https://github.com/jansel	2024-08-20 20:03:31 +00:00
David Berard	bb0bf09aff	[easy] skip test_sdpa_autocast on windows (#134009 ) test is failing because torch.compile doesn't work on windows Pull Request resolved: https://github.com/pytorch/pytorch/pull/134009 Approved by: https://github.com/YuqingJ, https://github.com/Skylion007, https://github.com/ZainRizvi	2024-08-20 19:51:55 +00:00
Xuehai Pan	28ce3c0227	[dynamo][itertools] support `itertools.tee` (#133771 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133771 Approved by: https://github.com/jansel ghstack dependencies: #133712, #133769, #133778, #133779	2024-08-20 19:48:57 +00:00
Xuehai Pan	3f58a8051a	[dynamo] simplify implementation for `builtins.sum` (#133779 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133779 Approved by: https://github.com/jansel ghstack dependencies: #133712, #133769, #133778	2024-08-20 19:48:57 +00:00
Xuehai Pan	37b4bc60a4	[dynamo] simplify implementation for `functools.reduce` (#133778 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133778 Approved by: https://github.com/jansel ghstack dependencies: #133712, #133769	2024-08-20 19:48:57 +00:00
Xuehai Pan	178e8563b8	[dynamo] simplify polyfill registration for `builtins.all` and `builtins.any` (#133769 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133769 Approved by: https://github.com/jansel ghstack dependencies: #133712	2024-08-20 19:48:57 +00:00
Xuehai Pan	71dd52f51a	[RFC][dynamo] add decorator to register polyfill for unsupported C++ function to avoid graph break (#133712 ) Add decorator `torch.compiler.substitute_in_graph` to register polyfill for unsupported C++ function to avoid graph break. This API provides an official way to add support for dynamo for third-party C extensions. Also, it can be used to simplify our implementation for `torch._dynamo.polyfill`. `5ee070266f/torch/_dynamo/variables/builtin.py (L97-L107)` Example: ```python >>> import operator >>> operator.indexOf([1, 2, 3, 4, 5], 3) 2 >>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3) Unsupported: ... >>> @torch.compiler.substitute_in_graph(operator.indexOf) ... def indexOf(sequence, x): ... for i, item in enumerate(sequence): ... if item is x or item == x: ... return i ... raise ValueError("sequence.index(x): x not in sequence") >>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3) 2 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133712 Approved by: https://github.com/jansel	2024-08-20 19:48:57 +00:00
wz337	49430bfd5c	[DeviceMesh] Add a _MeshEnv attr to record the mapping of flatten mesh_dim_name to its mesh dim index in root mesh (#133838 ) ``` # supposed we have a 3d mesh mesh_3d = init_device_mesh("cuda", (2,2,2), mesh_dim_names=("dp", "cp", "tp") dp_cp_mesh = mesh_3d["dp", "cp"]._flatten() """ then we would have flatten_name_to_root_dims[mesh_3d]: { "dp_cp": (0, 1) } """ ``` We need this information to validate the order mesh slice including flatten mesh dim. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133838 Approved by: https://github.com/fegin	2024-08-20 19:43:45 +00:00
Zain Rizvi	c188d419db	[BE] [EZ] Allow linux-build workflows to run on the default runner type (#133640 ) Replace usage of `runner` with the new `runner_prefix` input, which allows the workflows to use the default runner type (linux.2xlarge) specified by the reusable workflow. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133640 Approved by: https://github.com/clee2000, https://github.com/jeanschmidt, https://github.com/malfet	2024-08-20 19:37:14 +00:00
Colin Peppler	81a822ddc9	Back out "[1/N] Fix clang-tidy warnings in inductor (#131979 )" (#133922 ) Summary: Original commit changeset: cc9392e5fce2 Original Phabricator Diff: D60464909 Differential Revision: D61501052 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133922 Approved by: https://github.com/22quinn	2024-08-20 19:16:29 +00:00
PyTorch MergeBot	49f6ea6dd9	Revert "[report_exportability] Avoid re-exporting duplicated modules (#133930 )" This reverts commit 278bc985d71f1ee09a499fba2ea5032b7baf2567. Reverted https://github.com/pytorch/pytorch/pull/133930 on behalf of https://github.com/izaitsevfb due to breaks lint ([comment](https://github.com/pytorch/pytorch/pull/133930#issuecomment-2299513046))	2024-08-20 18:44:09 +00:00
Roy Hvaara	43f78bf37a	[MPS] Gather sliced inputs to batch norm (#133610 ) This PR removes the `executeGatherOp` flag from batch norm in favor of relying on the logic in `4aa66f68a8/aten/src/ATen/native/mps/OperationUtils.mm (L372)` to decide if gathering is necessary. It's not the most efficient way to solve this issue, but it assures correctness for sliced inputs. ### Performance impact #### With fix ``` python -m timeit -n 100 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x)" 100 loops, best of 5: 282 usec per loop python -m timeit -n 100 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x[5:])" 100 loops, best of 5: 448 usec per loop python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x)" 1000 loops, best of 5: 705 usec per loop python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x[5:])" 1000 loops, best of 5: 1.11 msec per loop python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(1000, 100, 35, 45).to('mps')" "bn(x)" 1000 loops, best of 5: 7.16 msec per loop python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(1000, 100, 35, 45).to('mps')" "bn(x[5:])" 1000 loops, best of 5: 11.7 msec per loop ``` #### Without fix ``` python -m timeit -n 100 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x)" 100 loops, best of 5: 284 usec per loop python -m timeit -n 100 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x[5:])" 100 loops, best of 5: 265 usec per loop python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x)" 1000 loops, best of 5: 715 usec per loop python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(100, 100, 35, 45).to('mps')" "bn(x[5:])" 1000 loops, best of 5: 675 usec per loop python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(1000, 100, 35, 45).to('mps')" "bn(x)" 1000 loops, best of 5: 7.19 msec per loop python -m timeit -n 1000 -s "import torch; import torch.nn as nn; bn = nn.BatchNorm2d(100, affine=False, device='mps');x = torch.randn(1000, 100, 35, 45).to('mps')" "bn(x[5:])" 1000 loops, best of 5: 7.13 msec per loop ``` Please feel free to push back or request changes. Fixes #133520 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133610 Approved by: https://github.com/malfet	2024-08-20 18:24:48 +00:00
Sherlock Huang	278bc985d7	[report_exportability] Avoid re-exporting duplicated modules (#133930 ) Summary: Skip re-exporting modules with the duplicated types to speed up the exportability tests. In real models, there are many duplicated modules, and mostly have the same export issues. Test Plan: Existing CI Differential Revision: D61504630 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133930 Approved by: https://github.com/angelayi Co-authored-by: bearzx <bearzx@fb.com>	2024-08-20 18:20:49 +00:00
Wei Wang	333890b701	Enable CUDA 12.4.1 (#132202 ) Trying to keep a record of the steps before I lose track of it. - 1st Commit: Similar to https://github.com/pytorch/builder/pull/1720 - 2nd Commit: Update CUDA 12.4 CI CUDA versions from 12.4.0 to 12.4.1 mapping to changes in https://github.com/pytorch/pytorch/pull/125944/files - 3rd Commit: update for aarch64 install_cuda_aarch64.sh docker step - 4th Commit: `aaa456e3e6` Related https://github.com/pytorch/pytorch/pull/121684 - Synchronization point: Meta helps uploading pypi cuda dependencies specified in .github/scripts/generate_binary_build_matrix.py - The above pypi upload is done (thanks Andrey!), restarted jobs like https://github.com/pytorch/pytorch/actions/runs/10188203670/job/28369471321 - `77532344e3`, use temporary docker containers (generated from a previous successful container build). If merged, these containers would be rebuilt, therefore testing them now. (5th commit) - 6th commit `5f93c625b5`: revert the 5th commit. Update, done but have to debug seemingly irrelevant failures (rocm/xpu/mps) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132202 Approved by: https://github.com/Skylion007, https://github.com/eqy, https://github.com/atalman	2024-08-20 17:52:50 +00:00
fduwjj	e41b520ee3	[3/N] Refactor FR script - Add a processor module (#133933 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133933 Approved by: https://github.com/c-p-i-o ghstack dependencies: #133927, #133929	2024-08-20 17:36:49 +00:00
Aaron Gokaslan	bce0caba78	[BE]: Update Typeguard to TypeIs for better type inference (#133814 ) Uses TypeIs instead of TypeGuard for better inference. See https://peps.python.org/pep-0742/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/133814 Approved by: https://github.com/ezyang	2024-08-20 17:19:57 +00:00
Xu Han	fbf3fc2a30	[inductor] Use int64_t as index type for all platfroms 4 (#133892 ) It is parallel PR to https://github.com/pytorch/pytorch/pull/133819 , and it is append change for @jansel 's comments. 1. For `torch/_inductor/codegen/cpp_wrapper_cpu.py`, revert to origin code to append LL on MacOS and Windows: `bdc14ad89a` 2. For `torch/_inductor/codegen/cpp_utils.py`, append LL on MacOS and Windows forlarge constants. And fix its UTs: `3a56b76ce0` ------------------------------ Another solution for https://github.com/pytorch/pytorch/pull/133615, use `int64_t` as index type for all plartform. ### Development notes: The metioned PR( https://github.com/pytorch/pytorch/pull/133615) is fix the index type not match to parse_arg args types. As reviewed with @jansel , Jason think we need to unificate `INDEX_TYPE` for all platforms. Current code is make code cumbersome: ```python INDEX_TYPE = "int64_t" if _IS_WINDOWS else "long" ``` So, I have some attempts to unificate `INDEX_TYPE` as `long` or `int64_t`. For use `long` as index type: https://github.com/pytorch/pytorch/pull/133768 For use `int64_t` as index type: https://github.com/pytorch/pytorch/pull/133782 Since that, we still discussed which type we will select as final solution. ![image](https://github.com/user-attachments/assets/b23fa577-2d40-4bd6-b934-fb7994fe0bb0) `long` type is different define and size in different OSs and different compilers. So, @jansel make decision that, we need to select `int64_t` for all platforms. So, I would comtine my work based on https://github.com/pytorch/pytorch/pull/133782. As https://github.com/pytorch/pytorch/pull/133782 still has two issues: 1. std::min/std::max could not match function instances by arg types. It as fixed and validated in PR: https://github.com/pytorch/pytorch/pull/133812 4. Cuda TestMemoryPlanning::test_cpp_wrapper issue by wrong index type. It is fixing in this PR. So, we made final solution in this PR. ### Changes: 1. Use `int64_t` type as index type for all OSs: `Windows`, `Linux` and `MacOS`. 2. Use static_cast<int64_t>(`constant`) to convert constant to `div_floor_integer` with args type(`int64_t`). 3. Update `parse_arg` function signature to `int64_t`, which follow the index type. 4. Append double L(`LL`) to constant on Windows and MacOS, because of their int64_t are are long long. 5. Fix `std::min/std::max` type miss match by static_cast to `INDEX_TYPE`. 6. Fix UTs, containts: cuda `TestMemoryPlanning::test_cpp_wrapper`, and `test_indexing.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133892 Approved by: https://github.com/jansel	2024-08-20 16:54:12 +00:00
Xu Han	3caf3baabb	[inductor] enable inductor backend for dynamo on Windows. (#133921 ) Changes: Enable inductor backend for dynamo on Windows. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133921 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-08-20 16:46:19 +00:00
cyy	c3d02fa390	[Reland2] Update NVTX to NVTX3 (#109843 ) Another attempt to update NVTX to NVTX3. We now avoid changing NVTX header inclusion of existing code. The advantage of NVTX3 over NVTX is that it is a header-only library so that linking with NVTX3 can greatly simplify our CMake and other building scripts for finding libraries in user environments. In addition, NVTX are indeed still present in the latest CUDA versions, but they're no longer a compiled library: It's now a header-only library. That's why there isn't a .lib file anymore. Pull Request resolved: https://github.com/pytorch/pytorch/pull/109843 Approved by: https://github.com/peterbell10, https://github.com/eqy Co-authored-by: Ivan Zaitsev <108101595+izaitsevfb@users.noreply.github.com>	2024-08-20 16:33:26 +00:00
Animesh Jain	33f1ee036e	[dynamo][user-defined] Simplify call_hasattr (#133935 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133935 Approved by: https://github.com/williamwen42, https://github.com/jansel ghstack dependencies: #133745, #133747, #133746, #133799, #133800	2024-08-20 16:27:44 +00:00
cyy	8d93fe510e	Remove NestedTensorFactories.h (#133809 ) Since it has no code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133809 Approved by: https://github.com/ezyang	2024-08-20 16:16:30 +00:00
Aaron Orenstein	187d55018a	[BE] Fix MYPY issues (#133872 ) Fix some mypy issues that have crept in to the trunk. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133872 Approved by: https://github.com/oulgen, https://github.com/Skylion007	2024-08-20 16:12:04 +00:00
Sam Larsen	52dfe99dbf	Skip test_custom_op_add_abi_compatible_cpu_with_stack_allocation internally (#133704 ) Summary: This test is segfaulting internally. Skip for now so we can get the internal tests green. Differential Revision: D61399618 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133704 Approved by: https://github.com/desertfire	2024-08-20 16:01:39 +00:00
PyTorch MergeBot	3a2f7192c3	Revert "return state dict without optimized module (#132626 )" This reverts commit e37eef8a7bd5915fa2961d688fd8b02df5cc5fd7. Reverted https://github.com/pytorch/pytorch/pull/132626 on behalf of https://github.com/ZainRizvi due to Sorry but it seems like this PR broke trunk. distributed/checkpoint/test_state_dict.py::TestStateDict::test_fsdp2 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10458281674/job/28969008325) [HUD commit link](`da69a28c6f`) ([comment](https://github.com/pytorch/pytorch/pull/132626#issuecomment-2299190664))	2024-08-20 15:54:54 +00:00
Nikita Shulga	f2b57d8831	Fix `torch._C` submodules population (#133919 ) This fixes regression introduced by https://github.com/pytorch/pytorch/pull/132216 that on some Python runtimes failed with ``` > from torch._C._dynamo.guards import GlobalStateGuard E ModuleNotFoundError: No module named 'torch._C._dynamo.guards'; 'torch._C._dynamo' is not a package c:\users\malfet\git\pytorch\torch\_dynamo\convert_frame.py:28: ModuleNotFoundError ``` Simplify it by always registering submodules by its primary name and do not try to add submodules which are not part of the same namespace as parent. Otherwise module can be registered by alias, rather than by primary name. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133919 Approved by: https://github.com/atalman, https://github.com/izaitsevfb, https://github.com/XuehaiPan, https://github.com/albanD, https://github.com/Skylion007	2024-08-20 15:38:32 +00:00
Shangdi Yu	b02695d65f	[export] training ir migration, fix export_rle_model (#133937 ) Summary: - exir.capture + to_edge is deprecated. We need to use the export + to_edge. - Fix quantization pass to be compatible with the new export IR. In the quantization pass, some nodes might have side-effects, so they don't have users, but still are not removed by the DCE pass. We need to consider it. - now export_rle_model works with the default `capture_pre_autograd_graph`, it should also work with the new training it Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/export:export_rle_model -- -r export_rle_model ``` Differential Revision: D61485834 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133937 Approved by: https://github.com/tugsbayasgalan	2024-08-20 15:35:25 +00:00
chuanqiw	6590f4fb0e	[CD] Enable python 3.13 for xpu nightly build (#133670 ) Enable python 3.13 for XPU nightly build, it depends on https://github.com/pytorch/pytorch/pull/133454 land. Also update the xpu nightly wheel test env. Works for https://github.com/pytorch/pytorch/issues/114850 Fixes #130543 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133670 Approved by: https://github.com/atalman, https://github.com/malfet	2024-08-20 15:05:20 +00:00
fduwjj	36376efd06	[2/N] Refactor FR script - add a loader module (#133929 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133929 Approved by: https://github.com/c-p-i-o ghstack dependencies: #133927	2024-08-20 14:27:40 +00:00
PyTorch MergeBot	2bd02e0c82	Revert "[RFC][dynamo] add decorator to register polyfill for unsupported C++ function to avoid graph break (#133712 )" This reverts commit 641724ed1daad1e6fc2525cc6858d199e576d5cd. Reverted https://github.com/pytorch/pytorch/pull/133712 on behalf of https://github.com/jeanschmidt due to breaking main windows cpu tests - reverting them all, so we can identify the culprit with more calmness ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2298528797))	2024-08-20 10:34:41 +00:00
PyTorch MergeBot	91fd270535	Revert "[dynamo] simplify polyfill registration for `builtins.all` and `builtins.any` (#133769 )" This reverts commit 59ca56e56ca3e2f6dd80db57079725cf61f06810. Reverted https://github.com/pytorch/pytorch/pull/133769 on behalf of https://github.com/jeanschmidt due to breaking main windows cpu tests - reverting them all, so we can identify the culprit with more calmness ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2298528797))	2024-08-20 10:34:41 +00:00
PyTorch MergeBot	5109c5ef23	Revert "[dynamo] simplify implementation for `functools.reduce` (#133778 )" This reverts commit ff9be0eda99c59cdbcc269853168657de93043c7. Reverted https://github.com/pytorch/pytorch/pull/133778 on behalf of https://github.com/jeanschmidt due to breaking main windows cpu tests - reverting them all, so we can identify the culprit with more calmness ([comment](https://github.com/pytorch/pytorch/pull/133712#issuecomment-2298528797))	2024-08-20 10:34:41 +00:00
Aaron Orenstein	241df7e7f8	Add multi-cache autotune test (#133868 ) Summary: The existing tests didn't cover a case where we had multiple autotunes in a single graph. Add a test to demonstrate that case. Also added a test dependency on redis and removed the "fake redis" from the previous PR (#133579) Test Plan: unit tests Reviewed By: oulgen Differential Revision: D61178861 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133868 Approved by: https://github.com/oulgen	2024-08-20 10:26:45 +00:00
Yifu Wang	11af423eca	[SymmetricMemory] make buffer_ptrs_dev, signal_pad_ptrs_dev, buffer_size, and signal_pad_size accessible in python (#133680 ) These allows us to experiment with creative applications with triton. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133680 Approved by: https://github.com/Chillee	2024-08-20 10:15:35 +00:00
PyTorch MergeBot	08b5e07e6c	Revert "[dynamo] simplify implementation for `builtins.sum` (#133779 )" This reverts commit 1fdeb4e32918017ee3a712e0bba86e8482fa293b. Reverted https://github.com/pytorch/pytorch/pull/133779 on behalf of https://github.com/jeanschmidt due to breaking main windows cpu tests ([comment](https://github.com/pytorch/pytorch/pull/133779#issuecomment-2298285206))	2024-08-20 08:33:29 +00:00
PyTorch MergeBot	68570fca69	Revert "Add MaskedTensor support to *_like API (#128637 )" This reverts commit 8de56e29581fa2706d44f8c4b0827830c9351470. Reverted https://github.com/pytorch/pytorch/pull/128637 on behalf of https://github.com/jeanschmidt due to Introduced API linting errors ([comment](https://github.com/pytorch/pytorch/pull/128637#issuecomment-2298270307))	2024-08-20 08:26:28 +00:00
PyTorch MergeBot	42097f0ec1	Revert "[BE]: Update Typeguard to TypeIs for better type inference (#133814 )" This reverts commit cf60fe53a83bafec0857d5b49c2054de6ba4cddc. Reverted https://github.com/pytorch/pytorch/pull/133814 on behalf of https://github.com/jeanschmidt due to Broke 12k internal signals/jobs, @ezyang please help get those changes merged. More details check D61488368 ([comment](https://github.com/pytorch/pytorch/pull/133814#issuecomment-2298210309))	2024-08-20 08:02:49 +00:00
Michael Lazos	25d5a815f7	[Dynamo] Guard on torch function mode global state (#133135 ) Adds guards checking whether torch function mode is in the all disabled state. There are three torch function enablement states: * All torch function disabled (modes + subclasses) * Torch function subclass disabled * All enabled We now have guards checking if the state is All enabled and if state is All disabled. All of the above ternary states are assigned to a unique pair of these two flags. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133135 Approved by: https://github.com/anijain2305 ghstack dependencies: #133130, #133729, #133131, #133132, #133133, #133134, #133136	2024-08-20 07:15:04 +00:00
Michael Lazos	48ee0984ac	Add C API to return all torch function disablement status (#133136 ) This PR adds a C function to check if all torch function is disabled. Recall that there are three torch function enablement states: * All disabled * Torch Function Subclass disabled * All enabled The API before this change provides two functions: * `_is_torch_function_enabled` - returns True iff the current TF state is All enabled * `_is_torch_function_mode_enabled` - returns True iff the state is not All disabled and the torch function mode stack is non-empty. The crux of why a new API is needed is the following: If dynamo enters a frame with the torch function mode stack empty, `_is_torch_function_enabled` == False, it is impossible to determine if after a new mode is pushed whether we should enter the mode or not. This is because we don't know if the enablement state is All disabled or only subclass disabled. Adding this API to check if All disabled is True allows us to disambiguate this case. In the next PR, Dynamo InstructionTranslator will have clearer flags than the underlying C API: * A flag to indicate if subclasses are disabled (ie All disabled or Subclass Disabled is the current state) * A flag to indicate if modes are disabled (ie if All disabled is the current state) * A symbolic stack which can be checked if any modes are present Pull Request resolved: https://github.com/pytorch/pytorch/pull/133136 Approved by: https://github.com/bdhirsh ghstack dependencies: #133130, #133729, #133131, #133132, #133133, #133134	2024-08-20 07:15:04 +00:00
Michael Lazos	d97ca968cd	[Dynamo] Test intermediate tf mode construction (#133134 ) Ensures that constructing a torch function mode in the middle of a function is supported. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133134 Approved by: https://github.com/williamwen42 ghstack dependencies: #133130, #133729, #133131, #133132, #133133	2024-08-20 07:14:56 +00:00
Michael Lazos	626acaeb16	[Dynamo] Support torch function stack len (#133133 ) Adds support for `torch._C._len_torch_function_stack()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133133 Approved by: https://github.com/williamwen42 ghstack dependencies: #133130, #133729, #133131, #133132	2024-08-20 07:14:52 +00:00
Michael Lazos	d1fdf984c3	[Dynamo] Support push torch function mode stack (#133132 ) This PR adds support `torch._C._push_on_torch_function_stack()` by updating `torch.py` to push onto the symbolic torch function mode stack when a push is encountered. The same side effects infra used in the previous PR is used to track the mutation of the torch function mode stack and add bytecode to update it if it is mutated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133132 Approved by: https://github.com/williamwen42 ghstack dependencies: #133130, #133729, #133131	2024-08-20 07:14:47 +00:00
Michael Lazos	c0b4aaa8c5	[Dynamo] Support pop torch function mode stack (#133131 ) This PR adds support for tracing `torch._C._pop_torch_function_stack()` without graph breaking and in order to verify the state change also adds replay of mutations to the torch function mode stack via side_effects appending supplemental bytecode as we do for other python mutable objects. Details: To represent the torch function mode stack symbolically a deque field is added to the instruction translator. When the InstructionTranslator is initialized, all modes are read from the current torch function mode stack, and stashed in a global weak ref for later access (using existing sources) without needing to push/pop the python/cpp torch function mode stack. During tracing, when `_pop_torch_function_stack` is encountered a value is popped from this deque and the variable tracker representing the mode is returned. To ensure the true torch function mode stack matches this state, `TorchFunctionModeStackVariable`, a singleton, is marked as mutated, this adds it to side effects, where during final codegen, side effects will codegen a call to a python helper which will update the python torch function mode stack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133131 Approved by: https://github.com/jansel ghstack dependencies: #133130, #133729	2024-08-20 07:14:42 +00:00
Michael Lazos	f147349568	Fix DeviceContext bug (#133729 ) Fixes https://github.com/pytorch/pytorch/issues/133666 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133729 Approved by: https://github.com/bdhirsh ghstack dependencies: #133130	2024-08-20 07:14:37 +00:00
Michael Lazos	09e366cb57	[Dynamo] Add torch function mode stack guard to dynamo (#133130 ) This PR adds a guard on the torch function mode stack state at the beginning of tracing. The way this is implemented is via a new leaf guard which is passed the initial stack state at construction and compares it to the stack state at the time the guard is run. Details: The stack state is extracted via popping all modes, appending them to a list, and pushing all modes back. This list is stored on the output graph and read during guard construction to pass to the stack mode guard. There the length and types of the modes are recorded. Next time the guard is run it compares this recorded state to the current mode stack state. To implement this in python a helper function was added to utils.py and this is used if cpp guards are not enabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133130 Approved by: https://github.com/anijain2305	2024-08-20 07:14:33 +00:00
Aaron Orenstein	7492da804f	Mark disabled tests as fixed (#133940 ) Fixes #132552, #133900, #133901, #133902, #133903, #133904, #133905, #133906, #133908, #133910, #133911, #133912, #133913, #133914, #133915, #133916, #133917 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133940 Approved by: https://github.com/oulgen	2024-08-20 06:58:11 +00:00
Animesh Jain	e8d3c4be36	[dynamo][reland][inline-inbuilt-nn-modules] Mark attributes of nn mod… (#133714 ) Relands https://github.com/pytorch/pytorch/pull/132539 Relands https://github.com/pytorch/pytorch/pull/132736 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133714 Approved by: https://github.com/jansel	2024-08-20 05:57:52 +00:00
Bob Ren	f08d484702	Add itertools.islice support in dynamo (#133893 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133893 Approved by: https://github.com/oulgen	2024-08-20 05:55:53 +00:00
fduwjj	b6891f4002	[1/N] Refactor fr trace script to make it modulized - config (#133927 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133927 Approved by: https://github.com/c-p-i-o	2024-08-20 05:47:17 +00:00
Stonepia	15addb00e6	Update test_control_flow.py to device-agnostic. (#133843 ) Fixes #133841 This PR makes the `test_pointwise_associative_scan_CUDA_flip` also work on XPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133843 Approved by: https://github.com/EikanWang, https://github.com/desertfire, https://github.com/malfet, https://github.com/jansel, https://github.com/atalman	2024-08-20 05:05:43 +00:00
Chirag Pandya	994fcb9acd	Killswitch based rollout for flight recorder (#133237 ) Summary: Defaulting TORCH_NCCL_DUMP_ON_TIMEOUT to "true" and adding a kilswitch in case we need to kill this feature in production. Test Plan: Tests pass manually but need futher testing before this is rolled out fully everywhere. Differential Revision: D61136320 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133237 Approved by: https://github.com/c00w	2024-08-20 04:27:55 +00:00
Huamin Li	32f57ac627	[BE] Fix lint issues in qlinear_prepack.cpp (#133797 ) Summary: This diff fixed many lint issues in qlinear_prepack.cpp. I'am fixing them as I want to add more ops/funcs into this file later. Test Plan: Sandcastle Differential Revision: D61425436 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133797 Approved by: https://github.com/Skylion007	2024-08-20 04:23:25 +00:00
Avik Chaudhuri	b0bafd2be5	remove tensor weak ref from constraint target (#133890 ) Summary: `_ConstraintTarget` is an internal data structure that has some redundancy: tensors are identified by their id but also carry a weak reference. The weak reference was probably useful a year back but everything is done with ids right now, and the lifetime of these tensors ensures that using their ids is OK. Test Plan: existing tests Differential Revision: D61488816 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133890 Approved by: https://github.com/tugsbayasgalan	2024-08-20 03:03:05 +00:00
atalman	188cb5e67b	Bump scikit-image to 0.22.0 (#133932 ) Fixes: https://github.com/pytorch/pytorch/issues/133926 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133932 Approved by: https://github.com/malfet	2024-08-20 02:37:16 +00:00
Bin Bao	6c82a1c68c	[AOTI] Introduce DeferredCudaKernelLine for cuda cpp wrapper (#129135 ) Summary: When generating CUDA kernel load and launch, certain Triton kernel meta data are needed, but those meta data only exist after kernel auto-tune is done. DeferredCudaKernelLine is a deferred line which can backfill a string template after kernel auto-tune. This is to prepare for one-pass AOTI codegen implementation. Differential Revision: [D61018114](https://our.internmc.facebook.com/intern/diff/D61018114) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129135 Approved by: https://github.com/angelayi	2024-08-20 02:15:44 +00:00
cyy	c51fc7e98e	Enable clang-tidy in aten/src/ATen/native/nested/ (#133829 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/133829 Approved by: https://github.com/Skylion007	2024-08-20 01:52:15 +00:00
chuanqiw	c6ea7b3f21	Update xpu CD used driver to rolling version (#133454 ) The main purpose of this PR is change the XPU CD use rolling driver to support more clients GPU AOT build and enable Kineto. And also plan to enable python 3.13 for xpu CD. Works for https://github.com/pytorch/pytorch/issues/114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133454 Approved by: https://github.com/atalman	2024-08-20 01:45:45 +00:00
Jane Xu	c7af2728d3	Remove aten dispatch to empty in foreach_norm cuda kernel (#133897 ) Saves significant time on aten dispatch. For 2k tensors, goes from 38ms to 58us. Should shave some overhead mentioned in https://github.com/pytorch/pytorch/issues/133586 Before PR: ![image](https://github.com/user-attachments/assets/7813f059-0f7f-4d44-a9f0-1aaf94ae849f) After: ![image](https://github.com/user-attachments/assets/ad0855b1-2743-432a-ad31-b574c620e2fd) script: ``` import torch # warm up caching allocator a = torch.rand(200, 10, device="cuda") b = torch.rand(200, 10, device="cuda") c = a + b del a, b, c ts = [torch.rand(2, 3, device="cuda") for _ in range(2000)] with torch.profiler.profile( activities=[ torch.profiler.ProfilerActivity.CPU, torch.profiler.ProfilerActivity.CUDA, ] ) as p: torch._foreach_norm(ts) print(p.key_averages().table(sort_by="cpu_time_total")) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133897 Approved by: https://github.com/albanD, https://github.com/drisspg	2024-08-20 01:27:09 +00:00
fduwjj	874ae854eb	[c10d] Land CudaEventCache with roll out flags (#133727 ) @zdevito added a cache for CudaEvent in https://github.com/pytorch/pytorch/pull/122732. And we want to productionize it with a flag in this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133727 Approved by: https://github.com/shuqiangzhang, https://github.com/eqy	2024-08-20 01:08:00 +00:00
Menglu Yu	cfcb9e388d	[PT2][Optimus] Add move reshape out of split stack pass (#133710 ) Summary: We observed a new pattern in CMF where reshape nodes are in the middle of split stack patter, introducing massive triton_fused_stack_xxx kernels, leading to increased compilation time, we thus move it outside of the pattern, and elimate such split stack nodes. Test Plan: # unit test ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test //caffe2/test/inductor:split_cat_fx_passes ``` Buck UI: https://www.internalfb.com/buck2/2fb51ae7-832e-436b-b6b7-a81599390182 Test UI: https://www.internalfb.com/intern/testinfra/testrun/14918173811074971 Network: Up: 10MiB Down: 5.4GiB (reSessionID-96a20105-fdc6-4b4f-b465-813a84a71eba) Jobs completed: 304618. Time elapsed: 25:24.7s. Cache hits: 99%. Commands: 120772 (cached: 120410, remote: 357, local: 5) Tests finished: Pass 9. Fail 0. Fatal 0. Skip 1. Build failure 0 # benchmark ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "cmf_shrink" --flow_id 587303213 ``` P1529578588 graph diffing: https://www.internalfb.com/intern/diffing/?paste_number=1529577762 Counter({'pattern_matcher_nodes': 2123, 'pattern_matcher_count': 1715, 'normalization_pass': 404, 'remove_split_with_size_one_pass': 269, 'extern_calls': 193, 'merge_splits_pass': 74, 'normalization_aten_pass': 47, 'fxgraph_cache_miss': 9, 'batch_aten_mul': 6, 'scmerge_split_sections_removed': 5, 'scmerge_split_removed': 4, 'scmerge_cat_removed': 4, 'unbind_stack_pass': 4, 'batch_sigmoid': 2, 'batch_linear': 2, 'move_reshape_out_of_split_stack_pass': 2, 'batch_aten_sub': 2, 'batch_layernorm': 1, 'scmerge_split_added': 1, 'scmerge_cat_added': 1, 'split_stack_to_cats_pass': 1, 'split_cat_to_slices_pass': 1, 'batch_aten_add': 1, 'batch_relu': 1}) Trace link: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Ftest%2Fcmf_shrink.Aug_15_10_55_41_trace.json.gz&bucket=pyper_traces The triton_fused_stack_xxx has been reduced significantly, we can see from the trace that the green part becomes smaller {F1806406290} # e2e ads_dper3:68464f2dc5e849ba2670482079cecaaa training_platform:8643db0c3453f2658aa7be7d73974ea0 baseline: f588719502 proposal: f592116164 Differential Revision: D61249205 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133710 Approved by: https://github.com/jackiexu1992	2024-08-20 00:50:07 +00:00
Lucy Qiu	6f738d6434	Remove early exit in constant_pad_nd for export (#132679 ) Summary: Remove the early exit for padding when padding = [0, 0, 0, 0]. This prevents export from specializing when all padding=0, allowing export when all padding >= 0. Specialization will still happen for negative padding. This change will be used to export image preprocess for multimodal models, where images of dynamic shape are padded. As images are of dynamic shape, we can't be sure if padding will be required or not. Padding is guaranteed to be non-negative. Preprocess code: https://github.com/pytorch/torchtune/pull/1242 Note: the alternative is to wrap padding in a custom op, which isn't ideal given the custom op will contain the same impl as constant_pad_nd. Test Plan: ci Differential Revision: D60687727 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132679 Approved by: https://github.com/ezyang	2024-08-20 00:07:41 +00:00
Ahmad Sarvmeily	9a998d98f1	Fix edge case in inductor triton clean script (#130837 ) The regex in the script is too restrictive, as it excludes examples with parentheses in args, like the following: ``` triton_poi_fused_add_0.run(arg0_1.item(), arg1_1.item(), buf0, 1, grid=grid(1), stream=streamNone) ^ ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130837 Approved by: https://github.com/Chillee	2024-08-19 23:46:11 +00:00
Oguz Ulgen	65b3e42074	Warn on fx graph cache bypass and log it to tlparse (#133826 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133826 Approved by: https://github.com/aorenste	2024-08-19 23:39:55 +00:00
Yidi Wu	2ec95ffe57	[cond] support unbacked symbool inputs (#133589 ) Fixes https://github.com/pytorch/pytorch/issues/133577. In dynamo, when received an unbacked symbool input, we create an unbacked symint to replace it. The alternative approach of `not realizing the pred LazyVariable in cond` doesn't work because we need to get the proxy of the symbool input. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133589 Approved by: https://github.com/ezyang	2024-08-19 23:36:48 +00:00
Jithun Nair	3f525c9d5d	Upgrade nightly wheels to rocm6.2 - 2 of 2 (binaries) (#133238 ) Depends on https://github.com/pytorch/pytorch/pull/132875 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133238 Approved by: https://github.com/atalman	2024-08-19 22:35:33 +00:00
William Wen	2b95007d12	[dynamo] support random.Random (#133725 ) Fixes the observed graph breaks in https://github.com/pytorch/pytorch/issues/121349 and https://github.com/pytorch/pytorch/issues/121350. But there are still graph breaks since a random output is being used as a seed, e.g. ```python import random import torch def fn(x): seed = random.randint(0, 100) rand = random.Random(seed) return x + rand.randrange(10) opt_fn = torch.compile(fn, backend="eager", fullgraph=True) opt_fn(torch.ones(1)) ``` fails with ``` torch._dynamo.exc.InternalTorchDynamoError: UnspecializedPythonVariable() is not a constant ``` when tracing the line ``` rand = random.Random(seed) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133725 Approved by: https://github.com/jansel	2024-08-19 22:34:44 +00:00
James Perng	06faa15194	[pytorch][counters] add pytorch.wait_counter.fx_codgen_and_compile (#133107 ) as titled Differential Revision: [D60876629](https://our.internmc.facebook.com/intern/diff/D60876629/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133107 Approved by: https://github.com/asiab4	2024-08-19 22:29:16 +00:00
Justin Chu	afb3e5ed6a	Add onnx and onnxscript to CI requirements (#133647 ) Add onnx and onnxscript to requirements-ci.txt to allow for `test_public_bindings` and mypy to function when checking `torch.onnx._internal` code as @malfet suggested. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133647 Approved by: https://github.com/titaiwangms, https://github.com/kit1980	2024-08-19 22:15:07 +00:00
Xuehai Pan	1fdeb4e329	[dynamo] simplify implementation for `builtins.sum` (#133779 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133779 Approved by: https://github.com/jansel ghstack dependencies: #133712, #133769, #133778	2024-08-19 22:14:34 +00:00
Xuehai Pan	ff9be0eda9	[dynamo] simplify implementation for `functools.reduce` (#133778 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133778 Approved by: https://github.com/jansel ghstack dependencies: #133712, #133769	2024-08-19 22:14:33 +00:00
Xuehai Pan	59ca56e56c	[dynamo] simplify polyfill registration for `builtins.all` and `builtins.any` (#133769 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133769 Approved by: https://github.com/jansel ghstack dependencies: #133712	2024-08-19 22:14:33 +00:00
Xuehai Pan	641724ed1d	[RFC][dynamo] add decorator to register polyfill for unsupported C++ function to avoid graph break (#133712 ) Add decorator `torch.compiler.substitute_in_graph` to register polyfill for unsupported C++ function to avoid graph break. This API provides an official way to add support for dynamo for third-party C extensions. Also, it can be used to simplify our implementation for `torch._dynamo.polyfill`. `5ee070266f/torch/_dynamo/variables/builtin.py (L97-L107)` Example: ```python >>> import operator >>> operator.indexOf([1, 2, 3, 4, 5], 3) 2 >>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3) Unsupported: ... >>> @torch.compiler.substitute_in_graph(operator.indexOf) ... def indexOf(sequence, x): ... for i, item in enumerate(sequence): ... if item is x or item == x: ... return i ... raise ValueError("sequence.index(x): x not in sequence") >>> torch.compile(operator.indexOf, fullgraph=True)([1, 2, 3, 4, 5], 3) 2 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133712 Approved by: https://github.com/jansel	2024-08-19 22:14:33 +00:00
nowtryz	8de56e2958	Add MaskedTensor support to *_like API (#128637 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128637 Approved by: https://github.com/cpuhrsch	2024-08-19 22:13:59 +00:00
nowtryz	14ddd932fd	Add MaskedTensor support to _is_any_true (#128574 ) Fixes #128557 If there is a better way to detect autograd anomalies consistently, feel free to share your ideas. This is a dirty check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128574 Approved by: https://github.com/cpuhrsch	2024-08-19 21:34:31 +00:00
Edward Z. Yang	432638f521	Remove useless environment in reusable workflow (#133659 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133659 Approved by: https://github.com/Skylion007	2024-08-19 20:44:17 +00:00
atalman	d131048056	Change install_triton to do git checkout, apply patch, pip install (#133878 ) Fixes Docker builds: https://github.com/pytorch/pytorch/actions/runs/10458684809/job/28961048777 Follow up after https://github.com/pytorch/pytorch/pull/133694 to apply same patch to Docker build. Change Rather then doing: ``` pip_install "git+${TRITON_REPO}@${TRITON_PINNED_COMMIT}#subdirectory=python" ``` We do using 4 step: git clone, git checkout, apply patch, pip install Pull Request resolved: https://github.com/pytorch/pytorch/pull/133878 Approved by: https://github.com/malfet, https://github.com/ZainRizvi	2024-08-19 20:42:50 +00:00
Edward Z. Yang	66d6d8b1b9	Support TORCH_COMPILER_COLLECTIVES envvar (#133696 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133696 Approved by: https://github.com/Skylion007, https://github.com/c-p-i-o	2024-08-19 20:13:04 +00:00
Colin Peppler	0d4eacb9d2	[fake tensor] unbacked symint support for binary op fast path (#133584 ) Addreses https://github.com/pytorch/pytorch/issues/133525 We have an unbacked symint in `final_shape` and it's a tuple... So, add `guard_size_oblivious` to do size oblivious checks + `sym_eq` for list equality. ``` op.shape > torch.Size([1]) final_shape > (u0 + 1,) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133584 Approved by: https://github.com/ezyang	2024-08-19 20:03:05 +00:00
Yichen Yan	565e2ea019	Scale XBLOCK in triton for `pointwise` (#133300 ) Adjust https://github.com/pytorch/pytorch/pull/128826 for also `triton_heuristics.pointwise`. An example we encountered during training qwen-7b with rocm 6.1: Note: this kernel also hit the limit of `TRITON_MAX_BLOCK['X']`, shall we increase it from 2048 to 4096? ``` import torch aten = torch.ops.aten inductor_ops = torch.ops.inductor assert_size_stride = torch._C._dynamo.guards.assert_size_stride empty_strided_cpu = torch._C._dynamo.guards._empty_strided_cpu empty_strided_cuda = torch._C._dynamo.guards._empty_strided_cuda reinterpret_tensor = torch._C._dynamo.guards._reinterpret_tensor alloc_from_pool = torch.ops.inductor._alloc_from_pool import triton import triton.language as tl from triton.compiler.compiler import AttrsDescriptor from torch._inductor.runtime import triton_heuristics from torch._inductor.runtime.hints import DeviceProperties @triton_heuristics.pointwise( size_hints=[8589934592], filename=__file__, triton_meta={'signature': {0: 'bf16'}, 'device': DeviceProperties(type='hip', index=2, cc='gfx942', major=None, regs_per_multiprocessor=None, max_threads_per_multi_processor=None, multi_processor_count=None), 'constants': {}, 'configs': [AttrsDescriptor(divisible_by_16=(0, 1), equal_to_1=())]}, inductor_meta={'autotune_hints': set(), 'kernel_name': 'triton_poi_fused_nll_loss_backward_0', 'mutated_arg_names': [], 'no_x_dim': False, 'num_load': 0, 'num_reduction': 0, 'backend_hash': None, 'are_deterministic_algorithms_enabled': False, 'assert_indirect_indexing': True, 'autotune_local_cache': True, 'autotune_pointwise': True, 'autotune_remote_cache': False, 'force_disable_caches': False, 'dynamic_scale_rblock': True, 'max_autotune': False, 'max_autotune_pointwise': False, 'min_split_scan_rblock': 256, 'spill_threshold': 16, 'store_cubin': False, 'is_hip': True}, min_elem_per_thread=0 ) @triton.jit def triton_(out_ptr0, XBLOCK : tl.constexpr): xoffset = tl.program_id(0).to(tl.int64) XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:].to(tl.int64) x0 = xindex tmp0 = 0.0 tl.store(out_ptr0 + (x0), tmp0, None) import triton import triton.language as tl from torch._inductor.runtime.triton_heuristics import grid from torch._C import _cuda_getCurrentRawStream as get_raw_stream if __name__ == "__main__": with torch.cuda._DeviceGuard(2): torch.cuda.set_device(2) buf0 = empty_strided_cuda((32752, 151936), (151936, 1), torch.bfloat16) stream2 = get_raw_stream(2) triton_.run(buf0, grid=grid(4976207872), stream=stream2) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133300 Approved by: https://github.com/jansel	2024-08-19 19:41:55 +00:00
drisspg	fb26b84390	Update fused kernels and call _safe_softmax from SDPA (#133882 ) # UPDATE: This is take 3 of https://github.com/pytorch/pytorch/pull/131863 which was landed via co dev but not applying correclty # Summary Changes the stance of SDPA on what to do for fully masked out rows ## Current Behavior Several PyTorch users have expressed frustration over this issue: - https://github.com/pytorch/pytorch/issues/41508 - https://github.com/pytorch/pytorch/issues/103749 - https://github.com/pytorch/pytorch/issues/103963 These are significant issues with extensive discussion but no satisfactory resolution. The PyTorch team's consensus, as stated here: https://github.com/pytorch/pytorch/issues/24816#issuecomment-524415617 Can be paraphrased as follows: When passing in fully masked out rows, attention becomes ambiguous. We have two main options: 1. Uniformly attend to all values: ```python scores[masked_out_rows] = 1 / len(row) out[masked_out_rows] = 1 / len(row) * value ``` 2. Decide that attention between no queries (masked) and no keys (masked) is meaningless: ```python output[fully_masked_rows] = NaN ``` We went with option 2. Partially because it was easier to implement, but also people argued that users can slice the output to remove the NaNs: ``` Python >fill_value = -float("inf") >row0 = torch.randn(4) >row1 = torch.tensor([(fill_value for _ in range(4)]) >matrix = torch.stack([row0, row1]).requires_grad_(True) >out = torch.softmax(matrix, 1) >out = out[0] >print(out) tensor([0.5377, 0.2729, 0.0692, 0.1201]) ``` Cool, problem solved. But what happends when you call backwards.. ```Python >out.backward(torch.ones_like(out)) >print(matrix.grad) tensor([[3.0957e-08, 1.4157e-08, 7.7802e-10, 1.3713e-08], [ nan, nan, nan, nan]]) ``` Those pesky NaNs are back! ## Why do we see NaNs today? The core of the problem revolves around using softmax function in sdpa: ```python > row = torch.tensor([(-float("inf")) for _ in range(4)]) > torch.softmax(row, 0) tensor([nan, nan, nan, nan]) ``` ## Quick Aside: Masking in Attention Attention itself doesn't have a concept of masking. The `sdpa` function has an argument called `attn_mask`, which would be more accurately named `attn_bias`. This is because we don't actually "mask" entries when computing attention. Instead, due to implementation details([performance](https://github.com/pytorch/pytorch/issues/25110#issuecomment-524519087)), we add a value to the masked-out query/key pairs. We use a large negative number (typically -inf) to decrease the attention weight, as softmax assigns more weight to larger values. ## Alternative Approaches If we use a very large negative number instead of -inf: ```python > row = torch.tensor([(-1e6) for _ in range(4)]) > torch.softmax(row, 0) tensor([0.2500, 0.2500, 0.2500, 0.2500]) ``` However if users always remembered to "slice" out their outputs i.e.: ```Python >fill_value = -1e6 >... >out.backward(torch.ones_like(out)) >print(matrix.grad) tensor([[-0.0563, -0.0564, 0.1613, -0.0486], [ 0.0000, 0.0000, 0.0000, 0.0000]]) ``` This would bring us back into a better state. ## A Third Option We don't necessarily need to alter the behavior of softmax for -inf or very large negative numbers. The fundamental goal is to exclude certain query/key pairs from attention, regardless of the underlying implementation. This PR implements the new semantic for masking w/ attention in fully masked-out rows: ```python out[masked_out_rows] = 0 ``` Important Note: This idea isn't entirely new. The [MaskedTensor](https://pytorch.org/tutorials/prototype/maskedtensor_overview#safe-softmax) prototype, a tensor subclass, was designed to handle such cases. However, it remains a prototype feature and hasn't gained widespread adoption. ## Details This PR stack does 3 things: 1. Adds a PRIVATE _safe_softmax op 2. Updates semantic for flash_cpu fused kernel 3. Updates semantic for efficient_cuda fused kernel _safe_softmax is not supposed to be used generically and is only meant to be used within the context of SDPA. Due to this fact instead of decomposing softmax and checking for -inf rows we instead "cheat" and use nan_to_num. Why I think this is okay? (please find a counter point if avail) There are multiple ways NaNs can emerge. For the fully masked out rows case nan_to_num works. But what if there were other NaNs, wouldn't this silently remove them? The only case that this can happen is if the input itself had a NaN or an Inf For example: ```Python a = torch.ones([4], requires_grad=False, dtype=torch.float16) a[1] = torch.finfo(torch.float16).max print(a.softmax(-1)) ``` Will return `tensor([0., 1., 0., 0.], dtype=torch.float16)` Where ```Python a = torch.ones([4], requires_grad=False, dtype=torch.float16) a[1] = float("inf") a.softmax(-1) ``` returns: `tensor([nan, nan, nan, nan], dtype=torch.float16)` If we dont want to even allow for the possibility of "inf" or "NaN" attention scores to be converted to 0 then we can implemented it something like this ```Python max = torch.max(a, dim=-1, keepdim=True) exp = torch.exp(a - max.values) denom = torch.sum(exp, dim=-1, keepdim=True) softmax = exp / denom softmax = torch.where(max.values == float('-inf'), 0.0, softmax) ``` however we would be paying for this in math performance. ## Why Now I think one point that has substantially changed where PyTorch should lie on this argument is the fact that we have fused implementations for SDPA now. And these fused implementations allow us to easily and performantly support this new semantic. Differential Revision: [D61418679](https://our.internmc.facebook.com/intern/diff/D61418679) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133882 Approved by: https://github.com/soulitzer	2024-08-19 18:53:11 +00:00
Shangdi Yu	f1dc3b108a	Back out "[export] fix test for training ir migration" (#133697 ) Summary: Original commit changeset: 0a1cb57e0338 Original Phabricator Diff: D61223356 Test Plan: buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/export:export_rle_model -- -r test_export_rle_model Reviewed By: tugsbayasgalan Differential Revision: D61395818 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133697 Approved by: https://github.com/tugsbayasgalan	2024-08-19 18:30:42 +00:00
Edward Z. Yang	a8619c9a1d	Add nitpicker, which allows adding comments to PRs when they match a file pattern (#133861 ) This message would have helped avoid https://www.internalfb.com/sevmanager/view/440895 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133861 Approved by: https://github.com/albanD, https://github.com/izaitsevfb	2024-08-19 18:29:59 +00:00
Jack Zhang	64d9afd8a7	Register nll_loss2d decompositions for core aten (#133534 ) When exporting a training model for Executorch (which requires all ops to be core aten) with cross entropy loss (`torch.nn.CrossEntropyLoss`), we ran into the following error from the fx verifier in `to_edge`: ``` torch._export.verifier.SpecViolationError: Operator torch._ops.aten.nll_loss2d_forward.default is not Aten Canonical. ``` The aten [implementation](https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/LossNLL.cpp#L624) of `torch.nn.CrossEntropyLoss` uses `nll_loss2d_forward` for inference and `nll_loss2d_backward` for training, so we need to add the decompositions for both (which already exist) to the list of core aten decompositions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133534 Approved by: https://github.com/JacobSzwejbka	2024-08-19 18:26:48 +00:00
Bin Bao	ad7dda7b32	[CI] Bump up TIMM pin (#133528 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133528 Approved by: https://github.com/angelayi	2024-08-19 18:13:57 +00:00
Jack Zhang	773a782249	Decompose _unsafe_index_put into index_put (#133365 ) ## Description Create decomposition of _unsafe_index_put (non-core aten) that turns it into index_put (core aten) ## Testing Phi3 mini + LoRA model successfully passed `to_edge` after failing due to a non-core aten `unsafe_index_put` getting introduced in a decomposition during joint graph calculations. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133365 Approved by: https://github.com/pianpwk	2024-08-19 18:07:23 +00:00
Zhengxu Chen	517aee5369	[torchscript] Add a sampled logging integration point. (#133484 ) Test Plan: test script: ``` def test_zhxchen17(self): from libfb.py.pyinit import initFacebook initFacebook() class M(torch.nn.Module): def forward(self, x): return torch.add(x, x) def tmptmp(x, y): return torch.mul(x, y) m = M() n = torch.jit.script(m) print(n(torch.tensor(1))) print(torch.jit.script(tmptmp)(torch.tensor(1), torch.tensor(2))) ``` ``` I0802 12:01:23.932929 4079081 init.cc:407] Logging to scuba: run __torch__.caffe2.test.export.test_export.M.forward sample rate: 1000000 ``` Differential Revision: D60920867 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133484 Approved by: https://github.com/davidberard98	2024-08-19 18:04:45 +00:00
Xintong Hu	6564e746ed	[PT2] Port remove_noop to PT2 pre_grad passes (#132183 ) Summary: migrate to aten IR, `reshape` -> `view.default`, not covering `flatten` as there are already optimazation done in PT2, see the example here P1506057533 Differential Revision: D60476525 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132183 Approved by: https://github.com/frank-wei	2024-08-19 17:46:51 +00:00
Will Constable	da69a28c6f	[pipelining] Add schedule runtime for lowered schedule (#130488 ) Creates a new runtime that shifts complexity from runtime to ahead-of-time. The existing runtime (PipelineScheduleMulti) accepts a compute-only schedule (forward, backward, weight) actions only are specified, and it infers the communication operations at runtime. Compared to that runtime, PipelineScheduleRuntime has less logic that happens at runtime and relies on lowering passes to transform the compute-only schedule to add communications. Advantages include - easier to verify the correctness by dumping a compute+comm schedule - posible to manually edit the compute+comm schedule if the lowering heuristics are insufficient Functionality included inside the PipelineScheduleRuntime is limited to - accepting a compute-only schedule and lowering it to add comms - executing the compute or comm operations specified by the given schedule - handling work.wait() automatically by calling it just before the matching compute operation (for RECV ops) or at the end of step (for SEND ops) Follow ups for later PRs - Some refactoring should be done to replace PipelineScheduleMulti with this runtime - Optimizer execution is not considered (e.g. for zero-bubble cases) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130488 Approved by: https://github.com/H-Huang	2024-08-19 17:44:24 +00:00
PyTorch MergeBot	f31404ba6f	Revert "Update xpu CD used driver to rolling version (#133454 )" This reverts commit 32ed4a3beb746c94c702c80c79c812e45ab3b2f4. Reverted https://github.com/pytorch/pytorch/pull/133454 on behalf of https://github.com/ZainRizvi due to Sorry, there's [an outage](https://github.com/triton-lang/triton/issues/4527) that's preventing triton from being installed correctly, which has the side effect of breaking our docker builds. Reverting this PR since it requires a docker rebuild (which now fails) to give us more time to properly fix the docker builds. ([comment](https://github.com/pytorch/pytorch/pull/133454#issuecomment-2297073937))	2024-08-19 17:28:50 +00:00
Animesh Jain	6ca68357b3	[dynamo] Save class vt in UserDefinedObjectVariable (#133800 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133800 Approved by: https://github.com/jansel ghstack dependencies: #133745, #133747, #133746, #133799	2024-08-19 17:21:48 +00:00
Animesh Jain	08f14d5492	[refactor][dynamo][side-effects] Helper function for __new__ for user defined class (#133799 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133799 Approved by: https://github.com/jansel ghstack dependencies: #133745, #133747, #133746	2024-08-19 17:21:48 +00:00
drisspg	d6f30b91e5	Add a smaller default config option for decode (#133646 ) ## Before A100 \| Type \| Speedup \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| \|---------\|-----------\|-------------\|------------\|----------------\|---------------------------\| \| Average \| 0.461 \| \| \| \| \| \| Max \| 0.996 \| None \| causal \| torch.bfloat16 \| (16, 16, 1, 16, 1024, 64) \| \| Min \| 0.188 \| None \| causal \| torch.bfloat16 \| (2, 16, 1, 16, 512, 128) \| H100 \| Type \| Speedup \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| \|---------\|-----------\|-------------\|------------\|----------------\|---------------------------\| \| Average \| 4.528 \| \| \| \| \| \| Max \| 16.710 \| None \| offset \| torch.bfloat16 \| (2, 16, 1, 2, 4096, 64) \| \| Min \| 1.612 \| None \| offset \| torch.bfloat16 \| (16, 16, 1, 16, 512, 128) \| ## After A100: \| Type \| Speedup \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| \|---------\|-----------\|-------------\|------------\|----------------\|---------------------------\| \| Average \| 0.472 \| \| \| \| \| \| Max \| 1.110 \| None \| causal \| torch.bfloat16 \| (16, 16, 1, 16, 1024, 64) \| \| Min \| 0.182 \| None \| causal \| torch.bfloat16 \| (2, 16, 1, 16, 4096, 128) \| H100: \| Type \| Speedup \| score_mod \| mask_mod \| dtype \| shape(B,Hq,M,Hkv,N,D) \| \|---------\|-----------\|-------------\|------------\|----------------\|---------------------------\| \| Average \| 4.535 \| \| \| \| \| \| Max \| 16.691 \| None \| offset \| torch.bfloat16 \| (2, 16, 1, 2, 4096, 64) \| \| Min \| 1.607 \| None \| offset \| torch.bfloat16 \| (16, 16, 1, 16, 512, 128) \| ### Failing example code ``` Python import torch import torch.nn as nn import functools from torch.nn.attention.flex_attention import flex_attention, create_block_mask class AttentionModel(nn.Module): def __init__(self, initial_kv_len): super().__init__() self.kv_len = initial_kv_len self.q_len = 1 def causal_mask_decode(self, b, h, q_idx, kv_idx): offset = self.kv_len - self.q_len return offset + q_idx >= kv_idx def forward(self, queries, keys, values, mask): self.kv_len = keys.shape[-2] bs, nh, seq_len, _ = queries.shape attention = functools.partial(flex_attention, block_mask=mask, enable_gqa=True) attention = torch.compile(attention) attn_output = attention(queries, keys, values) return attn_output # Driver code def main(): # Set up parameters d_model = 256 q_heads = 32 kv_heads = 8 kv_len = 128 q_len = 1 batch_size = 1 # Initialize the model model = AttentionModel(kv_len) mask = create_block_mask( lambda a, b, c, d: model.causal_mask_decode(a, b, c, d), 1, 1, q_len, kv_len ) # Create sample input tensors queries = torch.randn(batch_size, q_heads, q_len, d_model, device="cuda") keys = torch.randn(batch_size, kv_heads, kv_len, d_model, device="cuda") values = torch.randn(batch_size, kv_heads, kv_len, d_model, device="cuda") # Forward pass output = model(queries, keys, values, mask) print(f"Input shapes:") print(f" Queries: {queries.shape}") print(f" Keys: {keys.shape}") print(f" Values: {values.shape}") print(f"Output shape: {output.shape}") if __name__ == "__main__": main() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133646 Approved by: https://github.com/Chillee, https://github.com/joydddd	2024-08-19 17:13:26 +00:00
Mayank Mishra	e37eef8a7b	return state dict without optimized module (#132626 ) Fixes #123625 We should consider changing the current behaviour and make it similar to `1fb498d6e3/torch/distributed/algorithms/_checkpoint/checkpoint_wrapper.py (L69-L101)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132626 Approved by: https://github.com/williamwen42	2024-08-19 16:58:41 +00:00
PyTorch MergeBot	8d404581fc	Revert "[ONNX] New export logic leveraging ExportedProgram and ONNX IR (#132530 )" This reverts commit 5fab35d77c7d1db7dbb9d5c516254a510b4f4f64. Reverted https://github.com/pytorch/pytorch/pull/132530 on behalf of https://github.com/ZainRizvi due to Sorry but it seems like Dr. CI incorrectly flagged the [pull / linux-docs / build-docs-python-false](https://hud.pytorch.org/pr/pytorch/pytorch/132530#28918577682) failure as being flaky. The job started failing consistently on CI once your PR was merged. [GH job link](https://github.com/pytorch/pytorch/actions/runs/10454830880/job/28949386844) [HUD commit link](`5fab35d77c`) ([comment](https://github.com/pytorch/pytorch/pull/132530#issuecomment-2297001423))	2024-08-19 16:47:15 +00:00
Aaron Orenstein	68fcd54226	Lower cache mocking to test more pytorch code (#133579 ) Summary: Previously we were mocking out FbRemoteFxGraphCacheBackend which meant that we were missing testing a whole bunch of the cache code. Cache at a lower level (CacheClient, LocalAutotuneCacheBackend, ManifoldClient, Redis) so we cover a larger amount of the caching code. Test Plan: unit tests Reviewed By: oulgen Differential Revision: D60937966 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133579 Approved by: https://github.com/oulgen	2024-08-19 16:32:36 +00:00
chuanqiw	32ed4a3beb	Update xpu CD used driver to rolling version (#133454 ) The main purpose of this PR is change the XPU CD use rolling driver to support more clients GPU AOT build and enable Kineto. And also plan to enable python 3.13 for xpu CD. Works for https://github.com/pytorch/pytorch/issues/114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133454 Approved by: https://github.com/atalman	2024-08-19 16:01:47 +00:00
fduwjj	df6831562c	[Flight Recorder] Add more basic analysis to the script (#133412 ) This is the first step to make sure we have a basic function of analyzer for FR in production. - We want to use this script to find out abnormalities in collectives and report it to users. - We also fixed some type errors. - [Ongoing] Also we will add more unit tests to this script and make it modularized so that we can better maintain it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133412 Approved by: https://github.com/c-p-i-o, https://github.com/atalman	2024-08-19 15:55:00 +00:00
PyTorch MergeBot	76b0284744	Revert "[inductor][cpp] complete vectorization for int32/int64 (#122961 )" This reverts commit 99b3b58f39507bb8ad5b4bb1b9bedf7f47b64fa3. Reverted https://github.com/pytorch/pytorch/pull/122961 on behalf of https://github.com/atalman due to Breaks slow jobs: inductor/test_cpu_repro.py::CPUReproTests::test__adaptive_avg_pool2d [GH job link](https://github.com/pytorch/pytorch/actions/runs/10432403692/job/28893704833) [HUD commit link](`a0ef8888e6`) ([comment](https://github.com/pytorch/pytorch/pull/122961#issuecomment-2296852418))	2024-08-19 15:29:15 +00:00
PyTorch MergeBot	318d3b39c4	Revert "[Inductor][CPP] Support vectorization of load_seed and randn (#130317 )" This reverts commit a0ef8888e60d934ae7e4ddaec1c1274b12d0d39d. Reverted https://github.com/pytorch/pytorch/pull/130317 on behalf of https://github.com/atalman due to Breaks slow jobs: inductor/test_cpu_repro.py::CPUReproTests::test__adaptive_avg_pool2d [GH job link](https://github.com/pytorch/pytorch/actions/runs/10432403692/job/28893704833) [HUD commit link](`a0ef8888e6`) ([comment](https://github.com/pytorch/pytorch/pull/130317#issuecomment-2296819045))	2024-08-19 15:13:39 +00:00
Weizhuo Zhang	5153550e4b	[CI] Add FP32 dynamic, AMP static, AMP dynamic for AOT inductor accuracy CPU CI test (#132836 ) This PR added 3 more accuracy test for AOT inductor CPU side. 1. FP32 dynamic shape accuracy test, torchbench suite 2. AMP static shape accuracy test, torchbench suite 3. AMP dynamic shape accuracy test, torchbench suite Test Time cost: \| Precision \| Shape Type \| Suite \| Time cost \| \|----------- \|------------ \|------------ \|----------- \| \| FP32 \| dynamic \| Torchbench \| 1h40m \| \| AMP \| Static \| Torchbench \| 1h38m \| \| AMP \| dynamic \| Torchbench \| 1h48m \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/132836 Approved by: https://github.com/desertfire	2024-08-19 14:26:48 +00:00
Justin Chu	5fab35d77c	[ONNX] New export logic leveraging ExportedProgram and ONNX IR (#132530 ) 1/n PR to - Move code from torch-onnx from commit `395495e566` into torch.onnx and fixes imports. - Integrate the new export logic with the torch.onnx.export API and include basic set of tests. - Refactor the API for the change. - Improve documentation. Next PRs will be more tests and docs. Fix https://github.com/pytorch/pytorch/issues/129277 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132530 Approved by: https://github.com/titaiwangms, https://github.com/malfet	2024-08-19 14:01:07 +00:00
Jack Taylor	92151c814b	[ROCm] Set _HAS_PYNVML to false if amdsmi not installed (#132990 ) This is a bugfix that was recently encountered in ROCm/Deepspeed. Currently if a library installs pynvml and runs on ROCm pytorch will break as _HAS_PYNVML is set to true and it will attempt to use amdsmi library for the device_count call which will not be installed. This fix will set _HAS_PYNVML to false on ROCm if amdsmi is not installed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132990 Approved by: https://github.com/pruthvistony, https://github.com/eqy, https://github.com/malfet	2024-08-19 09:45:58 +00:00
Robert Hardwick	0a976b8899	Enable bf16 float32 mkldnn matmul when float32 precision is 'medium' (#130919 ) This fixes an issue on AArch64 cpus supporting BF16, caused when torch.set_float32_matmul_precision("highest") does not disable the bf16 downconversion in mkldnn_matmul. This was discovered from a unit test failure where the decorator `torch.testing._internal.common_mkldnn.bf32_on_and_off`, which internally switches the float32_matmul_precision between "medium" and "highest" was not having the desired effect. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130919 Approved by: https://github.com/jgong5	2024-08-19 09:18:12 +00:00
Laith Sakka	8b6b1721c8	remove StrobelightCompileTimeProfiler.profile_compile_time from stacktrace when strobelight profiling not enabled (#133831 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133831 Approved by: https://github.com/oulgen	2024-08-19 09:14:52 +00:00
wz337	4bae7ae3d9	[DeviceMesh][Easy] Fix typo (#133790 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/133790 Approved by: https://github.com/Skylion007	2024-08-19 05:20:22 +00:00
PyTorch MergeBot	35f36363ec	Revert "[dtensor] move DTensor to public namespace (#133113 )" This reverts commit 2ee6b97464d17fcf4c1fc67c29868fa30d0c16e1. Reverted https://github.com/pytorch/pytorch/pull/133113 on behalf of https://github.com/wanchaol due to looks like it break some internal type imports ([comment](https://github.com/pytorch/pytorch/pull/133113#issuecomment-2295670911))	2024-08-19 05:00:19 +00:00
CaoE	42e61c783c	[Inductor][CPP] Align Half load with BFloat16 load (#132011 ) Remove `static_cast<float>` for Half load to align with BFloat16. Before: ``` extern "C" void kernel(const half* in_ptr0, half* out_ptr0) { { #pragma GCC ivdep for(long x0=static_cast<long>(0L); x0<static_cast<long>(20L); x0+=static_cast<long>(1L)) { auto tmp0 = static_cast<float>(in_ptr0[static_cast<long>(x0)]); out_ptr0[static_cast<long>(x0)] = tmp0; } } } ``` After: ``` extern "C" void kernel(const half* in_ptr0, half* out_ptr0) { { #pragma GCC ivdep for(long x0=static_cast<long>(0L); x0<static_cast<long>(20L); x0+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(x0)]; out_ptr0[static_cast<long>(x0)] = tmp0; } } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132011 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-08-19 04:52:39 +00:00
Zain Rizvi	ae00063570	Change default runner's AMI to Amazon 2023 AMI - Part 1 (#133641 ) Upgrades the LF scale configs to change the default AMI in accordance with the Amazon 2023 rollout plan. This PR will be merged on Monday Aug 19 in the morning, and over the next 2-3 days as new linux runners are spun up (and old ones spun down) they'll start using this new AMI This PR will be paired with https://github.com/pytorch/test-infra/pull/5558, which will be merged after this one Pull Request resolved: https://github.com/pytorch/pytorch/pull/133641 Approved by: https://github.com/jeanschmidt	2024-08-19 01:32:25 +00:00
Christopher Yeh	e72e924eb5	Add correct typing annotations to rsample() for all distributions (#133516 ) Fixes #133514 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133516 Approved by: https://github.com/Skylion007	2024-08-18 20:31:54 +00:00
eqy	c0c82a5f6a	[CUDA][SDPA] Bump tolerances for `test_mem_efficient_attention_attn_mask_vs` (#133738 ) Same thing as #133051 but for efficient attention CC @drisspg @nWEIdia Pull Request resolved: https://github.com/pytorch/pytorch/pull/133738 Approved by: https://github.com/drisspg, https://github.com/nWEIdia, https://github.com/Skylion007	2024-08-18 19:14:29 +00:00
Aaron Gokaslan	cf60fe53a8	[BE]: Update Typeguard to TypeIs for better type inference (#133814 ) Uses TypeIs instead of TypeGuard for better inference. See https://peps.python.org/pep-0742/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/133814 Approved by: https://github.com/ezyang	2024-08-18 19:10:16 +00:00
cyy	0d4cedaa47	[13/N] Fix clang-tidy warnings in aten/src/ATen (#133807 ) Follows #133425 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133807 Approved by: https://github.com/ezyang, https://github.com/Skylion007	2024-08-18 17:54:12 +00:00
cyy	47ed5f57b0	[12/N] Fix clang-tidy warnings in aten/src/ATen (#133425 ) Follows #133758 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133425 Approved by: https://github.com/ezyang	2024-08-18 11:03:55 +00:00
Yu, Guangye	fbd020fce6	Add new prop to _XpuDevicePropertie for triton gemm optimization (#131738 ) # Motivation This PR aims to add new properties to `_XpuDevicePropertie` for triton gemm optimization. # Additional Context `ext_oneapi_supports_cl_extension` is not a ABI-neutral API. It depends on compiler 2025.0. For more details, see https://github.com/intel/llvm/pull/13212 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131738 Approved by: https://github.com/gujinghui	2024-08-18 08:32:30 +00:00
Animesh Jain	fed6096e73	[dynamo] Support object.__new__ call (#133746 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133746 Approved by: https://github.com/Skylion007, https://github.com/jansel ghstack dependencies: #133745, #133747	2024-08-18 07:18:52 +00:00
Animesh Jain	d56a395971	[dynamo] Support os.fspath (#133747 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133747 Approved by: https://github.com/yanboliang, https://github.com/Skylion007, https://github.com/jansel ghstack dependencies: #133745	2024-08-18 07:18:52 +00:00
JackCaoG	27dfd63ee8	remove unnecessary slicing in EffectTokensWrapper (#133737 ) In the cases that `outs ` is a tensor, `[0:]` will cause a nadditional slicing ops that's unnecessary and failed some of XLA's unit test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133737 Approved by: https://github.com/IvanKobzarev	2024-08-18 05:52:48 +00:00
Simon Fan	d717df2071	[compiled autograd] fix flaky tests due to torch.cuda.memory_allocated() != 0 (#133733 ) FIXES https://github.com/pytorch/pytorch/issues/123949 https://github.com/pytorch/pytorch/issues/124376 torch.cuda.memory_allocated returns the amount of memory allocated in the current process, so if it isn't 0 it means another test didn't properly clean up after itself. I'm keeping the memory check and isolating these tests in subprocess as we don't have a good way to test for activation refcount e.g. https://github.com/pytorch/pytorch/runs/28838386083 ``` _______________ TestCompiledAutograd.test_free_activation_memory _______________ Traceback (most recent call last): File "/var/lib/jenkins/workspace/test/inductor/test_compiled_autograd.py", line 1892, in test_free_activation_memory self.assertTrue(torch.cuda.memory_allocated() == 0) File "/opt/conda/envs/py_3.10/lib/python3.10/unittest/case.py", line 687, in assertTrue raise self.failureException(msg) AssertionError: False is not true ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133733 Approved by: https://github.com/jansel	2024-08-18 05:43:35 +00:00
cyy	fb9d2dc641	Remove Wno-invalid-partial-specialization from CMake (#133398 ) The code base is clean enough that Winvalid-partial-specialization can be enabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133398 Approved by: https://github.com/ezyang	2024-08-18 04:06:21 +00:00
cyy	f8cf1829b5	[Reland] [11/N] Fix clang-tidy warnings in aten/src/ATen (#133758 ) Reland of #133298. Remove possible changes that may increase the build time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133758 Approved by: https://github.com/Skylion007, https://github.com/ezyang	2024-08-17 23:09:44 +00:00
James Wu	0bde3c4f2f	Run cudagraphs on AOTAutograd cache hit (#132294 ) This threads through all of the necessary parts into aot autograd from the FXGraphCache changes so that we can run cudagraphs properly on a AOTAutograd cache hit. Specifics: - AOTAutograd needs access to the `cudagraphs` boxedbool in order to properly set the backward to not use cudagraphs on a cache hit from the forward. - We have lots of tests that test this already from the previous PR, so I just added an extra test and made the previous test work with both AOTAutogradCache and FXGraphCache at the same time. ``` TORCH_LOGS=torch._functorch._aot_autograd.autograd_cache,cudagraphs ENABLE_AOT_AUTOGRAD_CACHE=1 TORCHINDUCTOR_FX_GRAPH_CACHE=1 tlp python benchmarks/gpt_fast/benchmark.py --output ~/gpt_fast_benchmark.csv ``` Twice, once on cache miss and once and cache hit. Here is the perfetto trace for each(FB only link): Cache Miss: Logs: ``` Loading model Llama-2-7b-chat-hf Time to load model: 0.66 seconds I0813 10:53:34.416000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:479] [0/0] AOTAutograd cache miss for key alqchc7zw6ynsxj2bzktcsngu4cajwcb3tmhvwlyqkuinx3zhmey I0813 10:53:51.395000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:558] [0/0] Writing AOTAutograd cache entry to /tmp/torchinductor_jjwu/aotautograd/alqchc7zw6ynsxj2bzktcsngu4cajwcb3tmhvwlyqkuinx3zhmey/entry I0813 10:54:17.579000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:479] [1/0] AOTAutograd cache miss for key a3nq2ywjxku342c6ag7rsqkalnxfshlcgve3tb2bigg7a45uz6pt I0813 10:54:38.636000 911030 torch/_functorch/_aot_autograd/autograd_cache.py:558] [1/0] Writing AOTAutograd cache entry to /tmp/torchinductor_jjwu/aotautograd/a3nq2ywjxku342c6ag7rsqkalnxfshlcgve3tb2bigg7a45uz6pt/entry I0813 10:54:39.228000 911030 torch/_inductor/cudagraph_trees.py:385] [__cudagraphs] recording cudagraph tree for graph without symints V0813 10:54:39.939000 911030 torch/_inductor/cudagraph_trees.py:2160] [__cudagraphs] Running warmup of function 0 V0813 10:55:10.615000 911030 torch/_inductor/cudagraph_trees.py:2119] [__cudagraphs] Recording function 0 of graph recording id 0 Compilation time: 101.24 seconds Average tokens/sec: 147.96 tokens/sec Average bandwidth achieved: 1955.22 GB/s Memory used: 14.51 GB ``` Chromium Event(fb only): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom%2Fchromium_events.json&local_cache_key ![image](https://github.com/user-attachments/assets/47fdd77e-3cc1-437e-8e68-7901646269bb) Cache Hit: Logs: ``` Loading model Llama-2-7b-chat-hf Time to load model: 0.67 seconds I0813 10:55:51.821000 944420 torch/_functorch/_aot_autograd/autograd_cache.py:474] [0/0] AOTAutograd cache hit for key alqchc7zw6ynsxj2bzktcsngu4cajwcb3tmhvwlyqkuinx3zhmey I0813 10:55:55.465000 944420 torch/_functorch/_aot_autograd/autograd_cache.py:474] [1/0] AOTAutograd cache hit for key a3nq2ywjxku342c6ag7rsqkalnxfshlcgve3tb2bigg7a45uz6pt I0813 10:55:56.030000 944420 torch/_inductor/cudagraph_trees.py:385] [__cudagraphs] recording cudagraph tree for graph without symints V0813 10:55:56.192000 944420 torch/_inductor/cudagraph_trees.py:2160] [__cudagraphs] Running warmup of function 0 V0813 10:55:56.426000 944420 torch/_inductor/cudagraph_trees.py:2119] [__cudagraphs] Recording function 0 of graph recording id 0 Compilation time: 9.40 seconds Average tokens/sec: 147.94 tokens/sec Average bandwidth achieved: 1954.98 GB/s Memory used: 14.51 GB ``` Chromium Event(fb only): https://interncache-all.fbcdn.net/manifold/perfetto-artifacts/tree/ui/index.html?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom2%2Fchromium_events.json#!/viewer?url=https%3A%2F%2Finterncache-all.fbcdn.net%2Fmanifold%2Ftlparse_reports%2Ftree%2Flogs%2Fjjwu%2Fcustom2%2Fchromium_events.json&local_cache_key ![image](https://github.com/user-attachments/assets/9bdd14ec-d12a-4c89-8705-135c999ac746) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132294 Approved by: https://github.com/eellison	2024-08-17 21:24:54 +00:00
Christophe Bornet	d6368985af	[BE]: Fix setuptools not installed with Python 3.12 (#133561 ) setuptools is not installed correctly for Python 3.12. See https://github.com/python-poetry/poetry/issues/9630#issuecomment-2291114885 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133561 Approved by: https://github.com/Skylion007 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com>	2024-08-17 17:42:04 +00:00
Felix Janda	b4a1673a67	profiler/unwind: include <dlfcn.h> for dladdr (#133582 ) This fixes a compilation error on linux systems using the musl c library. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/133582 Approved by: https://github.com/Skylion007, https://github.com/aaronenyeshi	2024-08-17 16:15:18 +00:00
Jiang, Yanbing	215b14530a	Add Half for sparse.mm reduce (#133672 ) This PR is to add Half support for sparse.mm reduce in CPU backend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133672 Approved by: https://github.com/Skylion007	2024-08-17 15:20:39 +00:00
Xuehai Pan	1c6fbae579	[Easy][dynamo] fix builtin function names for `itertools` (#133711 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133711 Approved by: https://github.com/Skylion007	2024-08-17 15:12:01 +00:00
leslie-fang-intel	a0ef8888e6	[Inductor][CPP] Support vectorization of load_seed and randn (#130317 ) Summary Enable the vectorization of `load_seed` and `randn`. For now, `randn` is using the reference implementation. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_randn ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130317 Approved by: https://github.com/jgong5 ghstack dependencies: #122961	2024-08-17 07:15:57 +00:00
leslie-fang-intel	99b3b58f39	[inductor][cpp] complete vectorization for int32/int64 (#122961 ) Summary Implement the complete vectorization of `index_expr` functionally. We also add heuristic from performance perspective to resolve the regressions posted below: https://github.com/pytorch/pytorch/pull/122961#issuecomment-2041336265 by disabling vectorization of specific (Fused) scheduler Node: - Heuristic 1: when the num of non-contiguous `index_expr/load/store` exceeds the threshold, we disable the vectorization. - Heuristic 2: when the total number of elements along the vec dim is less than `tiling_factor/2`, we disable the vectorization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122961 Approved by: https://github.com/jansel Co-authored-by: leslie-fang-intel <leslie.fang@intel.com>	2024-08-17 07:07:49 +00:00
Huanyu He	d5f6d68d68	[PT2] Resolve PT2 compatility issue in slice and diff (#133740 ) Summary: # context * when running an IG FM training with PT2 we found there are a few graph break due to torch.diff call in [jagged_tensor.py](https://fburl.com/code/cwssxabc) ``` _length: List[int] = ( _length_per_key_from_stride_per_key(torch.diff(offsets), stride_per_key) if variable_stride_per_key else torch.sum(torch.diff(offsets).view(-1, stride), dim=1).tolist() ) ``` * look into the failure, we found the TORCH_CHECK in diff should be TORCH_SYM_CHECK * slice_forward error: df3d7729e, [tlparse](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpxXZ2em/index.html) ``` RestartAnalysis Tried to use data-dependent value in the subsequent computation. This can happen when we encounter unbounded dynamic value that is unknown during tracing time. You will need to explicitly give hint to the compiler. Please take a look at torch._check OR torch._check_is_size APIs. Could not guard on data-dependent expression ((5u37 + u38)//(u37 + u38)) < 0 (unhinted: ((5u37 + u38)//(u37 + u38)) < 0). (Size-like symbols: u38, u37) ATTENTION: guard_size_oblivious would fix the error, evaluating expression to False. Maybe you need to add guard_size_oblivious to framework code, see doc below for more guidance. Potential framework code culprit (scroll up for full backtrace): File "/data/users/hhy/fbsource/buck-out/v2/gen/fbcode/e99934938a0abe90/aps_models/ads/icvr/__icvr_launcher_live__/icvr_launcher_live#link-tree/torch/_decomp/decompositions.py", line 771, in slice_forward if end_val < 0: ``` * after this diff: [tlparse](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpAhv2Sh/failures_and_restarts.html) Test Plan: # command * run model ``` TORCH_SHOW_CPP_STACKTRACES=1 TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 TORCH_LOGS="+graph_code,output_code,dynamic,aot,guards,verbose_guards,recompiles,graph_breaks" TORCH_TRACE=/var/tmp/tt buck2 run fbcode//mode/opt fbcode//aps_models/ads/icvr:icvr_launcher_live -- mode=fmc/local_ig_fm_v4_mini training.pipeline_type=pt2 ``` * generate tlparse ``` tlparse `ls -t /var/tmp/tt/* \| head -1` ``` Reviewed By: ezyang Differential Revision: D56339251 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133740 Approved by: https://github.com/ezyang	2024-08-17 06:07:21 +00:00
Jiong Gong	cd89bf77c8	[inductor][cpp][gemm] easy: adjust indentation of template, var renaming etc. (#133312 ) Indent the template instructions separately from the generated code, for readability. Also, renaming M0,N0,K0 to Mr,Nr,Kr ("r" meaning "register") to consistent naming. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133312 Approved by: https://github.com/Skylion007, https://github.com/leslie-fang-intel ghstack dependencies: #132729, #132730	2024-08-17 05:49:14 +00:00
Animesh Jain	4dc9795ebf	[refactor][easy] Directly call var_getattr method for PythonModuleVariable (#133745 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133745 Approved by: https://github.com/yanboliang	2024-08-17 05:30:01 +00:00
Wanchao Liang	2ee6b97464	[dtensor] move DTensor to public namespace (#133113 ) Moving DTensor to be in the public namespace, to formally add the documentation page that includes all the public APIs. This includes: * many path renames and path import fixes * a dedicated doc page without too much content yet (adding in the next PRs) * To preserve the BC for users still using the `torch.distributed._tensor`, I added a shim script to redirect old path calls to the new module The BC preserving is evidented by the fact that all DTensor tests are still working without changing the public imports. So it's safe to land the changes Pull Request resolved: https://github.com/pytorch/pytorch/pull/133113 Approved by: https://github.com/XilunWu ghstack dependencies: #133305, #133306	2024-08-17 05:09:52 +00:00
Wanchao Liang	1a4709cef5	[dtensor] add more documentations (#133306 ) This PR adds more documentations to the DTensor APIs, to prepare for the module be public Pull Request resolved: https://github.com/pytorch/pytorch/pull/133306 Approved by: https://github.com/XilunWu, https://github.com/tianyu-l, https://github.com/wz337 ghstack dependencies: #133305	2024-08-17 05:09:52 +00:00
Wanchao Liang	addee9f4d1	[dtensor] add missing __all__ to public modules (#133305 ) as titled, some submodules are missing __all__ for API exposures, this PR adds necessary __all__ to those modules, and private some non public APIs explicitly together in this PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/133305 Approved by: https://github.com/XilunWu, https://github.com/tianyu-l, https://github.com/wz337	2024-08-17 05:09:48 +00:00
Masaki Kozuki	702c810780	move param's device check to `_init_group` for fused (#131153 ) There could be some cases where the params have the meta device when calling optimizer's dunder init and those params are materialized in the first computation. This change would allow such situation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131153 Approved by: https://github.com/mlazos, https://github.com/janeyx99 Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com>	2024-08-17 04:49:47 +00:00
Oguz Ulgen	12b8e29203	Add a fudge factor to ephemeral NCCL timeout increase (#133722 ) Differential Revision: [D61422431](https://our.internmc.facebook.com/intern/diff/D61422431) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133722 Approved by: https://github.com/c00w, https://github.com/aorenste ghstack dependencies: #133504	2024-08-17 03:08:40 +00:00
Avik Chaudhuri	695d7db2d6	remove dead code for suggesting legacy dynamic shapes fixes (#133700 ) Summary: `dynamic_dim` based dynamic shapes are long gone, so pretty-printing suggested fixes for them is dead code. Test Plan: existing tests Differential Revision: D61398303 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133700 Approved by: https://github.com/zhxchen17	2024-08-17 01:59:34 +00:00
Oguz Ulgen	455f6bda56	Add cache timings info to tlparse (#133504 ) https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpLR1T85/rank_1/0_0_0/fx_graph_cache_hash_11.json Differential Revision: [D61422432](https://our.internmc.facebook.com/intern/diff/D61422432) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133504 Approved by: https://github.com/jamesjwu	2024-08-17 01:37:53 +00:00
Li, Xingyuan	dcfa415e6e	[Inductor UT] Reuse inductor UT for intel GPU `test/inductor/test_compiled_optimizers.py` (#133083 ) [Inductor UT] Reuse Inductor test case for Intel GPU. Reuse `test/inductor/test_compiled_optimizers.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133083 Approved by: https://github.com/etaf, https://github.com/jansel, https://github.com/mlazos	2024-08-17 01:15:26 +00:00
Simon Fan	983bea399d	[compiled autograd] move non-hot path logs into default logger (#133541 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133541 Approved by: https://github.com/yf225, https://github.com/bdhirsh ghstack dependencies: #133115, #133148	2024-08-17 00:46:52 +00:00
Simon Fan	0a6cc15079	[compiled autograd] use same graph node names as AOTDispatcher (#133148 ) FIXES https://github.com/pytorch/pytorch/issues/132939 Compiled autograd's trace of the AOT backward may result in some additional ops e.g. clone to make contiguous, trace_wrapped HOPs, so the graphs may be slightly offset from each other hf_Whisper example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpNv89Pu/index.html fsdp2 example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpPdKssS/rank_0/index.html Unit test example: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpvoQsnl/index.html ```python ===== Compiled autograd graph ===== <eval_with_key>.14 class CompiledAutograd(torch.nn.Module): def forward(self, inputs, sizes, scalars, hooks): # No stacktrace found for following nodes getitem: "f32[]cpu" = inputs[0] aot1_primals_1: "f32[4]cpu" = inputs[1] aot1_primals_2: "f32[4]cpu" = inputs[2] aot0_sin: "f32[4]cpu" = inputs[3] aot0_cos: "f32[4]cpu" = inputs[4] getitem_5: "f32[4]cpu" = inputs[5]; inputs = None # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: SumBackward0 (NodeCall 1) expand: "f32[4]cpu" = torch.ops.aten.expand.default(getitem, [4]); getitem = None # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: CompiledFunctionBackward1 (NodeCall 2) aot1_tangents_1: "f32[4]cpu" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format); expand = None aot1_sin_1: "f32[4]cpu" = torch.ops.aten.sin.default(aot1_primals_2); aot1_primals_2 = None aot1_neg: "f32[4]cpu" = torch.ops.aten.neg.default(aot1_sin_1); aot1_sin_1 = None aot0_tangents_2: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot1_tangents_1, aot1_neg); aot1_neg = None aot1_cos_1: "f32[4]cpu" = torch.ops.aten.cos.default(aot1_primals_1); aot1_primals_1 = None aot0_tangents_1: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot1_tangents_1, aot1_cos_1); aot1_tangents_1 = aot1_cos_1 = None # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: CompiledFunctionBackward0 (NodeCall 3) aot0_neg: "f32[4]cpu" = torch.ops.aten.neg.default(aot0_sin); aot0_sin = None aot0_mul: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot0_tangents_2, aot0_neg); aot0_tangents_2 = aot0_neg = None aot0_mul_1: "f32[4]cpu" = torch.ops.aten.mul.Tensor(aot0_tangents_1, aot0_cos); aot0_tangents_1 = aot0_cos = None aot0_add: "f32[4]cpu" = torch.ops.aten.add.Tensor(aot0_mul, aot0_mul_1); aot0_mul = aot0_mul_1 = None # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:444 in set_node_origin, code: torch::autograd::AccumulateGrad (NodeCall 4) accumulate_grad_ = torch.ops.inductor.accumulate_grad_.default(getitem_5, aot0_add); getitem_5 = aot0_add = accumulate_grad_ = None _exec_final_callbacks_stub = torch__dynamo_external_utils__exec_final_callbacks_stub(); _exec_final_callbacks_stub = None return [] ``` where aot1 is ```python class GraphModule(torch.nn.Module): def forward(self, primals_1: "f32[4][1]cpu", primals_2: "f32[4][1]cpu", tangents_1: "f32[4][1]cpu"): # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2233 in torch_dynamo_resume_in_f_at_2232, code: return tmp1.sin() + tmp2.cos() sin_1: "f32[4][1]cpu" = torch.ops.aten.sin.default(primals_2); primals_2 = None neg: "f32[4][1]cpu" = torch.ops.aten.neg.default(sin_1); sin_1 = None mul: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, neg); neg = None cos_1: "f32[4][1]cpu" = torch.ops.aten.cos.default(primals_1); primals_1 = None mul_1: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, cos_1); tangents_1 = cos_1 = None return (mul_1, mul) ``` and aot0 is ```python class GraphModule(torch.nn.Module): def forward(self, sin: "f32[4][1]cpu", cos: "f32[4][1]cpu", tangents_1: "f32[4][1]cpu", tangents_2: "f32[4][1]cpu"): # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2231 in f, code: tmp2 = x.cos() neg: "f32[4][1]cpu" = torch.ops.aten.neg.default(sin); sin = None mul: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_2, neg); tangents_2 = neg = None # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2230 in f, code: tmp1 = x.sin() mul_1: "f32[4][1]cpu" = torch.ops.aten.mul.Tensor(tangents_1, cos); tangents_1 = cos = None # File: /data/users/xmfan/a/pytorch/test/inductor/test_compiled_autograd.py:2230 in f, code: tmp1 = x.sin() add: "f32[4][1]cpu" = torch.ops.aten.add.Tensor(mul, mul_1); mul = mul_1 = None return (add,) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133148 Approved by: https://github.com/jansel ghstack dependencies: #133115	2024-08-17 00:46:52 +00:00
Simon Fan	4b3ed8bc52	[compiled autograd] log aot id for CompiledFunctionBackward (#133115 ) Partially addresses https://github.com/pytorch/pytorch/issues/132939. Adds the AOT ID after the CompiledFunctionBackward annotation in verbose compiled autograd logging default (no change): https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp8WCSLf/dedicated_log_torch_trace_xw3ktsi_.log/index.html TORCH_LOGS="compiled_autograd_verbose": https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmp8WCSLf/dedicated_log_torch_trace_gsc9q_43.log/index.html ```python # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:361 in set_node_origin, code: CompiledFunctionBackward1 (NodeCall 2) clone: "f32[4]" = torch.ops.aten.clone.default(expand, memory_format = torch.contiguous_format); expand = None cos: "f32[4]" = torch.ops.aten.cos.default(getitem_1); getitem_1 = None mul: "f32[4]" = torch.ops.aten.mul.Tensor(clone, cos); clone = cos = None # File: /data/users/xmfan/a/pytorch/torch/_dynamo/compiled_autograd.py:361 in set_node_origin, code: CompiledFunctionBackward0 (NodeCall 3) cos_1: "f32[4]" = torch.ops.aten.cos.default(getitem_2) mul_1: "f32[4]" = torch.ops.aten.mul.Tensor(mul, cos_1); mul = cos_1 = None ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133115 Approved by: https://github.com/jansel	2024-08-17 00:46:52 +00:00
Andrew Gu	b0803129e8	Added meta registration for `_fused_adamw_` (#133728 ) See https://github.com/pytorch/pytorch/issues/123461#issuecomment-2294335273 <img width="1463" alt="Screenshot 2024-08-16 at 5 38 25 PM" src="https://github.com/user-attachments/assets/fe940c0e-775f-4047-bf69-34a3677d539b"> same signature so should be ok to just add the op to the decorator Pull Request resolved: https://github.com/pytorch/pytorch/pull/133728 Approved by: https://github.com/janeyx99, https://github.com/fegin	2024-08-17 00:28:31 +00:00
Sam Larsen	ec28121017	[inductor] Fix test_cudagraph_trees_expandable_segments.py for internal (#133698 ) Summary: These tests aren't running internally because the outer test harness is crashing without listing the tests. To fix we need: * Add a target for the tools/stats/ folder since this test imports it * Add a dependence to that target so it's included in the par * Fix up the relative import syntax, which is somehow different internally vs. fbcode (not sure why this works, but many other tests are doing it) Test Plan: `buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:cudagraph_trees_expandable_segments -- --run-disabled` Differential Revision: D61396711 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133698 Approved by: https://github.com/xuzhao9	2024-08-17 00:09:32 +00:00
leslie-fang-intel	648fc6c9c1	[Inductor][CPP] Refactor the tiling select into a standalone module to enhance its extensibility (#130892 ) Summary After enabling more vectorization, we found that vectorization does not always bring performance benefits. For example, a kernel with several non-contiguous index computations or non-contiguous buffer load/store operations can experience performance regression. A typical case is what we observed in the next PR: after fully enabling vectorization of `index_expr`, we saw a performance regression of `hf_BigBird`. In this PR, we refactor the tiling select into a standalone module to enhance its extensibility for further advanced tiling select heuristic. A standalone class `TilingSelect` with its method `select_tiling` has been added. `select_tiling` accepts the inputs of `fn_list`, `var_sizes_list` and return `tiling_factors`, `tiling_indices`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130892 Approved by: https://github.com/jgong5	2024-08-16 23:55:38 +00:00
Thomas Bohnstingl	d04cd7f3ba	Improvements for associative_scan - Reverse feature (#133011 ) This is part of a series of PRs to improve the functionality of the `associatve_scan` functionality. This specific PR introduces a `reverse` flag to the `associative_scan` to establish a similar interface as for `jax.associative_scan`. This PR has been derived from https://github.com/pytorch/pytorch/pull/129307. @ydwu4 @Chillee @zou3519 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133011 Approved by: https://github.com/ydwu4	2024-08-16 23:06:31 +00:00
PyTorch MergeBot	19ff9059eb	Revert "[Inductor][CPP] Support vectorization of remainder (#129849 )" This reverts commit 8624a571b4eecd11547867591d70992843265e97. Reverted https://github.com/pytorch/pytorch/pull/129849 on behalf of https://github.com/izaitsevfb due to ptedge_executorch_benchmark build failed again with LLVM crash ([comment](https://github.com/pytorch/pytorch/pull/129849#issuecomment-2294408526))	2024-08-16 22:41:05 +00:00
Xu Han	98d6a6eb7d	[inductor] clean up TODO comments. (#133718 ) clean up TODO comments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133718 Approved by: https://github.com/henrylhtsang	2024-08-16 22:12:01 +00:00
Justin Chu	271ee90851	[easy] Fix type annotation for `ExportedProgram.run_decompositions` (#133720 ) Fix the tuple type annotation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133720 Approved by: https://github.com/Skylion007	2024-08-16 22:11:42 +00:00
Charles David Hernandez	99e789b52b	[Fix 1/n] GPU Test skips - fbcode/ caffe2/test/quantization (#133158 ) Summary: This diff aims to fix the GPU Test skips in the quantization tests under the `caffe2/test/quantization` directory. The changes made in the `TARGETS` files include adding the `should_use_remote_gpu` flag to enable remote GPU testing. This should help to resolve the skipped tests and improve the overall test coverage. [This diff] Fixed skip count: 4 [Running total] Fixed skip count: 4 Note: Creating separate diffs for each test-group. Test Plan: 281475054644766: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_compare_per_channel_device_numerics (caffe2.test.quantization.core.test_quantized_tensor.TestQuantizedTensor)' https://www.internalfb.com/intern/testinfra/testrun/5629499773981783 281475054644780: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_compare_per_tensor_device_numerics (caffe2.test.quantization.core.test_quantized_tensor.TestQuantizedTensor)' https://www.internalfb.com/intern/testinfra/testrun/11540474087422107 281475054644853: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_quant_pin_memory (caffe2.test.quantization.core.test_quantized_tensor.TestQuantizedTensor)' https://www.internalfb.com/intern/testinfra/testrun/11540474087422477 844425008078016: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/quantization:test_quantization -- --exact 'caffe2/test/quantization:test_quantization - test_cuda_quantization_does_not_pin_memory (caffe2.test.quantization.core.test_quantized_tensor.TestQuantizedTensor)' https://www.internalfb.com/intern/testinfra/testrun/1407375259845199 Differential Revision: D60055277 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133158 Approved by: https://github.com/jovianjaison	2024-08-16 22:00:57 +00:00
Menglu Yu	fd33499b0c	[PT2][Optimus] Fix mixed precison training problem in decompose mem bound (#133626 ) Summary: Recently we observed in AI CMF, enabling decompose_mm pass will lead to mixed dtype for aten.mm and aten.addmm errors. By investigation, we figure out that the error comes from torch.sum, which has an implicit type casting to avoid the possible overflow (a similar discussion in github: https://github.com/pytorch/pytorch/issues/115832). Thus we do the output cast to avoid the error. Test Plan: # unit test ``` buck2 test mode/dev-nosan //caffe2/test/inductor:decompose_mem_bound_mm -- test_decompose_mm_mixed_precision ``` Buck UI: https://www.internalfb.com/buck2/00dc168e-4d65-40f8-b169-f4a58206f641 Test UI: https://www.internalfb.com/intern/testinfra/testrun/17169973624867151 Network: Up: 25KiB Down: 44KiB (reSessionID-b7e2ecc7-16ca-476d-95b2-09ea74645eb0) Jobs completed: 19. Time elapsed: 1:07.6s. Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2) Tests finished: Pass 6. Fail 0. Fatal 0. Skip 0. Build failure 0 # e2e ads_dper3:68464f2dc5e849ba2670482079cecaaa training_platform:2c41d916ad5dd82f196372a8c7bd37a0 ### build training_platform ``` buck2 run fbcode//fblearner/flow/projects/training_platform:training_platform ``` ### register training_platform ``` buck2 run mode/opt fblearner/flow/projects/training_platform:workflow -- register-workflows --project-name training_platform --flow_version training_platform:2c41d916ad5dd82f196372a8c7bd37a0 ``` ### build ads_dper 3 ``` fbpkg build -E ads_dper3 --yes --expire 14d ``` ### register ads_dper 3 ``` buck2 run //pyper/core/eval_app_utils:flow_utils_script -- register --pkg-version ads_dper3:68464f2dc5e849ba2670482079cecaaa ``` ### extend package (optional) ``` fbpkg expire --extend-only training_platform:2c41d916ad5dd82f196372a8c7bd37a0 30d ``` ### before fix f591360990 ### after fix baseline f591395056 proposal Differential Revision: D61351815 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133626 Approved by: https://github.com/jackiexu1992	2024-08-16 21:53:12 +00:00
Mwiza Kunda	be207af6e1	Disable unwrapping scalar tensors when used as outputs (#132859 ) If the scalar tensor is an output tensor, it shouldn't be unwrapped (i.e. `.item()` called) since `tl.store` requires a pointer type for outputs. This issue only occurs for mutated buffers: the input tensor is also used as an output tensor. Fixes #ISSUE_NUMBER @yanboliang @jansel @ngimel Pull Request resolved: https://github.com/pytorch/pytorch/pull/132859 Approved by: https://github.com/jansel	2024-08-16 21:40:45 +00:00
Denis Vieriu	861bdf96f4	[MPS] Add native strided API for MPSNDArray starting with macOS 15 (#128393 ) Add support for native strides in MPS starting with macOS Sequoia. This will get rid of the additional gather and scatter operations needed to solve the strides or storage offsets of the tensors. Summary of changes (starting with macOS 15): - Add support for MPS strided API (strides/storage offsets etc): - [initWithBuffer:offset:descriptor:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarray/4391636-initwithbuffer?language=objc) - [arrayViewWithCommandBuffer:descriptor:aliasing:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarray/3114040-arrayviewwithcommandbuffer?language=objc) - [arrayViewWithShape:strides:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarray/4408694-arrayviewwithshape?language=objc) - [reshapeWithCommandBuffer:sourceArray:shape:destinationArray:](https://developer.apple.com/documentation/metalperformanceshaders/mpsndarrayidentity/4438557-reshapewithcommandbuffer?language=objc) - Add native support for NHWC convolutions (without incurring any extra copy from NCHW -> NHWC -> NCHW). - Add support for strided output buffers (previously we would create a contiguous buffer OSes older than macOS 15 will run the old gather/scatter code path to solve strides/storage offsets. --- Couple performance stats collected from torchbench comparing macOS 15 vs macOS 14: ``` - test_train[functorch_maml_omniglot-mps]: 27% faster - test_train[timm_vision_transformer-mps]: 12% faster - test_train[hf_T5-mps]: 9.46% faster ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128393 Approved by: https://github.com/albanD Co-authored-by: Siddharth Kotapati <skotapati@apple.com>	2024-08-16 21:07:50 +00:00
Jack Taylor	447f428d6d	[ROCm] Fix text_export cudnn_attention UT (#133234 ) On ROCm we should decompose to flash_attention for sdpa instead of cudnn_attention. Need additional conditionalisation in this code. Issue observed: https://hud.pytorch.org/failure?name=rocm%20%2F%20linux-focal-rocm6.1-py3.8%20%2F%20test%20(default%2C%203%2C%206%2C%20linux.rocm.gpu.2)&jobName=undefined&failureCaptures=%5B%22export%2Ftest_export.py%3A%3ATestOneOffModelExportResult%3A%3Atest_scaled_dot_product_attention_cuda%22%5D Pull Request resolved: https://github.com/pytorch/pytorch/pull/133234 Approved by: https://github.com/malfet	2024-08-16 20:49:13 +00:00
Will Feng	f57b00704e	[Traceable FSDP2][Dynamo] Support reconstructing CUDA event object within Dynamo graph (#133635 ) `torch.cuda.Event` objects are different from `torch.cuda.Stream` in that events are not pooled, meaning we can't look up a previously created CUDA event object by ID. This prevents CUDA event object created outside of the Dynamo graph from being used within the graph (since Dynamo needs a way to emit a `call_function` line in the graph that does the retrieval of the event object for downstream op use). This PR adds a simple object pool within Dynamo utility, to support looking up CUDA event object by ID from within the Dynamo graph. After this PR, if a user creates a CUDA event object outside of the graph and use that event within the graph, the behavior will exactly match eager. Test commands: - `pytest -rA test/dynamo/test_ctx_manager.py::CtxManagerTests::test_cuda_event_created_outside_of_graph` - `pytest -rA test/dynamo/test_ctx_manager.py::CtxManagerTests::test_cuda_event_across_graph_break` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133635 Approved by: https://github.com/yifuwang ghstack dependencies: #133532, #133531, #133636	2024-08-16 20:40:46 +00:00
Yifu Wang	bc9e20b927	Move the layout constraint registration of aten._scaled_mm.default to module scope (#133669 ) During Inductor lowering, layout constraints for an op is applied before the op's lowering is called. Currently `add_layout_constraint(aten._scaled_mm.default, constrain_to_fx_strides)` is called inside `aten._scaled_mm.default`'s lowering. This means that if the first `_scaled_mm` to be lowered relies on the layout constraint, it won't be applied and the generated code would fail. The issue won't manifest if the first `_scaled_mm` doesn't rely on the layout constraint. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133669 Approved by: https://github.com/drisspg, https://github.com/yangsiyu007	2024-08-16 20:30:13 +00:00
Ivan Zaitsev	88ba50279c	Consolidate the format for `--max-acc-splits` flag (#133724 ) fixes the partial export of [lowering] Add max_acc_splits (#133041) ([D60133589](https://www.internalfb.com/diff/D60133589)) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133724 Approved by: https://github.com/kit1980	2024-08-16 20:28:55 +00:00
Aaron Gokaslan	3ac527ac5f	[BE][Ez]: Update cudnn_frontend submodule to 1.6.0 (#133687 ) Updates CUDNN_frontend header only library to make the most of the newest CUDNN features and decrease the overhead of the library. Copied from commit: New API - Graph Slice Operation: Introduced the graph.slice operation for slicing input tensors. Refer to docs/operations/Slice.md for detailed documentation and samples/cpp/misc/slice.cpp for a C++ sample. Pybinds for this operation have also been added. - SM Carveout Feature: Added the set_sm_count(int32_t type) graph property to support the SM Carveout feature introduced in Ampere and Hopper GPUs. Engines that do not support SM_COUNT will return NOT_SUPPORTED. Bug Fixes - Convolution Mode Attribute: Added the missing set_convolution_mode attribute to convolution attributes in forward propagation (fprop), data gradient (dgrad), and weight gradient (wgrad). Previously, this was hardcoded to CUDNN_CROSS_CORRELATION in the 1.x API. - SDPA FP8 Backward Node: Fixed an issue with the deserialization of the sdpa_fp8_backward node. Enhancements - Graph Execution Overhead: Reduced the overhead of graph.execute() by optimizing sub-node tree traversal, collected UIDs, workspace modifications, and workspace size. - Graph Validation Performance: Significantly improved (~10x) the performance of graph.validate() by deferring graph expansion to a later stage (build_operation_graph). - Optional Running Stats for BatchNorm: Made the running statistics for the batch normalization operation optional, supported by cuDNN backend version 9.3.0 and later. - Shape and Stride Inferencing: Enhanced shape and stride inferencing to preserve the stride order of the input. - Diagnostic Error Message: Added a diagnostic error message to create_execution_plans if called without the preceding build_operation_graph. - JSON Schema and Deserialization: Improved the JSON schema and deserialization logic with additional checks. - Logging Overhead: Reduced logging overhead, resulting in faster graph.build() calls. - CMake Integration: Replaced CMAKE_SOURCE_DIR with PROJECT_SOURCE_DIR in CMake files for better integration. See the relevant pull request for more details. Samples - Jupyter Notebooks: Added Jupyter notebooks for RMSNorm, InstanceNorm, and LayerNorm. Refer to the samples/python folder for more information. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133687 Approved by: https://github.com/eqy, https://github.com/malfet	2024-08-16 20:27:23 +00:00
Ivan Zaitsev	41e6619509	[codemod] Del un at::native::metal @ MPSCNNFullyConnectedOp.h:6 (export D59157302) (#133515 ) Manual export of D59157302 Original description: Removes a using namespace from the global namespace in pursuit of enabling -Wheader-hygiene. Qualifies instances that relied on the using namespace. @diff-train-skip-merge Pull Request resolved: https://github.com/pytorch/pytorch/pull/133515 Approved by: https://github.com/kit1980, https://github.com/malfet	2024-08-16 19:59:07 +00:00
PyTorch MergeBot	a0cb54ab46	Revert "C++ network flow implementation in c10 (#132188 )" This reverts commit e6272acaec63c960486b3ac558d0199cd65d7b97. Reverted https://github.com/pytorch/pytorch/pull/132188 on behalf of https://github.com/izaitsevfb due to breaks aps models and builds internally ([comment](https://github.com/pytorch/pytorch/pull/132188#issuecomment-2294120234))	2024-08-16 19:48:54 +00:00
atalman	fb59440791	Use dedicated docker-build environment for manywheel, libtorch and conda Docker builds - 2 (#133709 ) Follow up after https://github.com/pytorch/pytorch/pull/133699. 2 more placed where we need to pass these env vars. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133709 Approved by: https://github.com/Skylion007, https://github.com/seemethere	2024-08-16 19:41:11 +00:00
Yanbo Liang	678a8f9e66	[Inductor][FlexAttention] Small cleanup for FlexAttention kernel template (#133664 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133664 Approved by: https://github.com/drisspg	2024-08-16 19:33:36 +00:00
Siddharth Kotapati	611c104370	[MPS] Add workaround for nonzero with large/complex inputs (#126188 ) Fixes Issue #122916 Resolves correctness issue seen with large inputs to the mps nonzero op by using a different scatter mode. Native nonzero op is still used with smaller inputs for better performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126188 Approved by: https://github.com/kulinseth, https://github.com/malfet	2024-08-16 19:04:04 +00:00
Oguz Ulgen	0063e56949	Make FX Graph Cache work with distributed training (#133374 ) During distributed training if all ranks except one hit the cache, the rank that did not hit the cache will cause a NCCL timeout since rest of the ranks will enter the collective and start the timer. This PR uses the new PTD API to increase timeout for the ranks that hit the cache by the amount of time the cache would save. Differential Revision: [D61363722](https://our.internmc.facebook.com/intern/diff/D61363722) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133374 Approved by: https://github.com/ezyang	2024-08-16 18:51:14 +00:00
Matthias Braun	5ee070266f	Workaround ASAN failure (#133623 ) Summary: ASAN in llvm 17.x and newer reads 8 bytes in front of every function called. This means the JIT must not place a function immediately at the beginning of a freshly `mmap`ed page. This adds an 8 byte sized dummy variable as the first thing to work around the problem. See also: - https://reviews.llvm.org/D148665 - https://github.com/llvm/llvm-project/issues/65253 Test Plan: - `servicelab create cogwheel_adfinder_ubsan_multi_trial_test --local-commit`: https://www.internalfb.com/servicelab/experiment/3701354882 - sandcastle Differential Revision: D61348865 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133623 Approved by: https://github.com/Skylion007	2024-08-16 18:48:10 +00:00
cyy	90c3669cd9	Make sure T::is_traceable is bool (#133673 ) Add static_assert to C++ templates in custom_function Pull Request resolved: https://github.com/pytorch/pytorch/pull/133673 Approved by: https://github.com/Skylion007	2024-08-16 18:28:02 +00:00
wz337	eb3d517605	[Test] Add SkipIfRocm to test_grad_acc_cpu_offload (#132975 ) Fixes #123726 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132975 Approved by: https://github.com/malfet	2024-08-16 18:26:20 +00:00
rzou	e5baf43b61	[Inductor] short-term fix for needs_fixed_stride_order silent incorrectness (#133452 ) This is a low-risk short-term fix for https://github.com/pytorch/pytorch/issues/128084, for the purposes of 2.4.1. The actual fix for that issue is more risky and we'll target 2.5. needs_fixed_stride_order is silently incorrect with args that are mutable because it creates clones of those args, writes into them, and doesn't update the original args. This PR makes it so that needs_fixed_stride_order doesn't apply to inputs that are being mutated. This PR doesn't completely fix the problem, but it makes it less incorrect: most of the time the input already has the correct strides but inductor fails to recognize it, and in those cases writing directly to the input is fine. Test Plan: - new test Pull Request resolved: https://github.com/pytorch/pytorch/pull/133452 Approved by: https://github.com/eellison	2024-08-16 18:14:57 +00:00
atalman	caaa339e0f	Use dedicated docker-build environment for manywheel, libtorch and conda Docker builds (#133699 ) BE change. Apply logic simiar to: https://github.com/pytorch/pytorch/blob/main/.github/workflows/docker-builds.yml Pull Request resolved: https://github.com/pytorch/pytorch/pull/133699 Approved by: https://github.com/seemethere	2024-08-16 18:10:43 +00:00
PyTorch MergeBot	b833990a8f	Revert "[CUDA][CUTLASS][submodule] Fixes for CUTLASS upgrade (#131493 )" This reverts commit 4aa66f68a803927ddd127ceaaa1521b8d6e90e5f. Reverted https://github.com/pytorch/pytorch/pull/131493 on behalf of https://github.com/izaitsevfb due to breaks internal builds with identifier "std::numeric_limits< ::cutlass::half_t> ::infinity" is undefined in device code ([comment](https://github.com/pytorch/pytorch/pull/131493#issuecomment-2293939390))	2024-08-16 18:09:33 +00:00
Bill Yoshimi	4ee65c7e4e	Add message text to BypassFxGraphCache exceptions. (#133505 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133505 Approved by: https://github.com/oulgen	2024-08-16 18:02:59 +00:00
Will Feng	1df1d00ffc	[Traceable FSDP2] Remove usage of tuple() generator and simplify code (#133636 ) Dynamo doesn't support `tuple()` generator, and this change also simplifies code a bit. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133636 Approved by: https://github.com/awgu ghstack dependencies: #133532, #133531	2024-08-16 17:47:28 +00:00
Shunting Zhang	374c61cc82	[inductor] make conv template work with symbolic stride/padding (#132938 ) Fix https://github.com/pytorch/pytorch/issues/132716 The triton template for convolution does not work when the stride or padding contains dynamic shape. Use the hint and add guards to handle that. An alternative is to fallback to eager, but since I've seen the lowering rule for convolution use the hint in other cases, I'll just follow the convention. I don't really know how to add a unit test here since I need create symbolic strides (not strides of a tensor but the stride parameter for convolution) and paddings. I can try harder if reviewer swants me to add unit tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132938 Approved by: https://github.com/jansel, https://github.com/eellison ghstack dependencies: #132952	2024-08-16 17:45:12 +00:00
atalman	2cffe82dea	Fix triton build failure due to tritonlang.blob.core.windows.net not available (#133694 ) This should mitigate https://github.com/triton-lang/triton/issues/4527 We should also remove this once our triton pin moves past: https://github.com/triton-lang/triton/pull/4216 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133694 Approved by: https://github.com/Skylion007, https://github.com/kit1980, https://github.com/malfet	2024-08-16 17:34:30 +00:00
Menglu Yu	f735038c8f	[PT2][Optimus] Add unbind_stack_to_slices pass (#133420 ) Summary: We find another pattern to be optimized in AI CMF, thus we add the new pattern Test Plan: # unit test ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test //caffe2/test/inductor:split_cat_fx_passes ``` Buck UI: https://www.internalfb.com/buck2/b0b9bdf6-1bd1-45db-ba2c-a6892d9d557e Test UI: https://www.internalfb.com/intern/testinfra/testrun/1125900285323964 Network: Up: 595KiB Down: 1.7MiB (reSessionID-e527c3b3-03ac-45f8-bd08-3eb9a28b7dc0) Tests finished: Pass 9. Fail 0. Fatal 0. Skip 1. Build failure 0 # benchmark ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "ai_cmf" --flow_id 558295195 -n ``` P1520513078 Counter({'pattern_matcher_nodes': 1756, 'pattern_matcher_count': 936, 'normalization_pass': 280, 'merge_splits_pass': 250, 'scmerge_cat_removed': 14, 'scmerge_cat_added': 12, 'scmerge_split_removed': 7, 'unbind_stack_pass': 7, 'split_stack_to_cats_pass': 4, 'scmerge_split_sections_removed': 3, 'split_cat_pass': 2, 'scmerge_split_added': 2, 'split_cat_to_slices_pass': 2, 'unbind_stack_to_slices_pass': 1} # e2e (OBA AFOC) baseline f590253290 proposal f591051921 ### QPS and NE {F1804187079} ### trace analysis baseline trace link: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2Ff590283096-TrainingApplication%2F4%2Frank-1.Aug_12_08_52_03.3628.pt.trace.json.gz&bucket=pyper_traces proposal trace link: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2Ff591081210-TrainingApplication%2F0%2Frank-1.Aug_12_22_23_35.3401.pt.trace.json.gz&bucket=pyper_traces {F1804227687}{F1804227675} Based on the traces, the green part has been shrinked due to optimus transformation. Differential Revision: D61039466 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133420 Approved by: https://github.com/jackiexu1992	2024-08-16 17:30:35 +00:00
Will Feng	6790eb52f9	[Traceable FSDP2] Set torch._dynamo.config.skip_fsdp_hooks to True by default (#133531 ) Setting `torch._dynamo.config.skip_fsdp_hooks = True` is required for graph-break compiled FSDP2, thus setting it to default will make this adoption easier. If users want to use Traceable FSDP2, they can set this to False manually (which will allow FSDP2 hooks to be traced through). Pull Request resolved: https://github.com/pytorch/pytorch/pull/133531 Approved by: https://github.com/awgu ghstack dependencies: #133532	2024-08-16 17:18:42 +00:00
Will Feng	6d85077168	[Traceable FSDPS] Allow tracing through FSDP2 impl in trace_rules.py (#133532 ) Test commands: - `python test/distributed/_composable/fsdp/test_fully_shard_training.py TestFullyShard1DTrainingCompose.test_train_parity_with_activation_checkpointing` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133532 Approved by: https://github.com/yanboliang	2024-08-16 17:13:47 +00:00
Aleksei Nikiforov	18705e371d	S390x nightly binaries for python 3.13 (#132984 ) Enable building python 3.13 nightly binaries for s390x Pull Request resolved: https://github.com/pytorch/pytorch/pull/132984 Approved by: https://github.com/malfet	2024-08-16 17:07:27 +00:00
Yanbo Liang	770086fe39	[Dynamo] Support torch.cuda.device ctx manager (#133385 ) Fixes #128059 I'm not sure if this is the right way, since Inductor doesn't always respect the device id set by users, so probably we should just wrap it as null context manager and print a warning. cc @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames @jansel @anijain2305 @mlazos @williamwen42 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133385 Approved by: https://github.com/jansel	2024-08-16 17:05:55 +00:00
Alnis Murtovi	38e5ee1a34	mixed_mm: add more extensive dtype testing (#133292 ) This PR adds a test that tests more combinations of dtypes. The bfloat16 and uint8 combination causes a crash somewhere in triton during the generation of LLVM code. Tests like these would have also prevented segfaults like this one https://github.com/pytorch/pytorch/pull/133173. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133292 Approved by: https://github.com/shunting314	2024-08-16 16:49:27 +00:00
Shivam Raikundalia	9c2d119194	[Profiler/CPU] Add API for Dynamic Activity Toggling [3/n] (#133353 ) Summary: In this diff, we add the CPU activity implementation of being able to dynamically toggle profiling in between steps. To do this we remove the callbacks for Torch Ops and add them back in when an enable call is made. This diff also adds some support code for doing the same in python; however, the python stack comes with its own set of compilcations when enabling this feature. For one, we get into a scenario where the python stack during the toggle never gets an exit as it the tracing gets turned off which makes for some tricky post processing. For this reason, we can leave the python dynamic toggling off for now and revisit if there is enough demand. Test Plan: Got the following tracing by disabling torch and cuda ops: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Aug_13_13_03_02.606577.pt.trace.json.gz&bucket=gpu_traces Differential Revision: D61221497 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133353 Approved by: https://github.com/sanrise, https://github.com/aaronenyeshi	2024-08-16 16:36:57 +00:00
Shuqiang Zhang	46af996ce7	[c10d] Do not call ncclCommAbort if comm is not initialized (#133630 ) Summary: We saw ncclCommAbort was called and hang during the NCCLComm:create. If NCCL comm is not properly initialized, ncclCommAbort behavior is 'undefined', avoid calling it would allow the process to properly throw exception Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/133630 Approved by: https://github.com/wconstab	2024-08-16 16:25:07 +00:00
Alnis Murtovi	8b8b4e5ae9	AutoHeuristic: documentation for mm (#133611 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133611 Approved by: https://github.com/eellison ghstack dependencies: #131705, #131710, #131714, #133608	2024-08-16 16:20:38 +00:00
Alnis Murtovi	0e0077f3b6	AutoHeuristic: mm ranking heuristic h100 (#133608 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133608 Approved by: https://github.com/eellison ghstack dependencies: #131705, #131710, #131714	2024-08-16 16:20:38 +00:00
Alnis Murtovi	e51c8ad369	AutoHeuristic: Heuristic that ranks choices for mm (#131714 ) This PR adds a heuristic for tuned_mm that predicts the top 10 best choices. To be safe, aten.mm is always included. Perf run: https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2008%20Aug%202024%2020%3A20%3A28%20GMT&stopTime=Thu%2C%2015%20Aug%202024%2020%3A20%3A28%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&deviceName=cuda%20(a100)&lBranch=gh/AlnisM/22/head&lCommit=905826f4ab5344efb0bcaa87e3b27a25299927ab&rBranch=main&rCommit=79ca596dc6ea16b6cdd0f2517451e19840717d37 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131714 Approved by: https://github.com/eellison ghstack dependencies: #131705, #131710	2024-08-16 16:20:38 +00:00
Aaron Gokaslan	51e13745be	[BE]: Update ruff to 0.6.0 (#133609 ) Updates ruff and fixes a couple false negatives it discovered. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133609 Approved by: https://github.com/malfet	2024-08-16 14:11:01 +00:00
Jiong Gong	eca8b4220f	[inductor][cpp][gemm] fix k-slicing bug and add thread blocking config (#132730 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132730 Approved by: https://github.com/leslie-fang-intel, https://github.com/jansel ghstack dependencies: #132729	2024-08-16 13:50:19 +00:00
atalman	a6aa451bde	Move python 3.8 to 3.9 for linux-binary-manywheel workflow (#133621 ) Part of Deprecation of python 3.8 and moving to 3.9. Related to: https://github.com/pytorch/pytorch/issues/120718 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133621 Approved by: https://github.com/Skylion007, https://github.com/kit1980, https://github.com/malfet	2024-08-16 13:49:26 +00:00
PyTorch MergeBot	e1b9b89d94	Revert "[Flight Recorder] Add more basic analysis to the script (#133412 )" This reverts commit fcc2fc1a70c35628939611b496b209fa0a1d19bf. Reverted https://github.com/pytorch/pytorch/pull/133412 on behalf of https://github.com/atalman due to New test: distributed/flight_recorder/test_fr_analysis is constantly failing ([comment](https://github.com/pytorch/pytorch/pull/133412#issuecomment-2293506539))	2024-08-16 13:26:25 +00:00
Isuru Fernando	b444343087	Fix printing symfloat pow in triton (#133614 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133614 Approved by: https://github.com/Skylion007	2024-08-16 13:08:29 +00:00
Wu, Chunyuan	762b1b4c17	[inductor] [cpp] fix accuracy when template_buffer has users other than the epilogue nodes (#133073 ) This PR fixes the accuracy issues when template_buffer has users other than the epilogue nodes. This will fix the accuracy failure of the below models using max-autotune: - MobileBertForMaskedLM - MobileBertForQuestionAnswering - convnext_base - swin_base_patch4_window7_224 ## Issue 1: Previously we always add `template_buffer` as an alias of `Y`. In case the `template_buffer` has users other than the epilogue nodes, we shouldn't set it as an alias of `Y`. This PR adds the check in such case. Wrong code before the fix where `tmp4` and `tmp9` are both stored to `Y` while we need 2 different buffers for them since `tmp4` will be used by nodes other than the epilogue node: ```cpp Y[static_cast<long>(n_start + x1 + (32Lm_start) + (32Lx0))] = tmp4; // tmp4 is the output of the template Y[static_cast<long>(n_start + x1 + (32Lm_start) + (32Lx0))] = tmp9; // tmp9 is the output of the epilogue node ``` Correct code after the fix: ```cpp out_ptr2[static_cast<long>(n_start + x1 + (32Lm_start) + (32Lx0))] = tmp4; Y[static_cast<long>(n_start + x1 + (32Lm_start) + (32Lx0))] = tmp9; ``` ## Issue 2: When fixing the above issue, we found that there's correctness issue when `bias` is `False`. The root cause is that in the case where `bias` is `False`, the `template_buffer` has users other than the epilogue nodes and the GEMM output buffer is localized, we need to add an extra copy epilogue to ensure that the GEMM output (a local buffer) is stored to the `template_buffer` that will be used later by other nodes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133073 Approved by: https://github.com/jgong5 ghstack dependencies: #133070	2024-08-16 12:13:10 +00:00
Nicolas Macchioni	dd69013c7a	deprecate `search_autotune_cache` (#133628 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133628 Approved by: https://github.com/oulgen	2024-08-16 09:29:39 +00:00
Nicolas Macchioni	15183f5ebf	overestimate `time_taken_ns` for autotuning (#133633 ) tldr; in `autotune_to_one_config` we now include the precompile time, and in coordesc tuning we include the time from `autotune_to_one_config`, since this is a precursor Pull Request resolved: https://github.com/pytorch/pytorch/pull/133633 Approved by: https://github.com/oulgen, https://github.com/eellison	2024-08-16 09:28:49 +00:00
Oguz Ulgen	30fbf5b19c	Remove AMD restrictions on triton hashing (#133616 ) Summary: When we added these functions, AMD's triton checkout was very old, it appears to have caught up. Remove restrictions. Test Plan: unit tests Differential Revision: D61351473 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133616 Approved by: https://github.com/mxz297, https://github.com/nmacchioni, https://github.com/eellison	2024-08-16 08:02:48 +00:00
Avik Chaudhuri	5ed3b70d09	remove redundant upper bound check at runtime (#133627 ) Summary: Some symbols (unbacked symints?) can have upper bound that is `sys.maxsize - 1` but our code for runtime assertions assumes that such upper bounds would come in as `sympy.oo` (like backed symints?) in order to drop them. So we weren't dropping them, which this PR fixes. Test Plan: added test Differential Revision: D61352056 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133627 Approved by: https://github.com/SherlockNoMad	2024-08-16 06:57:12 +00:00
angelayi	f64146aff0	Update source matcher to use torch_fn (#133642 ) Updating the source matcher to also accept pattern matching on the torch_fn metadata, which exists in both strict and non-strict export. We want to replace the use of source_fn_stack with torch_fn, as it's not possible for us to get source_fn_stack in non-strict export. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133642 Approved by: https://github.com/ydwu4	2024-08-16 06:42:52 +00:00
Aleksandar Samardžić	d12bbcd785	Add auto-tuning for sparse semi-structured MM operator (#123742 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123742 Approved by: https://github.com/kadeng	2024-08-16 06:40:24 +00:00
Max Podkorytov	3d45717219	[ROCm][CK][Inductor] enable dynamic shapes for CK backend to gemm max autotune (#133285 ) This PR enables dynamic shapes for the CK backend for gemm max autotune (see #125453). This is achieved via unhardcoding the problem sizes from the template body and passing them as parameters instead. We handle passing the problem sizes for the kernel call as well as for the benchmark call. # Testing `pytest test/inductor/test_ck_backend.py [-k dynamic]` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133285 Approved by: https://github.com/ColinPeppler	2024-08-16 06:05:23 +00:00
Menglu Yu	8ea5b572a6	[PT2][Optimus] Add missing example value for the nodes introduced in group batch fusion (#133414 ) Summary: Recently we observed more missing example values in nodes introduced in Optimus, which causes problem to have further optimization when this node info needs to be used. Thus we add the meta for these nodes in the diff. Test Plan: # unit test ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test //caffe2/test/inductor:split_cat_fx_passes ``` Buck UI: https://www.internalfb.com/buck2/c0ad506f-ce9d-4b80-947a-cb79074b72f0 Test UI: https://www.internalfb.com/intern/testinfra/testrun/2251800058834808 Network: Up: 1.4GiB Down: 2.0GiB (reSessionID-fb781425-f29b-44b5-8a5b-daffe7274f86) Jobs completed: 300289. Time elapsed: 13:19.5s. Cache hits: 99%. Commands: 119360 (cached: 118494, remote: 824, local: 42) Tests finished: Pass 9. Fail 0. Fatal 0. Skip 1. Build failure 0 # benchmark ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "cmf_shrink" --flow_id 587303213 ``` P1520691492 Differential Revision: D61039772 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133414 Approved by: https://github.com/jackiexu1992	2024-08-16 04:52:16 +00:00
Animesh Jain	8a2b064236	[dynamo][user_defined][stable-diffusion] Raise ObservedAttributeError on UserDefinedObject var_getattr (#132806 ) Fixes https://github.com/pytorch/pytorch/issues/132551 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132806 Approved by: https://github.com/williamwen42	2024-08-16 04:30:06 +00:00
fduwjj	fcc2fc1a70	[Flight Recorder] Add more basic analysis to the script (#133412 ) This is the first step to make sure we have a basic function of analyzer for FR in production. - We want to use this script to find out abnormalities in collectives and report it to users. - We also fixed some type errors. - [Ongoing] Also we will add more unit tests to this script and make it modularized so that we can better maintain it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133412 Approved by: https://github.com/c-p-i-o	2024-08-16 03:53:12 +00:00
Shangdi Yu	d9f17cf4e4	[fx] Do not add Proxy on Tensor (#133470 ) Summary: Switch to set_proxy_slot instead of set the proxy directly on the Tensor. We do not want to add Proxy to tensor objects, because Proxy cannot be deepcopied or pickeled and can cause problems when users want to deepcopy or pickle models. Test Plan: CI Differential Revision: D61277650 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133470 Approved by: https://github.com/zou3519	2024-08-16 03:39:50 +00:00
Animesh Jain	8a5708ba3d	[dynamo] Support object creation of classes with custom __new__ (#132977 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132977 Approved by: https://github.com/jansel	2024-08-16 03:09:23 +00:00
Angela Yi	a1a869f2f5	[ts_converter][reland] Add support for LinearOpContext and Conv2dOpContext in quantization pass (#133622 ) Summary: Reland of D60871242 Test Plan: CI Differential Revision: D61352600 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133622 Approved by: https://github.com/SherlockNoMad	2024-08-16 01:55:45 +00:00
Nikita Shulga	1653f7786d	Fix type promotion for `ldexp` (#133519 ) According to the documentation, ldexp of half and int should return half tensor and ldexp of double should not overflow for 64-bit exponent Introduce `_pow2` helper function that does not follow scalar to float32 promotion pattern if `self` is reduced precision float or double Add regression tests to `test_ldexp` and enable it to run on both CPU and GPU Fixes https://github.com/pytorch/pytorch/issues/133267 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133519 Approved by: https://github.com/janeyx99, https://github.com/Skylion007	2024-08-16 01:26:26 +00:00
Alnis Murtovi	3a904d1163	AutoHeuristic: Enable explicit support for ranking (#131710 ) This PR adds support for heuristics that rank choices in AutoHeuristic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131710 Approved by: https://github.com/eellison ghstack dependencies: #131705	2024-08-16 01:20:52 +00:00
Alnis Murtovi	add0f0085c	AutoHeuristic: Support ranking/pruning choices (#131705 ) This PR adds support in train_decision if one wants to learn a heuristic for ranking. The main idea is that the user has to provide a number of choices the heuristic should return. I added a way to prune the learned decision tree such that it always returns the number of choices provided by the user. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131705 Approved by: https://github.com/eellison	2024-08-16 01:20:52 +00:00
cyy	929d2f8253	[3/N] Fix clang-tidy warnings in torch/csrc/autograd (#133389 ) Follows #133295 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133389 Approved by: https://github.com/Skylion007	2024-08-16 00:57:54 +00:00
Jiong Gong	c22f51ce7c	[inductor][cpp][gemm] improve large bs perf with better cache blocking (#132729 ) Improve the cache blocking by reducing Mc_blocks to make A reside in L2 and reused by B as much as possible. This improves large bs perf for both scenarios: 1) N is large and K is of medium sizes; 2) K is large. Different strategies are used to handle these scenarios. Check the notes in `get_cache_blocking` in the changes. Measured with 56-core Intel (R) Xeon (R) CPU Max 9480, jemalloc 5.1 and intel omp, bf16. Run with code cache of B matrix (weights). Model Shapes \| Before Optimization \| After Optimization \| Speedup \| onednn linear \| Speedup over onednn -- \| -- \| -- \| -- \| -- \| -- M=1024, N=12288, K=4096 (Llama2-8b) \| 5.69 ms \| 3.71 ms \| 1.53 \| 4.53 ms \| 1.22 M=1024, N=4096, K=4096 (Llama2-8b) \| 1.69 ms \| 1.63 ms \| 1.04 \| 2.05 ms \| 1.26 M=1024, N=22016, K=4096 (Llama2-8b) \| 10.32 ms \| 6.57 ms \| 1.57 \| 8.46 ms \| 1.29 M=1024, N=4096, K=11008 (Llama2-8b) \| 5.21 ms \| 3.26 ms \| 1.60 \| 4.65 ms \| 1.43 M=1024, N=5120, K=4096 (Llama3-8b) \| 1.99 ms \| 1.78 ms \| 1.12 \| 2.31 ms \| 1.30 M=1024, N=28672, K=4096 (Llama3-8b) \| 13.41 ms \| 8.56 ms \| 1.57 \| 10.96 ms \| 1.28 M=1024, N=4096, K=14336 (Llama3-8b) \| 6.93 ms \| 4.31 ms \| 1.61 \| 6.24 ms \| 1.45 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132729 Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/jansel	2024-08-16 00:57:51 +00:00
cyy	8f7cf796ea	[14/N] Use std::optional (#133417 ) Follows #132527 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133417 Approved by: https://github.com/ezyang	2024-08-16 00:48:34 +00:00
Mikayla Gawarecki	d9576c9440	Fix failures when default is flipped for weights_only (#127627 ) Tests on XLA shard not fixed yet but there is an issue here https://github.com/pytorch/xla/issues/7799 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127627 Approved by: https://github.com/albanD ghstack dependencies: #132349	2024-08-16 00:22:43 +00:00
Mikayla Gawarecki	c8ad5e37e8	Fix all RuntimeErrors during weights_only load from being erroneously reported with the weights_only message (#132349 ) Caught in above PR #127627 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132349 Approved by: https://github.com/albanD	2024-08-16 00:22:43 +00:00
Shangdi Yu	0d2be06d94	[export] fix test for training ir migration (#133587 ) Summary: Fix quantization pass to be compatible with the new export IR. Some nodes might have side-effects, so they don't have users, but still are not removed by the DCE pass. Test Plan: CI buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/export:export_rle_model -- -r export_rle_model Differential Revision: D61223356 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133587 Approved by: https://github.com/tugsbayasgalan	2024-08-15 23:55:09 +00:00
eqy	7ad3108ef2	[CUTLASS][FP8] Skip scaled_mm rowwise test on sm89 (#133612 ) Rowwise implementation currently uses sm90-specific features incl. TMA CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/133612 Approved by: https://github.com/Skylion007	2024-08-15 23:43:30 +00:00
Xintong Hu	413416cf33	[PT2] Consolidate args and kwargs usage in pre_grad passes (#133518 ) Summary: with acc_tracer disabled, the nodes generated use `args` instead of `kwargs` like before, in the current passes there are a mixed usage of `args` and `kwargs` and normalize nodes to switch between them can cause following passes to work/not work, in this diff we create a pass to normalize all the nodes to use `kwargs` at the beginning and changed all the passes to follow the same Reviewed By: frank-wei Differential Revision: D61049898 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133518 Approved by: https://github.com/frank-wei	2024-08-15 23:41:39 +00:00
Josh Fromm	f347174d61	Hipify Pytorch3D (#133343 ) Summary: X-link: https://github.com/fairinternal/pytorch3d/pull/45 X-link: https://github.com/facebookresearch/pytorch3d/pull/1851 Very minor change to extend hipification to a missing hipcub constant. This is needed to hipify some of the kernels in pytorch3d. Differential Revision: D61171993 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133343 Approved by: https://github.com/houseroad	2024-08-15 23:39:07 +00:00
Angela Yi	29c4b4ea5a	[executorch] Refactor delegation code (#132773 ) Summary: Refactoring partitioner-based delegation to prepare for allowing buffer mutations in the delegate (following diff). Test Plan: CI Differential Revision: D60813405 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132773 Approved by: https://github.com/ydwu4, https://github.com/cccclai	2024-08-15 22:52:12 +00:00
Andrew Gu	86aa327e4a	[FSDP2] Added eager fast-path for fp32->bf16 param cast (#133369 ) Some recommendation models have a high number of `nn.Parameter`s. This exacerbates per-tensor CPU overheads in FSDP2 compared to FSDP1. This PR adds a fast-path for the common bf16/fp32 mixed precision case for the casting the parameters from fp32 to bf16 to reduce CPU overhead and possibly have more efficient copy. - Old: `for` loop + `.to(torch.bfloat16)`, incurring dispatcher overhead per parameter - New: `torch.empty` + `torch.split` + `torch._foreach_copy_`, incurring three dispatches --- Example on Llama3-8B which does not have many `nn.Parameter`s (compared to recommendation models): (Old) on Llama3-8B (0.46 ms CPU overhead for all-gather): ![Screenshot 2024-08-13 at 6 19 39 PM](https://github.com/user-attachments/assets/e6390e9f-ee54-4208-9d60-9451a4142efa) (New) on Llama3-8B (0.37 ms CPU overhead for all-gather): ![Screenshot 2024-08-13 at 6 20 32 PM](https://github.com/user-attachments/assets/a5dc1d38-53d2-4984-b3cc-85ce5a538ede) --- Same example as above but now with float8 all-gather: (Old) on Llama3-8B with float8 (0.996 ms CPU overhead for all-gather): ![Screenshot 2024-08-15 at 11 27 46 AM](https://github.com/user-attachments/assets/2b7e9c9c-56ea-4375-851e-a2a704689d8d) (New) on Llama3-8B with float8 (1.014 ms CPU overhead for all-gather): ![Screenshot 2024-08-15 at 11 26 33 AM](https://github.com/user-attachments/assets/160cf8f6-bb97-4633-b802-baeae74e3262) The times are relatively comparable for float8 with the new one possibly slightly slower, but this is mainly because for Llama's transformer blocks, there are only two norm weights that need to cast to bf16. These screenshots are mainly to show that the optimization still works in the mixed case. Differential Revision: [D61236983](https://our.internmc.facebook.com/intern/diff/D61236983) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133369 Approved by: https://github.com/weifengpy ghstack dependencies: #133498	2024-08-15 22:27:20 +00:00
Edward Z. Yang	90d2593b3e	Revert #132806 , #132736 , #132539 , #132487 (#133570 ) This reverts commit 25df063f044202899ab92d6f3d77950af5de482f. This reverts commit de00c7958301ce81b9716bdef5731ed40d4d14ca. This reverts commit 419b76c4ac80c8b1c95120cd52db622333a3a688. This reverts commit bc57d5b6ff8725bbe93f0e67db72459720c750cf. Differential Revision: [D61335013](https://our.internmc.facebook.com/intern/diff/D61335013) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133570 Approved by: https://github.com/albanD, https://github.com/jansel, https://github.com/anijain2305	2024-08-15 20:54:21 +00:00
Angela Yi	5f1470d45d	[export] Add InterpreterModule to trace_rules (#132949 ) Summary: Added InterpreterModule to trace_rules so that it can be torch.compiled. Fixes https://github.com/pytorch/pytorch/issues/132921 Test Plan: CI Differential Revision: D60426372 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132949 Approved by: https://github.com/zhxchen17	2024-08-15 20:46:13 +00:00
Sherlock Huang	09a489b177	Fix serialization for tensor list output (#133539 ) Summary: Some element of tensor list output doesn't not have a user. In such case, create a name as `{node_name}_unused_{index}` for it. Test Plan: OSS CI Differential Revision: D61309011 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133539 Approved by: https://github.com/zhxchen17	2024-08-15 20:31:44 +00:00
Zain Rizvi	cdf217cda1	Disable distributed nccl tests to unblock Amazon2023 ami upgrade (#133355 ) These tests keep failing on the Linux Amazon 2023 AMI. The distributed team is looking into them, but until then, disabling the tests in order to unblock the AMI upgrade Examples of the failures: Failure 1: https://github.com/pytorch/pytorch/actions/runs/10047579686/job/27770963175 ``` FAILED [90.0880s] distributed/test_c10d_nccl.py::NCCLTraceTestDumpOnTimeout::test_timeout_dumps_timing_enabled_False - AssertionError: None mismatch: None is not -6 ``` Failure 2: https://github.com/pytorch/pytorch/actions/runs/10047579686/job/27770963494 ``` ____ NCCLTraceTestTimeoutDumpOnStuckRanks.test_timeout_dumps_on_stuck_ranks ____ Traceback (most recent call last): File "/var/lib/jenkins/workspace/test/distributed/test_c10d_nccl.py", line 4214, in test_timeout_dumps_on_stuck_ranks self.assertEqual(self._wait_process(0, timeout=90), -6) File "/opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3721, in assertEqual raise error_metas.pop()[0].to_error( AssertionError: None mismatch: None is not -6 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133355 Approved by: https://github.com/kit1980, https://github.com/wconstab	2024-08-15 20:15:00 +00:00
wz337	161cc137d2	[DTensor] Add naive replicate strategy for aten.triu.default and aten.tril.default (#133545 ) Shampoo uses triu and tril [here](https://github.com/facebookresearch/optimizers/blob/main/matrix_functions.py#L63). As the matrix input is replicated, we register the naive replicate strategy to unblock. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133545 Approved by: https://github.com/awgu	2024-08-15 20:05:03 +00:00
Edward Z. Yang	99cf567714	Make SCRIBE_GRAPHQL_ACCESS_TOKEN available to test jobs running on main (#133536 ) It is possible to write to Meta's internal in-memory database Scuba via the Scribe Graph API: https://www.internalfb.com/intern/wiki/Scribe/users/Knowledge_Base/Interacting_with_Scribe_categories/Graph_API/ This is currently being used by pytorch/benchmark repo to upload torchbench performance results. I want to make this API generally available to all jobs running on CI in a semi-trusted context. To talk to Scribe, you need a secret access token. I have initially configured an environment prod-branch-main which contains `SCRIBE_GRAPHQL_ACCESS_TOKEN`, and switched a single class of jobs (linux-test) to use this environment when they are running on the main branch. Because we require approvals for running CI on untrusted contributions, we could potentially allow all jobs to run in this environment, including jobs on PRs, but I don't need this for my use case (per-PR benchmark result reporting, and miscellaneous statistics on main.) If this works, I'll push out this environment to the rest of our test jobs. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133536 Approved by: https://github.com/xuzhao9, https://github.com/malfet, https://github.com/albanD	2024-08-15 19:53:17 +00:00
Alnis Murtovi	5dfb22d4c8	AutoHeuristic: tests (#133496 ) This PR adds tests to AutoHeuristic that ensure that when existing heuristics are re-generated, the generated code stays the same. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133496 Approved by: https://github.com/eellison	2024-08-15 19:22:44 +00:00
laithsakka	7673ee5456	remove benchmarks/__init__.py (#133390 ) trying to address https://github.com/pytorch/pytorch/issues/133377 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133390 Approved by: https://github.com/kit1980, https://github.com/malfet, https://github.com/ezyang	2024-08-15 19:08:10 +00:00
Klaus Strobl	dff388491b	Fix docs for L1Loss and MSELoss (#133501 ) The total number of elements is `N` not `n`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133501 Approved by: https://github.com/mikaylagawarecki	2024-08-15 18:56:55 +00:00
cyy	27538671ae	Enable clang-tidy coverage on torch/*.h (#133422 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/133422 Approved by: https://github.com/albanD, https://github.com/Skylion007	2024-08-15 18:52:08 +00:00
Eddie Yan	4aa66f68a8	[CUDA][CUTLASS][submodule] Fixes for CUTLASS upgrade (#131493 ) Unblocks/unbreaks against newer CUTLASS (3.5+) CC @nWEIdia @xwang233 @ptrblck @thakkarV Pull Request resolved: https://github.com/pytorch/pytorch/pull/131493 Approved by: https://github.com/Skylion007	2024-08-15 18:33:22 +00:00
Chirag Pandya	41d6cabca1	[c10d]Control logging c++ traces with a flag (#133490 ) Summary: Logging C++ stack traces occasionally races with shutdown processes on exception. It isn't safe and we've seen SIGSEGVs in the field. These crashes prevent flight recorder dumps from completing. For now, default this dumping to `true` and provide a knob if we need to control things in production. Test Plan: Tested locally on a job named `torchx-chirag_test_run` to make sure that the JK was honored by the code. It was correctly disabled on my test job. see (TORCH_NCCL_LOG_CPP_STACK_ON_EXCEPTION: 0) below. ``` ] [trainer2]:I0814 11:21:20.152419 3708 ProcessGroupNCCL.cpp:874] [PG ID 0PG GUID 0 Rank 10] ProcessGroupNCCL environments: NCCL version: 2.20.3, TORCH_NCCL_ASYNC_ERROR_HANDLING: 1, TORCH_NCCL_DUMP_ON_TIMEOUT: 1, TORCH_NCCL_WAIT_TIMEOUT_DUMP_MILSEC: 60000, TORCH_NCCL_DESYNC_DEBUG: 0, TORCH_NCCL_ENABLE_TIMING: 0, TORCH_NCCL_BLOCKING_WAIT: 0, TORCH_DISTRIBUTED_DEBUG: OFF, TORCH_NCCL_USE_TENSOR_REGISTER_ALLOCATOR_HOOK: 0, TORCH_NCCL_ENABLE_MONITORING: 0, TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC: 480, TORCH_NCCL_TRACE_BUFFER_SIZE: 2000, TORCH_NCCL_COORD_CHECK_MILSEC: 1000, TORCH_NCCL_NAN_CHECK: 0, TORCH_NCCL_LOG_CPP_STACK_ON_EXCEPTION: 0 ``` Differential Revision: D61283335 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133490 Approved by: https://github.com/fduwjj	2024-08-15 18:25:02 +00:00
Jean Schmidt	546c53b784	Bump max runners for linux.24xlarge to 500 (#133569 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/133569 Approved by: https://github.com/ZainRizvi	2024-08-15 18:02:46 +00:00
Zhengxu Chen	59b3f5911d	[sigmoid] Support custom obj deserialization. (#133463 ) Summary: It seems we have multiple places deserializing torchbind objects. Moving the code around so that every load essentially share the same implementation. Also added a test case "package_reader_testing" which load back the archive file in Python and eagerly validate the numerical result. Test Plan: buck test mode/opt sigmoid/inference/test:e2e_test_cpu Reviewed By: SherlockNoMad Differential Revision: D61235770 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133463 Approved by: https://github.com/ydwu4	2024-08-15 17:58:44 +00:00
Guilherme Leobas	5ec9c0bc4a	Fix `linearize(grad(...))` call (#133364 ) Fixes #124550 Also moves `graph.eliminate_dead_code()` call to a few lines after `_inline_module(...)` in `const_fold.py` * Test plan: Add a new test on `test_eager_transforms.py` to ensure the reported issue was indeed fixed Pull Request resolved: https://github.com/pytorch/pytorch/pull/133364 Approved by: https://github.com/zou3519	2024-08-15 17:55:36 +00:00
PyTorch MergeBot	cfec69e2a1	Revert "Update fused kernels and call _safe_softmax from SDPA (#131863 )" This reverts commit caba37e99b03d2199848197de4e452b78c8c2a23. Reverted https://github.com/pytorch/pytorch/pull/131863 on behalf of https://github.com/izaitsevfb due to breaks executorch test executorch/backends/apple/coreml:test - test_vit_skip_conv (executorch.backends.apple.coreml.test.test_coreml_partitioner.TestCoreMLPartitioner) ([comment](https://github.com/pytorch/pytorch/pull/131863#issuecomment-2291855634))	2024-08-15 17:55:07 +00:00
Shangdi Yu	d3b458e603	[export] Do not use export.export for `capture_pre_autograd_graph` (#133370 ) Summary: Do not use export.export for `capture_pre_autograd_graph` in unittests anymore. #buildall Test Plan: CI Reviewed By: tugsbayasgalan Differential Revision: D60996041 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133370 Approved by: https://github.com/tugsbayasgalan	2024-08-15 17:37:45 +00:00
Aart Bik	2236194c6b	[traced-graph][sparse] cleanup test guards (#133375 ) Rather than repeating the same guard for every test, simply express it once on the test fixture instead. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133375 Approved by: https://github.com/ezyang	2024-08-15 17:32:06 +00:00
fduwjj	a7c6e30a3f	[c10d][ez] Add space between PG ID and PG UID (#133497 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133497 Approved by: https://github.com/shengbao-zheng, https://github.com/wz337	2024-08-15 17:20:12 +00:00
Mikayla Gawarecki	018e48c337	[Reland] Add wrappers for synchronous GPUDirect Storage APIs (#133489 ) Reland #130633 USE_CUFILE turned off by default in this version Pull Request resolved: https://github.com/pytorch/pytorch/pull/133489 Approved by: https://github.com/albanD	2024-08-15 17:11:52 +00:00
Jane Xu	c23dceb8f1	Add Adafactor foreach impl (#132336 ) This PR adds the foreach impl for Adafactor knowing that there are many ways to improve its runtime perf today (by adding more foreach support). After this PR: - we have a foreach flag for Adafactor - It is NOT the default. Why not? It is only slightly faster + uses O(n) more memory where n is the number of params in your max param group. People tend to use Adafactor for memory efficiency. Next steps: - make torch.compile possible on it - make it faster (by adding more foreach apis) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132336 Approved by: https://github.com/albanD ghstack dependencies: #133360	2024-08-15 17:00:33 +00:00
Chien-Chin Huang	3434a54fba	[CP] Rewrite ring attention backward algorithm and enablement APIs (#131351 ) What does this PR achieve 1. This PR rewrite ring attention backward algorithm to fuse the alltoall and overlap the gradient communication with computation. 2. Enables memory efficient attention with CP by templating the ring attention backward to verify the accuracy as fp32 gives us higher confident about the implementation correctness. 3. Provides some experimental APIs to enable context parallelism. 4. Ensures CP work with torch.compiler. The combination of causal masking and torch.compiler has not yet worked. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131351 Approved by: https://github.com/wanchaol	2024-08-15 16:41:51 +00:00
Isuru Fernando	7470ae85e4	Fix triton codegen with math.trunc (#133354 ) Fixes https://github.com/pytorch/pytorch/issues/133172 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133354 Approved by: https://github.com/ezyang, https://github.com/jansel	2024-08-15 16:38:26 +00:00
akshay-raj-dhamija	fc5aa24a6e	Rewording doc string for clip_grad_norm_ (#133406 ) The doc string for `torch.nn.utils.clip_grad_norm_` needed some clarity, it was earlier unclear that the norm was being computed over the norms of individual gradient parameters. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133406 Approved by: https://github.com/mikaylagawarecki	2024-08-15 16:21:27 +00:00
Pian Pawakapan	a75248528f	[export] refactor _process_dynamic_shapes (#133391 ) Sorryyyyy for another refactor. This splits `_process_dynamic_shapes` into 3 parts: 1. `_combine_args` - mostly the same thing 2. `_check_dynamic_shapes`, which is responsible for raising 99% of UserErrors if the dynamic shapes spec is invalid (minus 1 UserError with DerivedDims) 3. `_process_dynamic_shapes`, which for now, is the same thing, minus the stuff in 2. This refactor is helpful for incoming automatic dynamic shapes work, because, we're switching to `assume_static_by_default=False`, which is what `_dynamo.export` currently does. This means any unspecified dims are allocated a symbol, in contrast to export today which keeps unspecified dims static. Historically this has been desirable - export users don't want too much dynamism. So we want to change how the spec is translated into constraints. This means when we switch over to automatic dynamic shapes, we want to plug in something in between steps 2. and 3. which patches up the spec for `assume_static_by_default=False`, filling in static shapes for any unspecified dims, and potentially clearing out the auto-dynamic dims (since they're no-ops). We would do this in-between 2. and 3. to keep `_process_dynamic_shapes` semantically the same, since it's used with `_dynamo.export`. We could do this without a refactor, plugging in this transform before `_process_dynamic_shapes`, but since that function's responsible for both spec checking + constraint production, moving spec checking to before we transform the specs helps guarantee we're raising errors on what the user's specified, and not an internal export bug. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133391 Approved by: https://github.com/avikchaudhuri	2024-08-15 16:21:21 +00:00
Aleksandar Samardžić	dd6ce2fe7c	Restore mixed dtypes GEMM auto-tuning for Ampere (#129058 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129058 Approved by: https://github.com/kadeng	2024-08-15 15:56:09 +00:00
Xuehai Pan	758a0a88a2	[BE][Easy] enable `ruff` rule `PIE790`: unnecessary `pass` statement (#133200 ) This PR removes unnecessary `pass` statement. This is semanticly safe because the bytecode for the Python code does not change. Note that if there is a docstring in the function, a empty function does not need a `pass` statement as placeholder. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133200 Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/kit1980	2024-08-15 15:50:19 +00:00
Justin Chu	57d1ffc512	Ignore `torch.onnx._internal` in `test_circular_dependencies` (#133110 ) Ignore the whole `_internal` module as code will depend on onnxscript and onnx. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133110 Approved by: https://github.com/titaiwangms, https://github.com/malfet	2024-08-15 15:37:24 +00:00
hippocookie	a6ad834fa8	Fix counting execution time in run_test.py (#133199 ) Counting `elapsed_time` immediately after `start_time`, not reflect real execution time of `test_batch`. Move `elapsed_time` and print method after `run_tests` method call to fix it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133199 Approved by: https://github.com/clee2000	2024-08-15 15:29:44 +00:00
Aaron Gokaslan	ec49ce5f8e	[CUDA]: Add frexp CUDA bfloat16 support (#133313 ) Fixes #133263 Add CUDA bfloat16 support to cuda_frexp Pull Request resolved: https://github.com/pytorch/pytorch/pull/133313 Approved by: https://github.com/ezyang, https://github.com/eqy	2024-08-15 15:20:00 +00:00
Andrew Gu	59e33cd1f4	[FSDP2] Set `ctx.set_materialize_grads(False)` for post-backward (#133498 ) https://pytorch.org/docs/stable/generated/torch.autograd.function.FunctionCtx.set_materialize_grads.html This avoids unnecessarily `aten::zeros` for the inputs in the post-backward custom autograd backward. We do not need the gradient values for the post-backward logic. Differential Revision: [D61291210](https://our.internmc.facebook.com/intern/diff/D61291210) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133498 Approved by: https://github.com/weifengpy	2024-08-15 14:58:26 +00:00
PyTorch MergeBot	07adae3dac	Revert "Make FX Graph Cache work with distributed training (#133374 )" This reverts commit dcdb25453e0ddc6a83e0052fffc544d4d03cdffd. Reverted https://github.com/pytorch/pytorch/pull/133374 on behalf of https://github.com/albanD due to Broke trunk ([comment](https://github.com/pytorch/pytorch/pull/133374#issuecomment-2291289260))	2024-08-15 13:43:16 +00:00
PyTorch MergeBot	32d890745d	Revert "Add cache timings info to tlparse (#133504 )" This reverts commit 7eb31e5023fa16c51a984257ee7ee4e17fb3c682. Reverted https://github.com/pytorch/pytorch/pull/133504 on behalf of https://github.com/albanD due to Broke trunk ([comment](https://github.com/pytorch/pytorch/pull/133374#issuecomment-2291289260))	2024-08-15 13:43:16 +00:00
Thanh Ha	bbddde311a	Migrate inductor jobs to runner determinator (#133457 ) Updates inductor jobs to use the runner determinator script. Depends-On: pytorch/pytorch#133352 Closes: pytorch/ci-infra#257 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133457 Approved by: https://github.com/ZainRizvi	2024-08-15 12:16:39 +00:00
Alnis Murtovi	9876aa39c0	AutoHeuristic: pad_mm documentation (#133411 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133411 Approved by: https://github.com/Chillee ghstack dependencies: #133409, #133410	2024-08-15 10:49:56 +00:00
Alnis Murtovi	f32a9e953f	AutoHeuristic: mixed_mm documentation (#133410 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133410 Approved by: https://github.com/Chillee ghstack dependencies: #133409	2024-08-15 10:49:56 +00:00
Alnis Murtovi	142353eca3	AutoHeuristic: util scripts (#133409 ) This PR introduces scripts that make it easier to use autoheuristic: - `collect_data.sh`: The user can specify things like the number of GPUs to be used and the number of training samples to collect. This script will open one tmux pane per GPU and collect num_training_samples/num_gpus samples per GPU. - `merge_data.py`: This script can be used to merge multiple training data files into a single file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133409 Approved by: https://github.com/Chillee	2024-08-15 10:49:56 +00:00
Lev A. Melnikovsky	b0fc6aa412	fix a typo in the householder_product docs (#124279 ) The function argument is A, not V. Remaining inconsistency is the matrix $A$ with columns $v_i$. It seems, a better solution would be to rename the argument $A \rightarrow V$, but this might lead to backward compatibility issues. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124279 Approved by: https://github.com/lezcano	2024-08-15 09:34:17 +00:00
y-sq	b6335cfeab	Add an option to use do_bench_using_profiling in TORCHINDUCTOR_PROFILE (#133523 ) When I did profiling using the "TORCHINDUCTOR_PROFILE" option, some kernel shows less bandwidth than expected. So, added the option to exclude the CPU overheads from the profiling time: ``` # With the option: (pytorch-3.10) [shuqiyangdevgpu001.lla3 ~/local/pytorch (gh/shunting314/144/head)]$ TORCHINDUCTOR_PROFILE=1 TORCHINDUCTOR_PROFILE_WITH_DO_BENCH_USING_PROFILING=1 TORCHINDUCTOR_PROFILE_OUTPUT=/tmp/profile.txt python ../test_pt/a.py 0.038ms 0.067 GB 1777.11GB/s triton_poi_fused__to_copy_clamp_clone_mul_0 SUMMARY (/tmp/torchinductor_shuqiyang/tmp03wdg8e4/m6/cm6vdqp62ofwsone3u3fmb42vs3fti5omseo3qn4ddh2bhalsvbn.py) 0.04ms 0.07 GB 1777.11GB/s # Without the option: (pytorch-3.10) [shuqiyangdevgpu001.lla3 ~/local/pytorch (gh/shunting314/144/head)]$ TORCHINDUCTOR_PROFILE=1 TORCHINDUCTOR_PROFILE_OUTPUT=/tmp/profile.txt python ../test_pt/a.py 0.040ms 0.067 GB 1663.09GB/s triton_poi_fused__to_copy_clamp_clone_mul_0 SUMMARY (/tmp/torchinductor_shuqiyang/tmpwr6rraao/s4/cs4npkh77myatwpcmsizyduyfm6ne6o4pg4n3eodejdvvg2j3xzd.py) 0.04ms 0.07 GB 1663.09GB/s ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133523 Approved by: https://github.com/nmacchioni	2024-08-15 09:27:11 +00:00
wz337	cf1fc07bd4	[DTensor][Easy] Minor fix to Partial Placement Docstring (#133149 ) Minor doc fix: The reduce op string for product should be "product" instead of "prod". https://github.com/pytorch/pytorch/blob/main/torch/distributed/_functional_collectives.py#L1045 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133149 Approved by: https://github.com/XilunWu, https://github.com/tianyu-l	2024-08-15 08:09:30 +00:00
David Berard	e6272acaec	C++ network flow implementation in c10 (#132188 ) The functorch partitioners use network flow to split the joint graph into a forward and backward graph. Internally, we've found that upgrading to networkx 2.8.8 (from 2.5) results in some hard-to-debug failures (internal reference: https://fburl.com/workplace/jrqwagdm). And I'm told that there's interest to remove the python dependency. So this PR introduces a C++ implementation that mirrors the API provided by networkx. We'll need to add python bindings and do some additional testing to verify correctness. Differential Revision: [D61284135](https://our.internmc.facebook.com/intern/diff/D61284135) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132188 Approved by: https://github.com/Chillee	2024-08-15 07:32:51 +00:00
Aaron Orenstein	c88174df95	typing for remote_cache (#133446 ) Summary: typing annotations for remote_cache Redo of #133299 with fixed annotations. Test Plan: unit tests Differential Revision: D61271883 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133446 Approved by: https://github.com/oulgen	2024-08-15 06:36:13 +00:00
Oguz Ulgen	7eb31e5023	Add cache timings info to tlparse (#133504 ) https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpLR1T85/rank_1/0_0_0/fx_graph_cache_hash_11.json Pull Request resolved: https://github.com/pytorch/pytorch/pull/133504 Approved by: https://github.com/jamesjwu ghstack dependencies: #133362, #133363, #133374	2024-08-15 05:53:00 +00:00
Alnis Murtovi	448d54ee92	AutoHeuristic: instructions (#132894 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132894 Approved by: https://github.com/Chillee	2024-08-15 04:54:54 +00:00
leslie-fang-intel	8624a571b4	[Inductor][CPP] Support vectorization of remainder (#129849 ) Summary When check the vectorization status among 3 test suit, we found some operators disabled vectorization with message `Disabled vectorization: op: remainder`. In this PR, we add vectorization support of this op. Test Plan ``` python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_vec_remainder python -u -m pytest -s -v test/inductor/test_cpu_repro.py -k test_int_div_vec ``` Differential Revision: [D61147014](https://our.internmc.facebook.com/intern/diff/D61147014) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129849 Approved by: https://github.com/jgong5, https://github.com/lezcano	2024-08-15 02:06:30 +00:00
PyTorch MergeBot	1120b5ab55	Revert "[CI] Change inductor-perf-test-nightly naming (#131476 )" This reverts commit 86cb24e6ebf1b85840568fbc62d22629abaf5739. Reverted https://github.com/pytorch/pytorch/pull/131476 on behalf of https://github.com/desertfire due to manually trigged dashboard run failed ([comment](https://github.com/pytorch/pytorch/pull/131476#issuecomment-2290224084))	2024-08-15 01:18:06 +00:00
chilli	c2b2969b5d	made some args optional in create_mask (#133413 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133413 Approved by: https://github.com/yanboliang, https://github.com/drisspg	2024-08-15 00:34:55 +00:00
Denis Vieriu	8676401707	[MPS] Enable MPS mm from macOS >= 14.4 (#133494 ) Summary of changes: - [MPS] Enable MPS `mm` op from macOS >= 14.4. Previously this was disabled in https://github.com/pytorch/pytorch/pull/117549 as it was causing crashes with large matrices. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133494 Approved by: https://github.com/malfet	2024-08-15 00:25:22 +00:00
Oguz Ulgen	dcdb25453e	Make FX Graph Cache work with distributed training (#133374 ) During distributed training if all ranks except one hit the cache, the rank that did not hit the cache will cause a NCCL timeout since rest of the ranks will enter the collective and start the timer. This PR uses the new PTD API to increase timeout for the ranks that hit the cache by the amount of time the cache would save. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133374 Approved by: https://github.com/ezyang ghstack dependencies: #133362, #133363	2024-08-14 22:58:48 +00:00
Justin Chu	6d4287419c	[ONNX] Disable op_level_debug tests (#133485 ) op_level_debug is being deprecated. So we disable the tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133485 Approved by: https://github.com/titaiwangms	2024-08-14 22:02:12 +00:00
Aart Bik	7a74294786	[sparse] enable meta tests (#133379 ) The skip for dynamo is no longer needed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133379 Approved by: https://github.com/ezyang	2024-08-14 21:58:23 +00:00
Rachel Guo	3965f11837	Minor type annotation updates following up D60954888 (#133382 ) Summary: As title. Test Plan: CI Ran lintrunner locally but might have to continue to keep an eye on more oss linting issue if comes up. Differential Revision: D61240900 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133382 Approved by: https://github.com/ColinPeppler	2024-08-14 21:36:42 +00:00
Zain Rizvi	d8c494910b	[EZ] Enable explicitly opting into the old Linux Amazon 2 ami - Pt 1 (#133469 ) For the next phase of the Amazon 2023 migration we'll be bulk migrating the remaining jobs over to the new AMI by changing the default AMI that we use. In preparation for that, we're adding the old Linux Amazon 2 ami as a fixed variant for runners, so that if any of the less frequently jobs breaks on Amazon 2023 AMI then they can shift to explicitly using the Amazon 2 AMI temporarily while the underlying problem is debugged and fixed. This PR is part 1, and there's a corresponding scale config PR in test-infra: https://github.com/pytorch/test-infra/pull/5551 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133469 Approved by: https://github.com/clee2000	2024-08-14 21:33:02 +00:00
wz337	3fc9ee5a31	[DeviceMesh] Directly retrieve flattened mesh if already created (#133195 ) Add mapping to keep track of root_to_flatten relationship and directly retrieve the flattened mesh if already created (no pg creation). Pull Request resolved: https://github.com/pytorch/pytorch/pull/133195 Approved by: https://github.com/fegin, https://github.com/wanchaol ghstack dependencies: #133193	2024-08-14 21:11:04 +00:00
wz337	44eaef46d0	[DCP] Fix meta tensor loading (#133256 ) We realized the fix for (https://github.com/pytorch/pytorch/pull/129683) loading the learning rate in place actually broke the meta tensor initialization. After the PR #129683, the learning rate is loading correctly, the param with meta tensors are still un-initialized. We cannot use `tree_map_only_` to iterate over state_dict for initialization in-place, as `empty_like` and `to("cuda")` are both not in-place option. More context in https://github.com/pytorch/pytorch/issues/130709 Therefore, with changes in (https://github.com/pytorch/pytorch/pull/129683), the tensor after loading are still meta tensors. We previously did not catch that since `self.assertEqual()` does not distinguish a DTensor with meta DTensor. In this PR, we added a _iterate_state_dict() function to implement in-place update for state_dict and updated the test to make sure that the params are no longer meta tensors after loading. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133256 Approved by: https://github.com/fegin	2024-08-14 21:07:11 +00:00
Ting Lu	c0be0105c7	[aarch64] Replace OpenBLAS with NVPL in cuda arm docker (#132811 ) Add NVPL to CUDA ARM docker build original https://github.com/pytorch/builder/pull/1928 moving to pytorch/pytorch repo now Need to go with builder repo change https://github.com/pytorch/builder/pull/1950 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132811 Approved by: https://github.com/atalman	2024-08-14 21:01:50 +00:00
Sergii Dymchenko	2e8c1be947	Update date for 2.5 in RELEASE.md (#133503 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133503 Approved by: https://github.com/atalman	2024-08-14 20:45:58 +00:00
Bin Bao	86cb24e6eb	[CI] Change inductor-perf-test-nightly naming (#131476 ) Summary: To make it consistent with inductor-perf-test-nightly-x86 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131476 Approved by: https://github.com/huydhn, https://github.com/zou3519	2024-08-14 20:42:15 +00:00
Bin Bao	bedf96d7ff	[AOTI] Switch fbcode HIP to C shim version v2 (#133105 ) Summary: Completely switch over the default value of c_shim_version to 2 Test Plan: CI Differential Revision: D60674018 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133105 Approved by: https://github.com/ColinPeppler, https://github.com/zoranzhao	2024-08-14 19:39:10 +00:00
Xintong Hu	6980e9e569	[AOTI] Disable split_cat_aten passes (#133014 ) Summary: disable passes with negative performance impact Test Plan: run UT Differential Revision: D60970288 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133014 Approved by: https://github.com/frank-wei	2024-08-14 19:36:17 +00:00
Oguz Ulgen	63e5b09218	Add unit test for asymmetric compilation (#133363 ) Unit test for asymmetric compilation Pull Request resolved: https://github.com/pytorch/pytorch/pull/133363 Approved by: https://github.com/jamesjwu ghstack dependencies: #133362	2024-08-14 19:32:18 +00:00
Oguz Ulgen	6f51782a59	Add comptime.sleep (#133362 ) Add comp time sleep for NCCL timeout testing. The unit test is not great.. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133362 Approved by: https://github.com/jamesjwu	2024-08-14 19:32:18 +00:00
Nicolas Macchioni	cf81180007	allow `SubConfigProxy` of arbitrary depth (#133418 ) Before, having arbitrary depth nested configs like ``` class Foo: foo: List[int] = [1, 2, 3] class Bar: bar: str = "1" class Baz: baz: int = 1 ``` would cause problems beyond the first layer. For example, if we tried ``` from torch._inductor import config as inductor_config print(inductor_config.Foo) print(repr(inductor_config.Foo.foo)) print(inductor_config.Foo.Bar) print(repr(inductor_config.Foo.Bar.bar)) print(inductor_config.Foo.Bar.Baz) print(repr(inductor_config.Foo.Bar.Baz.baz)) ``` we would get some output like ``` <torch.utils._config_module.SubConfigProxy object at 0x7fac65de00a0> [1, 2, 3] ... AttributeError: torch._inductor.config.Foo.Bar does not exist ``` Obviously, this is not what we want. With these changes, we get the right values ``` <torch.utils._config_module.SubConfigProxy object at 0x7f840d05bf40> [1, 2, 3] <torch.utils._config_module.SubConfigProxy object at 0x7f840cedc940> '1' <torch.utils._config_module.SubConfigProxy object at 0x7f840cedc100> 1 ``` Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/133418 Approved by: https://github.com/oulgen	2024-08-14 18:43:00 +00:00
PyTorch MergeBot	d46e0761ca	Revert "[11/N] Fix clang-tidy warnings in aten/src/ATen (#133298 )" This reverts commit 35785984013a74469de8c1d29eaecb25aa0c141e. Reverted https://github.com/pytorch/pytorch/pull/133298 on behalf of https://github.com/izaitsevfb due to causes build time regression in aten/src/ATen/native/cpu/ReduceOpsKernel.cpp ([comment](https://github.com/pytorch/pytorch/pull/133298#issuecomment-2289453440))	2024-08-14 17:47:12 +00:00
Nikita Shulga	07c73a964b	[MPS][BE] Delete MacOS-12.3 specific checks (#133141 ) And make MPS device unavailable on Sonoma releases As lots of those checks 2 years old, are no longer validated in CI and probably much more such checks are missing Pull Request resolved: https://github.com/pytorch/pytorch/pull/133141 Approved by: https://github.com/kulinseth, https://github.com/clee2000, https://github.com/atalman	2024-08-14 17:42:40 +00:00
Catherine Lee	7b269cc484	[TD] llm retrieval to not use bash -l {0} (#133464 ) https://github.com/pytorch/pytorch/pull/129720 swapped the action used to setup miniconda from [conda incubator](https://github.com/conda-incubator/setup-miniconda) to the [custom action](`2aba8f107a/.github/actions/setup-miniconda/action.yml (L1)`) we have in test-infra that comes with caching. The original miniconda [relies on bash profiles](`e5293c8fd2/README.md (L746)`) to set the environment variables needed to run conda, but the test infra version relies on the user using the env vars that are set during the step. This PR changes the job to not use `bash -l {0}` to see if not activating bash profile has an effect on the run. Unfortunately this failure happens rarely on main so I'm not sure I will be able see if this has an effect. On the plus side, changing this doesn't seem to have a negative effect on the job, so it should be a noop at worst. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133464 Approved by: https://github.com/kit1980	2024-08-14 16:53:41 +00:00
eellison	4bb1650ca3	Bump maxinum num warps (#132458 ) Fix for https://github.com/pytorch/pytorch/issues/129104 Our heuristic for num_warps was giving the optimal number, but we were capping maximum num_warps at 8. Gives 1% speedup on HF and TIMM in inference, 2% speedup in TIMM training, neutral otherwise. ultimately, I think we want live var analysis for register usage.. still worth landing this now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132458 Approved by: https://github.com/Chillee, https://github.com/shunting314	2024-08-14 16:51:05 +00:00
Chien-Chin Huang	d114fd78bd	[FSDP2] Enable HSDP + TP (#133335 ) This PR enables HSDP + TP Pull Request resolved: https://github.com/pytorch/pytorch/pull/133335 Approved by: https://github.com/awgu	2024-08-14 16:34:04 +00:00
Thanh Ha	7f40ac9be2	Migrate periodic jobs to use runner determinator (#133124 ) This updates the Linux & Windows jobs in periodic.yml to use the runner determinator script. Closes: pytorch/ci-infra#261 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133124 Approved by: https://github.com/ZainRizvi	2024-08-14 16:04:15 +00:00
Zain Rizvi	118b2a4139	Convert inductor jobs to Linux Amazon 2023 (#133352 ) A continuation of the migration started in - https://github.com/pytorch/pytorch/pull/131250 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133352 Approved by: https://github.com/zxiiro, https://github.com/seemethere	2024-08-14 15:59:33 +00:00
Daniil Kutz	62cd065de2	Validate that node TK_ASSIGN have field initialized (#127878 ) Fixes segmentation fault during model load via C++ API. An `Assign` statement (`TK_ASSIGN` type) have 3 fields: `lhs`, `rhs` and `type`. Field `type` is of type `Maybe`, which means it could be not presented. During model load in `import_source.cpp` field `type` is dereferenced without validation. It is similar error that have been fixed in #106041. Fixes #127877 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127878 Approved by: https://github.com/malfet	2024-08-14 15:27:58 +00:00
Isuru Fernando	e554f71d7e	Implement filter in dynamo (#131674 ) Fixes https://github.com/pytorch/pytorch/issues/128944 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131674 Approved by: https://github.com/amjames, https://github.com/jansel	2024-08-14 14:54:13 +00:00
Nicolas Macchioni	854a5ba958	[lint] fix lint broken by #131912 (#133428 ) lint Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/133428 Approved by: https://github.com/aaronenyeshi	2024-08-14 14:50:18 +00:00
Yuanhao Ji	378b12f3ad	Improve namespace for `c10::MemoryFormat::Contiguous` in `torchgen/api/cpp.py` (#131622 ) Top-level namespaces are more convenient for out-of-tree device extensions. For example, now we have a patch for it in `torch_npu`: `98c50ced16/codegen/gen_backend_stubs.py (L772-L778)` ```python JIT_TO_CPP_DEFAULT["contiguous_format"] = "c10::MemoryFormat::Contiguous" ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131622 Approved by: https://github.com/zou3519	2024-08-14 14:41:01 +00:00
Wu, Chunyuan	efc6e8457a	[inductor] [cpp] fix the reindexer from template_buffer to Y (#133070 ) This PR fixes the accuracy of jx_nest_base and part of the accuracy issue of convnext_base of the max-autotune path. Another fix (https://github.com/pytorch/pytorch/pull/133073 in this ghstack) is needed to make convnext_base fully pass the accuracy check. The index calculated via the reindexer was wrong before this PR. Both the shape of the reshape reindexer and the stride order of the stride reindexer needs to be fixed. Index calculated before this PR: ``` # in_ptr4 points to arg4_1: size = (1, 32, 18, 18), stride = (10368, 1, 576, 32)) auto tmp7 = in_ptr4[static_cast<long>((32L(static_cast<long>((n_start + x1 + (32Lm_start) + (32Lx0))) % static_cast<long>(18L))) + (576L(static_cast<long>(c10::div_floor_integer((n_start + x1 + (32Lm_start) + (32Lx0)), 324L)) % static_cast<long>(32L))))]; ``` The correct one after the fix is: ``` auto tmp7 = in_ptr4[static_cast<long>(n_start + x1 + (32L*(static_cast<long>((m_start + x0)) % static_cast<long>(324L))))]; ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133070 Approved by: https://github.com/jgong5	2024-08-14 11:42:03 +00:00
Yanbo Liang	52741043e7	[Inductor][FlexAttention] Support non-divisible sequence lengths (#133019 ) Perf benchmark script: https://gist.github.com/yanboliang/7c34a82df611d4ea8869cb9e041bfbfc * Update ```Q_LEN``` and ```KV_LEN``` to ```8192-9``` for testing non divisible cases. Run ```python perf_bench.py --partial-mask```. * Before this PR \| Seqence length \| Forward \| Backward \| \|---------------------\|-----------------\|------------------\| \| Divisible(8192) \| 0.87 \| 0.85 \| \| Non-divisible(8192-9) \| N/A \| N/A \| * After this PR \| Seqence length \| Forward \| Backward \| \|---------------------\|-----------------\|------------------\| \| Divisible(8192) \| 0.87 \| 0.85 \| \| Non-divisible(8192-9) \| 0.83 \| 0.78 \| Memory out of bounds check passed: * ```PYTORCH_NO_CUDA_MEMORY_CACHING=1 compute-sanitizer --tool memcheck python perf_bench.py --partial-mask``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133019 Approved by: https://github.com/Chillee	2024-08-14 10:27:39 +00:00
Edward Z. Yang	b5711297a0	Add support for SetVariable.discard (#133317 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133317 Approved by: https://github.com/Skylion007	2024-08-14 09:10:36 +00:00
wz337	ef580a0e5c	[DeviceMesh] Restrict slicing to be a contiguous or non-contiguous subsequence of the root mesh_dim_names (#133193 ) This PR adds restriction for DeviceMesh slicing. No out-of-order subsequence slicing is allowed. To create a flatten mesh_dim_names, only the in-order slicing is allowed. ``` mesh_3d = init_device_mesh( self.device_type, (2,2,2), mesh_dim_names=("dp", "cp", "tp"), ) # valid 2d slicing mesh_2d = mesh_3d["dp", "cp"] mesh_2d = mesh_3d["dp", "tp"] mesh_2d = mesh_3d["cp", "tp"] # invalid 2d slicing mesh_2d = mesh_3d["cp", "dp"] mesh_2d = mesh_3d["tp", "cp"] mesh_2d = mesh_3d["tp", "dp"] # valid way to create dp_cp flatten slice dp_cp_mesh = mesh_3d["dp", "cp"]._flatten() # invalid way to create dp_cp flatten slice dp_cp_mesh = mesh_3d["cp", "dp"]._flatten() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133193 Approved by: https://github.com/fegin, https://github.com/wanchaol	2024-08-14 07:18:41 +00:00
wz337	d143f879e2	[DTensor] Add more aten._foreach ops to _pointwise_ops.py (#133271 ) Fixes #ISSUE_NUMBER Follow up for https://github.com/pytorch/pytorch/pull/132056. Added the missing foreach ops pointed out by @ad8e. ``` _foreach_sub.Scalar _foreach_exp _foreach_exp_ _foreach_cos_ _foreach_log_ ``` As @ad8e mentioned, since the list of _foreach ops at https://pytorch.org/cppdocs/api/library_root.html is long and overload-heavy, it could be annoying to manually keep this file updated. We might need to come up with a way to update the list and add associated tests systematically. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133271 Approved by: https://github.com/awgu	2024-08-14 07:14:29 +00:00
Michael Lazos	a6413d2924	Regression test for S429861 (#133376 ) Adds repro test to verify that https://www.internalfb.com/sevmanager/view/429861 does not occur again. I haven't been able to reduce the size of the repro further, if I remove any buffers the error disappears! Pull Request resolved: https://github.com/pytorch/pytorch/pull/133376 Approved by: https://github.com/eellison	2024-08-14 06:55:05 +00:00
Avik Chaudhuri	a30504b2a2	fix silly error when printing diff (#133345 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/133336 When we fail to suggest fixes for a data dependent error because some symbols couldn't be mapped to sources, we print out those symbols but there was a silly bug in the printing code. New error: ``` ... raise self._make_data_dependent_error( torch.fx.experimental.symbolic_shapes.GuardOnDataDependentSymNode: Could not guard on data-dependent expression Eq(u0 + 1, CeilToInt(IntTrueDiv(u0 + 1, 1))) (unhinted: Eq(u0 + 1, CeilToInt(IntTrueDiv(u0 + 1, 1)))). (Size-like symbols: u0) Potential framework code culprit (scroll up for full backtrace): File "/data/users/avik/fbsource/buck-out/v2/gen/fbcode/6ef5f323b6193f0f/pyspeech/fb/tools/__export_speech_llama__/export_speech_llama#link-tree/torch/_refs/__init__.py", line 2972, in expand guard_size_oblivious(requested_length == x) For more information, run with TORCH_LOGS="dynamic" For extended logs when we create symbols, also add TORCHDYNAMO_EXTENDED_DEBUG_CREATE_SYMBOL="u0" If you suspect the guard was triggered from C++, add TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 For more debugging help, see https://docs.google.com/document/d/1HSuTTVvYH1pTew89Rtpeu84Ht3nQEFTYhAX3Ypa_xJs/edit?usp=sharing For C++ stack trace, run with TORCHDYNAMO_EXTENDED_DEBUG_CPP=1 The following call raised this error: File "/data/users/avik/fbsource/buck-out/v2/gen/fbcode/6ef5f323b6193f0f/pyspeech/fb/tools/__export_speech_llama__/export_speech_llama#link-tree/pyspeech/nn/utils.py", line 271, in lengths_to_padding_mask ).expand(batch_size, max_length) ``` Test Plan: Repro gets past reported error, hits new error Differential Revision: D61221994 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133345 Approved by: https://github.com/ezyang	2024-08-14 06:52:55 +00:00
drisspg	4d11a9b783	[CI] Fix rowwise scaling tests on h100 (#133384 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133384 Approved by: https://github.com/malfet, https://github.com/nWEIdia	2024-08-14 05:58:33 +00:00
IvanKobzarev	7aee3376e2	[aotd] HOP effect tokens wrapper above SubclassWrapper (#131672 ) Original issue: https://github.com/pytorch/pytorch/issues/129486 Before subclass_wrapper() got inputs containing additional effect tokens and failed as this did not match SubclassMeta indexes. This happened as functionalization was responsible to add / remove those tokens. Functionalization can not be run above Subclasses, as args/outs are duplicated in case of mutations. The main design thought is to keep logic of EffectTokens, Subclasses, Functionalization to know as less as possible about each others transformations. For that extracting EffectTokens manipulation to a separate wrapper, which will be processed above SubclassWrapper, while functionalization will happen below SubclassWrapper as before. In that case subclass wrap/unwrap works without information of additional arguments. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131672 Approved by: https://github.com/bdhirsh, https://github.com/zou3519	2024-08-14 05:57:17 +00:00
Janet Yang	2a4304329b	[wip][lowering] Add max_acc_splits (#133041 ) Summary: Model owners can set the lower_settings with max_acc_splits=2, and lowering will fail during model iteration, to alert them of possible performance degradation from increased fragmentation. Test Plan: Added unit tests Differential Revision: D60133589 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133041 Approved by: https://github.com/hl475	2024-08-14 03:50:31 +00:00
sanchitintel	f951fcd1d7	Inductor-CPU WoQ int8 GEMM micro-kernel with scale epilogue (#131887 ) ## Summary As part of #125683, this PR modifies existing CPU GEMM cpp template & micro-kernel template to enable int8 WoQ GEMM auto-tuning with AVX2, AVX512 & AMX ISAs (the latter is only available on Xeon 4th generation & beyond). WoQ GEMM takes FP16/BF16 activations, int8 weights, and scale of the same dtype as activations. The operation is equivalent to `torch.nn.functional.linear(x, w.to(x.dtype)) * scale`, which is essentially what the ATen op `torch.ops.aten._weight_int8pack_mm` currently does (except that weights are not cached by it). Weights will be considered constant & cached, so this implementation is suitable for inference, and not QAT. `scale` is supported as a `mul` epilogue. Only BF16 activations have been supported in this PR because for FP16 & FP32, weight is dequantized during constant-folding pass of freezing, and then after auto-tuning, performance with a large `M` dimension may be better than either torch.ops.aten._weight_int8pack_mm, or the WoQ micro-kernel support introduced in this PR, which dequantizes `w` within the micro-kernel. While even BF16 activations with a large `M` dimension may benefit from dequantizing `w` beforehand, for now, they would use WoQ support in GEMM templates for auto-tuning, and then a subsequent PR would add logic for deciding whether or not to dequantize weights beforehand. ### Performance #### AMX Op-level speedup due to AMX micro-kernel (selected during auto-tuning) on 32 physical cores of Intel(R) Xeon(R) Platinum 8468H (of Xeon 4th generation series, codenamed Sapphire Rapids) vs. ATen kernel `torch.ops.aten._weight_int8pack_mm`. Intel OpenMP & tcmalloc were preloaded. In a few cases with an odd `K`, the implementation being added in this PR may not perform as well as the ATen kernel, which is unrelated to this PR, though, since `test_linear_amx` also exhibits similar datapoints. In those cases, the AMX micro-kernel might be slower than AVX512 micro-kernel, so if such sets of shapes are used for auto-tuning, either the AVX512 micro-kernel implementation, or the ATen kernel would be chosen instead. Benchmarked with unit-tests. Tabular data at https://gist.github.com/sanchitintel/294811a86c8ff6b867c668ae2107c405?permalink_comment_id=5142442#gistcomment-5142442 The AVX512 micro-kernel was disabled to collect data for AMX micro-kernel. #### AVX2/AVX512 micro-kernels Tabular data at at https://gist.github.com/sanchitintel/52b5fa9c66f791be19e48e2aa6423dc4?permalink_comment_id=5142437#gistcomment-5142437 ### Follow-up 1. int4 WoQ GEMM micro-kernel will also be added in a separate PR. 2. A subsequent PR would add logic for deciding whether or not to dequantize weights beforehand. E2E perf measurement should be done with #131310. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131887 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-08-14 03:14:45 +00:00
Wouter Devriendt	918367ebb0	Add new runner: G4DN Extra Large with T4 for windows binary builds (#133229 ) Prep for #103104 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133229 Approved by: https://github.com/ZainRizvi	2024-08-14 03:08:49 +00:00
Will Feng	1206958d89	[Dynamo] add EventVariable reconstruct (#133236 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133236 Approved by: https://github.com/yifuwang	2024-08-14 02:56:11 +00:00
Jithun Nair	d1d6b370ce	Upgrade nightly wheels to rocm6.2 - 1 of 2 (docker images) (#132875 ) Fixes https://github.com/pytorch/pytorch/issues/132570 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132875 Approved by: https://github.com/atalman	2024-08-14 02:46:48 +00:00
Jane Xu	14750dd737	Correct return type of grouping helper function in Optimizer (#133360 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133360 Approved by: https://github.com/albanD	2024-08-14 01:56:02 +00:00
Menglu Yu	5fff960004	[PT2][Optimus] Extend split_stack_to_cats when split and stack have different dims (#133060 ) Summary: We observed a special case in AI CMF where the split and stack nodes have different dims, thus we extend our current implementation to include the special case. Test Plan: # unit test ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test //caffe2/test/inductor:split_cat_fx_passes ``` Buck UI: https://www.internalfb.com/buck2/6d0502bc-c840-425e-b686-b00b0b7da5f5 Test UI: https://www.internalfb.com/intern/testinfra/testrun/17732923577411786 Network: Up: 353KiB Down: 611KiB (reSessionID-1f80d74b-543f-4856-b3bf-181283c0f7e3) Jobs completed: 29. Time elapsed: 5:36.7s. Cache hits: 0%. Commands: 4 (cached: 0, remote: 1, local: 3) Tests finished: Pass 9. Fail 0. Fatal 0. Skip 1. Build failure 0 # benchmark ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "ai_cmf" --flow_id 558295195 -n ``` Counter({'pattern_matcher_nodes': 2321, 'pattern_matcher_count': 1320, 'normalization_pass': 280, 'merge_splits_pass': 250, 'extern_calls': 95, 'normalization_aten_pass': 28, 'scmerge_cat_removed': 14, 'scmerge_cat_added': 12, 'scmerge_split_removed': 7, 'unbind_stack_pass': 7, 'split_stack_to_cats_pass': 4, 'scmerge_split_sections_removed': 3, 'batch_aten_add': 3, 'batch_aten_mul': 3, 'split_cat_pass': 2, 'scmerge_split_added': 2, 'split_cat_to_slices_pass': 2, 'fxgraph_cache_miss': 2, 'batch_linear_post_grad': 1}) torch graph https://www.internalfb.com/intern/everpaste/?color=0&handle=GK5kwRZRtEMCZTAJAJlRpekhPhp0br0LAAAz # e2e Differential Revision: D60998945 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133060 Approved by: https://github.com/jackiexu1992	2024-08-14 01:45:12 +00:00
soulitzer	4af4910b1a	Reland "Construct NJT without graph breaks" (#133196 ) This reverts commit 154d40ca488e6979ce9c2de89d8a35b53129ebea. and adds changes from https://github.com/pytorch/pytorch/pull/133061 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133196 Approved by: https://github.com/ezyang ghstack dependencies: #133145	2024-08-14 01:11:13 +00:00
Zhengxu Chen	f23dbefe52	[export] Support "custom" metadata field. (#131912 ) Summary: Add a special field in Graph and Node level metadata called "custom" which should be mapped to a json-serializable object, and we guarantee this field should be always preversed across the following transformations: 1. copy/deepcopy 2. run_decompositions() 3. serialization 4. re-exporting Test Plan: :test_export -- -r custom_tag Reviewed By: angelayi Differential Revision: D60291839 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131912 Approved by: https://github.com/angelayi	2024-08-14 01:09:01 +00:00
cyy	c2eeda5da0	[structural binding][12/N] Replace std::tie with structural binding (#131031 ) Follows #130830 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131031 Approved by: https://github.com/ezyang	2024-08-14 00:51:34 +00:00
Nikita Shulga	7666ef9d9b	[GHF] Fix co-authors attribution (#133372 ) Acording to https://docs.github.com/en/pull-requests/committing-changes-to-your-project/creating-and-editing-commits/creating-a-commit-with-multiple-authors Co-authors must be mentioned at the very end of commit message and separated by 2 newlines Test plan: ```python from trymerge import GitHubPR pr = GitHubPR("pytorch", "pytorch", 133189) print(pr.gen_commit_message()) ``` Fixes https://github.com/pytorch/pytorch/issues/133310 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133372 Approved by: https://github.com/kit1980	2024-08-14 00:48:24 +00:00
cyy	3578598401	[11/N] Fix clang-tidy warnings in aten/src/ATen (#133298 ) Follows #133155 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133298 Approved by: https://github.com/ezyang	2024-08-14 00:29:38 +00:00
Jeff Daily	fbb0adbc32	[TunableOp] lazy init TuningContext singleton (#133347 ) Forward fix after #132464 because TuningContext had been created during static library init, which creates the TuningResultsValidator, which tries to query HIP device properties before the HIP runtime has initialized. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133347 Approved by: https://github.com/zixi-qi	2024-08-14 00:20:11 +00:00
Guilherme Leobas	5947169c9d	Add missing endline in exception message (#133240 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133240 Approved by: https://github.com/Skylion007	2024-08-14 00:11:39 +00:00
Nikita Shulga	c91bc499f7	[CI] Do not emit color escape sequence during testing (#133350 ) By forcing term to vt100 Fixes problem reported in https://github.com/pytorch/pytorch/issues/133330 but more broadly it should be fixed on Nova/Infra side Pull Request resolved: https://github.com/pytorch/pytorch/pull/133350 Approved by: https://github.com/zou3519	2024-08-13 23:39:16 +00:00
drisspg	caba37e99b	Update fused kernels and call _safe_softmax from SDPA (#131863 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131863 Approved by: https://github.com/jbschlosser, https://github.com/Chillee	2024-08-13 23:37:50 +00:00
Yanbo Liang	9de023d44d	[Dynamo] Make torch.Size can be reconstructed by LOAD_CONST (#133342 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133342 Approved by: https://github.com/mlazos, https://github.com/jansel	2024-08-13 23:18:38 +00:00
Rachel Guo	c17d26c3c1	[AOTI][Tooling] A couple fixes / minor updates for initial debug printer (#133016 ) Summary: Follow up small diff to fix a couple issues: - add condition for cuda/gpu case to only print kernel name list in the second pass i.e. when we do the cpp wrapper codegen - other minor fixes around `AOT_INDUCTOR_FILTERED_KERNELS_TO_PRINT` option Test Plan: ``` AOT_INDUCTOR_FILTERED_KERNELS_TO_PRINT="triton_poi_fused_0" AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+graph, inductor, +schedule, output_code" buck2 run -c fbcode.enable_gpu_sections=true -c fbcode.nvcc_arch=h100 @//mode/opt fbcode//caffe2/test/inductor:test_aot_inductor -- -r test_addmm_abi_compatible_cuda ``` Differential Revision: D60954888 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133016 Approved by: https://github.com/ColinPeppler	2024-08-13 23:00:29 +00:00
David Berard	41da528378	[BE] Skip inductor+profiler test for templates if we didn't run select_autotune (#133344 ) Sometimes we don't have enough SMs to do autotuning and then we fall back to aten, in which case we won't run the template kernel and it won't show up in the profile trace. Differential Revision: [D61222101](https://our.internmc.facebook.com/intern/diff/D61222101/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133344 Approved by: https://github.com/masnesral	2024-08-13 22:58:24 +00:00
Prachi Gupta	8e074c4583	[ROCm] skip SymmetricMemory related UTs for ROCm (#133241 ) This features is not yet supported on ROCm. Skipping: distributed/test_symmetric_memory.py::SymmetricMemoryTest::test_low_contention_all_gather_symm_mem_input_False With the errors: RuntimeError: CUDASymmetricMemory requires PYTORCH_C10_DRIVER_API_SUPPORTED Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/133241 Approved by: https://github.com/pruthvistony, https://github.com/malfet	2024-08-13 22:33:51 +00:00
Thanh Ha	5a1d4f7ddc	Migrate lint.yml to runner determinator (#133320 ) Update the jobs in lint.yml to use the runner determinator. Closes: pytorch/ci-infra#258 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133320 Approved by: https://github.com/Skylion007	2024-08-13 22:16:32 +00:00
Aart Bik	a9d34138df	[traced-graph][sparse] add to_dense() operation to sparse export test (#133175 ) This works for sparse COO but surprisingly still fails for the other compressed sparse cases. I filed the following bug report: https://github.com/pytorch/pytorch/issues/133174 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133175 Approved by: https://github.com/ezyang	2024-08-13 20:36:40 +00:00
PyTorch MergeBot	69de9e78e9	Revert "typing for remote_cache (#133299 )" This reverts commit 2fde1934f9efc418cc5a398bd0b09b29551cc091. Reverted https://github.com/pytorch/pytorch/pull/133299 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/133299#issuecomment-2287067434))	2024-08-13 20:26:24 +00:00
Nicolas Macchioni	fa7ae6cdbc	can't infer device on benchmarked function with no args or kwargs (#133290 ) when we call benchmarker.benchmark(fn, (), {}) it attempts to infer the device from the args and kwargs, which are both empty. in this case the default behavior is to assume CPU, since `is_cpu_device` is implemented as `all([x.device == "cpu" for x in ... if x is Tensor])`, and `all([]) == True`. I've added a PR that makes this raise an error, but we should just fix this one callsite first Pull Request resolved: https://github.com/pytorch/pytorch/pull/133290 Approved by: https://github.com/eellison	2024-08-13 20:13:44 +00:00
YangQun1	dfc7c860e4	Allow SymInt input for torch.fx reinplace pass (#133178 ) Fixes #133176 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133178 Approved by: https://github.com/ezyang	2024-08-13 20:07:17 +00:00
briancoutinho	61625a18ef	[profiler] Only parse kineto requests and build tree when required (#132713 ) To avoid high overheads of constructing datastructure in python when the user is simply saving trace to a file, we only process things lazily. ## Details 1. Delay function event parsing, add a flag to denote when needed. 1. Make profiler.function_events a computed property so code using `prof.function_events` does not need to change. 1. Fix coverage for `str(prof)` in profiler tests. ## Test run Test program ``` import torch from torch.profiler import profile, record_function, ProfilerActivity def payload(use_cuda=False): x = torch.randn(10, 10) if use_cuda: x = x.cuda() y = torch.randn(10, 10) if use_cuda: y = y.cuda() z = torch.mm(x, y) z = z + y if use_cuda: z = z.cpu() with profile(activities=[ProfilerActivity.CPU], record_shapes=True) as prof: with record_function("model_inference"): payload() prof.export_chrome_trace("/tmp/test_trace.json") #print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10)) ``` The print "this is computing events" will happen lazily. ``` >]$ python3 profiler_test.py Brian: this is computing function events ---------------------- ------------ ------------ ------------ ------------ ------------ ------------ Name Self CPU % Self CPU CPU total % CPU total CPU time avg # of Calls ---------------------- ------------ ------------ ------------ ------------ ------------ ------------ model_inference 6.77% 441.628us 100.00% 6.523ms 6.523ms 1 aten::randn 1.86% 121.108us 46.93% 3.061ms 1.530ms 2 aten::mm 45.36% 2.959ms 45.44% 2.964ms 2.964ms 1 aten::normal_ 44.72% 2.917ms 44.72% 2.917ms 1.458ms 2 aten::add 0.87% 56.646us 0.87% 56.646us 56.646us 1 aten::empty 0.35% 22.808us 0.35% 22.808us 11.404us 2 aten::resolve_conj 0.08% 5.173us 0.08% 5.173us 1.724us 3 ---------------------- ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 6.523ms $> python3 profiler_test.py (pytorch) [bcoutinho@devgpu038.ftw6 /data/users/bcoutinho/pytorch (profiler_optimize_parsing)]$ $>ls -a profiler_test.py $> ls -l /tmp/test_trace.json -rw-r--r-- 1 bcoutinho users 16471 Aug 5 16:10 /tmp/test_trace.json ``` ## Unit test Updates some tests and they all pass now. `pytest test/profiler/test_profiler.py` Also `python test/test_autograd.py TestAutogradWithCompiledAutograd.test_record_function` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132713 Approved by: https://github.com/sraikund16	2024-08-13 18:58:20 +00:00
Sam Larsen	657d58bbd8	[inductor] Fix test_codecache::test_inductor_counters (#133244 ) Summary: This test is flakey internally, but it's not a great test in the first place since it's relying on the max-autotune step to bump a related counter. Instead of doing that, directly install a mock that bumps a counter specifically for this test. Additionally, test that the caching logic correctly accommodates an arbitrary counter delta (previously the relevant counter is only bumped by +1). Differential Revision: [D61141164](https://our.internmc.facebook.com/intern/diff/D61141164) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133244 Approved by: https://github.com/eellison	2024-08-13 18:52:27 +00:00
Aaron Orenstein	2fde1934f9	typing for remote_cache (#133299 ) Summary: typing annotations for remote_cache Test Plan: unit tests Reviewed By: oulgen Differential Revision: D60937968 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133299 Approved by: https://github.com/Skylion007	2024-08-13 18:28:41 +00:00
Xavier Dupré	a1ca4dfe0b	[ONNX] Fix onnx conversion scaled_dot_product_attention (#133314 ) Fixes error message raised by the torch>=2.5: A mismatch between the number of arguments (8) and their descriptors (7) was found at symbolic function 'scaled_dot_product_attention' by adding the newly introduced use_gqa parameter. From https://github.com/pytorch/pytorch/pull/132689 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133314 Approved by: https://github.com/Skylion007, https://github.com/justinchuby	2024-08-13 18:22:24 +00:00
Feng Shi	19416bf38b	Reland "[2/2] PT2 Inductor ComboKernels - automatic horizontal fusing (#131675 )" (#133291 ) Reland by reverting commit 844103197d3e8cf6b4b59176e473365113f4f962. #131675 failed a few internal tests because it imported a diff version which wasn't rebased on the proper dependent diffs. Reland from OSS only to avoid the out-of-sync issue. Original description from #131675 Summary: A ComboKernel combines independent Inductor Triton kernels into a single one. This is part 2 pull request which 1) adds automatic horizontal fusion in the end of the inductor operator fusion process, 2) adds type annotation for trition_combo_kernel.py ComboKernel is used in two cases: 1) for existing foreach kernels, combo kernels are used as the backend kernel. the front-end kernel generation logic remains the same. 2) Added an extra optimization phase to the end of the scheduler to generate extra combo kernels if combo_kernels is True in config.py This is part 2 pull request which deals with the 2nd case above: The combo kernel generation in the added optimization phase is done in two steps: 1) in the front end inside the scheduler, it topologically sort the schedule nodes to find all the nodes with no data dependency and create a frond end schedule node for them. We currently limit the maximal number of sub-nodes for each combo kernel to 8 (but we still need to find what is the optimal number). 2) then, these sub-nodes are combined in the codegen phase to generate the combo kernel code for them based on a few rules. For example, 1d and 2d kernels are separated into different combo kernels, as mixing them is not supported yet. Note these algorithms we provide are very basic, and the users can register their customized combo kernel generation algorithms for both steps. Performance wise, combining small kernels is about always to see performance gain. however, combining very large kernels may not see any perf gain, sometimes even regression possibly due to improper block sizes. Thus, a benchmark function is implemented to avoid such perf regression, and it is recommended to turn it on by setting benchmark_combo_kernels to True whenever combo_kernels is True. Please refer to part 1 pull request https://github.com/pytorch/pytorch/pull/124969 for more details. Test Plan: buck2 test mode/dev-nosan caffe2/test/inductor:combo_kernels Pull Request resolved: https://github.com/pytorch/pytorch/pull/133291 Approved by: https://github.com/wdvr	2024-08-13 18:18:12 +00:00
Aaron Enye Shi	dadb20a9d6	[Memory Snapshot][Viz] Add Allocator Settings Tab (#132518 ) Summary: Since we are storing the allocator settings in the snapshot files for awhile now (since https://github.com/pytorch/pytorch/pull/119404), we can expose this to users with a new tab in the visualizer. Test Plan: Ran it locally: ![image](https://github.com/user-attachments/assets/5f79ccd0-fe1c-4e42-bb58-106d8f3cccd6) Differential Revision: D60673548 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/132518 Approved by: https://github.com/tianfengfrank, https://github.com/zdevito	2024-08-13 17:35:12 +00:00
Aaron Enye Shi	7172c732d9	[Memory Snapshot] Skip C++ warmup unwind() call if context is not set (#133038 ) Summary: Should skip C++ warmup `unwind::unwind();` if there is no context set. This call is sometimes causing hanging issues since C++ stack collection is not robust. Test Plan: CI Differential Revision: D60965985 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/133038 Approved by: https://github.com/eqy	2024-08-13 17:25:24 +00:00
Henry Tsang	be400ee2b4	[inductor][test] Fix test_vertical_pointwise_reduction_fusion (#133276 ) Summary: Fix after https://github.com/pytorch/pytorch/pull/131649 changes behavior for fusion. Test Plan: ci Differential Revision: D61165949 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133276 Approved by: https://github.com/ColinPeppler	2024-08-13 17:18:43 +00:00
Xu Han	89795da5e3	[inductor] process compile_only case in all build options class. (#129975 ) Optimize `compile_only` logical. Origin code only apply for `CppTorchCudaOptions`, this PR make it apply for all build option classes. Changes: 1. Remove `libraries_dirs` and `libraries` settings, when `compile_only`. 2. Remove compile_only from CppTorchCudaOptions. 3. Make the `compile_only` apply for all classes. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129975 Approved by: https://github.com/henrylhtsang	2024-08-13 16:45:27 +00:00
Sahdev Zala	19270cff61	Add a reference for the LRScheduler class (#133243 ) The `LRScheduler` class provides methods to adjusts the learning rate during optimization (as updated in this PR). Also, as a note, all the classes of lr_scheduluer are already provided in the `How to adjust learning rate` section. Fixes #127884 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133243 Approved by: https://github.com/janeyx99	2024-08-13 16:20:22 +00:00
drisspg	aa4fbba42d	Make q info optional in prep for inference (#133261 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133261 Approved by: https://github.com/Chillee ghstack dependencies: #132969	2024-08-13 16:09:39 +00:00
Zain Rizvi	660436d843	Convert Periodic to use Amazon2023 runners (#133036 ) Fixes #ISSUE_NUMBER Co-authored-by: clee2000 <44682903+clee2000@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133036 Approved by: https://github.com/clee2000, https://github.com/zxiiro	2024-08-13 15:59:50 +00:00
cyy	2f30473fba	[19/N] Fix clang-tidy warnings in jit (#133067 ) Follows #132963 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133067 Approved by: https://github.com/Skylion007	2024-08-13 15:59:43 +00:00
Thanh Ha	2e7d67e6af	Migrate slow.yml jobs to use runner determinator (#133232 ) Update the jobs in slow.yml to use the runner determinator script. Closes: pytorch/ci-infra#259 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133232 Approved by: https://github.com/ZainRizvi	2024-08-13 15:44:55 +00:00
Guilherme Leobas	c518b50c4c	Remove functorch dispatch keys in `legacyExtractDispatchKey` (#133018 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133018 Approved by: https://github.com/zou3519	2024-08-13 15:32:01 +00:00
James Wu	cd565bc455	Refactor process_inputs outside of create_aot_dispatcher_function (#130962 ) This PR refactors process_inputs so that it occurs earlier outside of create_aot_dispatcher_function for the purpose of calculating a cache key with the inputs after they have been processed. This way, if tensors have symint sizes/strides, we successfully factor that into the cache key instead of specializing on every possible size and stride. Test that utilizes this incoming. # Guard behavior Note that it's technically possible for tensors with symint arguments to introduce guards in aot_dispatch, if they trace through decompositions that branch on tensor size/stride. This can result in multiple graph modules with differing guards having the same key in the cache. FXGraphCache has this same issue, and the remote FXGraphCache intentionally does not handle this: instead it only saves the first result in the cache, and cache misses if guards miss. The local FXGraphCache does handle this by storing multiple files and iterating through them, but we opt not to introduce that complexity just yet for AOTAutogradCache until we deem it necessary (i.e., models appear where saving multiple cache results with different guards but the same cache key becomes important). Instead, AOTAutogradCache will save a single entry per result, overriding it if it cache misses due to guards. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130962 Approved by: https://github.com/bdhirsh	2024-08-13 14:56:00 +00:00
PyTorch MergeBot	4cca18d5b6	Revert "Update fused kernels and call _safe_softmax from SDPA (#131863 )" This reverts commit e61def65d5c6268e79f52776f75277ee60f01462. Reverted https://github.com/pytorch/pytorch/pull/131863 on behalf of https://github.com/albanD due to Broke forward AD tests in main ([comment](https://github.com/pytorch/pytorch/pull/131863#issuecomment-2286432628))	2024-08-13 14:44:08 +00:00
chuanqiw	095c5ccf9f	[CD] Change XPU nightly build back to ABI=0 (#132854 ) Works for https://github.com/pytorch/pytorch/issues/114850 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132854 Approved by: https://github.com/atalman	2024-08-13 13:46:29 +00:00
cyy	e0a5536cc9	[2/N] Fix clang-tidy warnings in torch/csrc/autograd (#133295 ) Follows #133180 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133295 Approved by: https://github.com/Skylion007	2024-08-13 13:23:46 +00:00
Aruna K	7756175273	Add Sleef Implementation for maximum Kernel for ARM (#131642 ) The NEON Vectorized<float> implementation does not use SLEEF functions for maximum Implementation. So updated maximum function with sleef calls for better performance on graviton3.It showed good performance improvement in LLM models. The results are taken in graviton3 machine as follows: <img width="268" alt="perf_result" src="https://github.com/user-attachments/assets/8c575873-b985-44e1-ba8e-880fe6494c5f"> This maximum kernel is used in softmax. The performance timing of softmax with default and sleef change is as below:(graviton3 machine) <img width="265" alt="softmax" src="https://github.com/user-attachments/assets/3be22c0e-7c99-407e-a8d1-891cb1e035ad"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131642 Approved by: https://github.com/snadampal, https://github.com/jgong5	2024-08-13 11:08:14 +00:00
Pian Pawakapan	40061bd61e	[export] overwrite placeholder names when deepcopying (#133269 ) In joint-graph export we have a `copy.deepcopy(ep.graph_module)` call. This turns out to be an imperfect deepcopy, because deepcopy allows objects to overwrite their `__deepcopy__` methods. For fx.Graph, this ends up deferring to `Graph.create_node()`, which checks the graph namespace, and can avoiding copying the exact name in niche examples, like where the name is a Python keyword (e.g. `input` gets renamed to `input_1`). Names like `input` happen because export's placeholder naming pass overwrites what the namespace creates, based on the model's `forward()` signature. So we can either 1) avoid overwriting such cases, which requires rewriting the naming pass logic, or 2) force another overwrite after deepcopying. This goes with 2). Pull Request resolved: https://github.com/pytorch/pytorch/pull/133269 Approved by: https://github.com/zhxchen17, https://github.com/dvorjackz, https://github.com/ydwu4	2024-08-13 10:20:43 +00:00
PyTorch UpdateBot	947a446be4	[executorch hash update] update the pinned executorch hash (#131420 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned executorch hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131420 Approved by: https://github.com/pytorchbot	2024-08-13 08:30:51 +00:00
Wanchao Liang	9f17037e8b	[dtensor] move tensor constructors to the api module (#133129 ) This is to ensure __init__.py only contain public APIs Pull Request resolved: https://github.com/pytorch/pytorch/pull/133129 Approved by: https://github.com/awgu, https://github.com/tianyu-l	2024-08-13 06:09:56 +00:00
cyy	50e837d9c2	[10/N] Fix clang-tidy warnings in aten/src/ATen (#133155 ) Follows #132842 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133155 Approved by: https://github.com/janeyx99, https://github.com/ezyang	2024-08-13 03:48:58 +00:00
cyy	af7830e353	[1/N] Fix clang-tidy warnings in torch/csrc/autograd (#133180 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/133180 Approved by: https://github.com/albanD	2024-08-13 03:36:10 +00:00
Pian Pawakapan	4671e98656	[export] fix node.users when inlining HOOs (#133144 ) The process of inlining HOO subgraphs (e.g. set_grad_enabled) seems to break node.users when a node is present in multiple subgraphs, for example: ``` class SetGradCase(torch.nn.Module): def forward(self, x): _x = x.shape[0] + 2 _xx = _x + 2 with torch.no_grad(): y = _x * 4 return _xx, y ``` The `_x` node contains 2 users (_xx and y) after being inlined, but with inspection it seems to only contain y as a user. Previously we were completely clearing node.users for output nodes in HOO subgraphs before inlining them - we should just be deleting the subgraph output nodes Pull Request resolved: https://github.com/pytorch/pytorch/pull/133144 Approved by: https://github.com/larryliu0820, https://github.com/ydwu4	2024-08-13 03:21:42 +00:00
Oguz Ulgen	fa36eba77d	Turn off remote caching in unit tests unless explicitly on (#133258 ) Summary: This PR turns off remote caching in unit tests unless the unit test explicitly turns it on. Test Plan: existing tests Differential Revision: D61152154 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133258 Approved by: https://github.com/masnesral	2024-08-13 02:49:43 +00:00
Luciano Bello	1e9bedf688	Add `_codecs.encode` and `builtins.bytearray` to `_get_allowed_globals` to support bytes and bytearray serialization (#133189 ) Fixes #133163 Debugged in collaboration with @hariveliki The `byte` type is demanding the global `_codecs.encode`. That means, the following currently works: ```python import torch torch.save(b'hello', '/tmp/dummy.pth') torch.serialization.add_safe_globals([_codecs.encode]) torch.load('/tmp/dummy.pth', weights_only=True) ``` Similarly, `bytearray` needs `builtins.bytearray`. Following the `torch.loads` docs promise, both types should be supported without `add_safe_globals` as they are both primitive types: > weights_only: Indicates whether unpickler should be restricted to > loading only tensors, primitive types, dictionaries > and any types added via :func:`torch.serialization.add_safe_globals`. This PR adds both `_codecs.encode` and `builtins.bytearray` to `_get_allowed_globals` and test for saving and loading of both types with and without `weights_only`. Co-authored-by: hariveliki <98284163+hariveliki@users.noreply.github.com> Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133189 Approved by: https://github.com/mikaylagawarecki	2024-08-13 02:20:28 +00:00
Alnis Murtovi	f1c439cbed	AutoHeuristic: refactoring (#133170 ) This PR refactors train_decision.py and adds some basic logging, which I'll extend in another PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133170 Approved by: https://github.com/Chillee	2024-08-13 01:46:53 +00:00
cyy	e76f0e0646	Remove QNNPACK reference from setup.py (#133177 ) QNNPACK has been removed from third party Pull Request resolved: https://github.com/pytorch/pytorch/pull/133177 Approved by: https://github.com/albanD	2024-08-13 01:19:12 +00:00
Sun, Jiayi	7be77658e9	[Inductor] support masked vectorization for the tail_loop for INT8 datatype (#131155 ) This PR supports masked vectorization for the tail_loop for torch.uint8 and torch.int8 datatype to improve performance. BTW, I fixed the UT of `byte` by setting the range of the sample inputs to [0, 255] since the range of `torch.uint8` is [0, 255]. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131155 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel ghstack dependencies: #130724	2024-08-13 01:12:05 +00:00
Sun, Jiayi	370b072d8d	[Inductor] support masked vectorization for the tail_loop of the 2d tiles kernel (#130724 ) This PR supports masked vectorization for the tail_loop of the 2d tiles kernel to improve the performance. Example: ``` import torch def fn(a): return torch.permute(a, (2, 0, 1)).contiguous() input = torch.randn(2, 20, 40) compiled_fn = torch.compile(fn) with torch.no_grad(): for _ in range(3): compiled_fn(input) ``` Generated code: - Before: ``` cpp_fused_clone_0 = async_compile.cpp_pybinding(['const float', 'float'], ''' #include "/tmp/torchinductor_jiayisun/z2/cz2ry4ghylembzwx7hkbanur76fi3mkiu7s6jm3zdi2amy5egq4b.h" extern "C" void kernel(const float* in_ptr0, float* out_ptr0) { { #pragma GCC ivdep for(long x0=static_cast<long>(0L); x0<static_cast<long>(32L); x0+=static_cast<long>(16L)) { #pragma GCC ivdep for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(16L)) { float tmp0[1616] __attribute__ ((aligned (16))); at::vec::transpose_mxn<float,16,16>(in_ptr0 + static_cast<long>(x0 + (40Lx1)), static_cast<long>(40L), tmp0, 16); for (long x0_inner = 0; x0_inner < 16; x0_inner++) { auto tmp1 = at::vec::Vectorized<float>::loadu(tmp0 + static_cast<long>(16Lx0_inner), 16); tmp1.store(out_ptr0 + static_cast<long>(x1 + (40Lx0) + (40Lx0_inner))); } } #pragma GCC ivdep for(long x1=static_cast<long>(32L); x1<static_cast<long>(40L); x1+=static_cast<long>(1L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x0 + (40Lx1)), 16); [&] { __at_align__ std::array<float, 16> tmpbuf; tmp0.store(tmpbuf.data(), 16); #pragma GCC unroll 16 for (long x0_inner = 0; x0_inner < 16; x0_inner++) { out_ptr0[static_cast<long>(x1 + (40Lx0) + (40Lx0_inner))] = tmpbuf[x0_inner]; } } () ; } } #pragma GCC ivdep for(long x0=static_cast<long>(32L); x0<static_cast<long>(40L); x0+=static_cast<long>(1L)) { #pragma GCC ivdep for(long x1=static_cast<long>(0L); x1<static_cast<long>(40L); x1+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(x0 + (40Lx1))]; out_ptr0[static_cast<long>(x1 + (40Lx0))] = tmp0; } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, = args args.clear() assert_size_stride(arg0_1, (2, 20, 40), (800, 40, 1)) buf0 = empty_strided_cpu((40, 2, 20), (40, 20, 1), torch.float32) cpp_fused_clone_0(arg0_1, buf0) del arg0_1 return (buf0, ) ``` - After: ``` cpp_fused_clone_0 = async_compile.cpp_pybinding(['const float', 'float'], ''' #include "/tmp/torchinductor_jiayisun/z2/cz2ry4ghylembzwx7hkbanur76fi3mkiu7s6jm3zdi2amy5egq4b.h" extern "C" void kernel(const float* in_ptr0, float* out_ptr0) { { #pragma GCC ivdep for(long x0=static_cast<long>(0L); x0<static_cast<long>(32L); x0+=static_cast<long>(16L)) { #pragma GCC ivdep for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(16L)) { float tmp0[1616] __attribute__ ((aligned (16))); at::vec::transpose_mxn<float,16,16>(in_ptr0 + static_cast<long>(x0 + (40Lx1)), static_cast<long>(40L), tmp0, 16); for (long x0_inner = 0; x0_inner < 16; x0_inner++) { auto tmp1 = at::vec::Vectorized<float>::loadu(tmp0 + static_cast<long>(16Lx0_inner), 16); tmp1.store(out_ptr0 + static_cast<long>(x1 + (40Lx0) + (40Lx0_inner))); } } #pragma GCC ivdep for(long x1=static_cast<long>(32L); x1<static_cast<long>(40L); x1+=static_cast<long>(8L)) { float tmp0[168] __attribute__ ((aligned (16))); at::vec::transpose_mxn<float,8,16>(in_ptr0 + static_cast<long>(x0 + (40Lx1)), static_cast<long>(40L), tmp0, 8); for (long x0_inner = 0; x0_inner < 16; x0_inner++) { auto tmp1 = at::vec::Vectorized<float>::loadu(tmp0 + static_cast<long>(8Lx0_inner), 8); tmp1.store(out_ptr0 + static_cast<long>(x1 + (40Lx0) + (40Lx0_inner)), 8); } } } #pragma GCC ivdep for(long x0=static_cast<long>(32L); x0<static_cast<long>(40L); x0+=static_cast<long>(8L)) { #pragma GCC ivdep for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(16L)) { float tmp0[816] __attribute__ ((aligned (16))); at::vec::transpose_mxn<float,16,8>(in_ptr0 + static_cast<long>(x0 + (40Lx1)), static_cast<long>(40L), tmp0, 16); for (long x0_inner = 0; x0_inner < 8; x0_inner++) { auto tmp1 = at::vec::Vectorized<float>::loadu(tmp0 + static_cast<long>(16Lx0_inner), 16); tmp1.store(out_ptr0 + static_cast<long>(x1 + (40Lx0) + (40Lx0_inner))); } } #pragma GCC ivdep for(long x1=static_cast<long>(32L); x1<static_cast<long>(40L); x1+=static_cast<long>(8L)) { float tmp0[88] __attribute__ ((aligned (16))); at::vec::transpose_mxn<float,8,8>(in_ptr0 + static_cast<long>(x0 + (40Lx1)), static_cast<long>(40L), tmp0, 8); for (long x0_inner = 0; x0_inner < 8; x0_inner++) { auto tmp1 = at::vec::Vectorized<float>::loadu(tmp0 + static_cast<long>(8Lx0_inner), 8); tmp1.store(out_ptr0 + static_cast<long>(x1 + (40Lx0) + (40Lx0_inner)), 8); } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, = args args.clear() assert_size_stride(arg0_1, (2, 20, 40), (800, 40, 1)) buf0 = empty_strided_cpu((40, 2, 20), (40, 20, 1), torch.float32) cpp_fused_clone_0(arg0_1, buf0) del arg0_1 return (buf0, ) ``` Co-authored-by: CaoE <e.cao@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130724 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-08-13 01:02:24 +00:00
drisspg	e61def65d5	Update fused kernels and call _safe_softmax from SDPA (#131863 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131863 Approved by: https://github.com/jbschlosser	2024-08-13 00:51:55 +00:00
PyTorch MergeBot	00aa086298	Revert "[dtensor] move tensor constructors to a separate module (#133129 )" This reverts commit e890d888d916b4f38b383a59e0e9445513c67313. Reverted https://github.com/pytorch/pytorch/pull/133129 on behalf of https://github.com/fbgheith due to breaking internal tests ([comment](https://github.com/pytorch/pytorch/pull/133129#issuecomment-2285090400))	2024-08-12 23:55:08 +00:00
PyTorch MergeBot	89670d5bdd	Revert "Inductor-CPU WoQ int8 GEMM micro-kernel with scale epilogue (#131887 )" This reverts commit 8fbd7d92a81b61d41363edb1b3902ba7701d5a27. Reverted https://github.com/pytorch/pytorch/pull/131887 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/131887#issuecomment-2285082401))	2024-08-12 23:45:46 +00:00
PyTorch MergeBot	844103197d	Revert "[2/2] PT2 Inductor ComboKernels - automatic horizontal fusing (#131675 )" This reverts commit bb6eef8ed1de0eb48bde10a07da57b6acc82fb05. Reverted https://github.com/pytorch/pytorch/pull/131675 on behalf of https://github.com/fbgheith due to breaking internal tests ([comment](https://github.com/pytorch/pytorch/pull/131675#issuecomment-2285069508))	2024-08-12 23:31:16 +00:00
PyTorch MergeBot	656465fc77	Revert "Conversions between strided and jagged layouts for Nested Tensors (#115749 )" This reverts commit ed97fb77f9a9d9d815f4975caccbc961ebbcb714. Reverted https://github.com/pytorch/pytorch/pull/115749 on behalf of https://github.com/izaitsevfb due to fails internal jobs, see [S440348](https://www.internalfb.com/sevmanager/view/440348) ([comment](https://github.com/pytorch/pytorch/pull/115749#issuecomment-2285051164))	2024-08-12 23:14:19 +00:00
drisspg	d4b31f7bcf	Refactor BlockMask constructorr and add Factory func (#132969 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132969 Approved by: https://github.com/Chillee	2024-08-12 22:38:42 +00:00
Zain Rizvi	e553ef69d0	[BE] Fix typo (#133247 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/133247 Approved by: https://github.com/c-p-i-o, https://github.com/zxiiro	2024-08-12 21:58:55 +00:00
rzou	8585dee85d	[inductor] Add some more reinplacing tests (#132839 ) Also add some tests around the counters we added in a previous PR. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/132839 Approved by: https://github.com/eellison	2024-08-12 21:34:45 +00:00
Thanh Ha	592682fe22	Migrate nightly.yml to use runner determinator (#133225 ) Updates the nightly.yml jobs to use the runner determinator script. Closes: pytorch/ci-infra#260 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133225 Approved by: https://github.com/ZainRizvi	2024-08-12 21:25:55 +00:00
Edward Z. Yang	80ed3e9ccd	s/dipatch/dispatch/g (#133192 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133192 Approved by: https://github.com/albanD	2024-08-12 20:26:58 +00:00
Catherine Lee	4f0d5f6551	Pin sympy to 1.13.1 (#133235 ) Sympy 1.13.2 release yesterday, and it results in test failures on windows and mac `454713fe9d/1` Hopefully these are the places it needs to get pinned Pull Request resolved: https://github.com/pytorch/pytorch/pull/133235 Approved by: https://github.com/atalman, https://github.com/ZainRizvi	2024-08-12 20:10:09 +00:00
Xu Han	36c4ed8e49	[inductor] add FreeLibrary to DLLWrapper for Windows. (#133184 ) For previous PR https://github.com/pytorch/pytorch/pull/132630 . We found `DLLWrapper` class doesn't have `_dlclose` implemention for Windows. I write a small test project to figure out how to make it works on Windows: https://github.com/xuhancn/ctypes_all_lifecycle/blob/main/pysrc/module_manage.py#L30-L61 Test result: https://github.com/xuhancn/ctypes_all_lifecycle/tree/main?tab=readme-ov-file#ctypes_cyclepy So, I have port the Windows FreeLibrary implemention to pytorch DLLWrapper in this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133184 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-08-12 19:55:48 +00:00
drisspg	cdcc7dc891	update comit pin for xla (#133120 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133120 Approved by: https://github.com/janeyx99	2024-08-12 19:38:37 +00:00
Li-Huai (Allan) Lin	cc1cc71c46	[MPS] Fix relu for 0-element input case (#133191 ) Fixes #133182 Should already be tested by `test/test_mps.py::MPSReluTest::testNumbersGPU`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133191 Approved by: https://github.com/albanD	2024-08-12 19:24:17 +00:00
Kiuk Chung	666362865c	[test/profiler] Make test_profiler_pattern_matcher_json_report write … (#133009 ) Makes it possible to run `test/profiler/test_profiler.py#test_profiler_pattern_matcher_json_report` on CI environments where the test runner doesn't have write permissions to the current-working-directory. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133009 Approved by: https://github.com/zou3519	2024-08-12 18:56:50 +00:00
PyTorch MergeBot	fa1d7b0262	Revert "Remove unused Caffe2 macros (#132979 )" This reverts commit da65cfbdea4f1f2176f6242004bda940a24f9ddb. Reverted https://github.com/pytorch/pytorch/pull/132979 on behalf of https://github.com/ezyang due to these are apparently load bearing internally ([comment](https://github.com/pytorch/pytorch/pull/132979#issuecomment-2284666332))	2024-08-12 18:34:56 +00:00
rzou	afb73d253c	[custom_ops] torch.library.{custom_op, register_kernel} disable Dynamo (#133125 ) We promise the user that these custom ops (and their kernels) are black boxes w.r.t. torch.compile. Unfortunately Dynamo can turn itself back on in the implementation of the custom operator, so we force it off by disabling Dynamo Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/133125 Approved by: https://github.com/ezyang	2024-08-12 18:29:18 +00:00
Chien-Chin Huang	d53dfa4680	[BE] Raise when the target model has scalar parameters (#132934 ) Address the issue, https://github.com/pytorch/pytorch/issues/130810. Both FSDP1 and FSDP2 do not support scalar parameters. For FSDP1, the issue happens during state_dict operations while FSDP2 fails during the initialization. This PR adds exceptions to help users debug the issue and change the scalar parameters to 1D parameters. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132934 Approved by: https://github.com/awgu, https://github.com/wz337	2024-08-12 18:28:02 +00:00
Pierre Chapuis	0e4c0ef29f	fix type of `eta_min` parameter in CosineAnnealing (int -> float) (#132482 ) This fixes errors with type checkers such as `pyright`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132482 Approved by: https://github.com/janeyx99	2024-08-12 18:22:26 +00:00
Jane Xu	e7d8d73582	[minor] Correct in-code documentation for complex numbers and LBFGS (#133020 ) To @lezcano's credit, this should be associative, as floating point add is actually commutative per IEEE754. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133020 Approved by: https://github.com/soulitzer, https://github.com/lezcano	2024-08-12 18:04:11 +00:00
Jeff Daily	d51e5467fd	TunableOp unconditionally add all validators (#132464 ) For workloads that only exercised scaled_mm, the csv result file would not contain the same set of validators as a gemm workload. Trying to reuse the same csv file between workloads would discard the file. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132464 Approved by: https://github.com/zixi-qi	2024-08-12 17:35:00 +00:00
Riley Dulin	d61815cb7d	[torch][ao] Use returned model from Quantizer.transform_for_annotation in prepare_pt2e (#132893 ) Summary: The Quantizer subclass can return a new model from `transform_for_annotation`, and this is common if it uses any ExportPass subclass which does not mutate in-place. Use the returned model instead of assuming its the same. Differential Revision: D60869676 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132893 Approved by: https://github.com/jerryzh168	2024-08-12 17:23:19 +00:00
Zain Rizvi	1371c420c3	Migrate binary builds to use Amazon2023 runners (#131826 ) A continuation of the migration started in - https://github.com/pytorch/pytorch/pull/131250 Migrates all linux binary builds. The failures are windows jobs which aren't touched by this PR prev runs (for tracking): - https://hud.pytorch.org/pytorch/pytorch/pull/131826?sha=e1ee074b1e7b17008e3f3774e4842b5e1d4c1355 - https://hud.pytorch.org/pytorch/pytorch/pull/131826?sha=50a3488ae776f86bd6bead8b048b051c49a25ec7 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131826 Approved by: https://github.com/malfet	2024-08-12 17:18:55 +00:00
Shangdi Yu	b06959e614	[export] change deepcopy to copy in _replace_with_hop passes (#133142 ) Summary: Add back the change in `19897a1647`. The change was lost in refactoring due to a bad rebase. Test Plan: CI ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//torchrec/distributed/tests:test_pt2 -- --filter-text test_sharded_quant_fpebc_non_strict_export ``` Differential Revision: D61052687 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133142 Approved by: https://github.com/ydwu4	2024-08-12 17:15:04 +00:00
Aaron Enye Shi	3128640c31	[Memory Snapshot][Viz] Show event timestamps if collected (#132523 ) Summary: Since we've been capturing timestamps for awhile (since https://github.com/pytorch/pytorch/pull/112266), we can surface this into the UI. This can be useful to correlate with timing of other events. Test Plan: Ran it locally. ![image](https://github.com/user-attachments/assets/8b3922e8-1ae2-4b09-aa13-20b2b8237064) Differential Revision: D60673800 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/132523 Approved by: https://github.com/tianfengfrank, https://github.com/zdevito	2024-08-12 16:12:04 +00:00
Catherine Lee	454713fe9d	Add inductor-cu124, inductor-rocm to upload test stats (#133143 ) Forgot to add them in https://github.com/pytorch/pytorch/issues/128250 and https://github.com/pytorch/pytorch/issues/131637 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133143 Approved by: https://github.com/huydhn	2024-08-12 15:51:51 +00:00
PyTorch MergeBot	9641abe97a	Revert "[export] change deepcopy to copy in _replace_with_hop passes (#133142 )" This reverts commit 2d71f03db124bd1517627d34896dd2d9248227af. Reverted https://github.com/pytorch/pytorch/pull/133142 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/133142#issuecomment-2284327241))	2024-08-12 15:48:11 +00:00
PyTorch MergeBot	e9eb8795bb	Revert "[Memory Snapshot][Viz] Show event timestamps if collected (#132523 )" This reverts commit 27c44c884e28c9378677fb295a528c36c429c3f7. Reverted https://github.com/pytorch/pytorch/pull/132523 on behalf of https://github.com/clee2000 due to broke some tests on mac ex export/test_retraceability.py::RetraceExportTestExport::test_disable_forced_specializations_ok_retraceability [GH job link](https://github.com/pytorch/pytorch/actions/runs/10344621336/job/28630686528) [HUD commit link](`27c44c884e`) Possibly a landrace since I see that some of the failing tests ran on the PR ([comment](https://github.com/pytorch/pytorch/pull/132523#issuecomment-2284312426))	2024-08-12 15:42:07 +00:00
Yuxin Wu	26b0a0c2f3	Fix fsdp_state_dict_type_without_warnings (#132621 ) Do actually ignore the warnings. Otherwise this is a no-op. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132621 Approved by: https://github.com/fegin	2024-08-12 10:33:09 +00:00
laithsakka	f5e704a6f2	Add instruction count benchmark to run on pull requests (#131475 ) This PR only adds the execution of the benchmarks on this PR and print results, following diffs will add checking out head~1 and running it and comparing. to access results goto test pr_time_benchmarks and inspect logs: you should see ``` + echo 'benchmark results on current PR: ' benchmark results on current PR: + cat /var/lib/jenkins/workspace/test/test-reports/pr_time_benchmarks_before.txt update_hint_regression,instruction_count,27971461254 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131475 Approved by: https://github.com/ezyang	2024-08-12 05:20:26 +00:00
Aaron Enye Shi	27c44c884e	[Memory Snapshot][Viz] Show event timestamps if collected (#132523 ) Summary: Since we've been capturing timestamps for awhile (since https://github.com/pytorch/pytorch/pull/112266), we can surface this into the UI. This can be useful to correlate with timing of other events. Test Plan: Ran it locally. ![image](https://github.com/user-attachments/assets/8b3922e8-1ae2-4b09-aa13-20b2b8237064) Differential Revision: D60673800 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/132523 Approved by: https://github.com/tianfengfrank, https://github.com/zdevito	2024-08-12 01:48:23 +00:00
PyTorch MergeBot	7f08b73980	Revert "[Memory Snapshot][Viz] Show event timestamps if collected (#132523 )" This reverts commit 456909e5d350810e941290ee61c1dfc3315a9a69. Reverted https://github.com/pytorch/pytorch/pull/132523 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/132523#issuecomment-2282925079))	2024-08-11 23:33:37 +00:00
Aaron Enye Shi	456909e5d3	[Memory Snapshot][Viz] Show event timestamps if collected (#132523 ) Summary: Since we've been capturing timestamps for awhile (since https://github.com/pytorch/pytorch/pull/112266), we can surface this into the UI. This can be useful to correlate with timing of other events. Test Plan: Ran it locally. ![image](https://github.com/user-attachments/assets/8b3922e8-1ae2-4b09-aa13-20b2b8237064) Differential Revision: D60673800 Pulled By: aaronenyeshi Pull Request resolved: https://github.com/pytorch/pytorch/pull/132523 Approved by: https://github.com/tianfengfrank, https://github.com/zdevito	2024-08-11 23:27:48 +00:00
Shangdi Yu	2d71f03db1	[export] change deepcopy to copy in _replace_with_hop passes (#133142 ) Summary: Add back the change in `19897a1647`. The change was lost in refactoring due to a bad rebase. Test Plan: CI ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//torchrec/distributed/tests:test_pt2 -- --filter-text test_sharded_quant_fpebc_non_strict_export ``` Differential Revision: D61052687 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133142 Approved by: https://github.com/ydwu4	2024-08-11 21:47:52 +00:00
Alnis Murtovi	e7b870c88b	mixed_mm: fix segfault when allow_tf32=True (#133173 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133173 Approved by: https://github.com/Chillee	2024-08-11 15:02:24 +00:00
chilli	04f37ed57d	Add support for returning LSE from FlexAttention (and also differentiating through it) (#133159 ) This PR changes the "contract" of `flex_attention_hop` to return LSE in base 2. However, we undo that and return LSE in base e from the `flex_attention` frontend. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133159 Approved by: https://github.com/yanboliang	2024-08-11 10:29:16 +00:00
haozhe.zhu	78ccbad678	[inductor] remove dtype check/assert for reduction vec and support bool for min/max (#132473 ) This PR is to remove the dtype check/assert for vectorized reduction. And support bool for min/max reduction. After removing dtype check and assertion, failed on UT. ``` PYTORCH_OPINFO_SAMPLE_INPUT_INDEX=0 python test/inductor/test_torchinductor_opinfo.py -k TestInductorOpInfoCPU.test_comprehensive_max_reduction_no_dim_cpu_bool ``` Now it is supported, generated code: ``` cpp_fused_max_0 = async_compile.cpp_pybinding(['const bool', 'bool'], ''' #include "/tmp/torchinductor_root/xf/cxf75ftbahznonqovnsugw7v6sldrabizgtx3j4rhgdmu3r36wlu.h" extern "C" void kernel(const bool* in_ptr0, bool* out_ptr0) { { { bool tmp_acc0 = std::numeric_limits<bool>::min(); at::vec::VecMask<float,1> tmp_acc0_vec = at::vec::VecMask<float,1>::from(std::numeric_limits<bool>::min()); for(long x0=static_cast<long>(0L); x0<static_cast<long>(112L); x0+=static_cast<long>(16L)) { auto tmp0 = at::vec::VecMask<float,1>::from(in_ptr0 + static_cast<long>(x0)); tmp_acc0_vec = tmp_acc0_vec \| tmp0; } #pragma omp simd simdlen(8) for(long x0=static_cast<long>(112L); x0<static_cast<long>(125L); x0+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(x0)]; tmp_acc0 = max_propagate_nan(tmp_acc0, tmp0); } tmp_acc0 = max_propagate_nan(tmp_acc0, tmp_acc0_vec.all_zero()); out_ptr0[static_cast<long>(0L)] = static_cast<bool>(tmp_acc0); } } } ''') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132473 Approved by: https://github.com/jgong5	2024-08-11 08:37:54 +00:00
Apurva Jain	79ca596dc6	Optimize test_transformers.py (#133049 ) - Reduced number of skipped test cases - Merged redundant test cases Benchmark: \| \| Original \| New \| \| ----- \| ----- \| ----- \| \| Run time \| 60 mins \| 35 mins \| \| Total tests \| 75k \| 18k \| \| Skipped tests \| 20k \| 4k \| _These are approximate numbers from running test_transformers.py on a single H100, and can change based on the device._ Pull Request resolved: https://github.com/pytorch/pytorch/pull/133049 Approved by: https://github.com/drisspg	2024-08-11 05:20:58 +00:00
Edward Z. Yang	a7912bf9dc	Make step != 0 test in slice irrefutable (#133091 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133091 Approved by: https://github.com/bdhirsh	2024-08-10 23:56:45 +00:00
cyy	5b7b3e4af0	Fix some issues detected by static analyzer (#132970 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/132970 Approved by: https://github.com/ezyang	2024-08-10 16:02:46 +00:00
xinan.lin	92f650c5b3	[Inductor][Intel GPU] Support codegen empty_strided_xpu, align with #118255 . (#126678 ) [Inductor][Intel GPU] Support codegen empty_strided_xpu, align with #118255. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126678 Approved by: https://github.com/EikanWang, https://github.com/jansel, https://github.com/eellison	2024-08-10 14:33:39 +00:00
Xu Han	4a3a30c36e	[inductor] remove deprecated cpp_builder implementation. (#133161 ) I have worked with @henrylhtsang to switch the cpp_builder to new one. We have removed the dependency to the old implementation. So, it is time to remove the old implementation now. This PR is done the change. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133161 Approved by: https://github.com/ezyang	2024-08-10 14:21:22 +00:00
cyy	32be3e942c	Remove -Wno-error=pedantic from CMake (#133074 ) The codebase is largely clean so that we can turn it on. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133074 Approved by: https://github.com/ezyang	2024-08-10 13:11:21 +00:00
Simon Fan	b9922f7a5a	[compiled autograd][cpp node] No recaptures from saved float scalars (#133048 ) Partially addresses https://github.com/pytorch/pytorch/issues/130170 for float scalars saved from forward pass of a custom c++ autograd function. With this PR, compiled autograd no longer recaptures when the float value changes, but downstream support isn't there yet: `4bdb4bbd86/torch/_dynamo/config.py (L58-L61)` Currently, any non-tensors passed in ctx->saved_data is specialized on by compiled autograd. To stop specializing on float values, we lift the float. We also require user code to use IValue::toSymFloat instead of IValue::toDouble in order to swap the SymFloat to proxy during compiled autograd tracing Pull Request resolved: https://github.com/pytorch/pytorch/pull/133048 Approved by: https://github.com/jansel ghstack dependencies: #132771	2024-08-10 11:05:44 +00:00
Simon Fan	c860889a65	[compiled autograd][cpp node] No recompiles from saved int scalars (#132771 ) Addresses https://github.com/pytorch/pytorch/issues/130170 for int scalars saved from forward pass of a custom c++ autograd function Currently, any non-tensors passed in ctx->saved_data is specialized on by compiled autograd. To stop specializing on int values, we lift the ints. We also require user code to use IValue::toSymInt instead of IValue::toInt in order to swap the SymInt to proxy during compiled autograd tracing Pull Request resolved: https://github.com/pytorch/pytorch/pull/132771 Approved by: https://github.com/jansel	2024-08-10 11:05:44 +00:00
Xu Han	2ad011ca73	[inductor] remove debug code of AotCodeCompiler (#132823 ) Since we switch AotCodeCompiler to new cpp_builder: https://github.com/pytorch/pytorch/pull/132766 We can remove debug code of AotCodeCompiler. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132823 Approved by: https://github.com/henrylhtsang	2024-08-10 08:04:48 +00:00
Yuanhao Ji	343071cd96	Fix privateuse1 backend name case (#132980 ) ### Problem `get_privateuse1_backend(bool lower_case)` always returns a lower case name and `lower_case` is not used. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132980 Approved by: https://github.com/albanD	2024-08-10 07:39:54 +00:00
Avik Chaudhuri	c8275e25a7	fix requirement for error classification (#133122 ) Test Plan: none Differential Revision: D61039300 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133122 Approved by: https://github.com/yushangdi	2024-08-10 04:59:09 +00:00
Xu Han	9f0d90655d	[inductor] cpp_builder add dynamo time trace for compile_file (#133103 ) trace `compile_file` time for cpp_builder. Ref: https://github.com/pytorch/pytorch/pull/132328/files#diff-c9b517f8db609ffa866804dfa2689188a4fee20abacaa0b0dca91625c1b5cb8dR2224 <img width="994" alt="image" src="https://github.com/user-attachments/assets/862c7943-79dc-4d06-b398-a09595ad1295"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/133103 Approved by: https://github.com/ezyang	2024-08-10 04:55:02 +00:00
Chirag Pandya	cc5a57d185	Return from monitoring thread on TCPStore failure (#133150 ) Summary: Pessimisticly assume that things are being torn down if TCPStore is not available and do not attempt to dump stack traces. Test Plan: Seeing crashes in production when Flight Recorder is enabled. Here's the relevant mast link: https://fburl.com/mlhub/qia257xh Reviewed By: fduwjj Differential Revision: D61055124 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133150 Approved by: https://github.com/fduwjj	2024-08-10 03:45:00 +00:00
joydddd	e888f401c5	Fix autotuning for flex_decoding (#132157 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132157 Approved by: https://github.com/drisspg, https://github.com/yanboliang ghstack dependencies: #131559	2024-08-10 03:39:48 +00:00
soulitzer	05de2b2d0f	Revert "Construct NJT without graph breaks" (#133145 ) This reverts commit 911154271309667b55dfb963ec6384bd0048019b. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133145 Approved by: https://github.com/YuqingJ	2024-08-10 03:11:16 +00:00
Wanchao Liang	e890d888d9	[dtensor] move tensor constructors to a separate module (#133129 ) This is to ensure __init__.py only contain public APIs Pull Request resolved: https://github.com/pytorch/pytorch/pull/133129 Approved by: https://github.com/awgu, https://github.com/tianyu-l	2024-08-10 02:51:42 +00:00
sanchitintel	8fbd7d92a8	Inductor-CPU WoQ int8 GEMM micro-kernel with scale epilogue (#131887 ) ## Summary As part of #125683, this PR modifies existing CPU GEMM cpp template & micro-kernel template to enable int8 WoQ GEMM auto-tuning with AVX2, AVX512 & AMX ISAs (the latter is only available on Xeon 4th generation & beyond). WoQ GEMM takes FP16/BF16 activations, int8 weights, and scale of the same dtype as activations. The operation is equivalent to `torch.nn.functional.linear(x, w.to(x.dtype)) * scale`, which is essentially what the ATen op `torch.ops.aten._weight_int8pack_mm` currently does (except that weights are not cached by it). Weights will be considered constant & cached, so this implementation is suitable for inference, and not QAT. `scale` is supported as a `mul` epilogue. Only BF16 activations have been supported in this PR because for FP16 & FP32, weight is dequantized during constant-folding pass of freezing, and then after auto-tuning, performance with a large `M` dimension may be better than either torch.ops.aten._weight_int8pack_mm, or the WoQ micro-kernel support introduced in this PR, which dequantizes `w` within the micro-kernel. While even BF16 activations with a large `M` dimension may benefit from dequantizing `w` beforehand, for now, they would use WoQ support in GEMM templates for auto-tuning, and then a subsequent PR would add logic for deciding whether or not to dequantize weights beforehand. ### Performance #### AMX Op-level speedup due to AMX micro-kernel (selected during auto-tuning) on 32 physical cores of Intel(R) Xeon(R) Platinum 8468H (of Xeon 4th generation series, codenamed Sapphire Rapids) vs. ATen kernel `torch.ops.aten._weight_int8pack_mm`. Intel OpenMP & tcmalloc were preloaded. In a few cases with an odd `K`, the implementation being added in this PR may not perform as well as the ATen kernel, which is unrelated to this PR, though, since `test_linear_amx` also exhibits similar datapoints. In those cases, the AMX micro-kernel might be slower than AVX512 micro-kernel, so if such sets of shapes are used for auto-tuning, either the AVX512 micro-kernel implementation, or the ATen kernel would be chosen instead. Benchmarked with unit-tests. Tabular data at https://gist.github.com/sanchitintel/294811a86c8ff6b867c668ae2107c405?permalink_comment_id=5142442#gistcomment-5142442 The AVX512 micro-kernel was disabled to collect data for AMX micro-kernel. #### AVX2/AVX512 micro-kernels Tabular data at at https://gist.github.com/sanchitintel/52b5fa9c66f791be19e48e2aa6423dc4?permalink_comment_id=5142437#gistcomment-5142437 ### Follow-up 1. int4 WoQ GEMM micro-kernel will also be added in a separate PR. 2. A subsequent PR would add logic for deciding whether or not to dequantize weights beforehand. E2E perf measurement should be done with #131310. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131887 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-08-10 02:01:04 +00:00
eqy	c89936eaa0	[CUDA][SDPA] Bump `grad_key` fudge factor in `test_flash_attention_vs_math_ref_grads` (#133051 ) Abates failures like `ValueError: grad_key Test error 1.592235639691353e-05 is greater than threshold 1.5236437320709229e-05!` that we've seen when bringing up newer versions of CUDA Pull Request resolved: https://github.com/pytorch/pytorch/pull/133051 Approved by: https://github.com/drisspg, https://github.com/Skylion007	2024-08-10 01:49:30 +00:00
James Wu	f037803290	Add ChromiumEventLogger, log FXGraphCache and AOTAutogradCache (#132864 ) This PR implements ChromiumEventLogger in all @dynamo_timed events. For each dynamo timed call, we log: - A start event before starting the function execution - An end event after finishing the function execution - An extra pair of start/end events for any phase names included in dynamo. Separately, this also gives us the ability to log instant events. I use them to log cache hits/misses as a first step. The little arrows on the bottom of the UI are cache hits/misses, and you can look at cache details by clicking each triangle. The outputted chromium trace events can be viewed in perfetto for a timeline of an execution. Here's what it looks like for a run of nanogpt: ![image](https://github.com/user-attachments/assets/cb9e6c7a-1acf-45e6-8a27-6651d9ae6132) And another with warm start: ![image](https://github.com/user-attachments/assets/cd9709bc-59ef-4da1-a7dd-10b1a0ab9b8f) Trace events are based around the JSON Event format: https://docs.google.com/document/d/1CvAClvFfyA5R-PhYUmn5OOQtYMH4h6I0nSsKchNAySU/preview We may want to switch to the less deprecated Protobuf format later, but so far I don't see any features we care about supported there. Internal FB employees can see a link to this in the tlparse output: https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpVi1FIl/dedicated_log_torch_trace_bb4zl_bc.log/index.html I'll also work on logging these Pull Request resolved: https://github.com/pytorch/pytorch/pull/132864 Approved by: https://github.com/aorenste	2024-08-10 01:15:53 +00:00
Huanyu He	de48d54042	[TorchRec] Add Support for FakeProcessGroup (#133039 ) Summary: # context * use FakeProcessGroup to mimic the multi-process tests * can use `_test_compile_fake_pg_fn` as the single-process VB compile test ``` from torchrec.distributed.tests.test_pt2_multiprocess import _test_compile_fake_pg_fn _test_compile_fake_pg_fn( rank=0, world_size=2, ) ``` reference: D59637444 Test Plan: # run test * run command and results: P1519228952, [tlparse](https://interncache-all.fbcdn.net/manifold/tlparse_reports/tree/logs/.tmpwMCK1E/index.html) ``` TORCH_TRACE=/var/tmp/tt TORCH_SHOW_CPP_STACKTRACES=1 TORCH_LOGS="+all" buck2 run fbcode//mode/opt fbcode//torchrec/distributed/tests:test_pt2_multiprocess ``` Differential Revision: D56124045 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133039 Approved by: https://github.com/ezyang	2024-08-10 01:10:47 +00:00
Avik Chaudhuri	3899465268	relax unification checks when size-like symbols can be 0 (#133112 ) Test Plan: Fixes test failure in https://www.internalfb.com/diff/D51127481 Differential Revision: D61031307 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133112 Approved by: https://github.com/angelayi	2024-08-10 00:57:49 +00:00
chuanqiw	72f2b29bb0	[CI] disable xpu kineto build (#133069 ) Due to the xpu kineto support PR https://github.com/pytorch/pytorch/pull/130811 landed, but the xpu ci infra not ready for now. Disable kineto build as a temp WA Pull Request resolved: https://github.com/pytorch/pytorch/pull/133069 Approved by: https://github.com/seemethere	2024-08-09 23:58:50 +00:00
Alnis Murtovi	21302d5891	AutoHeuristic: script to generate data for mm (#131617 ) This PR introduces a script that can be used to generate training data for tuned_mm in order to learn a heuristic with AutoHeuristic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131617 Approved by: https://github.com/eellison ghstack dependencies: #131615, #131616	2024-08-09 23:49:29 +00:00
Alnis Murtovi	e7512ab752	inductor mm autotuning: add back previously pruned configs (#131616 ) This PR adds back 10 configs for tuned_mm that were previously removed in https://github.com/pytorch/pytorch/pull/126570. The main idea is that we use 30 configs to autotune only when data is collected with AutoHeuristic. The learned heuristic will prune these 30 configs down to 10 configs, which reduces compilation time and at the same time might improve performance. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131616 Approved by: https://github.com/eellison ghstack dependencies: #131615	2024-08-09 23:49:29 +00:00
Alnis Murtovi	e5fa190e01	AutoHeuristic: tuned_mm (#131615 ) This PR enables AutoHeuristic to be used for `tuned_mm`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131615 Approved by: https://github.com/eellison	2024-08-09 23:49:29 +00:00
Kurman Karabukaev	3b440f358c	[elastic collectives API] add missing rank tracing support (#132818 ) Optional option to detect missing ranks (that can be mapped to host info via `rank_tracing_decoder` lambda argument) in store barrier operation. This approach has been used in some form already, moving it to collectives API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132818 Approved by: https://github.com/d4l3k	2024-08-09 22:55:04 +00:00
Tom Ritchford	6beb2be2ed	Fix _dynamo.variables.torch_function.global_mangled_class_name (#132744 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132744 Approved by: https://github.com/zou3519	2024-08-09 22:19:01 +00:00
Shivam Raikundalia	d2ecdcb2f7	[Profiler] Add API for Dynamic Activity Toggling [2/n] (#133035 ) Summary: During PT2 there are many GPU/CPU events that are unneccessary to profile in between a given step. To remedy this, we can add an API that takes in a list of activities and an arg to toggle said activies or not. For this diff we are adding the profiler API to propogate down to kineto (and in the future the collection.cpp logic). Subsequent diffs will be added for CPU toggling and e2e testing. Test Plan: Tested by toggling backward gpu traces off and got following trace: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Jul_31_13_40_55.3251726.pt.trace.json.gz&bucket=gpu_traces Reviewed By: aaronenyeshi Differential Revision: D60541767 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133035 Approved by: https://github.com/aaronenyeshi	2024-08-09 21:54:54 +00:00
fduwjj	b0b4723062	[c10d] Rename PG name and PG ID attribute (#132915 ) As discussed in https://github.com/pytorch/pytorch/pull/132058. we think pg_uid and local_uid might be a better name for pg_name and pg_id. So this PR is trying to rename it. More PRs are needed to change on the logging and other places. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132915 Approved by: https://github.com/fegin ghstack dependencies: #132058	2024-08-09 21:26:56 +00:00
joydddd	4110cb6ba7	Add explicit GQA support. (#131559 ) ### tl;dr This PR adds GQA support to higher order op `flex_attention`. ## Details When `enable_gqa` is set to True, HOP `flex_attention(score_mod, query, key, value, block_mask, enable_gqa)` runs Group Query Attention(GQA), where the number of query heads (Hq) is a multiple of number of key/value heads (Hkv). Each group of query heads (`Hq//Hkv` heads) attends to a shared kv head. Otherwise, `flex_attention` assumes Multi Head Attention (MHA) where the number of query heads is equal the number of kv heads. The `score_mod` and `mask_mod` API are adapted accordingly to take `q_head` as head index. ``` def score_mod(score: torch.Tensor, batch: torch.Tensor, q_head: torch.Tensor, token_q: torch.Tensor, token_kv: torch.Tensor) -> torch.Tensor def mask_mod(batch: torch.Tensor, q_head: torch.Tensor, token_q: torch.Tensor, token_kv: torch.Tensor) -> torch.Tensor ``` ## Example ```python import torch from torch.nn.attention.flex_attention import flex_attention from torch.nn.attention.flex_attention import create_block_mask torch.manual_seed(0) def query_key_value_clones( query: torch.Tensor, key: torch.Tensor, value: torch.Tensor, dtype: torch.dtype = None, ): """Clones the query, key, and value tensors and moves them to the specified dtype.""" if dtype is None: dtype = query.dtype query_ref = query.clone().detach().to(dtype).requires_grad_(query.requires_grad) key_ref = key.clone().detach().to(dtype).requires_grad_(key.requires_grad) value_ref = value.clone().detach().to(dtype).requires_grad_(value.requires_grad) return query_ref, key_ref, value_ref # Lets create some input tensors # The input tensor has shape (batch_size, num_heads, seq_len, head_dim). # query and key/value can have different num_heads and seq_len # Here 8 query heads share one KV head. query = torch.randn(2, 8, 2048, 64, device="cuda", dtype=torch.float32, requires_grad=True) key = torch.randn(2, 2, 2048, 64, device="cuda", dtype=torch.float32, requires_grad=True) value = torch.randn(2, 2, 2048, 64, device="cuda", dtype=torch.float32, requires_grad=True) query1, key1, value1 = query_key_value_clones(query, key, value) # Lets create a score_modification. We take alibi_bias as an example. # score_mod takes batch index, query head index, query index, and key/value index. def _generate_alibi_bias(num_kv_heads: int, num_q_heads: int): def _alibi_bias( score: torch.Tensor, b: torch.Tensor, hq: torch.Tensor, token_q: torch.Tensor, token_kv: torch.Tensor, ) -> torch.Tensor: # Let's calculate kv head from query head index group = num_q_heads // num_kv_heads hkv = hq // group scale = torch.exp2(-((hkv + 1) * 8.0 / num_kv_heads)) return score + (token_kv - token_q) * scale return _alibi_bias # Let's apply a casual mask on top of it def causal_mask(b, h, q, kv): return q >= kv # Generate a block mask for our new mask_mod function. # The mask is broadcasted long head & batch dimensions. block_mask = create_block_mask(causal_mask, B=1, H=1, Q_LEN=2048, KV_LEN=2048) # Lets call flex_attention with our new score modification and block mask under eager mode. output = flex_attention(query, key, value, score_mod=_generate_alibi_bias(2, 8), block_mask=block_mask, enable_gqa=True) # Now lets compile flex_attention and run the flex_attention kernel. compiled_flex_attention = torch.compile(flex_attention) out_compiled = compiled_flex_attention(query1, key1, value1, score_mod=_generate_alibi_bias(2, 8), block_mask=block_mask, enable_gqa=True) torch.testing.assert_close(output, out_compiled, atol=5e-2, rtol=2e-2) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131559 Approved by: https://github.com/drisspg	2024-08-09 21:25:35 +00:00
fduwjj	dc8bb2636c	[c10d][doc] Add docs for ENV variables TORCH_NCCL_ASYNC_ERROR_HANDLING TORCH_NCCL_TRACE_CPP_STACK and TORCH_NCCL_COORD_CHECK_MILSEC (#132920 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132920 Approved by: https://github.com/fegin, https://github.com/wconstab	2024-08-09 21:08:20 +00:00
Shivam Raikundalia	78fa32a77b	Turn off Function Event Accumulation by Default (#133095 ) Summary: D56956245 added the ability to accumulate FunctionEvents across multiple cycles in order to perform statistical analysis on them all together. Although this can be useful, it uses too many CPU resources especially for long running jobs. For this reason, lets add a flag to the profiler to turn off this behavior by default, but still allow users to turn it on if they wish. Test Plan: Changed function count test to have acc_events passed in and check the amount of function events based on if flag is true or not Differential Revision: D61021490 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133095 Approved by: https://github.com/briancoutinho, https://github.com/LucasLLC, https://github.com/aaronenyeshi	2024-08-09 20:47:20 +00:00
Yidi Wu	c44cb89e06	[export] detach constant tensors when they're not registered as buffer or parameter in unlift (#133031 ) Summary: Fixes T198245910. In previous diff D60532628 that causes the test failure, we fix the in-consistency caused by constant tensors is accidentally reigistered as buffer by deleting the buffer and re assign them as constant. However, this broke several existing tests in pyspeech when the exported program is re-traced with torch.jit.trace (which is an anti-pattern we probably should have some alignment), the jit tracer finds this constant tensor requiring grad and errors out. This PR force constant attr not requiring grad, which is the correct behavior. A better fix is finding out where the constants are created in user code and why it requires grad. But this has low roi so we warn user about it. Test Plan: See failures in T198245910. Differential Revision: D60974869 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133031 Approved by: https://github.com/angelayi	2024-08-09 20:33:52 +00:00
Wei Feng	cd307fb0b1	[FSDP2] reset FSDPParam.sharded_param in lazy_init (#132954 ) motivated by FSDP2 + DoRA https://github.com/pytorch/pytorch/issues/132721 after meta init, we need a user-defined function to move DoRALinear.magnitude from device=meta to device=cuda The problem is how to trigger reset_sharded_param or _apply to update FSDPParam. Otherwise lazy_init complains that DoRALinear.magnitude are still on device=meta credit to @awgu for chasing after DDP lazy_init to unblock the PR Pull Request resolved: https://github.com/pytorch/pytorch/pull/132954 Approved by: https://github.com/awgu ghstack dependencies: #133059	2024-08-09 20:26:10 +00:00
Henry Tsang	78cf8df4a0	[aoti] forward fix of [inductor] switch AotCodeCompiler to new cpp_builder. (take 3) (#133042 ) Summary: Forward fix of a test failure caused by D60773405. The idea of D60773405 is that we need to use absolute path. So we will want to use the older version of path for output_so and output_o. However, when I was copying the older definitions of output_so and output_o, I thought it was okay to simplify it a bit. See https://github.com/pytorch/pytorch/pull/131304#issuecomment-2270016609 Turns out I was wrong. Test Plan: ci Differential Revision: D60990594 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133042 Approved by: https://github.com/hl475, https://github.com/desertfire	2024-08-09 18:53:27 +00:00
Wei Feng	472b0daeaa	[DDP][FSDP2] keep DTensor params for replicate(fully_shard) (#133059 ) current status: for `replicate(fully_shard)`, DDP lazy_init will convert DTensor into local tensor, and that breaks FSDP unshard this PR keeps FSDP params untouched during DDP lazy_init I came across it because of a CI error in FSDP2's unit test #132978 thanks @awgu for fix proposal Pull Request resolved: https://github.com/pytorch/pytorch/pull/133059 Approved by: https://github.com/Skylion007, https://github.com/fegin	2024-08-09 18:38:05 +00:00
abhishek-fujitsu	e66084f9bf	[BUG FIX] Refactor _scale_attn_mask_fusion_kernel to Use Runtime Argument Instead of Template Parameter (#132434 ) Description _[BUG FIX]_ This PR fixes a bug which happens during compilation with GCC 11.4 compiler in the FlashAttentionKernel.cpp file. This issue doesn't seem to be with PyTorch main branch but gets introduced with our SVE PR changes (https://github.com/pytorch/pytorch/pull/119571 ) + PyTorch main. See the CI Pipeline failing in our PR: https://github.com/pytorch/pytorch/actions/runs/9895714768/job/27336251795?pr=119571 ``` /var/lib/jenkins/workspace/build/aten/src/ATen/native/cpu/FlashAttentionKernel.cpp.SVE256.cpp during RTL pass: expand In file included from /var/lib/jenkins/workspace/build/aten/src/ATen/native/cpu/FlashAttentionKernel.cpp.SVE256.cpp:1: /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/FlashAttentionKernel.cpp: In lambda function: /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/FlashAttentionKernel.cpp:290:57: internal compiler error: in emit_move_insn, at expr.c:3821 290 \| at::parallel_for(0, batchSize * num_head * qSlice, 1, [&](int64_t begin, int64_t end) { \| ^ 0xffffb03f73fb __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58 0xffffb03f74cb __libc_start_main_impl ../csu/libc-start.c:392 Please submit a full bug report, with preprocessed source if appropriate. Please include the complete backtrace with any bug report. See <file:///usr/share/doc/gcc-11/README.Bugs> for instructions. [5731/6839] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/CatKernel.cpp.SVE256.cpp.o [5732/6839] Building CXX object caffe2/CMakeFiles/torch_cpu.dir/__/aten/src/ATen/native/cpu/GridSamplerKernel.cpp.SVE256.cpp.o ``` This issue with compilation only happens with GCC 11.4 and works well with the latest GCC 12.3 compiler and also the Clang compiler. The issue is related to the check for `is_b_stride_zero` introduced as a template parameter (compile time check complexity) in the following commit: `5da428d9eb` which was added recently into FlashAttentionKernel.cpp file. This PR fixes the above compilation failure with GCC 11.4 compiler. cc : @Valentine233 @yanbing-j @mingfeima @malfet @jgong5 @r-barnes Pull Request resolved: https://github.com/pytorch/pytorch/pull/132434 Approved by: https://github.com/jgong5	2024-08-09 18:34:42 +00:00
Du Jiangcun	b41d62a3a2	Fix typo in docs of `all_gather` (#133066 ) Fix a typo of docs: ``` def all_gather(tensor_list, tensor, group=None, async_op=False): ... [tensor([0, 0], device='cuda:0'), tensor([0, 0], device='cuda:1')] # Rank 1 ``` `cuda:0` should be `cuda:1` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133066 Approved by: https://github.com/awgu	2024-08-09 18:21:26 +00:00
Jannick Kremer	f3eab23c42	Fix typo in `mypy.ini` (#133097 ) A missing comma in the file list currently leads to errors when running mypy, introduced in #113745 Fixes #133096 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133097 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-08-09 18:19:51 +00:00
PyTorch MergeBot	31ef900a65	Revert "added persistent option to buffers and namedbuffers (#132994 )" This reverts commit 8707c6dfacaed293ddc40cbb5ecf5841568df0e6. Reverted https://github.com/pytorch/pytorch/pull/132994 on behalf of https://github.com/PaliC due to breaking internal pyre tests ([comment](https://github.com/pytorch/pytorch/pull/132994#issuecomment-2278487672))	2024-08-09 18:14:53 +00:00
fduwjj	6c012f7217	[c10d][Log] Use pg_id instead of pg_name for logging prefix (#132058 ) When checking the logs of c10d, I found it showed that "[PG 7 rank 7]" which it actually means "[PG 1 rank 7]". So we need to use pg_id(aka, uid_) rather than pg_name_ because when creating subpgs, currently we need to call it multiple times, so this makes PG names are based on bumped up numbers (e.g, 7 rather than 1). Using pg_id is more accurate and consistent with other logging tools. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132058 Approved by: https://github.com/shengbao-zheng, https://github.com/shuqiangzhang	2024-08-09 18:14:10 +00:00
Zixi Qi	655ec07525	[ROCm] TunableOp logging improvements (#132173 ) Summary: TunableOp logging improvements: 1. PYTORCH_TUNABLEOP_VERBOSE=1: print out the expected value vs actual value for TunableOp validators, so that if validation fails, we know exactly how to fix it 2. PYTORCH_TUNABLEOP_VERBOSE=3: print out the exact kernel signature for both successful and failure cases in kernel lookup Test Plan: > PYTORCH_TUNABLEOP_VERBOSE=3 buck 2 run mode/{opt,amd-gpu} -c fbcode.enable_gpu_sections=true //scripts/xdwang/example:fc_llama -- --enab le-tuning ``` reading tuning results from hipblas_tuning_pt_llama0.csv Validator PT_VERSION=2.5.0 Validator ROCBLAS_VERSION=4.0.0-72e57364-dirty Validator HIPBLASLT_VERSION=800-a15e4178 Validator ROCM_VERSION=6.0.0.0-12969-1544e39 Validator GCN_ARCH_NAME=gfx942:sramecc+:xnack- GCN_ARCH_NAME validation: expect gfx942:sramecc+:xnack- to match gfx942:sramecc+:xnack- ROCM_VERSION validation: expect 6.0.0.0-12969-1544e39 to match 6.0.0.0-12969-1544e39 HIPBLASLT_VERSION validation: expect 800-a15e4178 to match 800-a15e4178 ROCBLAS_VERSION validation: expect 4.0.0-72e57364-dirty to match 4.0.0-72e57364-dirty PT_VERSION validation: expect 2.5.0 to match 2.5.0 Loading results GemmTunableOp_BFloat16_TN(tn_8192_2_1024) -> Gemm_Hipblaslt_TN_61169,0.0171694 GemmTunableOp_BFloat16_TN(tn_7168_2_8192) -> Gemm_Hipblaslt_TN_61089,0.036138 GemmTunableOp_BFloat16_TN(tn_8192_2_3584) -> Gemm_Hipblaslt_TN_61169,0.0240673 missing params_signature, returning null ResultEntry for GemmTunableOp_BFloat16_TN,tn_1280_2_8192 finding fastest for GemmTunableOp_BFloat16_TN(tn_1280_2_8192) out of 2818 candidates Rotating buffer 4 MiB. Needed Size: 20 MiB. Needed number of param copies: 1 ├──tuning using warmup iters 0 [0 ms] and tuning iters 1 [0.208254 ms] instance id=0, GemmTunableOp_BFloat16_TN(tn_1280_2_8192) Default ├──offset at 3 ...... ResultEntry found for GemmTunableOp_BFloat16_TN,tn_8192_2_3584 ResultEntry found for GemmTunableOp_BFloat16_TN,tn_8192_2_3584 ResultEntry found for GemmTunableOp_BFloat16_TN,tn_8192_2_3584 Avg time: 16.42832040786743 us, Achieved 7.15 TFLOPS, 3578.07 GB/s 2x1280x8192-torch.bfloat16,16.260499954223633,2.5794434438103107,1294.0669757533708 2x8192x1024-torch.bfloat16,16.15394949913025,2.0771658350056508,1041.11852032876 2x7168x8192-torch.bfloat16,25.691540241241455,9.14234887416194,4574.841325057144 2x8192x3584-torch.bfloat16,16.42832040786743,7.1486621324818085,3578.0709494714856 ``` Differential Revision: D60468273 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132173 Approved by: https://github.com/mxz297, https://github.com/jeffdaily, https://github.com/eqy	2024-08-09 17:55:21 +00:00
fduwjj	d13e72fd6a	[c10d] set a shorter heartbeat detect timeout to avoid race with NCCL timeout (#133028 ) What we found recently is that: 1. Monitoring detect watchdog hang(no heartbeat) at same time as nccl timeout. This race leads to less useful debug info gets dumped to logs (such as CudaEventDestroy and GIL checker) 2. We don't kill the program if monitoring thread has not enabled but somehow still silently run the monitoring thread. Plus for users who feel this is too short, they should config TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC themselves. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133028 Approved by: https://github.com/shuqiangzhang, https://github.com/wconstab	2024-08-09 17:48:34 +00:00
Shangdi Yu	574cdf1232	[export] Merge functions in replace set_grad/autocast with HOO (#132724 ) Summary: as title Test Plan: CI Differential Revision: D60701648 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132724 Approved by: https://github.com/ydwu4	2024-08-09 17:25:07 +00:00
Will Constable	2dbe5cb979	[C10D] Clarify warning for concurrent PG usage (#131895 ) Addresses a common misconception about safety of using multiple NCCL process groups from PyTorch. Notably, it IS safe to use multiple process groups, so long as communication operations from different groups are not allowed to overlap. (Overlap of communication operations from one group with compute operations IS ok). TODO: after getting feedback on the text, update other copies of the warning on other APIs Pull Request resolved: https://github.com/pytorch/pytorch/pull/131895 Approved by: https://github.com/fduwjj	2024-08-09 17:06:46 +00:00
leslie-fang-intel	bc57d5b6ff	[Inductor][CPP] Turns on inline_inbuilt_nn_modules for CPP GEMM template testing (#132487 ) Summary The CPP GEMM template testing has been skipped with turning on `inline_inbuilt_nn_modules ` as in https://github.com/pytorch/pytorch/issues/131929. Since https://github.com/pytorch/pytorch/pull/132334 has landed to fix the issues. Turn on this flag back since it's default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132487 Approved by: https://github.com/anijain2305, https://github.com/jgong5	2024-08-09 16:56:57 +00:00
Yueming Hao	23b877cb54	[inductor]a less ambitious way to slove the scalar tensor (#132702 ) Fixes #121374 The previous https://github.com/pytorch/pytorch/pull/131775 was trying to convert the 0dim cpu tensor to a DynamicScalar in lowering stage. But there are so many lowering rules uncompatible with that way. So, this PR is trying to do the conversion in codegen stage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132702 Approved by: https://github.com/eellison	2024-08-09 16:29:36 +00:00
PyTorch MergeBot	50595ecef4	Revert "[BE] Raise when the target model has scalar parameters (#132934 )" This reverts commit ea00036841b225330396f8d8f6ecf796f4826786. Reverted https://github.com/pytorch/pytorch/pull/132934 on behalf of https://github.com/clee2000 due to I think this broke distributed/_composable/fsdp/test_fully_shard_init.py::TestFullyShardShardedParameterTensor::test_raise_scalar_parameter [GH job link](https://github.com/pytorch/pytorch/actions/runs/10314920655/job/28563430905) [HUD commit link](`ea00036841`). Dr CI is wrong, it is not flaky ([comment](https://github.com/pytorch/pytorch/pull/132934#issuecomment-2278208789))	2024-08-09 15:30:34 +00:00
IvanKobzarev	065f7aa44b	[inductor] tensor_is_align fallbacking False if unbacked expr not comptime evaled (#132423 ) Currently if storage_offset is unbacked symbol and is_align can not be computed compiletime - it hard fails. Doing the best we can: adding guard_size_oblivious and fallback on False if can not be evaluated compiletime Pull Request resolved: https://github.com/pytorch/pytorch/pull/132423 Approved by: https://github.com/ezyang	2024-08-09 15:07:42 +00:00
Amit Agarwal (Ads AI HW Efficiency)	4bdb4bbd86	Fix fbcode AOTI GPU lowering for ARM64 hosts (#133017 ) Summary: Fix fbcode AOTI GPU lowering for ARM64 hosts Reviewed By: hl475 Differential Revision: D60969898 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133017 Approved by: https://github.com/hl475	2024-08-09 14:05:13 +00:00
Chirag Pandya	f2bacd908a	[BE] Move function definitions to .cpp (#132927 ) Summary: Non-functional change. Move function definitions for NCCLTraceBuffer to .cpp files. Test Plan: Unit tests. Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/132927 Approved by: https://github.com/Skylion007, https://github.com/d4l3k ghstack dependencies: #132916	2024-08-09 13:52:29 +00:00
PyTorch MergeBot	465e071898	Revert "[CUDA][CUTLASS][submodule] Fixes for CUTLASS upgrade (#131493 )" This reverts commit 927b4c11143e047eb6e3430e4c7c912064572f1b. Reverted https://github.com/pytorch/pytorch/pull/131493 on behalf of https://github.com/nmacchioni due to breaking many tests ([comment](https://github.com/pytorch/pytorch/pull/131493#issuecomment-2277738114))	2024-08-09 11:30:23 +00:00
Zhuoran Zhao	f565d16acb	Fix work-around item non-sync issue on AMD (#133054 ) Summary: Otherwise it will break FSDP code paths Test Plan: unit test see next diff for print message ``` sh ./scripts/lufang/amd/small_repro.sh ROCM_GET_SCALAR_ITEM_SYNC=1 sh ./scripts/lufang/amd/small_repro.sh ``` It will log "====== Async mode ======" or "====== Sync mode ======" correspondingly Differential Revision: D60995134 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133054 Approved by: https://github.com/houseroad	2024-08-09 09:22:29 +00:00
Eddie Yan	927b4c1114	[CUDA][CUTLASS][submodule] Fixes for CUTLASS upgrade (#131493 ) Unblocks/unbreaks against newer CUTLASS (3.5+) CC @nWEIdia @xwang233 @ptrblck @thakkarV Pull Request resolved: https://github.com/pytorch/pytorch/pull/131493 Approved by: https://github.com/Skylion007	2024-08-09 07:35:38 +00:00
Yiming Zhou	7b8ab7eb3e	[dynamo] Partially support random.Random class (#133037 ) This partially fixes the graph break issue when instantiating a `random.Random` class in Python. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133037 Approved by: https://github.com/anijain2305	2024-08-09 07:15:42 +00:00
Chien-Chin Huang	ea00036841	[BE] Raise when the target model has scalar parameters (#132934 ) Address the issue, https://github.com/pytorch/pytorch/issues/130810. Both FSDP1 and FSDP2 do not support scalar parameters. For FSDP1, the issue happens during state_dict operations while FSDP2 fails during the initialization. This PR adds exceptions to help users debug the issue and change the scalar parameters to 1D parameters. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132934 Approved by: https://github.com/awgu ghstack dependencies: #132908, #132933	2024-08-09 06:45:48 +00:00
xinan.lin	5707c6e952	[Fake tensor] Align the appearance of `device_put` op in fx_graph generated for CUDA and XPU, which is exposed in the issue #130823 (#132479 ) [Fake tensor] Align the appearance of device_put op in fx_graph generated for CUDA and XPU, which is exposed in the issue #130823 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132479 Approved by: https://github.com/EikanWang, https://github.com/zou3519, https://github.com/eellison	2024-08-09 05:31:00 +00:00
cyy	da65cfbdea	Remove unused Caffe2 macros (#132979 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/132979 Approved by: https://github.com/ezyang	2024-08-09 04:48:20 +00:00
cyy	05e8e87a69	[Submodule] Remove foxi (#132976 ) It is not used after removal of Caffe2 code. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132976 Approved by: https://github.com/ezyang	2024-08-09 03:46:52 +00:00
Feng Shi	bb6eef8ed1	[2/2] PT2 Inductor ComboKernels - automatic horizontal fusing (#131675 ) Summary: A ComboKernel combines independent Inductor Triton kernels into a single one. This is part 2 pull request which 1) adds automatic horizontal fusion in the end of the inductor operator fusion process, 2) adds type annotation for trition_combo_kernel.py ComboKernel is used in two cases: 1) for existing foreach kernels, combo kernels are used as the backend kernel. the front-end kernel generation logic remains the same. 2) Added an extra optimization phase to the end of the scheduler to generate extra combo kernels if combo_kernels is True in config.py This is part 2 pull request which deals with the 2nd case above: - The combo kernel generation in the added optimization phase is done in two steps: 1) in the front end inside the scheduler, it topologically sort the schedule nodes to find all the nodes with no data dependency and create a frond end schedule node for them. We currently limit the maximal number of sub-nodes for each combo kernel to 8 (but we still need to find what is the optimal number). 2) then, these sub-nodes are combined in the codegen phase to generate the combo kernel code for them based on a few rules. For example, 1d and 2d kernels are separated into different combo kernels, as mixing them is not supported yet. Note these algorithms we provide are very basic, and the users can register their customized combo kernel generation algorithms for both steps. - Performance wise, combining small kernels is about always to see performance gain. however, combining very large kernels may not see any perf gain, sometimes even regression possibly due to improper block sizes. Thus, a benchmark function is implemented to avoid such perf regression, and it is recommended to turn it on by setting benchmark_combo_kernels to True whenever combo_kernels is True. Please refer to part 1 pull request https://github.com/pytorch/pytorch/pull/124969 for more details. Test Plan: buck2 test mode/dev-nosan caffe2/test/inductor:combo_kernels Differential Revision: D60067757 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131675 Approved by: https://github.com/mlazos	2024-08-09 03:14:16 +00:00
Wanchao Liang	8875226d62	[dtensor] multi-dim mesh redistribute follow up (#133023 ) follow up from https://github.com/pytorch/pytorch/pull/131210 and added one test case from user in https://github.com/pytorch/pytorch/issues/132751 Pull Request resolved: https://github.com/pytorch/pytorch/pull/133023 Approved by: https://github.com/tianyu-l ghstack dependencies: #133022	2024-08-09 02:26:23 +00:00
Wanchao Liang	3b7edc12c6	[dtensor] more refactor to imports/paths (#133022 ) as titled Pull Request resolved: https://github.com/pytorch/pytorch/pull/133022 Approved by: https://github.com/XilunWu, https://github.com/wz337	2024-08-09 02:26:23 +00:00
Avik Chaudhuri	22ea248aa8	dynamic shapes mismatch errors (#132982 ) Summary: When PyTree detects a structural mismatch between inputs and dynamic shapes, the error messages are quite horrible. This PR fixes these error messages by adding, for each kind of error, the path to the point where the error happens and an actionable reason for the error. Test Plan: added test with several cases Differential Revision: D60956976 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132982 Approved by: https://github.com/yushangdi	2024-08-09 02:22:32 +00:00
cyy	8967d55b01	[18/N] Fix clang-tidy warnings in jit (#132963 ) Follows #132753 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132963 Approved by: https://github.com/Skylion007	2024-08-09 01:27:32 +00:00
PyTorch MergeBot	313aa151da	Revert "[ROCm] TunableOp logging improvements (#132173 )" This reverts commit 9cca0494b9d5c89c0a1100aee9477ed8ca7d527b. Reverted https://github.com/pytorch/pytorch/pull/132173 on behalf of https://github.com/PaliC due to reverted internally ([comment](https://github.com/pytorch/pytorch/pull/132173#issuecomment-2276966242))	2024-08-09 01:04:57 +00:00
Edward Z. Yang	4101dd14c2	Make debugging backends accept and ignore options kwargs from torch.compile (#132892 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132892 Approved by: https://github.com/anijain2305, https://github.com/jansel	2024-08-09 00:49:45 +00:00
wz337	0ff0bf3d31	[Replicate] Fix replicate with DeviceMesh initialization (#133024 ) A follow up on https://github.com/pytorch/pytorch/pull/132339. `get_parent_mesh` is replaced by `get_root_mesh`. In addition, modify a few places that parent mesh is mentioned in test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/133024 Approved by: https://github.com/Skylion007, https://github.com/fegin	2024-08-09 00:45:47 +00:00
Shunting Zhang	10c2168b31	[pt2-bench] use larger multiplier for smaller tensors for a few models (#132952 ) Fix https://github.com/pytorch/pytorch/issues/132922 and https://github.com/pytorch/pytorch/issues/132924 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132952 Approved by: https://github.com/eellison, https://github.com/jansel	2024-08-09 00:09:21 +00:00
Shangdi Yu	3c5b246d3c	[export] Remove Proxy from exported programs and modules (#132956 ) Summary: Remove Proxy from exported programs and modules because they cannot be deepcopied or pickeled. Test Plan: CI ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r qat_conv2d buck2 run 'fbcode//mode/dev-nosan' fbcode//modai/test:test_modai -- -r test_qat_stinson_htp_export buck2 run 'fbcode//mode/dev-nosan' fbcode//vizard_projects/ml_depth/tests:test_model -- -r test_qat_model_et buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/backends/tests:qnn_test -- -r test_qat_bias=False,use_3d_input=False buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/backends/tests:qnn_test -- -r test_qat_bias=True,use_3d_input=False buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_fold_bn_erases_bn_node ``` Differential Revision: D60940832 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132956 Approved by: https://github.com/angelayi	2024-08-09 00:00:20 +00:00
Scott Wolchok	e2b94923ba	[PyTorch] Speed up decomposed quantize_per_channel (#133029 ) Similar to D60871396 (#132828). Differential Revision: [D60978385](https://our.internmc.facebook.com/intern/diff/D60978385/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133029 Approved by: https://github.com/cccclai	2024-08-08 23:48:34 +00:00
Jiashen Cao	fa8c34301a	[ts-migration]: Quantized ops to standard ops pass. (#133026 ) #### Description Transform quantized operation properly. Add de/quantization before and after the quantized operation. #### Test Plan `pytest test/export/test_converter.py -s -k test_ts2ep_convert_quantized_model` Pull Request resolved: https://github.com/pytorch/pytorch/pull/133026 Approved by: https://github.com/angelayi	2024-08-08 23:10:17 +00:00
drisspg	45cf8ef557	add impls for required for nt ops (#132710 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132710 Approved by: https://github.com/jbschlosser ghstack dependencies: #131060	2024-08-08 23:09:38 +00:00
drisspg	1434e0b121	Add a private _safe_softmax (#131060 ) # Summary Changes the stance of SDPA on what to do for fully masked out rows ## Current Behavior Several PyTorch users have expressed frustration over this issue: - https://github.com/pytorch/pytorch/issues/41508 - https://github.com/pytorch/pytorch/issues/103749 - https://github.com/pytorch/pytorch/issues/103963 These are significant issues with extensive discussion but no satisfactory resolution. The PyTorch team's consensus, as stated here: https://github.com/pytorch/pytorch/issues/24816#issuecomment-524415617 Can be paraphrased as follows: When passing in fully masked out rows, attention becomes ambiguous. We have two main options: 1. Uniformly attend to all values: ```python scores[masked_out_rows] = 1 / len(row) out[masked_out_rows] = 1 / len(row) * value ``` 2. Decide that attention between no queries (masked) and no keys (masked) is meaningless: ```python output[fully_masked_rows] = NaN ``` We went with option 2. Partially because it was easier to implement, but also people argued that users can slice the output to remove the NaNs: ``` Python >fill_value = -float("inf") >row0 = torch.randn(4) >row1 = torch.tensor([(fill_value for _ in range(4)]) >matrix = torch.stack([row0, row1]).requires_grad_(True) >out = torch.softmax(matrix, 1) >out = out[0] >print(out) tensor([0.5377, 0.2729, 0.0692, 0.1201]) ``` Cool, problem solved. But what happends when you call backwards.. ```Python >out.backward(torch.ones_like(out)) >print(matrix.grad) tensor([[3.0957e-08, 1.4157e-08, 7.7802e-10, 1.3713e-08], [ nan, nan, nan, nan]]) ``` Those pesky NaNs are back! ## Why do we see NaNs today? The core of the problem revolves around using softmax function in sdpa: ```python > row = torch.tensor([(-float("inf")) for _ in range(4)]) > torch.softmax(row, 0) tensor([nan, nan, nan, nan]) ``` ## Quick Aside: Masking in Attention Attention itself doesn't have a concept of masking. The `sdpa` function has an argument called `attn_mask`, which would be more accurately named `attn_bias`. This is because we don't actually "mask" entries when computing attention. Instead, due to implementation details([performance](https://github.com/pytorch/pytorch/issues/25110#issuecomment-524519087)), we add a value to the masked-out query/key pairs. We use a large negative number (typically -inf) to decrease the attention weight, as softmax assigns more weight to larger values. ## Alternative Approaches If we use a very large negative number instead of -inf: ```python > row = torch.tensor([(-1e6) for _ in range(4)]) > torch.softmax(row, 0) tensor([0.2500, 0.2500, 0.2500, 0.2500]) ``` However if users always remembered to "slice" out their outputs i.e.: ```Python >fill_value = -1e6 >... >out.backward(torch.ones_like(out)) >print(matrix.grad) tensor([[-0.0563, -0.0564, 0.1613, -0.0486], [ 0.0000, 0.0000, 0.0000, 0.0000]]) ``` This would bring us back into a better state. ## A Third Option We don't necessarily need to alter the behavior of softmax for -inf or very large negative numbers. The fundamental goal is to exclude certain query/key pairs from attention, regardless of the underlying implementation. This PR implements the new semantic for masking w/ attention in fully masked-out rows: ```python out[masked_out_rows] = 0 ``` Important Note: This idea isn't entirely new. The [MaskedTensor](https://pytorch.org/tutorials/prototype/maskedtensor_overview#safe-softmax) prototype, a tensor subclass, was designed to handle such cases. However, it remains a prototype feature and hasn't gained widespread adoption. ## Details This PR stack does 3 things: 1. Adds a PRIVATE _safe_softmax op 2. Updates semantic for flash_cpu fused kernel 3. Updates semantic for efficient_cuda fused kernel _safe_softmax is not supposed to be used generically and is only meant to be used within the context of SDPA. Due to this fact instead of decomposing softmax and checking for -inf rows we instead "cheat" and use nan_to_num. Why I think this is okay? (please find a counter point if avail) There are multiple ways NaNs can emerge. For the fully masked out rows case nan_to_num works. But what if there were other NaNs, wouldn't this silently remove them? The only case that this can happen is if the input itself had a NaN or an Inf For example: ```Python a = torch.ones([4], requires_grad=False, dtype=torch.float16) a[1] = torch.finfo(torch.float16).max print(a.softmax(-1)) ``` Will return `tensor([0., 1., 0., 0.], dtype=torch.float16)` Where ```Python a = torch.ones([4], requires_grad=False, dtype=torch.float16) a[1] = float("inf") a.softmax(-1) ``` returns: `tensor([nan, nan, nan, nan], dtype=torch.float16)` If we dont want to even allow for the possibility of "inf" or "NaN" attention scores to be converted to 0 then we can implemented it something like this ```Python max = torch.max(a, dim=-1, keepdim=True) exp = torch.exp(a - max.values) denom = torch.sum(exp, dim=-1, keepdim=True) softmax = exp / denom softmax = torch.where(max.values == float('-inf'), 0.0, softmax) ``` however we would be paying for this in math performance. ## Why Now I think one point that has substantially changed where PyTorch should lie on this argument is the fact that we have fused implementations for SDPA now. And these fused implementations allow us to easily and performantly support this new semantic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131060 Approved by: https://github.com/jbschlosser	2024-08-08 23:09:38 +00:00
Edward Z. Yang	1f66487c69	[BE] Reroute all uses of proxy_tensor.maybe_disable_fake_tensor_mode to fake_tensor.unset_fake_temporarily (#132770 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132770 Approved by: https://github.com/bdhirsh	2024-08-08 23:07:23 +00:00
Nichols A. Romero	f25df31008	TunableOp more unit test follow-up (#130065 ) More unit tests for preventing TunableOp regressions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130065 Approved by: https://github.com/jeffdaily, https://github.com/malfet	2024-08-08 22:42:16 +00:00
Blaine Burton Rister	3d0de6e1cd	[Inductor] Add config option to force higher-dimensional tiling (#132937 ) Fixes #125077 Feature This PR creates a new Inductor config, `config.triton.prefer_nd_tiling`, which is disabled by default. When enabled, this encourages the Triton code to use as many tiling dimensions as possible. This simplifies indexing expressions for discontiguous tensors, resulting in expressions like `5 * x + 8 * y` as opposed to `5 * (x // 7) + 8 * (y % 9)`. This allows us to find more block pointers than we normally would. We should now see simplified indexing expressions as long as: 1. All discontiguous reads/writes have the same shape. 2. The number of discontiguous dimensions is less than `config.triton.max_tiles`. Here's an example kernel (elementwise add of views) with ND tiling disabled: ``` @triton.jit def triton_(in_ptr0, in_ptr1, out_ptr0, xnumel, XBLOCK : tl.constexpr): xnumel = 21 xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:] xmask = xindex < xnumel x0 = xindex % 7 x1 = (xindex // 7) x2 = xindex tmp0 = tl.load(in_ptr0 + (x0 + (9x1)), xmask) tmp1 = tl.load(in_ptr1 + (x0 + (9x1)), xmask) tmp2 = tmp0 + tmp1 tl.store(tl.make_block_ptr(out_ptr0, shape=[21], strides=[1], block_shape=[XBLOCK], order=[0], offsets=[xoffset]), tl.broadcast_to(tmp2, [XBLOCK]).to(tl.float32), boundary_check=[0]) ''', device_str='cuda') ``` And here's the version with it enabled: ``` @triton.jit def triton_(in_ptr0, in_ptr1, out_ptr0, ynumel, xnumel, YBLOCK : tl.constexpr, XBLOCK : tl.constexpr): ynumel = 3 xnumel = 7 yoffset = tl.program_id(1) * YBLOCK yindex = yoffset + tl.arange(0, YBLOCK)[None, :] ymask = yindex < ynumel xoffset = tl.program_id(0) * XBLOCK xindex = xoffset + tl.arange(0, XBLOCK)[:, None] xmask = xindex < xnumel x1 = xindex y0 = yindex tmp0 = tl.load(tl.make_block_ptr(in_ptr0, shape=[7, 3], strides=[1, 9], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), boundary_check=[0, 1], eviction_policy='evict_last') tmp1 = tl.load(tl.make_block_ptr(in_ptr1, shape=[7, 3], strides=[1, 9], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), boundary_check=[0, 1], eviction_policy='evict_last') tmp2 = tmp0 + tmp1 tl.store(tl.make_block_ptr(out_ptr0, shape=[7, 3], strides=[1, 7], block_shape=[XBLOCK, YBLOCK], order=[1, 0], offsets=[xoffset, yoffset]), tl.broadcast_to(tmp2, [XBLOCK, YBLOCK]).to(tl.float32), boundary_check=[0, 1]) ''', device_str='cuda') ``` With this feature enabled, we get a discontiguous strided block pointer. Previously, this would only have worked for specific shapes, like powers of 2 or multiples of the maximum block size. With this PR, we can support arbitrary shapes so long as we have enough tiles to cover all discontiguous dimensions. Test plan This PR adds some tests for pointwise ops with discontiguous tensors. - Test that we can generate block pointers for views with odd shapes like `(5,7)`, `(9,3,5)`, etc. - Test that we can generate block pointers for a single discontiguous dim in 3D and 4D tensors. - Test that we generate a 2D tiling for a 5D tensor with two discontiguous dims. This case doesn't generate a block pointer, but it checks that the output code is at least correct. This PR also parametrizes some existing tests to run with and without `triton.prefer_nd_tiling`. That way, we ensure this feature doesn't break existing usage. Since this setting isn't enabled on most tests, I also created https://github.com/pytorch/pytorch/pull/132935 to test what happens when `triton.prefer_nd_tiling=True` by default. None of the failures seem related to invalid tiling, so I think this feature is safe to merge. Limitations and follow-ups I can see two main improvements which would expand the usefulness of this feature: 1. This feature currently only works for pointwise kernels, since reductions are never tiled. As a follow-up, we could enable tiled reductions to extend these benefits to reduction kernels. 2. The usefulness of this feature depends on `triton.config.max_tiles`. This is currently restricted to 2 by default, although it can be increased to 3 in certain cases. To support more discontiguous dims, we might consider expanding support for 3D tiling, or even supporting ND tiling, by mapping an ND "virtual" launch grid onto Triton's 3D launch grid. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132937 Approved by: https://github.com/jansel, https://github.com/eellison	2024-08-08 22:11:56 +00:00
Randolf Scholz	8707c6dfac	added persistent option to buffers and namedbuffers (#132994 ) Fixes #85235 Alternative to PR https://github.com/pytorch/pytorch/pull/129655, implements 3-valued option (None or bool). - adds keyword only argument `persistent: Optional[bool] = None` to `nn.Module.buffers` - updated docstrings slightly. - added test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132994 Approved by: https://github.com/mikaylagawarecki	2024-08-08 21:39:01 +00:00
Zixi Qi	9cca0494b9	[ROCm] TunableOp logging improvements (#132173 ) Summary: TunableOp logging improvements: 1. PYTORCH_TUNABLEOP_VERBOSE=1: print out the expected value vs actual value for TunableOp validators, so that if validation fails, we know exactly how to fix it 2. PYTORCH_TUNABLEOP_VERBOSE=3: print out the exact kernel signature for both successful and failure cases in kernel lookup Test Plan: > PYTORCH_TUNABLEOP_VERBOSE=3 buck 2 run mode/{opt,amd-gpu} -c fbcode.enable_gpu_sections=true //scripts/xdwang/example:fc_llama -- --enab le-tuning ``` reading tuning results from hipblas_tuning_pt_llama0.csv Validator PT_VERSION=2.5.0 Validator ROCBLAS_VERSION=4.0.0-72e57364-dirty Validator HIPBLASLT_VERSION=800-a15e4178 Validator ROCM_VERSION=6.0.0.0-12969-1544e39 Validator GCN_ARCH_NAME=gfx942:sramecc+:xnack- GCN_ARCH_NAME validation: expect gfx942:sramecc+:xnack- to match gfx942:sramecc+:xnack- ROCM_VERSION validation: expect 6.0.0.0-12969-1544e39 to match 6.0.0.0-12969-1544e39 HIPBLASLT_VERSION validation: expect 800-a15e4178 to match 800-a15e4178 ROCBLAS_VERSION validation: expect 4.0.0-72e57364-dirty to match 4.0.0-72e57364-dirty PT_VERSION validation: expect 2.5.0 to match 2.5.0 Loading results GemmTunableOp_BFloat16_TN(tn_8192_2_1024) -> Gemm_Hipblaslt_TN_61169,0.0171694 GemmTunableOp_BFloat16_TN(tn_7168_2_8192) -> Gemm_Hipblaslt_TN_61089,0.036138 GemmTunableOp_BFloat16_TN(tn_8192_2_3584) -> Gemm_Hipblaslt_TN_61169,0.0240673 missing params_signature, returning null ResultEntry for GemmTunableOp_BFloat16_TN,tn_1280_2_8192 finding fastest for GemmTunableOp_BFloat16_TN(tn_1280_2_8192) out of 2818 candidates Rotating buffer 4 MiB. Needed Size: 20 MiB. Needed number of param copies: 1 ├──tuning using warmup iters 0 [0 ms] and tuning iters 1 [0.208254 ms] instance id=0, GemmTunableOp_BFloat16_TN(tn_1280_2_8192) Default ├──offset at 3 ...... ResultEntry found for GemmTunableOp_BFloat16_TN,tn_8192_2_3584 ResultEntry found for GemmTunableOp_BFloat16_TN,tn_8192_2_3584 ResultEntry found for GemmTunableOp_BFloat16_TN,tn_8192_2_3584 Avg time: 16.42832040786743 us, Achieved 7.15 TFLOPS, 3578.07 GB/s 2x1280x8192-torch.bfloat16,16.260499954223633,2.5794434438103107,1294.0669757533708 2x8192x1024-torch.bfloat16,16.15394949913025,2.0771658350056508,1041.11852032876 2x7168x8192-torch.bfloat16,25.691540241241455,9.14234887416194,4574.841325057144 2x8192x3584-torch.bfloat16,16.42832040786743,7.1486621324818085,3578.0709494714856 ``` Differential Revision: D60468273 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132173 Approved by: https://github.com/mxz297, https://github.com/jeffdaily	2024-08-08 21:24:16 +00:00
Menglu Yu	cd30861857	[PT2][Optimus] Update unbind_cat_to_view pass to include more complicated cases (#132831 ) Summary: We found recent CMF and IGCTR has more complicated patterns to optimize in order to remove as many stack/cat nodes as possible, we thus design such patterns Test Plan: # unit test ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test //caffe2/test/inductor:split_cat_fx_passes ``` Test UI: https://www.internalfb.com/intern/testinfra/testrun/3659174939423652 Network: Up: 113KiB Down: 112KiB (reSessionID-11c9b598-af3a-4727-8f02-ccb1471d092b) Jobs completed: 27. Time elapsed: 5:45.8s. Cache hits: 0%. Commands: 2 (cached: 0, remote: 0, local: 2) Tests finished: Pass 9. Fail 0. Fatal 0. Skip 1. Build failure 0 # benchmark ### cmf ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "cmf_shrink" --flow_id 587303213 -n ``` P1515072258 Counter({'pattern_matcher_nodes': 2170, 'pattern_matcher_count': 1766, 'normalization_pass': 402, 'remove_split_with_size_one_pass': 269, 'extern_calls': 193, 'merge_splits_pass': 74, 'normalization_aten_pass': 51, 'fxgraph_cache_miss': 9, 'batch_aten_mul': 6, 'scmerge_split_sections_removed': 5, 'scmerge_split_removed': 3, 'scmerge_cat_removed': 3, 'unbind_stack_pass': 3, 'batch_sigmoid': 2, 'batch_linear': 2, 'batch_aten_sub': 2, 'batch_layernorm': 1, 'scmerge_split_added': 1, 'scmerge_cat_added': 1, 'split_stack_to_cats_pass': 1, 'split_cat_to_slices_pass': 1, 'batch_aten_add': 1, 'batch_relu': 1}) ### ig_ctr ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "ig_ctr" --flow_id 584880697 -n ``` P1515087739 Counter({'pattern_matcher_nodes': 1832, 'pattern_matcher_count': 1564, 'extern_calls': 378, 'normalization_pass': 345, 'normalization_aten_pass': 49, 'fxgraph_cache_miss': 18, 'batch_aten_mul': 6, 'scmerge_cat_removed': 5, 'scmerge_cat_added': 4, 'batch_linear_post_grad': 4, 'scmerge_split_removed': 3, 'unbind_stack_pass': 3, 'unbind_cat_to_view_pass': 3, 'batch_tanh': 2, 'scmerge_split_sections_removed': 2, 'scmerge_split_added': 2, 'split_stack_to_cats_pass': 2, 'split_cat_to_slices_pass': 1}) # e2e testing the following new patterns ``` "split_stack_to_cats_pass": {}, "split_cat_to_slices_pass": {}, "unbind_cat_to_view_pass": {}, ``` Note that you can tune the hyper-parameter "threshold_to_cat " for these patterns, and the minimum value you give should be at least 2. The larger the value, the less aggressive to do the node slicing but to keep the cat, and the default value is 10. You can tune the parameters by setting threshold_to_cat. For example ``` "split_stack_to_cats_pass": {"threshold_to_cat": 10}, "split_cat_to_slices_pass": {"threshold_to_cat": 10}, "unbind_cat_to_view_pass": {"threshold_to_cat": 10}, ``` Note that the default value may not be optimal, it's based on my experiments on CMF and IGCTR, you are more than welcome to tune the value to find the best threashold for you. For example, in the cmf local run, - when "threshold_to_cat" is 2 P1515072258 =============Print full analysis for cmf_shrink================ \| Metric \| Value \| \|:-------------------\|:----------------\| \| Batch size \| 10 \| \| Latency \| 156.07 ms \| \| Model size \| 844357184 bytes \| \| Flops/example \| 583.53 G \| \| TFLOPS \| 37.39 \| \| MFU \| 4.67% \| \| Activation/example \| 1707.49 MB \| - when "threshold_to_cat" is 10 P1515912635 =============Print full analysis for cmf_shrink================ \| Metric \| Value \| \|:-------------------\|:----------------\| \| Batch size \| 10 \| \| Latency \| 155.09 ms \| \| Model size \| 844357184 bytes \| \| Flops/example \| 583.53 G \| \| TFLOPS \| 37.63 \| \| MFU \| 4.70% \| \| Activation/example \| 1707.49 MB \| ads_dper3:164562cbe29f6c5aea4546cf3d463b87 training_platform:5e455c643c52940bb4567017f4c7ba83 ## cmf baseline f588717948 proposal f588719502 ### QPS and NE results {F1793304642} {F1793304664} {F1793304689} {F1793304683} ### Compilation time reduction zoomer link: https://www.internalfb.com/intern/zoomer/?profiling_run_fbid=1045728747213538&tab=pt2_metrics Compile time for that frame is reduced to 1 min from 9 min. ### trace analysis baseline trace link https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2Ff588722004-TrainingApplication%2F0%2Frank-1.Aug_06_00_03_46.3617.pt.trace.json.gz&bucket=pyper_traces proposal trace link https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree%2Ftraces%2Fdynocli%2Ff588723545-TrainingApplication%2F0%2Frank-1.Aug_05_23_54_56.3647.pt.trace.json.gz&bucket=pyper_traces {F1793312804} {F1793312867} From the trace, we can see that the green part (introduced by split cat) has been reduced significantly with our new patterns. Differential Revision: D60750275 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132831 Approved by: https://github.com/jackiexu1992	2024-08-08 21:18:01 +00:00
Chirag Pandya	40767e8468	[BE] rename testHelperPrefix test (#132916 ) Summary: Re-enable testHelperPrefix test that was erroneously disabled in CI. Fixes #50701 Test Plan: Test passes locally: ``` ❯ ./TCPStoreTest --gtest_filter=TCPStoreTest.testHelperPrefix Running main() from /data/users/cpio/pytorch/third_party/googletest/googletest/src/gtest_main.cc Note: Google Test filter = TCPStoreTest.testHelperPrefix [==========] Running 1 test from 1 test suite. [----------] Global test environment set-up. [----------] 1 test from TCPStoreTest [ RUN ] TCPStoreTest.testHelperPrefix [W807 12:01:31.531576727 socket.cpp:462] [c10d] waitForInput: poll for socket SocketImpl(fd=6, addr=[localhost]:37984, remote=[localhost]:37171) returned 0, likely a timeout [W807 12:01:31.531663710 socket.cpp:487] [c10d] waitForInput: socket SocketImpl(fd=6, addr=[localhost]:37984, remote=[localhost]:37171) timed out after 100ms [ OK ] TCPStoreTest.testHelperPrefix (314 ms) [----------] 1 test from TCPStoreTest (314 ms total) [----------] Global test environment tear-down [==========] 1 test from 1 test suite ran. (314 ms total) [ PASSED ] 1 test. ╭─ ~/local/pytorch/build/bin main *1 +1 ···················· ✔ /home/cpio/local/a/pytorch-env  cpio@devgpu011 ─╮ ╰─ ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/132916 Approved by: https://github.com/Skylion007	2024-08-08 20:54:52 +00:00
Alnis Murtovi	7bd0732cbd	Fix flaky internal mixed_mm tests (#133015 ) This PR fixes flaky internal tests: - The AutoHeuristic test was sometimes failing because it required autotuning to happen for mixed_mm which didn't end up happening when there was a fx graph cache hit. - The tests inside pattern_matcher failed because in some cases pad_mm decided to pad which made the mixed_mm pattern not match anymore (instead of cast -> mm, it was cast -> pad -> mm), and the tests also fail when is_big_gpu is false (which I haven't found an explanation for). Differential Revision: [D60972176](https://our.internmc.facebook.com/intern/diff/D60972176) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133015 Approved by: https://github.com/Chillee, https://github.com/eellison	2024-08-08 20:32:12 +00:00
Guilherme Leobas	a9954d22f8	Raise exception if torch.func.* calls torch.compile functions (#128736 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128736 Approved by: https://github.com/zou3519	2024-08-08 20:21:44 +00:00
Wanchao Liang	b845068db2	[dtensor] refactor examples folder (#132914 ) as titled: 1. remove checkpoint example as it's not maintained 2. refactor convnext example to use torchrun 3. refactor comm mode feature example to sit in one file Pull Request resolved: https://github.com/pytorch/pytorch/pull/132914 Approved by: https://github.com/wz337	2024-08-08 20:03:14 +00:00
Prachi Gupta	c326533999	[ROCm][Inductor] Enable AOT Inductor CPP UTs for ROCm (#131521 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/131521 Approved by: https://github.com/jataylo, https://github.com/pruthvistony, https://github.com/malfet	2024-08-08 19:49:56 +00:00
Isuru Fernando	de288e2203	Fix inf value reduction in non persistent reduction for scans (#132293 ) Fixes https://github.com/pytorch/pytorch/issues/132107 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132293 Approved by: https://github.com/peterbell10	2024-08-08 19:02:32 +00:00
Xilun Wu	322c9d03a0	[FSDP][dtensor] use _StridedShard to represent nested sharding for correct full_tensor() result (#130760 ) Fixes issue #129229 #129206 Summary 1. Have `FSDP` choose `_StridedShard` placement for FSDP+TP sharding 2. Added a parity test to FSDP to ensure that FSDP+TP sharding (i.e. strided) and simply TP sharding (i.e. non-strided) has the same `full_tensor()` result 3. Re-enabled the tests that were disabled in #129519 test `pytest test/distributed/_composable/fsdp/` `pytest test/distributed/_composable/test_composability/test_2d_composability.py` `pytest test/distributed/checkpoint/fsdp/test_fsdp_dsd.py` Differential Revision: [D60606114](https://our.internmc.facebook.com/intern/diff/D60606114) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130760 Approved by: https://github.com/wanchaol, https://github.com/fegin, https://github.com/wz337 ghstack dependencies: #126697, #130239, #132391, #131408	2024-08-08 18:15:29 +00:00
Shangdi Yu	21906ddaba	[AOTI] Fix complex64 not defined (#132810 ) Partially fixes #122980 - change cpp type mapping for complex64 to std::complex<float> - add `aoti_torch_item_complex64` and `aoti_torch_scalar_to_tensor_complex64`. - add `expensiveCopyToTensor()` to convert `ArrayRefTensor<T>` type to `AtenTensorHandle` type. - if we want to fully fix #122980, we still need to let ArrayRef and MiniArrayRef to consider underlying storage number of elements. See more details in https://github.com/pytorch/pytorch/pull/132347 (#132347 broke some internal tests, so we need more work before landing it). Pull Request resolved: https://github.com/pytorch/pytorch/pull/132810 Approved by: https://github.com/desertfire	2024-08-08 18:08:23 +00:00
Zain Rizvi	ac95b2a2f2	Migrate slow self-hosted jobs to Amazon2023 AMI (#131771 ) A continuation of the migration started in - https://github.com/pytorch/pytorch/pull/131250 (for tracking: signal on Aug 6: https://hud.pytorch.org/pytorch/pytorch/pull/131771?sha=38bc4755567527fad5279203ddef534ac132ea94) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131771 Approved by: https://github.com/seemethere	2024-08-08 17:33:57 +00:00
Joel Schlosser	75eb66afc0	Support 'non-contiguous with holes' NJTs for contiguous clone() (#132776 ) It's possible to construct an NJT with "holes" by specifying both `offsets` and `lengths` metadata. When `nt.clone(memory_format=torch.contiguous_format)` is called on such an NJT, the result should be an NJT without holes. This PR fixes this in simplistic way using `unbind()`, which isn't really supported in `torch.compile`. The longer term solution involves writing a proper kernel to support this. NB: Another limitation is that the returned NJT does not have the same ragged structure as the input. While we could manually hack the nested int registry (or update the union find when that lands), this is the first instance where a NJT with holes and an NJT without holes could have the same ragged structure, and getting those to play nicely together requires some fairly involved updates. For now, this PR punts on these updates until we can clean this up. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132776 Approved by: https://github.com/ani300, https://github.com/soulitzer ghstack dependencies: #131898, #131704, #131937	2024-08-08 17:08:11 +00:00
PyTorch MergeBot	3ec9ec03a8	Revert "[pipelining] Add schedule runtime for lowered schedule (#130488 )" This reverts commit b73d4b6555dd6b5a39d70d741099b83190eb31f0. Reverted https://github.com/pytorch/pytorch/pull/130488 on behalf of https://github.com/PaliC due to breaking distributed tests internally (that should be running in OSS) ([comment](https://github.com/pytorch/pytorch/pull/130488#issuecomment-2276266909))	2024-08-08 16:57:50 +00:00
Zhengxu Chen	942ffd1b2d	Make the __module__ name of HOO to be always "torch.ops.higher_order" (#132775 ) Summary: It seems that we can just make this the default so that in the future all the ops printed in the graph should be like torch.ops.higher_order Test Plan: CI Differential Revision: D60530900 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132775 Approved by: https://github.com/ydwu4, https://github.com/zou3519	2024-08-08 16:55:09 +00:00
Scott Wolchok	eeb6ad0744	[quant] Speed up dequantize_per_channel (#132828 ) Tensor-wise operations are much faster than looping over tensor elements. Rewrite loop in dequantize_per_channel to use whole-Tensor operations accordingly. Differential Revision: [D60871396](https://our.internmc.facebook.com/intern/diff/D60871396/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132828 Approved by: https://github.com/cccclai	2024-08-08 16:44:41 +00:00
Thanh Ha	dfc5bb0099	Login to Meta's ECR when using non-meta runner (#132870 ) The project depends on fetching container images from Meta's ECR repo so when run on non-meta runners we need to ensure that we also login to Meta's ECR too. Closes pytorch/ci-infra#252. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132870 Approved by: https://github.com/ZainRizvi	2024-08-08 16:34:46 +00:00
Sam Larsen	4a4dc9d6d9	[inductor] Disable remote caching in failing test_cpu_repro tests (#132955 ) Summary: These tests are failing stress tests internally because of remote caching. Most already have local cache disabled; disable remote cache as well Test Plan: Ran stress tests locally for each of the affected tests Differential Revision: D60940081 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132955 Approved by: https://github.com/leslie-fang-intel	2024-08-08 16:20:56 +00:00
Edward Yang	9d5c85c499	Move exir.delegate to PyTorch core to enforce no out-of-tree HOPs (#132525 ) Summary: When HOPs live out of tree, it makes it impossible to make breaking changes to the HOP API. But HOP implementations are deeply entwined with PyTorch internals. Move the HOP into PyTorch tree so that changes are possible. Test Plan: sandcastle, ossci Differential Revision: D60674615 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132525 Approved by: https://github.com/zou3519, https://github.com/Skylion007	2024-08-08 16:06:56 +00:00
rzou	4ee5547b37	[triton_op] Skip HOP dispatch when possible (#132822 ) The capture_triton decorator returns a function that goes through the triton kernel wrapper HOP. This is useful for make_fx tracing and non-strict export. However, the HOP dispatch is slow (~1ms) and not necessary in certain situations. This PR skips going through the HOP dispatch for any capture_triton-wrapped triton kernels that are registered as implementations to a `@triton_op` custom operator. We do this by creating a new thread-local flag that controls if the capture_trition-wrapped triton kernel goes through HOP dispatch or not. Test Plan: - new test and existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/132822 Approved by: https://github.com/SherlockNoMad	2024-08-08 15:56:40 +00:00
PyTorch MergeBot	b885ad8fce	Revert "[Inductor][CPP] Turns on inline_inbuilt_nn_modules for CPP GEMM template testing (#132487 )" This reverts commit 73c083e02cb6093bb3adf06b7ccdf5c4a2e7591c. Reverted https://github.com/pytorch/pytorch/pull/132487 on behalf of https://github.com/PaliC due to this pr is breaking inductor tests internally ([comment](https://github.com/pytorch/pytorch/pull/132487#issuecomment-2276142742))	2024-08-08 15:47:04 +00:00
Janani Sriram	0ca8f66e3a	[NestedTensor] Modify softmax on ragged dimension to allow for 2D nested tensors (#132812 ) Summary: Modify `softmax` on the ragged dimension, where `ragged_idx == 1`, to allow for 2D nested tensors. This diff now enables a `softmax` operation on tensors of shape `(B, )`, where `` is the ragged dimension. Extend existing `softmax` unit tests to include 2D nested tensors using the `include_2d_tensor=True` keyword argument. Test Plan: Verify that existing and modified unit tests pass using the following commands: ``` buck2 run mode/{opt,inplace} //caffe2/test:nested -- --regex test_softmax ``` ``` buck2 run mode/{opt,inplace} //caffe2/test:nested -- --regex test_jagged_op ``` Reviewed By: davidberard98 Differential Revision: D60780975 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132812 Approved by: https://github.com/davidberard98	2024-08-08 15:41:28 +00:00
Howard Huang	c4071c4707	Remove noqa: G004 warnings (#132917 ) Remove logging messages with f-strings (G004), https://docs.astral.sh/ruff/rules/logging-f-string/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/132917 Approved by: https://github.com/Skylion007, https://github.com/c-p-i-o, https://github.com/fduwjj, https://github.com/fegin ghstack dependencies: #132888	2024-08-08 15:18:53 +00:00
Xu Han	9db5bfccdc	[inductor] disable test_torchinductor failed UTs on Windows (#132973 ) Disable failed UTs of `test/inductor/test_torchinductor.py` on Windows. TODO: Debug and enable these UTs, after CI ready. Local test: <img width="857" alt="image" src="https://github.com/user-attachments/assets/3d9da274-f147-474e-92f1-a6d3ed8aa003"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132973 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-08-08 14:56:10 +00:00
Jean Schmidt	51ddcde110	[BE] Introduces runner variants for amzn2023 to simplify lf-scale-config.yml and lf-canary-scale-config.yml (#132918 ) Depends on https://github.com/pytorch/test-infra/pull/5541 to be deployed on LF and Meta infra Test for this changes are in this PR: https://github.com/pytorch/test-infra/pull/5542 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132918 Approved by: https://github.com/zxiiro, https://github.com/ZainRizvi	2024-08-08 14:38:34 +00:00
PyTorch MergeBot	6f99e97f0a	Revert "[ts-migration]: Support quantized operation transformation (#131915 )" This reverts commit 0e8541766fe5ed58c54aa530eee8e34832539199. Reverted https://github.com/pytorch/pytorch/pull/131915 on behalf of https://github.com/ezyang due to test broken on windows `0e8541766f` ([comment](https://github.com/pytorch/pytorch/pull/131915#issuecomment-2275974907))	2024-08-08 14:30:35 +00:00
Syed Tousif Ahmed	42cd397a0e	Loads .pyd instead of .so in MemPool test for windows (#132749 ) Fixes #132650 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132749 Approved by: https://github.com/albanD	2024-08-08 14:29:56 +00:00
PyTorch MergeBot	d1f73fd844	Revert "[BE] Reroute all uses of proxy_tensor.maybe_disable_fake_tensor_mode to fake_tensor.unset_fake_temporarily (#132770 )" This reverts commit 902c6f3a191fb2ecb1976895b3e9eaae4b257b89. Reverted https://github.com/pytorch/pytorch/pull/132770 on behalf of https://github.com/ezyang due to Removed API was recommitted ([comment](https://github.com/pytorch/pytorch/pull/132770#issuecomment-2275749689))	2024-08-08 12:54:34 +00:00
Edward Z. Yang	902c6f3a19	[BE] Reroute all uses of proxy_tensor.maybe_disable_fake_tensor_mode to fake_tensor.unset_fake_temporarily (#132770 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132770 Approved by: https://github.com/bdhirsh ghstack dependencies: #132674, #132675, #132421, #132062, #132767, #132769	2024-08-08 12:03:25 +00:00
Edward Z. Yang	0e43175e22	[BE] Get rid of unnecessary inner_torch_dispatch method (#132769 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132769 Approved by: https://github.com/Skylion007, https://github.com/bdhirsh ghstack dependencies: #132674, #132675, #132421, #132062, #132767	2024-08-08 12:03:25 +00:00
Edward Z. Yang	35fd4391bc	Format torch.fx.experimental.proxy_tensor.py (#132767 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132767 Approved by: https://github.com/bdhirsh ghstack dependencies: #132674, #132675, #132421, #132062	2024-08-08 12:03:18 +00:00
Edward Z. Yang	b4e2411f6f	Big enough count to trigger stack overflow (#132062 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132062 Approved by: https://github.com/bdhirsh ghstack dependencies: #132674, #132675, #132421	2024-08-08 12:03:12 +00:00
Edward Z. Yang	aec6332356	Only thunkify proxies in some situations (#132421 ) The goal of this PR is to avoid stack overflow when we create extremely long chains of thunks, and then evaluate them (e.g., as occurs if you sum(long list of symint)). The basic idea behind this PR is to only thunkify proxies if they're being created in places where they may or may not be used--crucially, symint operations that occur in user code we are tracing are eagerly placed into the graph, even if they may eventually be dead. I annotated the PR with explanation of changes. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132421 Approved by: https://github.com/Skylion007, https://github.com/zou3519 ghstack dependencies: #132674, #132675	2024-08-08 12:03:06 +00:00
Edward Z. Yang	54efd43022	[BE] Simplify code interacting with get_proxy_mode/enable_tracing (#132675 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132675 Approved by: https://github.com/Skylion007, https://github.com/ydwu4, https://github.com/zou3519 ghstack dependencies: #132674	2024-08-08 12:03:00 +00:00
Edward Z. Yang	361db32d47	Consolidate SymDispatchMode into ProxyTensorMode (#132674 ) Instead of having a separate context variable for SymDispatchMode, we now simply delegate to the current active proxy tensor mode when we need to trace a SymInt. We maintain a separate `__sym_dispatch__` magic method as the calling convention is different than `__torch_dispatch__`. Consolidating the modes in this ways means that we can consistently disable both of these modes in tandem simply by removing the mode from the proxy mode infra slot. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132674 Approved by: https://github.com/zou3519, https://github.com/bdhirsh	2024-08-08 12:02:54 +00:00
PyTorch MergeBot	0f19d4150b	Revert "[inductor]a less ambitious way to slove the scalar tensor (#132702 )" This reverts commit b483ca05a91f2876b0f1f5a435fa264f5467762d. Reverted https://github.com/pytorch/pytorch/pull/132702 on behalf of https://github.com/ezyang due to breaks trunk jobs ([comment](https://github.com/pytorch/pytorch/pull/132702#issuecomment-2275642109))	2024-08-08 11:59:38 +00:00
xinan.lin	ec49796b8f	[Inductor] Support use_libdevice_for_f64 for pointwise ops on XPU, align with CUDA. (#132739 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132739 Approved by: https://github.com/malfet, https://github.com/EikanWang	2024-08-08 11:50:10 +00:00
Xuehai Pan	24dee99cb7	Populate submodules of `torch._C` to `sys.modules` recursively (#132216 ) See comment: `e9d1c26275/torch/__init__.py (L938-L950)` This PR recursively sets the submodules in the C extension to `sys.modules` (e.g., `_C._dynamo.eval_frame`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/132216 Approved by: https://github.com/ezyang	2024-08-08 10:20:25 +00:00
Wanchao Liang	7f71f2a997	[dtensor] improve docs and comments (#132683 ) as titled, fixed typos in various comments and improve the public documentations Pull Request resolved: https://github.com/pytorch/pytorch/pull/132683 Approved by: https://github.com/XilunWu ghstack dependencies: #131210, #132682	2024-08-08 09:24:58 +00:00
Wanchao Liang	9e37e73e01	[dtensor] refactor and improve readability of _dispatch.py (#132682 ) as titled. It also changes some comments of _op_schema.py to make them update to date Pull Request resolved: https://github.com/pytorch/pytorch/pull/132682 Approved by: https://github.com/XilunWu ghstack dependencies: #131210	2024-08-08 09:24:58 +00:00
leslie-fang-intel	ac960dced1	Skip Reformer for Dynamic size testing (#132468 ) Summary As discussed in https://github.com/pytorch/pytorch/issues/132286, `Reformer` has specialized the batch size dim which will fails the API `mark_dynamic` `3a355c1891/torch/_dynamo/decorators.py (L228-L230)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132468 Approved by: https://github.com/ezyang	2024-08-08 08:25:53 +00:00
Yu, Guangye	9c5e0d47fe	Add xpu_cmake_macros.h to xpu build (#132847 ) # Motivation fix https://github.com/pytorch/pytorch/issues/132971 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132847 Approved by: https://github.com/EikanWang	2024-08-08 08:06:49 +00:00
abhishek-fujitsu	751c744ad0	Optimize sort kernel for contiguous tensors (#132236 ) Introduces enhancement for SortingKernel.cpp for cases where both the values and indices tensors have a stride 1, indicating contiguous memory layouts. The changes include: 1. A new function `sort_kernel_impl`, encapsulating the core sorting logic for distinct types of tensor accessors. 2. Modifications to the `sort_kernel` function to utilize `sort_kernel_impl`. It now checks for tensor strides and optimally handles contiguous and non-contiguous tensor scenarios. 3. The optimization aims to improve cache locality and efficiency in memory access for contiguous tensor sorts. 4. Enhanced Code Readability and Structure: The restructuring of the sorting process improves clarity and maintenance by clearly defining how different stride scenarios are handled, making the code more transparent and easier to understand. Tests have been conducted across various tensor sizes and shapes to ensure stability and reliability of the change. The result of running the `test/test_sort_and_select.py` test suite is consistent between the main branch, and this modified branch. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132236 Approved by: https://github.com/jgong5	2024-08-08 07:01:25 +00:00
Wanchao Liang	83e4af203d	[dtensor] rewrite redistribute algorithm for multi-dim mesh (#131210 ) As titled, this PR rewrite the current redistribute algorithm to make the multi-mesh dim redistribute logic more sound. The previous algorithm works numerically but it could incur additional non-necessary steps when transforming shardings in the multi-dimesnion device mesh, i.e. Let's say we want to transform from (S(1), S(1)) -> (S(1), S(2)). The previous algorithm yield the following steps: * mesh_dim 1: S(1) -> R, mesh_dim 0: S(1) -> R * mesh_dim 0: R -> S(1), mesh_dim 1: R -> S(2) Although it works semantically but it incurs two allgather transformations, where it should really only incur a S(1) -> S(2) on the mesh dim 1. The rewrite algorithm basically take it in a more principled way: 1. we check if src_spec have sharding, if not, we don't need to worry about nested sharding case, as sharding would always be in order, so we just go from left to right in the placements and add the transform steps 2. if src_spec have sharding, this potentially means that there would be either nested or mis-aligned shardings. So we first tranverse from right to left to check if there's mis-aligned sharding as the above example showed, if there is, we replicate that mesh dimension so that it unshard the nested sharding 3. we tranverse again from left to right to generate the transformation after we unshard the nested sharding should also fix https://github.com/pytorch/pytorch/issues/132751 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131210 Approved by: https://github.com/tianyu-l	2024-08-08 06:50:30 +00:00
wz337	479d460471	[DeviceMesh] Add a private _flatten() API for device_mesh (#132632 ) Adds a new private API to flatten a DeviceMesh to a 1D DeviceMesh such that: ``` mesh_3d = init_device_mesh( self.device_type, (2, 2, 2), mesh_dim_names=("dp", "cp", "tp"), ) dp_cp_mesh = mesh_3d["dp", "cp"] # flattened_mesh on rank 0, 2, 4, 6 is DeviceMesh([0, 2, 4, 6], mesh_dim_names=('dp_cp',)) # flattened_mesh on rank 1, 3, 5, 7 is DeviceMesh([1, 3, 5, 7], mesh_dim_names=('dp_cp',)) flattened_dp_cp_mesh = dp_cp_mesh._flatten() ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132632 Approved by: https://github.com/fegin, https://github.com/wanchaol ghstack dependencies: #132310, #132311, #132339	2024-08-08 06:46:42 +00:00
Jiashen Cao	0e8541766f	[ts-migration]: Support quantized operation transformation (#131915 ) #### Description Transform quantized operation properly. Add de/quantization before and after the quantized operation. #### Test Plan `pytest test/export/test_converter.py -s -k test_ts2ep_convert_quantized_model` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131915 Approved by: https://github.com/angelayi	2024-08-08 06:34:53 +00:00
Chien-Chin Huang	9e584d0c05	[BE] Test foreach optimizer for FSDP1 optimizer state_dict (#132933 ) Summary: When fixing https://github.com/pytorch/pytorch/issues/130810, we suspected FSDP1 optimizer state_dict cannot handle foreach optimizer, which is not correct. For FSDP1, whether optimizer uses foreach or not does not matter. Since we already have tests for non-foreach version optimizer, this PR changes the distributed state_dict tests for FSDP1 to use foreach optimizer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132933 Approved by: https://github.com/c-p-i-o ghstack dependencies: #132908	2024-08-08 06:13:10 +00:00
angelayi	a270800f0b	[export][reland] Add print_readable to unflattened module (#132817 ) Reland https://github.com/pytorch/pytorch/pull/128617 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132817 Approved by: https://github.com/pianpwk	2024-08-08 06:05:30 +00:00
Chien-Chin Huang	745665d8b5	[BE] Using with_temp_dir for test_distributed_checkpoint (#132908 ) Fixes https://github.com/pytorch/pytorch/issues/113936 Fixes https://github.com/pytorch/pytorch/issues/113937 The original way to broadcast the path seems to cause desync issues. `with_temp_dir` has been used for other checkpoint related tests without problems. Change the tests to use `with_temp_dir` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132908 Approved by: https://github.com/awgu, https://github.com/Skylion007	2024-08-08 05:42:19 +00:00
daitian1995	aff48f7378	Autoselect default device in FSDP construction. (#127609 ) There are still some differences between CUDA and non-CUDA custom devices when construct FSDP because CUDA is selected as the default device. For example, when construct FSDP from CPU model and device_id is not passed, device_handle will choose CUDA as default device. This PR will autoselect the real device as the default device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127609 Approved by: https://github.com/awgu	2024-08-08 05:25:17 +00:00
Edward Z. Yang	4a1edbe475	Disable SymDispatchMode when torch.compile'ing (#132433 ) Partially addresses https://github.com/pytorch/pytorch/issues/132417 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132433 Approved by: https://github.com/ydwu4	2024-08-08 05:02:43 +00:00
xinyu-intel	5ae979ab10	[Dynamo] Support torch.autograd._is_checkpoint_valid (#132611 ) Hi, we got `torch._dynamo.exc.Unsupported: torch.* op returned non-Tensor bool call_function <function _is_checkpoint_valid at 0x7f0b0d22e290>` while tracing activation [checkpointing function in deepspeed](`324ee65cb0/deepspeed/runtime/activation_checkpointing/checkpointing.py (L630)`). Consider to add it to constant_folding list which is similar with https://github.com/pytorch/pytorch/pull/126196 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132611 Approved by: https://github.com/anijain2305, https://github.com/williamwen42	2024-08-08 04:05:08 +00:00
IvanKobzarev	4fd0d594a1	[sym_shapes] Not eval sym expression for printing storage_offset (#132911 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132911 Approved by: https://github.com/ezyang	2024-08-08 03:49:29 +00:00
Yueming Hao	b483ca05a9	[inductor]a less ambitious way to slove the scalar tensor (#132702 ) Fixes #121374 The previous https://github.com/pytorch/pytorch/pull/131775 was trying to convert the 0dim cpu tensor to a DynamicScalar in lowering stage. But there are so many lowering rules uncompatible with that way. So, this PR is trying to do the conversion in codegen stage. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132702 Approved by: https://github.com/eellison	2024-08-08 03:42:21 +00:00
Andrew Gu	ac6398b630	[FSDP2] Follow-up fix to correct relaxed overlap test (#132953 ) The previous PR forgot to include dummy all-gathers before backward, so the reference time was too short, causing the test to still fail. I verified the test passes locally. This should close https://github.com/pytorch/pytorch/issues/120961 (again). Pull Request resolved: https://github.com/pytorch/pytorch/pull/132953 Approved by: https://github.com/weifengpy ghstack dependencies: #132869	2024-08-08 03:24:46 +00:00
cyyever	636a7c4859	[13/N] Use std::optional (#132527 ) Follows #132361 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132527 Approved by: https://github.com/ezyang	2024-08-08 03:16:28 +00:00
Bin Bao	fd874b799f	[AOTI][refactor] Update MKLDNN ops cpp wrapper support (#132367 ) Summary: Set op_overload for MKLDNN ops so that cpp_kernel_name and python_kernel_name are constructed from there. This is an important step towards support those MKLDNN ops in the ABI-compatible mode, because we will need to read schema from op_overload for generating correct fallback op call in C++. Differential Revision: [D60909798](https://our.internmc.facebook.com/intern/diff/D60909798) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132367 Approved by: https://github.com/leslie-fang-intel, https://github.com/angelayi	2024-08-08 03:02:29 +00:00
Yiming Zhou	c69b2d24e3	[dynamo] Support remove method of set (#132943 ) Fixes https://github.com/pytorch/pytorch/issues/132800 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132943 Approved by: https://github.com/anijain2305	2024-08-08 02:43:19 +00:00
Animesh Jain	194ec49d27	[dynamo][lists][stable diffusion] Do not add source on list slice (#132912 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132912 Approved by: https://github.com/williamwen42 ghstack dependencies: #132806, #132899	2024-08-08 02:23:07 +00:00
Angela Yi	45d0e90bd3	[export] Allow str outputs (#132808 ) Summary: Fixes https://fb.workplace.com/groups/1075192433118967/permalink/1478413606130179/ Test Plan: CI Differential Revision: D60850712 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132808 Approved by: https://github.com/ydwu4	2024-08-08 02:20:59 +00:00
Tugsbayasgalan (Tugsuu) Manlaibaatar	4ca616e6d4	Disable sparse tests in export (#132824 ) Summary: Dynamo doesn't trace through sparse tensors in fbcode. So we should disable tests that run sparse tensors in export. We should do this to make the CI green internally. Test Plan: Before: Tests finished: Pass 1409. Fail 71. Fatal 0. Skip 90. Build failure 0 After: Tests finished: Pass 1408. Fail 0. Fatal 0. Skip 162. Build failure 0 Differential Revision: D60870543 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132824 Approved by: https://github.com/BoyuanFeng	2024-08-08 01:45:12 +00:00
zdevito	fb6b001cde	Disable expandable segments IPC in fbcode, because some jobs seem to be failing. (#132890) seem to be failing. https://fb.workplace.com/groups/1405155842844877/permalink/8867182216642165/ Differential Revision: [D60912371](https://our.internmc.facebook.com/intern/diff/D60912371/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132890 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-08-08 01:42:32 +00:00
Rachel Guo	5709375d56	[AOTI][tooling][1/n] Add intermediate value debug printer (#132323 ) Summary: Context: Currently we have a helper to print out AtenTensor in [shim_common.cpp](https://github.com/pytorch/pytorch/blob/v2.4.0-rc4/torch/csrc/inductor/aoti_torch/shim_common.cpp#L866) The way we were using this function was a “manual” process. We inject this function into the generated output.cpp file, and recompile and reload the file. This diff automates the printing value process. Changes: 1. Added a simple initial debug printer helper to print out tensor values 2. Added a filter option to selectively dump tensor values. Usage: Sample cmd : ``` AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor, +schedule, output_code" python test/inductor/test_aot_inductor.py -k test_addmm_abi_compatible_cuda ``` Sample outputs : ``` [ before_launch - triton_poi_fused_0 - buf0 ]: 0.6331 1.6358 -0.3459 1.0196 -0.4122 1.4279 [ CUDAFloatType{6} ] Min value: -0.412198 Max value: 1.63582 Device: cuda:0 Size: [6] Stride: [1] Dtype: float Layout: Strided Number of elements: 6 Is contiguous: 1 Requires grad: 0 [ after_launch - triton_poi_fused_0 - buf0 ]: 0.6331 1.6358 -0.3459 1.0196 -0.4122 1.4279 [ CUDAFloatType{6} ] Min value: -0.412198 Max value: 1.63582 Device: cuda:0 Size: [6] Stride: [1] Dtype: float Layout: Strided Number of elements: 6 Is contiguous: 1 Requires grad: 0 [ before_launch - aoti_torch_cuda_addmm_out - buf1 ]: Min value: -2.25655 Max value: 2.32996 Device: cuda:0 Size: [16, 6] Stride: [6, 1] Dtype: float Layout: Strided Number of elements: 96 Is contiguous: 1 Requires grad: 0 [ before_launch - aoti_torch_cuda_addmm_out - buf0 ]: 0.6331 1.6358 -0.3459 1.0196 -0.4122 1.4279 [ CUDAFloatType{6} ] Min value: -0.412198 Max value: 1.63582 Device: cuda:0 Size: [6] Stride: [1] Dtype: float Layout: Strided Number of elements: 6 Is contiguous: 1 Requires grad: 0 [ after_launch - aoti_torch_cuda_addmm_out - buf1 ]: Min value: -12.0839 Max value: 11.6878 Device: cuda:0 Size: [16, 6] Stride: [6, 1] Dtype: float Layout: Strided Number of elements: 96 Is contiguous: 1 Requires grad: 0 [ after_launch - aoti_torch_cuda_addmm_out - buf0 ]: 0.6331 1.6358 -0.3459 1.0196 -0.4122 1.4279 [ CUDAFloatType{6} ] Min value: -0.412198 Max value: 1.63582 Device: cuda:0 Size: [6] Stride: [1] Dtype: float Layout: Strided Number of elements: 6 Is contiguous: 1 Requires grad: 0 stats [('calls_captured', 1), ('unique_graphs', 1)] inductor [('pattern_matcher_count', 2), ('pattern_matcher_nodes', 2), ('extern_calls', 2)] . ---------------------------------------------------------------------- Ran 1 test in 10.867s OK ``` The user is able to filter kernel names to print out values by specifying env var `AOT_INDUCTOR_FILTERED_KERNELS_TO_PRINT` and see choices of kernel names in a log message like below: ``` torch/_inductor/graph.py:1642] Finished codegen for all nodes. The list of kernel names available: ['triton_poi_fused_0', 'aoti_torch_cuda_addmm_out'] ``` In the follow-up diff, will add `torch.save()` to dump/save the intermediate tensors into individual `.pt` files that can be further `torch.load()`. Test Plan: Run Unit Tests in OSS: (similar cmd as mentioned above in the usage part) `AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=1 TORCHINDUCTOR_FORCE_DISABLE_CACHES=1 TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCH_COMPILE_DEBUG=1 TORCH_LOGS="+inductor, output_code" python test/inductor/test_aot_inductor.py -k test_addmm_abi_compatible_cuda` Differential Revision: D60538496 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132323 Approved by: https://github.com/ColinPeppler	2024-08-08 01:39:59 +00:00
David Berard	59f4725b49	[NJT] manually autocast in SDPA handling (#132835 ) When autocasting is turned on, right now SDPA w/ NJT won't be autocasted. This PR adds manual "autocasting" logic in sdpa.py - at the beginning, it just checks if autocasting is enabled, and if so, it casts the inputs in the way you would expect if autocasting was actually running. Why normal autocasting won't work: * NJT intercepts the `__torch_function__` call for scaled_dot_product_attention, which, AFAIK, happens before we get to any dispatcher logic, and then calls efficient attention or flash attention. So autocasting the scaled_dot_product_attention op won't work; we never call the aten op for scaled_dot_product_attention, so we won't ever run autocasting for it. * If we try to add autocasting handling for `_flash_attention_forward` or `_efficient_attention_forward`, then autocasting will _run_, but it will have the wrong semantics: sdpa.py's handling will run first, and it will do backend selection based on the uncasted inputs to SDPA. This also means that if the inputs to the SDPA call don't have uniform types, the sdpa.py implementation will fail checks (this is the specific issue we're targeting). Alternative: "just change the backend selection logic for NJT to be autocast aware, but don't actually do the autocast; then, add `_(flash\|efficient)_attention_forward` to autocasting rules". I think this would work too. But it's arguably better to make the backend-selection logic and actual-autocast-behavior use the same implementation, in case the implementations are different. Differential Revision: [D60879916](https://our.internmc.facebook.com/intern/diff/D60879916) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132835 Approved by: https://github.com/soulitzer	2024-08-08 01:36:57 +00:00
Yidi Wu	bbf568aac8	Split of "[reland] [export] fix zero arg export in training_ir and constant tensor handling" (#132307 ) Summary: A re-land of D60006710. Fixed TrainingIRToRunDecomp failures for test_tensor_attribute_zero_args and also a few re-tracability failures because run_decomposition does a retracing. edit: also remove the eliminate_dead_code() in _unlift because of one onnx test failure: a constant tensor attr was lifted as constant_tensor input but it's not used in the graph after aot_autograd due to a short cut in its decomposition. This causes the setattr to be removed by eliminate_dead_code but the graph signature still contains the name of that buffer, which causes an inconsitency between the transformed graph and ep's original signature after _unlift. And it seems that this has happened a few times where some nodes are accidentally removed and we're in an inconsistent state. The alternative of removing it would be: every time we call elimiate_dead_code, we verify the consistency of the graph with 1. the graph before transformation and 2. all the meta datas but i think this deserves a complete design edit 2: Also fix the inconsistency of graph signatures when param_constant is marked as lifted_tensor_constants but it's registered as parameters in the output of ep.module(). Differential Revision: D60532628 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132307 Approved by: https://github.com/zhxchen17	2024-08-08 01:36:16 +00:00
Howard Huang	0f90ffe94a	Remove ProcessGroupRoundRobin (#132888 ) `_round_robin_process_groups` is deprecated and should be removed. `258f47fc0b/torch/csrc/distributed/c10d/ProcessGroupRoundRobin.cpp (L10-L12)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132888 Approved by: https://github.com/Skylion007, https://github.com/wanchaol, https://github.com/c-p-i-o, https://github.com/fduwjj	2024-08-08 01:07:40 +00:00
Nicolas Macchioni	5cb05a82b4	[BC breaking] move benchmarking + prefer inductor path (#132827 ) move benchmarking out of `torch._inductor.runtime.runtime_utils` and into `torch._inductor.runtime.benchmarking`, and prefer this path over directly accessing Triton's benchmarking Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/132827 Approved by: https://github.com/eellison	2024-08-08 00:47:45 +00:00
Xu Han	a9036e1cf8	[inductor] raise unsupport msg in capture_pre_autograd_graph on Windows (#132841 ) Debuged with @leslie-fang-intel , and we found that: https://github.com/pytorch/pytorch/issues/132561 and https://github.com/pytorch/pytorch/issues/132569 are all failed by `capture_pre_autograd_graph` not work well on Windows. So, we added some code to raise message and let end user known that. Detailed: For https://github.com/pytorch/pytorch/issues/132561 ```cmd Traceback (most recent call last): File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 59, in testPartExecutor yield File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 591, in run self._callTestMethod(testMethod) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 549, in _callTestMethod method() File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_utils.py", line 2918, in wrapper method(args, kwargs) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_utils.py", line 1515, in wrapper fn(args, *kwargs) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_quantization.py", line 399, in wrapper fn(args, *kwargs) File "D:\xu_git\dnnl_cb\pytorch\test\quantization\pt2e\test_x86inductor_quantizer.py", line 1737, in test_qat_conv2d self._test_quantizer( File "D:\xu_git\dnnl_cb\pytorch\test\quantization\pt2e\test_x86inductor_quantizer.py", line 553, in _test_quantizer m = capture_pre_autograd_graph( File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\_export\__init__.py", line 121, in capture_pre_autograd_graph raise RuntimeError("capture_pre_autograd_graph not yet supported on Windows") RuntimeError: capture_pre_autograd_graph not yet supported on Windows To execute this test, run the following from the base repo dir: python test\quantization\pt2e\test_x86inductor_quantizer.py -k TestQuantizePT2EX86Inductor.test_qat_conv2d This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 ``` For https://github.com/pytorch/pytorch/issues/132569 ```cmd Traceback (most recent call last): File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 59, in testPartExecutor yield File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 591, in run self._callTestMethod(testMethod) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\unittest\case.py", line 549, in _callTestMethod method() File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_utils.py", line 2918, in wrapper method(args, *kwargs) File "D:\xu_git\dnnl_cb\pytorch\test\inductor\test_torchinductor.py", line 11218, in new_test return value(self) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\_dynamo\testing.py", line 312, in _fn return fn(args, *kwargs) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\contextlib.py", line 79, in inner return func(args, *kwds) File "D:\xu_git\dnnl_cb\pytorch\test\inductor\test_cpu_cpp_wrapper.py", line 155, in fn _, code = test_torchinductor.run_and_get_cpp_code( File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\_inductor\utils.py", line 1863, in run_and_get_cpp_code result = fn(args, *kwargs) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_quantization.py", line 415, in wrapper fn(args, *kwargs) File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_quantization.py", line 367, in wrapper fn(args, **kwargs) File "D:\xu_git\dnnl_cb\pytorch\test\inductor\test_mkldnn_pattern_matcher.py", line 1668, in test_qlinear_gelu_cpu self._qlinear_unary_cpu_test_helper((torch.randn((2, 4)),), gelu) File "D:\xu_git\dnnl_cb\pytorch\test\inductor\test_mkldnn_pattern_matcher.py", line 1615, in _qlinear_unary_cpu_test_helper self._test_common( File "D:\xu_git\dnnl_cb\pytorch\test\inductor\test_mkldnn_pattern_matcher.py", line 165, in _test_common convert_model = _generate_qdq_quantized_model( File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\testing\_internal\common_quantization.py", line 2949, in _generate_qdq_quantized_model export_model = capture_pre_autograd_graph( File "C:\Users\Xuhan\.conda\envs\win_mkl_static\lib\site-packages\torch\_export\__init__.py", line 121, in capture_pre_autograd_graph raise RuntimeError("capture_pre_autograd_graph not yet supported on Windows") RuntimeError: capture_pre_autograd_graph not yet supported on Windows To execute this test, run the following from the base repo dir: python test\inductor\test_cpu_cpp_wrapper.py -k DynamicShapesCppWrapperCpuTests.test_qlinear_gelu_cpu_dynamic_shapes_cpp_wrapper This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 --------------------------------------------------------------------------------------------------------------------------- Captured stderr call ---------------------------------------------------------------------------------------------------------------------------- W0807 13:24:34.291000 11228 torch\_export\__init__.py:64] +============================+ W0807 13:24:34.291000 11228 torch\_export\__init__.py:65] \| !!! WARNING !!! \| W0807 13:24:34.291000 11228 torch\_export\__init__.py:66] +============================+ W0807 13:24:34.291000 11228 torch\_export\__init__.py:67] capture_pre_autograd_graph() is deprecated and doesn't provide any function guarantee moving forward. W0807 13:24:34.291000 11228 torch\_export\__init__.py:68] Please switch to use torch.export instead. ``` Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132841 Approved by: https://github.com/jgong5, https://github.com/ezyang	2024-08-08 00:28:07 +00:00
Tobias Ringwald	441c1c03d5	Prevent an unnecessary device -> host copy for CuPy arrays when not explicitly setting a device in torch.as_tensor. (#132595 ) See title. Until now, calling `torch.as_tensor` on a CuPy array would return a CPU tensor, when not providing a device. This is most likely not desired. Fixes #132553 ```python3 import torch import cupy as cp cupy_arr = cp.asarray([1, 2, 3]) # Default case t = torch.as_tensor(cupy_arr) # New behavior, same device as cupy_arr now, was cpu before print(t.device) # cuda:0 # Explicitly set device t = torch.as_tensor(cupy_arr, device='cpu') print(t.device) # cpu # Implicit default device torch.set_default_device('cpu') t = torch.as_tensor(cupy_arr) print(t.device) # cpu # Default device via context manager torch.set_default_device('cuda') with torch.device('cpu'): t = torch.as_tensor(cupy_arr) print(t.device) # cpu # Unset default device torch.set_default_device(None) t = torch.as_tensor(cupy_arr) # New behavior, same device as cupy_arr now, was cpu before print(t.device) # cuda:0 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132595 Approved by: https://github.com/ezyang	2024-08-08 00:26:58 +00:00
HDCharles	374747818d	Run performance test non-alternately (#131935 ) Summary: By default, performance tests (speedup experiments) will run the baseline and test backend alternately. However, this does not work for the torchao backend, which will change the model in-place, therefore the baseline run will also run with torchao backend since the model has already been quantized. Add a new experiment "latency_experiment" to run performance tests non-alternately (first run baseline for a few iterations, then run the test backend). other changes: need to add torch.compiler.cudagraph_mark_step_begin() to avoid the slowdown from # Unable to hit fast path of CUDAGraphs because of pending, uninvoked backwards also updated the torchao APIs to the current versions X-link: https://github.com/pytorch/benchmark/pull/2394 Test Plan: python run_benchmark.py torchao --only AlbertForMaskedLM --quantization noquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune python run_benchmark.py torchao --only BartForCausalLM --quantization noquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune python run_benchmark.py torchao --only timm_efficientnet --quantization noquant --performance --inference --bfloat16 --inductor-compile-mode max-autotune (should all be ~1.0 0.997x 1.006x 0.994x Reviewed By: xuzhao9 Differential Revision: D60252821 Pulled By: HDCharles Pull Request resolved: https://github.com/pytorch/pytorch/pull/131935 Approved by: https://github.com/xuzhao9	2024-08-08 00:23:20 +00:00
Edward Z. Yang	f16d87eeff	Print where raw cprofile lives (#132866 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132866 Approved by: https://github.com/albanD	2024-08-08 00:13:29 +00:00
Will Constable	b73d4b6555	[pipelining] Add schedule runtime for lowered schedule (#130488 ) Creates a new runtime that shifts complexity from runtime to ahead-of-time. The existing runtime (PipelineScheduleMulti) accepts a compute-only schedule (forward, backward, weight) actions only are specified, and it infers the communication operations at runtime. Compared to that runtime, PipelineScheduleRuntime has less logic that happens at runtime and relies on lowering passes to transform the compute-only schedule to add communications. Advantages include - easier to verify the correctness by dumping a compute+comm schedule - posible to manually edit the compute+comm schedule if the lowering heuristics are insufficient Functionality included inside the PipelineScheduleRuntime is limited to - accepting a compute-only schedule and lowering it to add comms - executing the compute or comm operations specified by the given schedule - handling work.wait() automatically by calling it just before the matching compute operation (for RECV ops) or at the end of step (for SEND ops) Follow ups for later PRs - Some refactoring should be done to replace PipelineScheduleMulti with this runtime - Optimizer execution is not considered (e.g. for zero-bubble cases) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130488 Approved by: https://github.com/H-Huang	2024-08-08 00:08:03 +00:00
Edward Z. Yang	9282e6ca78	Don't use _disable_current_modes as decorator (#132809 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132809 Approved by: https://github.com/albanD ghstack dependencies: #132801, #132802, #132804	2024-08-07 23:59:46 +00:00
Edward Z. Yang	42226ca3a3	Don't use use_lazy_graph_module as decorator (#132804 ) See https://github.com/pytorch/pytorch/pull/132073 for motivation Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132804 Approved by: https://github.com/albanD ghstack dependencies: #132801, #132802	2024-08-07 23:59:46 +00:00
Edward Z. Yang	5e4d8eb831	Don't generate stack entry for DebugContext.wrap (#132802 ) See https://github.com/pytorch/pytorch/pull/132073 for motivation Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132802 Approved by: https://github.com/albanD ghstack dependencies: #132801	2024-08-07 23:59:38 +00:00
Edward Z. Yang	708a99e52a	Stop using with_fresh_cache_if_config as decorator (#132801 ) See https://github.com/pytorch/pytorch/pull/132073 for motivation Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132801 Approved by: https://github.com/albanD	2024-08-07 23:59:32 +00:00
Howard Huang	c3e51c09ed	[PP] Add get_schedule_class util (#132768 ) Add a function to map a string to a class instance for schedules. This allows users to select a schedule based on a string command line argument and removes the need for glue code (e.g. in torchtitan) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132768 Approved by: https://github.com/c-p-i-o, https://github.com/fduwjj	2024-08-07 23:51:03 +00:00
Alnis Murtovi	383f2ac914	AutoHeuristic: mixed_mm H100 heuristic (#132685 ) H100 heuristic for mixed_mm. Performance looks similar to A100 heuristic. ``` set crit max_depth min_samples_leaf correct wrong unsure total wrong_max_spdup wrong_gman_spdup max_spdup_default gman_spdup_default max_slowdown_default non_default_preds default_better train entropy 5 0.01 1562 604 145 2311 1.522201 1.077722 10.399141 3.134170 1.034802 2061 2 test entropy 5 0.01 361 164 24 549 1.443590 1.079169 8.159173 3.105360 1.197973 500 2 ``` gpt-fast speedups \|batch size\|prompt length\| fallback \| heuristic \| speedup \| \|----------\|-------------\|------------:\|------------:\|--------:\| \| 1 \| 7 \| 109.95 \| 220.63\| 2 \| \| 1 \| 11 \| 109.65 \| 210.92\| 1.92 \| \| 4 \| 7 \| 149.04 \| 625.80\| 4.19 \| \| 4 \| 11 \| 149.56 \| 494.64\| 3.30 \| \| 8 \| 7 \| 293.68 \| 956.72\| 3.25 \| \| 8 \| 11 \| 294.48 \| 925.60\| 3.14 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/132685 Approved by: https://github.com/eellison	2024-08-07 23:48:01 +00:00
angelayi	c327710a87	[export] Publicize validate function (#132777 ) as titled Pull Request resolved: https://github.com/pytorch/pytorch/pull/132777 Approved by: https://github.com/zhxchen17	2024-08-07 23:10:05 +00:00
Chien-Chin Huang	21d4c48059	Allow distributed breakpoint to skip the first few calls (#129511 ) Summary: PDB allows to do conditional breakpoint but the ability won't work in the distributed environment. We can still do conditional breakpoint by doing the following: ``` counter = 0 global counter count += 1 if counter > 100: dist.breakpoint() ``` This PR makes dist.breakpoint() support this feature as a syntax sugar. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129511 Approved by: https://github.com/wconstab, https://github.com/c-p-i-o	2024-08-07 21:57:37 +00:00
Animesh Jain	acad2050c1	[easy][dynamo] Add tx as an arg in getitem_const (#132899 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132899 Approved by: https://github.com/yanboliang ghstack dependencies: #132806	2024-08-07 21:35:41 +00:00
vasiliy	700a11fdd4	Make inductor kernel metadata comments more descriptive (#126698 ) Summary: A couple of improvements to the generated comments in inductor kernels: 1. Makes the nodes in the comment topologically sorted, I think having them alphabetically sorted is a gotcha. I was always confused on why the sorting in the comments did not match the code. 2. Adds a printout of the aten graph fragment corresponding to the current inductor kernel, to make it easier to map from aten code to inductor code Example float8-overhead-related inductor kernel comment after this PR: ``` # kernel path: /tmp/torchinductor_vasiliy/27/c27ts3rdw56ns7od5j6ovdnhxphished2lcu3adclzzixoo7khg5.py # Source Nodes: [weight_fp8], Original ATen: [aten.mul, aten.clamp, aten._to_copy] # Source node to ATen node mapping: # weight_fp8 => clamp_max_1, clamp_min_3, convert_element_type_10, convert_element_type_11, convert_element_type_9, mul_3 # Graph fragment: # %mul_3 : [num_users=1] = call_function[target=torch.ops.aten.mul.Tensor](args = (%primals_2, %convert_element_type_8), kwargs = {}) # %convert_element_type_9 : [num_users=1] = call_function[target=torch.ops.prims.convert_element_type.default](args = (%mul_3, torch.float32), kwargs = {}) # %clamp_min_3 : [num_users=1] = call_function[target=torch.ops.aten.clamp_min.default](args = (%convert_element_type_9, -448.0), kwargs = {}) # %clamp_max_1 : [num_users=1] = call_function[target=torch.ops.aten.clamp_max.default](args = (%clamp_min_3, 448.0), kwargs = {}) # %convert_element_type_10 : [num_users=1] = call_function[target=torch.ops.prims.convert_element_type.default](args = (%clamp_max_1, torch.bfloat16), kwargs = {}) # %convert_element_type_11 : [num_users=1] = call_function[target=torch.ops.prims.convert_element_type.default](args = (%convert_element_type_10, torch.float8_e4m3fn), kwargs = {}) triton_poi_fused__to_copy_clamp_mul_5 = async_compile.triton('triton_', ''' ``` Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/126698 Approved by: https://github.com/ezyang ghstack dependencies: #126573	2024-08-07 21:25:09 +00:00
vasiliy	48f7bdbbe1	aot_autograd: copy metadata from fw to bw nodes (#126573 ) Summary: Uses the `seq_nr` field (introduced to aot_autograd nodes in https://github.com/pytorch/pytorch/pull/103129) to map the aot_autograd fx bw nodes to the corresponding fw nodes, and copy the metadata over. I am trusting the `seq_nr` mapping in the linked PR here. I did some validation with a toy LLaMa 3 8b training run and the mapping seemed correct. I am also trusting that the forward is single threaded, since `seq_nr` is thread local. If this isn't always true, we'll need to also plumb `thread_id` through the same machinery which is populating `seq_nr`. I'd like to use this data in a future PR to make inductor kernels easily attributable to the nn.Module path in modeling land, to make it easier to do performance debugging. Test Plan: ``` // 1. unit test python test/dynamo/test_aot_autograd.py -k test_aot_sequence_nr // 2. manual test // run LLaMa 3 8B fw + bw with torch.compile, print out the inductor graphs // seen in `torch/_inductor/utils.py::get_kernel_metadata`, they seemed // right to me. ``` Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/126573 Approved by: https://github.com/ezyang, https://github.com/bdhirsh	2024-08-07 21:25:09 +00:00
Jez Ng	260e7cb143	Make CUDA device properties's `__repr__` output actually printable (#132863 ) Previously we would write the UUID bytes directly, leading to 'invalid UTF-8 sequence' errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132863 Approved by: https://github.com/Skylion007, https://github.com/eqy	2024-08-07 21:08:43 +00:00
Aryan	525fdc0f95	[docs] fix incorrect example in `convert_conv3d_weight_memory_format` (#129318 ) The current example fails when using `torch.channels_last`, and the docs are slightly incorrect for the 3d case. Pull Request resolved: https://github.com/pytorch/pytorch/pull/129318 Approved by: https://github.com/albanD	2024-08-07 20:06:59 +00:00
Boyuan Feng	6a348e5e57	[CUDAGraph] Warn once if too many distinct sizes (#132832 ) Warn once if there are too many distinct sizes for cudagraph, so we can avoid spamming logs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132832 Approved by: https://github.com/eellison	2024-08-07 19:48:06 +00:00
David Berard	e76bd0b603	[BE] put "show_dispatch_trace()" print logic in .cpp file (#132717 ) I find myself occasionally trying to modify this to get additional debug info. Recompiling takes forever after modifying these lines, because the .h file is depended on by a huge number of files. If we move this logic into a helper function and put it in the .cpp file, recompilation will be a lot faster when adding debug here. Tested with a local DEBUG=1 build (which is needed to use `TORCH_SHOW_DISPATCH_TRACE=1`) and verified basic sanity - i.e. it still prints `[call]`, etc. Differential Revision: [D60804331](https://our.internmc.facebook.com/intern/diff/D60804331) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132717 Approved by: https://github.com/soulitzer, https://github.com/bdhirsh	2024-08-07 19:43:29 +00:00
Mengwei Liu	7830373662	Update owner for BC test (#132891 ) Add @larryliu0820 to `/test/forward_backward_compatibility/check_forward_backward_compatibility.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132891 Approved by: https://github.com/albanD	2024-08-07 19:42:04 +00:00
Xu Han	59bbaea3a7	[inductor] disable capture_pre_autograd_graph related UTs on Windows (#132848 ) Contined to https://github.com/pytorch/pytorch/pull/132841 We disable `capture_pre_autograd_graph` related UT on Windows. Disable `test_lstm_packed_change_input_sizes` and `test_multihead_attention` UTs on Windows. TODO: Turn on them after fix `capture_pre_autograd_graph` issue on Windows. ## Local Test: Linux is not skiped: <img width="1387" alt="image" src="https://github.com/user-attachments/assets/28dfbb4b-d9c0-4d5b-be84-d7b3697bcd3f"> And we can skiped them on Windows: <img width="853" alt="image" src="https://github.com/user-attachments/assets/e96ebcf8-9bf3-43aa-93fd-fb33d3743573"> Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132848 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-08-07 19:38:03 +00:00
Matthew Hoffman	7ea8374c0e	`nn.ModuleList.__getitem__` overloads (#132834 ) Overloads so that you can get more specific type info based on how you are indexing. ```python from torch import nn module_list = nn.ModuleList(32 * [nn.Linear(2, 2)]) # before: reveal_type(module_list[0]) # Type of "module_list[0]" is "Module \| ModuleList" reveal_type(module_list[:1]) # Type of "module_list[: 1]" is "Module \| ModuleList" # now: reveal_type(module_list[0]) # Type of "module_list[0]" is "Module" reveal_type(module_list[:1]) # Type of "module_list[: 1]" is "ModuleList" ``` Co-authored-by: Skylion007 <Skylion007@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132834 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-08-07 19:25:23 +00:00
Lu Fang	83fa7f871f	Work around item non-sync issue on AMD (#132772 ) Differential Revision: D59669714 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132772 Approved by: https://github.com/ZhengkaiZ, https://github.com/izaitsevfb	2024-08-07 18:58:11 +00:00
PyTorch MergeBot	ff81ca8e0c	Revert "Populate submodules of `torch._C` to `sys.modules` recursively (#132216 )" This reverts commit 672ce4610e41386da9763e07375b0879dc351905. Reverted https://github.com/pytorch/pytorch/pull/132216 on behalf of https://github.com/PaliC due to was breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/132216#issuecomment-2274112397))	2024-08-07 18:45:00 +00:00
Catherine Lee	4fe6a5dc34	Move slow tests to be in repo (#132379 ) Move the slow test json to be in the pytorch/pytorch repo and make a job that will update it weekly. The job uses the same environment as the commit hash. It uses similar code to the hash updates, but the hash update contains a lot of code that is specific to the hash update, so I chose to pick out the parts that are relevant Remove references to the old file and set up testing to read from the new file instead The old update cadence was every day, the new one is every week The auto slow test infra + the lack of pinning between pytorch and test-infra makes it really hard to tell if a test started failing because of a change or because of the slow test json changing. While this can have benefits, like disable test issues being effective everywhere immediately, it can also be very confusing, especially since we don't have the same insight into slow tests like we do for disable issues. Example PR made: https://github.com/pytorch/pytorch/pull/132383 (with all the changes from this PR because it was working on top of this) We should just get rid of this at some point in favor of the slowTest decorator, but there are some tests that take 5+ minutes to run and I don't want to track them down right now Pull Request resolved: https://github.com/pytorch/pytorch/pull/132379 Approved by: https://github.com/huydhn	2024-08-07 18:42:56 +00:00
Chen, Zejun	26b0011fb8	[XPU][Kineto Submodule] Introduce kineto-based XPU profiler (#130811 ) As XPU became a PyTorch built-in device, the profiler support is indispensable part of functionality completeness. This PR is associated with the PR to introduce XPU profiler plugin into the kineto. When USE_XPU is enabled, the LIBKINETO_NOXPUPTI option will be suppressed accordingly, which allows kineto to build with XPU profiler plugin. Associated PR to introduce kineto-based XPU profiler into kineto: https://github.com/pytorch/kineto/pull/961 Also updates the Kineto Submodule to include XPU changes. Co-authored-by: Aaron Enye Shi <enye.shi@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130811 Approved by: https://github.com/aaronenyeshi	2024-08-07 18:41:37 +00:00
PyTorch MergeBot	07551887b8	Revert "Disable SymDispatchMode when torch.compile'ing (#132433 )" This reverts commit 63eb06c0512b636a34caf041eab6fbc0726fc7ee. Reverted https://github.com/pytorch/pytorch/pull/132433 on behalf of https://github.com/PaliC due to We need to now revert https://github.com/pytorch/pytorch/pull/132216 in OSS and there is a dependency on this pr ([comment](https://github.com/pytorch/pytorch/pull/132433#issuecomment-2274105080))	2024-08-07 18:41:28 +00:00
Jeff Daily	ca713b8393	llvm update for backward-breaking APIs in 18 and 19 (#132825 ) Related to #130661, #129797. Based on the LLVM tagged releases, these LLVM_VERSION_MAJOR guards are accurate. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132825 Approved by: https://github.com/dcci, https://github.com/Skylion007	2024-08-07 18:31:40 +00:00
PyTorch MergeBot	a9ff190867	Revert "Consolidate SymDispatchMode into ProxyTensorMode (#132674 )" This reverts commit ffdf48e63b94930c81f05b06444721109d0b243d. Reverted https://github.com/pytorch/pytorch/pull/132674 on behalf of https://github.com/PaliC due to We need to now revert https://github.com/pytorch/pytorch/pull/132216 in OSS and there is a dependency on this pr ([comment](https://github.com/pytorch/pytorch/pull/132674#issuecomment-2274062785))	2024-08-07 18:25:33 +00:00
PyTorch MergeBot	9d476fee53	Revert "[BE] Simplify code interacting with get_proxy_mode/enable_tracing (#132675 )" This reverts commit c2bccfd4311fe905ff78c0977281b8e642bb10d6. Reverted https://github.com/pytorch/pytorch/pull/132675 on behalf of https://github.com/PaliC due to We need to now revert https://github.com/pytorch/pytorch/pull/132216 in OSS and there is a dependency on this pr ([comment](https://github.com/pytorch/pytorch/pull/132674#issuecomment-2274062785))	2024-08-07 18:25:33 +00:00
Sebastien Roy	f2ad3c89b0	fix dtype mismatch in lobpcg eigen solver (#132762 ) Fixes #132761 If rerr value is_complex, test against the real part. Since the rerr variable holds a norm calculation, the imaginary part will be 0.0. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132762 Approved by: https://github.com/albanD	2024-08-07 18:20:46 +00:00
PyTorch MergeBot	1749025081	Revert "Fix infinite recursion while walking to submodules (#132763 )" This reverts commit 063a45ed27c3001bba44ea2161d188ec2314d428. Reverted https://github.com/pytorch/pytorch/pull/132763 on behalf of https://github.com/PaliC due to We need to now revert https://github.com/pytorch/pytorch/pull/132216 in OSS and there is a dependency on this pr ([comment](https://github.com/pytorch/pytorch/pull/132763#issuecomment-2274059792))	2024-08-07 18:20:27 +00:00
Animesh Jain	25df063f04	[dynamo][user_defined][stable-diffusion] Raise ObservedAttributeError on UserDefinedObject var_getattr (#132806 ) Fixes https://github.com/pytorch/pytorch/issues/132551 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132806 Approved by: https://github.com/williamwen42	2024-08-07 18:19:49 +00:00
Xilun Wu	40ce0a53bb	[FSDP][dtensor] add FSDP2+TP distributed state dict test (#131408 ) Test `pytest test/distributed/_composable/fsdp/test_fully_shard_training.py` `pytest test/distributed/_composable/fsdp/test_fully_shard_state_dict.py` `pytest test/distributed/checkpoint/fsdp/test_fsdp_dsd.py` `pytest test/distributed/_composable/fsdp/test_fully_shard_init.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131408 Approved by: https://github.com/fegin ghstack dependencies: #126697, #130239, #132391	2024-08-07 18:17:12 +00:00
Xilun Wu	ad0ce89050	[3/N][dtensor] Strided Sharding offset calculation util (#132391 ) Summary 1. change `compute_local_shape_and_global_offset` to correctly compute shape and offset for strided sharding placement (currently it only handles 2D and some 3D+ sharding). 2. Add a new property `num_shards_map` to `DTensorSpec` denoting how many shards each tensor dimension has. This is necessary for constructing `_StridedShard` placement when we call `distribute_tensor(dtensor_tp, dp_device_mesh, [Shard(0)])` and the `split_factor` argument will just be the number of shards on that sharding tensor dim. Test `test/distributed/_tensor/test_utils.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132391 Approved by: https://github.com/wanchaol ghstack dependencies: #126697, #130239	2024-08-07 18:17:12 +00:00
Xilun Wu	0b0c660c02	[2/N][dtensor] Strided Sharding shard_to_replicate (#130239 ) Summary This PR adds the necessary util function to `_StridedShard` for correct shard-to-replicate resharding. Test `pytest test/distributed/_tensor/test_utils.py -s -k strided_sharding` `pytest test/distributed/_tensor/test_utils.py -s -k test_fsdp2_tp_2d_dtensor_local_shards_and_offsets` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130239 Approved by: https://github.com/wanchaol ghstack dependencies: #126697	2024-08-07 18:17:06 +00:00
Xilun Wu	92a17f454a	[1/N][dtensor] introduce StridedShard placement type and _split_tensor() logic (#126697 ) Summary This PR adds a new private placement type `_StridedShard` for FSDP2 + TP style tensor sharding. The previously used `Shard` placement type cannot produce correct `full_tensor()` result because it assumes the tensor to be first sharded over `dp` mesh dimension then `tp` mesh dimension which does not hold true in FSDP2 + TP case. Test `pytest test/distributed/_tensor/test_utils.py -s -k strided_sharding` Pull Request resolved: https://github.com/pytorch/pytorch/pull/126697 Approved by: https://github.com/wanchaol	2024-08-07 18:17:02 +00:00
PyTorch MergeBot	123d9ec5bf	Revert "Loads .pyd instead of .so in MemPool test for windows (#132749 )" This reverts commit 37ab0f33854fafdf9bf4f575260329ffcd960d13. Reverted https://github.com/pytorch/pytorch/pull/132749 on behalf of https://github.com/syed-ahmed due to Seems like periodic is still failing: `7c79e89bc5` ([comment](https://github.com/pytorch/pytorch/pull/132749#issuecomment-2274041302))	2024-08-07 18:08:44 +00:00
Andrew Gu	a62710c820	[FSDP2] Relaxed overlap test to address CI flakiness (#132869 ) This tries to fix https://github.com/pytorch/pytorch/issues/120961. This is a similar situation as https://github.com/pytorch/pytorch/pull/132116. The overlap tests were written strictly based on a precise calculation of what compute/communication should be non-overlapped vs. overlapped. This is done via `torch.cuda._sleep()`, which takes inputs in cycles, so we must convert from milliseconds to cycles via `get_cycles_per_ms()`, which is computed once and cached. Variation in CI can cause this `get_cycles_per_ms()` value to be inaccurate when the FSDP overlap tests run. Thus, we decide to relax the overlap tests to just make sure the overlapped runs are faster than a baseline without overlap. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132869 Approved by: https://github.com/weifengpy	2024-08-07 17:37:03 +00:00
cyy	32a284c275	[9/N] Fix clang-tidy warnings in aten/src/ATen (#132842 ) Follows #132728 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132842 Approved by: https://github.com/Skylion007	2024-08-07 16:54:21 +00:00
chilli	ffd0d92c18	fix autotuning init issues (#132837 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132837 Approved by: https://github.com/yanboliang	2024-08-07 16:36:47 +00:00
wz337	8b50d5398f	[DeviceMesh] Create new group for 1D mesh when default backend is 'gloo' and 'cuda' is available (#132709 ) More context in [#132471](https://github.com/pytorch/pytorch/issues/132471) and https://github.com/pytorch/pytorch/issues/132366. TLDR: When cuda is available and users move tensors to cuda, we cannot really reuse the default pg if default pg is gloo, as lots of collectives are not supported on gloo for cuda tensors. For example, `dtensor.full_tensor()` would result in a mysterious SIGTERM when all_gather a cuda tensor using gloo. Without the change in this PR, users would have to know the context and explicitly move the cuda tensor to cpu before invoking most collectives, which I think is not so ideal UX. Therefore, given most collectives are not supported on gloo for cuda tensors, we should init a new pg if the default pg is gloo when torch.cuda.is_available() and device_type is cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132709 Approved by: https://github.com/awgu, https://github.com/wanchaol	2024-08-07 16:13:11 +00:00
Matthew Hoffman	258f47fc0b	Add `padding_side` to `pad_sequence` with `"left"` and `"right"` options (`"right"` as default) (#131884 ) Fixes #10536 Reattempt of #61467. Thank you so much to @mskoh52 for your excellent work! As I was trying to create a more efficient LLM data collator, I realized that `pad_sequence` only supports right padding, even though left padding is a very common format for LLMs, like Llama and Mistral. The proposed alternative implementation was to use multiple flips, which tends to be 1.5x-2x slower. Instead we can add a [`padding_side` parameter as there is for for Hugging Face tokenizers](`9d6c0641c4/src/transformers/tokenization_utils_base.py (L1565)`), which requires only a very small change in the C++ code. Here are the benchmarks of the new implementation! `float32`: ![eaaa95ef-9384-45d2-be56-6898bc1d3514](https://github.com/user-attachments/assets/3b0eb309-e5a0-4a4d-97bb-4e3298783dbb) `bool`: ![892f32da-8d9a-492b-9507-18d3f0a41e8e](https://github.com/user-attachments/assets/6824ea15-7d4e-4b89-95f0-8546635f0c2e) Code: ```python from __future__ import annotations import random import time from typing import Literal import numpy as np import torch def pad_sequence_with_flips( sequences: list[torch.Tensor], batch_first: bool = False, padding_value: int \| float \| bool = 0.0, padding_side: Literal["left", "right"] \| str = "left", ) -> torch.Tensor: if padding_side == 'right': padded_sequence = torch._C._nn.pad_sequence([t.flatten() for t in sequences], batch_first=batch_first, padding_value=padding_value) elif padding_side=='left': padded_sequence = torch._C._nn.pad_sequence([t.flatten().flip(0) for t in sequences], batch_first=batch_first, padding_value=padding_value) # pyright: ignore[reportArgumentType] padded_sequence = padded_sequence.flip(int(batch_first)) else: raise ValueError(f"padding_side should be either 'right' or 'left', but got {padding_side}") return padded_sequence sequence_lengths: list[int] = [] flip_left_pad_times: list[float] = [] flip_left_pad_times_std: list[float] = [] left_pad_times: list[float] = [] left_pad_times_std: list[float] = [] RUNS_PER_LOOP: int = 100 for i in range(1, 7): sequence_length = i * int(1e6) // 6 sequence_lengths.append(sequence_length) sequences = [torch.randint(0, 2, (random.randint(1, sequence_length),), dtype=torch.bool) for _ in range(64)] inner_left_pad_times: list[float] = [] inner_right_pad_times: list[float] = [] inner_flip_left_pad_times: list[float] = [] inner_flip_right_pad_times: list[float] = [] for _ in range(RUNS_PER_LOOP): start = time.perf_counter() torch._C._nn.pad_sequence(sequences, batch_first=True, padding_value=False, padding_side="left") end = time.perf_counter() inner_left_pad_times.append(end - start) start = time.perf_counter() pad_sequence_with_flips(sequences, batch_first=True, padding_value=False, padding_side="left") end = time.perf_counter() inner_flip_left_pad_times.append(end - start) left_pad_times.append(sum(inner_left_pad_times) / len(inner_left_pad_times)) left_pad_times_std.append(np.std(inner_left_pad_times)) flip_left_pad_times.append(sum(inner_flip_left_pad_times) / len(inner_flip_left_pad_times)) flip_left_pad_times_std.append(np.std(inner_flip_left_pad_times)) print(f"Sequence Length: {sequence_length}, Left Pad Time: {left_pad_times[-1]}, Left with Flips Pad Time: {flip_left_pad_times[-1]}") import matplotlib.pyplot as plt plt.plot(sequence_lengths, left_pad_times, label="new pad_sequence left") plt.scatter(sequence_lengths, left_pad_times) plt.errorbar(sequence_lengths, left_pad_times, yerr=left_pad_times_std, linestyle='None', marker='^') plt.plot(sequence_lengths, flip_left_pad_times, label="old pad_sequence left (2 flips)") plt.scatter(sequence_lengths, flip_left_pad_times) plt.errorbar(sequence_lengths, flip_left_pad_times, yerr=flip_left_pad_times_std, linestyle='None', marker='^') plt.xlabel("Sequence Length") plt.ylabel("Time (s)") plt.legend(loc="upper right") # Sequence Length: 166666, Left Pad Time: 0.06147645162009212, Left with Flips Pad Time: 0.09842291727001794 # Sequence Length: 333333, Left Pad Time: 0.08933195920990329, Left with Flips Pad Time: 0.15597836187991562 # Sequence Length: 500000, Left Pad Time: 0.08863158334006585, Left with Flips Pad Time: 0.15224887342999863 # Sequence Length: 666666, Left Pad Time: 0.10524682551997103, Left with Flips Pad Time: 0.18177212480995877 # Sequence Length: 833333, Left Pad Time: 0.11801802741003485, Left with Flips Pad Time: 0.20821274195001024 # Sequence Length: 1000000, Left Pad Time: 0.131894061660023, Left with Flips Pad Time: 0.23223503091008751 ``` Co-authored-by: mskoh52 <mskoh52@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131884 Approved by: https://github.com/ezyang	2024-08-07 15:53:07 +00:00
PyTorch MergeBot	780310fed7	Revert "Only thunkify proxies in some situations (#132421 )" This reverts commit bb99008c9e7c357b88047bcd6971dc2078341484. Reverted https://github.com/pytorch/pytorch/pull/132421 on behalf of https://github.com/clee2000 due to I think this broke dynamo/test_subclasses.py::TestNestedTensor::test_in_graph_construction_from_input [GH job link](https://github.com/pytorch/pytorch/actions/runs/10283744685/job/28459340678) [HUD commit link](`bb99008c9e`). Test got added in `f50621989b` which is before your merge base ([comment](https://github.com/pytorch/pytorch/pull/132421#issuecomment-2273742960))	2024-08-07 15:29:54 +00:00
PyTorch MergeBot	de9b8a42c1	Revert "Add support for other backends in get_preferred_device (#132118 )" This reverts commit c184ac0f6b6d2482cf300d852fde6370a1c1e086. Reverted https://github.com/pytorch/pytorch/pull/132118 on behalf of https://github.com/clee2000 due to I think this broke distributed/checkpoint/test_file_system_checkpoint_cpu.py::TestDistributedReshardOnLoad::test_load_rowwise_to_colwise_thread_count_1 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10279901233/job/28456599072) [HUD commit link](`c184ac0f6b`). Dr CI classification is wrong, the failure is not flaky ([comment](https://github.com/pytorch/pytorch/pull/132118#issuecomment-2273729288))	2024-08-07 15:22:42 +00:00
cyy	13fa59580e	Enable clang-tidy on aten/src/ATen/cpu (#132830 ) Expands code coverage of clang-tidy to aten/src/ATen/cpu Pull Request resolved: https://github.com/pytorch/pytorch/pull/132830 Approved by: https://github.com/Skylion007	2024-08-07 14:44:17 +00:00
Antoni Viros	ed97fb77f9	Conversions between strided and jagged layouts for Nested Tensors (#115749 ) This PR does 3 things: 1. Adds a copy-free strided->jagged layout conversion for NT 2. Adds a copy-free jagged->strided layout conversion for NT 3. Modifies and expands the .to() API to support the layout argument for the specific case of NT layout conversion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115749 Approved by: https://github.com/jbschlosser	2024-08-07 14:18:53 +00:00
Joel Schlosser	fb146fc3c6	Only store necessary tensor_dict fields in node meta (#132805 ) Fixes #132290 This PR attempts a more invasive / complete solution than the one from #132338, which removes immediate tensor fields from the `tensor_dict` copy stored in node meta. The approach taken here is to store only those fields of the `tensor_dict` which are absolutely utilized somewhere else. So far, this appears to be limited to: * `_dynamo_static_input_type` * `tag` (at least in the tests). Discussion at #94080 appears to indicate this is depended on for export (CI may point out more) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132805 Approved by: https://github.com/mlazos	2024-08-07 13:35:16 +00:00
Edward Z. Yang	7c79e89bc5	Stop using clear_frame as decorator (#132778 ) See https://github.com/pytorch/pytorch/pull/132073 for motivation Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132778 Approved by: https://github.com/albanD ghstack dependencies: #132774	2024-08-07 11:53:18 +00:00
Edward Z. Yang	bb99008c9e	Only thunkify proxies in some situations (#132421 ) The goal of this PR is to avoid stack overflow when we create extremely long chains of thunks, and then evaluate them (e.g., as occurs if you sum(long list of symint)). The basic idea behind this PR is to only thunkify proxies if they're being created in places where they may or may not be used--crucially, symint operations that occur in user code we are tracing are eagerly placed into the graph, even if they may eventually be dead. I annotated the PR with explanation of changes. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132421 Approved by: https://github.com/Skylion007, https://github.com/zou3519 ghstack dependencies: #132674, #132675	2024-08-07 11:51:17 +00:00
Danielmic	32f9a809c7	Replace [[unlikely]] with unlikely(x) (#130816 ) Do not use `[[unlikely]]` as its c++20 language features, see https://en.cppreference.com/w/cpp/language/attributes/likely Fixes https://github.com/pytorch/pytorch/issues/130815 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130816 Approved by: https://github.com/jgong5, https://github.com/jansel, https://github.com/malfet	2024-08-07 10:38:13 +00:00
zengxian	8c8eb9670a	[CI] Enable inductor UT test on avx512 (#132645 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132645 Approved by: https://github.com/desertfire	2024-08-07 10:22:40 +00:00
Syed Tousif Ahmed	37ab0f3385	Loads .pyd instead of .so in MemPool test for windows (#132749 ) Fixes #132650 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132749 Approved by: https://github.com/albanD	2024-08-07 09:58:52 +00:00
xinyu-intel	8333ecf085	Support hasattr tracing for more PythonModuleVariable (#132731 ) Fixes #132237 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132731 Approved by: https://github.com/EikanWang, https://github.com/yanboliang	2024-08-07 09:15:17 +00:00
Nicolas Macchioni	c8c964f950	[inductor] check best templates first for fusions (#132829 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/132829 Approved by: https://github.com/eellison	2024-08-07 07:48:00 +00:00
Jeeja	c184ac0f6b	Add support for other backends in get_preferred_device (#132118 ) Currenlty get_preferred_device supports only cuda and cpu. Add support for other backends using backend config. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/132118 Approved by: https://github.com/awgu	2024-08-07 07:19:20 +00:00
wz337	87053132ea	[DeviceMesh] Remove parent mesh concept from _MeshEnv and replace by root mesh (#132339 ) Previously, when we slice out a submesh from a mesh, we assign the mesh as the parent mesh of the submesh. In this case, when we have a 3D mesh topology, the parent mesh of a 1D mesh sliced out from the 3D mesh is different from the parent mesh of the same 1D mesh sliced out from the 2D submesh of the 3D mesh. For example: ``` mesh_3d = init_device_mesh("cuda", (2,2,2), ("dim0", "dim1", "dim2")) mesh_dim0 = mesh_3d["dim0"] mesh_2d = mesh_2d["dim0", "dim1"] mesh_dim0_2 = mesh_2d["dim0_2"] # This would evaluate to be True print(_mesh_resources.get_parent_mesh(mesh_dim0) != _mesh_resources.get_parent_mesh(mesh_dim0)) ``` We can always reconstruct the mesh needed from the mesh dim names, as long as two dims come from the same root. For simplicity, we do not see the necessity of building a tree structure to represent child-parent relationship. Therefore, we are replacing the parent mesh concept with a root mesh concept in `_MeshEnv` so we would have: ``` mesh_3d = init_device_mesh("cuda", (2,2,2), ("dim0", "dim1", "dim2")) mesh_dim0 = mesh_3d["dim0"] mesh_2d = mesh_2d["dim0", "dim1"] mesh_dim0_2 = mesh_2d["dim0_2"] # This would evaluate to be True print(_mesh_resources.get_root_mesh(mesh_dim0) == _mesh_resources.get_root_mesh(mesh_dim0)) ``` With this change, we will have two types of meshes in an environment. 1. `device_mesh != _mesh_resources.get_root_mesh(device_mesh)` means that the device_mesh is created by slicing. 2. `device_mesh == _mesh_resources.get_root_mesh(device_mesh)` means that the device_mesh is a root mesh not created through slicing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132339 Approved by: https://github.com/wanchaol ghstack dependencies: #132310, #132311	2024-08-07 07:01:12 +00:00
leslie-fang-intel	dc00eeb0f4	[Dynamo] fix incorrect kwargs in create_proxy (#132723 ) ## Summary Fix https://github.com/pytorch/pytorch/issues/132642, the implementation of `create_proxy` requires to pass-in `kwargs` explicitly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132723 Approved by: https://github.com/aorenste	2024-08-07 06:26:24 +00:00
Nikita Shulga	2206a3de00	[Compile] Speedup int8-to-float conversion on aarch64 (#132676 ) With this change following snippet: ```cpp #include <ATen/cpu/vec/vec.h> void int8tofloat(int8_t* in, float* out) { auto tmp0 = at::vec::Vectorized<int8_t>::loadu(in, 8); auto tmp1 = at::vec::convert<float>(tmp0); tmp1.store(out); } ```, which is core of the algorithm generated by cpu_inductor for the following compiled function: ```python @torch.compile def to_float(x): return x.to(torch.float) ``` changes from ```assembly int8tofloat(signed char, float): 0000000000000000 stp x29, x30, [sp, #-0x10]! 0000000000000004 mov x29, sp 0000000000000008 sub x9, sp, #0x30 000000000000000c and sp, x9, #0xffffffffffffffe0 0000000000000010 adrp x8, 0 ; 0x0 0000000000000014 ldr x8, [x8] 0000000000000018 ldr x8, [x8] 000000000000001c str x8, [sp, #0x28] 0000000000000020 ldr s0, [x0] 0000000000000024 sshll.8h v0, v0, #0x0 0000000000000028 sshll.4s v0, v0, #0x0 000000000000002c scvtf.4s v0, v0 0000000000000030 str q0, [sp] 0000000000000034 ldr s0, [x0, #0x4] 0000000000000038 sshll.8h v0, v0, #0x0 000000000000003c sshll.4s v0, v0, #0x0 0000000000000040 scvtf.4s v0, v0 0000000000000044 str q0, [sp, #0x10] 0000000000000048 mov x8, sp 000000000000004c ld1.4s { v0, v1 }, [x8] 0000000000000050 st1.4s { v0, v1 }, [x1] 0000000000000054 ldr x8, [sp, #0x28] 0000000000000058 adrp x9, 0 ; 0x0 000000000000005c ldr x9, [x9] 0000000000000060 ldr x9, [x9] 0000000000000064 cmp x9, x8 0000000000000068 b.ne 0x78 000000000000006c mov sp, x29 0000000000000070 ldp x29, x30, [sp], #0x10 0000000000000074 ret 0000000000000078 bl 0x78 ``` to ```assembly 0000000000000000 ldr d0, [x0] 0000000000000004 sshll.8h v0, v0, #0x0 0000000000000008 sshll.4s v1, v0, #0x0 000000000000000c scvtf.4s v1, v1 0000000000000010 sshll2.4s v0, v0, #0x0 0000000000000014 scvtf.4s v2, v0 0000000000000018 st1.4s { v1, v2 }, [x1] 000000000000001c ret ``` and improves perf of `python3 torchchat.py generate stories110M --num-samples 3 --quantize '{"linear:int8" : {"groupsize" : 0}}' --compile --device cpu` from 56 to 98 tokens per sec on MacBook M1 Pro Pull Request resolved: https://github.com/pytorch/pytorch/pull/132676 Approved by: https://github.com/desertfire	2024-08-07 06:26:05 +00:00
Sun, Jiayi	4faa0e3efb	[Inductor] support masked vectorization for the tail_loop (#126526 ) Currently the tail_loop always uses the scalar kernel. This PR supports masked vectorization for the tail_loop to improve the performance. Example: ``` import torch import torch.nn as nn class GN(nn.Module): def __init__(self, num_groups, num_channels): super(GN, self).__init__() self.gn = nn.GroupNorm(num_groups, num_channels) def forward(self, x): return self.gn(x) input = torch.randn(2, 960, 96, 96).to(memory_format=torch.channels_last) m = GN(32, 960).eval() compiled_m = torch.compile(m) with torch.no_grad(): for _ in range(3): compiled_m(input) ``` Generated code: - Before: ``` cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float', 'const float', 'const float', 'float', 'float', 'float'], ''' #include "/tmp/torchinductor_jiayisun/ky/cky2bufythacofebk7ujv36e4pxyqcqbpsy5r4vojoprjiwcwfxf.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2) { #pragma omp parallel num_threads(112) { int tid = omp_get_thread_num(); { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); static WeightRecp<at::vec::Vectorized<float>> weight_recps(static_cast<long>(17280L)); for(long x2=static_cast<long>(0L); x2<static_cast<long>(9216L); x2+=static_cast<long>(1L)) { for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx1) + (960Lx2) + (8847360Lx0)), 16); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &weight_recps); } #pragma omp simd simdlen(8) for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(x3 + (30Lx1) + (960Lx2) + (8847360Lx0))]; tmp_acc0 = welford_combine(tmp_acc0, tmp0); } } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<long>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<long>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.m2); } } } } { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(9216L); x1+=static_cast<long>(1L)) { for(long x2=static_cast<long>(0L); x2<static_cast<long>(960L); x2+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x2 + (960Lx1) + (8847360Lx0)), 16); auto tmp1 = [&] { __at_align__ std::array<float, 16> tmpbuf; #pragma GCC unroll 16 for (long x2_inner = 0; x2_inner < 16; x2_inner++) { tmpbuf[x2_inner] = out_ptr0[static_cast<long>((32Lx0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))]; } return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16); } () ; auto tmp3 = [&] { __at_align__ std::array<float, 16> tmpbuf; #pragma GCC unroll 16 for (long x2_inner = 0; x2_inner < 16; x2_inner++) { tmpbuf[x2_inner] = out_ptr1[static_cast<long>((32Lx0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))]; } return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16); } () ; auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x2), 16); auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x2), 16); auto tmp2 = tmp0 - tmp1; auto tmp4 = static_cast<float>(276480.0); auto tmp5 = at::vec::Vectorized<float>(tmp4); auto tmp6 = tmp3 / tmp5; auto tmp7 = static_cast<float>(1e-05); auto tmp8 = at::vec::Vectorized<float>(tmp7); auto tmp9 = tmp6 + tmp8; auto tmp10 = tmp9.rsqrt(); auto tmp11 = tmp2 * tmp10; auto tmp13 = tmp11 * tmp12; auto tmp15 = tmp13 + tmp14; tmp15.store(out_ptr2 + static_cast<long>(x2 + (960Lx1) + (8847360Lx0))); } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, arg1_1, arg2_1 = args args.clear() assert_size_stride(arg0_1, (960, ), (1, )) assert_size_stride(arg1_1, (960, ), (1, )) assert_size_stride(arg2_1, (2, 960, 96, 96), (8847360, 1, 92160, 960)) buf0 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32) buf1 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32) buf3 = empty_strided_cpu((2, 960, 96, 96), (8847360, 1, 92160, 960), torch.float32) cpp_fused_native_group_norm_0(arg2_1, arg0_1, arg1_1, buf0, buf1, buf3) del arg0_1 del arg1_1 del arg2_1 return (buf3, ) ``` - After: ``` cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float', 'const float', 'const float', 'float', 'float', 'float'], ''' #include "/tmp/torchinductor_jiayisun/em/cemtujj65j5txpqlxc7w4pcunpmvz3qtiudkc5ocxxhcmdlknw2m.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, const float* in_ptr2, float* out_ptr0, float* out_ptr1, float* out_ptr2) { #pragma omp parallel num_threads(112) { int tid = omp_get_thread_num(); { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(32L); x1+=static_cast<long>(1L)) { { Welford<float> tmp_acc0 = Welford<float>(); Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>(); static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<long>(17280L)); for(long x2=static_cast<long>(0L); x2<static_cast<long>(9216L); x2+=static_cast<long>(1L)) { for(long x3=static_cast<long>(0L); x3<static_cast<long>(16L); x3+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx1) + (960Lx2) + (8847360Lx0)), 16); tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0); } for(long x3=static_cast<long>(16L); x3<static_cast<long>(30L); x3+=static_cast<long>(14L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x3 + (30Lx1) + (960Lx2) + (8847360Lx0)), 14); masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, tmp0, 14, &wrecps0); } } tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec)); tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec)); out_ptr0[static_cast<long>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.mean); out_ptr1[static_cast<long>(x1 + (32Lx0))] = static_cast<float>(tmp_acc0.m2); } } } } { #pragma omp for collapse(2) for(long x0=static_cast<long>(0L); x0<static_cast<long>(2L); x0+=static_cast<long>(1L)) { for(long x1=static_cast<long>(0L); x1<static_cast<long>(9216L); x1+=static_cast<long>(1L)) { for(long x2=static_cast<long>(0L); x2<static_cast<long>(960L); x2+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x2 + (960Lx1) + (8847360Lx0)), 16); auto tmp1 = [&] { __at_align__ std::array<float, 16> tmpbuf; #pragma GCC unroll 16 for (long x2_inner = 0; x2_inner < 16; x2_inner++) { tmpbuf[x2_inner] = out_ptr0[static_cast<long>((32Lx0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))]; } return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16); } () ; auto tmp3 = [&] { __at_align__ std::array<float, 16> tmpbuf; #pragma GCC unroll 16 for (long x2_inner = 0; x2_inner < 16; x2_inner++) { tmpbuf[x2_inner] = out_ptr1[static_cast<long>((32Lx0) + (c10::div_floor_integer((x2 + x2_inner), 30L)))]; } return at::vec::Vectorized<float>::loadu(tmpbuf.data(), 16); } () ; auto tmp12 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x2), 16); auto tmp14 = at::vec::Vectorized<float>::loadu(in_ptr2 + static_cast<long>(x2), 16); auto tmp2 = tmp0 - tmp1; auto tmp4 = static_cast<float>(276480.0); auto tmp5 = at::vec::Vectorized<float>(tmp4); auto tmp6 = tmp3 / tmp5; auto tmp7 = static_cast<float>(1e-05); auto tmp8 = at::vec::Vectorized<float>(tmp7); auto tmp9 = tmp6 + tmp8; auto tmp10 = tmp9.rsqrt(); auto tmp11 = tmp2 * tmp10; auto tmp13 = tmp11 * tmp12; auto tmp15 = tmp13 + tmp14; tmp15.store(out_ptr2 + static_cast<long>(x2 + (960Lx1) + (8847360Lx0))); } } } } } } ''') async_compile.wait(globals()) del async_compile def call(args): arg0_1, arg1_1, arg2_1 = args args.clear() assert_size_stride(arg0_1, (960, ), (1, )) assert_size_stride(arg1_1, (960, ), (1, )) assert_size_stride(arg2_1, (2, 960, 96, 96), (8847360, 1, 92160, 960)) buf0 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32) buf1 = empty_strided_cpu((2, 32, 1, 1), (32, 1, 64, 64), torch.float32) buf3 = empty_strided_cpu((2, 960, 96, 96), (8847360, 1, 92160, 960), torch.float32) cpp_fused_native_group_norm_0(arg2_1, arg0_1, arg1_1, buf0, buf1, buf3) del arg0_1 del arg1_1 del arg2_1 return (buf3, ) ``` Co-authored-by: CaoE <e.cao@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126526 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel, https://github.com/jansel	2024-08-07 06:00:12 +00:00
Apurva Jain	8bc5ef563e	Grouped Query Attention (#132689 ) ### Approach: Using the current function declaration Constraint: Q_Heads % KV_Heads == 0 Major change: - Added a new argument enable_gqa: bool to sdpa function call - It adds a meaning to the last third dimension. Sample use cases this would enable: LLama3 ``` # LLama3 8b call to SDPA query = torch.rand(batch, 32, seq_len_q, D) key = torch.rand(batch, 8, seq_len_kv, D) value = torch.rand(batch, 8, seq_len_kv, D) output = scaled_dot_product_attention(query, key, value, is_causal=True, enable_gqa=True) # Output Shape (batch, 32, seq_len_q, D) ``` ### Design Choice: - Check if Query.size(-3) == Key.size(-3) == Value.size(-3) or, Query.size(-3) % Key.size(-3) == 0 - The function adjusts the key and value tensors to match the query tensor's head dimension by using repeat_interleave if their number of heads are not equal, facilitating correct and efficient computation in attention mechanisms. - By default the enable_gqa flag is set to False, which ensures that regular sdpa functionality remains unchanged. ### Benchmarks: - sdpa.py: #130634 For different batch sizes enable_gqa=True shows a substansial improvement in the run_time of sdpa \| batch_size \| q_num_heads \| kv_num_heads \| q_seq_len \| kv_seq_len \| embed_dim \| forward_time when enable_gqa=True \| forward_time when enable_gqa=False \| \| ------------ \| ------------- \| -------------- \| ----------- \| ------------ \| ----------- \| ----------- \| ---------------- \| \| 1 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 100.71 \| 119.70 \| \| 8 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 539.78 \| 628.83 \| \| 16 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 1056.81 \| 1225.48 \| \| 32 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 2099.54 \| 2440.45 \| ![Screenshot 2024-07-25 at 9 07 40 PM](https://github.com/user-attachments/assets/a3e5f716-c39f-4096-9e6c-82a735e57b7b) - TorchTitan: https://github.com/pytorch/torchtitan/pull/458 Differential Revision: D60772086 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132689 Approved by: https://github.com/drisspg	2024-08-07 05:35:36 +00:00
Nicolas Macchioni	527f104a69	add L2 cache size to device properties (#132819 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/132819 Approved by: https://github.com/eellison	2024-08-07 04:55:06 +00:00
cyy	bfeb45e46b	[17/N] Fix clang-tidy warnings in jit (#132753 ) Follows #132604 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132753 Approved by: https://github.com/Skylion007	2024-08-07 03:47:54 +00:00
cyy	03480213de	[8/N] Fix clang-tidy warnings in aten/src/ATen (#132728 ) Follows #132727 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132728 Approved by: https://github.com/ezyang	2024-08-07 02:44:17 +00:00
Menglu Yu	919e384247	[PT2][Optimus] Add unbind_stack_to_cat_pass (#132542 ) Summary: We observe the stack mpde can be transformed to cat node to elimiate split nodes, which could further enable the unbind cat optimization, thus we add a more advanced pattern to do the graph transformation Test Plan: # unit test ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test //caffe2/test/inductor:split_cat_fx_passes ``` Buck UI: https://www.internalfb.com/buck2/de6c1cda-3d74-4a30-8980-7b209b6fe5dc Test UI: https://www.internalfb.com/intern/testinfra/testrun/12103424042268125 Network: Up: 485KiB Down: 728KiB (reSessionID-2f2c01c3-79bb-4e37-b5be-fb77ec09b264) Jobs completed: 29. Time elapsed: 5:19.8s. Cache hits: 0%. Commands: 4 (cached: 0, remote: 0, local: 4) Tests finished: Pass 9. Fail 0. Fatal 0. Skip 1. Build failure 0 # benchmark ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "ig_ctr" --flow_id 584880697 ``` P1503698962 before and after graph transformation https://www.internalfb.com/intern/diffing/?paste_number=1504050718 Differential Revision: D60411560 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132542 Approved by: https://github.com/jackiexu1992	2024-08-07 02:26:40 +00:00
Xuehai Pan	063a45ed27	Fix infinite recursion while walking to submodules (#132763 ) Fixes https://github.com/pytorch/pytorch/pull/132216#issuecomment-2271555873 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132763 Approved by: https://github.com/ezyang	2024-08-07 02:20:17 +00:00
leslie-fang-intel	73c083e02c	[Inductor][CPP] Turns on inline_inbuilt_nn_modules for CPP GEMM template testing (#132487 ) Summary The CPP GEMM template testing has been skipped with turning on `inline_inbuilt_nn_modules ` as in https://github.com/pytorch/pytorch/issues/131929. Since https://github.com/pytorch/pytorch/pull/132334 has landed to fix the issues. Turn on this flag back since it's default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132487 Approved by: https://github.com/anijain2305, https://github.com/jgong5	2024-08-07 02:18:51 +00:00
Edward Z. Yang	ed224554eb	[BE] Don't unnecessarily suggest -k for rerunning tests locally (#132807 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132807 Approved by: https://github.com/malfet	2024-08-07 02:15:18 +00:00
Edward Z. Yang	837898d9c8	Stop using preserve_rng_state as decorator (#132774 ) See https://github.com/pytorch/pytorch/pull/132073 for motivation Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132774 Approved by: https://github.com/albanD	2024-08-07 01:07:12 +00:00
cyy	b01402b0a4	[7/N] Fix clang-tidy warnings in aten/src/ATen (#132727 ) Follows #132620 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132727 Approved by: https://github.com/Skylion007	2024-08-07 00:29:03 +00:00
drisspg	178dc0c9c7	various doc fixes (#132803 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132803 Approved by: https://github.com/Chillee, https://github.com/joydddd, https://github.com/BoyuanFeng ghstack dependencies: #132799	2024-08-07 00:19:42 +00:00
drisspg	cb4d1bfb71	Clean up some tflop calc and add option for saving (#132799 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132799 Approved by: https://github.com/BoyuanFeng	2024-08-07 00:19:42 +00:00
PyTorch MergeBot	cbee9c1fd2	Revert "Deprecate `torch._utils.is_compiling()` and `torch._dynamo.external_utils.is_compiling()` (#127690 )" This reverts commit 0e7e61f7cec82a43f2de52b83eff152d703be7a3. Reverted https://github.com/pytorch/pytorch/pull/127690 on behalf of https://github.com/kit1980 due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/127690#issuecomment-2272370386))	2024-08-07 00:05:20 +00:00
Henry Tsang	e98eac76b3	[inductor] switch AotCodeCompiler to new cpp_builder. (take 3) (#132766 ) Summary: This is basically https://github.com/pytorch/pytorch/pull/131304 together with https://github.com/pytorch/pytorch/pull/132594 and absolute path fix for fbcode. Test Plan: ci Differential Revision: D60773405 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132766 Approved by: https://github.com/xuhancn, https://github.com/chenyang78, https://github.com/desertfire	2024-08-06 23:56:34 +00:00
PyTorch MergeBot	c7113a6186	Revert "[DeviceMesh] Create new group for 1D mesh when default backend is 'gloo' and 'cuda' is available (#132709 )" This reverts commit 1a23ef2ece1c667ee46cd34deb70df2b91bffa32. Reverted https://github.com/pytorch/pytorch/pull/132709 on behalf of https://github.com/clee2000 due to I think this broke distributed/test_distributed_spawn.py::TestDistBackendWithSpawn::test_ddp_device_mesh_initialization [GH job link](https://github.com/pytorch/pytorch/actions/runs/10274519791/job/28432469987) [HUD commit link](`1a23ef2ece`). Test not run due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/132709#issuecomment-2272350923))	2024-08-06 23:47:53 +00:00
rzou	0d6caeb259	Add logging + counter for missed reinplacing opportunities (#132758 ) Summary: - We add Inductor logs for what tensors we tried to reinplace, what tensors we were unable to reinplace, and of those tensors, which of those might be bugs (the "missed reinplacing opportunities"). You can tell this by reading the Inductor output graph but the logs make it easier to figure out. - Add a dynamo_compile counter for missed reinplacing opportunities. The goal is to see how widespread existing problems (if any) are. We've had trouble getting all of the edge cases for the reinplacing pass; the counter will help us hunt down issues. Test Plan: - tested locally Pull Request resolved: https://github.com/pytorch/pytorch/pull/132758 Approved by: https://github.com/eellison	2024-08-06 23:44:24 +00:00
mori360	cd7f527c59	[3/3] 3D Composability - move tp dp tests (#129802 ) pytorch (fsdp, tp, pp) -> pytorch (composable) Move (fsdp, tp, pp) tests under pytorch into a composable folder FSDP: test/distributed/_composable/fsdp/test_fully_shard_trainin.py -TestFullyShard2DTraining DP: test/distributed/tensor/parallel/test_ddp_2d_parallel.py TP: test/distributed/tensor/parallel/test_fsdp_2d_parallel.py PP: test/distributed/pipelining/test_composability.py => distributed/_composable/test_composability/test_2d_composability.py distributed/_composable/test_composability/test_pp_composability.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/129802 Approved by: https://github.com/fduwjj ghstack dependencies: #129801	2024-08-06 23:07:07 +00:00
mori360	179b572fd9	[2/3] 3D Composability - move pp tests (#129801 ) pytorch (fsdp, tp, pp) -> pytorch (composable) Move (fsdp, tp, pp) tests under pytorch into a composable folder FSDP: test/distributed/_composable/fsdp/test_fully_shard_trainin.py -TestFullyShard2DTraining DP: test/distributed/tensor/parallel/test_ddp_2d_parallel.py TP: test/distributed/tensor/parallel/test_fsdp_2d_parallel.py PP: test/distributed/pipelining/test_composability.py => distributed/_composable/test_composability/test_2d_composability.py distributed/_composable/test_composability/test_pp_composability.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/129801 Approved by: https://github.com/wconstab, https://github.com/atalman	2024-08-06 23:07:07 +00:00
Shangdi Yu	825002c9c6	[export][fx] More robust DCE pass (#132764 ) Summary: - make default DCE pass check schema, - need to rebase onto https://github.com/pytorch/pytorch/pull/131651 after it's in phabricator (for now the change is manually added). - mark Proxy dump as NotImplemented for better error msg - Remove Proxy from tensors when dumping models, as Proxy cannot be dumped. More details in https://docs.google.com/document/d/1G5vmTXjzxoyVGRI2kpA1gQukK_Glyg2NrE0Oh6Nlg9A/edit?usp=sharing. Test Plan: CI ``` - buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r qat_conv2d - test_export.py - buck2 run 'fbcode//mode/dev-nosan' fbcode//modai/test:test_modai -- -r test_qat_stinson_htp_export - buck2 run 'fbcode//mode/dev-nosan' fbcode//vizard_projects/ml_depth/tests:test_model -- -r test_qat_model_et - buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:fx -- -r dce - buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/backends/tests:qnn_test -- -r test_qat_bias=False,use_3d_input=False - buck2 run 'fbcode//mode/dev-nosan' fbcode//bolt/nn/executorch/backends/tests:qnn_test -- -r test_qat_bias=True,use_3d_input=False - buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test/quantization:test_quantization -- -r test_fold_bn_erases_bn_node ``` Reviewed By: angelayi Differential Revision: D60319175 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132764 Approved by: https://github.com/angelayi	2024-08-06 22:27:22 +00:00
wz337	073cee531c	[Test][Easy] Remove print in test_device_mesh.py (#132780 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132780 Approved by: https://github.com/XilunWu	2024-08-06 22:04:39 +00:00
wz337	1a23ef2ece	[DeviceMesh] Create new group for 1D mesh when default backend is 'gloo' and 'cuda' is available (#132709 ) More context in [#132471](https://github.com/pytorch/pytorch/issues/132471) and https://github.com/pytorch/pytorch/issues/132366. TLDR: When cuda is available and users move tensors to cuda, we cannot really reuse the default pg if default pg is gloo, as lots of collectives are not supported on gloo for cuda tensors. For example, `dtensor.full_tensor()` would result in a mysterious SIGTERM when all_gather a cuda tensor using gloo. Without the change in this PR, users would have to know the context and explicitly move the cuda tensor to cpu before invoking most collectives, which I think is not so ideal UX. Therefore, given most collectives are not supported on gloo for cuda tensors, we should init a new pg if the default pg is gloo when torch.cuda.is_available() and device_type is cuda. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132709 Approved by: https://github.com/awgu, https://github.com/wanchaol	2024-08-06 22:00:09 +00:00
eellison	18b678082e	[Easy] log output code path on cache hit (#132718 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132718 Approved by: https://github.com/oulgen, https://github.com/masnesral	2024-08-06 21:59:30 +00:00
Edward Z. Yang	3c1033eeb0	Don't auto request review for reopened PRs (#132681 ) This will clobber previous approves. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132681 Approved by: https://github.com/albanD, https://github.com/malfet	2024-08-06 21:36:18 +00:00
rzou	2073ddfd1c	Actually report the HOP and subclass/mode when there isn't a registration (#132550 ) Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/132550 Approved by: https://github.com/ydwu4	2024-08-06 21:33:10 +00:00
yuqingj	623d0204f0	[NJT] Support Chunk backward for simple cases (#132193 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132193 Approved by: https://github.com/soulitzer	2024-08-06 21:20:09 +00:00
Aart Bik	2f908ffa4a	[traced-graph][sparse] sparsity propagation for all current tests (#132690 ) This PR makes sure all current tests in the sparsity export test suite pass. Note that there will probably be anecdotal cases that need fixing after this, but the general idea of preserving sparsity metadata has been completed. Fixes: https://github.com/pytorch/pytorch/issues/117188 ``` $ PYTORCH_TEST_WITH_DYNAMO=0 python test/export/test_sparse.py ........................................................................................................................................................ ---------------------------------------------------------------------- Ran 152 tests OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132690 Approved by: https://github.com/ezyang	2024-08-06 21:18:13 +00:00
dependabot[bot]	029f8fc701	Bump rexml from 3.2.8 to 3.3.3 in /ios/TestApp (#132469 ) Bumps [rexml](https://github.com/ruby/rexml) from 3.2.8 to 3.3.3. <details> <summary>Release notes</summary> <p><em>Sourced from <a href="https://github.com/ruby/rexml/releases">rexml's releases</a>.</em></p> <blockquote> <h2>REXML 3.3.3 - 2024-08-01</h2> <h3>Improvements</h3> <ul> <li> <p>Added support for detecting invalid XML that has unsupported content before root element</p> <ul> <li><a href="https://redirect.github.com/ruby/rexml/issues/184">GH-184</a></li> <li>Patch by NAITOH Jun.</li> </ul> </li> <li> <p>Added support for <code>REXML::Security.entity_expansion_limit=</code> and <code>REXML::Security.entity_expansion_text_limit=</code> in SAX2 and pull parsers</p> <ul> <li><a href="https://redirect.github.com/ruby/rexml/issues/187">GH-187</a></li> <li>Patch by NAITOH Jun.</li> </ul> </li> <li> <p>Added more tests for invalid XMLs.</p> <ul> <li><a href="https://redirect.github.com/ruby/rexml/issues/183">GH-183</a></li> <li>Patch by Watson.</li> </ul> </li> <li> <p>Added more performance tests.</p> <ul> <li>Patch by Watson.</li> </ul> </li> <li> <p>Improved parse performance.</p> <ul> <li><a href="https://redirect.github.com/ruby/rexml/issues/186">GH-186</a></li> <li>Patch by tomoya ishida.</li> </ul> </li> </ul> <h3>Thanks</h3> <ul> <li> <p>NAITOH Jun</p> </li> <li> <p>Watson</p> </li> <li> <p>tomoya ishida</p> </li> </ul> <h2>REXML 3.3.2 - 2024-07-16</h2> <h3>Improvements</h3> <ul> <li> <p>Improved parse performance.</p> <ul> <li><a href="https://redirect.github.com/ruby/rexml/issues/160">GH-160</a></li> <li>Patch by NAITOH Jun.</li> </ul> </li> <li> <p>Improved parse performance.</p> <ul> <li><a href="https://redirect.github.com/ruby/rexml/issues/169">GH-169</a></li> <li><a href="https://redirect.github.com/ruby/rexml/issues/170">GH-170</a></li> <li><a href="https://redirect.github.com/ruby/rexml/issues/171">GH-171</a></li> <li><a href="https://redirect.github.com/ruby/rexml/issues/172">GH-172</a></li> <li><a href="https://redirect.github.com/ruby/rexml/issues/173">GH-173</a></li> <li><a href="https://redirect.github.com/ruby/rexml/issues/174">GH-174</a></li> <li><a href="https://redirect.github.com/ruby/rexml/issues/175">GH-175</a></li> <li><a href="https://redirect.github.com/ruby/rexml/issues/176">GH-176</a></li> </ul> </li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Changelog</summary> <p><em>Sourced from <a href="https://github.com/ruby/rexml/blob/master/NEWS.md">rexml's changelog</a>.</em></p> <blockquote> <h2>3.3.3 - 2024-08-01 {#version-3-3-3}</h2> <h3>Improvements</h3> <ul> <li> <p>Added support for detecting invalid XML that has unsupported content before root element</p> <ul> <li><a href="https://redirect.github.com/ruby/rexml/issues/184">GH-184</a></li> <li>Patch by NAITOH Jun.</li> </ul> </li> <li> <p>Added support for <code>REXML::Security.entity_expansion_limit=</code> and <code>REXML::Security.entity_expansion_text_limit=</code> in SAX2 and pull parsers</p> <ul> <li><a href="https://redirect.github.com/ruby/rexml/issues/187">GH-187</a></li> <li>Patch by NAITOH Jun.</li> </ul> </li> <li> <p>Added more tests for invalid XMLs.</p> <ul> <li><a href="https://redirect.github.com/ruby/rexml/issues/183">GH-183</a></li> <li>Patch by Watson.</li> </ul> </li> <li> <p>Added more performance tests.</p> <ul> <li>Patch by Watson.</li> </ul> </li> <li> <p>Improved parse performance.</p> <ul> <li><a href="https://redirect.github.com/ruby/rexml/issues/186">GH-186</a></li> <li>Patch by tomoya ishida.</li> </ul> </li> </ul> <h3>Thanks</h3> <ul> <li> <p>NAITOH Jun</p> </li> <li> <p>Watson</p> </li> <li> <p>tomoya ishida</p> </li> </ul> <h2>3.3.2 - 2024-07-16 {#version-3-3-2}</h2> <h3>Improvements</h3> <ul> <li> <p>Improved parse performance.</p> <ul> <li><a href="https://redirect.github.com/ruby/rexml/issues/160">GH-160</a></li> <li>Patch by NAITOH Jun.</li> </ul> </li> <li> <p>Improved parse performance.</p> <ul> <li><a href="https://redirect.github.com/ruby/rexml/issues/169">GH-169</a></li> <li><a href="https://redirect.github.com/ruby/rexml/issues/170">GH-170</a></li> <li><a href="https://redirect.github.com/ruby/rexml/issues/171">GH-171</a></li> <li><a href="https://redirect.github.com/ruby/rexml/issues/172">GH-172</a></li> <li><a href="https://redirect.github.com/ruby/rexml/issues/173">GH-173</a></li> <li><a href="https://redirect.github.com/ruby/rexml/issues/174">GH-174</a></li> <li><a href="https://redirect.github.com/ruby/rexml/issues/175">GH-175</a></li> </ul> </li> </ul> <!-- raw HTML omitted --> </blockquote> <p>... (truncated)</p> </details> <details> <summary>Commits</summary> <ul> <li><a href="`e4a067e112`"><code>e4a067e</code></a> Add 3.3.3 entry</li> <li><a href="`17ff3e7874`"><code>17ff3e7</code></a> test: add a performance test for attribute list declaration</li> <li><a href="`be86b3de0a`"><code>be86b3d</code></a> test: fix wrong test name</li> <li><a href="`b93d790b36`"><code>b93d790</code></a> test: use double quote for string literal</li> <li><a href="`0fbe7d5a0e`"><code>0fbe7d5</code></a> test: don't use abbreviated name</li> <li><a href="`1599e8785f`"><code>1599e87</code></a> test: add a performance test for PI with many tabs</li> <li><a href="`e2546e6eca`"><code>e2546e6</code></a> parse pi: improve invalid case detection</li> <li><a href="`73661ef281`"><code>73661ef</code></a> test: fix a typo</li> <li><a href="`850488abf2`"><code>850488a</code></a> test: use double quote for string literal</li> <li><a href="`46c6397d5c`"><code>46c6397</code></a> test: add performance tests for entity declaration</li> <li>Additional commits viewable in <a href="https://github.com/ruby/rexml/compare/v3.2.8...v3.3.3">compare view</a></li> </ul> </details> <br /> [![Dependabot compatibility score](https://dependabot-badges.githubapp.com/badges/compatibility_score?dependency-name=rexml&package-manager=bundler&previous-version=3.2.8&new-version=3.3.3)](https://docs.github.com/en/github/managing-security-vulnerabilities/about-dependabot-security-updates#about-compatibility-scores) Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> <br /> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot show <dependency name> ignore conditions` will show all of the ignore conditions of the specified dependency - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) You can disable automated security fix PRs for this repo from the [Security Alerts page](https://github.com/pytorch/pytorch/network/alerts). </details> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132469 Approved by: https://github.com/ezyang	2024-08-06 21:17:24 +00:00
PyTorch MergeBot	e47b684c33	Revert "Temp disable MKL in DistributionKernels.cpp (#132532 )" This reverts commit 7b2664ece6a961ce9e4557be913c2cead09c7390. Reverted https://github.com/pytorch/pytorch/pull/132532 on behalf of https://github.com/PaliC due to causing numerical instability issues internally ([comment](https://github.com/pytorch/pytorch/pull/132532#issuecomment-2272136210))	2024-08-06 20:57:09 +00:00
Li Yu (ads)	94155ce31b	[Torch] Support meta device in checkpoint (#132684 ) Summary: ## Why utils.checkpoint doesn't support meta device: ``` File "/Users/lyu1/torchdev/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 490, in checkpoint next(gen) File "/Users/lyu1/torchdev/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 1359, in _checkpoint_without_reentrant_generator device_module = _get_device_module(device) File "/Users/lyu1/torchdev/lib/python3.9/site-packages/torch/utils/checkpoint.py", line 98, in _get_device_module device_module = getattr(torch, device) File "/Users/lyu1/torchdev/lib/python3.9/site-packages/torch/__init__.py", line 1938, in __getattr__ raise AttributeError(f"module '{__name__}' has no attribute '{name}'") AttributeError: module 'torch' has no attribute 'meta' ``` This blocks us from running model with checkpoint enabled in meta mode. ## What This diff handles the case of meta device in checkpoint.py. (in checkpoint.py, device module is manily used when preserve_rng_state=true, which doesn't apply to meta case. So a more elgant fix might be set preserve_rng_state=false when detecting args are on meta device. But I didn't find where to do this check in the minimum way. Let me know if you have ideas.) Test Plan: Tested with toy model which has checkpoint on its module: P1513716944 Differential Revision: D60749427 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132684 Approved by: https://github.com/kit1980	2024-08-06 20:45:50 +00:00
Animesh Jain	de00c79583	[dynamo][inline_inbuilt_nn_modules] Mark nn module tensor static for cudagraphs (#132736 ) Fixes https://github.com/pytorch/pytorch/issues/132714 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132736 Approved by: https://github.com/mlazos ghstack dependencies: #132538	2024-08-06 20:13:28 +00:00
Shuo Ding	1954bfacda	[Inductor] Small performance, precision, and dependency updates to B2B-GEMM (#132354 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132354 Approved by: https://github.com/masnesral	2024-08-06 20:01:27 +00:00
Tugsbayasgalan Manlaibaatar	775c310c0c	Preserve source_fn_stack in the training IR decomp (#132033 ) Title Differential Revision: [D60377712](https://our.internmc.facebook.com/intern/diff/D60377712/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132033 Approved by: https://github.com/angelayi ghstack dependencies: #131988, #131995, #131999	2024-08-06 19:45:40 +00:00
Andrew Gu	4faa5804f6	[c10d] Used float tensor for PG NCCL barrier all-reduce (#132701 ) This helps avoid a CUDA illegal memory access in the NCCL all-reduce part of `barrier()` when the CUDA caching allocator is disabled. NCCL all-reduce seems to assume reading at least 4 bytes. See https://github.com/pytorch/pytorch/issues/132640 for more context. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132701 Approved by: https://github.com/wanchaol, https://github.com/fegin	2024-08-06 19:35:37 +00:00
Xu Han	1e65ccc3de	[inductor] export kernel for gemm template. (#132580 ) Changes: 1. Move `get_export_declaration` to global scope. 2. Export kernel for gemm template. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132580 Approved by: https://github.com/ezyang	2024-08-06 18:52:22 +00:00
Max Ren	81a5a7a30a	[Quantizer] Fix getattr for quantizing constants (#132705 ) Mobilebert quantization was failing because there were embedding constants that could not be accessed through getattr(). It seems that we have to search the submodule for the embeddings. Which we do here. This is just to help get around looking at unlifted attrs to check if they are large scalars Differential Revision: [D60492338](https://our.internmc.facebook.com/intern/diff/D60492338/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132705 Approved by: https://github.com/jerryzh168 ghstack dependencies: #132704	2024-08-06 18:16:27 +00:00
Edward Z. Yang	c2bccfd431	[BE] Simplify code interacting with get_proxy_mode/enable_tracing (#132675 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132675 Approved by: https://github.com/Skylion007, https://github.com/ydwu4, https://github.com/zou3519 ghstack dependencies: #132674	2024-08-06 18:13:22 +00:00
Max Ren	1de4ebc85d	[Quantizer] Fix Maxpool2d share q params (#132704 ) There seems to be a bug in the code for sharing q params for maxpool2d. This case occurs when output_node = maxpool_node. When this happens we overwrite the node's "quantization_annotation" metadata. This fix ensures that qparams are indeed shared across input and output Differential Revision: [D60492341](https://our.internmc.facebook.com/intern/diff/D60492341/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132704 Approved by: https://github.com/jerryzh168	2024-08-06 18:13:16 +00:00
Bin Bao	db0bd04151	[AOTI] Switch to use shim v2 for fbcode (#132750 ) Summary: As title Test Plan: CI Reviewed By: hl475, ColinPeppler Differential Revision: D57899065 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132750 Approved by: https://github.com/angelayi	2024-08-06 17:57:32 +00:00
Brian Hirsh	8d2c272e5a	properly register conjugate/neg fallthroughs to prim ops (#132699 ) A few aten ops (like `clone` and `copy_` get fallthrough registrations to the Conjugate/Negative keys. We haven't been giving the same treatment to their corresponding `prims` variants, which can cause infinite loops in some cases. Fixes an infinite loop that showed up in tests from https://github.com/pytorch/pytorch/pull/132563 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132699 Approved by: https://github.com/albanD	2024-08-06 17:57:04 +00:00
Thanh Ha	c6582f11cd	Add get_optin_feature() to allow opt-in to amz2023 (#131792 ) This extends the runner determinator to be able to opt-in to keywords to provide additional options when determining which systems to run jobs on. This enables us to support opt-in users to Amazon Linux 2023. This change creates a generic get_optin_feature() which hopefully will be useful to handle additional future features that we might want to experiment with. This change has kept backwards compatability with the existing issue userlist format and adds support for the comma-separated list of users in a backwards compatible way. The user list has the following rules: - Users are GitHub usernames with the @ prefix - If the first line is a "*" then all users will use the new runners - If the first line is a "!" then all users will use the old runners - Each user is also a comma-separated list of features/experiments to enable - A "#" prefix indicates the user is opted out of the new runners but is opting into features/experiments. Example user list: ``` @User1 @User2,amz2023 #@UserOptOutOfNewRunner,amz2023 ``` This closes pytorch/ci-infra#249. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131792 Approved by: https://github.com/jeanschmidt, https://github.com/ZainRizvi	2024-08-06 17:54:20 +00:00
Brian Hirsh	e3394e5548	torch.autograd.graph.increment_version: accept List[Tensor], use in AOTDispatcher (#132652 ) The regression from https://github.com/pytorch/pytorch/issues/132281 pinpoints `e4ace1a396` as the cause. The main delta that commit introduces is that we now manually check `is_inference()` and call `increment_version()` (a pybind call) on every mutated input tensor to the graph. This PR attempts to reduce overhead a bit by bundling up all of those checks into a single pybind call, by: (1) updating `torch.autograd.graph.increment_version()` to accept a `Union[Tensor, List[Tensor]]` (2) updating its semantics to no-op if you pass in a tensor with no version counter, instead of erroring Pull Request resolved: https://github.com/pytorch/pytorch/pull/132652 Approved by: https://github.com/albanD	2024-08-06 17:46:48 +00:00
Shangdi Yu	af67b8df6d	[export] Fix exportdb test (#132678 ) Summary: FIx exportdb test for tensor_setattr. copy.deepcopy(deepcopy) can fail if tensor inputs have attribute (i.e. __dict__). We remove it before deepcopy. Before the fix, we have ``` inputs[0].__dict__ {'attr': FakeTensor(..., size=(3, 2))} ``` the test errors out with ``` ====================================================================== ERROR: test_exportdb_supported_case_tensor_setattr (caffe2.test.export.test_serialize.TestDeserialize) ---------------------------------------------------------------------- Traceback (most recent call last): File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/a915c8ae5cba5b70/caffe2/test/__test_export__/test_export#link-tree/torch/testing/_internal/common_utils.py", line 529, in instantiated_test test(self, **param_kwargs) File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/a915c8ae5cba5b70/caffe2/test/__test_export__/test_export#link-tree/caffe2/test/export/test_serialize.py", line 878, in test_exportdb_supported self.check_graph(model, case.example_args, _check_meta=_check_meta) File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/a915c8ae5cba5b70/caffe2/test/__test_export__/test_export#link-tree/caffe2/test/export/test_serialize.py", line 548, in check_graph _check_graph(pre_dispatch=True) File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/a915c8ae5cba5b70/caffe2/test/__test_export__/test_export#link-tree/caffe2/test/export/test_serialize.py", line 506, in _check_graph copy.deepcopy(inputs), File "/usr/local/fbcode/platform010/lib/python3.10/copy.py", line 146, in deepcopy y = copier(x, memo) File "/usr/local/fbcode/platform010/lib/python3.10/copy.py", line 211, in _deepcopy_tuple y = [deepcopy(a, memo) for a in x] File "/usr/local/fbcode/platform010/lib/python3.10/copy.py", line 211, in <listcomp> y = [deepcopy(a, memo) for a in x] File "/usr/local/fbcode/platform010/lib/python3.10/copy.py", line 153, in deepcopy y = copier(memo) File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/a915c8ae5cba5b70/caffe2/test/__test_export__/test_export#link-tree/torch/_tensor.py", line 206, in __deepcopy__ new_tensor.__dict__ = deepcopy(self.__dict__, memo) File "/usr/local/fbcode/platform010/lib/python3.10/copy.py", line 146, in deepcopy y = copier(x, memo) File "/usr/local/fbcode/platform010/lib/python3.10/copy.py", line 231, in _deepcopy_dict y[deepcopy(key, memo)] = deepcopy(value, memo) File "/usr/local/fbcode/platform010/lib/python3.10/copy.py", line 153, in deepcopy y = copier(memo) File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/a915c8ae5cba5b70/caffe2/test/__test_export__/test_export#link-tree/torch/_tensor.py", line 108, in __deepcopy__ or (type(self) is not Tensor and self.data_ptr() == 0) RuntimeError: Cannot access data pointer of Tensor (e.g. FakeTensor, FunctionalTensor). If you're using torch.compile/export/fx, it is likely that we are erroneously tracing into a custom kernel. To fix this, please wrap the custom kernel into an opaque custom op. Please see the following for details: https://pytorch.org/tutorials/advanced/custom_ops_landing_page.html ``` Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r test_exportdb_supported_case_tensor_setattr ``` Differential Revision: D60610860 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132678 Approved by: https://github.com/zhxchen17	2024-08-06 17:45:10 +00:00
Brian Hirsh	e6eee04875	dynamo: use equality guards instead of id guards for Placement/DeviceMesh (#124401 ) After talking to @anijain2305, we probably can't land this since it won't work for C++ guards. But we should still be able to do better than ID_MATCH Pull Request resolved: https://github.com/pytorch/pytorch/pull/124401 Approved by: https://github.com/anijain2305	2024-08-06 17:14:44 +00:00
soulitzer	f50621989b	Construct NJT without graph breaks (#130292 ) Combines contributions from https://github.com/pytorch/pytorch/pull/130505 Some context can be found in this large comment block: `a5b64d39fd/test/dynamo/test_subclasses.py (L1667-L1681)` Changes in this PR - For each tensor fakified, check the nested int registry in eager, and eagerly symbolicize if that tensor has already been associated with nested int in eager. - Adds a separate counter stored on FakeTensorMode as a fake analog to _tensor_id_counter (which keeps track of unique tensors). This counter is initialized to the global eager tensor id counter upon creation of the FakeTensorMode, and needs to be reset when the same FakeTensorMode is reused to trace again (in this PR, we piggyback on the epoch incrementing logic). - (refactor) Today, we store FakeTensor -> symbolic nested int in the global registry. With this PR, symbolic nested int is stored directly on the FakeTensor. (Eager still caches nested int in the registry, though we should avoid this at some point.) Basically unchanged, but worth noting: - `__tensor_unflatten__` is still responsible for determining whether we should cache for now. The logic is somewhat simplified. - to_copy is still using the trick of updating two different tensors in the registry to point to the same nested int. This is kind of broken, but we try to leave it as is, and plan a better fix with the UnionFind stack. Differential Revision: [D60406772](https://our.internmc.facebook.com/intern/diff/D60406772) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130292 Approved by: https://github.com/bdhirsh ghstack dependencies: #131916, #131803	2024-08-06 17:03:39 +00:00
soulitzer	406b50835b	Use FakeTensor cache for subclass inner tensors (#131803 ) Rewrite of original PR in https://github.com/pytorch/pytorch/pull/130291 To answer review comments from https://github.com/pytorch/pytorch/pull/130291#pullrequestreview-2166671953: > At a higher level, do we need this? Today, this should not change the behavior of anything. But an invariant of "same tensor always corresponds to the same FakeTensor" is nice (from discussion with @bdhirsh). > Why does this happen? Today, both dynamo and meta_utils do some recursion when it comes to FakeTensors. So whenever we fakify a subclass, the process would roughly like: ``` wrap_to_fake (subclass) meta_utils (subclass) meta_utils (values) -> not cached because we use callback meta_utils(offsets) -> not cached because we use callback wrap_to_fake (values) wrap_to_fake (offsets) -> cached because we rely on top-level meta_utils ``` However, we know that: - Caching only occurs at the top-level of meta_utils. - The return value of the top-level wrap_to_fake is returned. This means that after all of this: - The fakified subclass holds inner FakeTensors that are NOT part of the cache - values/offsets are Fakified a second time, and those instances are cached. Differential Revision: [D60406773](https://our.internmc.facebook.com/intern/diff/D60406773) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131803 Approved by: https://github.com/ezyang ghstack dependencies: #131916	2024-08-06 17:03:39 +00:00
soulitzer	a94c441e48	Fix symbolic nested int printing (#131916 ) Differential Revision: [D60406775](https://our.internmc.facebook.com/intern/diff/D60406775) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131916 Approved by: https://github.com/Skylion007, https://github.com/jbschlosser	2024-08-06 17:03:39 +00:00
Edward Z. Yang	ffdf48e63b	Consolidate SymDispatchMode into ProxyTensorMode (#132674 ) Instead of having a separate context variable for SymDispatchMode, we now simply delegate to the current active proxy tensor mode when we need to trace a SymInt. We maintain a separate `__sym_dispatch__` magic method as the calling convention is different than `__torch_dispatch__`. Consolidating the modes in this ways means that we can consistently disable both of these modes in tandem simply by removing the mode from the proxy mode infra slot. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132674 Approved by: https://github.com/zou3519, https://github.com/bdhirsh	2024-08-06 17:03:17 +00:00
Pian Pawakapan	7045bc5a77	[export] change error message for specializations (#132698 ) https://github.com/pytorch/pytorch/pull/130775 recently killed forced specializations for export on complex guards, so the only way we now get a specialized value is if we're able to solve for it. For example, if we have guards `s0 * 2 = s1`, `s0 + 6 = s1`, we specialize `s0 = 6; s1 = 12`. That might look like this: ``` class Foo(torch.nn.Module): def forward(self, x, y): return x.reshape([-1]) + y dy = Dim("dy", min=6) x, y = torch.randn(6, 2), torch.randn(12) dynamic_shapes = { "x": (dy - 6, 2), "y": (dy,), } ``` Our current error message is: `{symbol} must be specialized to {value} because the guards generated for it are too complex` This is now misleading, so we change it to: `solving the guards generated for {symbol} resulted in a specialized value of {value}` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132698 Approved by: https://github.com/avikchaudhuri	2024-08-06 16:59:53 +00:00
Jiashen Cao	ca7ce2fca1	[ts-migration][1/N]: Add prim::Loop for constant number of iterations and condition (#131418 ) #### Description This PR adds prim::Loop support for the simplest case where the number of iteration is constant and the loop termination condition is also a constant. [PR by stages](https://docs.google.com/document/d/1q6OprW3HBHbYPwEyE_DikBn-uzmhnN284Cmen_CnlhI/edit?usp=sharing) #### Test Plan Add reprod example. * `pytest test/export/test_converter.py -s -k test_ts2ep_with_loop` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131418 Approved by: https://github.com/angelayi	2024-08-06 16:51:08 +00:00
C	c803e35c4b	Reduce number of guards introduced by check_cudnn_tensor_shapes when cudnn version is higher enough (#132384 ) I found that when using TorchDynamo (torch.compile) with dynamic shape on H100, there are some extra guards added to check the sequence length of inputs of `scaled_dot_product_attention` to be divisible by 64. These guards cause unwanted recompilations when the input shape changes. In fact these guards are not necessary if our CUDNN version is higher enough, So I change the order of those checks to use short-circuit rules to skip those checks and avoid unnecessary guards. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132384 Approved by: https://github.com/eqy, https://github.com/Skylion007	2024-08-06 16:48:13 +00:00
andrewor14	fc7849b93f	[pt2e][quant] Ensure BN node is erased after convert (#131651 ) Summary: Previously, when folding BN into conv, we rely on DCE to clean up the unused BN node from the graph. This works if the model is already in eval mode, but fails if the model is still in train mode because DCE doesn't remove nodes with potential side effects (in this case `_native_batch_norm_legit`). This required users to move the model to eval mode before calling convert in order to get a properly DCE'd graph. To solve this, we manually erase the BN node after folding instead of relying on DCE. This relaxes the ordering constraints between `move_exported_model_to_eval` and `convert_pt2e`. Test Plan: python test/test_quantization.py TestQuantizePT2EQAT_ConvBn1d.test_fold_bn_erases_bn_node python test/test_quantization.py TestQuantizePT2EQAT_ConvBn2d.test_fold_bn_erases_bn_node Reviewers: jerryzh168, yushangdi Subscribers: jerryzh168, yushangdi, supriyar Differential Revision: [D60520149](https://our.internmc.facebook.com/intern/diff/D60520149) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131651 Approved by: https://github.com/yushangdi, https://github.com/leslie-fang-intel	2024-08-06 16:37:39 +00:00
Randolf Scholz	679cdf606a	Converted `__all__` literal tuple to literal list. (#132404 ) Partial Fix for #131765. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132404 Approved by: https://github.com/soulitzer	2024-08-06 15:12:32 +00:00
Tobias Ringwald	6753ee127c	Allow torch.cuda.memory.mem_get_info to take a device str argument with an unspecified device index. (#132616 ) `torch.cuda.memory.mem_get_info` allows device strings given the current type hints. However, `device = torch.device('cuda')` leads to `device.index = None`, which results in downstream problems. Setting `optional=True` will insert the default device index in such cases. Fixes #132583 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132616 Approved by: https://github.com/soulitzer	2024-08-06 13:19:46 +00:00
PyTorch MergeBot	7100c36c8a	Revert "[inductor] export kernel for gemm template. (#132580 )" This reverts commit 87d46d70d7754e32eb0e6689688f4336e4e7c955. Reverted https://github.com/pytorch/pytorch/pull/132580 on behalf of https://github.com/PaliC due to sys is not defined in torch/_inductor/codegen/cpp_utils.py ([comment](https://github.com/pytorch/pytorch/pull/132580#issuecomment-2271264974))	2024-08-06 13:15:15 +00:00
cyy	656a4d1408	[6/N] Fix clang-tidy warnings in aten/src/ATen (#132620 ) Follows #132565 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132620 Approved by: https://github.com/Skylion007	2024-08-06 13:07:16 +00:00
Michael Lazos	a8f0979962	Add cudagraph static inputs logging (#132726 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132726 Approved by: https://github.com/anijain2305	2024-08-06 12:01:20 +00:00
Lei Ding	da320214e6	Format tensor (#127992 ) Align tensor display Pull Request resolved: https://github.com/pytorch/pytorch/pull/127992 Approved by: https://github.com/janeyx99	2024-08-06 07:10:16 +00:00
chilli	728374d7f7	Changed create_block_mask to just accept BLOCK_SIZE (#132697 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132697 Approved by: https://github.com/drisspg	2024-08-06 04:37:15 +00:00
Dan Zimmerman	91df66ee74	[caffe2] Wrap constexpr with preprocessor statements (#132582 ) Summary: When the preprocessor check we leave an unused constexpr around, so when `-Wunused-const-variable` is enabled we get an error. Let's inline these values since they're not used anywhere else in order to avoid this. Test Plan: CI Differential Revision: D60723823 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132582 Approved by: https://github.com/houseroad	2024-08-06 04:35:06 +00:00
Robert Hardwick	4260f365ba	[inductor] Replace torch.allclose with torch.testing.assert_close in test_fx_fusion (#130618 ) Preventative fix of a test failure with oneDNN v3.5 upgrade where order of float32 arithmetic may change in torch.admm ( bias term can be at the start or end of the arithmetic ) resulting in slightly different output due to float32 precision loss. Replaced occurrences of torch.allclose with ~~torch._dynamo.testing.same~~ torch.testing.assert_close which is the recommended approach as per this issue https://github.com/pytorch/pytorch/issues/56544 ,the default tolerance is more relaxed than torch.allclose which satisfies the test with upcoming oneDNN change. This should fix aarch64 ci failures in #129932 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130618 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-08-06 03:58:43 +00:00
fduwjj	4e610924d4	[c10d] Add a new API for adding ephemeral timeout for one local rank and the timeout will reset when the first collective finishes (#130905 ) We provide an API for user to add ephemeral timeout across all PGs within one rank and the timeout will reset when the first collective issued after the timeout added finishes. Each extension only covers collectives after the issue and before the first collective finished. The diagram below shows how the timeout changes: <img width="1174" alt="image" src="https://github.com/user-attachments/assets/354923b7-581c-40de-ae0f-1cd3da273ccc"> While this feature provides flexibility in specific scenarios, it introduces statefulness to timeout setting. Therefore, it is advisable to use this API sparingly and consider alternative approaches, such as directly setting the timeout or utilizing a barrier collective (one can set any timeout to the barrier), whenever feasible. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130905 Approved by: https://github.com/ezyang	2024-08-06 03:47:58 +00:00
Wang, Eikan	39c9b75a68	Add registration mechanism for aoti model runner (#131638 ) Current AOTI model runner has supported CUDA and CPU. However, in terms of a particular out-of-tree backend, it is not easier to support the feature. This PR intends to provide a registration mechanism to support this case by providing two: `RegisterAOTIModelRunner` and `getAOTIModelRunnerRegistry`. - `RegisterAOTIModelRunner` is used to register a function(`AOTIModelRunnerABC`) to create a `AOTIModelContainerRunner`. The function signature is as follows. ```C++ using AOTIModelRunnerABC = std::shared_ptr<AOTIModelContainerRunner> (*)( const std::string& model_so_path, size_t num_models, const std::string& device_str, const std::string& bin_dir); ``` - `getAOTIModelRunnerRegistry` is used to get all the registered backends. In terms of a new backend, it needs to define its `AOTIModelContainerRunner` class and then register a `AOTIModelRunnerABC` function to `aoti` to create its `AOTIModelContainerRunner`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131638 Approved by: https://github.com/desertfire, https://github.com/jansel	2024-08-06 02:47:35 +00:00
Edward Z. Yang	345bea01dc	Refactor thunkify to return proper thunk abstraction (#132407 ) This is superior to lru_cache because (1) it's more explicit and (2) it doesn't leak the original function after it's been forced. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132407 Approved by: https://github.com/albanD	2024-08-06 02:35:45 +00:00
Shangdi Yu	93fad2f0f2	[export] Fix import in D60427208 (#132707 ) Summary: D60427208 broke APS release by failing our NE deterministric test. https://www.internalfb.com/intern/test/562950111197340/ This Diff fixes it. Test Plan: ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//aps_models/ads/gmp/tests/ne/e2e_deterministic_tests:gmp_e2e_ne_tests -- --filter-text test_mtml_instagram_model_474023725_single_gpu_with_ir ``` Differential Revision: D60790203 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132707 Approved by: https://github.com/ydwu4	2024-08-06 02:35:17 +00:00
Yan Zhiwei	2f16e68cab	[Intel GPU] Allow XPU device in copy, cdist, index_put_impl (#130088 ) # Motivation `copy`, `cdist`, `index_put_impl` operators use `op_stub` for runtime dispatching inside operators. Extra device list is inside them to assure the accuracy, while XPU is not in them. This PRs make them allow XPU as a supported device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130088 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD ghstack dependencies: #130019, #130082	2024-08-06 01:55:50 +00:00
PyTorch MergeBot	38674bcb45	Revert "Conversions between strided and jagged layouts for Nested Tensors (#115749 )" This reverts commit eca0cb0fbe84bb0a34fa94afe261bceecd52c436. Reverted https://github.com/pytorch/pytorch/pull/115749 on behalf of https://github.com/izaitsevfb due to breaks test_overrides.py::TestTorchFunctionWarning::test_warn_on_invalid_torch_function_tensor_subclass ([comment](https://github.com/pytorch/pytorch/pull/115749#issuecomment-2270213988))	2024-08-06 01:55:41 +00:00
Randolf Scholz	d6a24b3b92	Removed duplicate `__all__` declarations. (#132405 ) Partial Fix for #131765. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132405 Approved by: https://github.com/soulitzer	2024-08-06 01:17:44 +00:00
haozhe.zhu	96471ea47c	[inductor] support vectorization for torch.any(bool) -> bool (#132472 ) Support reduction `any` by from `bool` to `bool`. TestPlan: ``` python test/inductor/test_cpu_repro.py -k test_any_bool_vec ``` Generated code for `test_any_bool_vec` ``` cpp_fused_any_0 = async_compile.cpp_pybinding(['const float', 'const float', 'bool', 'bool'], ''' #include "/tmp/torchinductor_root/ky/cky2bufythacofebk7ujv36e4pxyqcqbpsy5r4vojoprjiwcwfxf.h" extern "C" void kernel(const float* in_ptr0, const float* in_ptr1, bool* out_ptr0, bool* out_ptr1) { { { bool tmp_acc0 = 0; at::vec::VecMask<float,1> tmp_acc0_vec = at::vec::VecMask<float,1>::from(0); bool tmp_acc0_arr[64]; for (int tid = 0; tid < 64; tid++) { tmp_acc0_arr[tid] = 0; } at::vec::VecMask<float,1> tmp_acc0_vec_arr[64]; for (int tid = 0; tid < 64; tid++) { tmp_acc0_vec_arr[tid] = at::vec::VecMask<float,1>::from(0); } #pragma omp parallel num_threads(64) { int tid = omp_get_thread_num(); bool tmp_acc0_local = 0; at::vec::VecMask<float,1> tmp_acc0_vec_local = at::vec::VecMask<float,1>::from(0); #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(64L); x0+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<long>(x0), 16); auto tmp1 = at::vec::VecMask<float,1>::from<float,1>(tmp0); tmp_acc0_vec_local = tmp_acc0_vec_local \| tmp1; } tmp_acc0_arr[tid] = tmp_acc0_local; tmp_acc0_vec_arr[tid] = tmp_acc0_vec_local; } for (int tid = 0; tid < 64; tid++) { tmp_acc0 = tmp_acc0 \|\| tmp_acc0_arr[tid]; } for (int tid = 0; tid < 64; tid++) { tmp_acc0_vec = tmp_acc0_vec \| tmp_acc0_vec_arr[tid]; } tmp_acc0 = tmp_acc0 \|\| at::vec::vec_reduce_all<bool>([](at::vec::Vectorized<bool>& x, at::vec::Vectorized<bool>& y) { return x \| y; }, tmp_acc0_vec.to<bool, 1>()); out_ptr0[static_cast<long>(0L)] = static_cast<bool>(tmp_acc0); } } { { bool tmp_acc0 = 0; at::vec::VecMask<float,1> tmp_acc0_vec = at::vec::VecMask<float,1>::from(0); bool tmp_acc0_arr[64]; for (int tid = 0; tid < 64; tid++) { tmp_acc0_arr[tid] = 0; } at::vec::VecMask<float,1> tmp_acc0_vec_arr[64]; for (int tid = 0; tid < 64; tid++) { tmp_acc0_vec_arr[tid] = at::vec::VecMask<float,1>::from(0); } #pragma omp parallel num_threads(64) { int tid = omp_get_thread_num(); bool tmp_acc0_local = 0; at::vec::VecMask<float,1> tmp_acc0_vec_local = at::vec::VecMask<float,1>::from(0); #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(64L); x0+=static_cast<long>(16L)) { auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr1 + static_cast<long>(x0), 16); auto tmp1 = at::vec::VecMask<float,1>::from<float,1>(tmp0); tmp_acc0_vec_local = tmp_acc0_vec_local \| tmp1; } tmp_acc0_arr[tid] = tmp_acc0_local; tmp_acc0_vec_arr[tid] = tmp_acc0_vec_local; } for (int tid = 0; tid < 64; tid++) { tmp_acc0 = tmp_acc0 \|\| tmp_acc0_arr[tid]; } for (int tid = 0; tid < 64; tid++) { tmp_acc0_vec = tmp_acc0_vec \| tmp_acc0_vec_arr[tid]; } tmp_acc0 = tmp_acc0 \|\| at::vec::vec_reduce_all<bool>([](at::vec::Vectorized<bool>& x, at::vec::Vectorized<bool>& y) { return x \| y; }, tmp_acc0_vec.to<bool, 1>()); out_ptr1[static_cast<long>(0L)] = static_cast<bool>(tmp_acc0); } } } ''') ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132472 Approved by: https://github.com/jgong5	2024-08-06 01:03:51 +00:00
Brian Hirsh	26c6786109	return_and_correct_aliasing: skip dispatcher when swapping storage (#132524 ) `return_and_correct_aliasing` is used by FunctionalTensor today to ensure that when we call view/inplace ops, the input and output `FunctionalTensors` share the same storage. This was previously done with a dispatcher call to `aten.set_`. In this PR I swap it out with a util that just manually does the storage swap. Benefits: (1) we know this is safe in the specific way it is used by FunctionalTensor: avoiding the extra assertions in `aten.set_` is necessary to avoid some unbacked symint errors (2) this should improve compile times a bit Pull Request resolved: https://github.com/pytorch/pytorch/pull/132524 Approved by: https://github.com/ezyang ghstack dependencies: #132243, #132337, #132322	2024-08-06 00:44:35 +00:00
Antoni Viros	eca0cb0fbe	Conversions between strided and jagged layouts for Nested Tensors (#115749 ) This PR does 3 things: 1. Adds a copy-free strided->jagged layout conversion for NT 2. Adds a copy-free jagged->strided layout conversion for NT 3. Modifies and expands the .to() API to support the layout argument for the specific case of NT layout conversion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/115749 Approved by: https://github.com/jbschlosser	2024-08-05 23:45:48 +00:00
wz337	4306eebab1	[DeviceMesh] Update slicing documentation to include nD and non-continuous slicing (#132311 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132311 Approved by: https://github.com/wanchaol ghstack dependencies: #132310	2024-08-05 23:44:23 +00:00
wz337	1add8c5f1c	[Easy][DTensor] Rename args_sharding to args_schema for OpSchema __str__ (#132187 ) Looks like we don't use the name `args_sharding` anywhere else so just changing it to `args_schema` for naming consistency Pull Request resolved: https://github.com/pytorch/pytorch/pull/132187 Approved by: https://github.com/wanchaol	2024-08-05 23:40:19 +00:00
cyy	3ef45e5669	Fix ODR (#131032 ) Fixes ODR violation Pull Request resolved: https://github.com/pytorch/pytorch/pull/131032 Approved by: https://github.com/ezyang	2024-08-05 23:19:49 +00:00
Shuqi Yang	a74e5abda4	Fix issues in activation_memory_budget for float8 (#132687 ) Summary: When using activation_memory_budget for float8 training, two issues were noticed: - When `aggressive_options` (https://fburl.com/code/m1yoskxw) is called , all fp8 gemms (the scaled_mm op) are saved for recomputation. - After adding "scaled_mm" in the `compute_intensive_ops`, we got the next error from `estimate_runtime`: `mat2 must be col_major` from `meta_scaled_mm`. To fix it, modified `materialize_arg` to also include the stride of the original tensor. Test Plan: Run float8 training with `activation_memory_budget`. Differential Revision: D60777297 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132687 Approved by: https://github.com/Chillee	2024-08-05 23:01:35 +00:00
Yidi Wu	a4ed8eeb33	[hop] makes compiled hops not share code objects (#132427 ) Fixes code object sharing issue in https://github.com/pytorch/pytorch/issues/132417. Before this Pr, compiled hops such as cond and flex_attenion are wrapped by _dynamo/external_utils.py:wrap_inline. This causes them to share the same code object. There is a condition surrounding the warp_inline call and currently is passing. We make hops fail the check so that they don't share code objects by adding them to LEGACY_MOD_INLINELIST. Adding them to MOD_INLINELIST doesn't work because trace_rules.check(fn) doesn't check for MOD_INLINLIST by default. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132427 Approved by: https://github.com/jansel, https://github.com/anijain2305	2024-08-05 22:59:05 +00:00
Shangdi Yu	4a2cf50edf	[export][reland] Convert autocast to HOO (#132677 ) Summary: Reland of D60206382. Suggested in https://github.com/pytorch/pytorch/issues/128394. If there's an autocast context manager, the predispatch (strict) graph can look something like: ``` class <lambda>(torch.nn.Module): def forward(self, x: "f32[1]"): ... _enter_autocast = torch.amp.autocast_mode._enter_autocast('cuda', torch.bfloat16, True, None) mm: "f32[8, 8]" = torch.ops.aten.mm.default(rand, rand_1); rand = rand_1 = None _exit_autocast = torch.amp.autocast_mode._exit_autocast(_enter_autocast); _enter_autocast = None return (mm_1,) ``` But the operator `torch.amp.autocast_mode._enter_autocast` is not a valid ATen op. We remove these nodes by turning autocast into a higher order operator and make a submodule for the blocks between `_enter_autocast` and `_exit_autocast`. Some potential followup improvement: 1) Merge some of the duplicated logic with `replace_set_grad_with_hop_pass.py` 2) Check the current autocast status (any enabled? dtype?) and not create a submodule if the autocast args matches current autocast status. Test Plan: CI ``` buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r "test_predispatch_autocast" buck2 run 'fbcode//mode/dev-nosan' fbcode//caffe2/test:test_export -- -r "test_predispatch_set_grad" ``` Verified that now we can export the llama model in gh issue 128394 and the gemma model in gh issue 131829 without error. Differential Revision: D60770038 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132677 Approved by: https://github.com/angelayi	2024-08-05 22:34:52 +00:00
Yifu Wang	ea42027e0e	[micro_pipeline_tp] support all _scaled_mm args (#131984 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131984 Approved by: https://github.com/weifengpy	2024-08-05 21:44:37 +00:00
Edward Yang	2b5e31d099	Move sigmoid run_const_graph HOP to PyTorch core (#132526 ) Summary: When HOPs live out of tree, it makes it impossible to make breaking changes to the HOP API. But HOP implementations are deeply entwined with PyTorch internals. Move the HOP into PyTorch tree so that changes are possible. Test Plan: sandcastle and oss ci Differential Revision: D60674861 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132526 Approved by: https://github.com/SherlockNoMad	2024-08-05 21:40:56 +00:00
Brian Hirsh	af8b8a47cb	fsdp.set_: convey to functionalization that it mutates storage (#132322 ) Fixes https://github.com/pytorch/pytorch/issues/132197 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132322 Approved by: https://github.com/albanD, https://github.com/yf225 ghstack dependencies: #132243, #132337	2024-08-05 21:28:59 +00:00
Brian Hirsh	1a0db29932	move torch._functionalize APIs to pybind. add one for marking storage mutations (#132337 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132337 Approved by: https://github.com/albanD, https://github.com/justinchuby ghstack dependencies: #132243	2024-08-05 21:28:59 +00:00
Brian Hirsh	4db368a475	make functorch CSE respect mutations as barriers (like fsdp.set_) (#132243 ) Fixes https://github.com/pytorch/pytorch/issues/132200 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132243 Approved by: https://github.com/albanD, https://github.com/zou3519, https://github.com/yf225	2024-08-05 21:28:55 +00:00
Fangjun Kuang	ee0ae11b34	Fix a typo in the example code. (#132601 ) Since the backward multiples the gradient by `n`, we must change the forward function to multiply the input tensor by `n`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132601 Approved by: https://github.com/soulitzer	2024-08-05 21:04:20 +00:00
albanD	9a1ad3345f	Fix periodic windows test (#132648 ) This test fails to clean up folders on windows for the past week, see `27f61eba58` for example Pull Request resolved: https://github.com/pytorch/pytorch/pull/132648 Approved by: https://github.com/janeyx99, https://github.com/zou3519, https://github.com/malfet	2024-08-05 20:54:20 +00:00
cyy	6b12dc0224	[Reland] [11/N] Use std::nullopt and std::optional (#132622 ) Reland of #132396, which was reverted due to dependency reversion. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132622 Approved by: https://github.com/ezyang	2024-08-05 20:36:33 +00:00
Sam Larsen	6f4dc56735	[inductor] Default to 1 compile thread for internal (#132540 ) Summary: The historical default here is "1", i.e., no parallel compilation. In order to prepare for rolling out the subprocess-based parallel compile, I had previously modified this code to allow parallelism when worker_start_method="subprocess". I realize this probably isn't the best rollout strategy. Rather than opting all internal usages into both a) parallel-compile, _and_ b) a new implementation of parallel compile, let's put the default back to "1" and then start rolling out the new parallel compile implementation only to those usages that have already opted in by explicitly setting compile_thread > 1 Differential Revision: [D60686105](https://our.internmc.facebook.com/intern/diff/D60686105) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132540 Approved by: https://github.com/c00w	2024-08-05 20:23:16 +00:00
Pearu Peterson	1471473b84	Add tests to bsr_dense_addmm_meta. Tune bsr_dense_addmm kernel for ViT shapes. (#132646 ) As in the title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132646 Approved by: https://github.com/cpuhrsch	2024-08-05 20:22:33 +00:00
Basil Wong	b7bcfdaff2	Change deprecate warning on dispatch_on_subclass to warn once (#132374 ) Summary: # Problem `TORCH_WARN` can cause massive log spam. I output the logs for before and after adding this change. Before: * The log file size was ~61.15 MB(61148028 bytes). After: * The log filesize was ~56.44 MB(56444057) bytes. # Context Looks like we tried to land this change earlier but it was reverted: * D59413413 * Reverted https://github.com/pytorch/pytorch/pull/130047 on behalf of https://github.com/clee2000 due to broke test_overrides.py::TestTorchFunctionWarning::test_warn_on_invalid_torch_function # Testing Update `test_warn_on_invalid_torch_function` would fail because the warning would not be called on the handling of the second torch function class since `TORCH_WARN_ONCE` stops repeats globally. Updated so that it runs separate programs. (Was not able to actually run the test, could someone help me with that Test Plan: Need help with this... Differential Revision: D60561181 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132374 Approved by: https://github.com/ezyang	2024-08-05 20:02:33 +00:00
PyTorch MergeBot	2764bee942	Revert "[MPS] Add support for autocast in MPS (#99272 )" This reverts commit 6919e8baaba391ced7b4acaa553d6ea1f3b30e79. Reverted https://github.com/pytorch/pytorch/pull/99272 on behalf of https://github.com/clee2000 due to Broke test/inductor/test_cpu_select_algorithm.py::TestSelectAlgorithmCPU::test_quantized_linear_amx_batch_size_3_in_features_128_out_features_64_bias_False_cpu on sm86 jobs [GH job link](https://github.com/pytorch/pytorch/actions/runs/10252979157/job/28367091621) [HUD commit link](`6919e8baab`) Not caught on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/99272#issuecomment-2269808857))	2024-08-05 19:59:04 +00:00
PyTorch MergeBot	a3ea96b762	Revert "[export] Convert autocast to HOO (#131914 )" This reverts commit aec948adfc224e49213c4bc49586d4e4ba65fbbb. Reverted https://github.com/pytorch/pytorch/pull/131914 on behalf of https://github.com/davidberard98 due to PR shouldn't have been relanded by the bot, phabricator diff did not have any recent changes and is still internally reverted ([comment](https://github.com/pytorch/pytorch/pull/131914#issuecomment-2269797388))	2024-08-05 19:52:09 +00:00
Jack Taylor	1d34f33d00	Scale XBLOCK in triton reduction configs to avoid hitting max grid (#128826 ) Scale XBLOCK size in triton_config_reduction to avoid hitting maxGridSize limits. This issue was observed in gpt-fast examples with large sequence length: Reproducer: https://gist.github.com/jataylo/8a0ba922fbf68e345d360a418b48b9f1 `RuntimeError: Triton Error [HIP]: Code: 9, Messsage: invalid configuration argument` Co-authored-by: Jason Ansel <jansel@jansel.net> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128826 Approved by: https://github.com/jansel, https://github.com/nmacchioni	2024-08-05 19:34:38 +00:00
David Berard	e1c2bdac2f	[easy] fix f-string messages in torch/_ops.py (#132531 ) I encountered these when making this change: ``` diff --git a/test/functorch/test_ac.py b/test/functorch/test_ac.py index 3a2e07fa147..a4d003399e7 100644 --- a/test/functorch/test_ac.py +++ b/test/functorch/test_ac.py @@ -259,15 +259,8 @@ class MemoryBudgetTest(TestCase): expected = call() for budget in range(0, 11): - memory_budget = budget / 10 - torch._dynamo.reset() - with config.patch(activation_memory_budget=memory_budget): - if memory_budget is not None: - f_compile = torch.compile( - call, backend="aot_eager_decomp_partition" - ) - - self.assertEqual(expected, f_compile()) + get_mem_and_flops(call, memory_budget=budget / 10) + def test_prioritize_cheaper_matmul(self): def f(xs, ws): ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132531 Approved by: https://github.com/Skylion007	2024-08-05 18:58:33 +00:00
Shangdi Yu	aec948adfc	[export] Convert autocast to HOO (#131914 ) Summary: Suggested in https://github.com/pytorch/pytorch/issues/128394. If there's an autocast context manager, the predispatch (strict) graph can look something like: ``` class <lambda>(torch.nn.Module): def forward(self, x: "f32[1]"): ... _enter_autocast = torch.amp.autocast_mode._enter_autocast('cuda', torch.bfloat16, True, None) mm: "f32[8, 8]" = torch.ops.aten.mm.default(rand, rand_1); rand = rand_1 = None _exit_autocast = torch.amp.autocast_mode._exit_autocast(_enter_autocast); _enter_autocast = None return (mm_1,) ``` But the operator `torch.amp.autocast_mode._enter_autocast` is not a valid ATen op. We remove these nodes by turning autocast into a higher order operator and make a submodule for the blocks between `_enter_autocast` and `_exit_autocast`. Some potential followup improvement: 1) Merge some of the duplicated logic with `replace_set_grad_with_hop_pass.py` 2) Check the current autocast status (any enabled? dtype?) and not create a submodule if the autocast args matches current autocast status. Test Plan: CI ``` parsh --build-flags fbcode//mode/dev-nosan fbcode//caffe2/test:test_export run_tests("test_predispatch_autocast") ``` Reviewed By: angelayi Differential Revision: D60206382 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131914 Approved by: https://github.com/angelayi	2024-08-05 18:52:12 +00:00
zdevito	8d9c3a71f6	Support IPC for Expandable Segments (#130890 ) This reapplication commit is the same as before except it resolves a build error in an internal build where `handle` was shadowed. Differential Revision: [D60547506](https://our.internmc.facebook.com/intern/diff/D60547506) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130890 Approved by: https://github.com/dsjohns2	2024-08-05 18:48:13 +00:00
Yidi Wu	618e2c9de4	fix torch rec test failure (#132437 ) Summary: Fixes T192448049. The module call form an unusal call stack for the nodes: https://www.internalfb.com/phabricator/paste/view/P1507230978. This is currently not supported by unflattener and need some extra design to make it work. Test Plan: buck2 run 'fbcode//mode/opt' torchrec/distributed/tests:test_pt2 -- --filter-text "test_sharded_quant_fpebc_non_strict_export" Reviewed By: zhxchen17 Differential Revision: D60528900 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132437 Approved by: https://github.com/Skylion007	2024-08-05 18:06:07 +00:00
Max Podkorytov	1c7dc335f7	[ROCm][CK][Inductor] Enable addmm for CK backend to gemm max autotune (#130576 ) Add functional support for torch.addmm with CK backend. See also #125453 # Implementation details 1. It turns out we can use the same template between addmm and matmul; essentially, matmul is addmm with empty bias 2. The Python generator in CK was updated to generate the shared cpp template. The pip package can be installed from `pip install git+https://github.com/rocm/composable_kernel@add-addmm` and will be merged into `develop` branch after this PR lands to avoid breaking the current matmul # Testing `pytest test/inductor/test_ck_backend.py -k addmm` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130576 Approved by: https://github.com/chenyang78	2024-08-05 17:49:09 +00:00
Nikita Shulga	7b2664ece6	Temp disable MKL in DistributionKernels.cpp (#132532 ) Until https://github.com/pytorch/pytorch/issues/132395 is addressed Test plan: Add test based on the script below (taken from https://discuss.pytorch.org/t/bug-in-torch-multinomial-generated-distribution-is-modestly-incorrect-edit-this-is-a-regression-and-appears-to-be-due-to-an-analogous-bug-in-tensor-exponential ) ```python import torch high_bits_for_seed = 16000000000000000000 # to use "good quality" seed _ = torch.manual_seed (high_bits_for_seed + 2024) prob = torch.ones (26) dups_mult = 0 perm_counts_mult = {} for _ in range (1_000_000): p = tuple (torch.multinomial (prob, prob.numel(), replacement=False).tolist()) if p in perm_counts_mult: dups_mult += 1 perm_counts_mult[p] += 1 else: perm_counts_mult[p] = 1 print ('duplicate multinomial perms: ', dups_mult) print ('multiple multinomial perms: ', (torch.tensor (list (perm_counts_mult.values())) > 1).sum().item()) print ('max of perm_counts_mult: ', torch.tensor (list (perm_counts_mult.values())).max().item()) print ('len (perm_counts_mult): ', len (perm_counts_mult)) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132532 Approved by: https://github.com/albanD	2024-08-05 17:40:57 +00:00
PyTorch MergeBot	baa2483cea	Revert "Refactor thunkify to return proper thunk abstraction (#132407 )" This reverts commit c65cb37657ef4f7fcd070a7e8e5121eb299919fd. Reverted https://github.com/pytorch/pytorch/pull/132407 on behalf of https://github.com/ezyang due to td strikes again ([comment](https://github.com/pytorch/pytorch/pull/132407#issuecomment-2269577711))	2024-08-05 17:39:54 +00:00
cyy	d5045cceff	[16/N] Fix clang-tidy warnings in jit (#132604 ) Follows #132564 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132604 Approved by: https://github.com/Skylion007	2024-08-05 17:36:22 +00:00
Wouter Devriendt	e8645fa2b9	[Doc] fix some typos (found by codespell and typos) (#132544 ) Applying doc fixes from PR https://github.com/pytorch/pytorch/pull/127267 - with CLA Pull Request resolved: https://github.com/pytorch/pytorch/pull/132544 Approved by: https://github.com/kit1980	2024-08-05 17:21:56 +00:00
albanD	3d87dfc088	Add basic OpenReg module scaffolding with autograd (#131708 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131708 Approved by: https://github.com/ezyang	2024-08-05 17:07:11 +00:00
Will Constable	df59084012	Drop GIL around cudart APIs (#132520 ) Noticed a hang where the stuck thread blocked on cudaHostUnregister call, probably due to an internal cuda deadlock caused by something else, but was holding the GIL at the time and blocked other python threads. As far as I can tell cudart APIs all do not require the GIL held nor are they marked as thread unsafe. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132520 Approved by: https://github.com/LucasLLC, https://github.com/kirtiteja	2024-08-05 17:04:01 +00:00
Kulin Seth	6919e8baab	[MPS] Add support for autocast in MPS (#99272 ) Fixes https://github.com/pytorch/pytorch/issues/88415 Co-authored-by: Siddharth Kotapati <skotapati@apple.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/99272 Approved by: https://github.com/malfet	2024-08-05 17:02:30 +00:00
Kiuk Chung	d532c00c81	[test/torch_np] Fix usages of deprecated NumPy 2.0 APIs in numpy_tests (#131909 ) Migrates usages of deprecated APIs in NumPy-2.0 per [numpy-2.0 migration guide](https://numpy.org/devdocs/numpy_2_0_migration_guide.html#numpy-2-0-migration-guide). I did a grep on the old API usages (see list below) and these were used only referenced in test files under `test/torch_np/numpy_tests/*/.py`. Specifically, migrates the usages of the following APIs: 1. `np.sctypes` → Access dtypes explicitly instead 2. `np.float_` → `np.float64` 3. `np.complex_` → `np.complex128` 4. `np.longcomplex` → `np.clongdouble` 5. `np.unicode_` → `np.str_` 6. `np.product` → `np.prod` 7. `np.cumproduct` → `np.cumprod` 8. `np.alltrue` → `np.all` 9. `np.sometrue` → `np.any` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131909 Approved by: https://github.com/rgommers, https://github.com/Skylion007, https://github.com/atalman	2024-08-05 16:21:08 +00:00
Xu Han	a672f6c84e	[inductor] unificate SUBPROCESS_DECODE_ARGS variable in cpp_builder.py (#132615 ) [inductor] unificate SUBPROCESS_DECODE_ARGS variable in cpp_builder.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/132615 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-08-05 16:00:35 +00:00
Xu Han	9945caec65	[inductor] Fix autotune non-close attr crash on Windows (#132630 ) When I enable `autotune` related UT on Windows. <img width="1364" alt="Image" src="https://github.com/user-attachments/assets/b0c9c516-419d-47d0-a4c1-e90c98109d02"> I found the non `close` attr issue on Windows. Acturaly, I checked the DLL type is `CDLL`. It doesn't have `close` attr. I made this PR to check the `close` attr and do the close operation. <img width="1624" alt="Image" src="https://github.com/user-attachments/assets/14093900-4ad8-4673-839e-7ba1410c5656"> After this fix, the UTs passed. Here are some existing issues: 1. `CDLL` didn't have `close` attr, so the DLL are not be closed. Though it did't crash on Linux. 2. This PR just avoid crash on Windows, and didn't real close also. TODO: We need to replace `CDLL` by `DLLWrapper` in `CppBenchmarkRequest`, like `CUDABenchmarkRequest`. I have added a task to tracking: https://github.com/pytorch/pytorch/issues/124245 , and will follow up this change in further PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132630 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-08-05 16:00:27 +00:00
Aart Bik	a8490a0762	[traced-graph][sparse] propagate sparsity in fx graph (#131920 ) This PR proceeds with implementing the feature request #117188 by generalizing more cases that already work with COO to work with the compressed sparse formats as well. Feature request: https://github.com/pytorch/pytorch/issues/117188 Rebranch of older PRs (for history): https://github.com/pytorch/pytorch/pull/131474 https://github.com/pytorch/pytorch/pull/128549 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131920 Approved by: https://github.com/ezyang	2024-08-05 15:49:53 +00:00
Aleksei Nikiforov	14edd986b3	Fix missing include file (#132647 ) This error only appears with newer gcc releases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132647 Approved by: https://github.com/Skylion007	2024-08-05 15:49:49 +00:00
Andrew Gu	70cb16b316	[DTensor] Added naive replicate strategy for more diagonal ops (#132201 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132201 Approved by: https://github.com/wz337 ghstack dependencies: #132104	2024-08-05 15:18:56 +00:00
Edward Z. Yang	c65cb37657	Refactor thunkify to return proper thunk abstraction (#132407 ) This is superior to lru_cache because (1) it's more explicit and (2) it doesn't leak the original function after it's been forced. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132407 Approved by: https://github.com/albanD ghstack dependencies: #131649	2024-08-05 14:42:40 +00:00
Brian Hirsh	b465a5843b	DTensor: add more foreach ops to supported sharding prop list (#132066 ) fixes https://github.com/pytorch/pytorch/issues/132016. Right now if you run an op that DTensor has no sharding prop rule, and that op accepts non-trivial pytrees of inputs tensors as arguments, DTensor can end up infinite looping before it has the chance to error due to not having a sharding prop rule. This PR doesn't fix the problem, but adds rules for the culprit ops (missing foreach ops) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132066 Approved by: https://github.com/wanchaol	2024-08-05 13:51:59 +00:00
Gabriel Ferns	c3ee07c71c	add missing profiler include in cpp code generation (#132419 ) Summary: When a user sets config.profiler_mark_wrapper_call, RECORD_FUNCTION annotations are added to the code. This requires importing the header <ATen/record_function.h>, but the conditional for doing so didn't check config.profiler_mark_wrapper_call. Test Plan: This case is already covered in test_profiler_mark_wrapper_call. ``` (pytorch-3.10) [gabeferns@devvm2252.cco0 ~/pytorch (missing-profile-include)]$ TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k CpuTests.test_profiler_mark_wrapper_call_cpu stats [('calls_captured', 1), ('unique_graphs', 1)] inductor [('fxgraph_cache_miss', 1)] aot_autograd [('total', 1), ('ok', 1)] . ---------------------------------------------------------------------- Ran 1 test in 8.080s OK ``` Fixes https://github.com/pytorch/pytorch/issues/131339 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132419 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-08-05 13:40:47 +00:00
Andrew Gu	b30d0916d9	[FSDP2] Added missing event wait (for future) (#132568 ) Nothing is actually wrong currently, but we should add this in case we land https://github.com/pytorch/pytorch/pull/127032 in the future. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132568 Approved by: https://github.com/weifengpy, https://github.com/Skylion007	2024-08-05 12:44:46 +00:00
wz337	fb87796d4f	[DeviceMesh] Add supports for non-continuous slicing (#132310 ) Removes constraint of continuous slicing to allow non-continuous slicing and adds a unit test for 3D non-continuous slicing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132310 Approved by: https://github.com/wanchaol	2024-08-05 09:30:07 +00:00
Avik Chaudhuri	27f61eba58	serde sympy functions (#132493 ) Summary: Sympy functions appearing in symbolic expressions inside tensor metadata were not being deserialized properly. Test Plan: updated test Differential Revision: D60573150 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132493 Approved by: https://github.com/pianpwk	2024-08-05 08:08:50 +00:00
Feng Shi	55b0c39d82	Reland "[1/2] PT2 Inductor ComboKernels - Foreach cases (#124969 )" (#132182 ) Summary: Reland #124969 by backing out D60397377 "Back out "[1/2] PT2 Inductor ComboKernels - Foreach cases (#124969)"" The original diff D54134695 was reverted because of failure of ads nightly cogwheel tests. The root cause: the logic for generating mask in Triton kernel needed update after a recent refactoring on triton.py. This diff includes the fix of the root cause. See D54134695 or #124969 for more details. Test Plan: Originally failed tests f585704630 f585733786 Diff patched: f586664028 f586663820 Differential Revision: D60458597 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132182 Approved by: https://github.com/Yuzhen11	2024-08-05 06:57:30 +00:00
haozhe.zhu	ae44b8f410	[inductor] support vectorization for torch.argmax/min(float/int64_t)-> int64_t (#131016 ) Support reduction argmin/max by scalar implementation. TestPlan: ``` python test/inductor/test_cpu_repro.py -k test_argmax_argmin_with_nan_value python test/inductor/test_cpu_repro.py -k test_argmin python test/inductor/test_cpu_repro.py -k test_reduction_cpu_only ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131016 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-08-05 04:31:53 +00:00
Wu, Chunyuan	1fb498d6e3	Add try except for _maybe_evaluate_static call in IndexPropagation (#132128 ) Fixes the Inductor max-autotune mode failures of the below models: - GPT2ForSequenceClassification - PegasusForConditionalGeneration - XGLMForCausalLM - hf_GPT2 - tnt_s_patch16_224 ```log File "/pytorch/torch/_inductor/index_propagation.py", line 329, in statically_true evaluated = self.shape_env._maybe_evaluate_static( File "/pytorch/torch/fx/experimental/symbolic_shapes.py", line 1499, in wrapper return fn_cache(self, args, *kwargs) File "/pytorch/torch/fx/experimental/symbolic_shapes.py", line 4539, in _maybe_evaluate_static vr = var_ranges[k] torch._dynamo.exc.BackendCompilerFailed: backend='compile_fx_wrapper' raised: KeyError: m_start ``` The `_maybe_evaluate_static` call in `IndexPropagation` may fail. This PR adds try except following the way in `torch/_inductor/sizevars.py` by adding a common utility function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132128 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-08-05 01:02:51 +00:00
Jianyu Huang	c7cfa51721	Always use high precision for SDPA math backend (#128922 ) Summary: feikou observed the big numerical gaps when using math backend on AMD and NV GPUs. It's mainly because we are not using higher precision FP32 for the intermediate accumulated/materialized parts. Since math backend is expected to be slower anyways, and we expect math backend to generate the correct reference result, I think it should be worth to upcast FP16/BF16 input to FP32, and do FP32/TF32 computations, and then downcast FP32 output back to FP16/BF16. Differential Revision: D58710805 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128922 Approved by: https://github.com/xw285cornell, https://github.com/drisspg	2024-08-04 23:58:14 +00:00
William Wen	01cdcbf7c8	[dynamo] revert map/zip iterator related changes (#132528 ) Need to revert due to internal hangs: S437700 This reverts commit b6c1490cc02316ffe85e5ae74651d80f0158ba64. Revert "[dynamo] implement IteratorVariable and polyfill fallbacks for enumerate (#131725)" This reverts commit 2576dbbc35d66e8e9ed6cb12216ccc424cb87ec3. Revert "[dynamo] add itertools repeat/count bytecode reconstruction (#131716)" This reverts commit 35b4de32fafc5ad024c20ef1275711bffc557ae9. Revert "[dynamo] add lazy IteratorVariable implementations for map and zip (#131413)" This reverts commit 7d282d87550787d8269593093519c2ad7c5032cd. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/132528 Approved by: https://github.com/ZainRizvi	2024-08-04 18:46:55 +00:00
Oguz Ulgen	09f9c256ad	Add basic mypy annotations to inductor (#132416 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132416 Approved by: https://github.com/XuehaiPan, https://github.com/jamesjwu ghstack dependencies: #132415	2024-08-04 18:43:37 +00:00
Oguz Ulgen	6e79932543	Add basic mypy annotations to dynamo (#132415 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132415 Approved by: https://github.com/XuehaiPan, https://github.com/jamesjwu	2024-08-04 18:43:36 +00:00
PyTorch MergeBot	3558a8cf4a	Revert "Add basic mypy annotations to dynamo (#132415 )" This reverts commit 71e22e0959eb8d5a66833bf5c6b5903536a5bef1. Reverted https://github.com/pytorch/pytorch/pull/132415 on behalf of https://github.com/ZainRizvi due to Sorry, this PR has entered a weird state in the diff train. Trying to revert it to skip it, and then we can try relanding it ([comment](https://github.com/pytorch/pytorch/pull/132415#issuecomment-2267631785))	2024-08-04 18:39:29 +00:00
PyTorch MergeBot	f2ddd5e9e0	Revert "Add basic mypy annotations to inductor (#132416 )" This reverts commit 78927d37f6085a0b30269cceb731d8097302c091. Reverted https://github.com/pytorch/pytorch/pull/132416 on behalf of https://github.com/ZainRizvi due to Sorry, this PR has entered a weird state in the diff train. Trying to revert it to skip it, and then we can try relanding it ([comment](https://github.com/pytorch/pytorch/pull/132415#issuecomment-2267631785))	2024-08-04 18:39:29 +00:00
PyTorch MergeBot	9be33bc584	Revert "[inductor] Add type hints to functions in mkldnn_fusion.py (#131820 )" This reverts commit 6c65fd03942415b68040e102c44cf5109d2d851e. Reverted https://github.com/pytorch/pytorch/pull/131820 on behalf of https://github.com/ZainRizvi due to Sorry, had to revert this to revert another PR that depends on this change ([comment](https://github.com/pytorch/pytorch/pull/131820#issuecomment-2267629534))	2024-08-04 18:30:59 +00:00
PyTorch MergeBot	0a25666f92	Revert "[dynamo] revert map/zip iterator related changes (#132528 )" This reverts commit e81e74ca6cb45e1ab831ddfe9a2ba5c7e17fa03f. Reverted https://github.com/pytorch/pytorch/pull/132528 on behalf of https://github.com/ZainRizvi due to This stack entered a weird state in the diff train. Reverting and relanding to clean the state ([comment](https://github.com/pytorch/pytorch/pull/132528#issuecomment-2267628475))	2024-08-04 18:26:09 +00:00
Aaron Gokaslan	fd4b649e6c	[BE]: Simplify some list comps to generators C419 (#132578 ) Simplifies some list comprehensions to generator which is more efficient. Automatically applied diffs for the most part with ruff Pull Request resolved: https://github.com/pytorch/pytorch/pull/132578 Approved by: https://github.com/ezyang	2024-08-04 17:46:26 +00:00
Xuehai Pan	4226ed1585	[BE] Format uncategorized Python files with `ruff format` (#132576 ) Remove patterns ``, `test/`, and `torch/**` in `tools/linter/adapters/pyfmt_linter.py` and run `lintrunner`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132576 Approved by: https://github.com/ezyang, https://github.com/Skylion007 ghstack dependencies: #132574	2024-08-04 17:13:31 +00:00
Xuehai Pan	c35061c542	Migrate Python code formatter from `black` to `ruff format` (#132574 ) See also: - #124845 - #123062 Closes #124845 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132574 Approved by: https://github.com/ezyang	2024-08-04 17:13:31 +00:00
Jiashen Cao	09fcd792eb	[Fix]: ScriptObject lifting issue (#130952 ) #### Issue ScriptObject was treated as normal attribute by the converter previously. This PR lifts it to be a constant and convert it directly to a GetAttr fx node. ScriptObject would also trigger `CallMethod` and this PR adds that support as well. #### Test Plan Add test case for ScriptObject. `pytest test/export/test_converter.py -s -k test_convert_script_object` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130952 Approved by: https://github.com/angelayi	2024-08-04 16:52:45 +00:00
PyTorch MergeBot	5dac4d2c78	Revert "[easy] fix f-string messages in torch/_ops.py (#132531 )" This reverts commit 908d2a153b14cbb7a39c1f4ef9a77534cf2c71bf. Reverted https://github.com/pytorch/pytorch/pull/132531 on behalf of https://github.com/davidberard98 due to still breaks tests ([comment](https://github.com/pytorch/pytorch/pull/132531#issuecomment-2267584289))	2024-08-04 15:41:56 +00:00
cyy	105ba7b58c	[5/N] Fix clang-tidy warnings in aten/src/ATen (#132565 ) Follows #132001 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132565 Approved by: https://github.com/Skylion007	2024-08-04 14:39:16 +00:00
David Berard	908d2a153b	[easy] fix f-string messages in torch/_ops.py (#132531 ) I encountered these when making this change: ``` diff --git a/test/functorch/test_ac.py b/test/functorch/test_ac.py index 3a2e07fa147..a4d003399e7 100644 --- a/test/functorch/test_ac.py +++ b/test/functorch/test_ac.py @@ -259,15 +259,8 @@ class MemoryBudgetTest(TestCase): expected = call() for budget in range(0, 11): - memory_budget = budget / 10 - torch._dynamo.reset() - with config.patch(activation_memory_budget=memory_budget): - if memory_budget is not None: - f_compile = torch.compile( - call, backend="aot_eager_decomp_partition" - ) - - self.assertEqual(expected, f_compile()) + get_mem_and_flops(call, memory_budget=budget / 10) + def test_prioritize_cheaper_matmul(self): def f(xs, ws): ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132531 Approved by: https://github.com/Skylion007 ghstack dependencies: #132356, #132466	2024-08-04 14:30:42 +00:00
Xu Han	87d46d70d7	[inductor] export kernel for gemm template. (#132580 ) Changes: 1. Move `get_export_declaration` to `cpp_utils.py` as basic function. 2. Export kernel for gemm template. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132580 Approved by: https://github.com/ezyang	2024-08-04 11:17:19 +00:00
Xuehai Pan	d2dc173664	Remove lint dependency `ufmt` (#132573 ) `ufmt` is a combination of `black + usort`. This PR removes `ufmt` and run `black` and `usort` separately. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132573 Approved by: https://github.com/ezyang ghstack dependencies: #129769, #132572	2024-08-04 10:24:09 +00:00
Xuehai Pan	f7aeb394b6	[BE][Easy] Remove empty `ISORT_SKIPLIST` (#132572 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132572 Approved by: https://github.com/ezyang, https://github.com/justinchuby ghstack dependencies: #129769	2024-08-04 10:24:09 +00:00
Xuehai Pan	f3fce597e9	[BE][Easy][17/19] enforce style for empty lines in import segments in `torch/[a-c]/` and `torch/[e-n]/` (#129769 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129769 Approved by: https://github.com/ezyang	2024-08-04 10:24:09 +00:00
Dan Zimmerman	2714adce20	[caffe2] Fix compiling ATen-hip in non-opt mode (#132581 ) Summary: It looks like https://github.com/pytorch/pytorch/pull/131894 accidentally broke non-opt hip builds. I.e. `is_flash_attention_available` doesn't get inlined in non-opt mode, so all of `can_use_flash_attention` is compiled into the final object file. This includes a reference to `aotriton::v2::flash::check_gpu` which we haven't setup yet for HIP builds. Test Plan: CI Differential Revision: D60720707 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132581 Approved by: https://github.com/jianyuh, https://github.com/xw285cornell	2024-08-04 07:51:18 +00:00
cyy	522fa03e91	[Submodule] Bump ONNX to v1.16.2 (#132566 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/132566 Approved by: https://github.com/justinchuby	2024-08-04 07:01:54 +00:00
Wei Feng	2a8e94347f	[TP] verify numeric parity on Transfromers for multiple iterations (#132543 ) Before setting up float8 numeric parity test, I have to set up regular TP numeric parity test, preferrably testing 10 iterations this PR sets a baseline of TP numerics. I can verify fp8 on top of it Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/132543 Approved by: https://github.com/tianyu-l ghstack dependencies: #132350	2024-08-04 06:43:27 +00:00
Gabriel Ferns	8ff310392e	add __torch_function__ handler to get_device cpp (#132567 ) From the issue: ``` import torch class CustomParameter(torch.nn.Parameter): @classmethod def __torch_function__(cls, func, types, args=(), kwargs=None): return func.__name__ x = CustomParameter(torch.rand(2)) print(x.square()) # 'square' print(torch.square(x)) # 'square' print(x.get_device()) # 'get_device' print(torch.get_device(x)) # -1 ``` after fix: ``` $ python repro.py square square get_device get_device ``` Fixes: https://github.com/pytorch/pytorch/issues/131944 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132567 Approved by: https://github.com/ezyang	2024-08-04 04:26:30 +00:00
Xu Han	7f8a384a8f	[inductor] add msvc_cl compiler check (#132571 ) add `msvc_cl` compiler check. Local test: <img width="880" alt="image" src="https://github.com/user-attachments/assets/fe4da5e0-dd52-4dbc-831e-c32479e27a29"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132571 Approved by: https://github.com/ezyang	2024-08-04 03:48:25 +00:00
Feng Yuan	81b8d3586f	Update torch-xpu-ops pin (ATen XPU implementation) (#132390 ) Regular update. 1. New 69 ATen operators and variants are added. See https://github.com/intel/torch-xpu-ops/blob/main/yaml/xpu_functions.yaml. 2. Align with PyTorch in-tree to use safe data pointer access APIs. 3. Enable FP64 conversion emulation for some platforms. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132390 Approved by: https://github.com/EikanWang	2024-08-04 02:22:46 +00:00
CaoE	6ec4af6865	[Inductor][CPP] Add vectorization support for double (#131886 ) Before: ``` extern "C" void kernel(const double* in_ptr0, double* out_ptr0) { #pragma omp parallel num_threads(112) { int tid = omp_get_thread_num(); { #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(1024L); x0+=static_cast<long>(1L)) { auto tmp0 = in_ptr0[static_cast<long>(x0)]; auto tmp1 = decltype(tmp0)(tmp0 * tmp0); out_ptr0[static_cast<long>(x0)] = tmp1; } } } } ``` After: ``` extern "C" void kernel(const double* in_ptr0, double* out_ptr0) { #pragma omp parallel num_threads(112) { int tid = omp_get_thread_num(); { #pragma omp for for(long x0=static_cast<long>(0L); x0<static_cast<long>(1024L); x0+=static_cast<long>(16L)) { auto tmp0 = at::vec::VectorizedN<double,2>::loadu(in_ptr0 + static_cast<long>(x0), 16); auto tmp1 = tmp0 * tmp0; tmp1.store(out_ptr0 + static_cast<long>(x0), 16); } } } } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131886 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-08-04 02:13:21 +00:00
PyTorch MergeBot	d984105748	Revert "[export] Convert autocast to HOO (#131914 )" This reverts commit b28c01d90d6575522d2240ce485d7dd87a7242aa. Reverted https://github.com/pytorch/pytorch/pull/131914 on behalf of https://github.com/ezyang due to Failing lint, but was covered up by master failure on lint ([comment](https://github.com/pytorch/pytorch/pull/131914#issuecomment-2267248773))	2024-08-04 02:10:35 +00:00
Adnan Akhundov	6c65fd0394	[inductor] Add type hints to functions in mkldnn_fusion.py (#131820 ) Summary: ATT Test Plan: lintrunner Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/131820 Approved by: https://github.com/eellison	2024-08-03 22:11:47 +00:00
cyy	bc46f205c4	[15/N] Fix clang-tidy warnings in jit (#132564 ) Follows #132477 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132564 Approved by: https://github.com/Skylion007	2024-08-03 19:33:24 +00:00
PyTorch MergeBot	00097f3458	Revert "C++ network flow implementation in c10 (#132188 )" This reverts commit dccce77935bb023f225b9972929fd9213e754e84. Reverted https://github.com/pytorch/pytorch/pull/132188 on behalf of https://github.com/ZainRizvi due to Sorry but this appears to be failing internal tests. Please see D60702564 to investigate ([comment](https://github.com/pytorch/pytorch/pull/132188#issuecomment-2267098420))	2024-08-03 18:44:28 +00:00
Xu Han	e3387c6712	[inductor] use uint64_t replace long to add Windows support. (#132491 ) `long` type is different between `Windows` and `Linux`. This PR use `int64_t` instead of `long` on Windows. `LL` suffix is used to initial `int64_t` value. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132491 Approved by: https://github.com/malfet	2024-08-03 18:38:30 +00:00
Yanbo Liang	bbce517221	[Inductor][FlexAttention] TestFlexAttention -> TestFlexDecoding (#132547 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132547 Approved by: https://github.com/Chillee ghstack dependencies: #132015	2024-08-03 17:26:44 +00:00
PyTorch MergeBot	21d02f8b4b	Revert "[easy] fix f-string messages in torch/_ops.py (#132531 )" This reverts commit 25903f3932b3a24d4edf323484d2159f3ac92999. Reverted https://github.com/pytorch/pytorch/pull/132531 on behalf of https://github.com/davidberard98 due to broke lint and tests due to conflict with 132377 ([comment](https://github.com/pytorch/pytorch/pull/132531#issuecomment-2266743391))	2024-08-03 14:49:07 +00:00
Pian Pawakapan	a896fb1b36	check unsupported sympy functions for runtime asserts (#132457 ) Some sympy Functions aren't supported by sympy_interp(); we can't turn them into FX nodes, so currently the runtime asserts CSE pass avoids CSE'ing on any expression containing a sympy Function. https://github.com/pytorch/pytorch/pull/132325 started tracking unsupported functions, so we switch the check to that to be more precise. We also check for and skip unsupported functions when adding asserts - previously we only did the check for CSE, and not adding new expressions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132457 Approved by: https://github.com/avikchaudhuri	2024-08-03 10:17:25 +00:00
Xuehai Pan	0e7e61f7ce	Deprecate `torch._utils.is_compiling()` and `torch._dynamo.external_utils.is_compiling()` (#127690 ) This PR is split from PR #126898. - #126898 ------ Pull Request resolved: https://github.com/pytorch/pytorch/pull/127690 Approved by: https://github.com/Skylion007, https://github.com/malfet	2024-08-03 09:43:38 +00:00
Jiashen Cao	159d508f03	[Fix]: prim::If with multiple outputs and input return directly (#131779 ) #### Issue Test is not working for prim::Loop with multiple outputs. Additionally fix issue where input is directly returned, which is not supported by HigherOrderOp. #### Test Plan `pytest test/export/test_converter.py -s -k test_convert_if_multiple_out` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131779 Approved by: https://github.com/angelayi, https://github.com/SherlockNoMad	2024-08-03 08:07:21 +00:00
Xu Han	36ec0fdf10	[inductor] check compiler exist on Windows. (#132533 ) Current Windows env, if we are not activate the MSVC env. It will not raise a clear error to compiler: <img width="904" alt="image" src="https://github.com/user-attachments/assets/725ea608-d181-40b1-8930-42fe2b32643a"> With this PR, we can help users point to the issue is from compiler. <img width="1034" alt="image" src="https://github.com/user-attachments/assets/8515a796-e3e9-4909-a68f-8a14d4864951"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132533 Approved by: https://github.com/jansel	2024-08-03 07:47:11 +00:00
Adnan Akhundov	8ad9f89ccc	[inductor] Reland: Add flag to ignore unsupported @triton.autotune args in user-written kernel compilation (#132562 ) Summary: This is a reland attempt of [#131431](https://github.com/pytorch/pytorch/pull/131431), as, in its original form, the PR has caused issues internally. We currently don't support some of the `triton.autotune` arguments when compiling user-written Triton kernels with PT2. In this PR, we're adding a flag to circumvent it. This is to unblock internal compilation in some cases. The flag is supplied with the docs mentioning why it is not a good idea to set it. Test Plan: ``` python test/inductor/test_triton_kernels.py -k test_triton_kernel_ autotune_with_unsupported_args ... ---------------------------------------------------------------------- Ran 3 tests in 3.636s OK ``` Differential Revision: D60701839 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132562 Approved by: https://github.com/chenyang78	2024-08-03 06:31:28 +00:00
Animesh Jain	06581c277a	[dynamo][stable-diffusion] Support dict(obj) on constrained subclasses of dict and OrderedDict (#132558 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132558 Approved by: https://github.com/jansel	2024-08-03 06:31:00 +00:00
Shangdi Yu	b28c01d90d	[export] Convert autocast to HOO (#131914 ) Summary: Suggested in https://github.com/pytorch/pytorch/issues/128394. If there's an autocast context manager, the predispatch (strict) graph can look something like: ``` class <lambda>(torch.nn.Module): def forward(self, x: "f32[1]"): ... _enter_autocast = torch.amp.autocast_mode._enter_autocast('cuda', torch.bfloat16, True, None) mm: "f32[8, 8]" = torch.ops.aten.mm.default(rand, rand_1); rand = rand_1 = None _exit_autocast = torch.amp.autocast_mode._exit_autocast(_enter_autocast); _enter_autocast = None return (mm_1,) ``` But the operator `torch.amp.autocast_mode._enter_autocast` is not a valid ATen op. We remove these nodes by turning autocast into a higher order operator and make a submodule for the blocks between `_enter_autocast` and `_exit_autocast`. Some potential followup improvement: 1) Merge some of the duplicated logic with `replace_set_grad_with_hop_pass.py` 2) Check the current autocast status (any enabled? dtype?) and not create a submodule if the autocast args matches current autocast status. Test Plan: CI ``` parsh --build-flags fbcode//mode/dev-nosan fbcode//caffe2/test:test_export run_tests("test_predispatch_autocast") ``` Reviewed By: angelayi Differential Revision: D60206382 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131914 Approved by: https://github.com/angelayi	2024-08-03 05:48:57 +00:00
Avik Chaudhuri	ed4493de0e	dim name is identifier (#132557 ) Summary: Dim names appear in suggested fixes so should be valid Python identifiers. Test Plan: none Differential Revision: D60696854 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132557 Approved by: https://github.com/pianpwk	2024-08-03 05:28:50 +00:00
Edward Z. Yang	1f5dfe00da	Subtracer should always be real to inherit fake/real tensors from parent config (#132488 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132488 Approved by: https://github.com/zou3519	2024-08-03 04:55:42 +00:00
Justin Chu	6966d44eda	[ONNX] Rename _internal/exporter to _exporter_legacy (#132429 ) The next PR will be creating an `exporter` directory to house logic from `torch-onnx` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132429 Approved by: https://github.com/titaiwangms	2024-08-03 04:23:05 +00:00
David Berard	5973aec671	[fx] python_code(verbose=True): show size/strides for all tensors (#132192 ) python_code(verbose=True) (or print_readable()) generates a string with the code representing the fx graph, with extra annotations indicating the size or stride of the tensor. Currently, it'll only shows sizes/strides for FakeTensors provided in metadata. For subclass tensors like NestedTensor, the outer class (provided in the node metadata) will be a non-FakeTensor and the inner tensors will be fake. This PR expands the conditional to show sizes/strides for all tensors, not just FakeTensors. Testing: I ran this test script (below), ran it with `TORCH_LOGS=+dynamo` and found in the logs the graph shown below - we see that the input nested tensor has sizes and strides associated with it. Also, I stacked a diff on top of this one that forces the readable graph to be generated whenever PT2 is in use in tests, which should hopefully find any issues; https://github.com/pytorch/pytorch/pull/132195 shows no significant failures except for preexisting failures. test script: ```python import torch def fn(x): return x.cos() nt = torch.nested.nested_tensor_from_jagged( torch.randn(10, 10), torch.tensor([0, 1, 3, 6, 10]), ) torch.compile(fn)(nt) ``` logs excerpt: ``` [0/0] [__graph_code] TRACED GRAPH [0/0] [__graph_code] ===== __compiled_fn_1 ===== [0/0] [__graph_code] /data/users/dberard/pytorch/torch/fx/_lazy_graph_module.py class GraphModule(torch.nn.M [0/0] [__graph_code] def forward(self, L_x_: "f32[4, zf1, 10][10zf1, 10, 1]cpu", zf1: "Sym(zf1)"): [0/0] [__graph_code] l_x_ = L_x_ [0/0] [__graph_code] [0/0] [__graph_code] # File: /data/users/dberard/scripts/nt_print_graph.py:4 in fn, code: return x.c [0/0] [__graph_code] cos: "f32[4, zf1, 10][10zf1, 10, 1]cpu" = l_x_.cos(); l_x_ = None [0/0] [__graph_code] return (cos,) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132192 Approved by: https://github.com/Chillee	2024-08-03 02:54:32 +00:00
Ivan Zaitsev	0b571b1058	[codemod][pyre] Add missing Pyre mode headers (#132548 ) Reviewed By: connernilsen Differential Revision: D59849027 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132548 Approved by: https://github.com/kit1980, https://github.com/ZainRizvi	2024-08-03 02:32:53 +00:00
Yanbo Liang	373e9be457	[Inductor][FlexAttention] Add kwarg to top level for users to specify kernel params (#132015 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132015 Approved by: https://github.com/Chillee	2024-08-03 02:27:02 +00:00
David Berard	25903f3932	[easy] fix f-string messages in torch/_ops.py (#132531 ) I encountered these when making this change: ``` diff --git a/test/functorch/test_ac.py b/test/functorch/test_ac.py index 3a2e07fa147..a4d003399e7 100644 --- a/test/functorch/test_ac.py +++ b/test/functorch/test_ac.py @@ -259,15 +259,8 @@ class MemoryBudgetTest(TestCase): expected = call() for budget in range(0, 11): - memory_budget = budget / 10 - torch._dynamo.reset() - with config.patch(activation_memory_budget=memory_budget): - if memory_budget is not None: - f_compile = torch.compile( - call, backend="aot_eager_decomp_partition" - ) - - self.assertEqual(expected, f_compile()) + get_mem_and_flops(call, memory_budget=budget / 10) + def test_prioritize_cheaper_matmul(self): def f(xs, ws): ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132531 Approved by: https://github.com/Skylion007 ghstack dependencies: #132356, #132466	2024-08-03 02:23:44 +00:00
Animesh Jain	419b76c4ac	[dynamo] Reland 132308, 132314, 132318, 132334 - Make builtin nn modules attributes static (#132539 ) Relanding 4 PRs ending at https://github.com/pytorch/pytorch/pull/132334 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132539 Approved by: https://github.com/Skylion007, https://github.com/yanboliang, https://github.com/mlazos	2024-08-03 02:08:22 +00:00
Ivan Zaitsev	841cadd555	Fix discrepancies from 129973 (#132545 ) #129973 ([D59132793](https://www.internalfb.com/diff/D59132793)) was exported missing changes in `test/cpp/jit/CMakeLists.txt` this PR remediates that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132545 Approved by: https://github.com/kit1980	2024-08-03 01:57:49 +00:00
Eli Uriegas	243a763e1b	ci: Remove split-build CUDA testing from pull.yml (#132537 ) This is already represented in trunk.yml so it seems a bit redundant to include this level of testing in pull.yml. I've been observing a large spike in our usage of `g3.4xlarge` which seems to correspond to these builds in particular so removing these from `pull.yml` since they are already covered in `trunk.yml`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132537 Approved by: https://github.com/ZainRizvi, https://github.com/malfet	2024-08-03 01:24:17 +00:00
Shangdi Yu	a503136583	[export] Detect whether case_name is registered in exportdb (#132420 ) Summary: - moves logging functionalities into `torch/_export/db/logging.py` file. - add a check in `_dynamo/eval_frame.py` to check for optional input and error out with `UnsupportedError` - change the case name of `torch_sym_int` to `unsupported_operator` - Check if the case name is registered in exportdb, if so, we give a link to the case in exportdb. - TODO: add test Test Plan: CI Running the example in https://pytorch.org/docs/main/generated/exportdb/index.html#optional-input gives the following error logging: ``` E0730 10:53:33.687000 4155538 torch/_dynamo/eval_frame.py:1086] Parameter y is optional with a default value of tensor([[-0.1633, 1.2414, -0.1071], E0730 10:53:33.687000 4155538 torch/_dynamo/eval_frame.py:1086] [-0.1936, -0.9425, -0.0824]]) E0730 10:53:33.688000 4155538 torch/export/_trace.py:1043] See optional_input in exportdb for unsupported case. https://pytorch.org/docs/main/generated/exportdb/index.html#optional-input ...... File "/data/users/shangdiy/fbsource/buck-out/v2/gen/fbcode/389acaeb40d57230/tutorials/pytorch/nntest/__torchtest__/torchtest#link-tree/torch/_dynamo/eval_frame.py", line 1091, in produce_matching raise Unsupported( torch._dynamo.exc.Unsupported: Tracing through optional input is not supported yet ``` It also logs a `export.error.classified` event in Scuba. Reviewed By: zhxchen17 Differential Revision: D60427208 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132420 Approved by: https://github.com/zhxchen17	2024-08-03 01:08:48 +00:00
Joel Schlosser	64720f3b89	Introduce checks to validate public API tests (#131390 ) This PR introduces a new sanity check for the public API tests in `.ci/pytorch/test.sh`. * Validates two public API tests: 1. Ensures `test_correct_module_names` fails when a new file OR an existing file adds an invalid public API function (e.g. one whose `__module__` is unset). 2. Ensures `test_modules_can_be_imported` fails when a module underneath `torch/` cannot be imported. * Runs this in CI as part just before the pre-existing FC / BC checks. I've verified that re-introducing the bug that #131386 fixed causes the new check to fail: ![public_api_failure](https://github.com/user-attachments/assets/376ddef3-d14a-41f6-93e2-f935deb6555a) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131390 Approved by: https://github.com/albanD	2024-08-03 00:29:00 +00:00
cyy	fcef6cc6d1	[13/N] Fix clang-tidy warnings in jit (#132477 ) Follows #132209 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132477 Approved by: https://github.com/Skylion007	2024-08-03 00:13:18 +00:00
Shivam Raikundalia	705ac311aa	Fix Distributed EventList usage (#132448 ) Summary: Summarized here: https://github.com/pytorch/pytorch/issues/132227 Test Plan: Use suggestion in issue, should see test passing again Differential Revision: D60614690 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132448 Approved by: https://github.com/aaronenyeshi	2024-08-02 23:55:31 +00:00
Sherlock Huang	e3513fb2af	[ts_converter]handle python list append, list add, aten.to.dtype+mutation_op pattern (#132529 ) Summary: #### Description Add support for aten::append with a python function that returns a new list with the appended element. We then update the `fx_node` in the `name_to_node` mapping. aten::append contributed by Jiashen Cao <jiashenc@meta.com> Fix conversion for csr_ranker_test ``` model_name: csr_ranker_test_4.ptl has_ts_model: True has_sample_inputs: True ops_maybe_missing_meta: set() script_objects: set() ts_can_run: True ts_run_exception: None can_convert: True convert_exception: None ep_result_correct: True ep_run_exception: None can_package: True package_exception: None sigmoid_can_run: False sigmoid_run_exception: RuntimeError('not for symbolics') sigmoid_result_correct: None ``` Test Plan: test_aten_add_t test_aten_append_t test_aten_to_dtype_with_mutating_storage buck2 run mode/opt sigmoid/inference/ts_migration:main -- --mode test_one --model_name csr_ranker_test Differential Revision: D60635893 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132529 Approved by: https://github.com/jiashenC	2024-08-02 23:32:37 +00:00
David Berard	85f19ce14a	Support meta["val"] that is a dict, for triton kernels and for the partitioner (#132466 ) Internally there's a model that's using memory_budget with the partitioner, and using custom triton kernels. The partitioner fails when encountering the triton ops because they don't have `meta["val"]`. This PR adds `meta["val"]` to these fx graph nodes and then adds handling for `meta["val"]` being a dict in the partitioner. Differential Revision: [D60627813](https://our.internmc.facebook.com/intern/diff/D60627813) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132466 Approved by: https://github.com/zou3519 ghstack dependencies: #132356	2024-08-02 23:24:29 +00:00
Shivam Raikundalia	bcac71517c	[Profiler] Test Logging for Empty Traces (#132444 ) Summary: Tests D60311331. Please see that diff for explanation Test Plan: This diff is adding a test itself Reviewed By: aaronenyeshi Differential Revision: D60311555 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132444 Approved by: https://github.com/aaronenyeshi	2024-08-02 22:04:15 +00:00
David Berard	1962f9475f	[NJT][flop counter] attention: if offsets are fake, use max seqlen (#132356 ) The flop counter is used by the partitioner, in which case the tensors passed in can be fake. The flop computations for nested attention use the offsets to determine the actual amount of compute that will be done. But when the offsets are fake, we end up with unbacked symints (from `(offsets[1:] - offsets[:-1]).to_list()`). If we find that the offsets are fake or functional tensors, then use the max sequence length instead. Repro: https://gist.github.com/davidberard98/903fb3e586edb6d1d466786e1a610eba Differential Revision: [D60597463](https://our.internmc.facebook.com/intern/diff/D60597463) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132356 Approved by: https://github.com/soulitzer	2024-08-02 20:42:29 +00:00
Will Constable	37c3d503b7	[pipelining] Make test_schedule quiet (#132369 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132369 Approved by: https://github.com/H-Huang ghstack dependencies: #129810, #130378	2024-08-02 20:38:17 +00:00
Will Constable	7c1cca9fda	[pipelining] Add schedule send/recv pass (#130378 ) Inserts send/recv ops where needed in a compute-only pipeline schedule. Any F or B action will require a recv op for its input and a send op for its output, except for at the ends of the pipeline. To avoid hangs caused by mixed-up orderings of sends/recvs across ranks, we pick one compute action at a time and insert both its send op (on that rank's schedule), and the matching recv op for the recipient stage (on the schedule for the rank for that stage). TODO Currently ignores a couple of edge cases - ignores batching (which is an optimization) - ignores cases where a stage sends to anotehr stage on the same rank, and should skip the send/recv and directly access memory Pull Request resolved: https://github.com/pytorch/pytorch/pull/130378 Approved by: https://github.com/H-Huang ghstack dependencies: #129810	2024-08-02 20:38:17 +00:00
Will Constable	625f494619	[Pipelining] Add schedule unshard/reshard pass (#129810 ) Adds fsdp unshard/reshard ops to a compute-only schedule. Operates on one pp-rank's schedule at a time, since there is no cross-pp-rank coordination needed for FSDP. (Unshard/Reshard is across DP ranks within a PP group). Uses a heuristic based on examining the next N stages to run compute operations on this rank, evicting (resharding) and fetching (unsharding) ahead of time to give unshard operations a chance to overlap with compute and PP comms. - this heuristic has not been validated and may not be optimal Makes the assumption that it's fine to add the UNSHARD/RESHARD actions to the schedule regardless of if FSDP will actually be used. - this way, users do not have to tell us at PP schedule creation time if they plan to use FSDP or DDP - it is trivial to implement UNSHARD/RESHARD as no-ops inside the runtime, if FSDP is not detected on the stage module TODO - also add FSDP's reduce-scatter? or is it sufficient to leave this handled by PipelineStage at 'last backward' time - validate 'next N stages' heuristic and expose an API if needed - add an e2e test Co-authored-by: Howard Huang <howardhuang@fb.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/129810 Approved by: https://github.com/kwen2501, https://github.com/H-Huang	2024-08-02 20:38:17 +00:00
William Wen	f379bbd46d	[dynamo] support inspect.signature.bind (#132330 ) Fixes https://github.com/pytorch/pytorch/issues/93760. This was not that small of a task... Pull Request resolved: https://github.com/pytorch/pytorch/pull/132330 Approved by: https://github.com/jansel ghstack dependencies: #132329	2024-08-02 20:37:05 +00:00
Zhengxu Chen	642257db1a	Update the FQN for auto_functionalized HOO. (#132171 ) Summary: as title. torch._higher_order_ops.auto_functionlize.auto_functionalized is a Python FQN which should NOT be used to talk to the backends and we should use the standard FQN name torch.ops.higher_order.auto_functionalized instead. Test Plan: buck test mode/opt caffe2/test:test_export -- -r test_custom_op_auto_functionalize_pre_dispatch Differential Revision: D60468759 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132171 Approved by: https://github.com/SherlockNoMad	2024-08-02 20:34:50 +00:00
David Berard	dccce77935	C++ network flow implementation in c10 (#132188 ) The functorch partitioners use network flow to split the joint graph into a forward and backward graph. Internally, we've found that upgrading to networkx 2.8.8 (from 2.5) results in some hard-to-debug failures (internal reference: https://fburl.com/workplace/jrqwagdm). And I'm told that there's interest to remove the python dependency. So this PR introduces a C++ implementation that mirrors the API provided by networkx. We'll need to add python bindings and do some additional testing to verify correctness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132188 Approved by: https://github.com/Chillee	2024-08-02 20:30:59 +00:00
Mikayla Gawarecki	f49d5e30eb	Change owners of test/test_transformers.py to module: multi-headed-attention (#132519 ) So flaky tests get tagged with `module: multi-headed-attention` instead of `module: nn` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132519 Approved by: https://github.com/Skylion007	2024-08-02 20:12:33 +00:00
William Wen	e81e74ca6c	[dynamo] revert map/zip iterator related changes (#132528 ) Need to revert due to internal hangs: S437700 This reverts commit b6c1490cc02316ffe85e5ae74651d80f0158ba64. Revert "[dynamo] implement IteratorVariable and polyfill fallbacks for enumerate (#131725)" This reverts commit 2576dbbc35d66e8e9ed6cb12216ccc424cb87ec3. Revert "[dynamo] add itertools repeat/count bytecode reconstruction (#131716)" This reverts commit 35b4de32fafc5ad024c20ef1275711bffc557ae9. Revert "[dynamo] add lazy IteratorVariable implementations for map and zip (#131413)" This reverts commit 7d282d87550787d8269593093519c2ad7c5032cd. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/132528 Approved by: https://github.com/ZainRizvi	2024-08-02 19:40:57 +00:00
Sam Larsen	b71cd149ce	Fix file lock issue in AotCodeCompiler (#132343 ) Summary: It looks like there are several places in AotCodeCompiler that write files in a way that aren't safe for concurrency. There's a filelock to cope with that, but it seems like the lock path isn't quite robust enough to prevent races. We have an internal stress test failing when executing multiple concurrent versions of the test. It seems as though there's some variability in the content we write to the cpp file, which means we can get a different 'key' across different runs. The lock path includes that key in the lock path name, but the path for the "consts_path" is computed separately. Therefore, I see things like this: - The computed 'key' is `cp5tgbuxuegvg5g2j7oi6u74nkf3v7mx5w3qzl6qbedtmw5tq77z` - The lock_path (based on the key) is: `/tmp/torchinductor_slarsen/locks/cp5tgbuxuegvg5g2j7oi6u74nkf3v7mx5w3qzl6qbedtmw5tq77z.lock` - The cpp path is (also includes the key) is: `/tmp/torchinductor_slarsen/cenzkqfnhu53mrhrdhzjtnblzyma2hgmeo7hai5yqsxzirdavurh/cp5tgbuxuegvg5g2j7oi6u74nkf3v7mx5w3qzl6qbedtmw5tq77z.cpp` - The consts_path (not based on the key) is: `/tmp/torchinductor_slarsen/cenzkqfnhu53mrhrdhzjtnblzyma2hgmeo7hai5yqsxzirdavurh/cifbshkqkbsurzldsyi2vl5bsnhvejmavys4kktpwrzmpo4ysuoy.bin` So we have different test instances using different lock paths, but touching the same consts_path and therefore stomping on each others' consts_path. To fix, include the key in the consts_paths. Test Plan: Ran internal stress test. Repro'd failure and verified this change fixes it. Differential Revision: D60552021 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132343 Approved by: https://github.com/desertfire	2024-08-02 19:01:37 +00:00
PyTorch MergeBot	bcb4f7c172	Revert "Grouped Query Attention (#128898 )" This reverts commit 6b28af1b79eaa63e2f423d925bbd42330582983f. Reverted https://github.com/pytorch/pytorch/pull/128898 on behalf of https://github.com/ZainRizvi due to Sorry, this broke a bunch of tests internally. See D60638265 ([comment](https://github.com/pytorch/pytorch/pull/128898#issuecomment-2265961038))	2024-08-02 18:58:46 +00:00
Menglu Yu	afca6f5b47	[PT2][Optimus] Add missing example value for introduced nodes (#132297 ) Summary: We observed that many introduced nodes during split cat and batch fusion pattern optimization did not have example value meta data, which will cause problems in our follow up pattern optimizations, thus we add all missing values. We also fix bugs in some meta update and corner case bug for the old pattern, which caused problems in the follow up pattern optimization. We delete merge_stack_tahn_unbind_pass pattern, which was designed for cmf model, and it could be replaced by the more advanced pattern we added, thus we remove it for easy maintenance. Test Plan: # unit test ``` buck2 test //caffe2/test/inductor:split_cat_fx_passes ``` Test UI: https://www.internalfb.com/intern/testinfra/testrun/15481123762720165 Network: Up: 230KiB Down: 702KiB (reSessionID-756346bf-6da3-4fa0-8d03-1b4fd61e0a7a) Jobs completed: 30. Time elapsed: 7:23.9s. Cache hits: 20%. Commands: 5 (cached: 1, remote: 0, local: 4) Tests finished: Pass 9. Fail 0. Fatal 0. Skip 1. Build failure 0 ``` buck2 test @mode/opt pytorch/diff_train_tests/ads/optimus:local_pt2_runner ``` Network: Up: 1.3GiB Down: 84MiB (reSessionID-ff135cdd-e42c-4ab5-8217-907ada465f01) Jobs completed: 61. Time elapsed: 21:56.5s. Cache hits: 0%. Commands: 39 (cached: 0, remote: 0, local: 39) Tests finished: Pass 8. Fail 0. Fatal 0. Skip 0. Build failure 0 # benchmark ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run @mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "ig_ctr" --flow_id 584880697 ``` Counter({'pattern_matcher_nodes': 752, 'pattern_matcher_count': 732, 'normalization_pass': 328, 'normalization_aten_pass': 12, 'scmerge_cat_removed': 5, 'scmerge_cat_added': 4, 'scmerge_split_removed': 3, 'unbind_stack_pass': 3, 'batch_tanh': 2, 'scmerge_split_sections_removed': 2, 'scmerge_split_added': 2, 'optimize_cat_inputs_pass': 1, 'unbind_cat_to_view_pass': 1, 'fxgraph_cache_miss': 1}) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132297 Approved by: https://github.com/jackiexu1992	2024-08-02 18:57:12 +00:00
PyTorch MergeBot	24d0a32f98	Revert "[dynamo] Wrap unspecialized nn module getattr with UnspecializedNNModuleSource (#132308 )" This reverts commit aa0ed2496f5bf38768c9eda13112fd43359548bb. Reverted https://github.com/pytorch/pytorch/pull/132308 on behalf of https://github.com/anijain2305 due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/132308#issuecomment-2265959993))	2024-08-02 18:55:51 +00:00
PyTorch MergeBot	e696f17467	Revert "[dynamo] Track builtin nn modules with UnspecializedBuiltinNNModuleVariable (#132314 )" This reverts commit d6a82ce39bd8e705a4cc2cebb886f4476a7250cf. Reverted https://github.com/pytorch/pytorch/pull/132314 on behalf of https://github.com/anijain2305 due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/132314#issuecomment-2265953367))	2024-08-02 18:52:38 +00:00
PyTorch MergeBot	e4e3575fb0	Revert "[11/N] Use std::nullopt and std::optional (#132396 )" This reverts commit d7d61904936617a6a43782868d0b1004cb70dfc0. Reverted https://github.com/pytorch/pytorch/pull/132396 on behalf of https://github.com/ZainRizvi due to Sorry, but this PR has a dependency on another PR (https://github.com/pytorch/pytorch/pull/128898) that has to be reverted ([comment](https://github.com/pytorch/pytorch/pull/132396#issuecomment-2265952528))	2024-08-02 18:49:42 +00:00
PyTorch MergeBot	59b73079a0	Revert "Always use high precision for SDPA math backend (#128922 )" This reverts commit fbf3bc0a602b4ec1eab169202d5b1158fe2c1def. Reverted https://github.com/pytorch/pytorch/pull/128922 on behalf of https://github.com/ZainRizvi due to Sorry, but this PR has a dependency on another PR (https://github.com/pytorch/pytorch/pull/128898) that has to be reverted ([comment](https://github.com/pytorch/pytorch/pull/128922#issuecomment-2265949958))	2024-08-02 18:46:50 +00:00
PyTorch MergeBot	193a19ee91	Revert "[dynamo] Treat attr of unspecialized buiitin nn modules as static (#132318 )" This reverts commit 7b816d7d6d5d521f913c78f897790f66112c7d84. Reverted https://github.com/pytorch/pytorch/pull/132318 on behalf of https://github.com/anijain2305 due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/132318#issuecomment-2265945433))	2024-08-02 18:43:32 +00:00
PyTorch MergeBot	b8f7019df0	Revert "[dynamo] Track params/buffers and mark them as static (#132334 )" This reverts commit babb249a89b51931afe16db8b498ff72cd433afc. Reverted https://github.com/pytorch/pytorch/pull/132334 on behalf of https://github.com/anijain2305 due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/132334#issuecomment-2265942261))	2024-08-02 18:41:19 +00:00
Bin Bao	e0514a5b99	[AOTI][refactor] Consolidate how python_kernel_name is set (#132320 ) Summary: Similar to the refactoring of set_cpp_kernel, consolidate the ways of setting python_kernel_name Pull Request resolved: https://github.com/pytorch/pytorch/pull/132320 Approved by: https://github.com/angelayi, https://github.com/chenyang78 ghstack dependencies: #132319	2024-08-02 18:34:25 +00:00
Bin Bao	a9e1133faa	[AOTI][refactor] Move set_cpp_kernel to base class (#132319 ) Summary: Consolidate how cpp_kernel_name is set and make it a method in the base ExternKernel class. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132319 Approved by: https://github.com/angelayi, https://github.com/chenyang78	2024-08-02 18:34:24 +00:00
Aleksei Nikiforov	df781343e2	Link libc10 to pthreads (#132484 ) It gets linked as transitive dependency of `libmkl` on x86_64, but it's must be specified explicitly on s390x Linking issue only appears when using gcc-13 with gold linker. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132484 Approved by: https://github.com/malfet	2024-08-02 18:03:44 +00:00
Yidi Wu	19897a1647	[export] change deepcopy to copy in _replace_set_grad_with_hop pass.. (#132181 ) Summary: Fixes T197371132. Previously, we call copy.deepcopy to avoid mutating the original signature. However, this causes errors when the signature reference a FakeScriptObject, which then references a real torch.ScriptObject due to "The tensor has a non-zero number of elements, but its data is not allocated yet." We therefore just change it to a shallow copy. This should be good enough for guarding the signature. Test Plan: buck2 run 'fbcode//mode/opt' torchrec/distributed/tests:test_pt2 -- --filter-text "test_sharded_quant_ebc_non_strict_export" Differential Revision: D60476839 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132181 Approved by: https://github.com/BoyuanFeng	2024-08-02 17:57:09 +00:00
cyy	87d58cc81f	[4/N] Fix clang-tidy warnings in aten/src/ATen/native/ (#132001 ) Follows #132000 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132001 Approved by: https://github.com/Skylion007	2024-08-02 17:42:02 +00:00
cyy	207e24ff83	Enable clang-tidy on aten/src/ATen/cudnn/* (#130133 ) Continued work of applying clang-tidy Pull Request resolved: https://github.com/pytorch/pytorch/pull/130133 Approved by: https://github.com/eqy, https://github.com/Skylion007	2024-08-02 17:39:37 +00:00
Justin Chu	0c491702c4	[ONNX] Define the `TORCH_ONNX_USE_EXPERIMENTAL_LOGIC` flag (#132299 ) Define the `TORCH_ONNX_USE_EXPERIMENTAL_LOGIC` flag to allow for enabling the new torch.onnx logic and hiding them during migration and testing. The actual logic migration will happen after. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132299 Approved by: https://github.com/titaiwangms	2024-08-02 17:06:11 +00:00
David Berard	9167113c16	[easy][MPS] add torch.mps.is_available() (#132426 ) Just return "torch.mps.device_count() > 0", which, based on the implementation of device_count(), seems to be equivalent. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132426 Approved by: https://github.com/malfet	2024-08-02 17:05:49 +00:00
Edward Z. Yang	fc32732596	Don't attempt to compute hints for unbacked expressions (#132060 ) This breaks the inference we made that if you cat an N-D tensor with a 1-D tensor of size (u0,), the u0 must be zero, but no one really wanted that anyway... Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132060 Approved by: https://github.com/Skylion007	2024-08-02 16:39:14 +00:00
PyTorch MergeBot	8fff976355	Revert "Refactor thunkify to return proper thunk abstraction (#132407 )" This reverts commit d903e664c6b70ad17e0b316ef39d71be5edddc87. Reverted https://github.com/pytorch/pytorch/pull/132407 on behalf of https://github.com/ezyang due to test_correct_module_names ([comment](https://github.com/pytorch/pytorch/pull/132407#issuecomment-2265754857))	2024-08-02 16:32:43 +00:00
PyTorch MergeBot	1197550876	Revert "Don't attempt to compute hints for unbacked expressions (#132060 )" This reverts commit d342dc0179944dd317b509b3432da81701836444. Reverted https://github.com/pytorch/pytorch/pull/132060 on behalf of https://github.com/ezyang due to test_correct_module_names ([comment](https://github.com/pytorch/pytorch/pull/132407#issuecomment-2265754857))	2024-08-02 16:32:43 +00:00
Edward Z. Yang	296c339f98	Ensure compiler collective is called even when no graph is compiled (#132163 ) It's very important to make sure we always run the compiler collective, because if we don't, we will fail to apply automatic dynamic at all. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132163 Approved by: https://github.com/jansel	2024-08-02 16:31:54 +00:00
soulitzer	82b6480b0a	Update SavedTensorHooks TLS stack to use SafePyObject (#131700 ) Previously, we must manually manage refcounting when updating the TLS saved variable stack. With this PR, things should be handled automatically by the SafePyObject. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131700 Approved by: https://github.com/albanD	2024-08-02 16:27:16 +00:00
PyTorch MergeBot	9eeb5eebab	Revert "Ensure compiler collective is called even when no graph is compiled (#132163 )" This reverts commit 0d9c9716b2db52281f6f10a113e07936deeb6e0a. Reverted https://github.com/pytorch/pytorch/pull/132163 on behalf of https://github.com/ezyang due to test_correct_module_names ([comment](https://github.com/pytorch/pytorch/pull/132163#issuecomment-2265729449))	2024-08-02 16:16:31 +00:00
Andrii Grynenko	fca2dba7ca	[pytorch][counters] Pybind for WaitCounter (#132357 ) Summary: Basic pybind integration for WaitCounter providing a guard API. Also fixes broken copy/move constructor in WaitGuard (it wasn't really used with the macro-based C++ API). Test Plan: unit test Differential Revision: D60557660 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132357 Approved by: https://github.com/jamesperng, https://github.com/asiab4	2024-08-02 16:08:10 +00:00
PyTorch MergeBot	d224857b3a	Revert "Change signature of CompilerFn for register_backend decorator (#131880 )" This reverts commit ccf9ce8e8c3c86269003547d976da5ed1fc9511b. Reverted https://github.com/pytorch/pytorch/pull/131880 on behalf of https://github.com/albanD due to Breaking lint ([comment](https://github.com/pytorch/pytorch/pull/131880#issuecomment-2265682757))	2024-08-02 15:49:09 +00:00
Edward Z. Yang	63eb06c051	Disable SymDispatchMode when torch.compile'ing (#132433 ) Partially addresses https://github.com/pytorch/pytorch/issues/132417 Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132433 Approved by: https://github.com/ydwu4	2024-08-02 15:23:49 +00:00
cyy	5aafdc2f87	[3/N] Fix clang-tidy warnings in aten/src/ATen/native/ (#132000 ) Follows #131834 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132000 Approved by: https://github.com/ezyang	2024-08-02 15:00:38 +00:00
Yan Zhiwei	78f4a3919f	Remove duplicate XPU switch case in DispatchStub (#132480 ) This PR fixes the issue mentioned in https://github.com/pytorch/pytorch/issues/132481. Duplicated XPU switch cases exist in `DispatchStub.cpp` and this PR removes it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132480 Approved by: https://github.com/nautsimon, https://github.com/malfet	2024-08-02 14:39:00 +00:00
redradist	ccf9ce8e8c	Change signature of CompilerFn for register_backend decorator (#131880 ) ## Description Add `...` to show that CompilerFn for custom backend could take additional options Re: Recreated closed PR https://github.com/pytorch/pytorch/pull/110006 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131880 Approved by: https://github.com/jansel	2024-08-02 14:30:58 +00:00
Nick Westlake	053e5080f6	Enable exception chaining in call_user_compiler (#131186 ) Enable exception chaining of BackendCompilerFailed exception in call_user_compiler. This prevents the original exception and traceback, which is often the most useful for debugging, from being discarded. Example output without the patch > Traceback (most recent call last): > [Traceback from test_slice_scatter_issue122291 to raise BackendCompilerFailed(self.compiler_fn, e).with_traceback(] > [Trace back from call_user_compiler to _inplace_generalized_scatter raise RuntimeError] > torch._dynamo.exc.BackendCompilerFailed: backend='inductor' raised: > RuntimeError: shape error in scatter op, can not broadcast torch.Size([16, 2]) to torch.Size([16, 6]) > Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information Example output with the patch > Traceback (most recent call last): > [Traceback from_inplace_generalized_scatter to raise error_type(message_evaluated)] > RuntimeError: expand: attempting to expand a dimension of length 2! > The above exception was the direct cause of the following exception: > Traceback (most recent call last): > [Traceback from call_user_compiler to _inplace_generalized_scatter raise RuntimeError] > RuntimeError: shape error in scatter op, can not broadcast torch.Size([16, 2]) to torch.Size([16, 6]) > The above exception was the direct cause of the following exception: > Traceback (most recent call last): > [Traceback from test_slice_scatter_issue122291 to raise BackendCompilerFailed(self.compiler_fn, e) with e] > RuntimeError: shape error in scatter op, can not broadcast torch.Size([16, 2]) to torch.Size([16, 6]) > Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information Pull Request resolved: https://github.com/pytorch/pytorch/pull/131186 Approved by: https://github.com/jansel	2024-08-02 14:07:06 +00:00
Alnis Murtovi	48929184e9	AutoHeuristic: mixed_mm heuristic for A100 (#131613 ) This PR introduces changes to AutoHeuristic that allow one to learn a heuristic as a decision tree. I used this to learn a heuristic for mixed_mm on A100 that consistenly performs better than the default choice (https://github.com/pytorch/pytorch/blob/main/torch/_inductor/kernel/mm.py#L402). This is how the results look like: Explanation of columns: wrong_max_spdup: In the worst case, how much better would the best choice have been wrong_gman_spdup: For inputs where the heuristic is wrong, how much better is the best choice on average (geomean) max_spdup_default: Highest speedup achieved by the learned heuristic over the default choice gman_spdup_default: Geomean speedup achived by the learned heuristic over the default choice max_slowdown_default: If the default choice is better than the choice predicted by the learned heuristic, how much is it better in the worst case non_default_preds: Number of times the learned heuristic predicted a choice that is not the default choice default_better: Number of times the default choice is better than the choice made by the heuristic ``` set crit max_depth min_samples_leaf correct wrong unsure total wrong_max_spdup wrong_gman_spdup max_spdup_default gman_spdup_default max_slowdown_default non_default_preds default_better train entropy 5 0.01 2376 740 323 3439 1.855386 1.063236 11.352318 3.438279 1.022164 3116 2 test entropy 5 0.01 563 183 71 817 1.622222 1.060897 10.084181 3.507741 1.017039 746 2 ``` While the number of wrong predictions is high, on average the best choice is only around 6% better. What is important is that the choice predicted by the learned heuristic performs better than the default choice. I evaluated my heuristic on gpt-fast `meta-llama/Llama-2-7b-chat-hf` with int8 weight quantization. To get the `tuned_mixed_mm` to trigger, I had to replace `F.linear()` in https://github.com/pytorch-labs/gpt-fast/blob/main/quantize.py#L355 with `torch.matmul(input, self.weight.t().to(dtype=input.dtype))` because the mixed_mm pattern does not match if there is a transpose between a cast and the matmul. \|batch size\|prompt length\| fallback \| heuristic \| speedup \| \|----------\|-------------\|------------:\|------------:\|--------:\| \| 1 \| 7 \| 75.31 tok/s \| 148.83 tok/s\| 1.97 \| \| 1 \| 11 \| 75.99 tok/s \| 148.15 tok/s\| 1.94 \| \| 4 \| 7 \| 103.48 tok/s \| 472.00 tok/s\| 4.56 \| \| 4 \| 11 \| 103.56 tok/s \| 371.36 tok/s\| 3.58 \| \| 8 \| 7 \| 201.92 tok/s \| 813.44 tok/s\| 4.02 \| \| 8 \| 11 \| 201.76 tok/s \| 699.36 tok/s\| 3.46 \| Currently, the heuristic only applies to the following inputs: - m <= 128, k >= 1024, n >= 1024 (For these sizes, one of the triton kernels wins in most cases, but the heuristic still has to be careful to not choose a config that performs worse than the fallback) - k % 256 == 0 (If k is not a multiple of the block size, some choices perform extremely bad. In one case one config, that usually performs very well, was 130x slower.) - mat1 not transposed - mat2 transposed (In some cases, it was hard for the learned heuristic to detect some cases where it Pull Request resolved: https://github.com/pytorch/pytorch/pull/131613 Approved by: https://github.com/eellison	2024-08-02 13:54:37 +00:00
cyy	b9cb1abf65	[12/N] Use std::optional (#132361 ) Follows #132396 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132361 Approved by: https://github.com/eqy	2024-08-02 13:46:46 +00:00
Animesh Jain	56f2917bef	[dynamo] Bugfix for recently added str handler (#132461 ) There is probably more work to improve support. But this is hot fix to not fail on `.__func__` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132461 Approved by: https://github.com/williamwen42 ghstack dependencies: #132425	2024-08-02 13:16:39 +00:00
Edward Z. Yang	0d9c9716b2	Ensure compiler collective is called even when no graph is compiled (#132163 ) It's very important to make sure we always run the compiler collective, because if we don't, we will fail to apply automatic dynamic at all. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132163 Approved by: https://github.com/jansel	2024-08-02 12:18:34 +00:00
Edward Z. Yang	d342dc0179	Don't attempt to compute hints for unbacked expressions (#132060 ) This breaks the inference we made that if you cat an N-D tensor with a 1-D tensor of size (u0,), the u0 must be zero, but no one really wanted that anyway... Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132060 Approved by: https://github.com/Skylion007 ghstack dependencies: #131649, #132407	2024-08-02 12:09:37 +00:00
Edward Z. Yang	d903e664c6	Refactor thunkify to return proper thunk abstraction (#132407 ) This is superior to lru_cache because (1) it's more explicit and (2) it doesn't leak the original function after it's been forced. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132407 Approved by: https://github.com/albanD ghstack dependencies: #131649	2024-08-02 12:09:37 +00:00
Edward Z. Yang	290f09f829	Ban decorator usage of dynamo_timed (#132328 ) This is a more manual version of https://github.com/pytorch/pytorch/pull/132073 that just manually creates the new function at each call site instead of magicking it with clone. Review with whitespace diffs off. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132328 Approved by: https://github.com/albanD	2024-08-02 12:00:46 +00:00
Xu Han	8668bc279d	[inductor] contine to fix restrict keyword. (#132463 ) It is a continued work to the PR: https://github.com/pytorch/pytorch/pull/132394 , and all `restrict` key word of `cpp_micro_gemm.py` are fixed. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132463 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-08-02 11:09:17 +00:00
Michael Lazos	d2e9a8bf6d	[Reland] Fix inlining module-scoped store global (#132439 ) Reland https://github.com/pytorch/pytorch/pull/132224 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132439 Approved by: https://github.com/anijain2305	2024-08-02 09:13:52 +00:00
Pearu Peterson	a4ea776881	Add pinned memory support to sparse COO/CSR/CSC/BSR/BSC tensors (#129645 ) As in the title: To register indices/values of a sparse XYZ tensor with CUDA, the following methods are supported - `sparse_xyz_tensor(indices, values, pin_memory=True)` - `sparse_xyz_tensor(indices, values).pin_memory()` - `sparse_xyz_tensor(indices.pin_memory(), values.pin_memory())` Fixes https://github.com/pytorch/pytorch/issues/115330 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129645 Approved by: https://github.com/amjames, https://github.com/cpuhrsch, https://github.com/eqy	2024-08-02 08:55:55 +00:00
Animesh Jain	babb249a89	[dynamo] Track params/buffers and mark them as static (#132334 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132334 Approved by: https://github.com/ezyang, https://github.com/mlazos	2024-08-02 08:55:43 +00:00
xinyu-intel	2ee9895304	Support optimizer capturable on hpu and xpu (#132119 ) as title Pull Request resolved: https://github.com/pytorch/pytorch/pull/132119 Approved by: https://github.com/jgong5, https://github.com/janeyx99	2024-08-02 08:19:52 +00:00
zengxian	f936e68506	[CI] Update CPU inductor smoke test model list and target (#132221 ) Fixes #132097 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132221 Approved by: https://github.com/desertfire	2024-08-02 07:09:54 +00:00
eqy	e5560d10f4	[CUDA][SDPA] Fix expect export on sm90+ (#132194 ) CC @drisspg not sure what is causing the scale=0.125 to be omitted here... Pull Request resolved: https://github.com/pytorch/pytorch/pull/132194 Approved by: https://github.com/drisspg	2024-08-02 05:43:58 +00:00
David Berard	7d8b95e8fb	[easy] more debug in partitioner assert (#132456 ) Print the name of the node that didn't have good meta['val']. An internal model is failing with this assert, we need this info to debug further. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132456 Approved by: https://github.com/Chillee	2024-08-02 05:07:01 +00:00
cyy	35d14d22a0	Fix some issues detected by static analysis tools (#131989 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/131989 Approved by: https://github.com/ezyang	2024-08-02 04:18:57 +00:00
Yanbo Liang	5ea0f51187	[Dynamo] Support abc.MutableMapping.get (#132363 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132363 Approved by: https://github.com/anijain2305, https://github.com/mlazos	2024-08-02 04:17:35 +00:00
drisspg	2b86a7fcc7	fix printing of scores and mods names (#132424 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132424 Approved by: https://github.com/Skylion007	2024-08-02 03:30:23 +00:00
cyy	07fe1dd58f	[13/N] Fix clang-tidy warnings in jit (#132411 ) Follows #132209 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132411 Approved by: https://github.com/Skylion007	2024-08-02 03:14:09 +00:00
James Wu	1250171866	Use fresh inductor cache on unit tests (#132432 ) Summary: This makes it so that stress tests on separate processes on the same machine don't clobber the directories of each other. InductorTestCase will automatically make a fresh tmpdir for each unit test. Test Plan: ``` buck2 test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_aot_autograd_cache.py::AOTAutogradCacheTests::test_nn_module_with_params_global_constant' --run-disabled --stress-runs 10 --record-results ``` Now passes Differential Revision: D60604811 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132432 Approved by: https://github.com/masnesral	2024-08-02 03:02:36 +00:00
Animesh Jain	6c4ce4331c	[dynamo][exception] Raise Observed KeyError exception for dict __getitem__ (#132425 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132425 Approved by: https://github.com/yanboliang, https://github.com/Skylion007	2024-08-02 02:58:31 +00:00
Nikita Shulga	cd5452aace	[CUDA] `is_bf16_supported()` should not crash if there are no GPUs (#132313 ) `False` is the good answer on a system that does not have any CUDA GPUs. - Added regression test to TestTorch. Fixes https://github.com/pytorch/pytorch/issues/132303 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132313 Approved by: https://github.com/eqy, https://github.com/syed-ahmed	2024-08-02 02:50:43 +00:00
majing	3a355c1891	Correct sample creation of torch.histogram in UT op_db to align PyTorch defined operator semantics (#131630 ) Fixes #130916 As the semantics defined in [torch.histogram](https://pytorch.org/docs/stable/generated/torch.histogram.html#torch-histogram), we need an increasing sequence as bins tensor. Random input doesn't make sense for torch.histogram. The case is a comparison between CPU backend and another backend. When the input is random, kernel implementation in other backends have to totally align with the CPU kernel, or the case fails. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131630 Approved by: https://github.com/EikanWang, https://github.com/albanD	2024-08-02 01:51:09 +00:00
Chien-Chin Huang	bc510916fa	Only make wait_tensor as a side_effect op (#132341 ) Summary: https://github.com/pytorch/pytorch/pull/131023 add all the collective ops to the side effect list. But we should only make wait_tensor as a side_effect op because all collective ops should have a corresponding wait_tensor. We should switch to use high_order effect token. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132341 Approved by: https://github.com/yf225	2024-08-02 01:24:40 +00:00
Yichen Yan	ef426d5183	[nccl] Wrap nccl code update with version check (#130419 ) Fixes the issue that cannot build pytorch with nccl < 2.13 after https://github.com/pytorch/pytorch/issues/128756 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130419 Approved by: https://github.com/eqy, https://github.com/malfet	2024-08-02 01:22:07 +00:00
Chen Haifeng	50ed6ce277	Support built-in id function for TensorVariable on parameters (#130100 ) Fixes #130087 This patch tries to provide a built-in id function implementation for TensorVariable when the id function is called on tensors like module parameters. The id function call on intermediate tensors is not supported. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130100 Approved by: https://github.com/anijain2305	2024-08-02 01:19:25 +00:00
Siyu Yang	64235c6a71	Skip test_fp8 in test_aot_inductor to temporarily (#132453 ) https://github.com/pytorch/pytorch/pull/130422 caused the test `test.inductor.test_aot_inductor.AOTInductorTestABICompatibleCuda. test_fp8_abi_compatible_cuda` to fail (unclear why it was not run in GitHub) with `torch/csrc/inductor/aoti_torch/c/shim.h:390:34: note: candidate function not viable: requires 9 arguments, but 6 were provided`. We suspect that the kernel produced by the lowering function, which is no longer a fallback choice, has a schema issue at codegen. Fp8 is not used through AOTI currently and it is difficult to revert the PR (BE week), so we'll skip the test temporarily while making the new lowering compatible with AOTI. Testing: the failed test on internal diff is now skipped. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132453 Approved by: https://github.com/henrylhtsang	2024-08-02 01:18:03 +00:00
cyy	56334c854c	[2/N] Fix clang-tidy warnings in aten/src/ATen/native/*.{cpp,h} (#131834 ) Follows #130798 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131834 Approved by: https://github.com/ezyang	2024-08-02 00:49:30 +00:00
Avik Chaudhuri	ee1ef066fd	add src map to data-dependent errors (#132393 ) Summary: Currently suggested fixes pick a map from symbols to user variables. However it is possible that many user variables point to the same symbol, and some may be preferred over others. Thus we dump this info as well. Test Plan: updated test Sample error with new format: ``` Could not guard on data-dependent expression u2 >= 0 (unhinted: u2 >= 0). (Size-like symbols: none) <snip> The following call raised this error: File "test/export/test_export.py", line 1950, in forward return r.view(items[0], items[2]) To fix the error, insert one of the following checks before this call: 1. torch._check(items[2] >= 0) 2. torch._check(items[2] < 0) (These suggested fixes were derived by replacing `u2` with items[2] in u2 >= 0 and its negation.) ``` Differential Revision: D60574478 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132393 Approved by: https://github.com/BoyuanFeng	2024-08-02 00:31:12 +00:00
William Wen	625af2d27c	[dynamo] fix add_push_null callsites with CALL_FUNCTION_EX (#132329 ) Also fix a bug in `PyCodegen.add_push_null` where in Python <= 3.12, we may accidentally duplicate a NULL instead of the object on the stack before it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132329 Approved by: https://github.com/anijain2305	2024-08-02 00:29:21 +00:00
atalman	0016be8051	[Docker] Replace epel release rpm by yum install (#132449 ) URL: https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm is not available anymore, hence replacing this with yum epel-release install. As a backup plan this is available still : https://archives.fedoraproject.org/pub/archive/epel/7/x86_64/Packages/e/epel-release-7-14.noarch.rpm Saved on our s3 path, just in case: https://ossci-linux.s3.amazonaws.com/epel-release-7-14.noarch.rpm Please note, We are still using for installs like this: ``` RUN yum install -y \ https://repo.ius.io/ius-release-el7.rpm \ https://ossci-linux.s3.amazonaws.com/epel-release-7-14.noarch.rpm ``` Test in CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/132449 Approved by: https://github.com/kit1980, https://github.com/seemethere, https://github.com/malfet	2024-08-02 00:16:03 +00:00
PyTorch MergeBot	3855ac5a5d	Revert "[export] Add print_readable to unflattener (#128617 )" This reverts commit ab9791c0e342753013181eeeab300a05774fc456. Reverted https://github.com/pytorch/pytorch/pull/128617 on behalf of https://github.com/angelayi due to never got landed internally due to weird flow... sorry ([comment](https://github.com/pytorch/pytorch/pull/128617#issuecomment-2264224466))	2024-08-01 23:47:29 +00:00
henrylhtsang	0c3ac428a2	[BE][typing] fix types in common pruning (#132309 ) BE task. Add typings and remove mypy errors in torch/testing/_internal/common_pruning.py Pull Request resolved: https://github.com/pytorch/pytorch/pull/132309 Approved by: https://github.com/ColinPeppler	2024-08-01 23:34:33 +00:00
Mikayla Gawarecki	87ddf70fc6	Set weights_only=False in export `deserialize_torch_artifact` (#132348 ) Context: We are planning to make a BC breaking change to `torch.load` by flipping the default for `weights_only` from `False` --> `True` in a future release. With `weights_only=True`, a custom unpickler is used that limits what can be loaded to state_dicts containing tensors (there is also a way for the user to allowlist specific things to be loaded). The goal of this is to attempt to prevent remote execution of arbitrary code when using `torch.load`. To my understanding, in export, `torch.load` is used internally to load arbitrary objects, so we should set `weights_only=False` here to prevent the flip from breaking export. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132348 Approved by: https://github.com/angelayi	2024-08-01 23:25:07 +00:00
Shangdi Yu	1362d51e7d	[AOTI] Fix number type for AOTI (#132180 ) Fixes #131338 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132180 Approved by: https://github.com/desertfire	2024-08-01 22:43:28 +00:00
Yidi Wu	35400f750f	[torchbind] don't warning for certain skippable methods. (#132306 ) Summary: Skip the warning if the fake script object doesn't implement a fake method for: 1. __obj_flatten__: for real script object only. 2. __set_state__ and __get_state__ for serialization. Don't expect it to be used during tracing. Test Plan: Existing tests. Reviewed By: angelayi Differential Revision: D60478460 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132306 Approved by: https://github.com/angelayi	2024-08-01 22:40:42 +00:00
Shangdi Yu	2f54c38594	[AOTI] Fix bfloat16 in CPU (#132150 ) Fixes #122986 - add "typedef at::BFloat16 bfloat16;" to the header of generated cpp file - Supress warning: comparison of integer expressions of different signedness: ‘long unsigned int’ and ‘int64_t’ {aka ‘long int’} [-Wsign-compare] 436 \| if (tensor.numel() != numel) { Pull Request resolved: https://github.com/pytorch/pytorch/pull/132150 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2024-08-01 22:26:30 +00:00
Joel Schlosser	a356a03f4a	Fix DEBUG=1 asserts for mvlgamma backward with NJT (#132422 ) mvlgamma backward trips DEBUG=1 asserts when trying to construct an empty tensor with `layout=torch.jagged`. This happens due to passing `self.options()` to `arange()` in `mvlgamma_backward()`. Fix in this PR unconditionally constructs `arange()` with the strided layout. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132422 Approved by: https://github.com/albanD	2024-08-01 21:53:16 +00:00
Yu, Guangye	92bebb46fa	Support XPU ABI=0 build (#130110 ) # Motivation This PR intends to support ABI=0 build for XPU backend. # Additional Context The major change is adding a compilation option `-D__INTEL_PREVIEW_BREAKING_CHANGES` for the host compiler(gcc) and `-fpreview-breaking-changes` for XPU device kernel code compiler(icpx), why? Because we use - gcc to compile host code and link SYCL runtime. So we need to pass `-D__INTEL_PREVIEW_BREAKING_CHANGES` to tell the host compiler invoking the ABI-neutral API included in SYCL. And - use icpx to compile device kernel code and link SYCL runtime. So we need to pass `-fpreview-breaking-changes` to tell the device kernel compiler building ABI-neutral code. Besides, - `libsycl-preview.so` is an ABI-neutral library but `libsycl.so` is not. This PR depends on https://github.com/pytorch/pytorch/pull/131643. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130110 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD	2024-08-01 21:42:14 +00:00
Brian Hirsh	997f64af38	fastpath FunctionalTensor sizes() (#132084 ) Another attempt at fast-pathing sizes() in FunctionalTensor, since it appears to improve compile time perf by up to ~10%. See the investigation from https://github.com/pytorch/pytorch/issues/125977#issuecomment-2122915602. After looking at some failing tests locally I realized that we need to manually handle metadata mutations now, since the previous "smarter" size dispatch was handling the updates Pull Request resolved: https://github.com/pytorch/pytorch/pull/132084 Approved by: https://github.com/ezyang	2024-08-01 21:09:22 +00:00
PyTorch MergeBot	c8958f8f84	Revert "Ban decorator usage of dynamo_timed (#132328 )" This reverts commit 9853c048eb53946eb505424b17ac42ce46b66ac1. Reverted https://github.com/pytorch/pytorch/pull/132328 on behalf of https://github.com/clee2000 due to seems to have broken functorch/test_aotdispatch.py::TestAOTAutograd::test_input_data_and_metadata_mutation_aliases_other_input [GH job link](https://github.com/pytorch/pytorch/actions/runs/10204547165/job/28233976446) [HUD commit link](`9853c048eb`). Test passed on PR, probably a landrace, base is only 10 hours old ([comment](https://github.com/pytorch/pytorch/pull/132328#issuecomment-2263909337))	2024-08-01 20:20:28 +00:00
Oguz Ulgen	78927d37f6	Add basic mypy annotations to inductor (#132416 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132416 Approved by: https://github.com/XuehaiPan, https://github.com/jamesjwu ghstack dependencies: #132415	2024-08-01 20:14:25 +00:00
Oguz Ulgen	71e22e0959	Add basic mypy annotations to dynamo (#132415 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132415 Approved by: https://github.com/XuehaiPan, https://github.com/jamesjwu	2024-08-01 20:14:25 +00:00
Simon	12f61e65eb	[mtia][sdpa] MTIA SDPA dispatch via _fused_sdp_choice_stub (#132008 ) Summary: as title Differential Revision: D59823335 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132008 Approved by: https://github.com/mortzur	2024-08-01 20:01:40 +00:00
Anshul Sinha	596f568592	[dtensor][debug] adding js script to pytorch github so that i can host the browser visualizer on pytorch (#132185 ) Summary This is the javascript portion that is used in CommDebugMode's visual browser. I have placed it here so that I can host the browser on PyTorch. I am following the same procedures to host as memory_viz https://github.com/pytorch/pytorch.github.io/blob/site/memory_viz.html Pull Request resolved: https://github.com/pytorch/pytorch/pull/132185 Approved by: https://github.com/XilunWu ghstack dependencies: #132070	2024-08-01 19:50:23 +00:00
Edward Z. Yang	9853c048eb	Ban decorator usage of dynamo_timed (#132328 ) This is a more manual version of https://github.com/pytorch/pytorch/pull/132073 that just manually creates the new function at each call site instead of magicking it with clone. Review with whitespace diffs off. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132328 Approved by: https://github.com/albanD	2024-08-01 19:27:58 +00:00
PyTorch MergeBot	40c8f73099	Revert "Fix inlining module-scoped store global (#132224 )" This reverts commit c3a31d90e7d10a9b89b11396b6f8b20ed52bf394. Reverted https://github.com/pytorch/pytorch/pull/132224 on behalf of https://github.com/ZainRizvi due to Looks like the new import mock_store_global_crossfile_inline fails internally. Please see D60567756 for details ([comment](https://github.com/pytorch/pytorch/pull/132224#issuecomment-2263768729))	2024-08-01 19:06:36 +00:00
Michael Lazos	93979e7063	Skip frame if torch dispatch mode enabled (#131828 ) Fixes https://github.com/pytorch/pytorch/issues/105929 We now skip frames if a dispatch mode is enabled. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131828 Approved by: https://github.com/bdhirsh, https://github.com/anijain2305	2024-08-01 19:06:20 +00:00
Jianyu Huang	fbf3bc0a60	Always use high precision for SDPA math backend (#128922 ) Summary: feikou observed the big numerical gaps when using math backend on AMD and NV GPUs. It's mainly because we are not using higher precision FP32 for the intermediate accumulated/materialized parts. Since math backend is expected to be slower anyways, and we expect math backend to generate the correct reference result, I think it should be worth to upcast FP16/BF16 input to FP32, and do FP32/TF32 computations, and then downcast FP32 output back to FP16/BF16. Differential Revision: D58710805 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128922 Approved by: https://github.com/xw285cornell, https://github.com/drisspg	2024-08-01 18:55:48 +00:00
eellison	0eea2b3947	Cast inputs to low precision kernels in emulate low precision mode (#132345 ) With https://github.com/pytorch/pytorch/pull/132238 is sufficient to make give no divergence https://github.com/pytorch/pytorch/issues/132301: Although we should discuss that issue more at length. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132345 Approved by: https://github.com/zou3519	2024-08-01 18:02:10 +00:00
Ryo	ce61300141	Enable oneDNN for tanh based GELU on aarch64 (#130925 ) Provides speedup for GELU on aarch64 compared to native PyTorch implementation. e.g. 8.5x speedup compared to native implementation for 1x1x16384 on 32 threads on Graviton 3 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130925 Approved by: https://github.com/malfet	2024-08-01 17:54:48 +00:00
Bin Bao	97eba8e174	[AOTI] Fix a typo in ExternKernel.codegen_const_args (#132191 ) Differential Revision: [D60513923](https://our.internmc.facebook.com/intern/diff/D60513923) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132191 Approved by: https://github.com/chenyang78	2024-08-01 17:46:25 +00:00
James Wu	f467d55329	Disable remote cache on test_aot_autograd_cache (#132409 ) Summary: AOTAutogradCache currently only checks the local directory instead of both local and remote when saving/loading from the cache, so if remote cache is turned on, it will cache miss. Disable remote caching for now on these tests: when I work on remote caching compatibility, I'll re-enable them here. Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/dynamo:test_dynamo -- --exact 'caffe2/test/dynamo:test_dynamo - test_aot_autograd_cache.py::AOTAutogradCacheTests::test_nn_module_with_params_global_constant' --run-disabled passes Differential Revision: D60588615 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132409 Approved by: https://github.com/masnesral	2024-08-01 17:26:11 +00:00
angelayi	010fc7858a	[export] Fix serialization of OpOverload w/ SymInt outputs (#132126 ) Fixes https://fb.workplace.com/groups/1075192433118967/permalink/1473575486613991/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/132126 Approved by: https://github.com/ydwu4	2024-08-01 17:22:04 +00:00
Xuehai Pan	ff4ca0d02a	[Easy] Fix argument name collision in `HigherOrderOperator` dispatched functions (#132377 ) Share the same spirit of #129562 - #129562 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132377 Approved by: https://github.com/zou3519	2024-08-01 17:13:37 +00:00
Animesh Jain	7b816d7d6d	[dynamo] Treat attr of unspecialized buiitin nn modules as static (#132318 ) This fixes the huge increase in compile time with +dynamic with inline_inbuilt_nn_modules. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132318 Approved by: https://github.com/yanboliang, https://github.com/mlazos, https://github.com/ezyang ghstack dependencies: #132302, #132304, #132312, #132308, #132314	2024-08-01 17:11:18 +00:00
pratiklp00	69cbf05529	Fix recent build error on ppc64le (#129736 ) This PR will fix the recent build issue observed on ppc64le. Fixes #128130 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129736 Approved by: https://github.com/albanD, https://github.com/malfet	2024-08-01 17:09:42 +00:00
Xuehai Pan	30293319a8	[BE][Easy][19/19] enforce style for empty lines in import segments in `torch/[o-z]*/` (#129771 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129771 Approved by: https://github.com/justinchuby, https://github.com/janeyx99	2024-08-01 17:07:14 +00:00
Howard Huang	c59f3fff52	[PP] Forward only schedule (#132177 ) `python test/distributed/pipelining/test_schedule_multiproc.py -k test_forward_only` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132177 Approved by: https://github.com/lessw2020	2024-08-01 16:35:56 +00:00
Yiming Zhou	ee09d066d3	[dynamo] Add line number to _warn_capture_scalar_outputs() (#132333 ) Fixes #127667. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132333 Approved by: https://github.com/anijain2305	2024-08-01 16:11:21 +00:00
Xu Han	35fcd59fd8	[inductor] make restrict_keyword cross OSs. (#132394 ) Error Msg: <img width="862" alt="image" src="https://github.com/user-attachments/assets/51fef188-bce8-42a5-8ed4-d11802c6ca89"> <img width="347" alt="image" src="https://github.com/user-attachments/assets/0eafe38e-1c7c-427d-82f5-16a31bccc476"> Handle `restrict` keyword the by OS, ref: https://learn.microsoft.com/en-us/cpp/cpp/extension-restrict?view=msvc-170 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132394 Approved by: https://github.com/desertfire	2024-08-01 16:03:10 +00:00
Oguz Ulgen	920f0426ae	Add None return type to init -- tests rest (#132376 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132376 Approved by: https://github.com/jamesjwu ghstack dependencies: #132335, #132351, #132352	2024-08-01 15:44:51 +00:00
Oguz Ulgen	221350e3a4	Add None return type to init -- tests (#132352 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132352 Approved by: https://github.com/ezyang ghstack dependencies: #132335, #132351	2024-08-01 15:44:51 +00:00
Oguz Ulgen	a6985c09cb	Add None return type to init -- functorch and torchgen (#132351 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132351 Approved by: https://github.com/jamesjwu ghstack dependencies: #132335	2024-08-01 15:26:45 +00:00
Oguz Ulgen	72d2dba992	Add None return type to init (#132335 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132335 Approved by: https://github.com/albanD	2024-08-01 15:26:45 +00:00
atalman	30d7f0b15a	Remove wget call to builder install_cuda.sh (#132410 ) This file ``install_cuda.sh`` now lives in ``.ci/docker/common`` and will be removed from builder repo. Here is PR that removes it from builder: https://github.com/pytorch/builder/pull/1949 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132410 Approved by: https://github.com/Skylion007	2024-08-01 15:22:08 +00:00
cyy	c99adce9a1	[12/N] Fix clang-tidy warnings in jit (#132209 ) Follows #132131 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132209 Approved by: https://github.com/Skylion007	2024-08-01 15:12:12 +00:00
Justin Chu	0d88dd0f77	[TS2E] Remove reference to torch.onnx internals (#132186 ) Instead, this PR moves the code to the converter to avoid dependence. Feel free to refactor it afterward. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132186 Approved by: https://github.com/angelayi	2024-08-01 15:08:02 +00:00
cyy	d7d6190493	[11/N] Use std::nullopt and std::optional (#132396 ) Follows #132364 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132396 Approved by: https://github.com/ezyang	2024-08-01 14:46:33 +00:00
Xu Han	a4013e8b72	[inductor] cpp codegen alignas for all OSs. (#132387 ) Changes: 1. Make cpp codegen alignas works for all OSs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132387 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-08-01 14:30:09 +00:00
Xu Han	6c1f1563e1	[inductor] fix UndefinedTensorImpl singleton can't export on Windows. (#132326 ) This PR fix the `UndefinedTensorImpl::_singleton` can't export on Windows issue. Snapshot: <img width="1346" alt="image" src="https://github.com/user-attachments/assets/b34256ac-a0ae-473b-89e6-10d755eaad24"> The reason is MSVC can't export class static data to external linkage, ref: https://learn.microsoft.com/en-us/cpp/cpp/using-dllimport-and-dllexport-in-cpp-classes?view=msvc-170#_pluslang_using_dllimport_and_dllexport_in_c2b2bselectivememberimportexport I use another singleton implenmentation to avoid the issue, for Windows. Since this PR, cpp_wrapper on Windows would start to work. <img width="1916" alt="image" src="https://github.com/user-attachments/assets/c1d7d7e7-64ca-4c6d-9fb7-e3b91e675b58"> Next step, I will enable the cpp_wrapper UTs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132326 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-08-01 13:37:12 +00:00
Xuehai Pan	6ff1e43a41	[BE][Easy][13/19] enforce style for empty lines in import segments in `test/j*/` (#129764 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129764 Approved by: https://github.com/ezyang	2024-08-01 12:13:42 +00:00
Xuehai Pan	672ce4610e	Populate submodules of `torch._C` to `sys.modules` recursively (#132216 ) See comment: `e9d1c26275/torch/__init__.py (L938-L950)` This PR recursively sets the submodules in the C extension to `sys.modules` (e.g., `_C._dynamo.eval_frame`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/132216 Approved by: https://github.com/ezyang	2024-08-01 12:04:59 +00:00
Max Ren	d95756f6a5	[Quantizer][Add] Fix add annotation with constant (#132092 ) Summary: Occaisonally we run into a partition that looks like this for Add: ``` SourcePartition(nodes=[_constant2, add_2], source=<built-in function add>, input_nodes=[x], output_nodes=[_constant2, add_2], params=[_constant2]) ``` In this case we are adding a constant to an input, and reusing the constant later down the line. This causes our constant to be an output in our SourcePartition. The assumption then that: ``` add_node = add_partition.output_nodes[0] ``` Will not necessarily hold. As a result we must check that the output node is indeed a call function and not a constant. Test Plan: buck test mode/dev-nosan //executorch/backends/xnnpack/test:test_xnnpack_ops -- test_qs8_add_constant Differential Revision: D60413221 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132092 Approved by: https://github.com/jerryzh168	2024-08-01 09:57:43 +00:00
joydddd	bdd83c4c7f	Add Full block support to flex_decoding (#131404 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131404 Approved by: https://github.com/yanboliang	2024-08-01 07:28:52 +00:00
cyy	043e41f4f4	[10/N] Use std::nullopt and std::make_optional (#132364 ) Follows #130674 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132364 Approved by: https://github.com/ezyang	2024-08-01 07:02:35 +00:00
Animesh Jain	d6a82ce39b	[dynamo] Track builtin nn modules with UnspecializedBuiltinNNModuleVariable (#132314 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132314 Approved by: https://github.com/yanboliang ghstack dependencies: #132302, #132304, #132312, #132308	2024-08-01 06:21:05 +00:00
Animesh Jain	aa0ed2496f	[dynamo] Wrap unspecialized nn module getattr with UnspecializedNNModuleSource (#132308 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132308 Approved by: https://github.com/yanboliang ghstack dependencies: #132302, #132304, #132312	2024-08-01 06:21:05 +00:00
Animesh Jain	612ea35395	[dynamo] Introduce UnspecializedBuiltinNNModuleSource (#132312 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132312 Approved by: https://github.com/yanboliang ghstack dependencies: #132302, #132304	2024-08-01 06:21:05 +00:00
Tugsbayasgalan Manlaibaatar	4c29c1a96a	[EZ] adjust test to accept training IR input (#131999 ) When we do predispatch functional export, sometimes we get harmless additional detach calls. In the new training IR, it actually outputs slightly different (arguable more correct) result. Differential Revision: [D60348764](https://our.internmc.facebook.com/intern/diff/D60348764/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131999 Approved by: https://github.com/bdhirsh ghstack dependencies: #131988, #131995	2024-08-01 06:20:38 +00:00
Matthew Hoffman	7a779b5257	Add functions from `torch.masked._ops` to `__all__` for `torch.masked` (#131288 ) Add the non-private operations imported in this file to `__all__` so that pyright considers them to be publicly exported. Solves this error: ``` "mean" is not exported from module "torch.masked" Pylance[reportPrivateImportUsage] ``` Related: https://github.com/pytorch/pytorch/pulls?q=pyright+export Pull Request resolved: https://github.com/pytorch/pytorch/pull/131288 Approved by: https://github.com/ezyang	2024-08-01 05:45:08 +00:00
Tugsbayasgalan Manlaibaatar	928adb7cc2	Fix empty fake mode problem (#131995 ) Title Differential Revision: [D60348541](https://our.internmc.facebook.com/intern/diff/D60348541/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131995 Approved by: https://github.com/angelayi ghstack dependencies: #131988	2024-08-01 04:55:37 +00:00
eellison	f32ab3b9e3	Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004 ) Python's set is non deterministic. There is an internal failure which we recently ran into which did not consistently fail. See, repro here: P1453035092. Now, with these changes, it does consistently fail. In follow ups we could also consider adding a lintrule for uses of either set() or set literals. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130004 Approved by: https://github.com/oulgen	2024-08-01 04:37:15 +00:00
Animesh Jain	bcd1d2e832	[dynamo] Introduce UnspecializedNNModule guard source (#132304 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132304 Approved by: https://github.com/yanboliang ghstack dependencies: #132302	2024-08-01 04:35:43 +00:00
Animesh Jain	e772547d70	[dynamo][rename/refactor] Rename guard_source NN_MODULE to SPECIALIZED_NN_MODULE (#132302 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132302 Approved by: https://github.com/yanboliang	2024-08-01 04:35:43 +00:00
Dan Zimmerman	90fa64bd7e	[torch][take2] Implement BFloat16 __hip_bfloat16 overloads (#132234 ) Summary: In D60024830 I attempted to define these overloads, but gated the implementation on the wrong macros. Namely I used `__CUDACC__` instead of `__HIPCC__` (facepalm). It might be worth merging this with the nvidia case via typedefs (e.g. `typedef __hip_bfloat16 __gpu_bfloat16` and `typedef __nv_bfloat16 __gpu_bfloat16`), but that seems like an entirely new paradigm for torch, so I'll punt that change to the future so we can focus on supporting `BFloat16(__hip_bfloat16)` here Test Plan: CI Differential Revision: D60362079 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132234 Approved by: https://github.com/houseroad	2024-08-01 04:25:46 +00:00
Jiong Gong	7911b7bfb7	[inductor][cpp] stabilize do_bench_cpu (#131873 ) This PR stabilizes the `do_bench_cpu` by using milliseconds for warmup and benchmark runs, aligning with that of Trtion's do_bench. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131873 Approved by: https://github.com/leslie-fang-intel, https://github.com/chunyuan-w, https://github.com/eellison	2024-08-01 04:25:31 +00:00
Xuehai Pan	b25ef91bf1	[BE][Easy][18/19] enforce style for empty lines in import segments in `torch/d*/` (#129770 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129770 Approved by: https://github.com/wconstab	2024-08-01 04:22:50 +00:00
Wei Feng	bc7ed1fbdc	[FSDP2] add __repr__ to FSDPParamGroup and FSDPParam (#132350 ) in pdb, it's pretty common to print `FSDPParamGroup` and `FSDPParam`. making sure they are human readable print `FSDPParam` in pdb ``` FSDPParam(fqn=layers.6._checkpoint_wrapped_module.attention.wq.weight, orig_size=torch.Size([128, 256])) ``` print `FSDPParamGroup` in pdb ``` FSDPParamGroup(fqn=layers.6) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132350 Approved by: https://github.com/awgu	2024-08-01 04:21:57 +00:00
Tianyu Liu	46ed33b207	add decomposition_table as an arg to get_isolated_graphmodule (#130886 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130886 Approved by: https://github.com/wanchaol	2024-08-01 04:21:43 +00:00
Tugsbayasgalan Manlaibaatar	073430ebea	Don't check for autograd state when lowering to inference IR (#131988 ) When lowering to inference IR, we shouldn't error on autograd state changes because we will have preserved the autograd state change at the training level. I think the more correct way of implementing it would be to wrap autograd ops in HOP before decomposing, but that seems low ROI. Differential Revision: [D60346235](https://our.internmc.facebook.com/intern/diff/D60346235/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131988 Approved by: https://github.com/angelayi	2024-08-01 04:15:37 +00:00
Avik Chaudhuri	81db69278d	unsupported sympy functions in export solver (#132325 ) Summary: A bunch of issues around support for sympy functions like `TruncToInt` and `ToFloat` are uncovered by https://github.com/pytorch/pytorch/issues/131897. This PR addresses only one of them (as the title suggests). Another issue is deserialization, filed as a task: T197567691. However the most important issue is that adding runtime assertions is broken right now: specifically, sympy_interp with `PythonReferenceAnalysis` currently doesn't work because the implementations of some of these sympy functions in `PythonReferenceAnalysis` (or falling through to its base class) does not expect proxies. This means things like `math.trunc`, `math.floor`, `round`, etc. don't work, and can be easily repro'd by using them inside `torch._check`, e.g. According to ezyang these implementations need to point to new torch functions that can expect proxies (see how minimum and maximum are implemented, e.g.). Test Plan: added test (original repro provided) Differential Revision: D60540951 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132325 Approved by: https://github.com/ezyang	2024-08-01 04:11:52 +00:00
PyTorch MergeBot	10344d76bd	Revert "[AOTI] Fix bfloat16 in CPU (#132150 )" This reverts commit a488113062b7231197ace8522ab3cab535c77d0b. Reverted https://github.com/pytorch/pytorch/pull/132150 on behalf of https://github.com/clee2000 due to I think this broke inductor/test_cuda_cpp_wrapper.py::DynamicShapesCudaWrapperCudaTests::test_unspec_inputs_cuda_dynamic_shapes_cuda_wrapper [GH job link](https://github.com/pytorch/pytorch/actions/runs/10189155341/job/28189531216) [HUD commit link](`a488113062`). Test was not run on PR due to being skipped for being slow ([comment](https://github.com/pytorch/pytorch/pull/132150#issuecomment-2261895048))	2024-08-01 03:35:39 +00:00
PyTorch MergeBot	a28cda11ef	Revert "AutoHeuristic: mixed_mm heuristic for A100 (#131613 )" This reverts commit 344c15a0bb66409ec5e576992090d127cbfa2cff. Reverted https://github.com/pytorch/pytorch/pull/131613 on behalf of https://github.com/AlnisM due to lintrunner issues ([comment](https://github.com/pytorch/pytorch/pull/131613#issuecomment-2261884149))	2024-08-01 03:22:11 +00:00
YangQun1	589aef4bb0	Fix py codegen to delete values that don't have any users (#131028 ) Fixes #131025 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131028 Approved by: https://github.com/ezyang	2024-08-01 03:18:37 +00:00
rzou	718c13cd39	[inductor] Reinplacing should not allow an op to mutate the same input multiple times (#132238 ) Fixes #132196 Let's say we have: - op(x, y) that mutates both x and y - new_x, new_y = functional_op(x, y) is the functional variant If we are presented with functional_op(x, x), we must not reinplace this into op(x, x), because then it would be writing to the same Tensor. Instead, it's OK to reinplace one of them and to clone the other: ``` >>> y = x.clone() >>> op(x, y) ``` This also applies if we have views: functional_op(x, x[0]) should not reinplace into op(x, x[0]). The fix is to avoid reinplacing an arg if a view of it already has been reinplaced. Test Plan: - new and existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/132238 Approved by: https://github.com/oulgen, https://github.com/eellison	2024-08-01 02:37:03 +00:00
Alnis Murtovi	344c15a0bb	AutoHeuristic: mixed_mm heuristic for A100 (#131613 ) This PR introduces changes to AutoHeuristic that allow one to learn a heuristic as a decision tree. I used this to learn a heuristic for mixed_mm on A100 that consistenly performs better than the default choice (https://github.com/pytorch/pytorch/blob/main/torch/_inductor/kernel/mm.py#L402). This is how the results look like: Explanation of columns: wrong_max_spdup: In the worst case, how much better would the best choice have been wrong_gman_spdup: For inputs where the heuristic is wrong, how much better is the best choice on average (geomean) max_spdup_default: Highest speedup achieved by the learned heuristic over the default choice gman_spdup_default: Geomean speedup achived by the learned heuristic over the default choice max_slowdown_default: If the default choice is better than the choice predicted by the learned heuristic, how much is it better in the worst case non_default_preds: Number of times the learned heuristic predicted a choice that is not the default choice default_better: Number of times the default choice is better than the choice made by the heuristic ``` set crit max_depth min_samples_leaf correct wrong unsure total wrong_max_spdup wrong_gman_spdup max_spdup_default gman_spdup_default max_slowdown_default non_default_preds default_better train entropy 5 0.01 2376 740 323 3439 1.855386 1.063236 11.352318 3.438279 1.022164 3116 2 test entropy 5 0.01 563 183 71 817 1.622222 1.060897 10.084181 3.507741 1.017039 746 2 ``` While the number of wrong predictions is high, on average the best choice is only around 6% better. What is important is that the choice predicted by the learned heuristic performs better than the default choice. I evaluated my heuristic on gpt-fast `meta-llama/Llama-2-7b-chat-hf` with int8 weight quantization. To get the `tuned_mixed_mm` to trigger, I had to replace `F.linear()` in https://github.com/pytorch-labs/gpt-fast/blob/main/quantize.py#L355 with `torch.matmul(input, self.weight.t().to(dtype=input.dtype))` because the mixed_mm pattern does not match if there is a transpose between a cast and the matmul. \|batch size\|prompt length\| fallback \| heuristic \| speedup \| \|----------\|-------------\|------------:\|------------:\|--------:\| \| 1 \| 7 \| 75.31 tok/s \| 148.83 tok/s\| 1.97 \| \| 1 \| 11 \| 75.99 tok/s \| 148.15 tok/s\| 1.94 \| \| 4 \| 7 \| 103.48 tok/s \| 472.00 tok/s\| 4.56 \| \| 4 \| 11 \| 103.56 tok/s \| 371.36 tok/s\| 3.58 \| \| 8 \| 7 \| 201.92 tok/s \| 813.44 tok/s\| 4.02 \| \| 8 \| 11 \| 201.76 tok/s \| 699.36 tok/s\| 3.46 \| Currently, the heuristic only applies to the following inputs: - m <= 128, k >= 1024, n >= 1024 (For these sizes, one of the triton kernels wins in most cases, but the heuristic still has to be careful to not choose a config that performs worse than the fallback) - k % 256 == 0 (If k is not a multiple of the block size, some choices perform extremely bad. In one case one config, that usually performs very well, was 130x slower.) - mat1 not transposed - mat2 transposed (In some cases, it was hard for the learned heuristic to detect some cases where it Pull Request resolved: https://github.com/pytorch/pytorch/pull/131613 Approved by: https://github.com/eellison ghstack dependencies: #131610, #131611	2024-08-01 02:25:54 +00:00
Valentine233	2276d9045a	[cpu] add more VecConvert for 8bits (#131876 ) Adds more intrinsic specializations for 8bits conversions, in order to speed up bit8 SDPA in the future. - u8 -> i16 - i32 -> f32 - f32 -> i32 - i32 -> i8 (only add vec512 cause lack of avx512vl for vec256) - i16 -> i8 (only add vec512 cause lack of avx512vl for vec256) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131876 Approved by: https://github.com/jgong5, https://github.com/leslie-fang-intel	2024-08-01 01:38:39 +00:00
Syed Tousif Ahmed	7c89ec0f7c	Implements torch.cuda.MemPool() API (#131152 ) In this PR: - Pool id creation logic is refactored and moved to a MemPool class. `graph_pool_handle()` API now uses `torch.cuda.MemPool()` to get a unique id for a pool. Existing tests should cover this change. - MemPool holds a pointer to a CUDAAllocator as proposed in https://github.com/pytorch/pytorch/issues/124807#issuecomment-2077506997. Tests are added to show usage with CUDAPluggableAllocator. - MemPoolContext API makes a mempool active. Tests are added to show usage of this API. This API will be used in CUDACachingAllocator to route allocations to a user provided allocator. See draft here: https://github.com/pytorch/pytorch/pull/125722/ Pull Request resolved: https://github.com/pytorch/pytorch/pull/131152 Approved by: https://github.com/eqy, https://github.com/ezyang	2024-08-01 01:29:30 +00:00
albanD	4e966e8a1c	Update inference_mode doc (#132321 ) Fix https://github.com/pytorch/pytorch/issues/132288 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132321 Approved by: https://github.com/awgu, https://github.com/soulitzer	2024-07-31 23:50:03 +00:00
Shangdi Yu	a488113062	[AOTI] Fix bfloat16 in CPU (#132150 ) Fixes #122986 - add "typedef at::BFloat16 bfloat16;" to the header of generated cpp file - Supress warning: comparison of integer expressions of different signedness: ‘long unsigned int’ and ‘int64_t’ {aka ‘long int’} [-Wsign-compare] 436 \| if (tensor.numel() != numel) { Pull Request resolved: https://github.com/pytorch/pytorch/pull/132150 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2024-07-31 23:28:24 +00:00
jainapurva	6b28af1b79	Grouped Query Attention (#128898 ) ### Approach: Using the current function declaration Constraint: Q_Heads % KV_Heads == 0 Major change: - Added a new argument enable_gqa: bool to sdpa function call - It adds a meaning to the last third dimension. Sample use cases this would enable: LLama3 ``` # LLama3 8b call to SDPA query = torch.rand(batch, 32, seq_len_q, D) key = torch.rand(batch, 8, seq_len_kv, D) value = torch.rand(batch, 8, seq_len_kv, D) output = scaled_dot_product_attention(query, key, value, is_causal=True, enable_gqa=True) # Output Shape (batch, 32, seq_len_q, D) ``` ### Design Choice: - Check if Query.size(-3) == Key.size(-3) == Value.size(-3) or, Query.size(-3) % Key.size(-3) == 0 - The function adjusts the key and value tensors to match the query tensor's head dimension by using repeat_interleave if their number of heads are not equal, facilitating correct and efficient computation in attention mechanisms. - By default the enable_gqa flag is set to False, which ensures that regular sdpa functionality remains unchanged. ### Benchmarks: - sdpa.py: #130634 For different batch sizes enable_gqa=True shows a substansial improvement in the run_time of sdpa \| batch_size \| q_num_heads \| kv_num_heads \| q_seq_len \| kv_seq_len \| embed_dim \| forward_time when enable_gqa=True \| forward_time when enable_gqa=False \| \| ------------ \| ------------- \| -------------- \| ----------- \| ------------ \| ----------- \| ----------- \| ---------------- \| \| 1 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 100.71 \| 119.70 \| \| 8 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 539.78 \| 628.83 \| \| 16 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 1056.81 \| 1225.48 \| \| 32 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 2099.54 \| 2440.45 \| ![Screenshot 2024-07-25 at 9 07 40 PM](https://github.com/user-attachments/assets/a3e5f716-c39f-4096-9e6c-82a735e57b7b) - TorchTitan: https://github.com/pytorch/torchtitan/pull/458 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128898 Approved by: https://github.com/drisspg	2024-07-31 22:58:51 +00:00
eellison	f0da167ce5	Add fx graph runnable to tl parse (#130976 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130976 Approved by: https://github.com/ezyang	2024-07-31 22:19:35 +00:00
Oguz Ulgen	645c1052a6	Refactor local autotune remote cache to make the code less error prone (#132289 ) Fixes #132241 This PR refactors local autotune cache so that disabling it is easier and cleaner. Differential Revision: [D60537196](https://our.internmc.facebook.com/intern/diff/D60537196) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132289 Approved by: https://github.com/aorenste ghstack dependencies: #132285	2024-07-31 22:12:22 +00:00
Oguz Ulgen	b0e06d9d6a	Make config.autotune_remote_cache be a three-way option (#132285 ) Similar to fx_graph_cache config, make autotune config be three-way so we can hard enable/disable via config options. Differential Revision: [D60537105](https://our.internmc.facebook.com/intern/diff/D60537105) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132285 Approved by: https://github.com/aorenste	2024-07-31 22:12:22 +00:00
Peter Bell	260c991e20	[inductor] Fix unsoundness with negative-valued indexing expressions (#131761 ) This fixes a few instances where we assumed indexing expressions were non-negative. This is not valid when we have more complicated expressions involving masking e.g. pointwise cat. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131761 Approved by: https://github.com/ezyang	2024-07-31 21:32:20 +00:00
Xuehai Pan	e74ba1b34a	[BE][Easy][15/19] enforce style for empty lines in import segments in `torch/_d*/` (#129767 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129767 Approved by: https://github.com/anijain2305	2024-07-31 21:18:11 +00:00
Sheng Fu	ad9826208c	Remove string length limit in ET (#132169 ) Summary: ET sets the length limit of string input varaibele to 8192 characters. However, the node process_group::init has more than 8192 characters for a Ads 128 rank job. This DIFF is to temporaily remove this limit, so ET can capture the complete information of the process group. Test Plan: buck2 test mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTrace Reviewed By: sanrise Differential Revision: D60341306 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132169 Approved by: https://github.com/sraikund16, https://github.com/sanrise	2024-07-31 20:54:39 +00:00
Alnis Murtovi	d3cefc9e3a	AutoHeuristic: Collect data for mixed_mm (#131611 ) This PR introduces a script that can be used to collect data for mixed_mm to learn a heuristic with AutoHeuristic. This PR also includes the following things: Move pad_mm related AutoHeuristic files into subdirectory Introduce an interface benchmark_runner.py that can be subclassed to introduce new scripts to run benchmarks in order to collect data with AutoHeuristic (see gen_data_pad_mm.py and gen_data_mixed_mm.py). The idea behind the interface is that, in the end, it hopefully makes it easier to collect data for new optimizations, and thus makes it easier to learn a heuristic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131611 Approved by: https://github.com/eellison ghstack dependencies: #131610	2024-07-31 20:45:45 +00:00
Siddharth Kotapati	f8b6e91840	Add sequoia runner to mac-mps (#132190 ) Adds MacOS 15 runners to GitHub actions for Mac-mps test suite Co-authored-by: Joona Havukainen <jhavukainen@apple.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132190 Approved by: https://github.com/malfet	2024-07-31 20:26:04 +00:00
Sergii Dymchenko	d72e863b3e	Fix lint after PR #130572 (#132316 ) Fix lint after https://github.com/pytorch/pytorch/pull/130572 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132316 Approved by: https://github.com/Skylion007, https://github.com/malfet, https://github.com/ZainRizvi	2024-07-31 20:00:31 +00:00
Catherine Lee	aeb78c9849	[TD] More files for test_public_bindings (#132284 ) It relies on that file Also we care about .cpp files too apparently Pull Request resolved: https://github.com/pytorch/pytorch/pull/132284 Approved by: https://github.com/ZainRizvi	2024-07-31 19:53:40 +00:00
Andrii Grynenko	cb4c107d70	[pytorch][counters] DynamicCounter (#132166 ) Summary: Implement a callback-based dynamic counter with pluggable backends. The backend API and integration is similar to WaitCounter. Note that this counter should only be used with C++ callbacks, since making it safe to be used for GIL-requiring callbacks would be pretty challenging and may defeat the whole purpose of this counter (since the duration of the callback can no longer be guaranteed). Test Plan: unit test Differential Revision: D60464055 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132166 Approved by: https://github.com/asiab4	2024-07-31 19:52:51 +00:00
PyTorch MergeBot	dc38646c58	Revert "[pytorch][counters] Pybind for WaitCounter (#132167 )" This reverts commit 2c7bd61afa4b762e00b26bbde43685de080af32a. Reverted https://github.com/pytorch/pytorch/pull/132167 on behalf of https://github.com/clee2000 due to broke test_public_bindings.py::TestPublicBindings::test_correct_module_names [GH job link](https://github.com/pytorch/pytorch/actions/runs/10183687967/job/28172929836) [HUD commit link](`2c7bd61afa`) not tested on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/132167#issuecomment-2261328275))	2024-07-31 19:51:56 +00:00
Edward Z. Yang	6955bc170d	Some updates to merge rules (#132296 ) The added people from metamates don't actually make a material difference right now but I added some for fun. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132296 Approved by: https://github.com/albanD, https://github.com/malfet	2024-07-31 19:49:08 +00:00
Gabriel Ferns	2138a710eb	enable test_max_pool2d6 after resolving empty array (#132219 ) Related to Issue: https://github.com/pytorch/pytorch/issues/131335 Resolving PR: https://github.com/pytorch/pytorch/pull/132023 Test output: ``` (pytorch-3.10) [gabeferns@devvm2252.cco0 ~/pytorch (enable-test-max-pool2d6)]$ TORCHINDUCTOR_ABI_COMPATIBLE=1 python test/inductor/test_cpu_cpp_wrapper.py -k test_max_pool2d6 inline_call [] stats [('calls_captured', 3), ('unique_graphs', 1)] inductor [('extern_calls', 3), ('fxgraph_cache_miss', 1)] aot_autograd [('total', 1), ('ok', 1)] .inline_call [] stats [('calls_captured', 3), ('unique_graphs', 1)] aot_autograd [('total', 1), ('ok', 1)] inductor [('extern_calls', 3), ('fxgraph_cache_miss', 1)] . ---------------------------------------------------------------------- Ran 2 tests in 8.668s OK ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132219 Approved by: https://github.com/desertfire	2024-07-31 19:13:54 +00:00
drisspg	cfe61e84ac	Add a 'to' method for moving to and from device for BlockMask (#132087 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132087 Approved by: https://github.com/yanboliang	2024-07-31 19:05:30 +00:00
Edward Z. Yang	898a431a46	Dump files that look like FX graphs to structured log (#132100 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132100 Approved by: https://github.com/oulgen	2024-07-31 18:45:28 +00:00
James Wu	f9e4d05c15	Save and run post compilation steps within FXGraphCache (#130572 ) This PR mostly refactors by putting code into utils files so that they can be shared between codecache.py and compile_fx.py. Afterwards, it then changes compile_fx so that: - When saving to FXGraphCache, we save onto the CompiledFXGraph all the necessary metadata for running post compile steps (realigning inputs, cudagraphification). - When loading from FXGraphCache, we use the saved information directly, instead of calculating them from scratch. What this does is make it so that `FXGraphCache.load()` is a perfect cache on compile_fx_inner, in that it returns exactly what compile_fx_inner returns. This also makes it possible for AOTAutogradCache, given a key to the fx graph cache and example inputs, to get back the full return value of compile_fx_inner. ## What's a post compile step? We define a post-compile to be the set of actions that need to run after FXGraphCache either loads from the cache or misses and runs compilation. These steps include: - Setting the tracing context's output strides - Running cudagraphs if enabled - Maybe realign inputs if cudagraphs didn't run To run these steps, we save all the necessary metadata in CompiledFxGraph, and use them on a cache hit to reconstruct the object. ## Splitting cudagraphs work into pre/post compile Cudagraphs does a lot of work on the input graph module to determine if cudagraphs can be enabled. This is the code that involves cudagraph_tests and stack traces. This will work in a world where we have access to the input graph module, but with AOTAutograd warm start, we won't have access to that information anymore. Therefore we can split cudagraphs work into two parts: on a cache miss (and therefore a full compile), we do the cudagraphs testing work, and save cudagraph_fail_reasons into the cache. Then on a cache hit, we know whether or not we can run cudagraphs, and if we can't, we can emit the correct error messages. Implementation notes: - We save `fx_kwargs` directly onto the CompiledFXGraph. `fx_kwargs` is already, by definition, part of the cache key, so this is safe to do when it comes to cache correctness. - ^ Why do we do above even though FXGraphCache.load takes fx_kwargs as an argument? Because AOTAutogradCache doesn't have access to fx_kwargs: they're annoyingly encoded in the functools.partial() of the fw_compiler, so only inductor knows about these options. They're fully captured by the AOTAutogradCache key (since every key to fx_kwargs is either a global config, or a field that's deterministic based on an input graph module), but their values are still needed to run cudagraphs/postprocessing. Therefore, it's easier/safer to store it on the cached result. - Willing to hear other approaches here if we think saving these extra fields is not reasonable, though I can't think of another way to do this that's less complicated to explain. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130572 Approved by: https://github.com/eellison	2024-07-31 18:32:40 +00:00
JackCaoG	b40249b462	propagate XLA's metadata after functional sync (#131076 ) Fixes https://github.com/pytorch/xla/issues/7174 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131076 Approved by: https://github.com/bdhirsh	2024-07-31 18:20:00 +00:00
Joel Schlosser	7eb2a99585	Fix to support unary pointwise ops when an NJT is not the first arg (#131937 ) Background: NJT utilizes a `jagged_unary_pointwise()` fallback that historically has assumed blindly that the first arg is an NJT. This assumption breaks certain ops; for example `pow(scalar, Tensor)` has an NJT as the second arg. This PR expands `jagged_unary_pointwise()` and the associated schema validation logic to handle an NJT in args other than the first position. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131937 Approved by: https://github.com/soulitzer ghstack dependencies: #131898, #131704	2024-07-31 17:51:03 +00:00
Michael Lazos	c3a31d90e7	Fix inlining module-scoped store global (#132224 ) Fixes https://github.com/pytorch/pytorch/issues/132165 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132224 Approved by: https://github.com/anijain2305	2024-07-31 17:37:43 +00:00
Aaron Orenstein	6214b5388b	typing ir.py - part 1 (#131845 ) See #131852 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131845 Approved by: https://github.com/Skylion007, https://github.com/eellison	2024-07-31 17:37:14 +00:00
Michael Lazos	144639797a	Improve side effects error message (#132223 ) As title Pull Request resolved: https://github.com/pytorch/pytorch/pull/132223 Approved by: https://github.com/anijain2305	2024-07-31 17:29:26 +00:00
PyTorch MergeBot	784a6ec5a3	Revert "Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004 )" This reverts commit 13d744464f10e35c0de50feb4e2340d4dae8e05f. Reverted https://github.com/pytorch/pytorch/pull/130004 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/10183945999/job/28170099930) [HUD commit link](`13d744464f`) probably a landrace, the base is 21 hours old ([comment](https://github.com/pytorch/pytorch/pull/130004#issuecomment-2260946562))	2024-07-31 16:49:21 +00:00
Sam Larsen	9826c542f0	[inductor] skip remote fx caching in failing pattern matcher tests (#132206 ) Summary: These tests are failing internally with remote caching enabled because the installed pattern increments a nonlocal counter, which we skip with a cache hit. Test Plan: ``` buck2 test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/inductor:pattern_matcher -- --exact 'caffe2/test/inductor:pattern_matcher - test_match_with_mutation (caffe2.test.inductor.test_pattern_matcher.TestPatternMatcher)' --run-disabled --stress-runs 10 buck2 test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/inductor:pattern_matcher -- --exact 'caffe2/test/inductor:pattern_matcher - test_match_equivalent_function_invocations1 (caffe2.test.inductor.test_pattern_matcher.TestPatternMatcher)' --run-disabled --stress-runs 10 buck2 test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/inductor:pattern_matcher -- --exact 'caffe2/test/inductor:pattern_matcher - test_match_equivalent_function_invocations2 (caffe2.test.inductor.test_pattern_matcher.TestPatternMatcher)' --run-disabled --stress-runs 10 buck2 test -j 18 'fbcode//mode/opt' fbcode//caffe2/test/inductor:pattern_matcher -- --exact 'caffe2/test/inductor:pattern_matcher - test_match_equivalent_function_invocations3 (caffe2.test.inductor.test_pattern_matcher.TestPatternMatcher)' --run-disabled --stress-runs 10 ``` Differential Revision: D60491503 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132206 Approved by: https://github.com/oulgen	2024-07-31 16:41:04 +00:00
datagero	bdd7a0322d	[Dynamo] Fix - `str` handler for UserDefinedObjectVariable (#130506 ) Fixes #130301 Adjusted the call_str method to handle str conversion for UserDefinedObjectVariable. Attempt in a clean branch for unrelated test errors. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130506 Approved by: https://github.com/oulgen, https://github.com/anijain2305	2024-07-31 16:39:59 +00:00
Yan Zhiwei	fe4f8e97cd	[Intel GPU] xpu-ops codegen via backend whitelist (#130082 ) # Motivation This PR intends to enhance the codegen to allow generate codes for XPU backend. XPU operators need be registered in an hand-written way currently. Developers have no chance to take the advantage of shared code to handle tensor meta setting (like strides, proxy output, structured kernels). Manually porting code is erro-prone and may lead to high maintaining efforts. We utilize the backend_whitelist argument in `gen.py` to generate XPU needed headers and source codes. # Usage XPU ops lie in `third_pary/torch-xpu-ops`, the codegen process is triggered before the complation of `torch-xpu-ops` We use the following commands to generate XPU operators ` python -m torchgen.gen --source-path path/to/yaml/of/xpu --install-dir build/xpu --per-operator-headers --static-dispatch-backend --backend-whitelist=XPU` The diff lies at `backend-whitelist=XPU`. The backend-whitelist key is an existent argument in torchgen. The input of `gen.py` are code templates and operators yaml. We share the same templates in `aten`. A simplified yaml lies in `third_party/torch-xpu-ops`, which only includes the supported xpu operators. This yaml is a copy-and-modify of `native_functions.yaml`. No extra entry is added, the format is same as the one in `aten` # Result All operators headers are generated in `build/xpu/ATen/ops` independently, which would not affect operators declared/defined by CPU/CUDA or any other backend. XPU operators only include headers in this folder. # Verification * In `third-party/torch-xpu-ops`, we migrate all supported kernels to structured kernels style, where they are registered through `REGISTER_XPU_DISPATCH` or `TORCH_IMPL_FUNC`, and we have UT verification based on `test_ops.py` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130082 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/atalman ghstack dependencies: #130019	2024-07-31 16:31:38 +00:00
David Berard	aec8bc5e4c	[easy] fix type annotation on constraint_violations variable (#127064 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127064 Approved by: https://github.com/jananisriram	2024-07-31 16:27:10 +00:00
hongxiayang	c85088b1f9	[ROCm] performance optimization for index select (#131713 ) As observed during working on this fix (https://github.com/pytorch/pytorch/pull/130994), 128 threads per block seems quite low. This PR is to increase the default to improve the performance, and also slightly refactoring the code to replace the hard-coded 128 for better maintenance. By increasing the default max threads per block from 128 to 256, I saw for `aten::index_select`, its "CUDA total" time drop from 44.820ms to 33.608ms by profiling below embedding script: ``` input = torch.randint(low=0, high=16032, size=[131072], device="cuda") w = torch.randn([16032, 16384], device="cuda") with profiler.profile(record_shapes=True) as prof: x = torch.nn.functional.embedding(input, w) ``` I tested with the default from 128 to 256, 512, 1024 on several different types of devices, and observed "CUDA total" time dropping even more and more latency improvement as the number increases. Below is one example of latency improvement ratio: 128 \| 1x 256 \| 1.33x 512 \| 1.44x 1024 \| 1.49x Using 512 as the new default max for non-mi300x to be conservative, which is 1.44x faster than using 128 with the above profiling script. Using 1024 for mi300x is 1.61x faster than using 128 with the same profiling script, and using 512 is 1.57x faster. Co-authored-by: Jeff Daily <jeff.daily@amd.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131713 Approved by: https://github.com/jeffdaily, https://github.com/syed-ahmed, https://github.com/malfet	2024-07-31 16:24:01 +00:00
eellison	13d744464f	Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet (#130004 ) Python's set is non deterministic. There is an internal failure which we recently ran into which did not consistently fail. See, repro here: P1453035092. Now, with these changes, it does consistently fail. In follow ups we could also consider adding a lintrule for uses of either set() or set literals. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130004 Approved by: https://github.com/oulgen	2024-07-31 16:22:11 +00:00
Andrii Grynenko	2c7bd61afa	[pytorch][counters] Pybind for WaitCounter (#132167 ) Summary: Basic pybind integration for WaitCounter providing a guard API. Also fixes broken copy/move constructor in WaitGuard (it wasn't really used with the macro-based C++ API). Test Plan: unit test Reviewed By: asiab4 Differential Revision: D60463979 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132167 Approved by: https://github.com/asiab4	2024-07-31 16:04:40 +00:00
Xu Han	39a3c98aa6	[inductor] fix scalar miss constuctor for long type. (#132117 ) Fix `long` to `c10::scalar` convert issue. ![image](https://github.com/user-attachments/assets/fc44a170-e293-4688-a185-d189484f6638) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132117 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-07-31 15:40:48 +00:00
Ke Wen	b2118573d6	[BE] Unify PG assignments (#132230 ) python's `or` operator returns `bar` in cases of `foo = None or bar` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132230 Approved by: https://github.com/Skylion007, https://github.com/wconstab	2024-07-31 15:28:25 +00:00
IvanKobzarev	9c52013559	[subclasses] Fix nested subclasses flattened tensors ordering (#132096 ) get_plain_tensors() should result in DFS of leaves. The error was that plain tensors (leaves) on the same level were returned before subclasses plained tensors even if subclasses are before in "flatten" list. Original issue from AO: https://github.com/pytorch/ao/issues/515 Test:TBD, need to make asymetric subclass with dense tensors and subclasses Pull Request resolved: https://github.com/pytorch/pytorch/pull/132096 Approved by: https://github.com/bdhirsh	2024-07-31 14:12:51 +00:00
PyTorch MergeBot	5406e46b00	Revert "Add fx graph runnable to tl parse (#130976 )" This reverts commit 52c3af62d6fa4a0a4e22764a89f1877f3b1b28f9. Reverted https://github.com/pytorch/pytorch/pull/130976 on behalf of https://github.com/albanD due to Broke trunk ([comment](https://github.com/pytorch/pytorch/pull/130976#issuecomment-2260579485))	2024-07-31 13:53:57 +00:00
Ke Wen	3d7f541597	[BE][TP] Check module has bias before access (#132137 ) Some linear modules, such as the ones reconstructed by `torch.export.unflatten()`, may not have the `bias` attribute, if the original linear module has `bias=None`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132137 Approved by: https://github.com/wanchaol	2024-07-31 13:45:28 +00:00
Dan Zimmerman	dad125a64b	Address clang-tidy nits in BFloat16 (#132203 ) Summary: In https://github.com/pytorch/pytorch/pull/131359 I forgot to amend with clang-tidy fixes before merging. This addresses that. Test Plan: CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/132203 Approved by: https://github.com/houseroad	2024-07-31 13:41:56 +00:00
Yu, Guangye	45e6a364ee	Avoid autocast deprecation warning (#132207 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132207 Approved by: https://github.com/awgu	2024-07-31 13:13:39 +00:00
Luca Wehrstedt	f4f7aba75d	Expose function to probe whether PyTorch was built with FlashAttention (#131894 ) This is needed by downstream projects (e.g., xFormers) to determine whether they can count on FlashAttention in PyTorch or whether they need to build it themselves. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131894 Approved by: https://github.com/drisspg, https://github.com/eqy	2024-07-31 11:33:09 +00:00
Xuehai Pan	548c460bf1	[BE][Easy][7/19] enforce style for empty lines in import segments in `test/[a-c]/` and `test/[q-z]/` (#129758 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129758 Approved by: https://github.com/ezyang	2024-07-31 10:54:03 +00:00
Janani Sriram	46994e753b	[NestedTensor] Integrate the layer normalization operator along the jagged dimension into NestedTensor (#132172 ) Modify the existing `layer normalization` operator in PyTorch, invoked by `torch.layer_norm`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff, which uses the `aten` padding operator, enables PyTorch users to invoke `torch.nn.functional.layer_norm` on a nested tensor when reducing along the ragged dimension, e.g. `` in a `(B, , M)` or `(B, *, M, N)` nested tensor. Write unit tests based on the `softmax` jagged operator to verify the accuracy of the ragged reduction implementation for `torch.nn.functional.layer_norm`. Add unit tests to verify error handling for unsupported features. Note that this implementation is limited to nested tensors with `ragged_idx == 1`, i.e. the ragged dimension is not transposed. The layer normalization operator also requires an operation on a 2-dimensional layer; for nested tensors with 4 or more dimensions, I flatten the extra dimensions, then unflatten them after performing layer normalization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132172 Approved by: https://github.com/davidberard98 ghstack dependencies: #132170	2024-07-31 10:51:46 +00:00
Janani Sriram	89053e382a	[NestedTensor] Integrate the softmax operator along the jagged dimension into NestedTensor (#132170 ) Modify the existing `softmax` operator in PyTorch, invoked by `torch.softmax`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff, which uses the aten padding operator, enables PyTorch users to invoke `torch.softmax` on a nested tensor when reducing along the ragged dimension, e.g. `` in a `(B, , M)` nested tensor. Write unit tests based on the `sum` and `mean` jagged operators to verify the accuracy of the ragged reduction implementation for `torch.softmax`. Add unit tests to verify error handling for unsupported features in `NestedTensor` `torch.softmax`. Note that this implementation is limited to nested tensors with `ragged_idx == 1`, i.e. the ragged dimension is not transposed. In addition, the `softmax` operator is required to take in as input an integer for the reduction dimension `dim`, requiring new unit tests heavily inspired by the `sum` and `mean` jagged operator unit tests. `Softmax` also allows for reducing along the batch dimension. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132170 Approved by: https://github.com/davidberard98	2024-07-31 10:51:46 +00:00
Xuehai Pan	e7eeee473c	[BE][Easy][14/19] enforce style for empty lines in import segments in `torch/_[a-c]/` and `torch/_[e-h]/` and `torch/_[j-z]*/` (#129765 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129765 Approved by: https://github.com/ezyang	2024-07-31 10:42:50 +00:00
ekamiti	9e473fd868	Make adding Buffers more like adding Parameters (#125971 ) Add similar semantics for creating a buffer object similar to creating a parameter. This is done by introducing a new Buffer class that can be used for type disambiguation. The underlying functionality of registering a buffer remains the same as the register_buffer method has not been changed. The persistent parameter in the Buffer type is to indicate whether a buffer object should be persistent or not. Other non-test changes have to do with getting the new Buffer type recognized by inductor and dynamo. Remaining changes are test changes to make sure that the Buffer type can be used as a drop in replacement for register_buffer as it just leads to register_buffer being called. The addition of this new functionality still allows for normal tensors to be used as buffers so these changes are intended to be backwards compatible. Fixes #35735 Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125971 Approved by: https://github.com/albanD, https://github.com/anijain2305, https://github.com/mlazos	2024-07-31 10:32:40 +00:00
IvanKobzarev	a94e507c39	[aota] Needs autograd if an input requires_grad, agnostic to enable_grad (#128890 ) Original issue: https://github.com/pytorch/pytorch/issues/114338 Reland of: https://github.com/pytorch/pytorch/pull/128016 Summary from previous PR: We assume only two possible mutually exclusive scenarios: Running compiled region for training (Any of inputs has requires_grad) Produced differentiable outputs should have requires_grad. Running compiled region for inference (None of inputs has requires_grad) All outputs do not have requires_grad. Even if user runs the region under no_grad(), but has an input Tensor with requires_grad - we go Training scenario (1). With current state that means: 1/ needs_autograd should not check torch.is_grad_enabled(), only that any of inputs requires_grad 2/ if needs_autograd => trace_joint (We are in training scenario 1.) => always run compiled region under with.enable_grad() Changes in partitioner? Inference and Training graphs had difference in return container, list/tuple. The changes in partitioner are done to unify and return always tuple. As a result - some changes in test_aotdispatch.py for graph contents list -> tuple. Why was revert? There was a regression of hf_Reformer model on inference. ``` TORCHINDUCTOR_FX_GRAPH_CACHE=0 python benchmarks/dynamo/torchbench.py --performance --inference --bfloat16 --backend inductor --device cuda --only hf_Reformer --cold-start-latency --use-eval-mode ``` Because one of the compiled graphs contained outputs, which are aliases to the inputs that are nn.Parameter(requires_grad=True). Even if inference bencharmsk torchbench runs inside with` torch.no_grad()` - alias (specifically for hf_Reformer - expand) ops preserve requires_grad. As a result we started compiling training graph instead of inference. Fix for view ops: If we have outputs, that are aliases to inputs that requires_grad, those outputs requires grad is not a reason to generate training graph. This is handled in aot_autograd.py, where output_and_mutation_safe are calculated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/128890 Approved by: https://github.com/bdhirsh	2024-07-31 07:25:19 +00:00
Yiming Zhou	e9d1c26275	fix uniform op in dynamo (#132160 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132160 Approved by: https://github.com/anijain2305	2024-07-31 06:48:43 +00:00
Justin Chu	ae708e9791	[ONNX] Remove the deprecated SymbolicContext (#132184 ) Remove the deprecated SymbolicContext class from torch.onnx Pull Request resolved: https://github.com/pytorch/pytorch/pull/132184 Approved by: https://github.com/titaiwangms	2024-07-31 04:24:32 +00:00
cyy	89da94594e	[11/N] Fix clang-tidy warnings in jit (#132131 ) Follows #132122 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132131 Approved by: https://github.com/Skylion007	2024-07-31 03:45:52 +00:00
PyTorch MergeBot	91299c95ec	Revert "Add functions from `torch.masked._ops` to `__all__` for `torch.masked` (#131288 )" This reverts commit 78020ea55d1bc06898577887b80c15d6d2b967dc. Reverted https://github.com/pytorch/pytorch/pull/131288 on behalf of https://github.com/kit1980 due to Broke test_public_bindings.py::TestPublicBindings::test_correct_module_names [GH job link](https://github.com/pytorch/pytorch/actions/runs/10172945925/job/28136657243) [HUD commit link](`78020ea55d`) ([comment](https://github.com/pytorch/pytorch/pull/131288#issuecomment-2259581854))	2024-07-31 03:45:09 +00:00
Cheng Ni	27c9262d29	Fix stdout / stderr typing in SubprocessHandler (#132071 ) Summary: Fix stdout / stderr typing in SubprocessHandler. Stdout and Stderr should be `Optional[str]` instead of `str`. Test Plan: CI Differential Revision: D60319648 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132071 Approved by: https://github.com/Skylion007	2024-07-31 02:51:11 +00:00
eellison	52c3af62d6	Add fx graph runnable to tl parse (#130976 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130976 Approved by: https://github.com/ezyang	2024-07-31 02:27:22 +00:00
Matthew Hoffman	deb788f6cc	Merge `torch.nn.utils.rnn` type stubs (#131872 ) I want to re-attempt: * #61467 See: * https://github.com/pytorch/pytorch/issues/10536#issuecomment-2251948730 and this is one of the files I would touch. quoting @ezyang: * https://github.com/pytorch/pytorch/issues/91648#issuecomment-1372010129 > The back story here is that in https://github.com/pytorch/pytorch/pull/19089 we added pyi stubs for nn modules, but when we got off Python 2 we started merging the pyi stubs directly into the py files, e.g., as in https://github.com/pytorch/pytorch/pull/43044. But not all the modules got the treatment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131872 Approved by: https://github.com/Skylion007, https://github.com/ezyang	2024-07-31 02:24:59 +00:00
Matthew Hoffman	78020ea55d	Add functions from `torch.masked._ops` to `__all__` for `torch.masked` (#131288 ) Add the non-private operations imported in this file to `__all__` so that pyright considers them to be publicly exported. Solves this error: ``` "mean" is not exported from module "torch.masked" Pylance[reportPrivateImportUsage] ``` Related: https://github.com/pytorch/pytorch/pulls?q=pyright+export Pull Request resolved: https://github.com/pytorch/pytorch/pull/131288 Approved by: https://github.com/ezyang	2024-07-31 02:16:38 +00:00
Cui, Yifeng	df0494bbba	Clean redundant link libraries for XPU (#131322 ) `torch_xpu` should link to `libtorch_cpu.so` instead of `torch_cpu_library`, otherwise redundant link libraries will contaminate `torch_xpu`, especially when there are MKL in both CPU and XPU. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131322 Approved by: https://github.com/cyyever, https://github.com/ezyang	2024-07-31 02:15:15 +00:00
Xuehai Pan	c07aa1c9c9	[Easy] reorder functions in `torch._jit_internal` (#130531 ) Split from #128633. - #128633 Move commonly used functions (e.g. `is_scripting`) to the top of the module to avoid circular dependency. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130531 Approved by: https://github.com/EikanWang, https://github.com/ezyang	2024-07-31 02:12:29 +00:00
Xuehai Pan	fbe6f42dcf	[BE][Easy][8/19] enforce style for empty lines in import segments in `test/[k-p]*/` (#129759 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129759 Approved by: https://github.com/justinchuby, https://github.com/ezyang	2024-07-31 02:09:20 +00:00
atalman	914577569d	Remove python 3.8 nightly builds (#132138 ) Removing python 3.8 support in nightly builds. As per PR: https://github.com/pytorch/pytorch/issues/120718 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132138 Approved by: https://github.com/albanD, https://github.com/malfet, https://github.com/huydhn	2024-07-31 01:50:03 +00:00
Anshul Sinha	05317cd8f7	[dtensor][be] improving readability and reducing repeating code (#132070 ) Summary I created functions that reduced repeating code in the console and json APIs which also improved their readability for future developers. Test Plan 1. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_json_dump 2. torchrun --standalone --nnodes=1 --nproc-per-node=4 torch/distributed/_tensor/examples/comm_mode_features_example.py -e transformer_operation_tracing Pull Request resolved: https://github.com/pytorch/pytorch/pull/132070 Approved by: https://github.com/XilunWu	2024-07-31 00:53:36 +00:00
Tianyu Liu	f85feef127	[DTensor] add support for custom op registration (#131108 ) `register_sharding` is an experimental API that allows users to register sharding strategies for an operator when the tensor inputs and outputs are :class:`DTensor`s. It can be useful when: (1) there doesn't exist a default sharding strategy for ``op``, e.g. when `op` is a custom operator that is not supported by `DTensor`; (2) when users would like to overwrite default sharding strategies of existing operators. Here's an example: @register_sharding(aten._softmax.default) def custom_softmax_sharding(x, dim, half_to_float): softmax_dim = dim if dim >= 0 else dim + x.ndim acceptable_shardings = [] all_replicate = ([Replicate()], [Replicate(), None, None]) acceptable_shardings.append(all_replicate) for sharding_dim in range(x.ndim): if sharding_dim != softmax_dim: all_sharded = ( [Shard(sharding_dim)], [Shard(sharding_dim), None, None], ) acceptable_shardings.append(all_sharded) return acceptable_shardings Pull Request resolved: https://github.com/pytorch/pytorch/pull/131108 Approved by: https://github.com/wanchaol	2024-07-31 00:51:16 +00:00
leslie-fang-intel	31205d5198	[Inductor][CPP] Fix Local Buffer issue with inplace result line (#132018 ) Summary If a `global buffer` has been replaced by `local buffer`, we will add this `global buffer` into `removed_buffers` to avoid unnecessary allocation. However, a special case is when this `global buffer` can reuse previous buffer. We didn't handle this case previously which cause functional failure in `f151f25c0b/torch/_inductor/codegen/wrapper.py (L440)` In this PR, we resolve this issue by avoid adding this global buffer into `V.kernel.inplace_update_buffers` when this buffer has been marked as `removed`. Test Plan ``` python test/inductor/test_cpu_repro.py -k test_local_buffer_with_line_reuse ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132018 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-07-31 00:38:17 +00:00
Siyu Yang	882d80fd92	Add lowering for updated _scaled_mm (fixing submodules) (#130422 ) Add the Inductor lowering for `torch._scaled_mm`, whose API was last updated in https://github.com/pytorch/pytorch/pull/128683. The lowering does: - for tensor-wise scaling, auto-tune between the default ATen kernel (cuBLAS) and Triton kernel configurations. - for row-wise scaling, auto-tune between the default ATen kernel (CUTLASS kernel added in https://github.com/pytorch/pytorch/pull/125204) and Triton kernel configurations. The Triton kernel template is based on `3ad9031d02` (D56337896) by @choutim, without using SPLIT_K, and that of mm `torch/_inductor/kernel/mm.py` ## Testing: - Logging shows max-autotune tuning (`AUTOTUNE scaled_mm`) for both tensor-wise and row-wise scaling when called with the two scaling types. - Row-wise scaling allows operator fusion between preceding pointwise/reduction op and amax/cast: - output code Evaluating m=256, n=256, k=256, fusion_case='pointwise', scaling_mode='row' - P1477224245 - 2 kernels - output code Evaluating m=2048, n=256, k=2048, fusion_case='reduction', scaling_mode='row' - P1477227340 - 2 kernels - UT `python test/inductor/test_fp8.py -- TestFP8Lowering` ## Benchmarking Eager/compiled tensor-wise/row-wise scaling for various shapes: https://docs.google.com/spreadsheets/d/1VfWEVuyrwoWysfbS0_u2VHJ-PsdWkF1qIsiD60AzTes/edit?gid=2113587669#gid=2113587669 - Some of the “compiled” cases are slightly slower than “eager”. It’s because max-autotune selected the ATen kernel in the compiled case, and I think the discrepancy is variance. Eager/compiled tensor-wise/row-wise scaling with pointwise/reduction preceding op for various shapes: https://docs.google.com/spreadsheets/d/1Nv07NrdffQIoDeMjo9E0V-E-EYrEN0WysO_bn1bc6ns/edit?gid=1715488446#gid=1715488446 ## Questions for reviewers: - Should the type of the accumulator `ACC_TYPE` always be in float32? If not, where is this type set (output layout?)? ## Todo: - Make the Triton template use the improved persistent kernel version (https://github.com/pytorch/FBGEMM/pull/2735 by @htyu) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130422 Approved by: https://github.com/ipiszy	2024-07-30 23:48:48 +00:00
Menglu Yu	fdcd2f0dd1	[PT2][Optimus] Add unbind cat to view pass (#132152 ) Summary: We observed new graph transformation opportunity in IG_CTR, which can further remove the cat node. Test Plan: # unit test ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 test //caffe2/test/inductor:split_cat_fx_passes ``` Buck UI: https://www.internalfb.com/buck2/5061a3fe-b788-4031-b3af-66d48564a2df Test UI: https://www.internalfb.com/intern/testinfra/testrun/9007199298289131 Network: Up: 2.5GiB Down: 5.7GiB (reSessionID-a49b1234-c02c-4a2d-a9ad-9f5b23557522) Jobs completed: 294061. Time elapsed: 13:47.8s. Cache hits: 68%. Commands: 106996 (cached: 72904, remote: 33875, local: 217) Tests finished: Pass 10. Fail 0. Fatal 0. Skip 1. Build failure 0 # benchmark ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "ig_ctr" --flow_id 584880697 ``` Counter({'pattern_matcher_nodes': 1649, 'pattern_matcher_count': 1538, 'normalization_pass': 343, 'extern_calls': 160, 'normalization_aten_pass': 39, 'merge_splits_pass': 19, 'fxgraph_cache_miss': 9, 'scmerge_cat_added': 4, 'scmerge_cat_removed': 4, 'scmerge_split_removed': 3, 'unbind_stack_pass': 3, 'batch_tanh': 2, 'scmerge_split_sections_removed': 2, 'scmerge_split_added': 2, 'merge_stack_tahn_unbind_pass': 1, 'optimize_cat_inputs_pass': 1, 'unbind_cat_to_view_pass': 1}) before vs after graph diffing: https://www.internalfb.com/intern/diffing/?paste_number=1497865201 Differential Revision: D60325668 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132152 Approved by: https://github.com/jackiexu1992	2024-07-30 23:27:18 +00:00
Edward Z. Yang	afb04d78c8	Don't try hard to compute alignment of unbacked expressions (#131649 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131649 Approved by: https://github.com/bdhirsh	2024-07-30 23:19:42 +00:00
Yifu Wang	5a33657b31	[micro_pipeline_tp] implement the pass for fused_scaled_matmul_reduce_scatter (#131951 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131951 Approved by: https://github.com/weifengpy	2024-07-30 23:02:49 +00:00
Joel Schlosser	524aac413c	Initial OpInfo-based testing for NJTs (#131704 ) This PR utilizes the info from the existing OpInfo database `op_db` to contribute to general NJT testing. * New tests in `TestNestedTensorOpInfo` * `test_forward()` - compares forward output to an unbind-based reference * `test_backward()` - compares forward output and grads to an unbind-based reference * `test_forward_compile()` - compares forward compile output (`backend="aot_eager_decomp_partition"`) to eager * `test_backward_compile()` - compares forward compile output (`backend="aot_eager_decomp_partition"`) and grads to eager * To avoid adding a bunch of NJT-specific stuff to the `OpInfo` structure, this PR translates `op_db` -> a NJT-specific `njt_op_db`. * `UnaryUfuncInfo`s utilize a new `sample_inputs_unary_njt_pointwise()` which iterates through a comprehensive list of NJTs: contiguous / non-contiguous, dims 2, 3, and 4, transposed / not, etc. * `BinaryUfuncInfo`s utilize a new `sample_inputs_binary_njt_pointwise()` which iterates through a comprehensive list of NJTs: contiguous / non-contiguous, dims 2, 3, and 4, transposed / not, etc. * `ReductionOpInfo`s utilize a new `sample_inputs_njt_reduction()` which covers full reductions, reductions over the jagged dim, and reductions over the non-jagged dim * Several xfails were added to get things passing TODO (future PRs): * Pass non-contiguous / non-contiguous with holes NJTs (maybe we should have separate tests for these? most ops don't support NJTs with holes today) * Mixed (NT, T), (T, NT) inputs for binary ops * Handle other types of OpInfos (beyond unary pointwise, binary pointwise, and reduction) by manually by writing sample_inputs_funcs * Address all xfails via fixes Pull Request resolved: https://github.com/pytorch/pytorch/pull/131704 Approved by: https://github.com/soulitzer ghstack dependencies: #131898	2024-07-30 23:02:24 +00:00
Roy Berger	93facac02c	[NeuralNetInference] Bring up iOS builds (#131917 ) Summary: Mirror Android setup to static link & use lite interpreter on iOS Test Plan: CI Reviewed By: EscapeZero Differential Revision: D60156611 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131917 Approved by: https://github.com/cccclai	2024-07-30 23:01:09 +00:00
Wanchao Liang	53a5e0f1a8	[BE] delete spmd module (#132072 ) Summary: as titled, fully delete spmd module as we stopped working on this and the code is already broken with no unit tests enabled. We should not keep it in the codebase as it provide no value anymore, and it burdens DTensor to maintain the compatiblity with it (i.e. code paths/imports) constantly. Test Plan: sandcastle Differential Revision: D60402105 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132072 Approved by: https://github.com/awgu, https://github.com/XilunWu, https://github.com/fegin, https://github.com/seemethere, https://github.com/albanD, https://github.com/yifuwang	2024-07-30 22:20:21 +00:00
Songhao Jia	a141334c88	migitate wrong tensor.dim_order() (#131366 ) Summary: there're some issues for dim order creation. T194410923 has detail illustration. One of the reason is sometimes `is_contiguous` function may generate ambiguous memory format result (some tensors might be both channels_last and contiguous at the same time), and dim order generation rely on memory format result underneath for shortcut. To mitigate the issue, we make dim order utilizing the short cut if and only if the tensor is only belongs to single memory format. Otherwise, we will still recalculate it. Test Plan: CI Differential Revision: D60056793 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131366 Approved by: https://github.com/ezyang	2024-07-30 21:58:15 +00:00
Andrew Gu	2b43fab555	[DTensor] Added naive support for `nn.init.orthogonal_` (#132104 ) Try to unblock https://github.com/pytorch/pytorch/issues/131991 - `nn.init.orthogonal_` uses `tensor.new`, which is the legacy factory function. We change this to `tensor.new_empty` (empty is okay since it will be immediately followed by `.normal_()` to fill the tensor) so that it preserves `DTensor`-ness. - `nn.init.orthogonal_` uses QR decomposition (`aten.linalg_qr.default`) and `torch.diag` (calling into `aten.diagonal_copy.default`). For simplicity, we use naive replicate strategies for now. `aten.diagonal_copy.default` could do something more sophisticated for sharded inputs, but I would rather defer that to later due to the complexity. For `orthogonal_` support specifically, since the result of the QR decomp will be replicated, the input to `aten.diagonal_copy.default` will be replicated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132104 Approved by: https://github.com/albanD, https://github.com/wanchaol	2024-07-30 21:55:09 +00:00
Zain Rizvi	3e142d766a	[EZ] Make consistent with scale-config.yml (#132164 ) Fix inconsistencies from test-infra's scale-config.yml file To be followed up by https://github.com/pytorch/test-infra/pull/5513 which will catch such inconsistencies going forward Pull Request resolved: https://github.com/pytorch/pytorch/pull/132164 Approved by: https://github.com/clee2000, https://github.com/malfet, https://github.com/zxiiro	2024-07-30 21:42:23 +00:00
Lucas Pasqualin	69c34f6e4c	Corrects Error Codes from cudaHostRegister (#132089 ) Causing some terrible error messages e.g. : ``` # printing directly: cudaError.??? # casting to int first: 712 Traceback (most recent call last): File "/data/users/lpasqualin/fbsource/fbcode/scripts/lpasqualin/playground.py", line 15, in <module> main() File "/data/users/lpasqualin/fbsource/fbcode/scripts/lpasqualin/playground.py", line 11, in main _create_cpu_state_dict(sd, share_memory=True, pin_memory=True) File "/home/lpasqualin/pytorch/torch/distributed/_state_dict_utils.py", line 436, in _create_cpu_state_dict ret = _iterate_state_dict( ^^^^^^^^^^^^^^^^^^^^ File "/home/lpasqualin/pytorch/torch/distributed/_state_dict_utils.py", line 143, in _iterate_state_dict ret = { ^ File "/home/lpasqualin/pytorch/torch/distributed/_state_dict_utils.py", line 144, in <dictcomp> key: _iterate_state_dict( ^^^^^^^^^^^^^^^^^^^^ File "/home/lpasqualin/pytorch/torch/distributed/_state_dict_utils.py", line 125, in _iterate_state_dict ret = tensor_func(iter_object, pg, device, companion_obj) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lpasqualin/pytorch/torch/distributed/_state_dict_utils.py", line 428, in tensor_func succ == 0 AssertionError: Pinning shared memory failed with error-code: cudaError.??? ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/132089 Approved by: https://github.com/Skylion007	2024-07-30 21:42:00 +00:00
Jiashen Cao	ff377e16ab	Improve logging in the TSConverter (#132082 ) Summary: Currently, running explain with TORCH_LOGS enabled will cause duplicate loggings because explain uses the exact same code path for covnersion. This PR just disables logging when it is running explain. And move all logging to convert() to prevent from logging from __init__ when we are just using explain. Test Plan: Manual testing with attached outputs. Reviewed By: SherlockNoMad, angelayi Differential Revision: D60199007 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132082 Approved by: https://github.com/ydwu4	2024-07-30 21:37:44 +00:00
Edward Z. Yang	495d413519	Include code object of frame being compiled in stack (#132161 ) This is pretty useful to have! Test plan: https://internalfb.com/intern/fblearner/details/586653862/ Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132161 Approved by: https://github.com/oulgen	2024-07-30 21:33:27 +00:00
rzou	19db4f6014	[capture_triton] fix special kwargs path (#132143 ) I didn't test this path when creating the orchestrator. This PR fixes that path to work in the capture_triton path. The problem is that we are handling a value that is an int (in the capture_triton path) and a ConstantVariable (in the Dynamo triton path) so we abstract that out in the orchestrator. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/132143 Approved by: https://github.com/oulgen	2024-07-30 20:30:40 +00:00
Xintong Hu	1118c74b5f	[PT2] Port fuse_chunk_reshape_unsqueeze_concat_pass to PT2 pre_grad passes (#131902 ) (#132078 ) Summary: Port fuse_chunk_reshape_unsqueeze_concat_pass to PT2 pre_grad passes Test Plan: run new UTs Reviewed By: frank-wei Differential Revision: D60258724 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132078 Approved by: https://github.com/frank-wei	2024-07-30 20:17:06 +00:00
Joel Schlosser	d53b11bb6e	Strict shape checking for NJTs with TestCase.assertEqual() (#131898 ) Background: `TestCase.assertEqual()` is commonly used during test case validation. Historically, to support NSTs, the logic was written to compare two nested tensors by unbinding them and comparing their components. This logic applied to NJTs as well, which in practice meant that two NJTs with different nested ints in their shapes could compare equal if their components were equal. This PR changes the above logic so that NJTs are no longer unbound during comparison, allowing them to receive full shape validation. This makes `TestCase.assertEqual()` stricter for NJTs, requiring them to have the same nested ints in their shapes to compare equal. Note that some tests rely on the old, looser behavior. To address this, the PR introduces a base `NestedTensorTestCase` that defines a helper function `assertEqualIgnoringNestedInts()` so that these tests can explicitly opt in to the looser comparison behavior. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131898 Approved by: https://github.com/soulitzer	2024-07-30 20:05:48 +00:00
Shuai Yang	58f76bc301	Revise skip torchrec logic (#130783 ) Summary: The previous logic adds skipped files when the file was imported which happens at very early stage. However, we could set skip_torchrec at later stage (e.g, in APS, we set it during the trainer execution). In that case, the skip logic will still take effect since skipped files have been added. So in this diff, we revise the logic so that it can adapt to changes of skip_torchrec at later stages. Test Plan: Tested on APS models: buck2 run mode/opt //aps_models/ads/icvr:icvr_launcher_live -- mode=local_ig_fm_uhm_mini model_name=ig_fm_one_sparse_benchmark features=ig_fm_one_sparse_benchmark model=ig_fm_one_sparse_benchmark training.pipeline_type=pt2 commit: 2fb485d9e torchrec related paths were not skipped. Differential Revision: D59779153 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130783 Approved by: https://github.com/yanboliang	2024-07-30 19:55:20 +00:00
Li-Huai (Allan) Lin	964f97539f	[MPS] Correct nonzero warning and fix the test (#132127 ) #125355 lifted the natively supported macOS version to 14. Fixes #132110 Probably fixes this flaky test disabling issue: #126492 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132127 Approved by: https://github.com/malfet	2024-07-30 19:46:25 +00:00
Edward Z. Yang	f2dedc910e	Improve SpeculationLog error message (#131982 ) There are some substantive changes. Instead of recording the next instruction in the speculation log, I record the current instruction. I think this is more intuitive, we always call speculation at the beginning of executing an instruction, so logically, the entry is associated with the current instruction. (Note that self.instruction_pointer is next instruction, as conventionally we increment IP before calling speculate). The cosmetic change is to also pass in the Instruction corresponding to the IP and print it, and beef up the error message, including notes about the previous instruction that was run before it failed (this is typically the critical instruction). At time of submission, this test case triggered the error: ``` diff --git a/test/distributed/test_dynamo_distributed.py b/test/distributed/test_dynamo_distributed.py index 5ade17856e1..60ef89be346 100644 --- a/test/distributed/test_dynamo_distributed.py +++ b/test/distributed/test_dynamo_distributed.py @@ -844,6 +844,39 @@ class TestMultiProc(DynamoDistributedMultiProcTestCase): for r in res[1:]: self.assertEqual(res[0], r) + @unittest.skipIf(not has_triton(), "Inductor+gpu needs triton and recent GPU arch") + @config.patch(enable_compiler_collectives=True) + def test_compiler_collectives_automatic_dynamic_speculation_divergence(self): + with _dynamo_dist_per_rank_init(self.rank, self.world_size): + torch._dynamo.utils.clear_compilation_metrics() + + # TODO: This should be possible to do inside the function, but + device = f"cuda:{self.rank}" + + @torch.compile() + def f(x, y): + zx = x.shape + zy = y.shape + return x.sum() + y.sum() + + if self.rank == 0: + dataloader = [4, 4] + else: + dataloader = [3, 4] + + for data in dataloader: + f( + torch.randn(data, device=self.rank), + torch.randn(data, device=self.rank), + ) + + metrics = torch._dynamo.utils.get_compilation_metrics() + # Number of compiles same on all nodes + res = [None] * self.world_size + torch.distributed.all_gather_object(res, len(metrics)) + for r in res[1:]: + self.assertEqual(res[0], r) + @requires_nccl() ``` although I plan to fix this soon. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131982 Approved by: https://github.com/anijain2305, https://github.com/mlazos, https://github.com/jansel	2024-07-30 19:21:31 +00:00
Joel Schlosser	e6cddc9271	Fix public API tests (#131386 ) This PR fixes a bug in `test_correct_module_names` introduced in #130497. It also addresses post-fix test failures in: * `torch/ao/quantization/__init__.py` - set the correct `__module__` for several public API helpers * `torch/library.py` - add `register_vmap` to `__all__` * `torch/nn/attention/flex_attention.py` - make `round_up_to_multiple` private by prepending an underscore * `torch/storage.py` - introduce `__all__` to avoid `Self` being re-exported as a public API * `torch/distributed/pipelining/schedules.py` - add `ZeroBubbleAlgorithm` to `__all__` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131386 Approved by: https://github.com/albanD	2024-07-30 18:42:54 +00:00
Aidyn-A	f217b470cc	[CMAKE] Avoid double setting of LDFLAGS (#130370 ) It was observed that in some environments `LDFLAGS` gets directly appended to `CMAKE_SHARED_LINKER_FLAGS`. As the result, the same linker flag can appear twice in `CMAKE_SHARED_LINKER_FLAGS` due to manual set: `1bf4a44b33/CMakeLists.txt (L541-L542)` This flag collision causes the build failures at the `cmake` stage. This PR adds an instruction to `CMakeLists.txt` to avoid double setting of `LDFLAGS` into `CMAKE_SHARED_LINKER_FLAGS`. Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130370 Approved by: https://github.com/atalman, https://github.com/tinglvv, https://github.com/malfet	2024-07-30 18:16:04 +00:00
Jane Xu	3816f6420a	[BE] remove unnecessary _dispatch_sqrt by using 0.5 (#131358 ) Based on the discussion here where 0.5 is not slower than math.sqrt. https://github.com/pytorch/pytorch/pull/129905#discussion_r1675605075 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131358 Approved by: https://github.com/albanD	2024-07-30 18:08:17 +00:00
Aos Dabbagh	9f6d7df3d9	docs(multinomial): Add reference to `Multinomial` class (#131904 ) This PR just adds the reference to the class `torch.distributions.multinomial.Multinomial` in `torch.multinomial`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131904 Approved by: https://github.com/jbschlosser	2024-07-30 18:05:07 +00:00
PyTorch MergeBot	239d4d2489	Revert "[reland][inductor] switch AotCodeCompiler to new cpp_builder (#130127 )" This reverts commit 9606d61e0c921b886d20cb61454043c6c270ae89. Reverted https://github.com/pytorch/pytorch/pull/130127 on behalf of https://github.com/ZainRizvi due to broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/130127#issuecomment-2258871791))	2024-07-30 17:39:41 +00:00
Tristan Rice	9027db1ab8	TCPStore: fix remote address (#131773 ) (#131913 ) Summary: This fixes corrupt remote address logs caused by dangling pointers to addrinfo_storage inside of addrinfo. This relands it since it got reverted due to a fmt::format issue internally. Original Pull Request: https://github.com/pytorch/pytorch/pull/131773 Approved by: https://github.com/kurman Test Plan: Enable debug logs and verify addresses are correct ``` TORCH_CPP_LOG_LEVEL=INFO TORCH_DISABLE_SHARE_RDZV_TCP_STORE=1 TORCH_DISTRIBUTED_DEBUG=DETAIL LOGLEVEL=INFO python test/distributed/test_store.py -v buck2 test @//mode/dev-nosan //caffe2/test/distributed:store ``` Differential Revision: D60296583 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131913 Approved by: https://github.com/kurman, https://github.com/rsdcastro, https://github.com/Skylion007	2024-07-30 17:27:33 +00:00
Florian	3864a2d834	[profiler ut] Update event name in test_profiler.py (#131757 ) Fixes #ISSUE_NUMBER To support kernel name with some uppercase letters. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131757 Approved by: https://github.com/aaronenyeshi	2024-07-30 17:15:31 +00:00
Yidi Wu	32c57e78ed	Specialize sym node when used as device kwarg (#131811 ) Fixes https://github.com/pytorch/pytorch/issues/131189. We specialize the symint in python_arg_parser when used as kwarg device. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131811 Approved by: https://github.com/yanboliang, https://github.com/jansel, https://github.com/albanD	2024-07-30 17:11:57 +00:00
Andrew Gu	33ce9cf7f9	[FSDP2] Relaxed overlap timing check to avoid flakiness (#132116 ) Trying to fix https://github.com/pytorch/pytorch/issues/131081 See https://github.com/pytorch/pytorch/issues/131081#issuecomment-2239443504 for detailed context. This PR is relaxing one assertion against the _baseline_ to try to fix the flakiness. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132116 Approved by: https://github.com/Skylion007	2024-07-30 14:28:12 +00:00
Jeeja	16e0868a3d	[FSDP] Add hpu device to _get_remote_device_str (#132120 ) In _creating chunk_sharded_tensor, _get_remote_device_str is used. by default it uses the node cound to determine the device:instance. for hpu, need to use current device to get the deivce_instance. Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/132120 Approved by: https://github.com/awgu	2024-07-30 14:24:24 +00:00
Guilherme Leobas	a843178529	Let dynamo inline functional_call (#128646 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128646 Approved by: https://github.com/zou3519	2024-07-30 14:22:23 +00:00
Shreyans Pathak	12b67bd998	Fix pyi annotation for `ProcessGroupGloo.Options` (#132080 ) This PR fixes the pyi annotation for `ProcessGroupGloo.Options` based on the definition in the `torch/csrc/distributed/c10d/init.cpp` file. Fixes #132054 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132080 Approved by: https://github.com/Skylion007	2024-07-30 13:52:31 +00:00
PyTorch MergeBot	499ead96ff	Revert "Grouped Query Attention (#128898 )" This reverts commit d039b14207fe659d664c590efc06cc0a2abc96c0. Reverted https://github.com/pytorch/pytorch/pull/128898 on behalf of https://github.com/albanD due to Broken test on main ([comment](https://github.com/pytorch/pytorch/pull/128898#issuecomment-2258314481))	2024-07-30 13:11:24 +00:00
cyy	bdf57da6a6	[3/N] Enable clang-tidy on torch/csrc/inductor (#132101 ) Follows #132040 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132101 Approved by: https://github.com/Skylion007	2024-07-30 13:04:57 +00:00
cyy	eccbd408e5	[10/N] Fix clang-tidy warnings in jit (#132122 ) Follows #132010 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132122 Approved by: https://github.com/Skylion007	2024-07-30 12:56:31 +00:00
Sijia Chen	83db609ee5	[inductor] fix the cudagraph tree test (#132043 ) Summary: There are two kinds of exceptions: Case #1: ``` static input data pointer changed. input name: primals_2. data pointer changed from 140315748992000 to 140315748993536. input stack trace: File "/dev/shm/uid-30083/c0899c70-seed-nspid4026535598_cgpid16622182-ns-4026535192/caffe2/test/inductor/test_cudagraph_trees.py", line 1826, in forward return self.static_tensor + x + self.goo(x) File "/dev/shm/uid-30083/c0899c70-seed-nspid4026535598_cgpid16622182-ns-4026535192/caffe2/test/inductor/test_cudagraph_trees.py", line 1816, in forward return self.linear(x) input name: primals_3. data pointer changed from 140315748990976 to 140315748993024. input stack trace: File "/dev/shm/uid-30083/c0899c70-seed-nspid4026535598_cgpid16622182-ns-4026535192/caffe2/test/inductor/test_cudagraph_trees.py", line 1825, in forward self.static_tensor.add_(torch.ones((2, 2), device="cuda")) ``` Case #2: ``` static input data pointer changed. input name: primals_2. data pointer changed from 139852509086720 to 139852509088256. input stack trace: None input name: primals_3. data pointer changed from 139852509085696 to 139852509087744. input stack trace: File "/dev/shm/uid-30083/f61ee184-seed-nspid4026560782_cgpid769179-ns-4026560865/caffe2/test/inductor/test_cudagraph_trees.py", line 1825, in forward self.static_tensor.add_(torch.ones((2, 2), device="cuda")) ``` The current impl only covered the case #2 Test Plan: https://www.internalfb.com/intern/testinfra/testrun/15481123762274476 Differential Revision: D60340212 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132043 Approved by: https://github.com/BoyuanFeng	2024-07-30 08:35:56 +00:00
Menglu Yu	36e8289129	[PT2][Optimus] Optimize cat node inputs pattern (#131866 ) Test Plan: # unit test ``` buck2 test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/inductor:split_cat_fx_passes ``` # benchmark ``` CUDA_VISIBLE_DEVICES=3 OC_CAUSE=1 buck2 run mode/opt //scripts/jackiexu0313/pt2:local_model_with_pt2 -- --test_mode batch-split --model_type "ig_ctr" --flow_id 584880697 ``` Counter({'pattern_matcher_nodes': 1589, 'pattern_matcher_count': 1497, 'extern_calls': 393, 'normalization_pass': 342, 'merge_splits_pass': 19, 'fxgraph_cache_miss': 12, 'scmerge_cat_added': 4, 'scmerge_cat_removed': 4, 'scmerge_split_removed': 3, 'unbind_stack_pass': 3, 'batch_tanh': 2, 'scmerge_split_sections_removed': 2, 'scmerge_split_added': 2, 'merge_stack_tahn_unbind_pass': 1, 'optimize_cat_inputs_pass': 1}) P1496150856 Differential Revision: D60274533 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131866 Approved by: https://github.com/jackiexu1992	2024-07-30 07:49:26 +00:00
Yanbo Liang	54d4f6bbca	[Inductor][FlexAttention] Correct partial/full blocks naming (#131993 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131993 Approved by: https://github.com/drisspg	2024-07-30 06:40:40 +00:00
Animesh Jain	03e058189e	[dynamo] Support dict unpack of MutableMapping objects (#131961 ) Fixes https://github.com/pytorch/pytorch/issues/128067 The basic functionality was alredy introduced earlier. This just ensures that we support UserDefinedObjectVariable. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131961 Approved by: https://github.com/williamwen42, https://github.com/mlazos, https://github.com/yanboliang ghstack dependencies: #131827, #131956	2024-07-30 05:49:58 +00:00
Animesh Jain	f806128619	[dynamo] Skip <frozen abc> to skip __isisintance__ check on abc objects (#131956 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131956 Approved by: https://github.com/williamwen42, https://github.com/mlazos ghstack dependencies: #131827	2024-07-30 05:49:58 +00:00
Animesh Jain	13457d1da0	[dynamo][log] Suggest to use pytree when graph-break on optree (#131827 ) Discovered while working on https://github.com/pytorch/pytorch/issues/121369 On the model above, the log looks like this ~~~ /home/anijain/local/pytorch2/torch/_dynamo/variables/functions.py:698: UserWarning: Graph break for an optree C/C++ function optree._C.PyCapsule.flatten. Consider using torch._utils.pytree - https://github.com/pytorch/pytorch/blob/main/torch/utils/_pytree.py. torch._dynamo.utils.warn_once(msg) /home/anijain/local/pytorch2/torch/_dynamo/variables/functions.py:698: UserWarning: Graph break for an optree C/C++ function optree.PyCapsule.unflatten. Consider using torch._utils.pytree - https://github.com/pytorch/pytorch/blob/main/torch/utils/_pytree.py. torch._dynamo.utils.warn_once(msg) ~~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/131827 Approved by: https://github.com/zou3519, https://github.com/mlazos	2024-07-30 05:49:58 +00:00
Jiang, Yanbing	fc6066b80f	improve mkldnn_linear_pointwise_binary performance for contiguous tensor with non default contiguous strides (#132019 ) Fixes https://github.com/pytorch/pytorch/issues/131734 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132019 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5	2024-07-30 05:02:38 +00:00
PyTorch UpdateBot	40f8db5741	[audio hash update] update the pinned audio hash (#132105 ) This PR is auto-generated nightly by [this action](https://github.com/pytorch/pytorch/blob/main/.github/workflows/nightly.yml). Update the pinned audio hash. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132105 Approved by: https://github.com/pytorchbot	2024-07-30 03:39:27 +00:00
Xu Han	aa1488fe02	[inductor] turn on enable_kernel_profile on Windows. (#132025 ) Enable `TORCHINDUCTOR_CPP_ENABLE_KERNEL_PROFILE` on Windows inductor. Local tested pass: ![image](https://github.com/user-attachments/assets/a82351af-cc56-4ba1-a8f4-08f1c38713d1) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132025 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-30 03:02:09 +00:00
Xu Han	475da800c7	[inductor] optimize cflags for Windows. (#131980 ) changes: 1. optimize cflags for Windows. Ref: https://github.com/pytorch/pytorch/blob/v2.4.0/torch/utils/cpp_extension.py#L215 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131980 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-30 02:59:51 +00:00
Xu Han	bdc42e3fb8	[inductor] validate_can_generate_cpp_wrapper add win32 support. (#131978 ) Changes: 1. `validate_can_generate_cpp_wrapper` add win32 support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131978 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-30 02:59:48 +00:00
eellison	baa4c9ca46	Optimize aten.cat calls of a repeated element (#132081 ) This was a particular problem for a model I saw which would have a large number of repeats, making compilation slow. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132081 Approved by: https://github.com/shunting314	2024-07-30 02:56:00 +00:00
leslie-fang-intel	f8e4060484	[Inductor][CPP] Enhance cppcsevar data type deduce (#130827 ) Summary Previously, we used `data_type_propagation` at the start of `codegen` to deduce the data type of each node and save this information in `node.meta[OptimizationContext.key]`. Then, we used this node metadata to update the cppcsevar data type in `update_on_args`. However, this method is not always correct. For example, in the codegen of `indirect_indexing` (see [here](`096dc444ce/torch/_inductor/codegen/common.py (L1844)`)), we insert nodes on the fly and reuse the node of `indirect_indexing` to set the `cppcsevar` data type. In this PR, we plan to enhance the `cppcsevar` data type deduction: - We will deduce the `cppcsevar` data type in `update_on_args` by reusing the code in `data_type_propagation`. - To align the data type of scalar and vector variables, we previously always cast the scalar to the vector's data type. This caused a data type misalignment between `codegen` and `data_type_propagation`. We should use the same data type promotion logic to align the data types of scalar and vector variables. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130827 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-30 02:51:31 +00:00
William Wen	b6c1490cc0	[dynamo] make more unpack_var_sequence calls forced (#132069 ) Fixes [T197204962](https://www.internalfb.com/intern/tasks/?t=197204962) (example failure: https://www.internalfb.com/intern/testinfra/diagnostics/11540474088277914.281475138576374.1722221031/) Added tests contain a simple repro for the observed failure (`test_map_unpack_vars`). Also fixes https://github.com/pytorch/pytorch/issues/132044 Differential Revision: [D60420335](https://our.internmc.facebook.com/intern/diff/D60420335) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132069 Approved by: https://github.com/anijain2305	2024-07-30 02:30:08 +00:00
Aaron Orenstein	8721b21b38	Fix fake_tensor w/ non-view tensor (#132050 ) Summary: This code was overly complex and is confusing some guards - basically if a result cached tensor isn't a view there's no reason to be messing with its storage. Test Plan: unit tests pass Differential Revision: D60387821 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132050 Approved by: https://github.com/oulgen	2024-07-30 02:17:18 +00:00
eellison	9598c58618	Add config option to skip autotuning conv (#131839 ) requested internally bc for some models the conv templates are not very helpful Pull Request resolved: https://github.com/pytorch/pytorch/pull/131839 Approved by: https://github.com/oulgen ghstack dependencies: #131400	2024-07-30 01:57:53 +00:00
zhouyusong	5a2620302b	[inductor] Replace self_cuda_time_total function calls with self_dev… (#131029 ) …ice_time_total for wrapper_bench Pull Request resolved: https://github.com/pytorch/pytorch/pull/131029 Approved by: https://github.com/shunting314	2024-07-30 01:57:39 +00:00
Li-Huai (Allan) Lin	a147fa577b	[MPS] Fix masked_fill_ in non_contiguous cases (#131957 ) fixes #131285 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131957 Approved by: https://github.com/DenisVieriu97	2024-07-30 01:34:48 +00:00
blaine-rister	3716934b1a	[Inductor] Refactor autotuning utils to compute max block sizes (#131730 ) These OSS changes are part of a larger MTIA diff. The OSS part is a simple refactor that makes it easier to query max block sizes by the prefix of the grid dimension, e.g. `"X"`, as opposed to having to use separate functions for `get_xmax()`, `get_ymax()`, etc. Differential Revision: D60195669 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131730 Approved by: https://github.com/eellison	2024-07-30 01:04:53 +00:00
PyTorch MergeBot	7a7dd8c29e	Revert "[NestedTensor] Integrate the softmax operator along the jagged dimension into NestedTensor (#131518 )" This reverts commit bcf5c68c18c6a109e1fa00829eea0428d44cfb6b. Reverted https://github.com/pytorch/pytorch/pull/131518 on behalf of https://github.com/ZainRizvi due to Sorry, reverting this since this is based on an internal diff that has diverged from actual internal commit (the final PR and diff must always be identical). Conflicts arise when that happens which block the diff train. Let's revert both this PR and the internal diff, and then reland them as a proper new codev diff ([comment](https://github.com/pytorch/pytorch/pull/131518#issuecomment-2257259839))	2024-07-30 00:55:10 +00:00
angelayi	ab9791c0e3	[export] Add print_readable to unflattener (#128617 ) Taking inspiration from `GraphModule.print_readable` (aka I copied its [code](`17b45e905a/torch/fx/graph_module.py (L824)`)), I added a `print_readable` to the unflattened module, because it's kind of nontrivial to print the contents of this module. Example print from `python test/export/test_unflatten.py -k test_unflatten_nested` ``` class UnflattenedModule(torch.nn.Module): def forward(self, x: "f32[2, 3]"): # No stacktrace found for following nodes rootparam: "f32[2, 3]" = self.rootparam # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:99 in forward, code: x = x * self.rootparam mul: "f32[2, 3]" = torch.ops.aten.mul.Tensor(x, rootparam); x = rootparam = None # No stacktrace found for following nodes foo: "f32[2, 3]" = self.foo(mul); mul = None bar: "f32[2, 3]" = self.bar(foo); foo = None return (bar,) class foo(torch.nn.Module): def forward(self, mul: "f32[2, 3]"): # No stacktrace found for following nodes child1param: "f32[2, 3]" = self.child1param nested: "f32[2, 3]" = self.nested(mul); mul = None # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:79 in forward, code: return x + self.child1param add: "f32[2, 3]" = torch.ops.aten.add.Tensor(nested, child1param); nested = child1param = None return add class nested(torch.nn.Module): def forward(self, mul: "f32[2, 3]"): # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:67 in forward, code: return x / x div: "f32[2, 3]" = torch.ops.aten.div.Tensor(mul, mul); mul = None return div class bar(torch.nn.Module): def forward(self, add: "f32[2, 3]"): # No stacktrace found for following nodes child2buffer: "f32[2, 3]" = self.child2buffer # File: /data/users/angelayi/pytorch2/test/export/test_unflatten.py:87 in forward, code: return x - self.child2buffer sub: "f32[2, 3]" = torch.ops.aten.sub.Tensor(add, child2buffer); add = child2buffer = None return sub ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/128617 Approved by: https://github.com/zhxchen17, https://github.com/pianpwk	2024-07-30 00:41:44 +00:00
eellison	2a4d9aa548	Disable expandable segments checkpointing internally (#132048 ) Differential Revision: [D60388286](https://our.internmc.facebook.com/intern/diff/D60388286) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132048 Approved by: https://github.com/ezyang, https://github.com/eqy	2024-07-30 00:26:39 +00:00
PyTorch MergeBot	be5e44192d	Revert "[NestedTensor] Integrate the layer normalization operator along the jagged dimension into NestedTensor (#131519 )" This reverts commit 8fe2bf212dc5e01b15cbe728958f940873230d64. Reverted https://github.com/pytorch/pytorch/pull/131519 on behalf of https://github.com/ZainRizvi due to Sorry, reverting this since this is based on an internal diff that has diverged from actual internal commit. Weird conflicts arise when that happens. Let's revert both this PR and the internal diff, and then reland them as a proper new codev diff ([comment](https://github.com/pytorch/pytorch/pull/131519#issuecomment-2257230717))	2024-07-30 00:18:22 +00:00
Bin Bao	b1ccd0c407	[CI] Update environment varible setting for aarch64 (#132046 ) Summary: JEMALLOC_LIB and core_number need to be set differently on aarch64. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132046 Approved by: https://github.com/huydhn	2024-07-30 00:09:59 +00:00
yuqingj	e3dc20c94b	[NJT] support cat backward (#132076 ) cat_tensors_backward use narrow_symint, so we need to support aten::narrow for NJT. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132076 Approved by: https://github.com/davidberard98	2024-07-29 23:49:26 +00:00
Yuzhen Huang	5298acb5c7	Back out "[1/2] PT2 Inductor ComboKernels - Foreach cases (#124969 )" (#132065 ) Summary: Original commit changeset: 1d8cfdcef69d Original Phabricator Diff: D54134695 back out: D54134695 Test Plan: more details see: https://docs.google.com/document/d/1noPTmTdNYHVDFyk7AJSSO7jQoNw6fTo4o6k9eTNeZh8/edit#heading=h.xeo30usu77nc Reviewed By: zw2326 Differential Revision: D60397377 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132065 Approved by: https://github.com/zw2326, https://github.com/qchip	2024-07-29 22:48:29 +00:00
eellison	8b507a922a	Mode to emulate amp numerics (#131595 ) ``` # Mode to emulate pytorch eager numerics for lower precision (fp16, bf16) # Pytorch eager computes bf16/fp16 by upcasting inputs to fp32 and downcasting after # For multiple, fused pointwise nodes, inductor will elide the intermediary upcasts and downcasts # Typically this should be closer to fp64 ref numerics. However, it can be useful for debugging # to emulate the eager numerics. ``` We add extra upcasts and downcasts for pointwise nodes that correspond to casts that existed in the original user program (excluding pointwise nodes that are emitted during decomposition). Since this is mostly for debugging, I added this information in the `meta` so that this mode does not have unintended side effects like changing pattern matching. in theory there could also be some other casts with fused reduction -> reduction, although i havent seen this in practice as much. could be done as follow up. note: only works with cuda backend right now. This mode was sufficient to eliminate compile differences from https://fb.workplace.com/groups/385893200869952/posts/464263173032954/?comment_id=465199259606012&reply_comment_id=465676792891592. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131595 Approved by: https://github.com/shunting314, https://github.com/bdhirsh, https://github.com/jansel	2024-07-29 22:42:23 +00:00
soulitzer	884eadcd19	Fix multi grad hooks thread safety (#132055 ) Thanks @awgu for spotting this Pull Request resolved: https://github.com/pytorch/pytorch/pull/132055 Approved by: https://github.com/Skylion007, https://github.com/awgu, https://github.com/albanD	2024-07-29 22:32:59 +00:00
Edward Z. Yang	e55e9d8126	Clear speculation log when restarting due to compiler collective (#131983 ) The compiler collective can trigger an input to become dynamic, which can trigger operations to be recorded to the graph, which would change the speculation log entries (since they only start being recorded once we have a non-empty output graph). Test case triggers this situation. Production instance: https://www.internalfb.com/mlhub/pipelines/runs/mast/f584750649-TrainingApplication?job_attempt=2&version=0&env=PRODUCTION Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131983 Approved by: https://github.com/anijain2305, https://github.com/mlazos	2024-07-29 22:32:10 +00:00
PyTorch MergeBot	62b2e7a553	Revert "Add config option to skip autotuning conv (#131839 )" This reverts commit 3d4de8e96d0bb1fe19b25734a97a19dd85313692. Reverted https://github.com/pytorch/pytorch/pull/131839 on behalf of https://github.com/eellison due to wrong config name ([comment](https://github.com/pytorch/pytorch/pull/131839#issuecomment-2257117221))	2024-07-29 22:31:51 +00:00
Janani Sriram	8fe2bf212d	[NestedTensor] Integrate the layer normalization operator along the jagged dimension into NestedTensor (#131519 ) Modify the existing `layer normalization` operator in PyTorch, invoked by `torch.layer_norm`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff, which uses the `aten` padding operator, enables PyTorch users to invoke `torch.nn.functional.layer_norm` on a nested tensor when reducing along the ragged dimension, e.g. `` in a `(B, , M)` or `(B, *, M, N)` nested tensor. Write unit tests based on the `softmax` jagged operator to verify the accuracy of the ragged reduction implementation for `torch.nn.functional.layer_norm`. Add unit tests to verify error handling for unsupported features. Note that this implementation is limited to nested tensors with `ragged_idx == 1`, i.e. the ragged dimension is not transposed. The layer normalization operator also requires an operation on a 2-dimensional layer; for nested tensors with 4 or more dimensions, I flatten the extra dimensions, then unflatten them after performing layer normalization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131519 Approved by: https://github.com/davidberard98 ghstack dependencies: #131518	2024-07-29 22:16:32 +00:00
jainapurva	d039b14207	Grouped Query Attention (#128898 ) ### Approach: Using the current function declaration Constraint: Q_Heads % KV_Heads == 0 Major change: - Added a new argument enable_gqa: bool to sdpa function call - It adds a meaning to the last third dimension. Sample use cases this would enable: LLama3 ``` # LLama3 8b call to SDPA query = torch.rand(batch, 32, seq_len_q, D) key = torch.rand(batch, 8, seq_len_kv, D) value = torch.rand(batch, 8, seq_len_kv, D) output = scaled_dot_product_attention(query, key, value, is_causal=True, enable_gqa=True) # Output Shape (batch, 32, seq_len_q, D) ``` ### Design Choice: - Check if Query.size(-3) == Key.size(-3) == Value.size(-3) or, Query.size(-3) % Key.size(-3) == 0 - The function adjusts the key and value tensors to match the query tensor's head dimension by using repeat_interleave if their number of heads are not equal, facilitating correct and efficient computation in attention mechanisms. - By default the enable_gqa flag is set to False, which ensures that regular sdpa functionality remains unchanged. ### Benchmarks: - sdpa.py: #130634 For different batch sizes enable_gqa=True shows a substansial improvement in the run_time of sdpa \| batch_size \| q_num_heads \| kv_num_heads \| q_seq_len \| kv_seq_len \| embed_dim \| forward_time when enable_gqa=True \| forward_time when enable_gqa=False \| \| ------------ \| ------------- \| -------------- \| ----------- \| ------------ \| ----------- \| ----------- \| ---------------- \| \| 1 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 100.71 \| 119.70 \| \| 8 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 539.78 \| 628.83 \| \| 16 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 1056.81 \| 1225.48 \| \| 32 \| 32 \| 8 \| 2048 \| 2048 \| 2048 \| 2099.54 \| 2440.45 \| ![Screenshot 2024-07-25 at 9 07 40 PM](https://github.com/user-attachments/assets/a3e5f716-c39f-4096-9e6c-82a735e57b7b) - TorchTitan: https://github.com/pytorch/torchtitan/pull/458 Pull Request resolved: https://github.com/pytorch/pytorch/pull/128898 Approved by: https://github.com/drisspg	2024-07-29 21:49:06 +00:00
Yang Chen	05a8540041	[cpp-wrapper] create null pointer for zero-size array (#132023 ) zero-size array is not supported in the C or C++ standard, so we create a null pointer for it. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132023 Approved by: https://github.com/desertfire	2024-07-29 21:40:33 +00:00
Andrew Gu	d8358a2d86	Made `register_multi_grad_hook` return type `RemovableHandle` (#132074 ) `_MultiHandle` is private. Let us return `RemovableHandle`, which is public. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132074 Approved by: https://github.com/soulitzer	2024-07-29 21:29:34 +00:00
PyTorch MergeBot	d5e9fbb012	Revert "BE: reset dynamo before each test in test_module.py (#131372 )" This reverts commit 527901f054a947976dc587bb9cf72c86992b7c87. Reverted https://github.com/pytorch/pytorch/pull/131372 on behalf of https://github.com/kit1980 due to Broke test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_CTCLoss_cuda_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10149118852/job/28065175173) [HUD commit link](`ca8153ae67`) ([comment](https://github.com/pytorch/pytorch/pull/131372#issuecomment-2257019116))	2024-07-29 21:15:25 +00:00
PyTorch MergeBot	a4723b566f	Revert "BE: reset dynamo before each test in test_ops_gradients.py (#131397 )" This reverts commit ca8153ae6758fbf33cc767cfd0cb384b87b8d3ca. Reverted https://github.com/pytorch/pytorch/pull/131397 on behalf of https://github.com/kit1980 due to Broke test_modules.py::TestModuleCUDA::test_cpu_gpu_parity_nn_CTCLoss_cuda_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10149118852/job/28065175173) [HUD commit link](`ca8153ae67`) ([comment](https://github.com/pytorch/pytorch/pull/131372#issuecomment-2257019116))	2024-07-29 21:15:25 +00:00
Tom Ritchford	bdf5a6dca9	Add decomposition for unsqueeze_copy (#130942 ) * Extracted from #128416 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130942 Approved by: https://github.com/peterbell10	2024-07-29 21:13:37 +00:00
Yanbo Liang	3c1562158e	[BE] Fix torch.compile docstring formatting issues (#131837 ) Fixes #131815 <img width="1098" alt="Screenshot 2024-07-25 at 6 58 39 PM" src="https://github.com/user-attachments/assets/d0f6edc3-419e-4096-803b-cecd45d8644b"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131837 Approved by: https://github.com/williamwen42	2024-07-29 20:52:28 +00:00
Simon Mahns	dcb03106b7	[Land Internally] MTIA equivalent of torch.cuda.memory_stats (#132007 ) Summary: as title Test Plan: pytorch ci failing: https://github.com/pytorch/pytorch/issues/131962 Differential Revision: D60335413 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132007 Approved by: https://github.com/hanzlfs, https://github.com/egienvalue	2024-07-29 20:47:18 +00:00
Joona Havukainen	082d0b80ca	Min and max NaN propagation fix in MPS backend (#130445 ) Partial fix to issue #130295 Moves min and max ops to use the NaN propagating API in MPS to align with the pytorch convention. Adds a regression test to validate the fix achieves parity with cpu backend. Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130445 Approved by: https://github.com/malfet	2024-07-29 20:09:15 +00:00
Animesh Jain	f44446e851	[dynamo] Turn on inline_inbuilt_nn_modules (#131275 ) Known issues that are deliberately kept open and will be fixed later are tracked here - https://github.com/pytorch/pytorch/issues/131696 Training dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/anijain2305/435/head&lCommit=408b9358b8fca3a5d08b39741419fe8a596941aa&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51)) ![image](https://github.com/user-attachments/assets/08ef081c-37d7-436d-905b-4b9e2b470644) Inference dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=gh/anijain2305/435/head&lCommit=914244fa2fe0055917e039e35183b21fa90afdc6&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51)) ![image](https://github.com/user-attachments/assets/32136eff-a39e-4cde-a438-e51a665bc3c9) Inference sees a little bit more perf degradation but we are ok with that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131275 Approved by: https://github.com/ezyang, https://github.com/jansel ghstack dependencies: #132053	2024-07-29 20:01:51 +00:00
Sam Larsen	4c2bcf92cb	[inductor] Enable FX graph caching in OSS by default (#125863 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125863 Approved by: https://github.com/eellison, https://github.com/oulgen	2024-07-29 19:19:54 +00:00
Xu Han	484852c02b	[Doc] update guide install mkl-static from conda to pip (#130026 ) <img width="619" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/4ac3ca68-57dc-42c7-ac7a-876dc377ebcf"> Conda intel channel is not avaliable now. Use `pip` install instead of `conda`. `Windows` and `Linux` are avaliable: Binary list: https://pypi.org/project/mkl-static/#files `MacOS` is avaliable for old version: https://pypi.org/project/mkl-static/2021.3.0/#files TODO: 1. cherry-pick to `release/2.4` branch, @atalman . 2. fix it also in `release/2.3` branch: https://github.com/pytorch/pytorch/pull/131853 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130026 Approved by: https://github.com/jgong5, https://github.com/atalman	2024-07-29 19:19:15 +00:00
Aidyn-A	301ec32ae8	[EASY][TEST][CUDA] Fix typo in test_graph_make_graphed_callables_same_pool (#132059 ) Per title. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132059 Approved by: https://github.com/Skylion007	2024-07-29 19:15:37 +00:00
Xuehai Pan	5cc34f61d1	[CI] add new test config label `ci-test-showlocals` to control test log verbosity (#131981 ) Add a new label `ci-test-showlocals` and add it to test config filter. If the PR is labeled with `ci-test-showlocals` or "ci-test-showlocals" present in the PR comment, the test config filter will set a environment variable `TEST_SHOWLOCALS`. Then `pytest` will show local variables on failures for better debugging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131981 Approved by: https://github.com/malfet ghstack dependencies: #131151	2024-07-29 18:53:14 +00:00
Xuehai Pan	4694ee1ad2	[BE][tests] show local variables on failure in tests (#131151 ) ------ As per the title, add argument `--locals` for `unittest` and `--showlocals --tb=long` for `pytest` in CI. Some failures cannot be reproduced on the local machine but exist on cloud CI. This change allows us to investigate the test failure more easily. Example output: https://github.com/pytorch/pytorch/actions/runs/9961546996/job/27523888353?pr=130710#step:20:3361 ```text /opt/conda/envs/py_3.8/lib/python3.8/site-packages/sympy/core/function.py:307: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ cls = FloorDiv, base = -1.00000000000000, divisor = -1.00000000000000 @classmethod def eval(cls, base, divisor): # python test/test_dynamic_shapes.py -k TestDimConstraints.test_dim_constraints_solve_full # Assert triggered by inequality solver # assert base.is_integer, base # assert divisor.is_integer, divisor # We don't provide the same error message as in Python because SymPy # makes it difficult to check the types. if divisor.is_zero: raise ZeroDivisionError("division by zero") if base in (int_oo, -int_oo, sympy.oo, -sympy.oo) and divisor in ( int_oo, -int_oo, sympy.oo, -sympy.oo, ): return sympy.nan if base is sympy.nan or divisor is sympy.nan: return sympy.nan if base.is_zero: return sympy.S.Zero if base.is_integer and divisor == 1: return base if base.is_integer and divisor == -1: return sympy.Mul(base, -1) if ( isinstance(base, sympy.Number) and isinstance(divisor, sympy.Number) and ( base in (int_oo, -int_oo, sympy.oo, -sympy.oo) or divisor in (int_oo, -int_oo, sympy.oo, -sympy.oo) ) ): r = float(base) / float(divisor) if r == math.inf: return int_oo elif r == -math.inf: return -int_oo elif math.isnan(r): return sympy.nan else: return sympy.Integer(math.floor(r)) if isinstance(base, sympy.Integer) and isinstance(divisor, sympy.Integer): return sympy.Integer(int(base) // int(divisor)) if isinstance(base, FloorDiv): return FloorDiv(base.args[0], base.args[1] * divisor) # Expands (x + y) // b into x // b + y // b. # This only works if floor is an identity, i.e. x / b is an integer. for term in sympy.Add.make_args(base): quotient = term / divisor if quotient.is_integer and isinstance(divisor, sympy.Integer): # NB: this is correct even if the divisor is not an integer, but it # creates rational expressions that cause problems with dynamic # shapes. return FloorDiv(base - term, divisor) + quotient try: gcd = sympy.gcd(base, divisor) if gcd != 1: > return FloorDiv( sympy.simplify(base / gcd), sympy.simplify(divisor / gcd) ) base = -1.00000000000000 cls = FloorDiv divisor = -1.00000000000000 gcd = 1.00000000000000 quotient = 1.00000000000000 term = -1.00000000000000 /opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/_sympy/functions.py:159: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ args = (FloorDiv, -1.00000000000000, -1.00000000000000), kwargs = {} @wraps(func) def wrapper(args, kwargs): try: > retval = cfunc(args, **kwargs) E RecursionError: maximum recursion depth exceeded in comparison E E To execute this test, run the following from the base repo dir: E python test/test_sympy_utils.py -k TestValueRanges.test_binary_ref_fn_floordiv_dtype_float E E This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 args = (FloorDiv, -1.00000000000000, -1.00000000000000) cfunc = <functools._lru_cache_wrapper object at 0x7fc5303173a0> func = <function Function.__new__ at 0x7fc530317280> kwargs = {} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131151 Approved by: https://github.com/ezyang	2024-07-29 18:53:14 +00:00
cyy	ab912b7fef	[2/N] Fix clang-tidy warnings in inductor (#132040 ) Follows #131979 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132040 Approved by: https://github.com/Skylion007	2024-07-29 18:41:24 +00:00
cyy	c764ef6d53	[9/N] Fix clang-tidy warnings in jit (#132010 ) Follows #131997 Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/132010 Approved by: https://github.com/Skylion007	2024-07-29 18:38:35 +00:00
Animesh Jain	f389bca2e9	[dynamo][inline_inbuilt_nn_modules] Skip test_dpp_graphs for now (#132053 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132053 Approved by: https://github.com/laithsakka	2024-07-29 17:59:47 +00:00
Edward Z. Yang	6c6fbb4691	Fix pyi annotation for ProcessGroupNCCL.Options (#130957 ) Probably all the other options need updating too, but this is the one I needed. The accurate annotation was determined by reading torch/csrc/distributed/c10d/init.cpp Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/130957 Approved by: https://github.com/wconstab, https://github.com/fduwjj	2024-07-29 17:46:01 +00:00
Yang Chen	025242d065	[cpu-test] enable test_cpu_repro in fbcode (#132022 ) Summary: This diff enables test_cpu_repro in fbcode Test Plan: ci Differential Revision: D60364517 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132022 Approved by: https://github.com/desertfire	2024-07-29 17:45:26 +00:00
Shunting Zhang	ca8153ae67	BE: reset dynamo before each test in test_ops_gradients.py (#131397 ) https://github.com/pytorch/pytorch/pull/126586 tried to reset dynamo before each unit test. That PR get reverted a couple of times because we see post-land test failures that we don't see before merge. This PR only reset dynamo before each tests in `test_ops_gradients.py` to make it easier to land. Eventually after we reset dynamo in each individual test files, we can move the change to the base class (TestCase) and remove the change in individual test files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131397 Approved by: https://github.com/zou3519 ghstack dependencies: #131551, #131388, #131372	2024-07-29 17:39:23 +00:00
Shunting Zhang	527901f054	BE: reset dynamo before each test in test_module.py (#131372 ) https://github.com/pytorch/pytorch/pull/126586 tried to reset dynamo before each unit test. That PR get reverted a couple of times because we see post-land test failures that we don't see before merge. This PR only reset dynamo before each tests in `test_module.py` to make it easier to land. Eventually after we reset dynamo in each individual test files, we can move the change to the base class (TestCase) and remove the change in individual test files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131372 Approved by: https://github.com/zou3519 ghstack dependencies: #131551, #131388	2024-07-29 17:39:23 +00:00
Aaron Gokaslan	bd1a29b158	[BE][Ez]: Update ruff to 0.5.5. Bugfixes and better LSP support (#132037 ) Updates ruff to the latest and greatest, mainly better LSP support and bugfixes Pull Request resolved: https://github.com/pytorch/pytorch/pull/132037 Approved by: https://github.com/malfet	2024-07-29 16:57:13 +00:00
PyTorch MergeBot	6cf493158e	Revert "Enable FlashAttention on Windows (#131906 )" This reverts commit b90bc66766c3503c1f229660710a803488d53c16. Reverted https://github.com/pytorch/pytorch/pull/131906 on behalf of https://github.com/atalman due to Windows nightly failures ([comment](https://github.com/pytorch/pytorch/pull/131906#issuecomment-2256421183))	2024-07-29 16:49:23 +00:00
eellison	3d4de8e96d	Add config option to skip autotuning conv (#131839 ) requested internally bc for some models the conv templates are not very helpful Pull Request resolved: https://github.com/pytorch/pytorch/pull/131839 Approved by: https://github.com/oulgen ghstack dependencies: #131400	2024-07-29 16:43:58 +00:00
PyTorch MergeBot	e73a4cb21f	Revert "[pt2e][quant] Ensure BN node is erased after convert (#131651 )" This reverts commit eba2ffd278a004df8fd335328ab8ba00c978e471. Reverted https://github.com/pytorch/pytorch/pull/131651 on behalf of https://github.com/ZainRizvi due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/131651#issuecomment-2256407968))	2024-07-29 16:42:24 +00:00
PyTorch MergeBot	f72266ecea	Revert "Let dynamo inline functional_call (#128646 )" This reverts commit 5aab1acc84ff4a4374c9ddd179be48b07c6c8a74. Reverted https://github.com/pytorch/pytorch/pull/128646 on behalf of https://github.com/clee2000 due to the newly added test dynamo/test_higher_order_ops.py::FuncTorchHigherOrderOpTests::test_functional_call_sequential_params_and_buffers [GH job link](https://github.com/pytorch/pytorch/actions/runs/10147452270/job/28058682000) [HUD commit link](`5aab1acc84`) is broken, probably a landrace since it passed on PR ([comment](https://github.com/pytorch/pytorch/pull/128646#issuecomment-2256375501))	2024-07-29 16:26:50 +00:00
Tom Ritchford	962f248437	Add decomposition for expand_copy (#130940 ) * Extracted from #129476 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130940 Approved by: https://github.com/peterbell10	2024-07-29 16:23:56 +00:00
rzou	e393c7fa05	Tighten torch.library.infer_schema input types (#130705 ) Made the following changes: - mutates_args is now keyword-only and mandatory. This is to align with torch.library.custom_op (which makes it mandatory because it's easy to miss) - op_name is now keyword-only. This helps the readability of the API - updated all usages of infer_schema This change is not BC-breaking because we introduced torch.library.infer_schema a couple of days ago. Test Plan: - tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/130705 Approved by: https://github.com/yushangdi ghstack dependencies: #131777	2024-07-29 16:01:19 +00:00
PyTorch MergeBot	957a89f56c	Revert "[inductor] Fix unsoundness with negative-valued indexing expressions (#131761 )" This reverts commit 03760be2714c6ed3b4f44c4dc3ea016f557d8597. Reverted https://github.com/pytorch/pytorch/pull/131761 on behalf of https://github.com/atalman due to Broke CI: inductor/test_cpu_cpp_wrapper.py::DynamicShapesCppWrapperCpuTests::test_linear_binary_dynamic_shapes_cpp_wrapper [GH job link](https://github.com/pytorch/pytorch/actions/runs/10145214748/job/28051168920) [HUD commit link](`03760be271`) ([comment](https://github.com/pytorch/pytorch/pull/131761#issuecomment-2256287736))	2024-07-29 15:52:08 +00:00
Aaron Gokaslan	ca254d145f	[BE][Ez]: Update fmtlib submodule to 11.0.2 (#132036 ) Updates fmtlib to 11.0.2 which mainly includes minor bugfixes for edge cases such as move-only iterators and formatting on non-posix systems. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132036 Approved by: https://github.com/malfet	2024-07-29 15:50:00 +00:00
Guilherme Leobas	5aab1acc84	Let dynamo inline functional_call (#128646 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128646 Approved by: https://github.com/zou3519 ghstack dependencies: #129091, #130490	2024-07-29 15:41:03 +00:00
Guilherme Leobas	e0e4e84ef9	wrap self.call_function(...) in try finally block to undo changes to self.kw_names (#130490 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130490 Approved by: https://github.com/williamwen42, https://github.com/zou3519 ghstack dependencies: #129091	2024-07-29 15:41:03 +00:00
Guilherme Leobas	1e9cdf7d91	Relax constraints for creating a `GenericContextWrappingVariable` (#129091 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/129091 Approved by: https://github.com/yanboliang, https://github.com/zou3519	2024-07-29 15:40:59 +00:00
Brian Hirsh	6cbad37bee	make `_inductor.config.rocm.supported_arch` set order deterministic for caching (#131921 ) This fixes some AOTAutograd caching tests that were failing flakily internally because they would occasionally cache miss. [T195598220](https://www.internalfb.com/intern/tasks/?t=195598220) I found it by running some stress tests and diffing the AOT cache information on each run, and ended up with this diff (`rocm.supported_arch` order was changing from run to run, although apparently not in OSS): ``` --- tmpa.txt 2024-07-26 11:03:46.220924798 -0700 +++ tmpb.txt 2024-07-26 11:03:44.053586437 -0700 @@ -1,4 +1,4 @@ -Autograd graph cache hash details for key ati644hstroc45hvmc6dcgzmxz7n4ezi46vbb2iriu634aojza74: +Autograd graph cache hash details for key ayfqecv56xcczljwuvigh73sjd7dfvgr6akzf3ikr46nq7dfm6eh: [z76jr26kn3enjhz7b3ks3a2dgpwolnnqsqmo3wn6ddml3vxjtam] aot_config: (0, True, False, False, False, [LocalSource(local_name='x', cell_or_freevar=False)], True, False) [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] grad_enabled: False [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] disable_amp: False @@ -184,7 +184,7 @@ [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[rocm.print_kernel_resource_usage]: False [tquy2we2efmowuj4wuqzcfcfdcrkzkzmwdae6hprj7fa64jpusq] inductor_config[rocm.rocm_home]: None [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[rocm.save_temps]: False -[xr3ayxgy2xduff3r5ey7o3ypfndexy7edha62kibw2dexijjvdr] inductor_config[rocm.supported_arch]: {'gfx941', 'gfx942', 'gfx940'} +[qauhp44riavgubamhd3ehrifxdgm7pkwx2nehsqg5toy54dqqmn] inductor_config[rocm.supported_arch]: {'gfx942', 'gfx940', 'gfx941'} [cev5uo2jlwdhw2uyzcm7vr6cl23azjfw437f5r5lskm7spucos6] inductor_config[rocm.use_fast_math]: True [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[rocm.use_preselected_instances]: False [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[save_args]: False @@ -231,7 +231,7 @@ [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[verbose_progress]: False [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[warn_mix_layout]: False [a44txxznx23htuc7zxw7larc7yxpxzxmiqzloxznw7z2k2azqj3] inductor_config[worker_start_method]: fork -Autograd graph cache hash details for key ati644hstroc45hvmc6dcgzmxz7n4ezi46vbb2iriu634aojza74: +Autograd graph cache hash details for key ayfqecv56xcczljwuvigh73sjd7dfvgr6akzf3ikr46nq7dfm6eh: [z76jr26kn3enjhz7b3ks3a2dgpwolnnqsqmo3wn6ddml3vxjtam] aot_config: (0, True, False, False, False, [LocalSource(local_name='x', cell_or_freevar=False)], True, False) [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] grad_enabled: False [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] disable_amp: False @@ -417,7 +417,7 @@ [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[rocm.print_kernel_resource_usage]: False [tquy2we2efmowuj4wuqzcfcfdcrkzkzmwdae6hprj7fa64jpusq] inductor_config[rocm.rocm_home]: None [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[rocm.save_temps]: False -[xr3ayxgy2xduff3r5ey7o3ypfndexy7edha62kibw2dexijjvdr] inductor_config[rocm.supported_arch]: {'gfx941', 'gfx942', 'gfx940'} +[qauhp44riavgubamhd3ehrifxdgm7pkwx2nehsqg5toy54dqqmn] inductor_config[rocm.supported_arch]: {'gfx942', 'gfx940', 'gfx941'} [cev5uo2jlwdhw2uyzcm7vr6cl23azjfw437f5r5lskm7spucos6] inductor_config[rocm.use_fast_math]: True [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[rocm.use_preselected_instances]: False [esstihe2nyydk4mhzpvox3qkajyu5y5t23hk3fi2me7jn75xi3o] inductor_config[save_args]: False ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131921 Approved by: https://github.com/jamesjwu, https://github.com/oulgen	2024-07-29 15:29:04 +00:00
Ruichen Sun	14108c1677	Fix error handling in _triton.py (#132006 ) On Windows, _triton.py creates a confusing error ("RuntimeError: Should never be _installed")_ as triton is not supported in Windows. This is not caught in the current Pytorch exception handling. This pull request adds a new exception handling for the runtime error. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132006 Approved by: https://github.com/oulgen	2024-07-29 15:02:25 +00:00
Bin Bao	be3eba382f	[CI] Run perf test for perf_cpu_aarch64 (#132038 ) Summary: Run perf test for perf_cpu_aarch64 instead of regular CI test (test_linux_aarch64). Pull Request resolved: https://github.com/pytorch/pytorch/pull/132038 Approved by: https://github.com/malfet	2024-07-29 13:48:40 +00:00
PyTorch MergeBot	c35f21e5fc	Revert "[BE][tests] show local variables on failure in tests (#131151 )" This reverts commit 14158d892a2bd9b34edb5637f9a05217ea0330bd. Reverted https://github.com/pytorch/pytorch/pull/131151 on behalf of https://github.com/atalman due to Broke CI: test_testing.py::TestTestingCUDA::test_cuda_assert_should_stop_common_device_type_test_suite_cuda [GH job link](https://github.com/pytorch/pytorch/actions/runs/10131415299/job/28014665693) [HUD commit link](`14158d892a`) ([comment](https://github.com/pytorch/pytorch/pull/131151#issuecomment-2255921015))	2024-07-29 13:19:38 +00:00
PyTorch MergeBot	06fe99a097	Revert "[CI] add new test config label `ci-test-showlocals` to control test log verbosity (#131981 )" This reverts commit dfa18bf3f39c5a90b48baf956e50fa7da4462d3d. Reverted https://github.com/pytorch/pytorch/pull/131981 on behalf of https://github.com/atalman due to Sorry, need to revert bottom PR, which broke CI: https://github.com/pytorch/pytorch/pull/131151 ([comment](https://github.com/pytorch/pytorch/pull/131981#issuecomment-2255892628))	2024-07-29 13:09:41 +00:00
PyTorch MergeBot	7ef927da15	Revert "[dynamo] Turn on inline_inbuilt_nn_modules (#131275 )" This reverts commit 6de65d5dd4226b6bae15352b575c81a6750c819b. Reverted https://github.com/pytorch/pytorch/pull/131275 on behalf of https://github.com/atalman due to Broke CI: dynamo/test_structured_trace.py::StructuredTraceTest::test_ddp_graphs [GH job link](https://github.com/pytorch/pytorch/actions/runs/10132084288/job/28016215101) [HUD commit link](`6de65d5dd4`) ([comment](https://github.com/pytorch/pytorch/pull/131275#issuecomment-2255839646))	2024-07-29 12:48:27 +00:00
cyy	efca51e171	[8/N] Fix clang-tidy warnings in jit (#131997 ) Follows #131996 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131997 Approved by: https://github.com/Skylion007	2024-07-29 12:40:42 +00:00
PyTorch MergeBot	eb9409511e	Revert "support zb1p and zb2p algorithms (#130752 )" This reverts commit 8fe5b93667b60e37c12d288659a25cbd5ae53c79. Reverted https://github.com/pytorch/pytorch/pull/130752 on behalf of https://github.com/atalman due to Broke Periodic CI: distributed/pipelining/test_composability.py::ComposabilityTest::test_manual_with_data_parallel_dp_type_DDP_ScheduleClass4 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10131472868/job/28014900187) [HUD commit link](`8fe5b93667`) ([comment](https://github.com/pytorch/pytorch/pull/130752#issuecomment-2255819078))	2024-07-29 12:40:00 +00:00
pruthvistony	9d497887b8	Changes to support clang-19 (#131905 ) Co-authored-by: pruthvistony <pruthvigithub@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131905 Approved by: https://github.com/jeffdaily, https://github.com/Skylion007	2024-07-29 12:38:23 +00:00
cyy	b67811abda	[1/N] Fix clang-tidy warnings in inductor (#131979 ) Fixes clang-tidy warnings in inductor. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131979 Approved by: https://github.com/Skylion007	2024-07-29 12:37:56 +00:00
Chengji Yao	d47c470f47	[dynamo] implement `var_getattr` in UserFunctionVariable (#130413 ) This PR addresses the `getattr` of UserFunctionVariable. Although this usage is uncommon, it does appear in [Megatron's code](https://github.com/NVIDIA/Megatron-LM/blob/main/megatron/core/tensor_parallel/layers.py#L635). ``` def linear_with_grad_accumulation_and_async_allreduce(...): .... if not linear_with_grad_accumulation_and_async_allreduce.warned: .... .... linear_with_grad_accumulation_and_async_allreduce.warned = False ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/130413 Approved by: https://github.com/yanboliang	2024-07-29 08:29:59 +00:00
Xuehai Pan	dfa18bf3f3	[CI] add new test config label `ci-test-showlocals` to control test log verbosity (#131981 ) Add a new label `ci-test-showlocals` and add it to test config filter. If the PR is labeled with `ci-test-showlocals` or "ci-test-showlocals" present in the PR comment, the test config filter will set a environment variable `TEST_SHOWLOCALS`. Then `pytest` will show local variables on failures for better debugging. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131981 Approved by: https://github.com/malfet	2024-07-29 07:40:42 +00:00
Shunting Zhang	f151f25c0b	BE: reset dynamo before each test in test_torch.py (#131388 ) https://github.com/pytorch/pytorch/pull/126586 tried to reset dynamo before each unit test. That PR get reverted a couple of times because we see post-land test failures that we don't see before merge. This PR only reset dynamo before each tests in `test_torch.py` to make it easier to land. Eventually after we reset dynamo in each individual test files, we can move the change to the base class (TestCase) and remove the change in individual test files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131388 Approved by: https://github.com/zou3519 ghstack dependencies: #131551	2024-07-29 04:57:34 +00:00
Wu, Chunyuan	30e7fc0fe1	Cpp wrapper: set args to CppWrapperKernelArgs in cpp template kernel (#129557 ) Fix the compilation error: ```cpp /tmp/tmpywg34bca/tg/ctg7wbli6pvydsjr2xsxamdbamkquhlincuky3dzopa3ilrxqdwt.cpp:401:24: error: cannot convert ‘at::Tensor’ to ‘const bfloat16’ {aka ‘const c10::BFloat16’} 401 \| cpp_fused_div_mm_0(arg2_1, constant2, _frozen_param1, buf1); \| ^~~~~~ \| \| \| at::Tensor ``` The generated code after the fix will be: ```cpp cpp_fused_div_mm_0((bfloat16)(arg2_1.data_ptr()), (bfloat16)(constant2.data_ptr()), (bfloat16)(_frozen_param1.data_ptr()), (bfloat16)(buf1.data_ptr())); ``` Multiple changes are required for ABI compatible mode. Separate it into a follow-up PR in this ghstack: https://github.com/pytorch/pytorch/pull/131841 Pull Request resolved: https://github.com/pytorch/pytorch/pull/129557 Approved by: https://github.com/leslie-fang-intel	2024-07-29 04:01:17 +00:00
Peter Bell	03760be271	[inductor] Fix unsoundness with negative-valued indexing expressions (#131761 ) This fixes a few instances where we assumed indexing expressions were non-negative. This is not valid when we have more complicated expressions involving masking e.g. pointwise cat. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131761 Approved by: https://github.com/ezyang	2024-07-29 03:14:13 +00:00
Yan Zhiwei	2a02b5cd22	[Intel GPU] Dispatch Stub support (#130019 ) # Motivation Structured codegen is beneficial for easier decoupling tensor meta setting and kernel implementation. At present, XPU operators need to handle tensor metas in hand-written way. We plan to leverage the codegen system for auto generate structured operators. This PR facilitate the `DispatchStub` support for Intel GPUs. Based on that, XPU operators would have possibility to register kernel functor to operator stubs. This is a prerequisite of PR #130082, where we will modify the codegen system to generate XPU needed source files and headers. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130019 Approved by: https://github.com/EikanWang, https://github.com/gujinghui, https://github.com/albanD	2024-07-29 02:18:52 +00:00
cyy	5b3b2b9cc7	[7/N] Fix clang-tidy warnings in jit (#131996 ) Follows #131986 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131996 Approved by: https://github.com/ezyang	2024-07-29 01:21:18 +00:00
cyy	ddd539ba6c	[6/N] Fix clang-tidy warnings in jit (#131986 ) Follows #131969 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131986 Approved by: https://github.com/ezyang	2024-07-29 00:49:08 +00:00
Tianyu Liu	7b0e10f0e5	fix _MaskPartial when multiple embeddings coexist (#131264 ) Previously, using _MaskPartial when multiple embeddings have the following issues: 1. Suppose an `nn.Embedding` has shape `[vocab_size, emb_size]`. When there are more than one embeddings, sharing the same `vocab_size` but with different `emb_size`s. Then they would not share `OpStrategy` since each, when involved in computation, would have different `OpSchema`; however, there would be cache hit for redistribute (specifically `_gen_transform_infos` in `torch/distributed/_tensor/_redistribute.py` when doing `Replicate` -> `_MaskPartial`) as the `_MaskPartial` only has `vocab_size` as `logical_dim_size` but not `emb_size` as attribute. This cache hit is undesirable and would cause trouble when doing all-reduce/reduce-scatter on the new `_MaskPartial` in a separate `OpStrategy`. The error was reported in #130725. In this PR, we introduce `offset_shape` to represent the embedding's full shape to avoid cache hit from embeddings of different shapes. 2. The second issue is when we have two `nn.Embedding`s `emb1` and `emb2` with the same shape. There will be cache hit not only in `_gen_transform_infos`, but also in `OpStrategy` generation. Previously, if we sequentially do `Replicate` -> `_MaskPartial` for both `emb1` `emb2` and then sequentially do reduction on the `_MaskPartial` of `emb1`, it would destroy the `MaskBuffer` and `emb2` would hit error. This PR adds a `refcount` for the `MaskBuffer` so that it can be properly shared by multiple `nn.Embedding`s. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131264 Approved by: https://github.com/wanchaol	2024-07-29 00:40:58 +00:00
Peter Bell	0ab6551bcb	[inductor] Handle NoneLayout in count_numel (#131645 ) We're currently under-counting mutations from ExternKernel since they use `NoneLayout` which doesn't have an associated shape and dtype. Instead, we can get that information from the buffer being mutated. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131645 Approved by: https://github.com/jansel	2024-07-28 23:02:22 +00:00
cyy	7c1fbc7fe9	[5/N] Remove unused parameter (#131998 ) Follows #131291 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131998 Approved by: https://github.com/ezyang	2024-07-28 21:29:06 +00:00
Nikita Shulga	f901b02066	[Distributed] Do not expose `nlohmann/json.hpp` in public headers (#131925 ) Move `<hlohmann/json.hpp>` dependency as well as `NCCLTraceBuffer::getCollectiveTraceJson` and `NCCLTraceBuffer::dump_json` implementation introduced by https://github.com/pytorch/pytorch/pull/129505 from the header into .cpp file. This relaxes the requirement on all downstream client to depend on the library Fixes https://github.com/pytorch/pytorch/issues/130678 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131925 Approved by: https://github.com/albanD, https://github.com/d4l3k, https://github.com/fduwjj, https://github.com/c-p-i-o ghstack dependencies: #131922	2024-07-28 18:45:24 +00:00
Oguz Ulgen	75c8d59ea1	Remove mypy ignore from torch/_dynamo/variables/lazy.py (#131785 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131785 Approved by: https://github.com/aorenste, https://github.com/zou3519 ghstack dependencies: #131786, #131870	2024-07-28 17:13:53 +00:00
Oguz Ulgen	7c29665f77	Remove mypy ignore from torch/testing/_internal/distributed/ (#131870 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131870 Approved by: https://github.com/aakhundov ghstack dependencies: #131786	2024-07-28 17:13:53 +00:00
Oguz Ulgen	2e4807575c	Remove mypy ignore from torch/_dynamo/polyfill.py (#131786 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131786 Approved by: https://github.com/aorenste, https://github.com/zou3519	2024-07-28 17:13:49 +00:00
Adnan Akhundov	cc512ea0f6	[inductor] Fix flaky tests in test_aot_inductor.py (#131994 ) Summary: The `test_model_modified_weights` in `test_aot_inductor.py` has been failing internally for a while. The behavior leading to the test failure was that, after updating the eager model's weights and recompiling the (CPU) model with AOTI, the output of the model was identical to the one before the weights were updated. The root cause is here in Python: `8927fc209f/test/inductor/test_aot_inductor_utils.py (L69-L71)` which, in turn, instantiates the `Runner` object in C++ relying on `dlopen` for loading the .so. The problem is that repeated `dlopen` call does not reload the library from the same path, unless `dlclose` is called in-between the two `dlopen` calls. There is `dlclose` in the `Runner`'s destructor, but it's not called, likely due to the way the loaded `runner` gets closed over in Python: `8927fc209f/test/inductor/test_aot_inductor_utils.py (L83-L94)` Here we add copying the .so file to a unique temporary path right before loading it into a `runner` to avoid the `dlopen` staleness described above. This fixes the `test_model_modified_weights` and, hopefully, will help avoiding similar errors in the future tests. Test Plan: Tested internally. Differential Revision: D60348165 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131994 Approved by: https://github.com/chenyang78	2024-07-28 16:55:22 +00:00
Animesh Jain	6de65d5dd4	[dynamo] Turn on inline_inbuilt_nn_modules (#131275 ) Known issues that are deliberately kept open and will be fixed later are tracked here - https://github.com/pytorch/pytorch/issues/131696 Training dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=training&dtype=amp&lBranch=gh/anijain2305/435/head&lCommit=408b9358b8fca3a5d08b39741419fe8a596941aa&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51)) ![image](https://github.com/user-attachments/assets/08ef081c-37d7-436d-905b-4b9e2b470644) Inference dashboard ([link](https://hud.pytorch.org/benchmark/compilers?dashboard=torchinductor&startTime=Thu%2C%2018%20Jul%202024%2000%3A03%3A50%20GMT&stopTime=Thu%2C%2025%20Jul%202024%2000%3A03%3A50%20GMT&granularity=hour&suite=torchbench&mode=inference&dtype=bfloat16&lBranch=gh/anijain2305/435/head&lCommit=914244fa2fe0055917e039e35183b21fa90afdc6&rBranch=gh/anijain2305/435/base&rCommit=d31f2ae904ba2cf0884bf24413ba2109c3585d51)) ![image](https://github.com/user-attachments/assets/32136eff-a39e-4cde-a438-e51a665bc3c9) Inference sees a little bit more perf degradation but we are ok with that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131275 Approved by: https://github.com/ezyang, https://github.com/jansel ghstack dependencies: #131744, #131928, #131948	2024-07-28 13:23:00 +00:00
Adnan Akhundov	8927fc209f	[inductor] Add type hints to functions in debug.py (#131836 ) Summary: ATT Test Plan: lintrunner Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/131836 Approved by: https://github.com/eellison	2024-07-28 04:54:22 +00:00
Huy Do	500aea8d50	Build PT aarch64 on arm runner (#131964 ) Another fix is needed to address https://github.com/pytorch/pytorch/actions/runs/10118374576/job/27985575620. The build needs to be done on arm runner to stay compatible with the Docker image. ### Testing https://github.com/pytorch/pytorch/actions/runs/10118589329/job/27985670691 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131964 Approved by: https://github.com/malfet	2024-07-28 04:50:38 +00:00
PyTorch MergeBot	945bf78894	Revert "[BE] typing for decorators - fx/_compatibility (#131568 )" This reverts commit 193f62fde91ee20deb5ddcd9ff4593cd78d74c64. Reverted https://github.com/pytorch/pytorch/pull/131568 on behalf of https://github.com/clee2000 due to same as https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359 but I clicked the wrong link by accident. This is where it actually starts ([comment](https://github.com/pytorch/pytorch/pull/131568#issuecomment-2254330781))	2024-07-28 03:43:39 +00:00
PyTorch MergeBot	b002ec61b6	Revert "[BE] typing for decorators - masked/_ops (#131569 )" This reverts commit aa58af8b43ad0e615415b4d754255f5be481d41a. Reverted https://github.com/pytorch/pytorch/pull/131569 on behalf of https://github.com/clee2000 due to same as https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359 but I clicked the wrong link by accident. This is where it actually starts ([comment](https://github.com/pytorch/pytorch/pull/131568#issuecomment-2254330781))	2024-07-28 03:43:39 +00:00
PyTorch MergeBot	a3ba405871	Revert "[BE] typing for decorators - library (#131570 )" This reverts commit 5731b486c87bedff69aa0264d6c934bf723eb513. Reverted https://github.com/pytorch/pytorch/pull/131570 on behalf of https://github.com/clee2000 due to same as https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359 but I clicked the wrong link by accident. This is where it actually starts ([comment](https://github.com/pytorch/pytorch/pull/131568#issuecomment-2254330781))	2024-07-28 03:43:39 +00:00
PyTorch MergeBot	a0abb77007	Revert "[BE] typing for decorators - distributed/_tensor/ops/utils (#131571 )" This reverts commit 4b985e6f803023ec301238d2b4bab4fbea4dd03c. Reverted https://github.com/pytorch/pytorch/pull/131571 on behalf of https://github.com/clee2000 due to same as https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359 but I clicked the wrong link by accident. This is where it actually starts ([comment](https://github.com/pytorch/pytorch/pull/131568#issuecomment-2254330781))	2024-07-28 03:43:39 +00:00
Yifu Wang	a8a9882899	Implement fused_scaled_matmul_reduce_scatter for async-TP (#131950 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131950 Approved by: https://github.com/weifengpy ghstack dependencies: #131410, #131831, #131832, #131833	2024-07-28 03:39:12 +00:00
Yifu Wang	0538a69a8d	[micro_pipeline_tp] support all-gather -> _scaled_mm (#131833 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131833 Approved by: https://github.com/weifengpy ghstack dependencies: #131410, #131831, #131832	2024-07-28 03:39:11 +00:00
Yifu Wang	492e9a4886	[micro_pipeline_tp] add support for type-erased all-gather pattern observed in DTensor + float8_experimental (#131832 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131832 Approved by: https://github.com/weifengpy ghstack dependencies: #131410, #131831	2024-07-28 03:39:11 +00:00
PyTorch MergeBot	fd5b7d4bf9	Revert "[BE] typing for decorators - _meta_registrations (#131572 )" This reverts commit bfe0079b72aa3ed315ae8f140c97a5826c401a65. Reverted https://github.com/pytorch/pytorch/pull/131572 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))	2024-07-28 03:29:32 +00:00
PyTorch MergeBot	609447a626	Revert "[BE] typing for decorators - _jit_internal (#131573 )" This reverts commit f0f20f7e97716b4b077dca2a1a42930ccf990c1c. Reverted https://github.com/pytorch/pytorch/pull/131573 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))	2024-07-28 03:29:32 +00:00
PyTorch MergeBot	4684b8e9d7	Revert "[BE] typing for decorators - _inductor/lowering (#131574 )" This reverts commit b2cbcf710b26c4cb92d810fff46b6ddcb8d10cbf. Reverted https://github.com/pytorch/pytorch/pull/131574 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))	2024-07-28 03:29:32 +00:00
PyTorch MergeBot	07b7f51877	Revert "[BE] typing for decorators - _inductor/fx_passes/post_grad (#131575 )" This reverts commit 42dc5a47a157f9a441ceba53cf569cc42a640732. Reverted https://github.com/pytorch/pytorch/pull/131575 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))	2024-07-28 03:29:32 +00:00
PyTorch MergeBot	6a0c3bae21	Revert "[BE] typing for decorators - fx/experimental/migrate_gradual_types/constraint_generator (#131576 )" This reverts commit 37d76c7d48353cff5ed0d868b7ca486ad092ceaf. Reverted https://github.com/pytorch/pytorch/pull/131576 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))	2024-07-28 03:29:32 +00:00
PyTorch MergeBot	b1d640a2b7	Revert "[BE] typing for decorators - ao/quantization/quantizer/xnnpack_quantizer_utils (#131577 )" This reverts commit 5ee6a6dacc926da37ebe06e4206dcc307bf891f5. Reverted https://github.com/pytorch/pytorch/pull/131577 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))	2024-07-28 03:29:32 +00:00
PyTorch MergeBot	d3c17fea90	Revert "[BE] typing for decorators - _library/custom_ops (#131578 )" This reverts commit c65b197b85aeee61ed4c09527a8f6eecf8c20e27. Reverted https://github.com/pytorch/pytorch/pull/131578 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))	2024-07-28 03:29:32 +00:00
PyTorch MergeBot	065d0fe570	Revert "[BE] typing for decorators - fx/experimental/graph_gradual_typechecker (#131579 )" This reverts commit 79f0c4dc04c7976b734767d64c4833932219dcfb. Reverted https://github.com/pytorch/pytorch/pull/131579 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))	2024-07-28 03:29:31 +00:00
PyTorch MergeBot	5ced63a005	Revert "[BE] typing for decorators - utils/flop_counter (#131580 )" This reverts commit 81c26ba5ae1edf95da8f6956ae4b5ad23c9833c6. Reverted https://github.com/pytorch/pytorch/pull/131580 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))	2024-07-28 03:29:31 +00:00
PyTorch MergeBot	2c4023d65f	Revert "[BE] typing for decorators - _refs/nn/functional (#131581 )" This reverts commit dbf7c318b2dd4652467f11f4aaebaa3ed372e728. Reverted https://github.com/pytorch/pytorch/pull/131581 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))	2024-07-28 03:29:31 +00:00
PyTorch MergeBot	e448f32944	Revert "[BE] typing for decorators - signal/windows/windows (#131582 )" This reverts commit 8689d377f9b60b70efa6608e654a3889f947f4d8. Reverted https://github.com/pytorch/pytorch/pull/131582 on behalf of https://github.com/clee2000 due to breaking lint internally D60265575 ([comment](https://github.com/pytorch/pytorch/pull/131572#issuecomment-2254328359))	2024-07-28 03:29:31 +00:00
PyTorch MergeBot	d90f6b45c0	Revert "[inductor] Add type hints to functions in mkldnn_fusion.py (#131820 )" This reverts commit fb3ddafbcfe6de1c4b208c020bc5ff4c4c4faf79. Reverted https://github.com/pytorch/pytorch/pull/131820 on behalf of https://github.com/clee2000 due to reverting this to revert something else, only action you should need to do is to rebase and merge again, sorry for the churn ([comment](https://github.com/pytorch/pytorch/pull/131820#issuecomment-2254327833))	2024-07-28 03:26:14 +00:00
PyTorch MergeBot	8f5cf46405	Revert "Fix public API tests (#131386 )" This reverts commit 91fcfd87600545c19b975bd6ea134f2f931bf84a. Reverted https://github.com/pytorch/pytorch/pull/131386 on behalf of https://github.com/clee2000 due to reverting this to revert something else, only action you should need to do is to rebase and merge again, sorry for the churn ([comment](https://github.com/pytorch/pytorch/pull/131386#issuecomment-2254327487))	2024-07-28 03:23:04 +00:00
cyy	7be0ce51b6	Fix handle serialization error (#131871 ) This is a bug to try serialise std::string in C API Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/131871 Approved by: https://github.com/Skylion007	2024-07-28 00:33:20 +00:00
Aaron Orenstein	3e0ccb3a9f	Fixing fake tensor SymInt caching (#131966 ) Summary: Some tests are failing because of a weird interaction between the symbolic sizes and the `set()` - back it out for now. Differential Revision: D60320595 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131966 Approved by: https://github.com/oulgen	2024-07-27 22:43:57 +00:00
Shuo Ding	d07a125af2	[Inductor] supporting pointwise intermediate nodes in B2B-GEMM (#131685 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131685 Approved by: https://github.com/eellison	2024-07-27 20:11:20 +00:00
Xuehai Pan	14158d892a	[BE][tests] show local variables on failure in tests (#131151 ) ------ As per the title, add argument `--locals` for `unittest` and `--showlocals --tb=long` for `pytest` in CI. Some failures cannot be reproduced on the local machine but exist on cloud CI. This change allows us to investigate the test failure more easily. Example output: https://github.com/pytorch/pytorch/actions/runs/9961546996/job/27523888353?pr=130710#step:20:3361 ```text /opt/conda/envs/py_3.8/lib/python3.8/site-packages/sympy/core/function.py:307: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ cls = FloorDiv, base = -1.00000000000000, divisor = -1.00000000000000 @classmethod def eval(cls, base, divisor): # python test/test_dynamic_shapes.py -k TestDimConstraints.test_dim_constraints_solve_full # Assert triggered by inequality solver # assert base.is_integer, base # assert divisor.is_integer, divisor # We don't provide the same error message as in Python because SymPy # makes it difficult to check the types. if divisor.is_zero: raise ZeroDivisionError("division by zero") if base in (int_oo, -int_oo, sympy.oo, -sympy.oo) and divisor in ( int_oo, -int_oo, sympy.oo, -sympy.oo, ): return sympy.nan if base is sympy.nan or divisor is sympy.nan: return sympy.nan if base.is_zero: return sympy.S.Zero if base.is_integer and divisor == 1: return base if base.is_integer and divisor == -1: return sympy.Mul(base, -1) if ( isinstance(base, sympy.Number) and isinstance(divisor, sympy.Number) and ( base in (int_oo, -int_oo, sympy.oo, -sympy.oo) or divisor in (int_oo, -int_oo, sympy.oo, -sympy.oo) ) ): r = float(base) / float(divisor) if r == math.inf: return int_oo elif r == -math.inf: return -int_oo elif math.isnan(r): return sympy.nan else: return sympy.Integer(math.floor(r)) if isinstance(base, sympy.Integer) and isinstance(divisor, sympy.Integer): return sympy.Integer(int(base) // int(divisor)) if isinstance(base, FloorDiv): return FloorDiv(base.args[0], base.args[1] * divisor) # Expands (x + y) // b into x // b + y // b. # This only works if floor is an identity, i.e. x / b is an integer. for term in sympy.Add.make_args(base): quotient = term / divisor if quotient.is_integer and isinstance(divisor, sympy.Integer): # NB: this is correct even if the divisor is not an integer, but it # creates rational expressions that cause problems with dynamic # shapes. return FloorDiv(base - term, divisor) + quotient try: gcd = sympy.gcd(base, divisor) if gcd != 1: > return FloorDiv( sympy.simplify(base / gcd), sympy.simplify(divisor / gcd) ) base = -1.00000000000000 cls = FloorDiv divisor = -1.00000000000000 gcd = 1.00000000000000 quotient = 1.00000000000000 term = -1.00000000000000 /opt/conda/envs/py_3.8/lib/python3.8/site-packages/torch/utils/_sympy/functions.py:159: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ args = (FloorDiv, -1.00000000000000, -1.00000000000000), kwargs = {} @wraps(func) def wrapper(args, kwargs): try: > retval = cfunc(args, **kwargs) E RecursionError: maximum recursion depth exceeded in comparison E E To execute this test, run the following from the base repo dir: E python test/test_sympy_utils.py -k TestValueRanges.test_binary_ref_fn_floordiv_dtype_float E E This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0 args = (FloorDiv, -1.00000000000000, -1.00000000000000) cfunc = <functools._lru_cache_wrapper object at 0x7fc5303173a0> func = <function Function.__new__ at 0x7fc530317280> kwargs = {} ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131151 Approved by: https://github.com/ezyang	2024-07-27 19:39:40 +00:00
albanD	466ea8ce54	Add fallback() to torch.library (#131707 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131707 Approved by: https://github.com/zou3519	2024-07-27 18:02:35 +00:00
cyy	8e5a367311	[5/N] Fix clang-tidy warnings in jit (#131969 ) Follows #131903 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131969 Approved by: https://github.com/ezyang	2024-07-27 17:54:20 +00:00
Xuehai Pan	918ece4f4d	[BE][Easy][11/19] enforce style for empty lines in import segments in `test/dy*/` (#129762 ) See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501. Most changes are auto-generated by linter. You can review these PRs via: ```bash git diff --ignore-all-space --ignore-blank-lines HEAD~1 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/129762 Approved by: https://github.com/anijain2305	2024-07-27 17:43:53 +00:00
Angela Yi	ae9f17a821	[aoti] Rename OSS DynamicArg and OpKernel (#131862 ) Summary: Fixing P1495466240 which I think is due to the fact that internal also has an "OpKernel" in the same namespace, using thrift instead of json. Test Plan: https://www.internalfb.com/intern/testinfra/testrun/4785074844896831 Differential Revision: D60273354 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131862 Approved by: https://github.com/desertfire	2024-07-27 17:34:50 +00:00
PyTorch MergeBot	8cdfdb41bc	Revert "[NestedTensor] Integrate the layer normalization operator along the jagged dimension into NestedTensor (#131519 )" This reverts commit f862f457304f1952e75336f9f74e4ea3d2a5eb72. Reverted https://github.com/pytorch/pytorch/pull/131519 on behalf of https://github.com/atalman due to broke CI: test_nestedtensor.py::TestNestedTensorSubclassCPU::test_layer_norm_with_lengths_requires_grad_False_components_require_grad_False_cpu_float32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10121747545/job/27996722731) [HUD commit link](`f862f45730`) ([comment](https://github.com/pytorch/pytorch/pull/131519#issuecomment-2254167994))	2024-07-27 14:45:47 +00:00
Nikita Shulga	07389163f0	[C10][BE] Use range loop (#131922 ) Non-function change that iterates over entries in `getCollectiveTraceJson` and uses `C10_UNUSED` rather than `(void)i;` trick Pull Request resolved: https://github.com/pytorch/pytorch/pull/131922 Approved by: https://github.com/XilunWu	2024-07-27 11:26:27 +00:00
cyy	f83ef69b84	Fix typo in assignment operators (#131890 ) Most typos were introduced in #131077 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131890 Approved by: https://github.com/Skylion007	2024-07-27 11:13:42 +00:00
cyy	c82441e07a	Fix std::optional checking bug (#131874 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/131874 Approved by: https://github.com/Skylion007	2024-07-27 11:08:10 +00:00
Yifu Wang	93a4671746	Add out_dtypes to fused_all_gather_scaled_matmul's args (#131831 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131831 Approved by: https://github.com/weifengpy ghstack dependencies: #131410	2024-07-27 11:07:43 +00:00
Yifu Wang	12cd040edd	[micro_pipeline_tp] exclude simple overlappable collectives as micro-pipeline TP candidates when reorder_for_compute_comm_overlap is enabled (#131410 ) When a collective can be hidden through either simple overlapping or micro-pipeline TP, we prefer simple overlapping to avoid the overhead associated with decomposition. If `reorder_for_compute_comm_overlap` is enabled, we identify collectives that can be hidden through simple overlapping and exclude them from micro-pipeline TP candidates. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131410 Approved by: https://github.com/weifengpy	2024-07-27 11:07:43 +00:00
Animesh Jain	36d24925c6	[inline_inbuilt_nn_modules][inductor-cpu] More skips for dynamic shapes when inlining enabled (#131948 ) The issue is tracked here - https://github.com/pytorch/pytorch/issues/131929 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131948 Approved by: https://github.com/eellison, https://github.com/leslie-fang-intel ghstack dependencies: #131744, #131928	2024-07-27 10:03:49 +00:00
Will Feng	aee6bcdba4	[Traceable FSDP2][Inductor] Apply compute/comm reordering passes to achieve overlap (#131614 ) This PR enables the Inductor compute/comm reordering passes to Traceable FSDP2 to achieve overlap. Note that the overlap is not maximally optimized yet and the follow-up work will be done in subsequent PRs. Test commands: - `pytest -rA test/distributed/test_compute_comm_reordering.py::TestComputeCommReorderingMultiProc` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131614 Approved by: https://github.com/yifuwang ghstack dependencies: #131510	2024-07-27 08:39:58 +00:00
Will Feng	9e06572704	[Traceable FSDP2][Inductor] Create grouped nodes for FSDP2 all-gather code block and reduce-scatter code block (after Buffer/Operation split) (#131510 ) This PR creates these `GroupedSchedulerNode`s: - One for each all-gather code block (cast + copy-in + all-gather) - One for each all-gather-wait code block (all-gather-wait + copy-out) - One for each reduce-scatter code block (copy-in + reduce-scatter) - One for each reduce-scatter-wait code block (reduce-scatter-wait) This serves two goals: - Prevent outside ops from being fused into these op groups, in order to have more predicable memory usage. - Make it easier to specify the dependency e.g. from `i+1` all-gather group node to the `i` all-gather-wait group node, to enforce FSDP2 comm ordering (i.e. "serialization of comms"). The actual "reorder-for-FSDP-compute-comm-overlap" PR will come next. Test commands: - `pytest -rA test/distributed/test_compute_comm_reordering.py::TestComputeCommReorderingMultiProc` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_transformer_backend_inductor` - `pytest -rA test/distributed/_composable/fsdp/test_fully_shard_compile.py::TestFullyShardCompile::test_nested_fully_shard_backend_inductor` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131510 Approved by: https://github.com/yifuwang	2024-07-27 08:39:58 +00:00
cyy	99e13e68e9	[4/N] Fix clang-tidy warnings in jit (#131903 ) Follows #131830 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131903 Approved by: https://github.com/Skylion007	2024-07-27 08:08:14 +00:00
Janani Sriram	f862f45730	[NestedTensor] Integrate the layer normalization operator along the jagged dimension into NestedTensor (#131519 ) Modify the existing `layer normalization` operator in PyTorch, invoked by `torch.layer_norm`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff, which uses the `aten` padding operator, enables PyTorch users to invoke `torch.nn.functional.layer_norm` on a nested tensor when reducing along the ragged dimension, e.g. `` in a `(B, , M)` or `(B, *, M, N)` nested tensor. Write unit tests based on the `softmax` jagged operator to verify the accuracy of the ragged reduction implementation for `torch.nn.functional.layer_norm`. Add unit tests to verify error handling for unsupported features. Note that this implementation is limited to nested tensors with `ragged_idx == 1`, i.e. the ragged dimension is not transposed. The layer normalization operator also requires an operation on a 2-dimensional layer; for nested tensors with 4 or more dimensions, I flatten the extra dimensions, then unflatten them after performing layer normalization. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131519 Approved by: https://github.com/davidberard98 ghstack dependencies: #131518	2024-07-27 07:09:10 +00:00
Janani Sriram	bcf5c68c18	[NestedTensor] Integrate the softmax operator along the jagged dimension into NestedTensor (#131518 ) Modify the existing `softmax` operator in PyTorch, invoked by `torch.softmax`, to allow for reductions along the jagged dimension of a nested tensor. The function originally had a basic implementation for reducing along 1 non-ragged dimension. This diff, which uses the aten padding operator, enables PyTorch users to invoke `torch.softmax` on a nested tensor when reducing along the ragged dimension, e.g. `` in a `(B, , M)` nested tensor. Write unit tests based on the `sum` and `mean` jagged operators to verify the accuracy of the ragged reduction implementation for `torch.softmax`. Add unit tests to verify error handling for unsupported features in `NestedTensor` `torch.softmax`. Note that this implementation is limited to nested tensors with `ragged_idx == 1`, i.e. the ragged dimension is not transposed. In addition, the `softmax` operator is required to take in as input an integer for the reduction dimension `dim`, requiring new unit tests heavily inspired by the `sum` and `mean` jagged operator unit tests. `Softmax` also allows for reducing along the batch dimension. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131518 Approved by: https://github.com/davidberard98	2024-07-27 07:09:10 +00:00
Avik Chaudhuri	c49e857d32	[pt] immutable accessors in graph signature (#131940 ) Summary: splitting PT part of D60253955 Test Plan: existing tests Differential Revision: D60296909 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131940 Approved by: https://github.com/angelayi, https://github.com/zhxchen17	2024-07-27 05:32:53 +00:00
Oguz Ulgen	96c1862e0b	Remove mypy ignore from torch/_dynamo/variables/__init__.py (#131784 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131784 Approved by: https://github.com/aorenste, https://github.com/zou3519, https://github.com/Skylion007	2024-07-27 05:07:33 +00:00
drisspg	1bfe7eb7e6	Update how we do sdpa testing (#131743 ) ## Motivation This refactor aligns our testing methodology with the Flash Attention upstream repository while addressing several key issues: 1. Standardized comparison: We now compare fused kernels against float64 references, using the maximum of a calculated tolerance (based on same-precision math implementation) or standard float32 `atol`. 2. Reduced redundancy: Utilizing the same tensors for both same-precision math and fused kernel runs eliminates duplication. 3. Improved maintainability: The new approach simplifies tolerance adjustments across all affected tests. 4. Consistency: Standardizing tensor comparisons ensures a more uniform and reliable testing suite. These changes collectively simplify our testing code, improve its maintainability, and provide a more robust framework for validating our attention mechanisms. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131743 Approved by: https://github.com/jainapurva, https://github.com/jbschlosser	2024-07-27 03:58:49 +00:00
Vishwa Raj Singh	bcdba9f91d	Added hpu backend support in fsdp utils (#127757 ) In fsdp init_utils, adding support for hpu backend device on _get_device API. Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/127757 Approved by: https://github.com/wconstab, https://github.com/jgong5, https://github.com/awgu	2024-07-27 03:30:59 +00:00
Xu Han	28fd2e905d	[inductor] enhance cpp_builder lint check. (#131752 ) enhance cpp_builder `mypy` check. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131752 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-27 02:46:27 +00:00
Xu Han	a90b8b967a	[inductor] enable windows inductor UTs (#131767 ) Changes: 1. Add `skipIfWindows` function. 2. Fix `fresh_inductor_cache` raise error on Windows, due to can't delete loaded modules. 3. Disable some UTs, which are not passed on Windows. 4. Enable test_torchinductor in Windows CI. I have tested passed on my dev machine: <img width="864" alt="image" src="https://github.com/user-attachments/assets/91d5a62f-7383-44b3-b614-99940f196fdb"> TODO: review and fix the skipped cases. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131767 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-27 02:46:03 +00:00
Avik Chaudhuri	3768faec2f	carry cond in data-dependent error (#131932 ) Test Plan: existing Differential Revision: D60302877 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131932 Approved by: https://github.com/zhxchen17	2024-07-27 02:13:04 +00:00
Xu Han	9606d61e0c	[reland][inductor] switch AotCodeCompiler to new cpp_builder (#130127 ) Changes: 1. Switch `AotCodeCompiler` to new cpp_builder. 2. Only use `deprecated_cpp_compile_command` for `fb_code`, due to I can't debug anymore on no Meta internal environment access. 3. Add `TODO` comments for further some Meta employee help on contine to do this work. 4. Due to item 3, we only remaining `deprecated_cpp_compile_command` for `fb_code` to be fix, let's remove `validate_new_cpp_commands`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130127 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-07-27 01:46:13 +00:00
Matthew Hoffman	fdf1451bfa	Add `__all__` to torch.optim to define public interface (#131959 ) There was a regression in the public interface for `torch.optim` introduced in #125452 when `torch/optim/__init__.pyi` was merged into `torch/optim/__init__.py`. [The import aliases were not preserved and so now `pyright` thinks that these classes are not publicly exported from `torch/optim/__init__.py`.](https://github.com/pytorch/pytorch/pull/125452/files#diff-941595c1e1aa06bec94578499dd3654532a5183d0bc1bcd94d1f33b47e0d0adfL1-L15) ``` error: "SGD" is not exported from module "torch.optim" ``` Adding these classes/modules to `__all__` fixes this. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131959 Approved by: https://github.com/ezyang	2024-07-27 01:03:25 +00:00
Sergii Dymchenko	8458980bbf	Move benchmarks/dynamo/huggingface configuration to YAML (#131724 ) Similar to https://github.com/pytorch/pytorch/pull/120299 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131724 Approved by: https://github.com/shunting314	2024-07-27 00:55:04 +00:00
Zain Rizvi	ef8d118c67	Sync with changes to test-infra's scale-config.yml (#131955 ) This synchronized lf-canary-scale-config and lf-scale-config with one in test-infra. This really needs some automatic validation to prevent it from drifting out of sync over and over again (coming soon...) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131955 Approved by: https://github.com/malfet	2024-07-27 00:25:40 +00:00
Nikita Shulga	8b04edcac1	Delete unused yml files (#131298 ) To be landed at least 3 days later after previous commit Pull Request resolved: https://github.com/pytorch/pytorch/pull/131298 Approved by: https://github.com/ZainRizvi ghstack dependencies: #130762	2024-07-27 00:21:22 +00:00
Zain Rizvi	1e00f055a4	Move distributed experimental jobs back to the amazon2 for now (#131963 ) Something about the new Amazon2023 AMI is making some distributed tests fail. Moving them back to the old AMI until the issue is fixed This particular jobs are causing this test to fail: https://github.com/pytorch/pytorch/issues/129539 More details in https://github.com/pytorch/pytorch/issues/131962 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131963 Approved by: https://github.com/clee2000	2024-07-26 23:44:56 +00:00
Joel Schlosser	91fcfd8760	Fix public API tests (#131386 ) This PR fixes a bug in `test_correct_module_names` introduced in #130497. It also addresses post-fix test failures in: * `torch/ao/quantization/__init__.py` - set the correct `__module__` for several public API helpers * `torch/library.py` - add `register_vmap` to `__all__` * `torch/nn/attention/flex_attention.py` - make `round_up_to_multiple` private by prepending an underscore * `torch/storage.py` - introduce `__all__` to avoid `Self` being re-exported as a public API * `torch/distributed/pipelining/schedules.py` - add `ZeroBubbleAlgorithm` to `__all__` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131386 Approved by: https://github.com/albanD	2024-07-26 23:38:43 +00:00
Shangdi Yu	02b922900b	[aoti] Fix float16 and bfloat16 for generated GPU code (#131437 ) Fixes #131333 Summary: - Add header to define `float16` and `bfloat16` as `at::Half` and `at::BFloat16`. - change `float16` and `bfloat16` to `float` before passing to kernel. code generated before: ```cpp ..... half var_1; AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_item_float16(convert_arrayref_tensor_to_tensor(arg1_1), &var_1)); .... ``` code generated now: ```cpp typedef at::Half half; typedef at::BFloat16 bfloat16; ..... half var_1_tmp; AOTI_TORCH_ERROR_CODE_CHECK(aoti_torch_item_float16(convert_arrayref_tensor_to_tensor(arg1_1), &var_1_tmp)); float var_1 = float(var_1_tmp); .... ``` Test plan: `TORCHINDUCTOR_ABI_COMPATIBLE=1 TORCHINDUCTOR_CPP_WRAPPER=1 python test/inductor/test_torchinductor.py -k GPUTests.test_unspec_inputs_cuda` Work in progress. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131437 Approved by: https://github.com/desertfire	2024-07-26 23:36:11 +00:00
Bin Bao	0272934238	[Inductor][CPU] Fix an InvalidVecISA issue on CI (#131812 ) Summary: CPU CI nodes failed to find valid VecISA because importing torch under the default pytorch directory will fail with the following msg, so switch cwd to a tmp directory. ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/var/lib/jenkins/workspace/torch/__init__.py", line 66, in <module> from torch.torch_version import __version__ as __version__ File "/var/lib/jenkins/workspace/torch/torch_version.py", line 4, in <module> from torch.version import __version__ as internal_version ModuleNotFoundError: No module named 'torch.version' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131812 Approved by: https://github.com/eellison, https://github.com/malfet	2024-07-26 22:31:44 +00:00
Sergii Dymchenko	5489ff8e94	Use Mermaid for the diagram in torch/ao/quantization/fx/README.md (#131412 ) preview `3a0efcdfa3/torch/ao/quantization/fx/README.md` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131412 Approved by: https://github.com/jerryzh168	2024-07-26 22:01:21 +00:00
Peter Bell	16cd1aaa1d	[inductor] Improve sort kernel perf (#131719 ) Closes #129507 This makes two changes to the sort kernel: 1. Use int16 for the indices since we only operate on small dims anyway 2. Instead of passing an explicit mask, we pass the rnumel and imply the mask from that which saves an additional reduction in the sort kernel's inner loop. In my benchmarks, this gives enough of a perf improvement to bump up the max rblock to 512. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131719 Approved by: https://github.com/eellison	2024-07-26 21:56:47 +00:00
Luca Wehrstedt	b90bc66766	Enable FlashAttention on Windows (#131906 ) Let's just give this a try. Reland of https://github.com/pytorch/pytorch/pull/131875. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131906 Approved by: https://github.com/drisspg	2024-07-26 21:41:56 +00:00
rzou	d73b55d64b	Support meta tensors as inputs to the triton_kernel_wrapper HOPs (#131896 ) We automatically generate FakeTensor support for them (the FakeTensor kernel for a triton kernel is "return None"). The same thing should apply to the meta kernel. Tests: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/131896 Approved by: https://github.com/oulgen	2024-07-26 21:41:03 +00:00
Animesh Jain	fb98cd33f1	[inline_inbuilt_nn_modules][inductor-cpu] Skip test_quantized_linear_amx (#131928 ) The issue is tracked here - https://github.com/pytorch/pytorch/issues/131929 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131928 Approved by: https://github.com/eellison ghstack dependencies: #131744	2024-07-26 21:28:17 +00:00
Shunting Zhang	c8626a4e1f	[BE] add a list of inductor test files to skip resetting dynamo (#131551 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131551 Approved by: https://github.com/zou3519	2024-07-26 21:08:15 +00:00
Catherine Lee	fde577702d	[TD] More synonyms for filepath (#131838 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/131838 Approved by: https://github.com/PaliC, https://github.com/ZainRizvi	2024-07-26 21:02:42 +00:00
Zain Rizvi	1bda3a3135	Migrate nightly.yml workflow & docs to Amazon 2023 (#131821 ) A continuation of the migration started in - https://github.com/pytorch/pytorch/pull/131250 Migrates nightly jobs and the linux-docs job in pull.yml To preserve reusability, I'm switching to a new format here that allows one to only specify the runner prefix instead of the full runner name, allowing multiple jobs to continue using the same base runner type like how they did before Validation: - Nightly builds passed in the prev commit: https://github.com/pytorch/pytorch/actions/runs/10102118461/job/27937632823?pr=131821 - Latest commit only updated the docs job in pull.yml, and that has already passed: https://github.com/pytorch/pytorch/actions/runs/10114635537/job/27974392472?pr=131821 The other in-progress jobs are irrelevant Pull Request resolved: https://github.com/pytorch/pytorch/pull/131821 Approved by: https://github.com/atalman, https://github.com/seemethere	2024-07-26 20:54:43 +00:00
James Wu	0e6df1e0fb	Disable remote cache on test (#131908 ) Summary: Fixes test internally Test Plan: buck2 test 'fbcode//mode/opt' fbcode//caffe2/test/inductor:cudagraph_trees -- --exact 'caffe2/test/inductor:cudagraph_trees - test_cache_hit_forward_miss_backward (caffe2.test.inductor.test_cudagraph_trees.CudaGraphTreeTests)' Passes Differential Revision: D60293177 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131908 Approved by: https://github.com/clee2000	2024-07-26 20:19:02 +00:00
Brian Hirsh	071ac38141	fast-path FakeTensor detach (#131899 ) Fixes https://github.com/pytorch/pytorch/issues/128281, see investigation at https://github.com/pytorch/pytorch/issues/128281#issuecomment-2252976926. benchmark: ``` python benchmarks/dynamo/huggingface.py --performance --timing --explain --backend aot_eager --device cuda --training --float32 --only BertForMaskedLM ``` time before: ``` TIMING: entire_frame_compile:30.85435 backend_compile:23.98599 total_wall_time:30.85435 ``` time after: ``` TIMING: entire_frame_compile:24.35898 backend_compile:18.15235 total_wall_time:24.35898 ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131899 Approved by: https://github.com/ezyang, https://github.com/zou3519, https://github.com/albanD	2024-07-26 20:16:08 +00:00
Catherine Lee	2ec8312a28	Add rerun_disabled_tests for inductor (#131681 ) Test in prod? THis also turns on mem leak check Briefly checked that ``` python3 ".github/scripts/filter_test_configs.py" \ --workflow "inductor" \ --job-name "cuda12.1-py3.10-gcc9-sm86 / build" \ --test-matrix "{ include: [ { config: "inductor", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "inductor", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "inductor_distributed", shard: 1, num_shards: 1, runner: "linux.g5.12xlarge.nvidia.gpu" }, { config: "inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "dynamic_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "dynamic_inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "dynamic_inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "dynamic_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "dynamic_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "aot_inductor_huggingface", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "aot_inductor_timm", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "aot_inductor_timm", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "aot_inductor_torchbench", shard: 1, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "aot_inductor_torchbench", shard: 2, num_shards: 2, runner: "linux.g5.4xlarge.nvidia.gpu" }, { config: "inductor_cpp_wrapper_abi_compatible", shard: 1, num_shards: 1, runner: "linux.g5.4xlarge.nvidia.gpu" }, ]} " \ --selected-test-configs "" \ --pr-number "${PR_NUMBER}" \ --tag "${TAG}" \ --event-name "schedule" \ --schedule "29 8 * * *" \ --branch "${HEAD_BRANCH}" ``` has rerun disabled tests option in the test matrix I don't think all these things need to run but I'm not sure which ones (probably just inductor?) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131681 Approved by: https://github.com/zou3519	2024-07-26 20:05:24 +00:00
Sergii Dymchenko	da1a1fa55f	Move load_yaml_file to common (#131924 ) This is for https://github.com/pytorch/pytorch/pull/131724 and future timm_models.py refactoring. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131924 Approved by: https://github.com/shunting314, https://github.com/huydhn	2024-07-26 19:47:52 +00:00
Bin Bao	6c95f79645	[CI] Increase the timeout for aarch64 docker build (#131926 ) Summary: Increase the timeout limit for pytorch-linux-jammy-aarch64-py3.10-gcc11-inductor-benchmarks. If slow build is a problem later, we can upgrade the arm64 CI instance capability. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131926 Approved by: https://github.com/avikchaudhuri	2024-07-26 19:27:45 +00:00
PyTorch MergeBot	782efd8e5b	Revert "Add rerun_disabled_tests for inductor (#131681 )" This reverts commit 85fa66be04b6f78139da4f0ec8f8b1956291e1c5. Reverted https://github.com/pytorch/pytorch/pull/131681 on behalf of https://github.com/clee2000 due to this is the wrong file ([comment](https://github.com/pytorch/pytorch/pull/131681#issuecomment-2253318038))	2024-07-26 19:08:59 +00:00
PyTorch MergeBot	0f9bf208ec	Revert "[BE][tests] show local variables on failure in tests (#131151 )" This reverts commit 054d214c504b415b155ef2da1a70764a115e1276. Reverted https://github.com/pytorch/pytorch/pull/131151 on behalf of https://github.com/jbschlosser due to pollutes test failure output for OpInfo tests ([comment](https://github.com/pytorch/pytorch/pull/131151#issuecomment-2253310448))	2024-07-26 19:03:10 +00:00
rzou	a3cdbd8189	[FlopCounterMode] Fix register_flop_formula (#131777 ) Previously, FlopCounterMode would ignore any custom ops registered through `register_flop_formula`. The problem was: - register_flop_formula(target) requires target to be an OpOverloadPacket. - register_flop_formula used register_decomposition to populate its registry - register_decomposition decomposes the OpOverloadPacket into OpOverload before putting it into the registry - FlopCounterMode ignores OpOverloads in its registry (it assumes the registry is a dictionary mapping OpOverloadPacket to flop formula). register_decomposition is too heavy of a hammer, plus this isn't a decomposition, so I changed the registration mechanism. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/131777 Approved by: https://github.com/Chillee	2024-07-26 18:44:50 +00:00
Vishwa Raj Singh	cd53698df0	Add hpu backend support for dynamo torchVariable _in_graph_classes() function (#129948 ) Fixes #ISSUE_NUMBER Recent change from PR# `f657b2b1f8 (diff-4a52059570bb96333d8383ce6a9d01bbb114c5e34aff6028f820899ca39b5a26R80)` , has hard coded flow to cuda stream in ingraph function. For non cuda backend (hpu in our case), it breaks the graph. As part of this PR change adding hpu backend support to dynamo variables function _in_graph_classes(). Pull Request resolved: https://github.com/pytorch/pytorch/pull/129948 Approved by: https://github.com/yanboliang	2024-07-26 18:38:03 +00:00
eellison	5f2c80d16d	Add inductor OrderedSet (#130003 ) Implemented by extending `collections.abc.MutableSet` and backing it with a dictionary, which is ordered. From collections.abc.MutableSet: ``` A mutable set is a finite, iterable container. This class provides concrete generic implementations of all methods except for __contains__, __iter__, __len__, add(), and discard(). ``` In addition to implementing those methods I also had to define some methods of python's set which were not implemented in MutableSet. I reused the test from my python's lib. There were a few instances of tests that didnt pass because edge case behavior that is not necessary to reimplement - support self-referencing repr - erroring when an member's `__eq__` function would modify the set itself - MutableSet supports Iterables as inputs, but not sequences (pretty rare..) - Some specifics of exact equivalent type errors being thrown - [The protocol for automatic conversion to immutable](https://docs.python.org/2/library/sets.html#protocol-for-automatic-conversion-to-immutable) Pull Request resolved: https://github.com/pytorch/pytorch/pull/130003 Approved by: https://github.com/aorenste	2024-07-26 18:16:57 +00:00
Mikayla Gawarecki	1dd10ac802	[BE] [Reland] Make nn.Module state_dict load_state_dict pre-hook and state_dict post-hook public (#131690 ) Reland https://github.com/pytorch/pytorch/pull/126704 #### Fixes the issue with type of `nn.Module._state_dict_hooks` being changed in that PR which was problematic: Instead of using `Tuple(Callable, bool)` to keep track of whether the private `_register_state_dict_hook` or the public `register_state_dict_post_hook` API was used to register the hook and toggle the behavior accordingly, I set an attribute on the Callable in the private API, which is never cleaned up. If a callable previously registered using the private API is registered via the public API, a RuntimeError will be raised #### Copied from previous PR description Fixes https://github.com/pytorch/pytorch/issues/75287 and https://github.com/pytorch/pytorch/issues/117437 - `nn.Module._register_state_dict_hook` --> add public `nn.Module.register_state_dict_post_hook` - Add a test as this API was previously untested - `nn.Module._register_load_state_dict_pre_hook` --> add public `nn.Module.register_load_state_dict_pre_hook` (remove the `with_module` flag, default it to `True` ~- For consistency with optimizer `load_state_dict_pre_hook` raised by @janeyx99, allow the pre-hook to return a new `state_dict`~ - For issuet by https://github.com/pytorch/pytorch/issues/117437 regarding `_register_state_dict_hook` semantic of returning a new state_dict only being respected for the root for private hook - Document this for private `_register_state_dict_hook` - Remove this for the public `register_state_dict_post_hook` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131690 Approved by: https://github.com/albanD	2024-07-26 18:14:07 +00:00
Shuqiang Zhang	8158cf2f59	[c10d] Fix split_group usage when there is a single rank (#131824 ) Summary: This is a request from xlformer team to allow single rank PG/comms Test Plan: UT Pull Request resolved: https://github.com/pytorch/pytorch/pull/131824 Approved by: https://github.com/pavanbalaji, https://github.com/fduwjj	2024-07-26 18:11:17 +00:00
PyTorch MergeBot	e191b83462	Revert "Add wrappers for synchronous GPUDirect Storage APIs (#130633 )" This reverts commit 709ddf7a9dcfa1268848b72f6f56b55afa6728d6. Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to still failing internally D60265673 ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2253239607))	2024-07-26 18:08:20 +00:00
PyTorch MergeBot	e4db5dc1c4	Revert "[BE] remove unnecessary _dispatch_sqrt by using ** 0.5 (#131358 )" This reverts commit 4c7f22dee25649cd895bc382192d29f39e482215. Reverted https://github.com/pytorch/pytorch/pull/131358 on behalf of https://github.com/janeyx99 due to Internal uses this private API and landing that has been a pain so we're reverting this first ([comment](https://github.com/pytorch/pytorch/pull/131358#issuecomment-2253190654))	2024-07-26 17:35:27 +00:00
William Wen	2576dbbc35	[dynamo] implement IteratorVariable and polyfill fallbacks for enumerate (#131725 ) Fixes https://github.com/pytorch/pytorch/issues/112794. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131725 Approved by: https://github.com/anijain2305 ghstack dependencies: #131413, #131716	2024-07-26 17:17:09 +00:00
William Wen	35b4de32fa	[dynamo] add itertools repeat/count bytecode reconstruction (#131716 ) Also fix bugs in the count iterator variable implementation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131716 Approved by: https://github.com/anijain2305 ghstack dependencies: #131413	2024-07-26 17:17:09 +00:00
Boyuan Feng	40cc5c0697	[AOT Autograd] Donated Buffer (#130580 ) Implements donated buffer feature and adds unit tests. Donated buffer is a saved tensor that is not aliased with forward inputs, fw_outputs (except saved tensors), and bw_outputs. We detect donated buffers during `aot_dispatch_autograd` and store donated buffers in `ViewAndMutationMetadata`, such that it can be accssed in inductor. Fixes #129496 Pull Request resolved: https://github.com/pytorch/pytorch/pull/130580 Approved by: https://github.com/bdhirsh	2024-07-26 17:14:34 +00:00
Siyu Yang	9589d986fa	[UT] Relax atol for test_non_contiguous_input_* (3 tests) (#131822 ) BE task T195600898 (internal). The 3 tests ``` test_non_contiguous_input_mm test_non_contiguous_input_bmm test_non_contiguous_input_addmm ``` had the following error in TestX: ``` self.assertTrue(torch.allclose(ref, act, atol=1e-2, rtol=1e-2)) AssertionError: False is not true ``` The tolerance comparing eager and compiled results is too small, perhaps because of a Triton update that changed numerics: ``` Mismatched elements: 25 / 38597376 (0.0%) Greatest absolute difference: 0.015625 at index (3771, 509) (up to 0.01 allowed) Greatest relative difference: 9.375 at index (13687, 48) (up to 0.01 allowed) ``` Change the absolute tolerance from 0.01 to 0.02. Also switch to use `torch.testing.assert_close` which prints out the greatest absolute/relative difference like above when the assert fails. `test_non_contiguous_input_mm_plus_mm` has a different problem, just switching to `torch.testing.assert_close` to be uniform with the other tests. Test commands: ``` python test/inductor/test_max_autotune.py -k TestMaxAutotune.test_non_contiguous_input_mm python test/inductor/test_max_autotune.py -k TestMaxAutotune.test_non_contiguous_input_addmm python test/inductor/test_max_autotune.py -k TestMaxAutotune.test_non_contiguous_input_bmm ``` Internal stress tests pass now. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131822 Approved by: https://github.com/shunting314	2024-07-26 17:11:35 +00:00
PyTorch MergeBot	161bb67116	Revert "Fix static `py::object` dangling pointer with `py::gil_safe_call_once_and_store` (#130341 )" This reverts commit ace6decc9948e434dfe2e253bc28341bb22aa983. Reverted https://github.com/pytorch/pytorch/pull/130341 on behalf of https://github.com/clee2000 due to unfortunately the internal pybind update got reverted cc @malfet ([comment](https://github.com/pytorch/pytorch/pull/130341#issuecomment-2253147079))	2024-07-26 17:02:56 +00:00
Nikita Shulga	c382fc3fea	[Reland] Fix vulkan builds with missing overrides errors (#131760 ) Followup after https://github.com/pytorch/pytorch/pull/131524 Add note explaining why C10 macros should not be used in that header Pull Request resolved: https://github.com/pytorch/pytorch/pull/131760 Approved by: https://github.com/atalman	2024-07-26 17:01:51 +00:00
Bin Bao	1a2edf6dca	[AOTI] Fix _mm_plus_mm codegen (#131689 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/128474 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131689 Approved by: https://github.com/chenyang78	2024-07-26 16:50:12 +00:00
PyTorch MergeBot	696e83a1da	Revert "TCPStore: fix remote address (#131773 )" This reverts commit 9039131a89a5fdb8746bd86b0a4dd91559821e36. Reverted https://github.com/pytorch/pytorch/pull/131773 on behalf of https://github.com/clee2000 due to broke internal builds D60265883, something about formatter ([comment](https://github.com/pytorch/pytorch/pull/131773#issuecomment-2253123800))	2024-07-26 16:47:57 +00:00
Yidi Wu	404a8ae8f6	[export] fix set_grad x tensor constant. (#131787 ) Summary: Fixes https://github.com/pytorch/pytorch/issues/130379. The original error is verifier finds that the placeholder nodes' meta[''val"] are missing in subgraph of WrapSetGradEnabled hop. In this PR, we fixed it by re-ordering the replace_set_grad_with_hop_pass with lift_constant_tensor pass because only after lift_constant_pass, all the constant attrs start to have meta["val"]. Test Plan: buck2 test test:test_export -- -r "test_setgrad_lifted_tensor" Differential Revision: D60244935 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131787 Approved by: https://github.com/yushangdi	2024-07-26 16:41:59 +00:00
PyTorch MergeBot	bb64702eb3	Revert "[reland][inductor] switch AotCodeCompiler to new cpp_builder (#130127 )" This reverts commit 520182dbffe09943be74a8a9cd58618fc171738f. Reverted https://github.com/pytorch/pytorch/pull/130127 on behalf of https://github.com/clee2000 due to broke internal tests D60265910 ([comment](https://github.com/pytorch/pytorch/pull/130127#issuecomment-2253113689))	2024-07-26 16:40:03 +00:00
Alnis Murtovi	d57de73fe0	AutoHeuristic: Add support for kernel choice selection (#131610 ) This PR enables AutoHeuristic for kernel choice selection, where the feedback can not immediately be provided when AutoHeuristic is called, but only after autotuning has happened. The steps are the following: When the AutoHeuristic constructor is called, AutoHeuristic registers a function in select_algorithm.py. After autotuning in select_algorithm.py has happened, and there is an entry in autoheuristic_registry, select_algorithm provides the autotuning results to AutoHeuristic, which stores the results. I enabled AutoHeuristic for mixed_mm to have an example to test it on. We probably want to add more context, and also add an augment_context function. I will add support for this in another PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131610 Approved by: https://github.com/eellison	2024-07-26 16:35:55 +00:00
PyTorch MergeBot	a38890a53f	Revert "[2/3] 3D Composability - move pp tests (#129801 )" This reverts commit 29571c5c06f6e5fd143d85c18d8a6b87d2e4e1d3. Reverted https://github.com/pytorch/pytorch/pull/129801 on behalf of https://github.com/atalman due to Broke periodic CI: distributed/_composable/test_composability/test_pp_composability.py::ComposabilityTest::test_manual_with_data_parallel_dp_type_DDP_ScheduleClass4 [GH job link](https://github.com/pytorch/pytorch/actions/runs/10083807511/job/27882848654) [HUD commit link](`544f950d14`) ([comment](https://github.com/pytorch/pytorch/pull/129801#issuecomment-2253099894))	2024-07-26 16:30:29 +00:00
Animesh Jain	13ab92b72d	[dynamo][recompile-logs] Suggest force_parameter_static_shapes on the recompile log for parameter-related recomps (#131825 ) Discovered in https://github.com/pytorch/pytorch/issues/121369 On the user-empathy-day model, the logs look like these ~~~ W0725 15:33:58.022000 1967777 torch/_dynamo/convert_frame.py:807] [0/8] torch._dynamo hit config.cache_size_limit (8) W0725 15:33:58.022000 1967777 torch/_dynamo/convert_frame.py:807] [0/8] function: 'auto_repeat_tensors_for_time' (/home/anijain/local/lumiere-pytorch/lumiere_pytorch/lumiere.py:545) W0725 15:33:58.022000 1967777 torch/_dynamo/convert_frame.py:807] [0/8] last reason: 0/0: len(L['args']) == 1 W0725 15:33:58.022000 1967777 torch/_dynamo/convert_frame.py:807] [0/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". W0725 15:33:58.022000 1967777 torch/_dynamo/convert_frame.py:807] [0/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. W0725 15:34:00.282000 1967777 torch/_dynamo/convert_frame.py:807] [11/8] torch._dynamo hit config.cache_size_limit (8) W0725 15:34:00.282000 1967777 torch/_dynamo/convert_frame.py:807] [11/8] function: 'forward' (/home/anijain/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/denoising_diffusion_pytorch/karras_unet.py:150) W0725 15:34:00.282000 1967777 torch/_dynamo/convert_frame.py:807] [11/8] last reason: 11/0: tensor 'L['x']' size mismatch at index 0. expected 16, actual 8 W0725 15:34:00.282000 1967777 torch/_dynamo/convert_frame.py:807] [11/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". W0725 15:34:00.282000 1967777 torch/_dynamo/convert_frame.py:807] [11/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. W0725 15:34:10.216000 1967777 torch/_dynamo/convert_frame.py:807] [40/8] torch._dynamo hit config.cache_size_limit (8) W0725 15:34:10.216000 1967777 torch/_dynamo/convert_frame.py:807] [40/8] function: 'normalize_weight' (/home/anijain/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/denoising_diffusion_pytorch/karras_unet.py:127) W0725 15:34:10.216000 1967777 torch/_dynamo/convert_frame.py:807] [40/8] last reason: 40/1: tensor 'L['weight']' size mismatch at index 0. expected 64, actual 16. Guard failed on a parameter, consider using torch._dynamo.config.force_parameter_static_shapes = False to allow dynamism on parameters. W0725 15:34:10.216000 1967777 torch/_dynamo/convert_frame.py:807] [40/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". W0725 15:34:10.216000 1967777 torch/_dynamo/convert_frame.py:807] [40/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. W0725 15:34:11.643000 1967777 torch/_dynamo/convert_frame.py:807] [58/8] torch._dynamo hit config.cache_size_limit (8) W0725 15:34:11.643000 1967777 torch/_dynamo/convert_frame.py:807] [58/8] function: 'pack_one' (/home/anijain/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/denoising_diffusion_pytorch/karras_unet.py:38) W0725 15:34:11.643000 1967777 torch/_dynamo/convert_frame.py:807] [58/8] last reason: 58/1: tensor 'L['t']' stride mismatch at index 0. expected 32, actual 8. Guard failed on a parameter, consider using torch._dynamo.config.force_parameter_static_shapes = False to allow dynamism on parameters. W0725 15:34:11.643000 1967777 torch/_dynamo/convert_frame.py:807] [58/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". W0725 15:34:11.643000 1967777 torch/_dynamo/convert_frame.py:807] [58/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. W0725 15:34:12.029000 1967777 torch/_dynamo/convert_frame.py:807] [62/8] torch._dynamo hit config.cache_size_limit (8) W0725 15:34:12.029000 1967777 torch/_dynamo/convert_frame.py:807] [62/8] function: 'torch_dynamo_resume_in_pack_at_70' (/home/anijain/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/einops-0.8.0-py3.10.egg/einops/packing.py:70) W0725 15:34:12.029000 1967777 torch/_dynamo/convert_frame.py:807] [62/8] last reason: 62/0: tensor 'L['tensors'][0]' size mismatch at index 0. expected 16, actual 32. Guard failed on a parameter, consider using torch._dynamo.config.force_parameter_static_shapes = False to allow dynamism on parameters. W0725 15:34:12.029000 1967777 torch/_dynamo/convert_frame.py:807] [62/8] To log all recompilation reasons, use TORCH_LOGS="recompiles". W0725 15:34:12.029000 1967777 torch/_dynamo/convert_frame.py:807] [62/8] To diagnose recompilation issues, see https://pytorch.org/docs/main/torch.compiler_troubleshooting.html. W0725 15:34:12.357000 1967777 torch/_dynamo/convert_frame.py:807] [65/8] torch._dynamo hit config.cache_size_limit (8) W0725 15:34:12.357000 1967777 torch/_dynamo/convert_frame.py:807] [65/8] function: 'reshape' (/home/anijain/.conda/envs/pytorch-3.10/lib/python3.10/site-packages/einops-0.8.0-py3.10.egg/einops/_backends.py:91) W0725 15:34:12.357000 1967777 torch/_dynamo/convert_frame.py:807] [65/8] last reason: 65/0: tensor 'L['x']' size mismatch at index 0. expected 32, actual 8. Guard failed on a parameter, consider using torch._dynamo.config.force_parameter_static_shapes = False to allow dynamism on parameters. ~~~~ Pull Request resolved: https://github.com/pytorch/pytorch/pull/131825 Approved by: https://github.com/ezyang ghstack dependencies: #131795, #131801, #131804	2024-07-26 16:25:21 +00:00
Zhengxu Chen	7feaa73057	[export] Remove deprecated fields from ExportedProgram ctor. (#131697 ) Summary: as title. Test Plan: CI Reviewed By: SherlockNoMad Differential Revision: D60078426 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131697 Approved by: https://github.com/ydwu4	2024-07-26 16:19:46 +00:00
PyTorch MergeBot	546df5daf8	Revert "[3/3] 3D Composability - move tp dp tests (#129802 )" This reverts commit ec3829795dfb58a58ebc9ca241f7949efd60bfda. Reverted https://github.com/pytorch/pytorch/pull/129802 on behalf of https://github.com/atalman due to Need to revert https://github.com/pytorch/pytorch/pull/129801 that got remerged ([comment](https://github.com/pytorch/pytorch/pull/129802#issuecomment-2253082995))	2024-07-26 16:19:25 +00:00
cyy	2988d33c80	[3/N] Fix clang-tidy warnings in jit (#131830 ) Follows #131735 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131830 Approved by: https://github.com/ezyang	2024-07-26 15:46:28 +00:00
Brian Hirsh	5612408735	_get_operation_overload: dont raise exception when overload does not exist (#131554 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131554 Approved by: https://github.com/ezyang, https://github.com/zou3519 ghstack dependencies: #131403, #131482, #131665	2024-07-26 15:38:11 +00:00
andrewor14	eba2ffd278	[pt2e][quant] Ensure BN node is erased after convert (#131651 ) Summary: Previously, when folding BN into conv, we rely on DCE to clean up the unused BN node from the graph. This works if the model is already in eval mode, but fails if the model is still in train mode because DCE doesn't remove nodes with potential side effects (in this case `_native_batch_norm_legit`). This required users to move the model to eval mode before calling convert in order to get a properly DCE'd graph. To solve this, we manually erase the BN node after folding instead of relying on DCE. This relaxes the ordering constraints between `move_exported_model_to_eval` and `convert_pt2e`. Test Plan: python test/test_quantization.py TestQuantizePT2EQAT_ConvBn1d.test_fold_bn_erases_bn_node python test/test_quantization.py TestQuantizePT2EQAT_ConvBn2d.test_fold_bn_erases_bn_node Reviewers: jerryzh168, yushangdi Subscribers: jerryzh168, yushangdi, supriyar Pull Request resolved: https://github.com/pytorch/pytorch/pull/131651 Approved by: https://github.com/yushangdi	2024-07-26 15:30:45 +00:00
Bin Bao	9440a4824d	[CI][dashboard] Add a workflow to collect A10g perf (#131816 ) Summary: This is an experimental work. Depending on the performance stableness and benchmark coverage on A10g, we may consider to use A10g for manually-triggered per-PR performance comparison instead of exausting expensive A100 instances. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131816 Approved by: https://github.com/huydhn	2024-07-26 14:36:14 +00:00
Dan Zimmerman	535c17efb3	[torch] Implement c10::BFloat16 ctor from __hip_bfloat16 (#131359 ) Summary: Pretty straightfoward. ROCm 6.2.0 changed the `__hip_bfloat16` API (see [this PR](`481912a1fd`)), so we gate impl on `__BF16_HOST_DEVICE__` macro to support older and newer versions of ROCm. Test Plan: CI Differential Revision: D60024830 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131359 Approved by: https://github.com/houseroad	2024-07-26 14:30:49 +00:00
Brian Hirsh	e4ace1a396	AOTDispatcher: properly bump version counter on input mutations in inference graphs (#131665 ) This ensures that in an inference setting, we properly bump the VC of mutated graph inputs. Previously, we would only properly bump the VC for training graphs. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131665 Approved by: https://github.com/ezyang, https://github.com/zou3519 ghstack dependencies: #131403, #131482	2024-07-26 14:22:20 +00:00
Brian Hirsh	5570a0da0a	dont dispatch aten.conj(scalar_tensor) back to python (#131482 ) https://github.com/pytorch/pytorch/issues/105290 The problem in the original flow is that: (1) the user calls `torch.mul(complex_tensor, complex_scalar) (2) python arg parser wraps the complex scalar in a `scalar_tensor`, and dispatches to `aten.mul.Tensor(self, scalar_other)` (3) autograd sees `aten.mul.Tensor`, calls `scalar_other.conj()` [here](https://github.com/pytorch/pytorch/blob/main/torch/csrc/autograd/FunctionsManual.cpp#L597) (4) during proxy tensor tracing, this gets dispatched to `aten._conj(scalar_tensor)` (5) when we hit __torch_dispatch__, the scalar_tensor is converted back into a plain python scalar (6) we error during tracing, because in `FunctionalTensorMode.__torch_dispatch__` we try to redispatch on `aten._conj.default(plain_python_scalar)`, and this overload does not accept python scalars. My attempted fix in this PR is to update `TensorBase::conj()` to check if the current tensor is a scalar tensor (wrapped number), and if so, manually: (1) convert the scalar tensor back into a scalar (2) call scalar.conj() directly (3) convert the result back into a wrapped tensor This avoids having to go through python entirely in the tracing case (which is fine, because these scalar tensors are constants that we can const-prop during tracing anyway). Notable, I did not add e.g. a new `aten._conj.Scalar` overload. This would not actually fix the problem, since the bug is that we call `aten._conj.default(python_scalar)` directly. we would also need to muck with all `__torch_dispatch__` call sites to know to convert python scalars back into tensors directly. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131482 Approved by: https://github.com/zou3519, https://github.com/ezyang ghstack dependencies: #131403	2024-07-26 14:22:20 +00:00
Brian Hirsh	8bb9aa93a7	dynamo: mutations on .data should be invisible to autograd (#131403 ) Fixes https://github.com/pytorch/pytorch/issues/121353 our handle for `.data` in dynamo today basically just converts `y = x.data` into `y = x.detach()`. The semantics of these two ops are not quite the same, because: (1) any future mutations on `x.data` will be fully ignored by autograd (2) any mutations on `x.detach()` will bump x's version counter the linked model does a .data mutation that is hidden from autograd in eager, but ends up erroring during AOTDispatcher tracing. I updated dynamo's handling so that: (1) when dynamo sees a call to `getattr(tensor, "data")` and calls `.detach()` we set a flag on the returned `TensorVariable` indicating it came from `.data` (2) on any tensor method that we call with an input `TensorVariable` with this flag turned on, we proxy autograd's `preserve_version_counter` logic into the graph, to properly reset the VC after the op is run. One thing to note is that I don't actually do this on every op that we pass the tensor to: I only do it for tensor methods that appear to be mutations (by checking for a trailing underscore). My thought was that: (1) I didn't want to do this for every op that you pass `y` into, since that will e.g. triple the number of nodes in the graph, and could cause compile time regressions if you use .data (2) this situation is pretty rare in general, and I'm hoping that "tensor method mutations" cover most reasonable mutation cases. If we manage to miss a case, you will get a loud error during tracing anyway, so there is not a safety issue. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131403 Approved by: https://github.com/anijain2305, https://github.com/zou3519	2024-07-26 14:22:20 +00:00
PyTorch MergeBot	7339c8ab28	Revert "immutable accessors in graph signature (#131807 )" This reverts commit 6fd28fc228f900863d63b1c83912dcc000b084e3. Reverted https://github.com/pytorch/pytorch/pull/131807 on behalf of https://github.com/atalman due to Broke CI: [GH job link](https://github.com/pytorch/pytorch/actions/runs/10111847569/job/27965364355) [HUD commit link](`608057afe2`) ([comment](https://github.com/pytorch/pytorch/pull/131807#issuecomment-2252875417))	2024-07-26 14:21:12 +00:00
Yanbo Liang	e76e566cfb	[Dynamo] Support zip_longest (#131497 ) Fixes #121348 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131497 Approved by: https://github.com/mlazos, https://github.com/jansel, https://github.com/zou3519	2024-07-26 14:06:10 +00:00
PyTorch MergeBot	c9888c2739	Revert "[BE] typing for decorators - optim/optimizer (#131583 )" This reverts commit a1dad77dfa4e244a867ca7c73e9f6b6fe36a1340. Reverted https://github.com/pytorch/pytorch/pull/131583 on behalf of https://github.com/atalman due to Breaks CI: [GH job link](https://github.com/pytorch/pytorch/actions/runs/10105959146/job/27947741162) [HUD commit link](`a1dad77dfa`) ([comment](https://github.com/pytorch/pytorch/pull/131583#issuecomment-2252784280))	2024-07-26 13:41:22 +00:00
PyTorch MergeBot	7ee6831ae8	Revert "Fix vulkan builds with missing overrides errors (#131760 )" This reverts commit 7260eaeca056ffa013de769c10a2bfce9505d937. Reverted https://github.com/pytorch/pytorch/pull/131760 on behalf of https://github.com/malfet due to Does not work with internal builds ([comment](https://github.com/pytorch/pytorch/pull/131760#issuecomment-2252783645))	2024-07-26 13:38:28 +00:00
zengxian	d3e932dc10	[CI] Add inductor cpu accuracy test running on AVX2 runners (#128682 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/128682 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-07-26 13:24:41 +00:00
Huy Do	e73fa28ec8	[CI] Fix arm64 docker build arch (#131869 ) Attempt to fix arm64 docker build arch on https://github.com/pytorch/pytorch/pull/131855 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131869 Approved by: https://github.com/desertfire	2024-07-26 13:19:36 +00:00
Peter Bell	608057afe2	[inductor] Fix duplicated range tree codegen in split scan (#131669 ) Looks like in the halide codegen refactor, the range tree codegen was split out from initialize_range_tree into its own function, but triton_split_scan.py wasn't updated to reflect this change. The result was the codegen gets invoked twice which is benign but makes the kernel harder to read. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131669 Approved by: https://github.com/Chillee	2024-07-26 13:11:26 +00:00
Bin Bao	945946e817	[AOTI] Fix another ABI-compatible CPU issue (#131798 ) Summary: This problem is seen on AOTI CPU dashboard runs, a cpp compilation error because ConstantHandle::get doesn't exist. This PR adds ConstantHandle::get so that the interface is consistent with RAIIAtenTensorHandle. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131798 Approved by: https://github.com/zou3519, https://github.com/chenyang78 ghstack dependencies: #131791	2024-07-26 11:27:58 +00:00
William Wen	7d282d8755	[dynamo] add lazy IteratorVariable implementations for map and zip (#131413 ) Fixes https://github.com/pytorch/pytorch/issues/130750. Repro of lazy/eager `map` discrepancy without `islice`: ```python def fn(a, b): y = 1 def f(x): nonlocal y y += 1 return x l = list(zip([a, b], map(f, [1, 2, 3, 4]))) return a + y ``` The major change is that we implement `MapVariable` and `ZipVariable` based on `IteratorVariable`. Before, `map` and `zip` were being traced by immediately unpacking the result as a `TupleVariable`, which is wrong in cases such as the example above. `MapVariable`s are not allowed to be unpacked while `ZipVariable`s can only be unpacked if all of its iterables can also be unpacked. We also add new `[has_]force_unpack_var_sequence` methods to `VariableTracker` for the case where it is safe to unpack the entire sequence lazily, e.g., when building a list from a map (i.e. `list(map(f, ...))`). Pull Request resolved: https://github.com/pytorch/pytorch/pull/131413 Approved by: https://github.com/anijain2305	2024-07-26 10:47:38 +00:00
IvanKobzarev	115994fea2	[aotd] Align partitioner graph output type to tuple (#131759 ) Brian debugged the difference of the output type for inference and train graph. Partitioner sometimes return list output type. After this PR it will always return tuple. Potentially there can be some new graphs inside tests that will be landed between this PR ci jobs finish and landing. This could be easily fixed with fast-forward fix on: ``` EXPECTTEST_ACCEPT=1 python test/test.py ``` Adding ciflows/periodic to minimize this probability Pull Request resolved: https://github.com/pytorch/pytorch/pull/131759 Approved by: https://github.com/ezyang, https://github.com/bdhirsh	2024-07-26 09:46:29 +00:00
Bin Bao	1e24f7875e	[AOTI] Fix ABI-compatible mode link issue for CPU (#131791 ) Summary: Found this "cannot find -ltorch: No such file or directory" issue when collecting AOTI CPU perf for the dashboard. Debugging on the CI machine revealed two problems: 1) no valid VEC_ISA was picked; 2) when 1 happens, libtorch path is not specified in the linker path. This PR fixes the second problem. A later PR will fix the first problem, but somehow finding the right VEC_ISA causes a performance regression, which needs more investigation. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131791 Approved by: https://github.com/zou3519, https://github.com/chenyang78	2024-07-26 09:02:13 +00:00
Avik Chaudhuri	6fd28fc228	immutable accessors in graph signature (#131807 ) Test Plan: existing tests Differential Revision: D60253955 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131807 Approved by: https://github.com/ydwu4	2024-07-26 08:56:19 +00:00
Jiang, Yanbing	bceb91222c	Fix meta error in _convert_weight_to_int4pack (#130915 ) This PR is to fix meta error in _convert_weight_to_int4pack. Pull Request resolved: https://github.com/pytorch/pytorch/pull/130915 Approved by: https://github.com/jerryzh168	2024-07-26 08:36:30 +00:00
Avik Chaudhuri	2bf649f5ae	suggested fix for data-dependent error (#125378 ) Suggests fixes for data-dependent errors in non-strict export. Any data-dependent error has an unresolved condition on unbacked symints. A mechanizable strategy for fixing such errors, which this PR enables, is to "bash" them using `torch._check()`s. For each error we suggest using `torch._check()` on the condition or its negation. The user selects and copy-pastes the suggested fix and continues. For example, here's an existing data-dependent error message with the suffix following `<snip>...</snip>` added by this PR: ``` Could not guard on data-dependent expression Eq(u2, u1) (unhinted: Eq(u2, u1)). (Size-like symbols: u1) <snip>...</snip> User code: File "test/export/test_export.py", line 1944, in forward return r.view(items[0], items[2]) Suggested fixes (please choose one of the following): 1. torch._check(items[2] == r.shape[1]) 2. torch._check(items[2] != r.shape[1])" ``` Tests in this PR illustrate this workflow, by taking common examples of data-dependent errors and bashing them until success, purely based on suggested fixes. In particular, we test this workflow on the "puzzlers" in https://www.internalfb.com/intern/anp/view/?id=5330476 (thanks @ezyang). In terms of implementation, we focus on non-strict mode, where we can intercept torch function calls to install a handler that walks up the stack from the error, finding the closest non-torch frame and inspecting its locals for symints appearing in the error. The suggested fixes then access these symints through the local variables so that they can be (a) easily understood by the user (b) directly added to the code. Implementing this idea in strict mode is follow-up work—we have already investigated what it would take, and decided to separate it out of this PR for reasons described next. It's not too hard to map symints to locals in Dynamo (although it needs to happen elsewhere, i.e., intercepting torch function calls won't work). However, unfortunately this doesn't seem to be enough; the graph modules created by Dynamo when going through AOTAutograd can raise further data-dependent errors in some cases, and thus we need yet another mechanism to map symints to locals for graph modules, via captured source-level metadata and FX node walking. This latter component will require some care to build properly, or we might conclude it is altogether unnecessary and fix Dynamo instead. Differential Revision: D56867432 Pull Request resolved: https://github.com/pytorch/pytorch/pull/125378 Approved by: https://github.com/ezyang	2024-07-26 08:34:50 +00:00
Adnan Akhundov	fb3ddafbcf	[inductor] Add type hints to functions in mkldnn_fusion.py (#131820 ) Summary: ATT Test Plan: lintrunner Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/131820 Approved by: https://github.com/eellison	2024-07-26 08:11:34 +00:00
Janani Sriram	13e806a591	[NestedTensor] Add support for transposed NestedTensors where ragged_idx > 1 for sum and mean operators (#131517 ) Add support for transposed, non-contiguous `NestedTensor`s, where `ragged_idx > 1`, for the aten operators `sum` and `mean`. This diff enables reducing along the jagged dimension for non-contiguous `NestedTensor`s, transposed between non-batch dimensions as well as between a ragged and a non-batch dimension. For example, users can now reduce a `NestedTensor` of shape `(B, M, , N)` along `` or `(B, N, M, )` along ``. Parametrize existing unit tests and add new unit tests verifying the accuracy of implementations on `NestedTensor`s that transpose between 2 non-batch dimensions as well as between a ragged and a non-batch dimension. Differential Revision: [D59847927](https://our.internmc.facebook.com/intern/diff/D59847927/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131517 Approved by: https://github.com/davidberard98	2024-07-26 07:21:32 +00:00
Xuehai Pan	63374dda69	[BE][Easy] explicitly define global constants in `torch.testing._internal.common_utils` (#129826 ) This appeases IDE warnings like "torch.testing._internal.common_utils has no member TEST_WITH_ROCM". Pull Request resolved: https://github.com/pytorch/pytorch/pull/129826 Approved by: https://github.com/Skylion007	2024-07-26 06:32:08 +00:00
Boyuan Feng	aebfd3d4de	[CUDAGraph] skip cudagraph if too many distinct sizes (#131387 ) Current implementation records a new cudagraph for every distinct input size. This leads to significant overhead if there are too many distinct input sizes. While we currently hint re-recording cudagraph from dynamic shapes, it is at [info level](https://github.com/pytorch/pytorch/blob/main/torch/_inductor/cudagraph_trees.py#L363-L366) which is easy to overlook and leads to several issues, such as Issue #119640 and Issue #128424. This PR checks the number of cudagraph due to dynamic shapes and warns loudly if #cudagraph exceeds a threshold `cudagraph_dynamic_shape_limit`(=50). Fixes #119640 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131387 Approved by: https://github.com/eellison	2024-07-26 06:17:35 +00:00
Boyuan Feng	16d7cb5049	[CUDAGraph] Type annotation for cudagraph_trees.py (#131621 ) As a Better Engineer effort, this PR adds type annotation to `cudagraph_trees.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131621 Approved by: https://github.com/eellison	2024-07-26 06:14:06 +00:00
Yu, Guangye	dfba85c26b	Update torch-xpu-ops pin (ATen XPU implementation) (#131643 ) # Motivation Regular update. 1. Some new ATen ops support 2. ABI=0 build support 3. Remove dispatched implementation of pin_memory&is_pinned 4. Enhance deterministic usage Pull Request resolved: https://github.com/pytorch/pytorch/pull/131643 Approved by: https://github.com/EikanWang	2024-07-26 05:51:58 +00:00
Nikita Shulga	baa93e160f	[MPS] Add native implementation for shift ops (#131813 ) Similar to how AND/OR/XOR ops are implemented TODO: Consider using MPS method calls rather than metal kernels Pull Request resolved: https://github.com/pytorch/pytorch/pull/131813 Approved by: https://github.com/manuelcandales	2024-07-26 05:01:20 +00:00
Aaron Orenstein	a1dad77dfa	[BE] typing for decorators - optim/optimizer (#131583 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131583 Approved by: https://github.com/janeyx99 ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576, #131577, #131578, #131579, #131580, #131581, #131582	2024-07-26 05:00:07 +00:00
Aaron Orenstein	8689d377f9	[BE] typing for decorators - signal/windows/windows (#131582 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131582 Approved by: https://github.com/oulgen, https://github.com/zou3519 ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576, #131577, #131578, #131579, #131580, #131581	2024-07-26 05:00:07 +00:00
Aaron Orenstein	dbf7c318b2	[BE] typing for decorators - _refs/nn/functional (#131581 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131581 Approved by: https://github.com/oulgen, https://github.com/zou3519 ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576, #131577, #131578, #131579, #131580	2024-07-26 05:00:03 +00:00
Aaron Orenstein	81c26ba5ae	[BE] typing for decorators - utils/flop_counter (#131580 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131580 Approved by: https://github.com/oulgen, https://github.com/zou3519 ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576, #131577, #131578, #131579	2024-07-26 04:59:58 +00:00
Adnan Akhundov	33069630ce	[inductor] Add type hints to functions in decompositions.py (#131780 ) Summary: ATT Test Plan: lintrunner Reviewers: Subscribers: Tasks: Tags: Pull Request resolved: https://github.com/pytorch/pytorch/pull/131780 Approved by: https://github.com/eellison	2024-07-26 04:50:23 +00:00
Avik Chaudhuri	5b05ad9697	fix non-persistent buffers (#131756 ) Summary: Dynamo doesn't track whether buffers are `persistent`. This led to some ugly code where we would mark buffers as always persistent when creating signatures, then later check whether the buffers were not in the state dict to infer whether they were non-persistent, and use this to fix up the signature. This PR instead defines a utility to look up all the non-persistent buffers registered inside a module (this information is recorded in a private `_non_persistent_buffers_set` module attribute), and uses it to (a) correctly set the persistent flag on buffers when creating signatures (b) transfer this information to a Dynamo-traced graph module, which then causes non-persistent buffers to (correctly) not show up in the state dict. Test Plan: existing tests + new case with non-persistent buffer in nested module Differential Revision: D60224656 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131756 Approved by: https://github.com/zhxchen17, https://github.com/ydwu4	2024-07-26 04:45:30 +00:00
Animesh Jain	a617919541	[dynamo] Do not guard on keys for _forward_hooks and _forward_pre_hooks (#131682 ) Fixes https://github.com/pytorch/pytorch/issues/125836 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131682 Approved by: https://github.com/bdhirsh	2024-07-26 04:39:54 +00:00
Xuan Zhang	3d7c424a75	[inductor] update users to buffers instead of scheduler nodes (#131796 ) After a recent refactoring of inductor, `.users` are now associated with buffers instead of scheduler nodes. In `debug.py`, one such usage of `.users` is not updated accordingly, and the change here fixes that. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131796 Approved by: https://github.com/yf225	2024-07-26 03:34:26 +00:00
Isuru Fernando	6dbf343936	Fix aten implementation for low memory max_pool2d (#131717 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131717 Approved by: https://github.com/peterbell10	2024-07-26 03:23:16 +00:00
YangQun1	c2f3266c8e	Not remove collective ops in dce since they have side-effect (#131023 ) Fixes #130918 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131023 Approved by: https://github.com/yf225	2024-07-26 03:03:32 +00:00
Yu, Guangye	e0d3e4a498	remove unused code for XPU (#131856 ) # Motivation This PR aims to remove unused code in PyTorch for XPU, following https://github.com/pytorch/pytorch/pull/128179 Otherwise, CI will block without this PR. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131856 Approved by: https://github.com/EikanWang	2024-07-26 02:57:12 +00:00
Will Feng	236d055330	[Traceable FSDP2] Add partial-graph (graph-break) unit tests (#131747 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131747 Approved by: https://github.com/bdhirsh	2024-07-26 02:51:57 +00:00
PyTorch MergeBot	03f49c9523	Revert "[CUDAGraph] Type annotation for cudagraph_trees.py (#131621 )" This reverts commit 16699c7d848fca669865d83ffff205bcbb8665be. Reverted https://github.com/pytorch/pytorch/pull/131621 on behalf of https://github.com/atalman due to lint is failing, please rebase fix lint and reland ([comment](https://github.com/pytorch/pytorch/pull/131621#issuecomment-2251831163))	2024-07-26 02:08:45 +00:00
Boyuan Feng	16699c7d84	[CUDAGraph] Type annotation for cudagraph_trees.py (#131621 ) As a Better Engineer effort, this PR adds type annotation to `cudagraph_trees.py`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131621 Approved by: https://github.com/eellison	2024-07-26 01:40:23 +00:00
Colin Peppler	2ff98bc57f	[inductor][autotune_at_compile_time] fix some codegen-ing for standalone autotuning file (#131726 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131726 Approved by: https://github.com/desertfire ghstack dependencies: #131253	2024-07-26 00:58:04 +00:00
PyTorch MergeBot	b343644f3a	Revert "MTIA equivalent of torch.cuda.memory_stats (#131673 )" This reverts commit 513ce5f69a7f53742b7aa5798082dd158beec2ed. Reverted https://github.com/pytorch/pytorch/pull/131673 on behalf of https://github.com/clee2000 due to linked internal diff has internal changes, not sure what happened here, but this shouldn't have been merged externally without also merging the internal diff ([comment](https://github.com/pytorch/pytorch/pull/131673#issuecomment-2251749644))	2024-07-26 00:54:37 +00:00
Yanbo Liang	b893a57f96	[Dynamo] Fix guard_on_nn_modules unit tests discrepancy between OSS and fbcode (#131810 ) Fixes Meta internal task: [T195592220](https://www.internalfb.com/intern/tasks/?t=195592220) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131810 Approved by: https://github.com/zou3519	2024-07-26 00:24:46 +00:00
Animesh Jain	246e32055a	[benchmark] Add hf_T5_generate to inline_inbuilt_nn_modules (#131804 ) Fixes https://github.com/pytorch/pytorch/issues/121989 We are turning on the flag by default in another PR. But that PR can go through reverts. So, forcibly adding the benchmark to prevent dashboard fluctuation in case of reverts. Pull Request resolved: https://github.com/pytorch/pytorch/pull/131804 Approved by: https://github.com/yanboliang, https://github.com/shunting314 ghstack dependencies: #131795, #131801	2024-07-26 00:20:42 +00:00
Peter Bell	c92f2a19a4	[BE] Use assertEqual in MultiKernel tests (#127725 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127725 Approved by: https://github.com/lezcano ghstack dependencies: #131044, #127724	2024-07-26 00:12:43 +00:00
Peter Bell	9ae288f4be	[inductor] Simplify multi-kernel codegen by unifying kernel args (#127724 ) Persistent kernels are sometimes able to remove intermediate buffers that would otherwise be needed for the non-persistent reduction kernel. This makes multi kernel's codegen more complicated as it needs to drop these extra arguments at runtime after selecting the correct kernel to run. Instead, this PR updates the persistent kernel's `must_keep_buffers` so these aren't dropped during codegen so both kernels have the same signature. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127724 Approved by: https://github.com/shunting314 ghstack dependencies: #131044	2024-07-26 00:12:43 +00:00
PyTorch MergeBot	14920c149b	Revert "[dynamo] Turn on inline_inbuilt_nn_modules (#131275 )" This reverts commit 0455344777f354dcbbd8e661a46ca2ca20e8a913. Reverted https://github.com/pytorch/pytorch/pull/131275 on behalf of https://github.com/clee2000 due to I think this broke inductor/test_cpu_select_algorithm.py::TestSelectAlgorithmDynamicShapesCPU::test_quantized_linear_amx_dynamic_shapes_batch_size_16_in_features_4_out_features_64_bias_True_cpu [GH job link](https://github.com/pytorch/pytorch/actions/runs/10102272826/job/27938970118) [HUD commit link](`0455344777`) not run on PR due to bad TD ([comment](https://github.com/pytorch/pytorch/pull/131275#issuecomment-2251609554))	2024-07-26 00:12:40 +00:00
Tristan Rice	adbe4f5ecf	TCPStore: add better logging on wait timeout (#131808 ) This makes TCPStore `wait` timeout print actually useful info instead of a generic `Socket Timeout` message on timeout. Bonus: * fix weirdness where `connect_timeout` only supported seconds unlike the reset of our timeouts (thus minimum timeout was 1s) * Fixed tests that used a 10s timeout (test_store now only takes 20s instead of 40s) Ex: ``` DistStoreError: wait timeout after 100ms, keys: /the_key ``` Test plan: ``` python test/distributed/test_store.py python test/distributed/test_c10d_gloo.py -v -k timeout ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131808 Approved by: https://github.com/kurman	2024-07-25 23:54:41 +00:00
Brian Hirsh	e9443860e7	add python binding for _get_current_graph_task_keep_graph (#131038 ) Inductor would like a way to have activations that do not escape the backward graph marked as "donated", so we can re-use their memory during memory planning here: https://github.com/pytorch/pytorch/pull/130580 For this to be safe though, we need to know at runtime that autograd does not plan to retain the current autograd graph (either for another call to .backward() later, or if double backward is being used). In the linked PR, the current plan is to error when we detect this situation, and ask the user to turn off the donated buffer config (although if/once we get to the point of always delaying backward compilation to runtime, we can just wait until we know the runtime value to compile). There isn't a way to know if the currently running backward is run with `retain_graph=True` from python - @soulitzer helped me figure out where to grab it so I added a python binding for it under `ctx.is_retain_graph()` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131038 Approved by: https://github.com/soulitzer	2024-07-25 23:50:40 +00:00
cyy	eac83479cc	Enable Wunused-function and Wunused-result globally (#131596 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/131596 Approved by: https://github.com/zou3519	2024-07-25 23:50:12 +00:00
Animesh Jain	2a4ca5ccc4	[dynamo] Pop the exception stack on handling the StopIteration natively (#131801 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131801 Approved by: https://github.com/yanboliang ghstack dependencies: #131795	2024-07-25 23:33:19 +00:00
Animesh Jain	11673851d9	[dynamo][exception][bugfix] Add a pop for < 3.11 version (#131795 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131795 Approved by: https://github.com/yanboliang	2024-07-25 23:33:19 +00:00
Colin Peppler	f885a70fab	[inductor][autotune_at_compile_time] support Triton kernel with sympy fn str arg (#131253 ) ## What is sympy fn str arg? It's a string such as `sqrt` which also happens to be a real sympy function (e.g. `sympy.sqrt`) ## Crash ``` torch/_inductor/sizevars.py", line 468, in symbolic_hint expr = self.simplify(expr) # where expr is 'sqrt' torch/_inductor/sizevars.py", line 66, in simplify return sympy.expand(expr).xreplace(self.replacements) sympy/core/function.py", line 2816, in expand return sympify(e).expand(deep=deep, modulus=modulus, **hints) AttributeError: 'function' object has no attribute 'expand' ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/131253 Approved by: https://github.com/desertfire	2024-07-25 23:31:20 +00:00
drisspg	b4b62d3945	update to 2.5.8 (#131684 ) # Summary This stack brings the current fork of FAv2 near the top of main which is 2.6.2 Notably we need to update cutlass to 3.5.0 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131684 Approved by: https://github.com/jainapurva	2024-07-25 23:15:03 +00:00
Michael Lazos	51f4f87718	[Reland] Ensure staticmethods can be allowed in graph (#131789 ) Fixes https://github.com/pytorch/pytorch/issues/124735 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131789 Approved by: https://github.com/anijain2305	2024-07-25 22:54:18 +00:00
wz337	4de85e3c30	[DeviceMesh] Remove _parent_mesh as an attribute from DeviceMesh and remove it from DeviceMesh's hash (#131636 ) We recently revisited the hash implementation and think `_parent_mesh` information should not be burned into DeviceMesh but rather be inferred from the MeshEnv which manages device meshes. As `mesh_dim_names` is considered in device mesh's hash. This should not affect the issue brought up in https://github.com/pytorch/pytorch/issues/121799 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131636 Approved by: https://github.com/wanchaol	2024-07-25 22:47:22 +00:00
Aaron Orenstein	79f0c4dc04	[BE] typing for decorators - fx/experimental/graph_gradual_typechecker (#131579 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131579 Approved by: https://github.com/oulgen, https://github.com/zou3519 ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576, #131577, #131578	2024-07-25 22:24:19 +00:00
Aaron Orenstein	c65b197b85	[BE] typing for decorators - _library/custom_ops (#131578 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131578 Approved by: https://github.com/oulgen, https://github.com/zou3519 ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576, #131577	2024-07-25 22:24:19 +00:00
Aaron Orenstein	5ee6a6dacc	[BE] typing for decorators - ao/quantization/quantizer/xnnpack_quantizer_utils (#131577 ) See #131429 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131577 Approved by: https://github.com/oulgen, https://github.com/zou3519 ghstack dependencies: #131568, #131569, #131570, #131571, #131572, #131573, #131574, #131575, #131576	2024-07-25 22:24:19 +00:00

3353 changed files with 148260 additions and 68808 deletions

8

.ci/docker/aotriton_version.txt

View File

 @ -1,5 +1,5 @@
 .6b
 .7b
 manylinux_2_17
 rocm6.1
 f07e8a1cb1f99627eb6d77f5c0e9295c775f3c7
 c29fa3f3b614e187d7213d745e989a92708cee2bc6020419ab49019af399d1
 rocm6.2
 be04068c3c0857a4cfd17d7e39e71d0423ebac2
 e9e1959d23b93d78a08fcc5f868125dc3854dece32fd9458be9ef4467982291

									
										40

.ci/docker/build.sh
									
												View File
												
				@ -92,7 +92,7 @@ _UCC_COMMIT=20eae37090a4ce1b32bcce6144ccad0b49943e0b

				# from scratch

				case "$image" in

				  pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.4.0

				    CUDA_VERSION=12.4.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				@ -120,7 +120,7 @@ case "$image" in

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.4.0

				    CUDA_VERSION=12.4.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				@ -165,7 +165,7 @@ case "$image" in

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-focal-cuda12.4-cudnn9-py3.12-gcc9-inductor-benchmarks)

				    CUDA_VERSION=12.4.0

				    CUDA_VERSION=12.4.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.12

				    GCC_VERSION=9

				@ -194,7 +194,7 @@ case "$image" in

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.4.0

				    CUDA_VERSION=12.4.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				@ -222,7 +222,7 @@ case "$image" in

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-cuda12.4-cudnn9-py3-gcc9)

				    CUDA_VERSION=12.4.0

				    CUDA_VERSION=12.4.1

				    CUDNN_VERSION=9

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				@ -236,7 +236,7 @@ case "$image" in

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-py3-clang10-onnx)

				    ANACONDA_PYTHON_VERSION=3.8

				    ANACONDA_PYTHON_VERSION=3.9

				    CLANG_VERSION=10

				    PROTOBUF=yes

				    DB=yes

				@ -245,7 +245,7 @@ case "$image" in

				    ONNX=yes

				    ;;

				  pytorch-linux-focal-py3-clang9-android-ndk-r21e)

				    ANACONDA_PYTHON_VERSION=3.8

				    ANACONDA_PYTHON_VERSION=3.9

				    CLANG_VERSION=9

				    LLVMDEV=yes

				    PROTOBUF=yes

				@ -254,8 +254,8 @@ case "$image" in

				    GRADLE_VERSION=6.8.3

				    NINJA_VERSION=1.9.0

				    ;;

				  pytorch-linux-focal-py3.8-clang10)

				    ANACONDA_PYTHON_VERSION=3.8

				  pytorch-linux-focal-py3.9-clang10)

				    ANACONDA_PYTHON_VERSION=3.9

				    CLANG_VERSION=10

				    PROTOBUF=yes

				    DB=yes

				@ -276,8 +276,8 @@ case "$image" in

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-py3.8-gcc9)

				    ANACONDA_PYTHON_VERSION=3.8

				  pytorch-linux-focal-py3.9-gcc9)

				    ANACONDA_PYTHON_VERSION=3.9

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				@ -286,7 +286,7 @@ case "$image" in

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-rocm-n-1-py3)

				    ANACONDA_PYTHON_VERSION=3.8

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				@ -297,7 +297,7 @@ case "$image" in

				    TRITON=yes

				    ;;

				  pytorch-linux-focal-rocm-n-py3)

				    ANACONDA_PYTHON_VERSION=3.8

				    ANACONDA_PYTHON_VERSION=3.10

				    GCC_VERSION=9

				    PROTOBUF=yes

				    DB=yes

				@ -308,7 +308,7 @@ case "$image" in

				    TRITON=yes

				    ;;

				  pytorch-linux-jammy-xpu-2024.0-py3)

				    ANACONDA_PYTHON_VERSION=3.8

				    ANACONDA_PYTHON_VERSION=3.9

				    GCC_VERSION=11

				    PROTOBUF=yes

				    DB=yes

				@ -318,8 +318,8 @@ case "$image" in

				    CONDA_CMAKE=yes

				    TRITON=yes

				    ;;

				    pytorch-linux-jammy-py3.8-gcc11-inductor-benchmarks)

				    ANACONDA_PYTHON_VERSION=3.8

				    pytorch-linux-jammy-py3.9-gcc11-inductor-benchmarks)

				    ANACONDA_PYTHON_VERSION=3.9

				    GCC_VERSION=11

				    PROTOBUF=yes

				    DB=yes

				@ -330,8 +330,8 @@ case "$image" in

				    DOCS=yes

				    INDUCTOR_BENCHMARKS=yes

				    ;;

				  pytorch-linux-jammy-cuda11.8-cudnn9-py3.8-clang12)

				    ANACONDA_PYTHON_VERSION=3.8

				  pytorch-linux-jammy-cuda11.8-cudnn9-py3.9-clang12)

				    ANACONDA_PYTHON_VERSION=3.9

				    CUDA_VERSION=11.8

				    CUDNN_VERSION=9

				    CLANG_VERSION=12

				@ -355,8 +355,8 @@ case "$image" in

				    CONDA_CMAKE=yes

				    VISION=yes

				    ;;

				  pytorch-linux-jammy-py3.8-gcc11)

				    ANACONDA_PYTHON_VERSION=3.8

				  pytorch-linux-jammy-py3.9-gcc11)

				    ANACONDA_PYTHON_VERSION=3.9

				    GCC_VERSION=11

				    PROTOBUF=yes

				    DB=yes

									
										4

.ci/docker/centos-rocm/Dockerfile
									
												View File
												
				@ -108,10 +108,10 @@ ENV CMAKE_C_COMPILER cc

				ENV CMAKE_CXX_COMPILER c++

				COPY ./common/install_triton.sh install_triton.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/triton-rocm.txt triton-rocm.txt

				COPY ci_commit_pins/triton.txt triton.txt

				COPY triton_version.txt triton_version.txt

				RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton-rocm.txt triton_version.txt

				RUN rm install_triton.sh common_utils.sh triton.txt triton_version.txt

				# Install AOTriton (Early fail)

				COPY ./aotriton_version.txt aotriton_version.txt

2

.ci/docker/ci_commit_pins/executorch.txt

View File

 @ -1 +1 @@
 da61aa34b73ea8e2ee815a6a79eea817e361db
 cd1c833b079adb324871dcbbe75b43d42ffc0ade

2

.ci/docker/ci_commit_pins/halide.txt

View File

 @ -1 +1 @@
 fec6d3ebc73e7a19eba1663e9b0ba8ab2d
 c12871f336fe6f57b55d6a297f13ef209161b

2

.ci/docker/ci_commit_pins/timm.txt

View File

 @ -1 +1 @@
 b907b4d45a4713cbc425cbf224c46089fd514
 ac3470188b914c5d7a5058a7e28b9eb685a62427

1

.ci/docker/ci_commit_pins/triton-rocm.txt

View File

				`@ -1 +0,0 @@`
				`21eae954efa5bf584da70324b640288c3ee7aede`

2

.ci/docker/ci_commit_pins/triton-xpu.txt

View File

 @ -1 +1 @@
 b2f15840e0d70eec50d84c7a0575cb835524def
 b14bf5593cf58a8541f3e6b9125600a867d4ef

2

.ci/docker/ci_commit_pins/triton.txt

View File

 @ -1 +1 @@
 dedb7bdf339a3546896d4820366ca562c586bfa0
 fe38ffd73c2ac6ed6323b554205186696631c6f

5

.ci/docker/common/aotriton_version.txt

View File

 @ -1,5 +0,0 @@
 .6b
 manylinux_2_17
 rocm6.1
 b5df8c8123f90cba3ede7e971e6fbc6040d506
 c29fa3f3b614e187d7213d745e989a92708cee2bc6020419ab49019af399d1

									
										4

.ci/docker/common/install_aotriton.sh
									
												View File
												
				@ -4,12 +4,12 @@ set -ex

				source "$(dirname "${BASH_SOURCE[0]}")/common_utils.sh"

				TARBALL='aotriton.tar.bz2'

				TARBALL='aotriton.tar.gz'

				# This read command alwasy returns with exit code 1

				read -d "\n" VER MANYLINUX ROCMBASE PINNED_COMMIT SHA256 < aotriton_version.txt || true

				ARCH=$(uname -m)

				AOTRITON_INSTALL_PREFIX="$1"

				AOTRITON_URL="https://github.com/ROCm/aotriton/releases/download/${VER}/aotriton-${VER}-${MANYLINUX}_${ARCH}-${ROCMBASE}-shared.tar.bz2"

				AOTRITON_URL="https://github.com/ROCm/aotriton/releases/download/${VER}/aotriton-${VER}-${MANYLINUX}_${ARCH}-${ROCMBASE}-shared.tar.gz"

				cd "${AOTRITON_INSTALL_PREFIX}"

				# Must use -L to follow redirects

									
										33

.ci/docker/common/install_conda.sh
									
												View File
												
				@ -5,32 +5,22 @@ set -ex

				# Optionally install conda

				if [ -n "$ANACONDA_PYTHON_VERSION" ]; then

				  BASE_URL="https://repo.anaconda.com/miniconda"

				  CONDA_FILE="Miniconda3-latest-Linux-x86_64.sh"

				  if [[ $(uname -m) == "aarch64" ]] || [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then

				    BASE_URL="https://github.com/conda-forge/miniforge/releases/latest/download"

				    CONDA_FILE="Miniforge3-Linux-$(uname -m).sh"

				  fi

				  MAJOR_PYTHON_VERSION=$(echo "$ANACONDA_PYTHON_VERSION" | cut -d . -f 1)

				  MINOR_PYTHON_VERSION=$(echo "$ANACONDA_PYTHON_VERSION" | cut -d . -f 2)

				if [[ $(uname -m) == "aarch64" ]]; then

				  BASE_URL="https://github.com/conda-forge/miniforge/releases/latest/download"

				  case "$MAJOR_PYTHON_VERSION" in

				    3)

				      CONDA_FILE="Miniforge3-Linux-aarch64.sh"

				    ;;

				    3);;

				    *)

				      echo "Unsupported ANACONDA_PYTHON_VERSION: $ANACONDA_PYTHON_VERSION"

				      exit 1

				      ;;

				  esac

				else

				  case "$MAJOR_PYTHON_VERSION" in

				    3)

				      CONDA_FILE="Miniconda3-latest-Linux-x86_64.sh"

				    ;;

				    *)

				      echo "Unsupported ANACONDA_PYTHON_VERSION: $ANACONDA_PYTHON_VERSION"

				      exit 1

				      ;;

				  esac

				fi

				  mkdir -p /opt/conda

				  chown jenkins:jenkins /opt/conda

				@ -78,19 +68,20 @@ fi

				    CONDA_COMMON_DEPS="astunparse pyyaml setuptools openblas==0.3.25=*openmp* ninja==1.11.1 scons==4.5.2"

				    if [ "$ANACONDA_PYTHON_VERSION" = "3.8" ]; then

				      conda_install numpy=1.24.4 ${CONDA_COMMON_DEPS}

				      NUMPY_VERSION=1.24.4

				    else

				      conda_install numpy=1.26.2 ${CONDA_COMMON_DEPS}

				      NUMPY_VERSION=1.26.2

				    fi

				  else

				    CONDA_COMMON_DEPS="astunparse pyyaml mkl=2021.4.0 mkl-include=2021.4.0 setuptools"

				    if [ "$ANACONDA_PYTHON_VERSION" = "3.11" ] || [ "$ANACONDA_PYTHON_VERSION" = "3.12" ] || [ "$ANACONDA_PYTHON_VERSION" = "3.13" ]; then

				      conda_install numpy=1.26.0 ${CONDA_COMMON_DEPS}

				      NUMPY_VERSION=1.26.0

				    else

				      conda_install numpy=1.21.2 ${CONDA_COMMON_DEPS}

				      NUMPY_VERSION=1.21.2

				    fi

				  fi

				  conda_install ${CONDA_COMMON_DEPS}

				  # Install llvm-8 as it is required to compile llvmlite-0.30.0 from source

				  # and libpython-static for torch deploy

				@ -112,7 +103,7 @@ fi

				  # Install some other packages, including those needed for Python test reporting

				  pip_install -r /opt/conda/requirements-ci.txt

				  pip_install numpy=="$NUMPY_VERSION"

				  pip_install -U scikit-learn

				  if [ -n "$DOCS" ]; then

									
										25

.ci/docker/common/install_cpython.sh
									
												View File
												
				@ -7,7 +7,7 @@ PYTHON_DOWNLOAD_GITHUB_BRANCH=https://github.com/python/cpython/archive/refs/hea

				GET_PIP_URL=https://bootstrap.pypa.io/get-pip.py

				# Python versions to be installed in /opt/$VERSION_NO

				CPYTHON_VERSIONS=${CPYTHON_VERSIONS:-"3.8.1 3.9.0 3.10.1 3.11.0 3.12.0 3.13.0"}

				CPYTHON_VERSIONS=${CPYTHON_VERSIONS:-"3.8.1 3.9.0 3.10.1 3.11.0 3.12.0 3.13.0 3.13.0t"}

				function check_var {

				    if [ -z "$1" ]; then

				@ -22,6 +22,13 @@ function do_cpython_build {

				    check_var $py_ver

				    check_var $py_folder

				    tar -xzf Python-$py_ver.tgz

				    local additional_flags=""

				    if [ "$py_ver" == "3.13.0t" ]; then

				        additional_flags=" --disable-gil"

				        mv cpython-3.13/ cpython-3.13t/

				    fi

				    pushd $py_folder

				    local prefix="/opt/_internal/cpython-${py_ver}"

				@ -37,8 +44,10 @@ function do_cpython_build {

				        local openssl_flags="--with-openssl=${WITH_OPENSSL} --with-openssl-rpath=auto"

				    fi

				    # -Wformat added for https://bugs.python.org/issue17547 on Python 2.6

				    CFLAGS="-Wformat" ./configure --prefix=${prefix} ${openssl_flags} ${shared_flags} > /dev/null

				    CFLAGS="-Wformat" ./configure --prefix=${prefix} ${openssl_flags} ${shared_flags} ${additional_flags} > /dev/null

				    make -j40 > /dev/null

				    make install > /dev/null

				@ -58,7 +67,8 @@ function do_cpython_build {

				    if [ -e ${prefix}/bin/pip3 ] && [ ! -e ${prefix}/bin/pip ]; then

				        ln -s pip3 ${prefix}/bin/pip

				    fi

				    ${prefix}/bin/pip install wheel==0.34.2

				    # install setuptools since python 3.12 is required to use distutils

				    ${prefix}/bin/pip install wheel==0.34.2 setuptools==68.2.2

				    local abi_tag=$(${prefix}/bin/python -c "from wheel.pep425tags import get_abbr_impl, get_impl_ver, get_abi_tag; print('{0}{1}-{2}'.format(get_abbr_impl(), get_impl_ver(), get_abi_tag()))")

				    ln -s ${prefix} /opt/python/${abi_tag}

				}

				@ -68,7 +78,14 @@ function build_cpython {

				    check_var $py_ver

				    check_var $PYTHON_DOWNLOAD_URL

				    local py_ver_folder=$py_ver

				    if [ "$py_ver" = "3.13.0" ]; then

				    if [ "$py_ver" = "3.13.0t" ]; then

				        PY_VER_SHORT="3.13"

				        PYT_VER_SHORT="3.13t"

				        check_var $PYTHON_DOWNLOAD_GITHUB_BRANCH

				        wget $PYTHON_DOWNLOAD_GITHUB_BRANCH/$PY_VER_SHORT.tar.gz -O Python-$py_ver.tgz

				        do_cpython_build $py_ver cpython-$PYT_VER_SHORT

				    elif [ "$py_ver" = "3.13.0" ]; then

				        PY_VER_SHORT="3.13"

				        check_var $PYTHON_DOWNLOAD_GITHUB_BRANCH

				        wget $PYTHON_DOWNLOAD_GITHUB_BRANCH/$PY_VER_SHORT.tar.gz -O Python-$py_ver.tgz

									
										25

.ci/docker/common/install_cuda.sh
									
												View File
												
				@ -27,6 +27,17 @@ function install_cusparselt_052 {

				    rm -rf tmp_cusparselt

				}

				function install_cusparselt_062 {

				    # cuSparseLt license: https://docs.nvidia.com/cuda/cusparselt/license.html

				    mkdir tmp_cusparselt && pushd tmp_cusparselt

				    wget -q https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-x86_64/libcusparse_lt-linux-x86_64-0.6.2.3-archive.tar.xz

				    tar xf libcusparse_lt-linux-x86_64-0.6.2.3-archive.tar.xz

				    cp -a libcusparse_lt-linux-x86_64-0.6.2.3-archive/include/* /usr/local/cuda/include/

				    cp -a libcusparse_lt-linux-x86_64-0.6.2.3-archive/lib/* /usr/local/cuda/lib64/

				    popd

				    rm -rf tmp_cusparselt

				}

				function install_118 {

				    echo "Installing CUDA 11.8 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.4.0"

				    rm -rf /usr/local/cuda-11.8 /usr/local/cuda

				@ -94,13 +105,13 @@ function install_121 {

				}

				function install_124 {

				  echo "Installing CUDA 12.4 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.5.2"

				  echo "Installing CUDA 12.4.1 and cuDNN ${CUDNN_VERSION} and NCCL ${NCCL_VERSION} and cuSparseLt-0.5.2"

				  rm -rf /usr/local/cuda-12.4 /usr/local/cuda

				  # install CUDA 12.4.0 in the same container

				  wget -q https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux.run

				  chmod +x cuda_12.4.0_550.54.14_linux.run

				  ./cuda_12.4.0_550.54.14_linux.run --toolkit --silent

				  rm -f cuda_12.4.0_550.54.14_linux.run

				  # install CUDA 12.4.1 in the same container

				  wget -q https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux.run

				  chmod +x cuda_12.4.1_550.54.15_linux.run

				  ./cuda_12.4.1_550.54.15_linux.run --toolkit --silent

				  rm -f cuda_12.4.1_550.54.15_linux.run

				  rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.4 /usr/local/cuda

				  # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

				@ -121,7 +132,7 @@ function install_124 {

				  cd ..

				  rm -rf nccl

				  install_cusparselt_052

				  install_cusparselt_062

				  ldconfig

				}

									
										12

.ci/docker/common/install_cuda_aarch64.sh
									
												View File
												
				@ -17,13 +17,13 @@ function install_cusparselt_052 {

				}

				function install_124 {

				  echo "Installing CUDA 12.4 and cuDNN 9.1 and NCCL ${NCCL_VERSION} and cuSparseLt-0.5.2"

				  echo "Installing CUDA 12.4.1 and cuDNN 9.1 and NCCL ${NCCL_VERSION} and cuSparseLt-0.5.2"

				  rm -rf /usr/local/cuda-12.4 /usr/local/cuda

				  # install CUDA 12.4.0 in the same container

				  wget -q https://developer.download.nvidia.com/compute/cuda/12.4.0/local_installers/cuda_12.4.0_550.54.14_linux_sbsa.run

				  chmod +x cuda_12.4.0_550.54.14_linux_sbsa.run

				  ./cuda_12.4.0_550.54.14_linux_sbsa.run --toolkit --silent

				  rm -f cuda_12.4.0_550.54.14_linux_sbsa.run

				  # install CUDA 12.4.1 in the same container

				  wget -q https://developer.download.nvidia.com/compute/cuda/12.4.1/local_installers/cuda_12.4.1_550.54.15_linux_sbsa.run

				  chmod +x cuda_12.4.1_550.54.15_linux_sbsa.run

				  ./cuda_12.4.1_550.54.15_linux_sbsa.run --toolkit --silent

				  rm -f cuda_12.4.1_550.54.15_linux_sbsa.run

				  rm -f /usr/local/cuda && ln -s /usr/local/cuda-12.4 /usr/local/cuda

				  # cuDNN license: https://developer.nvidia.com/cudnn/license_agreement

									
										25

.ci/docker/common/install_cudss.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,25 @@

				#!/bin/bash

				set -ex

				# cudss license: https://docs.nvidia.com/cuda/cudss/license.html

				mkdir tmp_cudss && cd tmp_cudss

				if [[ ${CUDA_VERSION:0:4} =~ ^12\.[1-4]$ ]]; then

				    arch_path='sbsa'

				    export TARGETARCH=${TARGETARCH:-$(uname -m)}

				    if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then

				        arch_path='x86_64'

				    fi

				    CUDSS_NAME="libcudss-linux-${arch_path}-0.3.0.9_cuda12-archive"

				    curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cudss/redist/libcudss/linux-${arch_path}/${CUDSS_NAME}.tar.xz

				    # only for cuda 12

				    tar xf ${CUDSS_NAME}.tar.xz

				    cp -a ${CUDSS_NAME}/include/* /usr/local/cuda/include/

				    cp -a ${CUDSS_NAME}/lib/* /usr/local/cuda/lib64/

				fi

				cd ..

				rm -rf tmp_cudss

				ldconfig

									
										10

.ci/docker/common/install_cusparselt.sh
									
												View File
												
				@ -5,7 +5,15 @@ set -ex

				# cuSPARSELt license: https://docs.nvidia.com/cuda/cusparselt/license.html

				mkdir tmp_cusparselt && cd tmp_cusparselt

				if [[ ${CUDA_VERSION:0:4} =~ ^12\.[1-4]$ ]]; then

				if [[ ${CUDA_VERSION:0:4} =~ ^12\.[2-4]$ ]]; then

				    arch_path='sbsa'

				    export TARGETARCH=${TARGETARCH:-$(uname -m)}

				    if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then

				        arch_path='x86_64'

				    fi

				    CUSPARSELT_NAME="libcusparse_lt-linux-${arch_path}-0.6.2.3-archive"

				    curl --retry 3 -OLs https://developer.download.nvidia.com/compute/cusparselt/redist/libcusparse_lt/linux-${arch_path}/${CUSPARSELT_NAME}.tar.xz

				elif [[ ${CUDA_VERSION:0:4} == "12.1" ]]; then

				    arch_path='sbsa'

				    export TARGETARCH=${TARGETARCH:-$(uname -m)}

				    if [ ${TARGETARCH} = 'amd64' ] || [ "${TARGETARCH}" = 'x86_64' ]; then

									
										5

.ci/docker/common/install_miopen.sh
									
												View File
												
				@ -57,7 +57,10 @@ MIOPEN_CMAKE_COMMON_FLAGS="

				-DMIOPEN_BUILD_DRIVER=OFF

				"

				# Pull MIOpen repo and set DMIOPEN_EMBED_DB based on ROCm version

				if [[ $ROCM_INT -ge 60100 ]] && [[ $ROCM_INT -lt 60200 ]]; then

				if [[ $ROCM_INT -ge 60200 ]] && [[ $ROCM_INT -lt 60300 ]]; then

				    echo "ROCm 6.2 MIOpen does not need any patches, do not build from source"

				    exit 0

				elif [[ $ROCM_INT -ge 60100 ]] && [[ $ROCM_INT -lt 60200 ]]; then

				    echo "ROCm 6.1 MIOpen does not need any patches, do not build from source"

				    exit 0

				elif [[ $ROCM_INT -ge 60000 ]] && [[ $ROCM_INT -lt 60100 ]]; then

									
										20

.ci/docker/common/install_nvpl.sh
									
										Normal file
									
												View File
												
				@ -0,0 +1,20 @@

				#!/bin/bash

				set -ex

				function install_nvpl {

				    mkdir -p /opt/nvpl/lib /opt/nvpl/include

				    wget https://developer.download.nvidia.com/compute/nvpl/redist/nvpl_blas/linux-sbsa/nvpl_blas-linux-sbsa-0.3.0-archive.tar.xz

				    tar xf nvpl_blas-linux-sbsa-0.3.0-archive.tar.xz

				    cp -r nvpl_blas-linux-sbsa-0.3.0-archive/lib/* /opt/nvpl/lib/

				    cp -r nvpl_blas-linux-sbsa-0.3.0-archive/include/* /opt/nvpl/include/

				    wget https://developer.download.nvidia.com/compute/nvpl/redist/nvpl_lapack/linux-sbsa/nvpl_lapack-linux-sbsa-0.2.3.1-archive.tar.xz

				    tar xf nvpl_lapack-linux-sbsa-0.2.3.1-archive.tar.xz

				    cp -r nvpl_lapack-linux-sbsa-0.2.3.1-archive/lib/* /opt/nvpl/lib/

				    cp -r nvpl_lapack-linux-sbsa-0.2.3.1-archive/include/* /opt/nvpl/include/

				}

				install_nvpl

									
										9

.ci/docker/common/install_onnx.sh
									
												View File
												
				@ -15,7 +15,7 @@ pip_install \

				  flatbuffers==2.0 \

				  mock==5.0.1 \

				  ninja==1.10.2 \

				  networkx==2.0 \

				  networkx==2.5 \

				  numpy==1.24.2

				# ONNXRuntime should be installed before installing

				@ -30,10 +30,9 @@ pip_install \

				pip_install coloredlogs packaging

				pip_install onnxruntime==1.18

				pip_install onnx==1.16.0

				# pip_install "onnxscript@git+https://github.com/microsoft/onnxscript@3e869ef8ccf19b5ebd21c10d3e9c267c9a9fa729" --no-deps

				pip_install onnxscript==0.1.0.dev20240613 --no-deps

				pip_install onnxruntime==1.18.1

				pip_install onnx==1.16.2

				pip_install onnxscript==0.1.0.dev20240831 --no-deps

				# required by onnxscript

				pip_install ml_dtypes

									
										25

.ci/docker/common/install_triton.sh
									
												View File
												
				@ -12,10 +12,7 @@ conda_reinstall() {

				  as_jenkins conda install -q -n py_$ANACONDA_PYTHON_VERSION -y --force-reinstall $*

				}

				if [ -n "${ROCM_VERSION}" ]; then

				  TRITON_REPO="https://github.com/openai/triton"

				  TRITON_TEXT_FILE="triton-rocm"

				elif [ -n "${XPU_VERSION}" ]; then

				if [ -n "${XPU_VERSION}" ]; then

				  TRITON_REPO="https://github.com/intel/intel-xpu-backend-for-triton"

				  TRITON_TEXT_FILE="triton-xpu"

				else

				@ -41,19 +38,33 @@ if [ -z "${MAX_JOBS}" ]; then

				    export MAX_JOBS=$(nproc)

				fi

				# Git checkout triton

				mkdir /var/lib/jenkins/triton

				chown -R jenkins /var/lib/jenkins/triton

				chgrp -R jenkins /var/lib/jenkins/triton

				pushd /var/lib/jenkins/

				as_jenkins git clone ${TRITON_REPO} triton

				cd triton

				as_jenkins git checkout ${TRITON_PINNED_COMMIT}

				cd python

				# TODO: remove patch setup.py once we have a proper fix for https://github.com/triton-lang/triton/issues/4527

				as_jenkins sed -i -e 's/https:\/\/tritonlang.blob.core.windows.net\/llvm-builds/https:\/\/oaitriton.blob.core.windows.net\/public\/llvm-builds/g' setup.py

				if [ -n "${UBUNTU_VERSION}" ] && [ -n "${GCC_VERSION}" ] && [[ "${GCC_VERSION}" == "7" ]]; then

				  # Triton needs at least gcc-9 to build

				  apt-get install -y g++-9

				  CXX=g++-9 pip_install "git+${TRITON_REPO}@${TRITON_PINNED_COMMIT}#subdirectory=python"

				  CXX=g++-9 pip_install -e .

				elif [ -n "${UBUNTU_VERSION}" ] && [ -n "${CLANG_VERSION}" ]; then

				  # Triton needs <filesystem> which surprisingly is not available with clang-9 toolchain

				  add-apt-repository -y ppa:ubuntu-toolchain-r/test

				  apt-get install -y g++-9

				  CXX=g++-9 pip_install "git+${TRITON_REPO}@${TRITON_PINNED_COMMIT}#subdirectory=python"

				  CXX=g++-9 pip_install -e .

				else

				  pip_install "git+${TRITON_REPO}@${TRITON_PINNED_COMMIT}#subdirectory=python"

				  pip_install -e .

				fi

				if [ -n "${CONDA_CMAKE}" ]; then

									
										73

.ci/docker/common/install_xpu.sh
									
												View File
												
				@ -16,11 +16,11 @@ function install_ubuntu() {

				    apt-get update -y

				    apt-get install -y gpg-agent wget

				    # To add the online network package repository for the GPU Driver LTS releases

				    # To add the online network package repository for the GPU Driver

				    wget -qO - https://repositories.intel.com/gpu/intel-graphics.key \

				        | gpg --yes --dearmor --output /usr/share/keyrings/intel-graphics.gpg

				    echo "deb [arch=amd64 signed-by=/usr/share/keyrings/intel-graphics.gpg] \

				        https://repositories.intel.com/gpu/ubuntu ${VERSION_CODENAME}/lts/2350 unified" \

				        https://repositories.intel.com/gpu/ubuntu ${VERSION_CODENAME}${XPU_DRIVER_VERSION} unified" \

				        | tee /etc/apt/sources.list.d/intel-gpu-${VERSION_CODENAME}.list

				    # To add the online network network package repository for the Intel Support Packages

				    wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB \

				@ -45,9 +45,9 @@ function install_ubuntu() {

				    apt-get install -y libigc-dev intel-igc-cm libigdfcl-dev libigfxcmrt-dev level-zero-dev

				    # Install Intel Support Packages

				    if [ -n "$XPU_VERSION" ]; then

				        apt-get install -y intel-for-pytorch-gpu-dev-${XPU_VERSION}

				        apt-get install -y intel-for-pytorch-gpu-dev-${XPU_VERSION} intel-pti-dev

				    else

				        apt-get install -y intel-for-pytorch-gpu-dev

				        apt-get install -y intel-for-pytorch-gpu-dev intel-pti-dev

				    fi

				    # Cleanup

				@ -55,52 +55,6 @@ function install_ubuntu() {

				    rm -rf /var/lib/apt/lists/* /tmp/* /var/tmp/*

				}

				function install_centos() {

				    dnf install -y 'dnf-command(config-manager)'

				    dnf config-manager --add-repo \

				        https://repositories.intel.com/gpu/rhel/8.6/production/2328/unified/intel-gpu-8.6.repo

				    # To add the EPEL repository needed for DKMS

				    dnf -y install https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm

				        # https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm

				    # Create the YUM repository file in the /temp directory as a normal user

				    tee > /tmp/oneAPI.repo << EOF

				[oneAPI]

				name=Intel® oneAPI repository

				baseurl=https://yum.repos.intel.com/oneapi

				enabled=1

				gpgcheck=1

				repo_gpgcheck=1

				gpgkey=https://yum.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB

				EOF

				    # Move the newly created oneAPI.repo file to the YUM configuration directory /etc/yum.repos.d

				    mv /tmp/oneAPI.repo /etc/yum.repos.d

				    # The xpu-smi packages

				    dnf install -y flex bison xpu-smi

				    # Compute and Media Runtimes

				    dnf install -y \

				        intel-opencl intel-media intel-mediasdk libmfxgen1 libvpl2\

				        level-zero intel-level-zero-gpu mesa-dri-drivers mesa-vulkan-drivers \

				        mesa-vdpau-drivers libdrm mesa-libEGL mesa-libgbm mesa-libGL \

				        mesa-libxatracker libvpl-tools intel-metrics-discovery \

				        intel-metrics-library intel-igc-core intel-igc-cm \

				        libva libva-utils intel-gmmlib libmetee intel-gsc intel-ocloc hwinfo clinfo

				    # Development packages

				    dnf install -y --refresh \

				        intel-igc-opencl-devel level-zero-devel intel-gsc-devel libmetee-devel \

				        level-zero-devel

				    # Install Intel® oneAPI Base Toolkit

				    dnf install intel-basekit -y

				    # Cleanup

				    dnf clean all

				    rm -rf /var/cache/yum

				    rm -rf /var/lib/yum/yumdb

				    rm -rf /var/lib/yum/history

				}

				function install_rhel() {

				    . /etc/os-release

				    if [[ "${ID}" == "rhel" ]]; then

				@ -114,9 +68,9 @@ function install_rhel() {

				    fi

				    dnf install -y 'dnf-command(config-manager)'

				    # To add the online network package repository for the GPU Driver LTS releases

				    # To add the online network package repository for the GPU Driver

				    dnf config-manager --add-repo \

				        https://repositories.intel.com/gpu/rhel/${VERSION_ID}/lts/2350/unified/intel-gpu-${VERSION_ID}.repo

				        https://repositories.intel.com/gpu/rhel/${VERSION_ID}${XPU_DRIVER_VERSION}/unified/intel-gpu-${VERSION_ID}.repo

				    # To add the online network network package repository for the Intel Support Packages

				    tee > /etc/yum.repos.d/intel-for-pytorch-gpu-dev.repo << EOF

				[intel-for-pytorch-gpu-dev]

				@ -131,7 +85,7 @@ EOF

				    # The xpu-smi packages

				    dnf install -y xpu-smi

				    # Compute and Media Runtimes

				    dnf install -y \

				    dnf install --skip-broken -y \

				        intel-opencl intel-media intel-mediasdk libmfxgen1 libvpl2\

				        level-zero intel-level-zero-gpu mesa-dri-drivers mesa-vulkan-drivers \

				        mesa-vdpau-drivers libdrm mesa-libEGL mesa-libgbm mesa-libGL \

				@ -160,9 +114,9 @@ function install_sles() {

				        exit

				    fi

				    # To add the online network package repository for the GPU Driver LTS releases

				    # To add the online network package repository for the GPU Driver

				    zypper addrepo -f -r \

				        https://repositories.intel.com/gpu/sles/${VERSION_SP}/lts/2350/unified/intel-gpu-${VERSION_SP}.repo

				        https://repositories.intel.com/gpu/sles/${VERSION_SP}${XPU_DRIVER_VERSION}/unified/intel-gpu-${VERSION_SP}.repo

				    rpm --import https://repositories.intel.com/gpu/intel-graphics.key

				    # To add the online network network package repository for the Intel Support Packages

				    zypper addrepo https://yum.repos.intel.com/intel-for-pytorch-gpu-dev intel-for-pytorch-gpu-dev

				@ -181,6 +135,12 @@ function install_sles() {

				}

				# Default use GPU driver LTS releases

				XPU_DRIVER_VERSION="/lts/2350"

				if [[ "${XPU_DRIVER_TYPE,,}" == "rolling" ]]; then

				    # Use GPU driver rolling releases

				    XPU_DRIVER_VERSION=""

				fi

				# The installation depends on the base OS

				ID=$(grep -oP '(?<=^ID=).+' /etc/os-release | tr -d '"')

				@ -188,9 +148,6 @@ case "$ID" in

				    ubuntu)

				        install_ubuntu

				    ;;

				    centos)

				        install_centos

				    ;;

				    rhel|almalinux)

				        install_rhel

				    ;;

									
										5

.ci/docker/conda/Dockerfile
									
												View File
												
				@ -21,9 +21,8 @@ RUN sed -i s/^#.*baseurl=http/baseurl=http/g /etc/yum.repos.d/*.repo

				RUN sed -i s/^mirrorlist=http/#mirrorlist=http/g /etc/yum.repos.d/*.repo

				RUN yum install -y devtoolset-${DEVTOOLSET_VERSION}-gcc devtoolset-${DEVTOOLSET_VERSION}-gcc-c++ devtoolset-${DEVTOOLSET_VERSION}-gcc-gfortran devtoolset-${DEVTOOLSET_VERSION}-binutils

				# EPEL for cmake

				RUN wget http://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm && \

				    rpm -ivh epel-release-latest-7.noarch.rpm && \

				    rm -f epel-release-latest-7.noarch.rpm

				RUN yum --enablerepo=extras install -y epel-release

				# cmake

				RUN yum install -y cmake3 && \

				    ln -s /usr/bin/cmake3 /usr/bin/cmake

									
										2

.ci/docker/libtorch/Dockerfile
									
												View File
												
				@ -89,7 +89,7 @@ RUN bash ./install_rocm_magma.sh && rm install_rocm_magma.sh

				# Install AOTriton

				COPY ./common/common_utils.sh common_utils.sh

				COPY ./common/aotriton_version.txt aotriton_version.txt

				COPY ./aotriton_version.txt aotriton_version.txt

				COPY ./common/install_aotriton.sh install_aotriton.sh

				RUN bash ./install_aotriton.sh /opt/rocm && rm install_aotriton.sh aotriton_version.txt

				ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton

									
										2

.ci/docker/linter-cuda/Dockerfile
									
												View File
												
				@ -29,7 +29,7 @@ RUN bash ./install_conda.sh && rm install_conda.sh common_utils.sh /opt/conda/re

				# Install cuda and cudnn

				ARG CUDA_VERSION

				RUN wget -q https://raw.githubusercontent.com/pytorch/builder/main/common/install_cuda.sh -O install_cuda.sh

				COPY ./common/install_cuda.sh install_cuda.sh

				RUN bash ./install_cuda.sh ${CUDA_VERSION} && rm install_cuda.sh

				ENV DESIRED_CUDA ${CUDA_VERSION}

				ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:$PATH

									
										9

.ci/docker/manywheel/Dockerfile
									
												View File
												
				@ -29,9 +29,7 @@ RUN yum install -y devtoolset-${DEVTOOLSET_VERSION}-gcc devtoolset-${DEVTOOLSET_

				ENV PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/bin:$PATH

				ENV LD_LIBRARY_PATH=/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib64:/opt/rh/devtoolset-${DEVTOOLSET_VERSION}/root/usr/lib:$LD_LIBRARY_PATH

				RUN wget http://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm && \

				    rpm -ivh epel-release-latest-7.noarch.rpm && \

				    rm -f epel-release-latest-7.noarch.rpm

				RUN yum --enablerepo=extras install -y epel-release

				# cmake-3.18.4 from pip

				RUN yum install -y python3-pip && \

				@ -117,7 +115,8 @@ RUN yum install -y \

				        yasm

				RUN yum install -y \

				    https://repo.ius.io/ius-release-el7.rpm \

				    https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm

				    https://ossci-linux.s3.amazonaws.com/epel-release-7-14.noarch.rpm

				RUN yum swap -y git git236-core

				# git236+ would refuse to run git commands in repos owned by other users

				# Which causes version check to fail, as pytorch repo is bind-mounted into the image

				@ -197,7 +196,7 @@ RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh

				# Install AOTriton

				COPY ./common/common_utils.sh common_utils.sh

				COPY ./common/aotriton_version.txt aotriton_version.txt

				COPY ./aotriton_version.txt aotriton_version.txt

				COPY ./common/install_aotriton.sh install_aotriton.sh

				RUN bash ./install_aotriton.sh /opt/rocm && rm install_aotriton.sh aotriton_version.txt

				ENV AOTRITON_INSTALLED_PREFIX /opt/rocm/aotriton

3

.ci/docker/manywheel/Dockerfile_2014

View File

 @ -93,7 +93,8 @@ RUN yum install -y \
         yasm
 RUN yum install -y \
     https://repo.ius.io/ius-release-el7.rpm \
     https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
     https://ossci-linux.s3.amazonaws.com/epel-release-7-14.noarch.rpm
 RUN yum swap -y git git236-core
 # git236+ would refuse to run git commands in repos owned by other users
 # Which causes version check to fail, as pytorch repo is bind-mounted into the image

8

.ci/docker/manywheel/Dockerfile_2_28

View File

 @ -87,10 +87,10 @@ RUN yum install -y \
         xz \
         gcc-toolset-${DEVTOOLSET_VERSION}-toolchain \
         glibc-langpack-en
 RUN yum install -y \
     https://repo.ius.io/ius-release-el7.rpm \
     https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
     https://ossci-linux.s3.amazonaws.com/epel-release-7-14.noarch.rpm
 RUN yum swap -y git git236-core
 # git236+ would refuse to run git commands in repos owned by other users
 # Which causes version check to fail, as pytorch repo is bind-mounted into the image
 @ -145,9 +145,13 @@ ADD ./common/install_miopen.sh install_miopen.sh
 RUN bash ./install_miopen.sh ${ROCM_VERSION} && rm install_miopen.sh
 FROM cpu_final as xpu_final
 # XPU CD use rolling driver
 ENV XPU_DRIVER_TYPE ROLLING
 # cmake-3.28.4 from pip
 RUN python3 -m pip install --upgrade pip && \
     python3 -mpip install cmake==3.28.4
 # Install setuptools and wheel for python 3.13
 RUN /opt/python/cp313-cp313/bin/python -m pip install setuptools wheel
 ADD ./common/install_xpu.sh install_xpu.sh
 RUN bash ./install_xpu.sh && rm install_xpu.sh
 RUN pushd /opt/_internal && tar -xJf static-libs-for-embedding-only.tar.xz && popd

12

.ci/docker/manywheel/Dockerfile_cuda_aarch64

View File

 @ -75,17 +75,17 @@ ARG BASE_CUDA_VERSION
 ADD ./common/install_magma.sh install_magma.sh
 RUN bash ./install_magma.sh ${BASE_CUDA_VERSION} && rm install_magma.sh
 FROM base as openblas
 # Install openblas
 ADD ./common/install_openblas.sh install_openblas.sh
 RUN bash ./install_openblas.sh && rm install_openblas.sh
 FROM base as nvpl
 # Install nvpl
 ADD ./common/install_nvpl.sh install_nvpl.sh
 RUN bash ./install_nvpl.sh && rm install_nvpl.sh
 FROM final as cuda_final
 ARG BASE_CUDA_VERSION
 RUN rm -rf /usr/local/cuda-${BASE_CUDA_VERSION}
 COPY --from=cuda     /usr/local/cuda-${BASE_CUDA_VERSION}  /usr/local/cuda-${BASE_CUDA_VERSION}
 COPY --from=magma    /usr/local/cuda-${BASE_CUDA_VERSION}  /usr/local/cuda-${BASE_CUDA_VERSION}
 COPY --from=openblas     /opt/OpenBLAS/  /opt/OpenBLAS/
 COPY --from=nvpl /opt/nvpl/lib/  /usr/local/lib/
 COPY --from=nvpl /opt/nvpl/include/  /usr/local/include/
 RUN ln -sf /usr/local/cuda-${BASE_CUDA_VERSION} /usr/local/cuda
 ENV PATH=/usr/local/cuda/bin:$PATH
 ENV LD_LIBRARY_PATH=/opt/OpenBLAS/lib:$LD_LIBRARY_PATH

38

.ci/docker/requirements-ci.txt

View File

 @ -30,9 +30,14 @@ dill==0.3.7
 #Pinned versions: 0.3.7
 #test that import: dynamo/test_replay_record.py test_dataloader.py test_datapipe.py test_serialization.py
 expecttest==0.1.6
 expecttest==0.2.1
 #Description: method for writing tests where test framework auto populates
 # the expected output based on previous runs
 #Pinned versions: 0.2.1
 #test that import:
 fbscribelogger==0.1.6
 #Description: write to scribe from authenticated jobs on CI
 #Pinned versions: 0.1.6
 #test that import:
 @ -85,7 +90,7 @@ librosa>=0.6.2 ; python_version < "3.11"
 #Pinned versions:
 #test that import:
 mypy==1.10.0
 mypy==1.11.2
 # Pin MyPy version because new errors are likely to appear with each release
 #Description: linter
 #Pinned versions: 1.10.0
 @ -104,7 +109,7 @@ networkx==2.8.8
 #test that import: run_test.py, test_cpp_extensions_aot.py,test_determination.py
 numba==0.49.0 ; python_version < "3.9"
 numba==0.54.1 ; python_version == "3.9"
 numba==0.55.2 ; python_version == "3.9"
 numba==0.55.2 ; python_version == "3.10"
 #Description: Just-In-Time Compiler for Numerical Functions
 #Pinned versions: 0.54.1, 0.49.0, <=0.49.1
 @ -218,7 +223,7 @@ pygments==2.15.0
 #test that import:
 scikit-image==0.19.3 ; python_version < "3.10"
 scikit-image==0.20.0 ; python_version >= "3.10"
 scikit-image==0.22.0 ; python_version >= "3.10"
 #Description: image processing routines
 #Pinned versions:
 #test that import: test_nn.py
 @ -269,6 +274,10 @@ lintrunner==0.12.5
 #Pinned versions: 0.12.5
 #test that import:
 redis>=4.0.0
 #Description: redis database
 #test that import: anything that tests OSS caching/mocking (inductor/test_codecache.py, inductor/test_max_autotune.py)
 rockset==1.0.3
 #Description: queries Rockset
 #Pinned versions: 1.0.3
 @ -312,3 +321,24 @@ lxml==5.0.0
 # Python-3.9 binaries
 PyGithub==2.3.0
 sympy==1.12.1 ; python_version == "3.8"
 sympy==1.13.1 ; python_version >= "3.9"
 #Description: Required by coremltools, also pinned in .github/requirements/pip-requirements-macOS.txt
 #Pinned versions:
 #test that import:
 onnx==1.16.1
 #Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
 #Pinned versions:
 #test that import:
 onnxscript==0.1.0.dev20240817
 #Description: Required by mypy and test_public_bindings.py when checking torch.onnx._internal
 #Pinned versions:
 #test that import:
 parameterized==0.8.1
 #Description: Parameterizes unittests, both the tests themselves and the entire testing class
 #Pinned versions:
 #test that import:

2

.ci/docker/triton_version.txt

View File

 @ -1 +1 @@
 .0.0
 .1.0

									
										6

.ci/docker/ubuntu-cuda/Dockerfile
									
												View File
												
				@ -156,6 +156,12 @@ COPY ./common/install_cusparselt.sh install_cusparselt.sh

				RUN bash install_cusparselt.sh

				RUN rm install_cusparselt.sh

				# Install CUDSS

				ARG CUDA_VERSION

				COPY ./common/install_cudss.sh install_cudss.sh

				RUN bash install_cudss.sh

				RUN rm install_cudss.sh

				# Delete /usr/local/cuda-11.X/cuda-11.X symlinks

				RUN if [ -h /usr/local/cuda-11.6/cuda-11.6 ]; then rm /usr/local/cuda-11.6/cuda-11.6; fi

				RUN if [ -h /usr/local/cuda-11.7/cuda-11.7 ]; then rm /usr/local/cuda-11.7/cuda-11.7; fi

									
										4

.ci/docker/ubuntu-rocm/Dockerfile
									
												View File
												
				@ -100,10 +100,10 @@ ARG TRITON

				# try to reach out to S3, which docker build runners don't have access

				COPY ./common/install_triton.sh install_triton.sh

				COPY ./common/common_utils.sh common_utils.sh

				COPY ci_commit_pins/triton-rocm.txt triton-rocm.txt

				COPY ci_commit_pins/triton.txt triton.txt

				COPY triton_version.txt triton_version.txt

				RUN if [ -n "${TRITON}" ]; then bash ./install_triton.sh; fi

				RUN rm install_triton.sh common_utils.sh triton-rocm.txt triton_version.txt

				RUN rm install_triton.sh common_utils.sh triton.txt triton_version.txt

				# Install AOTriton

				COPY ./aotriton_version.txt aotriton_version.txt

									
										1

.ci/docker/ubuntu-xpu/Dockerfile
									
												View File
												
				@ -30,6 +30,7 @@ RUN bash ./install_docs_reqs.sh && rm install_docs_reqs.sh

				ARG ANACONDA_PYTHON_VERSION

				ARG CONDA_CMAKE

				ARG DOCS

				ARG BUILD_ENVIRONMENT

				ENV ANACONDA_PYTHON_VERSION=$ANACONDA_PYTHON_VERSION

				ENV PATH /opt/conda/envs/py_$ANACONDA_PYTHON_VERSION/bin:/opt/conda/bin:$PATH

				ENV DOCS=$DOCS

									
										2

.ci/docker/ubuntu/Dockerfile
									
												View File
												
				@ -50,7 +50,7 @@ RUN  bash ./install_lcov.sh && rm install_lcov.sh

				# Install cuda and cudnn

				ARG CUDA_VERSION

				RUN wget -q https://raw.githubusercontent.com/pytorch/builder/main/common/install_cuda.sh -O install_cuda.sh

				COPY ./common/install_cuda.sh install_cuda.sh

				RUN bash ./install_cuda.sh ${CUDA_VERSION} && rm install_cuda.sh

				ENV DESIRED_CUDA ${CUDA_VERSION}

				ENV PATH /usr/local/nvidia/bin:/usr/local/cuda/bin:$PATH

									
										8

.ci/pytorch/build.sh
									
												View File
												
				@ -176,7 +176,8 @@ fi

				if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then

				  # shellcheck disable=SC1091

				  source /opt/intel/oneapi/compiler/latest/env/vars.sh

				  export USE_XPU=1

				  # XPU kineto feature dependencies are not fully ready, disable kineto build as temp WA

				  export USE_KINETO=0

				fi

				# sccache will fail for CUDA builds if all cores are used for compiling

				@ -284,9 +285,8 @@ else

				    if [[ "$BUILD_ENVIRONMENT" != *rocm*  &&

				          "$BUILD_ENVIRONMENT" != *xla* ]]; then

				      if [[ "$BUILD_ENVIRONMENT" != *py3.8* ]]; then

				        # Install numpy-2.0 release candidate for builds

				        # Which should be backward compatible with Numpy-1.X

				        python -mpip install --pre numpy==2.0.0rc1

				        # Install numpy-2.0.2 for builds which are backward compatible with 1.X

				        python -mpip install --pre numpy==2.0.2

				      fi

				      WERROR=1 python setup.py clean

									
										2

.ci/pytorch/common_utils.sh
									
												View File
												
				@ -179,7 +179,7 @@ function install_torchvision() {

				}

				function install_tlparse() {

				  pip_install --user "tlparse==0.3.7"

				  pip_install --user "tlparse==0.3.25"

				  PATH="$(python -m site --user-base)/bin:$PATH"

				}

									
										19

.ci/pytorch/macos-test.sh
									
												View File
												
				@ -9,15 +9,13 @@ if [[ -n "$CONDA_ENV" ]]; then

				  export PATH="$CONDA_ENV/bin":$PATH

				fi

				# Test that OpenMP is enabled for non-arm64 build

				if [[ ${BUILD_ENVIRONMENT} != *arm64* ]]; then

				  pushd test

				  if [[ ! $(python -c "import torch; print(int(torch.backends.openmp.is_available()))") == "1" ]]; then

				    echo "Build should have OpenMP enabled, but torch.backends.openmp.is_available() is False"

				    exit 1

				  fi

				  popd

				# Test that OpenMP is enabled

				pushd test

				if [[ ! $(python -c "import torch; print(int(torch.backends.openmp.is_available()))") == "1" ]]; then

				  echo "Build should have OpenMP enabled, but torch.backends.openmp.is_available() is False"

				  exit 1

				fi

				popd

				setup_test_python() {

				  # The CircleCI worker hostname doesn't resolve to an address.

				@ -27,8 +25,9 @@ setup_test_python() {

				  echo "Ninja version: $(ninja --version)"

				  echo "Python version: $(which python) ($(python --version))"

				  # Increase default limit on open file handles from 256 to 1024

				  ulimit -n 1024

				  # Set the limit on open file handles to 16384

				  # might help with intermittent compiler test failures

				  ulimit -n 16384

				}

				test_python_all() {

									
										236

.ci/pytorch/test.sh
									
												View File
												
				@ -6,6 +6,9 @@

				set -ex

				# Suppress ANSI color escape sequences

				export TERM=vt100

				# shellcheck source=./common.sh

				source "$(dirname "${BASH_SOURCE[0]}")/common.sh"

				@ -166,7 +169,7 @@ fi

				if [[ "$BUILD_ENVIRONMENT" == *xpu* ]]; then

				  # Source Intel oneAPI envrioment script to enable xpu runtime related libraries

				  # refer to https://www.intel.com/content/www/us/en/docs/oneapi/programming-guide/2024-0/use-the-setvars-and-oneapi-vars-scripts-with-linux.html

				  # refer to https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpu/2-5.html

				  # shellcheck disable=SC1091

				  source /opt/intel/oneapi/compiler/latest/env/vars.sh

				  # Check XPU status before testing

				@ -316,6 +319,7 @@ test_inductor_distributed() {

				  python test/run_test.py -i inductor/test_aot_inductor.py -k test_replicate_on_devices --verbose

				  python test/run_test.py -i distributed/test_c10d_functional_native.py --verbose

				  python test/run_test.py -i distributed/_tensor/test_dtensor_compile.py --verbose

				  python test/run_test.py -i distributed/tensor/parallel/test_micro_pipeline_tp.py --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_comm.py --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_multi_group --verbose

				  python test/run_test.py -i distributed/_composable/fsdp/test_fully_shard_training.py -k test_train_parity_with_activation_checkpointing --verbose

				@ -357,10 +361,12 @@ test_inductor_shard() {

				test_inductor_aoti() {

				  # docker build uses bdist_wheel which does not work with test_aot_inductor

				  # TODO: need a faster way to build

				  if [[ "$BUILD_ENVIRONMENT" != *rocm* ]]; then

				    BUILD_AOT_INDUCTOR_TEST=1 python setup.py develop

				    CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="${TORCH_LIB_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aoti_abi_check cpp/test_aoti_inference

				  if [[ "$BUILD_ENVIRONMENT" == *rocm* ]]; then

				    # We need to hipify before building again

				    python3 tools/amd_build/build_amd.py

				  fi

				  BUILD_AOT_INDUCTOR_TEST=1 python setup.py develop

				  CPP_TESTS_DIR="${BUILD_BIN_DIR}" LD_LIBRARY_PATH="${TORCH_LIB_DIR}" python test/run_test.py --cpp --verbose -i cpp/test_aoti_abi_check cpp/test_aoti_inference

				}

				test_inductor_cpp_wrapper_abi_compatible() {

				@ -389,7 +395,22 @@ test_inductor_cpp_wrapper_abi_compatible() {

				# .github/workflows/inductor-perf-test-nightly.yml

				DYNAMO_BENCHMARK_FLAGS=()

				if [[ "${TEST_CONFIG}" == *dynamo_eager* ]]; then

				pr_time_benchmarks() {

				  pip_install --user "fbscribelogger"

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  mkdir -p "$TEST_REPORTS_DIR"

				  PYTHONPATH=$(pwd)/benchmarks/dynamo/pr_time_benchmarks source benchmarks/dynamo/pr_time_benchmarks/benchmark_runner.sh "$TEST_REPORTS_DIR/pr_time_benchmarks_results.csv" "benchmarks/dynamo/pr_time_benchmarks/benchmarks"

				  echo "benchmark results on current PR: "

				  cat  "$TEST_REPORTS_DIR/pr_time_benchmarks_results.csv"

				}

				if [[ "${TEST_CONFIG}" == *pr_time_benchmarks* ]]; then

				  pr_time_benchmarks

				  exit 0

				elif [[ "${TEST_CONFIG}" == *dynamo_eager* ]]; then

				  DYNAMO_BENCHMARK_FLAGS+=(--backend eager)

				elif [[ "${TEST_CONFIG}" == *aot_eager* ]]; then

				  DYNAMO_BENCHMARK_FLAGS+=(--backend aot_eager)

				@ -428,7 +449,6 @@ test_perf_for_dashboard() {

				  local targets=(accuracy performance)

				  local device=cuda

				  local taskset=""

				  if [[ "${TEST_CONFIG}" == *cpu* ]]; then

				    if [[ "${TEST_CONFIG}" == *cpu_x86* ]]; then

				      device=cpu_x86

				@ -436,8 +456,8 @@ test_perf_for_dashboard() {

				      device=cpu_aarch64

				    fi

				    test_inductor_set_cpu_affinity

				    end_core=$(( $(test_inductor_get_core_number)-1 ))

				    taskset="taskset -c 0-$end_core"

				  elif [[ "${TEST_CONFIG}" == *cuda_a10g* ]]; then

				    device=cuda_a10g

				  fi

				  for mode in "${modes[@]}"; do

				@ -455,43 +475,49 @@ test_perf_for_dashboard() {

				      fi

				      if [[ "$DASHBOARD_TAG" == *default-true* ]]; then

				        $taskset python "benchmarks/dynamo/$suite.py" \

				        $TASKSET python "benchmarks/dynamo/$suite.py" \

				            "${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" --disable-cudagraphs "$@" \

				            --output "$TEST_REPORTS_DIR/${backend}_no_cudagraphs_${suite}_${dtype}_${mode}_${device}_${target}.csv"

				      fi

				      if [[ "$DASHBOARD_TAG" == *cudagraphs-true* ]]; then

				        $taskset python "benchmarks/dynamo/$suite.py" \

				        $TASKSET python "benchmarks/dynamo/$suite.py" \

				            "${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" "$@" \

				            --output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_${suite}_${dtype}_${mode}_${device}_${target}.csv"

				      fi

				      if [[ "$DASHBOARD_TAG" == *dynamic-true* ]]; then

				        $taskset python "benchmarks/dynamo/$suite.py" \

				        $TASKSET python "benchmarks/dynamo/$suite.py" \

				            "${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" --dynamic-shapes \

				            --dynamic-batch-only "$@" \

				            --output "$TEST_REPORTS_DIR/${backend}_dynamic_${suite}_${dtype}_${mode}_${device}_${target}.csv"

				      fi

				      if [[ "$DASHBOARD_TAG" == *cppwrapper-true* ]] && [[ "$mode" == "inference" ]]; then

				        TORCHINDUCTOR_CPP_WRAPPER=1 $taskset python "benchmarks/dynamo/$suite.py" \

				        TORCHINDUCTOR_CPP_WRAPPER=1 $TASKSET python "benchmarks/dynamo/$suite.py" \

				            "${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" --disable-cudagraphs "$@" \

				            --output "$TEST_REPORTS_DIR/${backend}_cpp_wrapper_${suite}_${dtype}_${mode}_${device}_${target}.csv"

				      fi

				      if [[ "$DASHBOARD_TAG" == *freezing_cudagraphs-true* ]] && [[ "$mode" == "inference" ]]; then

				        $taskset python "benchmarks/dynamo/$suite.py" \

				        $TASKSET python "benchmarks/dynamo/$suite.py" \

				            "${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" "$@" --freezing \

				            --output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_freezing_${suite}_${dtype}_${mode}_${device}_${target}.csv"

				      fi

				      if [[ "$DASHBOARD_TAG" == *freeze_autotune_cudagraphs-true* ]] && [[ "$mode" == "inference" ]]; then

				        TORCHINDUCTOR_MAX_AUTOTUNE=1 $taskset python "benchmarks/dynamo/$suite.py" \

				        TORCHINDUCTOR_MAX_AUTOTUNE=1 $TASKSET python "benchmarks/dynamo/$suite.py" \

				            "${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" "$@" --freezing \

				            --output "$TEST_REPORTS_DIR/${backend}_with_cudagraphs_freezing_autotune_${suite}_${dtype}_${mode}_${device}_${target}.csv"

				      fi

				      if [[ "$DASHBOARD_TAG" == *aotinductor-true* ]] && [[ "$mode" == "inference" ]]; then

				        TORCHINDUCTOR_ABI_COMPATIBLE=1 $taskset python "benchmarks/dynamo/$suite.py" \

				        if [[ "$target" == "accuracy" ]]; then

				          # Also collect Export pass rate and display as a separate row

				          $TASKSET python "benchmarks/dynamo/$suite.py" \

				              "${target_flag[@]}" --"$mode" --"$dtype" --export --disable-cudagraphs "$@" \

				              --output "$TEST_REPORTS_DIR/${backend}_export_${suite}_${dtype}_${mode}_${device}_${target}.csv"

				        fi

				        TORCHINDUCTOR_ABI_COMPATIBLE=1 $TASKSET python "benchmarks/dynamo/$suite.py" \

				            "${target_flag[@]}" --"$mode" --"$dtype" --export-aot-inductor --disable-cudagraphs "$@" \

				            --output "$TEST_REPORTS_DIR/${backend}_aot_inductor_${suite}_${dtype}_${mode}_${device}_${target}.csv"

				      fi

				      if [[ "$DASHBOARD_TAG" == *maxautotune-true* ]]; then

				        TORCHINDUCTOR_MAX_AUTOTUNE=1 $taskset python "benchmarks/dynamo/$suite.py" \

				        TORCHINDUCTOR_MAX_AUTOTUNE=1 $TASKSET python "benchmarks/dynamo/$suite.py" \

				            "${target_flag[@]}" --"$mode" --"$dtype" --backend "$backend" "$@" \

				            --output "$TEST_REPORTS_DIR/${backend}_max_autotune_${suite}_${dtype}_${mode}_${device}_${target}.csv"

				      fi

				@ -499,7 +525,7 @@ test_perf_for_dashboard() {

				        # TODO: This has a new dtype called quant and the benchmarks script needs to be updated to support this.

				        # The tentative command is as follows. It doesn't work now, but it's ok because we only need mock data

				        # to fill the dashboard.

				        $taskset python "benchmarks/dynamo/$suite.py" \

				        $TASKSET python "benchmarks/dynamo/$suite.py" \

				          "${target_flag[@]}" --"$mode" --quant --backend "$backend" "$@" \

				          --output "$TEST_REPORTS_DIR/${backend}_cudagraphs_low_precision_${suite}_quant_${mode}_${device}_${target}.csv" || true

				        # Copy cudagraph results as mock data, easiest choice?

				@ -547,6 +573,13 @@ test_single_dynamo_benchmark() {

				      # For CPU device, we perfer non ABI-compatible mode on CI when testing AOTInductor.

				      export TORCHINDUCTOR_ABI_COMPATIBLE=1

				    fi

				    if [[ "${TEST_CONFIG}" == *_avx2* ]]; then

				      TEST_CONFIG=${TEST_CONFIG//_avx2/}

				    fi

				    if [[ "${TEST_CONFIG}" == *_avx512* ]]; then

				      TEST_CONFIG=${TEST_CONFIG//_avx512/}

				    fi

				    python "benchmarks/dynamo/$suite.py" \

				      --ci --accuracy --timing --explain \

				      "${DYNAMO_BENCHMARK_FLAGS[@]}" \

				@ -563,6 +596,9 @@ test_single_dynamo_benchmark() {

				test_inductor_micro_benchmark() {

				  TEST_REPORTS_DIR=$(pwd)/test/test-reports

				  if [[ "${TEST_CONFIG}" == *cpu* ]]; then

				    test_inductor_set_cpu_affinity

				  fi

				  python benchmarks/gpt_fast/benchmark.py --output "${TEST_REPORTS_DIR}/gpt_fast_benchmark.csv"

				}

				@ -632,8 +668,7 @@ test_inductor_torchbench_smoketest_perf() {

				  # https://github.com/pytorch/pytorch/actions/runs/7158691360/job/19491437314,

				  # and thus we lower its threshold to reduce flakiness. If this continues to be a problem,

				  # we switch to use some other model.

				  # lowering threshold from 4.9 to 4.7 for cu124. Will bump it up after cuda 12.4.0->12.4.1 update

				  python benchmarks/dynamo/check_perf_csv.py -f "$TEST_REPORTS_DIR/inductor_inference_smoketest.csv" -t 4.7

				  python benchmarks/dynamo/check_perf_csv.py -f "$TEST_REPORTS_DIR/inductor_inference_smoketest.csv" -t 4.9

				  # Check memory compression ratio for a few models

				  for test in hf_Albert timm_vision_transformer; do

				@ -657,19 +692,30 @@ test_inductor_torchbench_smoketest_perf() {

				}

				test_inductor_get_core_number() {

				  echo $(($(lscpu | grep 'Socket(s):' | awk '{print $2}') * $(lscpu | grep 'Core(s) per socket:' | awk '{print $4}')))

				  if [[ "${TEST_CONFIG}" == *aarch64* ]]; then

				    echo "$(($(lscpu | grep 'Cluster(s):' | awk '{print $2}') * $(lscpu | grep 'Core(s) per cluster:' | awk '{print $4}')))"

				  else

				    echo "$(($(lscpu | grep 'Socket(s):' | awk '{print $2}') * $(lscpu | grep 'Core(s) per socket:' | awk '{print $4}')))"

				  fi

				}

				test_inductor_set_cpu_affinity(){

				  #set jemalloc

				  JEMALLOC_LIB="/usr/lib/x86_64-linux-gnu/libjemalloc.so.2"

				  IOMP_LIB="$(dirname "$(which python)")/../lib/libiomp5.so"

				  export LD_PRELOAD="$JEMALLOC_LIB":"$IOMP_LIB":"$LD_PRELOAD"

				  JEMALLOC_LIB="$(find /usr/lib -name libjemalloc.so.2)"

				  export LD_PRELOAD="$JEMALLOC_LIB":"$LD_PRELOAD"

				  export MALLOC_CONF="oversize_threshold:1,background_thread:true,metadata_thp:auto,dirty_decay_ms:-1,muzzy_decay_ms:-1"

				  export KMP_AFFINITY=granularity=fine,compact,1,0

				  export KMP_BLOCKTIME=1

				  if [[ "${TEST_CONFIG}" != *aarch64* ]]; then

				    # Use Intel OpenMP for x86

				    IOMP_LIB="$(dirname "$(which python)")/../lib/libiomp5.so"

				    export LD_PRELOAD="$IOMP_LIB":"$LD_PRELOAD"

				    export KMP_AFFINITY=granularity=fine,compact,1,0

				    export KMP_BLOCKTIME=1

				  fi

				  cores=$(test_inductor_get_core_number)

				  export OMP_NUM_THREADS=$cores

				  end_core=$((cores-1))

				  export TASKSET="taskset -c 0-$end_core"

				}

				test_inductor_torchbench_cpu_smoketest_perf(){

				@ -677,7 +723,6 @@ test_inductor_torchbench_cpu_smoketest_perf(){

				  mkdir -p "$TEST_REPORTS_DIR"

				  test_inductor_set_cpu_affinity

				  end_core=$(( $(test_inductor_get_core_number)-1 ))

				  MODELS_SPEEDUP_TARGET=benchmarks/dynamo/expected_ci_speedup_inductor_torchbench_cpu.csv

				  grep -v '^ *#' < "$MODELS_SPEEDUP_TARGET" | while IFS=',' read -r -a model_cfg

				@ -694,11 +739,11 @@ test_inductor_torchbench_cpu_smoketest_perf(){

				    local output_name="$TEST_REPORTS_DIR/inductor_inference_${model_cfg[0]}_${model_cfg[1]}_${model_cfg[2]}_${model_cfg[3]}_cpu_smoketest.csv"

				    if [[ ${model_cfg[3]} == "dynamic" ]]; then

				      taskset -c 0-"$end_core" python benchmarks/dynamo/torchbench.py \

				      $TASKSET python benchmarks/dynamo/torchbench.py \

				        --inference --performance --"$data_type" -dcpu -n50 --only "$model_name" --dynamic-shapes \

				        --dynamic-batch-only --freezing --timeout 9000 --"$backend" --output "$output_name"

				    else

				      taskset -c 0-"$end_core" python benchmarks/dynamo/torchbench.py \

				      $TASKSET python benchmarks/dynamo/torchbench.py \

				        --inference --performance --"$data_type" -dcpu -n50 --only "$model_name" \

				        --freezing --timeout 9000 --"$backend" --output "$output_name"

				    fi

				@ -706,6 +751,17 @@ test_inductor_torchbench_cpu_smoketest_perf(){

				    # The threshold value needs to be actively maintained to make this check useful.

				    python benchmarks/dynamo/check_perf_csv.py -f "$output_name" -t "$speedup_target"

				  done

				  # Add a few ABI-compatible accuracy tests for CPU. These can be removed once we turn on ABI-compatible as default.

				  TORCHINDUCTOR_ABI_COMPATIBLE=1 python benchmarks/dynamo/timm_models.py --device cpu --accuracy \

				    --bfloat16 --inference --export-aot-inductor --disable-cudagraphs --only adv_inception_v3 \

				    --output "$TEST_REPORTS_DIR/aot_inductor_smoke_test.csv"

				  TORCHINDUCTOR_ABI_COMPATIBLE=1 python benchmarks/dynamo/timm_models.py --device cpu --accuracy \

				    --bfloat16 --inference --export-aot-inductor --disable-cudagraphs --only beit_base_patch16_224 \

				    --output "$TEST_REPORTS_DIR/aot_inductor_smoke_test.csv"

				  python benchmarks/dynamo/check_accuracy.py \

				    --actual "$TEST_REPORTS_DIR/aot_inductor_smoke_test.csv" \

				    --expected "benchmarks/dynamo/ci_expected_accuracy/aot_inductor_timm_inference.csv"

				}

				test_torchbench_gcp_smoketest(){

				@ -1019,11 +1075,113 @@ test_xla() {

				  assert_git_not_dirty

				}

				function check_public_api_test_fails {

				    test_name=$1

				    invalid_item_name=$2

				    invalid_item_desc=$3

				    echo "Running public API test '${test_name}'..."

				    test_output=$(python test/test_public_bindings.py -k "${test_name}" 2>&1) && ret=$? || ret=$?

				    # Ensure test fails correctly.

				    if [ "$ret" -eq 0 ]; then

				        cat << EOF

				Expected the public API test '${test_name}' to fail after introducing

				${invalid_item_desc}, but it succeeded! Check test/test_public_bindings.py

				for any changes that may have broken the test.

				EOF

				        return 1

				    fi

				    # Ensure invalid item is in the test output.

				    echo "${test_output}" | grep -q "${invalid_item_name}" && ret=$? || ret=$?

				    if [ $ret -ne 0 ]; then

				        cat << EOF

				Expected the public API test '${test_name}' to identify ${invalid_item_desc}, but

				it didn't! It's possible the test may not have run. Check test/test_public_bindings.py

				for any changes that may have broken the test.

				EOF

				        return 1

				    fi

				    echo "Success! '${test_name}' identified ${invalid_item_desc} ${invalid_item_name}."

				    return 0

				}

				# Do NOT run this test before any other tests, like test_python_shard, etc.

				# Because this function uninstalls the torch built from branch and installs

				# the torch built on its base commit.

				test_forward_backward_compatibility() {

				  set -x

				  # First, validate public API tests in the torch built from branch.

				  # Step 1. Make sure the public API test "test_correct_module_names" fails when a new file

				  # introduces an invalid public API function.

				  new_filename=$(mktemp XXXXXXXX.py -p "${TORCH_INSTALL_DIR}")

				  BAD_PUBLIC_FUNC=$(

				  cat << 'EOF'

				def new_public_func():

				  pass

				# valid public API functions have __module__ set correctly

				new_public_func.__module__ = None

				EOF

				  )

				  echo "${BAD_PUBLIC_FUNC}" >> "${new_filename}"

				  invalid_api="torch.$(basename -s '.py' "${new_filename}").new_public_func"

				  echo "Created an invalid public API function ${invalid_api}..."

				  check_public_api_test_fails \

				      "test_correct_module_names" \

				      "${invalid_api}" \

				      "an invalid public API function" && ret=$? || ret=$?

				  rm -v "${new_filename}"

				  if [ "$ret" -ne 0 ]; then

				      exit 1

				  fi

				  # Step 2. Make sure that the public API test "test_correct_module_names" fails when an existing

				  # file is modified to introduce an invalid public API function.

				  EXISTING_FILEPATH="${TORCH_INSTALL_DIR}/nn/parameter.py"

				  cp -v "${EXISTING_FILEPATH}" "${EXISTING_FILEPATH}.orig"

				  echo "${BAD_PUBLIC_FUNC}" >> "${EXISTING_FILEPATH}"

				  invalid_api="torch.nn.parameter.new_public_func"

				  echo "Appended an invalid public API function to existing file ${EXISTING_FILEPATH}..."

				  check_public_api_test_fails \

				      "test_correct_module_names" \

				      "${invalid_api}" \

				      "an invalid public API function" && ret=$? || ret=$?

				  mv -v "${EXISTING_FILEPATH}.orig" "${EXISTING_FILEPATH}"

				  if [ "$ret" -ne 0 ]; then

				      exit 1

				  fi

				  # Step 3. Make sure that the public API test "test_modules_can_be_imported" fails when a module

				  # cannot be imported.

				  new_module_dir=$(mktemp XXXXXXXX -d -p "${TORCH_INSTALL_DIR}")

				  echo "invalid syntax garbage" > "${new_module_dir}/__init__.py"

				  invalid_module_name="torch.$(basename "${new_module_dir}")"

				  check_public_api_test_fails \

				      "test_modules_can_be_imported" \

				      "${invalid_module_name}" \

				      "a non-importable module" && ret=$? || ret=$?

				  rm -rv "${new_module_dir}"

				  if [ "$ret" -ne 0 ]; then

				      exit 1

				  fi

				  # Next, build torch from the merge base.

				  REPO_DIR=$(pwd)

				  if [[ "${BASE_SHA}" == "${SHA1}" ]]; then

				    echo "On trunk, we should compare schemas with torch built from the parent commit"

				@ -1225,14 +1383,16 @@ test_executorch() {

				  assert_git_not_dirty

				}

				test_linux_aarch64(){

				test_linux_aarch64() {

				  python test/run_test.py --include test_modules test_mkldnn test_mkldnn_fusion test_openmp test_torch test_dynamic_shapes \

				       test_transformers test_multiprocessing test_numpy_interop --verbose

				        test_transformers test_multiprocessing test_numpy_interop \

				        --shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose

				  # Dynamo tests

				  python test/run_test.py --include dynamo/test_compile dynamo/test_backends dynamo/test_comptime dynamo/test_config \

				       dynamo/test_functions dynamo/test_fx_passes_pre_grad dynamo/test_interop dynamo/test_model_output dynamo/test_modules \

				       dynamo/test_optimizers dynamo/test_recompile_ux dynamo/test_recompiles --verbose

				       dynamo/test_optimizers dynamo/test_recompile_ux dynamo/test_recompiles \

				       --shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose

				  # Inductor tests

				  python test/run_test.py --include inductor/test_torchinductor inductor/test_benchmark_fusion inductor/test_codecache \

				@ -1242,14 +1402,15 @@ test_linux_aarch64(){

				       inductor/test_max_autotune inductor/test_memory_planning inductor/test_metrics inductor/test_multi_kernel inductor/test_pad_mm \

				       inductor/test_pattern_matcher inductor/test_perf inductor/test_profiler inductor/test_select_algorithm inductor/test_smoke \

				       inductor/test_split_cat_fx_passes inductor/test_standalone_compile inductor/test_torchinductor \

				       inductor/test_torchinductor_codegen_dynamic_shapes inductor/test_torchinductor_dynamic_shapes --verbose

				       inductor/test_torchinductor_codegen_dynamic_shapes inductor/test_torchinductor_dynamic_shapes \

				       --shard "$SHARD_NUMBER" "$NUM_TEST_SHARDS" --verbose

				}

				if ! [[ "${BUILD_ENVIRONMENT}" == *libtorch* || "${BUILD_ENVIRONMENT}" == *-bazel-* ]]; then

				  (cd test && python -c "import torch; print(torch.__config__.show())")

				  (cd test && python -c "import torch; print(torch.__config__.parallel_info())")

				fi

				if [[ "$BUILD_ENVIRONMENT" == *aarch64* ]]; then

				if [[ "${BUILD_ENVIRONMENT}" == *aarch64* && "${TEST_CONFIG}" != *perf_cpu_aarch64* ]]; then

				  test_linux_aarch64

				elif [[ "${TEST_CONFIG}" == *backward* ]]; then

				  test_forward_backward_compatibility

				@ -1301,9 +1462,9 @@ elif [[ "${TEST_CONFIG}" == *torchbench* ]]; then

				    checkout_install_torchbench hf_Bert hf_Albert nanogpt timm_vision_transformer

				    PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_smoketest_perf

				  elif [[ "${TEST_CONFIG}" == *inductor_torchbench_cpu_smoketest_perf* ]]; then

				    checkout_install_torchbench timm_vision_transformer phlippe_densenet basic_gnn_gcn \

				    checkout_install_torchbench timm_vision_transformer phlippe_densenet basic_gnn_edgecnn \

				      llama_v2_7b_16h resnet50 timm_efficientnet mobilenet_v3_large timm_resnest \

				      shufflenet_v2_x1_0 hf_GPT2 yolov3 mobilenet_v2 resnext50_32x4d hf_T5_base

				      functorch_maml_omniglot yolov3 mobilenet_v2 resnext50_32x4d densenet121 mnasnet1_0

				    PYTHONPATH=$(pwd)/torchbench test_inductor_torchbench_cpu_smoketest_perf

				  elif [[ "${TEST_CONFIG}" == *torchbench_gcp_smoketest* ]]; then

				    checkout_install_torchbench

				@ -1324,8 +1485,9 @@ elif [[ "${TEST_CONFIG}" == *inductor* ]]; then

				  install_torchvision

				  test_inductor_shard "${SHARD_NUMBER}"

				  if [[ "${SHARD_NUMBER}" == 1 ]]; then

				    test_inductor_aoti

				    test_inductor_distributed

				    if [[ "${BUILD_ENVIRONMENT}" != linux-jammy-py3.9-gcc11-build ]]; then

				      test_inductor_distributed

				    fi

				  fi

				elif [[ "${TEST_CONFIG}" == *dynamo* ]]; then

				  install_torchvision

									
										23

.ci/pytorch/win-test-helpers/build_pytorch.bat
									
												View File
												
				@ -24,6 +24,12 @@ call %INSTALLER_DIR%\install_sccache.bat

				if errorlevel 1 goto fail

				if not errorlevel 0 goto fail

				if "%USE_XPU%"=="1" (

				  :: Install xpu support packages

				  call %INSTALLER_DIR%\install_xpu.bat

				  if errorlevel 1 exit /b 1

				)

				:: Miniconda has been installed as part of the Windows AMI with all the dependencies.

				:: We just need to activate it here

				call %INSTALLER_DIR%\activate_miniconda3.bat

				@ -43,6 +49,16 @@ if "%VC_VERSION%" == "" (

				)

				if errorlevel 1 goto fail

				if not errorlevel 0 goto fail

				if "%USE_XPU%"=="1" (

				  :: Activate xpu environment - VS env is required for xpu

				  call "C:\Program Files (x86)\Intel\oneAPI\setvars.bat"

				  if errorlevel 1 exit /b 1

				  :: Reduce build time. Only have MTL self-hosted runner now

				  SET TORCH_XPU_ARCH_LIST=xe-lpg

				  SET USE_KINETO=0

				)

				@echo on

				popd

				@ -65,13 +81,6 @@ set CUDA_PATH_V%VERSION_SUFFIX%=%CUDA_PATH%

				set CUDNN_LIB_DIR=%CUDA_PATH%\lib\x64

				set CUDA_TOOLKIT_ROOT_DIR=%CUDA_PATH%

				set CUDNN_ROOT_DIR=%CUDA_PATH%

				set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt

				set PATH=%CUDA_PATH%\bin;%CUDA_PATH%\libnvvp;%PATH%

				set CUDNN_LIB_DIR=%CUDA_PATH%\lib\x64

				set CUDA_TOOLKIT_ROOT_DIR=%CUDA_PATH%

				set CUDNN_ROOT_DIR=%CUDA_PATH%

				set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt

				set PATH=%CUDA_PATH%\bin;%CUDA_PATH%\libnvvp;%PATH%

				:cuda_build_end

									
										91

.ci/pytorch/win-test-helpers/installation-helpers/install_xpu.bat
									
										Normal file
									
												View File
												
				@ -0,0 +1,91 @@

				@echo on

				REM Description: Install Intel Support Packages on Windows

				REM BKM reference: https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpu/2-5.html

				set XPU_INSTALL_MODE=%~1

				if "%XPU_INSTALL_MODE%"=="" goto xpu_bundle_install_start

				if "%XPU_INSTALL_MODE%"=="bundle" goto xpu_bundle_install_start

				if "%XPU_INSTALL_MODE%"=="driver" goto xpu_driver_install_start

				if "%XPU_INSTALL_MODE%"=="all" goto xpu_driver_install_start

				:arg_error

				echo Illegal XPU installation mode. The value can be "bundle"/"driver"/"all"

				echo If keep the value as space, will use default "bundle" mode

				exit /b 1

				:xpu_driver_install_start

				:: TODO Need more testing for driver installation

				set XPU_DRIVER_LINK=https://downloadmirror.intel.com/830975/gfx_win_101.5972.exe

				curl -o xpu_driver.exe --retry 3 --retry-all-errors -k %XPU_DRIVER_LINK%

				echo "XPU Driver installing..."

				start /wait "Intel XPU Driver Installer" "xpu_driver.exe"

				if errorlevel 1 exit /b 1

				del xpu_driver.exe

				if "%XPU_INSTALL_MODE%"=="driver" goto xpu_install_end

				:xpu_bundle_install_start

				set XPU_BUNDLE_PARENT_DIR=C:\Program Files (x86)\Intel\oneAPI

				set XPU_BUNDLE_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/9d1a91e2-e8b8-40a5-8c7f-5db768a6a60c/w_intel-for-pytorch-gpu-dev_p_0.5.3.37_offline.exe

				set XPU_PTI_URL=https://registrationcenter-download.intel.com/akdlm/IRC_NAS/9d1a91e2-e8b8-40a5-8c7f-5db768a6a60c/w_intel-pti-dev_p_0.9.0.37_offline.exe

				set XPU_BUNDLE_VERSION=0.5.3+31

				set XPU_PTI_VERSION=0.9.0+36

				set XPU_BUNDLE_PRODUCT_NAME=intel.oneapi.win.intel-for-pytorch-gpu-dev.product

				set XPU_PTI_PRODUCT_NAME=intel.oneapi.win.intel-pti-dev.product

				set XPU_BUNDLE_INSTALLED=0

				set XPU_PTI_INSTALLED=0

				set XPU_BUNDLE_UNINSTALL=0

				set XPU_PTI_UNINSTALL=0

				:: Check if XPU bundle is target version or already installed

				if exist "%XPU_BUNDLE_PARENT_DIR%\Installer\installer.exe" goto xpu_bundle_ver_check

				goto xpu_bundle_install

				:xpu_bundle_ver_check

				"%XPU_BUNDLE_PARENT_DIR%\Installer\installer.exe" --list-products > xpu_bundle_installed_ver.log

				for /f "tokens=1,2" %%a in (xpu_bundle_installed_ver.log) do (

				    if "%%a"=="%XPU_BUNDLE_PRODUCT_NAME%" (

				        echo %%a Installed Version: %%b

				        set XPU_BUNDLE_INSTALLED=1

				        if not "%XPU_BUNDLE_VERSION%"=="%%b" (

				            start /wait "Installer Title" "%XPU_BUNDLE_PARENT_DIR%\Installer\installer.exe" --action=remove --eula=accept --silent --product-id %XPU_BUNDLE_PRODUCT_NAME% --product-ver %%b --log-dir uninstall_bundle

				            set XPU_BUNDLE_UNINSTALL=1

				        )

				    )

				    if "%%a"=="%XPU_PTI_PRODUCT_NAME%" (

				        echo %%a Installed Version: %%b

				        set XPU_PTI_INSTALLED=1

				        if not "%XPU_PTI_VERSION%"=="%%b" (

				            start /wait "Installer Title" "%XPU_BUNDLE_PARENT_DIR%\Installer\installer.exe" --action=remove --eula=accept --silent --product-id %XPU_PTI_PRODUCT_NAME% --product-ver %%b --log-dir uninstall_bundle

				            set XPU_PTI_UNINSTALL=1

				        )

				    )

				)

				if errorlevel 1 exit /b 1

				if exist xpu_bundle_installed_ver.log del xpu_bundle_installed_ver.log

				if "%XPU_BUNDLE_INSTALLED%"=="0" goto xpu_bundle_install

				if "%XPU_BUNDLE_UNINSTALL%"=="1" goto xpu_bundle_install

				if "%XPU_PTI_INSTALLED%"=="0" goto xpu_pti_install

				if "%XPU_PTI_UNINSTALL%"=="1" goto xpu_pti_install

				goto xpu_install_end

				:xpu_bundle_install

				curl -o xpu_bundle.exe --retry 3 --retry-all-errors -k %XPU_BUNDLE_URL%

				echo "XPU Bundle installing..."

				start /wait "Intel Pytorch Bundle Installer" "xpu_bundle.exe" --action=install --eula=accept --silent --log-dir install_bundle

				if errorlevel 1 exit /b 1

				del xpu_bundle.exe

				:xpu_pti_install

				curl -o xpu_pti.exe --retry 3 --retry-all-errors -k %XPU_PTI_URL%

				echo "XPU PTI installing..."

				start /wait "Intel PTI Installer" "xpu_pti.exe" --action=install --eula=accept --silent --log-dir install_bundle

				if errorlevel 1 exit /b 1

				del xpu_pti.exe

				:xpu_install_end

									
										1

.ci/pytorch/win-test-helpers/setup_pytorch_env.bat
									
												View File
												
				@ -40,7 +40,6 @@ set CUDA_PATH_V%VERSION_SUFFIX%=%CUDA_PATH%

				set CUDNN_LIB_DIR=%CUDA_PATH%\lib\x64

				set CUDA_TOOLKIT_ROOT_DIR=%CUDA_PATH%

				set CUDNN_ROOT_DIR=%CUDA_PATH%

				set NVTOOLSEXT_PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt

				set PATH=%CUDA_PATH%\bin;%CUDA_PATH%\libnvvp;%PATH%

				set NUMBAPRO_CUDALIB=%CUDA_PATH%\bin

				set NUMBAPRO_LIBDEVICE=%CUDA_PATH%\nvvm\libdevice

									
										2

.ci/pytorch/win-test-helpers/test_custom_backend.bat
									
												View File
												
				@ -31,6 +31,6 @@ if ERRORLEVEL 1 exit /b 1

				:: Run tests C++-side and load the exported script module.

				cd build

				set PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt\bin\x64;%TMP_DIR_WIN%\build\torch\lib;%PATH%

				set PATH=%TMP_DIR_WIN%\build\torch\lib;%PATH%

				test_custom_backend.exe model.pt

				if ERRORLEVEL 1 exit /b 1

									
										2

.ci/pytorch/win-test-helpers/test_custom_script_ops.bat
									
												View File
												
				@ -31,6 +31,6 @@ if ERRORLEVEL 1 exit /b 1

				:: Run tests C++-side and load the exported script module.

				cd build

				set PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt\bin\x64;%TMP_DIR_WIN%\build\torch\lib;%PATH%

				set PATH=%TMP_DIR_WIN%\build\torch\lib;%PATH%

				test_custom_ops.exe model.pt

				if ERRORLEVEL 1 exit /b 1

									
										2

.ci/pytorch/win-test-helpers/test_libtorch.bat
									
												View File
												
				@ -5,7 +5,7 @@ if errorlevel 1 exit /b 1

				set CWD=%cd%

				set CPP_TESTS_DIR=%TMP_DIR_WIN%\build\torch\bin

				set PATH=C:\Program Files\NVIDIA Corporation\NvToolsExt\bin\x64;%TMP_DIR_WIN%\build\torch\lib;%PATH%

				set PATH=%TMP_DIR_WIN%\build\torch\lib;%PATH%

				set TORCH_CPP_TEST_MNIST_PATH=%CWD%\test\cpp\api\mnist

				python tools\download_mnist.py --quiet -d %TORCH_CPP_TEST_MNIST_PATH%

									
										6

.ci/pytorch/win-test.sh
									
												View File
												
				@ -40,6 +40,12 @@ python -m pip install pytest-rerunfailures==10.3 pytest-cpp==2.3.0 tensorboard==

				# Install Z3 optional dependency for Windows builds.

				python -m pip install z3-solver==4.12.2.0

				# Install tlparse for test\dynamo\test_structured_trace.py UTs.

				python -m pip install tlparse==0.3.25

				# Install parameterized

				python -m pip install parameterized==0.8.1

				run_tests() {

				    # Run nvidia-smi if available

				    for path in '/c/Program Files/NVIDIA Corporation/NVSMI/nvidia-smi.exe' /c/Windows/System32/nvidia-smi.exe; do

									
										11

.circleci/scripts/binary_linux_test.sh
									
												View File
												
				@ -116,15 +116,14 @@ if [[ "$PACKAGE_TYPE" == libtorch ]]; then

				  cd /tmp/libtorch

				fi

				if [[ "$GPU_ARCH_TYPE" == xpu ]]; then

				  # Workaround for __mkl_tmp_MOD unbound variable issue, refer https://github.com/pytorch/pytorch/issues/130543

				  set +u

				  source /opt/intel/oneapi/pytorch-gpu-dev-0.5/oneapi-vars.sh

				fi

				# Test the package

				/builder/check_binary.sh

				if [[ "\$GPU_ARCH_TYPE" != *s390x* && "\$GPU_ARCH_TYPE" != *xpu* && "\$GPU_ARCH_TYPE" != *rocm*  && "$PACKAGE_TYPE" != libtorch ]]; then

				  # Exclude s390, xpu, rocm and libtorch builds from smoke testing

				  python /builder/test/smoke_test/smoke_test.py --package=torchonly --torch-compile-check disabled

				fi

				# Clean temp files

				cd /builder && git clean -ffdx

									
										6

.circleci/scripts/binary_populate_env.sh
									
												View File
												
				@ -90,7 +90,7 @@ fi

				if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*rocm.* && $(uname) == "Linux" ]]; then

				    TRITON_REQUIREMENT="pytorch-triton-rocm==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"

				    if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then

				        TRITON_SHORTHASH=$(cut -c1-10 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton-rocm.txt)

				        TRITON_SHORTHASH=$(cut -c1-10 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton.txt)

				        TRITON_REQUIREMENT="pytorch-triton-rocm==${TRITON_VERSION}+${TRITON_SHORTHASH}; ${TRITON_CONSTRAINT}"

				    fi

				    if [[ -z "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then

				@ -102,10 +102,10 @@ fi

				# Set triton via PYTORCH_EXTRA_INSTALL_REQUIREMENTS for triton xpu package

				if [[ "$PACKAGE_TYPE" =~ .*wheel.* && -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*xpu.* && $(uname) == "Linux" ]]; then

				    TRITON_REQUIREMENT="pytorch-triton-xpu==${TRITON_VERSION}"

				    TRITON_REQUIREMENT="pytorch-triton-xpu==${TRITON_VERSION}; ${TRITON_CONSTRAINT}"

				    if [[ -n "$PYTORCH_BUILD_VERSION" && "$PYTORCH_BUILD_VERSION" =~ .*dev.* ]]; then

				        TRITON_SHORTHASH=$(cut -c1-10 $PYTORCH_ROOT/.ci/docker/ci_commit_pins/triton-xpu.txt)

				        TRITON_REQUIREMENT="pytorch-triton-xpu==${TRITON_VERSION}+${TRITON_SHORTHASH}"

				        TRITON_REQUIREMENT="pytorch-triton-xpu==${TRITON_VERSION}+${TRITON_SHORTHASH}; ${TRITON_CONSTRAINT}"

				    fi

				    if [[ -z "${PYTORCH_EXTRA_INSTALL_REQUIREMENTS:-}" ]]; then

				        export PYTORCH_EXTRA_INSTALL_REQUIREMENTS="${TRITON_REQUIREMENT}"

									
										5

.circleci/scripts/binary_windows_build.sh
									
												View File
												
				@ -10,6 +10,11 @@ export SCCACHE_BUCKET=ossci-compiler-cache

				export SCCACHE_IGNORE_SERVER_IO_ERROR=1

				export VC_YEAR=2019

				if [[ "$DESIRED_CUDA" == 'xpu' ]]; then

				    export VC_YEAR=2022

				    export USE_SCCACHE=0

				fi

				echo "Free space on filesystem before build:"

				df -h

									
										4

.circleci/scripts/binary_windows_test.sh
									
												View File
												
				@ -6,6 +6,10 @@ source "${BINARY_ENV_FILE:-/c/w/env}"

				export CUDA_VERSION="${DESIRED_CUDA/cu/}"

				export VC_YEAR=2019

				if [[ "$DESIRED_CUDA" == 'xpu' ]]; then

				    export VC_YEAR=2022

				fi

				pushd "$BUILDER_ROOT"

				./windows/internal/smoke_test.bat

5

.flake8

View File

 @ -7,7 +7,7 @@ max-line-length = 120
 # C408 ignored because we like the dict keyword argument syntax
 # E501 is not flexible enough, we're using B950 instead
 ignore =
     E203,E305,E402,E501,E721,E741,F405,F841,F999,W503,W504,C408,E302,W291,E303,
     E203,E305,E402,E501,E704,E721,E741,F405,F841,F999,W503,W504,C408,E302,W291,E303,
     # shebang has extra meaning in fbcode lints, so I think it's not worth trying
     # to line this up with executable bit
     EXE001,
 @ -55,6 +55,9 @@ per-file-ignores =
     torch/distributed/_functional_collectives.py: TOR901
     torch/distributed/_spmd/data_parallel.py: TOR901
     torch/distributed/_tensor/_collective_utils.py: TOR901
     # This is a full package that happen to live within the test
     # folder, so ok to skip
     test/cpp_extensions/open_registration_extension/pytorch_openreg/_aten_impl.py: TOR901
 optional-ascii-coding = True
 exclude =
     ./.git,

									
										12

.github/actionlint.yaml
									
										vendored
									
												View File
												
				@ -3,17 +3,20 @@ self-hosted-runner:

				    # GitHub hosted x86 Linux runners

				    - linux.20_04.4x

				    - linux.20_04.16x

				    # Repo-specific LF hosted ARC runners

				    - linux.large.arc

				    # Organization-wide AWS Linux Runners

				    - linux.large

				    - linux.2xlarge

				    - linux.4xlarge

				    - linux.9xlarge.ephemeral

				    - am2.linux.9xlarge.ephemeral

				    - linux.12xlarge

				    - linux.12xlarge.ephemeral

				    - linux.24xlarge

				    - linux.24xlarge.ephemeral

				    - linux.arm64.2xlarge

				    - linux.arm64.2xlarge.ephemeral

				    - linux.arm64.m7g.4xlarge

				    - linux.arm64.m7g.4xlarge.ephemeral

				    - linux.4xlarge.nvidia.gpu

				    - linux.8xlarge.nvidia.gpu

				    - linux.16xlarge.nvidia.gpu

				@ -36,6 +39,8 @@ self-hosted-runner:

				    - amz2023.linux.12xlarge

				    - amz2023.linux.24xlarge

				    - amz2023.linux.arm64.2xlarge

				    - amz2023.linux.arm64.m7g.4xlarge

				    - amz2023.linux.arm64.m7g.4xlarge.ephemeral

				    - amz2023.linux.4xlarge.nvidia.gpu

				    - amz2023.linux.8xlarge.nvidia.gpu

				    - amz2023.linux.16xlarge.nvidia.gpu

				@ -54,6 +59,9 @@ self-hosted-runner:

				    # Repo-specific IBM hosted S390x runner

				    - linux.s390x

				    # Organization wide AWS Windows runners

				    - windows.g4dn.xlarge

				    - windows.g4dn.xlarge.nonephemeral

				    - windows.4xlarge

				    - windows.4xlarge.nonephemeral

				    - windows.8xlarge.nvidia.gpu

				    - windows.8xlarge.nvidia.gpu.nonephemeral

									
										5

.github/actions/filter-test-configs/action.yml
									
										vendored
									
												View File
												
				@ -41,6 +41,9 @@ outputs:

				  ci-verbose-test-logs:

				    description: True if ci-verbose-test-logs label was on PR or [ci-verbose-test-logs] in PR body.

				    value: ${{ steps.filter.outputs.ci-verbose-test-logs }}

				  ci-test-showlocals:

				    description: True if ci-test-showlocals label was on PR or [ci-test-showlocals] in PR body.

				    value: ${{ steps.filter.outputs.ci-test-showlocals }}

				  ci-no-test-timeout:

				    description: True if ci-no-test-timeout label was on PR or [ci-no-test-timeout] in PR body.

				    value: ${{ steps.filter.outputs.ci-no-test-timeout }}

				@ -54,7 +57,7 @@ outputs:

				runs:

				  using: composite

				  steps:

				    - uses: nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482

				    - uses: nick-fields/retry@v3.0.0

				      name: Setup dependencies

				      env:

				        GITHUB_TOKEN: ${{ inputs.github-token }}

									
										226

.github/actions/linux-build/action.yml
									
										vendored
									
												View File
											
				@ -1,226 +0,0 @@

				name: linux-build

				inputs:

				  build-environment:

				    required: true

				    description: Top-level label for what's being built/tested.

				  docker-image-name:

				    required: true

				    description: Name of the base docker image to build with.

				  build-generates-artifacts:

				    required: false

				    default: "true"

				    description: If set, upload generated build artifacts.

				  build-with-debug:

				    required: false

				    default: "false"

				    description: If set, build in debug mode.

				  sync-tag:

				    required: false

				    default: ""

				    description: |

				      If this is set, our linter will use this to make sure that every other

				      job with the same `sync-tag` is identical.

				  cuda-arch-list:

				    required: false

				    default: "5.2"

				    description: Runner label to select worker type

				  runner:

				    required: false

				    default: "linux.2xlarge"

				    description: |

				      List of CUDA architectures CI build should target.

				  test-matrix:

				    required: false

				    type: string

				    description: |

				      An option JSON description of what test configs to run later on. This

				      is moved here from the Linux test workflow so that we can apply filter

				      logic using test-config labels earlier and skip unnecessary builds

				  s3-bucket:

				    description: S3 bucket to download artifact

				    required: false

				    default: "gha-artifacts"

				  aws-role-to-assume:

				    description: role to assume for downloading artifacts

				    required: false

				    default: ""

				  GITHUB_TOKEN:

				    description: GitHub token

				    required: true

				  HUGGING_FACE_HUB_TOKEN:

				    description: Hugging Face Hub token

				    required: false

				    default: ""

				  use_split_build:

				    description: |

				      [Experimental] Build a libtorch only wheel and build pytorch such that

				      are built from the libtorch wheel.

				    required: false

				    type: boolean

				    default: false

				outputs:

				  docker-image:

				    value: ${{ steps.calculate-docker-image.outputs.docker-image }}

				    description: The docker image containing the built PyTorch.

				  test-matrix:

				    value: ${{ steps.filter.outputs.test-matrix }}

				    description: An optional JSON description of what test configs to run later on.

				runs:

				  using: composite

				  steps:

				    - name: Setup Linux

				      uses: ./.github/actions/setup-linux

				    - name: configure aws credentials

				      uses: aws-actions/configure-aws-credentials@v3

				      if: ${{ inputs.aws-role-to-assume != '' }}

				      with:

				        role-to-assume: ${{ inputs.aws-role-to-assume }}

				        role-session-name: gha-linux-build

				        role-duration-seconds: 10800

				        aws-region: us-east-1

				    - name: Calculate docker image

				      id: calculate-docker-image

				      uses: pytorch/test-infra/.github/actions/calculate-docker-image@main

				      with:

				        docker-image-name: ${{ inputs.docker-image-name }}

				    - name: Use following to pull public copy of the image

				      id: print-ghcr-mirror

				      env:

				        ECR_DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}

				      shell: bash

				      run: |

				        tag=${ECR_DOCKER_IMAGE##*/}

				        echo "docker pull ghcr.io/pytorch/ci-image:${tag/:/-}"

				    - name: Pull docker image

				      uses: pytorch/test-infra/.github/actions/pull-docker-image@main

				      with:

				        docker-image: ${{ steps.calculate-docker-image.outputs.docker-image }}

				    - name: Parse ref

				      id: parse-ref

				      shell: bash

				      run: .github/scripts/parse_ref.py

				    - name: Get workflow job id

				      id: get-job-id

				      uses: ./.github/actions/get-workflow-job-id

				      if: always()

				      with:

				        github-token: ${{ inputs.GITHUB_TOKEN }}

				    # Apply the filter logic to the build step too if the test-config label is already there

				    - name: Select all requested test configurations (if the test matrix is available)

				      id: filter

				      uses: ./.github/actions/filter-test-configs

				      with:

				        github-token: ${{ inputs.GITHUB_TOKEN }}

				        test-matrix: ${{ inputs.test-matrix }}

				        job-name: ${{ steps.get-job-id.outputs.job-name }}

				    - name: Download pytest cache

				      uses: ./.github/actions/pytest-cache-download

				      continue-on-error: true

				      with:

				        cache_dir: .pytest_cache

				        job_identifier: ${{ github.workflow }}_${{ inputs.build-environment }}

				        s3_bucket: ${{ inputs.s3-bucket }}

				    - name: Build

				      if: steps.filter.outputs.is-test-matrix-empty == 'False' || inputs.test-matrix == ''

				      id: build

				      env:

				        BUILD_ENVIRONMENT: ${{ inputs.build-environment }}

				        BRANCH: ${{ steps.parse-ref.outputs.branch }}

				        # TODO duplicated

				        AWS_DEFAULT_REGION: us-east-1

				        PR_NUMBER: ${{ github.event.pull_request.number }}

				        SHA1: ${{ github.event.pull_request.head.sha || github.sha }}

				        SCCACHE_BUCKET: ossci-compiler-cache-circleci-v2

				        SCCACHE_S3_KEY_PREFIX: ${{ github.workflow }}

				        XLA_CLANG_CACHE_S3_BUCKET_NAME: ossci-compiler-clang-cache-circleci-xla

				        PR_LABELS: ${{ toJson(github.event.pull_request.labels.*.name) }}

				        TORCH_CUDA_ARCH_LIST: ${{ inputs.cuda-arch-list }}

				        DOCKER_IMAGE: ${{ steps.calculate-docker-image.outputs.docker-image }}

				        XLA_CUDA: ${{ contains(inputs.build-environment, 'xla') && '0' || '' }}

				        DEBUG: ${{ inputs.build-with-debug == 'true' && '1' || '0' }}

				        OUR_GITHUB_JOB_ID: ${{ steps.get-job-id.outputs.job-id }}

				        HUGGING_FACE_HUB_TOKEN: ${{ inputs.HUGGING_FACE_HUB_TOKEN }}

				        USE_SPLIT_BUILD: ${{ inputs.use_split_build }}

				      shell: bash

				      run: |

				        # detached container should get cleaned up by teardown_ec2_linux

				        container_name=$(docker run \

				          -e BUILD_ENVIRONMENT \

				          -e MAX_JOBS="$(nproc --ignore=2)" \

				          -e AWS_DEFAULT_REGION \

				          -e PR_NUMBER \

				          -e SHA1 \

				          -e BRANCH \

				          -e SCCACHE_BUCKET \

				          -e SCCACHE_S3_KEY_PREFIX \

				          -e XLA_CUDA \

				          -e XLA_CLANG_CACHE_S3_BUCKET_NAME \

				          -e SKIP_SCCACHE_INITIALIZATION=1 \

				          -e TORCH_CUDA_ARCH_LIST \

				          -e PR_LABELS \

				          -e OUR_GITHUB_JOB_ID \

				          -e HUGGING_FACE_HUB_TOKEN \

				          -e USE_SPLIT_BUILD \

				          --env-file="/tmp/github_env_${GITHUB_RUN_ID}" \

				          --security-opt seccomp=unconfined \

				          --cap-add=SYS_PTRACE \

				          --tty \

				          --detach \

				          --user jenkins \

				          -v "${GITHUB_WORKSPACE}:/var/lib/jenkins/workspace" \

				          -w /var/lib/jenkins/workspace \

				          "${DOCKER_IMAGE}"

				        )

				        docker exec -t "${container_name}" sh -c '.ci/pytorch/build.sh'

				    - name: Archive artifacts into zip

				      if: inputs.build-generates-artifacts == 'true' && steps.build.outcome != 'skipped'

				      shell: bash

				      run: |

				        zip -1 -r artifacts.zip dist/ build/custom_test_artifacts build/lib build/bin .additional_ci_files

				    - name: Store PyTorch Build Artifacts on S3

				      uses: seemethere/upload-artifact-s3@v5

				      if: inputs.build-generates-artifacts == 'true' && steps.build.outcome != 'skipped' && inputs.use_split_build != 'true'

				      with:

				        name: ${{ inputs.build-environment }}

				        retention-days: 14

				        if-no-files-found: error

				        path: artifacts.zip

				        s3-bucket: ${{ inputs.s3-bucket }}

				    - name: Store PyTorch Build Artifacts on S3 for split build

				      uses: seemethere/upload-artifact-s3@v5

				      if: inputs.build-generates-artifacts == 'true' && steps.build.outcome != 'skipped' && inputs.use_split_build == 'true'

				      with:

				        name: ${{ inputs.build-environment }}-experimental-split-build

				        retention-days: 14

				        if-no-files-found: error

				        path: artifacts.zip

				        s3-bucket: ${{ inputs.s3-bucket }}

				    - name: Upload sccache stats

				      if: steps.build.outcome != 'skipped'

				      uses: seemethere/upload-artifact-s3@v5

				      with:

				        s3-prefix: |

				          ${{ github.repository }}/${{ github.run_id }}/${{ github.run_attempt }}/artifact

				        retention-days: 365

				        if-no-files-found: warn

				        path: sccache-stats-*.json

				        s3-bucket: ${{ inputs.s3-bucket }}

				    - name: Teardown Linux

				      uses: pytorch/test-infra/.github/actions/teardown-linux@main

				      if: always()

									
										1

.github/actions/linux-test/action.yml
									
										vendored
									
												View File
												
				@ -167,6 +167,7 @@ runs:

				        REENABLED_ISSUES: ${{ steps.keep-going.outputs.reenabled-issues }}

				        CONTINUE_THROUGH_ERROR: ${{ steps.keep-going.outputs.keep-going }}

				        VERBOSE_TEST_LOGS: ${{ steps.keep-going.outputs.ci-verbose-test-logs }}

				        TEST_SHOWLOCALS: ${{ steps.keep-going.outputs.ci-test-showlocals }}

				        NO_TEST_TIMEOUT: ${{ steps.keep-going.outputs.ci-no-test-timeout }}

				        NO_TD: ${{ steps.keep-going.outputs.ci-no-td }}

				        TD_DISTRIBUTED: ${{ steps.keep-going.outputs.ci-td-distributed }}

									
										2

.github/actions/pytest-cache-download/action.yml
									
										vendored
									
												View File
												
				@ -17,7 +17,7 @@ inputs:

				runs:

				  using: composite

				  steps:

				    - uses: nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482

				    - uses: nick-fields/retry@v3.0.0

				      name: Setup dependencies

				      with:

				        shell: bash

									
										2

.github/actions/pytest-cache-upload/action.yml
									
										vendored
									
												View File
												
				@ -24,7 +24,7 @@ inputs:

				runs:

				  using: composite

				  steps:

				    - uses: nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482

				    - uses: nick-fields/retry@v3.0.0

				      name: Setup dependencies

				      with:

				        shell: bash

									
										9

.github/actions/setup-linux/action.yml
									
										vendored
									
												View File
												
				@ -44,7 +44,7 @@ runs:

				        fi

				    - name: Log in to ECR

				      uses: nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482

				      uses: nick-fields/retry@v3.0.0

				      env:

				        AWS_RETRY_MODE: standard

				        AWS_MAX_ATTEMPTS: "5"

				@ -59,6 +59,13 @@ runs:

				          aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \

				              --password-stdin "$AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"

				          # For LF Runners we need to make sure we also login to Meta's ECR docker registry too.

				          META_AWS_ACCOUNT_ID=308535385114

				          if [ "$AWS_ACCOUNT_ID" != "$META_AWS_ACCOUNT_ID" ] ; then

				              aws ecr get-login-password --region "$AWS_DEFAULT_REGION" | docker login --username AWS \

				                  --password-stdin "$META_AWS_ACCOUNT_ID.dkr.ecr.$AWS_DEFAULT_REGION.amazonaws.com"

				          fi

				    - name: Preserve github env variables for use in docker

				      shell: bash

				      run: |

									
										2

.github/actions/teardown-win/action.yml
									
										vendored
									
												View File
												
				@ -31,7 +31,7 @@ runs:

				    # retry this step several time similar to how checkout-pytorch GHA does

				    - name: Cleanup workspace

				      if: always()

				      uses: nick-fields/retry@v2.8.2

				      uses: nick-fields/retry@v3.0.0

				      env:

				        EXTRA_DELETE_DIR: ${{ inputs.extra-delete-dir }}

				      with:

2

.github/ci_commit_pins/audio.txt vendored

View File

 @ -1 +1 @@
 b2a0adc2ec03ab99990d7e8be3d4510438c148
 ba696ea3dfec4cbe693bf06a84c75dc196077f5b

2

.github/ci_commit_pins/xla.txt vendored

View File

 @ -1 +1 @@
 ea4535f0699f366adb554183a65ebf7dc34a8be
 eb4a60ed14a38260b85b0c765161f0ce45be6d1

									
										39

.github/label_to_label.yml
									
										vendored
									
												View File
												
				@ -1,13 +1,50 @@

				# Use this to auto apply labels based on other labels.  Applies to both PRs and

				# issues. Currently only supports any and all

				- any:

				  - "module: custom operators"

				  - "module: opcheck"

				  then:

				  - "module: custom-operators"

				- any:

				  - "module: custom-operators"

				  - "module: functionalization"

				  - "module: aotdispatch"

				  - "module: higher order operators"

				  - "module: fakeTensor"

				  - "module: ProxyTensor"

				  - "module: library"

				  - "module: reinplacing"

				  then:

				  - "module: pt2-dispatcher"

				- any:

				  - "module: vmap"

				  then:

				  - "module: functorch"

				- any:

				  - "module: reinplacing"

				  then:

				  - "module: inductor"

				- any:

				  - "module: pt2 optimizer"

				  then:

				  - "module: dynamo"

				- any:

				  - "module: flex attention"

				  then:

				  - "module: higher order operators"

				- any:

				  - "module: aotinductor"

				  then:

				  - "oncall: export"

				- any:

				  - "module: dynamo"

				  - "module: pt2-dispatcher"

				  - "module: inductor"

				  - "module: aotinductor"

				  - "module: cudagraphs"

				  - "oncall: export"

				  - "module: startup-tracing-compile"

				  - "module: compiled autograd"

				  - "module: flex attention"

				  - "module: dynamic shapes"

				  then:

				  - "oncall: pt2"

									
										1

.github/labeler.yml
									
										vendored
									
												View File
												
				@ -29,7 +29,6 @@

				- torch/fx/experimental/recording.py

				- torch/fx/experimental/sym_node.py

				- torch/fx/experimental/validator.py

				- torch/fx/experimental/_sym_dispatch_mode.py

				- torch/fx/experimental/proxy_tensor.py

				- test/distributed/_tensor/test_dtensor_compile.py

				- test/distributed/tensor/parallel/test_fsdp_2d_parallel.py

									
										274

.github/lf-canary-scale-config.yml
									
										vendored
									
												View File
												
				@ -1,13 +1,27 @@

				# Defines runner types that will be provisioned by by LF Self-hosted

				# runners for pytorch/pytorch-canary and their labels.

				# This file is generated by .github/scripts/validate_scale_config.py in test-infra

				# It defines runner types that will be provisioned by by LF Self-hosted runners

				# scale-config.yml:

				#   Powers what instance types are available for GHA auto-scaled

				#   runners. Runners listed here will be available as self hosted

				#   runners, configuration is directly pulled from the main branch.

				#

				# Runners listed here will be available as self hosted runners.

				# Configuration is directly pulled from the main branch.

				#

				# Default values:

				# NOTES:

				#  - Linux runners are by default non-ephemeral to reduce the amount of CreateInstaces calls

				#    to avoid RequestLimitExceeded issues

				#  - When updating this file, run the following command to validate the YAML and to generate

				#    corresponding versions of scale-config for the pytorch/pytorch repo and merge the

				#    pytorch/pytorch changes before merging these changes.

				#    `python .github/scripts/validate_scale_config.py --test-infra-repo-root [path_to_test-infra_root] --pytorch-repo-root [path_to_pytorch_root]``

				#

				# TODO: Add some documentation on how the auto-scaling works

				#

				# NOTE: Default values,

				#

				# runner_types:

				#   runner_label: # label to specify in the Github Actions workflow

				#   runner_label:

				#     instance_type: m4.large

				#     os: linux

				#     max_available: 20

				@ -21,107 +35,202 @@ runner_types:

				    is_ephemeral: false

				    max_available: 1000

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.linux.10xlarge.avx2:

				    disk_size: 200

				    instance_type: m4.10xlarge

				    is_ephemeral: false

				    max_available: 450

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.linux.24xl.spr-metal:

				    disk_size: 200

				    instance_type: c7i.metal-24xl

				    is_ephemeral: false

				    max_available: 30

				    max_available: 150

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.linux.16xlarge.spr:

				    disk_size: 200

				    instance_type: c7i.16xlarge

				    is_ephemeral: false

				    max_available: 30

				    max_available: 150

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.linux.9xlarge.ephemeral:

				    disk_size: 200

				    instance_type: c5.9xlarge

				    is_ephemeral: true

				    max_available: 50

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.c.linux.12xlarge.ephemeral:

				    disk_size: 200

				    instance_type: c5.12xlarge

				    is_ephemeral: true

				    max_available: 300

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.linux.16xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.16xlarge

				    is_ephemeral: false

				    max_available: 30

				    max_available: 150

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.linux.24xlarge:

				    disk_size: 150

				    instance_type: c5.24xlarge

				    is_ephemeral: false

				    max_available: 250

				    max_available: 500

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.linux.24xlarge.ephemeral:

				    disk_size: 150

				    instance_type: c5.24xlarge

				    is_ephemeral: true

				    max_available: 200

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.linux.2xlarge:

				    disk_size: 150

				    instance_type: c5.2xlarge

				    is_ephemeral: false

				    max_available: 3120

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.linux.4xlarge:

				    disk_size: 150

				    instance_type: c5.4xlarge

				    is_ephemeral: false

				    max_available: 1000

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.linux.4xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.4xlarge

				    is_ephemeral: false

				    max_available: 520

				    max_available: 1000

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.linux.8xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.8xlarge

				    is_ephemeral: false

				    max_available: 400

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.linux.g4dn.12xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g4dn.12xlarge

				    is_ephemeral: false

				    max_available: 50

				    max_available: 250

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.linux.g4dn.metal.nvidia.gpu:

				    disk_size: 150

				    instance_type: g4dn.metal

				    is_ephemeral: false

				    max_available: 30

				    max_available: 300

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.linux.g5.48xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.48xlarge

				    is_ephemeral: false

				    max_available: 20

				    max_available: 200

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.linux.g5.12xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.12xlarge

				    is_ephemeral: false

				    max_available: 150

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.linux.g5.4xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.4xlarge

				    is_ephemeral: false

				    max_available: 1200

				    max_available: 2400

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.linux.g6.4xlarge.experimental.nvidia.gpu:

				    disk_size: 150

				    instance_type: g6.4xlarge

				    is_ephemeral: false

				    max_available: 50

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.linux.large:

				    max_available: 1200

				    disk_size: 15

				    instance_type: c5.large

				    is_ephemeral: false

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.linux.arm64.2xlarge:

				    disk_size: 256

				    instance_type: t4g.2xlarge

				    is_ephemeral: false

				    max_available: 200

				    os: linux

				  lf.c.linux.arm64.m7g.2xlarge:

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

				  lf.c.linux.arm64.m7g.4xlarge:

				    disk_size: 256

				    instance_type: m7g.2xlarge

				    instance_type: m7g.4xlarge

				    is_ephemeral: false

				    max_available: 20

				    max_available: 200

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

				  lf.c.linux.arm64.2xlarge.ephemeral:

				    disk_size: 256

				    instance_type: t4g.2xlarge

				    is_ephemeral: true

				    max_available: 200

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

				  lf.c.linux.arm64.m7g.4xlarge.ephemeral:

				    disk_size: 256

				    instance_type: m7g.4xlarge

				    is_ephemeral: true

				    max_available: 200

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

				  lf.c.linux.arm64.m7g.metal:

				    disk_size: 256

				    instance_type: m7g.metal

				    is_ephemeral: false

				    max_available: 100

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

				  lf.c.windows.g4dn.xlarge:

				    disk_size: 256

				    instance_type: g4dn.xlarge

				    is_ephemeral: true

				    max_available: 100

				    os: windows

				  lf.c.windows.g4dn.xlarge.nonephemeral:

				    disk_size: 256

				    instance_type: g4dn.xlarge

				    is_ephemeral: false

				    max_available: 100

				    os: windows

				  lf.c.windows.4xlarge:

				    disk_size: 256

				    instance_type: c5d.4xlarge

				@ -138,7 +247,7 @@ runner_types:

				    disk_size: 256

				    instance_type: p3.2xlarge

				    is_ephemeral: true

				    max_available: 150

				    max_available: 300

				    os: windows

				  lf.c.windows.8xlarge.nvidia.gpu.nonephemeral:

				    disk_size: 256

				@ -152,130 +261,3 @@ runner_types:

				    is_ephemeral: false

				    max_available: 250

				    os: windows

				  ### Setup runner types to test the Amazon Linux 2023 AMI

				  lf.c.amz2023.linux.12xlarge:

				    disk_size: 200

				    instance_type: c5.12xlarge

				    is_ephemeral: false

				    max_available: 1000

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.24xl.spr-metal:

				    disk_size: 200

				    instance_type: c7i.metal-24xl

				    is_ephemeral: false

				    max_available: 30

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.16xlarge.spr:

				    disk_size: 200

				    instance_type: c7i.16xlarge

				    is_ephemeral: false

				    max_available: 30

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.12xlarge.ephemeral:

				    disk_size: 200

				    instance_type: c5.12xlarge

				    is_ephemeral: true

				    max_available: 300

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.16xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.16xlarge

				    is_ephemeral: false

				    max_available: 30

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.24xlarge:

				    disk_size: 150

				    instance_type: c5.24xlarge

				    is_ephemeral: false

				    max_available: 250

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.2xlarge:

				    disk_size: 150

				    instance_type: c5.2xlarge

				    is_ephemeral: false

				    max_available: 3120

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.4xlarge:

				    disk_size: 150

				    instance_type: c5.4xlarge

				    is_ephemeral: false

				    max_available: 1000

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.4xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.4xlarge

				    is_ephemeral: false

				    max_available: 520

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.8xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.8xlarge

				    is_ephemeral: false

				    max_available: 400

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.g4dn.12xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g4dn.12xlarge

				    is_ephemeral: false

				    max_available: 50

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.g4dn.metal.nvidia.gpu:

				    disk_size: 150

				    instance_type: g4dn.metal

				    is_ephemeral: false

				    max_available: 30

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.g5.48xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.48xlarge

				    is_ephemeral: false

				    max_available: 20

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.g5.12xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.12xlarge

				    is_ephemeral: false

				    max_available: 150

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.g5.4xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.4xlarge

				    is_ephemeral: false

				    max_available: 1200

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.large:

				    disk_size: 15

				    instance_type: c5.large

				    is_ephemeral: false

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.arm64.2xlarge:

				    disk_size: 256

				    instance_type: t4g.2xlarge

				    is_ephemeral: false

				    max_available: 200

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.c.amz2023.linux.arm64.m7g.2xlarge:

				    disk_size: 256

				    instance_type: m7g.2xlarge

				    is_ephemeral: false

				    max_available: 20

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

									
										274

.github/lf-scale-config.yml
									
										vendored
									
												View File
												
				@ -1,13 +1,27 @@

				# Defines runner types that will be provisioned by by LF Self-hosted

				# runners for pytorch/pytorch and their labels.

				# This file is generated by .github/scripts/validate_scale_config.py in test-infra

				# It defines runner types that will be provisioned by by LF Self-hosted runners

				# scale-config.yml:

				#   Powers what instance types are available for GHA auto-scaled

				#   runners. Runners listed here will be available as self hosted

				#   runners, configuration is directly pulled from the main branch.

				#

				# Runners listed here will be available as self hosted runners.

				# Configuration is directly pulled from the main branch.

				#

				# Default values:

				# NOTES:

				#  - Linux runners are by default non-ephemeral to reduce the amount of CreateInstaces calls

				#    to avoid RequestLimitExceeded issues

				#  - When updating this file, run the following command to validate the YAML and to generate

				#    corresponding versions of scale-config for the pytorch/pytorch repo and merge the

				#    pytorch/pytorch changes before merging these changes.

				#    `python .github/scripts/validate_scale_config.py --test-infra-repo-root [path_to_test-infra_root] --pytorch-repo-root [path_to_pytorch_root]``

				#

				# TODO: Add some documentation on how the auto-scaling works

				#

				# NOTE: Default values,

				#

				# runner_types:

				#   runner_label: # label to specify in the Github Actions workflow

				#   runner_label:

				#     instance_type: m4.large

				#     os: linux

				#     max_available: 20

				@ -21,107 +35,202 @@ runner_types:

				    is_ephemeral: false

				    max_available: 1000

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.linux.10xlarge.avx2:

				    disk_size: 200

				    instance_type: m4.10xlarge

				    is_ephemeral: false

				    max_available: 450

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.linux.24xl.spr-metal:

				    disk_size: 200

				    instance_type: c7i.metal-24xl

				    is_ephemeral: false

				    max_available: 30

				    max_available: 150

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.linux.16xlarge.spr:

				    disk_size: 200

				    instance_type: c7i.16xlarge

				    is_ephemeral: false

				    max_available: 30

				    max_available: 150

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.linux.9xlarge.ephemeral:

				    disk_size: 200

				    instance_type: c5.9xlarge

				    is_ephemeral: true

				    max_available: 50

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      am2:

				        ami: amzn2-ami-hvm-2.0.20240306.2-x86_64-ebs

				  lf.linux.12xlarge.ephemeral:

				    disk_size: 200

				    instance_type: c5.12xlarge

				    is_ephemeral: true

				    max_available: 300

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.linux.16xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.16xlarge

				    is_ephemeral: false

				    max_available: 30

				    max_available: 150

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.linux.24xlarge:

				    disk_size: 150

				    instance_type: c5.24xlarge

				    is_ephemeral: false

				    max_available: 250

				    max_available: 500

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.linux.24xlarge.ephemeral:

				    disk_size: 150

				    instance_type: c5.24xlarge

				    is_ephemeral: true

				    max_available: 200

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.linux.2xlarge:

				    disk_size: 150

				    instance_type: c5.2xlarge

				    is_ephemeral: false

				    max_available: 3120

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.linux.4xlarge:

				    disk_size: 150

				    instance_type: c5.4xlarge

				    is_ephemeral: false

				    max_available: 1000

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.linux.4xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.4xlarge

				    is_ephemeral: false

				    max_available: 520

				    max_available: 1000

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.linux.8xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.8xlarge

				    is_ephemeral: false

				    max_available: 400

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.linux.g4dn.12xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g4dn.12xlarge

				    is_ephemeral: false

				    max_available: 50

				    max_available: 250

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.linux.g4dn.metal.nvidia.gpu:

				    disk_size: 150

				    instance_type: g4dn.metal

				    is_ephemeral: false

				    max_available: 30

				    max_available: 300

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.linux.g5.48xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.48xlarge

				    is_ephemeral: false

				    max_available: 20

				    max_available: 200

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.linux.g5.12xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.12xlarge

				    is_ephemeral: false

				    max_available: 150

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.linux.g5.4xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.4xlarge

				    is_ephemeral: false

				    max_available: 1200

				    max_available: 2400

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.linux.g6.4xlarge.experimental.nvidia.gpu:

				    disk_size: 150

				    instance_type: g6.4xlarge

				    is_ephemeral: false

				    max_available: 50

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				    variants:

				      amz2023:

				        ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.linux.large:

				    max_available: 1200

				    disk_size: 15

				    instance_type: c5.large

				    is_ephemeral: false

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.linux.arm64.2xlarge:

				    disk_size: 256

				    instance_type: t4g.2xlarge

				    is_ephemeral: false

				    max_available: 200

				    os: linux

				  lf.linux.arm64.m7g.2xlarge:

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

				  lf.linux.arm64.m7g.4xlarge:

				    disk_size: 256

				    instance_type: m7g.2xlarge

				    instance_type: m7g.4xlarge

				    is_ephemeral: false

				    max_available: 20

				    max_available: 200

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

				  lf.linux.arm64.2xlarge.ephemeral:

				    disk_size: 256

				    instance_type: t4g.2xlarge

				    is_ephemeral: true

				    max_available: 200

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

				  lf.linux.arm64.m7g.4xlarge.ephemeral:

				    disk_size: 256

				    instance_type: m7g.4xlarge

				    is_ephemeral: true

				    max_available: 200

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

				  lf.linux.arm64.m7g.metal:

				    disk_size: 256

				    instance_type: m7g.metal

				    is_ephemeral: false

				    max_available: 100

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-arm64

				  lf.windows.g4dn.xlarge:

				    disk_size: 256

				    instance_type: g4dn.xlarge

				    is_ephemeral: true

				    max_available: 100

				    os: windows

				  lf.windows.g4dn.xlarge.nonephemeral:

				    disk_size: 256

				    instance_type: g4dn.xlarge

				    is_ephemeral: false

				    max_available: 100

				    os: windows

				  lf.windows.4xlarge:

				    disk_size: 256

				    instance_type: c5d.4xlarge

				@ -138,7 +247,7 @@ runner_types:

				    disk_size: 256

				    instance_type: p3.2xlarge

				    is_ephemeral: true

				    max_available: 150

				    max_available: 300

				    os: windows

				  lf.windows.8xlarge.nvidia.gpu.nonephemeral:

				    disk_size: 256

				@ -152,130 +261,3 @@ runner_types:

				    is_ephemeral: false

				    max_available: 250

				    os: windows

				  ### Setup runner types to test the Amazon Linux 2023 AMI

				  lf.amz2023.linux.12xlarge:

				    disk_size: 200

				    instance_type: c5.12xlarge

				    is_ephemeral: false

				    max_available: 1000

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.24xl.spr-metal:

				    disk_size: 200

				    instance_type: c7i.metal-24xl

				    is_ephemeral: false

				    max_available: 30

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.16xlarge.spr:

				    disk_size: 200

				    instance_type: c7i.16xlarge

				    is_ephemeral: false

				    max_available: 30

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.12xlarge.ephemeral:

				    disk_size: 200

				    instance_type: c5.12xlarge

				    is_ephemeral: true

				    max_available: 300

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.16xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.16xlarge

				    is_ephemeral: false

				    max_available: 30

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.24xlarge:

				    disk_size: 150

				    instance_type: c5.24xlarge

				    is_ephemeral: false

				    max_available: 250

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.2xlarge:

				    disk_size: 150

				    instance_type: c5.2xlarge

				    is_ephemeral: false

				    max_available: 3120

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.4xlarge:

				    disk_size: 150

				    instance_type: c5.4xlarge

				    is_ephemeral: false

				    max_available: 1000

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.4xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.4xlarge

				    is_ephemeral: false

				    max_available: 520

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.8xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g3.8xlarge

				    is_ephemeral: false

				    max_available: 400

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.g4dn.12xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g4dn.12xlarge

				    is_ephemeral: false

				    max_available: 50

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.g4dn.metal.nvidia.gpu:

				    disk_size: 150

				    instance_type: g4dn.metal

				    is_ephemeral: false

				    max_available: 30

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.g5.48xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.48xlarge

				    is_ephemeral: false

				    max_available: 20

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.g5.12xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.12xlarge

				    is_ephemeral: false

				    max_available: 150

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.g5.4xlarge.nvidia.gpu:

				    disk_size: 150

				    instance_type: g5.4xlarge

				    is_ephemeral: false

				    max_available: 1200

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.large:

				    disk_size: 15

				    instance_type: c5.large

				    is_ephemeral: false

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.arm64.2xlarge:

				    disk_size: 256

				    instance_type: t4g.2xlarge

				    is_ephemeral: false

				    max_available: 200

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

				  lf.amz2023.linux.arm64.m7g.2xlarge:

				    disk_size: 256

				    instance_type: m7g.2xlarge

				    is_ephemeral: false

				    max_available: 20

				    os: linux

				    ami: al2023-ami-2023.5.20240701.0-kernel-6.1-x86_64

									
										27

.github/merge_rules.yaml
									
										vendored
									
												View File
												
				@ -86,6 +86,18 @@

				  - pull

				  - inductor

				- name: OSS CI / pytorchbot / slow tests

				  patterns:

				  - test/slow_tests.json

				  approved_by:

				  - pytorchbot

				  ignore_flaky_failures: false

				  mandatory_checks_name:

				  - EasyCLA

				  - Lint

				  - pull

				  - slow

				- name: OSS CI /pytorchbot / Executorch

				  patterns:

				  - .ci/docker/ci_commit_pins/executorch.txt

				@ -107,8 +119,8 @@

				  mandatory_checks_name:

				  - EasyCLA

				  - Lint

				  - pull / linux-focal-py3_8-clang9-xla / build

				  - pull / linux-focal-py3_8-clang9-xla / test (xla, 1, 1, linux.12xlarge)

				  - pull / linux-focal-py3_9-clang9-xla / build

				  - pull / linux-focal-py3_9-clang9-xla / test (xla, 1, 1, linux.12xlarge)

				- name: Documentation

				  patterns:

				@ -282,9 +294,11 @@

				  - torch/_C/_distributed*

				  - torch/csrc/distributed/**

				  - torch/testing/_internal/distributed/**

				  - torch/multiprocessing/**

				  - test/distributed/**

				  - test/cpp/dist_autograd/**

				  - test/cpp/rpc/**

				  - test/*multiprocessing*

				  approved_by:

				  - wconstab

				  - mrshenli

				@ -523,6 +537,13 @@

				  - Skylion007

				  - ngimel

				  - peterbell10

				  - eqy

				  - jansel

				  - jeffdaily

				  - eellison

				  - anijain2305

				  - bdhirsh

				  - zou3519

				  mandatory_checks_name:

				  - EasyCLA

				  - Lint

				@ -537,6 +558,8 @@

				  - ezyang

				  - dzhulgakov

				  - malfet

				  - albanD

				  - ptrblck

				  mandatory_checks_name:

				  - EasyCLA

				  - Lint

									
										5

.github/nitpicks.yml
									
										vendored
									
										Normal file
									
												View File
												
				@ -0,0 +1,5 @@

				- markdown: |

				    ## Attention! native_functions.yaml was changed

				    If you are adding a new function or defaulted argument to native_functions.yaml, you cannot use it from pre-existing Python frontend code until our FC window passes (two weeks).  Split your PR into two PRs, one which adds the new C++ functionality, and one that makes use of it from Python, and land them two weeks apart.  See https://github.com/pytorch/pytorch/wiki/PyTorch's-Python-Frontend-Backward-and-Forward-Compatibility-Policy#forwards-compatibility-fc for more info.

				  pathFilter:

				    - 'aten/src/ATen/native/native_functions.yaml'

									
										1

.github/pytorch-probot.yml
									
										vendored
									
												View File
												
				@ -9,6 +9,7 @@ ciflow_push_tags:

				- ciflow/inductor-rocm

				- ciflow/inductor-perf-compare

				- ciflow/inductor-micro-benchmark

				- ciflow/inductor-micro-benchmark-cpu-x86

				- ciflow/inductor-cu124

				- ciflow/linux-aarch64

				- ciflow/mps

2

.github/requirements/conda-env-iOS.txt vendored

View File

 @ -4,4 +4,4 @@ ninja=1.10.2
 numpy=1.23.3
 pyyaml=6.0
 setuptools=68.2.2
 typing-extensions=4.9.0
 typing-extensions=4.11.0

6

.github/requirements/pip-requirements-macOS.txt vendored

View File

 @ -1,6 +1,7 @@
 boto3==1.19.12
 hypothesis==6.56.4
 expecttest==0.1.6
 expecttest==0.2.1
 fbscribelogger==0.1.6
 librosa>=0.6.2
 mpmath==1.3.0
 networkx==2.8.7
 @ -18,7 +19,7 @@ pytest-rerunfailures==10.3
 pytest-flakefinder==1.1.0
 scipy==1.10.1
 sympy==1.12.1 ; python_version == "3.8"
 sympy>=1.13.0 ; python_version >= "3.9"
 sympy==1.13.1 ; python_version >= "3.9"
 unittest-xml-reporting<=3.2.0,>=2.0.0
 xdoctest==1.1.0
 filelock==3.6.0
 @ -30,3 +31,4 @@ optree==0.12.1
 # NB: test_hparams_* from test_tensorboard is failing with protobuf 5.26.0 in
 # which the stringify metadata is wrong when escaping double quote
 protobuf==3.20.2
 parameterized==0.8.1

									
										26

.github/scripts/build_triton_wheel.py
									
										vendored
									
												View File
												
				@ -15,9 +15,7 @@ REPO_DIR = SCRIPT_DIR.parent.parent

				def read_triton_pin(device: str = "cuda") -> str:

				    triton_file = "triton.txt"

				    if device == "rocm":

				        triton_file = "triton-rocm.txt"

				    elif device == "xpu":

				    if device == "xpu":

				        triton_file = "triton-xpu.txt"

				    with open(REPO_DIR / ".ci" / "docker" / "ci_commit_pins" / triton_file) as f:

				        return f.read().strip()

				@ -50,6 +48,25 @@ def patch_init_py(

				        f.write(orig)

				# TODO: remove patch_setup_py() once we have a proper fix for https://github.com/triton-lang/triton/issues/4527

				def patch_setup_py(path: Path) -> None:

				    with open(path) as f:

				        orig = f.read()

				    try:

				        orig = check_and_replace(

				            orig,

				            "https://tritonlang.blob.core.windows.net/llvm-builds/",

				            "https://oaitriton.blob.core.windows.net/public/llvm-builds/",

				        )

				        with open(path, "w") as f:

				            f.write(orig)

				    except RuntimeError as e:

				        print(

				            f"Applying patch_setup_py() for llvm-build package failed: {e}.",

				            "If you are trying to build a newer version of Triton, you can ignore this.",

				        )

				def build_triton(

				    *,

				    version: str,

				@ -91,6 +108,9 @@ def build_triton(

				        else:

				            check_call(["git", "checkout", commit_hash], cwd=triton_basedir)

				        # TODO: remove this and patch_setup_py() once we have a proper fix for https://github.com/triton-lang/triton/issues/4527

				        patch_setup_py(triton_pythondir / "setup.py")

				        if build_conda:

				            with open(triton_basedir / "meta.yaml", "w") as meta:

				                print(

									
										11

.github/scripts/check_labels.py
									
										vendored
									
												View File
												
				@ -27,6 +27,12 @@ def parse_args() -> Any:

				    parser = ArgumentParser("Check PR labels")

				    parser.add_argument("pr_num", type=int)

				    # add a flag to return a non-zero exit code if the PR does not have the required labels

				    parser.add_argument(

				        "--exit-non-zero",

				        action="store_true",

				        help="Return a non-zero exit code if the PR does not have the required labels",

				    )

				    return parser.parse_args()

				@ -41,10 +47,13 @@ def main() -> None:

				        if not has_required_labels(pr):

				            print(LABEL_ERR_MSG)

				            add_label_err_comment(pr)

				            if args.exit_non_zero:

				                sys.exit(1)

				        else:

				            delete_all_label_err_comments(pr)

				    except Exception as e:

				        pass

				        if args.exit_non_zero:

				            sys.exit(1)

				    sys.exit(0)

									
										3

.github/scripts/cherry_pick.py
									
										vendored
									
												View File
												
				@ -169,7 +169,8 @@ def create_cherry_pick_branch(

				    repo.create_branch_and_checkout(branch=cherry_pick_branch)

				    # We might want to support ghstack later

				    repo._run_git("cherry-pick", "-x", "-X", "theirs", commit_sha)

				    # We don't want to resolve conflicts here.

				    repo._run_git("cherry-pick", "-x", commit_sha)

				    repo.push(branch=cherry_pick_branch, dry_run=False)

				    return cherry_pick_branch

									
										3

.github/scripts/filter_test_configs.py
									
										vendored
									
												View File
												
				@ -505,6 +505,9 @@ def perform_misc_tasks(

				        "ci-verbose-test-logs",

				        check_for_setting(labels, pr_body, "ci-verbose-test-logs"),

				    )

				    set_output(

				        "ci-test-showlocals", check_for_setting(labels, pr_body, "ci-test-showlocals")

				    )

				    set_output(

				        "ci-no-test-timeout", check_for_setting(labels, pr_body, "ci-no-test-timeout")

				    )

									
										85

.github/scripts/generate_binary_build_matrix.py
									
										vendored
									
												View File
												
				@ -18,13 +18,13 @@ from typing import Dict, List, Optional, Tuple

				CUDA_ARCHES = ["11.8", "12.1", "12.4"]

				CUDA_ARCHES_FULL_VERSION = {"11.8": "11.8.0", "12.1": "12.1.1", "12.4": "12.4.0"}

				CUDA_ARCHES_FULL_VERSION = {"11.8": "11.8.0", "12.1": "12.1.1", "12.4": "12.4.1"}

				CUDA_ARCHES_CUDNN_VERSION = {"11.8": "9", "12.1": "9", "12.4": "9"}

				ROCM_ARCHES = ["6.0", "6.1"]

				ROCM_ARCHES = ["6.1", "6.2"]

				XPU_ARCHES = ["xpu"]

				@ -68,18 +68,18 @@ PYTORCH_EXTRA_INSTALL_REQUIREMENTS = {

				        "nvidia-nvtx-cu12==12.1.105; platform_system == 'Linux' and platform_machine == 'x86_64'"

				    ),

				    "12.4": (

				        "nvidia-cuda-nvrtc-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cuda-runtime-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cuda-cupti-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cuda-nvrtc-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cuda-runtime-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cuda-cupti-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cudnn-cu12==9.1.0.70; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cublas-cu12==12.4.2.65; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cufft-cu12==11.2.0.44; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-curand-cu12==10.3.5.119; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusolver-cu12==11.6.0.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparse-cu12==12.3.0.142; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cublas-cu12==12.4.5.8; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cufft-cu12==11.2.1.3; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-curand-cu12==10.3.5.147; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusolver-cu12==11.6.1.9; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-cusparse-cu12==12.3.1.170; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nccl-cu12==2.21.5; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvtx-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvjitlink-cu12==12.4.99; platform_system == 'Linux' and platform_machine == 'x86_64'"

				        "nvidia-nvtx-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64' | "

				        "nvidia-nvjitlink-cu12==12.4.127; platform_system == 'Linux' and platform_machine == 'x86_64'"

				    ),

				}

				@ -215,7 +215,7 @@ LIBTORCH_CONTAINER_IMAGES: Dict[Tuple[str, str], str] = {

				    ("cpu", CXX11_ABI): f"pytorch/libtorch-cxx11-builder:cpu-{DEFAULT_TAG}",

				}

				FULL_PYTHON_VERSIONS = ["3.8", "3.9", "3.10", "3.11", "3.12"]

				FULL_PYTHON_VERSIONS = ["3.9", "3.10", "3.11", "3.12"]

				def translate_desired_cuda(gpu_arch_type: str, gpu_arch_version: str) -> str:

				@ -325,6 +325,7 @@ def generate_wheels_matrix(

				    os: str,

				    arches: Optional[List[str]] = None,

				    python_versions: Optional[List[str]] = None,

				    use_split_build: bool = False,

				) -> List[Dict[str, str]]:

				    package_type = "wheel"

				    if os == "linux" or os == "linux-aarch64" or os == "linux-s390x":

				@ -340,7 +341,7 @@ def generate_wheels_matrix(

				        if os == "linux":

				            arches += CPU_CXX11_ABI_ARCH + CUDA_ARCHES + ROCM_ARCHES + XPU_ARCHES

				        elif os == "windows":

				            arches += CUDA_ARCHES

				            arches += CUDA_ARCHES + XPU_ARCHES

				        elif os == "linux-aarch64":

				            # Only want the one arch as the CPU type is different and

				            # uses different build/test scripts

				@ -365,13 +366,23 @@ def generate_wheels_matrix(

				                else arch_version

				            )

				            # TODO: Enable python 3.13 on rocm, xpu, aarch64, windows

				            # TODO: Enable python 3.13 on rocm, aarch64, windows

				            if (

				                gpu_arch_type in ["rocm", "xpu"] or os != "linux"

				                gpu_arch_type == "rocm" or (os != "linux" and os != "linux-s390x")

				            ) and python_version == "3.13":

				                continue

				            if use_split_build and (

				                arch_version not in ["12.4", "12.1", "11.8", "cpu"] or os != "linux"

				            ):

				                raise RuntimeError(

				                    "Split build is only supported on linux with cuda 12.4, 12.1, 11.8, and cpu.\n"

				                    f"Currently attempting to build on arch version {arch_version} and os {os}.\n"

				                    "Please modify the matrix generation to exclude this combination."

				                )

				            # 12.1 linux wheels require PYTORCH_EXTRA_INSTALL_REQUIREMENTS to install

				            if (

				                arch_version in ["12.4", "12.1", "11.8"]

				                and os == "linux"

				@ -385,6 +396,7 @@ def generate_wheels_matrix(

				                        "desired_cuda": translate_desired_cuda(

				                            gpu_arch_type, gpu_arch_version

				                        ),

				                        "use_split_build": "True" if use_split_build else "False",

				                        "devtoolset": (

				                            "cxx11-abi" if arch_version == "cuda-aarch64" else ""

				                        ),

				@ -400,7 +412,8 @@ def generate_wheels_matrix(

				                        ),

				                    }

				                )

				                if arch_version != "cuda-aarch64":

				                # Special build building to use on Colab. Python 3.11 for 12.1 CUDA

				                if python_version == "3.11" and arch_version == "12.1":

				                    ret.append(

				                        {

				                            "python_version": python_version,

				@ -409,40 +422,16 @@ def generate_wheels_matrix(

				                            "desired_cuda": translate_desired_cuda(

				                                gpu_arch_type, gpu_arch_version

				                            ),

				                            "use_split_build": "True",

				                            "use_split_build": "True" if use_split_build else "False",

				                            "devtoolset": "",

				                            "container_image": WHEEL_CONTAINER_IMAGES[arch_version],

				                            "package_type": package_type,

				                            "pytorch_extra_install_requirements": (

				                                PYTORCH_EXTRA_INSTALL_REQUIREMENTS[arch_version]  # fmt: skip

				                                if os != "linux-aarch64"

				                                else ""

				                            ),

				                            "build_name": f"{package_type}-py{python_version}-{gpu_arch_type}{gpu_arch_version}-split".replace(  # noqa: B950

				                            "pytorch_extra_install_requirements": "",

				                            "build_name": f"{package_type}-py{python_version}-{gpu_arch_type}{gpu_arch_version}-full".replace(  # noqa: B950

				                                ".", "_"

				                            ),

				                        }

				                    )

				                    # Special build building to use on Colab. PyThon 3.10 for 12.1 CUDA

				                    if python_version == "3.10" and arch_version == "12.1":

				                        ret.append(

				                            {

				                                "python_version": python_version,

				                                "gpu_arch_type": gpu_arch_type,

				                                "gpu_arch_version": gpu_arch_version,

				                                "desired_cuda": translate_desired_cuda(

				                                    gpu_arch_type, gpu_arch_version

				                                ),

				                                "use_split_build": "False",

				                                "devtoolset": "",

				                                "container_image": WHEEL_CONTAINER_IMAGES[arch_version],

				                                "package_type": package_type,

				                                "pytorch_extra_install_requirements": "",

				                                "build_name": f"{package_type}-py{python_version}-{gpu_arch_type}{gpu_arch_version}-full".replace(  # noqa: B950

				                                    ".", "_"

				                                ),

				                            }

				                        )

				            else:

				                ret.append(

				                    {

				@ -452,10 +441,9 @@ def generate_wheels_matrix(

				                        "desired_cuda": translate_desired_cuda(

				                            gpu_arch_type, gpu_arch_version

				                        ),

				                        "use_split_build": "True" if use_split_build else "False",

				                        "devtoolset": (

				                            "cxx11-abi"

				                            if arch_version in ["cpu-cxx11-abi", "xpu"]

				                            else ""

				                            "cxx11-abi" if arch_version == "cpu-cxx11-abi" else ""

				                        ),

				                        "container_image": WHEEL_CONTAINER_IMAGES[arch_version],

				                        "package_type": package_type,

				@ -464,11 +452,12 @@ def generate_wheels_matrix(

				                        ),

				                        "pytorch_extra_install_requirements": (

				                            PYTORCH_EXTRA_INSTALL_REQUIREMENTS["12.1"]  # fmt: skip

				                            if os != "linux"

				                            if os != "linux" and gpu_arch_type != "xpu"

				                            else ""

				                        ),

				                    }

				                )

				    return ret

									
										35

.github/scripts/generate_ci_workflows.py
									
										vendored
									
												View File
												
				@ -61,6 +61,7 @@ class BinaryBuildWorkflow:

				    # Mainly for macos

				    cross_compile_arm64: bool = False

				    macos_runner: str = "macos-14-xlarge"

				    use_split_build: bool = False

				    def __post_init__(self) -> None:

				        if self.abi_version:

				@ -69,6 +70,9 @@ class BinaryBuildWorkflow:

				            )

				        else:

				            self.build_environment = f"{self.os}-binary-{self.package_type}"

				        if self.use_split_build:

				            # added to distinguish concurrency groups

				            self.build_environment += "-split"

				    def generate_workflow_file(self, workflow_template: jinja2.Template) -> None:

				        output_file_path = (

				@ -110,6 +114,20 @@ LINUX_BINARY_BUILD_WORFKLOWS = [

				            isolated_workflow=True,

				        ),

				    ),

				    BinaryBuildWorkflow(

				        os=OperatingSystem.LINUX,

				        package_type="manywheel",

				        build_configs=generate_binary_build_matrix.generate_wheels_matrix(

				            OperatingSystem.LINUX,

				            use_split_build=True,

				            arches=["11.8", "12.1", "12.4", "cpu"],

				        ),

				        ciflow_config=CIFlowConfig(

				            labels={LABEL_CIFLOW_BINARIES, LABEL_CIFLOW_BINARIES_WHEEL},

				            isolated_workflow=True,

				        ),

				        use_split_build=True,

				    ),

				    BinaryBuildWorkflow(

				        os=OperatingSystem.LINUX,

				        package_type="conda",

				@ -158,10 +176,25 @@ LINUX_BINARY_SMOKE_WORKFLOWS = [

				        build_configs=generate_binary_build_matrix.generate_wheels_matrix(

				            OperatingSystem.LINUX,

				            arches=["11.8", "12.1", "12.4"],

				            python_versions=["3.8"],

				            python_versions=["3.9"],

				        ),

				        branches="main",

				    ),

				    BinaryBuildWorkflow(

				        os=OperatingSystem.LINUX,

				        package_type="manywheel",

				        build_configs=generate_binary_build_matrix.generate_wheels_matrix(

				            OperatingSystem.LINUX,

				            arches=["11.8", "12.1", "12.4"],

				            python_versions=["3.9"],

				            use_split_build=True,

				        ),

				        ciflow_config=CIFlowConfig(

				            labels={LABEL_CIFLOW_PERIODIC},

				        ),

				        branches="main",

				        use_split_build=True,

				    ),

				    BinaryBuildWorkflow(

				        os=OperatingSystem.LINUX,

				        package_type="libtorch",

									
										22

.github/scripts/github_utils.py
									
										vendored
									
												View File
												
				@ -46,16 +46,24 @@ def gh_fetch_url_and_headers(

				        with urlopen(Request(url, headers=headers, data=data_, method=method)) as conn:

				            return conn.headers, reader(conn)

				    except HTTPError as err:

				        if err.code == 403 and all(

				            key in err.headers for key in ["X-RateLimit-Limit", "X-RateLimit-Used"]

				        if (

				            err.code == 403

				            and all(

				                key in err.headers

				                for key in ["X-RateLimit-Limit", "X-RateLimit-Remaining"]

				            )

				            and int(err.headers["X-RateLimit-Remaining"]) == 0

				        ):

				            print(

				                f"""Rate limit exceeded:

				                f"""{url}

				                Rate limit exceeded:

				                Used: {err.headers['X-RateLimit-Used']}

				                Limit: {err.headers['X-RateLimit-Limit']}

				                Remaining: {err.headers['X-RateLimit-Remaining']}

				                Resets at: {err.headers['x-RateLimit-Reset']}"""

				            )

				        else:

				            print(f"Error fetching {url} {err}")

				        raise

				@ -160,6 +168,14 @@ def gh_post_commit_comment(

				    )

				def gh_close_pr(org: str, repo: str, pr_num: int, dry_run: bool = False) -> None:

				    url = f"{GITHUB_API_URL}/repos/{org}/{repo}/pulls/{pr_num}"

				    if dry_run:

				        print(f"Dry run closing PR {pr_num}")

				    else:

				        gh_fetch_url(url, method="PATCH", data={"state": "closed"})

				def gh_delete_comment(org: str, repo: str, comment_id: int) -> None:

				    url = f"{GITHUB_API_URL}/repos/{org}/{repo}/issues/comments/{comment_id}"

				    gh_fetch_url(url, method="DELETE")

									
										1

.github/scripts/gitutils.py
									
										vendored
									
												View File
												
				@ -445,7 +445,6 @@ def retries_decorator(

				                    print(

				                        f'Attempt {idx} of {num_retries} to call {f.__name__} failed with "{e}"'

				                    )

				                    pass

				            return cast(T, rc)

				        return wrapper

									
										323

.github/scripts/runner_determinator.py
									
										vendored
									
												View File
												
				@ -1,23 +1,96 @@

				# flake8: noqa: G004

				"""

				This runner determinator is used to determine which set of runners to run a

				GitHub job on. It uses the first comment of a GitHub issue (by default

				https://github.com/pytorch/test-infra/issues/5132) to define the configuration

				of which runners should be used to run which job.

				The configuration has two parts, the settings and a list of opted-in users,

				separated by a line containing "---".  If the line is not present, the

				settings are considered to be empty with only the second part, the user

				list, defined.

				The first part is a YAML block that defines the rollout settings. This can be

				used to define any settings that are needed to determine which runners to use.

				It's fields are defined by the RolloutSettings class below.

				The second part is a list of users who are explicitly opted in to the LF fleet.

				The user list is also a comma separated list of additional features or

				experiments which the user could be opted in to.

				The user list has the following rules:

				- Users are GitHub usernames, which must start with the @ prefix

				- Each user is also a comma-separated list of features/experiments to enable

				- A "#" prefix opts the user out of all experiments

				Example config:

				    # A list of experiments that can be opted into.

				    # This defines the behavior they'll induce when opted into.

				    # Expected syntax is:

				    #   [experiment_name]: # Name of the experiment. Also used for the label prefix.

				    #      rollout_perc: [int] # % of workflows to run with this experiment when users are not opted in.

				    experiments:

				      lf:

				        rollout_percent: 25

				    ---

				    # Opt-ins:

				    # Users can opt into the LF fleet by adding their GitHub username to this list

				    # and specifying experiments to enable in a comma-separated list.

				    # Experiments should be from the above list.

				    @User1,lf,split_build

				    @User2,lf

				    @User3,split_build

				"""

				import logging

				import os

				import random

				from argparse import ArgumentParser

				from logging import LogRecord

				from typing import Any, Iterable

				from typing import Any, Dict, Iterable, List, NamedTuple, Tuple

				import yaml

				from github import Auth, Github

				from github.Issue import Issue

				WORKFLOW_LABEL_META = ""  # use meta runners

				DEFAULT_LABEL_PREFIX = ""  # use meta runners

				WORKFLOW_LABEL_LF = "lf."  # use runners from the linux foundation

				WORKFLOW_LABEL_LF_CANARY = "lf.c."  # use canary runners from the linux foundation

				GITHUB_OUTPUT = os.getenv("GITHUB_OUTPUT", "")

				GH_OUTPUT_KEY_AMI = "runner-ami"

				GH_OUTPUT_KEY_LABEL_TYPE = "label-type"

				SETTING_EXPERIMENTS = "experiments"

				LF_FLEET_EXPERIMENT = "lf"

				CANARY_FLEET_SUFFIX = ".c"

				class Experiment(NamedTuple):

				    rollout_perc: float = (

				        0  # Percentage of workflows to experiment on when user is not opted-in.

				    )

				    # Add more fields as needed

				class Settings(NamedTuple):

				    """

				    Settings for the experiments that can be opted into.

				    """

				    experiments: Dict[str, Experiment] = {}

				class ColorFormatter(logging.Formatter):

				    """Color codes the log messages based on the log level"""

				@ -109,11 +182,14 @@ def get_issue(gh: Github, repo: str, issue_num: int) -> Issue:

				def get_potential_pr_author(

				    gh: Github, repo: str, username: str, ref_type: str, ref_name: str

				    github_token: str, repo: str, username: str, ref_type: str, ref_name: str

				) -> str:

				    # If the trigger was a new tag added by a bot, this is a ciflow case

				    # Fetch the actual username from the original PR. The PR number is

				    # embedded in the tag name: ciflow/<name>/<pr-number>

				    gh = get_gh_client(github_token)

				    if username == "pytorch-bot[bot]" and ref_type == "tag":

				        split_tag = ref_name.split("/")

				        if (

				@ -135,80 +211,233 @@ def get_potential_pr_author(

				def is_exception_branch(branch: str) -> bool:

				    """

				    Branches that get opted out of all experiments and should always use Meta runners

				    """

				    return branch.split("/")[0] in {"main", "nightly", "release", "landchecks"}

				def get_workflow_type(issue: Issue, workflow_requestors: Iterable[str]) -> str:

				def load_yaml(yaml_text: str) -> Any:

				    try:

				        first_comment = issue.get_comments()[0].body.strip("\n\t ")

				        data = yaml.safe_load(yaml_text)

				        return data

				    except yaml.YAMLError as exc:

				        log.exception("Error loading YAML")

				        raise

				        if first_comment[0] == "!":

				            log.info("LF Workflows are disabled for everyone. Using meta runners.")

				            return WORKFLOW_LABEL_META

				        elif first_comment[0] == "*":

				            log.info("LF Workflows are enabled for everyone. Using LF runners.")

				            return WORKFLOW_LABEL_LF

				        else:

				            all_opted_in_users = {

				                usr_raw.strip("\n\t@ ") for usr_raw in first_comment.split()

				            }

				            opted_in_requestors = {

				                usr for usr in workflow_requestors if usr in all_opted_in_users

				            }

				            if opted_in_requestors:

				def extract_settings_user_opt_in_from_text(rollout_state: str) -> Tuple[str, str]:

				    """

				    Extracts the text with settings, if any, and the opted in users from the rollout state.

				    If the issue body contains "---" then the text above that is the settings

				    and the text below is the list of opted in users.

				    If it doesn't contain "---" then the settings are empty and the rest is the users.

				    """

				    rollout_state_parts = rollout_state.split("---")

				    if len(rollout_state_parts) >= 2:

				        return rollout_state_parts[0], rollout_state_parts[1]

				    else:

				        return "", rollout_state

				class UserOptins(Dict[str, List[str]]):

				    """

				    Dictionary of users with a list of features they have opted into

				    """

				def parse_user_opt_in_from_text(user_optin_text: str) -> UserOptins:

				    """

				    Parse the user opt-in text into a key value pair of username and the list of features they have opted into

				    Users are GitHub usernames with the @ prefix. Each user is also a comma-separated list of features/experiments to enable.

				        - Example line: "@User1,lf,split_build"

				        - A "#" prefix indicates the user is opted out of all experiments

				    """

				    optins = UserOptins()

				    for user in user_optin_text.split("\n"):

				        user = user.strip("\r\n\t -")

				        if not user or not user.startswith("@"):

				            # Not a valid user. Skip

				            continue

				        if user:

				            usr_name = user.split(",")[0].strip("@")

				            optins[usr_name] = [exp.strip(" ") for exp in user.split(",")[1:]]

				    return optins

				def parse_settings_from_text(settings_text: str) -> Settings:

				    """

				    Parse the experiments from the issue body into a list of ExperimentSettings

				    """

				    try:

				        if settings_text:

				            # Escape the backtick as well so that we can have the settings in a code block on the GH issue

				            # for easy reading

				            # Note: Using ascii for the backtick so that the cat step in _runner-determinator.yml doesn't choke on

				            #       the backtick character in shell commands.

				            backtick = chr(96)  # backtick character

				            settings_text = settings_text.strip(f"\r\n\t{backtick} ")

				            settings = load_yaml(settings_text)

				            # For now we just load experiments. We can expand this if/when we add more settings

				            experiments = {}

				            for exp_name, exp_settings in settings.get(SETTING_EXPERIMENTS).items():

				                valid_settings = {}

				                for setting in exp_settings:

				                    if setting not in Experiment._fields:

				                        log.warning(

				                            f"Unexpected setting in experiment: {setting} = {exp_settings[setting]}"

				                        )

				                    else:

				                        valid_settings[setting] = exp_settings[setting]

				                experiments[exp_name] = Experiment(**valid_settings)

				            return Settings(experiments)

				    except Exception:

				        log.exception("Failed to parse settings")

				    return Settings()

				def parse_settings(rollout_state: str) -> Settings:

				    """

				    Parse settings, if any, from the rollout state.

				    If the issue body contains "---" then the text above that is the settings

				    and the text below is the list of opted in users.

				    If it doesn't contain "---" then the settings are empty and the default values are used.

				    """

				    settings_text, _ = extract_settings_user_opt_in_from_text(rollout_state)

				    return parse_settings_from_text(settings_text)

				def parse_users(rollout_state: str) -> UserOptins:

				    """

				    Parse users from the rollout state.

				    """

				    _, users_text = extract_settings_user_opt_in_from_text(rollout_state)

				    return parse_user_opt_in_from_text(users_text)

				def is_user_opted_in(user: str, user_optins: UserOptins, experiment_name: str) -> bool:

				    """

				    Check if a user is opted into an experiment

				    """

				    return experiment_name in user_optins.get(user, [])

				def get_runner_prefix(

				    rollout_state: str, workflow_requestors: Iterable[str], is_canary: bool = False

				) -> str:

				    settings = parse_settings(rollout_state)

				    user_optins = parse_users(rollout_state)

				    fleet_prefix = ""

				    prefixes = []

				    for experiment_name, experiment_settings in settings.experiments.items():

				        enabled = False

				        # Is any workflow_requestor opted in to this experiment?

				        opted_in_users = [

				            requestor

				            for requestor in workflow_requestors

				            if is_user_opted_in(requestor, user_optins, experiment_name)

				        ]

				        if opted_in_users:

				            log.info(

				                f"{', '.join(opted_in_users)} have opted into experiment {experiment_name}."

				            )

				            enabled = True

				        elif experiment_settings.rollout_perc:

				            # If no user is opted in, then we randomly enable the experiment based on the rollout percentage

				            if random.uniform(0, 100) <= experiment_settings.rollout_perc:

				                log.info(

				                    f"LF Workflows are enabled for {', '.join(opted_in_requestors)}. Using LF runners."

				                    f"Based on rollout percentage of {experiment_settings.rollout_perc}%, enabling experiment {experiment_name}."

				                )

				                return WORKFLOW_LABEL_LF

				                enabled = True

				        if enabled:

				            label = experiment_name

				            if experiment_name == LF_FLEET_EXPERIMENT:

				                # We give some special treatment to the "lf" experiment since determines the fleet we use

				                #  - If it's enabled, then we always list it's prefix first

				                #  - If we're in the canary branch, then we append ".c" to the lf prefix

				                if is_canary:

				                    label += CANARY_FLEET_SUFFIX

				                fleet_prefix = label

				            else:

				                log.info(

				                    f"LF Workflows are disabled for {', '.join(workflow_requestors)}. Using meta runners."

				                )

				                return WORKFLOW_LABEL_META

				                prefixes.append(label)

				    except Exception as e:

				    if len(prefixes) > 1:

				        log.error(

				            f"Failed to get determine workflow type. Falling back to meta runners. Exception: {e}"

				            f"Only a fleet and one other experiment can be enabled for a job at any time. Enabling {prefixes[0]} and ignoring the rest, which are {', '.join(prefixes[1:])}"

				        )

				        return WORKFLOW_LABEL_META

				        prefixes = prefixes[:1]

				    # Fleet always comes first

				    if fleet_prefix:

				        prefixes.insert(0, fleet_prefix)

				    return ".".join(prefixes) + "." if prefixes else ""

				def get_rollout_state_from_issue(github_token: str, repo: str, issue_num: int) -> str:

				    """

				    Gets the first comment of the issue, which contains the desired rollout state.

				    The default issue we use - https://github.com/pytorch/test-infra/issues/5132

				    """

				    gh = get_gh_client(github_token)

				    issue = get_issue(gh, repo, issue_num)

				    return str(issue.get_comments()[0].body.strip("\n\t "))

				def main() -> None:

				    args = parse_args()

				    if args.github_ref_type == "branch" and is_exception_branch(args.github_branch):

				        log.info(f"Exception branch: '{args.github_branch}', using meta runners")

				        label_type = WORKFLOW_LABEL_META

				        log.info(

				            f"Exception branch: '{args.github_branch}', using Meta runners and no experiments."

				        )

				        runner_label_prefix = DEFAULT_LABEL_PREFIX

				    else:

				        try:

				            gh = get_gh_client(args.github_token)

				            # The default issue we use - https://github.com/pytorch/test-infra/issues/5132

				            issue = get_issue(gh, args.github_issue_repo, args.github_issue)

				            rollout_state = get_rollout_state_from_issue(

				                args.github_token, args.github_issue_repo, args.github_issue

				            )

				            username = get_potential_pr_author(

				                gh,

				                args.github_token,

				                args.github_repo,

				                args.github_actor,

				                args.github_ref_type,

				                args.github_branch,

				            )

				            label_type = get_workflow_type(

				                issue,

				                (

				                    args.github_issue_owner,

				                    username,

				                ),

				            is_canary = args.github_repo == "pytorch/pytorch-canary"

				            runner_label_prefix = get_runner_prefix(

				                rollout_state, (args.github_issue_owner, username), is_canary

				            )

				        except Exception as e:

				            log.error(

				                f"Failed to get issue. Falling back to meta runners. Exception: {e}"

				                f"Failed to get issue. Defaulting to Meta runners and no experiments. Exception: {e}"

				            )

				            label_type = WORKFLOW_LABEL_META

				    # For Canary builds use canary runners

				    if args.github_repo == "pytorch/pytorch-canary" and label_type == WORKFLOW_LABEL_LF:

				        label_type = WORKFLOW_LABEL_LF_CANARY

				    set_github_output(GH_OUTPUT_KEY_LABEL_TYPE, label_type)

				    set_github_output(GH_OUTPUT_KEY_LABEL_TYPE, runner_label_prefix)

				if __name__ == "__main__":

									
										39

.github/scripts/s390x-ci/README.md
									
										vendored
									
												View File
												
				@ -3,7 +3,7 @@

				## Install prerequisites.

				```

				$ sudo dnf install docker

				$ sudo dnf install podman podman-docker jq

				```

				## Add services.

				@ -27,23 +27,48 @@ $ sudo systemctl enable --now qemu-user-static

				## Rebuild the image

				In order to build or update the `iiilinuxibmcom/actions-runner` image, e.g. to get the

				latest OS security fixes, use the following commands:

				First build s390x builder image `docker.io/pytorch/manylinuxs390x-builder`,

				using following commands:

				```

				$ cd ~

				$ git clone https://github.com/pytorch/pytorch

				$ cd pytorch

				$ git submodule update --init --recursive

				$ GPU_ARCH_TYPE=cpu-s390x "$(pwd)/.ci/docker/manywheel/build.sh" manylinuxs390x-builder

				$ docker image tag localhost/pytorch/manylinuxs390x-builder docker.io/pytorch/manylinuxs390x-builder:cpu-s390x

				$ docker image save -o ~/manywheel-s390x.tar docker.io/pytorch/manylinuxs390x-builder:cpu-s390x

				```

				Next step is to build `actions-runner` image using:

				```

				$ cd self-hosted-builder

				$ sudo docker build \

				      --build-arg repo=<owner>/<name> \

				      --build-arg token=<***> \

				      --pull \

				      -f actions-runner.Dockerfile \

				      -t iiilinuxibmcom/actions-runner \

				      -t iiilinuxibmcom/actions-runner.<name> \

				      .

				```

				If it fails, ensure that selinux doesn't prevent it from working.

				If there are failures, ensure that selinux doesn't prevent it from working.

				In worst case, selinux can be disabled with `setenforce 0`.

				Now prepare all necessary files for runner registration:

				```

				$ sudo mkdir -p /etc/actions-runner/<name>

				$ sudo chmod 700 /etc/actions-runner/<name>

				$ sudo /bin/cp <github_app_private_key_file> /etc/actions-runner/<name>/key_private.pem

				$ sudo echo <github_app_id> | sudo tee /etc/actions-runner/<name>/appid.env

				$ sudo echo <github_app_install_id> | sudo tee /etc/actions-runner/<name>/installid.env

				$ sudo echo NAME=<worker_name> | sudo tee    /etc/actions-runner/<name>/env

				$ sudo echo ORG=<github_org>   | sudo tee -a /etc/actions-runner/<name>/env

				$ cd self-hosted-builder

				$ sudo /bin/cp helpers/*.sh /usr/local/bin/

				$ sudo chmod 755 /usr/local/bin/app_token.sh /usr/local/bin/gh_token_generator.sh

				```

				## Autostart the runner.

				```

									
										33

.github/scripts/s390x-ci/self-hosted-builder/actions-runner.Dockerfile
									
										vendored
									
												View File
												
				@ -1,12 +1,12 @@

				# Self-Hosted IBM Z Github Actions Runner.

				# Temporary image: amd64 dependencies.

				FROM docker.io/amd64/ubuntu:22.04 as ld-prefix

				FROM docker.io/amd64/ubuntu:23.10 as ld-prefix

				ENV DEBIAN_FRONTEND=noninteractive

				RUN apt-get update && apt-get -y install ca-certificates libicu70 libssl3

				RUN apt-get update && apt-get -y install ca-certificates libicu72 libssl3

				# Main image.

				FROM docker.io/s390x/ubuntu:22.04

				FROM docker.io/s390x/ubuntu:23.10

				# Packages for pytorch building and testing.

				ENV DEBIAN_FRONTEND=noninteractive

				@ -16,6 +16,7 @@ RUN apt-get update && apt-get -y install \

				        gcc \

				        git \

				        jq \

				        zip \

				        libxml2-dev \

				        libxslt-dev \

				        ninja-build \

				@ -43,24 +44,28 @@ COPY fs/ /

				RUN chmod +x /usr/bin/actions-runner /usr/bin/entrypoint

				# install podman

				RUN apt -y install podman podman-docker

				# amd64 Github Actions Runner.

				RUN useradd -m actions-runner

				USER actions-runner

				WORKDIR /home/actions-runner

				RUN curl -L https://github.com/actions/runner/releases/download/v2.309.0/actions-runner-linux-x64-2.309.0.tar.gz | tar -xz

				# repository

				ARG repo

				# set up python virtual environment which is later used by runner.

				# build workflows use "python -m pip install ...",

				# and it doesn't work for non-root user

				RUN virtualenv --system-site-packages venv

				# repository token

				ARG token

				# copy prebuilt manywheel docker image for builds and tests

				# build command is:

				# GPU_ARCH_TYPE=cpu-s390x "$(pwd)/manywheel/build_docker.sh"

				# and save command is:

				# docker image save -o manywheel-s390x.tar pytorch/manylinuxs390x-builder:cpu-s390x

				#

				COPY --chown=actions-runner:actions-runner manywheel-s390x.tar /home/actions-runner/manywheel-s390x.tar

				RUN ./config.sh \

				        --unattended \

				        --url "https://github.com/${repo}" \

				        --token "${token}" \

				        --no-default-labels \

				        --labels self-hosted,linux.s390x

				RUN curl -L https://github.com/actions/runner/releases/download/v2.317.0/actions-runner-linux-x64-2.317.0.tar.gz | tar -xz

				ENTRYPOINT ["/usr/bin/entrypoint"]

				CMD ["/usr/bin/actions-runner"]

									
										6

.github/scripts/s390x-ci/self-hosted-builder/actions-runner@.service
									
										vendored
									
												View File
												
				@ -8,12 +8,16 @@ StartLimitIntervalSec=0

				Type=simple

				Restart=always

				ExecStartPre=-/usr/bin/docker rm --force actions-runner.%i

				ExecStartPre=-/usr/local/bin/gh_token_generator.sh /etc/actions-runner/%i/appid.env /etc/actions-runner/%i/installid.env /etc/actions-runner/%i/key_private.pem /etc/actions-runner/%i/ghtoken.env

				ExecStart=/usr/bin/docker run \

				              --env-file=/etc/actions-runner/%i/env \

				              --env-file=/etc/actions-runner/%i/ghtoken.env \

				              --init \

				              --interactive \

				              --name=actions-runner.%i \

				              --rm \

				              iiilinuxibmcom/actions-runner

				              --privileged \

				              iiilinuxibmcom/actions-runner.%i

				ExecStop=/bin/sh -c "docker exec actions-runner.%i kill -INT -- -1"

				ExecStop=/bin/sh -c "docker wait actions-runner.%i"

				ExecStop=/bin/sh -c "docker rm actions-runner.%i"

42

.github/scripts/s390x-ci/self-hosted-builder/fs/usr/bin/actions-runner vendored

View File

 @ -2,5 +2,45 @@
 set -e -u
 # first import docker image
 if [ -f ./manywheel-s390x.tar ] ; then
         docker image load --input manywheel-s390x.tar
         docker image tag docker.io/pytorch/manylinuxs390x-builder:cpu-s390x docker.io/pytorch/manylinuxs390x-builder:cpu-s390x-main
         rm -f manywheel-s390x.tar
 fi
 token_file=registration-token.json
 # Generate registration token
 curl \
         -X POST \
         -H "Accept: application/vnd.github.v3+json" \
         -H "Authorization: Bearer ${ACCESS_TOKEN}" \
         "https://api.github.com/orgs/${ORG}/actions/runners/registration-token" \
         -o "$token_file"
 unset ACCESS_TOKEN
 # register runner as ephemeral runner
 # it does one job, stops and unregisters
 registration_token=$(jq --raw-output .token "$token_file")
 ./config.sh \
         --unattended \
         --ephemeral \
         --url "https://github.com/${ORG}" \
         --token "${registration_token}" \
         --name "${NAME}" \
         --no-default-labels \
         --labels self-hosted,linux.s390x
 unset registration_token
 rm -f "$token_file"
 # enter into python virtual environment.
 # build workflows use "python -m pip install ...",
 # and it doesn't work for non-root user
 source venv/bin/activate
 # Run one job.
 ./run.sh --once
 ./run.sh

									
										84

.github/scripts/s390x-ci/self-hosted-builder/helpers/app_token.sh
									
										vendored
									
										Executable file
									
												View File
												
				@ -0,0 +1,84 @@

				#!/usr/bin/env bash

				#

				# Request an ACCESS_TOKEN to be used by a GitHub APP

				# Environment variable that need to be set up:

				# * APP_ID, the GitHub's app ID

				# * INSTALL_ID, the Github's app's installation ID

				# * APP_PRIVATE_KEY, the content of GitHub app's private key in PEM format.

				#

				# https://github.com/orgs/community/discussions/24743#discussioncomment-3245300

				#

				set -o pipefail

				_GITHUB_HOST=${GITHUB_HOST:="github.com"}

				# If URL is not github.com then use the enterprise api endpoint

				if [[ ${GITHUB_HOST} = "github.com" ]]; then

				  URI="https://api.${_GITHUB_HOST}"

				else

				  URI="https://${_GITHUB_HOST}/api/v3"

				fi

				API_VERSION=v3

				API_HEADER="Accept: application/vnd.github.${API_VERSION}+json"

				CONTENT_LENGTH_HEADER="Content-Length: 0"

				APP_INSTALLATIONS_URI="${URI}/app/installations"

				# JWT parameters based off

				# https://docs.github.com/en/developers/apps/building-github-apps/authenticating-with-github-apps#authenticating-as-a-github-app

				#

				# JWT token issuance and expiration parameters

				JWT_IAT_DRIFT=60

				JWT_EXP_DELTA=600

				JWT_JOSE_HEADER='{

				    "alg": "RS256",

				    "typ": "JWT"

				}'

				build_jwt_payload() {

				    now=$(date +%s)

				    iat=$((now - JWT_IAT_DRIFT))

				    jq -c \

				        --arg iat_str "${iat}" \

				        --arg exp_delta_str "${JWT_EXP_DELTA}" \

				        --arg app_id_str "${APP_ID}" \

				    '

				        ($iat_str | tonumber) as $iat

				        | ($exp_delta_str | tonumber) as $exp_delta

				        | ($app_id_str | tonumber) as $app_id

				        | .iat = $iat

				        | .exp = ($iat + $exp_delta)

				        | .iss = $app_id

				    ' <<< "{}" | tr -d '\n'

				}

				base64url() {

				    base64 | tr '+/' '-_' | tr -d '=\n'

				}

				rs256_sign() {

				    openssl dgst -binary -sha256 -sign <(echo "$1")

				}

				request_access_token() {

				    jwt_payload=$(build_jwt_payload)

				    encoded_jwt_parts=$(base64url <<<"${JWT_JOSE_HEADER}").$(base64url <<<"${jwt_payload}")

				    encoded_mac=$(echo -n "$encoded_jwt_parts" | rs256_sign "${APP_PRIVATE_KEY}" | base64url)

				    generated_jwt="${encoded_jwt_parts}.${encoded_mac}"

				    auth_header="Authorization: Bearer ${generated_jwt}"

				    app_installations_response=$(curl -sX POST \

				        -H "${auth_header}" \

				        -H "${API_HEADER}" \

				        --header "X-GitHub-Api-Version: 2022-11-28" \

				        --url "https://api.github.com/app/installations/${INSTALL_ID}/access_tokens" \

				    )

				    echo "$app_installations_response" | jq --raw-output '.token'

				}

				request_access_token

									
										10

.github/scripts/s390x-ci/self-hosted-builder/helpers/gh_token_generator.sh
									
										vendored
									
										Executable file
									
												View File
												
				@ -0,0 +1,10 @@

				#!/usr/bin/env bash

				SCRIPT_DIR=$(dirname "$0")

				APP_ID=$1

				INSTALL_ID=$2

				APP_PRIVATE_KEY=$3

				DST_FILE="$4"

				ACCESS_TOKEN="$(APP_ID="$(<"${APP_ID}")" INSTALL_ID="$(<"${INSTALL_ID}")" APP_PRIVATE_KEY="$(<"${APP_PRIVATE_KEY}")" "${SCRIPT_DIR}/app_token.sh")"

				echo "ACCESS_TOKEN=${ACCESS_TOKEN}" > "${DST_FILE}"

									
										35

.github/scripts/sync_distributed_folder_prototype.sh
									
										vendored
									
												View File
											
				@ -1,35 +0,0 @@

				#!/bin/bash

				set -eoux pipefail

				SYNC_BRANCH=pytorch-stable-prototype

				git config user.email "fake@example.com"

				git config user.name  "PyTorch Stable Bot"

				git fetch origin main

				git fetch origin "$SYNC_BRANCH"

				git checkout "$SYNC_BRANCH"

				# Using a hardcoded SHA here is a massive speedup as we can skip the entire history of the pytorch GitHub repo.

				# This specific SHA was chosen as it was before the "branch point" of the stable branch

				for SHA in $(git log ba3b05fdf37ddbc3c301294d6a560a816335e717..origin/main --pretty="%h" -- torch/distributed torch/csrc/distributed test/distributed test/cpp/c10d benchmarks/distributed)

				do

				    # `git merge-base --is-ancestor` exits with code 0 if the given SHA is an ancestor, and non-0 otherwise

				    if git merge-base --is-ancestor $SHA HEAD || [[ $(git log --grep="(cherry picked from commit $SHA") ]]

				    then

				        echo "Skipping $SHA"

				        continue

				    fi

				    echo "Copying $SHA"

				    git cherry-pick -x "$SHA" -X theirs

				    git reset --soft HEAD~1

				    git add torch/distributed torch/csrc/distributed test/distributed test/cpp/c10d benchmarks/distributed

				    git checkout .

				    git commit --reuse-message=HEAD@{1}

				    git clean -f

				done

				if [[ "${WITH_PUSH}" == true ]]; then

				  git push

				fi

									
										2

.github/scripts/tag_docker_images_for_release.py
									
										vendored
									
												View File
												
				@ -51,6 +51,8 @@ def main() -> None:

				    for platform_image in platform_images:  # type: ignore[attr-defined]

				        for arch in platform_image.keys():  # type: ignore[attr-defined]

				            if arch == "cpu-s390x":

				                continue

				            tag_image(

				                platform_image[arch],  # type: ignore[index]

				                default_tag,

									
										1

.github/scripts/test_check_labels.py
									
										vendored
									
												View File
												
				@ -18,6 +18,7 @@ def mock_parse_args() -> object:

				    class Object:

				        def __init__(self) -> None:

				            self.pr_num = 76123

				            self.exit_non_zero = False

				    return Object()

									
										17

.github/scripts/test_filter_test_configs.py
									
										vendored
									
												View File
												
				@ -683,6 +683,7 @@ class TestConfigFilter(TestCase):

				        def _gen_expected_string(

				            keep_going: bool = False,

				            ci_verbose_test_logs: bool = False,

				            ci_test_showlocals: bool = False,

				            ci_no_test_timeout: bool = False,

				            ci_no_td: bool = False,

				            ci_td_distributed: bool = False,

				@ -692,6 +693,7 @@ class TestConfigFilter(TestCase):

				            return (

				                f"keep-going={keep_going}\n"

				                f"ci-verbose-test-logs={ci_verbose_test_logs}\n"

				                f"ci-test-showlocals={ci_test_showlocals}\n"

				                f"ci-no-test-timeout={ci_no_test_timeout}\n"

				                f"ci-no-td={ci_no_td}\n"

				                f"ci-td-distributed={ci_td_distributed}\n"

				@ -733,6 +735,21 @@ class TestConfigFilter(TestCase):

				                ),

				                "description": "No pipe logs label and no test timeout in PR body",

				            },

				            {

				                "labels": {"ci-test-showlocals"},

				                "test_matrix": '{include: [{config: "default"}]}',

				                "job_name": "A job name",

				                "expected": _gen_expected_string(ci_test_showlocals=True),

				                "description": "Has ci-test-showlocals",

				            },

				            {

				                "labels": {},

				                "test_matrix": '{include: [{config: "default"}]}',

				                "job_name": "A job name",

				                "pr_body": "[ci-test-showlocals]",

				                "expected": _gen_expected_string(ci_test_showlocals=True),

				                "description": "ci-test-showlocals in body",

				            },

				            {

				                "labels": {"ci-no-test-timeout"},

				                "test_matrix": '{include: [{config: "default"}]}',

									
										237

.github/scripts/test_runner_determinator.py
									
										vendored
									
										Normal file
									
												View File
												
				@ -0,0 +1,237 @@

				from unittest import main, TestCase

				from unittest.mock import Mock, patch

				import runner_determinator as rd

				class TestRunnerDeterminatorIssueParser(TestCase):

				    def test_parse_settings(self) -> None:

				        settings_text = """

				        experiments:

				            lf:

				                rollout_perc: 25

				            otherExp:

				                rollout_perc: 0

				        ---

				        Users:

				        @User1,lf

				        @User2,lf,otherExp

				        """

				        settings = rd.parse_settings(settings_text)

				        self.assertTupleEqual(

				            rd.Experiment(rollout_perc=25),

				            settings.experiments["lf"],

				            "lf settings not parsed correctly",

				        )

				        self.assertTupleEqual(

				            rd.Experiment(rollout_perc=0),

				            settings.experiments["otherExp"],

				            "otherExp settings not parsed correctly",

				        )

				    def test_parse_settings_in_code_block(self) -> None:

				        settings_text = """

				        ```

				        experiments:

				            lf:

				                rollout_perc: 25

				            otherExp:

				                rollout_perc: 0

				        ```

				        ---

				        Users:

				        @User1,lf

				        @User2,lf,otherExp

				        """

				        settings = rd.parse_settings(settings_text)

				        self.assertTupleEqual(

				            rd.Experiment(rollout_perc=25),

				            settings.experiments["lf"],

				            "lf settings not parsed correctly",

				        )

				        self.assertTupleEqual(

				            rd.Experiment(rollout_perc=0),

				            settings.experiments["otherExp"],

				            "otherExp settings not parsed correctly",

				        )

				    def test_parse_users(self) -> None:

				        settings_text = """

				        experiments:

				            lf:

				                rollout_perc: 0

				            otherExp:

				                rollout_perc: 0

				        ---

				        Users:

				        @User1,lf

				        @User2,lf,otherExp

				        """

				        users = rd.parse_users(settings_text)

				        self.assertDictEqual(

				            {"User1": ["lf"], "User2": ["lf", "otherExp"]},

				            users,

				            "Users not parsed correctly",

				        )

				    def test_parse_users_without_settings(self) -> None:

				        settings_text = """

				        @User1,lf

				        @User2,lf,otherExp

				        """

				        users = rd.parse_users(settings_text)

				        self.assertDictEqual(

				            {"User1": ["lf"], "User2": ["lf", "otherExp"]},

				            users,

				            "Users not parsed correctly",

				        )

				class TestRunnerDeterminatorGetRunnerPrefix(TestCase):

				    def test_opted_in_user(self) -> None:

				        settings_text = """

				        experiments:

				            lf:

				                rollout_perc: 0

				            otherExp:

				                rollout_perc: 0

				        ---

				        Users:

				        @User1,lf

				        @User2,lf,otherExp

				        """

				        prefix = rd.get_runner_prefix(settings_text, ["User1"])

				        self.assertEqual("lf.", prefix, "Runner prefix not correct for User1")

				    def test_opted_in_user_two_experiments(self) -> None:

				        settings_text = """

				        experiments:

				            lf:

				                rollout_perc: 0

				            otherExp:

				                rollout_perc: 0

				        ---

				        Users:

				        @User1,lf

				        @User2,lf,otherExp

				        """

				        prefix = rd.get_runner_prefix(settings_text, ["User2"])

				        self.assertEqual("lf.otherExp.", prefix, "Runner prefix not correct for User2")

				    @patch("random.uniform", return_value=50)

				    def test_opted_out_user(self, mock_uniform: Mock) -> None:

				        settings_text = """

				        experiments:

				            lf:

				                rollout_perc: 25

				            otherExp:

				                rollout_perc: 25

				        ---

				        Users:

				        @User1,lf

				        @User2,lf,otherExp

				        """

				        prefix = rd.get_runner_prefix(settings_text, ["User3"])

				        self.assertEqual("", prefix, "Runner prefix not correct for user")

				    @patch("random.uniform", return_value=10)

				    def test_opted_out_user_was_pulled_in_by_rollout(self, mock_uniform: Mock) -> None:

				        settings_text = """

				        experiments:

				            lf:

				                rollout_perc: 25

				            otherExp:

				                rollout_perc: 25

				        ---

				        Users:

				        @User1,lf

				        @User2,lf,otherExp

				        """

				        # User3 is opted out, but is pulled into both experiments by the 10% rollout

				        prefix = rd.get_runner_prefix(settings_text, ["User3"])

				        self.assertEqual("lf.otherExp.", prefix, "Runner prefix not correct for user")

				    def test_lf_prefix_always_comes_first(self) -> None:

				        settings_text = """

				        experiments:

				            otherExp:

				                rollout_perc: 0

				            lf:

				                rollout_perc: 0

				        ---

				        Users:

				        @User1,lf

				        @User2,otherExp,lf

				        """

				        prefix = rd.get_runner_prefix(settings_text, ["User2"])

				        self.assertEqual("lf.otherExp.", prefix, "Runner prefix not correct for user")

				    def test_ignores_commented_users(self) -> None:

				        settings_text = """

				        experiments:

				            lf:

				                rollout_perc: 0

				            otherExp:

				                rollout_perc: 0

				        ---

				        Users:

				        #@User1,lf

				        @User2,lf,otherExp

				        """

				        prefix = rd.get_runner_prefix(settings_text, ["User1"])

				        self.assertEqual("", prefix, "Runner prefix not correct for user")

				    def test_ignores_extra_experiments(self) -> None:

				        settings_text = """

				        experiments:

				            lf:

				                rollout_perc: 0

				            otherExp:

				                rollout_perc: 0

				            foo:

				                rollout_perc: 0

				        ---

				        Users:

				        @User1,lf,otherExp,foo

				        """

				        prefix = rd.get_runner_prefix(settings_text, ["User1"])

				        self.assertEqual("lf.otherExp.", prefix, "Runner prefix not correct for user")

				if __name__ == "__main__":

				    main()

									
										63

.github/scripts/trymerge.py
									
										vendored
									
												View File
												
				@ -36,6 +36,7 @@ from warnings import warn

				import yaml

				from github_utils import (

				    gh_close_pr,

				    gh_fetch_json_list,

				    gh_fetch_merge_base,

				    gh_fetch_url,

				@ -1116,15 +1117,20 @@ class GitHubPR:

				        msg = self.get_title() + f" (#{self.pr_num})\n\n"

				        msg += msg_body

				        # Mention PR co-authors

				        for author_login, author_name in self.get_authors().items():

				            if author_login != self.get_pr_creator_login():

				                msg += f"\nCo-authored-by: {author_name}"

				        msg += f"\nPull Request resolved: {self.get_pr_url()}\n"

				        msg += f"Approved by: {approved_by_urls}\n"

				        if ghstack_deps:

				            msg += f"ghstack dependencies: {', '.join([f'#{pr.pr_num}' for pr in ghstack_deps])}\n"

				        # Mention PR co-authors, which should be at the end of the message

				        # And separated from the body by two newlines

				        first_coauthor = True

				        for author_login, author_name in self.get_authors().items():

				            if author_login != self.get_pr_creator_login():

				                if first_coauthor:

				                    msg, first_coauthor = (msg + "\n", False)

				                msg += f"\nCo-authored-by: {author_name}"

				        return msg

				    def add_numbered_label(self, label_base: str, dry_run: bool) -> None:

				@ -1169,11 +1175,11 @@ class GitHubPR:

				            for pr in additional_merged_prs:

				                pr.add_numbered_label(MERGE_COMPLETE_LABEL, dry_run)

				        if comment_id and self.pr_num:

				            # When the merge process reaches this part, we can assume that the commit

				            # has been successfully pushed to trunk

				            merge_commit_sha = repo.rev_parse(name=REMOTE_MAIN_BRANCH)

				        # When the merge process reaches this part, we can assume that the commit

				        # has been successfully pushed to trunk

				        merge_commit_sha = repo.rev_parse(name=self.default_branch())

				        if comment_id and self.pr_num:

				            # Finally, upload the record to Rockset. The list of pending and failed

				            # checks are at the time of the merge

				            save_merge_record(

				@ -1198,6 +1204,17 @@ class GitHubPR:

				        else:

				            print("Missing comment ID or PR number, couldn't upload to Rockset")

				        # Usually Github will see that the commit has "resolves <pr_num>" in the

				        # commit message and close the PR, but sometimes it doesn't, leading to

				        # confusion.  When it doesn't, we close it manually.

				        time.sleep(60)  # Give Github some time to close the PR

				        manually_close_merged_pr(

				            pr=self,

				            additional_merged_prs=additional_merged_prs,

				            merge_commit_sha=merge_commit_sha,

				            dry_run=dry_run,

				        )

				    def merge_changes(

				        self,

				        repo: GitRepo,

				@ -1498,6 +1515,34 @@ def checks_to_markdown_bullets(

				    ]

				def manually_close_merged_pr(

				    pr: GitHubPR,

				    additional_merged_prs: List[GitHubPR],

				    merge_commit_sha: str,

				    dry_run: bool,

				) -> None:

				    def _comment_and_close(pr: GitHubPR, comment: str) -> None:

				        pr = GitHubPR(pr.org, pr.project, pr.pr_num)  # Refresh the PR

				        if not pr.is_closed():

				            gh_post_pr_comment(pr.org, pr.project, pr.pr_num, comment, dry_run)

				            gh_close_pr(pr.org, pr.project, pr.pr_num, dry_run)

				    message = (

				        f"This PR (#{pr.pr_num}) was merged in {merge_commit_sha} but it is still open, likely due to a Github bug, "

				        "so mergebot is closing it manually.  If you think this is a mistake, please feel free to reopen and contact Dev Infra."

				    )

				    _comment_and_close(pr, message)

				    for additional_pr in additional_merged_prs:

				        message = (

				            f"This PR (#{additional_pr.pr_num}) was merged as part of PR #{pr.pr_num} in the stack under {merge_commit_sha} "

				            "but it is still open, likely due to a Github bug, so mergebot is closing it manually. "

				            "If you think this is a mistake, please feel free to reopen and contact Dev Infra."

				        )

				        _comment_and_close(additional_pr, message)

				    print(f"PR {pr.pr_num} and all additional PRs in the stack have been closed.")

				@retries_decorator()

				def save_merge_record(

				    comment_id: int,

4

.github/templates/common.yml.j2 vendored

View File

 @ -1,7 +1,7 @@
 {%- set upload_artifact_s3_action = "seemethere/upload-artifact-s3@v5" -%}
 {%- set download_artifact_s3_action = "seemethere/download-artifact-s3@v4" -%}
 {%- set upload_artifact_action = "actions/upload-artifact@v3" -%}
 {%- set download_artifact_action = "actions/download-artifact@v3" -%}
 {%- set upload_artifact_action = "actions/upload-artifact@v4.4.0" -%}
 {%- set download_artifact_action = "actions/download-artifact@v4.1.7" -%}
 {%- set timeout_minutes = 240 -%}

23

.github/templates/linux_binary_build_workflow.yml.j2 vendored

View File

 @ -52,19 +52,32 @@ env:
 !{{ common.concurrency(build_environment) }}
 jobs:
   get-label-type:
     name: get-label-type
     uses: ./.github/workflows/_runner-determinator.yml
     with:
       triggering_actor: ${{ github.triggering_actor }}
       issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
       curr_branch: ${{ github.head_ref || github.ref_name }}
       curr_ref_type: ${{ github.ref_type }}
 {%- for config in build_configs %}
   !{{ config["build_name"] }}-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     uses: ./.github/workflows/_binary-build-linux.yml
     needs: get-label-type
     with:!{{ upload.binary_env_as_input(config) }}
       {%- if "aarch64" in build_environment %}
       runs_on: linux.arm64.m7g.4xlarge
       runs_on: linux.arm64.m7g.4xlarge.ephemeral
       ALPINE_IMAGE: "arm64v8/alpine"
       {%- elif "s390x" in build_environment %}
       runs_on: linux.s390x
       ALPINE_IMAGE: "docker.io/s390x/alpine"
       {%- elif "conda" in build_environment and config["gpu_arch_type"] == "cuda" %}
       runs_on: linux.24xlarge
       runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
       runs_on: linux.24xlarge.ephemeral
       {%- else %}
       runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
       {%- endif %}
       build_name: !{{ config["build_name"] }}
       build_environment: !{{ build_environment }}
 @ -80,7 +93,9 @@ jobs:
   {%- if config["gpu_arch_type"] != "cuda-aarch64" %}
   !{{ config["build_name"] }}-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
     needs: !{{ config["build_name"] }}-build
     needs:
       - !{{ config["build_name"] }}-build
       - get-label-type
     {%- if config["gpu_arch_type"] not in ["rocm", "xpu"] %}
     uses: ./.github/workflows/_binary-test-linux.yml
     with:!{{ upload.binary_env_as_input(config) }}
 @ -95,8 +110,10 @@ jobs:
       {%- elif config["gpu_arch_type"] == "rocm" %}
       runs_on: linux.rocm.gpu
       {%- elif config["gpu_arch_type"] == "cuda" %}
       runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
       runs_on: linux.4xlarge.nvidia.gpu
       {%- else %}
       runner_prefix: "${{ needs.get-label-type.outputs.label-type }}"
       runs_on: linux.4xlarge
       {%- endif %}
     secrets:

7

.github/templates/macos_binary_build_workflow.yml.j2 vendored

View File

 @ -64,9 +64,6 @@ jobs:
     {%- if config.pytorch_extra_install_requirements is defined and config.pytorch_extra_install_requirements|d('')|length > 0  %}
       PYTORCH_EXTRA_INSTALL_REQUIREMENTS: !{{ config.pytorch_extra_install_requirements }}
     {%- endif %}
       # For sccache access (only on non-forked PRs)
       AWS_ACCESS_KEY_ID: ${{ secrets.MACOS_SCCACHE_S3_ACCESS_KEY_ID }}
       AWS_SECRET_ACCESS_KEY: ${{ secrets.MACOS_SCCACHE_S3_SECRET_ACCESS_KEY }}
     steps:
       !{{ set_runner_specific_vars() }}
       - name: Install conda and dependencies
 @ -84,7 +81,7 @@ jobs:
       !{{ common.checkout(deep_clone=False, directory="pytorch") }}
       !{{ common.checkout(deep_clone=False, directory="builder", repository=common.builder_repo, branch=common.builder_branch) }}
       - name: Install sccache (only for non-forked PRs, and pushes to trunk)
         uses: nick-fields/retry@v2.8.2
         uses: nick-fields/retry@v3.0.0
         if: ${{ github.event_name == 'push' || github.event.pull_request.head.repo.full_name == github.repository }}
         with:
           timeout_minutes: 5
 @ -104,7 +101,7 @@ jobs:
           # shellcheck disable=SC1091
           source "${RUNNER_TEMP}/anaconda/bin/activate"
           "${PYTORCH_ROOT}/.circleci/scripts/binary_macos_build.sh"
       - uses: actions/upload-artifact@v3
       - uses: actions/upload-artifact@v4.4.0
         if: always()
         with:
           name: !{{ config["build_name"] }}

2

.github/templates/upload.yml.j2 vendored

View File

 @ -45,7 +45,7 @@
   {%- if is_windows %}
       # This is a dummy value for libtorch to work correctly with our batch scripts
       # without this value pip does not get installed for some reason
       DESIRED_PYTHON: "3.8"
       DESIRED_PYTHON: "3.9"
   {%- endif %}
 {%- else %}

26

.github/templates/windows_binary_build_workflow.yml.j2 vendored

View File

 @ -53,10 +53,24 @@ env:
 !{{ common.concurrency(build_environment) }}
 jobs:
   get-label-type:
     name: get-label-type
     uses: ./.github/workflows/_runner-determinator.yml
     with:
       triggering_actor: ${{ github.triggering_actor }}
       issue_owner: ${{ github.event.pull_request.user.login || github.event.issue.user.login }}
       curr_branch: ${{ github.head_ref || github.ref_name }}
       curr_ref_type: ${{ github.ref_type }}
 {%- for config in build_configs %}
   !{{ config["build_name"] }}-build:
     if: ${{ github.repository_owner == 'pytorch' }}
     runs-on: windows.4xlarge.nonephemeral
     needs: get-label-type
     {%- if branches == "nightly" %}
     runs-on: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge"
     {%- else %}
     runs-on: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral"
     {%- endif %}
     timeout-minutes: !{{ common.timeout_minutes }}
     !{{ upload.binary_env(config, True) }}
     {%- if config.pytorch_extra_install_requirements is defined and config.pytorch_extra_install_requirements|d('')|length > 0  %}
 @ -85,15 +99,17 @@ jobs:
       !{{ common.wait_and_kill_ssh_windows('pytorch') }}
   !{{ config["build_name"] }}-test:  # Testing
     if: ${{ github.repository_owner == 'pytorch' }}
     needs: !{{ config["build_name"] }}-build
     needs:
       - !{{ config["build_name"] }}-build
       - get-label-type
 {%- if config["gpu_arch_type"] == "cuda" %}
 {%- if branches == "nightly" %}
     runs-on: windows.8xlarge.nvidia.gpu
     runs-on: "${{ needs.get-label-type.outputs.label-type }}windows.g4dn.xlarge"
 {%- else %}
     runs-on: windows.8xlarge.nvidia.gpu.nonephemeral
     runs-on: "${{ needs.get-label-type.outputs.label-type }}windows.g4dn.xlarge.nonephemeral"
 {%- endif %}
 {%- else %}
     runs-on: windows.4xlarge.nonephemeral
     runs-on: "${{ needs.get-label-type.outputs.label-type }}windows.4xlarge.nonephemeral"
 {%- endif %}
     timeout-minutes: !{{ common.timeout_minutes }}
     !{{ upload.binary_env(config, True) }}

									
										13

.github/workflows/_binary-build-linux.yml
									
										vendored
									
												View File
												
				@ -11,11 +11,16 @@ on:

				        required: true

				        type: string

				        description: The build environment

				      runner_prefix:

				        required: false

				        default: ""

				        type: string

				        description: prefix for runner label

				      runs_on:

				        required: false

				        default: linux.12xlarge

				        default: linux.12xlarge.ephemeral

				        type: string

				        description: Hardware to run this "build"job on, linux.12xlarge or linux.arm64.2xlarge.

				        description: Hardware to run this "build" job on, linux.12xlarge or linux.arm64.2xlarge.

				      timeout-minutes:

				        required: false

				        default: 210

				@ -89,7 +94,7 @@ on:

				jobs:

				  build:

				    runs-on: ${{ inputs.runs_on }}

				    runs-on: ${{ inputs.runner_prefix}}${{ inputs.runs_on }}

				    timeout-minutes: ${{ inputs.timeout-minutes }}

				    env:

				      PYTORCH_ROOT: ${{ inputs.PYTORCH_ROOT }}

				@ -278,7 +283,7 @@ jobs:

				          # Ensure the working directory gets chowned back to the current user

				          docker run --rm -v "${RUNNER_TEMP}/artifacts:/v" -w /v "${ALPINE_IMAGE}" chown -R "$(id -u):$(id -g)" .

				      - uses: actions/upload-artifact@v3

				      - uses: actions/upload-artifact@v4.4.0

				        if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' }}

				        with:

				          name: ${{ inputs.build_name }}

									
										9

.github/workflows/_binary-test-linux.yml
									
										vendored
									
												View File
												
				@ -59,6 +59,11 @@ on:

				        required: false

				        type: string

				        description: Desired python version

				      runner_prefix:

				        required: false

				        default: ""

				        type: string

				        description: prefix for runner label

				      runs_on:

				        required: true

				        type: string

				@ -77,7 +82,7 @@ on:

				jobs:

				  test:

				    runs-on: ${{ inputs.runs_on }}

				    runs-on: ${{ inputs.runner_prefix}}${{ inputs.runs_on }}

				    timeout-minutes: 240

				    env:

				      PYTORCH_ROOT: ${{ inputs.PYTORCH_ROOT }}

				@ -205,7 +210,7 @@ jobs:

				      - name: Download Build Artifacts

				        if: ${{ steps.filter.outputs.is-test-matrix-empty == 'False' }}

				        uses: actions/download-artifact@v3

				        uses: actions/download-artifact@v4.1.7

				        with:

				          name: ${{ inputs.build_name }}

				          path: "${{ runner.temp }}/artifacts/"

									
										2

.github/workflows/_binary-upload.yml
									
										vendored
									
												View File
												
				@ -126,7 +126,7 @@ jobs:

				        # NB: When the previous build job is skipped, there won't be any artifacts and

				        # this step will fail. Binary build jobs can only be skipped on CI, not nightly

				        continue-on-error: true

				        uses: actions/download-artifact@v3

				        uses: actions/download-artifact@v4.1.7

				        with:

				          name: ${{ inputs.build_name }}

				          path: "${{ runner.temp }}/artifacts/"

									
										11

.github/workflows/_buck-build-test.yml
									
										vendored
									
												View File
												
				@ -8,6 +8,11 @@ on:

				        type: string

				        description: |

				          A JSON description of what configs to run later on.

				      runner_prefix:

				        required: false

				        type: string

				        description: |

				          Prefix for runner label

				defaults:

				  run:

				@ -16,7 +21,7 @@ defaults:

				jobs:

				  filter:

				    if: github.repository_owner == 'pytorch'

				    runs-on: [self-hosted, linux.large]

				    runs-on: [self-hosted, "${{ inputs.runner_prefix }}linux.large"]

				    outputs:

				      test-matrix: ${{ steps.filter.outputs.test-matrix }}

				      is-test-matrix-empty: ${{ steps.filter.outputs.is-test-matrix-empty }}

				@ -59,7 +64,7 @@ jobs:

				          environment-file: .github/requirements/conda-env-${{ runner.os }}-${{ runner.arch }}

				      - name: Install Buck

				        uses: nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482

				        uses: nick-fields/retry@v3.0.0

				        with:

				          timeout_minutes: 10

				          max_attempts: 5

				@ -69,7 +74,7 @@ jobs:

				            sudo apt install ./buck.2021.01.12.01_all.deb

				      - name: Download third party libraries and generate wrappers

				        uses: nick-fields/retry@3e91a01664abd3c5cd539100d10d33b9c5b68482

				        uses: nick-fields/retry@v3.0.0

				        with:

				          timeout_minutes: 10

				          max_attempts: 5

Compare commits

2508 Commits cslpull86 ... experiment

8 .ci/docker/aotriton_version.txt Unescape Escape View File

40 .ci/docker/build.sh Unescape Escape View File

4 .ci/docker/centos-rocm/Dockerfile Unescape Escape View File

2 .ci/docker/ci_commit_pins/executorch.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/halide.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/timm.txt Unescape Escape View File

1 .ci/docker/ci_commit_pins/triton-rocm.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton-xpu.txt Unescape Escape View File

2 .ci/docker/ci_commit_pins/triton.txt Unescape Escape View File

5 .ci/docker/common/aotriton_version.txt Unescape Escape View File

4 .ci/docker/common/install_aotriton.sh Unescape Escape View File

33 .ci/docker/common/install_conda.sh Unescape Escape View File

25 .ci/docker/common/install_cpython.sh Unescape Escape View File

25 .ci/docker/common/install_cuda.sh Unescape Escape View File

12 .ci/docker/common/install_cuda_aarch64.sh Unescape Escape View File

25 .ci/docker/common/install_cudss.sh Normal file Unescape Escape View File

10 .ci/docker/common/install_cusparselt.sh Unescape Escape View File

5 .ci/docker/common/install_miopen.sh Unescape Escape View File

20 .ci/docker/common/install_nvpl.sh Normal file Unescape Escape View File

9 .ci/docker/common/install_onnx.sh Unescape Escape View File

25 .ci/docker/common/install_triton.sh Unescape Escape View File

73 .ci/docker/common/install_xpu.sh Unescape Escape View File

5 .ci/docker/conda/Dockerfile Unescape Escape View File

2 .ci/docker/libtorch/Dockerfile Unescape Escape View File

2 .ci/docker/linter-cuda/Dockerfile Unescape Escape View File

9 .ci/docker/manywheel/Dockerfile Unescape Escape View File

3 .ci/docker/manywheel/Dockerfile_2014 Unescape Escape View File

8 .ci/docker/manywheel/Dockerfile_2_28 Unescape Escape View File

12 .ci/docker/manywheel/Dockerfile_cuda_aarch64 Unescape Escape View File

38 .ci/docker/requirements-ci.txt Unescape Escape View File

2 .ci/docker/triton_version.txt Unescape Escape View File

6 .ci/docker/ubuntu-cuda/Dockerfile Unescape Escape View File

4 .ci/docker/ubuntu-rocm/Dockerfile Unescape Escape View File

1 .ci/docker/ubuntu-xpu/Dockerfile Unescape Escape View File

2 .ci/docker/ubuntu/Dockerfile Unescape Escape View File

8 .ci/pytorch/build.sh Unescape Escape View File

2 .ci/pytorch/common_utils.sh Unescape Escape View File

19 .ci/pytorch/macos-test.sh Unescape Escape View File

236 .ci/pytorch/test.sh Unescape Escape View File

23 .ci/pytorch/win-test-helpers/build_pytorch.bat Unescape Escape View File

91 .ci/pytorch/win-test-helpers/installation-helpers/install_xpu.bat Normal file Unescape Escape View File

1 .ci/pytorch/win-test-helpers/setup_pytorch_env.bat Unescape Escape View File

2 .ci/pytorch/win-test-helpers/test_custom_backend.bat Unescape Escape View File

2 .ci/pytorch/win-test-helpers/test_custom_script_ops.bat Unescape Escape View File

2 .ci/pytorch/win-test-helpers/test_libtorch.bat Unescape Escape View File

6 .ci/pytorch/win-test.sh Unescape Escape View File

11 .circleci/scripts/binary_linux_test.sh Unescape Escape View File

6 .circleci/scripts/binary_populate_env.sh Unescape Escape View File

5 .circleci/scripts/binary_windows_build.sh Unescape Escape View File

4 .circleci/scripts/binary_windows_test.sh Unescape Escape View File

5 .flake8 Unescape Escape View File

12 .github/actionlint.yaml vendored Unescape Escape View File

5 .github/actions/filter-test-configs/action.yml vendored Unescape Escape View File

226 .github/actions/linux-build/action.yml vendored Unescape Escape View File

1 .github/actions/linux-test/action.yml vendored Unescape Escape View File

2 .github/actions/pytest-cache-download/action.yml vendored Unescape Escape View File

2 .github/actions/pytest-cache-upload/action.yml vendored Unescape Escape View File

9 .github/actions/setup-linux/action.yml vendored Unescape Escape View File

2 .github/actions/teardown-win/action.yml vendored Unescape Escape View File

2 .github/ci_commit_pins/audio.txt vendored Unescape Escape View File

2 .github/ci_commit_pins/xla.txt vendored Unescape Escape View File

39 .github/label_to_label.yml vendored Unescape Escape View File

1 .github/labeler.yml vendored Unescape Escape View File

274 .github/lf-canary-scale-config.yml vendored Unescape Escape View File

274 .github/lf-scale-config.yml vendored Unescape Escape View File

27 .github/merge_rules.yaml vendored Unescape Escape View File

5 .github/nitpicks.yml vendored Normal file Unescape Escape View File

1 .github/pytorch-probot.yml vendored Unescape Escape View File

2 .github/requirements/conda-env-iOS.txt vendored Unescape Escape View File

6 .github/requirements/pip-requirements-macOS.txt vendored Unescape Escape View File

26 .github/scripts/build_triton_wheel.py vendored Unescape Escape View File

11 .github/scripts/check_labels.py vendored Unescape Escape View File

3 .github/scripts/cherry_pick.py vendored Unescape Escape View File

3 .github/scripts/filter_test_configs.py vendored Unescape Escape View File

85 .github/scripts/generate_binary_build_matrix.py vendored Unescape Escape View File

35 .github/scripts/generate_ci_workflows.py vendored Unescape Escape View File

22 .github/scripts/github_utils.py vendored Unescape Escape View File

1 .github/scripts/gitutils.py vendored Unescape Escape View File

2508 Commits

cslpull86 ... experiment

8

.ci/docker/aotriton_version.txt

View File

40

.ci/docker/build.sh

View File

4

.ci/docker/centos-rocm/Dockerfile

View File

2

.ci/docker/ci_commit_pins/executorch.txt

View File

2

.ci/docker/ci_commit_pins/halide.txt

View File

2

.ci/docker/ci_commit_pins/timm.txt

View File

1

.ci/docker/ci_commit_pins/triton-rocm.txt

View File

2

.ci/docker/ci_commit_pins/triton-xpu.txt

View File

2

.ci/docker/ci_commit_pins/triton.txt

View File

5

.ci/docker/common/aotriton_version.txt

View File

4

.ci/docker/common/install_aotriton.sh

View File

33

.ci/docker/common/install_conda.sh

View File

25

.ci/docker/common/install_cpython.sh

View File

25

.ci/docker/common/install_cuda.sh

View File

12

.ci/docker/common/install_cuda_aarch64.sh

View File

25

.ci/docker/common/install_cudss.sh Normal file

View File

10

.ci/docker/common/install_cusparselt.sh

View File

5

.ci/docker/common/install_miopen.sh

View File

20

.ci/docker/common/install_nvpl.sh Normal file

View File

9

.ci/docker/common/install_onnx.sh

View File

25

.ci/docker/common/install_triton.sh

View File

73

.ci/docker/common/install_xpu.sh

View File

5

.ci/docker/conda/Dockerfile

View File

2

.ci/docker/libtorch/Dockerfile

View File

2

.ci/docker/linter-cuda/Dockerfile

View File

9

.ci/docker/manywheel/Dockerfile

View File

3

.ci/docker/manywheel/Dockerfile_2014

View File

8

.ci/docker/manywheel/Dockerfile_2_28

View File

12

.ci/docker/manywheel/Dockerfile_cuda_aarch64

View File

38

.ci/docker/requirements-ci.txt

View File

2

.ci/docker/triton_version.txt

View File

6

.ci/docker/ubuntu-cuda/Dockerfile

View File

4

.ci/docker/ubuntu-rocm/Dockerfile

View File

1

.ci/docker/ubuntu-xpu/Dockerfile

View File

2

.ci/docker/ubuntu/Dockerfile

View File

8

.ci/pytorch/build.sh

View File

2

.ci/pytorch/common_utils.sh

View File

19

.ci/pytorch/macos-test.sh

View File

236

.ci/pytorch/test.sh

View File

23

.ci/pytorch/win-test-helpers/build_pytorch.bat

View File

91

.ci/pytorch/win-test-helpers/installation-helpers/install_xpu.bat Normal file

View File

1

.ci/pytorch/win-test-helpers/setup_pytorch_env.bat

View File

2

.ci/pytorch/win-test-helpers/test_custom_backend.bat

View File

2

.ci/pytorch/win-test-helpers/test_custom_script_ops.bat

View File

2

.ci/pytorch/win-test-helpers/test_libtorch.bat

View File

6

.ci/pytorch/win-test.sh

View File

11

.circleci/scripts/binary_linux_test.sh

View File

6

.circleci/scripts/binary_populate_env.sh

View File

5

.circleci/scripts/binary_windows_build.sh

View File

4

.circleci/scripts/binary_windows_test.sh

View File

5

.flake8

View File

12

.github/actionlint.yaml vendored

View File

5

.github/actions/filter-test-configs/action.yml vendored

View File

226

.github/actions/linux-build/action.yml vendored

View File

1

.github/actions/linux-test/action.yml vendored

View File

2

.github/actions/pytest-cache-download/action.yml vendored

View File

2

.github/actions/pytest-cache-upload/action.yml vendored

View File

9

.github/actions/setup-linux/action.yml vendored

View File

2

.github/actions/teardown-win/action.yml vendored

View File

2

.github/ci_commit_pins/audio.txt vendored

View File

2

.github/ci_commit_pins/xla.txt vendored

View File

39

.github/label_to_label.yml vendored

View File

1

.github/labeler.yml vendored

View File

274

.github/lf-canary-scale-config.yml vendored

View File

274

.github/lf-scale-config.yml vendored

View File

27

.github/merge_rules.yaml vendored

View File

5

.github/nitpicks.yml vendored Normal file

View File

1

.github/pytorch-probot.yml vendored

View File

2

.github/requirements/conda-env-iOS.txt vendored

View File

6

.github/requirements/pip-requirements-macOS.txt vendored

View File

26

.github/scripts/build_triton_wheel.py vendored

View File

11

.github/scripts/check_labels.py vendored

View File

3

.github/scripts/cherry_pick.py vendored

View File

3

.github/scripts/filter_test_configs.py vendored

View File

85

.github/scripts/generate_binary_build_matrix.py vendored

View File

35

.github/scripts/generate_ci_workflows.py vendored

View File

22

.github/scripts/github_utils.py vendored

View File

1

.github/scripts/gitutils.py vendored

View File

323

.github/scripts/runner_determinator.py vendored

View File