pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 12:54:11 +08:00

Author	SHA1	Message	Date
Chirag Pandya	fd90991790	[rfc] opentelemetry in pytorch (#122999 ) 1. Add current latest version (opentelemetry-cpp version v1.14.2) to PyTorch library. Steps: ``` $cd pytorch $git submodule add https://github.com/open-telemetry/opentelemetry-cpp.git third_party/opentelemetry-cpp $cd third_party/opentelemetry-cpp $git checkout v1.14.2 $git add third_party/opentelemetry-cpp .gitmodules $git commit ``` Expected change in checkout size: ``` (/home/cpio/local/a/pytorch-env) [cpio@devvm17556.vll0 ~/local/pytorch (gh/c-p-i-o/otel)]$ git count-objects -vH count: 654 size: 3.59 MiB in-pack: 1229701 packs: 17 size-pack: 1.17 GiB prune-packable: 76 garbage: 0 size-garbage: 0 bytes ``` 2. TODO - [x] Figure out how dynamic linking works. App builders will somehow need to `target_include` opentelemetry-cpp at runtime. - [ ] Examples on how to use opentelemetry + pytorch - [ ] Tests + documentation (e.g. using null opentelemetry implementation). Pull Request resolved: https://github.com/pytorch/pytorch/pull/122999 Approved by: https://github.com/ezyang	2024-04-21 15:20:21 +00:00
Aaron Gokaslan	29cc293725	[BE]: FURB142 - Remove set mutations. Use set update (#124551 ) Uses set mutation methods instead of manually reimplementing (update, set_difference etc). Pull Request resolved: https://github.com/pytorch/pytorch/pull/124551 Approved by: https://github.com/ezyang	2024-04-21 14:12:33 +00:00
Aaron Gokaslan	5a1216bb2e	[BE]: Update ruff to 0.4.1 (#124549 ) Update ruff to 0.4.1 . This version fixes a lot false negatives/false positives, is 20-40% faster, and has various other bug fixes. Below is a before and after table showing the execution time of ruff lint and ruff format in milliseconds courtesy of https://astral.sh/blog/ruff-v0.4.0 \| Repository \| Linter (v0.3) \| Linter (v0.4) \| Formatter (v0.3) \| Formatter (v0.4) \| \|----------------------------------------------------\|---------------\|---------------\|------------------\|------------------\| \| [pytorch/pytorch](https://github.com/pytorch/pytorch) \| 328.7 \| 251.8 \| 351.1 \| 274.9 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/124549 Approved by: https://github.com/ezyang	2024-04-21 14:06:23 +00:00
Edward Z. Yang	f34905f61d	Assert that TracingContext is available when set_example_value is called (#124284 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124284 Approved by: https://github.com/Chillee ghstack dependencies: #124105, #124059, #124176, #124283	2024-04-21 11:23:13 +00:00
Edward Z. Yang	0e6367dd44	Factor var_to_range assignments to _update_var_to_range helper (#124283 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124283 Approved by: https://github.com/IvanKobzarev ghstack dependencies: #124105, #124059, #124176	2024-04-21 11:23:13 +00:00
Colin Peppler	cbf420b67a	[inductor] for UserDefinedTritonKernels don't mark all inputs as mutating (#124425 ) Take this example: ``` def _mul2(x): y = torch.empty_like(x) mul2_kernel[(10,)]( in_ptr0=x, out_ptr=y, n_elements=x.numel(), BLOCK_SIZE=1, ) return y def f(x): for _ in range(4): x = _mul2(x) return x + 1 ``` Currently, the codegen will show up like this. Notice, how we allocate 5 buffers of the same size. ``` # Source Nodes: [triton_kernel_wrapper_mutation], Original ATen: [] buf0 = empty_strided_cuda((10, ), (1, ), torch.float32) mul2_kernel_0.run(in_ptr0=arg0_1, out_ptr=reinterpret_tensor(buf0, (10, ), (1, ), 0) ...) # Source Nodes: [triton_kernel_wrapper_mutation_1], Original ATen: [] buf4 = empty_strided_cuda((10, ), (1, ), torch.float32) mul2_kernel_0.run(in_ptr0=reinterpret_tensor(buf0, (10, ), (1, ), 0), out_ptr=reinterpret_tensor(buf4, (10, ), (1, ), 0) ...) # Source Nodes: [triton_kernel_wrapper_mutation_2], Original ATen: [] buf8 = empty_strided_cuda((10, ), (1, ), torch.float32) mul2_kernel_0.run(in_ptr0=reinterpret_tensor(buf4, (10, ), (1, ), 0), out_ptr=reinterpret_tensor(buf8, (10, ), (1, ), 0) ...) # Source Nodes: [triton_kernel_wrapper_mutation_3], Original ATen: [] buf12 = empty_strided_cuda((10, ), (1, ), torch.float32) mul2_kernel_0.run(in_ptr0=reinterpret_tensor(buf8, (10, ), (1, ), 0), out_ptr=reinterpret_tensor(buf12, (10, ), (1, ), 0) ...) # Source Nodes: [add], Original ATen: [aten.add] buf16 = empty_strided_cuda((10, ), (1, ), torch.float32) triton_poi_fused_add_0.run(buf12, buf16, 10, grid=grid(10), stream=stream0)...) return (buf16, ) ``` With this PR, we want to see this. Notice, how we only allocate 2 buffers this time. The other 3 buffers are re-used. ``` # Source Nodes: [triton_kernel_wrapper_mutation], Original ATen: [] buf0 = empty_strided_cuda((10, ), (1, ), torch.float32) mul2_kernel_0.run(in_ptr0=arg0_1, out_ptr=reinterpret_tensor(buf0, (10, ), (1, ), 0), ...) del arg0_1 # Source Nodes: [triton_kernel_wrapper_mutation_1], Original ATen: [] buf2 = empty_strided_cuda((10, ), (1, ), torch.float32) mul2_kernel_0.run(in_ptr0=reinterpret_tensor(buf0, (10, ), (1, ), 0), out_ptr=reinterpret_tensor(buf2, (10, ), (1, ), 0) ...) # Source Nodes: [triton_kernel_wrapper_mutation_2], Original ATen: [] buf4 = buf0; del buf0 # reuse mul2_kernel_0.run(in_ptr0=reinterpret_tensor(buf2, (10, ), (1, ), 0), out_ptr=reinterpret_tensor(buf4, (10, ), (1, ), 0) ...) # Source Nodes: [triton_kernel_wrapper_mutation_3], Original ATen: [] buf6 = buf2; del buf2 # reuse mul2_kernel_0.run(in_ptr0=reinterpret_tensor(buf4, (10, ), (1, ), 0), out_ptr=reinterpret_tensor(buf6, (10, ), (1, ), 0) ...) del buf4 # Source Nodes: [add], Original ATen: [aten.add] buf8 = buf6; del buf6 # reuse triton_poi_fused_add_0.run(buf8, 10, grid=grid(10), stream=stream0) return (buf8, ) ``` Differential Revision: [D56379307](https://our.internmc.facebook.com/intern/diff/D56379307) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124425 Approved by: https://github.com/oulgen	2024-04-21 06:00:14 +00:00
Yanbo Liang	0d90d4d613	[Dynamo] Fix NamedTuple hasattr bug (#124531 ) Fixes #124402 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124531 Approved by: https://github.com/jansel	2024-04-21 04:36:22 +00:00
Joël Tang	a6a3f2e06b	[MPS] Fixes GELU, LeakyRELU and MISH on non-contiguous tensors (#123049 ) Fixes GELU, LeakyRELU and MISH activation functions on non-contiguous tensors (for instance, when a transpose operation was applied on the tensors prior to the MPS operator), forward and backward passes. I also extended tests on the 3 activation functions to check: full-precision and half-precision, contiguous and non-contiguous, and several dims of tensors: scalars, 1D, empty, 2D, > 3D. I had issues with Mish and GELU activations when asserting the gradients vs. CPU with sum() on some cases, so I reverted to the previous setup by setting a gradient parameter on .backwards(). This PR also fixes an issue with LeakyRELU on empty tensors. Fixes #98212 huggingface/transformers#22468 huggingface/transformers#19353 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123049 Approved by: https://github.com/kulinseth	2024-04-21 00:12:32 +00:00
Aidyn-A	98f3e0214b	[NCCL][TEST] Synchronize proper devices (#124517 ) There are multiple instances of `torch.cuda.synchronize()` calls without arguments. These calls cause device 0 being synchronized from multiple ranks while the rest of the devices are not. I am pretty sure that was not intended. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124517 Approved by: https://github.com/wconstab, https://github.com/eqy	2024-04-20 23:42:32 +00:00
FFFrog	d6f88105ce	Fix the problem about load_state_dict with unexpected key whose prefix matches a valid key (#124385 ) Fixes https://github.com/pytorch/pytorch/issues/123510 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124385 Approved by: https://github.com/mikaylagawarecki	2024-04-20 23:19:25 +00:00
Edward Z. Yang	afa78ad08c	Call writeline from writelines (#124515 ) This makes it more convenient to add a breakpoint here. Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124515 Approved by: https://github.com/albanD	2024-04-20 15:45:30 +00:00
Animesh Jain	a32eac345f	[dynamo] Return gm.forward for eager backend (#124109 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124109 Approved by: https://github.com/yanboliang, https://github.com/jansel ghstack dependencies: #124445	2024-04-20 14:11:05 +00:00
Animesh Jain	febc4d8759	[dynamo][easy] forbid_in_graph check to use getattr_static (#124445 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124445 Approved by: https://github.com/yanboliang, https://github.com/jansel	2024-04-20 14:11:05 +00:00
Isuru Fernando	97ccfad915	Fix test_decomp test for ops with py_impl(CompositeImplicitAutograd) (#116832 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/116832 Approved by: https://github.com/lezcano	2024-04-20 11:10:38 +00:00
Yanbo Liang	a3e3693afc	[Dynamo] Fix missing bracket in ListVariable (#124532 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/124532 Approved by: https://github.com/williamwen42	2024-04-20 08:26:30 +00:00
Timmy Xiao	f20e3ae0c3	Use recursive blob for package data (#119257 ) setup.py now supports recursive glob for package data I only added `.cpp`, `.h`, and `.yaml` files. Not sure if you want to include BAZEL or other files in package_data. Pull Request resolved: https://github.com/pytorch/pytorch/pull/119257 Approved by: https://github.com/zou3519	2024-04-20 06:33:39 +00:00
Michael Lazos	0d0b5b2655	Enable dynamo rosenbrock sparse tests (#124542 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124542 Approved by: https://github.com/yf225 ghstack dependencies: #124540, #124541	2024-04-20 05:54:41 +00:00
Michael Lazos	184f16016e	Enable dynamo-traced deepcopy test for RMSprop (#124541 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124541 Approved by: https://github.com/yf225 ghstack dependencies: #124540	2024-04-20 05:54:41 +00:00
Michael Lazos	6a730698e2	Enable dynamo-traced Adamax tests (#124540 ) Enabling tests related to https://github.com/pytorch/pytorch/issues/121178 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124540 Approved by: https://github.com/yf225	2024-04-20 05:54:41 +00:00
drisspg	f1cbaf1764	Adds LSE output for templated-attention-hop if inputs require grad (#124308 ) Adds LSE output for templated-attention-hop if inputs require grad Prep PR for adding autograd support to templated-attention-hop. The kernel needs to output the LSE during the forward which will be used during backwards. ### Output code https://gist.github.com/drisspg/2aea3ce5db75811e7e143eeecb774d8a ## Before \| Type \| Speedup \| batch_size \| num_heads \| q_seq_len \| k_seq_len \| head_dim \| score_mod \| dtype \| \|---------\|-----------\|--------------\|-------------\|-------------\|-------------\|------------\|---------------\|----------------\| \| Average \| 1.159 \| \| \| \| \| \| \| \| \| Max \| 1.342 \| 16 \| 16 \| 512 \| 512 \| 64 \| noop \| torch.bfloat16 \| \| Min \| 1.016 \| 1 \| 16 \| 512 \| 512 \| 64 \| relative_bias \| torch.bfloat16 \| ## After Type \| Speedup \| batch_size \| num_heads \| q_seq_len \| k_seq_len \| head_dim \| score_mod \| dtype \| \|---------\|-----------\|--------------\|-------------\|-------------\|-------------\|------------\|-------------\|----------------\| \| Average \| 1.155 \| \| \| \| \| \| \| \| \| Max \| 1.339 \| 16 \| 16 \| 512 \| 512 \| 64 \| noop \| torch.bfloat16 \| \| Min \| 1.009 \| 1 \| 16 \| 512 \| 512 \| 64 \| head_bias \| torch.bfloat16 \| Pull Request resolved: https://github.com/pytorch/pytorch/pull/124308 Approved by: https://github.com/Chillee	2024-04-20 05:45:56 +00:00
Oguz Ulgen	0d64b82f0b	Make CompiledFxGraph portable between machines (#124438 ) As we prepare FxGraphCache to move to remote, we need to make sure there's no data that is on the disk. Differential Revision: [D56363808](https://our.internmc.facebook.com/intern/diff/D56363808) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124438 Approved by: https://github.com/jansel	2024-04-20 05:26:14 +00:00
Shunting Zhang	c5a4ba2257	[inductor] consider pointwise nodes when deciding reduction hint (#124131 ) In certain rare scenarios, inductor can generate a reduction kernel with really bad perf. E.g., if - the reduction kernel contains a reduction node followed by a pointwise node - And the pointwise node use a transposed layout. - the reduction node is an inner reduction - and rnumel <= 1024 , then inductor will generate a persistent reduction kernel and it causes really bad perf when doing tl.store for the pointwise node since we use a very skinny tile `(XBLOCK=1, RBLOCK=next_power_of_2(rnumel))` . I've tried a few version of fix. - The first version is, if I found any pointwise node in a reduction kernel uses a non-contiguous dependency, we use ReductionHint.DEFAULT. This cause 8s compilation time increase for huggingface with no perf wins... The reason is ReductionHint.DEFAULT does more autotunings. - Then I changed the code to be more specific. We change the hint from INNER to DEFAULT if we are sure that the pointwise kernel can use a >1 stride for the lowest dimension. Kernels meet this condition should mostly have really bad perf anyways. The situation mentioned above is rare. But it's reported by internal users. I'll also run one more perf test. Testing script: https://gist.github.com/shunting314/9d3389891fa43633b49b8b7564ad6d8b . Something equivalent is also added as a unit test. For this specific test from user reports, we improve the mentioned reduction kernels perf by 4.14x (451us -> 109us) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124131 Approved by: https://github.com/jansel	2024-04-20 05:07:56 +00:00
Xiaodong Wang	57f64197f3	Reduce warning msg in torch.profiler (#124469 ) Summary: This is actually quite noisy and my logs are full of this soft assertion msg. Maybe making it log once? Test Plan: On AMD GPU side, I got a lot of those warnings: ``` W0415 01:40:45.109864 917160 collection.cpp:602] Warning: Memcpy ? (? -> ?) (function operator())” ``` So just suppress the excessive logs Reviewed By: aaronenyeshi, yoyoyocmu Differential Revision: D55602788 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124469 Approved by: https://github.com/aaronenyeshi	2024-04-20 04:45:12 +00:00
Jianping Wu	b79b0d3d6a	Enable UFMT on test/test_legacy_vmap.py (#124381 ) Part of https://github.com/pytorch/pytorch/issues/123062 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124381 Approved by: https://github.com/ezyang	2024-04-20 03:37:57 +00:00
Scott Wolchok	3d8b903d95	[PyTorch] Remove ArrayRefTensor::numel_ (#124516 ) ArrayRefTensor::numel_ is redundant with the size of the contained MiniArrayRef. Reclaiming the space entirely would break ABI compatibility, but at least we have 4-8 bytes for future expansion. Differential Revision: [D56366829](https://our.internmc.facebook.com/intern/diff/D56366829/) NOTE FOR REVIEWERS: This PR has internal Meta-specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D56366829/)! Pull Request resolved: https://github.com/pytorch/pytorch/pull/124516 Approved by: https://github.com/chenyang78, https://github.com/desertfire	2024-04-20 02:44:20 +00:00
Andrew Gu	f9fce110af	[FSDP2][ez] Removed error check for swap tensors flag (#124513 ) Since `DTensor` uses `swap_tensors` path automatically now, we can remove this check for the global flag. Pull Request resolved: https://github.com/pytorch/pytorch/pull/124513 Approved by: https://github.com/weifengpy ghstack dependencies: #124319, #120256	2024-04-20 00:46:36 +00:00
Andrew Gu	1c2cb36811	[FSDP2] Added CPU offloading (#120256 ) #### Overview This PR adds CPU offloading via the `offload_policy: OffloadPolicy` argument. - We incur one H2D copy for each parameter before all-gather. - We incur one D2H copy for each gradient after reduce-scatter. - We run optimizer on CPU. #### Example (Mixed Precision and CPU Offloading) This example uses a small 125M numel model, which is not too representative. We can try to run with a larger model like Llama-7B. However, since the current optimizer step is already too slow, we may want to patch a faster CPU optimizer. Forward ![Screenshot 2024-02-21 at 10 36 29 AM](https://github.com/pytorch/pytorch/assets/31054793/00ed95db-3a55-49bb-ac98-9b9162feaacd) ![Screenshot 2024-02-21 at 10 39 12 AM](https://github.com/pytorch/pytorch/assets/31054793/10e29854-1907-4001-b3dc-aab6c3bf153c) Backward ![Screenshot 2024-02-21 at 10 37 47 AM](https://github.com/pytorch/pytorch/assets/31054793/7039cb2e-eb78-4f53-b83f-67bae61ebddd) ![Screenshot 2024-02-21 at 10 38 44 AM](https://github.com/pytorch/pytorch/assets/31054793/e34615d6-6b6b-4995-aef1-9c7563034799) Overall CPU (CPU optimizer step dominates) ![Screenshot 2024-02-21 at 10 39 47 AM](https://github.com/pytorch/pytorch/assets/31054793/7a2a929a-3a40-4b35-891b-016cf57e8079) Pull Request resolved: https://github.com/pytorch/pytorch/pull/120256 Approved by: https://github.com/weifengpy ghstack dependencies: #124319	2024-04-20 00:42:58 +00:00
soulitzer	cf5ca58e7f	[NJT] Inline through torch.nested.nested_tensor_from_jagged instead of graph break (#124343 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124343 Approved by: https://github.com/jbschlosser	2024-04-19 23:13:59 +00:00
Laith Sakka	acbf888a13	rename sl to strobelight (#124455 ) Summary: TORCH_COMPILE_SL_PROFILE ->TORCH_COMPILE_STROBELIGHT SL_MAX_STACK_LENGTH -> COMPILE_STROBELIGHT_MAX_STACK_LENGTH SL_MAX_PROFILE_TIME -> COMPILE_STROBELIGHT_MAX_PROFILE_TIME profile_with_sl() -> strobelight() compiletime_sl_profile_meta() -> compiletime_strobelight_meta() Test Plan: 1. run and verify ``` TORCH_COMPILE_STROBELIGHT=TRUE buck2 run @//mode/inplace @//mode/opt //caffe2/fb/strobelight:compiletime_profiler_example ``` 2. run and verify ``` buck2 run @//mode/inplace @//mode/opt //caffe2/fb/strobelight:function_profiler_example --local-only ``` 3. run and verify truncated stack for ``` TORCH_COMPILE_STROBELIGHT=TRUE COMPILE_STROBELIGHT_MAX_STACK_LENGTH=1 buck2 run @//mode/inplace @//mode/opt //caffe2/fb/strobelight:compiletime_profiler_example ``` 4. add infinite loop in _verify and verify samples for ``` COMPILE_STROBELIGHT_MAX_PROFILE_TIME=30 TORCH_COMPILE_STROBELIGHT=TRUE buck2 run @//mode/inplace @//mode/opt //caffe2/fb/strobelight:compiletime_profiler_example ``` Reviewed By: oulgen Differential Revision: D56327139 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124455 Approved by: https://github.com/oulgen	2024-04-19 22:50:13 +00:00
PyTorch MergeBot	0feab7d6c3	Revert "Build device generic torch.Stream and torch.Event based on c10::Stream/Event (#123611 )" This reverts commit cb17721899d4d6a55d66d4f7188e36c20a078231. Reverted https://github.com/pytorch/pytorch/pull/123611 on behalf of https://github.com/jeffdaily due to This broke ROCm. see test_overrides.py ([comment](https://github.com/pytorch/pytorch/pull/123611#issuecomment-2067363780))	2024-04-19 22:44:26 +00:00
PyTorch MergeBot	929242a15c	Revert "torch.mtia module for MTIA device backend (#123612 )" This reverts commit d7e1bf9ff908d2a9c20d5354426d34c539fcb7a1. Reverted https://github.com/pytorch/pytorch/pull/123612 on behalf of https://github.com/jeffdaily due to This broke ROCm. see test_overrides.py ([comment](https://github.com/pytorch/pytorch/pull/123611#issuecomment-2067363780))	2024-04-19 22:44:26 +00:00
PyTorch MergeBot	52da03edeb	Revert "Add test_cpp_extensions tests for stream_and_event and mita_backend (#123614 )" This reverts commit b6f0159db08c1ad55fe57a5e92d8933e21ea543e. Reverted https://github.com/pytorch/pytorch/pull/123614 on behalf of https://github.com/jeffdaily due to This broke ROCm. see test_overrides.py ([comment](https://github.com/pytorch/pytorch/pull/123611#issuecomment-2067363780))	2024-04-19 22:44:26 +00:00
cdzhan	f8f7cfbeee	Add __torch_function__ support for generated tensor methods/property of PrivateUse1 (#121723 ) support following case: ```python import torch ... class CustomFooTensor(torch.Tensor): @classmethod def __torch_function__(cls, func, types, args=(), kwargs=None): ... a = CustomFooTensor([3]) print(a.is_foo) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/121723 Approved by: https://github.com/albanD	2024-04-19 22:34:34 +00:00
eellison	19850d770d	update triton pin (#124429 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124429 Approved by: https://github.com/shunting314, https://github.com/malfet	2024-04-19 22:34:28 +00:00
drisspg	d8a98ddd60	Prep PR for cutlass 3.5 update (#124412 ) # Summary These changes are needed for the upgrade to cutlass 3.5 #123458 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124412 Approved by: https://github.com/Skylion007, https://github.com/nWEIdia, https://github.com/malfet	2024-04-19 22:10:37 +00:00
Yuanhao Ji	b3504af56e	Enable UFMT on `test/scripts` and some files (#124137 ) Part of: #123062 Ran lintrunner on: - `test/scripts` - `test/simulate_nccl_errors.py` - `test/test_ao_sparsity.py` - `test/test_autocast.py` - `test/test_binary_ufuncs.py` - `test/test_bundled_images.py` - `test/test_bundled_inputs.py` - `test/test_comparison_utils.py` - `test/test_compile_benchmark_util.py` - `test/test_complex.py` - `test/test_cpp_api_parity.py` - `test/test_cpp_extensions_aot.py` - `test/test_cpp_extensions_jit.py` - `test/test_cpp_extensions_open_device_registration.py` Detail: ```bash $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124137 Approved by: https://github.com/soulitzer	2024-04-19 22:01:27 +00:00
rzou	f0560f7b3b	[opcheck] Stop doing test_aot_dispatch_static by default (#124495 ) Motivations: - this is pretty redundant with test_aot_dispatch_dynamic. - The user story for opcheck is that a user should use opcheck to see if their operator was "registered correctly". If a user's custom op only supports dynamic shapes, then it's a bit awkward for one of the tests (e.g. `test_aot_dispatch_static`) to fail. - We've already stopped running test_aot_dispatch_static in all of our opcheck tests. Test Plan: - wait for CI Pull Request resolved: https://github.com/pytorch/pytorch/pull/124495 Approved by: https://github.com/williamwen42 ghstack dependencies: #124180, #124200, #124299, #124134, #124199, #124403, #124414	2024-04-19 21:57:22 +00:00
rzou	37d18966ea	[custom_op] set some tags when constructing the op (#124414 ) - the op is automatically "pt2-compliant" - In general we want to turn on needs_fixed_stride_order for all customm ops, but this needs some more work, so we're just going to turn it on for the new custom op API. Test Plan: - existing tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/124414 Approved by: https://github.com/albanD ghstack dependencies: #124180, #124200, #124299, #124134, #124199, #124403	2024-04-19 21:57:22 +00:00
Andrew Gu	1900f79b72	[FSDP2] Added `set_reshard_after_backward` (#124319 ) This PR adds a `set_reshard_after_backward` method to allow disabling resharding after backward. `reshard_after_backward=False` can be used with `reshard_after_forward=False` to implement "ZeRO-1", where there is only all-gather on the first microbatch forward and reduce-scatter on the last microbatch backward. ``` for microbatch_idx, microbatch in dataloader: is_last_microbatch = microbatch_idx == num_microbatches - 1 model.set_requires_gradient_sync(is_last_microbatch) model.set_reshard_after_backward(is_last_microbatch) model.set_is_last_backward(is_last_microbatch) microbatch_fwd_bwd(model, microbatch, microbatch_idx) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/124319 Approved by: https://github.com/weifengpy	2024-04-19 21:49:35 +00:00
Pian Pawakapan	10b9d4d19c	[export] handle Dim.lower = 0, 1 for ep.run_decompositions() (#123602 ) Summary: With pre-dispatch export and ep.run_decompositions(), range constraints are updated through looking at ShapeEnv.var_to_range. However the lower bounds on these may be incorrect - analysis on un-specialized symbols are done with lower bounds of 2, which mismatch with user-specified bounds (may be 0, 1). This updates `_get_updated_range_constraints()` to use the old range constraints if possible. Test Plan: Existing pre-dispatch/dynamic shapes test case. Differential Revision: D55899872 Pull Request resolved: https://github.com/pytorch/pytorch/pull/123602 Approved by: https://github.com/tugsbayasgalan	2024-04-19 21:29:36 +00:00
Nikita Shulga	c74dfca5e7	Int4MM: Unswizzle for different dtypes (#124448 ) If dtype is not the one this platform is optimized for, it might need different unswizzling pattenrs Implement ones for non-vectorized flavor of the kernel, so that int4mm can be used with float32 and float16 dtypes Pull Request resolved: https://github.com/pytorch/pytorch/pull/124448 Approved by: https://github.com/jgong5, https://github.com/mikekgfb	2024-04-19 21:17:15 +00:00
eellison	000d55870a	Enable in oss (#124031 ) Biggest movement is 4% HF inference, 9% TIMM inference. Note, this is max-autotune mode so we are more tolerant of compilation increases. We could improve compilation time by limiting: ``` # Take how many of the top triton kernels to benchmark epilogue max_epilogue_benchmarked_choices = 3 ``` There is a hf_Whisper failure which you can repro on main without this stack with `TORCHINDUCTOR_MAX_AUTOTUNE_GEMM_BACKENDS=TRITON TORCHINDUCTOR_MAX_AUTOTUNE=1 python benchmarks/dynamo/torchbench.py --backend inductor --amp --accuracy --training --only hf_Whisper`. When you turn off epilogue fusion, it fixes the accuracy. I bisected the failure to an epilogue, however when you compare the results of that epilogue with the corresponding separate kernels the results of the output are equivalent. Inference: <img width="1686" alt="image" src="https://github.com/pytorch/pytorch/assets/11477974/0b240080-cd33-4c08-89d3-583103b1fb0c"> Training: <img width="1329" alt="Screenshot 2024-04-16 at 6 16 30 PM" src="https://github.com/pytorch/pytorch/assets/11477974/db0afcc9-7288-4c27-84ce-4fc1a5690788"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124031 Approved by: https://github.com/Chillee, https://github.com/shunting314 ghstack dependencies: #124030, #122642, #123229, #122825	2024-04-19 20:28:55 +00:00
Nikita Shulga	e6a788ac26	Fix compilation on aarch64 with gcc (#124511 ) Which is more stringent than clang when equivalently sized NEON registers are cast to each other. In particular, at one point `uint16x4_t` were cast to `int16x4_t`, which gcc does not allow. Added `vreinterpret_s16_u16` (which is a no-op) to solve this and tested in https://godbolt.org/z/sYb4ThM6M Test plan: Build aarch64 wheels Pull Request resolved: https://github.com/pytorch/pytorch/pull/124511 Approved by: https://github.com/mikekgfb	2024-04-19 19:53:19 +00:00
eellison	179108f14d	Use separate flags for MultiTemplates from BenchmarkFusion (#122825 ) Two changes: - Make the flag for multi template buffer independent from benchmark fusion. While benchmark fusion can be useful, the compilation time/performance trade offs are different than for just templates, which we'd like to enable by default. - Dont do MultiTemplateBuffers/benchmark-fusion for templates which have custom input gen fn's (currently which only exist internally). Threading the custom input gn fns to benchmark fusion is NYI. Pull Request resolved: https://github.com/pytorch/pytorch/pull/122825 Approved by: https://github.com/shunting314 ghstack dependencies: #124030, #122642, #123229	2024-04-19 19:50:42 +00:00
IvanKobzarev	73f56e1e81	[sym_shapes][perf] Do not calculate hint in advice_is_size (#124472 ) Differential Revision: [D56352412](https://our.internmc.facebook.com/intern/diff/D56352412) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124472 Approved by: https://github.com/ezyang	2024-04-19 19:10:24 +00:00
Xiaodong Wang	661fd23640	[AMD] TunableOp take priority over DISABLE_ADDMM_HIP_LT (#124161 ) Summary: It seems super confusing that if we set DISABLE_ADDMM_HIP_LT + PYTORCH_TUNABLEOP_ENABLED, the former takes priority. This is because the former goes through the gemm_and_bias and tunable op is integrated with gemm path. Before we can integrate tunable op with gemm_and_bias, we'll probably just let tunable op takes priority Test Plan: Run a simple linear program and verified. Differential Revision: D56183954 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124161 Approved by: https://github.com/jeffdaily, https://github.com/nmacchioni	2024-04-19 19:08:06 +00:00
PyTorch MergeBot	f87c788a34	Revert "Capture triton kernel in execution trace (#124140 )" This reverts commit 89407eca3b0be3c0272b5c583f8e77b9108a71f8. Reverted https://github.com/pytorch/pytorch/pull/124140 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/124140#issuecomment-2067137104))	2024-04-19 19:05:44 +00:00
IvanKobzarev	761de37ab7	[sym_shape][perf] eval_static: guards, unbacked compute once (#124217 ) Differential Revision: [D56212345](https://our.internmc.facebook.com/intern/diff/D56212345) Pull Request resolved: https://github.com/pytorch/pytorch/pull/124217 Approved by: https://github.com/ezyang	2024-04-19 19:03:04 +00:00
Xiaodong Wang	8869b543e8	[AMD] Remove deprecated macro from COnvUtils (#124158 ) Summary: This is not great, but our ATen-cpu is not completely GPU agnostic. Previously we have worked on D54453492 (https://github.com/pytorch/pytorch/pull/121082) and D54528255, but there are a few things we haven't resolved, and it's exploding here. So we'll continue to fix them until all are gone. This ROCm block is for 4.3 which is very old. I don't think it should be supported any more. So let's just kill this macro Test Plan: CI Differential Revision: D56172660 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124158 Approved by: https://github.com/jeffdaily, https://github.com/nmacchioni	2024-04-19 19:00:31 +00:00
Zhuoran Zhao	b0d83726bd	[5/x][AMD][Lowering Enablement] Hipifying aoti code_wrapper (#124241 ) Summary: as title Test Plan: CI & unit test patch on top of https://www.internalfb.com/phabricator/paste/view/P1214895953 to test Differential Revision: D56223917 Pull Request resolved: https://github.com/pytorch/pytorch/pull/124241 Approved by: https://github.com/jansel, https://github.com/desertfire	2024-04-19 18:57:38 +00:00

1 2 3 4 5 ...

72046 Commits