pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 12:54:11 +08:00

Author	SHA1	Message	Date
Jithun Nair	ee6a1ecb0a	[ROCm] Enable MI355 CI on PRs, and run full set of UTs on PRs (#160215 ) Useful to have PR testing for PRs such as https://github.com/pytorch/pytorch/pull/151360 Pull Request resolved: https://github.com/pytorch/pytorch/pull/160215 Approved by: https://github.com/malfet, https://github.com/atalman Co-authored-by: Jeff Daily <jeff.daily@amd.com>	2025-10-09 18:03:12 +00:00
Xinya Zhang	3cbfbbd691	[ROCm] Transformer/SDPA unit test parity (#163745 ) ## Major Changes * Efficient Attention on ROCM requires last dimensions of input tensors align with 16 bytes. - Unlike FA, ME does not pad input tensors in `scaled_dot_product_attention` and hence this is required. * Fix `atomic_counter` handling in varlen FA API * Unskips a few unit tests. Fixes #157120 Fixes #157121 Fixes #157122 Fixes #157167 Fixes #155217 Fixes #157043 Fixes #157060 Pull Request resolved: https://github.com/pytorch/pytorch/pull/163745 Approved by: https://github.com/jeffdaily	2025-09-25 17:14:19 +00:00
Laith Sakka	43b2716e89	PYFMT lint grandfathered files 1 (#154261 ) lint: - test/test_fake_tensor.py - test/test_flop_counter.py - torch/_export/verifier.py with same rules as other files, it was a night mare for me to update tests in one of the skipped files with not being able to lint them locally like other files with lintrunner -a. note that those file do have active dev and not old not touched files. Pull Request resolved: https://github.com/pytorch/pytorch/pull/154261 Approved by: https://github.com/angelayi, https://github.com/Skylion007	2025-05-25 17:36:14 +00:00
Aleksandar Samardžić	2b00d211f0	Build RowwiseScaledMM.cu for SM89 (#145676 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/145676 Approved by: https://github.com/drisspg, https://github.com/malfet, https://github.com/eqy	2025-02-01 11:44:58 +00:00
Luca Wehrstedt	a0d2c09115	Add flop formula for _scaled_mm (#144973 ) This will make it work correctly with the partitioner's AutoAC Pull Request resolved: https://github.com/pytorch/pytorch/pull/144973 Approved by: https://github.com/jeffdaily	2025-01-17 09:38:30 +00:00
PyTorch MergeBot	6559374494	Revert "Add flop formula for _scaled_mm (#144872 )" This reverts commit f31452268bf9f7e395f263cd8a9d693633ea75ce. Reverted https://github.com/pytorch/pytorch/pull/144872 on behalf of https://github.com/lw due to Breaks ROCm jobs on main ([comment](https://github.com/pytorch/pytorch/pull/144872#issuecomment-2595994134))	2025-01-16 15:16:18 +00:00
Luca Wehrstedt	f31452268b	Add flop formula for _scaled_mm (#144872 ) This will make it work correctly with the partitioner's AutoAC Pull Request resolved: https://github.com/pytorch/pytorch/pull/144872 Approved by: https://github.com/vkuzo	2025-01-16 13:57:54 +00:00
Tom Ritchford	d8c8ba2440	Fix unused Python variables in test/[e-z]* (#136964 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136964 Approved by: https://github.com/justinchuby, https://github.com/albanD	2024-12-18 23:02:30 +00:00
Florian (Feuermagier)	4fa72168ea	FlopCounterMode: Decompose ops for inference mode (#138508 ) Fixes #126268 I've basically followed @ezyang suggestion (I think) to use `func.decompose(...)`. Since `__torch_dispatch__` won't be called a second time for the same op, I've added a second `TorchDispatchMode` (`_DecomposedCounterMode`) that simpy dispatches to the parent flop counter. Using `self` as the inner context manager is not possible, since the second call to `__enter__` would re-initialize the counter's tracking state. Let me know if there's something wrong with this implementation, since I'm quite unsure how the decomposition thing actually works :D Pull Request resolved: https://github.com/pytorch/pytorch/pull/138508 Approved by: https://github.com/ezyang Co-authored-by: Edward Z. Yang <ezyang@meta.com>	2024-11-25 16:53:10 +00:00
PyTorch MergeBot	727f1a6da9	Revert "FlopCounterMode: Decompose ops for inference mode (#138508 )" This reverts commit f915409c26c0ba38b286c7b617880af61a6b08ba. Reverted https://github.com/pytorch/pytorch/pull/138508 on behalf of https://github.com/jamesjwu due to Failing internal jobs ([comment](https://github.com/pytorch/pytorch/pull/138508#issuecomment-2484310587))	2024-11-18 22:59:36 +00:00
Florian (Feuermagier)	f915409c26	FlopCounterMode: Decompose ops for inference mode (#138508 ) Fixes #126268 I've basically followed @ezyang suggestion (I think) to use `func.decompose(...)`. Since `__torch_dispatch__` won't be called a second time for the same op, I've added a second `TorchDispatchMode` (`_DecomposedCounterMode`) that simpy dispatches to the parent flop counter. Using `self` as the inner context manager is not possible, since the second call to `__enter__` would re-initialize the counter's tracking state. Let me know if there's something wrong with this implementation, since I'm quite unsure how the decomposition thing actually works :D Pull Request resolved: https://github.com/pytorch/pytorch/pull/138508 Approved by: https://github.com/ezyang	2024-11-09 03:13:53 +00:00
David Berard	1962f9475f	[NJT][flop counter] attention: if offsets are fake, use max seqlen (#132356 ) The flop counter is used by the partitioner, in which case the tensors passed in can be fake. The flop computations for nested attention use the offsets to determine the actual amount of compute that will be done. But when the offsets are fake, we end up with unbacked symints (from `(offsets[1:] - offsets[:-1]).to_list()`). If we find that the offsets are fake or functional tensors, then use the max sequence length instead. Repro: https://gist.github.com/davidberard98/903fb3e586edb6d1d466786e1a610eba Differential Revision: [D60597463](https://our.internmc.facebook.com/intern/diff/D60597463) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132356 Approved by: https://github.com/soulitzer	2024-08-02 20:42:29 +00:00
Oguz Ulgen	221350e3a4	Add None return type to init -- tests (#132352 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132352 Approved by: https://github.com/ezyang ghstack dependencies: #132335, #132351	2024-08-01 15:44:51 +00:00
rzou	a3cdbd8189	[FlopCounterMode] Fix register_flop_formula (#131777 ) Previously, FlopCounterMode would ignore any custom ops registered through `register_flop_formula`. The problem was: - register_flop_formula(target) requires target to be an OpOverloadPacket. - register_flop_formula used register_decomposition to populate its registry - register_decomposition decomposes the OpOverloadPacket into OpOverload before putting it into the registry - FlopCounterMode ignores OpOverloads in its registry (it assumes the registry is a dictionary mapping OpOverloadPacket to flop formula). register_decomposition is too heavy of a hammer, plus this isn't a decomposition, so I changed the registration mechanism. Test Plan: - new tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/131777 Approved by: https://github.com/Chillee	2024-07-26 18:44:50 +00:00
eqy	f845a7a91a	[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 ) Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes. What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here... Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343 Approved by: https://github.com/Skylion007	2024-06-30 19:22:16 +00:00
PyTorch MergeBot	999eec8dea	Revert "[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 )" This reverts commit b7e7a4cb01de394af7686ab6feb216a8a5c716bb. Reverted https://github.com/pytorch/pytorch/pull/125343 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break some test_transformer running on internal A100 and V100 ([comment](https://github.com/pytorch/pytorch/pull/125343#issuecomment-2196202003))	2024-06-28 06:03:54 +00:00
Eddie Yan	b7e7a4cb01	[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 ) Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes. What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here... Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343 Approved by: https://github.com/Skylion007	2024-06-26 00:49:18 +00:00
PyTorch MergeBot	817ce6835b	Revert "[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 )" This reverts commit 4c971932e839fc5da2b91906ad028d4654932bca. Reverted https://github.com/pytorch/pytorch/pull/125343 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/125343#issuecomment-2163690162))	2024-06-12 18:47:52 +00:00
eqy	4c971932e8	[cuDNN][SDPA] Remove `TORCH_CUDNN_SDPA_ENABLED=1`, enable cuDNN SDPA by default on H100 and 2nd on other archs >= sm80 (#125343 ) Looks like one of the first failures seen is `test_causal_variants_compile_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` when `test_causal_variants_causal_variant_CausalVariant_LOWER_RIGHT_shape0_cuda` passes. What seems interesting here is that the `torch.compile` version fails while the eager version passes. Not sure what the difference would be here... Nevertheless, is there a recommended mechanism to skip cuDNN SDPA as a backend for this test? CC @drisspg Pull Request resolved: https://github.com/pytorch/pytorch/pull/125343 Approved by: https://github.com/Skylion007	2024-06-09 06:53:34 +00:00
Xinya Zhang	d34075e0bd	Add Efficient Attention support on ROCM (#124885 ) This patch implements `with sdpa_kernel(SDPBackend.EFFICIENT_ATTENTION):` by reusing AOTriton's accelerated SDPA implementation Known limitations: - Only supports MI200/MI300X GPUs - Does not support varlen - Does not support `CausalVariant` - Optional arguments `causal_diagonal` and `seqlen_k` in `_efficient_attention_forward/backward` must be null - Does not work well with inductor's SDPA rewriter. The rewriter has been updated to only use math and flash attention on ROCM. This PR also uses a different approach of installing AOTriton binary instead of building it from source in the base docker image. More details on motivation: https://github.com/pytorch/pytorch/pull/124885#issuecomment-2153229129 `PYTORCH_TEST_WITH_ROCM=1 PYTORCH_TESTING_DEVICE_ONLY_FOR="cuda" python test/test_transformers.py` yields "55028 passed, 20784 skipped" results with this change. [Previous result](https://hud.pytorch.org/pr/127528) of `test_transformers.py` was 0 error, 0 failure, 55229 skipped out of 75517 tests in total (the XML report does not contain total number of passed tests). Pull Request resolved: https://github.com/pytorch/pytorch/pull/124885 Approved by: https://github.com/malfet	2024-06-08 22:41:05 +00:00
Catherine Lee	719a8f42bf	Foward fix lint after #125747 (#126295 ) Fixes #ISSUE_NUMBER Pull Request resolved: https://github.com/pytorch/pytorch/pull/126295 Approved by: https://github.com/atalman	2024-05-15 16:37:48 +00:00
Yuanhao Ji	ba3cd6e463	Enable UFMT on `test/test_fake_tensor.py`, `test/test_flop_counter.py` and some files (#125747 ) Part of: #123062 Ran lintrunner on: - test/test_fake_tensor.py - test/test_flop_counter.py - test/test_function_schema.py - test/test_functional_autograd_benchmark.py - test/test_functional_optim.py - test/test_functionalization_of_rng_ops.py Detail: ```bash $ lintrunner -a --take UFMT --all-files ok No lint issues. Successfully applied all patches. ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/125747 Approved by: https://github.com/malfet	2024-05-15 14:50:14 +00:00
David Berard	9e85d3d830	Add "accurate" FlopCounter implementations for NestedTensor SDPA kernels (#125776 ) This adds implementations for: * _flash_attention_forward * _efficient_attention_forward * _flash_attention_backward * _efficient_attention_backward These flop counts are implemented as follows: * Unbind the batch elements * Calculate flops individually for each element in the batch * Sum the final result This means that we are accessing the concrete sequence lengths (which could be slow, and may trigger a GPU/CPU sync); but, the FLOP numbers will vary with the sparsity of the NestedTensor - more accurate than if we just assumed we padded everything. Differential Revision: [D57120139](https://our.internmc.facebook.com/intern/diff/D57120139) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125776 Approved by: https://github.com/Chillee	2024-05-10 19:49:37 +00:00
albanD	76a26a885d	Add module tracker (#125352 ) This does a few things that were originally a few PRs but I am on a new machine and don't have ghstack. If it is too problematic to review, I can re-split, just let me know. This does: - Cleanup context manager use in test_flop_counter - Remove need for mod argument in FlopCounterMode, warning about it - Re-implement a Module tracker from scratch using global forward Module use and multi_grad_hook (we cannot use global backward Module hook because they don't look for nested Tensor and they're custom Function based instead of multi_grad_hook). - Update FlopCouterMode to use the new ModuleTracker. All the existing test suite passes as-is (only changes there are new tests and refactoring mentioned above) Pull Request resolved: https://github.com/pytorch/pytorch/pull/125352 Approved by: https://github.com/mikaylagawarecki	2024-05-04 18:33:35 +00:00
Aaron Gokaslan	1d6c5972c1	[BE]: Optimize min/max/sum comprehensions C419 (#123960 ) Automatic fixes that replaces certain list comprehensions with generator ones where appropriate so that they are immediately consumed. This is preview functionality in ruff for rule C419 and it was automatically applied. Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/123960 Approved by: https://github.com/malfet	2024-04-12 23:54:15 +00:00
chilli	84580f76d9	fix flop counter issue with out parameters (#123768 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/123768 Approved by: https://github.com/zou3519	2024-04-11 09:39:53 +00:00
chilli	8bfc87ce74	fixed flop counter formula for conv transposed backwards pass (#119874 ) Fixes #119806 Pull Request resolved: https://github.com/pytorch/pytorch/pull/119874 Approved by: https://github.com/zou3519 ghstack dependencies: #119521	2024-02-16 02:43:49 +00:00
Xinya Zhang	e3ca7346ce	Re-add initial Flash Attention support on ROCM (#115981 ) Note about the Updates: This PR: 1. skips more flash attention related UTs on MI200 2. Fix additional ATen compiling errors after hipification 3. Fix the author "root" of a specific commit 4. Includes the patch from Nikita in favor of block level static initialization. CAVEAT: This revised PR has a commit that modifies the CI to force its running on MI200 nodes. That specific commit must be reverted before merge. Original PR (https://github.com/pytorch/pytorch/pull/114309) Note: This pull requests add initial Flash Attention support for AMD/ROCM platform. It added a specialized Triton repository/branch as a compile-time dependency for Flash Attention math library on AMD/ROCM. This triton submodule is not used at runtime and will not be shipped to the final pytorch package. We have the plan to release this specialized Triton as a separate project. Know limitations: - Only supports MI200 series GPU (i.e., `gcnArchName == gfx90a:sramecc+:xnack-`. - Only supports power of two sequence lengths. - No support for varlen APIs. - Only support head dimension 16,32,64,128. - Performance is still being optimized. Fixes #112997 Pull Request resolved: https://github.com/pytorch/pytorch/pull/115981 Approved by: https://github.com/malfet	2024-01-04 22:21:31 +00:00
chilli	74adb4cccc	Updated flop counter to accept pytree inputs/outputs (#111990 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/111990 Approved by: https://github.com/ezyang	2023-10-26 01:25:27 +00:00
drisspg	e509b162ed	Disable FlashAttenion for is_causal=True when seqlen q not equal kv (#111007 ) # Summary: This pull request removes support for non-square sequence lengths in causal attention when using FlashAttention V2. ### Why are doing this // FlashAttention 2 updated the default mask meaning for causal in this PR: // 9e5e8bc91e it is now aligned to lower_right which would be a BC break // for non-square masks. We will not support non-square masks for causal w/ FAV2 For more context see: https://github.com/pytorch/pytorch/issues/108108 ### Followup A large number of people will likely want to use FAV2 with lower_right causal attention for non equal sequence lengths. See this RFC : https://github.com/pytorch/pytorch/issues/110681 Pull Request resolved: https://github.com/pytorch/pytorch/pull/111007 Approved by: https://github.com/cpuhrsch	2023-10-23 20:33:37 +00:00
chilli	ada65508d2	Add option to flop counter formula registration to get raw values (#110591 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/110591 Approved by: https://github.com/awgu ghstack dependencies: #110501, #110504	2023-10-05 21:14:41 +00:00
drisspg	ad90ab31f2	Flash Attention v2 (#105602 ) # Summary ## PR Dependencies I don't use ghstack :( this is a PR where it would have been helpful. That beings said I am going to peel off some PRs to make reviewing this easier: - [x] Separate build flags for Flash and MemEff: #107985 ### Description This pull request updates the version of _scaled_dot_product_flash_attention from version 1 to version 2. The changes are based on the flash attention code originally authored by @tridao ### Changes Made The majority of the changes in this pull request involve: - Copying over the flash_attention sources. - Updating header files. - Removing padding and slicing code from within the flash_attention kernel and relocating it to the composite implicit region of the SDPA. This was need to make the kernel functional and appease autograd. - Introducing a simple kernel generator to generate different instantiations of the forward and backward flash templates. - Adding conditional compilation (ifdef) to prevent building when nvcc is invoked with gencode < sm80. - Introducing a separate dependent option for mem_eff_attention, as flash_attention v2 lacks support for Windows and cannot be built for sm50 generation codes. - Modifying build.sh to reduce parallelization on sm86 runners and to lower the maximum parallelization on the manywheel builds. This adjustment was made to address out-of-memory issues during the compilation of FlashAttentionV2 sources. - Adding/Updating tests. ### Notes for Reviewers This is not a fun review, and I apologize in advance. Most of the files-changed are in the flash_attn/ folder. The only files of interest here IMO: - aten/src/ATen/native/transformers/cuda/flash_attn/flash_api.cpp - aten/src/ATen/native/transformers/cuda/flash_attn/kernels/generate_kernels.py ( this has been incorporated upstream to flash-attention github) There are a number of files all related to avoiding OOMs in CI/CD. These are typically shell scripts. ### Follow up items - Include the updates from `e07aa036db` and `9e5e8bc91e` \| https://github.com/pytorch/pytorch/issues/108108 ### Work Items - [x] I don't think Windows will be supported for 3.1.0 - Need to update cmakee - [x] Let multi_query/attention pass through and test \| UPDATE: I have the fast path implemented here: https://github.com/pytorch/pytorch/pull/106730 but since this will require changes to semantics of math to call repeat_interleave, I think this should be done as a followup. - [x] Had to drop cutlass back to 3.0.0 to get it to compile. Need to figure out how to upgrade to 3.1.0 and later. Spoke with Tri and he is going to be taking a look. Note: compiling with clang currently errors for the cute headers. - [x] Update test exercise above codepath - [x] Still need to disable on seq_len % 128 != 0 for backward( Tri beat me to it `a4f148b6ab`) - [x] Add determinism warning to BWD, Tri got to this one as well: 1c41d2b - [x] Update dispatcher to universally prefer FlashV2 - [x] Update tests to exercise new head_dims - [x] Move the head_dim padding from kernel to top level composite implicit function in order to make it purely functional - [x] Create template generator script - [x] Initial cmake support for building kernels/ folder - [x] Replay CudaGraph changes ### Results #### Forward only The TFlops are reported here are on a100 that is underclocked. ![flashv2_tflops_vs_seq_len](https://github.com/pytorch/pytorch/assets/32754868/152de46d-8fa6-42f0-9a9c-ef1eb7ae29e7) #### Forward+Backward Ran a sweep and for large compute bound sizes we do see a ~2x performance increase for forw+back. <img width="1684" alt="Screenshot 2023-07-20 at 3 47 47 PM" src="https://github.com/pytorch/pytorch/assets/32754868/fdd26e07-0077-4878-a417-f3a418b6fb3b"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105602 Approved by: https://github.com/huydhn, https://github.com/cpuhrsch	2023-09-13 13:59:05 +00:00
Huy Do	a9c663c269	Revert "Flash Attention v2 (#105602 )" (#108827 ) This reverts commit add45aea1cc8048fd0b43445b28fec7d93281f00. There are some conflicts on some benchmark csv file https://github.com/pytorch/pytorch/pull/105602#issuecomment-1710988951 so I need to revert this manually. The diff has been reverted internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108827 Approved by: https://github.com/kit1980	2023-09-08 07:43:04 +00:00
PyTorch MergeBot	e45b290127	Revert "Revert "Flash Attention v2 (#105602 )" (#108827 )" This reverts commit 24e9bbe22af296048f8242c6112d13cff726c588. Reverted https://github.com/pytorch/pytorch/pull/108827 on behalf of https://github.com/huydhn due to I need to land this revert properly as there are new failures showing up on trunk ([comment](https://github.com/pytorch/pytorch/pull/108827#issuecomment-1711020924))	2023-09-08 03:25:45 +00:00
Huy Do	24e9bbe22a	Revert "Flash Attention v2 (#105602 )" (#108827 ) This reverts commit add45aea1cc8048fd0b43445b28fec7d93281f00. There are some conflicts on some benchmark csv file https://github.com/pytorch/pytorch/pull/105602#issuecomment-1710988951 so I need to revert this manually. The diff has been reverted internally. Pull Request resolved: https://github.com/pytorch/pytorch/pull/108827 Approved by: https://github.com/kit1980	2023-09-08 02:54:20 +00:00
drisspg	add45aea1c	Flash Attention v2 (#105602 ) # Summary ## PR Dependencies I don't use ghstack :( this is a PR where it would have been helpful. That beings said I am going to peel off some PRs to make reviewing this easier: - [x] Separate build flags for Flash and MemEff: #107985 ### Description This pull request updates the version of _scaled_dot_product_flash_attention from version 1 to version 2. The changes are based on the flash attention code originally authored by @tridao ### Changes Made The majority of the changes in this pull request involve: - Copying over the flash_attention sources. - Updating header files. - Removing padding and slicing code from within the flash_attention kernel and relocating it to the composite implicit region of the SDPA. This was need to make the kernel functional and appease autograd. - Introducing a simple kernel generator to generate different instantiations of the forward and backward flash templates. - Adding conditional compilation (ifdef) to prevent building when nvcc is invoked with gencode < sm80. - Introducing a separate dependent option for mem_eff_attention, as flash_attention v2 lacks support for Windows and cannot be built for sm50 generation codes. - Modifying build.sh to reduce parallelization on sm86 runners and to lower the maximum parallelization on the manywheel builds. This adjustment was made to address out-of-memory issues during the compilation of FlashAttentionV2 sources. - Adding/Updating tests. ### Notes for Reviewers This is not a fun review, and I apologize in advance. Most of the files-changed are in the flash_attn/ folder. The only files of interest here IMO: - aten/src/ATen/native/transformers/cuda/flash_attn/flash_api.cpp - aten/src/ATen/native/transformers/cuda/flash_attn/kernels/generate_kernels.py ( this has been incorporated upstream to flash-attention github) There are a number of files all related to avoiding OOMs in CI/CD. These are typically shell scripts. ### Follow up items - Include the updates from `e07aa036db` and `9e5e8bc91e` \| https://github.com/pytorch/pytorch/issues/108108 ### Work Items - [x] I don't think Windows will be supported for 3.1.0 - Need to update cmakee - [x] Let multi_query/attention pass through and test \| UPDATE: I have the fast path implemented here: https://github.com/pytorch/pytorch/pull/106730 but since this will require changes to semantics of math to call repeat_interleave, I think this should be done as a followup. - [x] Had to drop cutlass back to 3.0.0 to get it to compile. Need to figure out how to upgrade to 3.1.0 and later. Spoke with Tri and he is going to be taking a look. Note: compiling with clang currently errors for the cute headers. - [x] Update test exercise above codepath - [x] Still need to disable on seq_len % 128 != 0 for backward( Tri beat me to it `a4f148b6ab`) - [x] Add determinism warning to BWD, Tri got to this one as well: 1c41d2b - [x] Update dispatcher to universally prefer FlashV2 - [x] Update tests to exercise new head_dims - [x] Move the head_dim padding from kernel to top level composite implicit function in order to make it purely functional - [x] Create template generator script - [x] Initial cmake support for building kernels/ folder - [x] Replay CudaGraph changes ### Results #### Forward only The TFlops are reported here are on a100 that is underclocked. ![flashv2_tflops_vs_seq_len](https://github.com/pytorch/pytorch/assets/32754868/152de46d-8fa6-42f0-9a9c-ef1eb7ae29e7) #### Forward+Backward Ran a sweep and for large compute bound sizes we do see a ~2x performance increase for forw+back. <img width="1684" alt="Screenshot 2023-07-20 at 3 47 47 PM" src="https://github.com/pytorch/pytorch/assets/32754868/fdd26e07-0077-4878-a417-f3a418b6fb3b"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/105602 Approved by: https://github.com/huydhn, https://github.com/cpuhrsch	2023-09-01 22:14:44 +00:00
Andrew Gu	974525c053	De-register forward hooks upon exiting flop counter context (#103744 ) This PR fixes https://github.com/pytorch/pytorch/issues/103684. - Instead of registering forward hooks in `__init__()`, do it upon `__enter__()`. - De-register those forward hooks upon `__exit__()`. - Achieve this by saving an additional mapping `_module_to_forward_hook_handles: Dict[nn.Module, _ForwardHookHandles]`. Only the values in the mapping (i.e. not the keys) are useful for this change. (A `List[_ForwardHookHandles]` would suffice.) - The unit test accesses private attributes `_forward_hooks` and `_forward_pre_hooks` :/ Note that this PR is technically not backward compatible since it does not register the hooks upon `__init__()`, which means that you will not get the flops counting without the context manager. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103744 Approved by: https://github.com/Chillee	2023-06-20 19:34:02 +00:00
Horace He	547bef11ee	tweak heuristic for sdpa selection based off of data (and a decision tree) (#99644 ) High level approach: 1. I generated a bunch of data comparing FlashAttention and Cutlass implementations (https://pastebin.com/pe0j3YeK) 2. I trained a decision tree using standard train/val split methodology and hyperparameter sweeps (https://pastebin.com/fjYX1HjR). 2a. I did a bunch of feature augmentation to capture interactions between features. The heuristic I ended up with is: ``` use_flash = seq_len / (num_heads * batch_size) > 6 ``` TL;DR: On my dataset, where FlashAttention and Cutlass differ by more than 10%, the existing heuristic achieves 69% accuracy. My new heuristic achieves 94% accuracy. Pull Request resolved: https://github.com/pytorch/pytorch/pull/99644 Approved by: https://github.com/ngimel, https://github.com/drisspg	2023-04-21 23:28:44 +00:00
Horace He	44c9ecad8d	fix flop formulas for sdpa (#96690 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96690 Approved by: https://github.com/drisspg	2023-03-16 04:55:56 +00:00
Horace He	31137a63a7	Changed flop formulas for flop counter to take in shapes directly (#96565 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/96565 Approved by: https://github.com/zdevito	2023-03-13 19:58:43 +00:00
Horace He	1b1b9c8706	Add flop counter utility (#95751 ) Overall, an example usage. Note that this also captures backwards FLOPs. ``` import torchvision.models as models import torch from torch.utils.flop_counter import FlopCounterMode inp = torch.randn(1, 3, 224, 224, device='cpu') mod = models.resnet18() flop_counter = FlopCounterMode(mod, depth=1) with flop_counter: mod(inp).sum().backward() ``` <img width="326" alt="image" src="https://user-images.githubusercontent.com/6355099/222023068-3491e405-f195-4e11-b679-36b19a1380c7.png"> You can control the depth of the module hierarchy with the `depth` attribute (which defaults to 2). For example, if I don't limit it, this is what it outputs. <img width="366" alt="image" src="https://user-images.githubusercontent.com/6355099/222023306-3d880bb6-f534-4f98-bf10-83c4353acefc.png"> ## Other APIs FlopCounterMode(custom_mapping=...): Allows for custom flop counting functions FlopCounterMode.get_table(depth=...): Explicitly get the table as a string FlopCounterMode.flop_counts: Contains the flop information as a Dict[hierarchy: str, Dict[Op, int]] FlopCounterMode.register_hierarchy(f, name): Allows you to register additional "hierarchies" for a function. Pull Request resolved: https://github.com/pytorch/pytorch/pull/95751 Approved by: https://github.com/ngimel, https://github.com/albanD	2023-03-02 23:19:49 +00:00

41 Commits