pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-21 05:34:18 +08:00

Author	SHA1	Message	Date
Dmitry Nikolaev	c99efc08fb	[ROCm] skip test_RNN_dropout_state (#149446 ) PR to skip test_nn.py::TestNN::test_RNN_dropout_state Currently ROCm doesn't support dropout value for RNN PR to enable RNN dropout on ROCm still in review and blocked pytorch/pytorch#144572 Fixes: https://github.com/pytorch/pytorch/issues/68849 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149446 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily	2025-03-20 17:22:39 +00:00
Nikita Shulga	2e0c98ff05	[MPS] Add `bicubic2d_aa` (#149378 ) Which is currently the most frequently requested op in https://github.com/pytorch/pytorch/issues/141287 Mostly done by refactoring `upsample_bilinear2d_aa` to accept Functor as one of the template arguments, which closely ideas from `eec43cfbc0/src/libImaging/Resample.c` as well as `bb42e4d137/aten/src/ATen/native/cuda/UpSampleBilinear2d.cu (L472-L478)` Populate unit tests by copying upsample_bilinear_2d_aa and reusing it as upsample_bicubic2d_aa At that point, only difference between upsample_bilinear2d_aa and upsample_bicubic2d_aa are convolution kernel function and size: for bilinear it's 3x3, for bicubic it's 5x5 Pull Request resolved: https://github.com/pytorch/pytorch/pull/149378 Approved by: https://github.com/dcci	2025-03-18 05:35:41 +00:00
zeshengzong	a7f8de2198	Add `nn.Bilinear` param validation (#149018 ) Fixes #103425 ## Changes - Add doc description size value `must be > 0` - Add validation for `in1_features` param Currently, only `in1_features` will cause runtime error, if add checks for `in2_features` and `out_features` as well, might be kind of BC breaking. ```python import torch from torch import nn class lenet(nn.Module): def __init__(self): super(lenet, self).__init__() self.conv = nn.Conv2d(in_channels=3, out_channels=16, kernel_size=5, stride=1) # Error, `in1_features=1, in2_features=0, out_features=0` no error self.linear = nn.Bilinear(in1_features=0, in2_features=0, out_features=0) def forward(self, x): # 1st block x = self.conv(x) x = self.linear(x) return x if __name__ == '__main__': net = lenet() ``` ## Test Result ```bash pytest test/test_nn.py -k test_bilinear -vv ``` ![image](https://github.com/user-attachments/assets/20617ba9-bac5-4db2-aecc-1831dbc8eb43) ![image](https://github.com/user-attachments/assets/401e4e1f-051a-4e1c-952b-48e85de64b0b) Pull Request resolved: https://github.com/pytorch/pytorch/pull/149018 Approved by: https://github.com/mikaylagawarecki	2025-03-14 19:26:12 +00:00
zeshengzong	97272e4b49	Fix `torch.nn.functional.hardswish` gradients corner case (#148049 ) Fixes #147801 ## Changes - Change hardswish gradient compute condition as [torch.nn.functional.hardswish](https://pytorch.org/docs/stable/generated/torch.nn.functional.hardswish.html) - Enable cuda for test `test_hardswish_grad_corner` - Add test case for value=-3 ## Test Result ```bash pytest test/test_nn.py -k test_hardswish pytest test/test_unary_ufuncs.py -k test_hardswish pytest test/inductor/test_torchinductor.py -k test_hardswish ``` ![image](https://github.com/user-attachments/assets/000cb5c4-15f5-4bfd-ab45-f52bf810ff3d) ![image](https://github.com/user-attachments/assets/38b08cf8-ea84-47a2-8e37-0a213da3e0c8) ![image](https://github.com/user-attachments/assets/54bc57be-2c57-46cc-ab90-94ea6cbe1c34) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148049 Approved by: https://github.com/soulitzer	2025-03-14 18:53:10 +00:00
cyy	970fefcc53	Remove outdated skipCUDAIfCudnnVersionLessThan decoration (#148940 ) Test conditions for CUDNN 7 and 8 were removed because we have moved to CUDNN 9. Pull Request resolved: https://github.com/pytorch/pytorch/pull/148940 Approved by: https://github.com/mikaylagawarecki	2025-03-13 18:02:50 +00:00
riccardofelluga	8f71d4563e	Fix rms_norm in fp16/bf16 (#147203 ) Fixes #134106. This PR moves the `upcasted_result` down-casting after all computation is done. Since the multiplication with the weight_opt input is not done in half precision, the current code path is doing the following: fp16 -> fp32 -> fp16 -> fp32 -> fp16. What we want tho is to avoid down-casting and this PR proposes: fp16 -> fp32 -> fp16. This results in better accuracy as it avoids truncating. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147203 Approved by: https://github.com/eqy	2025-03-08 04:43:18 +00:00
PyTorch MergeBot	abcca2fcbb	Revert "Fix `torch.nn.functional.hardswish` gradients corner case (#148049 )" This reverts commit 29b28e9d9f93d78092099a44a7bcc28cfbae06e3. Reverted https://github.com/pytorch/pytorch/pull/148049 on behalf of https://github.com/soulitzer due to This may be causing an accuracy failure on inductor ([comment](https://github.com/pytorch/pytorch/pull/148049#issuecomment-2706839169))	2025-03-07 16:05:56 +00:00
zeshengzong	29b28e9d9f	Fix `torch.nn.functional.hardswish` gradients corner case (#148049 ) Fixes #147801 ## Changes - Change hardswish gradient compute condition as [torch.nn.functional.hardswish](https://pytorch.org/docs/stable/generated/torch.nn.functional.hardswish.html) - Enable cuda for test `test_hardswish_grad_corner` - Add test case for value=-3 ## Test Result ```bash pytest test/test_nn.py -k test_hardswish pytest test/test_unary_ufuncs.py -k test_hardswish pytest test/inductor/test_torchinductor.py -k test_hardswish ``` ![image](https://github.com/user-attachments/assets/000cb5c4-15f5-4bfd-ab45-f52bf810ff3d) ![image](https://github.com/user-attachments/assets/38b08cf8-ea84-47a2-8e37-0a213da3e0c8) ![image](https://github.com/user-attachments/assets/54bc57be-2c57-46cc-ab90-94ea6cbe1c34) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148049 Approved by: https://github.com/soulitzer	2025-03-06 19:04:52 +00:00
CaoE	8b818ab58f	Use float data type for Half sum in fallback implementation of batchnorm backward on CPU (#147353 ) Fixes #147303. Use float data type for Half sum in fallback implementation of batchnorm backward on CPU as the representation range of Half is small. Pull Request resolved: https://github.com/pytorch/pytorch/pull/147353 Approved by: https://github.com/leslie-fang-intel, https://github.com/cpuhrsch	2025-02-21 01:33:33 +00:00
Ahmad Sharif	ae5cc19ba7	[pytorch][cuda] Improve softmax backward pass native CUDA implementation (#145866 ) This PR is similar to https://github.com/pytorch/pytorch/pull/122970, but works on the softmax backward pass. Specifically, it uses shared memory to cache the gradOutput when it can fit in shared memory. Before this PR we were reading gradOutput twice. On my H100 this seems to improve the softmax backward pass performance by about 5% for problem sizes that fit within shared memory. (Note that this is not the only kernel that runs when you call softmax backward pass -- there is an elementwise kernel that runs before this; optimizing that can be a separate PR). Important Note: Currently the softmax backward pass consists of an [element-wise multiply operator](`7f65a20884/aten/src/ATen/native/cuda/SoftMax.cu (L1216)`), followed by [this function](`7f65a20884/aten/src/ATen/native/cuda/SoftMax.cu (L1062)`) which calls the `cunn_SoftMaxBackward` kernel. With my change the kernel time reduces by about 12% (see screenshot below), while the total time (including the elementwise) reduces by about 5%. ``` Baseline This PR N size FP32 bandwidth FP16 bandwidth N size FP32 bandwidth FP16 bandwidth fp32 diff fp16 diff 0 256 134.340966 70.042039 0 256 133.70146 70.342753 -0.48% 0.43% 1 512 233.501185 129.945803 1 512 234.057145 132.933066 0.24% 2.30% 2 1024 340.667966 229.280464 2 1024 338.833265 226.441699 -0.54% -1.24% 3 2048 379.643726 337.452058 3 2048 399.559017 338.432284 5.25% 0.29% 4 4096 416.597537 383.625364 4 4096 428.252403 396.137506 2.80% 3.26% 5 6000 431.198241 384.384384 5 6000 457.744577 406.06275 6.16% 5.64% 6 8192 462.811252 427.292573 6 8192 474.791032 428.281563 2.59% 0.23% 7 10000 464.258731 429.050294 7 10000 483.7643 446.849381 4.20% 4.15% 8 10013 465.199701 429.824179 8 10013 464.904407 428.72184 -0.06% -0.26% 9 10240 477.07359 428.853737 9 10240 485.317024 444.902586 1.73% 3.74% 10 11000 473.038785 430.778663 10 11000 488.161438 453.462162 3.20% 5.27% 11 12000 474.342475 432.594814 11 12000 490.532418 458.427653 3.41% 5.97% 12 16384 487.468854 473.611576 12 16384 488.154406 476.264631 0.14% 0.56% 13 20000 482.029793 465.666186 13 20000 482.147092 483.886193 0.02% 3.91% 14 24000 478.368093 474.159464 14 24000 478.364948 491.447921 0.00% 3.65% 15 32000 476.523796 473.18868 15 32000 476.523796 474.398962 0.00% 0.26% 16 32768 476.104723 477.493634 16 32768 476.704463 477.330606 0.13% -0.03% 17 36864 477.900663 475.472787 17 36864 477.973279 475.728454 0.02% 0.05% 18 40960 477.707561 475.559064 18 40960 478.445017 476.088067 0.15% 0.11% 19 45056 479.169812 475.865134 19 45056 479.143266 475.878202 -0.01% 0.00% 20 49152 477.804907 475.382982 20 49152 477.868404 475.976377 0.01% 0.12% 21 65536 481.274125 478.171806 21 65536 481.537733 478.703926 0.05% 0.11% 22 66000 481.64652 480.095457 22 66000 481.856013 480.466388 0.04% 0.08% 23 68608 481.745774 479.034704 23 68608 481.917596 478.856209 0.04% -0.04% 24 80000 483.409361 480.356529 24 80000 483.330481 480.375277 -0.02% 0.00% 25 98304 480.736301 481.396882 25 98304 480.789858 481.320143 0.01% -0.02% ``` NCU profiler shows lower DRAM fetches with the new kernel: ![image](https://github.com/user-attachments/assets/f3606725-d8fc-4ea5-ae6d-9c188bf32d72) NCU reports about 12% elapsed time reduction in this kernel alone compared to baseline (and because of other kernels that are run, the overall backward pass time as seen by the user gets reduced by 5%). I compared the binary size increase by running `python setup.py develop` before and after and diffing the .so files: ![image](https://github.com/user-attachments/assets/8e6cee2e-3c7a-4fa4-8836-954047ce8ffc) libtorch_cuda.so goes from 274,752,224 bytes to 274,787,072 bytes. The increase in size is 34kB which is about 0.01%. I measured the compilation time for incremental development: ``` touch ./aten/src/ATen/native/cuda/SoftMax.cu time python setup.py develop real 0m10.083s user 0m8.197s sys 0m3.149s ``` Note that this uses `ccache` and does a bunch of copies and is not just measuring the `nvcc` time. I measured the `nvcc` time separately by capturing the `nvcc` command shown in [1] below and running it on the baseline and modified kernels: ``` # baseline nvcc time for SoftMax.cu real 0m35.341s user 0m33.801s sys 0m1.289s # this PR's nvcc time for SoftMax.cu real 0m36.513s user 0m34.722s sys 0m1.408s ``` So the `nvcc` time increases by about 1 second, or ~3% of the baseline. [1] `nvcc` command is here: ``` # This is the nvcc command /usr/local/cuda/bin/nvcc -forward-unknown-to-host-compiler -DAT_PER_OPERATOR_HEADERS -DFLASHATTENTION_DISABLE_ALIBI -DFMT_HEADER_ONLY=1 -DHAVE_MALLOC_USABLE_SIZE=1 -DHAVE_MMAP=1 -DHAVE_SHM_OPEN=1 -DHAVE_SHM_UNLINK=1 -DMINIZ_DISABLE_ZIP_READER_CRC32_CHECKS -DONNXIFI_ENABLE_EXT=1 -DONNX_ML=1 -DONNX_NAMESPACE=onnx_torch -DTORCH_CUDA_BUILD_MAIN_LIB -DTORCH_CUDA_USE_NVTX3 -DUSE_C10D_GLOO -DUSE_C10D_NCCL -DUSE_CUDA -DUSE_DISTRIBUTED -DUSE_EXTERNAL_MZCRC -DUSE_FLASH_ATTENTION -DUSE_MEM_EFF_ATTENTION -DUSE_NCCL -DUSE_RPC -DUSE_TENSORPIPE -D_FILE_OFFSET_BITS=64 -Dtorch_cuda_EXPORTS -I/home/ahmads/personal/pytorch/build/aten/src -I/home/ahmads/personal/pytorch/aten/src -I/home/ahmads/personal/pytorch/build -I/home/ahmads/personal/pytorch -I/home/ahmads/personal/pytorch/cmake/../third_party/benchmark/include -I/home/ahmads/personal/pytorch/third_party/onnx -I/home/ahmads/personal/pytorch/build/third_party/onnx -I/home/ahmads/personal/pytorch/nlohmann -I/home/ahmads/personal/pytorch/aten/src/THC -I/home/ahmads/personal/pytorch/aten/src/ATen/cuda -I/home/ahmads/personal/pytorch/third_party/fmt/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/include -I/home/ahmads/personal/pytorch/aten/src/ATen/../../../third_party/cutlass/tools/util/include -I/home/ahmads/personal/pytorch/build/caffe2/aten/src -I/home/ahmads/personal/pytorch/aten/src/ATen/.. -I/home/ahmads/personal/pytorch/build/nccl/include -I/home/ahmads/personal/pytorch/c10/cuda/../.. -I/home/ahmads/personal/pytorch/c10/.. -I/home/ahmads/personal/pytorch/third_party/tensorpipe -I/home/ahmads/personal/pytorch/build/third_party/tensorpipe -I/home/ahmads/personal/pytorch/third_party/tensorpipe/third_party/libnop/include -I/home/ahmads/personal/pytorch/torch/csrc/api -I/home/ahmads/personal/pytorch/torch/csrc/api/include -isystem /home/ahmads/personal/pytorch/build/third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/gloo -isystem /home/ahmads/personal/pytorch/cmake/../third_party/tensorpipe/third_party/libuv/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/googletest/googletest/include -isystem /home/ahmads/personal/pytorch/third_party/protobuf/src -isystem /home/ahmads/personal/pytorch/third_party/XNNPACK/include -isystem /home/ahmads/personal/pytorch/third_party/ittapi/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/eigen -isystem /usr/local/cuda/include -isystem /home/ahmads/personal/pytorch/torch/include -isystem /home/ahmads/personal/pytorch/third_party/ideep/include -isystem /home/ahmads/personal/pytorch/torch/include/oneapi/dnnl -isystem /home/ahmads/personal/pytorch/INTERFACE -isystem /home/ahmads/personal/pytorch/third_party/nlohmann/include -isystem /home/ahmads/personal/pytorch/third_party/NVTX/c/include -isystem /home/ahmads/personal/pytorch/cmake/../third_party/cudnn_frontend/include -DLIBCUDACXX_ENABLE_SIMPLIFIED_COMPLEX_OPERATIONS -D_GLIBCXX_USE_CXX11_ABI=1 -Xfatbin -compress-all -DONNX_NAMESPACE=onnx_torch -gencode arch=compute_90,code=sm_90 -Xcudafe --diag_suppress=cc_clobber_ignored,--diag_suppress=field_without_dll_interface,--diag_suppress=base_class_has_different_dll_interface,--diag_suppress=dll_interface_conflict_none_assumed,--diag_suppress=dll_interface_conflict_dllexport_assumed,--diag_suppress=bad_friend_decl --expt-relaxed-constexpr --expt-extended-lambda -Wno-deprecated-gpu-targets --expt-extended-lambda -DCUB_WRAPPED_NAMESPACE=at_cuda_detail -DCUDA_HAS_FP16=1 -D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -O3 -DNDEBUG -std=c++17 -Xcompiler=-fPIC -DTORCH_USE_LIBUV -DCAFFE2_USE_GLOO -Xcompiler -Wall -Wextra -Wdeprecated -Wno-unused-parameter -Wno-missing-field-initializers -Wno-array-bounds -Wno-unknown-pragmas -Wno-strict-overflow -Wno-strict-aliasing -Wunused-function -Wunused-variable -Wunused-but-set-variable -Wno-maybe-uninitialized -MD -MT caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/SoftMax.cu.o -MF caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/SoftMax.cu.o.d -x cu -c /home/ahmads/personal/pytorch/aten/src/ATen/native/cuda/SoftMax.cu -o caffe2/CMakeFiles/torch_cuda.dir/__/aten/src/ATen/native/cuda/SoftMax.cu.o ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/145866 Approved by: https://github.com/ngimel	2025-02-12 07:54:41 +00:00
Nikita Shulga	611ca163fd	[MPS] Add bilineard2d_aa implementation (#145526 ) Interesting quirk of the algorithm, that is not very well documented, is that value of align_corners is ignored in antialias mode, see arguments of `e8304f08fe/aten/src/ATen/native/cpu/UpSampleKernel.cpp (L747-L751)` Error out on uint8 implementation(as it relies on a very fragile integer integer arithmetic), as it's not implemented on any other Accelerator devices at the moment. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145526 Approved by: https://github.com/dcci	2025-02-10 15:03:14 +00:00
Danial Javady	bb4964013f	Add determinmistic kernel for reflection2d (#136241 ) Adds feature for #98925 Tests pass for both existing reflectionpad2d and the new one I inserted. Summary of the work: Simple conditional check for deterministic mode that will dispatch to a different kernel. This kernel does not use any atomic operations, and will lead to deterministic results as instead of going from the output to input(1:1) relationship, I am doing the opposite. I am going from input -> all outputs, which is 1 to many. These operations are done in the same order every execution as I simply traverse the data set with a grid stride loop and use simple linearized indexing into the input tensor. So each thread will compute the 4 conditionals, which are then used to see if the input has an output in the 8 regions. These 8 regions are top left, top, top right, left, right, bottom left, bottom, bottom right`. I did not focus on performance for this PR as that would expand the scope heavily. If there are any performance questions though i can answer. Pull Request resolved: https://github.com/pytorch/pytorch/pull/136241 Approved by: https://github.com/eqy, https://github.com/albanD	2025-01-29 20:34:03 +00:00
Benjamin Glass	5aa5a5763e	[inductor triton] Disable incorrect TF32 usage on CUDA capability < 8 (#145684 ) Triton 2.2 and greater have a bug where allowing TF32 generation for a GPU that does not support TF32 will cause code generation errors. Patch around this problem by: 1. Adding a function to `torch.cuda` that determines whether CUDA hardware is capable of using the TF32 format. 2. Using that function to explicitly disable TF32 generation when calling Triton, where needed. To demonstrate that this fix works, try running `test/inductor/test_max_autotune.py` on a GPU with CUDA compute capability < 8 (e.g. any NVIDIA consumer GPU) without this fix. Pull Request resolved: https://github.com/pytorch/pytorch/pull/145684 Approved by: https://github.com/eqy	2025-01-28 22:01:08 +00:00
CaoE	a9bfc5f70c	Fix boundary conditions for hardswish backward (#143899 ) Fixes #136345. Pull Request resolved: https://github.com/pytorch/pytorch/pull/143899 Approved by: https://github.com/jgong5, https://github.com/ezyang	2025-01-16 20:26:27 +00:00
Mario Vasilev	49bdc418be	Add strict kwarg to `nn.Module.set_submodule` and fix bug for non dot delineated strings (#143455 ) Before fixing set_submodule, it used to create leaf modules when the target was not a dot-delimited string. After the fix it will not create a new attribute if target is a non-dot-delimited string. If you want to create leaf nodes of `nn.Module` parent nodes, you can use `replace_or_create_new_leaf_module`. Fixes https://github.com/pytorch/pytorch/issues/143441 Pull Request resolved: https://github.com/pytorch/pytorch/pull/143455 Approved by: https://github.com/mikaylagawarecki	2025-01-16 05:06:33 +00:00
eqy	63569d9745	[CUDA][TF32] Add some missing TF32 decorators to `test_nn.py` (#144592 ) Original authored by @bilal2vec Pull Request resolved: https://github.com/pytorch/pytorch/pull/144592 Approved by: https://github.com/Skylion007	2025-01-11 16:20:59 +00:00
hongxyan	aa69d73e6b	[ROCm] fix torch.layer_norm invalid configuration problem when input is large tensor (#144007 ) Fixes #136291 This PR is to fix the `invalid configuration argument` problem happened on ROCm when input is a large tensor when calling `torch.layer_norm`. ``` File "/opt/conda/envs/py_3.9/lib/python3.9/site-packages/torch/nn/functional.py", line 2573, in layer_norm return torch.layer_norm RuntimeError: HIP error: invalid configuration argument ``` After investigation, I found that the reason why this error happened is: The amd compute language runtime checks whether `gridDim.x * blockDim.x` is greater than `std::numeric_limits<uint32_t>::max()` or not. If yes, it will error out with the "invalid configuration argument" message. The fix is to split the whole task to several chunks so that each chunk will not trigger the failure condition. This will ensure the correctness and completeness given the current kernel implementation logic of `vectorized_layer_norm_kernel`. Also added a largeTensor layer_norm unit test `test_layer_norm_large_tensor` with the same shape `[16, 3000, 3000, 16]` as the one used by the pytorch issue #136291 so that the unit test can check the expected output value to ensure correctness. The future work may include performance optimization of layer_norm and CK layer_norm integration. Pull Request resolved: https://github.com/pytorch/pytorch/pull/144007 Approved by: https://github.com/eqy	2025-01-07 19:17:02 +00:00
eqy	7e3cd0e488	[CUDA] Check `size` calculation in `ilpReduce` for `softmax` (#144009 ) For #143644 Pull Request resolved: https://github.com/pytorch/pytorch/pull/144009 Approved by: https://github.com/Skylion007	2025-01-04 02:31:15 +00:00
eqy	dbdda654af	[64-bit][CUDA] Upsample2D 64-bit indexing fix attempt 2 (#141923 ) #141831 Block/thread math requires a cast... Pull Request resolved: https://github.com/pytorch/pytorch/pull/141923 Approved by: https://github.com/ngimel	2025-01-04 02:30:38 +00:00
Tom Ritchford	d8c8ba2440	Fix unused Python variables in test/[e-z]* (#136964 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/136964 Approved by: https://github.com/justinchuby, https://github.com/albanD	2024-12-18 23:02:30 +00:00
Sun, Jiayi	863e6e4567	Improve input dimensions check for reflection_pad1d, reflection_pad2d and reflection_pad3d (#141670 ) Fix https://github.com/pytorch/pytorch/issues/141447. Pull Request resolved: https://github.com/pytorch/pytorch/pull/141670 Approved by: https://github.com/mingfeima, https://github.com/malfet	2024-12-18 17:46:26 +00:00
albanD	792f1c47e9	No actual change, just remove variable contain Tensors from global scope (#143225 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/143225 Approved by: https://github.com/ezyang	2024-12-17 16:14:25 +00:00
isalia20	37fe8015ac	softshrink nan fixes (#138421 ) Fixes #138385 . Currently contains fixes for cpu and cuda. Will add fixes to mps as well soon if my mac can build it from source.(Had some issues with building it on my linux pc due to limited memory) Pull Request resolved: https://github.com/pytorch/pytorch/pull/138421 Approved by: https://github.com/mikaylagawarecki	2024-11-21 23:06:08 +00:00
Michael Diggin	723498aab8	Gaussian nll loss scalar variance support (#138931 ) Fixes #138747 Adds support for `variance` being a Tensor or a float in `gaussian_nll_loss` to avoid a cpu-gpu sync point in the loss function, when the variance is a static tensor like `<scalar>*torch.ones_like(input)` Pull Request resolved: https://github.com/pytorch/pytorch/pull/138931 Approved by: https://github.com/mikaylagawarecki	2024-11-21 18:20:09 +00:00
PyTorch MergeBot	109f8274a8	Revert "Add NHWC support for group normalization (#126635 )" This reverts commit ed0e63e938317fd254a705f00580caeb68768f9c. Reverted https://github.com/pytorch/pytorch/pull/126635 on behalf of https://github.com/kit1980 due to Reverted internally at Meta, see D65979564 ([comment](https://github.com/pytorch/pytorch/pull/126635#issuecomment-2480130943))	2024-11-15 23:38:15 +00:00
Nikita Shulga	0f739b8f66	[Codemod] `skipIfMps`->`skipIfMPS` (#140562 ) As `MPS` is an acronym that stands for Metal Performance Shaders Also to closer align with `skipCUDAIf` not `skipCudaIf` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140562 Approved by: https://github.com/ZainRizvi, https://github.com/r-barnes	2024-11-13 19:45:08 +00:00
zeshengzong	cb71bcc542	Replace clone.detach with detach.clone (#140264 ) Fixes #64532 As state in issue, replace `clone.detach` by `detach.clone` Pull Request resolved: https://github.com/pytorch/pytorch/pull/140264 Approved by: https://github.com/soulitzer	2024-11-13 07:01:02 +00:00
Mikayla Gawarecki	2ee91db03d	Add APIs to separate norm calculation and gradient scaling in `nn.utils.clip_grad_norm_` (#139662 ) Fixes https://github.com/pytorch/pytorch/issues/139467 Refactor `nn.utils.clip_grad_norm_` into `nn.utils.get_total_norm` and then `nn.utils.clip_grads_with_norm_` . `clip_grad_norm_` now calls into these two new ops, `get_total_norm` is generalized (rather than `get_grad_norm` due to the discussion on the issue from @awgu) Pull Request resolved: https://github.com/pytorch/pytorch/pull/139662 Approved by: https://github.com/H-Huang	2024-11-07 23:13:23 +00:00
Danial Javady	ed0e63e938	Add NHWC support for group normalization (#126635 ) Fixes #111824 Currently it is the case that if the user specifies their group normalization to be of NHWC format, pytorch will default to NCHW tensors and convert. This conversion is not immediately obvious to the user unless they check the format themselves which is not intuitive. This PR adds suppor for NHWC for cuda by adding necessary kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126635 Approved by: https://github.com/eqy, https://github.com/mikaylagawarecki	2024-11-07 01:12:08 +00:00
Donald Tolley	c1e7d85ce6	Add Weighted Loss Functions to PyTorch : WMSE, WMAE, and Weighted Huber Loss (#132049 ) #### Summary This pull request introduces new weighted loss functions to the PyTorch library: `weighted_huber_loss`, `wmse_loss`, and `wmae_loss`. These functions allow for precise control over the influence of each sample during training, important for imbalanced data or when certain samples are more significant than others. #### Changes - `weighted_huber_loss`: Huber loss modified to incorporate weights, providing a balance between L1 and L2 loss based on the `delta` parameter. - `wmse_loss` (Weighted Mean Squared Error): Applies weights to the standard MSE loss, useful for emphasizing certain samples in regression tasks. - `wmae_loss` (Weighted Mean Absolute Error): Adjusts MAE loss calculation by including weights, ideal for datasets with outliers. #### Code Details - Input Validation: Ensures `input`, `target`, and `weights` tensors match in size to prevent broadcasting errors. - Reduction Options: Supports `none`, `mean`, and `sum` reductions to suit various computational needs. - Backward Compatibility: Maintains support for deprecated arguments `size_average` and `reduce`, while encouraging use of the `reduction` argument. #### Usage Example ```python import torch input = torch.tensor([0.5, 2.5, 2.0], dtype=torch.float32) target = torch.tensor([0.0, 2.0, 1.5], dtype=torch.float32) weights = torch.tensor([1.0, 0.5, 1.5], dtype=torch.float32) loss = weighted_huber_loss(input, target, weights, delta=1.0) print(loss) ``` --- Feedback on these implementations is welcome; please let me know if further modifications are required. Resolves #132465 Pull Request resolved: https://github.com/pytorch/pytorch/pull/132049 Approved by: https://github.com/mikaylagawarecki Co-authored-by: mikaylagawarecki <mikaylagawarecki@gmail.com>	2024-10-31 21:59:43 +00:00
Siddharth Kotapati	e27c0048db	Enable additional tests for MPS CI runs (#134356 ) As part of the follow up for https://github.com/pytorch/pytorch/issues/133520, adapting existing unused tests for use in MPS CI runs. Focusing on nhwc & other memory formatting tests Pull Request resolved: https://github.com/pytorch/pytorch/pull/134356 Approved by: https://github.com/malfet, https://github.com/eqy, https://github.com/huydhn	2024-10-04 21:52:38 +00:00
Nikita Shulga	c6192f32f1	[MPS] Add upsample_bicubic2d as Metal op (#136123 ) More or less literal copy-n-paste of `c33b0580e6/aten/src/ATen/native/cuda/UpSampleBicubic2d.cu (L24)` and `c33b0580e6/aten/src/ATen/native/cuda/UpSampleBicubic2d.cu (L99)` Missing `uint8` implementation mimics CUDA behavior Initial version coded live in https://www.youtube.com/watch?v=shi6Kb5xxvk Later refinements: - Switch from 2D dispatch to 1D one (to match CUDA behavior) - Added batch + channel loops - Fixed scale computation to match align corners behavior - Added backward implementation Backward implementation again, mimics CUDA, so it has issues precision issue for `torch.half` as well as a somewhat slow simulation of atomic adds using atomic compare and exchange of the pair of adjacent values, i.e. ```metal emplate <typename T> static inline void atomic_add_helper( device atomic<int>* data, long offset, float value) { auto ptr = data + (offset >> 1); auto old = atomic_load_explicit(ptr, memory_order_relaxed); union { int i; T t[2]; } val; do { val.i = old; val.t[offset & 1] += static_cast<T>(value); } while (!atomic_compare_exchange_weak_explicit( ptr, &old, val.i, memory_order_relaxed, memory_order_relaxed)); } ``` Bump basic Metal language version to 3.0, as it's supported on MacOS13 and that's the first version that has `atomic_float` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136123 Approved by: https://github.com/albanD	2024-09-24 18:58:11 +00:00
Xinya Zhang	74fd1bf965	[ROCm] Update to AOTriton 0.7b (#134498 ) Notable changes: 1. Enable CudaGraph related tests 2. Fix UT problems 3. EXPERIMENTAL Navi31 support. User should enable Navi31 support with Env Var `TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1` Know Problem: 1. `test/test_transformers.py` will massive failures and/or NaN outputs with `--use-pytest` + Update: Confirmed skip `class TestSDPAPrivateUse1Only` can fix the problem with `--use-pytest` Note: AOTriton 0.7b adds support to nestedtenosrs+SDPA but need more work (and consequently a separate PR) to enable it. Fixes #133540 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134498 Approved by: https://github.com/pruthvistony, https://github.com/jeffdaily, https://github.com/malfet	2024-09-11 20:34:01 +00:00
Roy Hvaara	23b1486185	[MPS] Allow nan mean reduction in `nll_loss` (#135434 ) This PR allows results from `nn_loss` to be `nan`, which is the same behavior as with CUDA and CPU https://github.com/pytorch/pytorch/pull/64572#issuecomment-926504162. Fixes #134431 Ref #64572 #119108 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135434 Approved by: https://github.com/malfet	2024-09-10 08:37:59 +00:00
Guilherme Leobas	136e28f616	Enable forward AD in functional.affine_grid (#135494 ) Fixes #121411 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135494 Approved by: https://github.com/zou3519, https://github.com/soulitzer	2024-09-10 00:07:07 +00:00
CaoE	dfb2b661f7	Use float data type for Half var_sum in batchnorm stats updating on CPU (#126525 ) Using float data type for Half `var_sum` in batchnorm stats updating on CPU to avoid `var_sum` overflow since the representation range of Half is small. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126525 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-09-09 15:31:38 +00:00
Roy Hvaara	5a69e0ebbe	[MPS] Update decorator comments with issue ref (#135448 ) Updating the comments with references to better places for context now that the bugs have been identified. xref #135442 #135447 #134184 Pull Request resolved: https://github.com/pytorch/pytorch/pull/135448 Approved by: https://github.com/ezyang	2024-09-09 15:18:52 +00:00
Nikita Shulga	f95085fd91	[BE][MPS] Prefer xfail to skip (#134858 ) This essentially undoes large skips on everything but MacOS Sequoia to nn.modules made by https://github.com/pytorch/pytorch/pull/128393 Instead it uses existing `xfail`, but guards it on `_macos15_or_newer` boolean Before the change if run on MacOS 14: ``` % python3 ../test/test_modules.py -v -k Hardswish 2>&1\|tail -n3 Ran 57 tests in 0.053s OK (skipped=32) ``` After ``` % python3 ../test/test_modules.py -v -k Hardswish 2>&1\|tail -n3 Ran 57 tests in 0.229s OK (skipped=10, expected failures=2) ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134858 Approved by: https://github.com/janeyx99	2024-08-31 00:29:48 +00:00
Nikita Shulga	8de0d7690c	Use newer `toAccumulateType` signature in `Normalization.cpp` (#134540 ) Which fixes BatchNorm behavior for if called with empty tensors on MPS backed. Removed `expectedFailureMPS` in test_nn.py, deleted expected failure in `test_mps.py` and adjusted `skipIfMPS` to `expectedFailureMPS` in BatchNorm2d OpInfo decorator, but restrict it only to the memory format tests Test Plan: CI + `python3 -c "import torch; print(torch.nn.BatchNorm2d(3, device='mps')(torch.rand(0, 3, 2, 2, device='mps')))"` Fixes https://github.com/pytorch/pytorch/issues/134423 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134540 Approved by: https://github.com/Skylion007, https://github.com/albanD	2024-08-27 18:09:20 +00:00
Roy Hvaara	1565940114	[MPS] Add `test/test_nn.py` to test suite (#134184 ) This PR increases test coverage by including the tests in `test/test_nn.py` in the test suite of MPS. Some of the tests are decorated with `@expectedFailureMPS` for various reasons. Either that the op is not implemented, or that the outputs do not align. Those tests that contain differing results should be investigated further to rule out any live bugs. ```bash $ python test/run_test.py --mps --verbose -k TestNN Running test batch 'tests to run' cost 84.76 seconds ``` Ref #133520 Pull Request resolved: https://github.com/pytorch/pytorch/pull/134184 Approved by: https://github.com/albanD, https://github.com/malfet	2024-08-26 23:48:23 +00:00
drisspg	fb26b84390	Update fused kernels and call _safe_softmax from SDPA (#133882 ) # UPDATE: This is take 3 of https://github.com/pytorch/pytorch/pull/131863 which was landed via co dev but not applying correclty # Summary Changes the stance of SDPA on what to do for fully masked out rows ## Current Behavior Several PyTorch users have expressed frustration over this issue: - https://github.com/pytorch/pytorch/issues/41508 - https://github.com/pytorch/pytorch/issues/103749 - https://github.com/pytorch/pytorch/issues/103963 These are significant issues with extensive discussion but no satisfactory resolution. The PyTorch team's consensus, as stated here: https://github.com/pytorch/pytorch/issues/24816#issuecomment-524415617 Can be paraphrased as follows: When passing in fully masked out rows, attention becomes ambiguous. We have two main options: 1. Uniformly attend to all values: ```python scores[masked_out_rows] = 1 / len(row) out[masked_out_rows] = 1 / len(row) * value ``` 2. Decide that attention between no queries (masked) and no keys (masked) is meaningless: ```python output[fully_masked_rows] = NaN ``` We went with option 2. Partially because it was easier to implement, but also people argued that users can slice the output to remove the NaNs: ``` Python >fill_value = -float("inf") >row0 = torch.randn(4) >row1 = torch.tensor([(fill_value for _ in range(4)]) >matrix = torch.stack([row0, row1]).requires_grad_(True) >out = torch.softmax(matrix, 1) >out = out[0] >print(out) tensor([0.5377, 0.2729, 0.0692, 0.1201]) ``` Cool, problem solved. But what happends when you call backwards.. ```Python >out.backward(torch.ones_like(out)) >print(matrix.grad) tensor([[3.0957e-08, 1.4157e-08, 7.7802e-10, 1.3713e-08], [ nan, nan, nan, nan]]) ``` Those pesky NaNs are back! ## Why do we see NaNs today? The core of the problem revolves around using softmax function in sdpa: ```python > row = torch.tensor([(-float("inf")) for _ in range(4)]) > torch.softmax(row, 0) tensor([nan, nan, nan, nan]) ``` ## Quick Aside: Masking in Attention Attention itself doesn't have a concept of masking. The `sdpa` function has an argument called `attn_mask`, which would be more accurately named `attn_bias`. This is because we don't actually "mask" entries when computing attention. Instead, due to implementation details([performance](https://github.com/pytorch/pytorch/issues/25110#issuecomment-524519087)), we add a value to the masked-out query/key pairs. We use a large negative number (typically -inf) to decrease the attention weight, as softmax assigns more weight to larger values. ## Alternative Approaches If we use a very large negative number instead of -inf: ```python > row = torch.tensor([(-1e6) for _ in range(4)]) > torch.softmax(row, 0) tensor([0.2500, 0.2500, 0.2500, 0.2500]) ``` However if users always remembered to "slice" out their outputs i.e.: ```Python >fill_value = -1e6 >... >out.backward(torch.ones_like(out)) >print(matrix.grad) tensor([[-0.0563, -0.0564, 0.1613, -0.0486], [ 0.0000, 0.0000, 0.0000, 0.0000]]) ``` This would bring us back into a better state. ## A Third Option We don't necessarily need to alter the behavior of softmax for -inf or very large negative numbers. The fundamental goal is to exclude certain query/key pairs from attention, regardless of the underlying implementation. This PR implements the new semantic for masking w/ attention in fully masked-out rows: ```python out[masked_out_rows] = 0 ``` Important Note: This idea isn't entirely new. The [MaskedTensor](https://pytorch.org/tutorials/prototype/maskedtensor_overview#safe-softmax) prototype, a tensor subclass, was designed to handle such cases. However, it remains a prototype feature and hasn't gained widespread adoption. ## Details This PR stack does 3 things: 1. Adds a PRIVATE _safe_softmax op 2. Updates semantic for flash_cpu fused kernel 3. Updates semantic for efficient_cuda fused kernel _safe_softmax is not supposed to be used generically and is only meant to be used within the context of SDPA. Due to this fact instead of decomposing softmax and checking for -inf rows we instead "cheat" and use nan_to_num. Why I think this is okay? (please find a counter point if avail) There are multiple ways NaNs can emerge. For the fully masked out rows case nan_to_num works. But what if there were other NaNs, wouldn't this silently remove them? The only case that this can happen is if the input itself had a NaN or an Inf For example: ```Python a = torch.ones([4], requires_grad=False, dtype=torch.float16) a[1] = torch.finfo(torch.float16).max print(a.softmax(-1)) ``` Will return `tensor([0., 1., 0., 0.], dtype=torch.float16)` Where ```Python a = torch.ones([4], requires_grad=False, dtype=torch.float16) a[1] = float("inf") a.softmax(-1) ``` returns: `tensor([nan, nan, nan, nan], dtype=torch.float16)` If we dont want to even allow for the possibility of "inf" or "NaN" attention scores to be converted to 0 then we can implemented it something like this ```Python max = torch.max(a, dim=-1, keepdim=True) exp = torch.exp(a - max.values) denom = torch.sum(exp, dim=-1, keepdim=True) softmax = exp / denom softmax = torch.where(max.values == float('-inf'), 0.0, softmax) ``` however we would be paying for this in math performance. ## Why Now I think one point that has substantially changed where PyTorch should lie on this argument is the fact that we have fused implementations for SDPA now. And these fused implementations allow us to easily and performantly support this new semantic. Differential Revision: [D61418679](https://our.internmc.facebook.com/intern/diff/D61418679) Pull Request resolved: https://github.com/pytorch/pytorch/pull/133882 Approved by: https://github.com/soulitzer	2024-08-19 18:53:11 +00:00
Mikayla Gawarecki	d9576c9440	Fix failures when default is flipped for weights_only (#127627 ) Tests on XLA shard not fixed yet but there is an issue here https://github.com/pytorch/xla/issues/7799 Pull Request resolved: https://github.com/pytorch/pytorch/pull/127627 Approved by: https://github.com/albanD ghstack dependencies: #132349	2024-08-16 00:22:43 +00:00
PyTorch MergeBot	cfec69e2a1	Revert "Update fused kernels and call _safe_softmax from SDPA (#131863 )" This reverts commit caba37e99b03d2199848197de4e452b78c8c2a23. Reverted https://github.com/pytorch/pytorch/pull/131863 on behalf of https://github.com/izaitsevfb due to breaks executorch test executorch/backends/apple/coreml:test - test_vit_skip_conv (executorch.backends.apple.coreml.test.test_coreml_partitioner.TestCoreMLPartitioner) ([comment](https://github.com/pytorch/pytorch/pull/131863#issuecomment-2291855634))	2024-08-15 17:55:07 +00:00
drisspg	caba37e99b	Update fused kernels and call _safe_softmax from SDPA (#131863 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131863 Approved by: https://github.com/jbschlosser, https://github.com/Chillee	2024-08-13 23:37:50 +00:00
PyTorch MergeBot	4cca18d5b6	Revert "Update fused kernels and call _safe_softmax from SDPA (#131863 )" This reverts commit e61def65d5c6268e79f52776f75277ee60f01462. Reverted https://github.com/pytorch/pytorch/pull/131863 on behalf of https://github.com/albanD due to Broke forward AD tests in main ([comment](https://github.com/pytorch/pytorch/pull/131863#issuecomment-2286432628))	2024-08-13 14:44:08 +00:00
drisspg	e61def65d5	Update fused kernels and call _safe_softmax from SDPA (#131863 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131863 Approved by: https://github.com/jbschlosser	2024-08-13 00:51:55 +00:00
PyTorch MergeBot	31ef900a65	Revert "added persistent option to buffers and namedbuffers (#132994 )" This reverts commit 8707c6dfacaed293ddc40cbb5ecf5841568df0e6. Reverted https://github.com/pytorch/pytorch/pull/132994 on behalf of https://github.com/PaliC due to breaking internal pyre tests ([comment](https://github.com/pytorch/pytorch/pull/132994#issuecomment-2278487672))	2024-08-09 18:14:53 +00:00
Randolf Scholz	8707c6dfac	added persistent option to buffers and namedbuffers (#132994 ) Fixes #85235 Alternative to PR https://github.com/pytorch/pytorch/pull/129655, implements 3-valued option (None or bool). - adds keyword only argument `persistent: Optional[bool] = None` to `nn.Module.buffers` - updated docstrings slightly. - added test. Pull Request resolved: https://github.com/pytorch/pytorch/pull/132994 Approved by: https://github.com/mikaylagawarecki	2024-08-08 21:39:01 +00:00
Oguz Ulgen	221350e3a4	Add None return type to init -- tests (#132352 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/132352 Approved by: https://github.com/ezyang ghstack dependencies: #132335, #132351	2024-08-01 15:44:51 +00:00
ekamiti	9e473fd868	Make adding Buffers more like adding Parameters (#125971 ) Add similar semantics for creating a buffer object similar to creating a parameter. This is done by introducing a new Buffer class that can be used for type disambiguation. The underlying functionality of registering a buffer remains the same as the register_buffer method has not been changed. The persistent parameter in the Buffer type is to indicate whether a buffer object should be persistent or not. Other non-test changes have to do with getting the new Buffer type recognized by inductor and dynamo. Remaining changes are test changes to make sure that the Buffer type can be used as a drop in replacement for register_buffer as it just leads to register_buffer being called. The addition of this new functionality still allows for normal tensors to be used as buffers so these changes are intended to be backwards compatible. Fixes #35735 Co-authored-by: Mikayla Gawarecki <mikaylagawarecki@gmail.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125971 Approved by: https://github.com/albanD, https://github.com/anijain2305, https://github.com/mlazos	2024-07-31 10:32:40 +00:00

1 2 3 4 5 ...

1592 Commits