36871622f1
[2/N] Mark unused parameters in C++ code ( #165121 )
...
This is follow-up of #164912 to mark unused C++ parameters to improve code readability.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165121
Approved by: https://github.com/Skylion007
2025-10-15 03:04:39 +00:00
04a393507b
Fused RMSNorm implementation ( #153666 )
...
Relevant #72643
Benchmarked versus unfused torch implementation and torch.compile implementation. Around 9x speedup vs unfused implementation on cuda and slightly faster vs inductor compile on 5090.
```py
import torch
import torch.nn as nn
class RMSNorm(nn.Module):
def __init__(self, dim, eps=1e-5):
super().__init__()
self.eps = eps
self.scale = nn.Parameter(torch.ones(dim))
def forward(self, x):
norm_x = x.norm(2, dim=-1, keepdim=True)
rms_x = norm_x * torch.rsqrt(torch.tensor(x.shape[-1], dtype=x.dtype))
x_normed = x / (rms_x + self.eps)
return self.scale * x_normed
def benchmark_rmsnorm_cuda(input_shape, normalized_dim, num_iterations=100, warmup_iterations=10, dtype=torch.float16):
rms_norm_layer = torch.nn.RMSNorm(normalized_dim, device='cuda', dtype=dtype)
input_data = torch.randn(input_shape, device='cuda', dtype=dtype)
for _ in range(warmup_iterations):
_ = rms_norm_layer(input_data)
torch.cuda.synchronize()
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
for _ in range(num_iterations):
_ = rms_norm_layer(input_data)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = elapsed_time_ms / num_iterations
print(f"--- RMSNorm CUDA Benchmark ---")
print(f"Input Shape: {input_shape}")
print(f"Normalized Dimension: {normalized_dim}")
print(f"Benchmark Iterations: {num_iterations}")
print(f"--- Fused Implementation ---")
print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")
compiled_rms_norm = torch.compile(RMSNorm(dim=normalized_dim)).cuda()
for _ in range(warmup_iterations):
_ = compiled_rms_norm(input_data)
torch.cuda.synchronize()
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
for _ in range(num_iterations):
_ = compiled_rms_norm(input_data)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = elapsed_time_ms / num_iterations
print(f"--- TorchCompile Implementation ---")
print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")
print("-" * 50)
if __name__ == '__main__':
parameter_sets = [
{'batch_size': 16, 'sequence_length': 256, 'hidden_features': 512, 'dtype': torch.float16},
{'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float16},
{'batch_size': 64, 'sequence_length': 1024, 'hidden_features': 1024, 'dtype': torch.float16},
{'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float32},
{'batch_size': 8, 'sequence_length': 2048, 'hidden_features': 2048, 'dtype': torch.float16},
]
num_benchmark_iterations = 200
num_warmup_iterations = 20
for params in parameter_sets:
batch_size = params['batch_size']
sequence_length = params['sequence_length']
hidden_features = params['hidden_features']
data_type = params.get('dtype', torch.float16)
shape = (batch_size, sequence_length, hidden_features)
norm_dim_to_normalize = hidden_features
print(f"Benchmarking with: BS={batch_size}, SeqLen={sequence_length}, Hidden={hidden_features}, DType={data_type}")
benchmark_rmsnorm_cuda(input_shape=shape,
normalized_dim=norm_dim_to_normalize,
num_iterations=num_benchmark_iterations,
warmup_iterations=num_warmup_iterations,
dtype=data_type)
```
Here are the triton compile tests ran on a 5090 (comparing this branch vs main)
```py
import torch
import torch.nn as nn
from torch._inductor.utils import run_and_get_code, run_fw_bw_and_get_code
torch.manual_seed(0)
device = torch.device("cuda")
for batch in range(0, 9):
for i in range(9, 16):
normalized_shape_arg = (2**batch, 2**i)
input_tensor = torch.randn(2**batch, 2**i, device=device, requires_grad=True)
weight_tensor = torch.randn(2**batch, 2**i,device=device, requires_grad=True)
model = torch.nn.functional.rms_norm
compiled_model = torch.compile(model)
loss = torch.randn_like(input_tensor)
num_iter = 5
for j in range(num_iter):
output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
output.backward(loss)
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
num_iter = 10
for j in range(num_iter):
output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
output.backward(loss)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = round(elapsed_time_ms / num_iter, 5)
print(2**batch, 2**i, avg_time_ms)
```
main
```
32 512 0.1812
32 1024 0.19021
32 2048 0.18871
32 4096 0.17019
32 8192 0.21944
32 16384 0.38871
32 32768 0.83282
64 512 0.14705
64 1024 0.13987
64 2048 0.14111
64 4096 0.21699
64 8192 0.43141
64 16384 0.90652
64 32768 2.18573
128 512 0.19361
128 1024 0.1963
128 2048 0.20122
128 4096 0.38888
128 8192 0.93795
128 16384 2.23437
128 32768 5.50079
256 512 0.16722
256 1024 0.22856
256 2048 0.39421
256 4096 0.96621
256 8192 2.48746
256 16384 5.53571
256 32768 11.97932
```
current branch
```
32 512 0.16328
32 1024 0.18104
32 2048 0.15508
32 4096 0.14356
32 8192 0.20111
32 16384 0.45974
32 32768 0.94799
64 512 0.16874
64 1024 0.18701
64 2048 0.16107
64 4096 0.20152
64 8192 0.46568
64 16384 0.96599
64 32768 2.21661
128 512 0.14982
128 1024 0.15565
128 2048 0.22241
128 4096 0.46128
128 8192 0.88883
128 16384 2.3097
128 32768 5.84448
256 512 0.14346
256 1024 0.2007
256 2048 0.45927
256 4096 0.87876
256 8192 2.10571
256 16384 5.73948
256 32768 12.98581
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153666
Approved by: https://github.com/ngimel , https://github.com/albanD
2025-07-22 22:25:44 +00:00
35f1b4ad9e
Revert "Fused RMSNorm implementation ( #153666 )"
...
This reverts commit 15ef4f28df0a14e9f0d55a57a4e2db415a303be7.
Reverted https://github.com/pytorch/pytorch/pull/153666 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking tests internally. @albanD can you please help land this change?You can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts . See D78599667 for more info ([comment](https://github.com/pytorch/pytorch/pull/153666#issuecomment-3097690935 ))
2025-07-21 17:31:42 +00:00
15ef4f28df
Fused RMSNorm implementation ( #153666 )
...
Relevant #72643
Benchmarked versus unfused torch implementation and torch.compile implementation. Around 9x speedup vs unfused implementation on cuda and slightly faster vs inductor compile on 5090.
```py
import torch
import torch.nn as nn
class RMSNorm(nn.Module):
def __init__(self, dim, eps=1e-5):
super().__init__()
self.eps = eps
self.scale = nn.Parameter(torch.ones(dim))
def forward(self, x):
norm_x = x.norm(2, dim=-1, keepdim=True)
rms_x = norm_x * torch.rsqrt(torch.tensor(x.shape[-1], dtype=x.dtype))
x_normed = x / (rms_x + self.eps)
return self.scale * x_normed
def benchmark_rmsnorm_cuda(input_shape, normalized_dim, num_iterations=100, warmup_iterations=10, dtype=torch.float16):
rms_norm_layer = torch.nn.RMSNorm(normalized_dim, device='cuda', dtype=dtype)
input_data = torch.randn(input_shape, device='cuda', dtype=dtype)
for _ in range(warmup_iterations):
_ = rms_norm_layer(input_data)
torch.cuda.synchronize()
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
for _ in range(num_iterations):
_ = rms_norm_layer(input_data)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = elapsed_time_ms / num_iterations
print(f"--- RMSNorm CUDA Benchmark ---")
print(f"Input Shape: {input_shape}")
print(f"Normalized Dimension: {normalized_dim}")
print(f"Benchmark Iterations: {num_iterations}")
print(f"--- Fused Implementation ---")
print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")
compiled_rms_norm = torch.compile(RMSNorm(dim=normalized_dim)).cuda()
for _ in range(warmup_iterations):
_ = compiled_rms_norm(input_data)
torch.cuda.synchronize()
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
for _ in range(num_iterations):
_ = compiled_rms_norm(input_data)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = elapsed_time_ms / num_iterations
print(f"--- TorchCompile Implementation ---")
print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")
print("-" * 50)
if __name__ == '__main__':
parameter_sets = [
{'batch_size': 16, 'sequence_length': 256, 'hidden_features': 512, 'dtype': torch.float16},
{'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float16},
{'batch_size': 64, 'sequence_length': 1024, 'hidden_features': 1024, 'dtype': torch.float16},
{'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float32},
{'batch_size': 8, 'sequence_length': 2048, 'hidden_features': 2048, 'dtype': torch.float16},
]
num_benchmark_iterations = 200
num_warmup_iterations = 20
for params in parameter_sets:
batch_size = params['batch_size']
sequence_length = params['sequence_length']
hidden_features = params['hidden_features']
data_type = params.get('dtype', torch.float16)
shape = (batch_size, sequence_length, hidden_features)
norm_dim_to_normalize = hidden_features
print(f"Benchmarking with: BS={batch_size}, SeqLen={sequence_length}, Hidden={hidden_features}, DType={data_type}")
benchmark_rmsnorm_cuda(input_shape=shape,
normalized_dim=norm_dim_to_normalize,
num_iterations=num_benchmark_iterations,
warmup_iterations=num_warmup_iterations,
dtype=data_type)
```
Here are the triton compile tests ran on a 5090 (comparing this branch vs main)
```py
import torch
import torch.nn as nn
from torch._inductor.utils import run_and_get_code, run_fw_bw_and_get_code
torch.manual_seed(0)
device = torch.device("cuda")
for batch in range(0, 9):
for i in range(9, 16):
normalized_shape_arg = (2**batch, 2**i)
input_tensor = torch.randn(2**batch, 2**i, device=device, requires_grad=True)
weight_tensor = torch.randn(2**batch, 2**i,device=device, requires_grad=True)
model = torch.nn.functional.rms_norm
compiled_model = torch.compile(model)
loss = torch.randn_like(input_tensor)
num_iter = 5
for j in range(num_iter):
output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
output.backward(loss)
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
num_iter = 10
for j in range(num_iter):
output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
output.backward(loss)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = round(elapsed_time_ms / num_iter, 5)
print(2**batch, 2**i, avg_time_ms)
```
main
```
32 512 0.1812
32 1024 0.19021
32 2048 0.18871
32 4096 0.17019
32 8192 0.21944
32 16384 0.38871
32 32768 0.83282
64 512 0.14705
64 1024 0.13987
64 2048 0.14111
64 4096 0.21699
64 8192 0.43141
64 16384 0.90652
64 32768 2.18573
128 512 0.19361
128 1024 0.1963
128 2048 0.20122
128 4096 0.38888
128 8192 0.93795
128 16384 2.23437
128 32768 5.50079
256 512 0.16722
256 1024 0.22856
256 2048 0.39421
256 4096 0.96621
256 8192 2.48746
256 16384 5.53571
256 32768 11.97932
```
current branch
```
32 512 0.16328
32 1024 0.18104
32 2048 0.15508
32 4096 0.14356
32 8192 0.20111
32 16384 0.45974
32 32768 0.94799
64 512 0.16874
64 1024 0.18701
64 2048 0.16107
64 4096 0.20152
64 8192 0.46568
64 16384 0.96599
64 32768 2.21661
128 512 0.14982
128 1024 0.15565
128 2048 0.22241
128 4096 0.46128
128 8192 0.88883
128 16384 2.3097
128 32768 5.84448
256 512 0.14346
256 1024 0.2007
256 2048 0.45927
256 4096 0.87876
256 8192 2.10571
256 16384 5.73948
256 32768 12.98581
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153666
Approved by: https://github.com/ngimel , https://github.com/eqy , https://github.com/albanD
2025-07-18 23:24:21 +00:00
7c1f627828
Fix 'dllimport attribute ignored on inline function' ( #157670 )
...
There are lots of warnings in builds:
```
2025-07-05T16:59:46.9208806Z C:\actions-runner\_work\pytorch\pytorch\build\aten\src\ATen\core\TensorBody.h(5043,29): warning: 'at::Tensor::less_' redeclared inline; 'dllimport' attribute ignored [-Wignored-attributes]
2025-07-05T16:59:46.9209030Z 5043 | inline at::Tensor & Tensor::less_(const at::Scalar & other) const {
2025-07-05T16:59:46.9209104Z | ^
2025-07-05T16:59:46.9209671Z C:\actions-runner\_work\pytorch\pytorch\build\aten\src\ATen\core\TensorBody.h(5048,29): warning: 'at::Tensor::less_' redeclared inline; 'dllimport' attribute ignored [-Wignored-attributes]
2025-07-05T16:59:46.9209860Z 5048 | inline at::Tensor & Tensor::less_(const at::Tensor & other) const
```
This PR has fixed them and turned the warning into an error.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157670
Approved by: https://github.com/albanD
2025-07-07 16:57:48 +00:00
6401d1d53d
Revert "Fused RMSNorm implementation ( #153666 )"
...
This reverts commit e1aee86646aa6d1b9cb9d34351e43936401c5efc.
Reverted https://github.com/pytorch/pytorch/pull/153666 on behalf of https://github.com/davidberard98 due to causing build failures on main branch [GH job link](https://github.com/pytorch/pytorch/actions/runs/16007148842/job/45156382001 ) [HUD commit link](e1aee86646
) ([comment](https://github.com/pytorch/pytorch/pull/153666#issuecomment-3025146176 ))
2025-07-01 18:46:45 +00:00
e1aee86646
Fused RMSNorm implementation ( #153666 )
...
Relevant #72643
Benchmarked versus unfused torch implementation and torch.compile implementation. Around 9x speedup vs unfused implementation on cuda and slightly faster vs inductor compile on 5090.
```py
import torch
import torch.nn as nn
class RMSNorm(nn.Module):
def __init__(self, dim, eps=1e-5):
super().__init__()
self.eps = eps
self.scale = nn.Parameter(torch.ones(dim))
def forward(self, x):
norm_x = x.norm(2, dim=-1, keepdim=True)
rms_x = norm_x * torch.rsqrt(torch.tensor(x.shape[-1], dtype=x.dtype))
x_normed = x / (rms_x + self.eps)
return self.scale * x_normed
def benchmark_rmsnorm_cuda(input_shape, normalized_dim, num_iterations=100, warmup_iterations=10, dtype=torch.float16):
rms_norm_layer = torch.nn.RMSNorm(normalized_dim, device='cuda', dtype=dtype)
input_data = torch.randn(input_shape, device='cuda', dtype=dtype)
for _ in range(warmup_iterations):
_ = rms_norm_layer(input_data)
torch.cuda.synchronize()
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
for _ in range(num_iterations):
_ = rms_norm_layer(input_data)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = elapsed_time_ms / num_iterations
print(f"--- RMSNorm CUDA Benchmark ---")
print(f"Input Shape: {input_shape}")
print(f"Normalized Dimension: {normalized_dim}")
print(f"Benchmark Iterations: {num_iterations}")
print(f"--- Fused Implementation ---")
print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")
compiled_rms_norm = torch.compile(RMSNorm(dim=normalized_dim)).cuda()
for _ in range(warmup_iterations):
_ = compiled_rms_norm(input_data)
torch.cuda.synchronize()
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
for _ in range(num_iterations):
_ = compiled_rms_norm(input_data)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = elapsed_time_ms / num_iterations
print(f"--- TorchCompile Implementation ---")
print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")
print("-" * 50)
if __name__ == '__main__':
parameter_sets = [
{'batch_size': 16, 'sequence_length': 256, 'hidden_features': 512, 'dtype': torch.float16},
{'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float16},
{'batch_size': 64, 'sequence_length': 1024, 'hidden_features': 1024, 'dtype': torch.float16},
{'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float32},
{'batch_size': 8, 'sequence_length': 2048, 'hidden_features': 2048, 'dtype': torch.float16},
]
num_benchmark_iterations = 200
num_warmup_iterations = 20
for params in parameter_sets:
batch_size = params['batch_size']
sequence_length = params['sequence_length']
hidden_features = params['hidden_features']
data_type = params.get('dtype', torch.float16)
shape = (batch_size, sequence_length, hidden_features)
norm_dim_to_normalize = hidden_features
print(f"Benchmarking with: BS={batch_size}, SeqLen={sequence_length}, Hidden={hidden_features}, DType={data_type}")
benchmark_rmsnorm_cuda(input_shape=shape,
normalized_dim=norm_dim_to_normalize,
num_iterations=num_benchmark_iterations,
warmup_iterations=num_warmup_iterations,
dtype=data_type)
```
Here are the triton compile tests ran on a 5090 (comparing this branch vs main)
```py
import torch
import torch.nn as nn
from torch._inductor.utils import run_and_get_code, run_fw_bw_and_get_code
torch.manual_seed(0)
device = torch.device("cuda")
for batch in range(0, 9):
for i in range(9, 16):
normalized_shape_arg = (2**batch, 2**i)
input_tensor = torch.randn(2**batch, 2**i, device=device, requires_grad=True)
weight_tensor = torch.randn(2**batch, 2**i,device=device, requires_grad=True)
model = torch.nn.functional.rms_norm
compiled_model = torch.compile(model)
loss = torch.randn_like(input_tensor)
num_iter = 5
for j in range(num_iter):
output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
output.backward(loss)
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
num_iter = 10
for j in range(num_iter):
output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
output.backward(loss)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = round(elapsed_time_ms / num_iter, 5)
print(2**batch, 2**i, avg_time_ms)
```
main
```
32 512 0.1812
32 1024 0.19021
32 2048 0.18871
32 4096 0.17019
32 8192 0.21944
32 16384 0.38871
32 32768 0.83282
64 512 0.14705
64 1024 0.13987
64 2048 0.14111
64 4096 0.21699
64 8192 0.43141
64 16384 0.90652
64 32768 2.18573
128 512 0.19361
128 1024 0.1963
128 2048 0.20122
128 4096 0.38888
128 8192 0.93795
128 16384 2.23437
128 32768 5.50079
256 512 0.16722
256 1024 0.22856
256 2048 0.39421
256 4096 0.96621
256 8192 2.48746
256 16384 5.53571
256 32768 11.97932
```
current branch
```
32 512 0.16328
32 1024 0.18104
32 2048 0.15508
32 4096 0.14356
32 8192 0.20111
32 16384 0.45974
32 32768 0.94799
64 512 0.16874
64 1024 0.18701
64 2048 0.16107
64 4096 0.20152
64 8192 0.46568
64 16384 0.96599
64 32768 2.21661
128 512 0.14982
128 1024 0.15565
128 2048 0.22241
128 4096 0.46128
128 8192 0.88883
128 16384 2.3097
128 32768 5.84448
256 512 0.14346
256 1024 0.2007
256 2048 0.45927
256 4096 0.87876
256 8192 2.10571
256 16384 5.73948
256 32768 12.98581
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153666
Approved by: https://github.com/ngimel
2025-07-01 18:22:24 +00:00
55e62ff74a
bf16 grouped gemm ( #150374 )
...
Enabled bf16 grouped gemm with an API similar to _scaled_group_gemm, except without scale and fast accum arguments. All transpose variants are enabled, unlike scaled gemm. Ideally we'd factor out a lot more code from scaled gemm, currently there's a lot of repetition between scaled and non-scaled versions. I factored out only a helper kernel that prepares arguments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150374
Approved by: https://github.com/drisspg
2025-04-06 04:53:24 +00:00
59f14d19ae
Implement gradient for the residuals
of torch.linalg.lstsq
( #148526 )
...
Fixes #147543 .
I have written some tests in python using `gradcheck`. Please advise where I should put these tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148526
Approved by: https://github.com/lezcano
2025-03-10 12:35:09 +00:00
882b6af219
c10::string_view -> std::string_view in autograd ( #142354 )
...
Differential Revision: D66939966
Pull Request resolved: https://github.com/pytorch/pytorch/pull/142354
Approved by: https://github.com/Skylion007
2024-12-10 15:43:41 +00:00
96be048f06
[1/N] Avoid copy in std::get ( #141812 )
...
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141812
Approved by: https://github.com/Skylion007
2024-12-01 03:53:35 +00:00
fb26b84390
Update fused kernels and call _safe_softmax from SDPA ( #133882 )
...
# UPDATE:
This is take 3 of https://github.com/pytorch/pytorch/pull/131863 which was landed via co dev but not applying correclty
# Summary
Changes the stance of SDPA on what to do for fully masked out rows
## Current Behavior
Several PyTorch users have expressed frustration over this issue:
- https://github.com/pytorch/pytorch/issues/41508
- https://github.com/pytorch/pytorch/issues/103749
- https://github.com/pytorch/pytorch/issues/103963
These are significant issues with extensive discussion but no satisfactory resolution. The PyTorch team's consensus, as stated here:
https://github.com/pytorch/pytorch/issues/24816#issuecomment-524415617
Can be paraphrased as follows:
When passing in fully masked out rows, attention becomes ambiguous. We have two main options:
1. Uniformly attend to all values:
```python
scores[masked_out_rows] = 1 / len(row)
out[masked_out_rows] = 1 / len(row) * value
```
2. Decide that attention between no queries (masked) and no keys (masked) is meaningless:
```python
output[fully_masked_rows] = NaN
```
We went with option 2. Partially because it was easier to implement, but also people argued that users can slice the output to remove the NaNs:
``` Python
>fill_value = -float("inf")
>row0 = torch.randn(4)
>row1 = torch.tensor([(fill_value for _ in range(4)])
>matrix = torch.stack([row0, row1]).requires_grad_(True)
>out = torch.softmax(matrix, 1)
>out = out[0]
>print(out)
tensor([0.5377, 0.2729, 0.0692, 0.1201])
```
Cool, problem solved. But what happends when you call backwards..
```Python
>out.backward(torch.ones_like(out))
>print(matrix.grad)
tensor([[3.0957e-08, 1.4157e-08, 7.7802e-10, 1.3713e-08],
[ nan, nan, nan, nan]])
```
Those pesky NaNs are back!
## Why do we see NaNs today?
The core of the problem revolves around using softmax function in sdpa:
```python
> row = torch.tensor([(-float("inf")) for _ in range(4)])
> torch.softmax(row, 0)
tensor([nan, nan, nan, nan])
```
## Quick Aside: Masking in Attention
Attention itself doesn't have a concept of masking. The `sdpa` function has an argument called `attn_mask`, which would be more accurately named `attn_bias`. This is because we don't actually "mask" entries when computing attention. Instead, due to implementation details([performance](https://github.com/pytorch/pytorch/issues/25110#issuecomment-524519087 )), we add a value to the masked-out query/key pairs.
We use a large negative number (typically -inf) to decrease the attention weight, as softmax assigns more weight to larger values.
## Alternative Approaches
If we use a very large negative number instead of -inf:
```python
> row = torch.tensor([(-1e6) for _ in range(4)])
> torch.softmax(row, 0)
tensor([0.2500, 0.2500, 0.2500, 0.2500])
```
However if users always remembered to "slice" out their outputs i.e.:
```Python
>fill_value = -1e6
>...
>out.backward(torch.ones_like(out))
>print(matrix.grad)
tensor([[-0.0563, -0.0564, 0.1613, -0.0486],
[ 0.0000, 0.0000, 0.0000, 0.0000]])
```
This would bring us back into a better state.
## A Third Option
We don't necessarily need to alter the behavior of softmax for -inf or very large negative numbers. The fundamental goal is to exclude certain query/key pairs from attention, regardless of the underlying implementation.
This PR implements the new semantic for masking w/ attention in fully masked-out rows:
```python
out[masked_out_rows] = 0
```
**Important Note**: This idea isn't entirely new. The [MaskedTensor](https://pytorch.org/tutorials/prototype/maskedtensor_overview#safe-softmax ) prototype, a tensor subclass, was designed to handle such cases. However, it remains a prototype feature and hasn't gained widespread adoption.
## Details
This PR stack does 3 things:
1. Adds a PRIVATE _safe_softmax op
2. Updates semantic for flash_cpu fused kernel
3. Updates semantic for efficient_cuda fused kernel
_safe_softmax is not supposed to be used generically and is only meant to be used within the context of SDPA. Due to this fact instead of decomposing softmax and checking for -inf rows we instead "cheat" and use nan_to_num.
Why I think this is okay? (please find a counter point if avail)
There are multiple ways NaNs can emerge. For the fully masked out rows case nan_to_num works. But what if there were other NaNs, wouldn't this silently remove them?
The only case that this can happen is if the input itself had a NaN or an Inf
For example:
```Python
a = torch.ones([4], requires_grad=False, dtype=torch.float16)
a[1] = torch.finfo(torch.float16).max
print(a.softmax(-1))
```
Will return
`tensor([0., 1., 0., 0.], dtype=torch.float16)`
Where
```Python
a = torch.ones([4], requires_grad=False, dtype=torch.float16)
a[1] = float("inf")
a.softmax(-1)
```
returns:
`tensor([nan, nan, nan, nan], dtype=torch.float16)`
If we dont want to even allow for the possibility of "inf" or "NaN" attention scores to be converted to 0 then we can implemented it something like this
```Python
max = torch.max(a, dim=-1, keepdim=True)
exp = torch.exp(a - max.values)
denom = torch.sum(exp, dim=-1, keepdim=True)
softmax = exp / denom
softmax = torch.where(max.values == float('-inf'), 0.0, softmax)
```
however we would be paying for this in math performance.
## Why Now
I think one point that has substantially changed where PyTorch should lie on this argument is the fact that we have fused implementations for SDPA now. And these fused implementations allow us to easily and performantly support this new semantic.
Differential Revision: [D61418679](https://our.internmc.facebook.com/intern/diff/D61418679 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133882
Approved by: https://github.com/soulitzer
2024-08-19 18:53:11 +00:00
929d2f8253
[3/N] Fix clang-tidy warnings in torch/csrc/autograd ( #133389 )
...
Follows #133295
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133389
Approved by: https://github.com/Skylion007
2024-08-16 00:57:54 +00:00
cfec69e2a1
Revert "Update fused kernels and call _safe_softmax from SDPA ( #131863 )"
...
This reverts commit caba37e99b03d2199848197de4e452b78c8c2a23.
Reverted https://github.com/pytorch/pytorch/pull/131863 on behalf of https://github.com/izaitsevfb due to breaks executorch test executorch/backends/apple/coreml:test - test_vit_skip_conv (executorch.backends.apple.coreml.test.test_coreml_partitioner.TestCoreMLPartitioner) ([comment](https://github.com/pytorch/pytorch/pull/131863#issuecomment-2291855634 ))
2024-08-15 17:55:07 +00:00
caba37e99b
Update fused kernels and call _safe_softmax from SDPA ( #131863 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131863
Approved by: https://github.com/jbschlosser , https://github.com/Chillee
2024-08-13 23:37:50 +00:00
623d0204f0
[NJT] Support Chunk backward for simple cases ( #132193 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132193
Approved by: https://github.com/soulitzer
2024-08-06 21:20:09 +00:00
798b9652f7
[6/N] Replace c10::optional with std::optional ( #130438 )
...
Follows #130408
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130438
Approved by: https://github.com/janeyx99
2024-07-11 01:15:37 +00:00
f4dcf2ae93
[1/N] Change #include <c10/util/Optional.h> to #include <optional> ( #128301 )
...
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128301
Approved by: https://github.com/ezyang , https://github.com/r-barnes
2024-07-08 07:03:53 +00:00
e1c1052829
Backward support for unbind() with NJT ( #128032 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128032
Approved by: https://github.com/soulitzer
2024-06-21 14:05:23 +00:00
5ffb032be6
Revert "Backward support for unbind() with NJT ( #128032 )"
...
This reverts commit 5dc4f652bc5c068ef15130c955e3f2ffe11f4b74.
Reverted https://github.com/pytorch/pytorch/pull/128032 on behalf of https://github.com/jbschlosser due to reverting to revert parent PR ([comment](https://github.com/pytorch/pytorch/pull/128032#issuecomment-2177296325 ))
2024-06-19 00:26:40 +00:00
5dc4f652bc
Backward support for unbind() with NJT ( #128032 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128032
Approved by: https://github.com/soulitzer
2024-06-18 20:29:00 +00:00
846bb30e13
Revert "[1/N] Change #include <c10/util/Optional.h> to #include <optional> ( #128301 )"
...
This reverts commit bd72e28314d8d63bb347becb8309f5ac7761c6b5.
Reverted https://github.com/pytorch/pytorch/pull/128301 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it fails XLA build bd72e28314
. Please rebase your PR before relanding because I think the failure is hidden by an unrelated broken trunk XLA failure from your current base commit ([comment](https://github.com/pytorch/pytorch/pull/128301#issuecomment-2169035822 ))
2024-06-15 01:58:20 +00:00
bd72e28314
[1/N] Change #include <c10/util/Optional.h> to #include <optional> ( #128301 )
...
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128301
Approved by: https://github.com/ezyang
2024-06-14 23:21:01 +00:00
ed327876f5
[codemod] c10:optional
-> std::optional
( #126135 )
...
Generated by running the following from PyTorch root:
```
find . -regex ".*\.\(cpp\|h\|cu\|hpp\|cc\|cxx\)$" | grep -v "build/" | xargs -n 50 -P 4 perl -pi -e 's/c10::optional/std::optional/'
```
`c10::optional` is just an alias for `std::optional`. This removes usages of that alias in preparation for eliminating it entirely.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126135
Approved by: https://github.com/Skylion007 , https://github.com/malfet , https://github.com/albanD , https://github.com/aaronenyeshi
2024-05-14 19:35:51 +00:00
20f769544c
[12/N] Apply clang-tidy and fix warnings in headers of torch/csrc ( #116486 )
...
This PR follows #116751 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116486
Approved by: https://github.com/albanD
2024-01-10 08:48:14 +00:00
0aa50909f3
Revert "[12/N] Apply clang-tidy and fix warnings in headers of torch/csrc ( #116486 )"
...
This reverts commit 5aa258eb09d5ecd62aea4d2bd02bbfa5eda0d554.
Reverted https://github.com/pytorch/pytorch/pull/116486 on behalf of https://github.com/izaitsevfb due to Reverting, as it depends on https://github.com/pytorch/pytorch/pull/116353 , which has to be reverted ([comment](https://github.com/pytorch/pytorch/pull/116486#issuecomment-1876042948 ))
2024-01-03 22:18:54 +00:00
5aa258eb09
[12/N] Apply clang-tidy and fix warnings in headers of torch/csrc ( #116486 )
...
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/116486
Approved by: https://github.com/albanD
2023-12-30 18:38:53 +00:00
194d57dae7
Add values backward support for sparse CSR, CSC, BSR, and BSC tensors ( #115586 )
...
Fixes https://github.com/pytorch/pytorch/issues/107286
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115586
Approved by: https://github.com/cpuhrsch , https://github.com/albanD
2023-12-14 23:09:13 +00:00
070b2d3cff
cholesky_solve_backward: speed up using output_mask ( #112981 )
...
Introduces a faster path for `cholesky_solve_backward` when the gradient with respect to the cholesky factor isn't required.
Adds test coverage in `test_linalg.py`.
# Example
## Setup
```py
import torch
torch.set_num_threads(1)
mat = torch.randn(500, 1000)
mat = mat @ mat.T
L = torch.linalg.cholesky(mat, upper=False)
rhs = torch.randn(500, 1)
rhs.requires_grad = True
sol = torch.cholesky_solve(rhs, L, upper=False).sum(dim=0)
```
## Before
```
%timeit torch.autograd.grad(sol, rhs, retain_graph=True)
2.61 ms ± 18.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
```
## After
```
%timeit torch.autograd.grad(sol, rhs, retain_graph=True)
109 µs ± 3.42 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/112981
Approved by: https://github.com/lezcano
2023-11-16 18:30:57 +00:00
325e0fdfdd
Enable masked_scatter_backward for inductor ( #109642 )
...
masked_scatter_backward was previously implemented as a
CompositeExplicitAutograd, which involved a decomp that calls
masked_select, and masked_select in general produces data-dependent
shapes that inductor doesn't support. But masked_scatter_backward
reshapes the return value of masked_select such that the end result has
a static shape again.
I have converted masked_scatter_backward into an aten op to avoid this
issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109642
Approved by: https://github.com/ezyang
ghstack dependencies: #108170
2023-11-09 01:27:57 +00:00
c84c86f018
SymIntify convolution ( #111599 )
...
Signed-off-by: Edward Z. Yang <ezyang@meta.com >
Pull Request resolved: https://github.com/pytorch/pytorch/pull/111599
Approved by: https://github.com/wanchaol , https://github.com/bdhirsh
2023-10-21 03:03:20 +00:00
17348b0f51
Implement split_with_sizes backward for NT ( #110647 )
...
Needed internally. Note that `split_with_sizes()` for NT is currently supported only on `dim=-1`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110647
Approved by: https://github.com/cpuhrsch , https://github.com/soulitzer
ghstack dependencies: #110646
2023-10-06 18:44:22 +00:00
48240ec62e
Make unbind() overrideable for NT subclass ( #110646 )
...
Reland of #109122 . Fixed the memory leak by not saving the outputs of `unbind()` for backward. Rather, the NT sizes are saved so undefined grads can replaced with zeros of the correct size.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/110646
Approved by: https://github.com/soulitzer , https://github.com/cpuhrsch
2023-10-06 18:44:22 +00:00
b083058e45
Revert "Make unbind() overrideable for NT subclass ( #109122 )"
...
This reverts commit f5a23ca78d13c5e536f5062325c815c50be5f4c2.
Reverted https://github.com/pytorch/pytorch/pull/109122 on behalf of https://github.com/PaliC due to breaking slow tests ([comment](https://github.com/pytorch/pytorch/pull/109122#issuecomment-1741555305 ))
2023-09-29 22:41:56 +00:00
f5a23ca78d
Make unbind() overrideable for NT subclass ( #109122 )
...
Goal: avoid making unbind composite implicit so we can override it within `__torch_dispatch__()` for the NT subclass.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109122
Approved by: https://github.com/cpuhrsch , https://github.com/soulitzer
2023-09-28 01:26:22 +00:00
51d2d825ab
[3/N] apply clang-tidy in torch/csrc/autograd ( #109368 )
...
This PR applies clang-tidy fixes in torch/csrc/autograd/FunctionsManual.cpp. There are also other fixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109368
Approved by: https://github.com/Skylion007
2023-09-17 07:26:59 +00:00
a14d30d8d1
[1/N] apply clang-tidy in torch/csrc/autograd ( #109032 )
...
This PR begins a new series of patches for enabling clang-tidy checks in torch/csrc/augograd
Pull Request resolved: https://github.com/pytorch/pytorch/pull/109032
Approved by: https://github.com/albanD , https://github.com/Skylion007
2023-09-15 23:28:43 +00:00
63dc24b4a6
Expose some APIs in FunctionsManual.h ( #104684 )
...
Fixes #ISSUE_NUMBER
Exporse some api in FunctionsManual.h for custom devices. This can be used in codegen features.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104684
Approved by: https://github.com/albanD
2023-07-07 11:22:40 +00:00
437bc5b1b7
sparse_mask: backward support for sparse lhs (take 2) ( #104341 )
...
This is a copy of https://github.com/pytorch/pytorch/pull/95165 with some bug fixes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/104341
Approved by: https://github.com/albanD , https://github.com/pearu , https://github.com/amjames
2023-07-03 14:12:44 +00:00
5cf3a99013
sampled_addmm: backward performance improvements ( #103544 )
...
No need to do double `sparse_mask`, let's squash everything into one call!
This PR exercises https://github.com/pytorch/pytorch/pull/103750 , so here is an autogened code for the backward pass.
```
at::Tensor sparse_sampled_addmm(c10::DispatchKeySet ks, const at::Tensor & self, const at::Tensor & mat1, const at::Tensor & mat2, const at::Scalar & beta, const at::Scalar & alpha) {
auto& self_ = unpack(self, "self", 0);
auto& mat1_ = unpack(mat1, "mat1", 1);
auto& mat2_ = unpack(mat2, "mat2", 2);
[[maybe_unused]] auto _any_requires_grad = compute_requires_grad( self, mat1, mat2 );
std::shared_ptr<SparseSampledAddmmBackward0> grad_fn;
if (_any_requires_grad) {
grad_fn = std::shared_ptr<SparseSampledAddmmBackward0>(new SparseSampledAddmmBackward0(), deleteNode);
grad_fn->set_next_edges(collect_next_edges( self, mat1, mat2 ));
grad_fn->alpha = alpha;
grad_fn->beta = beta;
if (grad_fn->should_compute_output(2)) {
grad_fn->mat1_ = SavedVariable(mat1, false);
}
if (grad_fn->should_compute_output(1)) {
grad_fn->mat2_ = SavedVariable(mat2, false);
}
grad_fn->self_ = SavedVariable(self, false);
}
```
As you can see, we do not save tensors unless needed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103544
Approved by: https://github.com/soulitzer
2023-06-28 08:49:54 +00:00
567b5e5b28
Multioutput backward formula: allow conditional guards against saving ( #103750 )
...
Multi-output backward formulas break the ability of autogen to decide which variables have to be stored in a graph.
This PR introduces a macro `wrap_opt_if` which could be used to hint autogen about variable interdependence.
For example, the following code is being generated for `_trilinear` with this modification:
```
at::Tensor _trilinear(c10::DispatchKeySet ks, const at::Tensor & i1, const at::Tensor & i2, const at::Tensor & i3, at::IntArrayRef expand1, at::IntArrayRef expand2, at::IntArrayRef expand3, at::IntArrayRef sumdim, int64_t unroll_dim) {
auto& i1_ = unpack(i1, "i1", 0);
auto& i2_ = unpack(i2, "i2", 1);
auto& i3_ = unpack(i3, "i3", 2);
[[maybe_unused]] auto _any_requires_grad = compute_requires_grad( i1, i2, i3 );
[[maybe_unused]] auto _any_has_forward_grad_result = (isFwGradDefined(i1) || isFwGradDefined(i2) || isFwGradDefined(i3));
std::shared_ptr<TrilinearBackward0> grad_fn;
if (_any_requires_grad) {
grad_fn = std::shared_ptr<TrilinearBackward0>(new TrilinearBackward0(), deleteNode);
grad_fn->set_next_edges(collect_next_edges( i1, i2, i3 ));
grad_fn->expand1 = expand1.vec();
grad_fn->expand2 = expand2.vec();
grad_fn->expand3 = expand3.vec();
if (grad_fn->should_compute_output(1) || grad_fn->should_compute_output(2)) {
grad_fn->i1_ = SavedVariable(i1, false);
}
if (grad_fn->should_compute_output(0) || grad_fn->should_compute_output(2)) {
grad_fn->i2_ = SavedVariable(i2, false);
}
if (grad_fn->should_compute_output(0) || grad_fn->should_compute_output(1)) {
grad_fn->i3_ = SavedVariable(i3, false);
}
grad_fn->sumdim = sumdim.vec();
}
```
with the following backward modifications:
```
- name: _trilinear(Tensor i1, Tensor i2, Tensor i3, int[] expand1, int[] expand2, int[] expand3, int[] sumdim, int unroll_dim=1) -> Tensor
- i1, i2, i3: _trilinear_backward(grad, i1, i2, i3, expand1, expand2, expand3, sumdim, grad_input_mask)
+ i1, i2, i3: "_trilinear_backward(grad,
+ wrap_opt_if(i1, grad_input_mask[1] || grad_input_mask[2]),
+ wrap_opt_if(i2, grad_input_mask[0] || grad_input_mask[2]),
+ wrap_opt_if(i3, grad_input_mask[0] || grad_input_mask[1]),
+ expand1, expand2, expand3, sumdim, grad_input_mask)"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/103750
Approved by: https://github.com/soulitzer
2023-06-27 15:12:09 +00:00
7274582390
Revert "sparse_mask: backward support for sparse lhs ( #95165 )"
...
This reverts commit f090fdf3b49164679fb6316e9ae15e0c4fb3c9eb.
Reverted https://github.com/pytorch/pytorch/pull/95165 on behalf of https://github.com/huydhn due to Sorry for reverting this. I think one of the tests test_sparse.py::TestSparseCUDA::test_sparse_mask_backward_cuda_complex128 is failing on slow gradcheck f090fdf3b4
([comment](https://github.com/pytorch/pytorch/pull/95165#issuecomment-1604696109 ))
2023-06-23 18:40:15 +00:00
f090fdf3b4
sparse_mask: backward support for sparse lhs ( #95165 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95165
Approved by: https://github.com/pearu , https://github.com/cpuhrsch
2023-06-23 12:27:27 +00:00
1c2dfdf30c
Add renorm forward-ad ( #100798 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100798
Approved by: https://github.com/soulitzer
2023-06-05 20:25:35 +00:00
36d91b5513
Add differentiable mkldnn_rnn_layer_backward to support double backward of LSTM ( #100627 )
...
### Description
This PR is to fix #99413 , which shows the limitation of double backward using oneDNN in LSTM.
This PR does not implement double backward function itself, because that is pretty hard to spell out. Instead, it implements mkldnn_rnn_layer_backward using differentiable operations, so that double backward can be done automatically.
During backward process, it needs to use gates and hidden states between cells during one layer. However, these middle variables are stored in the `workspace`, and it is hard to figure them out. Therefore, in backward, we need re-calculate them first.
Corresponding UT has been added based on the failing case in # 99413. The UT with gradcheck and gradgradcheck which is added in https://github.com/pytorch/pytorch/pull/26660 cannot test LSTM using oneDNN, because UT only supports `double` datatype, while oneDNN does not support it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100627
Approved by: https://github.com/jgong5 , https://github.com/soulitzer
2023-05-09 12:58:57 +00:00
6af509860e
Add logcumsumexp forward-ad ( #100629 )
...
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 8bb6158</samp>
This pull request adds forward and backward AD support for the `logcumsumexp` operator in functorch, a library for composable function transformations. It implements a forward-mode formula and a decomposition in `derivatives.yaml`, a C++ function for computing directional derivatives in `FunctionsManual.cpp`, and updates the tests and metadata in `test_ops.py` and `common_methods_invocations.py`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100629
Approved by: https://github.com/soulitzer
2023-05-06 04:08:55 +00:00
d66add688f
Revert "Add logcumsumexp forward-ad ( #100629 )"
...
This reverts commit d658c62677b7c096b0fda3ce7a4f0accc727430e.
Reverted https://github.com/pytorch/pytorch/pull/100629 on behalf of https://github.com/clee2000 due to broke slow test, see above comment for details ([comment](https://github.com/pytorch/pytorch/pull/100629#issuecomment-1536575442 ))
2023-05-05 17:42:35 +00:00
d658c62677
Add logcumsumexp forward-ad ( #100629 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/100629
Approved by: https://github.com/soulitzer
2023-05-05 02:21:27 +00:00
f89b7c2bec
[pt2] add SymInt
support for roll
( #99114 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/99114
Approved by: https://github.com/ezyang
2023-04-15 18:01:39 +00:00
ca791b6909
[MPS] Add higher order derivatives warning to max_pool2d ( #98582 )
...
The higher order derivatives calculations of `max_pool2d` require indices provided, but `mps_max_pool2d` kernel doesn't calculate it. If we calculate indices during back propagations afterwards, that would be expensive and unnecessary since users can directly call `max_pool2d` with `return_indices=True`, which calculates `indices` along.
This PR adds a warning for it.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/98582
Approved by: https://github.com/soulitzer
2023-04-11 18:03:46 +00:00