f02e3947f6
Expand type checking to mypy strict files ( #165697 )
...
Expands Pyrefly type checking to check the files outlined in the mypy-strict.ini configuration file:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165697
Approved by: https://github.com/ezyang
2025-10-18 04:34:45 +00:00
b2953f5643
[9/N] Apply ruff UP035 rule ( #165515 )
...
This is follow-up of #165214 to continue applying ruff UP035 rule to the code base.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/165515
Approved by: https://github.com/Lucaskabela
2025-10-17 00:09:51 +00:00
a029675f6f
More ruff SIM fixes ( #164695 )
...
This PR applies ruff `SIM` rules to more files. Most changes are about simplifying `dict.get` because `None` is already the default value.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/164695
Approved by: https://github.com/ezyang
2025-10-09 03:24:50 +00:00
c6329524d8
Revert "Add magic TORCH_MAKE_PYBIND_ENUM_FASTER macro ( #163527 )"
...
This reverts commit 50c0550f5a5b1e35885d892081a7d5115d8b4489.
Reverted https://github.com/pytorch/pytorch/pull/163527 on behalf of https://github.com/swolchok due to breaking import torch in debug builds, see #164297 ([comment](https://github.com/pytorch/pytorch/pull/163527#issuecomment-3361919142 ))
2025-10-02 15:42:42 +00:00
50c0550f5a
Add magic TORCH_MAKE_PYBIND_ENUM_FASTER macro ( #163527 )
...
See comment on the macro definition. In short, pybind11 3.x
added `py::native_enum`, and also had to add overhead for that new way
to bind enums on the critical path for calling functions that take
regular old `py::enum_`s as arguments (for example, `__eq__`).
Differential Revision: [D82873169](https://our.internmc.facebook.com/intern/diff/D82873169/ )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163527
Approved by: https://github.com/ezyang
2025-09-26 17:59:22 +00:00
deb7ebe0a3
Revert "[Reland] Use std::string_view in torchgen ( #158625 )"
...
This reverts commit 972e409829343cc2062aeee0994a9c1c735d216a.
Reverted https://github.com/pytorch/pytorch/pull/158625 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break a couple of ExecuTorch tests for Vulkan backend ([comment](https://github.com/pytorch/pytorch/pull/158625#issuecomment-3287754275 ))
2025-09-13 07:52:50 +00:00
972e409829
[Reland] Use std::string_view in torchgen ( #158625 )
...
Reland of #157050 , which is incidentally closed.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158625
Approved by: https://github.com/albanD
2025-09-12 08:31:54 +00:00
d65ffdef3d
[ROCm] fix miopen batchnorm changing output format ( #162112 )
...
It was found that the integration of miopen batchnorm was causing the output to always be in default contig memory format even when the input was channels last. This also unskips a number of related unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162112
Approved by: https://github.com/jeffdaily
Co-authored-by: Jeff Daily <jeff.daily@amd.com >
Co-authored-by: Dmitry Nikolaev <dmitry.nikolaev@amd.com >
Co-authored-by: Jithun Nair <37884920+jithunnair-amd@users.noreply.github.com >
2025-09-11 19:37:48 +00:00
189a054cfb
Remove guard_size_oblivious from default contiguity python check, and add aten.sym_is_contiguous. [attempt2] ( #160869 )
...
[relanding again after fixing internal build]
Summary:
This might cause some new DDEs on call sites that do not use is_contiguous_or_false() or sym_is_contiguous()
but want to find those call sites to handle this properly by calling is_contiguous_or_false() and not is_contiguous() explitly when appropriate.
I had to fix one issue after removing the implicit size oblivious reasoning. here is context
we defined in this https://github.com/pytorch/pytorch/pull/157472 sym_is_contiguous to be the function computing contiguity for dynamic shapes in c++. It returns a symbolic expression that represents contiguity and guaranteed not to throw a DDE.
when people call is_contiguous we do sym_is_contiguous().guard_bool()
when people call is_contiguous_or_false we do sym_is_contiguous().guard_or_false()
one issue not handled well was this path
```
c10::SymBool TensorImpl::sym_is_contiguous_custom(
at::MemoryFormat memory_format) const {
if (C10_UNLIKELY(matches_python_custom(SizesStridesPolicy::CustomStrides))) {
return pyobj_slot_.load_pyobj_interpreter()->is_contiguous(
this, memory_format);
}
return sym_is_contiguous_default(memory_format);
}
```
namely if we call sym_is_contiguous_custom but we have matches_python_custom(SizesStridesPolicy::CustomStrides) return true , then we used to call is_contiguous(this, memory_format);
This used to go through the load_pyobj_interpreter and end up calling the python is_contiguous call which used implicit size oblivious reasoning.
once we removed that implicit size oblivious reasoning, the right thing we want is to call
return pyobj_slot_.load_pyobj_interpreter()->sym_is_contiguous(this, memory_format);
otherwise we would get DDE even if the caller is doing sym_is_contiguous.
so I had to define it for pyinterpreter, and then I had to override it for nested tensors.
Approved by: https://github.com/ezyang
Test Plan:
contbuild & OSS CI, see e444cd24d4
Rollback Plan:
Differential Revision: D80435179
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160869
Approved by: https://github.com/ezyang
2025-09-08 22:59:13 +00:00
29280864d9
Add new parameter for gen_pyi.py to make it more configureable. ( #161772 )
...
This is a reposting of PR #128519 .
This change is important to how we maintain PyTorch at Google.
From the previous PR:
"
This will make the script more flexible for the directory where it is executed.
...
We plan to use the deprecated_yaml from a blaze genrule that invokes pyi.py. As the input to the pyi.py, genrule requires the input file to be explicitly listed out. When we feed the value of tools/autograd/deprecated.yaml to genrule, it failed to resolve since tools/autograd is a package from blaze perspective. Any file under a blaze package will a proper blaze target to be access.
"
Pull Request resolved: https://github.com/pytorch/pytorch/pull/161772
Approved by: https://github.com/albanD
Co-authored-by: Haifeng Jin <haifeng-jin@users.noreply.github.com >
2025-09-05 00:48:15 +00:00
b82aa3df20
Revert "Remove guard_size_oblivious from default contiguity python check, and add aten.sym_is_contiguous. ( #159197 )"
...
This reverts commit e444cd24d48b3a46f067974f2cc157f5ed27709f.
Reverted https://github.com/pytorch/pytorch/pull/159197 on behalf of https://github.com/laithsakka due to internal build failures ([comment](https://github.com/pytorch/pytorch/pull/159197#issuecomment-3195436668 ))
2025-08-18 07:22:13 +00:00
e444cd24d4
Remove guard_size_oblivious from default contiguity python check, and add aten.sym_is_contiguous. ( #159197 )
...
This might cause some new DDEs on call sites that do not use is_contiguous_or_false() or sym_is_contiguous()
but want to find those call sites to handle this properly by calling is_contiguous_or_false() and not is_contiguous() explitly when appropriate.
I had to fix one issue after removing the implicit size oblivious reasoning. here is context
we defined in this https://github.com/pytorch/pytorch/pull/157472 sym_is_contiguous to be the function computing contiguity for dynamic shapes in c++. It returns a symbolic expression that represents contiguity and guaranteed not to throw a DDE.
when people call is_contiguous we do sym_is_contiguous().guard_bool()
when people call is_contiguous_or_false we do sym_is_contiguous().guard_or_false()
one issue not handled well was this path
```
c10::SymBool TensorImpl::sym_is_contiguous_custom(
at::MemoryFormat memory_format) const {
if (C10_UNLIKELY(matches_python_custom(SizesStridesPolicy::CustomStrides))) {
return pyobj_slot_.load_pyobj_interpreter()->is_contiguous(
this, memory_format);
}
return sym_is_contiguous_default(memory_format);
}
```
namely if we call sym_is_contiguous_custom but we have matches_python_custom(SizesStridesPolicy::CustomStrides) return true , then we used to call is_contiguous(this, memory_format);
This used to go through the load_pyobj_interpreter and end up calling the python is_contiguous call which used implicit size oblivious reasoning.
once we removed that implicit size oblivious reasoning, the right thing we want is to call
return pyobj_slot_.load_pyobj_interpreter()->sym_is_contiguous(this, memory_format);
otherwise we would get DDE even if the caller is doing sym_is_contiguous.
so I had to define it for pyinterpreter, and then I had to override it for nested tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/159197
Approved by: https://github.com/ezyang
2025-08-16 09:15:58 +00:00
1128f4c2a8
[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for sm90
, sm100
( #149282 )
...
cleanup tuple/tensor boilerplate in cuDNN SDPA, preparation for nested/ragged tensor backward
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149282
Approved by: https://github.com/drisspg
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com >
2025-08-08 22:22:48 +00:00
7f649ed4f8
Add basic torch.hash_tensor op ( #154149 )
...
Added `torch.hash_tensor` reduction function with a `mode` argument that defaults to reduction with xor.
- The hash is always uint64.
- Integers will be casted to uint64 before performing the xor_sum reduction
- Floats will be upcasted to double and then bitcasted to uint64 before performing the xor_sum reduction
Pull Request resolved: https://github.com/pytorch/pytorch/pull/154149
Approved by: https://github.com/albanD
2025-07-23 22:28:03 +00:00
04a393507b
Fused RMSNorm implementation ( #153666 )
...
Relevant #72643
Benchmarked versus unfused torch implementation and torch.compile implementation. Around 9x speedup vs unfused implementation on cuda and slightly faster vs inductor compile on 5090.
```py
import torch
import torch.nn as nn
class RMSNorm(nn.Module):
def __init__(self, dim, eps=1e-5):
super().__init__()
self.eps = eps
self.scale = nn.Parameter(torch.ones(dim))
def forward(self, x):
norm_x = x.norm(2, dim=-1, keepdim=True)
rms_x = norm_x * torch.rsqrt(torch.tensor(x.shape[-1], dtype=x.dtype))
x_normed = x / (rms_x + self.eps)
return self.scale * x_normed
def benchmark_rmsnorm_cuda(input_shape, normalized_dim, num_iterations=100, warmup_iterations=10, dtype=torch.float16):
rms_norm_layer = torch.nn.RMSNorm(normalized_dim, device='cuda', dtype=dtype)
input_data = torch.randn(input_shape, device='cuda', dtype=dtype)
for _ in range(warmup_iterations):
_ = rms_norm_layer(input_data)
torch.cuda.synchronize()
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
for _ in range(num_iterations):
_ = rms_norm_layer(input_data)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = elapsed_time_ms / num_iterations
print(f"--- RMSNorm CUDA Benchmark ---")
print(f"Input Shape: {input_shape}")
print(f"Normalized Dimension: {normalized_dim}")
print(f"Benchmark Iterations: {num_iterations}")
print(f"--- Fused Implementation ---")
print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")
compiled_rms_norm = torch.compile(RMSNorm(dim=normalized_dim)).cuda()
for _ in range(warmup_iterations):
_ = compiled_rms_norm(input_data)
torch.cuda.synchronize()
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
for _ in range(num_iterations):
_ = compiled_rms_norm(input_data)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = elapsed_time_ms / num_iterations
print(f"--- TorchCompile Implementation ---")
print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")
print("-" * 50)
if __name__ == '__main__':
parameter_sets = [
{'batch_size': 16, 'sequence_length': 256, 'hidden_features': 512, 'dtype': torch.float16},
{'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float16},
{'batch_size': 64, 'sequence_length': 1024, 'hidden_features': 1024, 'dtype': torch.float16},
{'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float32},
{'batch_size': 8, 'sequence_length': 2048, 'hidden_features': 2048, 'dtype': torch.float16},
]
num_benchmark_iterations = 200
num_warmup_iterations = 20
for params in parameter_sets:
batch_size = params['batch_size']
sequence_length = params['sequence_length']
hidden_features = params['hidden_features']
data_type = params.get('dtype', torch.float16)
shape = (batch_size, sequence_length, hidden_features)
norm_dim_to_normalize = hidden_features
print(f"Benchmarking with: BS={batch_size}, SeqLen={sequence_length}, Hidden={hidden_features}, DType={data_type}")
benchmark_rmsnorm_cuda(input_shape=shape,
normalized_dim=norm_dim_to_normalize,
num_iterations=num_benchmark_iterations,
warmup_iterations=num_warmup_iterations,
dtype=data_type)
```
Here are the triton compile tests ran on a 5090 (comparing this branch vs main)
```py
import torch
import torch.nn as nn
from torch._inductor.utils import run_and_get_code, run_fw_bw_and_get_code
torch.manual_seed(0)
device = torch.device("cuda")
for batch in range(0, 9):
for i in range(9, 16):
normalized_shape_arg = (2**batch, 2**i)
input_tensor = torch.randn(2**batch, 2**i, device=device, requires_grad=True)
weight_tensor = torch.randn(2**batch, 2**i,device=device, requires_grad=True)
model = torch.nn.functional.rms_norm
compiled_model = torch.compile(model)
loss = torch.randn_like(input_tensor)
num_iter = 5
for j in range(num_iter):
output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
output.backward(loss)
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
num_iter = 10
for j in range(num_iter):
output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
output.backward(loss)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = round(elapsed_time_ms / num_iter, 5)
print(2**batch, 2**i, avg_time_ms)
```
main
```
32 512 0.1812
32 1024 0.19021
32 2048 0.18871
32 4096 0.17019
32 8192 0.21944
32 16384 0.38871
32 32768 0.83282
64 512 0.14705
64 1024 0.13987
64 2048 0.14111
64 4096 0.21699
64 8192 0.43141
64 16384 0.90652
64 32768 2.18573
128 512 0.19361
128 1024 0.1963
128 2048 0.20122
128 4096 0.38888
128 8192 0.93795
128 16384 2.23437
128 32768 5.50079
256 512 0.16722
256 1024 0.22856
256 2048 0.39421
256 4096 0.96621
256 8192 2.48746
256 16384 5.53571
256 32768 11.97932
```
current branch
```
32 512 0.16328
32 1024 0.18104
32 2048 0.15508
32 4096 0.14356
32 8192 0.20111
32 16384 0.45974
32 32768 0.94799
64 512 0.16874
64 1024 0.18701
64 2048 0.16107
64 4096 0.20152
64 8192 0.46568
64 16384 0.96599
64 32768 2.21661
128 512 0.14982
128 1024 0.15565
128 2048 0.22241
128 4096 0.46128
128 8192 0.88883
128 16384 2.3097
128 32768 5.84448
256 512 0.14346
256 1024 0.2007
256 2048 0.45927
256 4096 0.87876
256 8192 2.10571
256 16384 5.73948
256 32768 12.98581
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153666
Approved by: https://github.com/ngimel , https://github.com/albanD
2025-07-22 22:25:44 +00:00
35f1b4ad9e
Revert "Fused RMSNorm implementation ( #153666 )"
...
This reverts commit 15ef4f28df0a14e9f0d55a57a4e2db415a303be7.
Reverted https://github.com/pytorch/pytorch/pull/153666 on behalf of https://github.com/ZainRizvi due to Sorry but this is breaking tests internally. @albanD can you please help land this change?You can follow the instructions here: https://fburl.com/fixing-ghfirst-reverts . See D78599667 for more info ([comment](https://github.com/pytorch/pytorch/pull/153666#issuecomment-3097690935 ))
2025-07-21 17:31:42 +00:00
15ef4f28df
Fused RMSNorm implementation ( #153666 )
...
Relevant #72643
Benchmarked versus unfused torch implementation and torch.compile implementation. Around 9x speedup vs unfused implementation on cuda and slightly faster vs inductor compile on 5090.
```py
import torch
import torch.nn as nn
class RMSNorm(nn.Module):
def __init__(self, dim, eps=1e-5):
super().__init__()
self.eps = eps
self.scale = nn.Parameter(torch.ones(dim))
def forward(self, x):
norm_x = x.norm(2, dim=-1, keepdim=True)
rms_x = norm_x * torch.rsqrt(torch.tensor(x.shape[-1], dtype=x.dtype))
x_normed = x / (rms_x + self.eps)
return self.scale * x_normed
def benchmark_rmsnorm_cuda(input_shape, normalized_dim, num_iterations=100, warmup_iterations=10, dtype=torch.float16):
rms_norm_layer = torch.nn.RMSNorm(normalized_dim, device='cuda', dtype=dtype)
input_data = torch.randn(input_shape, device='cuda', dtype=dtype)
for _ in range(warmup_iterations):
_ = rms_norm_layer(input_data)
torch.cuda.synchronize()
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
for _ in range(num_iterations):
_ = rms_norm_layer(input_data)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = elapsed_time_ms / num_iterations
print(f"--- RMSNorm CUDA Benchmark ---")
print(f"Input Shape: {input_shape}")
print(f"Normalized Dimension: {normalized_dim}")
print(f"Benchmark Iterations: {num_iterations}")
print(f"--- Fused Implementation ---")
print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")
compiled_rms_norm = torch.compile(RMSNorm(dim=normalized_dim)).cuda()
for _ in range(warmup_iterations):
_ = compiled_rms_norm(input_data)
torch.cuda.synchronize()
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
for _ in range(num_iterations):
_ = compiled_rms_norm(input_data)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = elapsed_time_ms / num_iterations
print(f"--- TorchCompile Implementation ---")
print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")
print("-" * 50)
if __name__ == '__main__':
parameter_sets = [
{'batch_size': 16, 'sequence_length': 256, 'hidden_features': 512, 'dtype': torch.float16},
{'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float16},
{'batch_size': 64, 'sequence_length': 1024, 'hidden_features': 1024, 'dtype': torch.float16},
{'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float32},
{'batch_size': 8, 'sequence_length': 2048, 'hidden_features': 2048, 'dtype': torch.float16},
]
num_benchmark_iterations = 200
num_warmup_iterations = 20
for params in parameter_sets:
batch_size = params['batch_size']
sequence_length = params['sequence_length']
hidden_features = params['hidden_features']
data_type = params.get('dtype', torch.float16)
shape = (batch_size, sequence_length, hidden_features)
norm_dim_to_normalize = hidden_features
print(f"Benchmarking with: BS={batch_size}, SeqLen={sequence_length}, Hidden={hidden_features}, DType={data_type}")
benchmark_rmsnorm_cuda(input_shape=shape,
normalized_dim=norm_dim_to_normalize,
num_iterations=num_benchmark_iterations,
warmup_iterations=num_warmup_iterations,
dtype=data_type)
```
Here are the triton compile tests ran on a 5090 (comparing this branch vs main)
```py
import torch
import torch.nn as nn
from torch._inductor.utils import run_and_get_code, run_fw_bw_and_get_code
torch.manual_seed(0)
device = torch.device("cuda")
for batch in range(0, 9):
for i in range(9, 16):
normalized_shape_arg = (2**batch, 2**i)
input_tensor = torch.randn(2**batch, 2**i, device=device, requires_grad=True)
weight_tensor = torch.randn(2**batch, 2**i,device=device, requires_grad=True)
model = torch.nn.functional.rms_norm
compiled_model = torch.compile(model)
loss = torch.randn_like(input_tensor)
num_iter = 5
for j in range(num_iter):
output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
output.backward(loss)
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
num_iter = 10
for j in range(num_iter):
output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
output.backward(loss)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = round(elapsed_time_ms / num_iter, 5)
print(2**batch, 2**i, avg_time_ms)
```
main
```
32 512 0.1812
32 1024 0.19021
32 2048 0.18871
32 4096 0.17019
32 8192 0.21944
32 16384 0.38871
32 32768 0.83282
64 512 0.14705
64 1024 0.13987
64 2048 0.14111
64 4096 0.21699
64 8192 0.43141
64 16384 0.90652
64 32768 2.18573
128 512 0.19361
128 1024 0.1963
128 2048 0.20122
128 4096 0.38888
128 8192 0.93795
128 16384 2.23437
128 32768 5.50079
256 512 0.16722
256 1024 0.22856
256 2048 0.39421
256 4096 0.96621
256 8192 2.48746
256 16384 5.53571
256 32768 11.97932
```
current branch
```
32 512 0.16328
32 1024 0.18104
32 2048 0.15508
32 4096 0.14356
32 8192 0.20111
32 16384 0.45974
32 32768 0.94799
64 512 0.16874
64 1024 0.18701
64 2048 0.16107
64 4096 0.20152
64 8192 0.46568
64 16384 0.96599
64 32768 2.21661
128 512 0.14982
128 1024 0.15565
128 2048 0.22241
128 4096 0.46128
128 8192 0.88883
128 16384 2.3097
128 32768 5.84448
256 512 0.14346
256 1024 0.2007
256 2048 0.45927
256 4096 0.87876
256 8192 2.10571
256 16384 5.73948
256 32768 12.98581
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153666
Approved by: https://github.com/ngimel , https://github.com/eqy , https://github.com/albanD
2025-07-18 23:24:21 +00:00
bfe5674e22
Revert "[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for sm90
, sm100
( #149282 )"
...
This reverts commit 0797b2b6a80cf70a7accc3d5413186e7693d4451.
Reverted https://github.com/pytorch/pytorch/pull/149282 on behalf of https://github.com/wdvr due to reverting as discussed with @drisspg - @eqy please reach out to @drisspg for more info ([comment](https://github.com/pytorch/pytorch/pull/149282#issuecomment-3084759671 ))
2025-07-17 16:55:55 +00:00
0797b2b6a8
[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for sm90
, sm100
( #149282 )
...
cleanup tuple/tensor boilerplate in cuDNN SDPA, preparation for nested/ragged tensor backward
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149282
Approved by: https://github.com/drisspg
Co-authored-by: Aaron Gokaslan <aaronGokaslan@gmail.com >
2025-07-11 16:07:54 +00:00
7cfd054075
[attempt 2] Compute contiguity symbolically to avoid dde, and introduce c++ sym_is_contiguous ( #157472 )
...
Summary:
When we compute contiguity for a tensor with dynamic shapes we first:
1) Try to compute it without guarding.
2) If all shapes hinted, compute it with potentially adding guards.
3) if any input is not hinted, compute it symbolically.
sym_is_contiguous return a SymBool that is then either evaluated or guard_or_false can be called
on it to avoid data dependent errors.
ex:
bool is_contiguous = input.sym_is_contiguous().guard_or_false(__FILE__, __LINE__);
is_contiguous_or_false is a helper function that does that.
In this PR I only handle default contiguity, will follow up with changes for other formats like channel_last .
We use this patter in this PR for several locations to avoid DDEs.
Test Plan:
contbuild & OSS CI,
Rollback Plan:
Reviewed By: malfet
Differential Revision: D77639021
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157472
Approved by: https://github.com/aorenste
2025-07-02 23:12:29 +00:00
c6a27bae36
Revert "[do not revert] Compute contiguity symbolically to avoid dde, and introduce c++ sym_is_contiguous ( #155590 )"
...
This reverts commit d0a9629435aaceb5acbf31aad70f2109cb8a3ea2.
Reverted https://github.com/pytorch/pytorch/pull/155590 on behalf of https://github.com/laithsakka due to was asked by to land this internally ([comment](https://github.com/pytorch/pytorch/pull/155590#issuecomment-3025796794 ))
2025-07-01 22:58:14 +00:00
d0a9629435
[do not revert] Compute contiguity symbolically to avoid dde, and introduce c++ sym_is_contiguous ( #155590 )
...
When we compute contiguity for a tensor with dynamic shapes we first:
1) Try to compute it without guarding.
2) If all shapes hinted, compute it with potentially adding guards.
3) if any input is not hinted, compute it symbolically.
sym_is_contiguous return a SymBool that is then either evaluated or guard_or_false can be called
on it to avoid data dependent errors.
ex:
bool is_contiguous = input.sym_is_contiguous().guard_or_false(__FILE__, __LINE__);
is_contiguous_or_false is a helper function that does that.
In this PR I only handle default contiguity, will follow up with changes for other formats like channel_last .
We use this patter in this PR for several locations to avoid DDEs.
Differential Revision: [D77183032](https://our.internmc.facebook.com/intern/diff/D77183032 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155590
Approved by: https://github.com/ezyang
2025-07-01 21:39:38 +00:00
6401d1d53d
Revert "Fused RMSNorm implementation ( #153666 )"
...
This reverts commit e1aee86646aa6d1b9cb9d34351e43936401c5efc.
Reverted https://github.com/pytorch/pytorch/pull/153666 on behalf of https://github.com/davidberard98 due to causing build failures on main branch [GH job link](https://github.com/pytorch/pytorch/actions/runs/16007148842/job/45156382001 ) [HUD commit link](e1aee86646
) ([comment](https://github.com/pytorch/pytorch/pull/153666#issuecomment-3025146176 ))
2025-07-01 18:46:45 +00:00
e1aee86646
Fused RMSNorm implementation ( #153666 )
...
Relevant #72643
Benchmarked versus unfused torch implementation and torch.compile implementation. Around 9x speedup vs unfused implementation on cuda and slightly faster vs inductor compile on 5090.
```py
import torch
import torch.nn as nn
class RMSNorm(nn.Module):
def __init__(self, dim, eps=1e-5):
super().__init__()
self.eps = eps
self.scale = nn.Parameter(torch.ones(dim))
def forward(self, x):
norm_x = x.norm(2, dim=-1, keepdim=True)
rms_x = norm_x * torch.rsqrt(torch.tensor(x.shape[-1], dtype=x.dtype))
x_normed = x / (rms_x + self.eps)
return self.scale * x_normed
def benchmark_rmsnorm_cuda(input_shape, normalized_dim, num_iterations=100, warmup_iterations=10, dtype=torch.float16):
rms_norm_layer = torch.nn.RMSNorm(normalized_dim, device='cuda', dtype=dtype)
input_data = torch.randn(input_shape, device='cuda', dtype=dtype)
for _ in range(warmup_iterations):
_ = rms_norm_layer(input_data)
torch.cuda.synchronize()
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
for _ in range(num_iterations):
_ = rms_norm_layer(input_data)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = elapsed_time_ms / num_iterations
print(f"--- RMSNorm CUDA Benchmark ---")
print(f"Input Shape: {input_shape}")
print(f"Normalized Dimension: {normalized_dim}")
print(f"Benchmark Iterations: {num_iterations}")
print(f"--- Fused Implementation ---")
print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")
compiled_rms_norm = torch.compile(RMSNorm(dim=normalized_dim)).cuda()
for _ in range(warmup_iterations):
_ = compiled_rms_norm(input_data)
torch.cuda.synchronize()
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
for _ in range(num_iterations):
_ = compiled_rms_norm(input_data)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = elapsed_time_ms / num_iterations
print(f"--- TorchCompile Implementation ---")
print(f"Average Time per Iteration: {avg_time_ms:.4f} ms")
print(f"Total Time for {num_iterations} Iterations: {elapsed_time_ms:.3f} ms")
print("-" * 50)
if __name__ == '__main__':
parameter_sets = [
{'batch_size': 16, 'sequence_length': 256, 'hidden_features': 512, 'dtype': torch.float16},
{'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float16},
{'batch_size': 64, 'sequence_length': 1024, 'hidden_features': 1024, 'dtype': torch.float16},
{'batch_size': 32, 'sequence_length': 512, 'hidden_features': 768, 'dtype': torch.float32},
{'batch_size': 8, 'sequence_length': 2048, 'hidden_features': 2048, 'dtype': torch.float16},
]
num_benchmark_iterations = 200
num_warmup_iterations = 20
for params in parameter_sets:
batch_size = params['batch_size']
sequence_length = params['sequence_length']
hidden_features = params['hidden_features']
data_type = params.get('dtype', torch.float16)
shape = (batch_size, sequence_length, hidden_features)
norm_dim_to_normalize = hidden_features
print(f"Benchmarking with: BS={batch_size}, SeqLen={sequence_length}, Hidden={hidden_features}, DType={data_type}")
benchmark_rmsnorm_cuda(input_shape=shape,
normalized_dim=norm_dim_to_normalize,
num_iterations=num_benchmark_iterations,
warmup_iterations=num_warmup_iterations,
dtype=data_type)
```
Here are the triton compile tests ran on a 5090 (comparing this branch vs main)
```py
import torch
import torch.nn as nn
from torch._inductor.utils import run_and_get_code, run_fw_bw_and_get_code
torch.manual_seed(0)
device = torch.device("cuda")
for batch in range(0, 9):
for i in range(9, 16):
normalized_shape_arg = (2**batch, 2**i)
input_tensor = torch.randn(2**batch, 2**i, device=device, requires_grad=True)
weight_tensor = torch.randn(2**batch, 2**i,device=device, requires_grad=True)
model = torch.nn.functional.rms_norm
compiled_model = torch.compile(model)
loss = torch.randn_like(input_tensor)
num_iter = 5
for j in range(num_iter):
output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
output.backward(loss)
start_event = torch.cuda.Event(enable_timing=True)
end_event = torch.cuda.Event(enable_timing=True)
start_event.record()
num_iter = 10
for j in range(num_iter):
output = compiled_model(input_tensor, normalized_shape_arg, weight_tensor)
output.backward(loss)
end_event.record()
torch.cuda.synchronize()
elapsed_time_ms = start_event.elapsed_time(end_event)
avg_time_ms = round(elapsed_time_ms / num_iter, 5)
print(2**batch, 2**i, avg_time_ms)
```
main
```
32 512 0.1812
32 1024 0.19021
32 2048 0.18871
32 4096 0.17019
32 8192 0.21944
32 16384 0.38871
32 32768 0.83282
64 512 0.14705
64 1024 0.13987
64 2048 0.14111
64 4096 0.21699
64 8192 0.43141
64 16384 0.90652
64 32768 2.18573
128 512 0.19361
128 1024 0.1963
128 2048 0.20122
128 4096 0.38888
128 8192 0.93795
128 16384 2.23437
128 32768 5.50079
256 512 0.16722
256 1024 0.22856
256 2048 0.39421
256 4096 0.96621
256 8192 2.48746
256 16384 5.53571
256 32768 11.97932
```
current branch
```
32 512 0.16328
32 1024 0.18104
32 2048 0.15508
32 4096 0.14356
32 8192 0.20111
32 16384 0.45974
32 32768 0.94799
64 512 0.16874
64 1024 0.18701
64 2048 0.16107
64 4096 0.20152
64 8192 0.46568
64 16384 0.96599
64 32768 2.21661
128 512 0.14982
128 1024 0.15565
128 2048 0.22241
128 4096 0.46128
128 8192 0.88883
128 16384 2.3097
128 32768 5.84448
256 512 0.14346
256 1024 0.2007
256 2048 0.45927
256 4096 0.87876
256 8192 2.10571
256 16384 5.73948
256 32768 12.98581
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153666
Approved by: https://github.com/ngimel
2025-07-01 18:22:24 +00:00
1586521461
Revert "Compute contiguity symbolically to avoid dde, and introduce c++ sym_is_contiguous ( #155590 )"
...
This reverts commit 2c76f31221e117b217b8a6a96a5405f626d2218a.
Reverted https://github.com/pytorch/pytorch/pull/155590 on behalf of https://github.com/jeanschmidt due to Breaking 1000s of internal builds, it cant be properly landed internally, there are no options except revert and codev. ([comment](https://github.com/pytorch/pytorch/pull/155590#issuecomment-3023503929 ))
2025-07-01 11:23:00 +00:00
d5e6f42094
Revert "Use std::string_view in torchgen ( #157050 )"
...
This reverts commit 064288cbab94c9931ca2296a2b9723e864f9050a.
Reverted https://github.com/pytorch/pytorch/pull/157050 on behalf of https://github.com/jeanschmidt due to Seems to have broken internal builds, more details on D77449943. @ezyang may I count on your help to get those changes merged? ([comment](https://github.com/pytorch/pytorch/pull/157050#issuecomment-3020222668 ))
2025-06-30 18:08:54 +00:00
064288cbab
Use std::string_view in torchgen ( #157050 )
...
Let the generated code use std::sv
Pull Request resolved: https://github.com/pytorch/pytorch/pull/157050
Approved by: https://github.com/ezyang
2025-06-27 06:36:10 +00:00
2c76f31221
Compute contiguity symbolically to avoid dde, and introduce c++ sym_is_contiguous ( #155590 )
...
When we compute contiguity for a tensor with dynamic shapes we first:
1) Try to compute it without guarding.
2) If all shapes hinted, compute it with potentially adding guards.
3) if any input is not hinted, compute it symbolically.
sym_is_contiguous return a SymBool that is then either evaluated or guard_or_false can be called
on it to avoid data dependent errors.
ex:
bool is_contiguous = input.sym_is_contiguous().guard_or_false(__FILE__, __LINE__);
is_contiguous_or_false is a helper function that does that.
In this PR I only handle default contiguity, will follow up with changes for other formats like channel_last .
We use this patter in this PR for several locations to avoid DDEs.
Differential Revision: [D77183032](https://our.internmc.facebook.com/intern/diff/D77183032 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155590
Approved by: https://github.com/ezyang
2025-06-27 04:59:52 +00:00
a69785b3ec
[BE] fix typos in tools/ ( #156082 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156082
Approved by: https://github.com/soulitzer
ghstack dependencies: #156079
2025-06-17 19:25:50 +00:00
8817e5ac80
Render Example: and not Example:: in docs ( #153978 )
...
Everything here is a grep except the changes in tools/autograd/load_derivatives.py which I manually corrected.
The correct notation is:
```
Example::
>>> ...
```
It is common and wrong to have:
```
Example::
>>> ...
```
In the wrong example, we get these pesky double colons:

Pull Request resolved: https://github.com/pytorch/pytorch/pull/153978
Approved by: https://github.com/soulitzer , https://github.com/malfet
2025-05-21 01:03:26 +00:00
014726d9d3
[torchgen] Refactor torchgen.utils.FileManager
to accept pathlib.Path
( #150726 )
...
This PR allows `FileManager` to accept `pathlib.Path` as arguments while keeping the original `str` path support.
This allows us to simplify the code such as:
1. `os.path.join(..., ...)` with `Path.__floordiv__(..., ...)`.
95a5958db4/torchgen/utils.py (L155)
95a5958db4/torchgen/utils.py (L176)
2. `os.path.basename(...)` with `Path(...).name`.
95a5958db4/torchgen/utils.py (L161)
3. Manual file extension split with `Path(...).with_stem(new_stem)`
95a5958db4/torchgen/utils.py (L241-L256)
------
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150726
Approved by: https://github.com/aorenste
2025-05-15 02:52:24 +00:00
f363a3f51a
Revert "[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for sm90
, sm100
( #149282 )"
...
This reverts commit 9386701b51aadce951bf38daf497b0257a3f2211.
Reverted https://github.com/pytorch/pytorch/pull/149282 on behalf of https://github.com/jeanschmidt due to Breaking internal builds, see [D74729259](https://www.internalfb.com/diff/D74729259 ). @drisspg may you help out the author have their PR merged? ([comment](https://github.com/pytorch/pytorch/pull/149282#issuecomment-2881546951 ))
2025-05-14 20:53:49 +00:00
9386701b51
[cuDNN][SDPA] cuDNN SDPA refactor/cleanup, nested tensor backward, test priority bump for sm90
, sm100
( #149282 )
...
cleanup tuple/tensor boilerplate in cuDNN SDPA, preparation for nested/ragged tensor backward
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149282
Approved by: https://github.com/drisspg
2025-05-14 01:39:24 +00:00
22d1359bc6
Move warning from item to specific number conversions ( #152709 )
...
Follow up to https://github.com/pytorch/pytorch/pull/143261 to not warn when a plain .item() is done.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152709
Approved by: https://github.com/malfet , https://github.com/ngimel
2025-05-05 20:46:05 +00:00
45efa1aaa8
[3/N] Use internal linkage in C++ files ( #151297 )
...
Follows #151070 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151297
Approved by: https://github.com/Skylion007
2025-05-05 17:48:39 +00:00
1c04ea4e59
Revert "[torchgen] Refactor torchgen.utils.FileManager
to accept pathlib.Path
( #150726 )"
...
This reverts commit 4b5b1adb21f5d7d66945d78a1f89d2f9d86f15bb.
Reverted https://github.com/pytorch/pytorch/pull/150726 on behalf of https://github.com/malfet due to This breaks Windows builds, see a765e2ddda/1
([comment](https://github.com/pytorch/pytorch/pull/150726#issuecomment-2845858846 ))
2025-05-01 21:52:35 +00:00
4b5b1adb21
[torchgen] Refactor torchgen.utils.FileManager
to accept pathlib.Path
( #150726 )
...
This PR allows `FileManager` to accept `pathlib.Path` as arguments while keeping the original `str` path support.
This allows us to simplify the code such as:
1. `os.path.join(..., ...)` with `Path.__floordiv__(..., ...)`.
95a5958db4/torchgen/utils.py (L155)
95a5958db4/torchgen/utils.py (L176)
2. `os.path.basename(...)` with `Path(...).name`.
95a5958db4/torchgen/utils.py (L161)
3. Manual file extension split with `Path(...).with_stem(new_stem)`
95a5958db4/torchgen/utils.py (L241-L256)
------
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150726
Approved by: https://github.com/zou3519
2025-05-01 17:43:16 +00:00
632b89af43
[dynamic shapes] support SymInt inputs for kthvalue ( #152151 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/152151
Approved by: https://github.com/tugsbayasgalan , https://github.com/malfet
2025-05-01 03:47:23 +00:00
017a6bd593
add min/max_seqlen to non_differentiable ( #151750 )
...
Fixes #148988
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151750
Approved by: https://github.com/soulitzer
2025-04-22 21:46:02 +00:00
55e62ff74a
bf16 grouped gemm ( #150374 )
...
Enabled bf16 grouped gemm with an API similar to _scaled_group_gemm, except without scale and fast accum arguments. All transpose variants are enabled, unlike scaled gemm. Ideally we'd factor out a lot more code from scaled gemm, currently there's a lot of repetition between scaled and non-scaled versions. I factored out only a helper kernel that prepares arguments.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/150374
Approved by: https://github.com/drisspg
2025-04-06 04:53:24 +00:00
5005e1bc47
support multinomial for dynamic num_samples ( #149463 )
...
Test Plan: added test
Fixes #149048
Differential Revision: D71434914
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149463
Approved by: https://github.com/pianpwk
2025-03-19 23:15:29 +00:00
457ff9b7ae
[reland][ca] side-effect free inital trace: compiled_args ( #148376 )
...
This reverts commit ea12fc8a9ff7da808e0b661ca07e9d4ce75d04bc.
Reland https://github.com/pytorch/pytorch/pull/147804 , there was a bad import inserted by my linter.
Differential Revision: [D70582747](https://our.internmc.facebook.com/intern/diff/D70582747 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148376
Approved by: https://github.com/jansel
2025-03-11 01:57:36 +00:00
59f14d19ae
Implement gradient for the residuals
of torch.linalg.lstsq
( #148526 )
...
Fixes #147543 .
I have written some tests in python using `gradcheck`. Please advise where I should put these tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148526
Approved by: https://github.com/lezcano
2025-03-10 12:35:09 +00:00
ea12fc8a9f
Revert D70262395 ( #148164 )
...
Summary:
This reverts #147804 due to internal revert.
---
This diff reverts D70262395
Reviewed By: RossMcKenzie
Differential Revision: D70318024
@diff-train-skip-merge
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148164
Approved by: https://github.com/xmfan
2025-02-28 06:39:48 +00:00
c73a92fbf5
[BE][CI] bump ruff
to 0.9.2: multiline assert
statements ( #144546 )
...
Reference: https://docs.astral.sh/ruff/formatter/black/#assert-statements
> Unlike Black, Ruff prefers breaking the message over breaking the assertion, similar to how both Ruff and Black prefer breaking the assignment value over breaking the assignment target:
>
> ```python
> # Input
> assert (
> len(policy_types) >= priority + num_duplicates
> ), f"This tests needs at least {priority+num_duplicates} many types."
>
>
> # Black
> assert (
> len(policy_types) >= priority + num_duplicates
> ), f"This tests needs at least {priority+num_duplicates} many types."
>
> # Ruff
> assert len(policy_types) >= priority + num_duplicates, (
> f"This tests needs at least {priority + num_duplicates} many types."
> )
> ```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144546
Approved by: https://github.com/malfet
2025-02-27 20:46:16 +00:00
fd1220e386
[ca] side-effect free inital trace: compiled_args ( #147804 )
...
const methods to prevent accidental mutation. changes mainly in Error nodes and PyNode.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147804
Approved by: https://github.com/jansel
ghstack dependencies: #147242 , #147796
2025-02-26 16:37:27 +00:00
143f0f0006
Revert "[ca] side-effect free inital trace: compiled_args ( #147804 )"
...
This reverts commit ec768d8dc04b334e01db1a90e4e6646e4e867e67.
Reverted https://github.com/pytorch/pytorch/pull/147804 on behalf of https://github.com/wdvr due to failing tests in the slow workflow, see below ([comment](https://github.com/pytorch/pytorch/pull/147804#issuecomment-2683594740 ))
2025-02-26 00:31:40 +00:00
3ecfe6be25
[Submodule] Turning flash-attention integration into 3rd party submod ( #144120 ) ( #146372 )
...
Summary:
# Summary
### Sticky points
Cuda-graph rng handling has changed / deviated from original implementation. We will be left with a dangling 'offset' val and confusing naming due to BC
## Dependencies
- Flash PR: https://github.com/Dao-AILab/flash-attention/pull/1419
### Other Points
- The BC linter is complaining about losing generate.py and its functions which is not real BC surface
cc albanD
imported-using-ghimport
Test Plan:
Imported from OSS
Building in dev
`buck build @//mode/dev-nosan -c fbcode.nvcc_arch=h100a //caffe2:ATen-cu --show-full-output `
I and Nming the .so I do see that the flash symbols are correctly named:
```
0000000001c3dfb0 t pytorch_flash::run_mha_bwd(pytorch_flash::Flash_bwd_params&, CUstream_st*)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const
0000000001c36080 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()#6}::operator()() const
0000000001c360e0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const
0000000001c35fc0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#6}::operator()() const
0000000001c36020 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const
```
Reviewed By: vkuzo
Differential Revision: D68502879
Pulled By: drisspg
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146372
Approved by: https://github.com/jbschlosser
2025-02-26 00:10:59 +00:00
ec768d8dc0
[ca] side-effect free inital trace: compiled_args ( #147804 )
...
const methods to prevent accidental mutation. changes mainly in Error nodes and PyNode.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147804
Approved by: https://github.com/jansel
ghstack dependencies: #147242 , #147796
2025-02-25 20:38:51 +00:00
754fb834db
[BE][CI] bump ruff
to 0.9.0: string quote styles ( #144569 )
...
Reference: https://docs.astral.sh/ruff/formatter/#f-string-formatting
- Change the outer quotes to double quotes for nested f-strings
```diff
- f'{", ".join(args)}'
+ f"{', '.join(args)}"
```
- Change the inner quotes to double quotes for triple f-strings
```diff
string = """
- {', '.join(args)}
+ {", ".join(args)}
"""
```
- Join implicitly concatenated strings
```diff
- string = "short string " "short string " f"{var}"
+ string = f"short string short string {var}"
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/144569
Approved by: https://github.com/Skylion007
ghstack dependencies: #146509
2025-02-24 19:56:09 +00:00