d5b1d99f78
Enable more nightly tests on s390x ( #148452 )
...
Also enable some tests which probably were accidentally disabled.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148452
Approved by: https://github.com/seemethere , https://github.com/malfet
2025-03-18 16:09:39 +00:00
381d0cb239
[DCP] Avoid in-place update and deepcopy during dudpe ( #149320 )
...
Summary:
Avoid in-place update and deepcopy during dudpe. Deepcopy becomes prohibitively expensive with models having a huge number of FQNs. This was manifestd in the Ads 2K experiment as well. Here are the results from the TextRay model in Mitra:
#### Control job with deepcopy regression:
First save ~24.8s
Global step latency is ~7-8s
Test job with the new fix to avoid deepcopy:
First save is ~21s
global step latency ~2s
Test Plan:
```
buck test 'fbcode//mode/dev-nosan' fbcode//caffe2/test/distributed/checkpoint:test_planner
```
https://www.internalfb.com/intern/testinfra/testrun/3940649945104822
Differential Revision: D71245218
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149320
Approved by: https://github.com/MeetVadakkanchery
2025-03-18 16:08:40 +00:00
c41196a4d0
[EZ][Docker] Remove install_db.sh
( #149360 )
...
Which is a vestige of caffe2 days and was no-op since https://github.com/pytorch/pytorch/pull/125092
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149360
Approved by: https://github.com/atalman , https://github.com/cyyever , https://github.com/seemethere , https://github.com/Skylion007
2025-03-18 16:07:47 +00:00
fdacf3c920
[ONNX] Update types in VerificationInfo ( #149377 )
...
torch.types.Number was rendered as is in the documentation and can be confusing. We write the original types instead to reduce confusion for users.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149377
Approved by: https://github.com/titaiwangms
2025-03-18 15:37:39 +00:00
405025778d
Revert "[AOTI] Update test runner to use the new APIs ( #147105 )"
...
This reverts commit 9a78513c3cb21a5f506135e2a56f967cf1fddc60.
Reverted https://github.com/pytorch/pytorch/pull/147105 on behalf of https://github.com/facebook-github-bot due to Diff reverted internally ([comment](https://github.com/pytorch/pytorch/pull/147105#issuecomment-2733656413 ))
2025-03-18 15:25:40 +00:00
5ba437fb45
Revert "[AOTI] Forward fix unit test failures ( #149401 )"
...
This reverts commit ec9e11145e1a86300aae0fe09a1d8917d21deba1.
Reverted https://github.com/pytorch/pytorch/pull/149401 on behalf of https://github.com/desertfire due to reverting the original PR instead ([comment](https://github.com/pytorch/pytorch/pull/149401#issuecomment-2733633516 ))
2025-03-18 15:18:48 +00:00
213eea216a
[MTIA] Add _mtia_maybeExchangeDevice to MTIA module ( #149340 )
...
Summary: The FlexAttention path uses `_maybe_exchange_device`, so it will be needed eventually for MTIA as well.
Test Plan: `buck2 test fbcode//mtia/host_runtime/torch_mtia/tests:test_torch_mtia_api -- test_maybe_exchange_device`
Reviewed By: chaos5958
Differential Revision: D70072063
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149340
Approved by: https://github.com/chaos5958
2025-03-18 15:15:12 +00:00
ec9e11145e
[AOTI] Forward fix unit test failures ( #149401 )
...
Summary: There is a land conflict between https://github.com/pytorch/pytorch/pull/149161 and https://github.com/pytorch/pytorch/pull/147105 . We just need to update the APIs used in two new unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149401
Approved by: https://github.com/ZainRizvi
2025-03-18 15:02:01 +00:00
6e2b2660b9
Make numpy check optional ( #149356 )
...
We may want to skip numpy smoke tests. Hence making it optional
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149356
Approved by: https://github.com/ZainRizvi
2025-03-18 15:00:01 +00:00
bc88f6faa1
Use TorchVersion for triton version check ( #149136 )
...
Followup after https://github.com/pytorch/pytorch/pull/149092#issuecomment-2721990321
To use TorchVersion for triton version parsing
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149136
Approved by: https://github.com/malfet
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com >
2025-03-18 13:48:46 +00:00
b06b5c3e27
[ROCm] Use alternate mirror for drm repo ( #149380 )
...
Fixes issue with building ROCm manywheel and libtorch images eg. https://github.com/pytorch/pytorch/actions/runs/13887711267/job/38854659005#step:4:8328
```
#53 2.832 Cloning into 'drm'...
#53 2.849 fatal: unable to access 'https://gitlab.freedesktop.org/mesa/drm.git/ ': The requested URL returned error: 503
#53 2.851 ./install_rocm_drm.sh: line 29: pushd: drm: No such file or directory
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149380
Approved by: https://github.com/jeffdaily
2025-03-18 13:33:25 +00:00
6055a4f612
refresh benchmarks results. ( #149347 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149347
Approved by: https://github.com/jamesjwu
2025-03-18 08:53:49 +00:00
9b92828d4b
Add batch dim sharding rule to sdpa ( #149253 )
...
This is a trivial rule that for most cases isn't needed, but if we want to consider that the input data is actually `Shard(0)` (instead of `Replicated()` as it is currently assumed), then we need this rule.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149253
Approved by: https://github.com/XilunWu
2025-03-18 07:54:02 +00:00
9cd52da45c
[MPS/inductor] Add support for modified_bessel_i1
. ( #149379 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149379
Approved by: https://github.com/malfet
2025-03-18 06:02:33 +00:00
6c2db8fab0
Enable qint8 and quint8 add for AArch64 using ACL directly ( #148653 )
...
This enables qint8 and quint8 add for AArch64 through Arm Compute Library (ACL) directly.
Relative performance improvement using OMP_NUM_THREADS=1 is ~15x, using OMP_NUM_THREADS=32 it’s ~5.4x.
Co-authored-by: David Svantesson <david.svantesson-yeung@arm.com >
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148653
Approved by: https://github.com/malfet
ghstack dependencies: #148585
2025-03-18 05:38:39 +00:00
2e0c98ff05
[MPS] Add bicubic2d_aa
( #149378 )
...
Which is currently the most frequently requested op in https://github.com/pytorch/pytorch/issues/141287
Mostly done by refactoring `upsample_bilinear2d_aa` to accept Functor as one of the template arguments, which closely ideas from eec43cfbc0/src/libImaging/Resample.c
as well as
bb42e4d137/aten/src/ATen/native/cuda/UpSampleBilinear2d.cu (L472-L478)
Populate unit tests by copying upsample_bilinear_2d_aa and reusing it as upsample_bicubic2d_aa
At that point, only difference between upsample_bilinear2d_aa and upsample_bicubic2d_aa are convolution kernel function and size: for bilinear it's 3x3, for bicubic it's 5x5
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149378
Approved by: https://github.com/dcci
2025-03-18 05:35:41 +00:00
dea7157160
nccl: upgrade to 2.26.2 to avoid hang on ncclCommAbort ( #149351 )
...
Fixes #149153
Yaml generated from:
```
python .github/scripts/generate_ci_workflows.py
```
Test plan:
Repro in https://gist.github.com/d4l3k/16a19b475952bc40ddd7f2febcc297b7
```
rm -rf third_party/nccl
python setup.py develop
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149351
Approved by: https://github.com/kwen2501 , https://github.com/atalman , https://github.com/malfet
2025-03-18 05:23:18 +00:00
b8f91bcb14
[pt2_provenance_tracking] add support for cpp kernel ( #149185 )
...
Summary:
As title.
Add inductor cpp kernel to post grad graph node mapping
& UT.
Context:
Raised as a feature request for AOTI CPU case.
https://fb.workplace.com/groups/1028545332188949/permalink/1169020841474730/
Differential Revision: D71181284
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149185
Approved by: https://github.com/jingsh
2025-03-18 04:43:07 +00:00
7869196482
Fix torchbind schema str generation ( #149239 )
...
Summary: Fix Torchbind HOP schema generation when there's no input
Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:torchbind -- -r schema
```
Differential Revision: D71231164
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149239
Approved by: https://github.com/zou3519
2025-03-18 04:29:56 +00:00
bca75fe97a
[MAIA] [Autocast] Enable autocast on MAIA device ( #148511 )
...
Fixes #148510 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148511
Approved by: https://github.com/albanD
2025-03-18 03:46:22 +00:00
c43e35d6f7
[MPS] Implement support for modified_bessel_i1
in eager. ( #149368 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149368
Approved by: https://github.com/malfet
Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com >
2025-03-18 03:29:10 +00:00
bb42e4d137
[AOTInductor] Add function to free buffer ( #149161 )
...
Summary:
We add a function that allows users to free the unused buffer.
Test Plan:
Testing correctness:
python test/inductor/test_aot_inductor.py -k free_inactive
Testing memory consumption:
LD_LIBRARY_PATH=/data/users/$USER/pytorch/build/lib
/home/$USER/local/pytorch/build/bin/test_aoti_inference
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149161
Approved by: https://github.com/chenyang78 , https://github.com/desertfire
ghstack dependencies: #149249
2025-03-18 02:43:14 +00:00
cccdf860e2
[BE] Add STABLE_LIBRARY test for multiple returns ( #149230 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149230
Approved by: https://github.com/albanD , https://github.com/zou3519
ghstack dependencies: #149052
2025-03-18 02:40:54 +00:00
988827cdfb
Use schema as source of truth + support ones_like/empty_like ( #149052 )
...
This change does 2 important things:
(a) Instead of relying on IValue type as source of truth, we use the schema as the source of truth, which is important as IValue types are overloaded and can ambiguously convert incorrectly. For example, a MemoryFormat will look like an int + get converted to an int64_t vs a MemoryFormat!
(b) This PR expands support for many more types to encompass way more schemas, e.g., Optional, Device, dtype, etc. The main win from this PR is the ability for aoti_torch_call_dispatcher to call TensorFactory ops like ones_like/empty_like!
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149052
Approved by: https://github.com/albanD
2025-03-18 02:40:54 +00:00
ebabd0efdd
[ONNX] Expose verification utilities ( #148603 )
...
Expose verification utilities to public documentation.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148603
Approved by: https://github.com/titaiwangms
2025-03-18 02:10:34 +00:00
c36ac16da1
[Inductor] optimize welford reduction ( #145061 )
...
Fix https://github.com/pytorch/pytorch/issues/141541 .
Fix https://github.com/pytorch/pytorch/issues/142839 .
Fix https://github.com/pytorch/pytorch/issues/143182 .
**Summary:**
In order to fix the issue that the accuracy of welford reduction is not good enough, we refer to the eager implementation, combine Welford algorithm with cascade sum to improve numerical stability. Specifically:
1. Use Welford algorithm to compute mean and variance.
2. Use cascade summation when computing sum over input for both mean and variance.
I tested Inductor benchmark with this PR on CPU, no performance gains or regressions were seen.
**Example:**
Take https://github.com/pytorch/pytorch/issues/141541 as an example:
```
import torch
import torch.nn as nn
torch.manual_seed(0)
class Model(nn.Module):
def __init__(self):
super().__init__()
self.gn = nn.GroupNorm(num_groups=32, num_channels=32)
def forward(self, x):
return self.gn(x)
model = Model().eval()
c_model = torch.compile(model)
x = torch.randn(1, 32, 128, 128, 128)
with torch.no_grad():
output = model(x)
c_output = c_model(x)
print(torch.max(torch.abs(output - c_output)))
print(torch.allclose(output, c_output, 1.3e-6, 1e-5))
```
**logs**
- before
```
tensor(7.0095e-05)
False
```
- After
```
tensor(9.5367e-07)
True
```
- on CUDA
```
tensor(1.4305e-06, device='cuda:0', grad_fn=<MaxBackward1>)
True
```
**Generated code:**
- before
```
cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*'], '''
#include "/tmp/torchinductor_jiayisun/pi/cpicxudqmdsjh5cm4klbtbrvy2cxwr7whxl3md2zzdjdf3orvfdf.h"
extern "C" void kernel(const float* in_ptr0,
const float* in_ptr1,
const float* in_ptr2,
float* out_ptr0,
float* out_ptr1,
float* out_ptr2)
{
{
#pragma GCC ivdep
for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L))
{
{
Welford<float> tmp_acc0 = Welford<float>();
Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
static WeightRecp<at::vec::Vectorized<float>> wrecps0(static_cast<int64_t>(131072L));
for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(16L))
{
{
if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(2097152L)))
{
auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + 2097152L*x0), static_cast<int64_t>(16));
tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &wrecps0);
}
}
}
tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec));
tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
out_ptr0[static_cast<int64_t>(x0)] = static_cast<float>(tmp_acc0.mean);
out_ptr1[static_cast<int64_t>(x0)] = static_cast<float>(tmp_acc0.m2);
}
}
}
{
#pragma GCC ivdep
for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L))
{
for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(16L))
{
{
if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(2097152L)))
{
auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + 2097152L*x0), static_cast<int64_t>(16));
auto tmp1 = out_ptr0[static_cast<int64_t>(x0)];
auto tmp4 = out_ptr1[static_cast<int64_t>(x0)];
auto tmp12 = in_ptr1[static_cast<int64_t>(x0)];
auto tmp15 = in_ptr2[static_cast<int64_t>(x0)];
auto tmp2 = at::vec::Vectorized<float>(tmp1);
auto tmp3 = tmp0 - tmp2;
auto tmp5 = static_cast<float>(2097152.0);
auto tmp6 = tmp4 / tmp5;
auto tmp7 = static_cast<float>(1e-05);
auto tmp8 = decltype(tmp6)(tmp6 + tmp7);
auto tmp9 = 1 / std::sqrt(tmp8);
auto tmp10 = at::vec::Vectorized<float>(tmp9);
auto tmp11 = tmp3 * tmp10;
auto tmp13 = at::vec::Vectorized<float>(tmp12);
auto tmp14 = tmp11 * tmp13;
auto tmp16 = at::vec::Vectorized<float>(tmp15);
auto tmp17 = tmp14 + tmp16;
tmp17.store(out_ptr2 + static_cast<int64_t>(x1 + 2097152L*x0));
}
}
}
}
}
}
''')
```
- After
```
cpp_fused_native_group_norm_0 = async_compile.cpp_pybinding(['const float*', 'const float*', 'const float*', 'float*', 'float*', 'float*'], '''
#include "/tmp/torchinductor_jiayisun/ln/clnlak27xpvmq3klpqyj6xzyq2thf4ecrezve5ddy4f4xaz4sb7w.h"
extern "C" void kernel(const float* in_ptr0,
const float* in_ptr1,
const float* in_ptr2,
float* out_ptr0,
float* out_ptr1,
float* out_ptr2)
{
{
#pragma GCC ivdep
for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L))
{
{
Welford<float> tmp_acc0 = Welford<float>();
Welford<at::vec::Vectorized<float>> tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
Welford<at::vec::Vectorized<float>> masked_tmp_acc0_vec = Welford<at::vec::Vectorized<float>>();
WelfordHelper<at::vec::Vectorized<float>> welford_helper0(static_cast<int64_t>(131072L));
static WelfordHelper<at::vec::Vectorized<float>> masked_welford_helper0(static_cast<int64_t>(0L));
for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(16L))
{
{
if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(2097152L)))
{
auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + 2097152L*x0), static_cast<int64_t>(16));
tmp_acc0_vec = welford_combine(tmp_acc0_vec, tmp0, &welford_helper0);
}
}
}
tmp_acc0_vec = welford_combine(tmp_acc0_vec, &welford_helper0);
masked_tmp_acc0_vec = welford_combine(masked_tmp_acc0_vec, &masked_welford_helper0);
tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(masked_tmp_acc0_vec));
tmp_acc0 = welford_combine(tmp_acc0, welford_vec_reduce_all(tmp_acc0_vec));
out_ptr0[static_cast<int64_t>(x0)] = static_cast<float>(tmp_acc0.mean);
out_ptr1[static_cast<int64_t>(x0)] = static_cast<float>(tmp_acc0.m2);
}
}
}
{
#pragma GCC ivdep
for(int64_t x0=static_cast<int64_t>(0L); x0<static_cast<int64_t>(32L); x0+=static_cast<int64_t>(1L))
{
for(int64_t x1=static_cast<int64_t>(0L); x1<static_cast<int64_t>(2097152L); x1+=static_cast<int64_t>(16L))
{
{
if(C10_LIKELY(x1 >= static_cast<int64_t>(0) && x1 < static_cast<int64_t>(2097152L)))
{
auto tmp0 = at::vec::Vectorized<float>::loadu(in_ptr0 + static_cast<int64_t>(x1 + 2097152L*x0), static_cast<int64_t>(16));
auto tmp1 = out_ptr0[static_cast<int64_t>(x0)];
auto tmp4 = out_ptr1[static_cast<int64_t>(x0)];
auto tmp12 = in_ptr1[static_cast<int64_t>(x0)];
auto tmp15 = in_ptr2[static_cast<int64_t>(x0)];
auto tmp2 = at::vec::Vectorized<float>(tmp1);
auto tmp3 = tmp0 - tmp2;
auto tmp5 = static_cast<float>(2097152.0);
auto tmp6 = tmp4 / tmp5;
auto tmp7 = static_cast<float>(1e-05);
auto tmp8 = decltype(tmp6)(tmp6 + tmp7);
auto tmp9 = 1 / std::sqrt(tmp8);
auto tmp10 = at::vec::Vectorized<float>(tmp9);
auto tmp11 = tmp3 * tmp10;
auto tmp13 = at::vec::Vectorized<float>(tmp12);
auto tmp14 = tmp11 * tmp13;
auto tmp16 = at::vec::Vectorized<float>(tmp15);
auto tmp17 = tmp14 + tmp16;
tmp17.store(out_ptr2 + static_cast<int64_t>(x1 + 2097152L*x0));
}
}
}
}
}
}
''')
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145061
Approved by: https://github.com/leslie-fang-intel , https://github.com/jgong5 , https://github.com/jansel
2025-03-18 02:05:35 +00:00
1096443467
Use torch_compile_options for c10 libraries ( #147821 )
...
c10, c10_cuda, c10_hip and c10_xpu are given additional compile options by torch_compile_options, which are more restrictive and can help reveal potential bugs inside the code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147821
Approved by: https://github.com/guangyey , https://github.com/malfet
2025-03-18 01:54:23 +00:00
60523540f1
Force build to conform C++ standard on windows by adding /permissive- flag ( #149035 )
...
Fixes #147366
1. Add `/permissive-` to the `torch_compile_options` for the build to conform to the C++ standard.
2. Fix the error when trying to assign a string literal to a non-const ptr.
The `/permissive-` flag can be found at https://learn.microsoft.com/en-us/cpp/build/reference/permissive-standards-conformance?view=msvc-170
From the above [doc](https://learn.microsoft.com/en-us/cpp/build/reference/permissive-standards-conformance?view=msvc-170#remarks ),
> By default, the /permissive- option is set in new projects created by Visual Studio 2017 version 15.5 and later versions.
> The /permissive- option is implicitly set by the /std:c++latest option starting in Visual Studio 2019 version 16.8, and in version 16.11 by the /std:c++20 option.
Thus, it is reasonable to add this flag to the existing project.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149035
Approved by: https://github.com/guangyey , https://github.com/malfet
2025-03-18 01:51:46 +00:00
c1dd75e4dc
Add AOTI shim for _weight_int4pack_mm_cpu_tensor ( #149031 )
...
**Summary**
Previous implementation of shim did not align with the design and it was removed by https://github.com/pytorch/pytorch/pull/148907
This PR adds it back in the files of MKLDNN backend and re-enable the CPP wrapper UT.
**Test plan**
```
pytest -s test/inductor/test_cpu_cpp_wrapper.py -k test_woq_int4
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149031
Approved by: https://github.com/leslie-fang-intel , https://github.com/EikanWang , https://github.com/desertfire
2025-03-18 01:33:13 +00:00
425c6d8eba
Replace c10::is_pod with std::is_trivial ( #149286 )
...
These remaining c10::is_pod calls can be replaced without compromising the semantics.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149286
Approved by: https://github.com/zou3519
2025-03-18 01:33:01 +00:00
f9a787224c
[dynamo][guards][serialization] Dont use ID_MATCH guard for bool and None ( #149228 )
...
Doing this removes the need of collecting `id` and therefore facilitates serialization. It also improves readability with recompilations. Earlier, recompile message will just show the `id`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149228
Approved by: https://github.com/jansel
2025-03-18 01:25:37 +00:00
186cc7327c
[MPS/BE] Remove decorator that skipped test on macOS 12. ( #149365 )
...
macOS 12 is not really supported anymore.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149365
Approved by: https://github.com/malfet
2025-03-18 00:58:08 +00:00
a0ac63cbd9
[BE]: Apply ruff PERF403 to use dict comprehensions more often ( #149257 )
...
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149257
Approved by: https://github.com/jansel
2025-03-18 00:46:07 +00:00
811f587d86
[MPS/BE] @parametrize generation of pointwise_ops. ( #149363 )
...
Make this less error prone/reduces duplication.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149363
Approved by: https://github.com/malfet
2025-03-18 00:37:43 +00:00
9a78513c3c
[AOTI] Update test runner to use the new APIs ( #147105 )
...
Summary: Switch to the newer aoti_compile_and_package APIs. Some tests still kept using legacy APIs, and will follow up with internal test refactoring.
Differential Revision: [D69609685](https://our.internmc.facebook.com/intern/diff/D69609685 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147105
Approved by: https://github.com/jingsh
2025-03-18 00:27:09 +00:00
b52a8bef01
Revert "[dynamo][guards][serialization] Dont use ID_MATCH guard for bool and None ( #149228 )"
...
This reverts commit 5905bbe745b0acb4909243c93014c0e6f3512c2d.
Reverted https://github.com/pytorch/pytorch/pull/149228 on behalf of https://github.com/malfet due to I wonder if this will fix the pr-time-benchmark regressions ([comment](https://github.com/pytorch/pytorch/pull/149228#issuecomment-2731237949 ))
2025-03-18 00:10:50 +00:00
46226a90c8
[EZ][BE] Remove cross-compilation options from mac-build.yml ( #149237 )
...
It has long been gone
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149237
Approved by: https://github.com/seemethere , https://github.com/atalman
2025-03-17 23:50:31 +00:00
523bffd388
cd: Add no-cache for test binaries ( #149218 )
...
This is to make it so that we don't experience issues like https://github.com/pytorch/vision/actions/runs/13861462856/job/38795684317#step:13:212
```
ERROR: THESE PACKAGES DO NOT MATCH THE HASHES FROM THE REQUIREMENTS FILE. If you have updated the package versions, please update the hashes. Otherwise, examine the package contents carefully; someone may have tampered with them.
unknown package:
Expected sha256 8e34a6f02ac5a63763251953063a19ba9df855ac2c8a13ef409dfef708e2ba26
Got 341156cc5067488565c1e103be6e95105b0fc0d87d8ac24ff8891f63fd33216f
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149218
Approved by: https://github.com/ZainRizvi , https://github.com/atalman , https://github.com/malfet
2025-03-17 23:26:20 +00:00
37c914ca0c
fix simple-spec crash ( #147723 )
...
found an issue while running `python torchgen/fuse/gen_patterns.py`
exact error:
```shell
Traceback (most recent call last):
File "/Users/mayankmishra/Desktop/non-IBM/pytorch/torchgen/fuse/gen_patterns.py", line 19, in <module>
joint_graph.lazy_init()
File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/pattern_matcher.py", line 2096, in lazy_init
result = fn()
File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/fx_passes/joint_graph.py", line 53, in lazy_init
_pad_mm_init()
File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/fx_passes/pad_mm.py", line 905, in _pad_mm_init
gen_register_replacement(
File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/pattern_matcher.py", line 1584, in gen_register_replacement
pat = _serialize_pattern(
File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/pattern_matcher.py", line 1539, in _serialize_pattern
file_template = get_file_template()
File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/site-packages/torch/_inductor/pattern_matcher.py", line 1513, in get_file_template
if isinstance(attr, type) and issubclass(attr, (PatternExpr, _TargetExpr)):
File "/Users/mayankmishra/miniconda3/envs/ai/lib/python3.10/abc.py", line 123, in __subclasscheck__
return _abc_subclasscheck(cls, subclass)
TypeError: issubclass() arg 1 must be a class
```
This PR fixes this issue.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147723
Approved by: https://github.com/aorenste
Co-authored-by: Aaron Orenstein <aorenste@meta.com >
2025-03-17 23:25:48 +00:00
78715a181f
Convert Tensor lr to 0-dim as needed for the optimizer to normally work ( #145674 )
...
Fixes #145461
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145674
Approved by: https://github.com/janeyx99
Co-authored-by: Jane (Yuan) Xu <31798555+janeyx99@users.noreply.github.com >
2025-03-17 23:07:05 +00:00
1157367c78
[AOTInductor] [BE] Add macro for loading symbols in aoti runner ( #149249 )
...
Summary:
Add macro for loading symbols in aoti runner
Test Plan:
Existing tests
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149249
Approved by: https://github.com/chenyang78
2025-03-17 23:02:01 +00:00
24cfeec2c7
Revert "[BE]: Apply ruff PERF403 to use dict comprehensions more often ( #149257 )"
...
This reverts commit bfee141666319c80b6c5284394905beef8682515.
Reverted https://github.com/pytorch/pytorch/pull/149257 on behalf of https://github.com/malfet due to Let's see if it helps restore compiler benchmark sanity, see 8bc7bd94a5/1
([comment](https://github.com/pytorch/pytorch/pull/149257#issuecomment-2731133812 ))
2025-03-17 22:57:00 +00:00
afa1eda901
Revert "[PGNCCL] Launch kernel on current stream & remove record_stream
entirely ( #148590 )"
...
This reverts commit ef6296e7f20d744a0cfed81cab573d60204e7626.
Reverted https://github.com/pytorch/pytorch/pull/148590 on behalf of https://github.com/izaitsevfb due to reverted internally, see D71292427 ([comment](https://github.com/pytorch/pytorch/pull/148590#issuecomment-2731114626 ))
2025-03-17 22:43:15 +00:00
a16ada41b9
Fix outdated docstring of torch.export.export regarding strict flag ( #149077 )
...
Summary: Fix outdated docstring of torch.export.export regarding strict flag
Test Plan: None, doc only change
Differential Revision: D71068215
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149077
Approved by: https://github.com/zhxchen17
2025-03-17 22:29:20 +00:00
d25617255c
Fix AOTI update_constant_buffer issue. ( #149243 )
...
Summary:
In D69553929 we changed the logic of constant & buffer update in AOTI. However this is incompatible with current Sigmoid runtime since we have different logics to pass in buffers, resulted in errors like
```
I0310 17:29:24.456960 3679102 AOTIDelegateExecutor.cpp:89] AOTIDelegateExecutor processing weights
*** Aborted at 1741652964 (Unix time, try 'date -d 1741652964') ***
*** Signal 11 (SIGSEGV) (0x30) received by PID 3679102 (pthread TID 0x7f9933e49000) (linux TID 3679102) (code: address not mapped to object), stack trace: ***
@ 00000000000040b9 folly::symbolizer::(anonymous namespace)::signalHandler(int, siginfo_t*, void*)
./fbcode/folly/debugging/symbolizer/SignalHandler.cpp:453
@ 0000000000006c45 folly::fibers::(anonymous namespace)::sigsegvSignalHandler(int, siginfo_t*, void*)
./fbcode/folly/fibers/GuardPageAllocator.cpp:237
@ 000000000004455f (unknown)
/home/engshare/third-party2/glibc/2.34/src/glibc-2.34/signal/../sysdeps/unix/sysv/linux/libc_sigaction.c:8
-> /home/engshare/third-party2/glibc/2.34/src/glibc-2.34/signal/../sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c
@ 00000000001e8164 torch::aot_inductor::AOTInductorModelContainer::update_constant_buffer(std::unordered_map<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, AtenTensorOpaque*, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, AtenTensorOpaque*> > > const&, bool, bool)
```
Test Plan:
1) Generate lowered merge net
```
CUDA_VISIBLE_DEVICES=0 ../buck-out/v2/gen/fbcode/b5b13003c82cbdec/caffe2/torch/fb/model_transform/fx2trt/packaging/__generate_merge_net_file__/generate_merge_net_file.par --action=generate --input-file=/home/shengqin/models/aoti_sigmoid_test/cmf_interformer_with_custom_triton_kernels_691990503_0_input --output-file=/home/shengqin/models/aoti_sigmoid_test/cmf_interformer_with_custom_triton_kernels_691990503_0_output.aoti_sigmoid --lower-backend=aot_inductor --use_sigmoid=true --aot_inductor_config="{'max_autotune': True, 'comprehensive_padding': False}" --add_passes=use_matmul_lce_replace_normal_LCE,use_triton_dot_compress,use_matmul_fuse_lce_replace_first_LCE,use_contiguous_linear_reduction_replace_linear_reduction --disable_acc_tracer=false
```
2) Load net predictor
```
CUDA_VISIBLE_DEVICES=1 ../buck-out/v2/gen/fbcode/103717df3cc2b97a/caffe2/torch/fb/model_transform/fx2trt/packaging/__load_net_predictor__/load_net_predictor --loadMode=AccuracyAB --inputNetFile=/home/shengqin/models/aoti_sigmoid_test/cmf_interformer_with_custom_triton_kernels_691990503_0_output.aoti_ts --otherNetFile=/home/shengqin/models/aoti_sigmoid_test/cmf_interformer_with_custom_triton_kernels_691990503_0_output.aoti_sigmoid --moduleName=merge --benchmarkEnableProfiling=false —-predictor_hardware_type=1 --disableStaticRuntime=true
```
Reviewed By: hl475
Differential Revision: D71236710
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149243
Approved by: https://github.com/hl475 , https://github.com/jingsh
2025-03-17 22:10:57 +00:00
a3c6e3139a
allow extra args for parameterization of tests in inductor ( #149154 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149154
Approved by: https://github.com/amjames , https://github.com/eellison
2025-03-17 22:05:06 +00:00
e4f6e4ac84
[MPS] Add inductor support for modified_bessel_i0
. ( #149342 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/149342
Approved by: https://github.com/malfet
2025-03-17 21:45:51 +00:00
8bc7bd94a5
[ROCm] Input vectorization in elementwise kernels for tensors with heterogeneous types ( #147527 )
...
This patch exemplifies its use for input tensors with types (float,bfloat16) when functor type is float(float,float).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147527
Approved by: https://github.com/jeffdaily
Co-authored-by: Hashem Hashemi <hashem.hashemi@amd.com >
2025-03-17 20:51:36 +00:00
e8dd58b8cf
cpp_wrapper: Precompile device-specific header files ( #146928 )
...
This saves us about a second per compilation, which is _massive_ for the OpInfo tests. Total OpInfo test runtime is down about 2x from this change alone.
Relands #144002 , with changes needed by fbcode internals.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/146928
Approved by: https://github.com/desertfire
2025-03-17 20:40:15 +00:00
5e9f792479
[ROCm] Unskip flex attention UTs after triton 3.3 bump ( #148327 )
...
Enable `test_flex_attention.py::TestLearnableBiases` unit tests.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148327
Approved by: https://github.com/jeffdaily
2025-03-17 20:15:14 +00:00