Fixes https://github.com/pytorch/pytorch/issues/159995
Currently there are two problems with extern kernels in subgraphs:
1. They don't get serialized to the extern kernel json file because we only look at the toplevel graph.
2. Since the scope of each extern_kernel list is within its own subgraph, the indices referencing the operator is messed up because each subgraph will start counting from 0.
So, this PR moves the extern_kernels list to a global view (under virtualized) so that we can count the extern kernels across subgraphs and the toplevel graph.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/160004
Approved by: https://github.com/ydwu4
From the perivous PR: https://github.com/pytorch/pytorch/pull/157608 , I added `format_consts_to_cpp` to build consts bytes.
But it still raise clang ASAN `stack alloction`, when build large size consts.
This PR:
1. add `test_aot_inductor_consts_cpp_build` to stack allocation skip list.
2. add ATTRIBUTE_NO_SANITIZE_ADDRESS to skip ASAN check, because consts array is locate in global area.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/158175
Approved by: https://github.com/jansel
Sometimes a test file reports success according to pytest, but fails afterwards, and the rerun logic doesn't handle that correctly.
The name of the last run test is saved in order to do more efficient reruns (target the last run test for a rerun without rerunning the entire file). This usually correct, ex test fails and pytest catches it -> lastrun = the test that failed, test segfaults (pytest doesn't catch) -> lastrun is the test that segfaulted. But sometimes pytest reports a success, but the process has non zero exit code. The two cases I know of are hangs and double freeing at exit. In this case, its unclear which test caused the failure, so lastrun is set to be the first test that ran in that session, so that during the next session it will start from the beginning in an attempt to replicate the error (an alternate solution would be to just fail and not rerun, which might be the better option). But then it reruns with runsingle, which prevents lastrun from being reset (not sure why, I'm pretty sure there's no difference between resetting and not normally), so lastrun becomes the last test that ran, and its not always true that lastrun is the one that caused it. Then on the next run, it starts from the last test and the process now exits cleanly
Short term solution here: ensure the lastrun is always set to the initial value if the session succeeds. This is correct even in the normal path because initial value shouldn't change in that case
Things that still need to be fixed:
* log says "running single test" which is not true
* no xml reports get generated here
* also no xml reports get generated on segfault
* docs for this
I think I have a PR that fixes the above but its old so I need to take another look
Testing:
This from when I was based on a commit that had a hang for macs, and before I added the skips in inductor array ref:
cc862d2c14
Pull Request resolved: https://github.com/pytorch/pytorch/pull/155853
Approved by: https://github.com/malfet
Summary: See internal Diff for more details.
In ExternKernel, the FakeTensors do not have associated real tensors, because they are just created from ir.Node's shape and stride.
Test Plan:
```
buck run fbcode//mode/dev-nosan //caffe2/test/inductor:test_aot_inductor -- -r aoti_data_dependent_ex
buck2 run mode/dev-nosan fbcode//caffe2/test/inductor:aot_inductor_arrayref_cpu -- -r data_dependent_extern_kernel_op
```
Differential Revision: D73002775
Pull Request resolved: https://github.com/pytorch/pytorch/pull/151377
Approved by: https://github.com/angelayi
Previously, we require all inputs of while_loop to be on the same device. However, there're use cases where we want to keep some of the inputs on cpu while others on gpu e.g. an loop_idx on cpu will save the gpu to device copies. This PR relaxes the constraint and only check if carry and input at the same position have the same device.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/148019
Approved by: https://github.com/eellison, https://github.com/jansel
In this PR, we extract `codegen_unbacked_symbol_defs` of FallbackKernel out as a `codegen_unbacked_symbol_defs_for_outputs` method in wrapper. With it, HOPs can support the case where the subgraph returns a tensor with unbacked symints. This PR only do it for cond, we'll have follow up PRs for others (e.g. while_loop) as well.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/147567
Approved by: https://github.com/jansel
Before the PR, we're getting an undefined symbol error for output code when an unbacked symint is **only** used in the hop because we didn't correctly record the dependency of the unbacked symbols for hops and it gets DCEed accidentally.
This PR adds the symbol arguments to `constant_args`, where the dependencies can be correctly constructed when `get_unbacked_symbol_uses` is called to check constant_args.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/143456
Approved by: https://github.com/desertfire
Summary: Add an option in config to not include weights in .so
Test Plan: `test/inductor:test_aot_inductor -- -r test_so_without_weight_cuda`
Reviewed By: desertfire
Differential Revision: D65968885
Pull Request resolved: https://github.com/pytorch/pytorch/pull/141997
Approved by: https://github.com/desertfire
# Why?
I want the following code to work.
minimal repro:
```
class M(torch.nn.Module):
def forward(self, dilate_flag):
return dilate_flag.item()
input1 = (torch.tensor([1], dtype=torch.bool, device="cuda"),)
model = M().cuda()
ep = torch.export.export(model, input1, strict=True)
path = torch._inductor.aot_compile(ep.module(), input1)
aot_model = torch._export.aot_load(path, device="cuda")
actual_output = aot_model(*input1)
```
error: AssertionError: Encountered an unsupported object of type <class 'torch.SymBool'> while writing the metadata for exported program
second error will be handled by https://github.com/pytorch/pytorch/pull/138760
# Motivation
I could technically bypass it with a torch.int tensor. However, it doesn't work with torch.cond. I want the following to work. It would also require https://github.com/pytorch/pytorch/pull/138760 for aot compile to work.
```
class M(torch.nn.Module):
def __init__(self) -> None:
super().__init__()
self.dilate_flag = 0
def forward(self, dilate_flag):
self.dilate_flag = dilate_flag.item()
def true_fn(dilate_flag):
return dilate_flag.clone()
def false_fn(dilate_flag):
return dilate_flag.clone()
torch.cond(
self.dilate_flag,
true_fn,
false_fn,
(dilate_flag,),
)
return self.dilate_flag
input1 = (torch.tensor([1], dtype=torch.bool, device="cuda"),)
input2 = (torch.tensor([0], dtype=torch.bool, device="cuda"),)
inputs = (input1, input2)
model = M().cuda()
for input in inputs:
expected_output = model(*input)
ep = torch.export.export(model, input, strict=False)
path = torch._inductor.aot_compile(ep.module(), input)
aot_model = torch._export.aot_load(path, device="cuda")
actual_output = aot_model(*input)
assert (
expected_output == actual_output
), f"henry they are not equal {expected_output} != {actual_output}"
```
Differential Revision: D64867504
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138765
Approved by: https://github.com/ydwu4
Fixes the failure of INT8 DLRM using AOTI.
The previous code calculates `consts_size` directly using `tensor` from `graph.constants`:
```
consts_size = sum(
get_nbytes_of_tensor(tensor, all_cuda)
for (name, tensor) in graph.constants.items()
if name not in graph.folded_constants
)
```
Meanwhile, the actual bytes to serialize (`serialized_weights`) is using `graph.get_original_value_of_constant(name)`:
```
serialized_weights = b"".join(
_to_bytes(graph.get_original_value_of_constant(name), all_cuda)
for name in graph.constants.keys()
if name not in graph.folded_constants
)
```
`tensor` from `graph.constants` could be different from `graph.get_original_value_of_constant(name)` thus making the `consts_size` inconsistent with the actual byte size of the `serialized_weights`, resulting in runtime error `weights_offset must be aligned to 16K boundary`, similar to what happened in https://github.com/pytorch/pytorch/pull/135205.
This PR direclty gets `consts_size ` using `len(serialized_weights)`, which fixes the inconsistency.
We also added a `reduce_range` argument to the `get_default_x86_inductor_quantization_config` function, which is needed in the unit test to avoid accuracy issue on CI machines (earlier CPUs without VNNI).
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139054
Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/desertfire
Summary: The problem happened after splitting CppWrapperCpu and CppWrapperCpuArrayRef, because CppWrapperCpuArrayRef.generate_index_put_fallback missed a statement.
Running test_aot_inductor.py as a whole didn't reveal the problem, but running test_index_put_with_none_index_cpu_with_stack_allocation individually did. Digging deeper, the root cause is init_backend_registration has incorrectly cached CPU CppWrapperCodegen class, which means CppWrapperCpuArrayRef was never picked when running test_aot_inductor.py as a whole. To fix the problem, all the ArrayRef tests are split into a separate file. Also a code checking is added to regex match AOTInductorModelRunMinimalArrayrefInterface so this kind of false passing signal won't be unnoticed.
Differential Revision: [D64734106](https://our.internmc.facebook.com/intern/diff/D64734106)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138541
Approved by: https://github.com/frank-wei