Summary:
We add the parsing for list of string. This is needed for AOTInductor
profiling for input information of Triton kernels.
Test Plan:
Included in commit.
test_profiler_op_event_kwargs_list_of_strings
Reviewers:
Subscribers:
Tasks:
Tags:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163593
Approved by: https://github.com/sraikund16
As per comment in source code:
```
# If we are are coalescing on xblock (not ReductionHint.INNER) and this is not a tiny kernel
# (not ReductionHint.OUTER_TINY), do not use persistent reduction if it induces tile
# quantization. Peristent reduction forces rblock == rnumel, if the bounds between lower
# and upper are large, for the lower values we will be masking off large % of read/writes,
# when we could expand the coalescing xblock instead.
```
For the test case in question, this pr improves perf from 0.8573521325143717 -> 0.043151492193814305 because we were egregiously masking out rblock values (58/64 values).
Differential Revision: [D82853279](https://our.internmc.facebook.com/intern/diff/D82853279)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163365
Approved by: https://github.com/shunting314, https://github.com/PaulZhang12, https://github.com/jansel, https://github.com/v0i0
Landing this instead of https://github.com/pytorch/pytorch/pull/162994.
Here is how i think the whole dynamo + frame construction logic work:
1) There is no way to create a frame object in python land as this is created in runtime from cpython. So that's why aot_compile creates FrameInfo this way. (kind of like simulating the runtime) i guess you could write your own very simple eval_frame.c where you can interject the frame construction but we probably don't want that.
2) When there is no wrapper (the old export or aot_compile), we first assign sources by iterating over f_locals which contain both local args and closure variables (this is implementation details of cpython frame construction). So thats why closure variables end up getting LocalSource names as can be shown in this test case (f6ea41ead2/test/export/test_export.py (L1369)). Note that L["self"] here means we are referring to local object self. Important thing to keep in mind here is this self is not actually model self, but the outer self.
3) When we switch to wrapper case, we end up trying to inline the original inner module. When doing so, we need to track all local and closures for this inner module as can be seen here (f6ea41ead2/torch/_dynamo/variables/functions.py (L463)) Here we are not looking into inner frame's f_locals but just directly look at closures. I guess this is because we are one more frame up so there is no access to frame f_locals at this point. And it is probably not good idea to change dynamo's logic here. As a result, i get following error message that is different from old export:
"While exporting, we found certain side effects happened in the model.forward. Here are the list of potential sources you can double check: ["L['self']._export_root.forward.__func__.__closure__[1].cell_contents.bank", "L['self']._export_root.forward.__func__.__closure__[1].cell_contents.bank_dict", "L['self']._export_root.forward.__func__.__closure__[0].cell_contents"]"
My initial attempt of solving this was taking inner closures and put them to f_locals for the frame i am constructing which turned out too compilcated because we needed to muck around bytecode instructions as well. So i am thinking we should just update the test to reflect new names and follow up with better post-processing step to have better names.
Differential Revision: [D82582029](https://our.internmc.facebook.com/intern/diff/D82582029)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163107
Approved by: https://github.com/avikchaudhuri
Introduces a variant of size-hint multi-kernel, where for novel runtime shapes, instead of performing full benchmarking to determine the optimal kernel, selects one of many kernels pre-generated from multi-kernel hints, based off similarity b/w hint / runtime input & output shapes (L1 distance in log2 space).
Some caveats/changes:
- Size-hint multi-kernel now only kicks in if the kernel has dynamic shapes
- Pre-generation still only does 1-d search over specified hints, e.g. `matmul([s0, s1], [s1, s2])` with size-hints `[64, 256]` only generates 2 kernels - based on tuning shapes ([64, 64], [64, 64]) and ([256, 256], [256, 256]). Extending this to reasonable n-d search (via user API?) is an extension
Benchmarking results, compared to multi-kernel w/ full benchmarking (hints 64, 4096), and compiling with the ground truth hint:
<img width="1902" height="1222" alt="550541081_1088709150049684_6528797079439730237_n" src="https://github.com/user-attachments/assets/056cca48-c16a-4451-9b4a-fa13a7a058a9" />
Full benchmarking doing worse is extremely weird, but we did see similar spikes in #156628
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163090
Approved by: https://github.com/bobrenjc93
Fixes#160547
### Summary:
bug
```
def test_namedtuple(self):
from collections import namedtuple
Point = namedtuple('Point', 'x y')
class M(torch.nn.Module):
def forward(self, x, y):
return x + y
inp = Point(torch.ones(3), torch.ones(3))
print(M()(*inp))
# errors
ep = torch.export.export(M(), inp, strict=False)
print(ep)
# succeeds
ep = torch.export.export(M(), inp, strict=True)
print(ep)
# workaround could be to convert namedtuple to a kwarg
inp_kwargs = {field: getattr(inp, field) for field in inp._fields}
ep = torch.export.export(M(), (), inp_kwargs)
print(ep)
```
FIx :
namedtuple is subclass of tuple
but namedtuple is not expected
So, this change handles named tuple case
I have added 🧪 test case for this as well
Pull Request resolved: https://github.com/pytorch/pytorch/pull/162959
Approved by: https://github.com/angelayi
Co-authored-by: Angela Yi <angelayi@meta.com>
Summary: `.contiguous()` will discard the original storage size of the tensor, and could lead to issues during loading.
Test Plan:
buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_1D_tensor_slicing
buck2 run mode/dev-nosan caffe2/test:test_export -- -r test_2D_tensor_slicing
Differential Revision: D83016250
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163587
Approved by: https://github.com/angelayi
Fixes part of #163314
In particular bug: **Bug 1: H=None Broadcasting Produces Incorrect Results**
This fixes a shape bug when slicing BlockMask on the Q-tile axis with an int (**mask[:, :, i]**). That form of indexing collapses the Q dimension, so kv_num_blocks/kv_indices lose their expected [B, H, Q_tiles, …] shape. Due to them losing shape, even though the mask_mod remains "interpretable", the kernel’s stride math then reads wrong offsets. Due to this we get silent numerical mismatches compared to regular SDPA, especially when single position decoding/H broadcasting.
The B=None, H=None works case is accidental: with singleton batch/head the kernel maps to index 0 via `sparse_idx_z = off_zq % 1` and `sparse_idx_hq = off_hq % 1` and with a single Q tile `q_start // SPARSE_Q_MULTIPLE = 0`. The missing Q-tiles stride is multiplied by 0, so the bad offset from the collapsed Q axis doesn’t move the pointer and it happens to read the first tile correctly. Once H > 1 or there are multiple Q tiles, those terms become nonzero and the kernel indexes with wrong strides which causes silent error
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163426
Approved by: https://github.com/drisspg
Differential Revision: D82933509
over the weekend I realized that some of the cache implementation was a bit silly, and too constrained to be actually generic. for example, InMemoryCache[str, bytes] was odd since we'd probably want to be able to store more than just str keys with bytes values. so tldr; everything is now generic, with the one constraint being that Key and Value must both be pickle-able types. this makes things a lot simpler for us, since all caches can now be str -> bytes caches under the hood if we'd like, and Key/Value just get pickled on the way in and out.
with this change, there were also some improvements made to the testing; mainly better coverage, but now we also test each cache across every combination of Key/Value types to ensure that they will work with the types we might specify later
I also hardened some things here and there, for example we now use literal_eval (forgot who mentioned this on the first PR, but thank you for the suggestion!), and all errors coming from the caching will be wrapped in CacheError from now on (although we still raise from the original error context where possible)
putting this PR up now for feedback, in the process of generalizing the code I did remove the documentation since it was becoming outdated but I will add that back in after the PR is green
I have the next PR ready as well (implements a fresh cache context manager), will export once this lands
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163488
Approved by: https://github.com/aorenste, https://github.com/masnesral
## Why this PR?
I've tried to follow the guidance of the `OpenReg` [usage example](https://github.com/pytorch/pytorch/tree/main/test/cpp_extensions/open_registration_extension/torch_openreg/third_party/openreg) and found that the command for compiling `example.cpp` (`g++ -o out example/example.cpp -L ./build -lopenreg`) is not compatible with my `gcc` (v11.4).
Since I installed my `gcc` through `apt install build-essential`, and I think that's a common way to install `gcc` for a few developers? I believe it's necessary to slightly modify the command to add `-I ./` to explicitly indicate the header file search path.
## What I've changed?
- I added `-I ./` to correctly search for `./include/openreg.h`.
- I also added a `pwd` comment for better readability and removed unused imports in `example/example.cpp`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/163235
Approved by: https://github.com/FFFrog, https://github.com/albanD
Co-authored-by: Jiawei Li <ljw1101.vip@gmail.com>