Commit Graph

15 Commits

Author SHA1 Message Date
ea614fb2b1 Flip default value for mypy disallow_untyped_defs [2/11] (#127839)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127839
Approved by: https://github.com/oulgen
2024-06-08 18:23:08 +00:00
e4623de4cf typing scheduler.py [2/2]: Apply types (#126656)
Add `# mypy: disallow-untyped-defs` to scheduler.py and then fix the resulting fallout.

We probably should eventually add a new node between BaseSchedulerNode and all the non-FusedSchedulerNode types to indicate the split between nodes that have a valid `self.node` and ones that don't. That would cause a lot of the `assert self.node is not None` churn to go away - but was a bigger change because a lot of code makes assumptions about types that aren't reflected in the types themselves.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126656
Approved by: https://github.com/eellison
2024-05-22 20:33:31 +00:00
b40fb2de59 [AOTI] Fix a codegen issue when .item() is used for kernel arg (#126575)
Summary: fixes https://github.com/pytorch/pytorch/issues/126574 . Pass kernel argument type information into generate_args_decl, so it can generate the argument declaration instead of relying on string matching.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/126575
Approved by: https://github.com/chenyang78
ghstack dependencies: #126369
2024-05-21 18:20:20 +00:00
3267814d53 [inductor] refactor: device dispatch inside do_bench (#125736)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/125736
Approved by: https://github.com/shunting314
2024-05-09 23:50:02 +00:00
9fec26e231 Fix typo under torch/_inductor directory (#119658)
This PR fixes typo in comments and msgs under `torch/_inductor` directory, and also changes the corresponding test.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/119658
Approved by: https://github.com/colesbury
2024-04-30 22:28:56 +00:00
7fd8870e6b [inductor] Refactor runtime files into torch._inductor.runtime (part 3) (#124557)
I am planning to make the compile_worker process not import torch so it can start up much faster.  This stack is prep for that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124557
Approved by: https://github.com/yanboliang
ghstack dependencies: #124552, #124553
2024-04-22 18:46:24 +00:00
0b90af0bf5 Revert "[inductor] Refactor runtime files into torch._inductor.runtime (part 3) (#124557)"
This reverts commit fcf28b0ad59b1912d5783688b0f25f18b46efeb3.

Reverted https://github.com/pytorch/pytorch/pull/124557 on behalf of https://github.com/jeanschmidt due to There are internal breakages, already discussed with author and he'll FF ([comment](https://github.com/pytorch/pytorch/pull/124552#issuecomment-2070548223))
2024-04-22 18:28:05 +00:00
fcf28b0ad5 [inductor] Refactor runtime files into torch._inductor.runtime (part 3) (#124557)
I am planning to make the compile_worker process not import torch so it can start up much faster.  This stack is prep for that.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124557
Approved by: https://github.com/yanboliang
ghstack dependencies: #124552, #124553
2024-04-22 04:51:15 +00:00
fd0bf96c2b [inductor] make multi-kernel work with cpp-wrapper (#117813)
Make multi-kernel work with cpp-wrapper. multi-kernel generates two equivalent variants for a reduction. At runtime the faster one is picked. But cpp-wrapper need save cubin file during codegen. They don't work with each other at the beginning.

Thanks Jason for suggesting a neat way to integrate these two. cpp-wrapper does 2 passes codegen right now. For the first pass, we still generate multi-kernel code and run it; for the second pass, we load the cubin file for the faster kernel directly. And multi-kernel python code is not generated for the second pass since they should not be needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117813
Approved by: https://github.com/jansel
2024-02-05 23:35:41 +00:00
b964a1222c Revert "[inductor] make multi-kernel work with cpp-wrapper (#117813)"
This reverts commit c24ffc3f66b2270dfc65a404687b91b55ed580e9.

Reverted https://github.com/pytorch/pytorch/pull/117813 on behalf of https://github.com/atalman due to Failing internal tests ([comment](https://github.com/pytorch/pytorch/pull/117813#issuecomment-1927877102))
2024-02-05 19:25:39 +00:00
c24ffc3f66 [inductor] make multi-kernel work with cpp-wrapper (#117813)
Make multi-kernel work with cpp-wrapper. multi-kernel generates two equivalent variants for a reduction. At runtime the faster one is picked. But cpp-wrapper need save cubin file during codegen. They don't work with each other at the beginning.

Thanks Jason for suggesting a neat way to integrate these two. cpp-wrapper does 2 passes codegen right now. For the first pass, we still generate multi-kernel code and run it; for the second pass, we load the cubin file for the faster kernel directly. And multi-kernel python code is not generated for the second pass since they should not be needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117813
Approved by: https://github.com/jansel
2024-02-03 00:06:21 +00:00
796278b57e Revert "[inductor] make multi-kernel work with cpp-wrapper (#117813)"
This reverts commit 20484a193626ef72e0b3f35914f17deb2a89b8fc.

Reverted https://github.com/pytorch/pytorch/pull/117813 on behalf of https://github.com/atalman due to broke linux-focal-rocm5.7-py3.8 tests ([comment](https://github.com/pytorch/pytorch/pull/117813#issuecomment-1922613135))
2024-02-02 01:19:19 +00:00
20484a1936 [inductor] make multi-kernel work with cpp-wrapper (#117813)
Make multi-kernel work with cpp-wrapper. multi-kernel generates two equivalent variants for a reduction. At runtime the faster one is picked. But cpp-wrapper need save cubin file during codegen. They don't work with each other at the beginning.

Thanks Jason for suggesting a neat way to integrate these two. cpp-wrapper does 2 passes codegen right now. For the first pass, we still generate multi-kernel code and run it; for the second pass, we load the cubin file for the faster kernel directly. And multi-kernel python code is not generated for the second pass since they should not be needed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/117813
Approved by: https://github.com/jansel
2024-02-01 21:29:02 +00:00
1562dae62c [BE]: Apply RUF025 dict.fromkeys preview rule (#118637)
Simplifies and optimizes dict construction using the `fromkeys` classmethod ctor. This also makes it really obvious when all the keys will have the same static value, which could be a bug if unintentional. It is also significantly faster than using a dict comprehension. The rule is in preview, but I am adding a forward fix for when it becomes stable.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/118637
Approved by: https://github.com/albanD
2024-01-30 20:46:54 +00:00
e432b2e607 [inductor] multi-kernel support (#103469)
For a persistent reduction, we generate 2 flavor of 'equivalant' kernels at the same time
- persistent reduction
- regular reduction

A MultiKernel wraps these 2 kernels and pick the one with better performance at runtime.

Here I talk more about implementation details:
- Inductor maintains states for generating kernels. E.g. the wrapper code.  After we generate code for one kernel, we need restore the inductor state before we can generate the counterpart.

***There is one thing I need some comments from others***:
There is one tricky thing about kernel arguments. In general, inductor removes a buffer from the argument list if it's only used inside the kernel.  But somehow a buffer removed by persistent reduction kernel may still be kept by the regular (non-persistent) reduction kernel because of some CSE invalidation rule. My current implementation avoid removing buffers if multi_kernel is enabled. This makes sure both flavors of reduction has consistent argument list.  Another idea I have is, we generate the multi-kernel definition with the union of arguments from both sub-kernels. Let each sub-kernel pick the subset of arguments it wants. But this will make the code-gen or multi-kernel much complex.

I'm not sure if there is some easy and clean way to resolve this.

Testing command:
```

TORCHINDUCTOR_MULTI_KERNEL=1 TORCH_LOGS=+torch._inductor.graph TORCHINDUCTOR_UNIQUE_KERNEL_NAMES=1 python benchmarks/dynamo/huggingface.py --backend inductor --amp --performance --only BertForMaskedLM --training

```

Pull Request resolved: https://github.com/pytorch/pytorch/pull/103469
Approved by: https://github.com/jansel
2024-01-18 23:16:31 +00:00