44186a0a4e
Move Sympy printers to torch/utils/_sympy/printers.py ( #140597 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140597
Approved by: https://github.com/ezyang , https://github.com/anijain2305
2024-11-26 18:11:00 +00:00
f23621ec56
Revert "Move Sympy printers to torch/utils/_sympy/printers.py ( #140597 )"
...
This reverts commit c25b201583fc28243b87c460a2f18e2531a676e7.
Reverted https://github.com/pytorch/pytorch/pull/140597 on behalf of https://github.com/huydhn due to Trunk is sad again after this lands, this looks like a landrace this time, so please do a rebase ([comment](https://github.com/pytorch/pytorch/pull/140597#issuecomment-2494052978 ))
2024-11-22 15:43:39 +00:00
c25b201583
Move Sympy printers to torch/utils/_sympy/printers.py ( #140597 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140597
Approved by: https://github.com/ezyang , https://github.com/anijain2305
2024-11-22 02:04:36 +00:00
701e06b643
Revert "Move Sympy printers to torch/utils/_sympy/printers.py ( #140597 )"
...
This reverts commit aefcdb3c9fa787f9d43864f6f99a3590c914324a.
Reverted https://github.com/pytorch/pytorch/pull/140597 on behalf of https://github.com/huydhn due to Sorry for reverting your change but I think it fails inductor/test_padding in trunk. This is a target determination miss and that failed test was not run in your PR ([comment](https://github.com/pytorch/pytorch/pull/140597#issuecomment-2489641453 ))
2024-11-20 22:13:57 +00:00
aefcdb3c9f
Move Sympy printers to torch/utils/_sympy/printers.py ( #140597 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140597
Approved by: https://github.com/ezyang , https://github.com/anijain2305
2024-11-20 20:26:49 +00:00
12e95aa4ee
[BE]: Apply PERF401 autofixes from ruff ( #140980 )
...
* Automatically applies ruff rule 401. Turns loops into equivalent list comprehensions which are faster and do not leak the scope of the loop variables.
* list comprehensions not only often have better typing, but are 50+% faster than for loops on overhead. They also preserve length information etc and are better for the interpreter to optimize.
* Manually went back and made mypy happy after the change.
* Also fixed style lints in files covered by flake8 but not by pyfmt
Pull Request resolved: https://github.com/pytorch/pytorch/pull/140980
Approved by: https://github.com/justinchuby , https://github.com/malfet
2024-11-20 17:52:07 +00:00
ed30fa74ab
[inductor] sympy.Integer([01]) -> sympy.S.(Zero|One) ( #139523 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139523
Approved by: https://github.com/ezyang
ghstack dependencies: #139364 , #139365 , #139370 , #139452
2024-11-04 04:28:40 +00:00
98e11b0021
Revert "[inductor] sympy.Integer([01]) -> sympy.S.(Zero|One) ( #139523 )"
...
This reverts commit c53beab3775671b5b7ec6106737c0d8939b8455a.
Reverted https://github.com/pytorch/pytorch/pull/139523 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it is failing lots of internal tests in D65345157 ([comment](https://github.com/pytorch/pytorch/pull/139364#issuecomment-2452897337 ))
2024-11-02 06:49:10 +00:00
c53beab377
[inductor] sympy.Integer([01]) -> sympy.S.(Zero|One) ( #139523 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139523
Approved by: https://github.com/ezyang
ghstack dependencies: #139364 , #139365 , #139370 , #139452
2024-11-02 03:04:22 +00:00
f9ef880c0b
[inductor] Refactor kernel args into SIMDKernelFeatures ( #139327 )
...
This is a refactor PR to move stuff around. I'm planning to use the SIMDKernelFeatures class (in a future PR) to host new heuristics for selecting kernel types and block sizes.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/139327
Approved by: https://github.com/eellison , https://github.com/shunting314
2024-11-01 00:30:14 +00:00
2b937e4e6d
[inductor] Cooperative reductions ( #137756 )
...
Example generated code for `(x+y).sum()`:
```py
@triton.jit
def triton_unk_fused_add_sum_0(in_ptr0, in_ptr1, out_ptr0, ws_ptr, semaphores_ptr, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr, RSPLIT : tl.constexpr):
xnumel = 1
rnumel = 1048576
rsplit_id = tl.program_id(0)
num_rblocks = (rnumel + RBLOCK - 1) // RBLOCK
rsplit_chunk = (num_rblocks + RSPLIT - 1) // RSPLIT * RBLOCK
rsplit_start = rsplit_chunk * rsplit_id
rsplit_end = rsplit_chunk * (rsplit_id + 1)
xoffset = tl.program_id(1) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
xmask = tl.full([XBLOCK, RBLOCK], True, tl.int1)
rbase = tl.arange(0, RBLOCK)[None, :]
_tmp4 = tl.full([XBLOCK, RBLOCK], 0, tl.float32)
for roffset in range(rsplit_start, rsplit_end, RBLOCK):
rindex = roffset + rbase
rmask = rindex < rnumel
r0 = rindex
tmp0 = tl.load(in_ptr0 + (r0), rmask, eviction_policy='evict_first', other=0.0)
tmp1 = tl.load(in_ptr1 + (r0), rmask, eviction_policy='evict_first', other=0.0)
tmp2 = tmp0 + tmp1
tmp3 = tl.broadcast_to(tmp2, [XBLOCK, RBLOCK])
tmp5 = _tmp4 + tmp3
_tmp4 = tl.where(rmask, tmp5, _tmp4)
tmp4 = tl.sum(_tmp4, 1)[:, None]
if RSPLIT > 1:
tmp4_ws = (ws_ptr + 0).to(tl.pointer_type(tl.float32))
tl.store(tmp4_ws + (xindex * RSPLIT + rsplit_id), tmp4, None)
if RSPLIT > 1:
triton_helpers.gpu_barrier(semaphores_ptr + (2 * tl.program_id(1) + 0), RSPLIT, True)
if RSPLIT > 1:
tmp4_peers = tl.load(tmp4_ws + (xindex * RSPLIT + tl.arange(0, RSPLIT)[None,:]), None, eviction_policy='evict_first')
tmp4 = tl.sum(tmp4_peers, 1)[:, None]
if rsplit_id == (0 % RSPLIT):
tl.store(out_ptr0 + (tl.full([XBLOCK, 1], 0, tl.int32)), tmp4, None)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137756
Approved by: https://github.com/eellison
2024-10-29 00:45:53 +00:00
60d1c7138d
Revert "[inductor] Cooperative reductions ( #137756 )"
...
This reverts commit fed37dbfbceefe306af648ff4fe1e0124c4d7844.
Reverted https://github.com/pytorch/pytorch/pull/137756 on behalf of https://github.com/jeanschmidt due to ROCM tests are timing out :( ([comment](https://github.com/pytorch/pytorch/pull/137756#issuecomment-2441579322 ))
2024-10-28 13:24:33 +00:00
fed37dbfbc
[inductor] Cooperative reductions ( #137756 )
...
Example generated code for `(x+y).sum()`:
```py
@triton.jit
def triton_unk_fused_add_sum_0(in_ptr0, in_ptr1, out_ptr0, ws_ptr, semaphores_ptr, xnumel, rnumel, XBLOCK : tl.constexpr, RBLOCK : tl.constexpr, RSPLIT : tl.constexpr):
xnumel = 1
rnumel = 1048576
rsplit_id = tl.program_id(0)
num_rblocks = (rnumel + RBLOCK - 1) // RBLOCK
rsplit_chunk = (num_rblocks + RSPLIT - 1) // RSPLIT * RBLOCK
rsplit_start = rsplit_chunk * rsplit_id
rsplit_end = rsplit_chunk * (rsplit_id + 1)
xoffset = tl.program_id(1) * XBLOCK
xindex = xoffset + tl.arange(0, XBLOCK)[:, None]
xmask = tl.full([XBLOCK, RBLOCK], True, tl.int1)
rbase = tl.arange(0, RBLOCK)[None, :]
_tmp4 = tl.full([XBLOCK, RBLOCK], 0, tl.float32)
for roffset in range(rsplit_start, rsplit_end, RBLOCK):
rindex = roffset + rbase
rmask = rindex < rnumel
r0 = rindex
tmp0 = tl.load(in_ptr0 + (r0), rmask, eviction_policy='evict_first', other=0.0)
tmp1 = tl.load(in_ptr1 + (r0), rmask, eviction_policy='evict_first', other=0.0)
tmp2 = tmp0 + tmp1
tmp3 = tl.broadcast_to(tmp2, [XBLOCK, RBLOCK])
tmp5 = _tmp4 + tmp3
_tmp4 = tl.where(rmask, tmp5, _tmp4)
tmp4 = tl.sum(_tmp4, 1)[:, None]
if RSPLIT > 1:
tmp4_ws = (ws_ptr + 0).to(tl.pointer_type(tl.float32))
tl.store(tmp4_ws + (xindex * RSPLIT + rsplit_id), tmp4, None)
if RSPLIT > 1:
triton_helpers.gpu_barrier(semaphores_ptr + (2 * tl.program_id(1) + 0), RSPLIT, True)
if RSPLIT > 1:
tmp4_peers = tl.load(tmp4_ws + (xindex * RSPLIT + tl.arange(0, RSPLIT)[None,:]), None, eviction_policy='evict_first')
tmp4 = tl.sum(tmp4_peers, 1)[:, None]
if rsplit_id == (0 % RSPLIT):
tl.store(out_ptr0 + (tl.full([XBLOCK, 1], 0, tl.int32)), tmp4, None)
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/137756
Approved by: https://github.com/eellison
ghstack dependencies: #138970
2024-10-27 16:31:38 +00:00
ba6526814a
Add dtype attribute to CSEVariable ( #136778 )
...
Summary:
- This diff introduces `dtype` attribute to `TritonCSEVariable` and a dtype propagation helper function to infer dtype from input to output for each op.
- There will be a follow-up diff that uses this `dtype` information in `TritonCSEVariable` to perform dtype-aware codegen.
Test Plan: CI
Differential Revision: D61815079
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136778
Approved by: https://github.com/eellison , https://github.com/blaine-rister
2024-10-25 18:00:30 +00:00
4632594546
[inductor] Move V.graph.scheduler.current_device to V.graph.current_device ( #138252 )
...
There are some places where it would be nice to use this, but the scheduler hasn't yet been created.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/138252
Approved by: https://github.com/eellison
ghstack dependencies: #138170
2024-10-18 23:05:54 +00:00
71aac59e93
Add Triton CPU as an Inductor backend ( #133408 )
...
The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend.
Differential Revision: [D63298968](https://our.internmc.facebook.com/intern/diff/D63298968 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408
Approved by: https://github.com/jansel , https://github.com/blaine-rister , https://github.com/malfet
2024-09-30 20:24:52 +00:00
36428f91e9
Revert "Add Triton CPU as an Inductor backend ( #133408 )"
...
This reverts commit 31c0467594c7c41c8e8ff1828bf01fa31fc4454f.
Reverted https://github.com/pytorch/pytorch/pull/133408 on behalf of https://github.com/int3 due to internal tests failing ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2379692517 ))
2024-09-27 16:54:27 +00:00
31c0467594
Add Triton CPU as an Inductor backend ( #133408 )
...
The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend.
Differential Revision: [D63298968](https://our.internmc.facebook.com/intern/diff/D63298968 )
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408
Approved by: https://github.com/jansel , https://github.com/blaine-rister , https://github.com/malfet
2024-09-26 15:35:26 +00:00
d0cebedb31
Revert "Add Triton CPU as an Inductor backend ( #133408 )"
...
This reverts commit e498b02b472e45cfd6b7a08db0d6c1babec655c5.
Reverted https://github.com/pytorch/pytorch/pull/133408 on behalf of https://github.com/jeanschmidt due to Broke internal signals, see D62737208 for more details ([comment](https://github.com/pytorch/pytorch/pull/133408#issuecomment-2353623816 ))
2024-09-16 18:33:33 +00:00
e498b02b47
Add Triton CPU as an Inductor backend ( #133408 )
...
The goal is to use Inductor-generated kernels to stress test the new Triton CPU backend.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133408
Approved by: https://github.com/jansel
2024-09-14 21:45:19 +00:00
13ee85ca5e
[Inductor] Generalize cuda cpp wrapper as common triton based GPU cpp wrapper, will be reused by xpu in next PR. ( #135312 )
...
[Inductor] Generalize cuda cpp wrapper as common triton based GPU cpp wrapper, will be reused by xpu in next PR.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135312
Approved by: https://github.com/jansel , https://github.com/desertfire , https://github.com/eellison
2024-09-11 23:59:54 +00:00
1b10a5c652
Allow SymInts and SymFloats as other in div_softmax_pattern ( #133989 )
...
Fixes https://github.com/pytorch/pytorch/issues/133759
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133989
Approved by: https://github.com/ezyang
2024-08-22 14:36:01 +00:00
7470ae85e4
Fix triton codegen with math.trunc ( #133354 )
...
Fixes https://github.com/pytorch/pytorch/issues/133172
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133354
Approved by: https://github.com/ezyang , https://github.com/jansel
2024-08-15 16:38:26 +00:00
09f9c256ad
Add basic mypy annotations to inductor ( #132416 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132416
Approved by: https://github.com/XuehaiPan , https://github.com/jamesjwu
ghstack dependencies: #132415
2024-08-04 18:43:37 +00:00
f2ddd5e9e0
Revert "Add basic mypy annotations to inductor ( #132416 )"
...
This reverts commit 78927d37f6085a0b30269cceb731d8097302c091.
Reverted https://github.com/pytorch/pytorch/pull/132416 on behalf of https://github.com/ZainRizvi due to Sorry, this PR has entered a weird state in the diff train. Trying to revert it to skip it, and then we can try relanding it ([comment](https://github.com/pytorch/pytorch/pull/132415#issuecomment-2267631785 ))
2024-08-04 18:39:29 +00:00
78927d37f6
Add basic mypy annotations to inductor ( #132416 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132416
Approved by: https://github.com/XuehaiPan , https://github.com/jamesjwu
ghstack dependencies: #132415
2024-08-01 20:14:25 +00:00
f32ab3b9e3
Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet ( #130004 )
...
Python's set is non deterministic. There is an internal failure which we recently ran into which did not consistently fail.
See, repro here: P1453035092.
Now, with these changes, it does consistently fail. In follow ups we could also consider adding a lintrule for uses of either set() or set literals.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130004
Approved by: https://github.com/oulgen
2024-08-01 04:37:15 +00:00
784a6ec5a3
Revert "Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet ( #130004 )"
...
This reverts commit 13d744464f10e35c0de50feb4e2340d4dae8e05f.
Reverted https://github.com/pytorch/pytorch/pull/130004 on behalf of https://github.com/clee2000 due to broke lint [GH job link](https://github.com/pytorch/pytorch/actions/runs/10183945999/job/28170099930 ) [HUD commit link](13d744464f ) probably a landrace, the base is 21 hours old ([comment](https://github.com/pytorch/pytorch/pull/130004#issuecomment-2260946562 ))
2024-07-31 16:49:21 +00:00
13d744464f
Migrate Inductor scheduler, dependencies, ir, and codegen/common to use OrderedSet ( #130004 )
...
Python's set is non deterministic. There is an internal failure which we recently ran into which did not consistently fail.
See, repro here: P1453035092.
Now, with these changes, it does consistently fail. In follow ups we could also consider adding a lintrule for uses of either set() or set literals.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130004
Approved by: https://github.com/oulgen
2024-07-31 16:22:11 +00:00
8b507a922a
Mode to emulate amp numerics ( #131595 )
...
```
# Mode to emulate pytorch eager numerics for lower precision (fp16, bf16)
# Pytorch eager computes bf16/fp16 by upcasting inputs to fp32 and downcasting after
# For multiple, fused pointwise nodes, inductor will elide the intermediary upcasts and downcasts
# Typically this should be closer to fp64 ref numerics. However, it can be useful for debugging
# to emulate the eager numerics.
```
We add extra upcasts and downcasts for pointwise nodes that correspond to casts that existed in the original user program (excluding pointwise nodes that are emitted during decomposition). Since this is mostly for debugging, I added this information in the `meta` so that this mode does not have unintended side effects like changing pattern matching.
in theory there could also be some other casts with fused reduction -> reduction, although i havent seen this in practice as much. could be done as follow up. note: only works with cuda backend right now.
This mode was sufficient to eliminate compile differences from https://fb.workplace.com/groups/385893200869952/posts/464263173032954/?comment_id=465199259606012&reply_comment_id=465676792891592 .
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131595
Approved by: https://github.com/shunting314 , https://github.com/bdhirsh , https://github.com/jansel
2024-07-29 22:42:23 +00:00
5772c13f56
Dont wrap negative indexing in scatter reduce ( #131503 )
...
Fix for https://github.com/pytorch/pytorch/issues/131321
Pull Request resolved: https://github.com/pytorch/pytorch/pull/131503
Approved by: https://github.com/shunting314
2024-07-24 04:01:32 +00:00
b6d477fd56
[BE][Easy][16/19] enforce style for empty lines in import segments in torch/_i*/ ( #129768 )
...
See https://github.com/pytorch/pytorch/pull/129751#issue-2380881501 . Most changes are auto-generated by linter.
You can review these PRs via:
```bash
git diff --ignore-all-space --ignore-blank-lines HEAD~1
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129768
Approved by: https://github.com/jansel
2024-07-20 16:20:58 +00:00
dc7725cc16
[halide-backend] Random number generation ( #130211 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130211
Approved by: https://github.com/jansel
2024-07-15 05:03:24 +00:00
d325aaef39
[halide-backend] Use get_reduction_combine_fn for reduction ops ( #130212 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130212
Approved by: https://github.com/eellison
2024-07-08 17:23:32 +00:00
acd03ca2d9
[halide-backend] Support scan kernels ( #129035 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129035
Approved by: https://github.com/shunting314 , https://github.com/eellison
ghstack dependencies: #130129
2024-07-06 03:49:50 +00:00
c5110f6388
[halide-backend] Use 0D scalar inputs/outputs ( #130129 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130129
Approved by: https://github.com/shunting314
2024-07-06 03:49:50 +00:00
4fc9157e90
[halide-backend] Disable split reductions for Halide ( #129320 )
...
In theory Halide doesn't need the split reduction stuff we do for Triton since it can generate multiple kernels.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129320
Approved by: https://github.com/shunting314 , https://github.com/eellison
ghstack dependencies: #129321
2024-07-03 05:56:40 +00:00
0abcca85b7
[halide-backend] Support manual schedules ( #129321 )
...
Currently using this for some by-hand hacking, but might need to implement our own scheduler later.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129321
Approved by: https://github.com/shunting314
2024-07-03 05:56:40 +00:00
e385bf8ef8
Revert "[halide-backend] Disable split reductions for Halide ( #129320 )"
...
This reverts commit a18eb651d352e45860a96869abaf9fb7b215eac6.
Reverted https://github.com/pytorch/pytorch/pull/129320 on behalf of https://github.com/jeanschmidt due to This PR is breaking internal builds, please check comments on it D59204360 ([comment](https://github.com/pytorch/pytorch/pull/129320#issuecomment-2200351678 ))
2024-07-01 14:44:35 +00:00
a83eaf1c3a
Revert "[halide-backend] Support manual schedules ( #129321 )"
...
This reverts commit 9ae78a578caff195821ad535a9e8d8ef59552142.
Reverted https://github.com/pytorch/pytorch/pull/129321 on behalf of https://github.com/jeanschmidt due to Reverting, as it is required to do so in order to revert #129320 ([comment](https://github.com/pytorch/pytorch/pull/129321#issuecomment-2200345664 ))
2024-07-01 14:42:33 +00:00
9ae78a578c
[halide-backend] Support manual schedules ( #129321 )
...
Currently using this for some by-hand hacking, but might need to implement our own scheduler later.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129321
Approved by: https://github.com/shunting314
ghstack dependencies: #126417 , #129025 , #129026 , #127506 , #129036 , #129320
2024-06-29 14:06:28 +00:00
a18eb651d3
[halide-backend] Disable split reductions for Halide ( #129320 )
...
In theory Halide doesn't need the split reduction stuff we do for Triton since it can generate multiple kernels.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129320
Approved by: https://github.com/shunting314 , https://github.com/eellison
ghstack dependencies: #126417 , #129025 , #129026 , #127506 , #129036
2024-06-29 14:06:28 +00:00
4cb8cb04a7
[halide-backend] Enable bfloat16 support ( #129036 )
...
Requires https://github.com/halide/Halide/pull/8255
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129036
Approved by: https://github.com/shunting314 , https://github.com/eellison
ghstack dependencies: #126417 , #129025 , #129026 , #127506
2024-06-29 14:06:25 +00:00
b93bf55b6a
[halide-backend] Add GPU support ( #127506 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/127506
Approved by: https://github.com/shunting314 , https://github.com/eellison
ghstack dependencies: #126417 , #129025 , #129026
2024-06-29 14:06:21 +00:00
86cadc6385
[halide-backend] Dimension-based indexing ( #129026 )
...
Prior to this the generated Halide code was a rather literal translation of the Triton code, with XBLOCK/YBLOCK/RBLOCK and 1D inputs. Halide prefers dimensions, and this 1D index triggers a lot of bugs and perf issues. This PR infers dimensions and changes the indexing in the generated code.
Before
```py
@hl.generator(name="kernel")
class Kernel:
in_ptr0 = hl.InputBuffer(hl.Float(32), 1)
out_ptr3 = hl.OutputBuffer(hl.Float(32), 2)
def generate(g):
in_ptr0 = g.in_ptr0
out_ptr3 = g.out_ptr3
xindex = hl.Var('xindex')
rindex = hl.Var('rindex')
r1 = rindex
x0 = xindex
idom = hl.RDom([hl.Range(0, 16), hl.Range(0, 32)])
odom = hl.RDom([hl.Range(0, 16)])
rdom = hl.RDom([hl.Range(0, 32)])
xindex_idom = idom.x
xindex_odom = odom.x
rindex_idom = idom.y
r1_idom = rindex_idom
x0_idom = xindex_idom
x0_odom = xindex_odom
tmp0 = hl.Func('tmp0')
tmp0[rindex, xindex] = in_ptr0[r1 + (32*x0)]
tmp1 = hl.Func('tmp1')
tmp1[xindex] = hl.maximum(rdom, tmp0[rdom, xindex])
tmp2 = hl.Func('tmp2')
tmp2[rindex, xindex] = tmp0[rindex, xindex] - tmp1[xindex]
tmp3 = hl.Func('tmp3')
tmp3[rindex, xindex] = hl.fast_exp(hl.cast(hl.Float(32), tmp2[rindex, xindex])) if tmp2.type().bits() <= 32 else hl.exp(tmp2[rindex, xindex])
tmp4 = hl.Func('tmp4')
tmp4[xindex] = hl.sum(rdom, tmp3[rdom, xindex])
tmp5 = hl.Func('tmp5')
tmp5[rindex, xindex] = tmp3[rindex, xindex] / tmp4[xindex]
out_ptr3_i0 = hl.Var('out_ptr3_i0')
out_ptr3_i1 = hl.Var('out_ptr3_i1')
out_ptr3[out_ptr3_i0, out_ptr3_i1] = hl.cast(out_ptr3.type(), tmp5[out_ptr3_i0, out_ptr3_i1])
assert g.using_autoscheduler()
in_ptr0.set_estimates([hl.Range(0, 512)])
out_ptr3.set_estimates([hl.Range(0, 32), hl.Range(0, 16)])
```
After
```py
@hl.generator(name="kernel")
class Kernel:
in_ptr0 = hl.InputBuffer(hl.Float(32), 2)
out_ptr3 = hl.OutputBuffer(hl.Float(32), 2)
def generate(g):
in_ptr0 = g.in_ptr0
out_ptr3 = g.out_ptr3
h0 = hl.Var('h0')
h1 = hl.Var('h1')
rdom = hl.RDom([hl.Range(0, 32)])
hr1 = rdom[0]
tmp0 = hl.Func('tmp0')
tmp0[h0, h1] = in_ptr0[h0, h1,]
tmp1 = hl.Func('tmp1')
tmp1[h1] = hl.maximum(rdom, tmp0[hr1, h1])
tmp2 = hl.Func('tmp2')
tmp2[h0, h1] = tmp0[h0, h1] - tmp1[h1]
tmp3 = hl.Func('tmp3')
tmp3[h0, h1] = hl.fast_exp(hl.cast(hl.Float(32), tmp2[h0, h1])) if tmp2.type().bits() <= 32 else hl.exp(tmp2[h0, h1])
tmp4 = hl.Func('tmp4')
tmp4[h1] = hl.sum(rdom, tmp3[hr1, h1])
tmp5 = hl.Func('tmp5')
tmp5[h0, h1] = tmp3[h0, h1] / tmp4[h1]
out_ptr3[h0, h1,] = hl.cast(hl.Float(32), tmp5[h0, h1])
assert g.using_autoscheduler()
in_ptr0.dim(0).set_min(0)
in_ptr0.dim(0).set_stride(1)
in_ptr0.dim(0).set_extent(32)
in_ptr0.dim(1).set_min(0)
in_ptr0.dim(1).set_stride(32)
in_ptr0.dim(1).set_extent(16)
in_ptr0.set_estimates([hl.Range(0, 32), hl.Range(0, 16)])
out_ptr3.set_estimates([hl.Range(0, 32), hl.Range(0, 16)])
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129026
Approved by: https://github.com/shunting314 , https://github.com/eellison
ghstack dependencies: #126417 , #129025
2024-06-29 14:06:16 +00:00
da5f37515e
[halide-backend] Generate standalone runtime ( #129025 )
...
This puts the halide runtime in a global shared object, rather than copying it to each kernel. Having many copies of the runtime causes many issues with cuda.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129025
Approved by: https://github.com/shunting314 , https://github.com/eellison
ghstack dependencies: #126417
2024-06-29 14:06:12 +00:00
e34b7e6af3
[halide-backend] Initial implementation of HalideKernel and HalideScheduling ( #126417 )
...
Pull Request resolved: https://github.com/pytorch/pytorch/pull/126417
Approved by: https://github.com/shunting314 , https://github.com/eellison
2024-06-29 14:06:08 +00:00
1a54bb0f96
Revert "[halide-backend] Initial implementation of HalideKernel and HalideScheduling ( #126417 )"
...
This reverts commit 4f9399bd0d2bc0cbd14348b80e32b263de5c6bc0.
Reverted https://github.com/pytorch/pytorch/pull/126417 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/126417#issuecomment-2186999121 ))
2024-06-24 16:50:15 +00:00
063facf352
Revert "[halide-backend] Generate standalone runtime ( #129025 )"
...
This reverts commit 10c64c3b49e2008a50f9229e600c68c8a3d49292.
Reverted https://github.com/pytorch/pytorch/pull/129025 on behalf of https://github.com/fbgheith due to breaking internal builds ([comment](https://github.com/pytorch/pytorch/pull/129025#issuecomment-2186995467 ))
2024-06-24 16:47:25 +00:00
10c64c3b49
[halide-backend] Generate standalone runtime ( #129025 )
...
This puts the halide runtime in a global shared object, rather than copying it to each kernel. Having many copies of the runtime causes many issues with cuda.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129025
Approved by: https://github.com/shunting314 , https://github.com/eellison
ghstack dependencies: #126417
2024-06-22 17:39:52 +00:00