pytorch

frozenleaves/pytorch

Fork 0

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Commit Graph

Author SHA1 Message Date

Author	SHA1	Message	Date
Shunting Zhang	74074fe8d8	[inductor] handle offset in ReinterpretView for alignment (#151859 ) Fix https://github.com/pytorch/pytorch/issues/151589 It's interesting that the Q4_K dequantization example in the referred GH issue does not crash even if Inductor pass triton the wrong alignment information. I dig this a bit. The main reason is, there are 2 things in triton that decides the vectorization size 1. alignement 2. max number of contiguous elements a thread need to process Here is the triton code that decides vectorization size [link](`c5fed8e1ca/third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/LoadStoreOpToLLVM.cpp (L147-L157)`), and here is the triton code that considers contiguity for vectorization [link](`c5fed8e1ca/lib/Analysis/AxisInfo.cpp (L1250-L1269)`) When Inductor wrongly tell triton that a unaligned tensor is aligned, Triton may not do vectorization (or not do full vectorization) because of the second restriction. Check this test: ``` @parametrize( "size", ( 128, 1024, 1024 * 1024, ), ) def test_slice_view_dtype(self, size): offset = 1 def f(x): return x[2:].view(dtype=torch.float32) + 1 x = torch.randn((size + offset) * 2, dtype=torch.bfloat16, device=self.device) self.common(f, (x,), reference_in_float=False) ``` Before the fix, Inductor would tell Triton that the output of aten.view.dtype tensor is aligned even though it's not. That tensor will be passed to the triton kernel for the aten.add. Triton may do different vectorization decision depending on the tensor size 1. when size = 128, triton pick ld.global.b32 to load data from global memory 2. when size = 1024, triton uses ld.global.v2.b32 4. when size = 1024 * 1024, triton uses ld.global.v4.b32 So whether wrong alignment metadata causes issue depends on if triton picks the vectorized instructions. The latter depends on the triton config (block size) decided by inductor and triton internal logic (how they assign elements to each thread). We'd better to make sure Inductor always generate correct metadata to make sure such hidden issues does not turn into crash later. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151859 Approved by: https://github.com/jansel, https://github.com/eellison ghstack dependencies: #151841	2025-04-23 01:50:49 +00:00
Shunting Zhang	a48ccf02f9	[Inductor] move alignment tests to a separate file (#151841 ) This is a pure code movement. test_torchinductor.py is already 15K lines of code. Move alignment related tests I added recently to a separate file. I need add more such kind of tests. Pull Request resolved: https://github.com/pytorch/pytorch/pull/151841 Approved by: https://github.com/jansel, https://github.com/eellison	2025-04-22 20:18:58 +00:00

Shunting Zhang

74074fe8d8

[inductor] handle offset in ReinterpretView for alignment (#151859 )

Fix https://github.com/pytorch/pytorch/issues/151589

It's interesting that the Q4_K dequantization example in the referred GH issue does not crash even if Inductor pass triton the wrong alignment information. I dig this a bit. The main reason is, there are 2 things in triton that decides the vectorization size
1. alignement
2. max number of contiguous elements a thread need to process

Here is the triton code that decides vectorization size [link](c5fed8e1ca/third_party/nvidia/lib/TritonNVIDIAGPUToLLVM/LoadStoreOpToLLVM.cpp (L147-L157)), and here is the triton code that considers contiguity for vectorization [link](c5fed8e1ca/lib/Analysis/AxisInfo.cpp (L1250-L1269))

When Inductor wrongly tell triton that a unaligned tensor is aligned, Triton may not do vectorization (or not do full vectorization) because of the second restriction.

Check this test:
```
    @parametrize(
        "size",
        (
            128,
            1024,
            1024 * 1024,
        ),
    )
    def test_slice_view_dtype(self, size):
        offset = 1

        def f(x):
            return x[2:].view(dtype=torch.float32) + 1

        x = torch.randn((size + offset) * 2, dtype=torch.bfloat16, device=self.device)
        self.common(f, (x,), reference_in_float=False)
```

Before the fix, Inductor would tell Triton that the output of aten.view.dtype tensor is aligned even though it's not. That tensor will be passed to the triton kernel for the aten.add. Triton may do different vectorization decision depending on the tensor size
1. when size = 128, triton pick ld.global.b32 to load data from global memory
2. when size = 1024, triton uses ld.global.v2.b32
4. when size = 1024 * 1024, triton uses ld.global.v4.b32

So whether wrong alignment metadata causes issue depends on if triton picks the vectorized instructions. The latter depends on the triton config (block size) decided by inductor and triton internal logic (how they assign elements to each thread). We'd better to make sure Inductor always generate correct metadata to make sure such hidden issues does not turn into crash later.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151859
Approved by: https://github.com/jansel, https://github.com/eellison
ghstack dependencies: #151841

2025-04-23 01:50:49 +00:00

Shunting Zhang

a48ccf02f9

[Inductor] move alignment tests to a separate file (#151841 )

This is a pure code movement. test_torchinductor.py is already 15K lines of code. Move alignment related tests I added recently to a separate file. I need add more such kind of tests.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151841
Approved by: https://github.com/jansel, https://github.com/eellison

2025-04-22 20:18:58 +00:00

2 Commits