Commit Graph

29 Commits

Author SHA1 Message Date
4dce5b71a0 [build] modernize build-frontend: python setup.py develop/install -> [uv ]pip install --no-build-isolation [-e ]. (#156027)
Modernize the development installation:

```bash
# python setup.py develop
python -m pip install --no-build-isolation -e .

# python setup.py install
python -m pip install --no-build-isolation .
```

Now, the `python setup.py develop` is a wrapper around `python -m pip install -e .` since `setuptools>=80.0`:

- pypa/setuptools#4955

`python setup.py install` is deprecated and will emit a warning during run. The warning will become an error on October 31, 2025.

- 9c4d383631/setuptools/command/install.py (L58-L67)

> ```python
> SetuptoolsDeprecationWarning.emit(
>     "setup.py install is deprecated.",
>     """
>     Please avoid running ``setup.py`` directly.
>     Instead, use pypa/build, pypa/installer or other
>     standards-based tools.
>     """,
>     see_url="https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html",
>     due_date=(2025, 10, 31),
> )
> ```

- pypa/setuptools#3849

Additional Resource:

- [Why you shouldn't invoke setup.py directly](https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html)

Pull Request resolved: https://github.com/pytorch/pytorch/pull/156027
Approved by: https://github.com/ezyang
2025-07-09 11:24:27 +00:00
cec2977ed2 [BE][6/16] fix typos in torch/ (#156316)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156316
Approved by: https://github.com/albanD
ghstack dependencies: #156313, #156314, #156315
2025-06-23 02:57:34 +00:00
3f44fdc03d Revert "[BE][6/16] fix typos in torch/ (#156316)"
This reverts commit b210cf1ea56bcd9f937a2805d9e70d8684d25ee4.

Reverted https://github.com/pytorch/pytorch/pull/156316 on behalf of https://github.com/atalman due to export/test_torchbind.py::TestCompileTorchbind::test_compile_error_on_input_aliasing_contents_backend_aot_eager [GH job link](https://github.com/pytorch/pytorch/actions/runs/15804799771/job/44548489912) [HUD commit link](c95f7fa874) ([comment](https://github.com/pytorch/pytorch/pull/156313#issuecomment-2994171213))
2025-06-22 12:31:57 +00:00
b210cf1ea5 [BE][6/16] fix typos in torch/ (#156316)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/156316
Approved by: https://github.com/albanD
ghstack dependencies: #156313, #156314, #156315
2025-06-22 08:43:33 +00:00
0afd335174 PEP585 update - torch/nn torch/optim torch/package torch/profiler torch/serialization torch/sparse torch/xpu (#145175)
See #145101 for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145175
Approved by: https://github.com/bobrenjc93
2025-01-21 16:57:27 +00:00
5fd881a5b6 Revert "PEP585 update - torch/nn torch/optim torch/package torch/profiler torch/serialization torch/sparse torch/xpu (#145175)"
This reverts commit 54a00af2c6026a830f40d9e6a659ff81d51f9bc6.

Reverted https://github.com/pytorch/pytorch/pull/145175 on behalf of https://github.com/huydhn due to Sorry for reverting your change but it seems to break some trunk tests ([comment](https://github.com/pytorch/pytorch/pull/145175#issuecomment-2603418267))
2025-01-21 00:49:55 +00:00
54a00af2c6 PEP585 update - torch/nn torch/optim torch/package torch/profiler torch/serialization torch/sparse torch/xpu (#145175)
See #145101 for details.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/145175
Approved by: https://github.com/bobrenjc93
2025-01-20 22:32:59 +00:00
12e95aa4ee [BE]: Apply PERF401 autofixes from ruff (#140980)
* Automatically applies ruff rule 401. Turns loops into equivalent list comprehensions which are faster and do not leak the scope of the loop variables.
* list comprehensions not only often have better typing, but are 50+% faster than for loops on overhead. They also preserve length information etc and are better for the interpreter to optimize.
* Manually went back and made mypy happy after the change.
* Also fixed style lints in files covered by flake8 but not by pyfmt

Pull Request resolved: https://github.com/pytorch/pytorch/pull/140980
Approved by: https://github.com/justinchuby, https://github.com/malfet
2024-11-20 17:52:07 +00:00
8c840fb921 Add out_dtype kw argument to optimize_bsr_dense_addmm (#136626)
As in the title.

Addresses the task in https://github.com/pytorch/ao/pull/821#issuecomment-2373290266

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136626
Approved by: https://github.com/amjames, https://github.com/cpuhrsch
2024-10-22 09:52:25 +00:00
c0582fd0f8 Remove unused Python variables in torch/[b-z]* (#136963)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/136963
Approved by: https://github.com/ezyang
2024-10-19 16:45:22 +00:00
8f2a4cc4b1 Tune bsr_dense_addmm for int8 inputs on A100 (#136088)
As in the title. The tuning is done for dimensions 1280 and 5120 that are used in Vit-H.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136088
Approved by: https://github.com/cpuhrsch
2024-09-25 00:24:12 +00:00
b76d1b79e6 Add scaling arguments to bsr_dense_addmm (#136104)
As in the title.

Tackles https://github.com/pytorch/ao/pull/821/files#r1759821413

The PR assumes that the existing tuning parameters are good also when using scaling arguments. This needs to be verified as a follow-up task.

Also, this PR redefines triton-contiguous tensors: the tensor must have strides not larger than 1. This will now allow zero strides that previously triggered `contiguous` call although the underlying memory buffer was contiguous.

Re: "a considerable slow-down occurs because tensor data is copied element-wise rather than chunk-wise" - this note should refer to a code (torch or triton?) that implements the element/chunk-wise copy so that we could verify that allowing zero strides indeed would not trigger element-wise copies. Atm, the performance increase in ViT-H benchmarks (that involve using 0 strides) is an evidence that allowing zero strides does not lead to slow-downs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/136104
Approved by: https://github.com/cpuhrsch
2024-09-16 20:26:54 +00:00
b346e99376 remove fast_flush arguments (#135387)
I've removed them from upstream Triton in https://github.com/triton-lang/triton/pull/4485. It looks like most places in the code use the default value of `fast_flush=True` anyway, though there are two PRs from @pearu that use `False`. To my knowledge, there's no reason to use the `False` value.

Differential Revision: [D62325778](https://our.internmc.facebook.com/intern/diff/D62325778)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/135387
Approved by: https://github.com/nmacchioni, https://github.com/jansel
2024-09-13 08:13:46 +00:00
64cfcbd8a3 Tune _int_bsr_dense_addmm for int8 inputs on A100 (#134035)
As in the title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134035
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #133855
2024-08-22 06:43:11 +00:00
345578afb4 Add int8 support to bsr_dense_addmm and bsr_dense_mm Triton kernels (#133855)
As in the title. In addition, the PR introduces `_int_bsr_dense_addmm` that is equivalent to `bsr_dense_addmm` except for int8 inputs the operation result is int32 tensor (similar to existing `_int_mm`).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133855
Approved by: https://github.com/cpuhrsch
2024-08-21 20:44:40 +00:00
1471473b84 Add tests to bsr_dense_addmm_meta. Tune bsr_dense_addmm kernel for ViT shapes. (#132646)
As in the title.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132646
Approved by: https://github.com/cpuhrsch
2024-08-05 20:22:33 +00:00
973037be6a [BE][Easy] apply autofix for ruff rules unnecessary-collection-call (C408): list() / tuple() / dict() (#130199)
This PR changes the empty collection factory call to Python literals:

- `list()` -> `[]`
- `tuple()` -> `()`
- `dict()` -> `{}`

The Python literals are more performant and safer. For example, the bytecode for building an empty dictionary:

```bash
$ python3 -m dis - <<EOS
import collections

d1 = {}
d2 = dict()

dict = collections.OrderedDict
d3 = dict()
EOS
```

```text
  0           0 RESUME                   0

  1           2 LOAD_CONST               0 (0)
              4 LOAD_CONST               1 (None)
              6 IMPORT_NAME              0 (collections)
              8 STORE_NAME               0 (collections)

  3          10 BUILD_MAP                0
             12 STORE_NAME               1 (d1)

  4          14 PUSH_NULL
             16 LOAD_NAME                2 (dict)
             18 CALL                     0
             26 STORE_NAME               3 (d2)

  6          28 LOAD_NAME                0 (collections)
             30 LOAD_ATTR                8 (OrderedDict)
             50 STORE_NAME               2 (dict)

  7          52 PUSH_NULL
             54 LOAD_NAME                2 (dict)
             56 CALL                     0
             64 STORE_NAME               5 (d3)
             66 RETURN_CONST             1 (None)
```

The dict literal `{}` only has one bytecode `BUILD_MAP`, while the factory call `dict()` has three `PUSH_NULL + LOAD_NAME + CALL`. Also, the factory call is not safe if users override the `dict` name in `locals` or `globals` (see the example of replacing with `OrderedDict` above).

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130199
Approved by: https://github.com/malfet
2024-07-11 17:30:28 +00:00
8db9dfa2d7 Flip default value for mypy disallow_untyped_defs [9/11] (#127846)
See #127836 for details.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127846
Approved by: https://github.com/ezyang
ghstack dependencies: #127842, #127843, #127844, #127845
2024-06-08 18:50:06 +00:00
49f0d127fb Fix a bug in retrieving approximate bsr_dense_addmm kernel meta data (#124371)
Fixes #124333

Pull Request resolved: https://github.com/pytorch/pytorch/pull/124371
Approved by: https://github.com/eqy, https://github.com/lezcano
2024-04-24 13:59:18 +00:00
a39e638707 Update bsr_dense_addmm kernel parameters for sizes 3 x 2 ^ N (#122506)
As in the title. The speed-ups for a particular set of input sizes range from about 7 to 85 % depending on the used BSR tensor block sizes.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/122506
Approved by: https://github.com/cpuhrsch
2024-03-23 11:54:33 +00:00
e918461377 Add instructions for generating optimal Triton kernel parameters of bsr_dense_addmm (#115504)
As in the title.

In addition, enable verbose output when executing the torch/sparse/_triton_ops_meta.py script.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115504
Approved by: https://github.com/cpuhrsch
ghstack dependencies: #115499
2023-12-12 16:44:51 +00:00
32286512cc Add tune_bsr_dense_addmm as an API to find optimal triton kernel parameters for bsr_dense_addmm (#115499)
As in the title.

In addition:
- improve the algorithm for finding a minima of operation timings: break the inner loop early when a next minima candidate is found
- add tests and fix bugs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115499
Approved by: https://github.com/cpuhrsch
2023-12-12 16:44:51 +00:00
12085914b8 Replace bsr_dense_mm triton kernel with bsr_dense_addm triton kernel (#115030)
The `bsr_dense_addmm` triton kernel introduced in https://github.com/pytorch/pytorch/pull/114595 is a generalization of `bsr_dense_mm` triton kernel and a more efficient version of it because it uses an extra kernel parameter `SPLIT_N` that has notable effect to performance for r.h.s operand with a larger number of columns.

This PR eliminates the `bsr_dense_mm` triton kernel in favor of using `bsr_dense_addmm` triton kernel.

The performance increase of `bsr_dense_mm` is as follows (float16, `NVIDIA A100-SXM4-80GB`):
- with 16x16 blocks, the average/maximal speed up is 50/71 %
- with 32x32 blocks, the average/maximal speed up is 30/63 %
- with 64x64 blocks, the average/maximal speed up is 12/26 %
- with 128x128 blocks, the average/maximal speed up is 7/17 %

Pull Request resolved: https://github.com/pytorch/pytorch/pull/115030
Approved by: https://github.com/cpuhrsch
2023-12-05 22:29:24 +00:00
4ba37e1804 Add tests for bsr_dense_addmm and bsr_dense_mm triton kernels (#114800)
As in the title.

In addition,
- resolve https://github.com/pytorch/pytorch/pull/114757#discussion_r1409547917 re triton-contiguous inputs
- support non-contiguous inputs and outputs in triton kernels
- fix a couple of minor bugs

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114800
Approved by: https://github.com/cpuhrsch
2023-12-04 22:07:47 +00:00
69c4819f53 Add bsr_dense_addmm triton kernel (#114595)
As in the title.

The `bsr_dense_addmm` kernel implemented in this PR is a generalization of `bsr_dense_mm` in the following respects (in addition of having input, beta, and alpha parameters):
- it implements `SPLIT_N` kernel parameter that enables efficient kernel launches in the case of wide inputs. For instance, the timing of nn.linear with 256x256 BSR weights having 16x16 blocks and 256x131072 strided input reduced about 16x (this corresponds to the 94 % speed up value listed below).
- it supports rectangular blocks in sparse BSR tensor weights

The performance increase of nn.linear is as follows (float16, `NVIDIA A100-SXM4-80GB`):
- with 16x16 blocks, the average/maximal speed up is  55/94 %
- with 32x32 blocks, the average/maximal speed up is  33/63 %
- with 64x64 blocks, the average/maximal speed up is  23/42 %
- with 128x128 blocks, the average/maximal speed up is  15/39 %

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114595
Approved by: https://github.com/cpuhrsch
2023-11-29 05:29:25 +00:00
12f95df0e9 Eliminate unnecessary multiplications by 1 in addmm with sparse compressed tensor operand (#114026)
This PR:
- updates `torch/sparse/_triton_ops_meta.py` for the API change in `triton.testing.do_bench`
- force `num_stages` to be 1 when blocksize is 128x128 to avoid out of resources exception when `bsr_dense_mm` is called from `nn.linear`.
- as in the title. The performance of `nn.linear` on BSR tensor weights (dtypes `float16` and `bfloat16`) is increased as follows (`NVIDIA A100-SXM4-80GB`):
  - for blocksize 16x16, the average/maximum speed up is about 11/20 %
  - for blocksize 32x32, the average/maximum speed up is about 15/24 %
  - for blocksize 64x64, the average/maximum speed up is about 18/26 %
  - for blocksize 128x128, the average/maximum speed up is about 15/28 %

Pull Request resolved: https://github.com/pytorch/pytorch/pull/114026
Approved by: https://github.com/cpuhrsch
2023-11-19 12:13:54 +00:00
e1c872e009 Add optimal triton kernel parameters to bsr_dense_mm and scatter_mm for bfloat16 and float32 dtypes (#113553)
As in the title.

This PR is a follow-up to PR https://github.com/pytorch/pytorch/pull/112737 to address bfloat16 and float32 dtype cases. The performance increase is as follows (`NVIDIA A100-SXM4-80GB`):

- bsr_scatter_mm and bfloat16
  - for blocksize 16x16, the average/maximum speed up is about 29/75 %.
  - for blocksize 32x32, the average/maximum speed up is about 23/58 %.
  - for blocksize 64x64, the average/maximum speed up is about 27/66 %.
  - for blocksize 128x128, the average/maximum speed up is about 33/72 %.
- bsr_dense_mm and bfloat16
  - for blocksize 16x16, the average/maximum speed up is about 47/61 %.
  - for blocksize 32x32, the average/maximum speed up is about 29/43 %.
  - for blocksize 64x64, the average/maximum speed up is about 21/41 %.
  - for blocksize 128x128, the average/maximum speed up is about 12/29 %.
- bsr_dense_mm and  float32
  - for blocksize 16x16, the average/maximum speed up is about 35/49 %.
  - for blocksize 32x32, the average/maximum speed up is about 2/5 %.
  - for blocksize 64x64, the average/maximum speed up is about 2/21 %.
  - for blocksize 128x128, the average/maximum speed up is about 79/84 %.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113553
Approved by: https://github.com/cpuhrsch
2023-11-14 00:47:59 +00:00
8219bf051b [BE]: Apply RUF015 to torch folder (#113025)
Removes unnecessary allocations of iterators. There is a small chance this may have side effects as the entire iterator is no longer consumed, but this is a way more efficient method for retrieving the first element.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/113025
Approved by: https://github.com/ezyang, https://github.com/malfet
2023-11-07 00:48:15 +00:00
e64d250210 Add a tool for a semi-automatic optimization of bsr_dense_mm meta parameters. (#112737)
Finding optimal meta parameters for bsr_dense_mm and bsr_scatter_mm triton kernels is a tedious job. This PR introduces a tool (a Python script `torch/sparse/_triton_ops_meta.py`) that finds the optimal set of meta parameters for a given set of matrix multiplication inputs and their block sizes. Currently, such a set is found for square bsr tensor inputs with sizes 256...16384 and square blocksizes 16...128, and dense tensor inputs with sizes 256...131072.
As a result, bsr_dense_mm performance has increased as follows (`NVIDIA A100-SXM4-80GB`):
- for blocksize 16x16, the average/maximum speed up is about 40/60 %.
- for blocksize 32x32, the average/maximum speed up is about 28/45 %.
- for blocksize 64x64, the average/maximum speed up is about 26/43 %.
- for blocksize 128x128, the average/maximum speed up is about 12/28 %.

To enable the performance improvements through meta parameter optimization for other CUDA devices, one must execute the `_triton_ops_meta.py` which will calculate the optimal meta parameters and store the results in a dictionary object defined in `_triton_ops_meta.py`.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/112737
Approved by: https://github.com/cpuhrsch
2023-11-05 12:52:09 +00:00