Summary:
Update fbgemm pinned version in PyTroch.
Related update in fbgemm: D74434751
Included changes:
Update fbgemm external dependencies directory in setup.py
Add DISABLE_FBGEMM_AUTOVEC flag to disable fbgemm's autovec
Test Plan: PyTorch OSS CI
Differential Revision: D75073516
Pull Request resolved: https://github.com/pytorch/pytorch/pull/153950
Approved by: https://github.com/Skylion007, https://github.com/ngimel
Description:
1. Quantize Linear Layer Weights to 4-bits:
Quantize the weights of the Linear layer to 4 bits, using symmetric quantization.
Pack two 4-bit weights into one uint8 container.
Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32.
2. Prepare Quantized Weights, Scales, and Optional Bias:
After quantizing, obtain the quantized_weights, scales, and groupsize.
If the original Linear layer has a bias, prepare it as well.
3. Pack the Weights Efficiently:
Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias.
```python
packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features)
```
Input parameters should include:
in_features and out_features (the same as the Linear layer’s corresponding parameters).
4. Perform Dynamic Quantized Matrix Multiplication:
Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights.
```python
output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features)
```
Inputs required include:
The input tensor, packed_weights , groupsize, and the in_features and out_features.
API Usage: https://github.com/pytorch/pytorch/issues/143289
Model Perf :
7B Transformer model:
Prefill : 340 t/s
Decode : 40 t/s
2B Transformer model
Prefill : 747 t/s
Decode : 80 t/s
Tests:
python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight
Ran 1 test in 0.016s
OK
python test/test_linalg.py -k test__dyn_quant_matmul_4bit
Ran 8 tests in 0.077s
OK
python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit
Ran 8 tests in 11.454s
Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124
Approved by: https://github.com/digantdesai, https://github.com/malfet
Description:
1. Quantize Linear Layer Weights to 4-bits:
Quantize the weights of the Linear layer to 4 bits, using symmetric quantization.
Pack two 4-bit weights into one uint8 container.
Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32.
2. Prepare Quantized Weights, Scales, and Optional Bias:
After quantizing, obtain the quantized_weights, scales, and groupsize.
If the original Linear layer has a bias, prepare it as well.
3. Pack the Weights Efficiently:
Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias.
```python
packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features)
```
Input parameters should include:
in_features and out_features (the same as the Linear layer’s corresponding parameters).
4. Perform Dynamic Quantized Matrix Multiplication:
Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights.
```python
output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features)
```
Inputs required include:
The input tensor, packed_weights , groupsize, and the in_features and out_features.
API Usage: https://github.com/pytorch/pytorch/issues/143289
Model Perf :
7B Transformer model:
Prefill : 340 t/s
Decode : 40 t/s
2B Transformer model
Prefill : 747 t/s
Decode : 80 t/s
Tests:
python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight
Ran 1 test in 0.016s
OK
python test/test_linalg.py -k test__dyn_quant_matmul_4bit
Ran 8 tests in 0.077s
OK
python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit
Ran 8 tests in 11.454s
Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124
Approved by: https://github.com/digantdesai, https://github.com/malfet
Description:
1. Quantize Linear Layer Weights to 4-bits:
Quantize the weights of the Linear layer to 4 bits, using symmetric quantization.
Pack two 4-bit weights into one uint8 container.
Choose a quantization scheme (channel-wise or group-wise), with the group size being a multiple of 32.
2. Prepare Quantized Weights, Scales, and Optional Bias:
After quantizing, obtain the quantized_weights, scales, and groupsize.
If the original Linear layer has a bias, prepare it as well.
3. Pack the Weights Efficiently:
Use torch.ops.aten._dyn_quant_pack_4bit_weight to optimally pack the weights, scales, and optional bias.
```python
packed_weights = torch.ops.aten._dyn_quant_pack_4bit_weight(weight, scales_and_zeros, bias, groupsize, in_features, out_features)
```
Input parameters should include:
in_features and out_features (the same as the Linear layer’s corresponding parameters).
4. Perform Dynamic Quantized Matrix Multiplication:
Use torch.ops.aten._dyn_quant_matmul_4bit to perform matrix multiplication with quantized weights.
```python
output = torch.ops.aten._dyn_quant_matmul_4bit(input, packed_weights, groupsize, in_features, out_features)
```
Inputs required include:
The input tensor, packed_weights , groupsize, and the in_features and out_features.
API Usage: https://github.com/pytorch/pytorch/issues/143289
Model Perf :
7B Transformer model:
Prefill : 340 t/s
Decode : 40 t/s
2B Transformer model
Prefill : 747 t/s
Decode : 80 t/s
Tests:
python test/test_linalg.py -k test__dyn_quant_pack_4bit_weight
Ran 1 test in 0.016s
OK
python test/test_linalg.py -k test__dyn_quant_matmul_4bit
Ran 8 tests in 0.077s
OK
python test/test_linalg.py -k test_compile_dyn_quant_matmul_4bit
Ran 8 tests in 11.454s
Change-Id: Ia1672bad5e6ec94e64d8bb1971395d60f4b3a452
Fixes #ISSUE_NUMBER
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134124
Approved by: https://github.com/digantdesai, https://github.com/malfet
Summary:
Expose nlohmann json library so that it can be used from inside Pytorch. The library already exists in the `third_party` directory. This PR is making `nlohmann/json.hpp` header available to be used from `torch.distributed`.
The next PR makes actual use of this header.
imported-using-ghimport
Test Plan: Imported from OSS
Reviewed By: malfet
Differential Revision: D59035246
Pulled By: c-p-i-o
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129570
Approved by: https://github.com/d4l3k, https://github.com/malfet
- Update `WORKSPACE` to actually use Python-3.10 as job name claims it is
- Get rid of unneeded `future` and `six` dependencies (Removed long time ago)
- Update `requests`, `typing-extensions` and `setuptools` to the latest releases
- Mark `tools/build/bazel/requirements.txt` as a generated file
This also updates idna to 3.7 that contains a fix for [CVE-2024-3651](https://github.com/advisories/GHSA-jjg7-2v4v-x38h), though as we are no shipping a binary with it, it does not expose CI system to any actual risks
TODOs:
- Add periodic job that runs `pip compile` to update those to the latest version
- Unify varios requirements .txt (i.e. bazel requirements and requirements-ci should be one and the same)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124076
Approved by: https://github.com/seemethere, https://github.com/DanilBaibak
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 27084ed</samp>
This pull request simplifies and cleans up the code that uses the cuDNN library for convolution, batch normalization, CTC loss, and quantized operations. It removes the unnecessary checks and conditions for older cuDNN versions and the experimental cuDNN v8 API, and ~~replaces them with the stable `cudnn_frontend` API that requires cuDNN v8 or higher. It also adds the dependency and configuration for the `cudnn_frontend` library in the cmake and bazel files.~~ Correction: The v7 API will still be available with this PR, and can still be used, without any changes to the defaults. This change simply always _builds_ the v8 API, and removes the case where _only_ the v7 API is built.
This is a re-land of https://github.com/pytorch/pytorch/pull/91527
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95722
Approved by: https://github.com/malfet, https://github.com/atalman
- It was installed twice : once in `/usr/local/cuda/lib64` folder and 2nd time in `/usr/lib64`
- And don't install CuDNN headers thrice, only in `/usr/local/cuda/includa`
- Error on unknown CUDA version
- Modify bazel builds to look for cudnn in `/usr/local/cuda` folder
Pull Request resolved: https://github.com/pytorch/pytorch/pull/115872
Approved by: https://github.com/huydhn
<!--
copilot:summary
-->
### <samp>🤖 Generated by Copilot at 27084ed</samp>
This pull request simplifies and cleans up the code that uses the cuDNN library for convolution, batch normalization, CTC loss, and quantized operations. It removes the unnecessary checks and conditions for older cuDNN versions and the experimental cuDNN v8 API, and ~~replaces them with the stable `cudnn_frontend` API that requires cuDNN v8 or higher. It also adds the dependency and configuration for the `cudnn_frontend` library in the cmake and bazel files.~~ Correction: The v7 API will still be available with this PR, and can still be used, without any changes to the defaults. This change simply always _builds_ the v8 API, and removes the case where _only_ the v7 API is built.
This is a re-land of https://github.com/pytorch/pytorch/pull/91527
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95722
Approved by: https://github.com/malfet
This PR adds bazel python, so that bazel build could be used from python like `import torch`.
Notable changes:
- Add the python targets.
- Add the version.py.tpl generation.
- In order to archive the `USE_GLOBAL_DEPS = False` just for the bazel build, employ a monkey-patch hack in the mentioned `version.py.tpl`.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/101003
Approved by: https://github.com/huydhn
Fixes#79348
This change is mostly focused on enabling nvcc+sccache in the PyTorch CI.
Along the way we had to do couple tweaks:
1. Split the rules_cc from the rules_cuda that embeeded them before. This is needed in order to apply a different patch to the rules_cc compare to the one that rules_cuda does by default. This is in turn needed because we need to workaround an nvcc behavior where it doesn't send `-iquote xxx` to the host compiler, but it does send `-isystem xxx`. So we workaround this problem with (ab)using `-isystem` instead. Without it we are getting errors like `xxx` is not found.
2. Workaround bug in bazel https://github.com/bazelbuild/bazel/issues/10167 that prevents us from using a straightforward and honest `nvcc` sccache wrapper. Instead we generate ad-hock bazel specific nvcc wrapper that has internal knowledge of the relative bazel paths to local_cuda. This allows us to workaround the issue with CUDA symlinks. Without it we are getting `undeclared inclusion(s) in rule` all over the place for CUDA headers.
## Test plan
Green CI build https://github.com/pytorch/pytorch/actions/runs/4267147180/jobs/7428431740
Note that now it says "CUDA" in the sccache output
```
+ sccache --show-stats
Compile requests 9784
Compile requests executed 6726
Cache hits 6200
Cache hits (C/C++) 6131
Cache hits (CUDA) 69
Cache misses 519
Cache misses (C/C++) 201
Cache misses (CUDA) 318
Cache timeouts 0
Cache read errors 0
Forced recaches 0
Cache write errors 0
Compilation failures 0
Cache errors 7
Cache errors (C/C++) 7
Non-cacheable compilations 0
Non-cacheable calls 2893
Non-compilation calls 165
Unsupported compiler calls 0
Average cache write 0.116 s
Average cache read miss 23.722 s
Average cache read hit 0.057 s
Failed distributed compilations 0
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95528
Approved by: https://github.com/huydhn
An action item from https://github.com/pytorch/pytorch/issues/94346
Although the security practice of setting the checksum is good, it doesn't work when the archive is downloaded from some sites like GitHub because it can change. Specifically, GitHub gives no guarantee to keep the same value forever https://github.com/community/community/discussions/46034.
This also adds a new linter to make sure that SHA checksum from GitHub can be removed quickly. The WORKSPACE file is actually updated using the new linter:
```
>>> Lint for WORKSPACE:
Advice (BAZEL_LINTER) format
Redundant SHA checksum. Run `lintrunner -a` to apply this patch.
You can run `lintrunner -a` to apply this patch.
5 5 |
6 6 | http_archive(
7 7 | name = "rules_cuda",
7 |- sha256 = "f80438bee9906e9ecb1a8a4ae2365374ac1e8a283897281a2db2fb7fcf746333",
9 8 | strip_prefix = "runtime-b1c7cce21ba4661c17ac72421c6a0e2015e7bef3/third_party/rules_cuda",
10 9 | urls = ["b1c7cce21b.tar.gz"],
11 10 | )
--------------------------------------------------------------------------------
29 28 | name = "pybind11_bazel",
30 29 | strip_prefix = "pybind11_bazel-992381ced716ae12122360b0fbadbc3dda436dbf",
31 30 | urls = ["992381ced7.zip"],
31 |- sha256 = "3dc6435bd41c058453efe102995ef084d0a86b0176fd6a67a6b7100a2e9a940e",
33 31 | )
34 32 |
35 33 | new_local_repository(
--------------------------------------------------------------------------------
52 50 | urls = [
53 51 | "https://github.com/gflags/gflags/archive/v2.2.2.tar.gz",
54 52 | ],
54 |- sha256 = "34af2f15cf7367513b352bdcd2493ab14ce43692d2dcd9dfc499492966c64dcf",
56 53 | )
57 54 |
58 55 | new_local_repository(
```
Pull Request resolved: https://github.com/pytorch/pytorch/pull/95039
Approved by: https://github.com/ZainRizvi
Fixes https://github.com/pytorch/pytorch/issues/77509
This PR supersedes https://github.com/pytorch/pytorch/pull/77510.
It allows both `bazel query //...` and `bazel build --config=gpu //...` to work.
Concretely the changes are:
1. Add "GenerateAten" mnemonic -- this is a convenience thing, so anybody who uses [Remote Execution](https://bazel.build/docs/remote-execution) can add a
```
build:rbe --strategy=GenerateAten=sandboxed,local
```
line to the `~/.bazelrc` and build this action locally (it doesn't have hermetic dependencies at the moment).
2. Replaced few `http_archive` repos by the proper existing submodules to avoid code drift.
3. Updated `pybind11_bazel` and added `python_version="3"` to `python_configure`. This prevents hard-to-debug error that are caused by an attempt to build with python2 on the systems where it's a default python (Ubuntu 18.04 for example).
4. Added `unused_` repos, they purpose is to hide the unwanted submodules of submodules that often have bazel targets in them.
5. Updated CI to build //... -- this is a great step forward to prevent regressions in targets not only in the top-level BUILD.bazel file, but in other folders too.
6. Switch default bazel build to use gpu support.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/78870
Approved by: https://github.com/ezyang
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/71412
This is only in CMake and internal builds right now. Add to Bazel for
parity.
ghstack-source-id: 150235094
Test Plan: Built and ran locally. Rely on CI to verify.
Reviewed By: malfet
Differential Revision: D33635743
fbshipit-source-id: b9e5abbef5feabd52c53a9c2b95713b87ce81681
(cherry picked from commit 11700dbc80200093fdd74b1be066b4e740cee516)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70201
Included functions:
save_mobile_module -> saves a mobile::Module to flatbuffer
load_mobile_module_from_file -> loads a flatbuffer into mobile::Module
parse_mobile_module -> parses from bytes or deserialized flatbuffer module object
Compared to previous attempts, this diff only adds flatbuffer to cmake target and leaves fbcode/xplat ones unchanged.
Test Plan: unittest
Reviewed By: malfet, gmagogsfm
Differential Revision: D33239362
fbshipit-source-id: b9ca36b83d6af2d78cc50b9eb9e2a6fa7fce0763
Summary:
Fixes https://github.com/pytorch/pytorch/issues/35316
On master, bazel cuda build is disabled due to lack of a proper `cu_library` rule. This PR:
- Add `rules_cuda` to the WORKSPACE and forward `cu_library` to `rules_cuda`.
- Use a simple local cuda and cudnn repositories (adopted from TRTorch) for cuda 11.3.
- Fix current broken cuda build.
- Enable cuda build in CI, not just for `:torch` target but all the test binaries to catch undefined symbols.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/66241
Reviewed By: ejguan
Differential Revision: D31544091
Pulled By: malfet
fbshipit-source-id: fd3c34d0e8f80fee06f015694a4c13a8e9e12206