1867 Commits

Author SHA1 Message Date
cyy
cc28634172 [Submodule] Bump pybind11 to v2.13.5 (#135202)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135202
Approved by: https://github.com/Skylion007
2024-09-06 00:09:00 +00:00
b99ef1a02e Update torch-xpu-ops pin (ATen XPU implementation) (#135185)
Release cycle for PyTorch 2.5
1. Update specific AOT targets for Windows. On Windows, AOT target list prefers Intel client GPUs.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/135185
Approved by: https://github.com/EikanWang
2024-09-05 10:05:23 +00:00
679b8fe426 Update generate-xnnpack-wrappers.py parsing to handle build identifier (#134724)
Fixes an issue after updating XNNPACK where parsing the XNNPACK CMakeLists breaks. I'm just ignored the generated build identifier for now, since it's not used and we would need to update the buck build to generate it at build time.

Remove unused ukernels_xop XNNPACK target as it has no sources (after the recent update) and causes buck1 to complain.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134724
Approved by: https://github.com/mcr229
2024-09-04 08:45:46 +00:00
2443507acc Update torch-xpu-ops pin (ATen XPU implementation) (#134983)
Release cycle for PyTorch 2.5
1. Enable Windows build in latest torch-xpu-ops. Resolved large bin issue.
2. Refine test infrastructure for compatibility on different HW platforms.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134983
Approved by: https://github.com/EikanWang
2024-09-03 12:14:37 +00:00
39935e0fde Update cpuinfo submodule (#134891)
Last time it was done in June by https://github.com/pytorch/pytorch/pull/127505
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134891
Approved by: https://github.com/Skylion007
2024-09-03 09:29:59 +00:00
3b40b07efb Update PyTorch for XNNPACK 87ee0b4 (#134518)
Summary: Update XNNPACK library version.

Test Plan: Combined diff CI is clean: D61586079 (all changes, has to be split out for export).

Differential Revision: D61822610

Pull Request resolved: https://github.com/pytorch/pytorch/pull/134518
Approved by: https://github.com/mcr229
2024-08-28 19:24:04 +00:00
c9c84ae3ee [BE][Ez]: Update CUDNN_frontend submodule to 1.6.1 (#134007)
Update cudnn_frontend submodule to 1.6.1 to patch some minor bugfixes and compiler fixes.
# Bug fix
* Fixed an issue where custom dropout mask was not correctly applied.
* Added -fvisibility=hidden for the pip wheels generated to avoid symbol conflicts with other modules that use cudnn frontend.
* Fixed an issue in sdpa operation which when deserialized will lead to numerical mismatches.
* Fixed an issue in sdpa fp8 fprop operation (in inference mode).
# Samples
* Added a new sample to showcase how a custom dropout mask can be applied to a sdpa operation.
* Added a sample to showcase convolutions on large (c * d * h * w > 2 **31) tensors.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/134007
Approved by: https://github.com/eqy
2024-08-22 13:34:17 +00:00
b7baa062fc Update torch-xpu-ops pin (ATen XPU implementation) (#133850)
Bugfixings for PyTorch 2.5,
1. Using SYCL group algorithm API instead of old style for sub group shift utilities.
2. Add preprocess in reduction kernel for cases requiring data type cast.
3. Make group norm memory format compatible.
4. ZeroTensor: a. Remove unnecessary aten operators registration, or ZeroTensor process is bypassed. b. Align preprocess with intree implementation in aten::copy_.
5. Rebase checkIndexTensorTypes usage.
6. Align latest semantics of PyTorch foreach operators. Return multiple tensors with offset=0.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133850
Approved by: https://github.com/EikanWang
2024-08-22 06:27:03 +00:00
2a73ba298c Upgrade submodule oneDNN to v3.5.3 (#131620)
This PR is to upgrad submodule oneDNN to v3.5.3.

## Improvements

- [experimental] Introduced [microkernel API](https://oneapi-src.github.io/oneDNN/ukernels.html) for Intel Architecture Processors. This API exposes internal mechanisms used in matmul and convolution implementation to expert users.
- Improved performance of matmul primitive with sum post-op for batched cases on processors with Intel AMX instruction set support.
- Introduced fp64 matmul support. This functionality is currently implemented on Intel GPUs with hardware acceleration for fp64 math only.

## Validation results on CPU
No regression was found.

1. NLP models accuracy/inference/training

Model Name | Mode Name | Precision | OneDNN | Baseline | OneDNN/Baseline
-- | -- | -- | -- | -- | --
bert-large | realtime | bf16 | 192.498 | 189.664 | 1.014942214
bert-large | throughput | bf16 | 202.424 | 202.156 | 1.001325709
bert-large | train_phase2 | bf16 | 15.955 | 16.029 | 0.995383368
LCM | throughput | bf16 | 1.01983 | 1.06632 | 0.956401455
stable-diffusion | throughput | bf16 | 0.10313 | 0.10184 | 1.012666929
ViT | realtime | bf16 | 1086.48 | 928.43 | 1.17023362
ViT | throughput | bf16 | 1419.07 | 1393.81 | 1.018122987
yolov7 | realtime | bf16 | 413.468682 | 415.16503 | 0.995914039
yolov7 | throughput | bf16 | 369.697 | 366.789 | 1.007928264
bert-large | realtime | fp32 | 46.685 | 46.652 | 1.000707365
bert-large | throughput | fp32 | 47.766 | 48.007 | 0.994979899
bert-large | train_phase2 | fp32 | 7.101 | 7.104 | 0.999577703
LCM | throughput | fp32 | 0.5501 | 0.55023 | 0.999763735
stable-diffusion | throughput | fp32 | 0.04012 | 0.04002 | 1.002498751
ViT | realtime | fp32 | 337.27 | 335.19 | 1.006205436
ViT | throughput | fp32 | 346.52 | 350.08 | 0.989830896
yolov7 | realtime | fp32 | 107.138054 | 107.242747 | 0.999023775
yolov7 | throughput | fp32 | 103.383 | 104.301 | 0.99119855
bert-large | realtime | int8 | 283.541 | 289.569 | 0.979182855
LCM | throughput | int8 | 1.09864 | 1.08998 | 1.0079451
stable-diffusion | throughput | int8 | 0.10617 | 0.10604 | 1.001225952
ViT | realtime | int8 | 1562.11 | 1554.68 | 1.004779119
ViT | throughput | int8 | 1904.38 | 1903.39 | 1.000520125
yolov7 | realtime | int8 | 540.489493 | 539.902488 | 1.001087243
yolov7 | throughput | int8 | 499.999 | 500.757 | 0.998486292

Device | Dtype | Geomean Higher is better
-- | -- | --
All | all | 101.17%
All | fp32 | 99.83%
All | bf16 | 102.24%
All | int8 | 99.91%
All | fp16 | 103.61%
SPR | all | 100.54%
SPR | fp32 | 99.82%
SPR |bf16 | 101.78%
SPR |int8 | 99.90%
GNR | all | 101.58%
GNR | fp32 | 99.85%
GNR | bf16 | 102.66%
GNR | int8 | 99.93%
GNR | fp16 | 103.61%

2. Torchbench cpu userbenchmark inference & training

Perf_Geomean | Ratio (oneDNN/baseline)
-- | --
eager_throughtput_bf16_infer | 1.00x
eager_throughtput_fp32_infer | 1.00x
jit_llga_throughtput_amp_bf16 | 1.00x
jit_llga_throughtput_fp32 | 1.00x
eager_throughtput_fx_int8 | 0.99x
eager_throughtput_bf16_train | 1.01x
eager_throughtput_fp32_train | 1.00x

3. Inductor quantization

Static quant:
Perf_Geomean | Ratio (oneDNN/baseline)
-- | --
PTQ | 1.00x
PTQ_CPP_WRAPPER | 1.00x
QAT | 1.00x

ACC_Geomean | Ratio (oneDNN/baseline)
-- | --
PTQ | 1.00x
PTQ_CPP_WRAPPER | 1.00x
QAT | 1.00x

Dynamic quant:

  | Ratio (oneDNN/baseline)
-- | --
Performance | 1.04x
Accuracy | 1.00x

4. Dynamo benchmarks
GEOMEAN summary
![image](https://github.com/user-attachments/assets/82fc4b76-50f6-4f06-9ba9-034b932f1158)

FP32 Static shape, default wrapper
![image](https://github.com/user-attachments/assets/9335268e-3e99-426b-91f8-f9df90a2007c)

FP32 Dynamic shape, default wrapper
![image](https://github.com/user-attachments/assets/e7cf3f4f-2a62-4b58-9461-5e5ba254d822)

AMP Static shape, default wrapper
![image](https://github.com/user-attachments/assets/12392c88-e44f-4c95-904a-4fa5fc9f34a2)

AMP Dynamic shape, default wrapper
![image](https://github.com/user-attachments/assets/13930b0d-9bb2-46de-9ecb-5d2585d5c2f6)

## Validation results on XPU
Category | Eager | Inductor
-- | -- | --
huggingface_amp_fp16_training | 1.002456 | 0.999998
huggingface_bfloat16_inference | 1.005386 | 1.003511
huggingface_float32_training | 1.002533 | 1.003098
torchbench_amp_fp16_training | 1.009065 | 1.01323
torchbench_bfloat16_inference | 1.003371 | 1.001534
torchbench_float32_training | 1.012102 | 1.011596
timm_models_amp_fp16_training | 1.005511 | 1.010329
timm_models_bfloat16_inference | 1.000935 | 1.000538
timm_models_float32_training | 0.991873 | 0.99721

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131620
Approved by: https://github.com/jgong5, https://github.com/malfet
2024-08-21 23:40:02 +00:00
cyy
c3d02fa390 [Reland2] Update NVTX to NVTX3 (#109843)
Another attempt to update NVTX to NVTX3. We now avoid changing NVTX header inclusion of existing code.  The advantage of NVTX3 over NVTX is that it is a header-only library so that linking with NVTX3 can greatly simplify our CMake and other building scripts for finding libraries in user environments. In addition, NVTX are indeed still present in the latest CUDA versions, but they're no longer a compiled library: It's now a header-only library. That's why there isn't a .lib file anymore.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/109843
Approved by: https://github.com/peterbell10, https://github.com/eqy

Co-authored-by: Ivan Zaitsev <108101595+izaitsevfb@users.noreply.github.com>
2024-08-20 16:33:26 +00:00
3ac527ac5f [BE][Ez]: Update cudnn_frontend submodule to 1.6.0 (#133687)
Updates CUDNN_frontend header only library to make the most of the newest CUDNN features and decrease the overhead of the library.

Copied from commit:
New API
- Graph Slice Operation: Introduced the graph.slice operation for slicing input tensors. Refer to docs/operations/Slice.md for detailed documentation and samples/cpp/misc/slice.cpp for a C++ sample. Pybinds for this operation have also been added.
- SM Carveout Feature: Added the set_sm_count(int32_t type) graph property to support the SM Carveout feature introduced in Ampere and Hopper GPUs. Engines that do not support SM_COUNT will return NOT_SUPPORTED.
Bug Fixes
- Convolution Mode Attribute: Added the missing set_convolution_mode attribute to convolution attributes in forward propagation (fprop), data gradient (dgrad), and weight gradient (wgrad). Previously, this was hardcoded to CUDNN_CROSS_CORRELATION in the 1.x API.
- SDPA FP8 Backward Node: Fixed an issue with the deserialization of the sdpa_fp8_backward node.
Enhancements
- Graph Execution Overhead: Reduced the overhead of graph.execute() by optimizing sub-node tree traversal, collected UIDs, workspace modifications, and workspace size.
- Graph Validation Performance: Significantly improved (~10x) the performance of graph.validate() by deferring graph expansion to a later stage (build_operation_graph).
- Optional Running Stats for BatchNorm: Made the running statistics for the batch normalization operation optional, supported by cuDNN backend version 9.3.0 and later.
- Shape and Stride Inferencing: Enhanced shape and stride inferencing to preserve the stride order of the input.
- Diagnostic Error Message: Added a diagnostic error message to create_execution_plans if called without the preceding build_operation_graph.
- JSON Schema and Deserialization: Improved the JSON schema and deserialization logic with additional checks.
- Logging Overhead: Reduced logging overhead, resulting in faster graph.build() calls.
- CMake Integration: Replaced CMAKE_SOURCE_DIR with PROJECT_SOURCE_DIR in CMake files for better integration. See the relevant pull request for more details.
Samples
- Jupyter Notebooks: Added Jupyter notebooks for RMSNorm, InstanceNorm, and LayerNorm. Refer to the samples/python folder for more information.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133687
Approved by: https://github.com/eqy, https://github.com/malfet
2024-08-16 20:27:23 +00:00
b833990a8f Revert "[CUDA][CUTLASS][submodule] Fixes for CUTLASS upgrade (#131493)"
This reverts commit 4aa66f68a803927ddd127ceaaa1521b8d6e90e5f.

Reverted https://github.com/pytorch/pytorch/pull/131493 on behalf of https://github.com/izaitsevfb due to breaks internal builds with identifier "std::numeric_limits< ::cutlass::half_t> ::infinity" is undefined in device code ([comment](https://github.com/pytorch/pytorch/pull/131493#issuecomment-2293939390))
2024-08-16 18:09:33 +00:00
4aa66f68a8 [CUDA][CUTLASS][submodule] Fixes for CUTLASS upgrade (#131493)
Unblocks/unbreaks against newer CUTLASS (3.5+)

CC @nWEIdia @xwang233 @ptrblck @thakkarV

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131493
Approved by: https://github.com/Skylion007
2024-08-15 18:33:22 +00:00
018e48c337 [Reland] Add wrappers for synchronous GPUDirect Storage APIs (#133489)
Reland #130633

USE_CUFILE turned off by default in this version
Pull Request resolved: https://github.com/pytorch/pytorch/pull/133489
Approved by: https://github.com/albanD
2024-08-15 17:11:52 +00:00
d2ecdcb2f7 [Profiler] Add API for Dynamic Activity Toggling [2/n] (#133035)
Summary: During PT2 there are many GPU/CPU events that are unneccessary to profile in between a given step. To remedy this, we can add an API that takes in a list of activities and an arg to toggle said activies or not. For this diff we are adding the profiler API to propogate down to kineto (and in the future the collection.cpp logic). Subsequent diffs will be added for CPU toggling and e2e testing.

Test Plan: Tested by toggling backward gpu traces off and got following trace: https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Jul_31_13_40_55.3251726.pt.trace.json.gz&bucket=gpu_traces

Reviewed By: aaronenyeshi

Differential Revision: D60541767

Pull Request resolved: https://github.com/pytorch/pytorch/pull/133035
Approved by: https://github.com/aaronenyeshi
2024-08-09 21:54:54 +00:00
465e071898 Revert "[CUDA][CUTLASS][submodule] Fixes for CUTLASS upgrade (#131493)"
This reverts commit 927b4c11143e047eb6e3430e4c7c912064572f1b.

Reverted https://github.com/pytorch/pytorch/pull/131493 on behalf of https://github.com/nmacchioni due to breaking many tests ([comment](https://github.com/pytorch/pytorch/pull/131493#issuecomment-2277738114))
2024-08-09 11:30:23 +00:00
927b4c1114 [CUDA][CUTLASS][submodule] Fixes for CUTLASS upgrade (#131493)
Unblocks/unbreaks against newer CUTLASS (3.5+)

CC @nWEIdia @xwang233 @ptrblck @thakkarV

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131493
Approved by: https://github.com/Skylion007
2024-08-09 07:35:38 +00:00
cyy
05e8e87a69 [Submodule] Remove foxi (#132976)
It is not used after removal of Caffe2 code.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132976
Approved by: https://github.com/ezyang
2024-08-09 03:46:52 +00:00
26b0011fb8 [XPU][Kineto Submodule] Introduce kineto-based XPU profiler (#130811)
As XPU became a PyTorch built-in device, the profiler support is indispensable part of functionality completeness. This PR is associated with the PR to introduce XPU profiler plugin into the kineto. When USE_XPU is enabled, the LIBKINETO_NOXPUPTI option will be suppressed accordingly, which allows kineto to build with XPU profiler plugin.

Associated PR to introduce kineto-based XPU profiler into kineto:
https://github.com/pytorch/kineto/pull/961

Also updates the Kineto Submodule to include XPU changes.

Co-authored-by: Aaron Enye Shi <enye.shi@gmail.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130811
Approved by: https://github.com/aaronenyeshi
2024-08-07 18:41:37 +00:00
cyy
522fa03e91 [Submodule] Bump ONNX to v1.16.2 (#132566)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132566
Approved by: https://github.com/justinchuby
2024-08-04 07:01:54 +00:00
81b8d3586f Update torch-xpu-ops pin (ATen XPU implementation) (#132390)
Regular update.
1. New 69 ATen operators and variants are added. See https://github.com/intel/torch-xpu-ops/blob/main/yaml/xpu_functions.yaml.
2. Align with PyTorch in-tree to use safe data pointer access APIs.
3. Enable FP64 conversion emulation for some platforms.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132390
Approved by: https://github.com/EikanWang
2024-08-04 02:22:46 +00:00
bcac71517c [Profiler] Test Logging for Empty Traces (#132444)
Summary: Tests D60311331. Please see that diff for explanation

Test Plan: This diff is adding a test itself

Reviewed By: aaronenyeshi

Differential Revision: D60311555

Pull Request resolved: https://github.com/pytorch/pytorch/pull/132444
Approved by: https://github.com/aaronenyeshi
2024-08-02 22:04:15 +00:00
ca254d145f [BE][Ez]: Update fmtlib submodule to 11.0.2 (#132036)
Updates fmtlib to 11.0.2 which mainly includes minor bugfixes for edge cases such as move-only iterators and formatting on non-posix systems.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/132036
Approved by: https://github.com/malfet
2024-07-29 15:50:00 +00:00
e191b83462 Revert "Add wrappers for synchronous GPUDirect Storage APIs (#130633)"
This reverts commit 709ddf7a9dcfa1268848b72f6f56b55afa6728d6.

Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to still failing internally D60265673 ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2253239607))
2024-07-26 18:08:20 +00:00
dfba85c26b Update torch-xpu-ops pin (ATen XPU implementation) (#131643)
# Motivation
Regular update.
1. Some new ATen ops support
2. ABI=0 build support
3. Remove dispatched implementation of pin_memory&is_pinned
4. Enhance deterministic usage

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131643
Approved by: https://github.com/EikanWang
2024-07-26 05:51:58 +00:00
709ddf7a9d Add wrappers for synchronous GPUDirect Storage APIs (#130633)
Based in part on https://github.com/NVIDIA/apex/pull/1774

Differential Revision: [D60155434](https://our.internmc.facebook.com/intern/diff/D60155434)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633
Approved by: https://github.com/albanD
2024-07-25 22:23:38 +00:00
89bdd9c18f [kineto] populate src/dst rank for p2p (#130812)
Summary:
as title
populate src/dst rank (global rank) for p2p kernel

Differential Revision: D59794535

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130812
Approved by: https://github.com/aaronenyeshi
2024-07-25 21:10:57 +00:00
b90aa18569 [aoti] Add initial custom op support (#127034)
Re-land of https://github.com/pytorch/pytorch/pull/125242

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127034
Approved by: https://github.com/malfet
2024-07-24 20:29:55 +00:00
e4b5645f83 Revert "Add wrappers for synchronous GPUDirect Storage APIs (#130633)"
This reverts commit 5b5e0698a5f560decb9bbdd150ed7b0622eb7777.

Reverted https://github.com/pytorch/pytorch/pull/130633 on behalf of https://github.com/clee2000 due to breaking a lot of jobs and build rules internally D60085885, possibly needs to update some bazel build? ([comment](https://github.com/pytorch/pytorch/pull/130633#issuecomment-2245806738))
2024-07-23 17:19:34 +00:00
5b5e0698a5 Add wrappers for synchronous GPUDirect Storage APIs (#130633)
Based in part on https://github.com/NVIDIA/apex/pull/1774

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130633
Approved by: https://github.com/albanD
2024-07-22 14:51:24 +00:00
25d8a0480b [lint] Remove unnecessary BUCKRESTRICTEDSYNTAX suppressions
Differential Revision: D59935630

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131187
2024-07-19 07:19:11 -07:00
b556d31586 Update torch-xpu-ops pin (ATen XPU implementation) (#131015)
Regular update.
1. New 90 ATen operators and their variants are supported for XPU.
2. Bugfixing: a. Fixing out-of-bound memory access in index_put kernel b. Fixing debug build error
3. Binary change. Split device AOT code of SYCL kernel into multiple libraries to avoid linkage failure.
4. torch-xpu-ops test case enhancement: a. Hook PyTorch testing ob_db to align opInfo configuration with CUDA b. Hook _check_arg_device2 and freeze_rng_state to make XPU happy

Pull Request resolved: https://github.com/pytorch/pytorch/pull/131015
Approved by: https://github.com/EikanWang
2024-07-19 02:18:55 +00:00
83eedf66b9 Update libfmt submodule to 11.0.1 (#130628)
Update libfmt to 11.0.1 reopen of https://github.com/pytorch/pytorch/pull/129962. Requires a kineto update and moves fmt::join into a separate include so added it where necessary.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130628
Approved by: https://github.com/aaronenyeshi
2024-07-16 06:12:11 +00:00
ac28ae18dc [BE][Ez]: Update pybind11 submodule to v2.13.1 (#129827)
Updates pybind11 submodule to v2.13.1
Pull Request resolved: https://github.com/pytorch/pytorch/pull/129827
Approved by: https://github.com/XuehaiPan, https://github.com/atalman, https://github.com/albanD
2024-07-15 08:58:56 +00:00
cf090e222e Update torch-xpu-ops pin (ATen XPU implementation) (#130333)
1. Fixing compilation error due to PyTorch update. The helper function prototype changes, `checkIndexTensorTypes`.
2. Fixing compilation error due to PyTorch update. PyTorch forced -Werror=unused-function.
3. Fixing inductor case failure due to CUDA bias implementation in the case. https://github.com/pytorch/pytorch/issues/130426

Pull Request resolved: https://github.com/pytorch/pytorch/pull/130333
Approved by: https://github.com/EikanWang, https://github.com/atalman
2024-07-10 18:10:53 +00:00
a3ce9eddd6 [BE][Easy] apply autofix for ruff rule unnecessary-literal-set (C405) and unnecessary-map (C417) (#130198)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/130198
Approved by: https://github.com/Skylion007
2024-07-07 00:58:22 +00:00
e98587c58d Update torch-xpu-ops pin (ATen XPU implementation) (#129353)
188 new ATen operators/variants are added in the pin update, involving eager and torch.compile usage on HuggingFace, TIMM and TorchBench models. 16 new unit tests ported to enhance functionality coverage. Aligned source file directory structure with ATen native. Fixed corner case failures in aten::resize, aten::index_add and aten::index_put.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129353
Approved by: https://github.com/EikanWang
2024-07-04 07:36:17 +00:00
a21d4363d2 [Profiler] Remove all instances of TMP_USE_TSC_AS_TIMESTAMP (#129973)
Summary: Now that D56584521 is in, we can remove all insteances of TMP_USE_TSC_AS_TIMESTAMP

Test Plan:
Ran resnet. Trace looks good
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Jun_27_14_46_01.1967733.pt.trace.json.gz&bucket=gpu_traces

Reviewed By: aaronenyeshi, swolchok

Differential Revision: D59132793

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129973
Approved by: https://github.com/aaronenyeshi
2024-07-03 19:28:52 +00:00
6cb0ad3375 [BE]: Update NCCL submodule to 2.21.5 (#124014)
Update NCCL to the latest version. This release is mostly bugfixes with a few new minor features.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/124014
Approved by: https://github.com/eqy, https://github.com/ezyang, https://github.com/nWEIdia, https://github.com/malfet, https://github.com/atalman
2024-07-02 14:39:33 +00:00
1d0efedc85 [Profiler] Add TSC Clock Callback to CUPTI (#125036)
Summary:
Right now we use the default clock for CUPTI which is not monotonic nor particularly fast. We have already added the Kineto side of the implementation here: https://www.internalfb.com/diff/D56525885

This diff only adds the compile flags such that the TSC format is used and sets the converter using a libkineto call in the profiler

Test Plan:
Obtained following trace using resnet test:
https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/dynocli/devvm2185.cco0.facebook.com/rank-0.Apr_25_11_03_18.3862943.pt.trace.json.gz&bucket=gpu_traces

TBD: Add benchmarks

Differential Revision: D56584521

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125036
Approved by: https://github.com/aaronenyeshi
2024-06-27 21:07:43 +00:00
e19042481b [cuDNN][cuDNN Frontend] Bump cuDNN FE submodule to 1.5.2 (#129592)
Some relevant fixes include stride-0 support 👀

CC @drisspg @Skylion007 @vedaanta

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129592
Approved by: https://github.com/Skylion007
2024-06-27 04:01:23 +00:00
64f1111d38 Expose nholmann json to torch (#129570)
Summary:

Expose nlohmann json library so that it can be used from inside Pytorch. The library already exists in the `third_party` directory. This PR is making `nlohmann/json.hpp` header available to be used from `torch.distributed`.
The next PR makes actual use of this header.

imported-using-ghimport

Test Plan: Imported from OSS

Reviewed By: malfet

Differential Revision: D59035246

Pulled By: c-p-i-o

Pull Request resolved: https://github.com/pytorch/pytorch/pull/129570
Approved by: https://github.com/d4l3k, https://github.com/malfet
2024-06-26 21:59:26 +00:00
d52684e9a8 [BE]: Update CUDNN_frontend submodule to v1.5.1 (#128612)
Updates submodule to cudnn_frontend v1.5.1
Pull Request resolved: https://github.com/pytorch/pytorch/pull/128612
Approved by: https://github.com/eqy, https://github.com/ezyang
2024-06-21 18:17:35 +00:00
9dd8f8cf8b [cpuinfo][submodule] bump cpuinfo to the latest to support amx isa check (#127505)
Fix https://github.com/pytorch/pytorch/issues/127368

Pull Request resolved: https://github.com/pytorch/pytorch/pull/127505
Approved by: https://github.com/ezyang
2024-06-21 00:17:44 +00:00
fcf2a1378b Enable fp8 rowwise scaling kernel on cuda, TAKE 2: #125204 (#128989)
# Summary
First PR got reverted and needed a redo

This pull request introduces an fp8 row-scaling kernel as an optional implementation for `scaled_mm`. The kernel selection is based on the scaling tensors of the inputs. For inputs `x` and `y` of shape `[M, K]` and `[K, N]` respectively, the following conditions must be met:
- `x`'s scale should be a 1-dimensional tensor of length `M`.
- `y`'s scale should be a 1-dimensional tensor of length `N`.

It's important to note that this kernel is not called "rowwise, columnwise" scaling because, although the scales for `y` are semantically along its columns, this implementation only supports the TN format. This means the scaling is along the faster-moving dimension, or the "row".

The following two PRs were required to enable local builds:
- [PR #126185](https://github.com/pytorch/pytorch/pull/126185)
- [PR #125523](https://github.com/pytorch/pytorch/pull/125523)

### Todo
We still do not build our Python wheels with this architecture.

@ptrblck @malfet, should we replace `sm_90` with `sm_90a`?

The NVRTC TMA shadowing feels wrong, but I a not sure the right way to spoof the symbol for this compilation unit:
https://github.com/pytorch/pytorch/pull/125204/files#r1586986954

#### ifdef

I tried to use : `#if !defined(USE_ROCM) && defined(CUDA_VERSION) && CUDA_VERSION >= 12000 && \
    defined(__CUDA_ARCH__) && __CUDA_ARCH__ > 900` to gate the building of the kernel. I was having a hell of a time with this.. so I am not really sure the right way to do this

Kernel Credit:
@jwfromm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128989
Approved by: https://github.com/yangsiyu007, https://github.com/vkuzo
2024-06-19 04:49:39 +00:00
6767e38267 Fix manual licensing (#128630)
It has come to my attention that some of our licenses are incorrect, so I attempted to rectify a few of them based on given recommendations for:
clog - BSD-3
eigen - MPL-2.0
ffnvcodec - LGPL-2.1
-> **hungarian - Permissive (free to use)**
irrlicht - The Irrlicht Engine License (zlib/libpng)
-> **pdcurses - Public Domain for core**
-> **sigslot - Public Domain**
test - BSD-3
Vulkan - Apache-2.0 or MIT
fb-only: more context is here https://fb.workplace.com/groups/osssupport/posts/26333256012962998/?comment_id=26333622989592967

This PR addressed the manual mismatches of licensing mentioned above (the two bolded, one is getting addressed in #128085, but as everything else is generated by pulling through other files, I did not address those. It is unclear what needs to be updated for the remaining to be accurate/if they're inaccurate today.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128630
Approved by: https://github.com/malfet
2024-06-14 00:12:09 +00:00
de9a072ac4 Updating the sigslot license to Public Domain (#128085)
It seems that Sigslot's license is Public Domain, not Apache 2. https://sigslot.sourceforge.net

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128085
Approved by: https://github.com/janeyx99
2024-06-13 18:13:54 +00:00
136bdb96cb Update Kineto submodule with fix to test_basic_chrome_trace (#128333)
Summary: We've updated the sort_index in Kineto chrome traces to support device ids up to 16 devices. This should make chrome trace rows be ordered in the same way as CUDA. We need to update the unit test as well.

Test Plan:
Ran locally the changing test:
```
$ buck2 test 'fbcode//mode/opt' fbcode//caffe2/test:test_profiler_cuda -- --exact 'caffe2/test:test_profiler_cuda - test_basic_chrome_trace (profiler.test_profiler.TestProfiler)'
File changed: fbcode//caffe2/third_party/kineto.submodule.txt
Buck UI: https://www.internalfb.com/buck2/f4fd1e9a-99f1-4422-aeed-b54903c64146
Test UI: https://www.internalfb.com/intern/testinfra/testrun/16888498639845776
Network: Up: 5.4KiB  Down: 8.6KiB  (reSessionID-0329120e-7fa2-4bc0-b539-7e58058f8fce)
Jobs completed: 6. Time elapsed: 1:01.2s.
Tests finished: Pass 2. Fail 0. Fatal 0. Skip 0. Build failure 0
```

Differential Revision: D58362964

Pulled By: aaronenyeshi

Pull Request resolved: https://github.com/pytorch/pytorch/pull/128333
Approved by: https://github.com/Skylion007
2024-06-10 18:12:34 +00:00
a5b86a1ec0 Revert "FP8 rowwise scaling (#125204)"
This reverts commit 5dc912822913b3d90f4938891c7eca722a057cf1.

Reverted https://github.com/pytorch/pytorch/pull/125204 on behalf of https://github.com/atalman due to Sorry need to revert this failing, on internal CI. I suggest to reimport this and try to land internally resolving all issues ([comment](https://github.com/pytorch/pytorch/pull/125204#issuecomment-2152905513))
2024-06-06 16:12:34 +00:00
5dc9128229 FP8 rowwise scaling (#125204)
# Summary
This pull request introduces an fp8 row-scaling kernel as an optional implementation for `scaled_mm`. The kernel selection is based on the scaling tensors of the inputs. For inputs `x` and `y` of shape `[M, K]` and `[K, N]` respectively, the following conditions must be met:
- `x`'s scale should be a 1-dimensional tensor of length `M`.
- `y`'s scale should be a 1-dimensional tensor of length `N`.

It's important to note that this kernel is not called "rowwise, columnwise" scaling because, although the scales for `y` are semantically along its columns, this implementation only supports the TN format. This means the scaling is along the faster-moving dimension, or the "row".

The following two PRs were required to enable local builds:
- [PR #126185](https://github.com/pytorch/pytorch/pull/126185)
- [PR #125523](https://github.com/pytorch/pytorch/pull/125523)

### Todo
We still do not build our Python wheels with this architecture.

@ptrblck @malfet, should we replace `sm_90` with `sm_90a`?

The NVRTC TMA shadowing feels wrong, but I a not sure the right way to spoof the symbol for this compilation unit:
https://github.com/pytorch/pytorch/pull/125204/files#r1586986954

#### ifdef

I tried to use : `#if !defined(USE_ROCM) && defined(CUDA_VERSION) && CUDA_VERSION >= 12000 && \
    defined(__CUDA_ARCH__) && __CUDA_ARCH__ > 900` to gate the building of the kernel. I was having a hell of a time with this.. so I am not really sure the right way to do this

Kernel Credit:
@jwfromm

Pull Request resolved: https://github.com/pytorch/pytorch/pull/125204
Approved by: https://github.com/lw, https://github.com/malfet
2024-06-05 15:46:40 +00:00