pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-20 21:14:14 +08:00

Author	SHA1	Message	Date
Jiang, Yanbing	f2f25a5444	Upgrade submodule oneDNN to v3.7.1 (#148293 ) This PR is to upgrade submodule oneDNN to v3.7.1. ## Improvements - Improved performance of convolution and matmul primitives on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids). - Improved performance of int8 and fp32 forward convolution primitive on processors with Intel AVX2 instruction set support. - Improved performance of fp8 matmul primitives with bf16 and fp16 bias data type on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids). - Introduced initial optimizations for Intel GPUs based on Xe3 architecture. - Added bfloat16 support for SDPA, implemented fp16 and bf16 gemm kernel in SDPA. - Fixed f16 matmul accuracy, the issue of SDPA cannot dispatched to ukernel, bf16/fp16/fp32 conv performance, INT8 Kernel trigger page fault, deconvolution precision issue on complex128 and fp64 and gemm correctness issue in float16 issues. - Improved bf16 matmul performance with fp32 destination with Arm Compute Library (ACL). - Improved bf16 to fp32 reorder performance. - Improved bf16 reorder performance. - Improved bf16 convolution with ACL. Fixes https://github.com/pytorch/pytorch/issues/136348. ## Validation results on CPU 1. NLP models accuracy/inference/training ![image](https://github.com/user-attachments/assets/859279b8-1631-4268-b226-7de9ac5870d8) ![image](https://github.com/user-attachments/assets/30ec7151-41ca-482a-9d2d-0c4850e75bab) 2. Torchbench cpu userbenchmark inference & training ![image](https://github.com/user-attachments/assets/71c9807c-caf9-4385-9990-d2ab637031cd) 3. Inductor quantization ![image](https://github.com/user-attachments/assets/3d2a3bd3-82fa-4566-8050-7ea5d6b61675) 4. Dynamo benchmarks ![image](https://github.com/user-attachments/assets/554ecce3-c85c-4a0e-88f1-2e73983c5dcd) ![image](https://github.com/user-attachments/assets/148c88f8-4367-4428-bb54-ce8a4deefd1b) ![image](https://github.com/user-attachments/assets/f2e744f4-d710-4699-acf4-1f130ecfadf1) ![image](https://github.com/user-attachments/assets/97128b80-4d0e-495a-aeda-dde3e70c96fd) ![image](https://github.com/user-attachments/assets/a9afce37-684c-45c0-b938-6dd7e0383805) ![image](https://github.com/user-attachments/assets/b8714236-9681-4fbe-8d98-be93deedab88) ![image](https://github.com/user-attachments/assets/4423061f-d133-45ba-98bd-d2f739e50431) ![image](https://github.com/user-attachments/assets/7955da10-3d23-493e-99fa-658f7f40035b) ## Validation results on XPU Accuracy is same as baseline. Performance is shown below. ![image](https://github.com/user-attachments/assets/7645304d-5b1d-43f9-b840-9f846ed380a0) ## Validation results on ARM ![image](https://github.com/user-attachments/assets/080f7c02-0238-436f-ad20-5a9e3f6aafbb) ![image](https://github.com/user-attachments/assets/443742aa-ca61-41de-ae80-5d4c65cd0c87) Pull Request resolved: https://github.com/pytorch/pytorch/pull/148293 Approved by: https://github.com/mingfeima, https://github.com/atalman	2025-03-04 13:56:45 +00:00
PyTorch MergeBot	e72b4c61bf	Revert "Upgrade submodule oneDNN to v3.7 (#147498 )" This reverts commit 576ed1e400d069ec2fff6162f82a71ff0bd81f7c. Reverted https://github.com/pytorch/pytorch/pull/147498 on behalf of https://github.com/wdvr due to failing some tests on trunk - see below ([comment](https://github.com/pytorch/pytorch/pull/147498#issuecomment-2679867286))	2025-02-24 22:57:39 +00:00
Jiang, Yanbing	576ed1e400	Upgrade submodule oneDNN to v3.7 (#147498 ) This PR is to upgrade submodule oneDNN to v3.7. ## Improvements - Improved performance of convolution and matmul primitives on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids). - Improved performance of int8 and fp32 forward convolution primitive on processors with Intel AVX2 instruction set support. - Improved performance of fp8 matmul primitives with bf16 and fp16 bias data type on Intel Xeon processors with Intel AMX instruction set support (formerly Sapphire Rapids and Granite Rapids). - Introduced initial optimizations for Intel GPUs based on Xe3 architecture. - Added bfloat16 support for SDPA, implemented fp16 and bf16 gemm kernel in SDPA. - Fixed f16 matmul accuracy, the issue of SDPA cannot dispatched to ukernel, bf16/fp16/fp32 conv performance, INT8 Kernel trigger page fault, deconvolution precision issue on complex128 and fp64 and gemm correctness issue in float16 issues. - Improved bf16 matmul performance with fp32 destination with Arm Compute Library (ACL). - Improved bf16 to fp32 reorder performance. - Improved bf16 reorder performance. - Improved bf16 convolution with ACL. Fixes https://github.com/pytorch/pytorch/issues/136348. ## Validation results on CPU 1. NLP models accuracy/inference/training ![image](https://github.com/user-attachments/assets/859279b8-1631-4268-b226-7de9ac5870d8) ![image](https://github.com/user-attachments/assets/30ec7151-41ca-482a-9d2d-0c4850e75bab) 2. Torchbench cpu userbenchmark inference & training ![image](https://github.com/user-attachments/assets/71c9807c-caf9-4385-9990-d2ab637031cd) 3. Inductor quantization ![image](https://github.com/user-attachments/assets/3d2a3bd3-82fa-4566-8050-7ea5d6b61675) 4. Dynamo benchmarks ![image](https://github.com/user-attachments/assets/554ecce3-c85c-4a0e-88f1-2e73983c5dcd) ![image](https://github.com/user-attachments/assets/148c88f8-4367-4428-bb54-ce8a4deefd1b) ![image](https://github.com/user-attachments/assets/f2e744f4-d710-4699-acf4-1f130ecfadf1) ![image](https://github.com/user-attachments/assets/97128b80-4d0e-495a-aeda-dde3e70c96fd) ![image](https://github.com/user-attachments/assets/a9afce37-684c-45c0-b938-6dd7e0383805) ![image](https://github.com/user-attachments/assets/b8714236-9681-4fbe-8d98-be93deedab88) ![image](https://github.com/user-attachments/assets/4423061f-d133-45ba-98bd-d2f739e50431) ![image](https://github.com/user-attachments/assets/7955da10-3d23-493e-99fa-658f7f40035b) ## Validation results on XPU Accuracy is same as baseline. Performance is shown below. ![image](https://github.com/user-attachments/assets/7645304d-5b1d-43f9-b840-9f846ed380a0) ## Validation results on ARM ![image](https://github.com/user-attachments/assets/080f7c02-0238-436f-ad20-5a9e3f6aafbb) ![image](https://github.com/user-attachments/assets/443742aa-ca61-41de-ae80-5d4c65cd0c87) Pull Request resolved: https://github.com/pytorch/pytorch/pull/147498 Approved by: https://github.com/fadara01, https://github.com/mingfeima, https://github.com/atalman	2025-02-24 14:32:51 +00:00
CaoE	f7c0c06692	Add oneDNN BRGEMM support on CPU (#131878 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/131878 Approved by: https://github.com/jgong5, https://github.com/peterbell10	2024-09-07 13:22:30 +00:00
yanbing-j	2a73ba298c	Upgrade submodule oneDNN to v3.5.3 (#131620 ) This PR is to upgrad submodule oneDNN to v3.5.3. ## Improvements - [experimental] Introduced [microkernel API](https://oneapi-src.github.io/oneDNN/ukernels.html) for Intel Architecture Processors. This API exposes internal mechanisms used in matmul and convolution implementation to expert users. - Improved performance of matmul primitive with sum post-op for batched cases on processors with Intel AMX instruction set support. - Introduced fp64 matmul support. This functionality is currently implemented on Intel GPUs with hardware acceleration for fp64 math only. ## Validation results on CPU No regression was found. 1. NLP models accuracy/inference/training Model Name \| Mode Name \| Precision \| OneDNN \| Baseline \| OneDNN/Baseline -- \| -- \| -- \| -- \| -- \| -- bert-large \| realtime \| bf16 \| 192.498 \| 189.664 \| 1.014942214 bert-large \| throughput \| bf16 \| 202.424 \| 202.156 \| 1.001325709 bert-large \| train_phase2 \| bf16 \| 15.955 \| 16.029 \| 0.995383368 LCM \| throughput \| bf16 \| 1.01983 \| 1.06632 \| 0.956401455 stable-diffusion \| throughput \| bf16 \| 0.10313 \| 0.10184 \| 1.012666929 ViT \| realtime \| bf16 \| 1086.48 \| 928.43 \| 1.17023362 ViT \| throughput \| bf16 \| 1419.07 \| 1393.81 \| 1.018122987 yolov7 \| realtime \| bf16 \| 413.468682 \| 415.16503 \| 0.995914039 yolov7 \| throughput \| bf16 \| 369.697 \| 366.789 \| 1.007928264 bert-large \| realtime \| fp32 \| 46.685 \| 46.652 \| 1.000707365 bert-large \| throughput \| fp32 \| 47.766 \| 48.007 \| 0.994979899 bert-large \| train_phase2 \| fp32 \| 7.101 \| 7.104 \| 0.999577703 LCM \| throughput \| fp32 \| 0.5501 \| 0.55023 \| 0.999763735 stable-diffusion \| throughput \| fp32 \| 0.04012 \| 0.04002 \| 1.002498751 ViT \| realtime \| fp32 \| 337.27 \| 335.19 \| 1.006205436 ViT \| throughput \| fp32 \| 346.52 \| 350.08 \| 0.989830896 yolov7 \| realtime \| fp32 \| 107.138054 \| 107.242747 \| 0.999023775 yolov7 \| throughput \| fp32 \| 103.383 \| 104.301 \| 0.99119855 bert-large \| realtime \| int8 \| 283.541 \| 289.569 \| 0.979182855 LCM \| throughput \| int8 \| 1.09864 \| 1.08998 \| 1.0079451 stable-diffusion \| throughput \| int8 \| 0.10617 \| 0.10604 \| 1.001225952 ViT \| realtime \| int8 \| 1562.11 \| 1554.68 \| 1.004779119 ViT \| throughput \| int8 \| 1904.38 \| 1903.39 \| 1.000520125 yolov7 \| realtime \| int8 \| 540.489493 \| 539.902488 \| 1.001087243 yolov7 \| throughput \| int8 \| 499.999 \| 500.757 \| 0.998486292 Device \| Dtype \| Geomean Higher is better -- \| -- \| -- All \| all \| 101.17% All \| fp32 \| 99.83% All \| bf16 \| 102.24% All \| int8 \| 99.91% All \| fp16 \| 103.61% SPR \| all \| 100.54% SPR \| fp32 \| 99.82% SPR \|bf16 \| 101.78% SPR \|int8 \| 99.90% GNR \| all \| 101.58% GNR \| fp32 \| 99.85% GNR \| bf16 \| 102.66% GNR \| int8 \| 99.93% GNR \| fp16 \| 103.61% 2. Torchbench cpu userbenchmark inference & training Perf_Geomean \| Ratio (oneDNN/baseline) -- \| -- eager_throughtput_bf16_infer \| 1.00x eager_throughtput_fp32_infer \| 1.00x jit_llga_throughtput_amp_bf16 \| 1.00x jit_llga_throughtput_fp32 \| 1.00x eager_throughtput_fx_int8 \| 0.99x eager_throughtput_bf16_train \| 1.01x eager_throughtput_fp32_train \| 1.00x 3. Inductor quantization Static quant: Perf_Geomean \| Ratio (oneDNN/baseline) -- \| -- PTQ \| 1.00x PTQ_CPP_WRAPPER \| 1.00x QAT \| 1.00x ACC_Geomean \| Ratio (oneDNN/baseline) -- \| -- PTQ \| 1.00x PTQ_CPP_WRAPPER \| 1.00x QAT \| 1.00x Dynamic quant: \| Ratio (oneDNN/baseline) -- \| -- Performance \| 1.04x Accuracy \| 1.00x 4. Dynamo benchmarks GEOMEAN summary ![image](https://github.com/user-attachments/assets/82fc4b76-50f6-4f06-9ba9-034b932f1158) FP32 Static shape, default wrapper ![image](https://github.com/user-attachments/assets/9335268e-3e99-426b-91f8-f9df90a2007c) FP32 Dynamic shape, default wrapper ![image](https://github.com/user-attachments/assets/e7cf3f4f-2a62-4b58-9461-5e5ba254d822) AMP Static shape, default wrapper ![image](https://github.com/user-attachments/assets/12392c88-e44f-4c95-904a-4fa5fc9f34a2) AMP Dynamic shape, default wrapper ![image](https://github.com/user-attachments/assets/13930b0d-9bb2-46de-9ecb-5d2585d5c2f6) ## Validation results on XPU Category \| Eager \| Inductor -- \| -- \| -- huggingface_amp_fp16_training \| 1.002456 \| 0.999998 huggingface_bfloat16_inference \| 1.005386 \| 1.003511 huggingface_float32_training \| 1.002533 \| 1.003098 torchbench_amp_fp16_training \| 1.009065 \| 1.01323 torchbench_bfloat16_inference \| 1.003371 \| 1.001534 torchbench_float32_training \| 1.012102 \| 1.011596 timm_models_amp_fp16_training \| 1.005511 \| 1.010329 timm_models_bfloat16_inference \| 1.000935 \| 1.000538 timm_models_float32_training \| 0.991873 \| 0.99721 Pull Request resolved: https://github.com/pytorch/pytorch/pull/131620 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-08-21 23:40:02 +00:00
cyy	d44daebdbc	[Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127051 Approved by: https://github.com/cpuhrsch, https://github.com/malfet	2024-05-31 01:20:45 +00:00
PyTorch MergeBot	67739d8c6f	Revert "[Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051 )" This reverts commit 699db7988d84d163ebb6919f78885e4630182a7a. Reverted https://github.com/pytorch/pytorch/pull/127051 on behalf of https://github.com/PaliC due to This PR needs to be synced using the import button as there is a bug in our diff train ([comment](https://github.com/pytorch/pytorch/pull/127051#issuecomment-2138496995))	2024-05-30 01:16:57 +00:00
cyy	699db7988d	[Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127051 Approved by: https://github.com/cpuhrsch, https://github.com/malfet	2024-05-29 11:58:03 +00:00
PyTorch MergeBot	cdbb2c9acc	Revert "[Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051 )" This reverts commit 4fdbaa794f9d5af2f171f772a51cb710c51c925f. Reverted https://github.com/pytorch/pytorch/pull/127051 on behalf of https://github.com/PaliC due to This PR needs to be synced using the import button as there is a bug in our diff train ([comment](https://github.com/pytorch/pytorch/pull/127051#issuecomment-2136428735))	2024-05-29 03:02:35 +00:00
cyy	4fdbaa794f	[Submodule] Remove deprecated USE_TBB option and TBB submodule (#127051 ) Pull Request resolved: https://github.com/pytorch/pytorch/pull/127051 Approved by: https://github.com/cpuhrsch, https://github.com/malfet	2024-05-27 03:54:03 +00:00
Xia, Weiwen	c2f8c75129	[Reopen] Upgrade submodule oneDNN to v3.4.2 (#126137 ) Reopen of https://github.com/pytorch/pytorch/pull/122472 ## Improvements This upgrade fixes the following issues: - https://github.com/pytorch/pytorch/issues/120982 This upgrade brings the following new features: - Introduced memory descriptor serialization API. This API is needed to support freezing on CPU in AOTInductor (https://github.com/pytorch/pytorch/issues/114450) ## Validation results on CPU Original results with oneDNN v3.4.1 are here: https://github.com/pytorch/pytorch/pull/122472#issue-2201602846 Need to rerun validation and update results. Co-authored-by: Sunita Nadampalli <nadampal@amazon.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/126137 Approved by: https://github.com/jgong5, https://github.com/snadampal, https://github.com/atalman	2024-05-16 12:00:16 +00:00
PyTorch MergeBot	ee0c47349c	Revert "Upgrade submodule oneDNN to v3.4 (#122472 )" This reverts commit dbcf123105a3f11d02f04067ca0cb377ed09e88c. Reverted https://github.com/pytorch/pytorch/pull/122472 on behalf of https://github.com/atalman due to broke aarch64 builds and tests ([comment](https://github.com/pytorch/pytorch/pull/122472#issuecomment-2096750000))	2024-05-06 19:28:20 +00:00
Xia, Weiwen	dbcf123105	Upgrade submodule oneDNN to v3.4 (#122472 ) ## Improvements This upgrade fixes the following issues: - https://github.com/pytorch/pytorch/issues/120982 This upgrade brings the following new features: - Introduced memory descriptor serialization API. This API is needed to support freezing on CPU in AOTInductor (https://github.com/pytorch/pytorch/issues/114450) ## Validation results on CPU No regression was found. 1. NLP models accuracy/inference/training Model Name \| Mode\| Precision \| New \| Baseline \| New/Baseline -- \| -- \| -- \| -- \| -- \| -- bert-large \| accuracy \| fp32 \| 93.15325 \| 93.15325 \| 100.00% bert-large \| accuracy \| bf16 \| 93.20125 \| 93.20125 \| 100.00% bert-large \| accuracy \| int8 \| 92.66641 \| 92.66641 \| 100.00% LCM \| accuracy \| fp32 \| 44.11152 \| 44.11154 \| 100.00% LCM \| accuracy \| bf16 \| 43.57667 \| 43.65096 \| 100.17% ViT \| accuracy \| fp32 \| 0.8033 \| 0.8033 \| 100.00% ViT \| accuracy \| bf16 \| 0.8031 \| 0.8031 \| 100.00% ViT \| accuracy \| int8 \| 0.7985 \| 0.7985 \| 100.00% yolov7 \| accuracy \| fp32 \| 0.512 \| 0.512 \| 100.00% yolov7 \| accuracy \| bf16 \| 0.504 \| 0.504 \| 100.00% yolov7 \| accuracy \| int8 \| 0.507 \| 0.507 \| 100.00% bert-large \| realtime \| fp32 \| 37.433 \| 39.136 \| 95.65% bert-large \| realtime \| bf16 \| 166.592 \| 160.134 \| 104.03% bert-large \| realtime \| int8 \| 230.876 \| 222.594 \| 103.72% ViT \| realtime \| fp32 \| 288.19 \| 282.05 \| 102.18% ViT \| realtime \| bf16 \| 755.42 \| 741.1 \| 101.93% ViT \| realtime \| int8 \| 1060.94 \| 1092.47 \| 97.11% yolov7 \| realtime \| fp32 \| 17.06927 \| 16.47995 \| 103.58% yolov7 \| realtime \| bf16 \| 54.68561 \| 54.00723 \| 101.26% yolov7 \| realtime \| int8 \| 78.38271 \| 77.63214 \| 100.97% bert-large \| throughput \| fp32 \| 47.142 \| 47.341 \| 99.58% bert-large \| throughput \| bf16 \| 200.365 \| 200.806 \| 99.78% bert-large \| throughput \| int8 \| 144.999 \| 145.295 \| 99.80% LCM \| throughput \| fp32 \| 0.54913 \| 0.54897 \| 100.03% LCM \| throughput \| bf16 \| 1.062417 \| 1.07772 \| 98.58% stable-diffusion \| throughput \| fp32 \| 0.03301 \| 0.0331 \| 99.73% stable-diffusion \| throughput \| bf16 \| 0.08773 \| 0.08849 \| 99.14% stable-diffusion \| throughput \| int8 \| 0.0491 \| 0.05024 \| 97.73% ViT \| throughput \| fp32 \| 342.55 \| 346.47 \| 98.87% ViT \| throughput \| bf16 \| 1263.4 \| 1268.32 \| 99.61% ViT \| throughput \| int8 \| 1331.3 \| 1345.32 \| 98.96% yolov7 \| throughput \| fp32 \| 115.313 \| 115.612 \| 99.74% yolov7 \| throughput \| bf16 \| 323.364 \| 323.747 \| 99.88% yolov7 \| throughput \| int8 \| 388.137 \| 384.236 \| 101.02% bert-large \| train_phase1 \| fp32 \| 34.223 \| 34.309 \| 99.75% bert-large \| train_phase1 \| bf16 \| 90.372 \| 88.453 \| 102.17% bert-large \| train_phase2 \| fp32 \| 7.307 \| 7.318 \| 99.85% Data Type \| Geomean -- \| -- fp32 \| 99.88% bf16 \| 100.70% int8 \| 99.88% all \| 100.16% 2. Torchbench cpu userbenchmark inference & training Test suite \| Geomean Ratio (New/baseline) -- \| -- eager_throughtput_bf16_infer \| 1.00x eager_throughtput_fp32_infer \| 1.00x jit_llga_throughtput_amp_bf16 \| 0.99x jit_llga_throughtput_fp32 \| 1.01x eager_throughtput_fx_int8 \| 1.00x eager_throughtput_bf16_train \| 1.00x eager_throughtput_fp32_train \| 1.00x 3. Inductor quantization (static & dynamic) accuracy & performance Config \| Performance geomean ratio (New/baseline) \| Accuracy ratio (New/baseline) -- \| -- \| -- Static quant PTQ \| 0.99x \| 1.00x Static quant PTQ_CPP_WRAPPER \| 0.98x \| 1.00x Static quant QAT \| 0.99x \| 1.00x Dynamic quant PTQ \| 1.00x \| 1.00x 4. Dynamo benchmarks Precision \| Shape \| Wrapper \| Thread \| Ratio old/new GEOMEAN \| Ratio old/new GEOMEAN -- \| -- \| -- \| -- \| -- \| -- \| \| \| \| Eager \| Inductor Float32 \| Static \| Default \| Multiple \| 0.998776 \| 1.002091 \| \| \| Single \| 1.014086 \| 1.01054 Float32 \| Dynamic \| Default \| Multiple \| 1.00386 \| 1.005975 \| \| \| Single \| 1.011036 \| 1.008317 AMP \| Static \| Default \| Multiple \| 0.996965 \| 1.005117 \| \| \| Single \| 1.00092 \| 0.995666 AMP \| Dynamic \| Default \| Multiple \| 0.9959 \| 0.995048 \| \| \| Single \| 1.002569 \| 0.994085 --- Pull Request resolved: https://github.com/pytorch/pytorch/pull/122472 Approved by: https://github.com/jgong5, https://github.com/EikanWang, https://github.com/atalman	2024-05-01 20:59:17 +00:00
Xia, Weiwen	481c9bb1fc	Upgrade submodule oneDNN to v3.3.6 (#122164 ) As the title. Including issue fixes for aarch64: - https://github.com/oneapi-src/oneDNN/pull/1831 - https://github.com/oneapi-src/oneDNN/pull/1834 --- ## Validation results (on Intel CPU + Linux) Static quantization with Inductor on CV models Quant method \| Geomean throughput ratio (v3.3.6/baseline) -- \| -- ptq \| 0.982937 ptq (cpp wrapper) \| 0.978384 qat \| 0.978828 Torchbench cpu userbenchmark with Inductor Items \| Perf Geomean Ratio (v3.3.6/baseline) -- \| -- eager_throughtput_bf16_infer \| 1.00x eager_throughtput_fp32_infer \| 1.00x jit_llga_throughtput_amp_bf16 \| 1.01x jit_llga_throughtput_fp32 \| 1.00x eager_throughtput_fx_int8 \| 1.00x eager_throughtput_bf16_train \| 1.46x eager_throughtput_fp32_train \| 1.41x Dynamo benchmarks tests Precision \| Shape \| Wrapper \| Thread \| Eager old/new GEOMEAN \| Inductor old/new GEOMEAN -- \| -- \| -- \| -- \| -- \| -- Float32 \| Static \| Default \| Multiple \| 1.003836812 \| 1.003425 Float32 \| Static \| Default \| Single \| 1.000181451 \| 0.999611 Float32 \| Dynamic \| Default \| Multiple \| 1.003980183 \| 1.006563 Float32 \| Dynamic \| Default \| Single \| 1.000076939 \| 0.999969 AMP \| Static \| Default \| Multiple \| 0.996824772 \| 0.998715 AMP \| Static \| Default \| Single \| 0.996402574 \| 1.001483 AMP \| Dynamic \| Default \| Multiple \| 0.994919866 \| 1.000467 AMP \| Dynamic \| Default \| Single \| 0.9962054 \| 1.000767 (on Aarch64) https://github.com/pytorch/pytorch/pull/122164#issuecomment-2007912919 --- Pull Request resolved: https://github.com/pytorch/pytorch/pull/122164 Approved by: https://github.com/snadampal, https://github.com/malfet, https://github.com/atalman	2024-03-28 21:36:27 +00:00
Xia, Weiwen	daf89b4101	Update oneDNN submodule to v3.3.2 (#112700 ) Update oneDNN submodule to v3.3.2. Add a macro to check the version of `third_party/ideep`. Since we have versioning now, the changes won't break any pipeline even if `third_party/ideep` is not updated at the same time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112700 Approved by: https://github.com/leslie-fang-intel, https://github.com/atalman	2023-12-05 17:51:55 +00:00
PyTorch MergeBot	62df4f3428	Revert "Update oneDNN submodule to v3.3.2 (#112700 )" This reverts commit afbaa0c1650cf15100fb5dc579ceeba24fb8665a. Reverted https://github.com/pytorch/pytorch/pull/112700 on behalf of https://github.com/atalman due to Diff broke internal tests ([comment](https://github.com/pytorch/pytorch/pull/112700#issuecomment-1839350284))	2023-12-04 19:41:12 +00:00
Xia, Weiwen	afbaa0c165	Update oneDNN submodule to v3.3.2 (#112700 ) Update oneDNN submodule to v3.3.2. Add a macro to check the version of `third_party/ideep`. Since we have versioning now, the changes won't break any pipeline even if `third_party/ideep` is not updated at the same time. Pull Request resolved: https://github.com/pytorch/pytorch/pull/112700 Approved by: https://github.com/leslie-fang-intel, https://github.com/atalman	2023-12-01 18:40:07 +00:00
Xia, Weiwen	97a291f6bd	[ONEDNN][BC-breaking] update onednn from v2.7.3 to v3.1.1 (#97957 ) Summary Update onednn from v2.7.3 to v3.1.1. It is bc-breaking as some APIs are changed on oneDNN side. Changes include: - PyTorch code where oneDNN is directly called - Submodule `third_party/ideep` to adapt to oneDNN's new API. - CMAKE files to fix build issues. Test plan Building issues and correctness are covered by CI checks. For performance, we have run TorchBench models to ensure there is no regression. Below is the comparison before and after oneDNN update. ![image](https://github.com/pytorch/pytorch/assets/12522207/415a4ff0-7566-40c6-aed0-24997a475b0e) Note: - Base commit of PyTorch: da322ea - CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz (Ice Lake) Pull Request resolved: https://github.com/pytorch/pytorch/pull/97957 Approved by: https://github.com/jgong5, https://github.com/jerryzh168	2023-08-25 12:13:18 +00:00
yanbing-j	85edb58179	Fix oneDNN double checkout issue and Upgrade oneDNN to v2.7.3 (#92239 ) ### Descriotion This PR is to fix oneDNN double checkout issue that mentioned in https://github.com/pytorch/pytorch/pull/87061#issuecomment-1284384276, and upgrade oneDNN to v2.7.3 to fix #92138. ### Performance test Use TorchBench test in ICX with 40 cores Intel OpenMP & jemalloc were preloaded ![image](https://user-images.githubusercontent.com/61222868/212634378-b91c20b5-0e85-474f-861c-c1d2f6962de1.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/92239 Approved by: https://github.com/jgong5, https://github.com/malfet	2023-01-17 01:54:21 +00:00
Sergei Vorobev	39306c1dfb	Use `@pytorch//` in bazel build files (#89660 ) This change aims to make bazel build more embeeding-friendly. Namely, when PyTorch is included as an external repo in another project, it is usually included like this ``` native.local_repository( name = "pytorch", path = ..., repo_mapping = repo_mapping, ) ``` Or ``` http_archive( name = "pytorch", urls = ... repo_mapping = repo_mapping, ) ``` In this case, references to `@//` would resolve to the top-level WORKSPACE that includes PyTorch. That makes upgrades harder because we need to carry around this patch. Note that under some edge-case circumstances even `//` resolves to the top-level `WORKSPACE`. This change makes the embedding of the bazel build easier without compromising anything for the main repo, since the `@pytorch//` still refers to the same thing. Pull Request resolved: https://github.com/pytorch/pytorch/pull/89660 Approved by: https://github.com/kit1980	2022-12-22 05:14:55 +00:00
yanbing-j	dc40b6d043	Upgrade oneDNN to v2.7.2 (#90051 ) This PR is to upgrade oneDNN to v2.7.2. ### oneDNN v2.7.1 & 2.7.2 changes: Fixes #89104 Updated ITT API version to 3.23.0 ### Performance Benchmark Use TorchBench test in ICX with 40 cores Intel OpenMP & tcmalloc were preloaded ![image](https://user-images.githubusercontent.com/61222868/205240855-04e2d50f-8b3a-4097-9038-fdd0c0fc93b9.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/90051 Approved by: https://github.com/XiaobingSuper, https://github.com/jgong5	2022-12-08 09:41:02 +00:00
Jiang, Yanbing	c56be31d2e	Upgrade oneDNN to v2.7 (#87061 ) This PR is to upgrade oneDNN to v2.7. ### oneDNN v2.7 changes: Performance Optimizations - Improved performance for future Intel Xeon Scalable processors (code name Sapphire Rapids). - Introduced performance optimizations for [bf16 floating point math mode](http://oneapi-src.github.io/oneDNN/group_dnnl_api_mathmode.html) on Intel Xeon Scalable processors (code name Sapphire Rapids). The bf16 math mode allows oneDNN to use bf16 arithmetic and Intel AMX instructions in computations on fp32 data. Please go to https://github.com/oneapi-src/oneDNN/releases/tag/v2.7 for more detailed changes. ### oneDNN v2.6.1 & 2.6.2 changes: Functionality - Updated ITT API to 3.22.5 - Fixed correctness issue in fp32 convolution implementation for cases with large spatial size (https://github.com/pytorch/pytorch/issues/84488) ### Performance Benchmark Use TorchBench test in ICX with 40 cores Intel OpenMP & tcmalloc were preloaded ![image](https://user-images.githubusercontent.com/61222868/196121957-656faebc-9f4a-49f0-9ef0-0784416c3a47.png) Pull Request resolved: https://github.com/pytorch/pytorch/pull/87061 Approved by: https://github.com/jgong5, https://github.com/XiaobingSuper, https://github.com/weiwangmeta	2022-10-18 19:07:58 +00:00
yanbing-j	58d773ad29	Upgrade oneDNN to v2.6.0 (#75398 ) Summary: This PR upgrades oneDNN to v2.6.0. v2.6.0 changes: - Improved performance for future Intel Xeon® Scalable processors (code name Sapphire Rapids). The functionality requires Linux kernel 5.16 or later. - Improved performance of matmul primitive for processors with Intel AVX-512 support. - Introduced bfloat16 destination support for int8 convolution, matmul and inner product primitives for processors with Intel AVX-512 support and or future Intel Xeon® Scalable processors (code name Sapphire Rapids) - Extended RNN primitive with support for [AUGRU cell](https://oneapi-src.github.io/oneDNN/dev_guide_rnn.html#augru). - Added support for non-zero negative slope in ReLU post-op for batch normalization primitive. - Introduced support for mixed source and destination data types in softmax primitive. - Introduced [persistent cache API](https://oneapi-src.github.io/oneDNN/dev_guide_persistent_cache.html). This functionality allows to serialize and reuse JIT kernels. Pull Request resolved: https://github.com/pytorch/pytorch/pull/75398 Reviewed By: dagitses, bigfootjon Differential Revision: D35976827 Pulled By: frank-wei fbshipit-source-id: a77ae15cd77fc7c114ab9722453c28dc64aac679 (cherry picked from commit e376698d3c772aaa2dfbe51a4d1a75c8d17d0eee)	2022-05-02 22:07:42 +00:00
yanbing-j	4567d5ded4	Upgrade oneDNN to v2.5.2 (#71546 ) Summary: This PR upgrades oneDNN to v2.5.2, and includes some building support for oneDNN v2.5.2. v2.4 changes: - Improved performance for future Intel Xeon Scalable processor (code name Sapphire Rapids). The functionality is disabled by default and should be enabled via CPU dispatcher control. - Improved binary primitive performance for cases when one of the tensors is broadcasted. - Improved performance of reduction primitive, reorder, shuffle primitives. - Improved performance of depthwise convolution forward propagation for processors with Intel AVX5-12 support - Improved performance of forward inner product primitive for the shapes with minibatch equal to 1 for processors with Intel AVX-512 support - Improved performance of int8 matmul and inner product primitives for processors with Intel AVX2 and Intel DL Boost support v2.5 changes: - Improved performance for future Intel Xeon Scalable processors (code name Sapphire Rapids). The functionality is now enabled by default and requires Linux kernel 5.16. - Improved performance of matmul primitive for processors with Intel AVX-512 support. v2.5.2 changes: - Fixed performance regression in binary primitive with broadcast - Fixed segmentation fault in depthwise convolution primitive for shapes with huge spatial size for processors with Intel AVX-512 support Pull Request resolved: https://github.com/pytorch/pytorch/pull/71546 Reviewed By: george-qi Differential Revision: D33827108 Pulled By: VitalyFedyunin fbshipit-source-id: 8f5a19b331c82af5b0783f081e061e1034a93952 (cherry picked from commit 9705212fe9b7b0838cc010d040c37d1175be83ce)	2022-02-01 18:34:58 +00:00
Peter Bell	847dbb8684	CMake: Clean up unused definitions (#69216 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/69216 This cleans up 4 pre-processor defines not used by any code: - HAVE_GCC_GET_CPUID - USE_GCC_GET_CPUID - USE_AVX - USE_AVX2 `cpuid` isn't used in PyTorch any more, we only use `cpuinfo`. `USE_AVX` is also not used, instead `HAVE__CPU_DEFINITIONS` tells you which `CPU_CAPABILITY` flags are being compiled. There is also `fbgemm`'s code path adding `third_party` as an include path, despite `fbgemm` having a dedicated include directory and a CMake setup that properly includes it. Test Plan: Imported from OSS Reviewed By: albanD Differential Revision: D33794424 Pulled By: malfet fbshipit-source-id: 99d504af088818d4a26c2f6ce67ec0d59a5eb703 (cherry picked from commit 2e099d41f0e2f7d96c6013ac83223a75f4e4f862)	2022-01-31 22:49:11 +00:00
Peter Bell	304efd8e9a	Change TH_BLAS_MKL into AT_MKL_ENABLED() (#70219 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/70219 Pull Request resolved: https://github.com/pytorch/pytorch/pull/69419 Test Plan: Imported from OSS Reviewed By: malfet Differential Revision: D33246758 Pulled By: ngimel fbshipit-source-id: aedef4c9ef97b6aa9f574313c94f774b77df2748	2021-12-21 10:36:55 -08:00
chunyuan	9ad05f2c3a	Upgrade oneDNN to v2.3.3 and package oneDNN Graph API together (#63748 ) Summary: This PR upgrades oneDNN to [v2.3.3](https://github.com/oneapi-src/oneDNN/releases/tag/v2.3.3) and includes [Graph API preview release](https://github.com/oneapi-src/oneDNN/releases/tag/graph-v0.2) in one package. - oneDNN will be located at `pytorch/third_party/ideep/mkl-dnn/third_party/oneDNN` - The version of oneDNN will be [v2.3.3](https://github.com/oneapi-src/oneDNN/releases/tag/v2.3.3) The main changes on CPU: - v2.3 - Extended primitive cache to improve primitive descriptor creation performance. - Improved primitive cache performance in multithreaded configurations. - Introduced initial optimizations for bfloat16 compute functionality for future Intel Xeon Scalable processor (code name Sapphire Rapids). - Improved performance of binary primitive and binary post-op for cases with broadcast and mixed source and destination formats. - Improved performance of reduction primitive - Improved performance of depthwise convolution primitive with NHWC activations for training cases - v2.3.1 - Improved int8 GEMM performance for processors with Intel AVX2 and Intel DL Boost support - Fixed integer overflow for inner product implementation on CPUs - Fixed out of bounds access in GEMM implementation for Intel SSE 4.1 - v2.3.2 - Fixed performance regression in fp32 inner product primitive for processors with Intel AVX512 support - v2.3.3 - Reverted check for memory descriptor stride validity for unit dimensions - Fixed memory leak in CPU GEMM implementation More changes can be found in https://github.com/oneapi-src/oneDNN/releases. - The Graph API provides flexible API for aggressive fusion, and the preview2 supports fusion for FP32 inference. See the [Graph API release branch](https://github.com/oneapi-src/oneDNN/tree/dev-graph-preview2) and [spec](https://spec.oneapi.io/onednn-graph/latest/introduction.html) for more details. A separate PR will be submitted to integrate the oneDNN Graph API to Torchscript graph. Pull Request resolved: https://github.com/pytorch/pytorch/pull/63748 Reviewed By: albanD Differential Revision: D32153889 Pulled By: malfet fbshipit-source-id: 536071168ffe312d452f75d54f34c336ca3778c1	2021-12-09 13:42:40 -08:00
yanbing-j	ef13341a8d	upgrade onednn to v2.2.3 (#57928 ) Summary: This PR is to upgrade onednn to v2.2.3 (including v2.2 and v2.2.3 changes) which has the following main changes about CPU: v2.2 changes: Improved performance of compute functionality for future Intel Core processor with Intel AVX2 and Intel DL Boost instructions support (code name Alder Lake). Improved fp32 inner product forward propagation performance for processors with Intel AVX-512 support. Improved dnnl_gemm performance for cases with n=1 on all supported processors. v2.2.3 changes: Fixed a bug in int8 depthwise convolution ptimitive with groups and 1d spatial size for processors with Intel AVX-512 and Intel AVX2 support Fixed correctness issue for PReLU primitive on Intel Processor Graphics Fixed corretness issue in reorder for blocked layouts with zero padding Improved performance of weights reorders used by BRGEMM-based convolution primitive for processors with Intel AVX-512 support More changes can be found in https://github.com/oneapi-src/oneDNN/releases. Ideep used version is pytorch-rls-v2.2.3. OneDNN used version is v2.2.3. Pull Request resolved: https://github.com/pytorch/pytorch/pull/57928 Reviewed By: bdhirsh Differential Revision: D29037857 Pulled By: VitalyFedyunin fbshipit-source-id: db74534858bdcf5d6c7dcf58e224fc756188bc31	2021-06-14 11:57:45 -07:00
XiaobingSuper	f8788d5188	Upgrade onednn to v2.1.2 (#54956 ) Summary: This PR is to upgrade onednn to v2.1.2 which has the following main changes about cpu: - Improved performance of forward convolution with plain activations for processors with Intel AVX-512 support - Improved performance of fp32 depthwise convolution with plain activations on CPU. more changes can be found in https://github.com/oneapi-src/oneDNN/releases. Ideep used version is [pytorch-rls-v2.1.2](https://github.com/intel/ideep/tree/pytorch-rls-v2.1.2). OneDNN used version is [v2.1.2](https://github.com/oneapi-src/oneDNN/tree/v2.1.2). Pull Request resolved: https://github.com/pytorch/pytorch/pull/54956 Reviewed By: ejguan Differential Revision: D27466741 Pulled By: VitalyFedyunin fbshipit-source-id: ff96e2cbda4b6bf04d299b9978e9125a013ce32f	2021-04-06 10:51:57 -07:00
pinzhenx	1eed54d17a	Upgrade oneDNN (mkl-dnn) to v1.7 (#47853 ) Summary: Bump oneDNN (mkl-dnn) to 1.7 for bug fixes and performance optimizations - Fixes https://github.com/pytorch/pytorch/issues/42115. Fixed build issue on Windows for the case when oneDNN is built as submodule - Fixes https://github.com/pytorch/pytorch/issues/45746. Fixed segmentation fault for convolution weight gradient on systems with Intel AVX512 support This PR also contains a few changes in ideep for follow-up update (not enabled in current PR yet): - Performance improvements for the CPU path of Convolution - Channel-last support Pull Request resolved: https://github.com/pytorch/pytorch/pull/47853 Reviewed By: bdhirsh Differential Revision: D25275268 Pulled By: VitalyFedyunin fbshipit-source-id: 75a589d57e3d19a7f23272a67045ad7494f1bdbe	2020-12-03 11:54:31 -08:00
Zhang, Xiaobing	63e5a53b8c	DNNL: fix build error when DNNL using TBB threading pool (#40699 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/40699 Differential Revision: D22286334 Pulled By: albanD fbshipit-source-id: 0635a0a5e4bf80d44d90c86945d92e98e26ef480	2020-06-29 13:53:18 -07:00
pinzhenx	7f270233fb	Upgrade DNNL to 1.5 (#40088 ) Summary: - Bump DNNL to 1.5 - Bug fixes and improvements in ideep - suppress g++ Wreorder warning - avoid rebuilding `libmkldnn.so` https://github.com/oneapi-src/oneDNN/issues/743 - enable conv3d (integration code was checked in by Xiaobing https://github.com/pytorch/pytorch/pull/35662) Pull Request resolved: https://github.com/pytorch/pytorch/pull/40088 Differential Revision: D22071530 Pulled By: albanD fbshipit-source-id: e7a53d7421e8a7a03e36a7dfb68edc565a2f00df	2020-06-16 11:42:30 -07:00
George Gensure	447bcd341d	Bazel build of pytorch with gating CI (#36011 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/36011 Differential Revision: D20873430 Pulled By: malfet fbshipit-source-id: 8ffffd10ca0ff8bdab578a70a9b2b777aed985d0	2020-04-06 22:50:33 -07:00
Nikita Shulga	6be9c77998	Revert D20783179: [pytorch][PR] Bazel build of pytorch Test Plan: revert-hammer Differential Revision: D20783179 Original commit changeset: b160908a3e10 fbshipit-source-id: 5b7b36305525e7ccc49540b48991149cf0a759f4	2020-04-03 17:59:10 -07:00
George Gensure	585f153d00	Bazel build of pytorch (#35220 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/35220 Reviewed By: seemethere Differential Revision: D20783179 Pulled By: malfet fbshipit-source-id: b160908a3e107790fa06057a77de9d6d23493bbc	2020-04-03 17:13:58 -07:00

35 Commits