Commit Graph

32 Commits

Author SHA1 Message Date
0fd976b65c Enable mimalloc on non-Windows platforms and make default for AArch64 builds (#164741)
This change removes the Windows requirement for mimalloc builds, and makes mimalloc the default c10 system allocator for AArch64 builds. This significantly improves the performance of AArch64 builds of PyTorch as large allocations are better cached by mimalloc than glibc.

**Updated Results**

Torchbench FP32 eager Inference, 16 threads:
<img width="1510" height="733" alt="mimalloc-v2-fp32-diff" src="https://github.com/user-attachments/assets/7fe3ea0c-3b52-42e7-879b-612444479c90" />

Torchbench BF16 eager Inference, 16 threads:
<img width="1510" height="733" alt="mimalloc-v2-bf16-diff" src="https://github.com/user-attachments/assets/56469a72-9e06-4d57-ae2a-aeb139ca79a3" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164741
Approved by: https://github.com/fadara01, https://github.com/aditew01, https://github.com/malfet
2025-10-09 20:49:46 +00:00
688efd9741 Revert "Enable mimalloc on non-Windows platforms and make default for AArch64 builds (#164741)"
This reverts commit 87eccf10e8484c9e59ef81ae7bdee68d3db4f605.

Reverted https://github.com/pytorch/pytorch/pull/164741 on behalf of https://github.com/malfet due to But it breaks MacOS builds, see https://github.com/pytorch/pytorch/actions/runs/18382886648/job/52373781138 ([comment](https://github.com/pytorch/pytorch/pull/164741#issuecomment-3386859778))
2025-10-09 17:30:25 +00:00
87eccf10e8 Enable mimalloc on non-Windows platforms and make default for AArch64 builds (#164741)
This change removes the Windows requirement for mimalloc builds, and makes mimalloc the default c10 system allocator for AArch64 builds. This significantly improves the performance of AArch64 builds of PyTorch as large allocations are better cached by mimalloc than glibc.

**Updated Results**

Torchbench FP32 eager Inference, 16 threads:
<img width="1510" height="733" alt="mimalloc-v2-fp32-diff" src="https://github.com/user-attachments/assets/7fe3ea0c-3b52-42e7-879b-612444479c90" />

Torchbench BF16 eager Inference, 16 threads:
<img width="1510" height="733" alt="mimalloc-v2-bf16-diff" src="https://github.com/user-attachments/assets/56469a72-9e06-4d57-ae2a-aeb139ca79a3" />

Pull Request resolved: https://github.com/pytorch/pytorch/pull/164741
Approved by: https://github.com/fadara01, https://github.com/aditew01, https://github.com/malfet
2025-10-09 16:45:31 +00:00
cyy
e9e1aacef8 Enable -Wunused on torch targets (#150077)
For GCC, ``-Wunused`` contains:
```
-Wunused-function
Warn whenever a static function is declared but not defined or a non\-inline static function is unused.

-Wunused-label
Warn whenever a label is declared but not used.
To suppress this warning use the unused attribute.

-Wunused-parameter
Warn whenever a function parameter is unused aside from its declaration.
To suppress this warning use the unused attribute.

-Wunused-variable
Warn whenever a local variable or non-constant static variable is unused aside from its declaration
To suppress this warning use the unused attribute.
```
For Clang, some of the diagnostics controlled by ``-Wunused`` are enabled by default:
```
Controls [-Wunused-argument](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-argument),
[-Wunused-but-set-variable](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-but-set-variable),
[-Wunused-function](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-function),
[-Wunused-label](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-label), [-Wunused-lambda-capture](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-lambda-capture),
[-Wunused-local-typedef](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-local-typedef),
[-Wunused-private-field](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-private-field),
[-Wunused-property-ivar](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-property-ivar),
[-Wunused-value](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-value), [-Wunused-variable](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-variable).
```
These checks are all usefull. This PR aims to enable ``-Wunused`` without breaking code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150077
Approved by: https://github.com/zou3519, https://github.com/wdvr
2025-05-02 07:14:19 +00:00
6dadfc4457 Revert "Enable -Wunused on torch targets (#150077)"
This reverts commit 688adc9941f855e78dd4d595682eea16317b7f54.

Reverted https://github.com/pytorch/pytorch/pull/150077 on behalf of https://github.com/wdvr due to failing internally with use of undeclared identifier ([comment](https://github.com/pytorch/pytorch/pull/150077#issuecomment-2846499828))
2025-05-02 06:53:20 +00:00
cyy
688adc9941 Enable -Wunused on torch targets (#150077)
For GCC, ``-Wunused`` contains:
```
-Wunused-function
Warn whenever a static function is declared but not defined or a non\-inline static function is unused.

-Wunused-label
Warn whenever a label is declared but not used.
To suppress this warning use the unused attribute.

-Wunused-parameter
Warn whenever a function parameter is unused aside from its declaration.
To suppress this warning use the unused attribute.

-Wunused-variable
Warn whenever a local variable or non-constant static variable is unused aside from its declaration
To suppress this warning use the unused attribute.
```
For Clang, some of the diagnostics controlled by ``-Wunused`` are enabled by default:
```
Controls [-Wunused-argument](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-argument),
[-Wunused-but-set-variable](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-but-set-variable),
[-Wunused-function](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-function),
[-Wunused-label](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-label), [-Wunused-lambda-capture](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-lambda-capture),
[-Wunused-local-typedef](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-local-typedef),
[-Wunused-private-field](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-private-field),
[-Wunused-property-ivar](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-property-ivar),
[-Wunused-value](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-value), [-Wunused-variable](https://clang.llvm.org/docs/DiagnosticsReference.html#wunused-variable).
```
These checks are all usefull. This PR aims to enable ``-Wunused`` without breaking code.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/150077
Approved by: https://github.com/zou3519
2025-05-01 04:09:06 +00:00
4ac2ee573d [sigmoid] memory planner C10 deps (#151275)
Summary: perf-sensitive util functions for use in our memory planner

Test Plan: CI

Differential Revision: D73002726

Pull Request resolved: https://github.com/pytorch/pytorch/pull/151275
Approved by: https://github.com/georgiaphillips
2025-04-24 01:46:32 +00:00
cyy
8fa81a6066 Enable misc-use-internal-linkage check and apply fixes (#148948)
Enables clang-tidy rule [`misc-use-internal-linkage`](https://clang.llvm.org/extra/clang-tidy/checks/misc/use-internal-linkage.html). This new check was introduced in Clang-Tidy 18 and is available due to recent update of Clang-Tidy 19.

The check marks functions and variables used only in the translation unit as static. Therefore undesired symbols are not leaked into other units, more link time optimisations are possible and the resulting binaries may be smaller.

The detected violations were mostly fixed by using static. In other cases, the symbols were indeed consumed by others files, then their declaring headers were included. Still some declarations were wrong and have been fixed.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148948
Approved by: https://github.com/Skylion007
2025-03-12 14:22:56 +00:00
cyy
09291817b2 Fix extra semicolon warning (#148291)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/148291
Approved by: https://github.com/Skylion007
2025-03-03 18:51:44 +00:00
cyy
116af809eb Use std::string_view (#145906)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/145906
Approved by: https://github.com/albanD
2025-01-30 03:14:27 +00:00
cyy
1bdb92cbff [2/N] Use thread-safe strerror (#141011)
Fixes #ISSUE_NUMBER

Pull Request resolved: https://github.com/pytorch/pytorch/pull/141011
Approved by: https://github.com/ezyang
2024-11-22 07:02:30 +00:00
96b30dcb25 [Windows][cpu] mkl use mimalloc as allocator on Windows (#138419)
We did a lot of optimization for PyTorch Windows, and we got good progress of it. But still some models have performance gap between PyTorch Windows and PyTorch Linux. Ref: https://pytorch.org/blog/performance-boost-windows/#conclusion
From the blog conclusion, we found the `ResNet50` is typical case of it.

Let's focus on the `ResNet50`, and collect the profiling log:
```cmd
(nightly) D:\xu_git\dnnl_cb>python test_script_resnet50.py
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                             Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
                  model_inference         3.91%     682.427ms       100.00%       17.448s       17.448s             1
                     aten::conv2d         0.18%      30.906ms        64.79%       11.305s       2.133ms          5300
                aten::convolution         0.45%      78.031ms        64.62%       11.275s       2.127ms          5300
               aten::_convolution         0.30%      51.670ms        64.17%       11.196s       2.113ms          5300
         aten::mkldnn_convolution        63.58%       11.093s        63.87%       11.145s       2.103ms          5300
                 aten::batch_norm         0.13%      23.536ms        20.10%        3.506s     661.580us          5300
     aten::_batch_norm_impl_index         0.28%      49.486ms        19.96%        3.483s     657.139us          5300
          aten::native_batch_norm        19.26%        3.360s        19.64%        3.427s     646.615us          5300
                 aten::max_pool2d         0.01%       1.038ms         5.84%        1.018s      10.181ms           100
    aten::max_pool2d_with_indices         5.83%        1.017s         5.83%        1.017s      10.171ms           100
                       aten::add_         3.38%     588.907ms         3.38%     588.907ms      85.349us          6900
                      aten::relu_         0.35%      60.358ms         1.67%     292.155ms      59.624us          4900
                 aten::clamp_min_         1.33%     231.797ms         1.33%     231.797ms      47.306us          4900
                      aten::empty         0.46%      80.195ms         0.46%      80.195ms       1.513us         53000
                     aten::linear         0.01%     927.300us         0.23%      39.353ms     393.532us           100
                      aten::addmm         0.20%      35.379ms         0.21%      37.016ms     370.155us           100
                 aten::empty_like         0.12%      20.455ms         0.17%      29.976ms       5.656us          5300
                aten::as_strided_         0.11%      18.830ms         0.11%      18.830ms       3.553us          5300
        aten::adaptive_avg_pool2d         0.00%     419.900us         0.08%      14.265ms     142.647us           100
                       aten::mean         0.01%       1.737ms         0.08%      13.845ms     138.448us           100
                        aten::sum         0.05%       8.113ms         0.05%       8.648ms      86.479us           100
                    aten::resize_         0.03%       5.182ms         0.03%       5.182ms       0.978us          5300
                       aten::div_         0.01%       1.445ms         0.02%       3.460ms      34.600us           100
                         aten::to         0.00%     337.000us         0.01%       2.015ms      20.154us           100
                   aten::_to_copy         0.01%     977.500us         0.01%       1.678ms      16.784us           100
                      aten::copy_         0.01%       1.474ms         0.01%       1.474ms       7.371us           200
                          aten::t         0.00%     775.900us         0.01%       1.410ms      14.104us           100
                    aten::flatten         0.00%     420.900us         0.01%       1.311ms      13.106us           100
                       aten::view         0.01%     889.700us         0.01%     889.700us       8.897us           100
                  aten::transpose         0.00%     410.700us         0.00%     634.500us       6.345us           100
                     aten::expand         0.00%     496.800us         0.00%     566.800us       5.668us           100
                      aten::fill_         0.00%     534.800us         0.00%     534.800us       5.348us           100
                 aten::as_strided         0.00%     293.800us         0.00%     293.800us       1.469us           200
              aten::empty_strided         0.00%     241.700us         0.00%     241.700us       2.417us           100
               aten::resolve_conj         0.00%      54.800us         0.00%      54.800us       0.274us           200
---------------------------------  ------------  ------------  ------------  ------------  ------------  ------------
Self CPU time total: 17.448s

Execution time: 20.02380895614624
```
We found the major kernel consume CPU resource is `aten::mkldnn_convolution`. It was dispatched to `MKLDNN`.
Acturally, we had optimized memory allocation via integrated mimalloc to pytorch C10 module. It helps PyTorch Windows boost a lot, but it does not cover `MKL` and `MKLDNN`'s intermediary temporary memory.
We still have potential to improve PyTorch Windows performance via optimize `MKL` and `MKLDNN`'s intermediary temporary memory.

So, I discussed with Intel MKL team, and get a method to register high performance memory allocation API to MKL, and it would help MKL to boost memory performance. Please check the online document: https://www.intel.com/content/www/us/en/docs/onemkl/developer-guide-windows/2023-0/redefining-memory-functions.html

This PR is optimize MKL memory alloction performance on Windows, via register mi_malloc to MKL. PR Changes:
1. Add cmake option: `USE_MIMALLOC_ON_MKL`, It is sub-option of `USE_MIMALLOC`.
2. Wrap and export mi_malloc APIs in C10, when `USE_MIMALLOC_ON_MKL` is `ON`.
3. Add MklAllocationHelp.cpp to register allocation APIs to MKL, when `USE_MIMALLOC_ON_MKL` is `ON`.

For `oneDNN`, it is still tracking in this proposal: https://github.com/oneapi-src/oneDNN/issues/1898

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138419
Approved by: https://github.com/jgong5, https://github.com/ezyang
2024-10-24 05:29:47 +00:00
fddabc6e0b C10_UNUSED to [[maybe_unused]] (#6357) (#138364)
Summary: Pull Request resolved: https://github.com/pytorch/executorch/pull/6357

Pull Request resolved: https://github.com/pytorch/pytorch/pull/138364
Approved by: https://github.com/Skylion007, https://github.com/eqy
2024-10-19 13:17:43 +00:00
cyy
8c860aef0d [Reland][Environment Variable][3/N] Use thread-safe getenv functions (#137942)
Reland of #137328, which was reverted due to reverting a dependent PR.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137942
Approved by: https://github.com/eqy
2024-10-15 07:47:24 +00:00
df0c2f5cae Revert "[Environment Variable][3/N] Use thread-safe getenv wrapper (#137328)"
This reverts commit 25ac5652d003c5526f496bd1e2cdfbe697c58ba4.

Reverted https://github.com/pytorch/pytorch/pull/137328 on behalf of https://github.com/clee2000 due to need to revert this in order to revert #133896, please rebase and reland, sorry for the churn ([comment](https://github.com/pytorch/pytorch/pull/137328#issuecomment-2412143739))
2024-10-14 20:22:26 +00:00
25ac5652d0 [Environment Variable][3/N] Use thread-safe getenv wrapper (#137328)
Follows #124485

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137328
Approved by: https://github.com/eqy
2024-10-11 23:23:57 +00:00
cyy
a2396b2dd8 [2/N] Fix extra warnings brought by clang-tidy-17 (#137459)
Follows #137407

Pull Request resolved: https://github.com/pytorch/pytorch/pull/137459
Approved by: https://github.com/Skylion007
2024-10-08 19:05:02 +00:00
277ab8a4c0 Revert "[Environment Variable][1/N] Use thread-safe env variable API in c10 (#119449)"
This reverts commit a56e057814565b2ae33b2106b4d0136179aa18f8.

Reverted https://github.com/pytorch/pytorch/pull/119449 on behalf of https://github.com/jeanschmidt due to Broken internal signals, @albanD please help get this sorted :) ([comment](https://github.com/pytorch/pytorch/pull/119449#issuecomment-2069716129))
2024-04-22 14:44:44 +00:00
cyy
a56e057814 [Environment Variable][1/N] Use thread-safe env variable API in c10 (#119449)
This PR is the beginning of attempts to wrap thread-unsafe getenv and set_env functions inside a RW mutex.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119449
Approved by: https://github.com/malfet, https://github.com/albanD
2024-04-19 13:39:41 +00:00
61bc188f42 Revert "[Environment Variable][1/N] Use thread-safe env variable API in c10 (#119449)"
This reverts commit b51f66c1950a582dd18d1b2ee67df840a8c4dbbe.

Reverted https://github.com/pytorch/pytorch/pull/119449 on behalf of https://github.com/malfet due to Broke gcc9 builds ([comment](https://github.com/pytorch/pytorch/pull/119449#issuecomment-2064936414))
2024-04-18 18:53:59 +00:00
cyy
b51f66c195 [Environment Variable][1/N] Use thread-safe env variable API in c10 (#119449)
This PR is the beginning of attempts to wrap thread-unsafe getenv and set_env functions inside a RW mutex.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119449
Approved by: https://github.com/albanD
2024-04-18 13:35:48 +00:00
f5049de242 Revert "[Environment Variable][1/N] Use thread-safe env variable API in c10 (#119449)"
This reverts commit 5bef127c2ea49280e7fda4f9fa7cad6fa4078e7d.

Reverted https://github.com/pytorch/pytorch/pull/119449 on behalf of https://github.com/PaliC due to your using TORCH_INTERNAL_ASSERT incorrectly ([comment](https://github.com/pytorch/pytorch/pull/119449#issuecomment-2062696010))
2024-04-17 23:44:00 +00:00
cyy
5bef127c2e [Environment Variable][1/N] Use thread-safe env variable API in c10 (#119449)
This PR is the beginning of attempts to wrap thread-unsafe getenv and set_env functions inside a RW mutex.
Pull Request resolved: https://github.com/pytorch/pytorch/pull/119449
Approved by: https://github.com/albanD
2024-04-16 04:39:20 +00:00
88207b10ca Enable thp(transparent huge pages) for buffer sizes >=2MB (#107697)
The 2MB thp pages provide better allocation latencies compared to the standard 4KB pages. This change has shown substantial improvement for batch mode usecases where the tensor sizes are larger than 100MB.

Only enabled if THP_MEM_ALLOC_ENABLE environment variable is set.

Relanding https://github.com/pytorch/pytorch/pull/93888 with functionality disabled for Android

Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com>
Pull Request resolved: https://github.com/pytorch/pytorch/pull/107697
Approved by: https://github.com/malfet
2023-12-16 18:16:19 +00:00
6c1ccccf21 Enable mimalloc on pytorch Windows (#102595)
This PR is implemention of [#102534](https://github.com/pytorch/pytorch/issues/102534), option 2.
Major changes:
1. Add mimalloc to the submodule.
2. Add build option "USE_MIMALLOC".
3. It is only enabled on Windows build, And it would improve pytorch memory allocation performance.

Additional Test:
<img width="953" alt="image" src="https://github.com/pytorch/pytorch/assets/8433590/4b2ec2dc-16f1-4ad9-b457-cfeb37e489d3">
This PR also build & static link mimalloc on Linux well.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/102595
Approved by: https://github.com/jgong5, https://github.com/malfet
2023-06-27 08:53:26 +00:00
a22b92d8ba Revert "Enable thp(transparent huge pages) for buffer sizes >=2MB (#95963)"
This reverts commit 3bb16a084298ed8b9a1e59622afd80418ff4a2f1.

Reverted https://github.com/pytorch/pytorch/pull/95963 on behalf of https://github.com/izaitsevfb due to Breaks internal android builds: unused function c10_compute_alignment  [-Werror,-Wunused-function]
2023-03-14 02:15:08 +00:00
3bb16a0842 Enable thp(transparent huge pages) for buffer sizes >=2MB (#95963)
The 2MB thp pages provide better allocation latencies compared to the standard 4KB pages. This change has shown substantial improvement for batch mode usecases where the tensor sizes are larger than 100MB.

Only enabled if THP_MEM_ALLOC_ENABLE environment variable is set.

re-landing https://github.com/pytorch/pytorch/pull/93888

Pull Request resolved: https://github.com/pytorch/pytorch/pull/95963
Approved by: https://github.com/malfet
2023-03-10 13:58:01 +00:00
2936c8b9ce Revert "Enable thp(transparent huge pages) for buffer sizes >=2MB (#93888)"
This reverts commit 2cc845eb1a45c7ea494c33262a97f9a348818261.

Reverted https://github.com/pytorch/pytorch/pull/93888 on behalf of https://github.com/seemethere due to Reverting due to internal build issues, Meta employees see: https://fburl.com/sandcastle/1p4zvldk
2023-03-01 22:33:04 +00:00
2cc845eb1a Enable thp(transparent huge pages) for buffer sizes >=2MB (#93888)
The 2MB thp pages provide better allocation latencies compared to the standard 4KB pages. This change has shown significant improvement for batch mode usecases where the tensor sizes are larger than 100MB.

Only enabled if `THP_MEM_ALLOC_ENABLE` environment variable is set.

Pull Request resolved: https://github.com/pytorch/pytorch/pull/93888
Approved by: https://github.com/jgong5, https://github.com/malfet
2023-02-28 21:12:46 +00:00
cyy
37f7c00a8a More fixes and improved clang-tidy checkers (#93213)
Pull Request resolved: https://github.com/pytorch/pytorch/pull/93213
Approved by: https://github.com/Skylion007
2023-02-01 14:44:17 +00:00
6d7eddbb75 Make allocator check C10_UNLIKELY
This popped up while having a look at posible causes for https://github.com/pytorch/pytorch/issues/78800

Pull Request resolved: https://github.com/pytorch/pytorch/pull/78801

Approved by: https://github.com/ezyang
2022-06-03 19:41:29 +00:00
844a4b47df extract out //c10/core:alloc_cpu (#70859)
Summary:
Pull Request resolved: https://github.com/pytorch/pytorch/pull/70859

ghstack-source-id: 147642534

Test Plan: Extracting code unmodified to a new library: relying on CI to validate.

Reviewed By: malfet

Differential Revision: D33329688

fbshipit-source-id: f60327467d197ec1862fb3554f8b83e6c84cab5c
(cherry picked from commit f82e7c0e9beba1113defe6d55cf8a232551e913b)
2022-01-27 07:34:52 +00:00