pytorch

mirror of https://github.com/pytorch/pytorch.git synced 2025-10-21 05:34:18 +08:00

Author	SHA1	Message	Date
Grant	54c2b66592	Replace _device_t with torch.types.Device in torch/cpu/__init__.py (#161031 ) Fixes #152952 Replace `_device_t` with `torch.types.Device` in `torch/cpu/__init__.py`. Did basic smoke test by running tests that `import torch.cpu` including `test/distributed/test_c10d_functional_native.py` and `test/test_decomp.py`. Based this PR off of #152935 which is referenced in the main issue. (also, this is my first contribution but I followed the contributing guide closely) Pull Request resolved: https://github.com/pytorch/pytorch/pull/161031 Approved by: https://github.com/janeyx99	2025-08-21 00:22:43 +00:00
Wanchao Liang	4c5cf18ee0	[device_mesh] improve device selection logic (#150897 ) as titled, this PR improves the device selection logic when user did not set the device before calling the DeviceMesh constructor, as a device manager, DeviceMesh should try to set the device for users in a good way. The behavior of set_device before: * If user call init_process_group to init a world process group, we assume user already called set_device and we don't set the device for the user * If user does not init a world process group by themselves, we init a world process group for the user and follow a heuristic to set the device. This is ok but sometimes the set_device heuristic wouldn't work well (i.e. if user use TORCH_CUDA_VISBILE_DEVICES So this PR improves the device selection logic to: * If the default cuda context is initialized by the time we init DeviceMesh, then we assume user must called some cuda operation before therefore must have selected the device by themselves * If not the above, then we check if envvars have "LOCAL_RANK" and "WORLD_SIZE" from the launcher (i.e. torchrun), if so, we use "LOCAL_RANK" to set the device for the current process, which is a very standard practice. (This solves the TORCH_CUDA_VISBILE_DEVICES issue) * If not above, then we throw warning to users about situation, and fallback to the old heuristic. Pull Request resolved: https://github.com/pytorch/pytorch/pull/150897 Approved by: https://github.com/tianyu-l ghstack dependencies: #150898	2025-05-14 06:29:16 +00:00
Jessica Vandebon	6971b77510	[CPU Stream] Add noop for CPU stream record_event() and wait_event() (#145935 ) Summary: Adds wait_event and record_event endpoints to CPU stream in order to facilitate device-agnostic code. Both methods are noops. Test Plan: CI Differential Revision: D68833927 Pull Request resolved: https://github.com/pytorch/pytorch/pull/145935 Approved by: https://github.com/Skylion007	2025-02-20 18:50:55 +00:00
Nikita Shulga	e56dcf2772	[CPUInductor] Fix SVE256 detection (#146207 ) This PR removes `torch.cpu._is_arm_sve_supported()` and replaces is with stable `torch.backends.cpu.get_cpu_capability()` I should have reviewed https://github.com/pytorch/pytorch/pull/134672 more thoroughly, because it introduced duplicate, but slightly different API for detecting CPU architectures, which resulted in runtime crashes on system that do support SVE128, rather than SVE256 Fixes https://github.com/pytorch/pytorch/issues/145441 Pull Request resolved: https://github.com/pytorch/pytorch/pull/146207 Approved by: https://github.com/angelayi	2025-02-01 18:51:34 +00:00
CaoE	9e14d86573	[Inductor][CPP] Add oneDNN BRGEMM config for Half cpp gemm template (#136255 ) `kernel_micro_gemm` generated using BRGEMM: ``` template <bool accum> inline void kernel_micro_gemm( const half* __restrict__ A, const half* __restrict__ B, float* __restrict__ C, int64_t M, int64_t N, int64_t K, int64_t lda, int64_t ldb, int64_t ldc ) { at::native::cpublas::brgemm( M, N, K, lda, ldb, ldc, 1.f, accum ? 1.f : 0.f, A, B, C); } ``` Pull Request resolved: https://github.com/pytorch/pytorch/pull/136255 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-11-05 05:33:29 +00:00
Aditya Tewari	575f260229	Extend vectorization with SVE(ARM) with Torch Compile (Inductor) (#134672 ) Motivation Enable SVE vectorization with `torch.compile` Extends PR: #119571 * This PR enables vectorization for codegen part using SVE-256 (vec length) * The changes can be extended to other SVE vec lengths I've done some comparisons against existing NEON implementation with SVE vectorization enabled route for `torch.compile` Test results are for 8 cores on ARM Neoverse_V1 <img width="359" alt="Screenshot 2024-08-28 at 16 02 07" src="https://github.com/user-attachments/assets/6961fbea-8285-4ca3-b92e-934a2db50ee2"> It's worth mentioning, for standalone `SiLU op` there's a `~1.8x` speedup with `torch.compile` Pull Request resolved: https://github.com/pytorch/pytorch/pull/134672 Approved by: https://github.com/jgong5, https://github.com/malfet	2024-10-10 13:20:40 +00:00
sanchitintel	43dcb4bb61	Revise CPU vectorization ISA support API (#135075 ) Revising (mostly renaming) CPU vectorization ISA support API (non-frontend-user-facing). Also added AVX512_BF16 ISA detection API. Pull Request resolved: https://github.com/pytorch/pytorch/pull/135075 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/ezyang	2024-09-05 12:14:56 +00:00
Jiong Gong	914d3ca2ba	[inductor][cpp] BF16 AMX micro-gemm support (#127195 ) This PR adds the intrinsics based micro-gemm for BF16 using Advanced Matrix eXtension (AMX) instructions available in Intel 4th and 5th Xeon processors. A compilation check is added to `codecache.py` to check the validity of the compiler support. Also, since AMX requires an initialization in the Linux kernel to extra register states, an initialization function is added to do that and triggered via `codecache.py`. Performance speedups with >=10% on BF16 AMP, max_autotune vs. no autotune, measured on Intel(R) Xeon(R) Platinum 8488C: Static shapes Single-threaded \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|---------\| \| timm_models \| mixer_b16_224 \| 1.54 \| \| timm_models \| convit_base \| 1.53 \| \| huggingface \| MobileBertForQuestionAnswering \| 1.52 \| \| torchbench \| fastNLP_Bert \| 1.44 \| \| torchbench \| llama \| 1.33 \| \| timm_models \| swin_base_patch4_window7_224 \| 1.31 \| \| torchbench \| dlrm \| 1.28 \| \| torchbench \| timm_vision_transformer_large \| 1.28 \| \| huggingface \| MobileBertForMaskedLM \| 1.27 \| \| timm_models \| vit_base_patch16_224 \| 1.26 \| \| timm_models \| beit_base_patch16_224 \| 1.23 \| \| timm_models \| jx_nest_base \| 1.21 \| \| torchbench \| pyhpc_equation_of_state \| 1.18 \| \| huggingface \| Speech2Text2ForCausalLM \| 1.15 \| \| timm_models \| pit_b_224 \| 1.14 \| \| timm_models \| twins_pcpvt_base \| 1.14 \| \| torchbench \| maml_omniglot \| 1.1 \| \| timm_models \| botnet26t_256 \| 1.1 \| Multi-threaded \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|---------\| \| torchbench \| BERT_pytorch \| 1.35 \| \| torchbench \| lennard_jones \| 2.43 \| \| torchbench \| hf_Albert \| 1.35 \| \| torchbench \| hf_T5 \| 1.34 \| \| torchbench \| soft_actor_critic \| 1.34 \| \| torchbench \| fastNLP_Bert \| 1.28 \| \| huggingface \| LayoutLMForSequenceClassification \| 1.26 \| \| torchbench \| llama \| 1.24 \| \| huggingface \| GPT2ForSequenceClassification \| 1.19 \| \| torchbench \| hf_Bart \| 1.17 \| \| torchbench \| hf_Bert_large \| 1.16 \| \| torchbench \| hf_GPT2 \| 1.16 \| \| timm_models \| gmixer_24_224 \| 1.16 \| \| torchbench \| hf_GPT2_large \| 1.15 \| \| torchbench \| maml_omniglot \| 1.14 \| \| torchbench \| hf_Bert \| 1.13 \| \| torchbench \| hf_DistilBert \| 1.13 \| \| torchbench \| hf_T5_large \| 1.12 \| \| huggingface \| MT5ForConditionalGeneration \| 1.11 \| Dynamic shapes Single-threaded \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|-------\| \| timm_models \| mixer_b16_224 \| 1.52 \| \| timm_models \| convit_base \| 1.5 \| \| huggingface \| MobileBertForQuestionAnswering \| 1.49 \| \| torchbench \| fastNLP_Bert \| 1.42 \| \| torchbench \| timm_vision_transformer_large \| 1.28 \| \| timm_models \| swin_base_patch4_window7_224 \| 1.27 \| \| torchbench \| llama \| 1.26 \| \| huggingface \| MobileBertForMaskedLM \| 1.25 \| \| timm_models \| vit_base_patch16_224 \| 1.25 \| \| timm_models \| beit_base_patch16_224 \| 1.24 \| \| timm_models \| jx_nest_base \| 1.2 \| \| torchbench \| dlrm \| 1.19 \| \| timm_models \| pit_b_224 \| 1.13 \| \| timm_models \| twins_pcpvt_base \| 1.13 \| \| torchbench \| hf_Bert_large \| 1.12 \| \| torchbench \| hf_BigBird \| 1.11 \| \| huggingface \| Speech2Text2ForCausalLM \| 1.11 \| \| timm_models \| eca_botnext26ts_256 \| 1.11 \| \| timm_models \| botnet26t_256 \| 1.1 \| Multi-threaded \| Model Family \| Model Name \| Speedup \| \|--------------\|------------\|-------\| \| torchbench \| BERT_pytorch \| 1.18 \| \| torchbench \| lennard_jones \| 2.18 \| \| torchbench \| hf_Albert \| 1.37 \| \| torchbench \| soft_actor_critic \| 1.31 \| \| huggingface \| GPT2ForSequenceClassification \| 1.29 \| \| torchbench \| hf_T5 \| 1.28 \| \| torchbench \| fastNLP_Bert \| 1.27 \| \| torchbench \| hf_Bart \| 1.21 \| \| torchbench \| hf_Bert_large \| 1.19 \| \| torchbench \| hf_T5_large \| 1.19 \| \| torchbench \| hf_Bert \| 1.16 \| \| torchbench \| hf_GPT2 \| 1.16 \| \| huggingface \| CamemBert \| 1.16 \| \| torchbench \| hf_GPT2_large \| 1.13 \| \| torchbench \| functorch_maml_omniglot \| 1.12 \| \| huggingface \| BertForMaskedLM \| 1.12 \| \| huggingface \| MT5ForConditionalGeneration \| 1.12 \| \| torchbench \| hf_DistilBert \| 1.11 \| \| timm_models \| mixnet_l \| 1.11 \| \| timm_models \| tf_mixnet_l \| 1.11 \| No perf regressions. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127195 Approved by: https://github.com/jansel	2024-06-21 07:21:47 +00:00
Xu Han	9e39c62908	correct avx512_vnni isa name. (#128318 ) `x86` has two vnni isa currently: `avx2_vnni` and `avx512_vnni`. This PR correct the function name to `avx512_vnni`. Co-authored-by: Jiong Gong <jiong.gong@intel.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/128318 Approved by: https://github.com/leslie-fang-intel, https://github.com/jgong5, https://github.com/desertfire	2024-06-12 16:12:49 +00:00
Aaron Orenstein	62bcdc0ac9	Flip default value for mypy disallow_untyped_defs [4/11] (#127841 ) See #127836 for details. Pull Request resolved: https://github.com/pytorch/pytorch/pull/127841 Approved by: https://github.com/oulgen	2024-06-08 18:36:48 +00:00
Xu Han	ba81c3c290	[inductor] add cpp builder code. (take 2) (#125849 ) Fully manual rebase the code of PR: https://github.com/pytorch/pytorch/pull/124045 The old PR seems crashed due to too many commits, and too many times rebase. Please reference: https://github.com/pytorch/pytorch/pull/124045#issuecomment-2103744588 ------- It is the first step of RFC https://github.com/pytorch/pytorch/issues/124245. Changes: 1. Add cpp builder code, the new cpp_builder support Windows OS. 2. Add CPU ISA checker which is cross OS and exported from backend cpuinfo. 3. Switch compiler ISA checker to new cpp builder. 4. CppCodeCache use the new ISA checker. 5. Add temprary `test_new_cpp_build_logical` UT to help on transfer to new code. <img width="1853" alt="Image" src="https://github.com/pytorch/pytorch/assets/8433590/ce6519ab-ba92-4204-b1d6-7d15d2ba2cbe"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/125849 Approved by: https://github.com/jgong5, https://github.com/desertfire	2024-06-07 20:49:58 +00:00
cyy	bdea4904c1	Add some type annotations to python stream and event classes (#126171 ) For recent device agnostic code changes, we need type hinting on the parent classes for better tooling support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/126171 Approved by: https://github.com/ezyang	2024-05-15 04:58:07 +00:00
PyTorch MergeBot	2e237fcd70	Revert "[inductor] add cpp builder code. (#124045 )" This reverts commit 469383755fe416eb1c41fa724762ad3eaecdff07. Reverted https://github.com/pytorch/pytorch/pull/124045 on behalf of https://github.com/clee2000 due to broke inductor/test_codecache and inductor/test_max_autotune `469383755f` https://github.com/pytorch/pytorch/actions/runs/8996772350/job/24724775182 ([comment](https://github.com/pytorch/pytorch/pull/124045#issuecomment-2100851419))	2024-05-08 15:33:20 +00:00
Xu Han	469383755f	[inductor] add cpp builder code. (#124045 ) Previous full PR https://github.com/pytorch/pytorch/pull/115248 is failed to merge due to fb_code is hard to debug. I also tried to submit them as two pieces, https://github.com/pytorch/pytorch/pull/118514 https://github.com/pytorch/pytorch/pull/118515. And they have passed PreCI at that time. Now I tried to split https://github.com/pytorch/pytorch/pull/115248 into smaller piece, and it is the first step of RFC https://github.com/pytorch/pytorch/issues/124245. Changes: 1. Add cpp builder code, the new cpp_builder support Windows OS. 2. Add CPU ISA checker which is cross OS and exported from backend cpuinfo. 3. Switch compiler ISA checker to new cpp builder. 4. CppCodeCache use the new ISA checker. 5. Add temprary `test_new_cpp_build_logical` UT to help on transfer to new code. <img width="1853" alt="Image" src="https://github.com/pytorch/pytorch/assets/8433590/ce6519ab-ba92-4204-b1d6-7d15d2ba2cbe"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124045 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-05-08 05:27:15 +00:00
PyTorch MergeBot	2f79a18324	Revert "[inductor] add cpp builder code. (#124045 )" This reverts commit 7864d287a1e56685aa754285cc2d3c31ff055f62. Reverted https://github.com/pytorch/pytorch/pull/124045 on behalf of https://github.com/huydhn due to Sorry for reverting your change, but it is failing trunk jobs `7864d287a1` including lint ([comment](https://github.com/pytorch/pytorch/pull/124045#issuecomment-2099306071))	2024-05-07 21:04:49 +00:00
Xu Han	7864d287a1	[inductor] add cpp builder code. (#124045 ) Previous full PR https://github.com/pytorch/pytorch/pull/115248 is failed to merge due to fb_code is hard to debug. I also tried to submit them as two pieces, https://github.com/pytorch/pytorch/pull/118514 https://github.com/pytorch/pytorch/pull/118515. And they have passed PreCI at that time. Now I tried to split https://github.com/pytorch/pytorch/pull/115248 into smaller piece, and it is the first step of RFC https://github.com/pytorch/pytorch/issues/124245. Changes: 1. Add cpp builder code, the new cpp_builder support Windows OS. 2. Add CPU ISA checker which is cross OS and exported from backend cpuinfo. 3. Switch compiler ISA checker to new cpp builder. 4. CppCodeCache use the new ISA checker. 5. Add temprary `test_new_cpp_build_logical` UT to help on transfer to new code. <img width="1853" alt="Image" src="https://github.com/pytorch/pytorch/assets/8433590/ce6519ab-ba92-4204-b1d6-7d15d2ba2cbe"> Pull Request resolved: https://github.com/pytorch/pytorch/pull/124045 Approved by: https://github.com/jgong5, https://github.com/jansel	2024-05-07 20:07:41 +00:00
Rohan Varma	c608b0eb35	[Dist] Enable FSDP on CPU (#112145 ) Differential Revision: [D50688958](https://our.internmc.facebook.com/intern/diff/D50688958/) Pull Request resolved: https://github.com/pytorch/pytorch/pull/112145 Approved by: https://github.com/fegin ghstack dependencies: #112144	2023-11-07 01:37:02 +00:00
wz337	a614281ea9	Add current_device() to torch.cpu (#110987 ) Better support device agnostic, add a "cpu" return for `current_device()` in torch.cpu so that we won't run into `AttributeError: module 'torch.cpu' has no attribute 'current_device'`. Pull Request resolved: https://github.com/pytorch/pytorch/pull/110987 Approved by: https://github.com/wanchaol	2023-10-11 05:13:10 +00:00
Wanchao Liang	28d7d7fc42	device agnostic: torch.cpu.set_device (#110716 ) to support device agnostic, add a dummpy placeholder in torch.cpu Pull Request resolved: https://github.com/pytorch/pytorch/pull/110716 Approved by: https://github.com/albanD	2023-10-09 23:00:15 +00:00
Edward Z. Yang	3bf922a6ce	Apply UFMT to low traffic torch modules (#106249 ) Signed-off-by: Edward Z. Yang <ezyang@meta.com> Pull Request resolved: https://github.com/pytorch/pytorch/pull/106249 Approved by: https://github.com/Skylion007	2023-07-29 23:37:30 +00:00
Rodrigo Kumpera	fc012d716d	[core] Bring cpu device module closer to cuda's. (#103172 ) By implementing some of the functionality used by CUDA we make implementing device agnostic code a lot easier. With this set of changes it's now possible to get FSDP wrap a trivial module. FWD/BWD still TBD. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103172 Approved by: https://github.com/wz337, https://github.com/wanchaol	2023-07-12 19:43:22 +00:00
leslie-fang-intel	9832cfbbfe	Quantization oneDNN backend only support VNNI CPU (#103653 ) Summary - Update the quantization document that default qconfig with oneDNN backend is recommended to be used on CPUs with Vector Neural Network Instruction support. - Add the warning message when user uses default qconfig with oneDNN backend on CPU without Vector Neural Network Instruction support. Pull Request resolved: https://github.com/pytorch/pytorch/pull/103653 Approved by: https://github.com/jgong5, https://github.com/malfet	2023-06-19 09:50:07 +00:00
leslie-fang-intel	0ede83db7a	enable torch.cpu.amp.autocast (#57386 ) Summary: Pull Request resolved: https://github.com/pytorch/pytorch/pull/57386 Here is the PR for what's discussed in the RFC https://github.com/pytorch/pytorch/issues/55374 to enable the autocast for CPU device. Currently, this PR only enable BF16 as the lower precision datatype. Changes: 1. Enable new API `torch.cpu.amp.autocast` for autocast on CPU device: include the python API, C++ API, new Dispatchkey etc. 2. Consolidate the implementation for each cast policy sharing between CPU and GPU devices. 3. Add the operation lists to corresponding cast policy for cpu autocast. Test Plan: Imported from OSS Reviewed By: soulitzer Differential Revision: D28572219 Pulled By: ezyang fbshipit-source-id: db3db509973b16a5728ee510b5e1ee716b03a152	2021-05-20 17:48:36 -07:00

23 Commits